Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v6 11/43] KVM: guest_memfd: Ensure pages are not in use before conversion
From: Fuad Tabba @ 2026-05-21  7:09 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-11-91ab5a8b19a4@google.com>

Hi Ackerley,

On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
<devnull+ackerleytng.google.com@kernel.org> wrote:
>
> From: Ackerley Tng <ackerleytng@google.com>
>
> When converting memory to private in guest_memfd, it is necessary to ensure
> that the pages are not currently being accessed by any other part of the
> kernel or userspace to avoid any current user writing to guest private
> memory.
>
> guest_memfd checks for unexpected refcounts to determine whether a page is
> still in use. The only expected refcounts after unmapping the range
> requested for conversion are those that are held by guest_memfd itself.
>
> Update the kvm_memory_attributes2 structure to include an error_offset
> field. This allows KVM to report the exact offset where a conversion
> failed to userspace. If the safety check fails, return -EAGAIN and copy
> the error_offset back to userspace so that it can potentially retry the
> operation or handle the failure gracefully.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> ---
>  include/uapi/linux/kvm.h |  3 ++-
>  virt/kvm/guest_memfd.c   | 65 ++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 62 insertions(+), 6 deletions(-)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6bbf68a83813..0b55258573d3d 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1658,7 +1658,8 @@ struct kvm_memory_attributes2 {
>         __u64 size;
>         __u64 attributes;
>         __u64 flags;
> -       __u64 reserved[12];
> +       __u64 error_offset;
> +       __u64 reserved[11];
>  };
>
>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 91e89b188f583..9d82642a025e9 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -572,9 +572,42 @@ static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
>         return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
>  }
>
> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> +                                           size_t nr_pages, pgoff_t *err_index)
> +{
> +       struct address_space *mapping = inode->i_mapping;
> +       const int filemap_get_folios_refcount = 1;
> +       pgoff_t last = start + nr_pages - 1;
> +       struct folio_batch fbatch;
> +       bool safe = true;
> +       int i;
> +
> +       folio_batch_init(&fbatch);
> +       while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
> +
> +               for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +                       struct folio *folio = fbatch.folios[i];
> +
> +                       if (folio_ref_count(folio) !=
> +                           folio_nr_pages(folio) + filemap_get_folios_refcount) {
> +                               safe = false;
> +                               *err_index = folio->index;
> +                               break;

https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=11

Sashiko raised a few issues here, but I think this one might be
genuine. Can you look into it please?

If that's right, when huge page support lands, if start falls in the
middle of a large folio, returning folio->index as the err_index will
return an offset strictly less than the requested start. A naive
userspace retry loop resuming from error_offset would step backwards
and corrupt attributes on memory it didn't intend to convert.
err_index should be clamped to max(start, folio->index).

Cheers,
/fuad

> +                       }
> +               }
> +
> +               folio_batch_release(&fbatch);
> +               cond_resched();
> +       }
> +
> +       return safe;
> +}
> +
>  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> -                                    size_t nr_pages, uint64_t attrs)
> +                                    size_t nr_pages, uint64_t attrs,
> +                                    pgoff_t *err_index)
>  {
> +       bool to_private = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
>         struct address_space *mapping = inode->i_mapping;
>         struct gmem_inode *gi = GMEM_I(inode);
>         pgoff_t end = start + nr_pages;
> @@ -588,8 +621,21 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>
>         mas_init(&mas, mt, start);
>         r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> -       if (r)
> +       if (r) {
> +               *err_index = start;
>                 goto out;
> +       }
> +
> +       if (to_private) {
> +               unmap_mapping_pages(mapping, start, nr_pages, false);
> +
> +               if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
> +                                                    err_index)) {
> +                       mas_destroy(&mas);
> +                       r = -EAGAIN;
> +                       goto out;
> +               }
> +       }
>
>         /*
>          * From this point on guest_memfd has performed necessary
> @@ -609,9 +655,10 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
>         struct gmem_file *f = file->private_data;
>         struct inode *inode = file_inode(file);
>         struct kvm_memory_attributes2 attrs;
> +       pgoff_t err_index;
>         size_t nr_pages;
>         pgoff_t index;
> -       int i;
> +       int i, r;
>
>         if (copy_from_user(&attrs, argp, sizeof(attrs)))
>                 return -EFAULT;
> @@ -635,8 +682,16 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
>
>         nr_pages = attrs.size >> PAGE_SHIFT;
>         index = attrs.offset >> PAGE_SHIFT;
> -       return __kvm_gmem_set_attributes(inode, index, nr_pages,
> -                                        attrs.attributes);
> +       r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
> +                                     &err_index);
> +       if (r) {
> +               attrs.error_offset = ((uint64_t)err_index) << PAGE_SHIFT;
> +
> +               if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +                       return -EFAULT;
> +       }
> +
> +       return r;
>  }
>
>  static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
>
> --
> 2.54.0.563.g4f69b47b94-goog
>
>

^ permalink raw reply

* Re: [PATCH 2/3] rv/rtapp/sleep: Update nanosleep rule
From: Gabriele Monaco @ 2026-05-21  7:04 UTC (permalink / raw)
  To: Nam Cao; +Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <a2175a6647bb88797ad8275eddb4b20b749474ec.1779176466.git.namcao@linutronix.de>

On Tue, 2026-05-19 at 09:49 +0200, Nam Cao wrote:
> diff --git a/kernel/trace/rv/monitors/sleep/sleep.c
> b/kernel/trace/rv/monitors/sleep/sleep.c
> index 0a36f5519e6b..e01ac56b3f4a 100644
> --- a/kernel/trace/rv/monitors/sleep/sleep.c
> +++ b/kernel/trace/rv/monitors/sleep/sleep.c
> @@ -43,9 +43,7 @@ static void ltl_atoms_init(struct task_struct *task, struct
> ltl_monitor *mon, bo
>  	ltl_atom_set(mon, LTL_WOKEN_BY_EQUAL_OR_HIGHER_PRIO, false);
>  
>  	if (task_creation) {
> -		ltl_atom_set(mon, LTL_KTHREAD_SHOULD_STOP, false);

Was it intentional to remove this too? It seems to me now it's cleared
only for existing user threads.

Aren't we sure new kthreads aren't stopping?

> -		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_MONOTONIC, false);
> -		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_TAI, false);
> +		ltl_atom_set(mon, LTL_NANOSLEEP_CLOCK_REALTIME, false);
>  		ltl_atom_set(mon, LTL_NANOSLEEP_TIMER_ABSTIME, false);
>  		ltl_atom_set(mon, LTL_CLOCK_NANOSLEEP, false);
>  		ltl_atom_set(mon, LTL_FUTEX_WAIT, false);

You missed removing a couple of these in the next branch, the monitor
doesn't build.

Thanks,
Gabriele


^ permalink raw reply

* Re: [PATCH 3/3] rv/rtapp: Add wakeup monitor
From: Gabriele Monaco @ 2026-05-21  6:40 UTC (permalink / raw)
  To: Nam Cao; +Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <bf775af281cb1540d51ac9a8531b45e723d73226.1779176466.git.namcao@linutronix.de>

On Tue, 2026-05-19 at 09:49 +0200, Nam Cao wrote:
> Add a wakeup monitor to detect a lower-priority task waking up a
> higher-priority task.
> 
> The rtapp/sleep monitor already detects this. However, that monitor
> triggers an error in the context of the woken task and user only gets the
> stacktrace of that task. It is also extremely useful to get the stacktrace
> of the waking task, which this monitor offers. In other words, this monitor
> complements the rtapp/sleep monitor.
> 
> Signed-off-by: Nam Cao <namcao@linutronix.de>

Looks neat, so the idea here is that we are looking at the same events
sleep would react for, just from a different perspective, so it does
make sense to run them both together.

You may want to set

  depends on RV_PER_TASK_MONITORS >= 3

in rtapp/Kconfig. Though if we plan to add more per-task monitors we may
need to find a better solution.

> diff --git a/tools/verification/models/rtapp/wakeup.ltl
> b/tools/verification/models/rtapp/wakeup.ltl
> new file mode 100644
> index 000000000000..a5d63ca0811a
> --- /dev/null
> +++ b/tools/verification/models/rtapp/wakeup.ltl
> @@ -0,0 +1,5 @@
> +RULE = always (((RT and USER_THREAD) imply
> +		(not (WOKEN_BY_LOWER_PRIO or WOKEN_BY_SOFTIRQ)) or
> ALLOWLIST))
> +
> +ALLOWLIST = BLOCK_ON_RT_MUTEX
> +         or FUTEX_LOCK_PI

So here the events and atoms are similar to the ones in sleep, but since
we fail on the waking event, we are going to see it from the perspective
of the waker task, right?

But are those really equivalent? Why do we do RT and USER_THREAD here
while there is a much more nuanced set of conditions in sleep?

If I understand it correctly, sleep can monitor some kernel threads but
this monitor does not, is there a reason for that? Are we just not
interested in the waker for kernel threads?

> diff --git a/kernel/trace/rv/monitors/wakeup/Kconfig
> b/kernel/trace/rv/monitors/wakeup/Kconfig
> new file mode 100644
> index 000000000000..3cf11c5cd5f7
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/wakeup/Kconfig
> @@ -0,0 +1,17 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +config RV_MON_WAKEUP
> +	depends on RV
> +	depends on RV_MON_RTAPP
> +	depends on HAVE_SYSCALL_TRACEPOINTS
> +	select TRACE_IRQFLAGS
> +	default y
> +	select LTL_MON_EVENTS_ID
> +	bool "wakeup monitor"
> +	help
> +	  This monitor detects a lower-priority task waking up a
> +	  higher-priority task. The RV_MON_SLEEP monitor already
> +	  detects this case, but this monitor detects in the context
> +	  of the waking task instead. This and RV_MON_SLEEP can be
> +	  enabled together to get the stacktrace of both the waking
> +	  task and the woken task.

I'm not sure if there is any better terminology, but "waking" task makes
me think of the task that is about to be woken, though it can mean also
that task that is waking another (what you probably mean here).

What about using the waker/wakee terminology?

I see the kernel (events/sched.h) uses waking as well, but it says
waking context (which a bit clearer to me than waking task).
May be worth running it through an LLM which can produce more
English-native unambiguous wording, or maybe I'm just flipping..

Also please document it in Documentation/trace/rv/monitor_rtapp.rst

Thanks,
Gabriele


^ permalink raw reply

* Re: [PATCH v20 09/10] ring-buffer: Show persistent buffer dropped events in trace file
From: Masami Hiramatsu @ 2026-05-21  6:34 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Ian Rogers
In-Reply-To: <20260520185018.332045380@kernel.org>

On Wed, 20 May 2026 14:49:47 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When the persistent ring buffer is validated on boot up, if a subbuffer is
> deemed invalid, it resets the buffer and continues. Currently, these lost
> events are not shown in the trace file output.
> 
> Have the trace iterator look for subbuffers that have the RB_MISSED_EVENTS
> set and set the iter->missed_events flag when it is detected. This will
> then have the trace file shows "LOST EVENTS" when it reads across a
> subbuffer that was corrupted and invalidated.

I think it is good to show the dropped events. BTW, is it better to
comment out the line, just for parser?
For example, add a '#' at the like.

# CPU:5 [LOST EVENTS]

Ah, but it is already done...

Thanks,

> 
> For example:
> 
>  <...>-1016    [005] ...1.  6230.660403: preempt_disable: caller=__mod_memcg_state+0x1c8/0x200 parent=__mod_memcg_state+0x1c8/0x200
> CPU:5 [LOST EVENTS]
>  <...>-1016    [005] .....  6230.660673: kmem_cache_alloc: call_site=__anon_vma_prepare+0x1ad/0x1e0 ptr=000000006e40294c name=anon_vma bytes_req=200 bytes_alloc=208 gfp_flags=GFP_KERNEL node=-1 accounted=true
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/trace/ring_buffer.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index ae5c645b59c9..9cdbee171cdc 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -3518,6 +3518,9 @@ static void rb_inc_iter(struct ring_buffer_iter *iter)
>  	else
>  		rb_inc_page(&iter->head_page);
>  
> +	if (rb_page_commit(iter->head_page) & RB_MISSED_EVENTS)
> +		iter->missed_events = -1;
> +
>  	iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp;
>  	iter->head = 0;
>  	iter->next_event = 0;
> @@ -5579,6 +5582,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
>  	iter->head_page = cpu_buffer->reader_page;
>  	iter->head = cpu_buffer->reader_page->read;
>  	iter->next_event = iter->head;
> +	iter->missed_events = 0;
>  
>  	iter->cache_reader_page = iter->head_page;
>  	iter->cache_read = cpu_buffer->read;
> @@ -7053,7 +7057,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  	struct ring_buffer_event *event;
>  	struct buffer_data_page *dpage;
>  	struct buffer_page *reader;
> -	unsigned long missed_events;
> +	long missed_events;
>  	unsigned int commit;
>  	unsigned int read;
>  	u64 save_timestamp;
> @@ -7179,6 +7183,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  		local_set(&reader->entries, 0);
>  		reader->read = 0;
>  		data_page->data = dpage;
> +		if (!missed_events && rb_data_page_commit(dpage) & RB_MISSED_EVENTS)
> +			missed_events = -1;
>  
>  		/*
>  		 * Use the real_end for the data size,
> @@ -7196,10 +7202,12 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>  	 * Set a flag in the commit field if we lost events
>  	 */
>  	if (missed_events) {
> -		/* If there is room at the end of the page to save the
> +		/*
> +		 * If there is room at the end of the page to save the
>  		 * missed events, then record it there.
>  		 */
> -		if (buffer->subbuf_size - commit >= sizeof(missed_events)) {
> +		if (missed_events > 0 &&
> +		    buffer->subbuf_size - commit >= sizeof(missed_events)) {
>  			memcpy(&dpage->data[commit], &missed_events,
>  			       sizeof(missed_events));
>  			local_add(RB_MISSED_STORED, &dpage->commit);
> -- 
> 2.53.0
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v20 08/10] ring-buffer: Have dropped subbuffers be persistent across reboots
From: Masami Hiramatsu @ 2026-05-21  6:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Ian Rogers
In-Reply-To: <20260520185018.193087991@kernel.org>

On Wed, 20 May 2026 14:49:46 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> When the persistent ring buffer detects a corrupted subbuffer, it will
> zero its size and report dropped pages in the dmesg, then it continues
> normally.
> 
> But if a reboot happens without clearing or restarting tracing on the
> persistent ring buffer, the next boot will show no pages are dropped.
> 
> If the persistent ring buffer is still the same, then it should still
> report dropped pages so the user knows that the buffer has missing events.
> 
> Add the RB_MISSED_EVENTS flag to the commit value of the subbuffer so that
> the next boot will still show that pages were dropped.
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks!

> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/trace/ring_buffer.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index bda53a2d2159..ae5c645b59c9 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1915,7 +1915,7 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
>  	 * Even after clearing these bits, a commit value greater than the
>  	 * subbuf_size is considered invalid.
>  	 */
> -	tail = rb_data_page_size(dpage);
> +	tail = rb_data_page_commit(dpage);
>  	if (tail <= meta->subbuf_size - BUF_PAGE_HDR_SIZE)
>  		ret = rb_read_data_buffer(dpage, tail, cpu, &ts, &delta);
>  	else
> @@ -1929,7 +1929,7 @@ static int __rb_validate_buffer(struct buffer_page *bpage, int cpu,
>  	 */
>  	if (ret < 0 || (prev_ts && prev_ts > ts) || (next_ts && ts > next_ts)) {
>  		local_set(&bpage->entries, 0);
> -		local_set(&dpage->commit, 0);
> +		local_set(&dpage->commit, RB_MISSED_EVENTS);
>  		dpage->time_stamp = prev_ts ? prev_ts : next_ts;
>  		ret = -1;
>  	} else {
> @@ -3444,7 +3444,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
>  	 * is a mb(), which will synchronize with the rmb here.
>  	 * (see rb_tail_page_update() and __rb_reserve_next())
>  	 */
> -	commit = rb_page_commit(iter_head_page);
> +	commit = rb_page_size(iter_head_page);
>  	smp_rmb();
>  
>  	/* An event needs to be at least 8 bytes in size */
> @@ -3473,7 +3473,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
>  
>  	/* Make sure the page didn't change since we read this */
>  	if (iter->page_stamp != iter_head_page->page->time_stamp ||
> -	    commit > rb_page_commit(iter_head_page))
> +	    commit > rb_page_size(iter_head_page))
>  		goto reset;
>  
>  	iter->next_event = iter->head + length;
> @@ -5643,7 +5643,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
>  	 * (see rb_tail_page_update())
>  	 */
>  	smp_rmb();
> -	commit = rb_page_commit(commit_page);
> +	commit = rb_page_size(commit_page);
>  	/* We want to make sure that the commit page doesn't change */
>  	smp_rmb();
>  
> @@ -5836,6 +5836,7 @@ __rb_get_reader_page(struct ring_buffer_per_cpu *cpu_buffer)
>  	 */
>  	local_set(&cpu_buffer->reader_page->write, 0);
>  	local_set(&cpu_buffer->reader_page->entries, 0);
> +	rb_init_data_page(cpu_buffer->reader_page->page);
>  	cpu_buffer->reader_page->real_end = 0;
>  
>   spin:
> -- 
> 2.53.0
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Balbir Singh @ 2026-05-21  6:23 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <20260222084842.1824063-1-gourry@gourry.net>

On Sun, Feb 22, 2026 at 03:48:15AM -0500, Gregory Price wrote:
> Topic type: MM
> 
> Presenter: Gregory Price <gourry@gourry.net>
> 
> This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
> managed by the buddy allocator but excluded from normal allocations.
> 
> I present it with an end-to-end Compressed RAM service (mm/cram.c)
> that would otherwise not be possible (or would be considerably more
> difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
> 

Do we have updates/notes from the meeting?

> 
> TL;DR
> ===
> 
> N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
> explicit holes in that isolation to do useful things we couldn't do
> before without re-implementing entire portions of mm/ in a driver.
> 
> 
> /* This is my memory. There are many like it, but this one is mine. */
> rc = add_private_memory_driver_managed(nid, start, size, name, flags,
>                                        online_type, private_context);
> 
> page = alloc_pages_node(nid, __GFP_PRIVATE, 0);

Do we want to provide kernel level control over allocation of private
pages, I assumed that only user space applications? I would assume
node affinity would be the way to do so, unless we have multiple

> 
> /* Ok but I want to do something useful with it */
> static const struct node_private_ops ops = {
>         .migrate_to     = my_migrate_to,
>         .folio_migrate  = my_folio_migrate,
>         .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> };
> node_private_set_ops(nid, &ops);
>

Could you explain this further? Why does OPS_MIGRATION
and OPS_MEMPOLICY needs to be set explictly?

> /* And now I can use mempolicy with my memory */
> buf = mmap(...);
> mbind(buf, len, mode, private_node, ...);
> buf[0] = 0xdeadbeef;  /* Faults onto private node */
> 
> /* And to be clear, no one else gets my memory */
> buf2 = malloc(4096);  /* Standard allocation */
> buf2[0] = 0xdeadbeef; /* Can never land on private node */
> 
> /* But i can choose to migrate it to the private node */
> move_pages(0, 1, &buf, &private_node, NULL, ...);
> 
> /* And more fun things like this */
> 
> 
> Patchwork
> ===
> A fully working branch based on cxl/next can be found here:
> https://github.com/gourryinverse/linux/tree/private_compression
> 
> A QEMU device which can inject high/low interrupts can be found here:
> https://github.com/gourryinverse/qemu/tree/compressed_cxl_clean
> 
> The additional patches on these branches are CXL and DAX driver
> housecleaning only tangentially relevant to this RFC, so i've
> omitted them for the sake of trying to keep it somewhat clean
> here.  Those patches should (hopefully) be going upstream anyway.
> 
> Patches 1-22: Core Private Node Infrastructure
> 
>   Patch  1:      Introduce N_MEMORY_PRIVATE scaffolding
>   Patch  2:      Introduce __GFP_PRIVATE
>   Patch  3:      Apply allocation isolation mechanisms
>   Patch  4:      Add N_MEMORY nodes to private fallback lists
>   Patches 5-9:   Filter operations not yet supported
>   Patch 10:      free_folio callback
>   Patch 11:      split_folio callback
>   Patches 12-20: mm/ service opt-ins:
>                    Migration, Mempolicy, Demotion, Write Protect,
>                    Reclaim, OOM, NUMA Balancing, Compaction,
>                    LongTerm Pinning
>   Patch 21:      memory_failure callback
>   Patch 22:      Memory hotplug plumbing for private nodes
> 
> Patch 23: mm/cram -- Compressed RAM Management
> 
> Patches 24-27: CXL Driver examples
>   Sysram Regions with Private node support
>   Basic Driver Example: (MIGRATION | MEMPOLICY)
>   Compression Driver Example (Generic)
> 
> 
> Background
> ===
> 
> Today, drivers that want mm-like services on non-general-purpose
> memory either use ZONE_DEVICE (self-managed memory) or hotplug into
> N_MEMORY and accept the risk of uncontrolled allocation.
> 
> Neither option provides what we really want - the ability to:
> 	1) selectively participate in mm/ subsystems, while
> 	2) isolating that memory from general purpose use.
> 
> Some device-attached memory cannot be managed as fully general-purpose
> system RAM.  CXL devices with inline compression, for example, may
> corrupt data or crash the machine if the compression ratio drops
> below a threshold -- we simply run out of physical memory.
> 
> This is a hard problem to solve: how does an operating system deal
> with a device that basically lies about how much capacity it has?
> 
> (We'll discuss that in the CRAM section)
> 
> 
> Core Proposal: N_MEMORY_PRIVATE
> ===
> 
> Introduce N_MEMORY_PRIVATE, a NUMA node state for memory managed by
> the buddy allocator, but excluded from normal allocation paths.
> 
> Private nodes:
> 
>   - Are filtered from zonelist fallback: all existing callers to
>     get_page_from_freelist cannot reach these nodes through any
>     normal fallback mechanism.
> 
>   - Filter allocation requests on __GFP_PRIVATE
>     	numa_zone_allowed() excludes them otherwise. 
> 
>     Applies to systems with and without cpusets.
> 
>     GFP_PRIVATE is (__GFP_PRIVATE | __GFP_THISNODE).
> 
>     Services use it when they need to allocate specifically from
>     a private node (e.g., CRAM allocating a destination folio).
> 
>     No existing allocator path sets __GFP_PRIVATE, so private nodes
>     are unreachable by default.
> 
>   - Use standard struct page / folio.  No ZONE_DEVICE, no pgmap,
>     no struct page metadata limitations.
> 
>   - Use a node-scoped metadata structure to accomplish filtering
>     and callback support.
> 
>   - May participate in the buddy allocator, reclaim, compaction,
>     and LRU like normal memory, gated by an opt-in set of flags.
> 
> The key abstraction is node_private_ops: a per-node callback table
> registered by a driver or service.  
> 
> Each callback is individually gated by an NP_OPS_* capability flag.
> 
> A driver opts in only to the mm/ operations it needs.
> 
> It is similar to ZONE_DEVICE's pgmap at a node granularity.
> 
> In fact...
> 
> 
> Re-use of ZONE_DEVICE Hooks
> ===
> 
> The callback insertion points deliberately mirror existing ZONE_DEVICE
> hooks to minimize the surface area of the mechanism.
> 
> I believe this could subsume most DEVICE_COHERENT users, and greatly
> simplify the device-managed memory development process (no more
> per-driver allocator and migration code).
> 
> (Also it's just "So Fresh, So Clean").
> 
> The base set of callbacks introduced include:
> 
>   free_folio           - mirrors ZONE_DEVICE's
>                          free_zone_device_page() hook in
>                          __folio_put() / folios_put_refs()
> 
>   folio_split          - mirrors ZONE_DEVICE's
>   			 called when a huge page is split up
> 
>   migrate_to           - demote_folio_list() custom demotion (same
>                          site as ZONE_DEVICE demotion rejection)
> 
>   folio_migrate        - called when private node folio is moved to
>                          another location (e.g. compaction)
> 
>   handle_fault         - mirrors the ZONE_DEVICE fault dispatch in
>                          handle_pte_fault() (do_wp_page path)
> 
>   reclaim_policy       - called by reclaim to let a driver own the
>                          boost lifecycle (driver can driver node reclaim)
> 
>   memory_failure       - parallels memory_failure_dev_pagemap(),
>                          but for online pages that enter the normal
>                          hwpoison path
> 
> At skip sites (mlock, madvise, KSM, user migration), a unified
> folio_is_private_managed() predicate covers both ZONE_DEVICE and
> N_MEMORY_PRIVATE folios, consolidating existing zone_device checks
> with private node checks rather than adding new ones.
> 
>   static inline bool folio_is_private_managed(struct folio *folio)
>   {
>           return folio_is_zone_device(folio) ||
>                  folio_is_private_node(folio);
>   }
> 
> Most integration points become a one-line swap:
> 
>   -     if (folio_is_zone_device(folio))
>   +     if (unlikely(folio_is_private_managed(folio)))
> 
> 
> Where a one-line integration is insufficient, the integration is
> kept as clean as possible with zone_device, rather than simply
> adding more call-sites on top of it:
> 
> static inline bool folio_managed_handle_fault(struct folio *folio,
>   struct vm_fault *vmf, vm_fault_t *ret)
> {
>   /* Zone device pages use swap entries; handled in do_swap_page */
>   if (folio_is_zone_device(folio))
>     return false;
> 
>   if (folio_is_private_node(folio)) {
>     const struct node_private_ops *ops = folio_node_private_ops(folio);
> 
>     if (ops && ops->handle_fault) {
>       *ret = ops->handle_fault(vmf);
>       return true;
>     }
>   }
>   return false;
> }
> 
> 
> 
> Flag-gated behavior (NP_OPS_*) controls:
> ===
> 
> We use OPS flags to denote what mm/ services we want to allow on our
> private node.   I've plumbed these through so far:
> 
>   NP_OPS_MIGRATION       - Node supports migration
>   NP_OPS_MEMPOLICY       - Node supports mempolicy actions
>   NP_OPS_DEMOTION        - Node appears in demotion target lists
>   NP_OPS_PROTECT_WRITE   - Node memory is read-only (wrprotect)
>   NP_OPS_RECLAIM         - Node supports reclaim
>   NP_OPS_NUMA_BALANCING  - Node supports numa balancing
>   NP_OPS_COMPACTION      - Node supports compaction
>   NP_OPS_LONGTERM_PIN    - Node supports longterm pinning
>   NP_OPS_OOM_ELIGIBLE	 - (MIGRATION | DEMOTION), node is reachable
>                            as normal system ram storage, so it should
> 			   be considered in OOM pressure calculations.
> 
> I wasn't quite sure how to classify ksm, khugepaged, madvise, and
> mlock - so i have omitted those for now.
> 
> Most hooks are straightforward.
> 
> Including a node as a demotion-eligible target was as simple as:
> 
> static void establish_demotion_targets(void)
> {
>   ..... snip .....
>   /*
>    * Include private nodes that have opted in to demotion
>    * via NP_OPS_DEMOTION.  A node might have custom migrate
>    */
>   all_memory = node_states[N_MEMORY];
>   for_each_node_state(node, N_MEMORY_PRIVATE) {
>       if (node_private_has_flag(node, NP_OPS_DEMOTION))
>       node_set(node, all_memory);
>   }
>   ..... snip .....
> }
> 
> The Migration and Mempolicy support are the two most complex pieces,
> and most useful things are built on top of Migration (meaning the
> remaining implementations are usually simple).
> 
> 
> Private Node Hotplug Lifecycle
> ===
> 
> Registration follows a strict order enforced by
> add_private_memory_driver_managed():
> 
>   1. Driver calls add_private_memory_driver_managed(nid, start,
>      size, resource_name, mhp_flags, online_type, &np).
> 
>   2. node_private_register(nid, &np) stores the driver's
>      node_private in pgdat and sets pgdat->private.  N_MEMORY and
>      N_MEMORY_PRIVATE are mutually exclusive -- registration fails
>      with -EBUSY if the node already has N_MEMORY set.
> 
>      Only one driver may register per private node.
> 
>   3. Memory is hotplugged via __add_memory_driver_managed().
> 
>      When online_pages() runs, it checks pgdat->private and sets
>      N_MEMORY_PRIVATE instead of N_MEMORY.  
> 
>      Zonelist construction gives private nodes a self-only NOFALLBACK
>      list and an N_MEMORY fallback list (so kernel/slab allocations on
>      behalf of private node work can fall back to DRAM).
> 
>   4. kswapd and kcompactd are NOT started for private nodes.  The
>      owning service is responsible for driving reclaim if needed
>      (e.g., CRAM uses watermark_boost to wake kswapd on demand).
> 
> Teardown is the reverse:
> 
>   1. Driver calls offline_and_remove_private_memory(nid, start,
>      size).
> 
>   2. offline_pages() offlines the memory.  When the last block is
>      offlined, N_MEMORY_PRIVATE is cleared automatically.
> 
>   3. node_private_unregister() clears pgdat->node_private and
>      drops the refcount.  It refuses to unregister (-EBUSY) if
>      N_MEMORY_PRIVATE is still set (other memory ranges remain).
> 
> The driver is responsible for ensuring memory is hot-unpluggable
> before teardown.  The service must ensure all memory is cleaned
> up before hot-unplug - or the service must support migration (so
> memory_hotplug.c can evacuate the memory itself).
> 
> In the CRAM example, the service supports migration, so memory
> hot-unplug can remove memory without any special infrastructure.
> 
> 
> Application: Compressed RAM (mm/cram)
> ===
> 
> Compressed RAM has a serious design issue:  Its capacity a lie.
> 
> A compression device reports more capacity than it physically has.
> If workloads write faster than the OS can reclaim from the device,
> we run out of real backing store and corrupt data or crash.
> 
> I call this problem: "Trying to Out Run A Bear"
> 
> I.e. This is only stable as long as we stay ahead of the pressure.
> 
> We don't want to design a system where stability depends on outrunning
> a bear - I am slow and do not know where to acquire bear spray.
> 
>   Fun fact:   Grizzly bears have a top-speed of 56-64 km/h.
>   Unfun Fact: Humans typically top out at ~24 km/h.
> 
> This MVP takes a conservative position:
> 
>    all compressed memory is mapped read-only.
> 
>   - Folios reach the private node only via reclaim (demotion)
>   - migrate_to implements custom demotion with backpressure.
>   - fixup_migration_pte write-protects PTEs on arrival.
>   - wrprotect hooks prevent silent upgrades
>   - handle_fault promotes folios back to DRAM on write.
>   - free_folio scrubs stale data before buddy free.
> 
> Because pages are read-only, writes can never cause runaway
> compression ratio loss behind the allocator's back.  Every write
> goes through handle_fault, which promotes the folio to DRAM first.
> 
> The device only ever sees net compression (demotion in) and explicit
> decompression (promotion out via fault or reclaim), and has a much
> wider timeframe to respond to poor compression scenarios.
> 
> That means there's no bear to out run. The bears are safely asleep in
> their bear den, and even if they show up we have a bear-proof cage.
> 
> The backpressure system is our bear-proof cage: the driver reports real
> device utilization (generalized via watermark_boost on the private
> node's zone), and CRAM throttles demotion when capacity is tight.
> 
> If compression ratios are bad, we stop demoting pages and start
> evicting pages aggressively.
> 
> The service as designed is ~350 functional lines of code because it
> re-uses mm/ services:
> 
>   - Existing reclaim/vmscan code handles demotion.
>   - Existing migration code handles migration to/from.
>   - Existing page fault handling dispatches faults.
> 
> The driver contains all the CXL nastiness core developers don't want
> anything to do with - No vendor logic touches mm/ internals.
> 
> 
> 
> Future CRAM : Loosening the read-only constraint
> ===
> 
> The read-only model is safe but conservative.  For workloads where
> compressed pages are occasionally written, the promotion fault adds
> latency.  A future optimization could allow a tunable fraction of
> compressed pages to be mapped writable, accepting some risk of
> write-driven decompression in exchange for lower overhead.
> 
> The private node ops make this straightforward:
> 
>   - Adjust fixup_migration_pte to selectively skip
>     write-protection.
>   - Use the backpressure system to either revoke writable mappings,
>     deny additional demotions, or evict when device pressure rises.
> 
> This comes at a mild memory overhead: 32MB of DRAM per 1TB of CRAM.
> (1 bit per 4KB page).
> 
> This is not proposed here, but it should be somewhat trivial.
> 
> 
> Discussion Topics
> ===
> 0. Obviously I've included the set as an RFC, please rip it apart.
> 
> 1. Is N_MEMORY_PRIVATE the right isolation abstraction, or should
>    this extend ZONE_DEVICE?  Prior feedback pushed away from new
>    ZONE logic, but this will likely be debated further.
> 
>    My comments on this:
> 
>    ZONE_DEVICE requires re-implementing every service you want to
>    provide to your device memory, including basic allocation.
> 
>    Private nodes use real struct pages with no metadata
>    limitations, participate in the buddy allocator, and get NUMA
>    topology for free.
> 
> 2. Can this subsume ZONE_DEVICE COHERENT users?  The architecture
>    was designed with this in mind, but it is only a thought experiment.
> 
> 3. Is a dedicated mm/ service (cram) the right place for compressed
>    memory management, or should this be purely driver-side until
>    more devices exist?
> 
>    I wrote it this way because I forsee more "innovation" in the
>    compressed RAM space given current... uh... "Market Conditions".
> 
>    I don't see CRAM being CXL-specific, though the only solutions I've
>    seen have been CXL.  Nothing is stopping someone from soldering such
>    memory directly to a PCB.
> 
> 5. Where is your hardware-backed data that shows this works?
> 
>    I should have some by conference time.
> 
> Thanks for reading
> Gregory (Gourry)
> 
> 
> Gregory Price (27):
>   numa: introduce N_MEMORY_PRIVATE node state
>   mm,cpuset: gate allocations from N_MEMORY_PRIVATE behind __GFP_PRIVATE
>   mm/page_alloc: add numa_zone_allowed() and wire it up
>   mm/page_alloc: Add private node handling to build_zonelists
>   mm: introduce folio_is_private_managed() unified predicate
>   mm/mlock: skip mlock for managed-memory folios
>   mm/madvise: skip madvise for managed-memory folios
>   mm/ksm: skip KSM for managed-memory folios
>   mm/khugepaged: skip private node folios when trying to collapse.
>   mm/swap: add free_folio callback for folio release cleanup
>   mm/huge_memory.c: add private node folio split notification callback
>   mm/migrate: NP_OPS_MIGRATION - support private node user migration
>   mm/mempolicy: NP_OPS_MEMPOLICY - support private node mempolicy
>   mm/memory-tiers: NP_OPS_DEMOTION - support private node demotion
>   mm/mprotect: NP_OPS_PROTECT_WRITE - gate PTE/PMD write-upgrades
>   mm: NP_OPS_RECLAIM - private node reclaim participation
>   mm/oom: NP_OPS_OOM_ELIGIBLE - private node OOM participation
>   mm/memory: NP_OPS_NUMA_BALANCING - private node NUMA balancing
>   mm/compaction: NP_OPS_COMPACTION - private node compaction support
>   mm/gup: NP_OPS_LONGTERM_PIN - private node longterm pin support
>   mm/memory-failure: add memory_failure callback to node_private_ops
>   mm/memory_hotplug: add add_private_memory_driver_managed()
>   mm/cram: add compressed ram memory management subsystem
>   cxl/core: Add cxl_sysram region type
>   cxl/core: Add private node support to cxl_sysram
>   cxl: add cxl_mempolicy sample PCI driver
>   cxl: add cxl_compression PCI driver
> 
>  drivers/base/node.c                           |  250 +++-
>  drivers/cxl/Kconfig                           |    2 +
>  drivers/cxl/Makefile                          |    2 +
>  drivers/cxl/core/Makefile                     |    1 +
>  drivers/cxl/core/core.h                       |    4 +
>  drivers/cxl/core/port.c                       |    2 +
>  drivers/cxl/core/region_sysram.c              |  381 ++++++
>  drivers/cxl/cxl.h                             |   53 +
>  drivers/cxl/type3_drivers/Kconfig             |    3 +
>  drivers/cxl/type3_drivers/Makefile            |    3 +
>  .../cxl/type3_drivers/cxl_compression/Kconfig |   20 +
>  .../type3_drivers/cxl_compression/Makefile    |    4 +
>  .../cxl_compression/compression.c             | 1025 +++++++++++++++++
>  .../cxl/type3_drivers/cxl_mempolicy/Kconfig   |   16 +
>  .../cxl/type3_drivers/cxl_mempolicy/Makefile  |    4 +
>  .../type3_drivers/cxl_mempolicy/mempolicy.c   |  297 +++++
>  include/linux/cpuset.h                        |    9 -
>  include/linux/cram.h                          |   66 ++
>  include/linux/gfp_types.h                     |   15 +-
>  include/linux/memory-tiers.h                  |    9 +
>  include/linux/memory_hotplug.h                |   11 +
>  include/linux/migrate.h                       |   17 +-
>  include/linux/mm.h                            |   22 +
>  include/linux/mmzone.h                        |   16 +
>  include/linux/node_private.h                  |  532 +++++++++
>  include/linux/nodemask.h                      |    1 +
>  include/trace/events/mmflags.h                |    4 +-
>  include/uapi/linux/mempolicy.h                |    1 +
>  kernel/cgroup/cpuset.c                        |   49 +-
>  mm/Kconfig                                    |   10 +
>  mm/Makefile                                   |    1 +
>  mm/compaction.c                               |   32 +-
>  mm/cram.c                                     |  508 ++++++++
>  mm/damon/paddr.c                              |    3 +
>  mm/huge_memory.c                              |   23 +-
>  mm/hugetlb.c                                  |    2 +-
>  mm/internal.h                                 |  226 +++-
>  mm/khugepaged.c                               |    7 +-
>  mm/ksm.c                                      |    9 +-
>  mm/madvise.c                                  |    5 +-
>  mm/memory-failure.c                           |   15 +
>  mm/memory-tiers.c                             |   46 +-
>  mm/memory.c                                   |   26 +
>  mm/memory_hotplug.c                           |  122 +-
>  mm/mempolicy.c                                |   69 +-
>  mm/migrate.c                                  |   63 +-
>  mm/mlock.c                                    |    5 +-
>  mm/mprotect.c                                 |    4 +-
>  mm/oom_kill.c                                 |   52 +-
>  mm/page_alloc.c                               |   79 +-
>  mm/rmap.c                                     |    4 +-
>  mm/slub.c                                     |    3 +-
>  mm/swap.c                                     |   21 +-
>  mm/vmscan.c                                   |   55 +-
>  54 files changed, 4057 insertions(+), 152 deletions(-)
>  create mode 100644 drivers/cxl/core/region_sysram.c
>  create mode 100644 drivers/cxl/type3_drivers/Kconfig
>  create mode 100644 drivers/cxl/type3_drivers/Makefile
>  create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Kconfig
>  create mode 100644 drivers/cxl/type3_drivers/cxl_compression/Makefile
>  create mode 100644 drivers/cxl/type3_drivers/cxl_compression/compression.c
>  create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Kconfig
>  create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/Makefile
>  create mode 100644 drivers/cxl/type3_drivers/cxl_mempolicy/mempolicy.c
>  create mode 100644 include/linux/cram.h
>  create mode 100644 include/linux/node_private.h
>  create mode 100644 mm/cram.c
> 
> -- 
> 2.53.0
> 
> 

Balbir

^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Vernon Yang @ 2026-05-21  5:11 UTC (permalink / raw)
  To: Wei Yang
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260521024654.2a7teoe665porz76@master>

On Thu, May 21, 2026 at 02:46:54AM +0000, Wei Yang wrote:
> On Thu, May 21, 2026 at 10:36:15AM +0800, Vernon Yang wrote:
> >On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
> >> Enable khugepaged to collapse to mTHP orders. This patch implements the
> >> main scanning logic using a bitmap to track occupied pages and a stack
> >> structure that allows us to find optimal collapse sizes.
> >>
> >> Previous to this patch, PMD collapse had 3 main phases, a light weight
> >> scanning phase (mmap_read_lock) that determines a potential PMD
> >> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> >> phase (mmap_write_lock).
> >>
> >> To enabled mTHP collapse we make the following changes:
> >>
> >> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> >> orders are enabled, we remove the restriction of max_ptes_none during the
> >> scan phase to avoid missing potential mTHP collapse candidates. Once we
> >> have scanned the full PMD range and updated the bitmap to track occupied
> >> pages, we use the bitmap to find the optimal mTHP size.
> >>
> >> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> >> and determine the best eligible order for the collapse. A stack structure
> >> is used instead of traditional recursion to manage the search. This also
> >> prevents a traditional recursive approach when the kernel stack struct is
> >> limited. The algorithm recursively splits the bitmap into smaller chunks to
> >> find the highest order mTHPs that satisfy the collapse criteria. We start
> >> by attempting the PMD order, then moved on the consecutively lower orders
> >> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> >> indicating the number of PTEs from the start of the PMD, and the order of
> >> the potential collapse candidate.
> >>
> >> The algorithm for consuming the bitmap works as such:
> >>     1) push (0, HPAGE_PMD_ORDER) onto the stack
> >>     2) pop the stack
> >>     3) check if the number of set bits in that (offset,order) pair
> >>        statisfy the max_ptes_none threshold for that order
> >>     4) if yes, attempt collapse
> >>     5) if no (or collapse fails), push two new stack items representing
> >>        the left and right halves of the current bitmap range, at the
> >>        next lower order
> >>     6) repeat at step (2) until stack is empty.
> >>
> >> Below is a diagram representing the algorithm and stack items:
> >>
> >>                             offset   mid_offset
> >>                             |        |
> >>                             |        |
> >>                             v        v
> >>           ____________________________________
> >>          |          PTE Page Table            |
> >>          --------------------------------------
> >> 			    <-------><------->
> >>                              order-1  order-1
> >>
> >> mTHP collapses reject regions containing swapped out or shared pages.
> >> This is because adding new entries can lead to new none pages, and these
> >> may lead to constant promotion into a higher order mTHP. A similar
> >> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> >> introducing at least 2x the number of pages, and on a future scan will
> >> satisfy the promotion condition once again. This issue is prevented via
> >> the collapse_max_ptes_none() function which imposes the max_ptes_none
> >> restrictions above.
> >>
> >> We currently only support mTHP collapse for max_ptes_none values of 0
> >> and HPAGE_PMD_NR - 1. resulting in the following behavior:
> >>
> >>     - max_ptes_none=0: Never introduce new empty pages during collapse
> >>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
> >>       available mTHP order
> >>
> >> Any other max_ptes_none value will emit a warning and skip mTHP collapse
> >> attempts. There should be no behavior change for PMD collapse.
> >>
> >> Once we determine what mTHP sizes fits best in that PMD range a collapse
> >> is attempted. A minimum collapse order of 2 is used as this is the lowest
> >> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
> >>
> >> Currently madv_collapse is not supported and will only attempt PMD
> >> collapse.
> >>
> >> We can also remove the check for is_khugepaged inside the PMD scan as
> >> the collapse_max_ptes_none() function handles this logic now.
> >>
> >> Signed-off-by: Nico Pache <npache@redhat.com>
> >> ---
> >>  mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
> >>  1 file changed, 174 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 3492b135d667..39bf7ea8a6e8 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >>
> >>  static struct kmem_cache *mm_slot_cache __ro_after_init;
> >>
> >> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> >> +/*
> >> + * mthp_collapse() does an iterative DFS over a binary tree, from
> >> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> >> + * size needed for a DFS on a binary tree is height + 1, where
> >> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> >> + *
> >> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> >> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> >> + */
> >> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> >> +
> >> +/*
> >> + * Defines a range of PTE entries in a PTE page table which are being
> >> + * considered for mTHP collapse.
> >> + *
> >> + * @offset: the offset of the first PTE entry in a PMD range.
> >> + * @order: the order of the PTE entries being considered for collapse.
> >> + */
> >> +struct mthp_range {
> >> +	u16 offset;
> >> +	u8 order;
> >> +};
> >> +
> >>  struct collapse_control {
> >>  	bool is_khugepaged;
> >>
> >> @@ -111,6 +135,12 @@ struct collapse_control {
> >>
> >>  	/* nodemask for allocation fallback */
> >>  	nodemask_t alloc_nmask;
> >> +
> >> +	/* Each bit represents a single occupied (!none/zero) page. */
> >> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> >> +	/* A mask of the current range being considered for mTHP collapse. */
> >> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> >> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> >>  };
> >>
> >>  /**
> >> @@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> >>  	return result;
> >>  }
> >>
> >> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> >> +				     u16 offset, u8 order)
> >> +{
> >> +	const int size = *stack_size;
> >> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> >> +
> >> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> >> +	stack->order = order;
> >> +	stack->offset = offset;
> >> +	(*stack_size)++;
> >> +}
> >> +
> >> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> >> +						 int *stack_size)
> >> +{
> >> +	const int size = *stack_size;
> >> +
> >> +	VM_WARN_ON_ONCE(size <= 0);
> >> +	(*stack_size)--;
> >> +	return cc->mthp_bitmap_stack[size - 1];
> >> +}
> >> +
> >> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> >> +						u16 offset, unsigned int nr_ptes)
> >> +{
> >> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> >> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> >> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> >> +}
> >> +
> >> +/*
> >> + * mthp_collapse() consumes the bitmap that is generated during
> >> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> >> + *
> >> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> >> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> >> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> >> + * variables (offset, order), indicating the number of PTEs from the start of
> >> + * the PMD, and the order of the potential collapse candidate respectively. We
> >> + * start at the PMD order and check if it is eligible for collapse; if not, we
> >> + * add two entries to the stack at a lower order to represent the left and right
> >> + * halves of the PTE page table we are examining.
> >> + *
> >> + *                         offset       mid_offset
> >> + *                         |         |
> >> + *                         |         |
> >> + *                         v         v
> >> + *      --------------------------------------
> >> + *      |          cc->mthp_bitmap            |
> >> + *      --------------------------------------
> >> + *                         <-------><------->
> >> + *                          order-1  order-1
> >> + *
> >> + * For each of these, we determine how many PTE entries are occupied in the
> >> + * range of PTE entries we propose to collapse, then we compare this to a
> >> + * threshold number of PTE entries which would need to be occupied for a
> >> + * collapse to be permitted at that order (accounting for max_ptes_none).
> >> + *
> >> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> >> + * mTHP.
> >> + */
> >> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> >> +		int referenced, int unmapped, struct collapse_control *cc,
> >> +		unsigned long enabled_orders)
> >> +{
> >> +	unsigned int nr_occupied_ptes, nr_ptes;
> >> +	int max_ptes_none, collapsed = 0, stack_size = 0;
> >> +	unsigned long collapse_address;
> >> +	struct mthp_range range;
> >> +	u16 offset;
> >> +	u8 order;
> >> +
> >> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> >> +
> >> +	while (stack_size) {
> >> +		range = collapse_mthp_stack_pop(cc, &stack_size);
> >> +		order = range.order;
> >> +		offset = range.offset;
> >> +		nr_ptes = 1UL << order;
> >> +
> >> +		if (!test_bit(order, &enabled_orders))
> >> +			goto next_order;
> >> +
> >> +		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> >> +
> >> +		if (max_ptes_none < 0)
> >> +			return collapsed;
> >> +
> >> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> >> +							       nr_ptes);
> >> +
> >> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> >> +			int ret;
> >> +
> >> +			collapse_address = address + offset * PAGE_SIZE;
> >> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> >> +						 unmapped, cc, order);
> >> +			if (ret == SCAN_SUCCEED) {
> >> +				collapsed += nr_ptes;
> >> +				continue;
> >> +			}
> >> +		}
> >> +
> >> +next_order:
> >> +		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> >
> >Hi Nico, thank you very much for your contributions to this series.
> >
> >I found a minor issue, for MADV_COLLAPSE, if collapse_huge_page() fails
> >for some reason (e.g. allocate folio), it goes to next_order and
> >continues splitting to the next small order. However, enabled_orders
> >only supports HPAGE_PMD_ORDER, so it keeps runing the split operations
> >without any effective work until KHUGEPAGED_MIN_MTHP_ORDER is reached
> >before exiting. For khugepaged, e.g. setting only 2MB to always, also
> >same phenomenon.
>
> Yes, but it does no actual work since it is checked after pop up.
>
> >
> >This does not affect the overall functionality of mthp collapse, just
> >redundant.
> >
> >The redundant operations can be easily skipped with the following
> >modification. If I miss some thing, please let me know. Thanks!
> >
> >diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >index 1a25af3d6d0f..fa407cce525c 100644
> >--- a/mm/khugepaged.c
> >+++ b/mm/khugepaged.c
> >@@ -1574,7 +1574,7 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> > 		}
> >
> > next_order:
> >-		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
> >+		if ((BIT(order) - 1) & enabled_orders) {
> > 			const u8 next_order = order - 1;
> > 			const u16 mid_offset = offset + (nr_ptes / 2);
> >
>
> This would stop the iteration if there are other lower enabled order, right?
             ^^^^                                  ^^^^^^^^^^^^^^^^^^^

NO :)

For more details, please refer to the following information.

|              Scenario               | Old Behavior (order > 2) | New Behavior ((BIT(order)-1) & enabled_orders) |
|-------------------------------------|--------------------------|------------------------------------------------|
| MADV_COLLAPSE                       | Splits 9,8,7,...,3       | No split                                       |
| khugepaged, only 2MB enabled        | Splits 9,8,7,...,3       | No split                                       |
| khugepaged, only 2MB + 64KB enabled | Splits 9,8,7,...,3       | Splits 9,8,7,...,5                             |
| khugepaged, only 32KB enabled       | Splits 9,8,7,...,3       | Splits 9,8,7,...,4                             |
| khugepaged, only 16KB enabled       | Splits 9,8,7,...,3       | Splits 9,8,7,...,3                             |
| khugepaged, all mTHP enabled        | Splits 9,8,7,...,3       | Splits 9,8,7,...,3                             |

--
Cheers,
Vernon

^ permalink raw reply

* [PATCH] tracing: Do not call map->ops->elt_free() if elt_alloc() is not succeeded
From: Masami Hiramatsu (Google) @ 2026-05-21  4:49 UTC (permalink / raw)
  To: Steven Rostedt, Tom Zanussi
  Cc: linux-trace-kernel, linux-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Rosen Penev
In-Reply-To: <20260520223101.34710-1-rosenp@gmail.com>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

In paths where tracing_map_elt_alloc() failed to allocate objects,
the map->ops->elt_alloc() call was never successful. In this case,
map->ops->elt_free() should not be called.

This bug was found by Sashiko.
Link: https://sashiko.dev/#/patchset/20260520223101.34710-1-rosenp%40gmail.com

Fixes: 2734b629525a ("tracing: Add per-element variable support to tracing_map")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
---
 kernel/trace/tracing_map.c |   17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
index bf1a507695b6..0dd7927df22a 100644
--- a/kernel/trace/tracing_map.c
+++ b/kernel/trace/tracing_map.c
@@ -386,13 +386,11 @@ static void tracing_map_elt_init_fields(struct tracing_map_elt *elt)
 	}
 }
 
-static void tracing_map_elt_free(struct tracing_map_elt *elt)
+static void __tracing_map_elt_free(struct tracing_map_elt *elt)
 {
 	if (!elt)
 		return;
 
-	if (elt->map->ops && elt->map->ops->elt_free)
-		elt->map->ops->elt_free(elt);
 	kfree(elt->fields);
 	kfree(elt->vars);
 	kfree(elt->var_set);
@@ -400,6 +398,17 @@ static void tracing_map_elt_free(struct tracing_map_elt *elt)
 	kfree(elt);
 }
 
+static void tracing_map_elt_free(struct tracing_map_elt *elt)
+{
+	if (!elt)
+		return;
+
+	/* Only objects initialized with alloc_elt() should be passed to free_elt().*/
+	if (elt->map->ops && elt->map->ops->elt_free)
+		elt->map->ops->elt_free(elt);
+	__tracing_map_elt_free(elt);
+}
+
 static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
 {
 	struct tracing_map_elt *elt;
@@ -444,7 +453,7 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
 	}
 	return elt;
  free:
-	tracing_map_elt_free(elt);
+	__tracing_map_elt_free(elt);
 
 	return ERR_PTR(err);
 }


^ permalink raw reply related

* Re: [PATCH] trace: allocate fields with elt struct
From: Masami Hiramatsu @ 2026-05-21  4:48 UTC (permalink / raw)
  To: Rosen Penev
  Cc: linux-trace-kernel, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, open list:TRACING
In-Reply-To: <20260520223101.34710-1-rosenp@gmail.com>

Hi Rosen,

Please check Sashiko's comment:

https://sashiko.dev/#/patchset/20260520223101.34710-1-rosenp%40gmail.com

One actual bug was found. I'll sent a fix.

Thanks,

On Wed, 20 May 2026 15:31:01 -0700
Rosen Penev <rosenp@gmail.com> wrote:

> Use a flexible array member to embed the fields array in the
> tracing_map_elt allocation, reducing the number of allocations
> per element.
> 
> Since the fields are now embedded in the struct, taking the address
> of a field through a const-qualified elt pointer yields a
> const-qualified pointer. Rather than adding casts, switch the
> comparison functions to take const void * parameters. These are
> all read-only operations.
> 
> Assisted-by: OpenCode:BigPickle
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  kernel/trace/tracing_map.c | 41 ++++++++++++++++----------------------
>  kernel/trace/tracing_map.h |  8 ++++----
>  2 files changed, 21 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
> index 1404bf752d99..97f7e3cde262 100644
> --- a/kernel/trace/tracing_map.c
> +++ b/kernel/trace/tracing_map.c
> @@ -125,32 +125,32 @@ u64 tracing_map_read_var_once(struct tracing_map_elt *elt, unsigned int i)
>  	return (u64)atomic64_read(&elt->vars[i]);
>  }
>  
> -int tracing_map_cmp_string(void *val_a, void *val_b)
> +int tracing_map_cmp_string(const void *val_a, const void *val_b)
>  {
> -	char *a = val_a;
> -	char *b = val_b;
> +	const char *a = val_a;
> +	const char *b = val_b;
>  
>  	return strcmp(a, b);
>  }
>  
> -int tracing_map_cmp_none(void *val_a, void *val_b)
> +int tracing_map_cmp_none(const void *val_a, const void *val_b)
>  {
>  	return 0;
>  }
>  
> -static int tracing_map_cmp_atomic64(void *val_a, void *val_b)
> +static int tracing_map_cmp_atomic64(const void *val_a, const void *val_b)
>  {
> -	u64 a = atomic64_read((atomic64_t *)val_a);
> -	u64 b = atomic64_read((atomic64_t *)val_b);
> +	u64 a = atomic64_read((const atomic64_t *)val_a);
> +	u64 b = atomic64_read((const atomic64_t *)val_b);
>  
>  	return (a > b) ? 1 : ((a < b) ? -1 : 0);
>  }
>  
>  #define DEFINE_TRACING_MAP_CMP_FN(type)					\
> -static int tracing_map_cmp_##type(void *val_a, void *val_b)		\
> +static int tracing_map_cmp_##type(const void *val_a, const void *val_b)	\
>  {									\
> -	type a = (type)(*(u64 *)val_a);					\
> -	type b = (type)(*(u64 *)val_b);					\
> +	type a = (type)(*(const u64 *)val_a);				\
> +	type b = (type)(*(const u64 *)val_b);				\
>  									\
>  	return (a > b) ? 1 : ((a < b) ? -1 : 0);			\
>  }
> @@ -385,7 +385,6 @@ static void tracing_map_elt_free(struct tracing_map_elt *elt)
>  
>  	if (elt->map->ops && elt->map->ops->elt_free)
>  		elt->map->ops->elt_free(elt);
> -	kfree(elt->fields);
>  	kfree(elt->vars);
>  	kfree(elt->var_set);
>  	kfree(elt->key);
> @@ -397,7 +396,7 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
>  	struct tracing_map_elt *elt;
>  	int err = 0;
>  
> -	elt = kzalloc_obj(*elt);
> +	elt = kzalloc_flex(*elt, fields, map->n_fields);
>  	if (!elt)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -409,12 +408,6 @@ static struct tracing_map_elt *tracing_map_elt_alloc(struct tracing_map *map)
>  		goto free;
>  	}
>  
> -	elt->fields = kzalloc_objs(*elt->fields, map->n_fields);
> -	if (!elt->fields) {
> -		err = -ENOMEM;
> -		goto free;
> -	}
> -
>  	elt->vars = kzalloc_objs(*elt->vars, map->n_vars);
>  	if (!elt->vars) {
>  		err = -ENOMEM;
> @@ -848,10 +841,10 @@ static int cmp_entries_sum(const void *A, const void *B)
>  {
>  	const struct tracing_map_elt *elt_a, *elt_b;
>  	const struct tracing_map_sort_entry *a, *b;
> -	struct tracing_map_sort_key *sort_key;
> -	struct tracing_map_field *field;
> +	const struct tracing_map_sort_key *sort_key;
> +	const struct tracing_map_field *field;
>  	tracing_map_cmp_fn_t cmp_fn;
> -	void *val_a, *val_b;
> +	const void *val_a, *val_b;
>  	int ret = 0;
>  
>  	a = *(const struct tracing_map_sort_entry **)A;
> @@ -879,10 +872,10 @@ static int cmp_entries_key(const void *A, const void *B)
>  {
>  	const struct tracing_map_elt *elt_a, *elt_b;
>  	const struct tracing_map_sort_entry *a, *b;
> -	struct tracing_map_sort_key *sort_key;
> -	struct tracing_map_field *field;
> +	const struct tracing_map_sort_key *sort_key;
> +	const struct tracing_map_field *field;
>  	tracing_map_cmp_fn_t cmp_fn;
> -	void *val_a, *val_b;
> +	const void *val_a, *val_b;
>  	int ret = 0;
>  
>  	a = *(const struct tracing_map_sort_entry **)A;
> diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
> index 18a02959d77b..90a7fde5dd02 100644
> --- a/kernel/trace/tracing_map.h
> +++ b/kernel/trace/tracing_map.h
> @@ -13,7 +13,7 @@
>  #define TRACING_MAP_VARS_MAX		16
>  #define TRACING_MAP_SORT_KEYS_MAX	2
>  
> -typedef int (*tracing_map_cmp_fn_t) (void *val_a, void *val_b);
> +typedef int (*tracing_map_cmp_fn_t) (const void *val_a, const void *val_b);
>  
>  /*
>   * This is an overview of the tracing_map data structures and how they
> @@ -137,11 +137,11 @@ struct tracing_map_field {
>  
>  struct tracing_map_elt {
>  	struct tracing_map		*map;
> -	struct tracing_map_field	*fields;
>  	atomic64_t			*vars;
>  	bool				*var_set;
>  	void				*key;
>  	void				*private_data;
> +	struct tracing_map_field	fields[];
>  };
>  
>  struct tracing_map_entry {
> @@ -260,8 +260,8 @@ tracing_map_lookup(struct tracing_map *map, void *key);
>  
>  extern tracing_map_cmp_fn_t tracing_map_cmp_num(int field_size,
>  						int field_is_signed);
> -extern int tracing_map_cmp_string(void *val_a, void *val_b);
> -extern int tracing_map_cmp_none(void *val_a, void *val_b);
> +extern int tracing_map_cmp_string(const void *val_a, const void *val_b);
> +extern int tracing_map_cmp_none(const void *val_a, const void *val_b);
>  
>  extern void tracing_map_update_sum(struct tracing_map_elt *elt,
>  				   unsigned int i, u64 n);
> -- 
> 2.54.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] gpu: host1x: trace: fix string fields in host1x traces
From: Mikko Perttunen @ 2026-05-21  4:33 UTC (permalink / raw)
  To: Steven Rostedt, Thierry Reding
  Cc: Artur Kowalski, Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, linux-tegra, Thierry Reding, David Airlie,
	Simona Vetter
In-Reply-To: <ag2iF9bZJcBQ93lh@orome>

On Wednesday, May 20, 2026 9:03 PM Thierry Reding wrote:
> On Tue, May 19, 2026 at 02:10:59PM -0400, Steven Rostedt wrote:
> > On Tue, 19 May 2026 12:16:43 +0200
> > Artur Kowalski <arturkow2000@gmail.com> wrote:
> > 
> > > Use __assign_str and __get_str as required by tracing subsystem. Fixes
> > > string fields being rejected by the verifier and unreadable from
> > > userspace.
> > 
> > Does anyone use these tracepoints? The fact that they have been broken
> > for 5 years and nobody noticed makes me think they are useless.
> > 
> > I rather remove them than fix them, but if someone thinks that these
> > are still useful then by all means apply this patch.
> > 
> > Acked-by: Steven Rostedt <rostedt@goodmis.org>
> 
> I know that Mikko used them a lot early on, but this driver is pretty
> mature now, so we rarely need this low level of tracing. I'll defer to
> Mikko on whether we still need these.
> 
> Thierry

Yeah, these have been quite useful in the past when debugging why a job
is failing. Without the cdma traces it can be cumbersome to find out
exactly what is being sent to the hardware in some cases.

My preference is for keeping them for now.

Mikko



^ permalink raw reply

* Re: [PATCH v2] ring-buffer: Fix reporting of missed events in iterator
From: Masami Hiramatsu @ 2026-05-21  3:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Linux trace kernel, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260520220801.4fd09d13@fedora>

On Wed, 20 May 2026 22:08:01 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From c3651ad3ac95b331c7aa010d163704a3702855da Mon Sep 17 00:00:00 2001
> From: Steven Rostedt <rostedt@goodmis.org>
> Date: Wed, 20 May 2026 14:28:17 -0400
> Subject: [PATCH] ring-buffer: Fix reporting of missed events in iterator
> 
> When tracing is active while reading the trace file, if the iterator
> reading the buffer detects that the writer has passed the iterator head,
> it will reset and set a "missed events" flag. This flag is passed to the
> output processing to show the user that events were missed:
> 
>   CPU:4 [LOST EVENTS]
> 
> The problem is that the flag is reset after it is checked in
> ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU
> ring buffers and it will check if they are dropped when figuring out which
> buffer to print next. This prematurely clears the missed_events flag if
> the CPU buffer with the missed events is not the one that is printed next.
> 
> On the iteration where the CPU buffer with the missed events is printed,
> the check if it had missed events would return false and the output does
> not show that events were missed.
> 
> Do not reset the missed_events flag when checking if there were missed
> events, but instead clear it when moving the iterator head to the next
> event.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

> Cc: stable@vger.kernel.org
> Fixes: c9b7a4a72ff64 ("ring-buffer/tracing: Have iterator acknowledge dropped events")
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
> Changes since v1: https://patch.msgid.link/20260520142817.4050abab@gandalf.local.home
> 
> - Added clearing iter->missed_events in rb_iter_reset() (Sashiko)
> 
>  kernel/trace/ring_buffer.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 7288383b1f27..7b07d2004cc6 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -5429,6 +5429,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
>  	iter->head_page = cpu_buffer->reader_page;
>  	iter->head = cpu_buffer->reader_page->read;
>  	iter->next_event = iter->head;
> +	iter->missed_events = 0;
>  
>  	iter->cache_reader_page = iter->head_page;
>  	iter->cache_read = cpu_buffer->read;
> @@ -6108,10 +6109,7 @@ ring_buffer_peek(struct trace_buffer *buffer, int cpu, u64 *ts,
>   */
>  bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter)
>  {
> -	bool ret = iter->missed_events != 0;
> -
> -	iter->missed_events = 0;
> -	return ret;
> +	return iter->missed_events != 0;
>  }
>  EXPORT_SYMBOL_GPL(ring_buffer_iter_dropped);
>  
> @@ -6273,7 +6271,7 @@ void ring_buffer_iter_advance(struct ring_buffer_iter *iter)
>  	unsigned long flags;
>  
>  	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> -
> +	iter->missed_events = 0;
>  	rb_advance_iter(iter);
>  
>  	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> -- 
> 2.53.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH] tracing: Use flexible array for entry fetch code
From: Masami Hiramatsu @ 2026-05-21  3:09 UTC (permalink / raw)
  To: Rosen Penev
  Cc: linux-trace-kernel, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Kees Cook, Gustavo A. R. Silva,
	open list:TRACING,
	open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be|_ptr)?b
In-Reply-To: <20260520215817.16560-1-rosenp@gmail.com>

On Wed, 20 May 2026 14:58:17 -0700
Rosen Penev <rosenp@gmail.com> wrote:

> Store probe entry fetch instructions in the probe_entry_arg
> allocation instead of allocating a separate instruction array.
> 
> This keeps the entry fetch code tied to the entry argument lifetime while
> leaving regular probe_arg instruction arrays separately allocated and
> freed.
> 

Thanks, this looks good to me.

> Assisted-by: Codex:GPT-5.5
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  kernel/trace/trace_probe.c | 8 +-------
>  kernel/trace/trace_probe.h | 2 +-
>  2 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index e0d3a0da26af..39f040c863e8 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -838,15 +838,10 @@ static int __store_entry_arg(struct trace_probe *tp, int argnum)
>  	int i, offset, last_offset = 0;
>  
>  	if (!earg) {
> -		earg = kzalloc_obj(*tp->entry_arg);
> +		earg = kzalloc_flex(*earg, code, 2 * tp->nr_args + 1);
>  		if (!earg)
>  			return -ENOMEM;
>  		earg->size = 2 * tp->nr_args + 1;
> -		earg->code = kzalloc_objs(struct fetch_insn, earg->size);
> -		if (!earg->code) {
> -			kfree(earg);
> -			return -ENOMEM;
> -		}
>  		/* Fill the code buffer with 'end' to simplify it */
>  		for (i = 0; i < earg->size; i++)
>  			earg->code[i].op = FETCH_OP_END;
> @@ -2051,7 +2046,6 @@ void trace_probe_cleanup(struct trace_probe *tp)
>  		traceprobe_free_probe_arg(&tp->args[i]);
>  
>  	if (tp->entry_arg) {
> -		kfree(tp->entry_arg->code);
>  		kfree(tp->entry_arg);
>  		tp->entry_arg = NULL;
>  	}
> diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
> index 262d8707a3df..1076f1df347b 100644
> --- a/kernel/trace/trace_probe.h
> +++ b/kernel/trace/trace_probe.h
> @@ -238,8 +238,8 @@ struct probe_arg {
>  };
>  
>  struct probe_entry_arg {
> -	struct fetch_insn	*code;
>  	unsigned int		size;	/* The entry data size */
> +	struct fetch_insn	code[] __counted_by(size);
>  };
>  
>  struct trace_uprobe_filter {
> -- 
> 2.54.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re:
From: Steven Rostedt @ 2026-05-21  3:00 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
In-Reply-To: <20260520184716.199133347@kernel.org>


This is what happens when I have a typo in the quilt send mail "To:" fields.

[ Yes, I still use quilt to send patch series! ]

-- Steve

^ permalink raw reply

* Re: [PATCH v20 07/10] ring-buffer: Skip invalid sub-buffers for iterator
From: Steven Rostedt @ 2026-05-21  2:58 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: linux-kernel, linux-trace-kernel, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Ian Rogers
In-Reply-To: <20260521115125.a2a1d90c5913bd4d27cb9107@kernel.org>

On Thu, 21 May 2026 11:51:25 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> > Fixes: **TBD** ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>  
> 
> Should we merge this into the original one?
> 
> Anyway, this looks good to me.

I could. I never know how to do this really. That is, I always feel
uncomfortable modifying someone else's patch. But I could add:

  [ SDR: Added skip of invalid sub-buffer for iterator case ]

Above my SOB.

-- Steve

^ permalink raw reply

* Re: [PATCH v5] tracing/eprobes: Allow use of BTF names to dereference pointers
From: Steven Rostedt @ 2026-05-21  2:55 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: sashiko-bot, sashiko-reviews, bpf, LKML, Linux trace kernel
In-Reply-To: <20260521105804.a66cf40d48f796ff66ffface@kernel.org>

On Thu, 21 May 2026 10:58:04 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> On Wed, 20 May 2026 12:48:32 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Wed, 20 May 2026 15:20:21 +0900
> > Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> >   
> > > > > > @@ -515,6 +542,10 @@ static void clear_btf_context(struct traceprobe_parse_context *ctx)
> > > > > >  		ctx->params = NULL;
> > > > > >  		ctx->nr_params = 0;
> > > > > >  	}
> > > > > > +	if (ctx->struct_btf) {
> > > > > > +		btf_put(ctx->struct_btf);
> > > > > > +		ctx->last_struct = NULL;      
> > > > > 
> > > > > [Severity: Low]
> > > > > Should ctx->struct_btf be explicitly set to NULL after btf_put() drops
> > > > > the reference?    
> > > > 
> > > > I'm thinking of dropping it in the '(' switch case.    
> > > 
> > > Can you consider making the '(' switch case part as a helper
> > > function because it depends on CONFIG_DEBUG_INFO_BTF?  
> > 
> > Should we just encapsulate that entire case statement with:
> > 
> > #ifdef CONFIG_DEBUG_INFO_BTF
> > [..]
> > #endif  
> 
> Yeah that is possible, and I rather like to make it a separate
> function for simplifying switch-case block for readability.
> 

Hmm, but as a separate function, you mean something like this?

	case '(':
		ret = handle_typecast(...);
		break;

And have;

#ifdef CONFIG_DEBUG_INFO_BTF
static int handle_typecast(...)
{
	[ do the real work here ]
}
#else
static int handle_typecast(...)
{
	trace_probe_log_err(ctx->offset, NOSUP_BTFARG);
	return -EOPNOTSUPP;
}
#endif

  ?

-- Steve

^ permalink raw reply

* Re: [PATCHv2] tracing: simplify pages allocation
From: Masami Hiramatsu @ 2026-05-21  2:55 UTC (permalink / raw)
  To: Rosen Penev
  Cc: linux-trace-kernel, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Kees Cook, Gustavo A. R. Silva,
	open list:TRACING,
	open list:KERNEL HARDENING (not covered by other areas):Keyword:b__counted_by(_le|_be|_ptr)?b
In-Reply-To: <20260520215006.12008-1-rosenp@gmail.com>

On Wed, 20 May 2026 14:50:06 -0700
Rosen Penev <rosenp@gmail.com> wrote:

> Change to a flexible array member to allocate together with the array
> struct.
> 
> Simplifies code slightly by removing no longer correct null checks for
> pages and removing kfrees.

Looks good to me.

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks!

> 
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  v2: add back kfree(a). Accidentally removed.
>  kernel/trace/tracing_map.c | 30 +++++++++++-------------------
>  kernel/trace/tracing_map.h |  2 +-
>  2 files changed, 12 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
> index c59326ad7a84..97f7e3cde262 100644
> --- a/kernel/trace/tracing_map.c
> +++ b/kernel/trace/tracing_map.c
> @@ -288,9 +288,6 @@ static void tracing_map_array_clear(struct tracing_map_array *a)
>  {
>  	unsigned int i;
> 
> -	if (!a->pages)
> -		return;
> -
>  	for (i = 0; i < a->n_pages; i++)
>  		memset(a->pages[i], 0, PAGE_SIZE);
>  }
> @@ -302,9 +299,6 @@ static void tracing_map_array_free(struct tracing_map_array *a)
>  	if (!a)
>  		return;
> 
> -	if (!a->pages)
> -		goto free;
> -
>  	for (i = 0; i < a->n_pages; i++) {
>  		if (!a->pages[i])
>  			break;
> @@ -312,9 +306,6 @@ static void tracing_map_array_free(struct tracing_map_array *a)
>  		free_page((unsigned long)a->pages[i]);
>  	}
> 
> -	kfree(a->pages);
> -
> - free:
>  	kfree(a);
>  }
> 
> @@ -322,24 +313,25 @@ static struct tracing_map_array *tracing_map_array_alloc(unsigned int n_elts,
>  						  unsigned int entry_size)
>  {
>  	struct tracing_map_array *a;
> +	unsigned int entry_size_shift;
> +	unsigned int entries_per_page;
> +	unsigned int n_pages;
>  	unsigned int i;
> 
> -	a = kzalloc_obj(*a);
> +	entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
> +	entries_per_page = PAGE_SIZE / (1 << entry_size_shift);
> +	n_pages = max(1, n_elts / entries_per_page);
> +
> +	a = kzalloc_flex(*a, pages, n_pages);
>  	if (!a)
>  		return NULL;
> 
> -	a->entry_size_shift = fls(roundup_pow_of_two(entry_size) - 1);
> -	a->entries_per_page = PAGE_SIZE / (1 << a->entry_size_shift);
> -	a->n_pages = n_elts / a->entries_per_page;
> -	if (!a->n_pages)
> -		a->n_pages = 1;
> +	a->entry_size_shift = entry_size_shift;
> +	a->entries_per_page = entries_per_page;
> +	a->n_pages = n_pages;
>  	a->entry_shift = fls(a->entries_per_page) - 1;
>  	a->entry_mask = (1 << a->entry_shift) - 1;
> 
> -	a->pages = kcalloc(a->n_pages, sizeof(void *), GFP_KERNEL);
> -	if (!a->pages)
> -		goto free;
> -
>  	for (i = 0; i < a->n_pages; i++) {
>  		a->pages[i] = (void *)get_zeroed_page(GFP_KERNEL);
>  		if (!a->pages[i])
> diff --git a/kernel/trace/tracing_map.h b/kernel/trace/tracing_map.h
> index ed64136782d8..90a7fde5dd02 100644
> --- a/kernel/trace/tracing_map.h
> +++ b/kernel/trace/tracing_map.h
> @@ -167,7 +167,7 @@ struct tracing_map_array {
>  	unsigned int entry_shift;
>  	unsigned int entry_mask;
>  	unsigned int n_pages;
> -	void **pages;
> +	void *pages[] __counted_by(n_pages);
>  };
> 
>  #define TRACING_MAP_ARRAY_ELT(array, idx)				\
> --
> 2.54.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v20 07/10] ring-buffer: Skip invalid sub-buffers for iterator
From: Masami Hiramatsu @ 2026-05-21  2:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, Masami Hiramatsu,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Ian Rogers
In-Reply-To: <20260520185018.051228084@kernel.org>

On Wed, 20 May 2026 14:49:45 -0400
Steven Rostedt <rostedt@kernel.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> On bootup if the persistent ring buffer finds an invalid sub-buffer, it
> only invalidates the invalid sub-buffer and continues. Several sub-buffers
> may be invalid and this can cause the iterator to loop more than 3 times
> looking for a new event. If that happens, then it returns NULL. Having a
> NULL return early can confuse the iterator looking for the next event, and
> may show events out of order.
> 
> Have the same logic for the consuming read for the iterator that will
> allow the loop to find the next event to happen the number of sub-buffers
> and not just 3.
> 
> Fixes: **TBD** ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Should we merge this into the original one?

Anyway, this looks good to me.

Thanks,

> ---
>  kernel/trace/ring_buffer.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index c6c2f92bfc24..bda53a2d2159 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -6103,12 +6103,14 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
>  	struct ring_buffer_per_cpu *cpu_buffer;
>  	struct ring_buffer_event *event;
>  	int nr_loops = 0;
> +	int max_loops;
>  
>  	if (ts)
>  		*ts = 0;
>  
>  	cpu_buffer = iter->cpu_buffer;
>  	buffer = cpu_buffer->buffer;
> +	max_loops = cpu_buffer->ring_meta ? cpu_buffer->nr_pages : 3;
>  
>  	/*
>  	 * Check if someone performed a consuming read to the buffer
> @@ -6131,7 +6133,7 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
>  	 * the ring buffer with an active write as the consumer is.
>  	 * Do not warn if the three failures is reached.
>  	 */
> -	if (++nr_loops > 3)
> +	if (++nr_loops > max_loops)
>  		return NULL;
>  
>  	if (rb_per_cpu_empty(cpu_buffer))
> -- 
> 2.53.0
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Wei Yang @ 2026-05-21  2:46 UTC (permalink / raw)
  To: Vernon Yang
  Cc: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, akpm, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, liam, ljs, mathieu.desnoyers,
	matthew.brost, mhiramat, mhocko, peterx, pfalcato, rakie.kim,
	raquini, rdunlap, richard.weiyang, rientjes, rostedt, rppt,
	ryan.roberts, shivankg, sunnanyong, surenb, thomas.hellstrom,
	tiwai, usamaarif642, vbabka, vishal.moola, wangkefeng.wang, will,
	willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <8f9834db-8981-4eb1-ae46-94908943da3d@gmail.com>

On Thu, May 21, 2026 at 10:36:15AM +0800, Vernon Yang wrote:
>On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
>> Enable khugepaged to collapse to mTHP orders. This patch implements the
>> main scanning logic using a bitmap to track occupied pages and a stack
>> structure that allows us to find optimal collapse sizes.
>>
>> Previous to this patch, PMD collapse had 3 main phases, a light weight
>> scanning phase (mmap_read_lock) that determines a potential PMD
>> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
>> phase (mmap_write_lock).
>>
>> To enabled mTHP collapse we make the following changes:
>>
>> During PMD scan phase, track occupied pages in a bitmap. When mTHP
>> orders are enabled, we remove the restriction of max_ptes_none during the
>> scan phase to avoid missing potential mTHP collapse candidates. Once we
>> have scanned the full PMD range and updated the bitmap to track occupied
>> pages, we use the bitmap to find the optimal mTHP size.
>>
>> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
>> and determine the best eligible order for the collapse. A stack structure
>> is used instead of traditional recursion to manage the search. This also
>> prevents a traditional recursive approach when the kernel stack struct is
>> limited. The algorithm recursively splits the bitmap into smaller chunks to
>> find the highest order mTHPs that satisfy the collapse criteria. We start
>> by attempting the PMD order, then moved on the consecutively lower orders
>> (mTHP collapse). The stack maintains a pair of variables (offset, order),
>> indicating the number of PTEs from the start of the PMD, and the order of
>> the potential collapse candidate.
>>
>> The algorithm for consuming the bitmap works as such:
>>     1) push (0, HPAGE_PMD_ORDER) onto the stack
>>     2) pop the stack
>>     3) check if the number of set bits in that (offset,order) pair
>>        statisfy the max_ptes_none threshold for that order
>>     4) if yes, attempt collapse
>>     5) if no (or collapse fails), push two new stack items representing
>>        the left and right halves of the current bitmap range, at the
>>        next lower order
>>     6) repeat at step (2) until stack is empty.
>>
>> Below is a diagram representing the algorithm and stack items:
>>
>>                             offset   mid_offset
>>                             |        |
>>                             |        |
>>                             v        v
>>           ____________________________________
>>          |          PTE Page Table            |
>>          --------------------------------------
>> 			    <-------><------->
>>                              order-1  order-1
>>
>> mTHP collapses reject regions containing swapped out or shared pages.
>> This is because adding new entries can lead to new none pages, and these
>> may lead to constant promotion into a higher order mTHP. A similar
>> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
>> introducing at least 2x the number of pages, and on a future scan will
>> satisfy the promotion condition once again. This issue is prevented via
>> the collapse_max_ptes_none() function which imposes the max_ptes_none
>> restrictions above.
>>
>> We currently only support mTHP collapse for max_ptes_none values of 0
>> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>>
>>     - max_ptes_none=0: Never introduce new empty pages during collapse
>>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>>       available mTHP order
>>
>> Any other max_ptes_none value will emit a warning and skip mTHP collapse
>> attempts. There should be no behavior change for PMD collapse.
>>
>> Once we determine what mTHP sizes fits best in that PMD range a collapse
>> is attempted. A minimum collapse order of 2 is used as this is the lowest
>> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>>
>> Currently madv_collapse is not supported and will only attempt PMD
>> collapse.
>>
>> We can also remove the check for is_khugepaged inside the PMD scan as
>> the collapse_max_ptes_none() function handles this logic now.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>  mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 174 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 3492b135d667..39bf7ea8a6e8 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>>
>>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>>
>> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
>> +/*
>> + * mthp_collapse() does an iterative DFS over a binary tree, from
>> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
>> + * size needed for a DFS on a binary tree is height + 1, where
>> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
>> + *
>> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
>> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
>> + */
>> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
>> +
>> +/*
>> + * Defines a range of PTE entries in a PTE page table which are being
>> + * considered for mTHP collapse.
>> + *
>> + * @offset: the offset of the first PTE entry in a PMD range.
>> + * @order: the order of the PTE entries being considered for collapse.
>> + */
>> +struct mthp_range {
>> +	u16 offset;
>> +	u8 order;
>> +};
>> +
>>  struct collapse_control {
>>  	bool is_khugepaged;
>>
>> @@ -111,6 +135,12 @@ struct collapse_control {
>>
>>  	/* nodemask for allocation fallback */
>>  	nodemask_t alloc_nmask;
>> +
>> +	/* Each bit represents a single occupied (!none/zero) page. */
>> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
>> +	/* A mask of the current range being considered for mTHP collapse. */
>> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>>  };
>>
>>  /**
>> @@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>>  	return result;
>>  }
>>
>> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
>> +				     u16 offset, u8 order)
>> +{
>> +	const int size = *stack_size;
>> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
>> +
>> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
>> +	stack->order = order;
>> +	stack->offset = offset;
>> +	(*stack_size)++;
>> +}
>> +
>> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
>> +						 int *stack_size)
>> +{
>> +	const int size = *stack_size;
>> +
>> +	VM_WARN_ON_ONCE(size <= 0);
>> +	(*stack_size)--;
>> +	return cc->mthp_bitmap_stack[size - 1];
>> +}
>> +
>> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
>> +						u16 offset, unsigned int nr_ptes)
>> +{
>> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
>> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
>> +}
>> +
>> +/*
>> + * mthp_collapse() consumes the bitmap that is generated during
>> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
>> + *
>> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
>> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
>> + * of the bitmap for collapse eligibility. The stack maintains a pair of
>> + * variables (offset, order), indicating the number of PTEs from the start of
>> + * the PMD, and the order of the potential collapse candidate respectively. We
>> + * start at the PMD order and check if it is eligible for collapse; if not, we
>> + * add two entries to the stack at a lower order to represent the left and right
>> + * halves of the PTE page table we are examining.
>> + *
>> + *                         offset       mid_offset
>> + *                         |         |
>> + *                         |         |
>> + *                         v         v
>> + *      --------------------------------------
>> + *      |          cc->mthp_bitmap            |
>> + *      --------------------------------------
>> + *                         <-------><------->
>> + *                          order-1  order-1
>> + *
>> + * For each of these, we determine how many PTE entries are occupied in the
>> + * range of PTE entries we propose to collapse, then we compare this to a
>> + * threshold number of PTE entries which would need to be occupied for a
>> + * collapse to be permitted at that order (accounting for max_ptes_none).
>> + *
>> + * If a collapse is permitted, we attempt to collapse the PTE range into a
>> + * mTHP.
>> + */
>> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
>> +		int referenced, int unmapped, struct collapse_control *cc,
>> +		unsigned long enabled_orders)
>> +{
>> +	unsigned int nr_occupied_ptes, nr_ptes;
>> +	int max_ptes_none, collapsed = 0, stack_size = 0;
>> +	unsigned long collapse_address;
>> +	struct mthp_range range;
>> +	u16 offset;
>> +	u8 order;
>> +
>> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>> +
>> +	while (stack_size) {
>> +		range = collapse_mthp_stack_pop(cc, &stack_size);
>> +		order = range.order;
>> +		offset = range.offset;
>> +		nr_ptes = 1UL << order;
>> +
>> +		if (!test_bit(order, &enabled_orders))
>> +			goto next_order;
>> +
>> +		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
>> +
>> +		if (max_ptes_none < 0)
>> +			return collapsed;
>> +
>> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
>> +							       nr_ptes);
>> +
>> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
>> +			int ret;
>> +
>> +			collapse_address = address + offset * PAGE_SIZE;
>> +			ret = collapse_huge_page(mm, collapse_address, referenced,
>> +						 unmapped, cc, order);
>> +			if (ret == SCAN_SUCCEED) {
>> +				collapsed += nr_ptes;
>> +				continue;
>> +			}
>> +		}
>> +
>> +next_order:
>> +		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>
>Hi Nico, thank you very much for your contributions to this series.
>
>I found a minor issue, for MADV_COLLAPSE, if collapse_huge_page() fails
>for some reason (e.g. allocate folio), it goes to next_order and
>continues splitting to the next small order. However, enabled_orders
>only supports HPAGE_PMD_ORDER, so it keeps runing the split operations
>without any effective work until KHUGEPAGED_MIN_MTHP_ORDER is reached
>before exiting. For khugepaged, e.g. setting only 2MB to always, also
>same phenomenon.

Yes, but it does no actual work since it is checked after pop up.

>
>This does not affect the overall functionality of mthp collapse, just
>redundant.
>
>The redundant operations can be easily skipped with the following
>modification. If I miss some thing, please let me know. Thanks!
>
>diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>index 1a25af3d6d0f..fa407cce525c 100644
>--- a/mm/khugepaged.c
>+++ b/mm/khugepaged.c
>@@ -1574,7 +1574,7 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> 		}
>
> next_order:
>-		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
>+		if ((BIT(order) - 1) & enabled_orders) {
> 			const u8 next_order = order - 1;
> 			const u16 mid_offset = offset + (nr_ptes / 2);
>

This would stop the iteration if there are other lower enabled order, right?

>Cheers,
>Vernon

-- 
Wei Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH v2] tools/bootconfig: Fix buf leaks in apply_xbc
From: Masami Hiramatsu @ 2026-05-21  2:38 UTC (permalink / raw)
  To: lihongtao; +Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <20260520030126.147782-1-lihongtao@kylinos.cn>

On Wed, 20 May 2026 11:01:26 +0800
lihongtao <lihongtao@kylinos.cn> wrote:

> From: Hongtao Lee <lihongtao@kylinos.cn>
> 
> If data calloc failed, free the buf before return.
> 

Thanks! Let me pick it.

> Fixes: 950313ebf79c ("tools: bootconfig: Add bootconfig command")
> Signed-off-by: Hongtao Lee <lihongtao@kylinos.cn>
> ---
> V1 -> V2: Change Email Signed name
>  tools/bootconfig/main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
> index 643f707b8f1d..ddabde20585f 100644
> --- a/tools/bootconfig/main.c
> +++ b/tools/bootconfig/main.c
> @@ -390,8 +390,10 @@ static int apply_xbc(const char *path, const char *xbc_path)
>  
>  	/* Backup the bootconfig data */
>  	data = calloc(size + BOOTCONFIG_ALIGN + BOOTCONFIG_FOOTER_SIZE, 1);
> -	if (!data)
> +	if (!data) {
> +		free(buf);
>  		return -ENOMEM;
> +	}
>  	memcpy(data, buf, size);
>  
>  	/* Check the data format */
> -- 
> 2.25.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH mm-unstable v17 11/14] mm/khugepaged: Introduce mTHP collapse support
From: Vernon Yang @ 2026-05-21  2:36 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	akpm, anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, liam, ljs, mathieu.desnoyers, matthew.brost, mhiramat,
	mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260511185817.686831-12-npache@redhat.com>

On Mon, May 11, 2026 at 12:58:11PM -0600, Nico Pache wrote:
> Enable khugepaged to collapse to mTHP orders. This patch implements the
> main scanning logic using a bitmap to track occupied pages and a stack
> structure that allows us to find optimal collapse sizes.
>
> Previous to this patch, PMD collapse had 3 main phases, a light weight
> scanning phase (mmap_read_lock) that determines a potential PMD
> collapse, an alloc phase (mmap unlocked), then finally heavier collapse
> phase (mmap_write_lock).
>
> To enabled mTHP collapse we make the following changes:
>
> During PMD scan phase, track occupied pages in a bitmap. When mTHP
> orders are enabled, we remove the restriction of max_ptes_none during the
> scan phase to avoid missing potential mTHP collapse candidates. Once we
> have scanned the full PMD range and updated the bitmap to track occupied
> pages, we use the bitmap to find the optimal mTHP size.
>
> Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> and determine the best eligible order for the collapse. A stack structure
> is used instead of traditional recursion to manage the search. This also
> prevents a traditional recursive approach when the kernel stack struct is
> limited. The algorithm recursively splits the bitmap into smaller chunks to
> find the highest order mTHPs that satisfy the collapse criteria. We start
> by attempting the PMD order, then moved on the consecutively lower orders
> (mTHP collapse). The stack maintains a pair of variables (offset, order),
> indicating the number of PTEs from the start of the PMD, and the order of
> the potential collapse candidate.
>
> The algorithm for consuming the bitmap works as such:
>     1) push (0, HPAGE_PMD_ORDER) onto the stack
>     2) pop the stack
>     3) check if the number of set bits in that (offset,order) pair
>        statisfy the max_ptes_none threshold for that order
>     4) if yes, attempt collapse
>     5) if no (or collapse fails), push two new stack items representing
>        the left and right halves of the current bitmap range, at the
>        next lower order
>     6) repeat at step (2) until stack is empty.
>
> Below is a diagram representing the algorithm and stack items:
>
>                             offset   mid_offset
>                             |        |
>                             |        |
>                             v        v
>           ____________________________________
>          |          PTE Page Table            |
>          --------------------------------------
> 			    <-------><------->
>                              order-1  order-1
>
> mTHP collapses reject regions containing swapped out or shared pages.
> This is because adding new entries can lead to new none pages, and these
> may lead to constant promotion into a higher order mTHP. A similar
> issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> introducing at least 2x the number of pages, and on a future scan will
> satisfy the promotion condition once again. This issue is prevented via
> the collapse_max_ptes_none() function which imposes the max_ptes_none
> restrictions above.
>
> We currently only support mTHP collapse for max_ptes_none values of 0
> and HPAGE_PMD_NR - 1. resulting in the following behavior:
>
>     - max_ptes_none=0: Never introduce new empty pages during collapse
>     - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest
>       available mTHP order
>
> Any other max_ptes_none value will emit a warning and skip mTHP collapse
> attempts. There should be no behavior change for PMD collapse.
>
> Once we determine what mTHP sizes fits best in that PMD range a collapse
> is attempted. A minimum collapse order of 2 is used as this is the lowest
> order supported by anon memory as defined by THP_ORDERS_ALL_ANON.
>
> Currently madv_collapse is not supported and will only attempt PMD
> collapse.
>
> We can also remove the check for is_khugepaged inside the PMD scan as
> the collapse_max_ptes_none() function handles this logic now.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 182 +++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 174 insertions(+), 8 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3492b135d667..39bf7ea8a6e8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -100,6 +100,30 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
>
>  static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> +/*
> + * mthp_collapse() does an iterative DFS over a binary tree, from
> + * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> + * size needed for a DFS on a binary tree is height + 1, where
> + * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> + *
> + * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> + * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> + */
> +#define MTHP_STACK_SIZE	(ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> +
> +/*
> + * Defines a range of PTE entries in a PTE page table which are being
> + * considered for mTHP collapse.
> + *
> + * @offset: the offset of the first PTE entry in a PMD range.
> + * @order: the order of the PTE entries being considered for collapse.
> + */
> +struct mthp_range {
> +	u16 offset;
> +	u8 order;
> +};
> +
>  struct collapse_control {
>  	bool is_khugepaged;
>
> @@ -111,6 +135,12 @@ struct collapse_control {
>
>  	/* nodemask for allocation fallback */
>  	nodemask_t alloc_nmask;
> +
> +	/* Each bit represents a single occupied (!none/zero) page. */
> +	DECLARE_BITMAP(mthp_bitmap, MAX_PTRS_PER_PTE);
> +	/* A mask of the current range being considered for mTHP collapse. */
> +	DECLARE_BITMAP(mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
>  };
>
>  /**
> @@ -1404,20 +1434,140 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
>  	return result;
>  }
>
> +static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> +				     u16 offset, u8 order)
> +{
> +	const int size = *stack_size;
> +	struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> +
> +	VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> +	stack->order = order;
> +	stack->offset = offset;
> +	(*stack_size)++;
> +}
> +
> +static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> +						 int *stack_size)
> +{
> +	const int size = *stack_size;
> +
> +	VM_WARN_ON_ONCE(size <= 0);
> +	(*stack_size)--;
> +	return cc->mthp_bitmap_stack[size - 1];
> +}
> +
> +static unsigned int collapse_mthp_count_present(struct collapse_control *cc,
> +						u16 offset, unsigned int nr_ptes)
> +{
> +	bitmap_zero(cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +	bitmap_set(cc->mthp_bitmap_mask, offset, nr_ptes);
> +	return bitmap_weight_and(cc->mthp_bitmap, cc->mthp_bitmap_mask, MAX_PTRS_PER_PTE);
> +}
> +
> +/*
> + * mthp_collapse() consumes the bitmap that is generated during
> + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> + *
> + * Each bit in cc->mthp_bitmap represents a single occupied (!none/zero) page.
> + * A stack structure cc->mthp_bitmap_stack is used to check different regions
> + * of the bitmap for collapse eligibility. The stack maintains a pair of
> + * variables (offset, order), indicating the number of PTEs from the start of
> + * the PMD, and the order of the potential collapse candidate respectively. We
> + * start at the PMD order and check if it is eligible for collapse; if not, we
> + * add two entries to the stack at a lower order to represent the left and right
> + * halves of the PTE page table we are examining.
> + *
> + *                         offset       mid_offset
> + *                         |         |
> + *                         |         |
> + *                         v         v
> + *      --------------------------------------
> + *      |          cc->mthp_bitmap            |
> + *      --------------------------------------
> + *                         <-------><------->
> + *                          order-1  order-1
> + *
> + * For each of these, we determine how many PTE entries are occupied in the
> + * range of PTE entries we propose to collapse, then we compare this to a
> + * threshold number of PTE entries which would need to be occupied for a
> + * collapse to be permitted at that order (accounting for max_ptes_none).
> + *
> + * If a collapse is permitted, we attempt to collapse the PTE range into a
> + * mTHP.
> + */
> +static int mthp_collapse(struct mm_struct *mm, unsigned long address,
> +		int referenced, int unmapped, struct collapse_control *cc,
> +		unsigned long enabled_orders)
> +{
> +	unsigned int nr_occupied_ptes, nr_ptes;
> +	int max_ptes_none, collapsed = 0, stack_size = 0;
> +	unsigned long collapse_address;
> +	struct mthp_range range;
> +	u16 offset;
> +	u8 order;
> +
> +	collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
> +
> +	while (stack_size) {
> +		range = collapse_mthp_stack_pop(cc, &stack_size);
> +		order = range.order;
> +		offset = range.offset;
> +		nr_ptes = 1UL << order;
> +
> +		if (!test_bit(order, &enabled_orders))
> +			goto next_order;
> +
> +		max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> +
> +		if (max_ptes_none < 0)
> +			return collapsed;
> +
> +		nr_occupied_ptes = collapse_mthp_count_present(cc, offset,
> +							       nr_ptes);
> +
> +		if (nr_occupied_ptes >= nr_ptes - max_ptes_none) {
> +			int ret;
> +
> +			collapse_address = address + offset * PAGE_SIZE;
> +			ret = collapse_huge_page(mm, collapse_address, referenced,
> +						 unmapped, cc, order);
> +			if (ret == SCAN_SUCCEED) {
> +				collapsed += nr_ptes;
> +				continue;
> +			}
> +		}
> +
> +next_order:
> +		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {

Hi Nico, thank you very much for your contributions to this series.

I found a minor issue, for MADV_COLLAPSE, if collapse_huge_page() fails
for some reason (e.g. allocate folio), it goes to next_order and
continues splitting to the next small order. However, enabled_orders
only supports HPAGE_PMD_ORDER, so it keeps runing the split operations
without any effective work until KHUGEPAGED_MIN_MTHP_ORDER is reached
before exiting. For khugepaged, e.g. setting only 2MB to always, also
same phenomenon.

This does not affect the overall functionality of mthp collapse, just
redundant.

The redundant operations can be easily skipped with the following
modification. If I miss some thing, please let me know. Thanks!

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1a25af3d6d0f..fa407cce525c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1574,7 +1574,7 @@ static int mthp_collapse(struct mm_struct *mm, unsigned long address,
 		}

 next_order:
-		if (order > KHUGEPAGED_MIN_MTHP_ORDER) {
+		if ((BIT(order) - 1) & enabled_orders) {
 			const u8 next_order = order - 1;
 			const u16 mid_offset = offset + (nr_ptes / 2);

--
Cheers,
Vernon

> +			const u8 next_order = order - 1;
> +			const u16 mid_offset = offset + (nr_ptes / 2);
> +
> +			collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> +						 next_order);
> +			collapse_mthp_stack_push(cc, &stack_size, offset,
> +						 next_order);
> +		}
> +	}
> +	return collapsed;
> +}
> +
>  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
>  		bool *lock_dropped, struct collapse_control *cc)
>  {
> -	const int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
> +	int max_ptes_none = collapse_max_ptes_none(cc, vma, HPAGE_PMD_ORDER);
>  	const unsigned int max_ptes_shared = collapse_max_ptes_shared(cc, HPAGE_PMD_ORDER);
>  	const unsigned int max_ptes_swap = collapse_max_ptes_swap(cc, HPAGE_PMD_ORDER);
> +	enum tva_type tva_flags = cc->is_khugepaged ? TVA_KHUGEPAGED : TVA_FORCED_COLLAPSE;
>  	pmd_t *pmd;
> -	pte_t *pte, *_pte;
> -	int none_or_zero = 0, shared = 0, referenced = 0;
> +	pte_t *pte, *_pte, pteval;
> +	int i;
> +	int none_or_zero = 0, shared = 0, nr_collapsed = 0, referenced = 0;
>  	enum scan_result result = SCAN_FAIL;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long addr;
> +	unsigned long enabled_orders;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
>
> @@ -1429,8 +1579,19 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>
> +	bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> +
> +	/*
> +	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> +	 * scan all pages to populate the bitmap for mTHP collapse.
> +	 */
> +	if (enabled_orders != BIT(HPAGE_PMD_ORDER))
> +		max_ptes_none = KHUGEPAGED_MAX_PTES_LIMIT;
> +
>  	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
>  	if (!pte) {
>  		cc->progress++;
> @@ -1438,11 +1599,13 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		goto out;
>  	}
>
> -	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, addr += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		_pte = pte + i;
> +		addr = start_addr + i * PAGE_SIZE;
> +		pteval = ptep_get(_pte);
> +
>  		cc->progress++;
>
> -		pte_t pteval = ptep_get(_pte);
>  		if (pte_none_or_zero(pteval)) {
>  			if (++none_or_zero > max_ptes_none) {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -1522,6 +1685,8 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  			}
>  		}
>
> +		/* Set bit for occupied pages */
> +		__set_bit(i, cc->mthp_bitmap);
>  		/*
>  		 * Record which node the original page is from and save this
>  		 * information to cc->node_load[].
> @@ -1580,10 +1745,11 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  	if (result == SCAN_SUCCEED) {
>  		/* collapse_huge_page expects the lock to be dropped before calling */
>  		mmap_read_unlock(mm);
> -		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc, HPAGE_PMD_ORDER);
> +		nr_collapsed = mthp_collapse(mm, start_addr, referenced, unmapped,
> +					      cc, enabled_orders);
>  		/* collapse_huge_page will return with the mmap_lock released */
>  		*lock_dropped = true;
> +		result = nr_collapsed ? SCAN_SUCCEED : SCAN_FAIL;
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> --
> 2.54.0
>
>

^ permalink raw reply related

* [PATCH v4 1/2] tracing: Return ERR_PTR() from expr_str()
From: Pengpeng Hou @ 2026-05-21  2:28 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, linux-trace-kernel, linux-kernel, Pengpeng Hou

expr_str() already has failure cases for invalid recursion depth and
allocation failure, but it currently reports them as a bare NULL. Teach
it to return ERR_PTR()-encoded errors and update parse_unary() and
parse_expr() to propagate those errors.

This keeps the error conversion separate from the string-building change
so the follow-up seq_buf patch can stay focused on the overflow fix
itself.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v3: https://lore.kernel.org/all/20260417223002.2-tracing-expr-v3-pengpeng@iscas.ac.cn/
- split the ERR_PTR() conversion into a separate patch as requested by
  Steven
- use __free(kfree) and return_ptr() in expr_str()
- propagate ERR_PTR() errors from parse_unary() and parse_expr()

 kernel/trace/trace_events_hist.c | 37 +++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 0dbbf6cca9bc..0b33bb8ef6f7 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -1769,18 +1769,18 @@ static void expr_field_str(struct hist_field *field, char *expr)
 
 static char *expr_str(struct hist_field *field, unsigned int level)
 {
-	char *expr;
+	char *expr __free(kfree) = NULL;
 
 	if (level > 1)
-		return NULL;
+		return ERR_PTR(-EINVAL);
 
 	expr = kzalloc(MAX_FILTER_STR_VAL, GFP_KERNEL);
 	if (!expr)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	if (!field->operands[0]) {
 		expr_field_str(field, expr);
-		return expr;
+		return_ptr(expr);
 	}
 
 	if (field->operator == FIELD_OP_UNARY_MINUS) {
@@ -1788,16 +1788,15 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 
 		strcat(expr, "-(");
 		subexpr = expr_str(field->operands[0], ++level);
-		if (!subexpr) {
-			kfree(expr);
-			return NULL;
-		}
+		if (IS_ERR(subexpr))
+			return subexpr;
+
 		strcat(expr, subexpr);
 		strcat(expr, ")");
 
 		kfree(subexpr);
 
-		return expr;
+		return_ptr(expr);
 	}
 
 	expr_field_str(field->operands[0], expr);
@@ -1816,13 +1815,12 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 		strcat(expr, "*");
 		break;
 	default:
-		kfree(expr);
-		return NULL;
+		return ERR_PTR(-EINVAL);
 	}
 
 	expr_field_str(field->operands[1], expr);
 
-	return expr;
+	return_ptr(expr);
 }
 
 /*
@@ -2636,6 +2634,11 @@ static struct hist_field *parse_unary(struct hist_trigger_data *hist_data,
 	expr->is_signed = operand1->is_signed;
 	expr->operator = FIELD_OP_UNARY_MINUS;
 	expr->name = expr_str(expr, 0);
+	if (IS_ERR(expr->name)) {
+		ret = PTR_ERR(expr->name);
+		expr->name = NULL;
+		goto free;
+	}
 	expr->type = kstrdup_const(operand1->type, GFP_KERNEL);
 	if (!expr->type) {
 		ret = -ENOMEM;
@@ -2848,6 +2851,11 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
 		destroy_hist_field(operand1, 0);
 
 		expr->name = expr_str(expr, 0);
+		if (IS_ERR(expr->name)) {
+			ret = PTR_ERR(expr->name);
+			expr->name = NULL;
+			goto free_expr;
+		}
 	} else {
 		/* The operand sizes should be the same, so just pick one */
 		expr->size = operand1->size;
@@ -2861,6 +2869,11 @@ static struct hist_field *parse_expr(struct hist_trigger_data *hist_data,
 		}
 
 		expr->name = expr_str(expr, 0);
+		if (IS_ERR(expr->name)) {
+			ret = PTR_ERR(expr->name);
+			expr->name = NULL;
+			goto free_expr;
+		}
 	}
 
 	return expr;
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* [PATCH v4 2/2] tracing: Bound histogram expression strings with seq_buf
From: Pengpeng Hou @ 2026-05-21  2:28 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, linux-trace-kernel, linux-kernel, Pengpeng Hou
In-Reply-To: <20260521022817.38453-1-pengpeng@iscas.ac.cn>

expr_str() allocates a fixed MAX_FILTER_STR_VAL buffer and then builds
expression names with a series of raw strcat() appends. Nested operands,
constants and field flags can push the rendered string past that fixed
limit before the name is attached to the hist field.

Build the expression strings with seq_buf and return -E2BIG when the
rendered name would exceed MAX_FILTER_STR_VAL.

Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
Changes since v3: https://lore.kernel.org/all/20260417223002.2-tracing-expr-v3-pengpeng@iscas.ac.cn/
- rebase on top of v7.1-rc3
- keep the ERR_PTR() conversion in patch 1
- use seq_buf for expression construction and return -E2BIG on overflow

 kernel/trace/trace_events_hist.c | 62 +++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 24 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 0b33bb8ef6f7..c878dc8f0cb9 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -8,6 +8,7 @@
 #include <linux/module.h>
 #include <linux/kallsyms.h>
 #include <linux/security.h>
+#include <linux/seq_buf.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/stacktrace.h>
@@ -1743,33 +1744,37 @@ static const char *get_hist_field_flags(struct hist_field *hist_field)
 	return flags_str;
 }
 
-static void expr_field_str(struct hist_field *field, char *expr)
+static bool expr_field_str(struct hist_field *field, struct seq_buf *s)
 {
+	const char *field_name;
+
 	if (field->flags & HIST_FIELD_FL_VAR_REF) {
 		if (!field->system)
-			strcat(expr, "$");
-	} else if (field->flags & HIST_FIELD_FL_CONST) {
-		char str[HIST_CONST_DIGITS_MAX];
+			seq_buf_putc(s, '$');
+	} else if (field->flags & HIST_FIELD_FL_CONST)
+		seq_buf_printf(s, "%llu", field->constant);
 
-		snprintf(str, HIST_CONST_DIGITS_MAX, "%llu", field->constant);
-		strcat(expr, str);
-	}
+	field_name = hist_field_name(field, 0);
+	if (!field_name)
+		return false;
 
-	strcat(expr, hist_field_name(field, 0));
+	seq_buf_puts(s, field_name);
 
 	if (field->flags && !(field->flags & HIST_FIELD_FL_VAR_REF)) {
 		const char *flags_str = get_hist_field_flags(field);
 
-		if (flags_str) {
-			strcat(expr, ".");
-			strcat(expr, flags_str);
-		}
+		if (flags_str)
+			seq_buf_printf(s, ".%s", flags_str);
 	}
+
+	seq_buf_str(s);
+	return !seq_buf_has_overflowed(s);
 }
 
 static char *expr_str(struct hist_field *field, unsigned int level)
 {
 	char *expr __free(kfree) = NULL;
+	struct seq_buf s;
 
 	if (level > 1)
 		return ERR_PTR(-EINVAL);
@@ -1778,47 +1783,56 @@ static char *expr_str(struct hist_field *field, unsigned int level)
 	if (!expr)
 		return ERR_PTR(-ENOMEM);
 
+	seq_buf_init(&s, expr, MAX_FILTER_STR_VAL);
+
 	if (!field->operands[0]) {
-		expr_field_str(field, expr);
+		if (!expr_field_str(field, &s))
+			return ERR_PTR(-E2BIG);
+
 		return_ptr(expr);
 	}
 
 	if (field->operator == FIELD_OP_UNARY_MINUS) {
-		char *subexpr;
+		char *subexpr __free(kfree) = NULL;
 
-		strcat(expr, "-(");
+		seq_buf_puts(&s, "-(");
 		subexpr = expr_str(field->operands[0], ++level);
 		if (IS_ERR(subexpr))
 			return subexpr;
 
-		strcat(expr, subexpr);
-		strcat(expr, ")");
+		seq_buf_puts(&s, subexpr);
+		seq_buf_putc(&s, ')');
+		seq_buf_str(&s);
 
-		kfree(subexpr);
+		if (seq_buf_has_overflowed(&s))
+			return ERR_PTR(-E2BIG);
 
 		return_ptr(expr);
 	}
 
-	expr_field_str(field->operands[0], expr);
+	if (!expr_field_str(field->operands[0], &s))
+		return ERR_PTR(-E2BIG);
 
 	switch (field->operator) {
 	case FIELD_OP_MINUS:
-		strcat(expr, "-");
+		seq_buf_putc(&s, '-');
 		break;
 	case FIELD_OP_PLUS:
-		strcat(expr, "+");
+		seq_buf_putc(&s, '+');
 		break;
 	case FIELD_OP_DIV:
-		strcat(expr, "/");
+		seq_buf_putc(&s, '/');
 		break;
 	case FIELD_OP_MULT:
-		strcat(expr, "*");
+		seq_buf_putc(&s, '*');
 		break;
 	default:
 		return ERR_PTR(-EINVAL);
 	}
 
-	expr_field_str(field->operands[1], expr);
+	if (seq_buf_has_overflowed(&s) ||
+	    !expr_field_str(field->operands[1], &s))
+		return ERR_PTR(-E2BIG);
 
 	return_ptr(expr);
 }
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related

* Re: [PATCH v6 2/2] blk-mq: expose tag starvation counts via debugfs
From: Aaron Tomlin @ 2026-05-21  2:22 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, rostedt, mhiramat, mathieu.desnoyers, bvanassche,
	johannes.thumshirn, kch, dlemoal, ritesh.list, loberman, neelx,
	sean, mproche, chjohnst, linux-block, linux-kernel,
	linux-trace-kernel
In-Reply-To: <fc307bd1-2c41-4bb1-8a10-b9ffde685d30@oracle.com>

On Mon, May 18, 2026 at 09:14:49AM +0100, John Garry wrote:
> On 17/05/2026 22:36, Aaron Tomlin wrote:
> > In high-performance storage environments, particularly when utilising
> > RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
> > latency spikes can occur when fast devices are starved of available
> > tags.
> > 
> > This patch introduces two new debugfs attributes for each block
> > hardware queue:
> >    - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
> >    - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag
> 
> How would these counters be used? You are just saying that we may have
> performance latency spikes and so here are two new counters.

[ ... ]

> > These files expose atomic counters that increment each time a submitting
> > context is forced into an uninterruptible sleep via io_schedule() due to
> > the complete exhaustion of physical driver tags or software scheduler
> > tags, respectively.
> > 
> > To ensure negligible performance overhead even in production
> > environments where CONFIG_BLK_DEBUG_FS is actively enabled, this
> > tracking logic utilises dynamically allocated per-CPU counters. When
> > this configuration is disabled, the tracking logic compiles down to a
> > safe no-op.
> 
> How does one normalise the values which are measured? I mean, during a
> period of high contention, we may get a bunch of threads waiting for a
> driver tag and the value in wait_on_hw_tag may jump considerably - how do
> you normalize this value in wait_on_hw_tag for meaningful analysis?

Hi John,

Thanks for the review.

Based on feedback from Jens regarding this series [1], I am actually going to
drop Patch 2 entirely in v7. 

To answer your questions in the context of the new tracepoint approach:

Storage engineers can use bpftrace(8) to hook the newly proposed
"block_rq_tag_wait" tracepoint. This allows us to track tag starvation
dynamically, filter it by specific devices, or even group it by the
specific process (comm) experiencing the latency spike.

Moving this to userspace completely solves the normalization problem you
highlighted. For example, using bpftrace, userspace can hook both the tag
starvation event (i.e., block_rq_tag_wait) and the standard block issue
event (i.e., block_rq_issue) over a defined time window (e.g., 5 seconds).

Userspace can then divide the waits by the total issues to get a
meaningful, normalized starvation ratio (e.g., "4% of I/Os were starved of
hardware tags in the last 5 seconds"). This is far more useful for analysis
than a contextless debugfs integer jumping by an arbitrary amount.

Thanks again for taking a look. I'll make sure to Cc you on v7.

[1]: https://lore.kernel.org/lkml/t47wegcgfc43nimo4vqfdedqih43ydfietb7tsaobeitxgdhxs@6lnzvbj5rhab/


Kind regards,
-- 
Aaron Tomlin

^ permalink raw reply

* [PATCH v2] ring-buffer: Fix reporting of missed events in iterator
From: Steven Rostedt @ 2026-05-21  2:08 UTC (permalink / raw)
  To: LKML, Linux trace kernel; +Cc: Masami Hiramatsu, Mathieu Desnoyers

From c3651ad3ac95b331c7aa010d163704a3702855da Mon Sep 17 00:00:00 2001
From: Steven Rostedt <rostedt@goodmis.org>
Date: Wed, 20 May 2026 14:28:17 -0400
Subject: [PATCH] ring-buffer: Fix reporting of missed events in iterator

When tracing is active while reading the trace file, if the iterator
reading the buffer detects that the writer has passed the iterator head,
it will reset and set a "missed events" flag. This flag is passed to the
output processing to show the user that events were missed:

  CPU:4 [LOST EVENTS]

The problem is that the flag is reset after it is checked in
ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU
ring buffers and it will check if they are dropped when figuring out which
buffer to print next. This prematurely clears the missed_events flag if
the CPU buffer with the missed events is not the one that is printed next.

On the iteration where the CPU buffer with the missed events is printed,
the check if it had missed events would return false and the output does
not show that events were missed.

Do not reset the missed_events flag when checking if there were missed
events, but instead clear it when moving the iterator head to the next
event.

Cc: stable@vger.kernel.org
Fixes: c9b7a4a72ff64 ("ring-buffer/tracing: Have iterator acknowledge dropped events")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Changes since v1: https://patch.msgid.link/20260520142817.4050abab@gandalf.local.home

- Added clearing iter->missed_events in rb_iter_reset() (Sashiko)

 kernel/trace/ring_buffer.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7288383b1f27..7b07d2004cc6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5429,6 +5429,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
 	iter->head_page = cpu_buffer->reader_page;
 	iter->head = cpu_buffer->reader_page->read;
 	iter->next_event = iter->head;
+	iter->missed_events = 0;
 
 	iter->cache_reader_page = iter->head_page;
 	iter->cache_read = cpu_buffer->read;
@@ -6108,10 +6109,7 @@ ring_buffer_peek(struct trace_buffer *buffer, int cpu, u64 *ts,
  */
 bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter)
 {
-	bool ret = iter->missed_events != 0;
-
-	iter->missed_events = 0;
-	return ret;
+	return iter->missed_events != 0;
 }
 EXPORT_SYMBOL_GPL(ring_buffer_iter_dropped);
 
@@ -6273,7 +6271,7 @@ void ring_buffer_iter_advance(struct ring_buffer_iter *iter)
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
-
+	iter->missed_events = 0;
 	rb_advance_iter(iter);
 
 	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v6 0/2] blk-mq: introduce tag starvation observability
From: Aaron Tomlin @ 2026-05-21  2:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: rostedt, mhiramat, mathieu.desnoyers, bvanassche,
	johannes.thumshirn, kch, dlemoal, ritesh.list, loberman, neelx,
	sean, mproche, chjohnst, linux-block, linux-kernel,
	linux-trace-kernel
In-Reply-To: <189ddfd7-f579-4c86-bcfc-334cf574bdfc@kernel.dk>

On Mon, May 18, 2026 at 07:31:45AM -0600, Jens Axboe wrote:
> Why not just issue the trace points? Then there's close to zero
> overhead, rather than needing to need added counters for this, and the
> kernel to keep track. If you just issue the get/put tag kind of traces,
> then userspace can keep track. That's what blktrace has done for decades
> for things like inflight/queue depth accounting.
> 
> IOW, seems to me, this could be done with basically zero kernel
> additions outside of perhaps a trace point or two.

Hi Jens,

Thanks for taking a look.

You make a completely fair point.
I agree that pushing the accounting to userspace is the right approach,
especially given the proposed hard-coded tracepoint. For example, with
bpftrace(8):

# bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
  Attaching 1 probe...
  ^C
  @tag_waits[4]: 12
  @tag_waits[12]: 87


I will drop Patch 2 from this series, in the next iteration.


Kind regards,
-- 
Aaron Tomlin

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox