Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH] Documentation/rv: Replace stale website link
From: Gabriele Monaco @ 2026-04-27  9:57 UTC (permalink / raw)
  To: Jonathan Corbet, rdunlap, Steven Rostedt, linux-trace-kernel,
	linux-doc, linux-kernel
  Cc: matteo.martelli, skhan
In-Reply-To: <875x5crb4g.fsf@trenco.lwn.net>

On Mon, 2026-04-27 at 03:44 -0600, Jonathan Corbet wrote:
> Since, as you say, it can be found online, is there a reason not to
> include a link here?

Mmh, perhaps being overly cautious for the link not to break again?

The paper is published so I assume it's always going to be available in
some way. It is currently hosted by the university at [1], which may be
unlikely to change, and can be found via DOI at [2], which should never
change (at least that's what I believe a DOI is for) but brings to the
publisher's website rather than the open-access PDF.

I think the reference to the paper I included is robust yet easy to use
with any scientific or even general purpose search engine. But if you
believe using either of the two links is more appropriate, I can send a
V2 with the change.

Thanks,
Gabriele

[1] -
https://www.iris.sssup.it/bitstream/11382/533630/1/Elsevier-JSA-2020.pdf
[2] - https://doi.org/10.1016/j.sysarc.2020.101729


^ permalink raw reply

* Re: [PATCH] Documentation: fix spelling mistake "stucture" -> "structure"
From: Jonathan Corbet @ 2026-04-27  9:56 UTC (permalink / raw)
  To: Ninad Naik, rostedt, mhiramat, mathieu.desnoyers, skhan
  Cc: linux-trace-kernel, linux-doc, linux-kernel, me,
	linux-kernel-mentees, Ninad Naik
In-Reply-To: <20260419184527.779828-1-ninadnaik07@gmail.com>

Ninad Naik <ninadnaik07@gmail.com> writes:

> Fixing a spelling mistake in Documentation/trace/histogram-design.rst.
>
> Signed-off-by: Ninad Naik <ninadnaik07@gmail.com>
> ---
>  Documentation/trace/histogram-design.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/trace/histogram-design.rst b/Documentation/trace/histogram-design.rst
> index e92f56ebd0b5..41a726cd3536 100644
> --- a/Documentation/trace/histogram-design.rst
> +++ b/Documentation/trace/histogram-design.rst
> @@ -247,7 +247,7 @@ field's size and offset, is used to grab that subkey's data from the
>  current trace record.
>  
>  Note, the hist field function use to be a function pointer in the
> -hist_field stucture. Due to spectre mitigation, it was converted into
> +hist_field structure. Due to spectre mitigation, it was converted into
>  a fn_num and hist_fn_call() is used to call the associated hist field

Applied, thanks.

jon

^ permalink raw reply

* Re: [PATCH] Documentation/rv: Replace stale website link
From: Jonathan Corbet @ 2026-04-27  9:44 UTC (permalink / raw)
  To: Gabriele Monaco, rdunlap, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel, linux-doc, linux-kernel
  Cc: matteo.martelli, skhan
In-Reply-To: <20260427085526.111835-1-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> The sched monitor page was linking to Daniel's website which is now
> down. The main purpose of the link was to point to a source for the
> models from the original author and that can be found also in his
> published paper.
>
> Replace the link with a reference to Daniel's "A thread synchronization
> model for the PREEMPT_RT Linux kernel" which can be found online and
> includes the models definitions as well as the work behind them (not the
> original patches but since they're based on a 5.0 kernel and are mostly
> included upstream, there's little value in keeping them in the docs).
>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>  Documentation/trace/rv/monitor_sched.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst
> index 0b96d6e147c6..661171bd7c5e 100644
> --- a/Documentation/trace/rv/monitor_sched.rst
> +++ b/Documentation/trace/rv/monitor_sched.rst
> @@ -365,4 +365,4 @@ constraints when processing the events::
>  References
>  ----------
>  
> -[1] - https://bristot.me/linux-task-model
> +[1] - Daniel Bristot de Oliveira et al.: A thread synchronization model for the PREEMPT_RT Linux kernel, J. Syst. Archit., 2020.

Since, as you say, it can be found online, is there a reason not to
include a link here?

jon

^ permalink raw reply

* Re: [PATCH] tools/rv: harden monitor name lookup bounds checks
From: Gabriele Monaco @ 2026-04-27  9:38 UTC (permalink / raw)
  To: unknownbbqrx; +Cc: rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <69972ccf-31ee-4906-9907-0ead76bd60b9@smtp-relay.sendinblue.com>

On Thu, 2026-04-23 at 17:44 +0300, unknownbbqrx wrote:
> 
> Bound monitor-name derived copies in __ikm_find_monitor_name() and
> avoid unbounded writes from sprintf()/memcpy().
> 
> Pass the output buffer size from the caller, validate extracted line
> length from rv/available_monitors, and use snprintf() with truncation
> checks when building container monitor names.
> 
> Signed-off-by: unknownbbqrx <dev@unknownbbqr.xyz>

Hi,

thanks for the fix, however __ikm_find_monitor_name() is already a bit
sloppy (strstr can take any substring as a valid monitor) so I have a
patch to refactor it, which I'm about to send.
This will make your fix obsolete so I'm likely not going to take this
patch.

Thanks anyway,
Gabriele

> ---
>  tools/verification/rv/src/in_kernel.c | 34 +++++++++++++++++++++----
> --
>  1 file changed, 27 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/verification/rv/src/in_kernel.c
> b/tools/verification/rv/src/in_kernel.c
> index d32453824..f17eac9b6 100644
> --- a/tools/verification/rv/src/in_kernel.c
> +++ b/tools/verification/rv/src/in_kernel.c
> @@ -56,9 +56,12 @@ static int __ikm_read_enable(char *monitor_name)
>   * The string out_name is populated with the full name, which can be
>   * equal to monitor_name or container/monitor_name if nested
>   */
> -static int __ikm_find_monitor_name(char *monitor_name, char
> *out_name)
> +static int __ikm_find_monitor_name(char *monitor_name, char
> *out_name,
> +				  size_t out_name_size)
>  {
> -	char *available_monitors, container[MAX_DA_NAME_LEN+1],
> *cursor, *end;
> +	char *available_monitors, container[MAX_DA_NAME_LEN + 2],
> *cursor, *end;
> +	size_t len;
> +	int n;
>  	int retval = 1;
>  
>  	available_monitors = tracefs_instance_file_read(NULL,
> "rv/available_monitors", NULL);
> @@ -72,17 +75,34 @@ static int __ikm_find_monitor_name(char
> *monitor_name, char *out_name)
>  	}
>  
>  	for (; cursor > available_monitors; cursor--)
> -		if (*(cursor-1) == '\n')
> +		if (*(cursor - 1) == '\n')
>  			break;
> +
>  	end = strstr(cursor, "\n");
> -	memcpy(out_name, cursor, end-cursor);
> -	out_name[end-cursor] = '\0';
> +	if (!end) {
> +		retval = -1;
> +		goto out_free;
> +	}
> +
> +	len = end - cursor;
> +	if (len >= out_name_size) {
> +		retval = -1;
> +		goto out_free;
> +	}
> +
> +	memcpy(out_name, cursor, len);
> +	out_name[len] = '\0';
>  
>  	cursor = strstr(out_name, ":");
>  	if (cursor)
>  		*cursor = '/';
>  	else {
> -		sprintf(container, "%s:", monitor_name);
> +		n = snprintf(container, sizeof(container), "%s:",
> monitor_name);
> +		if (n < 0 || (size_t)n >= sizeof(container)) {
> +			retval = -1;
> +			goto out_free;
> +		}
> +
>  		if (strstr(available_monitors, container))
>  			config_is_container = 1;
>  	}
> @@ -782,7 +802,7 @@ int ikm_run_monitor(char *monitor_name, int argc,
> char **argv)
>  	else
>  		nested_name = monitor_name;
>  
> -	retval = __ikm_find_monitor_name(monitor_name, full_name);
> +	retval = __ikm_find_monitor_name(monitor_name, full_name,
> sizeof(full_name));
>  	if (!retval)
>  		return 0;
>  	if (retval < 0) {
> 
> base-commit: 2e68039281932e6dc37718a1ea7cbb8e2cda42e6
> prerequisite-patch-id: b61dd51dee390277603975bf729a687113185c3a


^ permalink raw reply

* Re: [PATCH 1/1] tools/rv: ensure monitor name and desc are NUL-terminated
From: Gabriele Monaco @ 2026-04-27  9:32 UTC (permalink / raw)
  To: unknownbbqrx; +Cc: rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <dc9ea036-de62-4e1f-be63-8e14d675bcca@smtp-relay.sendinblue.com>

On Thu, 2026-04-23 at 17:19 +0300, unknownbbqrx wrote:
> 
> ikm_fill_monitor_definition() copies monitor name and description
> with
> strncpy(), but does not guarantee NUL termination when source strings
> are
> equal to or longer than the destination buffers.
> 
> Clamp copies to sizeof(dst) - 1 and explicitly append '\0' for both
> fields
> to keep them safe for later string operations.

Hi,

thanks for the fix!
Looks good to me.

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Fixes: 6d60f89691fc9 ("tools/rv: Add in-kernel monitor interface")

On a side note, you sent 2 patches and you apparently sent them both
twice (did you issue git send-email twice? They seem equivalent to me),
next time you could merge them in the same series, just preparing them
in the same branch and passing them all to git format-patch/send-email
[1]. In general you'd also add a cover letter, can be very simple in
this case.
That's usually tidier and easier to apply for maintainers/reviewers.
(You can ignore it this time)

Also add the Fixes: tag if you're fixing something (e.g. a potential
buffer overflow in this case), I did it for you now but you can find
the commit you're fixing using git blame.

[1] -
https://www.kernel.org/doc/html/latest/process/submitting-patches.html

> 
> Signed-off-by: unknownbbqrx <dev@unknownbbqr.xyz>
> ---
>  tools/verification/rv/src/in_kernel.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/verification/rv/src/in_kernel.c
> b/tools/verification/rv/src/in_kernel.c
> index 4bb746ea6..d32453824 100644
> --- a/tools/verification/rv/src/in_kernel.c
> +++ b/tools/verification/rv/src/in_kernel.c
> @@ -215,10 +215,11 @@ static int ikm_fill_monitor_definition(char
> *name, struct monitor *ikm, char *co
>  		return -1;
>  	}
>  
> -	strncpy(ikm->name, nested_name, MAX_DA_NAME_LEN);
> +	strncpy(ikm->name, nested_name, sizeof(ikm->name) - 1);
> +	ikm->name[sizeof(ikm->name) - 1] = '\0';
>  	ikm->enabled = enabled;
> -	strncpy(ikm->desc, desc, MAX_DESCRIPTION);
> -
> +	strncpy(ikm->desc, desc, sizeof(ikm->desc) - 1);
> +	ikm->desc[sizeof(ikm->desc) - 1] = '\0';
>  	free(desc);
>  
>  	return 0;


^ permalink raw reply

* [PATCH] Documentation/rv: Replace stale website link
From: Gabriele Monaco @ 2026-04-27  8:55 UTC (permalink / raw)
  To: rdunlap, Steven Rostedt, Gabriele Monaco, Jonathan Corbet,
	linux-trace-kernel, linux-doc, linux-kernel
  Cc: matteo.martelli, skhan
In-Reply-To: <b845c448-1655-4860-9b6d-93d6f8426740@infradead.org>

The sched monitor page was linking to Daniel's website which is now
down. The main purpose of the link was to point to a source for the
models from the original author and that can be found also in his
published paper.

Replace the link with a reference to Daniel's "A thread synchronization
model for the PREEMPT_RT Linux kernel" which can be found online and
includes the models definitions as well as the work behind them (not the
original patches but since they're based on a 5.0 kernel and are mostly
included upstream, there's little value in keeping them in the docs).

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 Documentation/trace/rv/monitor_sched.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst
index 0b96d6e147c6..661171bd7c5e 100644
--- a/Documentation/trace/rv/monitor_sched.rst
+++ b/Documentation/trace/rv/monitor_sched.rst
@@ -365,4 +365,4 @@ constraints when processing the events::
 References
 ----------
 
-[1] - https://bristot.me/linux-task-model
+[1] - Daniel Bristot de Oliveira et al.: A thread synchronization model for the PREEMPT_RT Linux kernel, J. Syst. Archit., 2020.

base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
-- 
2.53.0


^ permalink raw reply related

* [PATCH] kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
From: Jianpeng Chang @ 2026-04-27  7:35 UTC (permalink / raw)
  To: naveen, davem, mhiramat, catalin.marinas, mark.rutland
  Cc: linux-kernel, linux-trace-kernel, stable, Jianpeng Chang

When kprobe_add_area_blacklist() iterates through a section like
.kprobes.text, the start address may not correspond to a named symbol.
On ARM64 with CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS=y (introduced by
commit baaf553d3bc3 ("arm64: Implement
HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS")), the compiler flag
-fpatchable-function-entry=4,2 inserts 2 NOPs before each function entry
point for ftrace call_ops. These pre-function NOPs sit at the section base
address, before the first named function symbol. The compiler emits a $x
mapping symbol at offset 0x00 to mark the start of code, but
find_kallsyms_symbol() ignores mapping symbols.

Without CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS (e.g. defconfig), no
pre-function NOPs are inserted, the first function starts at offset
0x00, and the bug does not trigger.

This only affects modules that have a .kprobes.text section (i.e. those
using the __kprobes annotation). Modules using NOKPROBE_SYMBOL() instead
(like kretprobe_example.ko) blacklist exact function addresses via the
_kprobe_blacklist section and are not affected.

For kprobe_example.ko on ARM64 with -fpatchable-function-entry=4,2,
the .kprobes.text section layout is:

  offset 0x00: $x + 2 NOPs    (mapping symbol + ftrace preamble)
  offset 0x08: handler_post   (64 bytes)
  offset 0x50: handler_pre    (68 bytes)

kprobe_add_area_blacklist() starts iterating from the section base
address (offset 0x00), which only has the $x mapping symbol.
kprobe_add_ksym_blacklist() then calls kallsyms_lookup_size_offset()
for this address, which goes through:

  kallsyms_lookup_size_offset()
    -> module_address_lookup()
      -> find_kallsyms_symbol()

find_kallsyms_symbol() scans all module symbols to find the closest
preceding symbol.

Since no named text symbol exists at offset 0x00,
find_kallsyms_symbol() picks __UNIQUE_ID_vermagic (a .modinfo symbol
whose address is in the temporary image) as the "best" match. The
computed "size" = next_text_symbol - modinfo_symbol spans across
these two unrelated memory regions, creating a blacklist entry with
a bogus range of tens of terabytes.

Whether this causes a visible failure depends on address randomization,
here is what happens on Raspberry Pi 4/5:

  - On RPi5, the bogus size was ~35 TB. start + size stayed within
    64-bit range, so the blacklist entry covered the entire kernel
    text. register_kprobe() in the module's own init function failed
    with -EINVAL.

  - On RPi4, the bogus size was ~75 TB. start + size overflowed
    64 bits and wrapped to a small address near zero. The range
    check (addr >= start && addr < end) then failed because end
    wrapped around, so the bogus entry was accidentally harmless
    and kprobes worked by luck.

The same bug exists on both machines, but randomization determines whether
the integer overflow masks it or not.

Fix this by checking the offset returned by kallsyms_lookup_size_offset().
A non-zero offset means the address is not at a symbol boundary, so skip
forward to the next symbol instead of creating a blacklist entry with a
wrong size.

Fixes: baaf553d3bc3 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS")
Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com>
---
Hi,

This patch skips non-symbol addresses, fixes the bogus blacklist entry,
but leaves the NOP gap at the start of .kprobes.text unblacklisted.

We can continue alloc the ent without return to add the gap to
blacklist, or do some more works to add the gap to the first symbol in
blacklist. I'm not sure if is this necessary, or is there a better way?

Thanks,
Jianpeng

 kernel/kprobes.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index bfc89083daa9..be700fb03198 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2503,6 +2503,10 @@ int kprobe_add_ksym_blacklist(unsigned long entry)
 	    !kallsyms_lookup_size_offset(entry, &size, &offset))
 		return -EINVAL;
 
+	/* Not on a symbol boundary -- skip to the next symbol */
+	if (offset)
+		return (int)(size - offset);
+
 	ent = kmalloc_obj(*ent);
 	if (!ent)
 		return -ENOMEM;
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v2] mm/page_alloc: trace PCP refills and PCP zone lock usage
From: SUVONOV BUNYOD @ 2026-04-27  6:21 UTC (permalink / raw)
  To: akpm, vbabka, linux-mm
  Cc: rostedt, mhiramat, mathieu desnoyers, linux-trace-kernel,
	linux-kernel, surenb, mhocko, jackmanb, hannes, ziy, david,
	vishal moola, corbet, skhan, linux-doc
In-Reply-To: <20260427060142.131055-1-b.suvonov@sjtu.edu.cn>

Thank you for reviewing v1 Vishal,

All of your concerns except for the last one should be covered in v2.

> If you're trying to trace all pages as they come onto the pcp lists,
> should you also account for the free_frozen_page_commit() path?
>
>>          }
>>          spin_unlock_irqrestore(&zone->lock, flags);

No, the intent is not to trace every insertion into PCP lists. This patch
is trying to make buddy <-> PCP traffic observable by adding a new
mm_page_pcpu_refill event symmetric with the existing mm_page_pcpu_drain
event.

I also added additional zone_locked tracepoints because my research is
focusing on analyzing which kernel mm subsystems and other parts are
under stress for a given workload. The best way to see it for PCP would
be to count zone lock acquirings as the whole purpose of PCP is to lower
number of zone lock acquiring in first place.

^ permalink raw reply

* [PATCH v2] mm/page_alloc: trace PCP refills and PCP zone lock usage
From: Bunyod Suvonov @ 2026-04-27  6:01 UTC (permalink / raw)
  To: akpm, vbabka, linux-mm
  Cc: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel,
	linux-kernel, surenb, mhocko, jackmanb, hannes, ziy, david,
	vishal.moola, corbet, skhan, linux-doc, Bunyod Suvonov
In-Reply-To: <20260425091335.346504-1-b.suvonov@sjtu.edu.cn>

mm_page_pcpu_drain traces page blocks drained from the per-cpu page
lists back to the buddy allocator. There is no matching tracepoint for
the opposite direction, where rmqueue_bulk() refills a PCP list from the
buddy allocator.

Add mm_page_pcpu_refill as the counterpart to mm_page_pcpu_drain. The
pair makes PCP traffic observable in both directions: refill shows page
blocks moving from the buddy allocator into PCP lists, while drain shows
page blocks moving from PCP lists back to the buddy allocator. Comparing
the two helps identify PCP churn, imbalance between CPUs, and cases where
pages repeatedly cycle between PCP lists and the buddy allocator instead
of being served efficiently from PCP.

PCP refill and drain activity can also require entering the buddy
allocator under zone->lock. The per-page-block refill and drain events do
not directly count those lock acquisitions, because a single bulk
operation can move multiple page blocks.

Add mm_page_pcpu_refill_zone_locked and
mm_page_pcpu_drain_zone_locked to trace successful PCP bulk operations
after acquiring the zone lock. These events make it possible to count how
often PCP refill and drain paths enter the zone-locked buddy allocator.
Frequent events can indicate that PCP lists are under pressure and are
not avoiding the zone lock as effectively as expected.

mm_page_alloc_zone_locked is not a reliable substitute for PCP refill
activity. It is emitted from __rmqueue_smallest(), which is reached with
zone->lock already held by both rmqueue_bulk() and the direct buddy
allocation path. Its percpu_refill field is derived from the allocation
order and migratetype, so it does not reliably identify whether the
allocation came from a PCP refill.

Document the new kmem tracepoints.

Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>
---
Changes since v1:
- Add mm_page_pcpu_refill as the per-page-block counterpart to
  mm_page_pcpu_drain.
- Add mm_page_pcpu_refill_zone_locked and
  mm_page_pcpu_drain_zone_locked to count PCP bulk operations that
  acquired zone->lock.
- Document the new kmem tracepoints and clarify the PCP refill/drain
  semantics.

 Documentation/trace/events-kmem.rst | 65 ++++++++++++++++++-----------
 include/trace/events/kmem.h         | 58 +++++++++++++++++++++++--
 mm/page_alloc.c                     |  5 +++
 3 files changed, 100 insertions(+), 28 deletions(-)

diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 68fa75247488..9f935db1ea88 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -75,30 +75,47 @@ contention on the lruvec->lru_lock.
 =============================
 ::
 
-  mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
-  mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
-
-In front of the page allocator is a per-cpu page allocator. It exists only
-for order-0 pages, reduces contention on the zone->lock and reduces the
-amount of writing on struct page.
-
-When a per-CPU list is empty or pages of the wrong type are allocated,
-the zone->lock will be taken once and the per-CPU list refilled. The event
-triggered is mm_page_alloc_zone_locked for each page allocated with the
-event indicating whether it is for a percpu_refill or not.
-
-When the per-CPU list is too full, a number of pages are freed, each one
-which triggers a mm_page_pcpu_drain event.
-
-The individual nature of the events is so that pages can be tracked
-between allocation and freeing. A number of drain or refill pages that occur
-consecutively imply the zone->lock being taken once. Large amounts of per-CPU
-refills and drains could imply an imbalance between CPUs where too much work
-is being concentrated in one place. It could also indicate that the per-CPU
-lists should be a larger size. Finally, large amounts of refills on one CPU
-and drains on another could be a factor in causing large amounts of cache
-line bounces due to writes between CPUs and worth investigating if pages
-can be allocated and freed on the same CPU through some algorithm change.
+  mm_page_alloc_zone_locked	page=%p pfn=0x%lx order=%u migratetype=%d percpu_refill=%d
+  mm_page_pcpu_refill		page=%p pfn=0x%lx order=%d migratetype=%d
+  mm_page_pcpu_drain		page=%p pfn=0x%lx order=%d migratetype=%d
+  mm_page_pcpu_refill_zone_locked nid=%d zid=%d nr_pages=%lu
+  mm_page_pcpu_drain_zone_locked  nid=%d zid=%d nr_pages=%lu
+
+In front of the buddy allocator are per-cpu page lists. They reduce
+contention on the zone->lock and reduce the amount of writing on struct
+page.
+
+When an allocation finds the target per-CPU list empty, the zone->lock may
+be taken once and the per-CPU list refilled from the buddy allocator. The
+mm_page_pcpu_refill_zone_locked event is emitted once after the refill path
+successfully acquires the zone lock. The mm_page_pcpu_refill event is
+emitted for each page block added to the per-CPU list.
+
+When per-CPU pages are drained back to the buddy allocator, for example
+because a per-CPU list is above its high mark, PCP high is decayed, or an
+explicit drain is requested, the drain path takes the zone lock. The
+mm_page_pcpu_drain_zone_locked event is emitted once after the drain path
+successfully acquires the zone lock. The mm_page_pcpu_drain event is emitted
+for each page block drained from the per-CPU list.
+
+The individual refill and drain events allow pages to be tracked between
+allocation and freeing. The zone_locked events allow the bulk operations to
+be counted directly. A single zone_locked event may be followed by multiple
+refill or drain events, depending on how many page blocks are moved while
+holding the zone lock. The nr_pages field in the zone_locked events is the
+target number of base pages for the bulk operation when the zone lock is
+acquired. The individual refill or drain events describe the page blocks
+actually moved.
+
+Large amounts of per-CPU refills and drains could imply an imbalance between
+CPUs where too much work is being concentrated in one place. Frequent
+zone_locked events can indicate that the per-CPU lists are under pressure
+and are not avoiding the zone lock as effectively as expected. It could also
+indicate that the per-CPU lists should be a larger size. Finally, large
+amounts of refills on one CPU and drains on another could be a factor in
+causing large amounts of cache line bounces due to writes between CPUs and
+worth investigating if pages can be allocated and freed on the same CPU
+through some algorithm change.
 
 5. External Fragmentation
 =========================
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index cd7920c81f85..68f5d4a84da6 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -243,16 +243,52 @@ DEFINE_EVENT(mm_page, mm_page_alloc_zone_locked,
 	TP_ARGS(page, order, migratetype, percpu_refill)
 );
 
-TRACE_EVENT(mm_page_pcpu_drain,
+DECLARE_EVENT_CLASS(mm_page_pcpu_zone_locked,
+
+	TP_PROTO(int nid, int zid, unsigned long nr_pages),
+
+	TP_ARGS(nid, zid, nr_pages),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, zid)
+		__field(unsigned long, nr_pages)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->zid		= zid;
+		__entry->nr_pages	= nr_pages;
+	),
+
+	TP_printk("nid=%d zid=%d nr_pages=%lu",
+		__entry->nid, __entry->zid, __entry->nr_pages)
+);
+
+DEFINE_EVENT(mm_page_pcpu_zone_locked, mm_page_pcpu_refill_zone_locked,
+
+	TP_PROTO(int nid, int zid, unsigned long nr_pages),
+
+	TP_ARGS(nid, zid, nr_pages)
+);
+
+DEFINE_EVENT(mm_page_pcpu_zone_locked, mm_page_pcpu_drain_zone_locked,
+
+	TP_PROTO(int nid, int zid, unsigned long nr_pages),
+
+	TP_ARGS(nid, zid, nr_pages)
+);
+
+DECLARE_EVENT_CLASS(mm_page_pcpu,
 
 	TP_PROTO(struct page *page, unsigned int order, int migratetype),
 
 	TP_ARGS(page, order, migratetype),
 
 	TP_STRUCT__entry(
-		__field(	unsigned long,	pfn		)
-		__field(	unsigned int,	order		)
-		__field(	int,		migratetype	)
+		__field(unsigned long, pfn)
+		__field(unsigned int, order)
+		__field(int, migratetype)
 	),
 
 	TP_fast_assign(
@@ -266,6 +302,20 @@ TRACE_EVENT(mm_page_pcpu_drain,
 		__entry->order, __entry->migratetype)
 );
 
+DEFINE_EVENT(mm_page_pcpu, mm_page_pcpu_refill,
+
+	TP_PROTO(struct page *page, unsigned int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype)
+);
+
+DEFINE_EVENT(mm_page_pcpu, mm_page_pcpu_drain,
+
+	TP_PROTO(struct page *page, unsigned int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype)
+);
+
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65e205111553..9323bdbce731 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1470,6 +1470,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	pindex = pindex - 1;
 
 	spin_lock_irqsave(&zone->lock, flags);
+	trace_mm_page_pcpu_drain_zone_locked(zone_to_nid(zone), zone_idx(zone),
+					     count);
 
 	while (count > 0) {
 		struct list_head *list;
@@ -2527,6 +2529,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 	}
+	trace_mm_page_pcpu_refill_zone_locked(zone_to_nid(zone), zone_idx(zone),
+					      count << order);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 					      alloc_flags, &rmqm);
@@ -2544,6 +2548,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->pcp_list, list);
+		trace_mm_page_pcpu_refill(page, order, migratetype);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 
-- 
2.53.0

^ permalink raw reply related

* [PATCH v5 2/2] blk-mq: expose tag starvation counts via debugfs
From: Aaron Tomlin @ 2026-04-27  2:01 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260427020142.358912-1-atomlin@atomlin.com>

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices are starved of available
tags.

This patch introduces two new debugfs attributes for each block
hardware queue:
  - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
  - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag

These files expose atomic counters that increment each time a submitting
context is forced into an uninterruptible sleep via io_schedule() due to
the complete exhaustion of physical driver tags or software scheduler
tags, respectively.

To ensure negligible performance overhead even in production
environments where CONFIG_BLK_DEBUG_FS is actively enabled, this
tracking logic utilises dynamically allocated per-CPU counters. When
this configuration is disabled, the tracking logic compiles down to a
safe no-op.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-debugfs.c | 109 +++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.h |  19 +++++++
 block/blk-mq-tag.c     |   4 ++
 block/blk-mq.c         |   5 ++
 include/linux/blk-mq.h |  12 +++++
 5 files changed, 149 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 047ec887456b..1a993bcea5c9 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -7,6 +7,7 @@
 #include <linux/blkdev.h>
 #include <linux/build_bug.h>
 #include <linux/debugfs.h>
+#include <linux/percpu.h>
 
 #include "blk.h"
 #include "blk-mq.h"
@@ -484,6 +485,54 @@ static int hctx_dispatch_busy_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+/**
+ * hctx_wait_on_hw_tag_show - display hardware tag starvation count
+ * @data: generic pointer to the associated hardware context (hctx)
+ * @m: seq_file pointer for debugfs output formatting
+ *
+ * Prints the cumulative number of times a submitting context was forced
+ * to block due to the exhaustion of physical hardware driver tags.
+ *
+ * Return: 0 on success.
+ */
+static int hctx_wait_on_hw_tag_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	unsigned long count = 0;
+	int cpu;
+
+	if (hctx->wait_on_hw_tag) {
+		for_each_possible_cpu(cpu)
+			count += *per_cpu_ptr(hctx->wait_on_hw_tag, cpu);
+	}
+	seq_printf(m, "%lu\n", count);
+	return 0;
+}
+
+/**
+ * hctx_wait_on_sched_tag_show - display scheduler tag starvation count
+ * @data: generic pointer to the associated hardware context (hctx)
+ * @m: seq_file pointer for debugfs output formatting
+ *
+ * Prints the cumulative number of times a submitting context was forced
+ * to block due to the exhaustion of software scheduler tags.
+ *
+ * Return: 0 on success.
+ */
+static int hctx_wait_on_sched_tag_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	unsigned long count = 0;
+	int cpu;
+
+	if (hctx->wait_on_sched_tag) {
+		for_each_possible_cpu(cpu)
+			count += *per_cpu_ptr(hctx->wait_on_sched_tag, cpu);
+	}
+	seq_printf(m, "%lu\n", count);
+	return 0;
+}
+
 #define CTX_RQ_SEQ_OPS(name, type)					\
 static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t *pos) \
 	__acquires(&ctx->lock)						\
@@ -599,6 +648,8 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{"active", 0400, hctx_active_show},
 	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
 	{"type", 0400, hctx_type_show},
+	{"wait_on_hw_tag", 0400, hctx_wait_on_hw_tag_show},
+	{"wait_on_sched_tag", 0400, hctx_wait_on_sched_tag_show},
 	{},
 };
 
@@ -815,3 +866,61 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)
 	debugfs_remove_recursive(hctx->sched_debugfs_dir);
 	hctx->sched_debugfs_dir = NULL;
 }
+
+/**
+ * blk_mq_debugfs_alloc_hctx_stats - Allocate per-cpu starvation statistics
+ * @hctx: hardware context associated with the tag allocation
+ * @gfp: memory allocation flags
+ *
+ * Allocates the per-cpu memory for tracking hardware and scheduler tag
+ * starvation.
+ */
+void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx, gfp_t gfp)
+{
+	if (!hctx->wait_on_hw_tag)
+		hctx->wait_on_hw_tag = alloc_percpu_gfp(unsigned long,
+							gfp);
+	if (!hctx->wait_on_sched_tag)
+		hctx->wait_on_sched_tag = alloc_percpu_gfp(unsigned long,
+							   gfp);
+}
+
+/**
+ * blk_mq_debugfs_free_hctx_stats - Free per-cpu starvation statistics
+ * @hctx: hardware context associated with the tag allocation
+ *
+ * Frees the per-cpu memory used for tracking hardware and scheduler tag
+ * starvation. This must only be called during hardware queue teardown when
+ * the queue is safely frozen and no active I/O submissions can race to
+ * increment the statistics.
+ */
+void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
+{
+	free_percpu(hctx->wait_on_hw_tag);
+	hctx->wait_on_hw_tag = NULL;
+	free_percpu(hctx->wait_on_sched_tag);
+	hctx->wait_on_sched_tag = NULL;
+}
+
+/**
+ * blk_mq_debugfs_inc_wait_tags - increment the tag starvation counters
+ * @hctx: hardware context associated with the tag allocation
+ * @is_sched: true if the starved pool is the software scheduler
+ *
+ * Evaluates the exhausted tag pool and safely increments the appropriate
+ * per-cpu debugfs starvation counter.
+ *
+ * Note: The per-cpu pointers are explicitly checked to prevent a NULL
+ * pointer dereference in the event that the system was under heavy memory
+ * pressure and the initial per-cpu allocation failed.
+ */
+void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+				  bool is_sched)
+{
+	unsigned long __percpu *tags = is_sched ?
+			READ_ONCE(hctx->wait_on_sched_tag) :
+			READ_ONCE(hctx->wait_on_hw_tag);
+
+	if (likely(tags))
+		this_cpu_inc(*tags);
+}
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 49bb1aaa83dc..7a7c0f376a2b 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -17,6 +17,8 @@ struct blk_mq_debugfs_attr {
 	const struct seq_operations *seq_ops;
 };
 
+void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+				  bool is_sched);
 int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
 int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
 
@@ -26,6 +28,9 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
 void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx);
 void blk_mq_debugfs_register_hctxs(struct request_queue *q);
 void blk_mq_debugfs_unregister_hctxs(struct request_queue *q);
+void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
+				     gfp_t gfp);
+void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_sched(struct request_queue *q);
 void blk_mq_debugfs_unregister_sched(struct request_queue *q);
@@ -35,6 +40,11 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_rq_qos(struct request_queue *q);
 #else
+static inline void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+						bool is_sched)
+{
+}
+
 static inline void blk_mq_debugfs_register(struct request_queue *q)
 {
 }
@@ -56,6 +66,15 @@ static inline void blk_mq_debugfs_unregister_hctxs(struct request_queue *q)
 {
 }
 
+static inline void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
+						   gfp_t gfp)
+{
+}
+
+static inline void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
+{
+}
+
 static inline void blk_mq_debugfs_register_sched(struct request_queue *q)
 {
 }
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 66138dd043d4..3cc6a97a87a0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -17,6 +17,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-mq-debugfs.h"
 
 /*
  * Recalculate wakeup batch when tag is shared by hctx.
@@ -191,6 +192,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		trace_block_rq_tag_wait(data->q, data->hctx,
 					data->rq_flags & RQF_SCHED_TAGS);
 
+		blk_mq_debugfs_inc_wait_tags(data->hctx,
+					     data->rq_flags & RQF_SCHED_TAGS);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..cd52bf6f82ce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3991,6 +3991,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 			blk_free_flush_queue_callback);
 	hctx->fq = NULL;
 
+	blk_mq_debugfs_free_hctx_stats(hctx);
+
 	spin_lock(&q->unused_hctx_lock);
 	list_add(&hctx->hctx_list, &q->unused_hctx_list);
 	spin_unlock(&q->unused_hctx_lock);
@@ -4016,6 +4018,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
 
+	blk_mq_debugfs_alloc_hctx_stats(hctx, gfp);
+
 	hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
 	if (!hctx->fq)
 		goto fail;
@@ -4041,6 +4045,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	blk_free_flush_queue(hctx->fq);
 	hctx->fq = NULL;
  fail:
+	blk_mq_debugfs_free_hctx_stats(hctx);
 	return -1;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..41d61488d683 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -453,6 +453,18 @@ struct blk_mq_hw_ctx {
 	struct dentry		*debugfs_dir;
 	/** @sched_debugfs_dir:	debugfs directory for the scheduler. */
 	struct dentry		*sched_debugfs_dir;
+	/**
+	 * @wait_on_hw_tag: Cumulative per-cpu counter incremented each
+	 * time a submitting context is forced to block due to physical
+	 * hardware tag exhaustion.
+	 */
+	unsigned long __percpu	*wait_on_hw_tag;
+	/**
+	 * @wait_on_sched_tag: Cumulative per-cpu counter incremented each
+	 * time a submitting context is forced to block due to software
+	 * scheduler tag exhaustion.
+	 */
+	unsigned long __percpu	*wait_on_sched_tag;
 #endif
 
 	/**
-- 
2.51.0


^ permalink raw reply related

* [PATCH v5 1/2] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-04-27  2:01 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260427020142.358912-1-atomlin@atomlin.com>

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait trace point in the tag
allocation slow-path. It triggers immediately before the thread yields
the CPU, exposing the exact hardware context (hctx) that is starved, the
specific pool experiencing starvation (hardware or software scheduler),
and the total pool depth.

This provides storage engineers and performance monitoring agents
with a zero-configuration, low-overhead mechanism to definitively
identify shared-tag bottlenecks and tune I/O schedulers or cgroup
throttling accordingly.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-tag.c           |  4 ++++
 include/trace/events/block.h | 43 ++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..66138dd043d4 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
 #include <linux/kmemleak.h>
 
 #include <linux/delay.h>
+#include <trace/events/block.h>
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
@@ -187,6 +188,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		trace_block_rq_tag_wait(data->q, data->hctx,
+					data->rq_flags & RQF_SCHED_TAGS);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..7c1026d1cb35 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,49 @@ DECLARE_EVENT_CLASS(block_rq,
 		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
 );
 
+/**
+ * block_rq_tag_wait - triggered when a request is starved of a tag
+ * @q: request queue of the target device
+ * @hctx: hardware context of the request experiencing starvation
+ * @is_sched_tag: indicates whether the starved pool is the software scheduler
+ *
+ * Called immediately before the submitting context is forced to block due
+ * to the exhaustion of available tags (i.e., physical hardware driver tags
+ * or software scheduler tags). This trace point indicates that the context
+ * will be placed into an uninterruptible state via io_schedule() until an
+ * active request completes and relinquishes its assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx, bool is_sched_tag),
+
+	TP_ARGS(q, hctx, is_sched_tag),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( u32,		hctx_id			)
+		__field( u32,		nr_tags			)
+		__field( bool,		is_sched_tag		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= q->disk ? disk_devt(q->disk);
+		__entry->hctx_id	= hctx->queue_num;
+		__entry->is_sched_tag	= is_sched_tag;
+
+		if (is_sched_tag)
+			__entry->nr_tags = hctx->sched_tags->nr_tags;
+		else
+			__entry->nr_tags = hctx->tags->nr_tags;
+	),
+
+	TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->hctx_id,
+		  __entry->is_sched_tag ? "scheduler" : "hardware",
+		  __entry->nr_tags)
+);
+
 /**
  * block_rq_insert - insert block operation request into queue
  * @rq: block IO operation request
-- 
2.51.0


^ permalink raw reply related

* [PATCH v5 0/2] blk-mq: introduce tag starvation observability
From: Aaron Tomlin @ 2026-04-27  2:01 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel

Hi Jens, Steve, Masami,

In high-performance storage environments, particularly when utilising RAID
controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency
spikes can occur when fast devices are starved of available tags.
Currently, diagnosing this specific queue contention requires deploying
dynamic kprobes or inferring sleep states, which lacks a simple,
out-of-the-box diagnostic path.

This short series introduces dedicated, low-overhead observability for tag
exhaustion events in the block layer:

  - Patch 1 introduces the "block_rq_tag_wait" tracepoint in the tag
    allocation slow-path to capture precise, event-based starvation.

  - Patch 2 complements this by exposing "wait_on_hw_tag" and
    "wait_on_sched_tag" per-CPU counters via debugfs for quick,
    point-in-time cumulative polling.

Together, these provide storage engineers with zero-configuration
mechanisms to definitively identify shared-tag bottlenecks.

Please let me know your thoughts.


Changes since v4 [1]:
 - Prevented a NULL pointer dereference in the tracepoint fast-assign for
   disk-less request queues by safely checking q->disk before resolving the
   dev_t

 - Fixed a Use-After-Free (UAF) and permanent memory leak by decoupling
   the per-CPU counter allocation from the volatile debugfs lifecycle and
   tying it directly to the core hctx lifecycle (i.e., blk_mq_init_hctx()
   and blk_mq_exit_hctx())

 - Fixed a potential compiler double-fetch bug by wrapping the per-CPU
   pointer evaluations with READ_ONCE() in blk_mq_debugfs_inc_wait_tags()

 - Passed the appropriate gfp_t flags down to the allocation routines to
   maintain the strict GFP_NOIO context

 - Updated kernel-doc descriptions to clarify that the NULL pointer 
   checks guard against memory allocation failures under pressure, rather 
   than initialisation race conditions

Changes since v3 [2]:
 - Transitioned tracking architecture from shared atomic_t variables to
   dynamically allocated per-CPU counters to resolve cache line bouncing
   (Bart Van Assche)

Changes since v2 [3]:
 - Added "Reviewed-by:" and "Tested-by:" tags for patch 1

 - Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)

 - Introduced atomic counters via debugfs 

Changes since v1 [4]:
 - Improved the description of the trace point (Damien Le Moal)

 - Removed the redundant "active requests" (Laurence Oberman)

 - Introduced pool-specific starvation tracking

[1]: https://lore.kernel.org/lkml/20260419023036.1419514-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/


Aaron Tomlin (2):
  blk-mq: add tracepoint block_rq_tag_wait
  blk-mq: expose tag starvation counts via debugfs

 block/blk-mq-debugfs.c       | 109 +++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.h       |  19 ++++++
 block/blk-mq-tag.c           |   8 +++
 block/blk-mq.c               |   5 ++
 include/linux/blk-mq.h       |  12 ++++
 include/trace/events/block.h |  43 ++++++++++++++
 6 files changed, 196 insertions(+)

-- 
2.51.0


^ permalink raw reply

* Re: [PATCH v3 3/4] testing: add nfsd-io-bench NFS server benchmark suite
From: Ritesh Harjani @ 2026-04-26 23:54 UTC (permalink / raw)
  To: Jeff Layton, Andrew Morton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Christoph Hellwig, Kairui Song, Qi Zheng, Shakeel Butt,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever, linux-fsdevel,
	linux-kernel, linux-nfs, linux-mm, linux-trace-kernel, Zorro Lang
In-Reply-To: <a1e784d7006fe5d4331d41a0638be117ac67fb21.camel@kernel.org>

Jeff Layton <jlayton@kernel.org> writes:

> On Sun, 2026-04-26 at 05:34 -0700, Andrew Morton wrote:
>> So how are we to maintain this?

Maybe in xfstests? It has tests/perf/, but that just have 1 test.
Maybe others can tell whether it make sense to maintain such fio based
performance benchmarking scripts in there.

-ritesh

^ permalink raw reply

* Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
From: Ritesh Harjani @ 2026-04-26 22:31 UTC (permalink / raw)
  To: Jeff Layton, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Christoph Hellwig, Kairui Song, Qi Zheng, Shakeel Butt,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel, Jeff Layton
In-Reply-To: <20260426-dontcache-v3-2-79eb37da9547@kernel.org>

Jeff Layton <jlayton@kernel.org> writes:

> The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> filemap_flush_range() on every write, submitting writeback inline in
> the writer's context.  Perf lock contention profiling shows the
> performance problem is not lock contention but the writeback submission
> work itself — walking the page tree and submitting I/O blocks the writer
> for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> (dontcache).
>
> Replace the inline filemap_flush_range() call with a flusher kick that
> drains dirty pages in the background.  This moves writeback submission
> completely off the writer's hot path.
>
> To avoid flushing unrelated buffered dirty data, add a dedicated
> WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> the new NR_DONTCACHE_DIRTY counter to determine how many pages to write
> back.  The flusher writes back that many pages from the oldest dirty
> inodes (not restricted to dontcache-specific inodes). This helps
> preserve I/O batching while limiting the scope of expedited writeback.
>

Yup, so, we wakeup the writeback flusher, which will write those many
"number" of dirty pages. Those dirty pages written by writeback, can be
of any type though, can be DONTCACHE or normal (non-dontcache) dirty
pages. IIUC, writeback doesn't distinguish between them while writing.


IMO, what we could also include in the commit msg is why is this above
approach taken? IIUC, that is because, by writing NR_DONTCACHE_DIRTY
pages, it still reduces the page cache pressure and still reduces the
amount of work that the reclaim has to do, even though some of those
pages maybe non-dontcache pages, in case if there was a parallel
buffered write in the system.


Also should the following change be documented somewhere? Like in Man
page maybe? i.e.
Earlier RWF_DONTCACHE writes made sure that those dirty pages are
immediately submitted for writeback and completion would release those
pages. But now, in certain cases when there is a mixed buffered write in
the system, those dontcache dirty pages might be written back after a
delay (whenever the next time writeback kicks in).
However for RWF_DONTCACHE reads, it should not affect anything.

> Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> DONTCACHE writes into a single flusher wakeup without per-write
> allocations.
>
> Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> visibility, and target the correct cgroup writeback domain via
> unlocked_inode_to_wb_begin().
>
> dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
> RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
> ~503 GB, compared to a v6.19-ish baseline):
>

Can we please also test parallel buffered writes and dontcache writes? 
Since this patch series definitely affects that.

BTW - adding these numbers in the commit msg itself is much helpful.

>   Single-client sequential write (MB/s):
>                        baseline    patched     change
>   buffered              1449.8     1440.1      -0.7%
>   dontcache             1347.9     1461.5      +8.4%
>   direct                1450.0     1440.1      -0.7%
>
>   Single-client sequential write latency (us):
>                        baseline    patched     change
>   dontcache p50         3031.0    10551.3    +248.1%
>   dontcache p99        74973.2    21626.9     -71.2%
>   dontcache p99.9      85459.0    23199.7     -72.9%
>
>   Single-client random write (MB/s):
>                        baseline    patched     change
>   dontcache              284.2      295.4      +3.9%
>
>   Single-client random write p99.9 latency (us):
>                        baseline    patched     change
>   dontcache             2277.4      872.4     -61.7%
>
>   Multi-writer aggregate throughput (MB/s):

Can you please help describe this test scenario if possible.. In above
you mentioned we are writing file_size as 2x RAM_SIZE. But your
multi-client tests says something else..

local num_clients=4
+	mem_kb=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
+	client_size="$(( mem_kb / 1024 / num_clients ))M"

Also the multi-writer case is spawning parallel fio jobs, and then
parsing and aggregating the bandwidth results instead of using fio to
spawn multiple parallel threads... which is ok, but a bit wierd.
Why not let fio do the aggregate bandwidth, and latency calculation
instead?

>                        baseline    patched     change
>   buffered              1619.5     1611.2      -0.5%
>   dontcache             1281.1     1629.4     +27.2%
>   direct                1545.4     1609.4      +4.1%
>
>   Mixed-mode noisy neighbor (dontcache writer + buffered readers):
>                        baseline    patched     change
>   writer (MB/s)         1297.6     1471.1     +13.4%
>   readers avg (MB/s)     855.0      462.4     -45.9%
>
> nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
> NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
> file size ~502 GB, compared to v6.19-ish baseline):
>
>   Single-client sequential write (MB/s):
>                        baseline    patched     change
>   buffered              4844.2     4653.4      -3.9%
>   dontcache             3028.3     3723.1     +22.9%
>   direct                 957.6      987.8      +3.2%
>
>   Single-client sequential write p99.9 latency (us):
>                        baseline    patched     change
>   dontcache            759169.0   175112.2     -76.9%
>
>   Single-client random write (MB/s):
>                        baseline    patched     change
>   dontcache              590.0     1561.0    +164.6%
>
>   Multi-writer aggregate throughput (MB/s):
>                        baseline    patched     change
>   buffered              9636.3     9422.9      -2.2%
>   dontcache             1894.9     9442.6    +398.3%
>   direct                 809.6      975.1     +20.4%
>
>   Noisy neighbor (dontcache writer + random readers):
>                        baseline    patched     change
>   writer (MB/s)         1854.5     4063.6    +119.1%
>   readers avg (MB/s)     131.2      101.6     -22.5%
>
> The NFS results show even larger improvements than the local benchmarks.
> Multi-writer dontcache throughput improves nearly 5x, matching buffered
> I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
> buffered.
>

Nice :)
Some explaination here of why 5x improvement with NFS compared to local
filesystems please?
(I am not much aware of NFS side, but a possible reasoning would help)

-ritesh


^ permalink raw reply

* Re: [PATCH] mm/page_alloc: add tracepoint for PCP refills
From: Vishal Moola @ 2026-04-26 22:06 UTC (permalink / raw)
  To: Bunyod Suvonov
  Cc: akpm, vbabka, linux-mm, rostedt, mhiramat, mathieu.desnoyers,
	linux-trace-kernel, linux-kernel, surenb, mhocko, jackmanb,
	hannes, ziy
In-Reply-To: <20260425091335.346504-1-b.suvonov@sjtu.edu.cn>

On Sat, Apr 25, 2026 at 05:13:35PM +0800, Bunyod Suvonov wrote:
> The page allocator already has mm_page_pcpu_drain to trace pages
> drained from the per-cpu page lists back to the buddy allocator. There
> is no matching tracepoint for the opposite direction, where
> rmqueue_bulk() refills a PCP list from the buddy allocator.

This sounds like a reasonable idea. Does this tracepoint show us
something that a workload might care about? Not opposed, just curious.

For future versions, would you mind including documentation about it
in Documentation/trace/events-kmem.rst?

> mm_page_alloc_zone_locked is not a good substitute for this. It is
> emitted from __rmqueue_smallest(), which is used both by rmqueue_bulk()
> and by the direct buddy allocation path. Its percpu_refill field is
> derived from the allocation order and migratetype, so it does not
> reliably identify whether the allocation came from a PCP refill.
> 
> Add mm_page_pcpu_refill and emit it from rmqueue_bulk() for each page
> added to the PCP list. The new tracepoint uses the same page, order and
> migratetype fields as mm_page_pcpu_drain, making refill and drain
> activity directly comparable.
> 
> Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>
> ---
>  include/trace/events/kmem.h | 23 +++++++++++++++++++++++
>  mm/page_alloc.c             |  1 +
>  2 files changed, 24 insertions(+)
> 
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index cd7920c81f85..16985604fc51 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -243,6 +243,29 @@ DEFINE_EVENT(mm_page, mm_page_alloc_zone_locked,
>  	TP_ARGS(page, order, migratetype, percpu_refill)
>  );
>  
> +TRACE_EVENT(mm_page_pcpu_refill,
> +
> +	TP_PROTO(struct page *page, unsigned int order, int migratetype),
> +
> +	TP_ARGS(page, order, migratetype),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	pfn		)
> +		__field(	unsigned int,	order		)
> +		__field(	int,		migratetype	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->pfn		= page ? page_to_pfn(page) : -1UL;
> +		__entry->order		= order;
> +		__entry->migratetype	= migratetype;
> +	),
> +
> +	TP_printk("page=%p pfn=0x%lx order=%d migratetype=%d",
> +		pfn_to_page(__entry->pfn), __entry->pfn,
> +		__entry->order, __entry->migratetype)
> +);
> +
>  TRACE_EVENT(mm_page_pcpu_drain,
>  
>  	TP_PROTO(struct page *page, unsigned int order, int migratetype),
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 65e205111553..a60b73ed39a4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2544,6 +2544,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		 * pages are ordered properly.
>  		 */
>  		list_add_tail(&page->pcp_list, list);
> +		trace_mm_page_pcpu_refill(page, order, migratetype);

If you're trying to trace all pages as they come onto the pcp lists,
should you also account for the free_frozen_page_commit() path?

>  	}
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
> -- 
> 2.53.0
> 

^ permalink raw reply

* [RFC PATCH 16/19] mm/damon: trace probe_hits
From: SeongJae Park @ 2026-04-26 20:52 UTC (permalink / raw)
  Cc: SeongJae Park, Andrew Morton, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel
In-Reply-To: <20260426205222.93895-1-sj@kernel.org>

Introduce a new tracepoint for exposing the per-region per-probe
positive sample count via tracefs.

Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/trace/events/damon.h | 41 ++++++++++++++++++++++++++++++++++++
 mm/damon/core.c              |  1 +
 2 files changed, 42 insertions(+)

diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 7e25f4469b81b..121d7bc3a2c27 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -130,6 +130,47 @@ TRACE_EVENT(damon_monitor_intervals_tune,
 	TP_printk("sample_us=%lu", __entry->sample_us)
 );
 
+TRACE_EVENT(damon_aggregated_v2,
+
+	TP_PROTO(unsigned int target_id, struct damon_region *r,
+		unsigned int nr_regions),
+
+	TP_ARGS(target_id, r, nr_regions),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, target_id)
+		__field(unsigned int, nr_regions)
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+		__field(unsigned int, nr_accesses)
+		__field(unsigned int, age)
+		__field(unsigned char, probe_hit0)
+		__field(unsigned char, probe_hit1)
+		__field(unsigned char, probe_hit2)
+		__field(unsigned char, probe_hit3)
+	),
+
+	TP_fast_assign(
+		__entry->target_id = target_id;
+		__entry->nr_regions = nr_regions;
+		__entry->start = r->ar.start;
+		__entry->end = r->ar.end;
+		__entry->nr_accesses = r->nr_accesses;
+		__entry->age = r->age;
+		__entry->probe_hit0 = r->probe_hits[0];
+		__entry->probe_hit1 = r->probe_hits[1];
+		__entry->probe_hit2 = r->probe_hits[2];
+		__entry->probe_hit3 = r->probe_hits[3];
+	),
+
+	TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u %hhu %hhu %hhu %hhu",
+			__entry->target_id, __entry->nr_regions,
+			__entry->start, __entry->end,
+			__entry->nr_accesses, __entry->age,
+			__entry->probe_hit0, __entry->probe_hit1,
+			__entry->probe_hit2, __entry->probe_hit3)
+);
+
 TRACE_EVENT(damon_aggregated,
 
 	TP_PROTO(unsigned int target_id, struct damon_region *r,
diff --git a/mm/damon/core.c b/mm/damon/core.c
index fe14971d72747..54834b74efef4 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1924,6 +1924,7 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
 			int i;
 
 			trace_damon_aggregated(ti, r, damon_nr_regions(t));
+			trace_damon_aggregated_v2(ti, r, damon_nr_regions(t));
 			damon_warn_fix_nr_accesses_corruption(r);
 			r->last_nr_accesses = r->nr_accesses;
 			r->nr_accesses = 0;
-- 
2.47.3

^ permalink raw reply related

* [RFC PATCH 00/19] mm/damon: introduce data attributes monitoring
From: SeongJae Park @ 2026-04-26 20:52 UTC (permalink / raw)
  Cc: SeongJae Park, Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel

TL; DR
======

Extend DAMON for monitoring general data attributes other than accesses.
This is for enabling light-weight page type (e.g., belonging cgroup)
aware monitoring in short term.  In long term, this will help extending
DAMON for multiple access events capture primitives (e.g., page faults
and PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring
and Operations eNgine" in long term.

Background: High Cost of Page Level Properties Monitoring
=========================================================

DAMON is initially introduced as a Data Access MONitor.  It has been
extended for not only access monitoring but also data access-aware
system operations (DAMOS).  But still the monitoring part is only for
data accesses.

Data access patterns is good information, but some users need more
holistic views.  Particularly, users want to show the access pattern
information together with the types of the memory.  For example, users
who work for making huge pages efficiently want to know how much of
DAMON-found hot/cold regions are backed by huge pages.  Users who run
multiple workloads with different cgroups want to know how much of
DAMON-found hot/cold regions belong to specific cgroups.

For the user demand, we developed a DAMOS extension for page level
properties based monitoring [1], which has landed on 6.14.  Using the
feature, users can inform the page level data properties that they are
interested in, in a flexible format that uses DAMOS filters.  Then,
DAMON applies the filters to each folio of the entire DAMON region and
lets users know how many bytes of memory in each DAMON region passed the
given filters.

This gives page level detailed and deterministic information to users.
But, because the operation is done at page level, the overhead is
proportional to the memory size.  It was useful for test or debugging
purposes on a small number of machines.  But it was obviously too heavy
to be enabled always on all machines running the real user workloads.
For real world workloads, it was recommended to use the feature with
user-space controlled sampling approaches.  For example, users could do
the page level monitoring only once per hour, on randomly selected one
percent of machines of their fleet.  If the runtime and the  size of the
fleet is long and big enough, it should provide statistically meaningful
data.

But users are too busy to implement such controls on their own.

Data Attributes Monitoring
==========================

Extend DAMON to monitor not only data accesses, but also general data
attributes.  Do the extension while keeping the main promise of DAMON,
the bounded and best-effort minimum overhead.

Allow users to specify what data attributes in addition to the data
access they want to monitor.  Users can install one 'data probe' per
data attribute of their interest for this purpose.  The 'data probe'
should be able to be applied to any memory, and determine if the given
memory has the appropriate data attribute.  E.g., if memory of physical
address 42 belongs to cgroup A.  Each 'data probe' is configured with
filters that are very similar to the DAMOS filters.

When DAMON checks if each sampling address memory of each region is
accessed since the last check, it applies data probes if registered.
Same to the number of access check-positive samples accounting
(nr_accesses), it accounts the number of each data probe-positive
samples in another per-region counters array, namely 'probe_hits'. When
DAMON resets nr_accesses every aggregation interval, it resets
'probe_hits' together.

Users can read 'probe_hits' just before the values are reset.  In this
way, users can know how many hot/cold memory regions have data
attributes of their interest.  E.g., 30 percent of this system's hot
memory is belonging to cgroup A and 80 percent of the hot cgroup A
memory is backed by huge pages.

Patches Sequence
================

First eight patches implement the core feature, interface and the
working support.  Patch 1 introduces data probe data structure, namely
damon_probe.  Patch 2 extends damon_ctx for installing data probes.
Patch 3 introduces another data structure for filters of each data
probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
to handle the probes.  Patch 5 extends damon_region for the per-region
per-probe positive samples counter, namely probe_hits.  Patch 6 extends
damon_operations for applying probes on the underlying DAMON operations
implementation.  Patch 7 updates kdamond_fn() to invoke the probes
applying callback.  Patch 8 finally implements the probes support on
paddr ops.

Eight changes for user interface (patches 9-16) come next.  Patches 9-13
implements sysfs directories and files for setting data probes, namely
probes directory, probe directory, filters directory, filter directory
and filter directory internal files, respectively.  Patch 14 connects
the user inputs that are made via the sysfs files to DAMON core.
Patch 15 implements sysfs files for showing the per-region per-probe
positive samples count, namely probe_hits.  Patch 16 introduces a new
tracepoint for showing the counts via tracefs.

Patch 14 adds a selftest for the sysfs files.

Patches 15 and 16 documents the design and usage of the new feature,
respectively.

Discussions
===========

This allows the page properties monitoring with overhead that is low
enough to be enabled always on real world workloads.  Because the
sampling time for access check is reused for data attributes check,  the
upper-bounded and best-effort minimum overhead of DAMON is kept.
Because the sampling memory for access check is reused for data
attributes check, additional overhead is minimum.

Still DAMOS-based page level properties monitoring should be useful,
because it provides a deterministic page level information.  When in
doubt of the sampling based information, running DAMOS-based one
together and comparing the results would be useful, for debugging and
tuning.

Plan for Dropping RFC tag
=========================

The user ABI for reading probe_hits is not yet convincing.  It is
exposed to users by a tracepoint and new sysfs file.  For the
tracepoint, a new one namely damon:damon_aggregated_v2 is introduced.
The name is not convincing, and its internal mechanism seems to have
room to be improved before dropping RFC.  For the sysfs, a file under
the DAMOS-tried region directory namely 'probe_hits' is added.  Reading
it returns four probe_hits values with ',' as a separator.  With the
maximum number of data probes, this should work.  This can make future
changes of the limit difficult.  I will try to find a better way before
dropping the RFC tag.  Maybe 'probe_hits/' directory having files of
name '0' to 'N-1' for each of user-registered 'N' data probes.

I'm currently hoping to drop the RFC tag by 7.2-rc1.

Future Works: Short Term
========================

This series is introducing only a single type of data attribute:
anonymous page.  Once this is landed, I will extend it for
cgroup-belonging, so that we can do cgroup-level monitoring with low
overhead.  After that, I may further work on supporting all DAMOS filter
types.  And as demands are found, we could extend the types.

This version of implementation is limiting the maximum number of data
probes to four.  I will try to find a way to remove the limit in future,
if it is easy to do.  I personally think it should be enough for common
use cases, though, and therefore not giving high priority at the moment.

Future Works: Long Term
=======================

There are user requests for extending DAMON with detailed access
information, for example, per-CPUs/threads/read/writes monitoring.  For
that, I was working [2] on extending DAMON to use page fault events as
another access check primitives, and making the infrastructure flexible
for future use of yet another access check primitive.  Actually there is
another ongoing work [3] for extending DAMON with PMU events.  The
motivation of the work is reducing the overhead, though.

In my work [2], I was introducing a new interface for access sampling
primitives control.  Now I think this data probe interface can be used
for that, too.  That is, data access becomes just one type of data
attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
and PMU event-confirmed access will be different types of data
attributes.

The regions adjustment mechanism is currently working based on the
access information.  That's because DAMON is designed for data access
monitoring.  That is, data access information is the primary interest,
and therefore DAMON adjusts regions in a way that can best-present the
information.

Once data access becomes just one of data attributes, there is no reason
to think data access that special.  There might be some users not
interested in access at all but want to know the location of memory of
specific type.  Data probes interface will allow doing that.  Further,
we could extend the interface to let users set any data attribute as the
'primary' attribute.  Then, DAMON will split and merge regions in a way
that can best-present the 'primary' attributes.

DAMOS will also be extended, to specify targets based on not only the
data access pattern, but all user-registered data attributes.  From this
stage, we may be able to call DAMON as a "Data Attributes Monitoring and
Operations eNgine".

[1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
[2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
[3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com

SeongJae Park (19):
  mm/damon/core: introduce struct damon_probe
  mm/damon/core: embed damon_probe objects in damon_ctx
  mm/damon/core: introduce damon_filter
  mm/damon/core: commit probes
  mm/damon/core: introduce damon_region->probe_hits
  mm/damon/core: introduce damon_ops->apply_probes
  mm/damon/core: do data attributes monitoring
  mm/damon/paddr: support data attributes monitoring
  mm/damon/sysfs: implement probes dir
  mm/damon/sysfs: implement probe dir
  mm/damon/sysfs: implement filters directory
  mm/damon/sysfs: implement filter dir
  mm/damon/sysfs: implement filter dir files
  mm/damon/sysfs: setup probes on DAMON core API parameters
  mm/damon/sysfs-schemes: implement tried_region/probe_hits file
  mm/damon: trace probe_hits
  selftests/damon/sysfs.sh: test probes dir
  Docs/mm/damon/design: document data attributes monitoring
  Docs/admin-guide/mm/damon/usage: document data attributes monitoring

 Documentation/admin-guide/mm/damon/usage.rst |  44 +-
 Documentation/mm/damon/design.rst            |  37 ++
 include/linux/damon.h                        |  60 +++
 include/trace/events/damon.h                 |  41 ++
 mm/damon/core.c                              | 182 +++++++
 mm/damon/paddr.c                             |  45 ++
 mm/damon/sysfs-schemes.c                     |  30 ++
 mm/damon/sysfs.c                             | 502 +++++++++++++++++++
 tools/testing/selftests/damon/sysfs.sh       |  48 ++
 9 files changed, 982 insertions(+), 7 deletions(-)


base-commit: 8f22aa2e28454419ed2031119ad32ea4a6c9f1f1
-- 
2.47.3

^ permalink raw reply

* Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
From: Matthew Wilcox @ 2026-04-26 20:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Mike Snitzer, Jens Axboe, Ritesh Harjani, Christoph Hellwig,
	Kairui Song, Qi Zheng, Shakeel Butt, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Chuck Lever, linux-fsdevel, linux-kernel,
	linux-nfs, linux-mm, linux-trace-kernel
In-Reply-To: <20260426-dontcache-v3-2-79eb37da9547@kernel.org>

On Sun, Apr 26, 2026 at 07:56:08AM -0400, Jeff Layton wrote:
>   Mixed-mode noisy neighbor (dontcache writer + buffered readers):
>                        baseline    patched     change
>   writer (MB/s)         1297.6     1471.1     +13.4%
>   readers avg (MB/s)     855.0      462.4     -45.9%

hm.  This wasn't what I thought of when I thought of "noisy neighbour".
I'd have process A doing DONTCACHE writes to file A and process B doing
normal buffered writes to file B.


^ permalink raw reply

* [PATCH] mm/damon: fix damos_stat tracepoint format for sz_applied
From: SeongJae Park @ 2026-04-26 19:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: SeongJae Park, # 7 . 0 . x, Masami Hiramatsu, Mathieu Desnoyers,
	Steven Rostedt, damon, linux-kernel, linux-mm, linux-trace-kernel

The print format is wrongly marking sz_applied as sz_tried.  Fix it.

Fixes: 804c26b961da ("mm/damon/core: add trace point for damos stat per apply interval")
Cc: <stable@vger.kernel.org> # 7.0.x
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/trace/events/damon.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index 24fc402ab3c85..7e25f4469b81b 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -41,7 +41,7 @@ TRACE_EVENT(damos_stat_after_apply_interval,
 	),
 
 	TP_printk("ctx_idx=%u scheme_idx=%u nr_tried=%lu sz_tried=%lu "
-			"nr_applied=%lu sz_tried=%lu sz_ops_filter_passed=%lu "
+			"nr_applied=%lu sz_applied=%lu sz_ops_filter_passed=%lu "
 			"qt_exceeds=%lu nr_snapshots=%lu",
 			__entry->context_idx, __entry->scheme_idx,
 			__entry->nr_tried, __entry->sz_tried,

base-commit: 2e98f54b5a2b874905c71f3bc40eb8c0e8e757f0
-- 
2.47.3

^ permalink raw reply related

* [syzbot ci] Re: mm: improve write performance with RWF_DONTCACHE
From: syzbot ci @ 2026-04-26 19:02 UTC (permalink / raw)
  To: akpm, axboe, axelrasmussen, baohua, brauner, chuck.lever, david,
	hch, jack, jlayton, kasong, liam.howlett, linux-fsdevel,
	linux-kernel, linux-mm, linux-nfs, linux-trace-kernel, ljs,
	mathieu.desnoyers, mhiramat, mhocko, qi.zheng, ritesh.list,
	rostedt, rppt, shakeel.butt, snitzer, surenb, vbabka, viro,
	weixugc, willy, yuanchu
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260426-dontcache-v3-0-79eb37da9547@kernel.org>

syzbot ci has tested the following series

[v3] mm: improve write performance with RWF_DONTCACHE
https://lore.kernel.org/all/20260426-dontcache-v3-0-79eb37da9547@kernel.org
* [PATCH v3 1/4] mm: add NR_DONTCACHE_DIRTY node page counter
* [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
* [PATCH v3 3/4] testing: add nfsd-io-bench NFS server benchmark suite
* [PATCH v3 4/4] testing: add dontcache-bench local filesystem benchmark suite

and found the following issue:
WARNING in __mod_memcg_lruvec_state

Full report is available here:
https://ci.syzbot.org/series/e53aef43-ac7a-4cb7-8714-bb927aaee659

***

WARNING in __mod_memcg_lruvec_state

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      27d128c1cff64c3b8012cc56dd5a1391bb4f1821
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/c10ddd10-bb16-48c2-90fb-3625d3b258aa/config
syz repro: https://ci.syzbot.org/findings/1e8993c1-818b-4ddf-b90b-30f051b3a9d6/syz_repro

------------[ cut here ]------------
__mod_memcg_lruvec_state: missing stat item 21
WARNING: mm/memcontrol.c:911 at __mod_memcg_lruvec_state+0x1f3/0x360 mm/memcontrol.c:911, CPU#0: syz.0.17/5831
Modules linked in:
CPU: 0 UID: 0 PID: 5831 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__mod_memcg_lruvec_state+0x1fc/0x360 mm/memcontrol.c:911
Code: 00 11 85 c0 74 31 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d e9 95 2e 72 09 cc 48 8d 3d 7d c4 fd 0d 48 c7 c6 5d b4 f5 8d 89 da <67> 48 0f b9 3a eb d5 90 0f 0b 90 eb 90 e8 02 22 fb fe eb c8 48 8d
RSP: 0018:ffffc900039e7520 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000015 RCX: dffffc0000000000
RDX: 0000000000000015 RSI: ffffffff8df5b45d RDI: ffffffff90363d90
RBP: 0000000000000001 R08: ffffffff82388833 R09: ffffffff8e95cd60
R10: dffffc0000000000 R11: fffff940008c3f49 R12: ffff8881026eee80
R13: 00000000000000ff R14: 0000000000000001 R15: ffff888173a80e00
FS:  00007f5f76bca6c0(0000) GS:ffff88818dc95000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055d77c624128 CR3: 0000000171fde000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 mod_memcg_lruvec_state+0xa7/0x220 mm/memcontrol.c:941
 mod_lruvec_state mm/memcontrol.c:964 [inline]
 lruvec_stat_mod_folio+0x239/0x3e0 mm/memcontrol.c:984
 folio_account_dirtied mm/page-writeback.c:2634 [inline]
 __folio_mark_dirty+0x633/0xec0 mm/page-writeback.c:2692
 mark_buffer_dirty+0x261/0x410 fs/buffer.c:1110
 block_commit_write+0x15d/0x270 fs/buffer.c:2115
 block_write_end+0x6e/0xb0 fs/buffer.c:2191
 ext4_write_end+0x27d/0xa30 fs/ext4/inode.c:1458
 ext4_da_write_end+0x86/0xcb0 fs/ext4/inode.c:3296
 generic_perform_write+0x620/0x8f0 mm/filemap.c:4350
 ext4_buffered_write_iter+0xcb/0x370 fs/ext4/file.c:316
 ext4_file_write_iter+0x298/0x1bd0 fs/ext4/file.c:-1
 do_iter_readv_writev+0x619/0x8c0 fs/read_write.c:-1
 vfs_writev+0x33c/0x990 fs/read_write.c:1059
 do_pwritev fs/read_write.c:1155 [inline]
 __do_sys_pwritev2 fs/read_write.c:1213 [inline]
 __se_sys_pwritev2+0x184/0x2a0 fs/read_write.c:1204
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5f75d9cdd9
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f5f76bca028 EFLAGS: 00000246 ORIG_RAX: 0000000000000148
RAX: ffffffffffffffda RBX: 00007f5f76015fa0 RCX: 00007f5f75d9cdd9
RDX: 0000000000000001 RSI: 00002000000001c0 RDI: 0000000000000004
RBP: 00007f5f75e32d69 R08: 0000000000000001 R09: 0000000000000081
R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f5f76016038 R14: 00007f5f76015fa0 R15: 00007fffe7503ad8
 </TASK>
----------------
Code disassembly (best guess):
   0:	00 11                	add    %dl,(%rcx)
   2:	85 c0                	test   %eax,%eax
   4:	74 31                	je     0x37
   6:	48 83 c4 08          	add    $0x8,%rsp
   a:	5b                   	pop    %rbx
   b:	41 5c                	pop    %r12
   d:	41 5d                	pop    %r13
   f:	41 5e                	pop    %r14
  11:	41 5f                	pop    %r15
  13:	5d                   	pop    %rbp
  14:	e9 95 2e 72 09       	jmp    0x9722eae
  19:	cc                   	int3
  1a:	48 8d 3d 7d c4 fd 0d 	lea    0xdfdc47d(%rip),%rdi        # 0xdfdc49e
  21:	48 c7 c6 5d b4 f5 8d 	mov    $0xffffffff8df5b45d,%rsi
  28:	89 da                	mov    %ebx,%edx
* 2a:	67 48 0f b9 3a       	ud1    (%edx),%rdi <-- trapping instruction
  2f:	eb d5                	jmp    0x6
  31:	90                   	nop
  32:	0f 0b                	ud2
  34:	90                   	nop
  35:	eb 90                	jmp    0xffffffc7
  37:	e8 02 22 fb fe       	call   0xfefb223e
  3c:	eb c8                	jmp    0x6
  3e:	48                   	rex.W
  3f:	8d                   	.byte 0x8d


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply

* Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
From: Jeff Layton @ 2026-04-26 18:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260426052854.8372fb9d4c616f16a8aa0a0f@linux-foundation.org>

On Sun, 2026-04-26 at 05:28 -0700, Andrew Morton wrote:
> Naive questions...
> 
> On Sun, 26 Apr 2026 07:56:08 -0400 Jeff Layton <jlayton@kernel.org> wrote:
> 
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context.  Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the writer
> > for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> > (dontcache).
> 
> So in the current case, when generic_write_sync() returns, all that
> memory is written back and clean&reclaimable (or freed?), yes?
> 
> > Replace the inline filemap_flush_range() call with a flusher kick that
> > drains dirty pages in the background.  This moves writeback submission
> > completely off the writer's hot path.
> 
> Whereas after this change, that pagecache is probably still dirty,
> unreclaimable, waiting for the flusher to do its thing?
> 
> So is there potential that the system will get all gummed up with
> dirty, to-be-written-soon pagecache?  Is there something which limits
> this buildup?
> 
> > ...
> > 
> > dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
> > RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
> > ~503 GB, compared to a v6.19-ish baseline):
> > 
> >   Single-client sequential write (MB/s):
> >                        baseline    patched     change
> >   buffered              1449.8     1440.1      -0.7%
> >   dontcache             1347.9     1461.5      +8.4%
> >   direct                1450.0     1440.1      -0.7%
> > 
> >   Single-client sequential write latency (us):
> >                        baseline    patched     change
> >   dontcache p50         3031.0    10551.3    +248.1%
> >   dontcache p99        74973.2    21626.9     -71.2%
> >   dontcache p99.9      85459.0    23199.7     -72.9%
> > 
> >   Single-client random write (MB/s):
> >                        baseline    patched     change
> >   dontcache              284.2      295.4      +3.9%
> > 
> >   Single-client random write p99.9 latency (us):
> >                        baseline    patched     change
> >   dontcache             2277.4      872.4     -61.7%
> > 
> >   Multi-writer aggregate throughput (MB/s):
> >                        baseline    patched     change
> >   buffered              1619.5     1611.2      -0.5%
> >   dontcache             1281.1     1629.4     +27.2%
> >   direct                1545.4     1609.4      +4.1%
> > 
> >   Mixed-mode noisy neighbor (dontcache writer + buffered readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1297.6     1471.1     +13.4%
> >   readers avg (MB/s)     855.0      462.4     -45.9%
> 
> These results look ambiguous.  Sometimes better, sometimes worse?
> 

Forgot to comment on this part earlier...

This is the "mixed-mode" (dontcache writes + buffered reads). I played
with a bunch of different settings under nfsd, and those settings
turned out to perform the best with this benchmark.

I suspect what's happening is that the increase in write throughput
from writing via the flusher thread is crowding out reads. So, read
throughput suffers in this test from that. There are a number of ways
we could probably make that more fair.

> > nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
> > NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
> > file size ~502 GB, compared to v6.19-ish baseline):
> > 
> >   Single-client sequential write (MB/s):
> >                        baseline    patched     change
> >   buffered              4844.2     4653.4      -3.9%
> >   dontcache             3028.3     3723.1     +22.9%
> >   direct                 957.6      987.8      +3.2%
> > 
> >   Single-client sequential write p99.9 latency (us):
> >                        baseline    patched     change
> >   dontcache            759169.0   175112.2     -76.9%
> > 
> >   Single-client random write (MB/s):
> >                        baseline    patched     change
> >   dontcache              590.0     1561.0    +164.6%
> > 
> >   Multi-writer aggregate throughput (MB/s):
> >                        baseline    patched     change
> >   buffered              9636.3     9422.9      -2.2%
> >   dontcache             1894.9     9442.6    +398.3%
> >   direct                 809.6      975.1     +20.4%
> > 
> >   Noisy neighbor (dontcache writer + random readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1854.5     4063.6    +119.1%
> >   readers avg (MB/s)     131.2      101.6     -22.5%
> 
> Ditto but less so.
> 

Same reason for the drop, I think.

> > The NFS results show even larger improvements than the local benchmarks.
> > Multi-writer dontcache throughput improves nearly 5x, matching buffered
> > I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
> > buffered.
> 
> It sounds that you like the results, so OK ;)

I think it's a win overall. As with anything writeback-related, it's a
game of tradeoffs. The good news is that DONTCACHE is still fairly new
and not many applications are using it yet, so the blast radius from
any change here should be rather small.

As a side note: I've long thought that we in general wait too long to
kick off writeback with normal buffered I/O, particularly with modern
memory sizes. DONTCACHE gives us a place to experiment with this
scheme, but we may want to think about kicking off writeback earlier in
the normal buffered case too.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Song Chen @ 2026-04-26 14:26 UTC (permalink / raw)
  To: Masami Hiramatsu (Google), Petr Mladek
  Cc: Petr Pavlu, rafael, lenb, mturquette, sboyd, viresh.kumar, agk,
	snitzer, mpatocka, bmarzins, song, yukuai, linan122, jason.wessel,
	danielt, dianders, horms, davem, edumazet, kuba, pabeni, paulmck,
	frederic, mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, joe.lawrence, rostedt, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260420112707.aa3627ca9f975eeaf7d8ea0e@kernel.org>

Hi,


On 4/20/26 10:27, Masami Hiramatsu (Google) wrote:
> On Thu, 16 Apr 2026 16:49:32 +0200
> Petr Mladek <pmladek@suse.com> wrote:
> 
>> On Thu 2026-04-16 13:18:30, Petr Pavlu wrote:
>>> On 4/15/26 8:43 AM, Song Chen wrote:
>>>> On 4/14/26 22:33, Petr Pavlu wrote:
>>>>> On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
>>>>>> diff --git a/include/linux/module.h b/include/linux/module.h
>>>>>> index 14f391b186c6..0bdd56f9defd 100644
>>>>>> --- a/include/linux/module.h
>>>>>> +++ b/include/linux/module.h
>>>>>> @@ -308,6 +308,14 @@ enum module_state {
>>>>>>        MODULE_STATE_COMING,    /* Full formed, running module_init. */
>>>>>>        MODULE_STATE_GOING,    /* Going away. */
>>>>>>        MODULE_STATE_UNFORMED,    /* Still setting it up. */
>>>>>> +    MODULE_STATE_FORMED,
>>>>>
>>>>> I don't see a reason to add a new module state. Why is it necessary and
>>>>> how does it fit with the existing states?
>>>>>
>>>> because once notifier fails in state MODULE_STATE_UNFORMED (now only ftrace has someting to do in this state), notifier chain will roll back by calling blocking_notifier_call_chain_robust, i'm afraid MODULE_STATE_GOING is going to jeopardise the notifers which don't handle it appropriately, like:
>>>>
>>>> case MODULE_STATE_COMING:
>>>>       kmalloc();
>>>> case MODULE_STATE_GOING:
>>>>       kfree();
>>>
>>> My understanding is that the current module "state machine" operates as
>>> follows. Transitions marked with an asterisk (*) are announced via the
>>> module notifier.
>>>
>>> ---> UNFORMED --*> COMING --*> LIVE --*> GOING -.
>>>          ^            |                     ^    |
>>>          |            '---------------------*    |
>>>          '---------------------------------------'
>>>
>>> The new code aims to replace the current ftrace_module_init() call in
>>> load_module(). To achieve this, it adds a notification for the UNFORMED
>>> state (only when loading a module) and introduces a new FORMED state for
>>> rollback. FORMED is purely a fake state because it never appears in
>>> module::state. The new structure is as follows:
>>>
>>>          ,--*> (FORMED)
>>>          |
>>> --*> UNFORMED --*> COMING --*> LIVE --*> GOING -.
>>>          ^            |                     ^    |
>>>          |            '---------------------*    |
>>>          '---------------------------------------'
>>>
>>> I'm afraid this is quite complex and inconsistent. Unless it can be kept
>>> simple, we would be just replacing one special handling with a different
>>> complexity, which is not worth it.
>>
>>>>>
>>>>>> +    if (err)
>>>>>> +        goto ddebug_cleanup;
>>>>>>          /* Finally it's fully formed, ready to start executing. */
>>>>>>        err = complete_formation(mod, info);
>>>>>> -    if (err)
>>>>>> +    if (err) {
>>>>>> +        blocking_notifier_call_chain_reverse(&module_notify_list,
>>>>>> +                MODULE_STATE_FORMED, mod);
>>>>>>            goto ddebug_cleanup;
>>>>>> +    }
>>>>>>    -    err = prepare_coming_module(mod);
>>>>>> +    err = prepare_module_state_transaction(mod,
>>>>>> +                MODULE_STATE_COMING, MODULE_STATE_GOING);
>>>>>>        if (err)
>>>>>>            goto bug_cleanup;
>>>>>>    @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
>>>>>>        destroy_params(mod->kp, mod->num_kp);
>>>>>>        blocking_notifier_call_chain(&module_notify_list,
>>>>>>                         MODULE_STATE_GOING, mod);
>>>>>
>>>>> My understanding is that all notifier chains for MODULE_STATE_GOING
>>>>> should be reversed.
>>>> yes, all, from lowest priority notifier to highest.
>>>> I will resend patch 1 which was failed due to my proxy setting.
>>>
>>> What I meant here is that the call:
>>>
>>> blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_GOING, mod);
>>>
>>> should be replaced with:
>>>
>>> blocking_notifier_call_chain_reverse(&module_notify_list, MODULE_STATE_GOING, mod);
>>>
>>>>
>>>>>
>>>>>> -    klp_module_going(mod);
>>>>>>     bug_cleanup:
>>>>>>        mod->state = MODULE_STATE_GOING;
>>>>>>        /* module_bug_cleanup needs module_mutex protection */
>>>>>
>>>>> The patch removes the klp_module_going() cleanup call in load_module().
>>>>> Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
>>>>> should be removed and appropriately replaced with a cleanup via
>>>>> a notifier.
>>>>>
>>>>      err = prepare_module_state_transaction(mod,
>>>>                  MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);
>>>>      if (err)
>>>>          goto ddebug_cleanup;
>>>>
>>>> ftrace will be cleanup in blocking_notifier_call_chain_robust rolling back.
>>>>
>>>>      err = prepare_module_state_transaction(mod,
>>>>                  MODULE_STATE_COMING, MODULE_STATE_GOING);
>>>>
>>>> each notifier including ftrace and klp will be cleanup in blocking_notifier_call_chain_robust rolling back.
>>>>
>>>> if all notifiers are successful in MODULE_STATE_COMING, they all will be clean up in
>>>>   coming_cleanup:
>>>>      mod->state = MODULE_STATE_GOING;
>>>>      destroy_params(mod->kp, mod->num_kp);
>>>>      blocking_notifier_call_chain(&module_notify_list,
>>>>                       MODULE_STATE_GOING, mod);
>>>>
>>>> if  something wrong underneath.
>>>
>>> My point is that the patch leaves a call to ftrace_release_mod() in
>>> load_module(), which I expected to be handled via a notifier.
>>
>> I think that I have got it. The ftrace code needs two notifiers when
>> the module is being loaded and two when it is going.
>>
>> This is why Sond added the new state. But I think that we would
>> need two new states to call:
>>
>>      + ftrace_module_init() in MODULE_STATE_UNFORMED
>>      + ftrace_module_enable() in MODULE_STATE_FORMED
>>
>> and
>>
>>      + ftrace_free_mem() in MODULE_STATE_PRE_GOING
>>      + ftrace_free_mem() in MODULE_STATE_GOING
>>
>>
>> By using the ascii art:
>>
>>   -*> UNFORMED -*> FORMED -> COMING -*> LIVE -*> PRE_GOING -*> GOING -.
>>                |          |         |                ^           ^    ^
>>                |          |         '----------------'           |    |
>>                |          '--------------------------------------'    |
>>                '------------------------------------------------------'
>>
>>
>> But I think that this is not worth it.
> 
> Agree.
> 
> If this needs to be ordered so strictly, why we will use a "single"
> module notifier chain for this complex situation?
> 
> I think the notifier call chain is just for notice a single signal,
> instead of sending several different signals, especially if there is
> any dependency among the callbacks.
> 
> If notification callbacks need to be ordered, they are currently
> sorted by representing priority numerically, but this is quite
> fragile for updating. It has to look up other registered priorities
> and adjust the order among dependencies each time. For this reason,
> this mechanism is not suitable for global ordering. (It's like line
> numbers in BASIC.)
> It is probably only useful for representing dependencies between
> two components maintained by the same maintainer.
> 
> I'm against a general-purpose system that makes everything modular.
> It unnecessarily complicates things. If there are processes that
> require strict ordering, especially processes that must be performed
> before each stage as part of the framework, they should be called
> directly from the framework, not via notification callbacks.
> 
> This makes it simpler and more robust to maintain.
> 
> Only the framework's end users should utilize notification callbacks.
> 
> Thank you,
> 
> 

my motivation is to decouple ftrace and klp from module loader and make 
blocking_notifier_chain more generic, but it doesn't become generic 
completely. I understand your and Petr's comments and agree.

Thanks

Best regards

Song

>>
>> Best Regards,
>> Petr
>>
> 
> 


^ permalink raw reply

* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Song Chen @ 2026-04-26 14:14 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
	jikos, mbenes, pmladek, joe.lawrence, rostedt, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260420144429.57b45f2beece690bceea96ec@kernel.org>

Hi Hiramatsu san,


On 4/20/26 13:44, Masami Hiramatsu (Google) wrote:
> Hi Song,
> 
> On Wed, 15 Apr 2026 15:01:37 +0800
> chensong_2000@189.cn wrote:
> 
>> From: Song Chen <chensong_2000@189.cn>
>>
>> The current notifier chain implementation uses a single-linked list
>> (struct notifier_block *next), which only supports forward traversal
>> in priority order. This makes it difficult to handle cleanup/teardown
>> scenarios that require notifiers to be called in reverse priority order.
> 
> What about introducing a new notification callback API that allows you
> to describe dependencies between callback functions?
> 
> For example, when registering a callback, you could register a string
> as an ID and specify whether to call it before or after that ID,
> or you could register a comparison function that is called when adding
> to a list. (I prefer @name and @depends fields so that it can be easily
> maintained.)
> 
> This would allow for better dependency building when adding to the list.
> 

Is the new notification callback API going to replace 
blocking_notifier_chain in module loader? or an expansion inside 
blocking_notifier_chain but introducing less complexity?
>>
>> A concrete example is the ordering dependency between ftrace and
>> livepatch during module load/unload. see the detail here [1].
> 
> If this only concerns notification callback issues with the ftrace
> and livepatch modules, it's far more robust to simply call the
> necessary processing directly when the modules load and unload,
> rather than registering notification callbacks externally.
> 
> There are fprobe, kprobe and its trace-events, all of them are using
> ftrace as its fundation layer. In this case, I always needs to
> consider callback order when a module is unloaded.
> 
> If ftrace is working as a part of module callbacks, it will conflict
> with fprobe/kprobe module callback. Of course we can reorder it with
> modifying its priority. But this is ugly, because when we introduce
> a new other feature which depends on another layer, we need to
> reorder the callback's priority number on the list.
> 
> Based on the above, I don't think this can be resolved simply by
> changing the list of notification callbacks to a bidirectional list.
> 
> Thank you,
> 

understood, many thanks for your proposal, i will think  about it.

best regards,

Song


^ permalink raw reply

* Re: [PATCH v3 3/4] testing: add nfsd-io-bench NFS server benchmark suite
From: Jeff Layton @ 2026-04-26 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260426053455.4c06140446976964e6fbb8ab@linux-foundation.org>

On Sun, 2026-04-26 at 05:34 -0700, Andrew Morton wrote:
> On Sun, 26 Apr 2026 07:56:09 -0400 Jeff Layton <jlayton@kernel.org> wrote:
> 
> > Add a benchmark suite for testing NFSD I/O mode performance using fio
> > with the libnfs backend against an NFS server on localhost.  Tests
> > buffered, dontcache, and direct I/O modes via NFSD debugfs controls.
> > 
> > Includes:
> >  - fio job files for sequential/random read/write, multi-writer,
> >    noisy-neighbor, and latency-sensitive reader workloads
> >  - run-benchmarks.sh: orchestrates test matrix with mode switching
> >  - parse-results.sh: extracts metrics from fio JSON output
> >  - setup-server.sh: configures NFS export for testing
> > 
> > Assisted-by: Claude:claude-opus-4-6
> 
> OK, question.
> 
> >  10 files changed, 1024 insertions(+)
> 
> Seems that this code was largely machine-generated.  So I assume that
> you're in possession of the scripts/prompts/whatever which were used to
> generate this code.
> 
> (Can you please briefly describe the process which you used here?)
> 

It's been a while since it generated these, but I think I just asked it
to concoct a set of benchmarks for DONTCACHE writes when that involved
file sizes that were larger than the machine's memory. 

I ended up asking it to make some changes (e.g. the mixed-mode test,
and some of the perf stuff), but it seemed to do a reasonable job of
creating it.

> So how are we to maintain this?  Will other developers have to go in
> and hack this machine-generated output by hand?  Or would it be better
> to provide (in-tree) other developers with the means to regenerate this code,
> presumably using Claude?
> 
> IOW, this feels a bit like shipping the .s file without giving us the .c
> file!

As I mentioned in the cover letter, I mostly included this in the
series to demonstrate how this was tested. I'm not sure if the two
benchmark suites are suitable for inclusion. I'm fine with leaving
those two patches out of the merge. I found the testcases useful for
this, but they are indeed AI slop, and I'm not sure they have long-term
value or will be maintainable.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
From: Jeff Layton @ 2026-04-26 14:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260426052854.8372fb9d4c616f16a8aa0a0f@linux-foundation.org>

On Sun, 2026-04-26 at 05:28 -0700, Andrew Morton wrote:
> Naive questions...
> 
> On Sun, 26 Apr 2026 07:56:08 -0400 Jeff Layton <jlayton@kernel.org> wrote:
> 
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context.  Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the writer
> > for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> > (dontcache).
> 
> So in the current case, when generic_write_sync() returns, all that
> memory is written back and clean&reclaimable (or freed?), yes?
> 

No. Before returning, it submits the I/Os for the portion that it wrote
rather than leaving it to the flusher to take care of things, but it
doesn't wait for the I/Os to complete.

> > Replace the inline filemap_flush_range() call with a flusher kick that
> > drains dirty pages in the background.  This moves writeback submission
> > completely off the writer's hot path.
> 
> Whereas after this change, that pagecache is probably still dirty,
> unreclaimable, waiting for the flusher to do its thing?
> 

Correct, but that's sort of the case today too since DONTCACHE I/Os
don't wait for the completion. With this change we're just deferring
the I/O submission to the flusher thread (which should hopefully soon
wake and take care of business). If the flusher thread can't keep up,
then eventually balance_dirty_pages() will kick in and start slowing
things down.

> So is there potential that the system will get all gummed up with
> dirty, to-be-written-soon pagecache?  Is there something which limits
> this buildup?
> 

Today in this situation, the writers are limited by the backing device
throughput. Once the I/O submission queues are full, then the DONTCACHE
writers end up stacking up on those. With this change, the writers will
be more limited by traditional VM limits in this situation. 

In the test runs I did, the peak pagecache with DONTCACHE writes was
higher than with the unpatched version but still considerably less than
with normal buffered I/O. That's the cost of deferring the I/O
submission to the flusher.

One thing we could consider is going back to submitting the writes
inline when the number of dirty pages is high. But, that could have a
detrimental effect on performance too.

> > ...
> > 
> > dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
> > RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
> > ~503 GB, compared to a v6.19-ish baseline):
> > 
> >   Single-client sequential write (MB/s):
> >                        baseline    patched     change
> >   buffered              1449.8     1440.1      -0.7%
> >   dontcache             1347.9     1461.5      +8.4%
> >   direct                1450.0     1440.1      -0.7%
> > 
> >   Single-client sequential write latency (us):
> >                        baseline    patched     change
> >   dontcache p50         3031.0    10551.3    +248.1%
> >   dontcache p99        74973.2    21626.9     -71.2%
> >   dontcache p99.9      85459.0    23199.7     -72.9%
> > 
> >   Single-client random write (MB/s):
> >                        baseline    patched     change
> >   dontcache              284.2      295.4      +3.9%
> > 
> >   Single-client random write p99.9 latency (us):
> >                        baseline    patched     change
> >   dontcache             2277.4      872.4     -61.7%
> > 
> >   Multi-writer aggregate throughput (MB/s):
> >                        baseline    patched     change
> >   buffered              1619.5     1611.2      -0.5%
> >   dontcache             1281.1     1629.4     +27.2%
> >   direct                1545.4     1609.4      +4.1%
> > 
> >   Mixed-mode noisy neighbor (dontcache writer + buffered readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1297.6     1471.1     +13.4%
> >   readers avg (MB/s)     855.0      462.4     -45.9%
> 
> These results look ambiguous.  Sometimes better, sometimes worse?
> 
> > nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
> > NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
> > file size ~502 GB, compared to v6.19-ish baseline):
> > 
> >   Single-client sequential write (MB/s):
> >                        baseline    patched     change
> >   buffered              4844.2     4653.4      -3.9%
> >   dontcache             3028.3     3723.1     +22.9%
> >   direct                 957.6      987.8      +3.2%
> > 
> >   Single-client sequential write p99.9 latency (us):
> >                        baseline    patched     change
> >   dontcache            759169.0   175112.2     -76.9%
> > 
> >   Single-client random write (MB/s):
> >                        baseline    patched     change
> >   dontcache              590.0     1561.0    +164.6%
> > 
> >   Multi-writer aggregate throughput (MB/s):
> >                        baseline    patched     change
> >   buffered              9636.3     9422.9      -2.2%
> >   dontcache             1894.9     9442.6    +398.3%
> >   direct                 809.6      975.1     +20.4%
> > 
> >   Noisy neighbor (dontcache writer + random readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1854.5     4063.6    +119.1%
> >   readers avg (MB/s)     131.2      101.6     -22.5%
> 
> Ditto but less so.
> 
> > The NFS results show even larger improvements than the local benchmarks.
> > Multi-writer dontcache throughput improves nearly 5x, matching buffered
> > I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
> > buffered.
> 
> It sounds that you like the results, so OK ;)

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox