Linux Trace Kernel
 help / color / mirror / Atom feed
* Re: [PATCH v3 3/4] testing: add nfsd-io-bench NFS server benchmark suite
From: Ritesh Harjani @ 2026-04-26 23:54 UTC (permalink / raw)
  To: Jeff Layton, Andrew Morton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Christoph Hellwig, Kairui Song, Qi Zheng, Shakeel Butt,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever, linux-fsdevel,
	linux-kernel, linux-nfs, linux-mm, linux-trace-kernel, Zorro Lang
In-Reply-To: <a1e784d7006fe5d4331d41a0638be117ac67fb21.camel@kernel.org>

Jeff Layton <jlayton@kernel.org> writes:

> On Sun, 2026-04-26 at 05:34 -0700, Andrew Morton wrote:
>> So how are we to maintain this?

Maybe in xfstests? It has tests/perf/, but that just have 1 test.
Maybe others can tell whether it make sense to maintain such fio based
performance benchmarking scripts in there.

-ritesh

^ permalink raw reply

* [PATCH v5 0/2] blk-mq: introduce tag starvation observability
From: Aaron Tomlin @ 2026-04-27  2:01 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel

Hi Jens, Steve, Masami,

In high-performance storage environments, particularly when utilising RAID
controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe latency
spikes can occur when fast devices are starved of available tags.
Currently, diagnosing this specific queue contention requires deploying
dynamic kprobes or inferring sleep states, which lacks a simple,
out-of-the-box diagnostic path.

This short series introduces dedicated, low-overhead observability for tag
exhaustion events in the block layer:

  - Patch 1 introduces the "block_rq_tag_wait" tracepoint in the tag
    allocation slow-path to capture precise, event-based starvation.

  - Patch 2 complements this by exposing "wait_on_hw_tag" and
    "wait_on_sched_tag" per-CPU counters via debugfs for quick,
    point-in-time cumulative polling.

Together, these provide storage engineers with zero-configuration
mechanisms to definitively identify shared-tag bottlenecks.

Please let me know your thoughts.


Changes since v4 [1]:
 - Prevented a NULL pointer dereference in the tracepoint fast-assign for
   disk-less request queues by safely checking q->disk before resolving the
   dev_t

 - Fixed a Use-After-Free (UAF) and permanent memory leak by decoupling
   the per-CPU counter allocation from the volatile debugfs lifecycle and
   tying it directly to the core hctx lifecycle (i.e., blk_mq_init_hctx()
   and blk_mq_exit_hctx())

 - Fixed a potential compiler double-fetch bug by wrapping the per-CPU
   pointer evaluations with READ_ONCE() in blk_mq_debugfs_inc_wait_tags()

 - Passed the appropriate gfp_t flags down to the allocation routines to
   maintain the strict GFP_NOIO context

 - Updated kernel-doc descriptions to clarify that the NULL pointer 
   checks guard against memory allocation failures under pressure, rather 
   than initialisation race conditions

Changes since v3 [2]:
 - Transitioned tracking architecture from shared atomic_t variables to
   dynamically allocated per-CPU counters to resolve cache line bouncing
   (Bart Van Assche)

Changes since v2 [3]:
 - Added "Reviewed-by:" and "Tested-by:" tags for patch 1

 - Evaluate is_sched_tag directly within TP_fast_assign (Steven Rostedt)

 - Introduced atomic counters via debugfs 

Changes since v1 [4]:
 - Improved the description of the trace point (Damien Le Moal)

 - Removed the redundant "active requests" (Laurence Oberman)

 - Introduced pool-specific starvation tracking

[1]: https://lore.kernel.org/lkml/20260419023036.1419514-1-atomlin@atomlin.com/
[2]: https://lore.kernel.org/lkml/20260319221956.332770-1-atomlin@atomlin.com/
[3]: https://lore.kernel.org/lkml/20260319015300.287653-1-atomlin@atomlin.com/
[4]: https://lore.kernel.org/lkml/20260317182835.258183-1-atomlin@atomlin.com/


Aaron Tomlin (2):
  blk-mq: add tracepoint block_rq_tag_wait
  blk-mq: expose tag starvation counts via debugfs

 block/blk-mq-debugfs.c       | 109 +++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.h       |  19 ++++++
 block/blk-mq-tag.c           |   8 +++
 block/blk-mq.c               |   5 ++
 include/linux/blk-mq.h       |  12 ++++
 include/trace/events/block.h |  43 ++++++++++++++
 6 files changed, 196 insertions(+)

-- 
2.51.0


^ permalink raw reply

* [PATCH v5 1/2] blk-mq: add tracepoint block_rq_tag_wait
From: Aaron Tomlin @ 2026-04-27  2:01 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260427020142.358912-1-atomlin@atomlin.com>

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait trace point in the tag
allocation slow-path. It triggers immediately before the thread yields
the CPU, exposing the exact hardware context (hctx) that is starved, the
specific pool experiencing starvation (hardware or software scheduler),
and the total pool depth.

This provides storage engineers and performance monitoring agents
with a zero-configuration, low-overhead mechanism to definitively
identify shared-tag bottlenecks and tune I/O schedulers or cgroup
throttling accordingly.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-tag.c           |  4 ++++
 include/trace/events/block.h | 43 ++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..66138dd043d4 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -13,6 +13,7 @@
 #include <linux/kmemleak.h>
 
 #include <linux/delay.h>
+#include <trace/events/block.h>
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
@@ -187,6 +188,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		if (tag != BLK_MQ_NO_TAG)
 			break;
 
+		trace_block_rq_tag_wait(data->q, data->hctx,
+					data->rq_flags & RQF_SCHED_TAGS);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..7c1026d1cb35 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -226,6 +226,49 @@ DECLARE_EVENT_CLASS(block_rq,
 		  IOPRIO_PRIO_LEVEL(__entry->ioprio), __entry->comm)
 );
 
+/**
+ * block_rq_tag_wait - triggered when a request is starved of a tag
+ * @q: request queue of the target device
+ * @hctx: hardware context of the request experiencing starvation
+ * @is_sched_tag: indicates whether the starved pool is the software scheduler
+ *
+ * Called immediately before the submitting context is forced to block due
+ * to the exhaustion of available tags (i.e., physical hardware driver tags
+ * or software scheduler tags). This trace point indicates that the context
+ * will be placed into an uninterruptible state via io_schedule() until an
+ * active request completes and relinquishes its assigned tag.
+ */
+TRACE_EVENT(block_rq_tag_wait,
+
+	TP_PROTO(struct request_queue *q, struct blk_mq_hw_ctx *hctx, bool is_sched_tag),
+
+	TP_ARGS(q, hctx, is_sched_tag),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( u32,		hctx_id			)
+		__field( u32,		nr_tags			)
+		__field( bool,		is_sched_tag		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= q->disk ? disk_devt(q->disk);
+		__entry->hctx_id	= hctx->queue_num;
+		__entry->is_sched_tag	= is_sched_tag;
+
+		if (is_sched_tag)
+			__entry->nr_tags = hctx->sched_tags->nr_tags;
+		else
+			__entry->nr_tags = hctx->tags->nr_tags;
+	),
+
+	TP_printk("%d,%d hctx=%u starved on %s tags (depth=%u)",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->hctx_id,
+		  __entry->is_sched_tag ? "scheduler" : "hardware",
+		  __entry->nr_tags)
+);
+
 /**
  * block_rq_insert - insert block operation request into queue
  * @rq: block IO operation request
-- 
2.51.0


^ permalink raw reply related

* [PATCH v5 2/2] blk-mq: expose tag starvation counts via debugfs
From: Aaron Tomlin @ 2026-04-27  2:01 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: bvanassche, johannes.thumshirn, kch, dlemoal, ritesh.list,
	loberman, neelx, sean, mproche, chjohnst, linux-block,
	linux-kernel, linux-trace-kernel
In-Reply-To: <20260427020142.358912-1-atomlin@atomlin.com>

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices are starved of available
tags.

This patch introduces two new debugfs attributes for each block
hardware queue:
  - /sys/kernel/debug/block/[device]/hctxN/wait_on_hw_tag
  - /sys/kernel/debug/block/[device]/hctxN/wait_on_sched_tag

These files expose atomic counters that increment each time a submitting
context is forced into an uninterruptible sleep via io_schedule() due to
the complete exhaustion of physical driver tags or software scheduler
tags, respectively.

To ensure negligible performance overhead even in production
environments where CONFIG_BLK_DEBUG_FS is actively enabled, this
tracking logic utilises dynamically allocated per-CPU counters. When
this configuration is disabled, the tracking logic compiles down to a
safe no-op.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 block/blk-mq-debugfs.c | 109 +++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-debugfs.h |  19 +++++++
 block/blk-mq-tag.c     |   4 ++
 block/blk-mq.c         |   5 ++
 include/linux/blk-mq.h |  12 +++++
 5 files changed, 149 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 047ec887456b..1a993bcea5c9 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -7,6 +7,7 @@
 #include <linux/blkdev.h>
 #include <linux/build_bug.h>
 #include <linux/debugfs.h>
+#include <linux/percpu.h>
 
 #include "blk.h"
 #include "blk-mq.h"
@@ -484,6 +485,54 @@ static int hctx_dispatch_busy_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+/**
+ * hctx_wait_on_hw_tag_show - display hardware tag starvation count
+ * @data: generic pointer to the associated hardware context (hctx)
+ * @m: seq_file pointer for debugfs output formatting
+ *
+ * Prints the cumulative number of times a submitting context was forced
+ * to block due to the exhaustion of physical hardware driver tags.
+ *
+ * Return: 0 on success.
+ */
+static int hctx_wait_on_hw_tag_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	unsigned long count = 0;
+	int cpu;
+
+	if (hctx->wait_on_hw_tag) {
+		for_each_possible_cpu(cpu)
+			count += *per_cpu_ptr(hctx->wait_on_hw_tag, cpu);
+	}
+	seq_printf(m, "%lu\n", count);
+	return 0;
+}
+
+/**
+ * hctx_wait_on_sched_tag_show - display scheduler tag starvation count
+ * @data: generic pointer to the associated hardware context (hctx)
+ * @m: seq_file pointer for debugfs output formatting
+ *
+ * Prints the cumulative number of times a submitting context was forced
+ * to block due to the exhaustion of software scheduler tags.
+ *
+ * Return: 0 on success.
+ */
+static int hctx_wait_on_sched_tag_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	unsigned long count = 0;
+	int cpu;
+
+	if (hctx->wait_on_sched_tag) {
+		for_each_possible_cpu(cpu)
+			count += *per_cpu_ptr(hctx->wait_on_sched_tag, cpu);
+	}
+	seq_printf(m, "%lu\n", count);
+	return 0;
+}
+
 #define CTX_RQ_SEQ_OPS(name, type)					\
 static void *ctx_##name##_rq_list_start(struct seq_file *m, loff_t *pos) \
 	__acquires(&ctx->lock)						\
@@ -599,6 +648,8 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{"active", 0400, hctx_active_show},
 	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
 	{"type", 0400, hctx_type_show},
+	{"wait_on_hw_tag", 0400, hctx_wait_on_hw_tag_show},
+	{"wait_on_sched_tag", 0400, hctx_wait_on_sched_tag_show},
 	{},
 };
 
@@ -815,3 +866,61 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)
 	debugfs_remove_recursive(hctx->sched_debugfs_dir);
 	hctx->sched_debugfs_dir = NULL;
 }
+
+/**
+ * blk_mq_debugfs_alloc_hctx_stats - Allocate per-cpu starvation statistics
+ * @hctx: hardware context associated with the tag allocation
+ * @gfp: memory allocation flags
+ *
+ * Allocates the per-cpu memory for tracking hardware and scheduler tag
+ * starvation.
+ */
+void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx, gfp_t gfp)
+{
+	if (!hctx->wait_on_hw_tag)
+		hctx->wait_on_hw_tag = alloc_percpu_gfp(unsigned long,
+							gfp);
+	if (!hctx->wait_on_sched_tag)
+		hctx->wait_on_sched_tag = alloc_percpu_gfp(unsigned long,
+							   gfp);
+}
+
+/**
+ * blk_mq_debugfs_free_hctx_stats - Free per-cpu starvation statistics
+ * @hctx: hardware context associated with the tag allocation
+ *
+ * Frees the per-cpu memory used for tracking hardware and scheduler tag
+ * starvation. This must only be called during hardware queue teardown when
+ * the queue is safely frozen and no active I/O submissions can race to
+ * increment the statistics.
+ */
+void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
+{
+	free_percpu(hctx->wait_on_hw_tag);
+	hctx->wait_on_hw_tag = NULL;
+	free_percpu(hctx->wait_on_sched_tag);
+	hctx->wait_on_sched_tag = NULL;
+}
+
+/**
+ * blk_mq_debugfs_inc_wait_tags - increment the tag starvation counters
+ * @hctx: hardware context associated with the tag allocation
+ * @is_sched: true if the starved pool is the software scheduler
+ *
+ * Evaluates the exhausted tag pool and safely increments the appropriate
+ * per-cpu debugfs starvation counter.
+ *
+ * Note: The per-cpu pointers are explicitly checked to prevent a NULL
+ * pointer dereference in the event that the system was under heavy memory
+ * pressure and the initial per-cpu allocation failed.
+ */
+void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+				  bool is_sched)
+{
+	unsigned long __percpu *tags = is_sched ?
+			READ_ONCE(hctx->wait_on_sched_tag) :
+			READ_ONCE(hctx->wait_on_hw_tag);
+
+	if (likely(tags))
+		this_cpu_inc(*tags);
+}
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index 49bb1aaa83dc..7a7c0f376a2b 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -17,6 +17,8 @@ struct blk_mq_debugfs_attr {
 	const struct seq_operations *seq_ops;
 };
 
+void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+				  bool is_sched);
 int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
 int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
 
@@ -26,6 +28,9 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
 void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx);
 void blk_mq_debugfs_register_hctxs(struct request_queue *q);
 void blk_mq_debugfs_unregister_hctxs(struct request_queue *q);
+void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
+				     gfp_t gfp);
+void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_sched(struct request_queue *q);
 void blk_mq_debugfs_unregister_sched(struct request_queue *q);
@@ -35,6 +40,11 @@ void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_debugfs_register_rq_qos(struct request_queue *q);
 #else
+static inline void blk_mq_debugfs_inc_wait_tags(struct blk_mq_hw_ctx *hctx,
+						bool is_sched)
+{
+}
+
 static inline void blk_mq_debugfs_register(struct request_queue *q)
 {
 }
@@ -56,6 +66,15 @@ static inline void blk_mq_debugfs_unregister_hctxs(struct request_queue *q)
 {
 }
 
+static inline void blk_mq_debugfs_alloc_hctx_stats(struct blk_mq_hw_ctx *hctx,
+						   gfp_t gfp)
+{
+}
+
+static inline void blk_mq_debugfs_free_hctx_stats(struct blk_mq_hw_ctx *hctx)
+{
+}
+
 static inline void blk_mq_debugfs_register_sched(struct request_queue *q)
 {
 }
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 66138dd043d4..3cc6a97a87a0 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -17,6 +17,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-mq-debugfs.h"
 
 /*
  * Recalculate wakeup batch when tag is shared by hctx.
@@ -191,6 +192,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		trace_block_rq_tag_wait(data->q, data->hctx,
 					data->rq_flags & RQF_SCHED_TAGS);
 
+		blk_mq_debugfs_inc_wait_tags(data->hctx,
+					     data->rq_flags & RQF_SCHED_TAGS);
+
 		bt_prev = bt;
 		io_schedule();
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..cd52bf6f82ce 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3991,6 +3991,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 			blk_free_flush_queue_callback);
 	hctx->fq = NULL;
 
+	blk_mq_debugfs_free_hctx_stats(hctx);
+
 	spin_lock(&q->unused_hctx_lock);
 	list_add(&hctx->hctx_list, &q->unused_hctx_list);
 	spin_unlock(&q->unused_hctx_lock);
@@ -4016,6 +4018,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	gfp_t gfp = GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY;
 
+	blk_mq_debugfs_alloc_hctx_stats(hctx, gfp);
+
 	hctx->fq = blk_alloc_flush_queue(hctx->numa_node, set->cmd_size, gfp);
 	if (!hctx->fq)
 		goto fail;
@@ -4041,6 +4045,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	blk_free_flush_queue(hctx->fq);
 	hctx->fq = NULL;
  fail:
+	blk_mq_debugfs_free_hctx_stats(hctx);
 	return -1;
 }
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..41d61488d683 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -453,6 +453,18 @@ struct blk_mq_hw_ctx {
 	struct dentry		*debugfs_dir;
 	/** @sched_debugfs_dir:	debugfs directory for the scheduler. */
 	struct dentry		*sched_debugfs_dir;
+	/**
+	 * @wait_on_hw_tag: Cumulative per-cpu counter incremented each
+	 * time a submitting context is forced to block due to physical
+	 * hardware tag exhaustion.
+	 */
+	unsigned long __percpu	*wait_on_hw_tag;
+	/**
+	 * @wait_on_sched_tag: Cumulative per-cpu counter incremented each
+	 * time a submitting context is forced to block due to software
+	 * scheduler tag exhaustion.
+	 */
+	unsigned long __percpu	*wait_on_sched_tag;
 #endif
 
 	/**
-- 
2.51.0


^ permalink raw reply related

* [PATCH v2] mm/page_alloc: trace PCP refills and PCP zone lock usage
From: Bunyod Suvonov @ 2026-04-27  6:01 UTC (permalink / raw)
  To: akpm, vbabka, linux-mm
  Cc: rostedt, mhiramat, mathieu.desnoyers, linux-trace-kernel,
	linux-kernel, surenb, mhocko, jackmanb, hannes, ziy, david,
	vishal.moola, corbet, skhan, linux-doc, Bunyod Suvonov
In-Reply-To: <20260425091335.346504-1-b.suvonov@sjtu.edu.cn>

mm_page_pcpu_drain traces page blocks drained from the per-cpu page
lists back to the buddy allocator. There is no matching tracepoint for
the opposite direction, where rmqueue_bulk() refills a PCP list from the
buddy allocator.

Add mm_page_pcpu_refill as the counterpart to mm_page_pcpu_drain. The
pair makes PCP traffic observable in both directions: refill shows page
blocks moving from the buddy allocator into PCP lists, while drain shows
page blocks moving from PCP lists back to the buddy allocator. Comparing
the two helps identify PCP churn, imbalance between CPUs, and cases where
pages repeatedly cycle between PCP lists and the buddy allocator instead
of being served efficiently from PCP.

PCP refill and drain activity can also require entering the buddy
allocator under zone->lock. The per-page-block refill and drain events do
not directly count those lock acquisitions, because a single bulk
operation can move multiple page blocks.

Add mm_page_pcpu_refill_zone_locked and
mm_page_pcpu_drain_zone_locked to trace successful PCP bulk operations
after acquiring the zone lock. These events make it possible to count how
often PCP refill and drain paths enter the zone-locked buddy allocator.
Frequent events can indicate that PCP lists are under pressure and are
not avoiding the zone lock as effectively as expected.

mm_page_alloc_zone_locked is not a reliable substitute for PCP refill
activity. It is emitted from __rmqueue_smallest(), which is reached with
zone->lock already held by both rmqueue_bulk() and the direct buddy
allocation path. Its percpu_refill field is derived from the allocation
order and migratetype, so it does not reliably identify whether the
allocation came from a PCP refill.

Document the new kmem tracepoints.

Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>
---
Changes since v1:
- Add mm_page_pcpu_refill as the per-page-block counterpart to
  mm_page_pcpu_drain.
- Add mm_page_pcpu_refill_zone_locked and
  mm_page_pcpu_drain_zone_locked to count PCP bulk operations that
  acquired zone->lock.
- Document the new kmem tracepoints and clarify the PCP refill/drain
  semantics.

 Documentation/trace/events-kmem.rst | 65 ++++++++++++++++++-----------
 include/trace/events/kmem.h         | 58 +++++++++++++++++++++++--
 mm/page_alloc.c                     |  5 +++
 3 files changed, 100 insertions(+), 28 deletions(-)

diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 68fa75247488..9f935db1ea88 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -75,30 +75,47 @@ contention on the lruvec->lru_lock.
 =============================
 ::
 
-  mm_page_alloc_zone_locked	page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
-  mm_page_pcpu_drain		page=%p pfn=%lu order=%d cpu=%d migratetype=%d
-
-In front of the page allocator is a per-cpu page allocator. It exists only
-for order-0 pages, reduces contention on the zone->lock and reduces the
-amount of writing on struct page.
-
-When a per-CPU list is empty or pages of the wrong type are allocated,
-the zone->lock will be taken once and the per-CPU list refilled. The event
-triggered is mm_page_alloc_zone_locked for each page allocated with the
-event indicating whether it is for a percpu_refill or not.
-
-When the per-CPU list is too full, a number of pages are freed, each one
-which triggers a mm_page_pcpu_drain event.
-
-The individual nature of the events is so that pages can be tracked
-between allocation and freeing. A number of drain or refill pages that occur
-consecutively imply the zone->lock being taken once. Large amounts of per-CPU
-refills and drains could imply an imbalance between CPUs where too much work
-is being concentrated in one place. It could also indicate that the per-CPU
-lists should be a larger size. Finally, large amounts of refills on one CPU
-and drains on another could be a factor in causing large amounts of cache
-line bounces due to writes between CPUs and worth investigating if pages
-can be allocated and freed on the same CPU through some algorithm change.
+  mm_page_alloc_zone_locked	page=%p pfn=0x%lx order=%u migratetype=%d percpu_refill=%d
+  mm_page_pcpu_refill		page=%p pfn=0x%lx order=%d migratetype=%d
+  mm_page_pcpu_drain		page=%p pfn=0x%lx order=%d migratetype=%d
+  mm_page_pcpu_refill_zone_locked nid=%d zid=%d nr_pages=%lu
+  mm_page_pcpu_drain_zone_locked  nid=%d zid=%d nr_pages=%lu
+
+In front of the buddy allocator are per-cpu page lists. They reduce
+contention on the zone->lock and reduce the amount of writing on struct
+page.
+
+When an allocation finds the target per-CPU list empty, the zone->lock may
+be taken once and the per-CPU list refilled from the buddy allocator. The
+mm_page_pcpu_refill_zone_locked event is emitted once after the refill path
+successfully acquires the zone lock. The mm_page_pcpu_refill event is
+emitted for each page block added to the per-CPU list.
+
+When per-CPU pages are drained back to the buddy allocator, for example
+because a per-CPU list is above its high mark, PCP high is decayed, or an
+explicit drain is requested, the drain path takes the zone lock. The
+mm_page_pcpu_drain_zone_locked event is emitted once after the drain path
+successfully acquires the zone lock. The mm_page_pcpu_drain event is emitted
+for each page block drained from the per-CPU list.
+
+The individual refill and drain events allow pages to be tracked between
+allocation and freeing. The zone_locked events allow the bulk operations to
+be counted directly. A single zone_locked event may be followed by multiple
+refill or drain events, depending on how many page blocks are moved while
+holding the zone lock. The nr_pages field in the zone_locked events is the
+target number of base pages for the bulk operation when the zone lock is
+acquired. The individual refill or drain events describe the page blocks
+actually moved.
+
+Large amounts of per-CPU refills and drains could imply an imbalance between
+CPUs where too much work is being concentrated in one place. Frequent
+zone_locked events can indicate that the per-CPU lists are under pressure
+and are not avoiding the zone lock as effectively as expected. It could also
+indicate that the per-CPU lists should be a larger size. Finally, large
+amounts of refills on one CPU and drains on another could be a factor in
+causing large amounts of cache line bounces due to writes between CPUs and
+worth investigating if pages can be allocated and freed on the same CPU
+through some algorithm change.
 
 5. External Fragmentation
 =========================
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index cd7920c81f85..68f5d4a84da6 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -243,16 +243,52 @@ DEFINE_EVENT(mm_page, mm_page_alloc_zone_locked,
 	TP_ARGS(page, order, migratetype, percpu_refill)
 );
 
-TRACE_EVENT(mm_page_pcpu_drain,
+DECLARE_EVENT_CLASS(mm_page_pcpu_zone_locked,
+
+	TP_PROTO(int nid, int zid, unsigned long nr_pages),
+
+	TP_ARGS(nid, zid, nr_pages),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, zid)
+		__field(unsigned long, nr_pages)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->zid		= zid;
+		__entry->nr_pages	= nr_pages;
+	),
+
+	TP_printk("nid=%d zid=%d nr_pages=%lu",
+		__entry->nid, __entry->zid, __entry->nr_pages)
+);
+
+DEFINE_EVENT(mm_page_pcpu_zone_locked, mm_page_pcpu_refill_zone_locked,
+
+	TP_PROTO(int nid, int zid, unsigned long nr_pages),
+
+	TP_ARGS(nid, zid, nr_pages)
+);
+
+DEFINE_EVENT(mm_page_pcpu_zone_locked, mm_page_pcpu_drain_zone_locked,
+
+	TP_PROTO(int nid, int zid, unsigned long nr_pages),
+
+	TP_ARGS(nid, zid, nr_pages)
+);
+
+DECLARE_EVENT_CLASS(mm_page_pcpu,
 
 	TP_PROTO(struct page *page, unsigned int order, int migratetype),
 
 	TP_ARGS(page, order, migratetype),
 
 	TP_STRUCT__entry(
-		__field(	unsigned long,	pfn		)
-		__field(	unsigned int,	order		)
-		__field(	int,		migratetype	)
+		__field(unsigned long, pfn)
+		__field(unsigned int, order)
+		__field(int, migratetype)
 	),
 
 	TP_fast_assign(
@@ -266,6 +302,20 @@ TRACE_EVENT(mm_page_pcpu_drain,
 		__entry->order, __entry->migratetype)
 );
 
+DEFINE_EVENT(mm_page_pcpu, mm_page_pcpu_refill,
+
+	TP_PROTO(struct page *page, unsigned int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype)
+);
+
+DEFINE_EVENT(mm_page_pcpu, mm_page_pcpu_drain,
+
+	TP_PROTO(struct page *page, unsigned int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype)
+);
+
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65e205111553..9323bdbce731 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1470,6 +1470,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	pindex = pindex - 1;
 
 	spin_lock_irqsave(&zone->lock, flags);
+	trace_mm_page_pcpu_drain_zone_locked(zone_to_nid(zone), zone_idx(zone),
+					     count);
 
 	while (count > 0) {
 		struct list_head *list;
@@ -2527,6 +2529,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 	}
+	trace_mm_page_pcpu_refill_zone_locked(zone_to_nid(zone), zone_idx(zone),
+					      count << order);
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 					      alloc_flags, &rmqm);
@@ -2544,6 +2548,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * pages are ordered properly.
 		 */
 		list_add_tail(&page->pcp_list, list);
+		trace_mm_page_pcpu_refill(page, order, migratetype);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 
-- 
2.53.0

^ permalink raw reply related

* Re: [PATCH v2] mm/page_alloc: trace PCP refills and PCP zone lock usage
From: SUVONOV BUNYOD @ 2026-04-27  6:21 UTC (permalink / raw)
  To: akpm, vbabka, linux-mm
  Cc: rostedt, mhiramat, mathieu desnoyers, linux-trace-kernel,
	linux-kernel, surenb, mhocko, jackmanb, hannes, ziy, david,
	vishal moola, corbet, skhan, linux-doc
In-Reply-To: <20260427060142.131055-1-b.suvonov@sjtu.edu.cn>

Thank you for reviewing v1 Vishal,

All of your concerns except for the last one should be covered in v2.

> If you're trying to trace all pages as they come onto the pcp lists,
> should you also account for the free_frozen_page_commit() path?
>
>>          }
>>          spin_unlock_irqrestore(&zone->lock, flags);

No, the intent is not to trace every insertion into PCP lists. This patch
is trying to make buddy <-> PCP traffic observable by adding a new
mm_page_pcpu_refill event symmetric with the existing mm_page_pcpu_drain
event.

I also added additional zone_locked tracepoints because my research is
focusing on analyzing which kernel mm subsystems and other parts are
under stress for a given workload. The best way to see it for PCP would
be to count zone lock acquirings as the whole purpose of PCP is to lower
number of zone lock acquiring in first place.

^ permalink raw reply

* [PATCH] kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
From: Jianpeng Chang @ 2026-04-27  7:35 UTC (permalink / raw)
  To: naveen, davem, mhiramat, catalin.marinas, mark.rutland
  Cc: linux-kernel, linux-trace-kernel, stable, Jianpeng Chang

When kprobe_add_area_blacklist() iterates through a section like
.kprobes.text, the start address may not correspond to a named symbol.
On ARM64 with CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS=y (introduced by
commit baaf553d3bc3 ("arm64: Implement
HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS")), the compiler flag
-fpatchable-function-entry=4,2 inserts 2 NOPs before each function entry
point for ftrace call_ops. These pre-function NOPs sit at the section base
address, before the first named function symbol. The compiler emits a $x
mapping symbol at offset 0x00 to mark the start of code, but
find_kallsyms_symbol() ignores mapping symbols.

Without CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS (e.g. defconfig), no
pre-function NOPs are inserted, the first function starts at offset
0x00, and the bug does not trigger.

This only affects modules that have a .kprobes.text section (i.e. those
using the __kprobes annotation). Modules using NOKPROBE_SYMBOL() instead
(like kretprobe_example.ko) blacklist exact function addresses via the
_kprobe_blacklist section and are not affected.

For kprobe_example.ko on ARM64 with -fpatchable-function-entry=4,2,
the .kprobes.text section layout is:

  offset 0x00: $x + 2 NOPs    (mapping symbol + ftrace preamble)
  offset 0x08: handler_post   (64 bytes)
  offset 0x50: handler_pre    (68 bytes)

kprobe_add_area_blacklist() starts iterating from the section base
address (offset 0x00), which only has the $x mapping symbol.
kprobe_add_ksym_blacklist() then calls kallsyms_lookup_size_offset()
for this address, which goes through:

  kallsyms_lookup_size_offset()
    -> module_address_lookup()
      -> find_kallsyms_symbol()

find_kallsyms_symbol() scans all module symbols to find the closest
preceding symbol.

Since no named text symbol exists at offset 0x00,
find_kallsyms_symbol() picks __UNIQUE_ID_vermagic (a .modinfo symbol
whose address is in the temporary image) as the "best" match. The
computed "size" = next_text_symbol - modinfo_symbol spans across
these two unrelated memory regions, creating a blacklist entry with
a bogus range of tens of terabytes.

Whether this causes a visible failure depends on address randomization,
here is what happens on Raspberry Pi 4/5:

  - On RPi5, the bogus size was ~35 TB. start + size stayed within
    64-bit range, so the blacklist entry covered the entire kernel
    text. register_kprobe() in the module's own init function failed
    with -EINVAL.

  - On RPi4, the bogus size was ~75 TB. start + size overflowed
    64 bits and wrapped to a small address near zero. The range
    check (addr >= start && addr < end) then failed because end
    wrapped around, so the bogus entry was accidentally harmless
    and kprobes worked by luck.

The same bug exists on both machines, but randomization determines whether
the integer overflow masks it or not.

Fix this by checking the offset returned by kallsyms_lookup_size_offset().
A non-zero offset means the address is not at a symbol boundary, so skip
forward to the next symbol instead of creating a blacklist entry with a
wrong size.

Fixes: baaf553d3bc3 ("arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS")
Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com>
---
Hi,

This patch skips non-symbol addresses, fixes the bogus blacklist entry,
but leaves the NOP gap at the start of .kprobes.text unblacklisted.

We can continue alloc the ent without return to add the gap to
blacklist, or do some more works to add the gap to the first symbol in
blacklist. I'm not sure if is this necessary, or is there a better way?

Thanks,
Jianpeng

 kernel/kprobes.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index bfc89083daa9..be700fb03198 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -2503,6 +2503,10 @@ int kprobe_add_ksym_blacklist(unsigned long entry)
 	    !kallsyms_lookup_size_offset(entry, &size, &offset))
 		return -EINVAL;
 
+	/* Not on a symbol boundary -- skip to the next symbol */
+	if (offset)
+		return (int)(size - offset);
+
 	ent = kmalloc_obj(*ent);
 	if (!ent)
 		return -ENOMEM;
-- 
2.54.0


^ permalink raw reply related

* [PATCH] Documentation/rv: Replace stale website link
From: Gabriele Monaco @ 2026-04-27  8:55 UTC (permalink / raw)
  To: rdunlap, Steven Rostedt, Gabriele Monaco, Jonathan Corbet,
	linux-trace-kernel, linux-doc, linux-kernel
  Cc: matteo.martelli, skhan
In-Reply-To: <b845c448-1655-4860-9b6d-93d6f8426740@infradead.org>

The sched monitor page was linking to Daniel's website which is now
down. The main purpose of the link was to point to a source for the
models from the original author and that can be found also in his
published paper.

Replace the link with a reference to Daniel's "A thread synchronization
model for the PREEMPT_RT Linux kernel" which can be found online and
includes the models definitions as well as the work behind them (not the
original patches but since they're based on a 5.0 kernel and are mostly
included upstream, there's little value in keeping them in the docs).

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 Documentation/trace/rv/monitor_sched.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst
index 0b96d6e147c6..661171bd7c5e 100644
--- a/Documentation/trace/rv/monitor_sched.rst
+++ b/Documentation/trace/rv/monitor_sched.rst
@@ -365,4 +365,4 @@ constraints when processing the events::
 References
 ----------
 
-[1] - https://bristot.me/linux-task-model
+[1] - Daniel Bristot de Oliveira et al.: A thread synchronization model for the PREEMPT_RT Linux kernel, J. Syst. Archit., 2020.

base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH 1/1] tools/rv: ensure monitor name and desc are NUL-terminated
From: Gabriele Monaco @ 2026-04-27  9:32 UTC (permalink / raw)
  To: unknownbbqrx; +Cc: rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <dc9ea036-de62-4e1f-be63-8e14d675bcca@smtp-relay.sendinblue.com>

On Thu, 2026-04-23 at 17:19 +0300, unknownbbqrx wrote:
> 
> ikm_fill_monitor_definition() copies monitor name and description
> with
> strncpy(), but does not guarantee NUL termination when source strings
> are
> equal to or longer than the destination buffers.
> 
> Clamp copies to sizeof(dst) - 1 and explicitly append '\0' for both
> fields
> to keep them safe for later string operations.

Hi,

thanks for the fix!
Looks good to me.

Reviewed-by: Gabriele Monaco <gmonaco@redhat.com>
Fixes: 6d60f89691fc9 ("tools/rv: Add in-kernel monitor interface")

On a side note, you sent 2 patches and you apparently sent them both
twice (did you issue git send-email twice? They seem equivalent to me),
next time you could merge them in the same series, just preparing them
in the same branch and passing them all to git format-patch/send-email
[1]. In general you'd also add a cover letter, can be very simple in
this case.
That's usually tidier and easier to apply for maintainers/reviewers.
(You can ignore it this time)

Also add the Fixes: tag if you're fixing something (e.g. a potential
buffer overflow in this case), I did it for you now but you can find
the commit you're fixing using git blame.

[1] -
https://www.kernel.org/doc/html/latest/process/submitting-patches.html

> 
> Signed-off-by: unknownbbqrx <dev@unknownbbqr.xyz>
> ---
>  tools/verification/rv/src/in_kernel.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/verification/rv/src/in_kernel.c
> b/tools/verification/rv/src/in_kernel.c
> index 4bb746ea6..d32453824 100644
> --- a/tools/verification/rv/src/in_kernel.c
> +++ b/tools/verification/rv/src/in_kernel.c
> @@ -215,10 +215,11 @@ static int ikm_fill_monitor_definition(char
> *name, struct monitor *ikm, char *co
>  		return -1;
>  	}
>  
> -	strncpy(ikm->name, nested_name, MAX_DA_NAME_LEN);
> +	strncpy(ikm->name, nested_name, sizeof(ikm->name) - 1);
> +	ikm->name[sizeof(ikm->name) - 1] = '\0';
>  	ikm->enabled = enabled;
> -	strncpy(ikm->desc, desc, MAX_DESCRIPTION);
> -
> +	strncpy(ikm->desc, desc, sizeof(ikm->desc) - 1);
> +	ikm->desc[sizeof(ikm->desc) - 1] = '\0';
>  	free(desc);
>  
>  	return 0;


^ permalink raw reply

* Re: [PATCH] tools/rv: harden monitor name lookup bounds checks
From: Gabriele Monaco @ 2026-04-27  9:38 UTC (permalink / raw)
  To: unknownbbqrx; +Cc: rostedt, linux-trace-kernel, linux-kernel
In-Reply-To: <69972ccf-31ee-4906-9907-0ead76bd60b9@smtp-relay.sendinblue.com>

On Thu, 2026-04-23 at 17:44 +0300, unknownbbqrx wrote:
> 
> Bound monitor-name derived copies in __ikm_find_monitor_name() and
> avoid unbounded writes from sprintf()/memcpy().
> 
> Pass the output buffer size from the caller, validate extracted line
> length from rv/available_monitors, and use snprintf() with truncation
> checks when building container monitor names.
> 
> Signed-off-by: unknownbbqrx <dev@unknownbbqr.xyz>

Hi,

thanks for the fix, however __ikm_find_monitor_name() is already a bit
sloppy (strstr can take any substring as a valid monitor) so I have a
patch to refactor it, which I'm about to send.
This will make your fix obsolete so I'm likely not going to take this
patch.

Thanks anyway,
Gabriele

> ---
>  tools/verification/rv/src/in_kernel.c | 34 +++++++++++++++++++++----
> --
>  1 file changed, 27 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/verification/rv/src/in_kernel.c
> b/tools/verification/rv/src/in_kernel.c
> index d32453824..f17eac9b6 100644
> --- a/tools/verification/rv/src/in_kernel.c
> +++ b/tools/verification/rv/src/in_kernel.c
> @@ -56,9 +56,12 @@ static int __ikm_read_enable(char *monitor_name)
>   * The string out_name is populated with the full name, which can be
>   * equal to monitor_name or container/monitor_name if nested
>   */
> -static int __ikm_find_monitor_name(char *monitor_name, char
> *out_name)
> +static int __ikm_find_monitor_name(char *monitor_name, char
> *out_name,
> +				  size_t out_name_size)
>  {
> -	char *available_monitors, container[MAX_DA_NAME_LEN+1],
> *cursor, *end;
> +	char *available_monitors, container[MAX_DA_NAME_LEN + 2],
> *cursor, *end;
> +	size_t len;
> +	int n;
>  	int retval = 1;
>  
>  	available_monitors = tracefs_instance_file_read(NULL,
> "rv/available_monitors", NULL);
> @@ -72,17 +75,34 @@ static int __ikm_find_monitor_name(char
> *monitor_name, char *out_name)
>  	}
>  
>  	for (; cursor > available_monitors; cursor--)
> -		if (*(cursor-1) == '\n')
> +		if (*(cursor - 1) == '\n')
>  			break;
> +
>  	end = strstr(cursor, "\n");
> -	memcpy(out_name, cursor, end-cursor);
> -	out_name[end-cursor] = '\0';
> +	if (!end) {
> +		retval = -1;
> +		goto out_free;
> +	}
> +
> +	len = end - cursor;
> +	if (len >= out_name_size) {
> +		retval = -1;
> +		goto out_free;
> +	}
> +
> +	memcpy(out_name, cursor, len);
> +	out_name[len] = '\0';
>  
>  	cursor = strstr(out_name, ":");
>  	if (cursor)
>  		*cursor = '/';
>  	else {
> -		sprintf(container, "%s:", monitor_name);
> +		n = snprintf(container, sizeof(container), "%s:",
> monitor_name);
> +		if (n < 0 || (size_t)n >= sizeof(container)) {
> +			retval = -1;
> +			goto out_free;
> +		}
> +
>  		if (strstr(available_monitors, container))
>  			config_is_container = 1;
>  	}
> @@ -782,7 +802,7 @@ int ikm_run_monitor(char *monitor_name, int argc,
> char **argv)
>  	else
>  		nested_name = monitor_name;
>  
> -	retval = __ikm_find_monitor_name(monitor_name, full_name);
> +	retval = __ikm_find_monitor_name(monitor_name, full_name,
> sizeof(full_name));
>  	if (!retval)
>  		return 0;
>  	if (retval < 0) {
> 
> base-commit: 2e68039281932e6dc37718a1ea7cbb8e2cda42e6
> prerequisite-patch-id: b61dd51dee390277603975bf729a687113185c3a


^ permalink raw reply

* Re: [PATCH] Documentation/rv: Replace stale website link
From: Jonathan Corbet @ 2026-04-27  9:44 UTC (permalink / raw)
  To: Gabriele Monaco, rdunlap, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel, linux-doc, linux-kernel
  Cc: matteo.martelli, skhan
In-Reply-To: <20260427085526.111835-1-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> The sched monitor page was linking to Daniel's website which is now
> down. The main purpose of the link was to point to a source for the
> models from the original author and that can be found also in his
> published paper.
>
> Replace the link with a reference to Daniel's "A thread synchronization
> model for the PREEMPT_RT Linux kernel" which can be found online and
> includes the models definitions as well as the work behind them (not the
> original patches but since they're based on a 5.0 kernel and are mostly
> included upstream, there's little value in keeping them in the docs).
>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>  Documentation/trace/rv/monitor_sched.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst
> index 0b96d6e147c6..661171bd7c5e 100644
> --- a/Documentation/trace/rv/monitor_sched.rst
> +++ b/Documentation/trace/rv/monitor_sched.rst
> @@ -365,4 +365,4 @@ constraints when processing the events::
>  References
>  ----------
>  
> -[1] - https://bristot.me/linux-task-model
> +[1] - Daniel Bristot de Oliveira et al.: A thread synchronization model for the PREEMPT_RT Linux kernel, J. Syst. Archit., 2020.

Since, as you say, it can be found online, is there a reason not to
include a link here?

jon

^ permalink raw reply

* Re: [PATCH] Documentation: fix spelling mistake "stucture" -> "structure"
From: Jonathan Corbet @ 2026-04-27  9:56 UTC (permalink / raw)
  To: Ninad Naik, rostedt, mhiramat, mathieu.desnoyers, skhan
  Cc: linux-trace-kernel, linux-doc, linux-kernel, me,
	linux-kernel-mentees, Ninad Naik
In-Reply-To: <20260419184527.779828-1-ninadnaik07@gmail.com>

Ninad Naik <ninadnaik07@gmail.com> writes:

> Fixing a spelling mistake in Documentation/trace/histogram-design.rst.
>
> Signed-off-by: Ninad Naik <ninadnaik07@gmail.com>
> ---
>  Documentation/trace/histogram-design.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/trace/histogram-design.rst b/Documentation/trace/histogram-design.rst
> index e92f56ebd0b5..41a726cd3536 100644
> --- a/Documentation/trace/histogram-design.rst
> +++ b/Documentation/trace/histogram-design.rst
> @@ -247,7 +247,7 @@ field's size and offset, is used to grab that subkey's data from the
>  current trace record.
>  
>  Note, the hist field function use to be a function pointer in the
> -hist_field stucture. Due to spectre mitigation, it was converted into
> +hist_field structure. Due to spectre mitigation, it was converted into
>  a fn_num and hist_fn_call() is used to call the associated hist field

Applied, thanks.

jon

^ permalink raw reply

* Re: [PATCH] Documentation/rv: Replace stale website link
From: Gabriele Monaco @ 2026-04-27  9:57 UTC (permalink / raw)
  To: Jonathan Corbet, rdunlap, Steven Rostedt, linux-trace-kernel,
	linux-doc, linux-kernel
  Cc: matteo.martelli, skhan
In-Reply-To: <875x5crb4g.fsf@trenco.lwn.net>

On Mon, 2026-04-27 at 03:44 -0600, Jonathan Corbet wrote:
> Since, as you say, it can be found online, is there a reason not to
> include a link here?

Mmh, perhaps being overly cautious for the link not to break again?

The paper is published so I assume it's always going to be available in
some way. It is currently hosted by the university at [1], which may be
unlikely to change, and can be found via DOI at [2], which should never
change (at least that's what I believe a DOI is for) but brings to the
publisher's website rather than the open-access PDF.

I think the reference to the paper I included is robust yet easy to use
with any scientific or even general purpose search engine. But if you
believe using either of the two links is more appropriate, I can send a
V2 with the change.

Thanks,
Gabriele

[1] -
https://www.iris.sssup.it/bitstream/11382/533630/1/Elsevier-JSA-2020.pdf
[2] - https://doi.org/10.1016/j.sysarc.2020.101729


^ permalink raw reply

* Re: [PATCH] Documentation/rv: Replace stale website link
From: Jonathan Corbet @ 2026-04-27 10:09 UTC (permalink / raw)
  To: Gabriele Monaco, rdunlap, Steven Rostedt, linux-trace-kernel,
	linux-doc, linux-kernel
  Cc: matteo.martelli, skhan
In-Reply-To: <6d7e529c7cb0ad599669e3f33e5b6168e92a8861.camel@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> On Mon, 2026-04-27 at 03:44 -0600, Jonathan Corbet wrote:
>> Since, as you say, it can be found online, is there a reason not to
>> include a link here?
>
> Mmh, perhaps being overly cautious for the link not to break again?
>
> The paper is published so I assume it's always going to be available in
> some way. It is currently hosted by the university at [1], which may be
> unlikely to change, and can be found via DOI at [2], which should never
> change (at least that's what I believe a DOI is for) but brings to the
> publisher's website rather than the open-access PDF.
>
> I think the reference to the paper I included is robust yet easy to use
> with any scientific or even general purpose search engine. But if you
> believe using either of the two links is more appropriate, I can send a
> V2 with the change.

I will defer to others in the end, but to me it seems that we should
make life easier for our readers whenever we can.  Providing a link
seems better than requiring them to search for it themselves.

Thanks,

jon

^ permalink raw reply

* Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
From: Jeff Layton @ 2026-04-27 10:44 UTC (permalink / raw)
  To: Ritesh Harjani, Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Christoph Hellwig, Kairui Song, Qi Zheng, Shakeel Butt,
	Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever
  Cc: linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel
In-Reply-To: <qzo1s6a4.ritesh.list@gmail.com>

On Mon, 2026-04-27 at 04:01 +0530, Ritesh Harjani wrote:
> Jeff Layton <jlayton@kernel.org> writes:
> 
> > The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> > filemap_flush_range() on every write, submitting writeback inline in
> > the writer's context.  Perf lock contention profiling shows the
> > performance problem is not lock contention but the writeback submission
> > work itself — walking the page tree and submitting I/O blocks the writer
> > for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> > (dontcache).
> > 
> > Replace the inline filemap_flush_range() call with a flusher kick that
> > drains dirty pages in the background.  This moves writeback submission
> > completely off the writer's hot path.
> > 
> > To avoid flushing unrelated buffered dirty data, add a dedicated
> > WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> > the new NR_DONTCACHE_DIRTY counter to determine how many pages to write
> > back.  The flusher writes back that many pages from the oldest dirty
> > inodes (not restricted to dontcache-specific inodes). This helps
> > preserve I/O batching while limiting the scope of expedited writeback.
> > 
> 
> Yup, so, we wakeup the writeback flusher, which will write those many
> "number" of dirty pages. Those dirty pages written by writeback, can be
> of any type though, can be DONTCACHE or normal (non-dontcache) dirty
> pages. IIUC, writeback doesn't distinguish between them while writing.
> 

Correct. This was the approach that Jan and HCH suggested in the
responses to the last posting.

> 
> IMO, what we could also include in the commit msg is why is this above
> approach taken? IIUC, that is because, by writing NR_DONTCACHE_DIRTY
> pages, it still reduces the page cache pressure and still reduces the
> amount of work that the reclaim has to do, even though some of those
> pages maybe non-dontcache pages, in case if there was a parallel
> buffered write in the system.
> 

Good suggestion. I'll add that.

> 
> Also should the following change be documented somewhere? Like in Man
> page maybe? i.e.
> Earlier RWF_DONTCACHE writes made sure that those dirty pages are
> immediately submitted for writeback and completion would release those
> pages. But now, in certain cases when there is a mixed buffered write in
> the system, those dontcache dirty pages might be written back after a
> delay (whenever the next time writeback kicks in).
> However for RWF_DONTCACHE reads, it should not affect anything.
> 

Looks like DONTCACHE is documented in the preadv/writev manpage. Here's
the current blurb about writes:

    Additionally, any range dirtied by a write operation with RWF_DONT‐
    CACHE  set  will  get kicked off for writeback.  This is similar to
    calling  sync_file_range(2)  with  SYNC_FILE_RANGE_WRITE  to  start
    writeback on the given range.  RWF_DONTCACHE is a hint, or best ef‐
    fort,  where  no hard guarantees are given on the state of the page
    cache once the operation completes.

I don't think this verbiage is invalid after this change. Kicking off
writeback is still just a hint, like it was before. We could mention
about how that I/O can compete with regular buffered I/O, but it seems
a bit like we're adding info that will just be confusing for users.

> > Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> > DONTCACHE writes into a single flusher wakeup without per-write
> > allocations.
> > 
> > Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> > visibility, and target the correct cgroup writeback domain via
> > unlocked_inode_to_wb_begin().
> > 
> > dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
> > RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
> > ~503 GB, compared to a v6.19-ish baseline):
> > 
> 
> Can we please also test parallel buffered writes and dontcache writes? 
> Since this patch series definitely affects that.
>
> BTW - adding these numbers in the commit msg itself is much helpful.
> 

To be clear, this only affects DONTCACHE, not normal buffered writes,
but I guess you're referring to the fact that DONTCACHE and buffered
writes can compete now.

Can you clarify specifically what you'd like me to test here? Are you
saying you want me to test parallel and buffered writes together at the
same time (i.e. make them compete?).

I should be able to do that for the local benchmarks, but nfsd's iomode
settings are global and that won't be possible there.

> >   Single-client sequential write (MB/s):
> >                        baseline    patched     change
> >   buffered              1449.8     1440.1      -0.7%
> >   dontcache             1347.9     1461.5      +8.4%
> >   direct                1450.0     1440.1      -0.7%
> > 
> >   Single-client sequential write latency (us):
> >                        baseline    patched     change
> >   dontcache p50         3031.0    10551.3    +248.1%
> >   dontcache p99        74973.2    21626.9     -71.2%
> >   dontcache p99.9      85459.0    23199.7     -72.9%
> > 
> >   Single-client random write (MB/s):
> >                        baseline    patched     change
> >   dontcache              284.2      295.4      +3.9%
> > 
> >   Single-client random write p99.9 latency (us):
> >                        baseline    patched     change
> >   dontcache             2277.4      872.4     -61.7%
> > 
> >   Multi-writer aggregate throughput (MB/s):
> 
> Can you please help describe this test scenario if possible.. In above
> you mentioned we are writing file_size as 2x RAM_SIZE. But your
> multi-client tests says something else..
> 
> local num_clients=4
> +	mem_kb=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
> +	client_size="$(( mem_kb / 1024 / num_clients ))M"
> 
> Also the multi-writer case is spawning parallel fio jobs, and then
> parsing and aggregating the bandwidth results instead of using fio to
> spawn multiple parallel threads... which is ok, but a bit wierd.
> Why not let fio do the aggregate bandwidth, and latency calculation
> instead?
> 

That's what I get for asking Claude to roll a testsuite. I'm not that
well-versed in fio, but that makes sense. I'll have a look at reworking
it along those lines.

> >                        baseline    patched     change
> >   buffered              1619.5     1611.2      -0.5%
> >   dontcache             1281.1     1629.4     +27.2%
> >   direct                1545.4     1609.4      +4.1%
> > 
> >   Mixed-mode noisy neighbor (dontcache writer + buffered readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1297.6     1471.1     +13.4%
> >   readers avg (MB/s)     855.0      462.4     -45.9%
> > 
> > nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
> > NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
> > file size ~502 GB, compared to v6.19-ish baseline):
> > 
> >   Single-client sequential write (MB/s):
> >                        baseline    patched     change
> >   buffered              4844.2     4653.4      -3.9%
> >   dontcache             3028.3     3723.1     +22.9%
> >   direct                 957.6      987.8      +3.2%
> > 
> >   Single-client sequential write p99.9 latency (us):
> >                        baseline    patched     change
> >   dontcache            759169.0   175112.2     -76.9%
> > 
> >   Single-client random write (MB/s):
> >                        baseline    patched     change
> >   dontcache              590.0     1561.0    +164.6%
> > 
> >   Multi-writer aggregate throughput (MB/s):
> >                        baseline    patched     change
> >   buffered              9636.3     9422.9      -2.2%
> >   dontcache             1894.9     9442.6    +398.3%
> >   direct                 809.6      975.1     +20.4%
> > 
> >   Noisy neighbor (dontcache writer + random readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1854.5     4063.6    +119.1%
> >   readers avg (MB/s)     131.2      101.6     -22.5%
> > 
> > The NFS results show even larger improvements than the local benchmarks.
> > Multi-writer dontcache throughput improves nearly 5x, matching buffered
> > I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
> > buffered.
> > 
> 
> Nice :)
> Some explaination here of why 5x improvement with NFS compared to local
> filesystems please?
> (I am not much aware of NFS side, but a possible reasoning would help)
> 

I suspect that it's because of the "scattered" nature of nfsd writes.
When the client sends a write to nfsd, we wake a nfsd thread to service
it. So, if there are a lot of writes operating in parallel, they all
get done in the context of different tasks.

My hunch is that this I/O pattern (writing to same file from a bunch of
different threads), particularly suffers from the DONTCACHE inline
write behavior. The threads all end up competing to submit jobs to the
queue and that causes the performance to fall off sharply.

Thanks for the review!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
From: Jeff Layton @ 2026-04-27 10:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Mike Snitzer, Jens Axboe, Ritesh Harjani, Christoph Hellwig,
	Kairui Song, Qi Zheng, Shakeel Butt, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Chuck Lever, linux-fsdevel, linux-kernel,
	linux-nfs, linux-mm, linux-trace-kernel
In-Reply-To: <ae55M8xBIVYZXPFN@casper.infradead.org>

On Sun, 2026-04-26 at 21:44 +0100, Matthew Wilcox wrote:
> On Sun, Apr 26, 2026 at 07:56:08AM -0400, Jeff Layton wrote:
> >   Mixed-mode noisy neighbor (dontcache writer + buffered readers):
> >                        baseline    patched     change
> >   writer (MB/s)         1297.6     1471.1     +13.4%
> >   readers avg (MB/s)     855.0      462.4     -45.9%
> 
> hm.  This wasn't what I thought of when I thought of "noisy neighbour".
> I'd have process A doing DONTCACHE writes to file A and process B doing
> normal buffered writes to file B.

Originally, I was benchmarking this via nfsd and only later did I add
the suite for local benchmarks. With nfsd, setting the iomode affects
all reads or writes.

So initially, I had it testing them with both reads and writes set to
the same setting, but then later I decided to play with different modes
for reads and writes. The best performing one was buffered reads +
dontcache writes. It's possible a mix of different modes will be better
on a local fs.

I can't easily do a benchmark like you're suggesting with nfsd, but it
should be possible on a local benchmark. I'll see what I can come up
with.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply

* [PATCH] fprobe: Add unregister_fprobe_sync() for synchronous unregistration
From: Masami Hiramatsu (Google) @ 2026-04-27 12:09 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Mathieu Desnoyers, Jonathan Corbet, linux-kernel,
	linux-trace-kernel, linux-doc

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Currently, unregister_fprobe() removes the ftrace hooks but does not
wait for the RCU grace period to expire. This is efficient for batch
unregistration of multiple fprobes (to avoid multiple RCU grace period
latencies), but it leaves a window where probe handlers might still be
running on other CPUs after the function returns.
If a caller needs to free the fprobe structure or unload the module
immediately after unregistration, they must manually call
synchronize_rcu() to prevent use-after-free issues.

To simplify this use case, introduce unregister_fprobe_sync(). This
function unregisters the fprobe and waits for the RCU grace period to
complete before returning.

Also, update the documentation of unregister_fprobe() to clarify its
non-blocking behavior and suggest using unregister_fprobe_sync() for the
last probe in a batch. Finally, update the fprobe sample module to use
the synchronous version on exit to ensure safe module unloading.
And add a fix to use synchronous version in the sample code and
trace_fprobe (unexpected error case).

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Documentation/trace/fprobe.rst  |   15 ++++++++++++---
 include/linux/fprobe.h          |    5 +++++
 kernel/trace/fprobe.c           |   30 ++++++++++++++++++++++++++++++
 kernel/trace/trace_fprobe.c     |    9 +++++++--
 samples/fprobe/fprobe_example.c |    2 +-
 5 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 95998b189ae3..eee4860ab29a 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -65,6 +65,12 @@ To disable (remove from functions) this fprobe, call::
 
   unregister_fprobe(&fp);
 
+Or if you need to wait for the RCU grace period to ensure no handlers
+are running on any CPU (e.g., before freeing the `fprobe` structure),
+use::
+
+  unregister_fprobe_sync(&fp);
+
 You can temporally (soft) disable the fprobe by::
 
   disable_fprobe(&fp);
@@ -81,9 +87,12 @@ Same as ftrace, the registered callbacks will start being called some time
 after the register_fprobe() is called and before it returns. See
 Documentation/trace/ftrace.rst.
 
-Also, the unregister_fprobe() will guarantee that both enter and exit
-handlers are no longer being called by functions after unregister_fprobe()
-returns as same as unregister_ftrace_function().
+Also, the `unregister_fprobe_sync()` will guarantee that both enter and exit
+handlers are no longer being called by functions after it returns.
+On the other hand, `unregister_fprobe()` does not wait for the RCU grace period,
+so handlers might still be running on other CPUs for a short time after it returns.
+This is useful when you unregister multiple fprobes in a batch to avoid
+waiting for the RCU grace period for each one.
 
 The fprobe entry/exit handler
 =============================
diff --git a/include/linux/fprobe.h b/include/linux/fprobe.h
index 0a3bcd1718f3..6ae452e250a1 100644
--- a/include/linux/fprobe.h
+++ b/include/linux/fprobe.h
@@ -94,6 +94,7 @@ int register_fprobe(struct fprobe *fp, const char *filter, const char *notfilter
 int register_fprobe_ips(struct fprobe *fp, unsigned long *addrs, int num);
 int register_fprobe_syms(struct fprobe *fp, const char **syms, int num);
 int unregister_fprobe(struct fprobe *fp);
+int unregister_fprobe_sync(struct fprobe *fp);
 bool fprobe_is_registered(struct fprobe *fp);
 int fprobe_count_ips_from_filter(const char *filter, const char *notfilter);
 #else
@@ -113,6 +114,10 @@ static inline int unregister_fprobe(struct fprobe *fp)
 {
 	return -EOPNOTSUPP;
 }
+static inline int unregister_fprobe_sync(struct fprobe *fp)
+{
+	return -EOPNOTSUPP;
+}
 static inline bool fprobe_is_registered(struct fprobe *fp)
 {
 	return false;
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index cc49ebd2a773..5f3e48385a47 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -1097,6 +1097,9 @@ static int unregister_fprobe_nolock(struct fprobe *fp)
  * @fp: A fprobe data structure to be unregistered.
  *
  * Unregister fprobe (and remove ftrace hooks from the function entries).
+ * Note: This function does not wait for RCU grace period, since user
+ * may use several fprobes (and then unregister them one by one). In that
+ * case, it is recommended to use unregister_fprobe_sync() for the last fprobe.
  *
  * Return 0 if @fp is unregistered successfully, -errno if not.
  */
@@ -1110,6 +1113,33 @@ int unregister_fprobe(struct fprobe *fp)
 }
 EXPORT_SYMBOL_GPL(unregister_fprobe);
 
+/**
+ * unregister_fprobe_sync() - Unregister fprobe synchronously with RCU grace period.
+ * @fp: A fprobe data structure to be unregistered.
+ *
+ * Unregister fprobe (and remove ftrace hooks from the function entries) and
+ * wait for the RCU grace period to finish. This is useful for preventing
+ * the fprobe from being used after it is unregistered.
+ *
+ * Return 0 if @fp is unregistered successfully, -errno if not.
+ */
+int unregister_fprobe_sync(struct fprobe *fp)
+{
+	int ret;
+
+	guard(mutex)(&fprobe_mutex);
+	if (!fp || !fprobe_registered(fp))
+		return -EINVAL;
+
+	ret = unregister_fprobe_nolock(fp);
+	if (ret)
+		return ret;
+
+	synchronize_rcu();
+	return 0;
+}
+EXPORT_SYMBOL_GPL(unregister_fprobe_sync);
+
 static int __init fprobe_initcall(void)
 {
 	rhltable_init(&fprobe_ip_table, &fprobe_rht_params);
diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
index 9f5f08c0e7c2..fa5b41f7f306 100644
--- a/kernel/trace/trace_fprobe.c
+++ b/kernel/trace/trace_fprobe.c
@@ -845,8 +845,13 @@ static int __register_trace_fprobe(struct trace_fprobe *tf)
 /* Internal unregister function - just handle fprobe and flags */
 static void __unregister_trace_fprobe(struct trace_fprobe *tf)
 {
-	if (trace_fprobe_is_registered(tf))
-		unregister_fprobe(&tf->fp);
+	/*
+	 * Here, @tf must NOT be busy, so it MUST be unregistered already.
+	 * But if it is unexpectedly registered, unregister it synchronously.
+	 */
+	if (WARN_ON_ONCE(trace_fprobe_is_registered(tf)))
+		unregister_fprobe_sync(&tf->fp);
+
 	if (tf->tuser) {
 		tracepoint_user_put(tf->tuser);
 		tf->tuser = NULL;
diff --git a/samples/fprobe/fprobe_example.c b/samples/fprobe/fprobe_example.c
index bfe98ce826f3..382d2f67672a 100644
--- a/samples/fprobe/fprobe_example.c
+++ b/samples/fprobe/fprobe_example.c
@@ -142,7 +142,7 @@ static int __init fprobe_init(void)
 
 static void __exit fprobe_exit(void)
 {
-	unregister_fprobe(&sample_probe);
+	unregister_fprobe_sync(&sample_probe);
 
 	pr_info("fprobe at %s unregistered. %ld times hit, %ld times missed\n",
 		symbol, nhit, sample_probe.nmissed);


^ permalink raw reply related

* Re: [PATCH 0/9] rtla/tests: Extend runtime test coverage
From: Wander Lairson Costa @ 2026-04-27 12:20 UTC (permalink / raw)
  To: Tomas Glozar
  Cc: Steven Rostedt, John Kacur, Luis Goncalves, Crystal Wood,
	Costa Shulyupin, LKML, linux-trace-kernel
In-Reply-To: <20260423130558.882022-1-tglozar@redhat.com>

On Thu, Apr 23, 2026 at 03:05:49PM +0200, Tomas Glozar wrote:
> This patchset introduces some new tests to cover more options, especially
> histogram and thread options. Most of the new tests use positive and negative
> output matches, sometimes in combination with action scripts, to verify that
> RTLA is applying the settings correctly.
> 
> Tests were reorganized a little, adding two new sections: thread tests and
> histogram tests, next to basic tests.
> 
> Additionally, coverage of existing tests is extended by adding new matches and
> by extending tests to cover both top and hist tools where possible. For the
> latter, new helpers check_top_hist and check_top_q_hist are added to engine.sh.
> 
> As part of the new action scripts, detection of measurement threads is made more
> robust by following child processes of either RTLA (user workload) or kthreadd
> (kernel workload) rather than grepping through the comms of all processes, which
> might have lead to false positives.
> 
> These changes significantly improve test coverage and make the test suite more
> against false positives from unrelated processes.

Reviewed-by: Wander Lairson Costa <wander@redhat.com>

> 
> Tomas Glozar (9):
>   rtla/tests: Cover both top and hist tools where possible
>   rtla/tests: Add get_workload_pids() helper
>   rtla/tests: Check -c/--cpus thread affinity
>   rtla/tests: Use negative match when testing --aa-only
>   rtla/tests: Extend timerlat top --aa-only coverage
>   rtla/tests: Cover all hist options in runtime tests
>   rtla/tests: Add runtime test for -H/--house-keeping
>   rtla/tests: Add runtime test for -k and -u options
>   rtla/tests: Add runtime tests for -C/--cgroup
> 
>  tools/tracing/rtla/tests/engine.sh            |  15 +++
>  tools/tracing/rtla/tests/osnoise.t            |  73 +++++++----
>  .../rtla/tests/scripts/check-cgroup-match.sh  |  17 +++
>  .../tracing/rtla/tests/scripts/check-cpus.sh  |   9 ++
>  .../tests/scripts/check-housekeeping-cpus.sh  |   4 +
>  .../rtla/tests/scripts/check-priority.sh      |   8 +-
>  .../scripts/check-user-kernel-threads.sh      |  16 +++
>  .../tests/scripts/lib/get_workload_pids.sh    |  11 ++
>  tools/tracing/rtla/tests/timerlat.t           | 113 +++++++++++-------
>  9 files changed, 194 insertions(+), 72 deletions(-)
>  create mode 100755 tools/tracing/rtla/tests/scripts/check-cgroup-match.sh
>  create mode 100755 tools/tracing/rtla/tests/scripts/check-cpus.sh
>  create mode 100755 tools/tracing/rtla/tests/scripts/check-housekeeping-cpus.sh
>  create mode 100755 tools/tracing/rtla/tests/scripts/check-user-kernel-threads.sh
>  create mode 100644 tools/tracing/rtla/tests/scripts/lib/get_workload_pids.sh
> 
> -- 
> 2.53.0
> 


^ permalink raw reply

* Re: [PATCH v3 2/4] mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking
From: Jan Kara @ 2026-04-27 12:46 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara,
	Matthew Wilcox (Oracle), Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Mike Snitzer, Jens Axboe,
	Ritesh Harjani, Christoph Hellwig, Kairui Song, Qi Zheng,
	Shakeel Butt, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Chuck Lever,
	linux-fsdevel, linux-kernel, linux-nfs, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260426-dontcache-v3-2-79eb37da9547@kernel.org>

On Sun 26-04-26 07:56:08, Jeff Layton wrote:
> The IOCB_DONTCACHE writeback path in generic_write_sync() calls
> filemap_flush_range() on every write, submitting writeback inline in
> the writer's context.  Perf lock contention profiling shows the
> performance problem is not lock contention but the writeback submission
> work itself — walking the page tree and submitting I/O blocks the writer
> for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms
> (dontcache).
> 
> Replace the inline filemap_flush_range() call with a flusher kick that
> drains dirty pages in the background.  This moves writeback submission
> completely off the writer's hot path.
> 
> To avoid flushing unrelated buffered dirty data, add a dedicated
> WB_start_dontcache bit and wb_check_start_dontcache() handler that uses
> the new NR_DONTCACHE_DIRTY counter to determine how many pages to write
> back.  The flusher writes back that many pages from the oldest dirty
> inodes (not restricted to dontcache-specific inodes). This helps
> preserve I/O batching while limiting the scope of expedited writeback.
> 
> Like WB_start_all, the WB_start_dontcache bit coalesces multiple
> DONTCACHE writes into a single flusher wakeup without per-write
> allocations.
> 
> Also add WB_REASON_DONTCACHE as a new writeback reason for tracing
> visibility, and target the correct cgroup writeback domain via
> unlocked_inode_to_wb_begin().
> 
> dontcache-bench results on dual-socket Xeon Gold 6138 (80 CPUs, 256 GB
> RAM, Samsung MZ1LB1T9HALS 1.7 TB NVMe, local XFS, io_uring, file size
> ~503 GB, compared to a v6.19-ish baseline):
> 
>   Single-client sequential write (MB/s):
>                        baseline    patched     change
>   buffered              1449.8     1440.1      -0.7%
>   dontcache             1347.9     1461.5      +8.4%
>   direct                1450.0     1440.1      -0.7%
> 
>   Single-client sequential write latency (us):
>                        baseline    patched     change
>   dontcache p50         3031.0    10551.3    +248.1%
>   dontcache p99        74973.2    21626.9     -71.2%
>   dontcache p99.9      85459.0    23199.7     -72.9%
> 
>   Single-client random write (MB/s):
>                        baseline    patched     change
>   dontcache              284.2      295.4      +3.9%
> 
>   Single-client random write p99.9 latency (us):
>                        baseline    patched     change
>   dontcache             2277.4      872.4     -61.7%
> 
>   Multi-writer aggregate throughput (MB/s):
>                        baseline    patched     change
>   buffered              1619.5     1611.2      -0.5%
>   dontcache             1281.1     1629.4     +27.2%
>   direct                1545.4     1609.4      +4.1%
> 
>   Mixed-mode noisy neighbor (dontcache writer + buffered readers):
>                        baseline    patched     change
>   writer (MB/s)         1297.6     1471.1     +13.4%
>   readers avg (MB/s)     855.0      462.4     -45.9%
> 
> nfsd-io-bench results on same hardware (XFS on NVMe, NFSv3 via fio
> NFS engine with libnfs, 1024 NFSD threads, pool_mode=pernode,
> file size ~502 GB, compared to v6.19-ish baseline):
> 
>   Single-client sequential write (MB/s):
>                        baseline    patched     change
>   buffered              4844.2     4653.4      -3.9%
>   dontcache             3028.3     3723.1     +22.9%
>   direct                 957.6      987.8      +3.2%
> 
>   Single-client sequential write p99.9 latency (us):
>                        baseline    patched     change
>   dontcache            759169.0   175112.2     -76.9%
> 
>   Single-client random write (MB/s):
>                        baseline    patched     change
>   dontcache              590.0     1561.0    +164.6%
> 
>   Multi-writer aggregate throughput (MB/s):
>                        baseline    patched     change
>   buffered              9636.3     9422.9      -2.2%
>   dontcache             1894.9     9442.6    +398.3%
>   direct                 809.6      975.1     +20.4%
> 
>   Noisy neighbor (dontcache writer + random readers):
>                        baseline    patched     change
>   writer (MB/s)         1854.5     4063.6    +119.1%
>   readers avg (MB/s)     131.2      101.6     -22.5%
> 
> The NFS results show even larger improvements than the local benchmarks.
> Multi-writer dontcache throughput improves nearly 5x, matching buffered
> I/O. Dirty page footprint drops 85-95% in sequential workloads vs.
> buffered.
> 
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

One comment regarding how the writeback is started:

> +static long wb_check_start_dontcache(struct bdi_writeback *wb)
> +{
> +	long nr_pages;
> +
> +	if (!test_bit(WB_start_dontcache, &wb->state))
> +		return 0;
> +
> +	nr_pages = global_node_page_state(NR_DONTCACHE_DIRTY);
> +	if (nr_pages) {
> +		struct wb_writeback_work work = {
> +			.nr_pages	= wb_split_bdi_pages(wb, nr_pages),
> +			.sync_mode	= WB_SYNC_NONE,
> +			.range_cyclic	= 1,
> +			.reason		= WB_REASON_DONTCACHE,
> +		};
> +
> +		nr_pages = wb_writeback(wb, &work);
> +	}
> +
> +	clear_bit(WB_start_dontcache, &wb->state);
> +	return nr_pages;
> +}

So this will end up splitting NR_DONTCACHE_DIRTY folios among per-cgroup wb
structures based on their writeback bandwidth. This is a reasonable thing
for global writeback where the bandwidth more or less corresponds to the
amount of dirty folios. However with DONTCACHE I expect big differences in
among NR_DONTCACHE_DIRTY among different cgroups not necessarily
corresponding to wb throughput. In particular if you do DONTCACHE writes
from one cgroup and normal writes from another cgroup this will
systematically underestimate the amount needed to write by a factor of
about two.

So I think the stat should be a per bdi_writeback one (instead of a per node
one) which would avoid the need to split the value to wb here.

								Honza


-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] Documentation/rv: Replace stale website link
From: Gabriele Monaco @ 2026-04-27 12:56 UTC (permalink / raw)
  To: Jonathan Corbet, rdunlap, Steven Rostedt, linux-trace-kernel,
	linux-doc, linux-kernel
  Cc: matteo.martelli, skhan
In-Reply-To: <87340gpvdg.fsf@trenco.lwn.net>

On Mon, 2026-04-27 at 04:09 -0600, Jonathan Corbet wrote:
> Gabriele Monaco <gmonaco@redhat.com> writes:
> 
> > On Mon, 2026-04-27 at 03:44 -0600, Jonathan Corbet wrote:
> > > Since, as you say, it can be found online, is there a reason not
> > > to include a link here?
> > 
> > Mmh, perhaps being overly cautious for the link not to break again?
> > 
> > The paper is published so I assume it's always going to be
> > available in some way. It is currently hosted by the university at
> > [1], which may be unlikely to change, and can be found via DOI at
> > [2], which should never change (at least that's what I believe a
> > DOI is for) but brings to the publisher's website rather than the
> > open-access PDF.
> > 
> > I think the reference to the paper I included is robust yet easy to
> > use with any scientific or even general purpose search engine. But
> > if you believe using either of the two links is more appropriate, I
> > can send a V2 with the change.
> 
> I will defer to others in the end, but to me it seems that we should
> make life easier for our readers whenever we can.  Providing a link
> seems better than requiring them to search for it themselves.

Alright, makes sense. I'm going to send a V2 with [1] (the open access
PDF), in the remote case the link stops working, we can update it.

Thanks,
Gabriele


^ permalink raw reply

* Re: [RFC PATCH 00/19] mm/damon: introduce data attributes monitoring
From: Gutierrez Asier @ 2026-04-27 13:16 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Liam R. Howlett, Andrew Morton, David Hildenbrand,
	Jonathan Corbet, Lorenzo Stoakes, Masami Hiramatsu,
	Mathieu Desnoyers, Michal Hocko, Mike Rapoport, Shuah Khan,
	Shuah Khan, Steven Rostedt, Suren Baghdasaryan, Vlastimil Babka,
	damon, linux-doc, linux-kernel, linux-kselftest, linux-mm,
	linux-trace-kernel
In-Reply-To: <20260426205222.93895-1-sj@kernel.org>

Hi SeonJae,

On 4/26/2026 11:52 PM, SeongJae Park wrote:
> TL; DR
> ======
> 
> Extend DAMON for monitoring general data attributes other than accesses.
> This is for enabling light-weight page type (e.g., belonging cgroup)
> aware monitoring in short term.  In long term, this will help extending
> DAMON for multiple access events capture primitives (e.g., page faults
> and PMU) and eventually pivotting DAMON to a "Data Attributes Monitoring
> and Operations eNgine" in long term.

Very interesting. Looking forward to seeing this in upstream.

> 
> Background: High Cost of Page Level Properties Monitoring
> =========================================================
> 
> DAMON is initially introduced as a Data Access MONitor.  It has been
> extended for not only access monitoring but also data access-aware
> system operations (DAMOS).  But still the monitoring part is only for
> data accesses.
> 
> Data access patterns is good information, but some users need more
> holistic views.  Particularly, users want to show the access pattern
> information together with the types of the memory.  For example, users
> who work for making huge pages efficiently want to know how much of
> DAMON-found hot/cold regions are backed by huge pages.  Users who run
> multiple workloads with different cgroups want to know how much of
> DAMON-found hot/cold regions belong to specific cgroups.
> 
> For the user demand, we developed a DAMOS extension for page level
> properties based monitoring [1], which has landed on 6.14.  Using the
> feature, users can inform the page level data properties that they are
> interested in, in a flexible format that uses DAMOS filters.  Then,
> DAMON applies the filters to each folio of the entire DAMON region and
> lets users know how many bytes of memory in each DAMON region passed the
> given filters.
> 
> This gives page level detailed and deterministic information to users.
> But, because the operation is done at page level, the overhead is
> proportional to the memory size.  It was useful for test or debugging
> purposes on a small number of machines.  But it was obviously too heavy
> to be enabled always on all machines running the real user workloads.
> For real world workloads, it was recommended to use the feature with
> user-space controlled sampling approaches.  For example, users could do
> the page level monitoring only once per hour, on randomly selected one
> percent of machines of their fleet.  If the runtime and the  size of the
> fleet is long and big enough, it should provide statistically meaningful
> data.
> 
> But users are too busy to implement such controls on their own.
> 
> Data Attributes Monitoring
> ==========================
> 
> Extend DAMON to monitor not only data accesses, but also general data
> attributes.  Do the extension while keeping the main promise of DAMON,
> the bounded and best-effort minimum overhead.
> 
> Allow users to specify what data attributes in addition to the data
> access they want to monitor.  Users can install one 'data probe' per
> data attribute of their interest for this purpose.  The 'data probe'
> should be able to be applied to any memory, and determine if the given
> memory has the appropriate data attribute.  E.g., if memory of physical
> address 42 belongs to cgroup A.  Each 'data probe' is configured with
> filters that are very similar to the DAMOS filters.
> 
> When DAMON checks if each sampling address memory of each region is
> accessed since the last check, it applies data probes if registered.
> Same to the number of access check-positive samples accounting
> (nr_accesses), it accounts the number of each data probe-positive
> samples in another per-region counters array, namely 'probe_hits'. When
> DAMON resets nr_accesses every aggregation interval, it resets
> 'probe_hits' together.
> 
> Users can read 'probe_hits' just before the values are reset.  In this
> way, users can know how many hot/cold memory regions have data
> attributes of their interest.  E.g., 30 percent of this system's hot
> memory is belonging to cgroup A and 80 percent of the hot cgroup A
> memory is backed by huge pages.
> 
> Patches Sequence
> ================
> 
> First eight patches implement the core feature, interface and the
> working support.  Patch 1 introduces data probe data structure, namely
> damon_probe.  Patch 2 extends damon_ctx for installing data probes.
> Patch 3 introduces another data structure for filters of each data
> probe, namely damon_filter.  Patch 4 updates damon_ctx commit function
> to handle the probes.  Patch 5 extends damon_region for the per-region
> per-probe positive samples counter, namely probe_hits.  Patch 6 extends
> damon_operations for applying probes on the underlying DAMON operations
> implementation.  Patch 7 updates kdamond_fn() to invoke the probes
> applying callback.  Patch 8 finally implements the probes support on
> paddr ops.
> 
> Eight changes for user interface (patches 9-16) come next.  Patches 9-13
> implements sysfs directories and files for setting data probes, namely
> probes directory, probe directory, filters directory, filter directory
> and filter directory internal files, respectively.  Patch 14 connects
> the user inputs that are made via the sysfs files to DAMON core.
> Patch 15 implements sysfs files for showing the per-region per-probe
> positive samples count, namely probe_hits.  Patch 16 introduces a new
> tracepoint for showing the counts via tracefs.
> 
> Patch 14 adds a selftest for the sysfs files.
> 
> Patches 15 and 16 documents the design and usage of the new feature,
> respectively.
> 
> Discussions
> ===========
> 
> This allows the page properties monitoring with overhead that is low
> enough to be enabled always on real world workloads.  Because the
> sampling time for access check is reused for data attributes check,  the
> upper-bounded and best-effort minimum overhead of DAMON is kept.
> Because the sampling memory for access check is reused for data
> attributes check, additional overhead is minimum.
> 
> Still DAMOS-based page level properties monitoring should be useful,
> because it provides a deterministic page level information.  When in
> doubt of the sampling based information, running DAMOS-based one
> together and comparing the results would be useful, for debugging and
> tuning.
> 
> Plan for Dropping RFC tag
> =========================
> 
> The user ABI for reading probe_hits is not yet convincing.  It is
> exposed to users by a tracepoint and new sysfs file.  For the
> tracepoint, a new one namely damon:damon_aggregated_v2 is introduced.
> The name is not convincing, and its internal mechanism seems to have
> room to be improved before dropping RFC.  For the sysfs, a file under
> the DAMOS-tried region directory namely 'probe_hits' is added.  Reading
> it returns four probe_hits values with ',' as a separator.  With the
> maximum number of data probes, this should work.  This can make future
> changes of the limit difficult.  I will try to find a better way before
> dropping the RFC tag.  Maybe 'probe_hits/' directory having files of
> name '0' to 'N-1' for each of user-registered 'N' data probes.
> 
> I'm currently hoping to drop the RFC tag by 7.2-rc1.
> 
> Future Works: Short Term
> ========================
> 
> This series is introducing only a single type of data attribute:
> anonymous page.  Once this is landed, I will extend it for
> cgroup-belonging, so that we can do cgroup-level monitoring with low
> overhead.  After that, I may further work on supporting all DAMOS filter
> types.  And as demands are found, we could extend the types.
> 
> This version of implementation is limiting the maximum number of data
> probes to four.  I will try to find a way to remove the limit in future,
> if it is easy to do.  I personally think it should be enough for common
> use cases, though, and therefore not giving high priority at the moment.
> 
> Future Works: Long Term
> =======================
> 
> There are user requests for extending DAMON with detailed access
> information, for example, per-CPUs/threads/read/writes monitoring.  For
> that, I was working [2] on extending DAMON to use page fault events as
> another access check primitives, and making the infrastructure flexible
> for future use of yet another access check primitive.  Actually there is
> another ongoing work [3] for extending DAMON with PMU events.  The
> motivation of the work is reducing the overhead, though.
> 
> In my work [2], I was introducing a new interface for access sampling
> primitives control.  Now I think this data probe interface can be used
> for that, too.  That is, data access becomes just one type of data
> attribute.  Also, pg_idle-confirmed access, page fault-confirmed access,
> and PMU event-confirmed access will be different types of data
> attributes.
> 
> The regions adjustment mechanism is currently working based on the
> access information.  That's because DAMON is designed for data access
> monitoring.  That is, data access information is the primary interest,
> and therefore DAMON adjusts regions in a way that can best-present the
> information.
> 
> Once data access becomes just one of data attributes, there is no reason
> to think data access that special.  There might be some users not
> interested in access at all but want to know the location of memory of
> specific type.  Data probes interface will allow doing that.  Further,
> we could extend the interface to let users set any data attribute as the
> 'primary' attribute.  Then, DAMON will split and merge regions in a way
> that can best-present the 'primary' attributes.
> 
> DAMOS will also be extended, to specify targets based on not only the
> data access pattern, but all user-registered data attributes.  From this
> stage, we may be able to call DAMON as a "Data Attributes Monitoring and
> Operations eNgine".
> 
> [1] https://lore.kernel.org/20250106193401.109161-1-sj@kernel.org
> [2] https://lore.kernel.org/20251208062943.68824-1-sj@kernel.org/
> [3] https://lore.kernel.org/20260423004211.7037-1-akinobu.mita@gmail.com
> 
> SeongJae Park (19):
>   mm/damon/core: introduce struct damon_probe
>   mm/damon/core: embed damon_probe objects in damon_ctx
>   mm/damon/core: introduce damon_filter
>   mm/damon/core: commit probes
>   mm/damon/core: introduce damon_region->probe_hits
>   mm/damon/core: introduce damon_ops->apply_probes
>   mm/damon/core: do data attributes monitoring
>   mm/damon/paddr: support data attributes monitoring
>   mm/damon/sysfs: implement probes dir
>   mm/damon/sysfs: implement probe dir
>   mm/damon/sysfs: implement filters directory
>   mm/damon/sysfs: implement filter dir
>   mm/damon/sysfs: implement filter dir files
>   mm/damon/sysfs: setup probes on DAMON core API parameters
>   mm/damon/sysfs-schemes: implement tried_region/probe_hits file
>   mm/damon: trace probe_hits
>   selftests/damon/sysfs.sh: test probes dir
>   Docs/mm/damon/design: document data attributes monitoring
>   Docs/admin-guide/mm/damon/usage: document data attributes monitoring
> 
>  Documentation/admin-guide/mm/damon/usage.rst |  44 +-
>  Documentation/mm/damon/design.rst            |  37 ++
>  include/linux/damon.h                        |  60 +++
>  include/trace/events/damon.h                 |  41 ++
>  mm/damon/core.c                              | 182 +++++++
>  mm/damon/paddr.c                             |  45 ++
>  mm/damon/sysfs-schemes.c                     |  30 ++
>  mm/damon/sysfs.c                             | 502 +++++++++++++++++++
>  tools/testing/selftests/damon/sysfs.sh       |  48 ++
>  9 files changed, 982 insertions(+), 7 deletions(-)
> 
> 
> base-commit: 8f22aa2e28454419ed2031119ad32ea4a6c9f1f1

My main concern is about potential pollution of sysfs. DAMON is already
complex to set up, with a lot of knobs. Adding more configuration options
may make admin's job more complex.

Do you plan to support this extension in damo user space?

-- 
Asier Gutierrez
Huawei


^ permalink raw reply

* [PATCH v2] Documentation/rv: Replace stale website link
From: Gabriele Monaco @ 2026-04-27 13:17 UTC (permalink / raw)
  To: rdunlap, Steven Rostedt, Gabriele Monaco, Jonathan Corbet,
	linux-trace-kernel, linux-doc, linux-kernel
  Cc: matteo.martelli, skhan

The sched monitor page was linking to Daniel's website which is now
down. The main purpose of the link was to point to a source for the
models from the original author and that can be found also in his
published paper.

Replace the link with a reference to Daniel's "A thread synchronization
model for the PREEMPT_RT Linux kernel" which can be found online and
includes the models definitions as well as the work behind them (not the
original patches but since they're based on a 5.0 kernel and are mostly
included upstream, there's little value in keeping them in the docs).

Fixes: 03abeaa63c08 ("Documentation/rv: Add docs for the sched monitors")
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
V2: Add link to the PDF and fixed RST references

 Documentation/trace/rv/monitor_sched.rst | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst
index 0b96d6e147c6..d3ba7edc202f 100644
--- a/Documentation/trace/rv/monitor_sched.rst
+++ b/Documentation/trace/rv/monitor_sched.rst
@@ -36,7 +36,7 @@ Specifications
 --------------
 
 The specifications included in sched are currently a work in progress, adapting the ones
-defined in by Daniel Bristot in [1].
+defined by Daniel Bristot in [1]_.
 
 Currently we included the following:
 
@@ -365,4 +365,7 @@ constraints when processing the events::
 References
 ----------
 
-[1] - https://bristot.me/linux-task-model
+.. [1] Daniel Bristot de Oliveira et al.:
+       `A thread synchronization model for the PREEMPT_RT Linux kernel
+       <https://www.iris.sssup.it/bitstream/11382/533630/1/Elsevier-JSA-2020.pdf>`_,
+       J. Syst. Archit., 2020.

base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
-- 
2.53.0


^ permalink raw reply related

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Arun George @ 2026-04-27 12:32 UTC (permalink / raw)
  To: Gregory Price
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman, gost.dev, arungeorge05, cpgs
In-Reply-To: <20260222084842.1824063-1-gourry@gourry.net>

[-- Attachment #1: Type: text/plain, Size: 1778 bytes --]

On 22/02/26 03:48AM, Gregory Price wrote:
>Topic type: MM
>
>Presenter: Gregory Price <gourry@gourry.net>
>
>This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
>managed by the buddy allocator but excluded from normal allocations.
>
>I present it with an end-to-end Compressed RAM service (mm/cram.c)
>that would otherwise not be possible (or would be considerably more
>difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
>
>
>TL;DR
>===
>
Appreciate the work as we also chase the same problem statement.
A few queries please.

I see the current support relies on read-only mappings which might
limit the performance. Any particular workload you are targeting with
this (which can tolerate this latency)?

Any deployments you think of where the goal is a capacity expansion
with a compromise in performance?

On the device side, are you targeting beyond compressed RAM like
devices such as memory with NAND etc.?

The TL;DR talked about mmap/mbind way of user space allocation from
the private node. But the allocation is controlled by GFP flag
N_MEMORY_PRIVATE. Does the user space path of allocation set this
flag along the way?

And I believe the bear-proof cage might work in the normal scenarios,
but may not work for all. We might not be able to rely on the control
path (backpressure) fully. The control path could go slow, slower and
even die as well. Should the device respond with something like
'bus error' if the host tries to write when it is not capable of
taking any more writes?

Are there any workloads (VM?) where this 'bus error'or similar error
could be an OK / recoverable scenario?

This is assuming that checking with the device on every operation
(whether it is safe to write or not) could be slow.

--- Arun George

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply

* Re: [PATCH v2 2/2] module/kallsyms: sort function symbols and use binary search
From: Petr Pavlu @ 2026-04-27 13:51 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-modules, Sami Tolvanen, Luis Chamberlain, linux-kernel,
	linux-trace-kernel, live-patching, Daniel Gomez, Aaron Tomlin,
	Steven Rostedt, Masami Hiramatsu, Jordan Rome, Viktor Malik
In-Reply-To: <20260424091330.GA31168@wp.pl>

On 4/24/26 11:13 AM, Stanislaw Gruszka wrote:
> On Thu, Apr 23, 2026 at 04:00:04PM +0200, Petr Pavlu wrote:
>> On 3/27/26 12:00 PM, Stanislaw Gruszka wrote:
[...]
>>> diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c
>>> index f23126d804b2..d69e99e67707 100644
>>> --- a/kernel/module/kallsyms.c
>>> +++ b/kernel/module/kallsyms.c
>>> @@ -10,6 +10,7 @@
>>>  #include <linux/kallsyms.h>
>>>  #include <linux/buildid.h>
>>>  #include <linux/bsearch.h>
>>> +#include <linux/sort.h>
>>>  #include "internal.h"
>>>  
>>>  /* Lookup exported symbol in given range of kernel_symbols */
>>> @@ -103,6 +104,95 @@ static bool is_core_symbol(const Elf_Sym *src, const Elf_Shdr *sechdrs,
>>>  	return true;
>>>  }
>>>  
>>> +static inline bool is_func_symbol(const Elf_Sym *sym)
>>> +{
>>> +	return sym->st_shndx != SHN_UNDEF && sym->st_size != 0 &&
>>> +	       ELF_ST_TYPE(sym->st_info) == STT_FUNC;
>>> +}
>>> +
>>> +static unsigned int bsearch_func_symbol(struct mod_kallsyms *kallsyms,
>>> +					unsigned long addr,
>>> +					unsigned long *bestval,
>>> +					unsigned long *nextval)
>>> +
>>> +{
>>> +	unsigned int mid, low = 1, high = kallsyms->num_func_syms + 1;
>>> +	unsigned int best = 0;
>>> +	unsigned long thisval;
>>> +
>>> +	while (low < high) {
>>> +		mid = low + (high - low) / 2;
>>> +		thisval = kallsyms_symbol_value(&kallsyms->symtab[mid]);
>>> +
>>> +		if (thisval <= addr) {
>>> +			*bestval = thisval;
>>> +			best = mid;
>>> +			low = mid + 1;
>>
>> If thisval == addr, the search moves to the right and finds the last
>> symbol with the same address. I believe it should do the opposite and
>> return the first symbol to match the behavior of
>> search_kallsyms_symbol().
> 
> In the case of multiple symbols sharing the same address, we have
> to pick one and ignore the others. I don’t think it matters much which
> one is chosen in practice. Also, I expect function symbol addresses
> to be unique, so this shouldn’t be a real issue.

I think that the code should consistently pick the same answer. If
someone uses aliases for their functions, the logic shouldn't
arbitrarily return one of them, but preferably the first one, which
should normally be the actual implementation.

> 
>>> +		} else {
>>> +			*nextval = thisval;
>>> +			high = mid;
>>> +		}
>>> +	}
>>> +
>>> +	return best;
>>> +}
>>> +
>>> +static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms,
>>> +					unsigned int symnum)
>>> +{
>>> +	return kallsyms->strtab + kallsyms->symtab[symnum].st_name;
>>> +}
>>> +
>>> +static unsigned int search_kallsyms_symbol(struct mod_kallsyms *kallsyms,
>>> +					   unsigned long addr,
>>> +					   unsigned long *bestval,
>>> +					   unsigned long *nextval)
>>> +{
>>> +	unsigned int i, best = 0;
>>> +
>>> +	/*
>>> +	 * Scan for closest preceding symbol and next symbol. (ELF starts
>>> +	 * real symbols at 1). Skip the initial function symbols range
>>> +	 * if num_func_syms is non-zero, those are handled separately for
>>> +	 * the core TEXT segment lookup.
>>> +	 */
>>> +	for (i = 1 + kallsyms->num_func_syms; i < kallsyms->num_symtab; i++) {
>>> +		const Elf_Sym *sym = &kallsyms->symtab[i];
>>> +		unsigned long thisval = kallsyms_symbol_value(sym);
>>> +
>>> +		if (sym->st_shndx == SHN_UNDEF)
>>> +			continue;
>>> +
>>> +		/*
>>> +		 * We ignore unnamed symbols: they're uninformative
>>> +		 * and inserted at a whim.
>>> +		 */
>>> +		if (*kallsyms_symbol_name(kallsyms, i) == '\0' ||
>>> +		    is_mapping_symbol(kallsyms_symbol_name(kallsyms, i)))
>>> +			continue;
>>> +
>>> +		if (thisval <= addr && thisval > *bestval) {
>>> +			best = i;
>>> +			*bestval = thisval;
>>> +		}
>>> +		if (thisval > addr && thisval < *nextval)
>>> +			*nextval = thisval;
>>> +	}
>>> +
>>> +	return best;
>>> +}
>>> +
>>> +static int elf_sym_cmp(const void *a, const void *b)
>>> +{
>>> +	unsigned long val_a = kallsyms_symbol_value((const Elf_Sym *)a);
>>> +	unsigned long val_b = kallsyms_symbol_value((const Elf_Sym *)b);
>>> +
>>> +	if (val_a < val_b)
>>> +		return -1;
>>> +
>>> +	return val_a > val_b;
>>
>> Does this comparison function and the sort() call result in stable
>> sorting? If val_a and val_b are the same, the sorting should preserve
>> the original order.
> 
> The kernel’s sort() implementation is not stable.

Ok, I see it is a heapsort. It would require additional data to keep
information about the original indexes for elf_sym_cmp() to use as
a tiebreaker.

> 
>>> +}
>>> +
>>>  /*
>>>   * We only allocate and copy the strings needed by the parts of symtab
>>>   * we keep.  This is simple, but has the effect of making multiple
>>> @@ -115,9 +205,10 @@ void layout_symtab(struct module *mod, struct load_info *info)
>>>  	Elf_Shdr *symsect = info->sechdrs + info->index.sym;
>>>  	Elf_Shdr *strsect = info->sechdrs + info->index.str;
>>>  	const Elf_Sym *src;
>>> -	unsigned int i, nsrc, ndst, strtab_size = 0;
>>> +	unsigned int i, nsrc, ndst, nfunc, strtab_size = 0;
>>>  	struct module_memory *mod_mem_data = &mod->mem[MOD_DATA];
>>>  	struct module_memory *mod_mem_init_data = &mod->mem[MOD_INIT_DATA];
>>> +	bool is_lp_mod = is_livepatch_module(mod);
>>>  
>>>  	/* Put symbol section at end of init part of module. */
>>>  	symsect->sh_flags |= SHF_ALLOC;
>>> @@ -129,12 +220,14 @@ void layout_symtab(struct module *mod, struct load_info *info)
>>>  	nsrc = symsect->sh_size / sizeof(*src);
>>>  
>>>  	/* Compute total space required for the core symbols' strtab. */
>>> -	for (ndst = i = 0; i < nsrc; i++) {
>>> -		if (i == 0 || is_livepatch_module(mod) ||
>>> +	for (ndst = nfunc = i = 0; i < nsrc; i++) {
>>> +		if (i == 0 || is_lp_mod ||
>>>  		    is_core_symbol(src + i, info->sechdrs, info->hdr->e_shnum,
>>>  				   info->index.pcpu)) {
>>>  			strtab_size += strlen(&info->strtab[src[i].st_name]) + 1;
>>>  			ndst++;
>>> +			if (!is_lp_mod && is_func_symbol(src + i))
>>> +				nfunc++;
>>>  		}
>>>  	}
>>>  
>>> @@ -156,6 +249,7 @@ void layout_symtab(struct module *mod, struct load_info *info)
>>>  	mod_mem_init_data->size = ALIGN(mod_mem_init_data->size,
>>>  					__alignof__(struct mod_kallsyms));
>>>  	info->mod_kallsyms_init_off = mod_mem_init_data->size;
>>> +	info->num_func_syms = nfunc;
>>>  
>>>  	mod_mem_init_data->size += sizeof(struct mod_kallsyms);
>>>  	info->init_typeoffs = mod_mem_init_data->size;
>>> @@ -169,7 +263,7 @@ void layout_symtab(struct module *mod, struct load_info *info)
>>>   */
>>>  void add_kallsyms(struct module *mod, const struct load_info *info)
>>>  {
>>> -	unsigned int i, ndst;
>>> +	unsigned int i, di, nfunc, ndst;
>>>  	const Elf_Sym *src;
>>>  	Elf_Sym *dst;
>>>  	char *s;
>>> @@ -178,6 +272,7 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
>>>  	void *data_base = mod->mem[MOD_DATA].base;
>>>  	void *init_data_base = mod->mem[MOD_INIT_DATA].base;
>>>  	struct mod_kallsyms *kallsyms;
>>> +	bool is_lp_mod = is_livepatch_module(mod);
>>>  
>>>  	kallsyms = init_data_base + info->mod_kallsyms_init_off;
>>
>> This code is followed by the initialization of kallsyms:
>>
>> 	kallsyms->symtab = (void *)symsec->sh_addr;
>> 	kallsyms->num_symtab = symsec->sh_size / sizeof(Elf_Sym);
>> 	/* Make sure we get permanent strtab: don't use info->strtab. */
>> 	kallsyms->strtab = (void *)info->sechdrs[info->index.str].sh_addr;
>> 	kallsyms->typetab = init_data_base + info->init_typeoffs;
>>
>> I suggest adding 'kallsyms->num_func_syms = 0;' after the initialization
>> of kallsyms->num_symtab.
> 
> I relied on zeroed memory initialization, but I can add this explicitly
> for clarity.
> 
>>> @@ -194,19 +289,28 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
>>>  	mod->core_kallsyms.symtab = dst = data_base + info->symoffs;
>>>  	mod->core_kallsyms.strtab = s = data_base + info->stroffs;
>>>  	mod->core_kallsyms.typetab = data_base + info->core_typeoffs;
>>> +
>>>  	strtab_size = info->core_typeoffs - info->stroffs;
>>>  	src = kallsyms->symtab;
>>> -	for (ndst = i = 0; i < kallsyms->num_symtab; i++) {
>>> +	ndst = info->num_func_syms + 1;
>>> +
>>> +	for (nfunc = i = 0; i < kallsyms->num_symtab; i++) {
>>>  		kallsyms->typetab[i] = elf_type(src + i, info);
>>> -		if (i == 0 || is_livepatch_module(mod) ||
>>> +		if (i == 0 || is_lp_mod ||
>>>  		    is_core_symbol(src + i, info->sechdrs, info->hdr->e_shnum,
>>>  				   info->index.pcpu)) {
>>>  			ssize_t ret;
>>>  
>>> -			mod->core_kallsyms.typetab[ndst] =
>>> -				kallsyms->typetab[i];
>>> -			dst[ndst] = src[i];
>>> -			dst[ndst++].st_name = s - mod->core_kallsyms.strtab;
>>> +			if (i == 0)
>>> +				di = 0;
>>> +			else if (!is_lp_mod && is_func_symbol(src + i))
>>> +				di = 1 + nfunc++;
>>> +			else
>>> +				di = ndst++;
>>> +
>>> +			mod->core_kallsyms.typetab[di] = kallsyms->typetab[i];
>>> +			dst[di] = src[i];
>>> +			dst[di].st_name = s - mod->core_kallsyms.strtab;
>>>  			ret = strscpy(s, &kallsyms->strtab[src[i].st_name],
>>>  				      strtab_size);
>>>  			if (ret < 0)
>>> @@ -216,9 +320,13 @@ void add_kallsyms(struct module *mod, const struct load_info *info)
>>>  		}
>>>  	}
>>>  
>>> +	WARN_ON_ONCE(nfunc != info->num_func_syms);
>>> +	sort(dst + 1, nfunc, sizeof(Elf_Sym), elf_sym_cmp, NULL);
>>> +
>>
>> The code sorts mod->core_kallsyms.symtab but mod->core_kallsyms.typetab
>> is not reordered accordingly.
> 
> Right, but for function symbols the typetab entries are all 't',
> so swapping them does not change the type value. The 'T' vs 't'
> distinction is handled later when printing (based on export status).
> But the comment explaining skiping adjusting of
> mod->core_kallsyms.typetab is needed.

Modules can also contain weak functions with elf_type() = 'w'.

> 
>>>  	/* Set up to point into init section. */
>>>  	rcu_assign_pointer(mod->kallsyms, kallsyms);
>>>  	mod->core_kallsyms.num_symtab = ndst;
>>> +	mod->core_kallsyms.num_func_syms = nfunc;
>>>  }
>>>  
>>>  #if IS_ENABLED(CONFIG_STACKTRACE_BUILD_ID)
>>> @@ -241,11 +349,6 @@ void init_build_id(struct module *mod, const struct load_info *info)
>>>  }
>>>  #endif
>>>  
>>> -static const char *kallsyms_symbol_name(struct mod_kallsyms *kallsyms, unsigned int symnum)
>>> -{
>>> -	return kallsyms->strtab + kallsyms->symtab[symnum].st_name;
>>> -}
>>> -
>>>  /*
>>>   * Given a module and address, find the corresponding symbol and return its name
>>>   * while providing its size and offset if needed.
>>> @@ -255,7 +358,10 @@ static const char *find_kallsyms_symbol(struct module *mod,
>>>  					unsigned long *size,
>>>  					unsigned long *offset)
>>>  {
>>> -	unsigned int i, best = 0;
>>> +	unsigned int (*search)(struct mod_kallsyms *kallsyms,
>>> +			       unsigned long addr, unsigned long *bestval,
>>> +			       unsigned long *nextval);
>>> +	unsigned int best;
>>>  	unsigned long nextval, bestval;
>>>  	struct mod_kallsyms *kallsyms = rcu_dereference(mod->kallsyms);
>>>  	struct module_memory *mod_mem = NULL;
>>> @@ -266,6 +372,11 @@ static const char *find_kallsyms_symbol(struct module *mod,
>>>  			continue;
>>>  #endif
>>>  		if (within_module_mem_type(addr, mod, type)) {
>>> +			if (type == MOD_TEXT && kallsyms->num_func_syms > 0)
>>> +				search = bsearch_func_symbol;
>>
>> I'm not sure if it is ok to limit the search only to function symbols
>> when the address lies in MOD_TEXT. The text can theoretically contain
>> non-function symbols.
> 
> Yes, the patch assumes that the only valid symbols in the MOD_TEXT
> are functions. If there are defined OBJECT symbols in .text, the patch
> would break lookup for those.
> 
> While it’s theoretically possible (e.g. hand-written assembly placing
> data in .text ?), I’m not sure this is a practical concern. In general,
> having data in executable segments is discouraged for security reasons. 
> 
>> Could this optimization be adjusted to sort all
>> MOD_TEXT symbols (excluding anonymous and mapping symbols) and move them
>> to the front of the symbol table?
> 
> That’s possible. We could track .text sections indices in
> __layout_sections() and include all valid symbols from those sections,
> and also reorder typetab accordingly.
> 
> However, this adds complexity. I would prefer to first confirm whether
> OBJECT symbols in MOD_TEXT is a real issue before going in that direction.

I'm not aware of specific OBJECT symbols that end up in MOD_TEXT.
Nonetheless, it is a valid case and it is preferable that an
optimization doesn't break their lookup by address.

In general, I'm worried about the several edge cases and inconsistencies
that this optimization introduces. This also includes the fact that it
doesn't work for livepatch modules.

An alternative could be to keep the symbol table untouched and have
a separate array with symbol indexes that is sorted by their addresses,
but it requires evaluation if the additional memory usage is worth it.

-- 
Thanks,
Petr

^ permalink raw reply

* [RFC PATCH 00/12] rv: Add selftests to tools and KUnit tests
From: Gabriele Monaco @ 2026-04-27 15:11 UTC (permalink / raw)
  To: linux-trace-kernel, linux-kernel
  Cc: Gabriele Monaco, Steven Rostedt, Nam Cao, Thomas Weissschuh,
	Tomas Glozar, John Kacur, Wen Yang

This series adds support to the make check target in the rv userspace
tool and the rvgen script, this allows to quickly validate its
functionality. The selftest framework is inspired by the one used in
RTLA.

A few bugs in both tools were also discovered and are fixed as part of
this series.

Additionally it adds unit tests for models. This is achieved by running
the handlers functions directly within KUnit, emulating all modules
paths as if real kernel events fired.

Unit tests emulate a series of events that are expected to trigger
violations and checks that a reaction occurred, stub structs and
functions are used so the kernel is not affected by the test.

To: linux-trace-kernel@vger.kernel.org
To: linux-kernel@vger.kernel.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nam Cao <namcao@linutronix.de>
Cc: Thomas Weissschuh <thomas.weissschuh@linutronix.de>
Cc: Tomas Glozar <tglozar@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Wen Yang <wen.yang@linux.dev>

Gabriele Monaco (12):
  tools/rv: Fix substring match bug in monitor name search
  tools/rv: Fix substring match when listing container monitors
  tools/rv: Fix exit status when monitor execution fails
  tools/rv: Fix cleanup after failed trace setup
  tools/rv: Add selftests
  verification/rvgen: Fix options shared among commands
  verification/rvgen: Add golden and spec folders for tests
  verification/rvgen: Add selftests
  rv: Add KUnit stub to rv_react() and rv_*_task_monitor_slot()
  rv: Add KUnit tests for some DA/HA monitors
  rv: Add KUnit stubs for current and smp_processor_id()
  rv: Add KUnit tests for some LTL monitors

 include/rv/da_monitor.h                       |  32 +++
 include/rv/kunit_stubs.h                      |  17 ++
 include/rv/ltl_monitor.h                      |  32 +++
 kernel/trace/rv/Kconfig                       |  14 +
 kernel/trace/rv/Makefile                      |   3 +
 kernel/trace/rv/monitors/nomiss/nomiss.c      |  30 +++
 kernel/trace/rv/monitors/opid/opid.c          |  27 ++
 .../trace/rv/monitors/pagefault/pagefault.c   |  26 +-
 kernel/trace/rv/monitors/sco/sco.c            |  23 ++
 kernel/trace/rv/monitors/sleep/sleep.c        |  64 ++++-
 kernel/trace/rv/monitors/sssw/sssw.c          |  27 ++
 kernel/trace/rv/monitors/sts/sts.c            |  35 +++
 kernel/trace/rv/rv.c                          |   5 +
 kernel/trace/rv/rv_monitors_test.c            |  99 +++++++
 kernel/trace/rv/rv_monitors_test.h            |  90 +++++++
 kernel/trace/rv/rv_reactors.c                 |   3 +
 tools/verification/rv/Makefile                |   5 +-
 tools/verification/rv/src/in_kernel.c         |  58 ++--
 tools/verification/rv/src/rv.c                |   2 +-
 tools/verification/rv/tests/rv_list.t         |  48 ++++
 tools/verification/rv/tests/rv_mon.t          |  95 +++++++
 tools/verification/rvgen/Makefile             |   4 +
 tools/verification/rvgen/__main__.py          |  10 +-
 .../rvgen/tests/golden/da_global/Kconfig      |   9 +
 .../rvgen/tests/golden/da_global/da_global.c  |  95 +++++++
 .../rvgen/tests/golden/da_global/da_global.h  |  47 ++++
 .../tests/golden/da_global/da_global_trace.h  |  15 ++
 .../tests/golden/da_perobj_parent/Kconfig     |  11 +
 .../da_perobj_parent/da_perobj_parent.c       | 110 ++++++++
 .../da_perobj_parent/da_perobj_parent.h       |  64 +++++
 .../da_perobj_parent/da_perobj_parent_trace.h |  15 ++
 .../tests/golden/da_pertask_desc/Kconfig      |   9 +
 .../golden/da_pertask_desc/da_pertask_desc.c  | 105 ++++++++
 .../golden/da_pertask_desc/da_pertask_desc.h  |  64 +++++
 .../da_pertask_desc/da_pertask_desc_trace.h   |  15 ++
 .../rvgen/tests/golden/ha_percpu/Kconfig      |   9 +
 .../rvgen/tests/golden/ha_percpu/ha_percpu.c  | 244 +++++++++++++++++
 .../rvgen/tests/golden/ha_percpu/ha_percpu.h  |  72 +++++
 .../tests/golden/ha_percpu/ha_percpu_trace.h  |  19 ++
 .../rvgen/tests/golden/ltl_pertask/Kconfig    |   9 +
 .../tests/golden/ltl_pertask/ltl_pertask.c    | 107 ++++++++
 .../tests/golden/ltl_pertask/ltl_pertask.h    | 108 ++++++++
 .../golden/ltl_pertask/ltl_pertask_trace.h    |  14 +
 .../rvgen/tests/golden/test_container/Kconfig |   5 +
 .../golden/test_container/test_container.c    |  35 +++
 .../golden/test_container/test_container.h    |   3 +
 .../rvgen/tests/golden/test_da/Kconfig        |   9 +
 .../rvgen/tests/golden/test_da/test_da.c      |  95 +++++++
 .../rvgen/tests/golden/test_da/test_da.h      |  47 ++++
 .../tests/golden/test_da/test_da_trace.h      |  15 ++
 .../rvgen/tests/golden/test_ha/Kconfig        |   9 +
 .../rvgen/tests/golden/test_ha/test_ha.c      | 247 ++++++++++++++++++
 .../rvgen/tests/golden/test_ha/test_ha.h      |  72 +++++
 .../tests/golden/test_ha/test_ha_trace.h      |  19 ++
 .../rvgen/tests/golden/test_ltl/Kconfig       |  11 +
 .../rvgen/tests/golden/test_ltl/test_ltl.c    | 108 ++++++++
 .../rvgen/tests/golden/test_ltl/test_ltl.h    | 108 ++++++++
 .../tests/golden/test_ltl/test_ltl_trace.h    |  14 +
 .../rvgen/tests/rvgen_container.t             |  20 ++
 .../verification/rvgen/tests/rvgen_monitor.t  |  87 ++++++
 .../rvgen/tests/specs/test_da.dot             |  16 ++
 .../rvgen/tests/specs/test_da2.dot            |  18 ++
 .../rvgen/tests/specs/test_ha.dot             |  27 ++
 .../rvgen/tests/specs/test_invalid.dot        |   8 +
 .../rvgen/tests/specs/test_invalid.ltl        |   1 +
 .../rvgen/tests/specs/test_invalid_ha.dot     |  16 ++
 .../rvgen/tests/specs/test_ltl.ltl            |   1 +
 tools/verification/tests/engine.sh            | 156 +++++++++++
 68 files changed, 2993 insertions(+), 44 deletions(-)
 create mode 100644 include/rv/kunit_stubs.h
 create mode 100644 kernel/trace/rv/rv_monitors_test.c
 create mode 100644 kernel/trace/rv/rv_monitors_test.h
 create mode 100644 tools/verification/rv/tests/rv_list.t
 create mode 100644 tools/verification/rv/tests/rv_mon.t
 create mode 100644 tools/verification/rvgen/tests/golden/da_global/Kconfig
 create mode 100644 tools/verification/rvgen/tests/golden/da_global/da_global.c
 create mode 100644 tools/verification/rvgen/tests/golden/da_global/da_global.h
 create mode 100644 tools/verification/rvgen/tests/golden/da_global/da_global_trace.h
 create mode 100644 tools/verification/rvgen/tests/golden/da_perobj_parent/Kconfig
 create mode 100644 tools/verification/rvgen/tests/golden/da_perobj_parent/da_perobj_parent.c
 create mode 100644 tools/verification/rvgen/tests/golden/da_perobj_parent/da_perobj_parent.h
 create mode 100644 tools/verification/rvgen/tests/golden/da_perobj_parent/da_perobj_parent_trace.h
 create mode 100644 tools/verification/rvgen/tests/golden/da_pertask_desc/Kconfig
 create mode 100644 tools/verification/rvgen/tests/golden/da_pertask_desc/da_pertask_desc.c
 create mode 100644 tools/verification/rvgen/tests/golden/da_pertask_desc/da_pertask_desc.h
 create mode 100644 tools/verification/rvgen/tests/golden/da_pertask_desc/da_pertask_desc_trace.h
 create mode 100644 tools/verification/rvgen/tests/golden/ha_percpu/Kconfig
 create mode 100644 tools/verification/rvgen/tests/golden/ha_percpu/ha_percpu.c
 create mode 100644 tools/verification/rvgen/tests/golden/ha_percpu/ha_percpu.h
 create mode 100644 tools/verification/rvgen/tests/golden/ha_percpu/ha_percpu_trace.h
 create mode 100644 tools/verification/rvgen/tests/golden/ltl_pertask/Kconfig
 create mode 100644 tools/verification/rvgen/tests/golden/ltl_pertask/ltl_pertask.c
 create mode 100644 tools/verification/rvgen/tests/golden/ltl_pertask/ltl_pertask.h
 create mode 100644 tools/verification/rvgen/tests/golden/ltl_pertask/ltl_pertask_trace.h
 create mode 100644 tools/verification/rvgen/tests/golden/test_container/Kconfig
 create mode 100644 tools/verification/rvgen/tests/golden/test_container/test_container.c
 create mode 100644 tools/verification/rvgen/tests/golden/test_container/test_container.h
 create mode 100644 tools/verification/rvgen/tests/golden/test_da/Kconfig
 create mode 100644 tools/verification/rvgen/tests/golden/test_da/test_da.c
 create mode 100644 tools/verification/rvgen/tests/golden/test_da/test_da.h
 create mode 100644 tools/verification/rvgen/tests/golden/test_da/test_da_trace.h
 create mode 100644 tools/verification/rvgen/tests/golden/test_ha/Kconfig
 create mode 100644 tools/verification/rvgen/tests/golden/test_ha/test_ha.c
 create mode 100644 tools/verification/rvgen/tests/golden/test_ha/test_ha.h
 create mode 100644 tools/verification/rvgen/tests/golden/test_ha/test_ha_trace.h
 create mode 100644 tools/verification/rvgen/tests/golden/test_ltl/Kconfig
 create mode 100644 tools/verification/rvgen/tests/golden/test_ltl/test_ltl.c
 create mode 100644 tools/verification/rvgen/tests/golden/test_ltl/test_ltl.h
 create mode 100644 tools/verification/rvgen/tests/golden/test_ltl/test_ltl_trace.h
 create mode 100644 tools/verification/rvgen/tests/rvgen_container.t
 create mode 100644 tools/verification/rvgen/tests/rvgen_monitor.t
 create mode 100644 tools/verification/rvgen/tests/specs/test_da.dot
 create mode 100644 tools/verification/rvgen/tests/specs/test_da2.dot
 create mode 100644 tools/verification/rvgen/tests/specs/test_ha.dot
 create mode 100644 tools/verification/rvgen/tests/specs/test_invalid.dot
 create mode 100644 tools/verification/rvgen/tests/specs/test_invalid.ltl
 create mode 100644 tools/verification/rvgen/tests/specs/test_invalid_ha.dot
 create mode 100644 tools/verification/rvgen/tests/specs/test_ltl.ltl
 create mode 100644 tools/verification/tests/engine.sh


base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
-- 
2.53.0


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox