Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [RFC v8 2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.

lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.

Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes v6:
- Rebase onto linux-next master as of 2026-04-08.
- Refresh the patch context around upstream ext4_alloc_inode() changes,
  without changing the subclassing logic.

 fs/ext4/ext4.h  | 4 +++-
 fs/ext4/super.c | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e337a37bb6fb..115a3c94db16 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1015,12 +1015,14 @@ do {										\
  *			  than the first
  *  I_DATA_SEM_QUOTA  - Used for quota inodes only
  *  I_DATA_SEM_EA     - Used for ea_inodes only
+ *  I_DATA_SEM_JOURNAL - Used for journal inode only
  */
 enum {
 	I_DATA_SEM_NORMAL = 0,
 	I_DATA_SEM_OTHER,
 	I_DATA_SEM_QUOTA,
-	I_DATA_SEM_EA
+	I_DATA_SEM_EA,
+	I_DATA_SEM_JOURNAL
 };
 
 struct ext4_fc_inode_snap;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6a77db4d3124..3c869f0001c5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1431,6 +1431,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ext4_fc_init_inode(&ei->vfs_inode);
 	spin_lock_init(&ei->i_fc_lock);
 	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&ei->i_data_sem, I_DATA_SEM_NORMAL);
+#endif
 	return &ei->vfs_inode;
 }
 
@@ -5910,6 +5913,11 @@ static struct inode *ext4_get_journal_inode(struct super_block *sb,
 		return ERR_PTR(-EFSCORRUPTED);
 	}
 
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&EXT4_I(journal_inode)->i_data_sem,
+			     I_DATA_SEM_JOURNAL);
+#endif
+
 	ext4_debug("Journal inode found at %p: %lld bytes\n",
 		  journal_inode, journal_inode->i_size);
 	return journal_inode;
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

ext4_fc_track_inode() can be called while holding i_data_sem (e.g.
fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an
ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING ->
wait(i_data_sem) in the commit task.

Now that fast commit snapshots inode state at commit time, updates during
log writing do not need to block. Drop the wait and lockdep assertion in
ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an
inode cannot be removed while the commit thread is still using it.

When an inode is modified during a fast commit, mark it with
EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit.
This is needed because jbd2_fc_end_commit() invokes the cleanup callback
with tid == 0, so tid-based requeue logic would requeue every inode.

Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same
transaction. nblks is the number of journal blocks written for that fast
commit. Before this change, the second fsync still wrote almost the same
fast commit log (nblks 10->9), because tid == 0 in jbd2_fc_end_commit()
caused the tid-based requeue logic to keep all inodes queued. After this
change, only inodes modified during the commit are requeued, and the
second fsync wrote a nearly empty fast commit (nblks 10->1).

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in RFC v8:
- Base the series on "ext4: fix fast commit wait/wake bit mapping on
  64-bit", so the FC_COMMITTING/FC_FLUSHING_DATA wait and wake paths use
  the shared helper mapping.

 fs/ext4/ext4.h        |   1 +
 fs/ext4/fast_commit.c | 106 ++++++++++++++++++++----------------------
 2 files changed, 51 insertions(+), 56 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 115a3c94db16..927173bc8381 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1991,6 +1991,7 @@ enum {
 	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
 	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
 	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
+	EXT4_STATE_FC_REQUEUE,		/* Inode modified during fast commit */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 0c49144e8ca2..673668860e2d 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -62,9 +62,8 @@
  *     setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
  *     needed for log writing.
  * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
- *     starting of new handles. If new handles try to start an update on
- *     any of the inodes that are being committed, ext4_fc_track_inode()
- *     will block until those inodes have finished the fast commit.
+ *     starting of new handles. Updates to inodes being fast committed are
+ *     tracked for requeue rather than blocking.
  * [6] Commit all the directory entry updates in the fast commit space.
  * [7] Commit all the changed inodes in the fast commit space.
  * [8] Write tail tag (this tag ensures the atomicity, please read the following
@@ -218,6 +217,7 @@ void ext4_fc_init_inode(struct inode *inode)
 
 	ext4_fc_reset_inode(inode);
 	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+	ext4_clear_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
 	ei->i_fc_snap = NULL;
@@ -245,7 +245,10 @@ void ext4_fc_del(struct inode *inode)
 	struct ext4_fc_dentry_update *fc_dentry;
 	wait_queue_head_t *wq;
 	unsigned long *wait_word = ext4_inode_state_wait_word(inode);
-	int wait_bit = ext4_inode_state_wait_bit(EXT4_STATE_FC_FLUSHING_DATA);
+	int committing_wait_bit =
+		ext4_inode_state_wait_bit(EXT4_STATE_FC_COMMITTING);
+	int flushing_wait_bit =
+		ext4_inode_state_wait_bit(EXT4_STATE_FC_FLUSHING_DATA);
 	int alloc_ctx;
 
 	if (ext4_fc_disabled(inode->i_sb))
@@ -259,26 +262,26 @@ void ext4_fc_del(struct inode *inode)
 	}
 
 	/*
-	 * Since ext4_fc_del is called from ext4_evict_inode while having a
-	 * handle open, there is no need for us to wait here even if a fast
-	 * commit is going on. That is because, if this inode is being
-	 * committed, ext4_mark_inode_dirty would have waited for inode commit
-	 * operation to finish before we come here. So, by the time we come
-	 * here, inode's EXT4_STATE_FC_COMMITTING would have been cleared. So,
-	 * we shouldn't see EXT4_STATE_FC_COMMITTING to be set on this inode
-	 * here.
-	 *
-	 * We may come here without any handles open in the "no_delete" case of
-	 * ext4_evict_inode as well. However, if that happens, we first mark the
-	 * file system as fast commit ineligible anyway. So, even in that case,
-	 * it is okay to remove the inode from the fc list.
+	 * Wait for ongoing fast commit to finish. We cannot remove the inode
+	 * from fast commit lists while it is being committed.
 	 */
-	WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)
-		&& !ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE));
+	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+		DEFINE_WAIT_BIT(wait, wait_word, committing_wait_bit);
+
+		wq = bit_waitqueue(wait_word, committing_wait_bit);
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+			ext4_fc_unlock(inode->i_sb, alloc_ctx);
+			schedule();
+			alloc_ctx = ext4_fc_lock(inode->i_sb);
+		}
+		finish_wait(wq, &wait.wq_entry);
+	}
+
 	while (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
-		DEFINE_WAIT_BIT(wait, wait_word, wait_bit);
+		DEFINE_WAIT_BIT(wait, wait_word, flushing_wait_bit);
 
-		wq = bit_waitqueue(wait_word, wait_bit);
+		wq = bit_waitqueue(wait_word, flushing_wait_bit);
 		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
 		if (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
 			ext4_fc_unlock(inode->i_sb, alloc_ctx);
@@ -287,19 +290,22 @@ void ext4_fc_del(struct inode *inode)
 		}
 		finish_wait(wq, &wait.wq_entry);
 	}
+
 	ext4_fc_free_inode_snap(inode);
 	list_del_init(&ei->i_fc_list);
 
 	/*
-	 * Since this inode is getting removed, let's also remove all FC
-	 * dentry create references, since it is not needed to log it anyways.
+	 * Since this inode is getting removed, let's also remove all FC dentry
+	 * create references, since it is not needed to log it anyways.
 	 */
 	if (list_empty(&ei->i_fc_dilist)) {
 		ext4_fc_unlock(inode->i_sb, alloc_ctx);
 		return;
 	}
 
-	fc_dentry = list_first_entry(&ei->i_fc_dilist, struct ext4_fc_dentry_update, fcd_dilist);
+	fc_dentry = list_first_entry(&ei->i_fc_dilist,
+				     struct ext4_fc_dentry_update,
+				     fcd_dilist);
 	WARN_ON(fc_dentry->fcd_op != EXT4_FC_TAG_CREAT);
 	list_del_init(&fc_dentry->fcd_list);
 	list_del_init(&fc_dentry->fcd_dilist);
@@ -371,6 +377,8 @@ static int ext4_fc_track_template(
 
 	tid = handle->h_transaction->t_tid;
 	spin_lock(&ei->i_fc_lock);
+	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+		ext4_set_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	if (tid == ei->i_sync_tid) {
 		update = true;
 	} else {
@@ -541,10 +549,6 @@ static int __track_inode(handle_t *handle, struct inode *inode, void *arg,
 
 void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 {
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	wait_queue_head_t *wq;
-	unsigned long *wait_word = ext4_inode_state_wait_word(inode);
-	int wait_bit = ext4_inode_state_wait_bit(EXT4_STATE_FC_COMMITTING);
 	int ret;
 
 	if (S_ISDIR(inode->i_mode))
@@ -560,21 +564,11 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 		return;
 
 	/*
-	 * If we come here, we may sleep while waiting for the inode to
-	 * commit. We shouldn't be holding i_data_sem when we go to sleep since
-	 * the commit path needs to grab the lock while committing the inode.
+	 * Fast commit snapshots inode state at commit time, so there's no need
+	 * to wait for EXT4_STATE_FC_COMMITTING here. If the inode is already
+	 * on the commit queue, ext4_fc_cleanup() will requeue it for the new
+	 * transaction once the current commit finishes.
 	 */
-	lockdep_assert_not_held(&ei->i_data_sem);
-
-	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-		DEFINE_WAIT_BIT(wait, wait_word, wait_bit);
-
-		wq = bit_waitqueue(wait_word, wait_bit);
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
-			schedule();
-		finish_wait(wq, &wait.wq_entry);
-	}
 
 	/*
 	 * From this point on, this inode will not be committed either
@@ -1499,32 +1493,32 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	while (!list_empty(&sbi->s_fc_q[FC_Q_MAIN])) {
+		bool requeue;
+
 		ei = list_first_entry(&sbi->s_fc_q[FC_Q_MAIN],
 					struct ext4_inode_info,
 					i_fc_list);
 		list_del_init(&ei->i_fc_list);
 		ext4_fc_free_inode_snap(&ei->vfs_inode);
+		spin_lock(&ei->i_fc_lock);
+		if (full)
+			requeue = !tid_geq(tid, ei->i_sync_tid);
+		else
+			requeue = ext4_test_inode_state(&ei->vfs_inode,
+							EXT4_STATE_FC_REQUEUE);
+		if (!requeue)
+			ext4_fc_reset_inode(&ei->vfs_inode);
+		ext4_clear_inode_state(&ei->vfs_inode, EXT4_STATE_FC_REQUEUE);
 		ext4_clear_inode_state(&ei->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
-		if (tid_geq(tid, ei->i_sync_tid)) {
-			ext4_fc_reset_inode(&ei->vfs_inode);
-		} else if (full) {
-			/*
-			 * We are called after a full commit, inode has been
-			 * modified while the commit was running. Re-enqueue
-			 * the inode into STAGING, which will then be splice
-			 * back into MAIN. This cannot happen during
-			 * fastcommit because the journal is locked all the
-			 * time in that case (and tid doesn't increase so
-			 * tid check above isn't reliable).
-			 */
+		spin_unlock(&ei->i_fc_lock);
+		if (requeue)
 			list_add_tail(&ei->i_fc_list,
 				      &sbi->s_fc_q[FC_Q_STAGING]);
-		}
 		/*
 		 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
 		 * visible before we send the wakeup. Pairs with implicit
-		 * barrier in prepare_to_wait() in ext4_fc_track_inode().
+		 * barrier in prepare_to_wait() in ext4_fc_del().
 		 */
 		smp_mb();
 		wake_up_bit(ext4_inode_state_wait_word(&ei->vfs_inode),
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

ext4_fc_snapshot_inodes() used igrab()/iput() to pin inodes while building
commit-time snapshots. With ext4_fc_del() waiting for
EXT4_STATE_FC_COMMITTING, iput() can trigger
ext4_clear_inode()->ext4_fc_del() in the commit thread and deadlock waiting
for the fast commit to finish.

ext4_fc_del() also has to re-check EXT4_STATE_FC_COMMITTING after
waiting on EXT4_STATE_FC_FLUSHING_DATA. The commit thread clears
FLUSHING_DATA before it sets COMMITTING, so a waiter woken from the
flush wait must not delete the inode based on an old COMMITTING
check.

Avoid taking extra references. Collect inode pointers under s_fc_lock and
rely on EXT4_STATE_FC_COMMITTING to pin inodes until ext4_fc_cleanup()
clears the bit.

Also set EXT4_STATE_FC_COMMITTING for create-only inodes referenced
from the dentry update queue, and wake up waiters when ext4_fc_cleanup()
clears the bit.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in v8:
- Factor out small ext4_fc_wait_inode_state()/ext4_fc_wake_inode_state()
  helpers so the repeated FC state wait/wake mapping is kept in one place.
- Re-check EXT4_STATE_FC_COMMITTING after waking from
  EXT4_STATE_FC_FLUSHING_DATA in ext4_fc_del(), so list deletion only
  happens after both predicates pass under the same s_fc_lock critical
  section.

 fs/ext4/fast_commit.c | 124 +++++++++++++++++++++++++-----------------
 1 file changed, 75 insertions(+), 49 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 673668860e2d..8a6981e50ffe 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -235,6 +235,37 @@ static bool ext4_fc_eligible(struct super_block *sb)
 		!(ext4_test_mount_flag(sb, EXT4_MF_FC_INELIGIBLE));
 }
 
+/*
+ * Wait for an inode fast-commit state bit to clear while dropping the
+ * fast-commit lock around schedule().
+ */
+static void ext4_fc_wait_inode_state(struct inode *inode, int bit,
+				     int *alloc_ctx)
+{
+	wait_queue_head_t *wq;
+	unsigned long *wait_word = ext4_inode_state_wait_word(inode);
+	int wait_bit = ext4_inode_state_wait_bit(bit);
+
+	while (ext4_test_inode_state(inode, bit)) {
+		DEFINE_WAIT_BIT(wait, wait_word, wait_bit);
+
+		wq = bit_waitqueue(wait_word, wait_bit);
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		if (ext4_test_inode_state(inode, bit)) {
+			ext4_fc_unlock(inode->i_sb, *alloc_ctx);
+			schedule();
+			*alloc_ctx = ext4_fc_lock(inode->i_sb);
+		}
+		finish_wait(wq, &wait.wq_entry);
+	}
+}
+
+static inline void ext4_fc_wake_inode_state(struct inode *inode, int bit)
+{
+	wake_up_bit(ext4_inode_state_wait_word(inode),
+		    ext4_inode_state_wait_bit(bit));
+}
+
 /*
  * Remove inode from fast commit list. If the inode is being committed
  * we wait until inode commit is done.
@@ -243,12 +274,6 @@ void ext4_fc_del(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_dentry_update *fc_dentry;
-	wait_queue_head_t *wq;
-	unsigned long *wait_word = ext4_inode_state_wait_word(inode);
-	int committing_wait_bit =
-		ext4_inode_state_wait_bit(EXT4_STATE_FC_COMMITTING);
-	int flushing_wait_bit =
-		ext4_inode_state_wait_bit(EXT4_STATE_FC_FLUSHING_DATA);
 	int alloc_ctx;
 
 	if (ext4_fc_disabled(inode->i_sb))
@@ -263,32 +288,19 @@ void ext4_fc_del(struct inode *inode)
 
 	/*
 	 * Wait for ongoing fast commit to finish. We cannot remove the inode
-	 * from fast commit lists while it is being committed.
+	 * from fast commit lists while it is being committed. If we wake from
+	 * FC_FLUSHING_DATA, re-check FC_COMMITTING before deleting because the
+	 * commit thread sets FC_COMMITTING only after clearing FLUSHING_DATA.
 	 */
-	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-		DEFINE_WAIT_BIT(wait, wait_word, committing_wait_bit);
+	for (;;) {
+		ext4_fc_wait_inode_state(inode, EXT4_STATE_FC_COMMITTING,
+					 &alloc_ctx);
 
-		wq = bit_waitqueue(wait_word, committing_wait_bit);
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-			ext4_fc_unlock(inode->i_sb, alloc_ctx);
-			schedule();
-			alloc_ctx = ext4_fc_lock(inode->i_sb);
-		}
-		finish_wait(wq, &wait.wq_entry);
-	}
-
-	while (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
-		DEFINE_WAIT_BIT(wait, wait_word, flushing_wait_bit);
+		if (!ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA))
+			break;
 
-		wq = bit_waitqueue(wait_word, flushing_wait_bit);
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
-			ext4_fc_unlock(inode->i_sb, alloc_ctx);
-			schedule();
-			alloc_ctx = ext4_fc_lock(inode->i_sb);
-		}
-		finish_wait(wq, &wait.wq_entry);
+		ext4_fc_wait_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA,
+					 &alloc_ctx);
 	}
 
 	ext4_fc_free_inode_snap(inode);
@@ -1184,13 +1196,12 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
-		inodes[i] = igrab(&iter->vfs_inode);
-		if (inodes[i])
-			i++;
+		inodes[i++] = &iter->vfs_inode;
 	}
 
 	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
 		struct ext4_inode_info *ei;
+		struct inode *inode;
 
 		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
 			continue;
@@ -1200,12 +1211,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		/* See the comment in ext4_fc_commit_dentry_updates(). */
 		ei = list_first_entry(&fc_dentry->fcd_dilist,
 				      struct ext4_inode_info, i_fc_dilist);
+		inode = &ei->vfs_inode;
 		if (!list_empty(&ei->i_fc_list))
 			continue;
 
-		inodes[i] = igrab(&ei->vfs_inode);
-		if (inodes[i])
-			i++;
+		/*
+		 * Create-only inodes may only be referenced via fcd_dilist and
+		 * not appear on s_fc_q[MAIN]. They may hit the last iput while
+		 * we are snapshotting, but inode eviction calls ext4_fc_del(),
+		 * which waits for FC_COMMITTING to clear. Mark them FC_COMMITTING
+		 * so the inode stays pinned and the snapshot stays valid until
+		 * ext4_fc_cleanup().
+		 */
+		ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+		inodes[i++] = inode;
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
@@ -1215,10 +1234,6 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 			break;
 	}
 
-	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
-		if (inodes[nr_inodes])
-			iput(inodes[nr_inodes]);
-	}
 	kvfree(inodes);
 	return ret;
 }
@@ -1234,8 +1249,6 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	int ret = 0;
 	u32 crc = 0;
 	int alloc_ctx;
-	int flushing_wait_bit =
-		ext4_inode_state_wait_bit(EXT4_STATE_FC_FLUSHING_DATA);
 
 	/*
 	 * Step 1: Mark all inodes on s_fc_q[MAIN] with
@@ -1261,8 +1274,8 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		ext4_clear_inode_state(&iter->vfs_inode,
 				       EXT4_STATE_FC_FLUSHING_DATA);
-		wake_up_bit(ext4_inode_state_wait_word(&iter->vfs_inode),
-			    flushing_wait_bit);
+		ext4_fc_wake_inode_state(&iter->vfs_inode,
+					 EXT4_STATE_FC_FLUSHING_DATA);
 	}
 
 	/*
@@ -1285,8 +1298,9 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	jbd2_journal_lock_updates(journal);
 	/*
 	 * The journal is now locked. No more handles can start and all the
-	 * previous handles are now drained. We now mark the inodes on the
-	 * commit queue as being committed.
+	 * previous handles are now drained. Snapshotting happens in this
+	 * window so log writing can consume only stable snapshots without
+	 * doing logical-to-physical mapping.
 	 */
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
@@ -1482,8 +1496,6 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 	struct ext4_inode_info *ei;
 	struct ext4_fc_dentry_update *fc_dentry;
 	int alloc_ctx;
-	int committing_wait_bit =
-		ext4_inode_state_wait_bit(EXT4_STATE_FC_COMMITTING);
 
 	if (full && sbi->s_fc_bh)
 		sbi->s_fc_bh = NULL;
@@ -1521,8 +1533,8 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 		 * barrier in prepare_to_wait() in ext4_fc_del().
 		 */
 		smp_mb();
-		wake_up_bit(ext4_inode_state_wait_word(&ei->vfs_inode),
-			    committing_wait_bit);
+		ext4_fc_wake_inode_state(&ei->vfs_inode,
+					 EXT4_STATE_FC_COMMITTING);
 	}
 
 	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
@@ -1537,6 +1549,20 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					      struct ext4_inode_info,
 					      i_fc_dilist);
 			ext4_fc_free_inode_snap(&ei->vfs_inode);
+			spin_lock(&ei->i_fc_lock);
+			ext4_clear_inode_state(&ei->vfs_inode,
+					       EXT4_STATE_FC_REQUEUE);
+			ext4_clear_inode_state(&ei->vfs_inode,
+					       EXT4_STATE_FC_COMMITTING);
+			spin_unlock(&ei->i_fc_lock);
+			/*
+			 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
+			 * visible before we send the wakeup. Pairs with implicit
+			 * barrier in prepare_to_wait() in ext4_fc_del().
+			 */
+			smp_mb();
+			ext4_fc_wake_inode_state(&ei->vfs_inode,
+						 EXT4_STATE_FC_COMMITTING);
 		}
 		list_del_init(&fc_dentry->fcd_dilist);
 
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

Commit-time snapshots run under jbd2_journal_lock_updates(), so the work
done there must stay bounded.

The snapshot path still used ext4_map_blocks() to build data ranges. This
can take i_data_sem and pulls the mapping code into the snapshot logic.
Build inode data range snapshots from the extent status tree instead.

The extent status tree is a cache, not an authoritative source. If the
needed information is missing or unstable (e.g. delayed allocation), treat
the transaction as fast commit ineligible and fall back to full commit.

Also cap the number of inodes and ranges snapshotted per fast commit and
allocate range records from a dedicated slab cache. The inode pointer
array is allocated outside the updates-locked window.

Testing: QEMU/KVM guest, virtio-pmem + dax, ext4 -O fast_commit, mounted
dax,noatime. Ran python3 500x {4K write + fsync}, fallocate 256M, and
python3 500x {creat + fsync(dir)} without lockdep splats or errors.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in v7:
- Address Sashiko review by guarding snapshot range arithmetic near
  EXT_MAX_BLOCKS to avoid cur_lblk / remaining-range wraparound in the
  snapshot walk.

 fs/ext4/fast_commit.c | 253 ++++++++++++++++++++++++++++++------------
 1 file changed, 179 insertions(+), 74 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 8a6981e50ffe..9e73c83b0e25 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -184,6 +184,15 @@
 
 #include <trace/events/ext4.h>
 static struct kmem_cache *ext4_fc_dentry_cachep;
+static struct kmem_cache *ext4_fc_range_cachep;
+
+/*
+ * Avoid spending unbounded time/memory snapshotting highly fragmented files
+ * under jbd2_journal_lock_updates(). If we exceed this limit, fall back to
+ * full commit.
+ */
+#define EXT4_FC_SNAPSHOT_MAX_INODES	1024
+#define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
@@ -939,7 +948,7 @@ static void ext4_fc_free_ranges(struct list_head *head)
 
 	list_for_each_entry_safe(range, range_n, head, list) {
 		list_del(&range->list);
-		kfree(range);
+		kmem_cache_free(ext4_fc_range_cachep, range);
 	}
 }
 
@@ -957,16 +966,19 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 }
 
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
-				       struct list_head *ranges)
+				       struct list_head *ranges,
+				       unsigned int nr_ranges_total,
+				       unsigned int *nr_rangesp)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
-	struct ext4_map_blocks map;
-	int ret;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
 		spin_unlock(&ei->i_fc_lock);
+		if (nr_rangesp)
+			*nr_rangesp = 0;
 		return 0;
 	}
 	start_lblk = ei->i_fc_lblk_start;
@@ -980,61 +992,82 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		   (unsigned long long)inode->i_ino);
 
 	while (cur_lblk <= end_lblk) {
+		struct extent_status es;
 		struct ext4_fc_range *range;
+		ext4_lblk_t len;
+		u64 remaining = (u64)end_lblk - cur_lblk + 1;
 
-		map.m_lblk = cur_lblk;
-		map.m_len = end_lblk - cur_lblk + 1;
-		ret = ext4_map_blocks(NULL, inode, &map,
-				      EXT4_GET_BLOCKS_IO_SUBMIT |
-				      EXT4_EX_NOCACHE);
-		if (ret < 0)
-			return -ECANCELED;
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+			return -EAGAIN;
+
+		if (ext4_es_is_delayed(&es))
+			return -EAGAIN;
 
-		if (map.m_len == 0) {
+		len = es.es_len - (cur_lblk - es.es_lblk);
+		if (len > remaining)
+			len = remaining;
+		if (len == 0) {
 			cur_lblk++;
 			continue;
 		}
 
-		range = kmalloc(sizeof(*range), GFP_NOFS);
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+			return -E2BIG;
+
+		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
 		if (!range)
 			return -ENOMEM;
+		nr_ranges++;
 
-		range->lblk = map.m_lblk;
-		range->len = map.m_len;
+		range->lblk = cur_lblk;
+		range->len = len;
 		range->pblk = 0;
 		range->unwritten = false;
 
-		if (ret == 0) {
+		if (ext4_es_is_hole(&es)) {
 			range->tag = EXT4_FC_TAG_DEL_RANGE;
-		} else {
-			unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ?
-				EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN;
-
-			/* Limit the number of blocks in one extent */
-			map.m_len = min(max, map.m_len);
+		} else if (ext4_es_is_written(&es) ||
+			   ext4_es_is_unwritten(&es)) {
+			unsigned int max;
 
 			range->tag = EXT4_FC_TAG_ADD_RANGE;
-			range->len = map.m_len;
-			range->pblk = map.m_pblk;
-			range->unwritten = !!(map.m_flags & EXT4_MAP_UNWRITTEN);
+			range->pblk = ext4_es_pblock(&es) +
+				      (cur_lblk - es.es_lblk);
+			range->unwritten = ext4_es_is_unwritten(&es);
+
+			max = range->unwritten ? EXT_UNWRITTEN_MAX_LEN :
+						 EXT_INIT_MAX_LEN;
+			if (range->len > max)
+				range->len = max;
+		} else {
+			kmem_cache_free(ext4_fc_range_cachep, range);
+			return -EAGAIN;
 		}
 
 		INIT_LIST_HEAD(&range->list);
 		list_add_tail(&range->list, ranges);
 
-		cur_lblk += map.m_len;
+		if ((u64)range->len > (u64)end_lblk - cur_lblk)
+			break;
+
+		cur_lblk += range->len;
 	}
 
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return 0;
 }
 
-static int ext4_fc_snapshot_inode(struct inode *inode)
+static int ext4_fc_snapshot_inode(struct inode *inode,
+				  unsigned int nr_ranges_total,
+				  unsigned int *nr_rangesp)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
 	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
 	struct ext4_iloc iloc;
 	LIST_HEAD(ranges);
+	unsigned int nr_ranges = 0;
 	int ret;
 	int alloc_ctx;
 
@@ -1058,7 +1091,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode)
 	memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len);
 	brelse(iloc.bh);
 
-	ret = ext4_fc_snapshot_inode_data(inode, &ranges);
+	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
+					  &nr_ranges);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1071,10 +1105,11 @@ static int ext4_fc_snapshot_inode(struct inode *inode)
 	list_splice_tail_init(&ranges, &snap->data_list);
 	ext4_fc_unlock(inode->i_sb, alloc_ctx);
 
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return 0;
 }
 
-
 /* Flushes data of all the inodes in the commit queue. */
 static int ext4_fc_flush_data(journal_t *journal)
 {
@@ -1153,49 +1188,32 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 	return 0;
 }
 
-static int ext4_fc_snapshot_inodes(journal_t *journal)
+static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
+					 struct inode ***inodesp,
+					 unsigned int *nr_inodesp);
+
+static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
+				   unsigned int inodes_size)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_inode_info *iter;
 	struct ext4_fc_dentry_update *fc_dentry;
-	struct inode **inodes;
-	unsigned int nr_inodes = 0;
 	unsigned int i = 0;
+	unsigned int idx;
+	unsigned int nr_ranges = 0;
 	int ret = 0;
 	int alloc_ctx;
 
-	alloc_ctx = ext4_fc_lock(sb);
-	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
-		nr_inodes++;
-
-	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
-		struct ext4_inode_info *ei;
-
-		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
-			continue;
-		if (list_empty(&fc_dentry->fcd_dilist))
-			continue;
-
-		/* See the comment in ext4_fc_commit_dentry_updates(). */
-		ei = list_first_entry(&fc_dentry->fcd_dilist,
-				      struct ext4_inode_info, i_fc_dilist);
-		if (!list_empty(&ei->i_fc_list))
-			continue;
-
-		nr_inodes++;
-	}
-	ext4_fc_unlock(sb, alloc_ctx);
-
-	if (!nr_inodes)
+	if (!inodes_size)
 		return 0;
 
-	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
-	if (!inodes)
-		return -ENOMEM;
-
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		if (i >= inodes_size) {
+			ret = -E2BIG;
+			goto unlock;
+		}
 		inodes[i++] = &iter->vfs_inode;
 	}
 
@@ -1215,6 +1233,10 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		if (!list_empty(&ei->i_fc_list))
 			continue;
 
+		if (i >= inodes_size) {
+			ret = -E2BIG;
+			goto unlock;
+		}
 		/*
 		 * Create-only inodes may only be referenced via fcd_dilist and
 		 * not appear on s_fc_q[MAIN]. They may hit the last iput while
@@ -1226,15 +1248,22 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING);
 		inodes[i++] = inode;
 	}
+unlock:
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
-		ret = ext4_fc_snapshot_inode(inodes[nr_inodes]);
+	if (ret)
+		return ret;
+
+	for (idx = 0; idx < i; idx++) {
+		unsigned int inode_ranges = 0;
+
+		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
+					     &inode_ranges);
 		if (ret)
 			break;
+		nr_ranges += inode_ranges;
 	}
 
-	kvfree(inodes);
 	return ret;
 }
 
@@ -1245,6 +1274,8 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct ext4_inode_info *iter;
 	struct ext4_fc_head head;
 	struct inode *inode;
+	struct inode **inodes;
+	unsigned int inodes_size;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
@@ -1294,6 +1325,10 @@ static int ext4_fc_perform_commit(journal_t *journal)
 		return ret;
 
 
+	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
+	if (ret)
+		return ret;
+
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
 	/*
@@ -1309,8 +1344,9 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
 	jbd2_journal_unlock_updates(journal);
+	kvfree(inodes);
 	if (ret)
 		return ret;
 
@@ -1366,6 +1402,64 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	return ret;
 }
 
+static unsigned int ext4_fc_count_snapshot_inodes(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	unsigned int nr_inodes = 0;
+	int alloc_ctx;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
+		nr_inodes++;
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		nr_inodes++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	return nr_inodes;
+}
+
+static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
+					 struct inode ***inodesp,
+					 unsigned int *nr_inodesp)
+{
+	unsigned int nr_inodes = ext4_fc_count_snapshot_inodes(sb);
+	struct inode **inodes;
+
+	*inodesp = NULL;
+	*nr_inodesp = 0;
+
+	if (!nr_inodes)
+		return 0;
+
+	if (nr_inodes > EXT4_FC_SNAPSHOT_MAX_INODES)
+		return -E2BIG;
+
+	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
+	if (!inodes)
+		return -ENOMEM;
+
+	*inodesp = inodes;
+	*nr_inodesp = nr_inodes;
+	return 0;
+}
+
 static void ext4_fc_update_stats(struct super_block *sb, int status,
 				 u64 commit_time, int nblks, tid_t commit_tid)
 {
@@ -1458,7 +1552,10 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
 	ret = ext4_fc_perform_commit(journal);
 	if (ret < 0) {
-		status = EXT4_FC_STATUS_FAILED;
+		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
+			status = EXT4_FC_STATUS_INELIGIBLE;
+		else
+			status = EXT4_FC_STATUS_FAILED;
 		goto fallback;
 	}
 	nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before;
@@ -1539,26 +1636,27 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
 		fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN],
-					     struct ext4_fc_dentry_update,
-					     fcd_list);
+						 struct ext4_fc_dentry_update,
+						 fcd_list);
 		list_del_init(&fc_dentry->fcd_list);
 		if (fc_dentry->fcd_op == EXT4_FC_TAG_CREAT &&
-		    !list_empty(&fc_dentry->fcd_dilist)) {
+			!list_empty(&fc_dentry->fcd_dilist)) {
 			/* See the comment in ext4_fc_commit_dentry_updates(). */
 			ei = list_first_entry(&fc_dentry->fcd_dilist,
-					      struct ext4_inode_info,
-					      i_fc_dilist);
+						  struct ext4_inode_info,
+						  i_fc_dilist);
 			ext4_fc_free_inode_snap(&ei->vfs_inode);
 			spin_lock(&ei->i_fc_lock);
 			ext4_clear_inode_state(&ei->vfs_inode,
-					       EXT4_STATE_FC_REQUEUE);
+						   EXT4_STATE_FC_REQUEUE);
 			ext4_clear_inode_state(&ei->vfs_inode,
-					       EXT4_STATE_FC_COMMITTING);
+						   EXT4_STATE_FC_COMMITTING);
 			spin_unlock(&ei->i_fc_lock);
 			/*
 			 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
-			 * visible before we send the wakeup. Pairs with implicit
-			 * barrier in prepare_to_wait() in ext4_fc_del().
+			 * visible before we send the wakeup. Pairs with
+			 * implicit barrier in prepare_to_wait() in
+			 * ext4_fc_del().
 			 */
 			smp_mb();
 			ext4_fc_wake_inode_state(&ei->vfs_inode,
@@ -2538,13 +2636,20 @@ int __init ext4_fc_init_dentry_cache(void)
 	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
 					   SLAB_RECLAIM_ACCOUNT);
 
-	if (ext4_fc_dentry_cachep == NULL)
+	if (!ext4_fc_dentry_cachep)
 		return -ENOMEM;
 
+	ext4_fc_range_cachep = KMEM_CACHE(ext4_fc_range, SLAB_RECLAIM_ACCOUNT);
+	if (!ext4_fc_range_cachep) {
+		kmem_cache_destroy(ext4_fc_dentry_cachep);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
 void ext4_fc_destroy_dentry_cache(void)
 {
+	kmem_cache_destroy(ext4_fc_range_cachep);
 	kmem_cache_destroy(ext4_fc_dentry_cachep);
 }
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 6/7] ext4: fast commit: add lock_updates tracepoint
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-ext4, linux-kernel,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
so it is useful to quantify the time spent with updates locked and to
understand why snapshotting can fail.

Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
the updates-locked window along with the number of snapshotted inodes
and ranges. Record the first snapshot failure reason in a stable snap_err
field for tooling.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
Changes in v8:
- Use trace_call__ext4_fc_lock_updates() at the guarded call site as
  suggested by Steven Rostedt, avoiding a second static_branch check.

Changes in v7:
- Address Sashiko review by reporting successfully snapshotted inode counts
  in ext4_fc_lock_updates when snapshotting stops early.

Changes in v6:
- Drop explicit ext4_fc_snap_err assignments and rely on enum
  auto-increment.
- Treat locked_ns as trace-only in this patch and calculate it only when
  ext4_fc_lock_updates is enabled, as suggested by Steven Rostedt.

 fs/ext4/ext4.h              | 15 ++++++++
 fs/ext4/fast_commit.c       | 74 +++++++++++++++++++++++++++++--------
 include/trace/events/ext4.h | 61 ++++++++++++++++++++++++++++++
 3 files changed, 135 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 927173bc8381..dd09d00a73af 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1027,6 +1027,21 @@ enum {
 
 struct ext4_fc_inode_snap;
 
+/*
+ * Snapshot failure reasons for ext4_fc_lock_updates tracepoint.
+ * Keep these stable for tooling.
+ */
+enum ext4_fc_snap_err {
+	EXT4_FC_SNAP_ERR_NONE = 0,
+	EXT4_FC_SNAP_ERR_ES_MISS,
+	EXT4_FC_SNAP_ERR_ES_DELAYED,
+	EXT4_FC_SNAP_ERR_ES_OTHER,
+	EXT4_FC_SNAP_ERR_INODES_CAP,
+	EXT4_FC_SNAP_ERR_RANGES_CAP,
+	EXT4_FC_SNAP_ERR_NOMEM,
+	EXT4_FC_SNAP_ERR_INODE_LOC,
+};
+
 /*
  * fourth extended file system inode data in memory
  */
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 9e73c83b0e25..dc08f8ff43d9 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -194,6 +194,12 @@ static struct kmem_cache *ext4_fc_range_cachep;
 #define EXT4_FC_SNAPSHOT_MAX_INODES	1024
 #define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
+static inline void ext4_fc_set_snap_err(int *snap_err, int err)
+{
+	if (snap_err && *snap_err == EXT4_FC_SNAP_ERR_NONE)
+		*snap_err = err;
+}
+
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
 	BUFFER_TRACE(bh, "");
@@ -968,11 +974,12 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       struct list_head *ranges,
 				       unsigned int nr_ranges_total,
-				       unsigned int *nr_rangesp)
+				       unsigned int *nr_rangesp,
+				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+	unsigned int nr_ranges = 0;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
@@ -997,11 +1004,16 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		ext4_lblk_t len;
 		u64 remaining = (u64)end_lblk - cur_lblk + 1;
 
-		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
+		}
 
-		if (ext4_es_is_delayed(&es))
+		if (ext4_es_is_delayed(&es)) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
+		}
 
 		len = es.es_len - (cur_lblk - es.es_lblk);
 		if (len > remaining)
@@ -1011,12 +1023,17 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 			continue;
 		}
 
-		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
+		}
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
-		if (!range)
+		if (!range) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
+		}
 		nr_ranges++;
 
 		range->lblk = cur_lblk;
@@ -1041,6 +1058,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
 
@@ -1060,7 +1078,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int nr_ranges_total,
-				  unsigned int *nr_rangesp)
+				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
@@ -1072,8 +1090,10 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	int alloc_ctx;
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
-	if (ret)
+	if (ret) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
+	}
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
 		inode_len = EXT4_INODE_SIZE(inode->i_sb);
@@ -1082,6 +1102,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
 	}
@@ -1092,7 +1113,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	brelse(iloc.bh);
 
 	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
-					  &nr_ranges);
+					  &nr_ranges, snap_err);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1193,7 +1214,10 @@ static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
 					 unsigned int *nr_inodesp);
 
 static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
-				   unsigned int inodes_size)
+				   unsigned int inodes_size,
+				   unsigned int *nr_inodesp,
+				   unsigned int *nr_rangesp,
+				   int *snap_err)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1211,6 +1235,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1234,6 +1260,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1258,16 +1286,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 		unsigned int inode_ranges = 0;
 
 		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
-					     &inode_ranges);
+					     &inode_ranges, snap_err);
 		if (ret)
 			break;
 		nr_ranges += inode_ranges;
 	}
 
+	if (nr_inodesp)
+		*nr_inodesp = idx;
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return ret;
 }
 
-static int ext4_fc_perform_commit(journal_t *journal)
+static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1276,10 +1308,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct inode *inode;
 	struct inode **inodes;
 	unsigned int inodes_size;
+	unsigned int snap_inodes = 0;
+	unsigned int snap_ranges = 0;
+	int snap_err = EXT4_FC_SNAP_ERR_NONE;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
 	int alloc_ctx;
+	ktime_t lock_start;
+	u64 locked_ns;
 
 	/*
 	 * Step 1: Mark all inodes on s_fc_q[MAIN] with
@@ -1324,13 +1361,13 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	if (ret)
 		return ret;
 
-
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
 	if (ret)
 		return ret;
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
+	lock_start = ktime_get();
 	/*
 	 * The journal is now locked. No more handles can start and all the
 	 * previous handles are now drained. Snapshotting happens in this
@@ -1344,8 +1381,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
+				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
+	if (trace_ext4_fc_lock_updates_enabled()) {
+		locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+		trace_call__ext4_fc_lock_updates(sb, commit_tid, locked_ns,
+						 snap_inodes, snap_ranges,
+						 ret, snap_err);
+	}
 	kvfree(inodes);
 	if (ret)
 		return ret;
@@ -1550,7 +1594,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 		journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
 	set_task_ioprio(current, journal_ioprio);
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
-	ret = ext4_fc_perform_commit(journal);
+	ret = ext4_fc_perform_commit(journal, commit_tid);
 	if (ret < 0) {
 		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
 			status = EXT4_FC_STATUS_INELIGIBLE;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f493642cf121..7028a28316fa 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -107,6 +107,26 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_VERITY);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_MOVE_EXT);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
 
+#undef EM
+#undef EMe
+#define EM(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
+#define EMe(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
+
+#define TRACE_SNAP_ERR						\
+	EM(NONE)						\
+	EM(ES_MISS)						\
+	EM(ES_DELAYED)						\
+	EM(ES_OTHER)						\
+	EM(INODES_CAP)						\
+	EM(RANGES_CAP)						\
+	EM(NOMEM)						\
+	EMe(INODE_LOC)
+
+TRACE_SNAP_ERR
+
+#undef EM
+#undef EMe
+
 #define show_fc_reason(reason)						\
 	__print_symbolic(reason,					\
 		{ EXT4_FC_REASON_XATTR,		"XATTR"},		\
@@ -2818,6 +2838,47 @@ TRACE_EVENT(ext4_fc_commit_stop,
 		  __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid)
 );
 
+#define EM(a)	{ EXT4_FC_SNAP_ERR_##a, #a },
+#define EMe(a)	{ EXT4_FC_SNAP_ERR_##a, #a }
+
+TRACE_EVENT(ext4_fc_lock_updates,
+	    TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns,
+		     unsigned int nr_inodes, unsigned int nr_ranges, int err,
+		     int snap_err),
+
+	TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err),
+
+	TP_STRUCT__entry(/* entry */
+		__field(dev_t, dev)
+		__field(tid_t, tid)
+		__field(u64, locked_ns)
+		__field(unsigned int, nr_inodes)
+		__field(unsigned int, nr_ranges)
+		__field(int, err)
+		__field(int, snap_err)
+	),
+
+	TP_fast_assign(/* assign */
+		__entry->dev = sb->s_dev;
+		__entry->tid = commit_tid;
+		__entry->locked_ns = locked_ns;
+		__entry->nr_inodes = nr_inodes;
+		__entry->nr_ranges = nr_ranges;
+		__entry->err = err;
+		__entry->snap_err = snap_err;
+	),
+
+	TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err %d snap_err %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
+		  __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges,
+		  __entry->err, __print_symbolic(__entry->snap_err,
+						 TRACE_SNAP_ERR))
+);
+
+#undef EM
+#undef EMe
+#undef TRACE_SNAP_ERR
+
 #define FC_REASON_NAME_STAT(reason)					\
 	show_fc_reason(reason),						\
 	__entry->fc_ineligible_rc[reason]
-- 
2.53.0

^ permalink raw reply related

* [RFC v8 7/7] ext4: fast commit: export snapshot stats in fc_info
From: Li Chen @ 2026-05-15  9:18 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel
In-Reply-To: <20260515091829.194810-1-me@linux.beauty>

Snapshot-based fast commit can fall back when the commit-time snapshot
cannot be built (e.g. extent status cache misses). It is useful to
quantify the updates-locked window and to see why snapshotting failed.

Add best-effort snapshot counters to the ext4 superblock and extend
/proc/fs/ext4/<sb_id>/fc_info to report the number of snapshotted
inodes and ranges, snapshot failure reasons, and the average/max time
spent with journal updates locked.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
---
Changes in v8:
- Treat stale snapshot inode sizing as a capacity fallback instead of
  letting log writing later report a missing snapshot.
- Use atomic64_t for the snapshot counters so fc_info cannot observe
  torn 64-bit values on 32-bit systems.

Changes in v7:
- Address Sashiko review by using READ_ONCE() + div64_u64() for the fc_info
  lock_updates average.

Changes in v6:
- Start consuming locked_ns in fc_info, so this patch intentionally moves
  lock_updates_ns_{total,max,samples} accounting here.
- Guard the tracepoint call with trace_ext4_fc_lock_updates_enabled() and
  use trace_call__ext4_fc_lock_updates() to avoid the double static_branch
  at the guarded call site.
- Keep the stats unconditionally while avoiding extra tracepoint
  overhead when ext4_fc_lock_updates is disabled.

 fs/ext4/ext4.h        |  31 ++++++++++++++
 fs/ext4/fast_commit.c |  96 ++++++++++++++++++++++++++++++++++++++-----
 fs/ext4/super.c       |   1 +
 3 files changed, 118 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index dd09d00a73af..ddc903738c6b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1550,6 +1550,36 @@ struct ext4_orphan_info {
 						 * file blocks */
 };
 
+/*
+ * Ext4 fast commit snapshot statistics.
+ *
+ * These are best-effort counters intended for debugging / performance
+ * introspection; they are not exact under concurrent updates.
+ */
+struct ext4_fc_snap_stats {
+	atomic64_t lock_updates_ns_total;
+	atomic64_t lock_updates_ns_max;
+	atomic64_t lock_updates_samples;
+
+	atomic64_t snap_inodes;
+	atomic64_t snap_ranges;
+
+	atomic64_t snap_fail_es_miss;
+	atomic64_t snap_fail_es_delayed;
+	atomic64_t snap_fail_es_other;
+
+	atomic64_t snap_fail_inodes_cap;
+	atomic64_t snap_fail_ranges_cap;
+	atomic64_t snap_fail_nomem;
+	atomic64_t snap_fail_inode_loc;
+
+	/*
+	 * Missing inode snapshots during log writing should never happen.
+	 * Keep this counter to help catch unexpected regressions.
+	 */
+	atomic64_t snap_fail_no_snap;
+};
+
 /*
  * fourth extended-fs super-block data in memory
  */
@@ -1824,6 +1854,7 @@ struct ext4_sb_info {
 	struct mutex s_fc_lock;
 	struct buffer_head *s_fc_bh;
 	struct ext4_fc_stats s_fc_stats;
+	struct ext4_fc_snap_stats s_fc_snap_stats;
 	tid_t s_fc_ineligible_tid;
 #ifdef CONFIG_EXT4_DEBUG
 	int s_fc_debug_max_replay;
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index dc08f8ff43d9..4ef796b9b6cb 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -281,6 +281,19 @@ static inline void ext4_fc_wake_inode_state(struct inode *inode, int bit)
 		    ext4_inode_state_wait_bit(bit));
 }
 
+static void ext4_fc_snap_stats_update_max(atomic64_t *stat, u64 value)
+{
+	u64 old = atomic64_read(stat);
+
+	while (value > old) {
+		u64 prev = atomic64_cmpxchg(stat, old, value);
+
+		if (prev == old)
+			break;
+		old = prev;
+	}
+}
+
 /*
  * Remove inode from fast commit list. If the inode is being committed
  * we wait until inode commit is done.
@@ -868,6 +881,8 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	struct ext4_fc_inode fc_inode;
 	struct ext4_fc_tl tl;
 	u8 *dst;
@@ -875,13 +890,17 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 	int inode_len;
 	int ret;
 
-	if (!snap)
+	if (!snap) {
+		atomic64_inc(&stats->snap_fail_no_snap);
 		return -ECANCELED;
+	}
 
 	src = snap->inode_buf;
 	inode_len = snap->inode_len;
-	if (!src || inode_len == 0)
+	if (!src || inode_len == 0) {
+		atomic64_inc(&stats->snap_fail_no_snap);
 		return -ECANCELED;
+	}
 
 	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
 	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
@@ -911,13 +930,17 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	struct ext4_fc_add_range fc_ext;
 	struct ext4_fc_del_range lrange;
 	struct ext4_extent *ex;
 	struct ext4_fc_range *range;
 
-	if (!snap)
+	if (!snap) {
+		atomic64_inc(&stats->snap_fail_no_snap);
 		return -ECANCELED;
+	}
 
 	list_for_each_entry(range, &snap->data_list, list) {
 		if (range->tag == EXT4_FC_TAG_DEL_RANGE) {
@@ -978,6 +1001,8 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
 	unsigned int nr_ranges = 0;
 
@@ -1005,11 +1030,13 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		u64 remaining = (u64)end_lblk - cur_lblk + 1;
 
 		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			atomic64_inc(&stats->snap_fail_es_miss);
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
 		}
 
 		if (ext4_es_is_delayed(&es)) {
+			atomic64_inc(&stats->snap_fail_es_delayed);
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
@@ -1024,6 +1051,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		}
 
 		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			atomic64_inc(&stats->snap_fail_ranges_cap);
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
@@ -1031,6 +1059,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
 		if (!range) {
+			atomic64_inc(&stats->snap_fail_nomem);
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
 		}
@@ -1058,6 +1087,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			atomic64_inc(&stats->snap_fail_es_other);
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
@@ -1081,6 +1111,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	struct ext4_fc_inode_snap *snap;
 	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
 	struct ext4_iloc iloc;
@@ -1091,6 +1123,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
 	if (ret) {
+		atomic64_inc(&stats->snap_fail_inode_loc);
 		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
 	}
@@ -1102,6 +1135,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		atomic64_inc(&stats->snap_fail_nomem);
 		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
@@ -1126,6 +1160,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	list_splice_tail_init(&ranges, &snap->data_list);
 	ext4_fc_unlock(inode->i_sb, alloc_ctx);
 
+	atomic64_inc(&stats->snap_inodes);
+	atomic64_add(nr_ranges, &stats->snap_ranges);
 	if (nr_rangesp)
 		*nr_rangesp = nr_ranges;
 	return 0;
@@ -1229,12 +1265,10 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	int ret = 0;
 	int alloc_ctx;
 
-	if (!inodes_size)
-		return 0;
-
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			atomic64_inc(&sbi->s_fc_snap_stats.snap_fail_inodes_cap);
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
@@ -1260,6 +1294,7 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			atomic64_inc(&sbi->s_fc_snap_stats.snap_fail_inodes_cap);
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
@@ -1303,6 +1338,7 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_snap_stats *snap_stats = &sbi->s_fc_snap_stats;
 	struct ext4_inode_info *iter;
 	struct ext4_fc_head head;
 	struct inode *inode;
@@ -1362,8 +1398,13 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 		return ret;
 
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
-	if (ret)
+	if (ret) {
+		if (ret == -E2BIG)
+			atomic64_inc(&snap_stats->snap_fail_inodes_cap);
+		else if (ret == -ENOMEM)
+			atomic64_inc(&snap_stats->snap_fail_nomem);
 		return ret;
+	}
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
@@ -1384,12 +1425,15 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
 				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
-	if (trace_ext4_fc_lock_updates_enabled()) {
-		locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
-		trace_call__ext4_fc_lock_updates(sb, commit_tid, locked_ns,
-						 snap_inodes, snap_ranges,
-						 ret, snap_err);
-	}
+	locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+	atomic64_add(locked_ns, &snap_stats->lock_updates_ns_total);
+	atomic64_inc(&snap_stats->lock_updates_samples);
+	ext4_fc_snap_stats_update_max(&snap_stats->lock_updates_ns_max,
+				      locked_ns);
+	if (trace_ext4_fc_lock_updates_enabled())
+		trace_call__ext4_fc_lock_updates(sb, commit_tid, locked_ns,
+						 snap_inodes, snap_ranges,
+						 ret, snap_err);
 	kvfree(inodes);
 	if (ret)
 		return ret;
@@ -2657,11 +2701,26 @@ int ext4_fc_info_show(struct seq_file *seq, void *v)
 {
 	struct ext4_sb_info *sbi = EXT4_SB((struct super_block *)seq->private);
 	struct ext4_fc_stats *stats = &sbi->s_fc_stats;
+	struct ext4_fc_snap_stats *snap_stats = &sbi->s_fc_snap_stats;
+	u64 lock_avg_ns = 0;
+	u64 lock_updates_samples;
+	u64 lock_updates_ns_total;
+	u64 lock_updates_ns_max;
 	int i;
 
 	if (v != SEQ_START_TOKEN)
 		return 0;
 
+	lock_updates_samples =
+		atomic64_read(&snap_stats->lock_updates_samples);
+	lock_updates_ns_total =
+		atomic64_read(&snap_stats->lock_updates_ns_total);
+	lock_updates_ns_max =
+		atomic64_read(&snap_stats->lock_updates_ns_max);
+	if (lock_updates_samples)
+		lock_avg_ns = div64_u64(lock_updates_ns_total,
+					lock_updates_samples);
+
 	seq_printf(seq,
 		"fc stats:\n%ld commits\n%ld ineligible\n%ld numblks\n%lluus avg_commit_time\n",
 		   stats->fc_num_commits, stats->fc_ineligible_commits,
@@ -2672,6 +2731,23 @@ int ext4_fc_info_show(struct seq_file *seq, void *v)
 		seq_printf(seq, "\"%s\":\t%d\n", fc_ineligible_reasons[i],
 			stats->fc_ineligible_reason_count[i]);
 
+	seq_printf(seq,
+		   "Snapshot stats:\n%llu inodes\n%llu ranges\n%lluus lock_updates_avg\n%lluus lock_updates_max\n",
+		   atomic64_read(&snap_stats->snap_inodes),
+		   atomic64_read(&snap_stats->snap_ranges),
+		   div_u64(lock_avg_ns, 1000),
+		   div_u64(lock_updates_ns_max, 1000));
+	seq_printf(seq,
+		   "Snapshot failures:\n%llu es_miss\n%llu es_delayed\n%llu es_other\n%llu inodes_cap\n%llu ranges_cap\n%llu nomem\n%llu inode_loc\n%llu no_snap\n",
+		   atomic64_read(&snap_stats->snap_fail_es_miss),
+		   atomic64_read(&snap_stats->snap_fail_es_delayed),
+		   atomic64_read(&snap_stats->snap_fail_es_other),
+		   atomic64_read(&snap_stats->snap_fail_inodes_cap),
+		   atomic64_read(&snap_stats->snap_fail_ranges_cap),
+		   atomic64_read(&snap_stats->snap_fail_nomem),
+		   atomic64_read(&snap_stats->snap_fail_inode_loc),
+		   atomic64_read(&snap_stats->snap_fail_no_snap));
+
 	return 0;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3c869f0001c5..f1f8819a2a23 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4544,6 +4544,7 @@ static void ext4_fast_commit_init(struct super_block *sb)
 	sbi->s_fc_ineligible_tid = 0;
 	mutex_init(&sbi->s_fc_lock);
 	memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
+	memset(&sbi->s_fc_snap_stats, 0, sizeof(sbi->s_fc_snap_stats));
 	sbi->s_fc_replay_state.fc_regions = NULL;
 	sbi->s_fc_replay_state.fc_regions_size = 0;
 	sbi->s_fc_replay_state.fc_regions_used = 0;
-- 
2.53.0

^ permalink raw reply related

* Re: [RFC PATCH v2 08/10] rv/tlob: add tlob hybrid automaton monitor
From: Gabriele Monaco @ 2026-05-15  9:53 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <fe5ed6a9a0a911e6ec74dc06c453786a2c4fb6d1.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Introduce tlob (task latency over budget), a per-task hybrid-automaton
> RV monitor that measures elapsed time (CLOCK_MONOTONIC) across
> a user-delimited code section and fires an error_env_tlob tracepoint
> when the elapsed time exceeds a configurable per-invocation budget.
> 
> The monitor is built on RV_MON_PER_OBJ with HA_TIMER_HRTIMER.  Three
> states track the scheduler status of the monitored task:
> 
>   running  --(sleep)-------> sleeping
>   running  --(preempt)-----> waiting
>   sleeping --(wakeup)------> waiting
>   waiting  --(switch_in)--> running
> 
> A single clock invariant clk_elapsed < BUDGET_NS() is active in all
> three states.  The budget hrtimer is rearmed on each DA transition for
> the remaining budget, keeping the absolute deadline fixed at
> start_time + BUDGET_NS.
> 
> Per-task state is stored in the DA framework's hash table keyed by
> task->pid.  Storage is pre-allocated by tlob_start_task() with
> GFP_KERNEL via da_create_or_get() before the scheduler tracepoints
> can fire, using DA_SKIP_AUTO_ALLOC so that no kmalloc occurs on the
> tracepoint hot path.  This avoids both the kmalloc_nolock() restriction
> (requires HAVE_ALIGNED_STRUCT_PAGE) and latency issues under PREEMPT_RT.
> 
> Nested monitoring is handled by nest_depth: tlob_start_task() on an
> already-monitored pid returns -EEXIST and increments nest_depth without
> disturbing the outer window; only the outermost tlob_stop_task()
> performs real cleanup.
> 
> Two userspace interfaces are provided.  The ioctl interface exposes
> in-process self-instrumentation via /dev/rv with TLOB_IOCTL_TRACE_START
> and TLOB_IOCTL_TRACE_STOP.  The uprobe interface enables external
> monitoring of unmodified binaries via tracefs:
> 
>   echo "p PATH:OFFSET_START OFFSET_STOP threshold=NS" \
>       > /sys/kernel/tracing/rv/monitors/tlob/monitor
> 
> Violations are reported via error_env_tlob (HA clock-invariant)
> regardless of which interface triggered them.
> 
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
[...]
> diff --git a/include/linux/rv.h b/include/linux/rv.h
> index 541ba404926a..1ea91bb3f1c2 100644
> --- a/include/linux/rv.h
> +++ b/include/linux/rv.h
> @@ -21,6 +21,13 @@
>  #include <linux/list.h>
>  #include <linux/types.h>
>  
> +/* Forward declaration: poll_table is only needed by rv_chardev_ops::poll.
> + * Avoid pulling in <linux/poll.h> from rv.h — that header is included by
> + * sched.h, and poll.h → fs.h → rcupdate.h creates a header-ordering cycle
> + * with migrate_disable() on UML/non-SMP targets.
> + */
> +struct poll_table_struct;
> +
>  /*
>   * Deterministic automaton per-object variables.
>   */
> @@ -158,6 +165,44 @@ int rv_register_monitor(struct rv_monitor *monitor,
> struct rv_monitor *parent);
>  int rv_get_task_monitor_slot(void);
>  void rv_put_task_monitor_slot(int slot);

Could you have everything that isn't strictly tlob-related in another
patch. This adds the ioctl functionality, can it stay on its own until
you wire it with tlob?

[...]

> diff --git a/include/rv/automata.h b/include/rv/automata.h
> index 4a4eb40cf09a..ae819638d85a 100644
> --- a/include/rv/automata.h
> +++ b/include/rv/automata.h
> @@ -41,6 +41,21 @@ static char *model_get_event_name(enum events event)
>  	return RV_AUTOMATON_NAME.event_names[event];
>  }
>  
> +/*
> + * model_get_timer_event_name - label used when the HA timer fires (no
> event).
> + *
> + * Monitors may define MONITOR_TIMER_EVENT_NAME before including the model
> + * header to give the timer-fired violation a semantically meaningful label
> + * (e.g. "budget_exceeded" for tlob).  Defaults to "none".
> + */
> +#ifndef MONITOR_TIMER_EVENT_NAME
> +#define MONITOR_TIMER_EVENT_NAME "none"
> +#endif

Why don't you just override EVENT_NONE_LBL (and if you prefer call it
MONITOR_TIMER_EVENT_NAME) without the need for another function?

> +static inline char *model_get_timer_event_name(void)
> +{
> +	return MONITOR_TIMER_EVENT_NAME;
> +}
> +

[...]

> diff --git a/include/rv/rv_uprobe.h b/include/rv/rv_uprobe.h
> index 084cdb36a2ff..9106c5c9275e 100644
> --- a/include/rv/rv_uprobe.h
> +++ b/include/rv/rv_uprobe.h
> @@ -79,9 +79,41 @@ struct rv_uprobe *rv_uprobe_attach(const char *binpath,
> loff_t offset,
>   * for any in-progress handler to finish, then releases the path reference
>   * and frees the rv_uprobe struct.  The caller's priv data is NOT freed.
>   *
> + * When removing a single probe, prefer this over the three-phase API.
>   * Safe to call from process context only (uprobe_unregister_sync() may
>   * schedule).
>   */
>  void rv_uprobe_detach(struct rv_uprobe *p);

Why don't you put all this in the patch about uprobes?

>  
> +/**
> + * rv_uprobe_unregister_nosync - dequeue an uprobe without waiting
> + * @p:  probe to dequeue; may be NULL (no-op)
> + *
> + * Removes the uprobe from the uprobe subsystem but does NOT wait for
> + * in-flight handlers to complete.  The caller must call rv_uprobe_sync()
> + * before calling rv_uprobe_free() on the same probe.
> + *
> + * Use this to batch multiple deregistrations before a single
> rv_uprobe_sync().
> + */
> +void rv_uprobe_unregister_nosync(struct rv_uprobe *p);
> +
> +/**
> + * rv_uprobe_sync - wait for all in-flight uprobe handlers to complete
> + *
> + * Global barrier: waits for every in-flight uprobe handler across the system
> + * to finish.  Call once after a batch of rv_uprobe_unregister_nosync() calls
> + * and before any rv_uprobe_free() call.
> + */
> +void rv_uprobe_sync(void);
> +
> +/**
> + * rv_uprobe_free - release resources of a previously deregistered probe
> + * @p:  probe to free; may be NULL (no-op)
> + *
> + * Releases the path reference and frees the rv_uprobe struct.  Must only
> + * be called after rv_uprobe_sync() has returned.  The caller's priv data
> + * is NOT freed.
> + */
> +void rv_uprobe_free(struct rv_uprobe *p);
> +
>  #endif /* _RV_UPROBE_H */
> diff --git a/include/uapi/linux/rv.h b/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000000..a34e5426393b
> --- /dev/null
> +++ b/include/uapi/linux/rv.h
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * All RV monitors that expose an ioctl self-instrumentation interface
> + * share the magic byte RV_IOC_MAGIC ('r').
> + *
> + * Usage examples and design rationale are in:
> + *   Documentation/trace/rv/monitor_tlob.rst
> + */

Same as above, this could be in a separate patch.

> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
[...]
> diff --git a/kernel/trace/rv/Makefile b/kernel/trace/rv/Makefile
> index f139b904bea3..8a5b5c84aff9 100644
> --- a/kernel/trace/rv/Makefile
> +++ b/kernel/trace/rv/Makefile
> @@ -2,7 +2,7 @@
>  
>  ccflags-y += -I $(src)		# needed for trace events
>  
> -obj-$(CONFIG_RV) += rv.o
> +obj-$(CONFIG_RV) += rv.o rv_chardev.o

Same here.

>  obj-$(CONFIG_RV_MON_WIP) += monitors/wip/wip.o
>  obj-$(CONFIG_RV_MON_WWNR) += monitors/wwnr/wwnr.o
>  obj-$(CONFIG_RV_MON_SCHED) += monitors/sched/sched.o
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/Kconfig
> @@ -0,0 +1,69 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +config RV_MON_TLOB
> +	depends on RV
> +	select RV_UPROBE
> +	select HA_MON_EVENTS_ID
> +	bool "tlob monitor"
> +	help
> +	  Enable the tlob (task latency over budget) monitor.  This monitor
> +	  tracks the elapsed time (CLOCK_MONOTONIC) of a marked code path
> +	  within a task (including both on-CPU and off-CPU time) and reports
> +	  a violation when the elapsed time exceeds a configurable budget.
> +
> +	  The monitor uses a three-state hybrid automaton (running, waiting,
> +	  sleeping) stored per object using RV_MON_PER_OBJ.  A single HA
> +	  clock invariant (clk_elapsed < BUDGET_NS) is enforced in all three
> +	  states via a per-task hrtimer.
> +
> +	  States: running (initial, on-CPU), waiting (in runqueue, off-CPU),
> +	          sleeping (blocked on resource, off-CPU).
> +	  Key transitions:
> +	    running  --(sleep)------> sleeping
> +	    running  --(preempt)----> waiting
> +	    sleeping --(wakeup)-----> waiting
> +	    waiting  --(switch_in)--> running
> +	  task_start calls da_handle_start_event() to set the initial state,
> +	  then arms the budget timer directly via ha_reset_clk_ns() +
> +	  ha_start_timer_ns().  task_stop cancels the timer synchronously via
> +	  ha_cancel_timer_sync() then calls da_monitor_reset().
> +
> +	  Two userspace interfaces are provided:
> +
> +	  tracefs uprobe binding (external, unmodified binaries):
> +	    echo "p PATH:OFFSET_START OFFSET_STOP threshold=NS" \
> +	        > /sys/kernel/tracing/rv/monitors/tlob/monitor
> +	  The uprobe at offset_start fires tlob_start_task(); the uprobe at
> +	  offset_stop fires tlob_stop_task().  Both are plain entry uprobes
> +	  so a mistyped offset cannot corrupt the call stack.
> +
> +	  /dev/rv ioctl (in-process self-instrumentation):
> +	    ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +	    do_critical_work();
> +	    ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	    /* ret == -EOVERFLOW when budget exceeded */
> +	  Allows conditional monitoring, sub-function granularity, and
> +	  inline reaction to violations without polling the trace buffer.
> +
> +	  Up to TLOB_MAX_MONITORED tasks may be monitored simultaneously.
> +
> +	  Violations are always reported via the standard error_env_tlob RV
> +	  tracepoint regardless of which interface triggered them.  The
> +	  tracefs interface requires only tracefs write permissions, avoiding
> +	  the CAP_BPF privilege needed for equivalent eBPF-based approaches.
> +
> +	  For further information, see:
> +	    Documentation/trace/rv/monitor_tlob.rst
> +
> +config TLOB_KUNIT_TEST

Do you need to add this here? Since you have a patch adding KUnit tests
to tlob, cannot you put everything kunit-related there?

That's also going to simplify things since RV KUnits aren't stable right
now.

> +	tristate "KUnit tests for tlob monitor" if !KUNIT_ALL_TESTS

I couldn't build it as module, do we need it that way?

  ERROR: modpost: "sched_setscheduler_nocheck" [kernel/trace/rv/monitors/tlob/tlob_kunit.ko] undefined!

> +	depends on RV_MON_TLOB && KUNIT
> +	default KUNIT_ALL_TESTS
> +	help
> +	  Enable KUnit in-kernel unit tests for the tlob RV monitor.
> +
> +	  Tests cover automaton state transitions, the start/stop task
> +	  interface, scheduler context-switch accounting, and the uprobe
> +	  format string parser.
> +
> +	  Say Y or M here to run the tlob KUnit test suite; otherwise say N.
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> new file mode 100644
> index 000000000000..475e972ae9aa
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -0,0 +1,1307 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob: task latency over budget monitor
> + *
> + * Track the elapsed wall-clock time of a marked code path and detect when
> + * a monitored task exceeds its per-task latency budget.  CLOCK_MONOTONIC
> + * is used so both on-CPU and off-CPU time count toward the budget.
> + *
> + * On a budget violation, two tracepoints are emitted from the hrtimer
> + * callback: error_env_tlob signals the violation, and detail_env_tlob
> + * provides a per-state time breakdown (running_ns, waiting_ns, sleeping_ns)
> + * that pinpoints whether the overrun occurred in running, waiting, or
> sleeping state.
> + *
> + * The monitor uses RV_MON_PER_OBJ: per-task state (struct tlob_task_state)
> + * is stored as monitor_target in the framework's hash table.
> + *
> + * One HA clock invariant is enforced:
> + *   clk_elapsed < BUDGET_NS()   (active in all states)
> + *
> + * task_start uses da_handle_start_event() to set the initial state, then
> + * calls ha_reset_clk_ns() + ha_start_timer_ns() directly to initialise the
> + * clock and arm the budget timer.  No synthetic event is needed.
> + * The HA timer is cancelled synchronously by ha_cancel_timer_sync() in
> + * tlob_stop_task().
> + *
> + * Copyright (C) 2026 Wen Yang <wen.yang@linux.dev>
> + */
> +#include <linux/completion.h>
> +#include <linux/hrtimer.h>
> +#include <linux/kernel.h>
> +#include <linux/ktime.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/namei.h>
> +#include <linux/refcount.h>
> +#include <linux/rv.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/tracefs.h>
> +#include <linux/uaccess.h>
> +#include <kunit/visibility.h>
> +#include <rv/instrumentation.h>
> +#include <rv/rv_uprobe.h>
> +#include <uapi/linux/rv.h>
> +#include "../../rv.h"
> +
> +#define MODULE_NAME "tlob"
> +
> +#include <trace/events/sched.h>
> +#include <rv_trace.h>
> +
> +/*
> + * Per-fd private data; one instance per open /dev/rv fd.
> + * monitoring: set while TRACE_START is active; cleared at TRACE_STOP.
> + * budget_exceeded: set by hrtimer callback; read at TRACE_STOP to report
> + * -EOVERFLOW even when cleanup was claimed by a concurrent stop_all or
> + * a task-exit handler.
> + */
> +struct tlob_fpriv {
> +	struct task_struct	*task;
> +	bool			monitoring;
> +	bool			budget_exceeded;
> +};
> +
> +/*
> + * Per-task latency monitoring state.  One instance per monitoring window.
> + * Stored as monitor_target in da_monitor_storage; freed via call_rcu.
> + */
> +struct tlob_task_state {
> +	struct task_struct	*task;		/* via get_task_struct */
> +	u64			threshold_us;	/* budget in microseconds */
> +
> +	/* 1 = cleanup claimed; ha_setup_invariants won't restart the timer.
> */
> +	atomic_t		stopping;
> +
> +	/* Serialises the ns accumulators; held briefly (hardirq-safe). */
> +	raw_spinlock_t		entry_lock;
> +	u64			running_ns;	/* time in running state  */
> +	u64			waiting_ns;	/* time in waiting state  */
> +	u64			sleeping_ns;	/* time in sleeping state */
> +	ktime_t			last_ts;
> +
> +	/* store-release in TRACE_START ioctl, load-acquire in reset_notify.
> */
> +	struct tlob_fpriv	*fpriv;
> +
> +	struct rcu_head		rcu;		/* for call_rcu()
> teardown */
> +};
> +
> +#define RV_MON_TYPE RV_MON_PER_OBJ
> +#define HA_TIMER_TYPE HA_TIMER_HRTIMER
> +/* Pool mode: da_handle_start_event uses da_fill_empty_storage, not kmalloc.
> */
> +#define DA_SKIP_AUTO_ALLOC
> +
> +/* Type for da_monitor_storage.target; must be defined before the includes.
> */
> +typedef struct tlob_task_state *monitor_target;
> +
> +/* Forward-declared so da_monitor_reset_hook works before ha_monitor.h. */
> +static inline void tlob_reset_notify(struct da_monitor *da_mon);
> +#define da_monitor_reset_hook tlob_reset_notify
> +
> +/*
> + * When the hrtimer fires (budget elapsed), the HA framework emits
> + * error_env_tlob with this label instead of the generic "none".
> + */
> +#define MONITOR_TIMER_EVENT_NAME "budget_exceeded"
> +
> +#include "tlob.h"
> +#include <rv/ha_monitor.h>
> +
> +/*
> + * Called from da_monitor_reset() on both normal stop and hrtimer expiry.
> + * On violation (stopping==0), emits detail_env_tlob.
> + */
> +static inline void tlob_reset_notify(struct da_monitor *da_mon)
> +{
> +	struct ha_monitor *ha_mon = to_ha_monitor(da_mon);
> +	struct tlob_task_state *ws;
> +
> +	ha_monitor_reset_env(da_mon);
> +
> +	ws = ha_get_target(ha_mon);
> +	if (!ws)
> +		return;
> +
> +	/*
> +	 * Emit per-state breakdown on budget violation only.
> +	 * stopping==0: timer callback owns this path (genuine overrun).
> +	 * stopping==1: normal stop claimed ownership first; skip.
> +	 */
> +	if (!atomic_read(&ws->stopping)) {
> +		unsigned int curr_state = READ_ONCE(da_mon->curr_state);
> +		u64 running_ns, waiting_ns, sleeping_ns, partial_ns;
> +		struct tlob_fpriv *fp;
> +		unsigned long flags;
> +
> +		/*
> +		 * Snapshot accumulators; partial_ns covers curr_state time
> +		 * not yet folded in (transition-out pending).
> +		 */
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		partial_ns   = ktime_get_ns() - ktime_to_ns(ws->last_ts);
> +		running_ns   = ws->running_ns  +
> +			       (curr_state == running_tlob  ? partial_ns :
> 0);
> +		waiting_ns   = ws->waiting_ns  +
> +			       (curr_state == waiting_tlob  ? partial_ns :
> 0);
> +		sleeping_ns  = ws->sleeping_ns +
> +			       (curr_state == sleeping_tlob ? partial_ns :
> 0);
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +
> +		trace_detail_env_tlob(da_get_id(da_mon), ws->threshold_us,
> +				      running_ns, waiting_ns, sleeping_ns);
> +
> +		/*
> +		 * Latch violation in the fd so TRACE_STOP can return -
> EOVERFLOW
> +		 * even if a concurrent stop_all or task-exit handler claims
> +		 * cleanup first.  Pairs with smp_store_release in
> TRACE_START.
> +		 */
> +		fp = smp_load_acquire(&ws->fpriv);
> +		if (fp)
> +			WRITE_ONCE(fp->budget_exceeded, true);
> +	}
> +}
> +
> +#define BUDGET_US(ha_mon) (ha_get_target(ha_mon)->threshold_us)
> +#define BUDGET_NS(ha_mon) (BUDGET_US(ha_mon) * 1000ULL)
> +
> +/* HA constraint functions (called by ha_monitor_handle_constraint) */
> +
> +static u64 ha_get_env(struct ha_monitor *ha_mon, enum envs_tlob env, u64
> time_ns)
> +{
> +	if (env == clk_elapsed_tlob)
> +		return ha_get_clk_ns(ha_mon, env, time_ns);
> +	return ENV_INVALID_VALUE;
> +}
> +
> +static void ha_reset_env(struct ha_monitor *ha_mon, enum envs_tlob env, u64
> time_ns)
> +{
> +	if (env == clk_elapsed_tlob)
> +		ha_reset_clk_ns(ha_mon, env, time_ns);
> +}
> +
> +/*
> + * ha_verify_invariants - clk_elapsed < BUDGET_NS must hold in all states.
> + */
> +static inline bool ha_verify_invariants(struct ha_monitor *ha_mon,
> +					enum states curr_state, enum events
> event,
> +					enum states next_state, u64 time_ns)
> +{
> +	if (curr_state == running_tlob)
> +		return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> +	else if (curr_state == sleeping_tlob)
> +		return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> +	else if (curr_state == waiting_tlob)
> +		return ha_check_invariant_ns(ha_mon, clk_elapsed_tlob,
> time_ns);
> +	return true;
> +}
> +
> +/*
> + * Convert invariant (deadline) to guard (reset anchor) on state transitions.
> + * Skip if uninitialised (ENV_INVALID_VALUE): the race between
> + * da_handle_start_event() and ha_reset_clk_ns() would give U64_MAX -
> BUDGET_NS.
> + */
> +static inline void ha_convert_inv_guard(struct ha_monitor *ha_mon,
> +					enum states curr_state, enum events
> event,
> +					enum states next_state, u64 time_ns)
> +{
> +	if (curr_state == next_state)
> +		return;
> +	if (curr_state == running_tlob &&
> +	    !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
> +		ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> time_ns);
> +	else if (curr_state == sleeping_tlob &&
> +		 !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
> +		ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> time_ns);
> +	else if (curr_state == waiting_tlob &&
> +		 !ha_monitor_env_invalid(ha_mon, clk_elapsed_tlob))
> +		ha_inv_to_guard(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> time_ns);
> +}
> +
> +/* No per-event guard conditions for tlob; invariants suffice. */
> +static inline bool ha_verify_guards(struct ha_monitor *ha_mon,
> +				    enum states curr_state, enum events
> event,
> +				    enum states next_state, u64 time_ns)
> +{
> +	return true;
> +}
> +
> +/*
> + * Arm or cancel the HA budget timer on state transitions.
> + * Guard on stopping: sched_switch events can arrive after
> ha_cancel_timer_sync,
> + * restarting the timer and triggering an ODEBUG "activate active" splat.
> + */
> +static inline void ha_setup_invariants(struct ha_monitor *ha_mon,
> +				       enum states curr_state, enum events
> event,
> +				       enum states next_state, u64 time_ns)
> +{
> +	if (next_state == curr_state)
> +		return;
> +	if (next_state == running_tlob) {
> +		if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
> +			ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> +	} else if (next_state == sleeping_tlob) {
> +		if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
> +			ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> +	} else if (next_state == waiting_tlob) {
> +		if (!atomic_read_acquire(&ha_get_target(ha_mon)->stopping))
> +			ha_start_timer_ns(ha_mon, clk_elapsed_tlob,
> BUDGET_NS(ha_mon), time_ns);
> +	} else if (curr_state == running_tlob)
> +		ha_cancel_timer(ha_mon);
> +	else if (curr_state == waiting_tlob)
> +		ha_cancel_timer(ha_mon);
> +	else if (curr_state == sleeping_tlob)
> +		ha_cancel_timer(ha_mon);
> +}
> +
> +static bool ha_verify_constraint(struct ha_monitor *ha_mon,
> +				 enum states curr_state, enum events event,
> +				 enum states next_state, u64 time_ns)
> +{
> +	if (!ha_verify_invariants(ha_mon, curr_state, event, next_state,
> time_ns))
> +		return false;
> +
> +	ha_convert_inv_guard(ha_mon, curr_state, event, next_state, time_ns);
> +
> +	if (!ha_verify_guards(ha_mon, curr_state, event, next_state,
> time_ns))
> +		return false;
> +
> +	ha_setup_invariants(ha_mon, curr_state, event, next_state, time_ns);
> +
> +	return true;
> +}
> +
> +static struct kmem_cache *tlob_state_cache;
> +
> +static atomic_t tlob_num_monitored = ATOMIC_INIT(0);
> +
> +/* Uprobe binding list; protected by tlob_uprobe_mutex. */
> +static LIST_HEAD(tlob_uprobe_list);
> +static DEFINE_MUTEX(tlob_uprobe_mutex);
> +
> +/*
> + * Serialises duplicate-check + da_create_or_get() to prevent two concurrent
> + * callers for the same pid from both inserting into the hash table.
> + */
> +static DEFINE_MUTEX(tlob_start_mutex);
> +
> +/*
> + * Counts open /dev/rv fds plus one synthetic ref held while enabled.
> + * __tlob_destroy_monitor() drops the synthetic ref and waits for zero
> + * before teardown, preventing kmem_cache_zalloc() on a destroyed cache.
> + */
> +static refcount_t tlob_fd_refcount = REFCOUNT_INIT(0);
> +static DECLARE_COMPLETION(tlob_fd_released);
> +
> +/* Per-uprobe-binding state: a start + stop probe pair for one binary region.
> */
> +struct tlob_uprobe_binding {
> +	struct list_head	list;
> +	u64			threshold_us;
> +	char			binpath[TLOB_MAX_PATH];
> +	loff_t			offset_start;
> +	loff_t			offset_stop;
> +	struct rv_uprobe	*start_probe;
> +	struct rv_uprobe	*stop_probe;
> +};
> +
> +/* RCU callback: free the slab once no readers remain. */
> +static void tlob_free_rcu(struct rcu_head *head)
> +{
> +	struct tlob_task_state *ws =
> +		container_of(head, struct tlob_task_state, rcu);
> +	kmem_cache_free(tlob_state_cache, ws);
> +}
> +
> +/*
> + * handle_sched_switch - advance the DA on every context switch.
> + *
> + * Generates three DA events:
> + *   prev, prev_state != 0  -> sleep_tlob    (running -> sleeping)
> + *   prev, prev_state == 0  -> preempt_tlob  (running -> waiting)
> + *   next                   -> switch_in_tlob (waiting -> running)
> + */
> +static void handle_sched_switch(void *data, bool preempt_unused,
> +				struct task_struct *prev,
> +				struct task_struct *next,
> +				unsigned int prev_state)
> +{
> +	struct tlob_task_state *ws;
> +	unsigned long flags;
> +	bool do_prev = false, do_next = false;
> +	bool prev_preempted;
> +	ktime_t now;
> +

Perhaps keep the handler simpler by moving this reporting to a helper
function and use guard(rcu)() there.

> +	rcu_read_lock();
> +
> +	ws = da_get_target_by_id(prev->pid);
> +	if (ws) {
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		now = ktime_get();
> +		ws->running_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
> +		ws->last_ts = now;
> +		/* prev_state == 0: TASK_RUNNING (preempted); != 0: sleeping.
> */
> +		prev_preempted = (prev_state == 0);
> +		do_prev = true;
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +	}
> +
> +	ws = da_get_target_by_id(next->pid);
> +	if (ws) {
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		now = ktime_get();
> +		ws->waiting_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
> +		ws->last_ts = now;
> +		do_next = true;
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +	}
> +
> +	rcu_read_unlock();
> +

You probably don't need these. da_handle_event should skip tasks without
a monitor.

> +	if (do_prev)
> +		da_handle_event(prev->pid, NULL,
> +				prev_preempted ? preempt_tlob : sleep_tlob);
> +	if (do_next)
> +		da_handle_event(next->pid, NULL, switch_in_tlob);
> +}
> +
> +/*
> + * handle_sched_wakeup - sleeping -> waiting transition.
> + *
> + * try_to_wake_up() skips TASK_RUNNING tasks, so this never fires for a
> + * task already in running or waiting state.
> + */
> +static void handle_sched_wakeup(void *data, struct task_struct *p)
> +{
> +	struct tlob_task_state *ws;
> +	unsigned long flags;
> +	bool found = false;
> +

Same as above to keep the handler simple.

> +	rcu_read_lock();
> +	ws = da_get_target_by_id(p->pid);
> +	if (ws) {
> +		ktime_t now = ktime_get();
> +
> +		raw_spin_lock_irqsave(&ws->entry_lock, flags);
> +		ws->sleeping_ns += ktime_to_ns(ktime_sub(now, ws->last_ts));
> +		ws->last_ts = now;
> +		raw_spin_unlock_irqrestore(&ws->entry_lock, flags);
> +		found = true;
> +	}
> +	rcu_read_unlock();
> +
> +	if (found)

You probably don't need this. da_handle_event should skip tasks without
a monitor.

> +		da_handle_event(p->pid, NULL, wakeup_tlob);
> +}
> +
> +/*
> + * handle_sched_process_exit - clean up if a task exits without TRACE_STOP.
> + *
> + * Called in do_exit() context; the task still has a valid pid here.
> + */
> +static void handle_sched_process_exit(void *data, struct task_struct *p,
> +				       bool group_dead)
> +{
> +	struct tlob_task_state *ws;
> +	bool found = false;
> +

> +	rcu_read_lock();
> +	ws = da_get_target_by_id(p->pid);
> +	found = !!ws;
> +	rcu_read_unlock();
> +
> +	if (found)

You can skip all this here.

> +		tlob_stop_task(p);
> +}
> +
> +
> +
> +/**
> + * tlob_start_task - begin monitoring @task with budget @threshold_us us.
> + * @task:         Task to monitor; may be current or another task.
> + * @threshold_us: Latency budget in microseconds (wall-clock; running +
> waiting + sleeping). > 0.
> + *
> + * Returns 0, -ENODEV, -EALREADY, -ENOSPC, or -ENOMEM.
> + */
> +int tlob_start_task(struct task_struct *task, u64 threshold_us)
> +{
> +	struct tlob_task_state *ws_existing;
> +	struct tlob_task_state *ws;
> +	struct da_monitor *da_mon;
> +	struct ha_monitor *ha_mon;
> +	u64 now_ns;
> +	int ret;
> +
> +	if (!da_monitor_enabled())
> +		return -ENODEV;
> +
> +	if (threshold_us == 0)
> +		return -ERANGE;
> +
> +	/* Serialise duplicate-check + da_create_or_get for the same pid. */
> +	guard(mutex)(&tlob_start_mutex);
> +
> +	rcu_read_lock();

That should be a scoped_guard(rcu), definitely use guards if you have
return paths, the compiler is going to clean up (unlock) for you.

> +	ws_existing = da_get_target_by_id(task->pid);
> +	if (ws_existing) {
> +		rcu_read_unlock();
> +		return -EALREADY;
> +	}
> +	rcu_read_unlock();
> +
> +	ws = kmem_cache_zalloc(tlob_state_cache, GFP_KERNEL);
> +	if (!ws)
> +		return -ENOMEM;
> +
> +	ws->task = task;
> +	get_task_struct(task);
> +	ws->threshold_us = threshold_us;
> +	ws->last_ts = ktime_get();
> +	raw_spin_lock_init(&ws->entry_lock);
> +
> +	/* Claim a pool slot (no kmalloc; DA_SKIP_AUTO_ALLOC + prealloc). */
> +	ret = da_create_or_get(task->pid, ws);
> +	if (ret) {
> +		put_task_struct(task);
> +		kmem_cache_free(tlob_state_cache, ws);
> +		return ret;
> +	}
> +
> +	atomic_inc(&tlob_num_monitored);
> +
> +	/* Hold RCU across handle + timer setup to keep da_mon valid. */
> +	rcu_read_lock();

Same here about guards.
Sadly there doesn't seem to be a cleanup helper for kmem_cache_free,
would be worth adding one. You have also a lot of other things to do
here so it isn't a big deal.

> +	da_handle_start_event(task->pid, ws, switch_in_tlob);
> +	da_mon = da_get_monitor(task->pid, NULL);
> +	if (unlikely(!da_mon)) {
> +		/* Slot registered; missing da_mon means concurrent destroy.
> */
> +		rcu_read_unlock();
> +		da_destroy_storage(task->pid);
> +		atomic_dec(&tlob_num_monitored);
> +		put_task_struct(task);
> +		kmem_cache_free(tlob_state_cache, ws);
> +		return -ENOMEM;
> +	}
> +	ha_mon = to_ha_monitor(da_mon);
> +	now_ns = ktime_get_ns();
> +	ha_reset_env(ha_mon, clk_elapsed_tlob, now_ns);
> +	ha_start_timer_ns(ha_mon, clk_elapsed_tlob, BUDGET_NS(ha_mon),
> now_ns);
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_start_task);
> +
> +/**
> + * tlob_stop_task - stop monitoring @task.
> + * @task: Task to stop.
> + *
> + * CAS on ws->stopping (0->1) under RCU claims cleanup ownership;
> + * the winner cancels the timer synchronously and frees all resources.
> + *
> + * Returns 0, -EOVERFLOW (budget exceeded), -ESRCH (not monitored),
> + * or -EAGAIN (concurrent caller claimed cleanup).
> + */
> +int tlob_stop_task(struct task_struct *task)
> +{
> +	struct da_monitor *da_mon;
> +	struct ha_monitor *ha_mon;
> +	struct tlob_task_state *ws;
> +	bool budget_exceeded;
> +
> +	rcu_read_lock();
> +	ws = da_get_target_by_id(task->pid);
> +	if (!ws) {
> +		rcu_read_unlock();
> +		return -ESRCH;
> +	}
> +
> +	da_mon = da_get_monitor(task->pid, NULL);
> +	if (unlikely(!da_mon)) {
> +		/* ws in hash but da_mon gone; internal inconsistency. */
> +		rcu_read_unlock();
> +		WARN_ON_ONCE(1);
> +		return -ESRCH;
> +	}
> +
> +	ha_mon = to_ha_monitor(da_mon);
> +
> +	/*
> +	 * CAS (0->1) claims cleanup ownership under RCU (ws guaranteed
> valid).
> +	 * _release pairs with atomic_read_acquire in ha_setup_invariants.
> +	 */
> +	if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0) {
> +		rcu_read_unlock();
> +		return -EAGAIN;
> +	}
> +
> +	rcu_read_unlock();
> +
> +	/* Wait for in-flight timer callback before reading da_monitoring. */
> +	ha_cancel_timer_sync(ha_mon);
> +
> +	/* Timer fired first -> budget exceeded; otherwise reset normally. */
> +	rcu_read_lock();
> +	budget_exceeded = !da_monitoring(da_mon);
> +	if (!budget_exceeded)
> +		da_monitor_reset(da_mon);
> +	rcu_read_unlock();
> +	da_destroy_storage(task->pid);
> +	atomic_dec(&tlob_num_monitored);
> +
> +	put_task_struct(ws->task);
> +	call_rcu(&ws->rcu, tlob_free_rcu);
> +	return budget_exceeded ? -EOVERFLOW : 0;
> +}
> +EXPORT_SYMBOL_GPL(tlob_stop_task);
> +
> +static void tlob_stop_all(void)
> +{

All this function does should be done by da_monitor_destroy. It does
have some concurrency issues I'm trying to fix, but there's no reason
not to use it.

We could add a way to pass some additional deallocation for all the
other cleanup you're doing on each storage.

Something like a da_extra_cleanup() you can define as whatever you need
and gets called in all per-obj destruction paths.

In general, let's try to use/extend as much as possible in the RV API
rather then re-implementing things.

> +	struct da_monitor_storage *ms;
> +	pid_t pids[TLOB_MAX_MONITORED];
> +	int bkt, n = 0;
> +
> +	/* Snapshot pids under RCU; re-derive ws under a fresh lock below. */
> +	rcu_read_lock();
> +	hash_for_each_rcu(da_monitor_ht, bkt, ms, node) {
> +		if (ms->target && n < TLOB_MAX_MONITORED)
> +			pids[n++] = ms->id;
> +	}
> +	rcu_read_unlock();
> +
> +	for (int i = 0; i < n; i++) {
> +		pid_t pid = pids[i];
> +		struct da_monitor *da_mon;
> +		struct ha_monitor *ha_mon;
> +		struct tlob_task_state *ws;
> +
> +		rcu_read_lock();
> +		da_mon = da_get_monitor(pid, NULL);
> +		if (!da_mon) {
> +			/* Cleaned up by tlob_stop_task or exit handler. */
> +			rcu_read_unlock();
> +			continue;
> +		}
> +
> +		ws = da_get_target(da_mon);
> +		ha_mon = to_ha_monitor(da_mon);
> +
> +		/* CAS (0->1) claims ownership; skip if another caller won.
> */
> +		if (atomic_cmpxchg_release(&ws->stopping, 0, 1) != 0) {
> +			rcu_read_unlock();
> +			continue;
> +		}
> +		rcu_read_unlock();
> +
> +		ha_cancel_timer_sync(ha_mon);
> +
> +		scoped_guard(rcu) {
> +			da_monitor_reset(da_mon);
> +		}
> +		da_destroy_storage(pid);
> +		atomic_dec(&tlob_num_monitored);
> +		put_task_struct(ws->task);
> +		call_rcu(&ws->rcu, tlob_free_rcu);
> +	}
> +}
> +
> +static int tlob_uprobe_entry_handler(struct rv_uprobe *p, struct pt_regs
> *regs,
> +				     __u64 *data)
> +{
> +	struct tlob_uprobe_binding *b = p->priv;
> +
> +	tlob_start_task(current, b->threshold_us);
> +	return 0;
> +}
> +
> +static int tlob_uprobe_stop_handler(struct rv_uprobe *p, struct pt_regs
> *regs,
> +				    __u64 *data)
> +{
> +	tlob_stop_task(current);
> +	return 0;
> +}
> +
> +/*
> + * Register start + stop entry uprobes for a binding.
> + * Called with tlob_uprobe_mutex held.
> + */
> +static int tlob_add_uprobe(u64 threshold_us, const char *binpath,
> +			   loff_t offset_start, loff_t offset_stop)
> +{
> +	struct tlob_uprobe_binding *b, *tmp_b;
> +	char pathbuf[TLOB_MAX_PATH];
> +	struct path path;
> +	char *canon;
> +	int ret;
> +
> +	if (binpath[0] != '/')
> +		return -EINVAL;
> +
> +	b = kzalloc_obj(*b, GFP_KERNEL);
> +	if (!b)
> +		return -ENOMEM;
> +
> +	b->threshold_us = threshold_us;
> +	b->offset_start = offset_start;
> +	b->offset_stop  = offset_stop;
> +
> +	ret = kern_path(binpath, LOOKUP_FOLLOW, &path);
> +	if (ret)
> +		goto err_free;
> +
> +	if (!d_is_reg(path.dentry)) {
> +		ret = -EINVAL;
> +		goto err_path;
> +	}
> +
> +	/* Reject duplicate start offset for the same binary. */
> +	list_for_each_entry(tmp_b, &tlob_uprobe_list, list) {
> +		if (tmp_b->offset_start == offset_start &&
> +		    tmp_b->start_probe->path.dentry == path.dentry) {
> +			ret = -EEXIST;
> +			goto err_path;
> +		}
> +	}
> +
> +	canon = d_path(&path, pathbuf, sizeof(pathbuf));
> +	if (IS_ERR(canon)) {
> +		ret = PTR_ERR(canon);
> +		goto err_path;
> +	}
> +	strscpy(b->binpath, canon, sizeof(b->binpath));
> +
> +	/* Both probes share b (priv) and path; attach_path refs path itself.
> */
> +	b->start_probe = rv_uprobe_attach_path(&path, offset_start,
> +					       tlob_uprobe_entry_handler,
> NULL, b);
> +	if (IS_ERR(b->start_probe)) {
> +		ret = PTR_ERR(b->start_probe);
> +		b->start_probe = NULL;
> +		goto err_path;
> +	}
> +
> +	b->stop_probe = rv_uprobe_attach_path(&path, offset_stop,
> +					      tlob_uprobe_stop_handler, NULL,
> b);
> +	if (IS_ERR(b->stop_probe)) {
> +		ret = PTR_ERR(b->stop_probe);
> +		b->stop_probe = NULL;
> +		goto err_start;
> +	}
> +
> +	path_put(&path);
> +	list_add_tail(&b->list, &tlob_uprobe_list);
> +	return 0;
> +
> +err_start:
> +	rv_uprobe_detach(b->start_probe);
> +err_path:
> +	path_put(&path);
> +err_free:
> +	kfree(b);
> +	return ret;
> +}
> +
> +static int tlob_remove_uprobe_by_key(loff_t offset_start, const char
> *binpath)
> +{
> +	struct tlob_uprobe_binding *b, *tmp;
> +	struct path remove_path;
> +	int ret;
> +
> +	ret = kern_path(binpath, LOOKUP_FOLLOW, &remove_path);
> +	if (ret)
> +		return ret;
> +
> +	ret = -ENOENT;
> +	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> +		if (b->offset_start != offset_start)
> +			continue;
> +		if (b->start_probe->path.dentry != remove_path.dentry)
> +			continue;
> +		list_del(&b->list);
> +		rv_uprobe_detach(b->start_probe);
> +		rv_uprobe_detach(b->stop_probe);
> +		kfree(b);
> +		ret = 0;
> +		break;
> +	}
> +
> +	path_put(&remove_path);
> +	return ret;
> +}
> +
> +static void tlob_remove_all_uprobes(void)
> +{
> +	struct tlob_uprobe_binding *b, *tmp;
> +	LIST_HEAD(pending);
> +
> +	mutex_lock(&tlob_uprobe_mutex);
> +	list_for_each_entry_safe(b, tmp, &tlob_uprobe_list, list) {
> +		list_move(&b->list, &pending);
> +		rv_uprobe_unregister_nosync(b->start_probe);
> +		rv_uprobe_unregister_nosync(b->stop_probe);
> +	}
> +	mutex_unlock(&tlob_uprobe_mutex);
> +
> +	if (list_empty(&pending))
> +		return;
> +
> +	/*
> +	 * One global barrier for all probes dequeued above; no new handlers
> +	 * for any of them can fire after this returns.
> +	 */
> +	rv_uprobe_sync();
> +
> +	list_for_each_entry_safe(b, tmp, &pending, list) {
> +		rv_uprobe_free(b->start_probe);
> +		rv_uprobe_free(b->stop_probe);
> +		kfree(b);
> +	}
> +}
> +
> +static ssize_t tlob_monitor_read(struct file *file,
> +				 char __user *ubuf,
> +				 size_t count, loff_t *ppos)
> +{
> +	const int line_sz = TLOB_MAX_PATH + 128;
> +	struct tlob_uprobe_binding *b;
> +	char *buf, *p;
> +	int n = 0, buf_sz, pos = 0;
> +	ssize_t ret;
> +
> +	mutex_lock(&tlob_uprobe_mutex);
> +	list_for_each_entry(b, &tlob_uprobe_list, list)
> +		n++;
> +
> +	buf_sz = (n ? n : 1) * line_sz + 1;
> +	buf = kmalloc(buf_sz, GFP_KERNEL);
> +	if (!buf) {
> +		mutex_unlock(&tlob_uprobe_mutex);
> +		return -ENOMEM;
> +	}
> +
> +	list_for_each_entry(b, &tlob_uprobe_list, list) {
> +		p = b->binpath;
> +		pos += scnprintf(buf + pos, buf_sz - pos,
> +				 "p %s:0x%llx 0x%llx threshold=%llu\n",
> +				 p,
> +				 (unsigned long long)b->offset_start,
> +				 (unsigned long long)b->offset_stop,
> +				 b->threshold_us);
> +	}
> +	mutex_unlock(&tlob_uprobe_mutex);
> +
> +	ret = simple_read_from_buffer(ubuf, count, ppos, buf, pos);
> +	kfree(buf);
> +	return ret;
> +}
> +
> +/*
> + * Parse "p PATH:OFFSET_START OFFSET_STOP threshold=US".
> + * PATH may contain ':'; the last ':' separates path from offset.
> + * Returns 0 or -EINVAL.
> + */
> +static int tlob_parse_uprobe_line(char *buf, u64 *thr_out,
> +				  char **path_out,
> +				  loff_t *start_out, loff_t *stop_out)
> +{
> +	unsigned long long thr = 0, stop_val = 0;
> +	long long start_val;
> +	char *p, *path_token, *token, *colon;
> +	bool got_stop = false, got_thr = false;
> +	int n;
> +
> +	/* Must start with "p " */
> +	if (buf[0] != 'p' || buf[1] != ' ')
> +		return -EINVAL;
> +
> +	p = buf + 2;
> +	while (*p == ' ')
> +		p++;
> +
> +	/* First space-delimited token is PATH:OFFSET_START */
> +	path_token = strsep(&p, " \t");
> +	if (!path_token || !*path_token)
> +		return -EINVAL;
> +
> +	/* Split at last ':' to handle paths that contain ':'. */
> +	colon = strrchr(path_token, ':');
> +	if (!colon || colon - path_token < 2)
> +		return -EINVAL;
> +	*colon = '\0';
> +
> +	if (path_token[0] != '/')
> +		return -EINVAL;
> +
> +	n = 0;
> +	if (sscanf(colon + 1, "%lli%n", &start_val, &n) != 1 || n == 0)
> +		return -EINVAL;
> +	if (start_val < 0)
> +		return -EINVAL;
> +
> +	/* Remaining tokens: OFFSET_STOP threshold=US */
> +	while (p && (token = strsep(&p, " \t")) != NULL) {
> +		if (!*token)
> +			continue;
> +		if (strncmp(token, "threshold=", 10) == 0) {
> +			if (kstrtoull(token + 10, 0, &thr))
> +				return -EINVAL;
> +			got_thr = true;
> +		} else if (!got_stop) {
> +			long long sv;
> +
> +			n = 0;
> +			if (sscanf(token, "%lli%n", &sv, &n) != 1 || n == 0)
> +				return -EINVAL;
> +			if (sv < 0)
> +				return -EINVAL;
> +			stop_val = (unsigned long long)sv;
> +			got_stop = true;
> +		} else {
> +			return -EINVAL;
> +		}
> +	}
> +
> +	if (!got_stop || !got_thr || thr == 0)
> +		return -EINVAL;
> +	if (start_val == (long long)stop_val)
> +		return -EINVAL;
> +
> +	*thr_out   = thr;
> +	*path_out  = path_token;
> +	*start_out = (loff_t)start_val;
> +	*stop_out  = (loff_t)stop_val;
> +	return 0;
> +}
> +
> +/* Parse "-PATH:OFFSET_START" (ftrace uprobe_events removal convention). */
> +static int tlob_parse_remove_line(char *buf, char **path_out, loff_t
> *start_out)
> +{
> +	char *binpath, *colon;
> +	long long off;
> +	int n = 0;
> +
> +	if (buf[0] != '-')
> +		return -EINVAL;
> +	binpath = buf + 1;
> +	if (binpath[0] != '/')
> +		return -EINVAL;
> +	colon = strrchr(binpath, ':');
> +	if (!colon || colon - binpath < 2)
> +		return -EINVAL;
> +	*colon = '\0';
> +	if (sscanf(colon + 1, "%lli%n", &off, &n) != 1 || n == 0)
> +		return -EINVAL;
> +	*path_out  = binpath;
> +	*start_out = (loff_t)off;
> +	return 0;
> +}
> +
> +VISIBLE_IF_KUNIT int tlob_create_or_delete_uprobe(char *buf)
> +{
> +	loff_t offset_start, offset_stop;
> +	u64 threshold_us;
> +	char *binpath;
> +	int ret;
> +
> +	if (buf[0] == '-') {
> +		ret = tlob_parse_remove_line(buf, &binpath, &offset_start);
> +		if (ret)
> +			return ret;
> +		mutex_lock(&tlob_uprobe_mutex);
> +		ret = tlob_remove_uprobe_by_key(offset_start, binpath);
> +		mutex_unlock(&tlob_uprobe_mutex);
> +		return ret;
> +	}
> +	ret = tlob_parse_uprobe_line(buf, &threshold_us, &binpath,
> +				     &offset_start, &offset_stop);
> +	if (ret)
> +		return ret;
> +	mutex_lock(&tlob_uprobe_mutex);
> +	ret = tlob_add_uprobe(threshold_us, binpath, offset_start,
> offset_stop);
> +	mutex_unlock(&tlob_uprobe_mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_create_or_delete_uprobe);
> +
> +static ssize_t tlob_monitor_write(struct file *file,
> +				  const char __user *ubuf,
> +				  size_t count, loff_t *ppos)
> +{
> +	char buf[TLOB_MAX_PATH + 128];
> +
> +	if (count >= sizeof(buf))
> +		return -EINVAL;
> +	if (copy_from_user(buf, ubuf, count))
> +		return -EFAULT;
> +	buf[count] = '\0';
> +	if (count > 0 && buf[count - 1] == '\n')
> +		buf[count - 1] = '\0';
> +	return tlob_create_or_delete_uprobe(buf) ?: (ssize_t)count;
> +}
> +
> +static const struct file_operations tlob_monitor_fops = {
> +	.open	= simple_open,
> +	.read	= tlob_monitor_read,
> +	.write	= tlob_monitor_write,
> +	.llseek	= noop_llseek,
> +};
> +
> +static int __tlob_init_monitor(void)
> +{
> +	int retval;
> +
> +	tlob_state_cache = kmem_cache_create("tlob_task_state",
> +					     sizeof(struct tlob_task_state),
> +					     0, 0, NULL);
> +	if (!tlob_state_cache)
> +		return -ENOMEM;
> +
> +	atomic_set(&tlob_num_monitored, 0);
> +
> +	retval = da_monitor_init_prealloc(TLOB_MAX_MONITORED);
> +	if (retval) {
> +		kmem_cache_destroy(tlob_state_cache);
> +		tlob_state_cache = NULL;
> +		return retval;
> +	}
> +
> +	/* Synthetic reference: held while the monitor is enabled. */
> +	reinit_completion(&tlob_fd_released);
> +	refcount_set(&tlob_fd_refcount, 1);
> +
> +	rv_this.enabled = 1;
> +	return 0;
> +}
> +
> +static void __tlob_destroy_monitor(void)
> +{
> +	rv_this.enabled = 0;
> +	/*
> +	 * Remove uprobes first so stop_task can't race with tlob_stop_all().
> +	 * rv_uprobe_sync() inside ensures all in-flight handlers have
> finished.
> +	 */
> +	tlob_remove_all_uprobes();
> +	tlob_stop_all();
> +	/* Wait for tlob_free_rcu and da_pool_return_cb before pool teardown.
> */
> +	synchronize_rcu();
> +
> +	/*
> +	 * Drop the synthetic ref and wait for all open fds to close before
> +	 * teardown; prevents kmem_cache_zalloc() on the destroyed cache.
> +	 */
> +	if (!refcount_dec_and_test(&tlob_fd_refcount))
> +		wait_for_completion(&tlob_fd_released);
> +
> +	da_monitor_destroy();
> +	kmem_cache_destroy(tlob_state_cache);
> +	tlob_state_cache = NULL;
> +}
> +
> +/* KUnit wrappers that acquire rv_interface_lock around monitor init/destroy.
> */
> +#if IS_ENABLED(CONFIG_KUNIT)
> +int tlob_init_monitor(void)
> +{
> +	int ret;
> +
> +	mutex_lock(&rv_interface_lock);
> +	ret = __tlob_init_monitor();
> +	mutex_unlock(&rv_interface_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(tlob_init_monitor);
> +
> +void tlob_destroy_monitor(void)
> +{
> +	mutex_lock(&rv_interface_lock);
> +	__tlob_destroy_monitor();
> +	mutex_unlock(&rv_interface_lock);
> +}
> +EXPORT_SYMBOL_GPL(tlob_destroy_monitor);
> +
> +int tlob_num_monitored_read(void)
> +{
> +	return atomic_read(&tlob_num_monitored);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_num_monitored_read);
> +
> +/* Tracepoint probes for KUnit; rv_trace.h is only included here. */
> +static struct tlob_captured_event     tlob_kunit_last_event;
> +static struct tlob_captured_error_env tlob_kunit_last_error_env;
> +static atomic_t tlob_kunit_event_cnt    = ATOMIC_INIT(0);
> +static atomic_t tlob_kunit_error_env_cnt = ATOMIC_INIT(0);
> +
> +static void tlob_kunit_event_probe(void *data, int id, char *state, char
> *event,
> +				   char *next_state, bool final_state)
> +{
> +	tlob_kunit_last_event.id = id;
> +	strscpy(tlob_kunit_last_event.state, state,
> +		sizeof(tlob_kunit_last_event.state));
> +	strscpy(tlob_kunit_last_event.event, event,
> +		sizeof(tlob_kunit_last_event.event));
> +	strscpy(tlob_kunit_last_event.next_state, next_state,
> +		sizeof(tlob_kunit_last_event.next_state));
> +	tlob_kunit_last_event.final_state = final_state;
> +	atomic_inc(&tlob_kunit_event_cnt);
> +}
> +
> +static void tlob_kunit_error_env_probe(void *data, int id, char *state,
> +				       char *event, char *env)
> +{
> +	tlob_kunit_last_error_env.id = id;
> +	strscpy(tlob_kunit_last_error_env.state, state,
> +		sizeof(tlob_kunit_last_error_env.state));
> +	strscpy(tlob_kunit_last_error_env.event, event,
> +		sizeof(tlob_kunit_last_error_env.event));
> +	strscpy(tlob_kunit_last_error_env.env, env,
> +		sizeof(tlob_kunit_last_error_env.env));
> +	atomic_inc(&tlob_kunit_error_env_cnt);
> +}
> +
> +int tlob_register_kunit_probes(void)
> +{
> +	int ret;
> +
> +	atomic_set(&tlob_kunit_event_cnt, 0);
> +	atomic_set(&tlob_kunit_error_env_cnt, 0);
> +
> +	ret = register_trace_event_tlob(tlob_kunit_event_probe, NULL);
> +	if (ret)
> +		return ret;
> +	ret = register_trace_error_env_tlob(tlob_kunit_error_env_probe,
> NULL);
> +	if (ret) {
> +		unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
> +		return ret;
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_register_kunit_probes);
> +
> +void tlob_unregister_kunit_probes(void)
> +{
> +	unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
> +	unregister_trace_error_env_tlob(tlob_kunit_error_env_probe, NULL);
> +	tracepoint_synchronize_unregister();
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_unregister_kunit_probes);
> +
> +int tlob_event_count_read(void)
> +{
> +	return atomic_read(&tlob_kunit_event_cnt);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_count_read);
> +
> +void tlob_event_count_reset(void)
> +{
> +	atomic_set(&tlob_kunit_event_cnt, 0);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_event_count_reset);
> +
> +int tlob_error_env_count_read(void)
> +{
> +	return atomic_read(&tlob_kunit_error_env_cnt);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_error_env_count_read);
> +
> +void tlob_error_env_count_reset(void)
> +{
> +	atomic_set(&tlob_kunit_error_env_cnt, 0);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_error_env_count_reset);
> +
> +const struct tlob_captured_event *tlob_last_event_read(void)
> +{
> +	return &tlob_kunit_last_event;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_last_event_read);
> +
> +const struct tlob_captured_error_env *tlob_last_error_env_read(void)
> +{
> +	return &tlob_kunit_last_error_env;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_last_error_env_read);
> +
> +#endif /* CONFIG_KUNIT */
> +
> +VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
> +{
> +	rv_attach_trace_probe("tlob", sched_switch, handle_sched_switch);
> +	rv_attach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> +	rv_attach_trace_probe("tlob", sched_process_exit,
> handle_sched_process_exit);
> +	return 0;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_enable_hooks);
> +
> +VISIBLE_IF_KUNIT void tlob_disable_hooks(void)
> +{
> +	rv_detach_trace_probe("tlob", sched_switch, handle_sched_switch);
> +	rv_detach_trace_probe("tlob", sched_wakeup, handle_sched_wakeup);
> +	rv_detach_trace_probe("tlob", sched_process_exit,
> handle_sched_process_exit);
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_disable_hooks);
> +
> +static int enable_tlob(void)
> +{
> +	int retval;
> +
> +	retval = __tlob_init_monitor();
> +	if (retval)
> +		return retval;
> +
> +	return tlob_enable_hooks();
> +}
> +
> +static void disable_tlob(void)
> +{
> +	tlob_disable_hooks();
> +	__tlob_destroy_monitor();
> +}
> +
> +static struct rv_monitor rv_this = {
> +	.name		= "tlob",
> +	.description	= "Per-task latency-over-budget monitor.",
> +	.enable		= enable_tlob,
> +	.disable	= disable_tlob,
> +	.reset		= da_monitor_reset_all,
> +	.enabled	= 0,
> +};
> +
> +static void *tlob_chardev_bind(void)
> +{
> +	struct tlob_fpriv *fp;
> +
> +	fp = kzalloc_obj(*fp, GFP_KERNEL);
> +	if (!fp)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Pin cache/pool for fd lifetime; balanced in tlob_chardev_release.
> +	 * If the synthetic ref has already been dropped
> (__tlob_destroy_monitor
> +	 * ran to completion), reject the bind so the caller gets ENODEV
> instead
> +	 * of corrupting a zero refcount.
> +	 */
> +	if (!refcount_inc_not_zero(&tlob_fd_refcount)) {
> +		kfree(fp);
> +		return ERR_PTR(-ENODEV);
> +	}
> +	return fp;
> +}
> +
> +static void tlob_chardev_release(void *priv)
> +{
> +	struct tlob_fpriv *fp = priv;
> +
> +	if (fp->monitoring) {
> +		/* All return values are safe on close. */
> +		(void)tlob_stop_task(fp->task);
> +		put_task_struct(fp->task);
> +	}
> +
> +	kfree(fp);
> +
> +	/* Release fd's pin; if last, wake __tlob_destroy_monitor. */
> +	if (refcount_dec_and_test(&tlob_fd_refcount))
> +		complete(&tlob_fd_released);
> +}
> +
> +static long tlob_chardev_ioctl(void *priv, unsigned int cmd, unsigned long
> arg)
> +{
> +	struct tlob_fpriv *fp = priv;
> +	struct tlob_start_args args;
> +	struct task_struct *task;
> +	int ret;
> +
> +	switch (cmd) {
> +	case TLOB_IOCTL_TRACE_START:
> +		if (fp->monitoring)
> +			return -EALREADY;
> +
> +		if (copy_from_user(&args, (void __user *)arg, sizeof(args)))
> +			return -EFAULT;
> +
> +		ret = tlob_start_task(current, args.threshold_us);
> +		if (ret)
> +			return ret;
> +
> +		fp->task = current;
> +		get_task_struct(current);
> +		fp->budget_exceeded = false;
> +
> +		/* Link fd so hrtimer callback can latch budget_exceeded. */
> +		scoped_guard(rcu) {
> +			struct tlob_task_state *ws =
> da_get_target_by_id(current->pid);
> +
> +			if (ws)
> +				smp_store_release(&ws->fpriv, fp);
> +		}
> +
> +		fp->monitoring = true;
> +		return 0;
> +
> +	case TLOB_IOCTL_TRACE_STOP:
> +		if (!fp->monitoring)
> +			return -EINVAL;
> +
> +		task = fp->task;
> +		fp->monitoring = false;
> +		fp->task = NULL;
> +
> +		ret = tlob_stop_task(task);
> +		put_task_struct(task);
> +
> +		/*
> +		 * -EOVERFLOW: budget exceeded; propagate to caller.
> +		 * -EAGAIN: concurrent stop_all claimed cleanup; fall through
> to
> +		 *   budget_exceeded latch set by the hrtimer callback.
> +		 * -ESRCH: task exited before TRACE_STOP (process-exit
> handler
> +		 *   claimed cleanup); same latch applies.  Not an internal
> error.
> +		 */
> +		if (ret == -EAGAIN || ret == -ESRCH)
> +			return READ_ONCE(fp->budget_exceeded) ? -EOVERFLOW :
> 0;
> +		return ret;
> +
> +	default:
> +		return -ENOTTY;
> +	}
> +}
> +
> +static const struct rv_chardev_ops tlob_chardev_ops = {
> +	.owner   = THIS_MODULE,
> +	.bind    = tlob_chardev_bind,
> +	.ioctl   = tlob_chardev_ioctl,
> +	.release = tlob_chardev_release,
> +};
> +
> +static int __init register_tlob(void)
> +{
> +	int ret;
> +
> +	ret = rv_chardev_register_monitor("tlob", &tlob_chardev_ops);
> +	if (ret)
> +		return ret;
> +
> +	ret = rv_register_monitor(&rv_this, NULL);
> +	if (ret) {
> +		rv_chardev_unregister_monitor("tlob");
> +		return ret;
> +	}
> +
> +	if (rv_this.root_d) {
> +		if (!tracefs_create_file("monitor", 0644, rv_this.root_d,
> NULL,
> +					 &tlob_monitor_fops)) {
> +			rv_unregister_monitor(&rv_this);
> +			rv_chardev_unregister_monitor("tlob");
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit unregister_tlob(void)
> +{
> +	rv_chardev_unregister_monitor("tlob");
> +	rv_unregister_monitor(&rv_this);
> +}
> +
> +module_init(register_tlob);
> +module_exit(unregister_tlob);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Wen Yang <wen.yang@linux.dev>");
> +MODULE_DESCRIPTION("tlob: task latency over budget per-task monitor.");
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.h
> b/kernel/trace/rv/monitors/tlob/tlob.h
> new file mode 100644
> index 000000000000..71c1735d27d2
> --- /dev/null
[...]
> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
> index ee4e68102f17..a45c4763dbe5 100644
> --- a/kernel/trace/rv/rv.c
> +++ b/kernel/trace/rv/rv.c
> @@ -142,10 +142,17 @@
>  #include <linux/module.h>
>  #include <linux/init.h>
>  #include <linux/slab.h>
> +#include <kunit/visibility.h>
>  
>  #ifdef CONFIG_RV_MON_EVENTS
>  #define CREATE_TRACE_POINTS
>  #include <rv_trace.h>
> +
> +#ifdef CONFIG_RV_MON_TLOB
> +EXPORT_TRACEPOINT_SYMBOL_GPL(error_tlob);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(event_tlob);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(error_env_tlob);
> +#endif

Cannot this stay in tlob.c ? So you keep the shared file clean and skip
the ifdeffery.

>  #endif
>  
>  #include "rv.h"
> @@ -696,6 +703,33 @@ static void turn_monitoring_on(void)
>  	WRITE_ONCE(monitoring_on, true);
>  }
>  
> +#if IS_ENABLED(CONFIG_KUNIT)
> +/**
> + * rv_kunit_monitoring_on - enable the global monitoring_on flag for KUnit
> tests.
> + *
> + * KUnit test suite_init functions must call this before initialising any
> + * monitor, mirroring the turn_monitoring_on() call in rv_init_interface().
> + * The matching rv_kunit_monitoring_off() must be called in suite_exit to
> + * restore the flag so that test suites do not interfere with each other.
> + */
> +void rv_kunit_monitoring_on(void)
> +{
> +	turn_monitoring_on();
> +}
> +EXPORT_SYMBOL_IF_KUNIT(rv_kunit_monitoring_on);
> +
> +/**
> + * rv_kunit_monitoring_off - disable the global monitoring_on flag for KUnit
> tests.
> + *
> + * Must be called in suite_exit to restore global state after
> rv_kunit_monitoring_on().
> + */
> +void rv_kunit_monitoring_off(void)
> +{
> +	turn_monitoring_off();
> +}
> +EXPORT_SYMBOL_IF_KUNIT(rv_kunit_monitoring_off);
> +#endif /* CONFIG_KUNIT */
> +
>  static void turn_monitoring_on_with_reset(void)
>  {
>  	lockdep_assert_held(&rv_interface_lock);
> @@ -846,6 +880,10 @@ int __init rv_init_interface(void)
>  	if (retval)
>  		return 1;
>  
> +	retval = rv_chardev_init();
> +	if (retval)
> +		return 1;
> +

Both of those can stay in separate patches as mentioned above.

>  	turn_monitoring_on();
>  
>  	rv_root.root_dir = no_free_ptr(root_dir);
> diff --git a/kernel/trace/rv/rv.h b/kernel/trace/rv/rv.h
> index 2c0f51ff9d5c..82c9a2b57596 100644
> --- a/kernel/trace/rv/rv.h
> +++ b/kernel/trace/rv/rv.h
> @@ -31,6 +31,8 @@ int rv_enable_monitor(struct rv_monitor *mon);
>  bool rv_is_container_monitor(struct rv_monitor *mon);
>  bool rv_is_nested_monitor(struct rv_monitor *mon);
>  
> +int rv_chardev_init(void);
> +

Same here.

>  #ifdef CONFIG_RV_REACTORS
>  int reactor_populate_monitor(struct rv_monitor *mon, struct dentry *root);
>  int init_rv_reactors(struct dentry *root_dir);
> diff --git a/kernel/trace/rv/rv_chardev.c b/kernel/trace/rv/rv_chardev.c
> new file mode 100644
> index 000000000000..1fba1642ebc1
> --- /dev/null
> +++ b/kernel/trace/rv/rv_chardev.c
> @@ -0,0 +1,201 @@
> +// SPDX-License-Identifier: GPL-2.0
> +

And here.

> diff --git a/kernel/trace/rv/rv_uprobe.c b/kernel/trace/rv/rv_uprobe.c
> index bc28399cfd4b..1ba7b80c1d87 100644
> --- a/kernel/trace/rv/rv_uprobe.c
> +++ b/kernel/trace/rv/rv_uprobe.c

Also this probably belongs in the uprobes patch.

> @@ -132,13 +132,10 @@ EXPORT_SYMBOL_GPL(rv_uprobe_attach);
>   */
>  void rv_uprobe_detach(struct rv_uprobe *p)
>  {
> -	struct rv_uprobe_impl *impl;
> -
>  	if (!p)
>  		return;
>  
> -	impl = container_of(p, struct rv_uprobe_impl, pub);
> -	uprobe_unregister_nosync(impl->uprobe, &impl->uc);
> +	rv_uprobe_unregister_nosync(p);
>  	/*
>  	 * uprobe_unregister_sync() is a global barrier: it waits for all
>  	 * in-flight uprobe handlers across the entire system to complete,
> @@ -146,8 +143,47 @@ void rv_uprobe_detach(struct rv_uprobe *p)
>  	 * guarantees that no handler touching impl->pub.priv is running by
>  	 * the time we return, even if the caller immediately frees priv.
>  	 */
> +	rv_uprobe_sync();
> +	rv_uprobe_free(p);
> +}
> +EXPORT_SYMBOL_GPL(rv_uprobe_detach);

[...]

> diff --git a/tools/include/uapi/linux/rv.h b/tools/include/uapi/linux/rv.h
> new file mode 100644
> index 000000000000..a34e5426393b
> --- /dev/null
> +++ b/tools/include/uapi/linux/rv.h
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * UAPI definitions for Runtime Verification (RV) monitors.
> + *
> + * All RV monitors that expose an ioctl self-instrumentation interface
> + * share the magic byte RV_IOC_MAGIC ('r').
> + *
> + * Usage examples and design rationale are in:
> + *   Documentation/trace/rv/monitor_tlob.rst
> + */

And this in a new ioctl patch.

> +
> +#ifndef _UAPI_LINUX_RV_H
> +#define _UAPI_LINUX_RV_H

Thanks,
Gabriele


^ permalink raw reply

* Re: [PATCH] tracing: samples: avoid warning about __aeabi_unwind_cpp_pr1
From: Arnd Bergmann @ 2026-05-15 10:56 UTC (permalink / raw)
  To: Vincent Donnefort, Steven Rostedt
  Cc: Arnd Bergmann, Masami Hiramatsu, Nathan Chancellor, Marc Zyngier,
	Mathieu Desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <agWb6DvB1MdJ12cB@google.com>

On Thu, May 14, 2026, at 11:54, Vincent Donnefort wrote:
> On Wed, May 13, 2026 at 10:59:39AM -0400, Steven Rostedt wrote:
>> 
>> Vincent,
>> 
>> Is this patch needed? That is, did it fall through the cracks?
>
> Yes, I believe it is! 
>
> Reviewed-by: Vincent Donnefort <vdonnefort@google.com>

Just today, I came across yet another one:

Unexpected symbols in kernel/trace/simple_ring_buffer.o:
                 U __s390_indirect_jump_r1
                 U __s390_indirect_jump_r10
                 U __s390_indirect_jump_r14
                 U __s390_indirect_jump_r2
                 U __s390_indirect_jump_r5
                 U __s390_indirect_jump_r7
                 U __s390_indirect_jump_r8
                 U __s390_indirect_jump_r9

I'll send a replacement patch that addresses both, since the
old one hasn't been applied yet.

      Arnd

^ permalink raw reply

* [PATCH] [v2] tracing: samples: avoid unexpected symbol warnings (arm, s390)
From: Arnd Bergmann @ 2026-05-15 10:57 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu, Marc Zyngier, Nathan Chancellor,
	Vincent Donnefort
  Cc: Arnd Bergmann, Mathieu Desnoyers, Paolo Bonzini, linux-kernel,
	linux-trace-kernel

From: Arnd Bergmann <arnd@arndb.de>

The now more verbose check found more architecture specific symbol
missing from the whitelist, during randconfig testing on s390
and 32-bit arm:

Unexpected symbols in kernel/trace/simple_ring_buffer.o:
         U __aeabi_unwind_cpp_pr1

Unexpected symbols in kernel/trace/simple_ring_buffer.o:
                 U __s390_indirect_jump_r1
                 U __s390_indirect_jump_r10
                 U __s390_indirect_jump_r14
                 U __s390_indirect_jump_r2
                 U __s390_indirect_jump_r5
                 U __s390_indirect_jump_r7
                 U __s390_indirect_jump_r8
                 U __s390_indirect_jump_r9
make[6]: *** [/home/arnd/arm-soc/kernel/trace/Makefile:160: kernel/trace/simple_ring_buffer.o.checked] Error 1

Add these to the list and keep it roughly sorted into sanitizer
and architecture symbols.

Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
v2: add both s390 and arm symbols
---
 kernel/trace/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 1decdce8cbef..9b0834134cae 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -143,8 +143,8 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o
 targets += undefsyms_base.o
 KASAN_SANITIZE_undefsyms_base.o := y
 
-UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
-		      __msan simple_ring_buffer \
+UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __msan \
+		      __aeabi_unwind_cpp __s390_indirect_jump __x86_indirect_thunk simple_ring_buffer \
 		      $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
 
 quiet_cmd_check_undefined = NM      $<
-- 
2.39.5


^ permalink raw reply related

* Re: [PATCH 2/7] libbpf: Change has_nop_combo to work on top of nop10
From: Jakub Sitnicki @ 2026-05-15 11:12 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260514135342.22130-3-jolsa@kernel.org>

On Thu, May 14, 2026 at 03:53 PM +02, Jiri Olsa wrote:
> We now expect nop combo with 10 bytes nop instead of 5 bytes nop,
> fixing has_nop_combo to reflect that.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  tools/lib/bpf/usdt.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/tools/lib/bpf/usdt.c b/tools/lib/bpf/usdt.c
> index e3710933fd52..7e62e4d5bedd 100644
> --- a/tools/lib/bpf/usdt.c
> +++ b/tools/lib/bpf/usdt.c
> @@ -305,7 +305,7 @@ struct usdt_manager *usdt_manager_new(struct bpf_object *obj)
>  
>  	/*
>  	 * Detect kernel support for uprobe() syscall, it's presence means we can
> -	 * take advantage of faster nop5 uprobe handling.
> +	 * take advantage of faster nop10 uprobe handling.
>  	 * Added in: 56101b69c919 ("uprobes/x86: Add uprobe syscall to speed up uprobe")
>  	 */
>  	man->has_uprobe_syscall = kernel_supports(obj, FEAT_UPROBE_SYSCALL);
> @@ -596,14 +596,14 @@ static int parse_usdt_spec(struct usdt_spec *spec, const struct usdt_note *note,
>  #if defined(__x86_64__)
>  static bool has_nop_combo(int fd, long off)
>  {
> -	unsigned char nop_combo[6] = {
> -		0x90, 0x0f, 0x1f, 0x44, 0x00, 0x00 /* nop,nop5 */
> +	unsigned char nop_combo[11] = {
> +		0x90, 0x66, 0x66, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00,
>  	};
> -	unsigned char buf[6];
> +	unsigned char buf[11];
>  
> -	if (pread(fd, buf, 6, off) != 6)
> +	if (pread(fd, buf, 11, off) != 11)
>  		return false;
> -	return memcmp(buf, nop_combo, 6) == 0;
> +	return memcmp(buf, nop_combo, 11) == 0;
>  }
>  #else
>  static bool has_nop_combo(int fd, long off)

Nit: Would use ARRAY_SIZE(buf) instead of repeating the scalar value
in multiple places. Otherwise:

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply

* Re: [PATCH 1/7] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Jiri Olsa @ 2026-05-15 12:31 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <2qkbqj7c2bi7li4crheoarasvokrtxbb7ikofdv5zvsvgww5lx@bjd73tm2prfj>

On Thu, May 14, 2026 at 06:54:37PM +0200, Jakub Sitnicki wrote:
> On Thu, May 14, 2026 at 03:53:36PM +0200, Jiri Olsa wrote:
> > Andrii reported an issue with optimized uprobes [1] that can clobber
> > redzone area with call instruction storing return address on stack
> > where user code may keep temporary data without adjusting rsp.
> > 
> > Fixing this by moving the optimized uprobes on top of 10-bytes nop
> > instruction, so we can squeeze another instruction to escape the
> > redzone area before doing the call, like:
> > 
> >   lea -0x80(%rsp), %rsp
> >   call tramp
> > 
> > Note the lea instruction is used to adjust the rsp register without
> > changing the flags.
> > 
> > The optimized uprobe performance stays the same:
> > 
> >         uprobe-nop     :    3.129 ± 0.013M/s
> >         uprobe-push    :    3.045 ± 0.006M/s
> >         uprobe-ret     :    1.095 ± 0.004M/s
> >   -->   uprobe-nop10   :    7.170 ± 0.020M/s
> >         uretprobe-nop  :    2.143 ± 0.021M/s
> >         uretprobe-push :    2.090 ± 0.000M/s
> >         uretprobe-ret  :    0.942 ± 0.000M/s
> >   -->   uretprobe-nop10:    3.381 ± 0.003M/s
> >         usdt-nop       :    3.245 ± 0.004M/s
> >   -->   usdt-nop10     :    7.256 ± 0.023M/s
> > 
> > [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> > Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  arch/x86/kernel/uprobes.c | 121 +++++++++++++++++++++++++++-----------
> >  1 file changed, 86 insertions(+), 35 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index ebb1baf1eb1d..f7c4101a4039 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -636,9 +636,21 @@ struct uprobe_trampoline {
> >  	unsigned long		vaddr;
> >  };
> >  
> > +#define LEA_INSN_SIZE		5
> > +#define OPT_INSN_SIZE		(LEA_INSN_SIZE + CALL_INSN_SIZE)
> > +#define OPT_JMP8_OFFSET		(OPT_INSN_SIZE - JMP8_INSN_SIZE)
> > +#define REDZONE_SIZE		0x80
> > +
> > +static const u8 lea_rsp[] = { 0x48, 0x8d, 0x64, 0x24, 0x80 };
> > +
> > +static bool is_lea_insn(const uprobe_opcode_t *insn)
> > +{
> > +	return !memcmp(insn, lea_rsp, LEA_INSN_SIZE);
> > +}
> > +
> 
> Just a thought. See if below maybe reads better when plugged in.
> is_call_insn can then be removed, I think.
> 
> static bool is_call_past_redzone_insns(const uprobe_opcode_t *insn)
> {
> 	static const u8 lea_rsp_call[] = {
> 		0x48, 0x8d, 0x64, 0x24, REDZONE_SIZE, /* lea -0x80(%rsp), %rsp */
> 		CALL_INSN_OPCODE
> 	};
> 
> 	return !memcmp(insn, lea_rsp_call, ARRAY_SIZE(lea_rsp_call));
> }

yep, might be easier to unify that, thanks

jirka

^ permalink raw reply

* Re: [PATCH 2/7] libbpf: Change has_nop_combo to work on top of nop10
From: Jiri Olsa @ 2026-05-15 12:32 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: oleg, peterz, mingo, mhiramat, andrii, bpf, linux-trace-kernel,
	ast, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai
In-Reply-To: <430564f1eb4e1fb284b40510811502671e4aa8ef59afadddd546e4286402f760@mail.kernel.org>

On Thu, May 14, 2026 at 02:55:25PM +0000, bot+bpf-ci@kernel.org wrote:
> >     libbpf: Change has_nop_combo to work on top of nop10
> >
> >     We now expect nop combo with 10 bytes nop instead of 5 bytes nop,
> >     fixing has_nop_combo to reflect that.
> >
> >     Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> 
> This looks like a bug fix for kernel/libbpf integration. Should this
> include:
> 
>   Fixes: 9734c3ac2f23 ("uprobes/x86: Move optimized uprobe from nop5 to nop10")

I think it should be this one:

  41a5c7df4466 libbpf: Add support to detect nop,nop5 instructions combo for usdt probe

jirka

^ permalink raw reply

* Re: [RFC PATCH v2 08/10] rv/tlob: add tlob hybrid automaton monitor
From: Gabriele Monaco @ 2026-05-15 13:08 UTC (permalink / raw)
  To: wen.yang, Steven Rostedt; +Cc: linux-trace-kernel, linux-kernel
In-Reply-To: <fe5ed6a9a0a911e6ec74dc06c453786a2c4fb6d1.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> diff --git a/Documentation/trace/rv/monitor_tlob.rst
> b/Documentation/trace/rv/monitor_tlob.rst
> new file mode 100644
> index 000000000000..91b592630b3f
> --- /dev/null
> +++ b/Documentation/trace/rv/monitor_tlob.rst
> +Usage
> +-----
> +
> +tracefs interface (uprobe-based external monitoring)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The ``monitor`` tracefs file instruments an unmodified binary via uprobes.
> +The format follows the ftrace ``uprobe_events`` convention (``PATH:OFFSET``
> +for the probe location, ``key=value`` for configuration parameters)::
> +
> +  p PATH:OFFSET_START OFFSET_STOP threshold=US
> +
> +The uprobe at ``OFFSET_START`` fires ``tlob_start_task()``; the uprobe at
> +``OFFSET_STOP`` fires ``tlob_stop_task()``.  Both offsets are ELF file
> +offsets of entry points in ``PATH``.  ``PATH`` may contain ``:``; the last
> +``:`` in the ``PATH:OFFSET_START`` token is the separator.
> +
> +To remove a binding, use ``-PATH:OFFSET_START``::
> +
> +  echo 1 > /sys/kernel/tracing/rv/monitors/tlob/enable
> +
> +  echo "p /usr/bin/myapp:0x12a0 0x12f0 threshold=5000" \
> +      > /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> +  # Remove a binding
> +  echo "-/usr/bin/myapp:0x12a0" >
> /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> +  # List registered bindings
> +  cat /sys/kernel/tracing/rv/monitors/tlob/monitor
> +
> +  # Read violations from the trace buffer
> +  cat /sys/kernel/tracing/trace
> +
> +ioctl self-instrumentation (/dev/rv)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I'm not particularly fond of ioctls, they aren't that flexible and in
this way I don't really see an added value.

In short, you're adding this so a program could instrument itself using
ioctls instead of using uprobes, cannot the same thing be achieved using
uprobes alone, e.g. by registering a function address or the current
instruction pointer?

If you really cannot do it with uprobes alone, wouldn't a sysfs/tracefs file
achieve a similar purpose without much of the boilerplate code?

> +
> +``/dev/rv`` is a shared RV character device.  Before using any monitor-
> specific
> +ioctl, the fd must be bound to a monitor via ``RV_IOCTL_BIND_MONITOR``.  Each
> +open fd has independent per-fd monitoring state::
> +
> +  int fd = open("/dev/rv", O_RDWR);
> +
> +  /* Bind this fd to the tlob monitor. */
> +  struct rv_bind_args bind = { .monitor_name = "tlob" };
> +  ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind);
> +
> +  struct tlob_start_args args = {
> +      .threshold_us = 50000,   /* 50 ms in microseconds */
> +  };
> +  ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +
> +  /* ... code path under observation ... */
> +
> +  int ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +  /* ret == 0:          within budget  */
> +  /* ret == -EOVERFLOW: budget exceeded */
> +
> +  close(fd);
> +
> +``TRACE_STOP`` returns ``-EOVERFLOW`` whenever the budget was exceeded.
> +The HA timer calls ``da_monitor_reset()`` (storage remains); the
> +synchronous ``ha_cancel_timer_sync()`` in ``tlob_stop_task()`` ensures the
> +callback has completed before checking ``da_monitoring()``.
> +
> +Violation events
> +~~~~~~~~~~~~~~~~

Since you are not documenting the detail_env_tlob tracepoint, is it
something really required?
It's deviating from the original RV purpose (run a model and spot
violations) by adding further accounting, I'm fine with that if there is a
documented need.
In such case I would at the very least document its usage (thought I'd
really like to be rid of it and let the curious user implement the
accounting themselves).
> +
> +Budget violations are always reported via the ``error_env_tlob`` RV
> +tracepoint (HA clock-invariant violation), regardless of which interface
> +triggered them::
> +
> +  cat /sys/kernel/tracing/trace
> +
> +To capture violations in a file::
> +
> +  trace-cmd record -e error_env_tlob &
> +  # ... run workload ...
> +  trace-cmd report
> +

This is standard tracepoints usage, there's nothing about tlob we should
document here. If you feel the existing RV documentation should expand
this subject, feel free to contribute there.

> +tracefs files
> +-------------
> +
> +The following files are created under
> +``/sys/kernel/tracing/rv/monitors/tlob/``:
> +
> +``enable`` (rw)
> +  Write ``1`` to enable the monitor; write ``0`` to disable it.
> +
> +``desc`` (ro)
> +  Human-readable description of the monitor.

Same here, standard RV.

> +
> +``monitor`` (rw)
> +  Write ``p PATH:OFFSET_START OFFSET_STOP threshold=US``
> +  to bind two entry uprobes.  Write ``-PATH:OFFSET_START`` to remove a
> +  binding.  Read to list registered bindings in the same format.

And this is duplicating what mentioned above about uprobes, isn't it?

> +
> +Kernel API
> +----------
> +
> +.. kernel-doc:: kernel/trace/rv/monitors/tlob/tlob.c
> +   :functions: tlob_start_task tlob_stop_task
> +
> +``tlob_start_task(task, threshold_us)``
> +  Begin monitoring *task* with a total latency budget of *threshold_us*
> +  microseconds.  Allocates per-task state, sets initial DA state to
> +  ``running``, resets ``clk_elapsed``, and arms the HA budget timer.
> +  Returns 0, -ENODEV (monitor disabled), -ERANGE (zero threshold),
> +  -EALREADY (already monitoring), -ENOSPC (at capacity), or -ENOMEM.
> +
> +``tlob_stop_task(task)``
> +  Stop monitoring *task*.  Synchronously cancels the HA timer via
> +  ``ha_cancel_timer_sync()``, checks ``da_monitoring()`` to determine
> outcome.
> +  Returns 0 (clean stop, within budget), -EOVERFLOW (budget was exceeded),
> +  -ESRCH (not monitored), or -EAGAIN (concurrent stop racing).
> +

Is kernel code going to use this API? RV monitors are meant to be
enabled by userspace. What's the use-case here?

> +Design notes
> +------------
> +
> +State transitions are driven by two tracepoints:
> +
> +- ``sched_switch``: ``prev_state == 0`` (``TASK_RUNNING``, preempted,
> +  stays on runqueue) → running→waiting; ``prev_state != 0`` (voluntarily
> +  blocked, leaves runqueue) → running→sleeping; ``next`` pointer →
> +  waiting→running.
> +- ``sched_wakeup``: task moves back onto the runqueue → sleeping→waiting.
> +
> +No ``waiting → sleeping`` edge exists because a task can only block
> +itself while executing on CPU.  ``try_to_wake_up()`` is also a no-op
> +when ``__state == TASK_RUNNING``, so ``sched_wakeup`` never fires while
> +the task is in ``waiting`` state.

That's probably a bit too detailed for this page.
If you really want this information somewhere couldn't it stay in the
code?

Thanks,
Gabriele


^ permalink raw reply

* Re: [RFC PATCH v2 09/10] rv/tlob: add KUnit tests for the tlob monitor
From: Gabriele Monaco @ 2026-05-15 13:13 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <a12d14297b33b9b8d425bc1b813a8aecbd54bcc6.1778522945.git.wen.yang@linux.dev>

On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add five KUnit test suites gated behind CONFIG_TLOB_KUNIT_TEST
> (depends on RV_MON_TLOB && KUNIT; default KUNIT_ALL_TESTS) with a
> .kunitconfig fragment for the kunit.py runner.
> 
> tlob_task_api tests the start/stop API, error returns (-EEXIST,
> -ESRCH, -EOVERFLOW, -ENOSPC, -ERANGE).
> tlob_sched_integration covers context-switch accounting and monitoring
> a kthread.  tlob_parse_uprobe exercises the uprobe line parser.
> tlob_trace_output checks sched_switch and error_env_tlob field layout.
> tlob_violation_react verifies error_env_tlob fires once on budget
> expiry and zero times when the budget is not exceeded.
> 
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>

That's quite extensive, but what caught my eyes are tests enrolling tracepoints
handlers. If you go there you're no longer doing unit testing, what's the
advantage of testing the entire monitor here over doing that in selftests?

Thanks,
Gabriele

> ---
>  kernel/trace/rv/monitors/tlob/.kunitconfig |   5 +
>  kernel/trace/rv/monitors/tlob/tlob.c       |  26 +
>  kernel/trace/rv/monitors/tlob/tlob_kunit.c | 881 +++++++++++++++++++++
>  3 files changed, 912 insertions(+)
>  create mode 100644 kernel/trace/rv/monitors/tlob/.kunitconfig
>  create mode 100644 kernel/trace/rv/monitors/tlob/tlob_kunit.c
> 
> diff --git a/kernel/trace/rv/monitors/tlob/.kunitconfig
> b/kernel/trace/rv/monitors/tlob/.kunitconfig
> new file mode 100644
> index 000000000000..977c58601ab7
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/.kunitconfig
> @@ -0,0 +1,5 @@
> +CONFIG_FTRACE=y
> +CONFIG_KUNIT=y
> +CONFIG_RV=y
> +CONFIG_RV_MON_TLOB=y
> +CONFIG_TLOB_KUNIT_TEST=y
> diff --git a/kernel/trace/rv/monitors/tlob/tlob.c
> b/kernel/trace/rv/monitors/tlob/tlob.c
> index 475e972ae9aa..90e7035a0b55 100644
> --- a/kernel/trace/rv/monitors/tlob/tlob.c
> +++ b/kernel/trace/rv/monitors/tlob/tlob.c
> @@ -1024,6 +1024,7 @@ EXPORT_SYMBOL_IF_KUNIT(tlob_num_monitored_read);
>  /* Tracepoint probes for KUnit; rv_trace.h is only included here. */
>  static struct tlob_captured_event     tlob_kunit_last_event;
>  static struct tlob_captured_error_env tlob_kunit_last_error_env;
> +static struct tlob_captured_detail    tlob_kunit_last_detail;
>  static atomic_t tlob_kunit_event_cnt    = ATOMIC_INIT(0);
>  static atomic_t tlob_kunit_error_env_cnt = ATOMIC_INIT(0);
>  
> @@ -1054,6 +1055,17 @@ static void tlob_kunit_error_env_probe(void *data, int
> id, char *state,
>  	atomic_inc(&tlob_kunit_error_env_cnt);
>  }
>  
> +static void tlob_kunit_detail_probe(void *data, int pid, u64 threshold_us,
> +				    u64 running_ns, u64 waiting_ns,
> +				    u64 sleeping_ns)
> +{
> +	tlob_kunit_last_detail.pid		= pid;
> +	tlob_kunit_last_detail.threshold_us	= threshold_us;
> +	tlob_kunit_last_detail.running_ns	= running_ns;
> +	tlob_kunit_last_detail.waiting_ns	= waiting_ns;
> +	tlob_kunit_last_detail.sleeping_ns	= sleeping_ns;
> +}
> +
>  int tlob_register_kunit_probes(void)
>  {
>  	int ret;
> @@ -1069,6 +1081,12 @@ int tlob_register_kunit_probes(void)
>  		unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
>  		return ret;
>  	}
> +	ret = register_trace_detail_env_tlob(tlob_kunit_detail_probe, NULL);
> +	if (ret) {
> +		unregister_trace_error_env_tlob(tlob_kunit_error_env_probe,
> NULL);
> +		unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
> +		return ret;
> +	}
>  	return 0;
>  }
>  EXPORT_SYMBOL_IF_KUNIT(tlob_register_kunit_probes);
> @@ -1077,6 +1095,7 @@ void tlob_unregister_kunit_probes(void)
>  {
>  	unregister_trace_event_tlob(tlob_kunit_event_probe, NULL);
>  	unregister_trace_error_env_tlob(tlob_kunit_error_env_probe, NULL);
> +	unregister_trace_detail_env_tlob(tlob_kunit_detail_probe, NULL);
>  	tracepoint_synchronize_unregister();
>  }
>  EXPORT_SYMBOL_IF_KUNIT(tlob_unregister_kunit_probes);
> @@ -1105,6 +1124,7 @@ void tlob_error_env_count_reset(void)
>  }
>  EXPORT_SYMBOL_IF_KUNIT(tlob_error_env_count_reset);
>  
> +
>  const struct tlob_captured_event *tlob_last_event_read(void)
>  {
>  	return &tlob_kunit_last_event;
> @@ -1117,6 +1137,12 @@ const struct tlob_captured_error_env
> *tlob_last_error_env_read(void)
>  }
>  EXPORT_SYMBOL_IF_KUNIT(tlob_last_error_env_read);
>  
> +const struct tlob_captured_detail *tlob_last_detail_read(void)
> +{
> +	return &tlob_kunit_last_detail;
> +}
> +EXPORT_SYMBOL_IF_KUNIT(tlob_last_detail_read);
> +
>  #endif /* CONFIG_KUNIT */
>  
>  VISIBLE_IF_KUNIT int tlob_enable_hooks(void)
> diff --git a/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> new file mode 100644
> index 000000000000..ed2e7c7abaf8
> --- /dev/null
> +++ b/kernel/trace/rv/monitors/tlob/tlob_kunit.c
> @@ -0,0 +1,881 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KUnit tests for the tlob RV monitor.
> + *
> + * tlob_task_api:          start/stop lifecycle, error paths, violations.
> + * tlob_sched_integration: per-state accounting across real context switches.
> + * tlob_uprobe_format:     uprobe binding format; add/remove acceptance and
> rejection.
> + * tlob_trace_output:      trace event format for event_tlob, error_env_tlob.
> + * tlob_violation_react:   error count per budget expiry; per-state
> breakdown.
> + *
> + * tlob_add_uprobe() duplicate-(binary, offset_start) constraint is not
> covered
> + * here: kern_path() requires a real filesystem; see selftests instead.
> + */
> +#include <kunit/test.h>
> +#include <linux/atomic.h>
> +#include <linux/completion.h>
> +#include <linux/delay.h>
> +#include <linux/kthread.h>
> +#include <linux/ktime.h>
> +#include <linux/mutex.h>
> +#include <linux/sched.h>
> +#include <linux/sched/rt.h>
> +#include <linux/sched/task.h>
> +
> +#include "tlob.h"
> +
> +MODULE_IMPORT_NS("EXPORTED_FOR_KUNIT_TESTING");
> +
> +/*
> + * Kthread cleanup guard: registers a kunit action that stops a kthread on
> + * test exit, even when a KUNIT_ASSERT fires before normal teardown code
> runs.
> + *
> + * Caller must call get_task_struct() before registering the guard.
> + * Set guard->task = NULL before normal-path teardown to prevent double-stop.
> + * Pass the completion to unblock on early exit, or NULL if not needed.
> + */
> +struct tlob_kthread_guard {
> +	struct task_struct	*task;
> +	struct completion	*unblock;
> +};
> +
> +static void kthread_guard_fn(void *arg)
> +{
> +	struct tlob_kthread_guard *g = arg;
> +
> +	if (!g->task)
> +		return;
> +	if (g->unblock)
> +		complete(g->unblock);
> +	kthread_stop(g->task);
> +	put_task_struct(g->task);
> +}
> +
> +static struct tlob_kthread_guard *
> +tlob_guard_kthread(struct kunit *test, struct task_struct *task,
> +		   struct completion *unblock)
> +{
> +	struct tlob_kthread_guard *g;
> +
> +	g = kunit_kzalloc(test, sizeof(*g), GFP_KERNEL);
> +	if (!g)
> +		return NULL;
> +	g->task = task;
> +	g->unblock = unblock;
> +	if (kunit_add_action_or_reset(test, kthread_guard_fn, g))
> +		return NULL;
> +	return g;
> +}
> +
> +/* Suite 1: task API - lifecycle, error paths, violations. */
> +
> +/* Basic start/stop cycle */
> +static void tlob_start_stop_ok(struct kunit *test)
> +{
> +	int ret;
> +
> +	ret = tlob_start_task(current, 10000000ULL);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), 0);
> +	KUNIT_EXPECT_EQ(test, tlob_num_monitored_read(), 0);
> +}
> +
> +/* Double start must return -EALREADY; double stop must return -ESRCH. */
> +static void tlob_double_start(struct kunit *test)
> +{
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000ULL), 0);
> +	KUNIT_EXPECT_EQ(test, tlob_start_task(current, 10000000ULL), -
> EALREADY);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), 0);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +	KUNIT_EXPECT_EQ(test, tlob_num_monitored_read(), 0);
> +}
> +
> +/* Stop without start must return -ESRCH. */
> +static void tlob_stop_without_start(struct kunit *test)
> +{
> +	tlob_stop_task(current);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +	KUNIT_EXPECT_EQ(test, tlob_num_monitored_read(), 0);
> +}
> +
> +/* threshold_us == 0 is invalid and must return -ERANGE. */
> +static void tlob_zero_threshold(struct kunit *test)
> +{
> +	KUNIT_EXPECT_EQ(test, tlob_start_task(current, 0), -ERANGE);
> +}
> +
> +/* 1 ns budget: timer almost certainly fires before tlob_stop_task(). */
> +static void tlob_immediate_deadline(struct kunit *test)
> +{
> +	int ret = tlob_start_task(current, 1);
> +
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +	udelay(100);
> +	/* timer fired -> -EOVERFLOW; if we won the race, 0 is also valid */
> +	ret = tlob_stop_task(current);
> +	KUNIT_EXPECT_TRUE(test, ret == 0 || ret == -EOVERFLOW);
> +	KUNIT_EXPECT_EQ(test, tlob_num_monitored_read(), 0);
> +}
> +
> +/*
> + * kthreads provide distinct task_structs; fill to TLOB_MAX_MONITORED,
> + * then verify -ENOSPC.
> + */
> +struct tlob_waiter_ctx {
> +	struct completion start;
> +	struct completion done;
> +};
> +
> +static int tlob_waiter_fn(void *arg)
> +{
> +	struct tlob_waiter_ctx *ctx = arg;
> +
> +	wait_for_completion(&ctx->start);
> +	complete(&ctx->done);
> +	return 0;
> +}
> +
> +static void tlob_enospc(struct kunit *test)
> +{
> +	struct tlob_waiter_ctx *ctxs;
> +	struct task_struct **threads;
> +	int i, ret;
> +
> +	ctxs = kunit_kcalloc(test, TLOB_MAX_MONITORED,
> +			     sizeof(*ctxs), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, ctxs);
> +
> +	threads = kunit_kcalloc(test, TLOB_MAX_MONITORED,
> +				sizeof(*threads), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, threads);
> +
> +	KUNIT_ASSERT_EQ(test, tlob_num_monitored_read(), 0);
> +
> +	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> +		init_completion(&ctxs[i].start);
> +		init_completion(&ctxs[i].done);
> +
> +		threads[i] = kthread_run(tlob_waiter_fn, &ctxs[i],
> +					 "tlob_waiter_%d", i);
> +		if (IS_ERR(threads[i])) {
> +			KUNIT_FAIL(test, "kthread_run failed at i=%d", i);
> +			threads[i] = NULL;
> +			goto cleanup;
> +		}
> +		get_task_struct(threads[i]);
> +
> +		ret = tlob_start_task(threads[i], 10000000ULL);
> +		if (ret != 0) {
> +			KUNIT_FAIL(test, "tlob_start_task failed at i=%d:
> %d",
> +				   i, ret);
> +			put_task_struct(threads[i]);
> +			complete(&ctxs[i].start);
> +			threads[i] = NULL;
> +			goto cleanup;
> +		}
> +	}
> +
> +	ret = tlob_start_task(current, 10000000ULL);
> +	KUNIT_EXPECT_EQ(test, ret, -ENOSPC);
> +
> +cleanup:
> +	/* cancel monitoring and unblock first, then wait for full exit */
> +	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> +		if (!threads[i])
> +			break;
> +		tlob_stop_task(threads[i]);
> +		complete(&ctxs[i].start);
> +	}
> +	for (i = 0; i < TLOB_MAX_MONITORED; i++) {
> +		if (!threads[i])
> +			break;
> +		kthread_stop(threads[i]);
> +		put_task_struct(threads[i]);
> +	}
> +}
> +
> +/*
> + * Holder kthread holds a mutex for 80 ms; arm a 10 ms budget, burn ~1 ms
> + * on-CPU, then block on the mutex; timer fires while sleeping -> -EOVERFLOW.
> + */
> +struct tlob_holder_ctx {
> +	struct mutex		lock;
> +	struct completion	ready;
> +	unsigned int		hold_ms;
> +};
> +
> +static int tlob_holder_fn(void *arg)
> +{
> +	struct tlob_holder_ctx *ctx = arg;
> +
> +	mutex_lock(&ctx->lock);
> +	complete(&ctx->ready);
> +	msleep(ctx->hold_ms);
> +	mutex_unlock(&ctx->lock);
> +	return 0;
> +}
> +
> +static void tlob_deadline_fires_sleeping(struct kunit *test)
> +{
> +	struct tlob_holder_ctx *ctx;
> +	struct tlob_kthread_guard *guard;
> +	struct task_struct *holder;
> +	ktime_t t0;
> +	int ret;
> +
> +	ctx = kunit_kzalloc(test, sizeof(*ctx), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, ctx);
> +	ctx->hold_ms = 80;
> +	mutex_init(&ctx->lock);
> +	init_completion(&ctx->ready);
> +
> +	holder = kthread_run(tlob_holder_fn, ctx, "tlob_holder_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, holder);
> +	get_task_struct(holder);
> +
> +	guard = tlob_guard_kthread(test, holder, NULL);
> +	KUNIT_ASSERT_NOT_NULL(test, guard);
> +
> +	wait_for_completion(&ctx->ready);
> +
> +	ret = tlob_start_task(current, 10000);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 1000)
> +		cpu_relax();
> +
> +	/* block on mutex: running->sleeping; timer fires while sleeping */
> +	mutex_lock(&ctx->lock);
> +	mutex_unlock(&ctx->lock);
> +
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -EOVERFLOW);
> +
> +	guard->task = NULL;
> +	kthread_stop(holder);
> +	put_task_struct(holder);
> +}
> +
> +/*
> + * yield() triggers a preempt sched_switch (prev_state==0): running->waiting.
> + * Busy-spin 50 ms so the 2 ms budget fires regardless of scheduler timing.
> + */
> +static void tlob_deadline_fires_waiting(struct kunit *test)
> +{
> +	ktime_t t0;
> +	int ret;
> +
> +	ret = tlob_start_task(current, 2000);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	yield();
> +
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 50000)
> +		cpu_relax();
> +
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -EOVERFLOW);
> +}
> +
> +/* Arm a 1 ms budget and busy-spin for 50 ms; timer fires in running state.
> */
> +static void tlob_deadline_fires_running(struct kunit *test)
> +{
> +	ktime_t t0;
> +	int ret;
> +
> +	ret = tlob_start_task(current, 1000);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 50000)
> +		cpu_relax();
> +
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -EOVERFLOW);
> +}
> +
> +/* Start three tasks, reinit monitor, verify all entries are gone. */
> +static int tlob_dummy_fn(void *arg)
> +{
> +	wait_for_completion((struct completion *)arg);
> +	return 0;
> +}
> +
> +static void tlob_reinit_clears_all(struct kunit *test)
> +{
> +	struct completion *done1, *done2;
> +	struct tlob_kthread_guard *guard1, *guard2;
> +	struct task_struct *t1, *t2;
> +	int ret;
> +
> +	done1 = kunit_kzalloc(test, sizeof(*done1), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, done1);
> +	done2 = kunit_kzalloc(test, sizeof(*done2), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, done2);
> +
> +	init_completion(done1);
> +	init_completion(done2);
> +
> +	t1 = kthread_run(tlob_dummy_fn, done1, "tlob_dummy1");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t1);
> +	get_task_struct(t1);
> +	guard1 = tlob_guard_kthread(test, t1, done1);
> +	KUNIT_ASSERT_NOT_NULL(test, guard1);
> +
> +	t2 = kthread_run(tlob_dummy_fn, done2, "tlob_dummy2");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, t2);
> +	get_task_struct(t2);
> +	guard2 = tlob_guard_kthread(test, t2, done2);
> +	KUNIT_ASSERT_NOT_NULL(test, guard2);
> +
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000ULL), 0);
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(t1, 10000000ULL), 0);
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(t2, 10000000ULL), 0);
> +
> +	tlob_destroy_monitor();
> +	ret = tlob_init_monitor();
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), -ESRCH);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(t1), -ESRCH);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(t2), -ESRCH);
> +
> +	/* null guards before teardown to prevent double-stop */
> +	guard1->task = NULL;
> +	guard2->task = NULL;
> +	complete(done1);
> +	complete(done2);
> +	kthread_stop(t1);
> +	kthread_stop(t2);
> +	put_task_struct(t1);
> +	put_task_struct(t2);
> +}
> +
> +static int tlob_task_api_suite_init(struct kunit_suite *suite)
> +{
> +	rv_kunit_monitoring_on();
> +	return tlob_init_monitor();
> +}
> +
> +static void tlob_task_api_suite_exit(struct kunit_suite *suite)
> +{
> +	tlob_destroy_monitor();
> +	rv_kunit_monitoring_off();
> +}
> +
> +static void tlob_task_api_exit(struct kunit *test)
> +{
> +	/*
> +	 * tlob_stop_task() returns pool slots via call_rcu
> (da_pool_return_cb).
> +	 * Wait for all pending callbacks so each test starts with a full
> pool.
> +	 */
> +	rcu_barrier();
> +}
> +
> +static struct kunit_case tlob_task_api_cases[] = {
> +	KUNIT_CASE(tlob_start_stop_ok),
> +	KUNIT_CASE(tlob_double_start),
> +	KUNIT_CASE(tlob_stop_without_start),
> +	KUNIT_CASE(tlob_zero_threshold),
> +	KUNIT_CASE(tlob_immediate_deadline),
> +	KUNIT_CASE(tlob_enospc),
> +	KUNIT_CASE(tlob_deadline_fires_sleeping),
> +	KUNIT_CASE(tlob_deadline_fires_waiting),
> +	KUNIT_CASE(tlob_deadline_fires_running),
> +	KUNIT_CASE(tlob_reinit_clears_all),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_task_api_suite = {
> +	.name       = "tlob_task_api",
> +	.suite_init = tlob_task_api_suite_init,
> +	.suite_exit = tlob_task_api_suite_exit,
> +	.exit       = tlob_task_api_exit,
> +	.test_cases = tlob_task_api_cases,
> +};
> +
> +/* Suite 2: sched integration - per-state ns accounting. */
> +
> +struct tlob_ping_ctx {
> +	struct completion ping;
> +	struct completion pong;
> +};
> +
> +static int tlob_ping_fn(void *arg)
> +{
> +	struct tlob_ping_ctx *ctx = arg;
> +
> +	wait_for_completion(&ctx->ping);
> +	complete(&ctx->pong);
> +	return 0;
> +}
> +
> +/* Force two context switches and verify stop returns 0 (within budget). */
> +static void tlob_sched_switch_accounting(struct kunit *test)
> +{
> +	struct tlob_ping_ctx *ctx;
> +	struct tlob_kthread_guard *guard;
> +	struct task_struct *peer;
> +	int ret;
> +
> +	ctx = kunit_kzalloc(test, sizeof(*ctx), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, ctx);
> +	init_completion(&ctx->ping);
> +	init_completion(&ctx->pong);
> +
> +	peer = kthread_run(tlob_ping_fn, ctx, "tlob_ping_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, peer);
> +	get_task_struct(peer);
> +
> +	guard = tlob_guard_kthread(test, peer, &ctx->ping);
> +	KUNIT_ASSERT_NOT_NULL(test, guard);
> +
> +	ret = tlob_start_task(current, 5000000ULL);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* complete(ping) -> peer runs, forcing a context switch out and back
> */
> +	complete(&ctx->ping);
> +	wait_for_completion(&ctx->pong);
> +
> +	ret = tlob_stop_task(current);
> +	KUNIT_EXPECT_EQ(test, ret, 0);
> +
> +	guard->task = NULL;
> +	kthread_stop(peer);
> +	put_task_struct(peer);
> +}
> +
> +/* start/stop monitoring a kthread other than current */
> +static int tlob_block_fn(void *arg)
> +{
> +	struct completion *done = arg;
> +
> +	msleep(20);
> +	complete(done);
> +	return 0;
> +}
> +
> +static void tlob_monitor_other_task(struct kunit *test)
> +{
> +	struct completion *done;
> +	struct tlob_kthread_guard *guard;
> +	struct task_struct *target;
> +	int ret;
> +
> +	done = kunit_kzalloc(test, sizeof(*done), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, done);
> +	init_completion(done);
> +
> +	target = kthread_run(tlob_block_fn, done, "tlob_target_kunit");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, target);
> +	get_task_struct(target);
> +
> +	guard = tlob_guard_kthread(test, target, NULL);
> +	KUNIT_ASSERT_NOT_NULL(test, guard);
> +
> +	ret = tlob_start_task(target, 5000000ULL);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	wait_for_completion(done);
> +
> +	/* 5 s budget won't fire in 20 ms; 0 or -EOVERFLOW are both valid */
> +	ret = tlob_stop_task(target);
> +	KUNIT_EXPECT_TRUE(test, ret == 0 || ret == -EOVERFLOW);
> +
> +	guard->task = NULL;
> +	kthread_stop(target);
> +	put_task_struct(target);
> +}
> +
> +static int tlob_sched_suite_init(struct kunit_suite *suite)
> +{
> +	rv_kunit_monitoring_on();
> +	return tlob_init_monitor();
> +}
> +
> +static void tlob_sched_suite_exit(struct kunit_suite *suite)
> +{
> +	tlob_destroy_monitor();
> +	rv_kunit_monitoring_off();
> +}
> +
> +static struct kunit_case tlob_sched_integration_cases[] = {
> +	KUNIT_CASE(tlob_sched_switch_accounting),
> +	KUNIT_CASE(tlob_monitor_other_task),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_sched_integration_suite = {
> +	.name       = "tlob_sched_integration",
> +	.suite_init = tlob_sched_suite_init,
> +	.suite_exit = tlob_sched_suite_exit,
> +	.test_cases = tlob_sched_integration_cases,
> +};
> +
> +/* Suite 3: uprobe binding format - add/remove acceptance and rejection. */
> +
> +static const char * const tlob_format_valid[] = {
> +	"p /usr/bin/myapp:4768 4848 threshold=5000",
> +	"p /usr/bin/myapp:0x12a0 0x12f0 threshold=10000",
> +	"p /opt/my:app/bin:0x100 0x200 threshold=1000",
> +};
> +
> +static const char * const tlob_format_invalid[] = {
> +	/* add: malformed */
> +	"p /usr/bin/myapp:0x100 0x200 threshold=0",
> +	"p :0x100 0x200 threshold=5000",
> +	"p /usr/bin/myapp:0x100 threshold=5000",
> +	"p /usr/bin/myapp:-1 0x200 threshold=5000",
> +	"p /usr/bin/myapp:0x100 0x200",
> +	"p /usr/bin/myapp:0x100 0x100 threshold=5000",
> +	/* remove: malformed */
> +	"-usr/bin/myapp:0x100",
> +	"-/usr/bin/myapp",
> +	"-/:0x100",
> +	"-/usr/bin/myapp:abc",
> +};
> +
> +/*
> + * Valid add lines return -ENOENT (path does not exist in the test
> environment)
> + * rather than 0; a non-(-EINVAL) return confirms the format was accepted.
> + */
> +static void tlob_format_accepted(struct kunit *test)
> +{
> +	char buf[128];
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(tlob_format_valid); i++) {
> +		strscpy(buf, tlob_format_valid[i], sizeof(buf));
> +		KUNIT_EXPECT_NE(test, tlob_create_or_delete_uprobe(buf), -
> EINVAL);
> +	}
> +}
> +
> +static void tlob_format_rejected(struct kunit *test)
> +{
> +	char buf[128];
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(tlob_format_invalid); i++) {
> +		strscpy(buf, tlob_format_invalid[i], sizeof(buf));
> +		KUNIT_EXPECT_EQ(test, tlob_create_or_delete_uprobe(buf), -
> EINVAL);
> +	}
> +}
> +
> +static struct kunit_case tlob_uprobe_format_cases[] = {
> +	KUNIT_CASE(tlob_format_accepted),
> +	KUNIT_CASE(tlob_format_rejected),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_uprobe_format_suite = {
> +	.name       = "tlob_uprobe_format",
> +	.test_cases = tlob_uprobe_format_cases,
> +};
> +
> +/* Suite 4: trace output - verify event_tlob and error_env_tlob field values.
> */
> +
> +static void tlob_trace_event_format(struct kunit *test)
> +{
> +	const struct tlob_captured_event *ev;
> +	int pid = current->pid;
> +	int ret;
> +
> +	tlob_event_count_reset();
> +	ret = tlob_start_task(current, 5000000ULL);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	/* sleep/wakeup/switch_in: running->sleeping->waiting->running */
> +	msleep(20);
> +
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), 0);
> +
> +	KUNIT_EXPECT_GE(test, tlob_event_count_read(), 3);
> +
> +	ev = tlob_last_event_read();
> +	KUNIT_EXPECT_EQ(test,    ev->id,          pid);
> +	KUNIT_EXPECT_STREQ(test, ev->state,       "waiting");
> +	KUNIT_EXPECT_STREQ(test, ev->event,       "switch_in");
> +	KUNIT_EXPECT_STREQ(test, ev->next_state,  "running");
> +	KUNIT_EXPECT_TRUE(test,  ev->final_state);
> +}
> +
> +static void tlob_trace_error_env_format(struct kunit *test)
> +{
> +	const struct tlob_captured_error_env *err;
> +	ktime_t t0;
> +	int pid = current->pid;
> +	int ret;
> +
> +	tlob_error_env_count_reset();
> +	ret = tlob_start_task(current, 1000);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 50000)
> +		cpu_relax();
> +
> +	tlob_stop_task(current);
> +
> +	KUNIT_ASSERT_GE(test, tlob_error_env_count_read(), 1);
> +
> +	err = tlob_last_error_env_read();
> +	KUNIT_EXPECT_EQ(test,    err->id,    pid);
> +	KUNIT_EXPECT_STREQ(test, err->state, "running");
> +	KUNIT_EXPECT_STREQ(test, err->event, "budget_exceeded");
> +	KUNIT_EXPECT_TRUE(test, strncmp(err->env, "clk_elapsed=", 12) == 0);
> +}
> +
> +static int tlob_trace_suite_init(struct kunit_suite *suite)
> +{
> +	int ret;
> +
> +	rv_kunit_monitoring_on();
> +	ret = tlob_init_monitor();
> +	if (ret)
> +		goto err_mon_off;
> +	ret = tlob_register_kunit_probes();
> +	if (ret)
> +		goto err_destroy;
> +	ret = tlob_enable_hooks();
> +	if (ret)
> +		goto err_probes;
> +	return 0;
> +
> +err_probes:
> +	tlob_unregister_kunit_probes();
> +err_destroy:
> +	tlob_destroy_monitor();
> +err_mon_off:
> +	rv_kunit_monitoring_off();
> +	return ret;
> +}
> +
> +static void tlob_trace_suite_exit(struct kunit_suite *suite)
> +{
> +	tlob_disable_hooks();
> +	tlob_unregister_kunit_probes();
> +	tlob_destroy_monitor();
> +	rv_kunit_monitoring_off();
> +}
> +
> +static struct kunit_case tlob_trace_output_cases[] = {
> +	KUNIT_CASE(tlob_trace_event_format),
> +	KUNIT_CASE(tlob_trace_error_env_format),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_trace_output_suite = {
> +	.name       = "tlob_trace_output",
> +	.suite_init = tlob_trace_suite_init,
> +	.suite_exit = tlob_trace_suite_exit,
> +	.test_cases = tlob_trace_output_cases,
> +};
> +
> +/*
> + * Suite 5: violation reaction - complement to Suite 4.
> + * Suite 4 checks trace field values; Suite 5 checks semantics:
> + * error count per budget expiry and per-state ns breakdown.
> + */
> +
> +/* generous budget; usleep forces state transitions; no error must fire */
> +static void tlob_no_error_within_budget(struct kunit *test)
> +{
> +	tlob_error_env_count_reset();
> +	tlob_event_count_reset();
> +
> +	KUNIT_ASSERT_EQ(test, tlob_start_task(current, 10000000ULL), 0);
> +	usleep_range(5000, 10000);
> +	KUNIT_EXPECT_EQ(test, tlob_stop_task(current), 0);
> +	KUNIT_EXPECT_EQ(test, tlob_error_env_count_read(), 0);
> +	KUNIT_EXPECT_GE(test, tlob_event_count_read(), 2);
> +}
> +
> +/* busy-spin 50 ms >> 1 ms budget; running_ns must dominate */
> +static void tlob_detail_running_dominates(struct kunit *test)
> +{
> +	const struct tlob_captured_detail *d;
> +	u64 total_ns;
> +	ktime_t t0;
> +	int ret;
> +
> +	tlob_error_env_count_reset();
> +
> +	ret = tlob_start_task(current, 1000);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	t0 = ktime_get();
> +	while (ktime_us_delta(ktime_get(), t0) < 50000)
> +		cpu_relax();
> +
> +	tlob_stop_task(current);
> +
> +	KUNIT_EXPECT_EQ(test, tlob_error_env_count_read(), 1);
> +	d = tlob_last_detail_read();
> +	KUNIT_EXPECT_EQ(test, d->pid, current->pid);
> +	KUNIT_EXPECT_EQ(test, d->threshold_us, 1000ULL);
> +	total_ns = d->running_ns + d->waiting_ns + d->sleeping_ns;
> +	KUNIT_EXPECT_GE(test, total_ns, 1000ULL * 1000);
> +	KUNIT_EXPECT_GT(test, d->running_ns, d->sleeping_ns + d->waiting_ns);
> +}
> +
> +struct tlob_hog_ctx {
> +	int spin_ms;
> +};
> +
> +static int tlob_hog_fn(void *arg)
> +{
> +	struct tlob_hog_ctx *ctx = arg;
> +	ktime_t t0 = ktime_get();
> +
> +	while (!kthread_should_stop() &&
> +	       ktime_ms_delta(ktime_get(), t0) < ctx->spin_ms)
> +		cpu_relax();
> +	return 0;
> +}
> +
> +/*
> + * SCHED_FIFO kthread bound to the same CPU preempts the monitored task
> + * (sched_switch prev_state == 0: running->waiting) and holds the CPU for
> + * 80 ms >> 10 ms budget, guaranteeing the timer fires in waiting state.
> + */
> +static void tlob_detail_waiting_dominates(struct kunit *test)
> +{
> +	struct tlob_hog_ctx *ctx;
> +	struct task_struct *hog;
> +	struct tlob_kthread_guard *guard;
> +	const struct tlob_captured_detail *d;
> +	struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
> +	int ret;
> +
> +	tlob_error_env_count_reset();
> +
> +	ctx = kunit_kzalloc(test, sizeof(*ctx), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, ctx);
> +	ctx->spin_ms = 80;
> +
> +	hog = kthread_create(tlob_hog_fn, ctx, "tlob_s5_hog");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, hog);
> +	get_task_struct(hog);
> +
> +	kthread_bind(hog, smp_processor_id());
> +	sched_setscheduler_nocheck(hog, SCHED_FIFO, &param);
> +
> +	guard = tlob_guard_kthread(test, hog, NULL);
> +	KUNIT_ASSERT_NOT_NULL(test, guard);
> +
> +	ret = tlob_start_task(current, 10000); /* 10 ms budget */
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	wake_up_process(hog);
> +	yield(); /* sched_switch prev_state == 0: running->waiting */
> +
> +	tlob_stop_task(current);
> +
> +	KUNIT_EXPECT_EQ(test, tlob_error_env_count_read(), 1);
> +	d = tlob_last_detail_read();
> +	KUNIT_EXPECT_EQ(test, d->sleeping_ns, 0ULL);
> +	KUNIT_EXPECT_GT(test, d->waiting_ns, d->running_ns + d->sleeping_ns);
> +
> +	guard->task = NULL;
> +	kthread_stop(hog);
> +	put_task_struct(hog);
> +}
> +
> +/* block on mutex for 80 ms >> 10 ms budget; sleeping_ns must dominate */
> +static void tlob_detail_sleeping_dominates(struct kunit *test)
> +{
> +	struct tlob_holder_ctx *ctx;
> +	struct tlob_kthread_guard *guard;
> +	struct task_struct *holder;
> +	const struct tlob_captured_detail *d;
> +	int ret;
> +
> +	tlob_error_env_count_reset();
> +
> +	ctx = kunit_kzalloc(test, sizeof(*ctx), GFP_KERNEL);
> +	KUNIT_ASSERT_NOT_NULL(test, ctx);
> +	ctx->hold_ms = 80;
> +	mutex_init(&ctx->lock);
> +	init_completion(&ctx->ready);
> +
> +	holder = kthread_run(tlob_holder_fn, ctx, "tlob_s5_detail");
> +	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, holder);
> +	get_task_struct(holder);
> +
> +	guard = tlob_guard_kthread(test, holder, NULL);
> +	KUNIT_ASSERT_NOT_NULL(test, guard);
> +
> +	wait_for_completion(&ctx->ready);
> +
> +	ret = tlob_start_task(current, 10000);
> +	KUNIT_ASSERT_EQ(test, ret, 0);
> +
> +	mutex_lock(&ctx->lock);
> +	mutex_unlock(&ctx->lock);
> +
> +	tlob_stop_task(current);
> +
> +	KUNIT_EXPECT_EQ(test, tlob_error_env_count_read(), 1);
> +	d = tlob_last_detail_read();
> +	KUNIT_EXPECT_GT(test, d->sleeping_ns, d->running_ns + d->waiting_ns);
> +
> +	guard->task = NULL;
> +	kthread_stop(holder);
> +	put_task_struct(holder);
> +}
> +
> +static int tlob_violation_suite_init(struct kunit_suite *suite)
> +{
> +	int ret;
> +
> +	rv_kunit_monitoring_on();
> +	ret = tlob_init_monitor();
> +	if (ret)
> +		goto err_mon_off;
> +	ret = tlob_register_kunit_probes();
> +	if (ret)
> +		goto err_destroy;
> +	ret = tlob_enable_hooks();
> +	if (ret)
> +		goto err_probes;
> +	return 0;
> +
> +err_probes:
> +	tlob_unregister_kunit_probes();
> +err_destroy:
> +	tlob_destroy_monitor();
> +err_mon_off:
> +	rv_kunit_monitoring_off();
> +	return ret;
> +}
> +
> +static void tlob_violation_suite_exit(struct kunit_suite *suite)
> +{
> +	tlob_disable_hooks();
> +	tlob_unregister_kunit_probes();
> +	tlob_destroy_monitor();
> +	rv_kunit_monitoring_off();
> +}
> +
> +static struct kunit_case tlob_violation_react_cases[] = {
> +	KUNIT_CASE(tlob_no_error_within_budget),
> +	KUNIT_CASE(tlob_detail_running_dominates),
> +	KUNIT_CASE(tlob_detail_sleeping_dominates),
> +	KUNIT_CASE(tlob_detail_waiting_dominates),
> +	{}
> +};
> +
> +static struct kunit_suite tlob_violation_react_suite = {
> +	.name       = "tlob_violation_react",
> +	.suite_init = tlob_violation_suite_init,
> +	.suite_exit = tlob_violation_suite_exit,
> +	.test_cases = tlob_violation_react_cases,
> +};
> +
> +kunit_test_suites(&tlob_task_api_suite,
> +		  &tlob_sched_integration_suite,
> +		  &tlob_uprobe_format_suite,
> +		  &tlob_trace_output_suite,
> +		  &tlob_violation_react_suite);
> +
> +MODULE_DESCRIPTION("KUnit tests for the tlob RV monitor");
> +MODULE_LICENSE("GPL");


^ permalink raw reply

* Re: [PATCH v7 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Breno Leitao @ 2026-05-15 13:13 UTC (permalink / raw)
  To: Lance Yang
  Cc: linmiaohe, akpm, david, ljs, vbabka, rppt, surenb, mhocko, shuah,
	nao.horiguchi, rostedt, mhiramat, mathieu.desnoyers, corbet,
	skhan, liam, linux-mm, linux-kernel, linux-doc, linux-kselftest,
	linux-trace-kernel, kernel-team
In-Reply-To: <20260515070353.87244-1-lance.yang@linux.dev>

On Fri, May 15, 2026 at 03:03:53PM +0800, Lance Yang wrote:
> 
> On Thu, May 14, 2026 at 07:37:14AM -0700, Breno Leitao wrote:
> >On Thu, May 14, 2026 at 09:28:30PM +0800, Lance Yang wrote:
> >> 
> >> On Wed, May 13, 2026 at 08:39:33AM -0700, Breno Leitao wrote:
> >> >get_any_page() collapses three different failure modes into a single
> >> >-EIO return:
> >> >
> >> >  * the put_page race in the !count_increased path;
> >> >  * the HWPoisonHandlable() rejection that bounces out of
> >> >    __get_hwpoison_page() with -EBUSY and exhausts shake_page() retries;
> >> >  * the HWPoisonHandlable() rejection that goes through the
> >> >    count_increased / put_page / shake_page retry loop.
> >> >
> >> >The first is transient (the page is racing with the allocator).  The
> >> >second can be either transient (a userspace folio briefly off LRU
> >> >during migration/compaction) or stable (slab/vmalloc/page-table/
> >> >kernel-stack pages).  The third describes a stable kernel-owned page
> >> >that the count_increased=true caller already held a reference on.
> >> >
> >> >Distinguish them on the return path: keep -EIO for both the put_page
> >> >race and the -EBUSY-after-retries branch (shake_page() cannot drag a
> >> >folio back from active migration, so we cannot prove the page is
> >> >permanently kernel-owned from there), keep -EBUSY for the allocation
> >> >race (unchanged), and return -ENOTRECOVERABLE only from the
> >> >count_increased-true HWPoisonHandlable() rejection that exhausts its
> >> >retries -- the caller's reference is structural evidence that the
> >> >page is owned by the kernel.
> >> >
> >> >Extend the unhandlable-page pr_err() to fire for either errno and
> >> >update the get_hwpoison_page() kerneldoc.
> >> >
> >> >memory_failure() still folds every negative return into
> >> >MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
> >> >this patch is a no-op for users of memory_failure() and only changes
> >> >the errno that soft_offline_page() can propagate to its callers.  A
> >> >follow-up wires the new return code through memory_failure() and
> >> >reports MF_MSG_KERNEL for the unrecoverable cases.
> >> >
> >> >Suggested-by: David Hildenbrand <david@kernel.org>
> >> >Signed-off-by: Breno Leitao <leitao@debian.org>
> >> >---
> >> > mm/memory-failure.c | 18 +++++++++++++++---
> >> > 1 file changed, 15 insertions(+), 3 deletions(-)
> >> >
> >> >diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >> >index 49bcfbd04d213..bae883df3ccb2 100644
> >> >--- a/mm/memory-failure.c
> >> >+++ b/mm/memory-failure.c
> >> >@@ -1408,6 +1408,15 @@ static int get_any_page(struct page *p, unsigned long flags)
> >> > 				shake_page(p);
> >> > 				goto try_again;
> >> > 			}
> >> >+			/*
> >> >+			 * Return -EIO rather than -ENOTRECOVERABLE: this
> >> >+			 * branch is also reached for pages that are merely
> >> >+			 * off-LRU transiently (e.g. a folio in the middle
> >> >+			 * of migration or compaction), which shake_page()
> >> >+			 * cannot drag back.  The caller cannot prove the
> >> >+			 * page is permanently kernel-owned from here, so
> >> >+			 * keep it on the recoverable errno.
> >> >+			 */
> >> > 			ret = -EIO;
> >> > 			goto out;
> >> > 		}
> >> >@@ -1427,10 +1436,10 @@ static int get_any_page(struct page *p, unsigned long flags)
> >> > 			goto try_again;
> >> > 		}
> >> > 		put_page(p);
> >> >-		ret = -EIO;
> >> >+		ret = -ENOTRECOVERABLE;
> >> > 	}
> >> > out:
> >> >-	if (ret == -EIO)
> >> >+	if (ret == -EIO || ret == -ENOTRECOVERABLE)
> >> > 		pr_err("%#lx: unhandlable page.\n", page_to_pfn(p));
> >> > 
> >> > 	return ret;
> >> >@@ -1487,7 +1496,10 @@ static int __get_unpoison_page(struct page *page)
> >> >  *         -EIO for pages on which we can not handle memory errors,
> >> >  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
> >> >  *         operations like allocation and free,
> >> >- *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
> >> >+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy,
> >> >+ *         -ENOTRECOVERABLE for stable kernel-owned pages the handler
> >> >+ *         cannot recover (PG_reserved, slab, vmalloc, page tables,
> >> >+ *         kernel stacks, and similar non-LRU/non-buddy pages).
> >> 
> >> Did you test this patch series? I don't see how we ever get to
> >> -ENOTRECOVERABLE there ...
> >
> >Yes, I did. I am using the following test case:
> 
> Okay.
> 
> >https://github.com/leitao/linux/commit/cfebe84ddeab5ac34ed456331db980d57e7025dc
> >
> >	# RUN_DESTRUCTIVE=1 tools/testing/selftests/mm/hwpoison-panic.sh
> >	# enabling /proc/sys/vm/panic_on_unrecoverable_memory_failure
> >	# injecting hwpoison at phys 0x2a00000 (Kernel rodata)
> >	# expecting kernel panic: 'Memory failure: <pfn>: unrecoverable page'
> >	[  501.113256] Memory failure: 0x2a00: recovery action for reserved kernel page: Ignored
> >	[  501.113956] Kernel panic - not syncing: Memory failure: 0x2a00: unrecoverable page
> >
> >
> >> Even with MF_COUNT_INCREASED, the first pass does:
> >> 
> >> 	if (flags & MF_COUNT_INCREASED)
> >> 		count_increased = true;
> >> 
> >> 	[...]
> >> 
> >> 	if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
> >> 		ret = 1;
> >> 	} else {
> >> 		if (pass++ < GET_PAGE_MAX_RETRY_NUM) { <-
> >> 			put_page(p);
> >> 			shake_page(p);
> >> 			count_increased = false;
> >> 			goto try_again; <-
> >> 		}
> >> 		put_page(p);
> >> 		ret = -ENOTRECOVERABLE;
> >> 	}
> >> 
> >> Then we come back with count_increased=false:
> >> 
> >> try_again:
> >> 	if (!count_increased) {
> >> 		ret = __get_hwpoison_page(p, flags); <-
> >> 		if (!ret) {
> >> 		[...]
> >> 		} else if (ret == -EBUSY) { <-
> >> 		[...]
> >> 			ret = -EIO;
> >> 			goto out; <-
> >> 		}
> >> 	}
> >> 
> >> For slab/vmalloc/page-table pages, __get_hwpoison_page() returns -EBUSY:
> >> 
> >> 	if (!HWPoisonHandlable(&folio->page, flags))
> >> 		return -EBUSY;
> >> 
> >> so they still seem to end up as -EIO ... Am I missing something?
> >
> >You are not, and thanks for catching this. I traced it again and the
> >-ENOTRECOVERABLE branch is unreachable for slab/vmalloc/page-table pages
> >exactly as you described. The __get_hwpoison_page() → -EBUSY → shake → retry
> >loop catches them first and they exit as -EIO.
> 
> Wonder if it would be simpler to just do a positive check near the top
> of get_any_page() instead. Something like:
> 
> static bool hwpoison_unrecoverable_kernel_page(struct page *page,
> 						unsigned long flags)

Ack. We probably want to call it something like HWPoisonKernelOwned() to
follow the same naming sematics of these helpers, such as HWPoisonHandlable()

By the way, I will re-include the self test back to this patch series,
In case they are not useful, we do not merge it.

Thanks for the review,
--breno

^ permalink raw reply

* Re: [RFC PATCH v2 10/10] selftests/verification: add tlob selftests
From: Gabriele Monaco @ 2026-05-15 13:23 UTC (permalink / raw)
  To: wen.yang; +Cc: linux-trace-kernel, linux-kernel, Steven Rostedt
In-Reply-To: <8148267505ef90175b6b69e1ffb3aa560ff42d35.1778522945.git.wen.yang@linux.dev>



On Tue, 2026-05-12 at 02:24 +0800, wen.yang@linux.dev wrote:
> From: Wen Yang <wen.yang@linux.dev>
> 
> Add selftest coverage for the tlob RV monitor in
> tools/testing/selftests/verification/.
> 
> Two helper binaries are built by tlob/Makefile: tlob_helper for the
> ioctl interface (/dev/rv) and tlob_uprobe_target for the uprobe tests.
> The top-level Makefile delegates to tlob/ via a generic MONITOR_SUBDIRS
> pattern so monitor-specific build details stay within each monitor's
> own subdirectory.
> 
> Eight test files cover the tracefs control interface (tracefs.tc), the
> ioctl self-instrumentation interface (ioctl.tc, 8 scenarios), and the
> uprobe external monitoring interface (uprobe_bind.tc, uprobe_violation.tc,
> uprobe_no_event.tc, uprobe_multi.tc, uprobe_detail_sleeping.tc,
> uprobe_detail_waiting.tc).
> 
> Tested on x86_64 with vng (virtme-ng):
> 
>   TAP version 13
>   1..12
>   ok 1 Test monitor enable/disable
>   ok 2 Test monitor reactor setting
>   ok 3 Check available monitors
>   ok 4 Test wwnr monitor with printk reactor
>   ok 5 Test tlob ioctl self-instrumentation (within/over-budget, error paths)
>   ok 6 Test tlob monitor tracefs interface (enable/disable and files)

This should be tested together with the other monitors (enable/disable), we
could at most expand those with the check_requires, though that seems to be
meant for ftracetest's internals.

Let's focus on tlob-only features in this patch.

Thanks,
Gabriele

>   ok 7 uprobe binding: visible in monitor file, removable, duplicate offset
> rejected
>   ok 8 uprobe detail sleeping: sleeping_ns dominates when task blocks between
> probes
>   ok 9 uprobe detail waiting: waiting_ns dominates when task is preempted
> between probes
>   ok 10 Two bindings on same binary with different offsets and budgets fire
> independently
>   ok 11 Verify no spurious error_env_tlob events without an active uprobe
> binding
>   ok 12 uprobe violation: error_env_tlob and detail_env_tlob fire with correct
> fields
>   # Totals: pass:12 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> Suggested-by: Gabriele Monaco <gmonaco@redhat.com> 
> Signed-off-by: Wen Yang <wen.yang@linux.dev>
> ---
>  tools/testing/selftests/verification/Makefile |  21 +-
>  .../verification/test.d/tlob/ioctl.tc         |  36 +
>  .../verification/test.d/tlob/tracefs.tc       |  17 +
>  .../verification/test.d/tlob/uprobe_bind.tc   |  34 +
>  .../test.d/tlob/uprobe_detail_sleeping.tc     |  47 ++
>  .../test.d/tlob/uprobe_detail_waiting.tc      |  60 ++
>  .../verification/test.d/tlob/uprobe_multi.tc  |  60 ++
>  .../test.d/tlob/uprobe_no_event.tc            |  19 +
>  .../test.d/tlob/uprobe_violation.tc           |  60 ++
>  .../selftests/verification/tlob/Makefile      |  21 +
>  .../selftests/verification/tlob/tlob_ioctl.c  | 626 ++++++++++++++++++
>  .../selftests/verification/tlob/tlob_target.c | 138 ++++
>  12 files changed, 1138 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/verification/test.d/tlob/ioctl.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/tracefs.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
>  create mode 100644
> tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
>  create mode 100644 tools/testing/selftests/verification/tlob/Makefile
>  create mode 100644 tools/testing/selftests/verification/tlob/tlob_ioctl.c
>  create mode 100644 tools/testing/selftests/verification/tlob/tlob_target.c
> 
> diff --git a/tools/testing/selftests/verification/Makefile
> b/tools/testing/selftests/verification/Makefile
> index aa8790c22a71..b5584fd3762d 100644
> --- a/tools/testing/selftests/verification/Makefile
> +++ b/tools/testing/selftests/verification/Makefile
> @@ -1,8 +1,27 @@
>  # SPDX-License-Identifier: GPL-2.0
> -all:
>  
>  TEST_PROGS := verificationtest-ktap
>  TEST_FILES := test.d settings
>  EXTRA_CLEAN := $(OUTPUT)/logs/*
>  
> +# Subdirectories that provide helper binaries for the test runner.
> +# Each entry must contain a Makefile that accepts OUTDIR= and deposits
> +# its binaries there; verificationtest-ktap adds OUTDIR to PATH so
> +# the ftracetest require-checks resolve the binaries by name.
> +MONITOR_SUBDIRS := tlob
> +
>  include ../lib.mk
> +
> +# Build and clean each monitor subdirectory.
> +all: $(patsubst %,_build_%,$(MONITOR_SUBDIRS))
> +
> +clean: $(patsubst %,_clean_%,$(MONITOR_SUBDIRS))
> +
> +.PHONY: $(patsubst %,_build_%,$(MONITOR_SUBDIRS)) \
> +        $(patsubst %,_clean_%,$(MONITOR_SUBDIRS))
> +
> +$(patsubst %,_build_%,$(MONITOR_SUBDIRS)): _build_%:
> +	$(MAKE) -C $* OUTDIR="$(OUTPUT)" TOOLS_INCLUDES="$(TOOLS_INCLUDES)"
> +
> +$(patsubst %,_clean_%,$(MONITOR_SUBDIRS)): _clean_%:
> +	$(MAKE) -C $* OUTDIR="$(OUTPUT)" clean
> diff --git a/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> b/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> new file mode 100644
> index 000000000000..54ae249af9a6
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/ioctl.tc
> @@ -0,0 +1,36 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test tlob ioctl self-instrumentation (within/over-budget,
> error paths)
> +# requires: tlob:monitor tlob_ioctl:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +
> +[ -c /dev/rv ] || exit_unsupported
> +
> +echo 1 > monitors/tlob/enable
> +
> +# within budget: 50 ms threshold, 10 ms workload
> +"$TLOB_HELPER" within_budget
> +
> +# over budget in running state: 1 ms threshold, 100 ms busy-spin
> +"$TLOB_HELPER" over_budget_running
> +
> +# over budget in sleeping state: 3 ms threshold, 50 ms sleep
> +"$TLOB_HELPER" over_budget_sleeping
> +
> +# over budget in waiting state: 1 us threshold, sched_yield
> +"$TLOB_HELPER" over_budget_waiting
> +
> +# error paths
> +"$TLOB_HELPER" double_start
> +"$TLOB_HELPER" stop_no_start
> +
> +# per-thread isolation
> +"$TLOB_HELPER" multi_thread
> +
> +# bind against disabled monitor must return ENODEV, not crash
> +echo 0 > monitors/tlob/enable
> +"$TLOB_HELPER" not_enabled
> +echo 1 > monitors/tlob/enable
> +
> +echo 0 > monitors/tlob/enable
> diff --git a/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> b/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> new file mode 100644
> index 000000000000..5d1e7cc02498
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/tracefs.tc
> @@ -0,0 +1,17 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test tlob monitor tracefs interface (enable/disable and files)
> +# requires: tlob:monitor
> +
> +check_requires monitors/tlob/enable monitors/tlob/desc monitors/tlob/monitor
> +
> +# enable / disable via the enable file
> +echo 1 > monitors/tlob/enable
> +grep -q 1 monitors/tlob/enable
> +echo "tlob" >> enabled_monitors
> +grep -q tlob enabled_monitors
> +
> +echo 0 > monitors/tlob/enable
> +grep -q 0 monitors/tlob/enable
> +echo "!tlob" >> enabled_monitors
> +! grep -q "^tlob$" enabled_monitors
> diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> new file mode 100644
> index 000000000000..41e20d593855
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_bind.tc
> @@ -0,0 +1,34 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe binding (visible in monitor file, removable,
> duplicate rejected)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > monitors/tlob/enable
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=5000000" >
> "$TLOB_MONITOR"
> +
> +# Binding must appear in monitor file with canonical hex-offset format.
> +grep -qE "^p ${UPROBE_TARGET}:0x[0-9a-f]+ 0x[0-9a-f]+ threshold=[0-9]+$"
> "$TLOB_MONITOR"
> +grep -q "threshold=5000000" "$TLOB_MONITOR"
> +
> +# Duplicate offset_start must be rejected.
> +! echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=9999" >
> "$TLOB_MONITOR" 2>/dev/null
> +
> +# Remove the binding; it must no longer appear.
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR"
> +! grep -q "^p .*:0x${busy_offset#0x} " "$TLOB_MONITOR"
> +
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > monitors/tlob/enable
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> new file mode 100644
> index 000000000000..2b8656e0fef1
> --- /dev/null
> +++
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_sleeping.tc
> @@ -0,0 +1,47 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe detail sleeping (sleeping_ns dominates when task
> blocks between probes)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +start_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work_done
> 2>/dev/null)
> +[ -n "$start_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 5000 sleep &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# 50 ms budget; task sleeps 200 ms per iteration -> sleeping_ns dominates.
> +echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=50000" >
> "$TLOB_MONITOR"
> +
> +found=0; i=0
> +while [ "$i" -lt 30 ]; do
> +	sleep 0.1
> +	grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +[ "$sleeping" -gt "$((running + waiting))" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> new file mode 100644
> index 000000000000..0705854f24df
> --- /dev/null
> +++
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_detail_waiting.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe detail waiting (waiting_ns dominates when task is
> preempted between probes)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +command -v chrt    > /dev/null || exit_unsupported
> +command -v taskset > /dev/null || exit_unsupported
> +
> +start_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_preempt_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET"
> tlob_preempt_work_done 2>/dev/null)
> +[ -n "$start_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ]  || exit_unsupported
> +
> +cpu=0
> +
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# Register probe before the target starts so the start uprobe fires on the
> +# first entry to tlob_preempt_work. Budget: 500 ms.
> +echo "p ${UPROBE_TARGET}:${start_offset} ${stop_offset} threshold=500000" >
> "$TLOB_MONITOR"
> +
> +# Target starts; start probe fires on tlob_preempt_work entry.
> +taskset -c "$cpu" "$UPROBE_TARGET" 5000 preempt &
> +busy_pid=$!
> +sleep 0.05
> +
> +# RT hog on the same CPU preempts the target; target stays in waiting state
> +# (runnable, off-CPU) until the budget expires -> waiting_ns dominates.
> +chrt -f 99 taskset -c "$cpu" sh -c 'while true; do :; done' 2>/dev/null &
> +hog_pid=$!
> +
> +found=0; i=0
> +while [ "$i" -lt 30 ]; do
> +	sleep 0.1
> +	grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${start_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$hog_pid" 2>/dev/null; wait "$hog_pid" 2>/dev/null || true
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +waiting=$(echo "$line" | sed 's/.*waiting_ns=\([0-9]*\).*/\1/')
> +[ "$waiting" -gt "$((running + sleeping))" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git a/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> new file mode 100644
> index 000000000000..c4b8f7108ae9
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_multi.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test two uprobe bindings on same binary (different offsets
> fire independently)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +busy_stop=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +sleep_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work
> 2>/dev/null)
> +sleep_stop=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_sleep_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ]  || exit_unsupported
> +[ -n "$busy_stop" ]    || exit_unsupported
> +[ -n "$sleep_offset" ] || exit_unsupported
> +[ -n "$sleep_stop" ]   || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &       # busy mode: tlob_busy_work fires every 200 ms
> +busy_pid=$!
> +"$UPROBE_TARGET" 30000 sleep & # sleep mode: tlob_sleep_work fires every 200
> ms
> +sleep_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# Binding A: 5 s budget on the busy probe - must not fire in 200 ms loops.
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${busy_stop} threshold=5000000" >
> "$TLOB_MONITOR"
> +# Binding B: 10 ns budget on the sleep probe - fires on first invocation.
> +echo "p ${UPROBE_TARGET}:${sleep_offset} ${sleep_stop} threshold=10" >
> "$TLOB_MONITOR"
> +
> +# Wait up to 2 s for error_env_tlob from binding B.
> +found=0; i=0
> +while [ "$i" -lt 20 ]; do
> +	sleep 0.1
> +	grep -q "error_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +echo "-${UPROBE_TARGET}:${sleep_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$sleep_pid" 2>/dev/null; wait "$sleep_pid" 2>/dev/null || true
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +
> +echo 0 > monitors/tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +
> +[ "$found" = "1" ]
> +# error_env_tlob payload: label and clock variable must be present.
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "budget_exceeded"
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "clk_elapsed="
> +# detail_env_tlob must appear alongside the error.
> +grep -q "detail_env_tlob" /sys/kernel/tracing/trace
> +
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> new file mode 100644
> index 000000000000..4a74853346e3
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_no_event.tc
> @@ -0,0 +1,19 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test no spurious error_env_tlob events without an active
> uprobe binding
> +# requires: tlob:monitor tlob_ioctl:program
> +
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +sleep 0.5
> +
> +! grep -q "error_env_tlob" /sys/kernel/tracing/trace
> +
> +echo 0 > monitors/tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo > /sys/kernel/tracing/trace
> diff --git
> a/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> b/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> new file mode 100644
> index 000000000000..624fdb950f6b
> --- /dev/null
> +++ b/tools/testing/selftests/verification/test.d/tlob/uprobe_violation.tc
> @@ -0,0 +1,60 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +# description: Test uprobe violation (error_env_tlob and detail_env_tlob fire
> with correct fields)
> +# requires: tlob:monitor tlob_ioctl:program tlob_target:program
> +
> +TLOB_HELPER=$(command -v tlob_ioctl)
> +UPROBE_TARGET=$(command -v tlob_target)
> +TLOB_MONITOR=monitors/tlob/monitor
> +
> +busy_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work
> 2>/dev/null)
> +stop_offset=$("$TLOB_HELPER" sym_offset "$UPROBE_TARGET" tlob_busy_work_done
> 2>/dev/null)
> +[ -n "$busy_offset" ] || exit_unsupported
> +[ -n "$stop_offset" ] || exit_unsupported
> +
> +"$UPROBE_TARGET" 30000 &
> +busy_pid=$!
> +sleep 0.05
> +
> +echo 1 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 1 > /sys/kernel/tracing/tracing_on
> +echo 1 > monitors/tlob/enable
> +echo > /sys/kernel/tracing/trace
> +
> +# 10 ns budget - fires almost immediately; task is busy-spinning on-CPU.
> +echo "p ${UPROBE_TARGET}:${busy_offset} ${stop_offset} threshold=10" >
> "$TLOB_MONITOR"
> +
> +# wait up to 2 s for detail_env_tlob
> +found=0; i=0
> +while [ "$i" -lt 20 ]; do
> +	sleep 0.1
> +	grep -q "detail_env_tlob" /sys/kernel/tracing/trace && { found=1;
> break; }
> +	i=$((i+1))
> +done
> +
> +echo "-${UPROBE_TARGET}:${busy_offset}" > "$TLOB_MONITOR" 2>/dev/null
> +kill "$busy_pid" 2>/dev/null; wait "$busy_pid" 2>/dev/null || true
> +echo 0 > /sys/kernel/tracing/events/rv/error_env_tlob/enable
> +echo 0 > /sys/kernel/tracing/events/rv/detail_env_tlob/enable
> +echo 0 > monitors/tlob/enable
> +
> +[ "$found" = "1" ]
> +
> +# error_env_tlob event label must be budget_exceeded
> +grep "error_env_tlob" /sys/kernel/tracing/trace | head -n 1 | grep -q
> "budget_exceeded"
> +
> +# detail_env_tlob must have all five fields with the correct threshold
> +line=$(grep "detail_env_tlob" /sys/kernel/tracing/trace | head -n 1)
> +echo "$line" | grep -q "pid="
> +echo "$line" | grep -q "threshold_us=10"
> +echo "$line" | grep -q "running_ns="
> +echo "$line" | grep -q "waiting_ns="
> +echo "$line" | grep -q "sleeping_ns="
> +
> +# Busy-spin keeps the task on-CPU: running_ns must exceed sleeping_ns.
> +running=$(echo "$line" | sed 's/.*running_ns=\([0-9]*\).*/\1/')
> +sleeping=$(echo "$line" | sed 's/.*sleeping_ns=\([0-9]*\).*/\1/')
> +[ "$running" -gt "$sleeping" ]
> +
> +echo > /sys/kernel/tracing/trace
> diff --git a/tools/testing/selftests/verification/tlob/Makefile
> b/tools/testing/selftests/verification/tlob/Makefile
> new file mode 100644
> index 000000000000..1bedf946cb34
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/Makefile
> @@ -0,0 +1,21 @@
> +# SPDX-License-Identifier: GPL-2.0
> +# Builds tlob selftest helper binaries.
> +#
> +# Invoked by ../Makefile; pass OUTDIR to control the output directory
> +# and TOOLS_INCLUDES for the in-tree UAPI -isystem flag.
> +
> +OUTDIR ?= $(CURDIR)/..
> +CFLAGS += $(TOOLS_INCLUDES)
> +
> +.PHONY: all
> +all: $(OUTDIR)/tlob_ioctl $(OUTDIR)/tlob_target
> +
> +$(OUTDIR)/tlob_ioctl: tlob_ioctl.c
> +	$(CC) $(CFLAGS) -o $@ $< -lpthread
> +
> +$(OUTDIR)/tlob_target: tlob_target.c
> +	$(CC) $(CFLAGS) -o $@ $<
> +
> +.PHONY: clean
> +clean:
> +	$(RM) $(OUTDIR)/tlob_ioctl $(OUTDIR)/tlob_target
> diff --git a/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> b/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> new file mode 100644
> index 000000000000..abb4e2e80a2c
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/tlob_ioctl.c
> @@ -0,0 +1,626 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_ioctl.c - ioctl test driver and ELF utility for tlob selftests
> + *
> + * Usage: tlob_ioctl <subcommand> [args...]
> + *
> + *   not_enabled          - TRACE_START without monitor enabled -> ENODEV
> + *   within_budget        - sleep within budget -> 0
> + *   over_budget_running  - busy-spin past budget -> EOVERFLOW
> + *   over_budget_sleeping - sleep past budget -> EOVERFLOW
> + *   over_budget_waiting  - sched_yield into waiting state -> EOVERFLOW
> + *   double_start         - two starts without stop -> EALREADY
> + *   stop_no_start        - stop without start -> EINVAL
> + *   multi_thread         - two fds: thread A within budget, thread B over
> + *   bench                - TRACE_START/STOP latency (TAP output, always
> passes)
> + *   sym_offset <binary> <symbol> - print ELF file offset of symbol
> + *
> + * Exit: 0 = pass, 1 = fail, 2 = skip (device not available).
> + */
> +#define _GNU_SOURCE
> +#include <elf.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <pthread.h>
> +#include <sched.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <linux/rv.h>
> +
> +static int rv_fd = -1;
> +
> +static int open_rv(void)
> +{
> +	struct rv_bind_args bind = { .monitor_name = "tlob" };
> +
> +	rv_fd = open("/dev/rv", O_RDWR);
> +	if (rv_fd < 0) {
> +		fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> +		return -1;
> +	}
> +	if (ioctl(rv_fd, RV_IOCTL_BIND_MONITOR, &bind) < 0) {
> +		fprintf(stderr, "bind tlob: %s\n", strerror(errno));
> +		close(rv_fd);
> +		rv_fd = -1;
> +		return -1;
> +	}
> +	return 0;
> +}
> +
> +static void busy_spin_us(unsigned long us)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < us * 1000UL);
> +}
> +
> +static int trace_start(uint64_t threshold_us)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = threshold_us,
> +	};
> +
> +	return ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +}
> +
> +static int trace_stop(void)
> +{
> +	return ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +}
> +
> +/* Synchronous TRACE_START / TRACE_STOP tests */
> +
> +/* Bind to a disabled monitor must return ENODEV without crashing */
> +static int test_not_enabled(void)
> +{
> +	struct rv_bind_args bind = { .monitor_name = "tlob" };
> +	int fd;
> +	int ret;
> +
> +	fd = open("/dev/rv", O_RDWR);
> +	if (fd < 0) {
> +		fprintf(stderr, "open /dev/rv: %s\n", strerror(errno));
> +		return 2; /* skip */
> +	}
> +
> +	ret = ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind);
> +	close(fd);
> +
> +	if (ret == 0) {
> +		fprintf(stderr, "RV_IOCTL_BIND_MONITOR: expected ENODEV, got
> success\n");
> +		return 1;
> +	}
> +	if (errno != ENODEV) {
> +		fprintf(stderr, "RV_IOCTL_BIND_MONITOR: expected ENODEV, got
> %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_within_budget(void)
> +{
> +	int ret;
> +
> +	/* 50 ms budget */
> +	if (trace_start(50000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	usleep(10000); /* 10 ms */
> +	ret = trace_stop();
> +	if (ret != 0) {
> +		fprintf(stderr, "TRACE_STOP: expected 0, got %d errno=%s\n",
> +			ret, strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_running(void)
> +{
> +	int ret;
> +
> +	/* 1 ms budget */
> +	if (trace_start(1000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	busy_spin_us(100000); /* 100 ms */
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_sleeping(void)
> +{
> +	int ret;
> +
> +	/* 3 ms budget */
> +	if (trace_start(3000) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	usleep(50000); /* 50 ms; sleeping time counts toward budget */
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int test_over_budget_waiting(void)
> +{
> +	int ret;
> +
> +	/* 1 us budget */
> +	if (trace_start(1) < 0) {
> +		fprintf(stderr, "TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	sched_yield(); /* running -> waiting -> running */
> +	busy_spin_us(10); /* 10 us >> 1 us budget; hrtimer fires during spin
> */
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EOVERFLOW) {
> +		fprintf(stderr, "TRACE_STOP: expected EOVERFLOW, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* Error-handling tests */
> +
> +static int test_double_start(void)
> +{
> +	int ret;
> +
> +	/* 10 s: large enough the hrtimer won't fire during the test */
> +	if (trace_start(10000000ULL) < 0) {
> +		fprintf(stderr, "first TRACE_START: %s\n", strerror(errno));
> +		return 1;
> +	}
> +	ret = trace_start(10000000);
> +	if (ret == 0) {
> +		fprintf(stderr, "second TRACE_START: expected EALREADY, got
> 0\n");
> +		trace_stop();
> +		return 1;
> +	}
> +	if (errno != EALREADY) {
> +		fprintf(stderr, "second TRACE_START: expected EALREADY, got
> %s\n",
> +			strerror(errno));
> +		trace_stop();
> +		return 1;
> +	}
> +	trace_stop();
> +	return 0;
> +}
> +
> +static int test_stop_no_start(void)
> +{
> +	int ret;
> +
> +	/* Ensure clean state: ignore error from a stale entry */
> +	trace_stop();
> +
> +	ret = trace_stop();
> +	if (ret == 0) {
> +		fprintf(stderr, "TRACE_STOP: expected EINVAL, got 0\n");
> +		return 1;
> +	}
> +	if (errno != EINVAL) {
> +		fprintf(stderr, "TRACE_STOP: expected EINVAL, got %s\n",
> +			strerror(errno));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* Two threads, each with its own fd: A within budget, B over budget. */
> +
> +struct mt_thread_args {
> +	uint64_t      threshold_us;
> +	unsigned long workload_us;
> +	int           busy;
> +	int           expect_eoverflow;
> +	int           result;
> +};
> +
> +static void *mt_thread_fn(void *arg)
> +{
> +	struct mt_thread_args *a = arg;
> +	struct tlob_start_args args = { .threshold_us = a->threshold_us };
> +	struct rv_bind_args bind = { .monitor_name = "tlob" };
> +	int fd;
> +	int ret;
> +
> +	fd = open("/dev/rv", O_RDWR);
> +	if (fd < 0) {
> +		fprintf(stderr, "thread open /dev/rv: %s\n",
> strerror(errno));
> +		a->result = 1;
> +		return NULL;
> +	}
> +	if (ioctl(fd, RV_IOCTL_BIND_MONITOR, &bind) < 0) {
> +		fprintf(stderr, "thread bind tlob: %s\n", strerror(errno));
> +		close(fd);
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	ret = ioctl(fd, TLOB_IOCTL_TRACE_START, &args);
> +	if (ret < 0) {
> +		fprintf(stderr, "thread TRACE_START: %s\n", strerror(errno));
> +		close(fd);
> +		a->result = 1;
> +		return NULL;
> +	}
> +
> +	if (a->busy)
> +		busy_spin_us(a->workload_us);
> +	else
> +		usleep(a->workload_us);
> +
> +	ret = ioctl(fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	if (a->expect_eoverflow) {
> +		if (ret == 0 || errno != EOVERFLOW) {
> +			fprintf(stderr, "thread: expected EOVERFLOW, got
> ret=%d errno=%s\n",
> +				ret, strerror(errno));
> +			close(fd);
> +			a->result = 1;
> +			return NULL;
> +		}
> +	} else {
> +		if (ret != 0) {
> +			fprintf(stderr, "thread: expected 0, got ret=%d
> errno=%s\n",
> +				ret, strerror(errno));
> +			close(fd);
> +			a->result = 1;
> +			return NULL;
> +		}
> +	}
> +	close(fd);
> +	a->result = 0;
> +	return NULL;
> +}
> +
> +static int test_multi_thread(void)
> +{
> +	pthread_t ta, tb;
> +	struct mt_thread_args a = {
> +		.threshold_us     = 20000,   /* 20 ms */
> +		.workload_us      = 5000,    /* 5 ms sleep -> within budget
> */
> +		.busy             = 0,
> +		.expect_eoverflow = 0,
> +	};
> +	struct mt_thread_args b = {
> +		.threshold_us     = 3000,    /* 3 ms */
> +		.workload_us      = 30000,   /* 30 ms spin -> over budget */
> +		.busy             = 1,
> +		.expect_eoverflow = 1,
> +	};
> +
> +	pthread_create(&ta, NULL, mt_thread_fn, &a);
> +	pthread_create(&tb, NULL, mt_thread_fn, &b);
> +	pthread_join(ta, NULL);
> +	pthread_join(tb, NULL);
> +
> +	return (a.result || b.result) ? 1 : 0;
> +}
> +
> +/*
> + * Benchmark TRACE_START, TRACE_STOP, and round-trip ioctls.
> + * Output uses TAP '#' prefix; always returns 0.
> + */
> +#define BENCH_WARMUP  32
> +#define BENCH_N      1000
> +
> +static long long timespec_diff_ns(const struct timespec *a,
> +				   const struct timespec *b)
> +{
> +	return (long long)(b->tv_sec - a->tv_sec) * 1000000000LL
> +		+ (b->tv_nsec - a->tv_nsec);
> +}
> +
> +static int test_bench(void)
> +{
> +	struct tlob_start_args args = {
> +		.threshold_us = 10000000ULL, /* 10 s */
> +	};
> +	struct timespec t0, t1;
> +	long long total_start_ns = 0, total_stop_ns = 0, total_rt_ns = 0;
> +	int i;
> +
> +	/* warm up */
> +	for (i = 0; i < BENCH_WARMUP; i++) {
> +		if (ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args) == 0)
> +			ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	}
> +
> +	/* start only */
> +	for (i = 0; i < BENCH_N; i++) {
> +		clock_gettime(CLOCK_MONOTONIC, &t0);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +		clock_gettime(CLOCK_MONOTONIC, &t1);
> +		total_start_ns += timespec_diff_ns(&t0, &t1);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	}
> +
> +	/* stop only */
> +	for (i = 0; i < BENCH_N; i++) {
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +		clock_gettime(CLOCK_MONOTONIC, &t0);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +		clock_gettime(CLOCK_MONOTONIC, &t1);
> +		total_stop_ns += timespec_diff_ns(&t0, &t1);
> +	}
> +
> +	/* round-trip */
> +	clock_gettime(CLOCK_MONOTONIC, &t0);
> +	for (i = 0; i < BENCH_N; i++) {
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_START, &args);
> +		ioctl(rv_fd, TLOB_IOCTL_TRACE_STOP, NULL);
> +	}
> +	clock_gettime(CLOCK_MONOTONIC, &t1);
> +	total_rt_ns = timespec_diff_ns(&t0, &t1);
> +
> +	printf("# start ioctl only:      %lld ns/iter (N=%d, includes
> syscall)\n",
> +	       total_start_ns / BENCH_N, BENCH_N);
> +	printf("# stop ioctl only:       %lld ns/iter (N=%d, includes
> syscall)\n",
> +	       total_stop_ns / BENCH_N, BENCH_N);
> +	printf("# start+stop roundtrip:  %lld ns/iter (N=%d, includes 2
> syscalls)\n",
> +	       total_rt_ns / BENCH_N, BENCH_N);
> +	return 0;
> +}
> +
> +/*
> + * Print the ELF file offset of <symname> in <binary>.  Walks .symtab
> + * (falling back to .dynsym) and converts vaddr to file offset via PT_LOAD.
> + * Supports 32- and 64-bit ELF.
> + */
> +static int sym_offset(const char *binary, const char *symname)
> +{
> +	int fd;
> +	struct stat st;
> +	void *map;
> +	Elf64_Ehdr *ehdr;
> +	Elf32_Ehdr *ehdr32;
> +	int is64;
> +	uint64_t sym_vaddr = 0;
> +	int found = 0;
> +	uint64_t file_offset = 0;
> +
> +	fd = open(binary, O_RDONLY);
> +	if (fd < 0) {
> +		fprintf(stderr, "open %s: %s\n", binary, strerror(errno));
> +		return 1;
> +	}
> +	if (fstat(fd, &st) < 0) {
> +		close(fd);
> +		return 1;
> +	}
> +	map = mmap(NULL, (size_t)st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
> +	close(fd);
> +	if (map == MAP_FAILED) {
> +		fprintf(stderr, "mmap: %s\n", strerror(errno));
> +		return 1;
> +	}
> +
> +	ehdr = (Elf64_Ehdr *)map;
> +	ehdr32 = (Elf32_Ehdr *)map;
> +	if (st.st_size < 4 ||
> +	    ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
> +	    ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
> +	    ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
> +	    ehdr->e_ident[EI_MAG3] != ELFMAG3) {
> +		fprintf(stderr, "%s: not an ELF file\n", binary);
> +		munmap(map, (size_t)st.st_size);
> +		return 1;
> +	}
> +	is64 = (ehdr->e_ident[EI_CLASS] == ELFCLASS64);
> +
> +	if (is64) {
> +		Elf64_Shdr *shdrs = (Elf64_Shdr *)((char *)map + ehdr-
> >e_shoff);
> +		Elf64_Shdr *shstrtab_hdr = &shdrs[ehdr->e_shstrndx];
> +		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> +		int si;
> +
> +		/* prefer .symtab; fall back to .dynsym */
> +		for (int pass = 0; pass < 2 && !found; pass++) {
> +			const char *target = pass ? ".dynsym" : ".symtab";
> +
> +			for (si = 0; si < ehdr->e_shnum && !found; si++) {
> +				Elf64_Shdr *sh = &shdrs[si];
> +				const char *name = shstrtab + sh->sh_name;
> +
> +				if (strcmp(name, target) != 0)
> +					continue;
> +
> +				Elf64_Shdr *strtab_sh = &shdrs[sh->sh_link];
> +				const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> +				Elf64_Sym *syms = (Elf64_Sym *)((char *)map +
> sh->sh_offset);
> +				uint64_t nsyms = sh->sh_size /
> sizeof(Elf64_Sym);
> +				uint64_t j;
> +
> +				for (j = 0; j < nsyms; j++) {
> +					if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> +						sym_vaddr = syms[j].st_value;
> +						found = 1;
> +						break;
> +					}
> +				}
> +			}
> +		}
> +
> +		if (!found) {
> +			fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> +			munmap(map, (size_t)st.st_size);
> +			return 1;
> +		}
> +
> +		/* Convert vaddr to file offset via PT_LOAD segments */
> +		Elf64_Phdr *phdrs = (Elf64_Phdr *)((char *)map + ehdr-
> >e_phoff);
> +		int pi;
> +
> +		for (pi = 0; pi < ehdr->e_phnum; pi++) {
> +			Elf64_Phdr *ph = &phdrs[pi];
> +
> +			if (ph->p_type != PT_LOAD)
> +				continue;
> +			if (sym_vaddr >= ph->p_vaddr &&
> +			    sym_vaddr < ph->p_vaddr + ph->p_filesz) {
> +				file_offset = sym_vaddr - ph->p_vaddr + ph-
> >p_offset;
> +				break;
> +			}
> +		}
> +	} else {
> +		/* 32-bit ELF */
> +		Elf32_Shdr *shdrs = (Elf32_Shdr *)((char *)map + ehdr32-
> >e_shoff);
> +		Elf32_Shdr *shstrtab_hdr = &shdrs[ehdr32->e_shstrndx];
> +		const char *shstrtab = (char *)map + shstrtab_hdr->sh_offset;
> +		int si;
> +		uint32_t sym_vaddr32 = 0;
> +
> +		for (int pass = 0; pass < 2 && !found; pass++) {
> +			const char *target = pass ? ".dynsym" : ".symtab";
> +
> +			for (si = 0; si < ehdr32->e_shnum && !found; si++) {
> +				Elf32_Shdr *sh = &shdrs[si];
> +				const char *name = shstrtab + sh->sh_name;
> +
> +				if (strcmp(name, target) != 0)
> +					continue;
> +
> +				Elf32_Shdr *strtab_sh = &shdrs[sh->sh_link];
> +				const char *strtab = (char *)map + strtab_sh-
> >sh_offset;
> +				Elf32_Sym *syms = (Elf32_Sym *)((char *)map +
> sh->sh_offset);
> +				uint32_t nsyms = sh->sh_size /
> sizeof(Elf32_Sym);
> +				uint32_t j;
> +
> +				for (j = 0; j < nsyms; j++) {
> +					if (strcmp(strtab + syms[j].st_name,
> symname) == 0) {
> +						sym_vaddr32 =
> syms[j].st_value;
> +						found = 1;
> +						break;
> +					}
> +				}
> +			}
> +		}
> +
> +		if (!found) {
> +			fprintf(stderr, "symbol '%s' not found in %s\n",
> symname, binary);
> +			munmap(map, (size_t)st.st_size);
> +			return 1;
> +		}
> +
> +		Elf32_Phdr *phdrs = (Elf32_Phdr *)((char *)map + ehdr32-
> >e_phoff);
> +		int pi;
> +
> +		for (pi = 0; pi < ehdr32->e_phnum; pi++) {
> +			Elf32_Phdr *ph = &phdrs[pi];
> +
> +			if (ph->p_type != PT_LOAD)
> +				continue;
> +			if (sym_vaddr32 >= ph->p_vaddr &&
> +			    sym_vaddr32 < ph->p_vaddr + ph->p_filesz) {
> +				file_offset = sym_vaddr32 - ph->p_vaddr + ph-
> >p_offset;
> +				break;
> +			}
> +		}
> +		sym_vaddr = sym_vaddr32;
> +	}
> +
> +	munmap(map, (size_t)st.st_size);
> +
> +	if (!file_offset && sym_vaddr) {
> +		fprintf(stderr, "could not map vaddr 0x%lx to file offset\n",
> +			(unsigned long)sym_vaddr);
> +		return 1;
> +	}
> +
> +	printf("0x%lx\n", (unsigned long)file_offset);
> +	return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	int rc;
> +
> +	if (argc < 2) {
> +		fprintf(stderr, "Usage: %s <subcommand> [args...]\n",
> argv[0]);
> +		return 1;
> +	}
> +
> +	/* sym_offset does not need /dev/rv */
> +	if (strcmp(argv[1], "sym_offset") == 0) {
> +		if (argc < 4) {
> +			fprintf(stderr, "Usage: %s sym_offset <binary>
> <symbol>\n",
> +				argv[0]);
> +			return 1;
> +		}
> +		return sym_offset(argv[2], argv[3]);
> +	}
> +
> +	/* not_enabled: monitor is disabled; bind must return ENODEV without
> open_rv() */
> +	if (strcmp(argv[1], "not_enabled") == 0)
> +		return test_not_enabled();
> +
> +	if (open_rv() < 0)
> +		return 2; /* skip */
> +
> +	if (strcmp(argv[1], "bench") == 0)
> +		rc = test_bench();
> +	else if (strcmp(argv[1], "within_budget") == 0)
> +		rc = test_within_budget();
> +	else if (strcmp(argv[1], "over_budget_running") == 0)
> +		rc = test_over_budget_running();
> +	else if (strcmp(argv[1], "over_budget_sleeping") == 0)
> +		rc = test_over_budget_sleeping();
> +	else if (strcmp(argv[1], "over_budget_waiting") == 0)
> +		rc = test_over_budget_waiting();
> +	else if (strcmp(argv[1], "double_start") == 0)
> +		rc = test_double_start();
> +	else if (strcmp(argv[1], "stop_no_start") == 0)
> +		rc = test_stop_no_start();
> +	else if (strcmp(argv[1], "multi_thread") == 0)
> +		rc = test_multi_thread();
> +	else {
> +		fprintf(stderr, "Unknown test: %s\n", argv[1]);
> +		rc = 1;
> +	}
> +
> +	close(rv_fd);
> +	return rc;
> +}
> diff --git a/tools/testing/selftests/verification/tlob/tlob_target.c
> b/tools/testing/selftests/verification/tlob/tlob_target.c
> new file mode 100644
> index 000000000000..0fdbc575d71d
> --- /dev/null
> +++ b/tools/testing/selftests/verification/tlob/tlob_target.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * tlob_target.c - uprobe target binary for tlob selftests.
> + *
> + * Provides three start/stop probe pairs, each designed to exercise a
> + * different dominant component of the detail_env_tlob ns breakdown:
> + *
> + *   tlob_busy_work    / tlob_busy_work_done    - busy-spin: running_ns
> dominates
> + *   tlob_sleep_work   / tlob_sleep_work_done   - nanosleep: sleeping_ns
> dominates
> + *   tlob_preempt_work / tlob_preempt_work_done - busy-spin: waiting_ns
> dominates
> + *                                                (needs an RT competitor on
> the same CPU)
> + *
> + * Usage: tlob_target <duration_ms> [mode]
> + *
> + * mode is one of: busy (default), sleep, preempt.
> + * Loops in 200 ms iterations until <duration_ms> has elapsed
> + * (0 = run for ~24 hours).
> + */
> +#define _GNU_SOURCE
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +
> +#ifndef noinline
> +#define noinline __attribute__((noinline))
> +#endif
> +
> +static inline int timespec_before(const struct timespec *a,
> +				   const struct timespec *b)
> +{
> +	return a->tv_sec < b->tv_sec ||
> +	       (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec);
> +}
> +
> +static void timespec_add_ms(struct timespec *ts, unsigned long ms)
> +{
> +	ts->tv_sec  += ms / 1000;
> +	ts->tv_nsec += (long)(ms % 1000) * 1000000L;
> +	if (ts->tv_nsec >= 1000000000L) {
> +		ts->tv_sec++;
> +		ts->tv_nsec -= 1000000000L;
> +	}
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_busy_work_done(void)
> +{
> +	/* empty: uprobe fires on entry */
> +}
> +
> +/* start probe; busy-spin so running_ns dominates */
> +noinline void tlob_busy_work(unsigned long duration_ns)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < duration_ns);
> +
> +	tlob_busy_work_done();
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_sleep_work_done(void)
> +{
> +	/* empty: uprobe fires on entry */
> +}
> +
> +/* start probe; nanosleep so sleeping_ns dominates */
> +noinline void tlob_sleep_work(unsigned long duration_ms)
> +{
> +	struct timespec ts = {
> +		.tv_sec  = duration_ms / 1000,
> +		.tv_nsec = (long)(duration_ms % 1000) * 1000000L,
> +	};
> +	nanosleep(&ts, NULL);
> +	tlob_sleep_work_done();
> +}
> +
> +/* stop probe; noinline keeps the entry point visible to uprobes */
> +noinline void tlob_preempt_work_done(void)
> +{
> +	/* empty: uprobe fires on entry */
> +}
> +
> +/*
> + * start probe; busy-spin so an RT competitor on the same CPU drives
> + * waiting_ns (prev_state==0 -> preempt event, task stays runnable off-CPU).
> + */
> +noinline void tlob_preempt_work(unsigned long duration_ms)
> +{
> +	struct timespec start, now;
> +	unsigned long elapsed;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +	do {
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed = (unsigned long)(now.tv_sec - start.tv_sec)
> +			  * 1000000000UL
> +			+ (unsigned long)(now.tv_nsec - start.tv_nsec);
> +	} while (elapsed < duration_ms * 1000000UL);
> +
> +	tlob_preempt_work_done();
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	unsigned long duration_ms = 0;
> +	const char *mode = "busy";
> +	struct timespec deadline, now;
> +
> +	if (argc >= 2)
> +		duration_ms = strtoul(argv[1], NULL, 10);
> +	if (argc >= 3)
> +		mode = argv[2];
> +
> +	clock_gettime(CLOCK_MONOTONIC, &deadline);
> +	timespec_add_ms(&deadline, duration_ms ? duration_ms : 86400000UL);
> +
> +	do {
> +		if (strcmp(mode, "sleep") == 0)
> +			tlob_sleep_work(200);
> +		else if (strcmp(mode, "preempt") == 0)
> +			tlob_preempt_work(200);
> +		else
> +			tlob_busy_work(200 * 1000000UL);
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +	} while (timespec_before(&now, &deadline));
> +
> +	return 0;
> +}


^ permalink raw reply

* [PATCH v3 01/11] io_uring: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Pillai (Google) @ 2026-05-15 13:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, Steven Rostedt, linux-trace-kernel, Vineeth Pillai,
	Peter Zijlstra

From: Vineeth Pillai <vineeth@bitbyteword.org>

Replace trace_foo() with the new trace_call__foo() at sites already
guarded by trace_foo_enabled(), avoiding a redundant
static_branch_unlikely() re-evaluation inside the tracepoint.
trace_call__foo() calls the tracepoint callbacks directly without
utilizing the static branch again.

Original v2 series:
https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/

Parts of the original v2 series have already been merged in mainline.
This patch is being reposted as a follow-up cleanup for the remaining
unmerged pieces.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Assisted-by: Claude:claude-sonnet-4-6
---
 io_uring/io_uring.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e612a66ee80e..1b657b714373 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -312,7 +312,7 @@ static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
 	}
 
 	if (trace_io_uring_complete_enabled())
-		trace_io_uring_complete(req->ctx, req, cqe);
+		trace_call__io_uring_complete(req->ctx, req, cqe);
 	return true;
 }
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 02/11] net: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Pillai (Google) @ 2026-05-15 13:59 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Aaron Conole, Eelco Chaudron, Ilya Maximets,
	Marcelo Ricardo Leitner, Xin Long, Jon Maloy
  Cc: netdev, bpf, dev, linux-sctp, tipc-discussion, Steven Rostedt,
	linux-trace-kernel, Vineeth Pillai, Peter Zijlstra

From: Vineeth Pillai <vineeth@bitbyteword.org>

Replace trace_foo() with the new trace_call__foo() at sites already
guarded by trace_foo_enabled(), avoiding a redundant
static_branch_unlikely() re-evaluation inside the tracepoint.
trace_call__foo() calls the tracepoint callbacks directly without
utilizing the static branch again.

Original v2 series:
https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/

Parts of the original v2 series have already been merged in mainline.
This patch is being reposted as a follow-up cleanup for the remaining
unmerged pieces.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Assisted-by: Claude:claude-sonnet-4-6
---
 net/core/dev.c             | 2 +-
 net/core/xdp.c             | 2 +-
 net/openvswitch/actions.c  | 2 +-
 net/openvswitch/datapath.c | 2 +-
 net/sctp/outqueue.c        | 2 +-
 net/tipc/node.c            | 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8bfa8313ef62..12a583ce4d95 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6482,7 +6482,7 @@ void netif_receive_skb_list(struct list_head *head)
 		return;
 	if (trace_netif_receive_skb_list_entry_enabled()) {
 		list_for_each_entry(skb, head, list)
-			trace_netif_receive_skb_list_entry(skb);
+			trace_call__netif_receive_skb_list_entry(skb);
 	}
 	netif_receive_skb_list_internal(head);
 	trace_netif_receive_skb_list_exit(0);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 9890a30584ba..3003e5c57419 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -362,7 +362,7 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 		xsk_pool_set_rxq_info(allocator, xdp_rxq);
 
 	if (trace_mem_connect_enabled() && xdp_alloc)
-		trace_mem_connect(xdp_alloc, xdp_rxq);
+		trace_call__mem_connect(xdp_alloc, xdp_rxq);
 	return 0;
 }
 
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 140388a18ae0..7b7c93c3bde4 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -1260,7 +1260,7 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
 		int err = 0;
 
 		if (trace_ovs_do_execute_action_enabled())
-			trace_ovs_do_execute_action(dp, skb, key, a, rem);
+			trace_call__ovs_do_execute_action(dp, skb, key, a, rem);
 
 		/* Actions that rightfully have to consume the skb should do it
 		 * and return directly.
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index bbbde50fc649..f2b6688f18d6 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -335,7 +335,7 @@ int ovs_dp_upcall(struct datapath *dp, struct sk_buff *skb,
 	int err;
 
 	if (trace_ovs_dp_upcall_enabled())
-		trace_ovs_dp_upcall(dp, skb, key, upcall_info);
+		trace_call__ovs_dp_upcall(dp, skb, key, upcall_info);
 
 	if (upcall_info->portid == 0) {
 		err = -ENOTCONN;
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index f6b8c13dafa4..4025d863ffc8 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -1267,7 +1267,7 @@ int sctp_outq_sack(struct sctp_outq *q, struct sctp_chunk *chunk)
 	/* SCTP path tracepoint for congestion control debugging. */
 	if (trace_sctp_probe_path_enabled()) {
 		list_for_each_entry(transport, transport_list, transports)
-			trace_sctp_probe_path(transport, asoc);
+			trace_call__sctp_probe_path(transport, asoc);
 	}
 
 	sack_ctsn = ntohl(sack->cum_tsn_ack);
diff --git a/net/tipc/node.c b/net/tipc/node.c
index 97aa970a0d83..6cfe4c40c82b 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1943,7 +1943,7 @@ static bool tipc_node_check_state(struct tipc_node *n, struct sk_buff *skb,
 
 	if (trace_tipc_node_check_state_enabled()) {
 		trace_tipc_skb_dump(skb, false, "skb for node state check");
-		trace_tipc_node_check_state(n, true, " ");
+		trace_call__tipc_node_check_state(n, true, " ");
 	}
 	l = n->links[bearer_id].link;
 	if (!l)
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 03/11] accel/habanalabs: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Pillai (Google) @ 2026-05-15 13:59 UTC (permalink / raw)
  To: Koby Elbaz, Konstantin Sinyuk, Oded Gabbay
  Cc: dri-devel, Steven Rostedt, linux-trace-kernel, Vineeth Pillai,
	Peter Zijlstra

From: Vineeth Pillai <vineeth@bitbyteword.org>

Replace trace_foo() with the new trace_call__foo() at sites already
guarded by trace_foo_enabled(), avoiding a redundant
static_branch_unlikely() re-evaluation inside the tracepoint.
trace_call__foo() calls the tracepoint callbacks directly without
utilizing the static branch again.

Original v2 series:
https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/

Parts of the original v2 series have already been merged in mainline.
This patch is being reposted as a follow-up cleanup for the remaining
unmerged pieces.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Assisted-by: Claude:claude-sonnet-4-6
---
 drivers/accel/habanalabs/common/device.c  | 9 +++++----
 drivers/accel/habanalabs/common/mmu/mmu.c | 3 ++-
 drivers/accel/habanalabs/common/pci/pci.c | 6 ++++--
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 09b27bac3a31..68c6f973d53f 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -132,8 +132,9 @@ static void *hl_dma_alloc_common(struct hl_device *hdev, size_t size, dma_addr_t
 	}
 
 	if (trace_habanalabs_dma_alloc_enabled() && !ZERO_OR_NULL_PTR(ptr))
-		trace_habanalabs_dma_alloc(&(hdev)->pdev->dev, (u64) (uintptr_t) ptr, *dma_handle,
-						size, caller);
+		trace_call__habanalabs_dma_alloc(&(hdev)->pdev->dev,
+						 (u64) (uintptr_t) ptr,
+						 *dma_handle, size, caller);
 
 	return ptr;
 }
@@ -2656,7 +2657,7 @@ inline u32 hl_rreg(struct hl_device *hdev, u32 reg)
 	u32 val = readl(hdev->rmmio + reg);
 
 	if (unlikely(trace_habanalabs_rreg32_enabled()))
-		trace_habanalabs_rreg32(&(hdev)->pdev->dev, reg, val);
+		trace_call__habanalabs_rreg32(&(hdev)->pdev->dev, reg, val);
 
 	return val;
 }
@@ -2674,7 +2675,7 @@ inline u32 hl_rreg(struct hl_device *hdev, u32 reg)
 inline void hl_wreg(struct hl_device *hdev, u32 reg, u32 val)
 {
 	if (unlikely(trace_habanalabs_wreg32_enabled()))
-		trace_habanalabs_wreg32(&(hdev)->pdev->dev, reg, val);
+		trace_call__habanalabs_wreg32(&(hdev)->pdev->dev, reg, val);
 
 	writel(val, hdev->rmmio + reg);
 }
diff --git a/drivers/accel/habanalabs/common/mmu/mmu.c b/drivers/accel/habanalabs/common/mmu/mmu.c
index 6c7c4ff8a8a9..dd8b6fb3aa1f 100644
--- a/drivers/accel/habanalabs/common/mmu/mmu.c
+++ b/drivers/accel/habanalabs/common/mmu/mmu.c
@@ -263,7 +263,8 @@ int hl_mmu_unmap_page(struct hl_ctx *ctx, u64 virt_addr, u32 page_size, bool flu
 		mmu_funcs->flush(ctx);
 
 	if (trace_habanalabs_mmu_unmap_enabled() && !rc)
-		trace_habanalabs_mmu_unmap(&hdev->pdev->dev, virt_addr, 0, page_size, flush_pte);
+		trace_call__habanalabs_mmu_unmap(&hdev->pdev->dev, virt_addr, 0,
+						 page_size, flush_pte);
 
 	return rc;
 }
diff --git a/drivers/accel/habanalabs/common/pci/pci.c b/drivers/accel/habanalabs/common/pci/pci.c
index 81cbd8697d4c..12663e8e12e0 100644
--- a/drivers/accel/habanalabs/common/pci/pci.c
+++ b/drivers/accel/habanalabs/common/pci/pci.c
@@ -123,7 +123,8 @@ int hl_pci_elbi_read(struct hl_device *hdev, u64 addr, u32 *data)
 		pci_read_config_dword(pdev, mmPCI_CONFIG_ELBI_DATA, data);
 
 		if (unlikely(trace_habanalabs_elbi_read_enabled()))
-			trace_habanalabs_elbi_read(&hdev->pdev->dev, (u32) addr, val);
+			trace_call__habanalabs_elbi_read(&hdev->pdev->dev,
+							 (u32) addr, val);
 
 		return 0;
 	}
@@ -186,7 +187,8 @@ static int hl_pci_elbi_write(struct hl_device *hdev, u64 addr, u32 data)
 
 	if ((val & PCI_CONFIG_ELBI_STS_MASK) == PCI_CONFIG_ELBI_STS_DONE) {
 		if (unlikely(trace_habanalabs_elbi_write_enabled()))
-			trace_habanalabs_elbi_write(&hdev->pdev->dev, (u32) addr, val);
+			trace_call__habanalabs_elbi_write(&hdev->pdev->dev,
+							  (u32) addr, val);
 		return 0;
 	}
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 04/11] devfreq: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Pillai (Google) @ 2026-05-15 13:59 UTC (permalink / raw)
  To: MyungJoo Ham, Kyungmin Park, Chanwoo Choi
  Cc: linux-pm, Steven Rostedt, linux-trace-kernel, Vineeth Pillai,
	Peter Zijlstra

From: Vineeth Pillai <vineeth@bitbyteword.org>

Replace trace_foo() with the new trace_call__foo() at sites already
guarded by trace_foo_enabled(), avoiding a redundant
static_branch_unlikely() re-evaluation inside the tracepoint.
trace_call__foo() calls the tracepoint callbacks directly without
utilizing the static branch again.

Original v2 series:
https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/

Parts of the original v2 series have already been merged in mainline.
This patch is being reposted as a follow-up cleanup for the remaining
unmerged pieces.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Assisted-by: Claude:claude-sonnet-4-6
---
 drivers/devfreq/devfreq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index 82dd9a43dc62..9f71d9dc4a70 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -370,7 +370,7 @@ static int devfreq_set_target(struct devfreq *devfreq, unsigned long new_freq,
 	 * change order of between devfreq device and passive devfreq device.
 	 */
 	if (trace_devfreq_frequency_enabled() && new_freq != cur_freq)
-		trace_devfreq_frequency(devfreq, new_freq, cur_freq);
+		trace_call__devfreq_frequency(devfreq, new_freq, cur_freq);
 
 	freqs.new = new_freq;
 	devfreq_notify_transition(devfreq, &freqs, DEVFREQ_POSTCHANGE);
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 05/11] dma-buf: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Pillai (Google) @ 2026-05-15 13:59 UTC (permalink / raw)
  To: Sumit Semwal, Christian König
  Cc: linux-media, dri-devel, linaro-mm-sig, Steven Rostedt,
	linux-trace-kernel, Vineeth Pillai, Peter Zijlstra

From: Vineeth Pillai <vineeth@bitbyteword.org>

Replace trace_foo() with the new trace_call__foo() at sites already
guarded by trace_foo_enabled(), avoiding a redundant
static_branch_unlikely() re-evaluation inside the tracepoint.
trace_call__foo() calls the tracepoint callbacks directly without
utilizing the static branch again.

Original v2 series:
https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/

Parts of the original v2 series have already been merged in mainline.
This patch is being reposted as a follow-up cleanup for the remaining
unmerged pieces.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Assisted-by: Claude:claude-sonnet-4-6
---
 drivers/dma-buf/dma-fence.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index a2aa82f4eedd..a41cdd9c9343 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -553,7 +553,7 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 	}
 	if (trace_dma_fence_wait_end_enabled()) {
 		rcu_read_lock();
-		trace_dma_fence_wait_end(fence);
+		trace_call__dma_fence_wait_end(fence);
 		rcu_read_unlock();
 	}
 	return ret;
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 06/11] drm: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Pillai (Google) @ 2026-05-15 13:59 UTC (permalink / raw)
  To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Harry Wentland, Leo Li, Matthew Brost, Danilo Krummrich,
	Philipp Stanner, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann
  Cc: amd-gfx, dri-devel, Steven Rostedt, linux-trace-kernel,
	Vineeth Pillai, Peter Zijlstra

From: Vineeth Pillai <vineeth@bitbyteword.org>

Replace trace_foo() with the new trace_call__foo() at sites already
guarded by trace_foo_enabled(), avoiding a redundant
static_branch_unlikely() re-evaluation inside the tracepoint.
trace_call__foo() calls the tracepoint callbacks directly without
utilizing the static branch again.

Original v2 series:
https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/

Parts of the original v2 series have already been merged in mainline.
This patch is being reposted as a follow-up cleanup for the remaining
unmerged pieces.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Assisted-by: Claude:claude-sonnet-4-6
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c            |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c            |  4 ++--
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 +++++-----
 drivers/gpu/drm/scheduler/sched_entity.c          |  5 +++--
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index b24d5d21be5f..cb0b5cb07d57 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1004,7 +1004,7 @@ static void trace_amdgpu_cs_ibs(struct amdgpu_cs_parser *p)
 		struct amdgpu_job *job = p->jobs[i];
 
 		for (j = 0; j < job->num_ibs; ++j)
-			trace_amdgpu_cs(p, job, &job->ibs[j]);
+			trace_call__amdgpu_cs(p, job, &job->ibs[j]);
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 9ba9de16a27a..a36ae94c425f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1415,7 +1415,7 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va,
 
 	if (trace_amdgpu_vm_bo_mapping_enabled()) {
 		list_for_each_entry(mapping, &bo_va->valids, list)
-			trace_amdgpu_vm_bo_mapping(mapping);
+			trace_call__amdgpu_vm_bo_mapping(mapping);
 	}
 
 error_free:
@@ -2183,7 +2183,7 @@ void amdgpu_vm_bo_trace_cs(struct amdgpu_vm *vm, struct ww_acquire_ctx *ticket)
 				continue;
 		}
 
-		trace_amdgpu_vm_bo_cs(mapping);
+		trace_call__amdgpu_vm_bo_cs(mapping);
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 5fc5d5608506..fbdc12cdd6bb 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -5263,11 +5263,11 @@ static void amdgpu_dm_backlight_set_level(struct amdgpu_display_manager *dm,
 	}
 
 	if (trace_amdgpu_dm_brightness_enabled()) {
-		trace_amdgpu_dm_brightness(__builtin_return_address(0),
-					   user_brightness,
-					   brightness,
-					   caps->aux_support,
-					   power_supply_is_system_supplied() > 0);
+		trace_call__amdgpu_dm_brightness(__builtin_return_address(0),
+						 user_brightness,
+						 brightness,
+						 caps->aux_support,
+						 power_supply_is_system_supplied() > 0);
 	}
 
 	if (caps->aux_support) {
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index fe174a4857be..185a2636b599 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -429,7 +429,8 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity,
 
 	if (trace_drm_sched_job_unschedulable_enabled() &&
 	    !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &entity->dependency->flags))
-		trace_drm_sched_job_unschedulable(sched_job, entity->dependency);
+		trace_call__drm_sched_job_unschedulable(sched_job,
+							entity->dependency);
 
 	if (!dma_fence_add_callback(entity->dependency, &entity->cb,
 				    drm_sched_entity_wakeup))
@@ -586,7 +587,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 		unsigned long index;
 
 		xa_for_each(&sched_job->dependencies, index, entry)
-			trace_drm_sched_job_add_dep(sched_job, entry);
+			trace_call__drm_sched_job_add_dep(sched_job, entry);
 	}
 	atomic_inc(entity->rq->sched->score);
 	WRITE_ONCE(entity->last_user, current->group_leader);
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 07/11] HID: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Pillai (Google) @ 2026-05-15 13:59 UTC (permalink / raw)
  To: Srinivas Pandruvada, Jiri Kosina, Benjamin Tissoires
  Cc: linux-input, Steven Rostedt, linux-trace-kernel, Vineeth Pillai,
	Peter Zijlstra

From: Vineeth Pillai <vineeth@bitbyteword.org>

Replace trace_foo() with the new trace_call__foo() at sites already
guarded by trace_foo_enabled(), avoiding a redundant
static_branch_unlikely() re-evaluation inside the tracepoint.
trace_call__foo() calls the tracepoint callbacks directly without
utilizing the static branch again.

Original v2 series:
https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/

Parts of the original v2 series have already been merged in mainline.
This patch is being reposted as a follow-up cleanup for the remaining
unmerged pieces.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Assisted-by: Claude:claude-sonnet-4-6
---
 drivers/hid/intel-ish-hid/ipc/pci-ish.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hid/intel-ish-hid/ipc/pci-ish.c b/drivers/hid/intel-ish-hid/ipc/pci-ish.c
index ed3405c05e73..8d36ae96a3ee 100644
--- a/drivers/hid/intel-ish-hid/ipc/pci-ish.c
+++ b/drivers/hid/intel-ish-hid/ipc/pci-ish.c
@@ -110,7 +110,7 @@ void ish_event_tracer(struct ishtp_device *dev, const char *format, ...)
 		vsnprintf(tmp_buf, sizeof(tmp_buf), format, args);
 		va_end(args);
 
-		trace_ishtp_dump(tmp_buf);
+		trace_call__ishtp_dump(tmp_buf);
 	}
 }
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 08/11] scsi: ufs: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Pillai (Google) @ 2026-05-15 13:59 UTC (permalink / raw)
  To: James E.J. Bottomley, Martin K. Petersen
  Cc: linux-scsi, Steven Rostedt, linux-trace-kernel, Vineeth Pillai,
	Peter Zijlstra

From: Vineeth Pillai <vineeth@bitbyteword.org>

Replace trace_foo() with the new trace_call__foo() at sites already
guarded by trace_foo_enabled(), avoiding a redundant
static_branch_unlikely() re-evaluation inside the tracepoint.
trace_call__foo() calls the tracepoint callbacks directly without
utilizing the static branch again.

Original v2 series:
https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/

Parts of the original v2 series have already been merged in mainline.
This patch is being reposted as a follow-up cleanup for the remaining
unmerged pieces.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Assisted-by: Claude:claude-sonnet-4-6
---
 drivers/ufs/core/ufshcd.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index c3f08957d179..07f3126d2a94 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -421,8 +421,8 @@ static void ufshcd_add_cmd_upiu_trace(struct ufs_hba *hba,
 	else
 		header = &lrb->ucd_rsp_ptr->header;
 
-	trace_ufshcd_upiu(hba, str_t, header, &rq->sc.cdb,
-			  UFS_TSF_CDB);
+	trace_call__ufshcd_upiu(hba, str_t, header, &rq->sc.cdb,
+			       UFS_TSF_CDB);
 }
 
 static void ufshcd_add_query_upiu_trace(struct ufs_hba *hba,
@@ -432,8 +432,8 @@ static void ufshcd_add_query_upiu_trace(struct ufs_hba *hba,
 	if (!trace_ufshcd_upiu_enabled())
 		return;
 
-	trace_ufshcd_upiu(hba, str_t, &rq_rsp->header,
-			  &rq_rsp->qr, UFS_TSF_OSF);
+	trace_call__ufshcd_upiu(hba, str_t, &rq_rsp->header,
+			       &rq_rsp->qr, UFS_TSF_OSF);
 }
 
 static void ufshcd_add_tm_upiu_trace(struct ufs_hba *hba, unsigned int tag,
@@ -445,15 +445,15 @@ static void ufshcd_add_tm_upiu_trace(struct ufs_hba *hba, unsigned int tag,
 		return;
 
 	if (str_t == UFS_TM_SEND)
-		trace_ufshcd_upiu(hba, str_t,
-				  &descp->upiu_req.req_header,
-				  &descp->upiu_req.input_param1,
-				  UFS_TSF_TM_INPUT);
+		trace_call__ufshcd_upiu(hba, str_t,
+					&descp->upiu_req.req_header,
+					&descp->upiu_req.input_param1,
+					UFS_TSF_TM_INPUT);
 	else
-		trace_ufshcd_upiu(hba, str_t,
-				  &descp->upiu_rsp.rsp_header,
-				  &descp->upiu_rsp.output_param1,
-				  UFS_TSF_TM_OUTPUT);
+		trace_call__ufshcd_upiu(hba, str_t,
+					&descp->upiu_rsp.rsp_header,
+					&descp->upiu_rsp.output_param1,
+					UFS_TSF_TM_OUTPUT);
 }
 
 static void ufshcd_add_uic_command_trace(struct ufs_hba *hba,
@@ -470,10 +470,10 @@ static void ufshcd_add_uic_command_trace(struct ufs_hba *hba,
 	else
 		cmd = ufshcd_readl(hba, REG_UIC_COMMAND);
 
-	trace_ufshcd_uic_command(hba, str_t, cmd,
-				 ufshcd_readl(hba, REG_UIC_COMMAND_ARG_1),
-				 ufshcd_readl(hba, REG_UIC_COMMAND_ARG_2),
-				 ufshcd_readl(hba, REG_UIC_COMMAND_ARG_3));
+	trace_call__ufshcd_uic_command(hba, str_t, cmd,
+				       ufshcd_readl(hba, REG_UIC_COMMAND_ARG_1),
+				       ufshcd_readl(hba, REG_UIC_COMMAND_ARG_2),
+				       ufshcd_readl(hba, REG_UIC_COMMAND_ARG_3));
 }
 
 static void ufshcd_add_command_trace(struct ufs_hba *hba, struct scsi_cmnd *cmd,
@@ -522,8 +522,9 @@ static void ufshcd_add_command_trace(struct ufs_hba *hba, struct scsi_cmnd *cmd,
 	} else {
 		doorbell = ufshcd_readl(hba, REG_UTP_TRANSFER_REQ_DOOR_BELL);
 	}
-	trace_ufshcd_command(cmd->device, hba, str_t, tag, doorbell, hwq_id,
-			     transfer_len, intr, lba, opcode, group_id);
+	trace_call__ufshcd_command(cmd->device, hba, str_t, tag, doorbell,
+				   hwq_id, transfer_len, intr, lba, opcode,
+				   group_id);
 }
 
 static void ufshcd_print_clk_freqs(struct ufs_hba *hba)
-- 
2.54.0


^ permalink raw reply related

* [PATCH v3 09/11] net: devlink: Use trace_call__##name() at guarded tracepoint call sites
From: Vineeth Pillai (Google) @ 2026-05-15 13:59 UTC (permalink / raw)
  To: Jiri Pirko, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, Steven Rostedt, linux-trace-kernel, Vineeth Pillai,
	Peter Zijlstra

From: Vineeth Pillai <vineeth@bitbyteword.org>

Replace trace_foo() with the new trace_call__foo() at sites already
guarded by trace_foo_enabled(), avoiding a redundant
static_branch_unlikely() re-evaluation inside the tracepoint.
trace_call__foo() calls the tracepoint callbacks directly without
utilizing the static branch again.

Original v2 series:
https://lore.kernel.org/linux-trace-kernel/20260323160052.17528-1-vineeth@bitbyteword.org/

Parts of the original v2 series have already been merged in mainline.
This patch is being reposted as a follow-up cleanup for the remaining
unmerged pieces.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Assisted-by: Claude:claude-sonnet-4-6
---
 net/devlink/trap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/devlink/trap.c b/net/devlink/trap.c
index 8edb31654a68..d54276dcd62f 100644
--- a/net/devlink/trap.c
+++ b/net/devlink/trap.c
@@ -1497,7 +1497,7 @@ void devlink_trap_report(struct devlink *devlink, struct sk_buff *skb,
 
 		devlink_trap_report_metadata_set(&metadata, trap_item,
 						 in_devlink_port, fa_cookie);
-		trace_devlink_trap_report(devlink, skb, &metadata);
+		trace_call__devlink_trap_report(devlink, skb, &metadata);
 	}
 }
 EXPORT_SYMBOL_GPL(devlink_trap_report);
-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox