Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH v10 0/5] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu (Google) @ 2026-03-17  9:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers

Hi,

Here is the 10th version of improvement patches for making persistent
ring buffers robust to failures.
The previous version is here:

https://lore.kernel.org/all/177319273059.130641.10882692460536780093.stgit@mhiramat.tok.corp.google.com/

In this version, I added a new patch to skip invalid page in rewinding
process[4/5], add entry_bytes check in the test [5/5] and do not
compile test code when CONFIG_RING_BUFFER_PERSISTENT_SELFTEST=n[5/5].

Thank you,

---

Masami Hiramatsu (Google) (5):
      ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
      ring-buffer: Flush and stop persistent ring buffer on panic
      ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
      ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
      ring-buffer: Add persistent ring buffer selftest


 arch/alpha/include/asm/Kbuild        |    1 
 arch/arc/include/asm/Kbuild          |    1 
 arch/arm/include/asm/Kbuild          |    1 
 arch/arm64/include/asm/ring_buffer.h |   10 ++
 arch/csky/include/asm/Kbuild         |    1 
 arch/hexagon/include/asm/Kbuild      |    1 
 arch/loongarch/include/asm/Kbuild    |    1 
 arch/m68k/include/asm/Kbuild         |    1 
 arch/microblaze/include/asm/Kbuild   |    1 
 arch/mips/include/asm/Kbuild         |    1 
 arch/nios2/include/asm/Kbuild        |    1 
 arch/openrisc/include/asm/Kbuild     |    1 
 arch/parisc/include/asm/Kbuild       |    1 
 arch/powerpc/include/asm/Kbuild      |    1 
 arch/riscv/include/asm/Kbuild        |    1 
 arch/s390/include/asm/Kbuild         |    1 
 arch/sh/include/asm/Kbuild           |    1 
 arch/sparc/include/asm/Kbuild        |    1 
 arch/um/include/asm/Kbuild           |    1 
 arch/x86/include/asm/Kbuild          |    1 
 arch/xtensa/include/asm/Kbuild       |    1 
 include/asm-generic/ring_buffer.h    |   13 ++
 include/linux/ring_buffer.h          |    1 
 kernel/trace/Kconfig                 |   15 ++
 kernel/trace/ring_buffer.c           |  226 ++++++++++++++++++++++++++--------
 kernel/trace/trace.c                 |    4 +
 26 files changed, 233 insertions(+), 56 deletions(-)
 create mode 100644 arch/arm64/include/asm/ring_buffer.h
 create mode 100644 include/asm-generic/ring_buffer.h

--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v3] tracing: Generate undef symbols allowlist for simple_ring_buffer
From: Marc Zyngier @ 2026-03-17  9:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Vincent Donnefort, arnd, nathan, linux-trace-kernel, kvmarm,
	kernel-team
In-Reply-To: <20260316100929.5402a335@gandalf.local.home>

On Mon, 16 Mar 2026 14:09:29 +0000,
Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Mon, 16 Mar 2026 09:28:45 +0000
> Vincent Donnefort <vdonnefort@google.com> wrote:
> 
> > Compiler and tooling-generated symbols are difficult to maintain
> > across all supported architectures. Make the allowlist more robust by
> > replacing the harcoded list with a mechanism that automatically detects
> > these symbols.
> > 
> > This mechanism generates a C function designed to trigger common
> > compiler-inserted symbols.
> > 
> > Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> > Reviewed-by: Nathan Chancellor <nathan@kernel.org>
> > Tested-by: Nathan Chancellor <nathan@kernel.org>
> 
> I take it that Marc will take this?

Yup, now merged.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH v3] tracing: Generate undef symbols allowlist for simple_ring_buffer
From: Marc Zyngier @ 2026-03-17  9:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Vincent Donnefort, Steven Rostedt, Nathan Chancellor,
	linux-trace-kernel, kvmarm, kernel-team
In-Reply-To: <501b2810-db63-4aa2-ac22-d3f1a99e9bfa@app.fastmail.com>

On Mon, 16 Mar 2026 20:48:09 +0000,
"Arnd Bergmann" <arnd@arndb.de> wrote:
> 
> On Mon, Mar 16, 2026, at 21:47, Arnd Bergmann wrote:
> >
> > This needs "__kmsan" as well, for these symbols:
> >
> 
> "__msan" of course, not "__kmsan".

Folded that in the patch.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply

* Re: [PATCH v3] tracing: Generate undef symbols allowlist for simple_ring_buffer
From: Marc Zyngier @ 2026-03-17  9:03 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: rostedt, arnd, nathan, linux-trace-kernel, kvmarm, kernel-team
In-Reply-To: <20260316092845.3367411-1-vdonnefort@google.com>

On Mon, 16 Mar 2026 09:28:45 +0000, Vincent Donnefort wrote:
> Compiler and tooling-generated symbols are difficult to maintain
> across all supported architectures. Make the allowlist more robust by
> replacing the harcoded list with a mechanism that automatically detects
> these symbols.
> 
> This mechanism generates a C function designed to trigger common
> compiler-inserted symbols.
> 
> [...]

Applied to next, thanks!

[1/1] tracing: Generate undef symbols allowlist for simple_ring_buffer
      commit: 1211907ac0b5f35e5720620c50b7ca3c72d81f7e

Cheers,

	M.
-- 
Without deviation from the norm, progress is not possible.



^ permalink raw reply

* [RFC v5 7/7] ext4: fast commit: export snapshot stats in fc_info
From: Li Chen @ 2026-03-17  8:46 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260317084624.457185-1-me@linux.beauty>

Snapshot-based fast commit can fall back when the commit-time snapshot
cannot be built (e.g. extent status cache misses). It is useful to
quantify the updates-locked window and to see why snapshotting failed.

Add best-effort snapshot counters to the ext4 superblock and extend
/proc/fs/ext4/<sb_id>/fc_info to report the number of snapshotted
inodes and ranges, snapshot failure reasons, and the average/max time
spent with journal updates locked.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/ext4.h        | 31 ++++++++++++++++++++++
 fs/ext4/fast_commit.c | 61 ++++++++++++++++++++++++++++++++++++++++---
 fs/ext4/super.c       |  1 +
 3 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b9e146f3dd9e4..8b7530f2e0706 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1566,6 +1566,36 @@ struct ext4_orphan_info {
 						 * file blocks */
 };
 
+/*
+ * Ext4 fast commit snapshot statistics.
+ *
+ * These are best-effort counters intended for debugging / performance
+ * introspection; they are not exact under concurrent updates.
+ */
+struct ext4_fc_snap_stats {
+	u64 lock_updates_ns_total;
+	u64 lock_updates_ns_max;
+	u64 lock_updates_samples;
+
+	u64 snap_inodes;
+	u64 snap_ranges;
+
+	u64 snap_fail_es_miss;
+	u64 snap_fail_es_delayed;
+	u64 snap_fail_es_other;
+
+	u64 snap_fail_inodes_cap;
+	u64 snap_fail_ranges_cap;
+	u64 snap_fail_nomem;
+	u64 snap_fail_inode_loc;
+
+	/*
+	 * Missing inode snapshots during log writing should never happen.
+	 * Keep this counter to help catch unexpected regressions.
+	 */
+	u64 snap_fail_no_snap;
+};
+
 /*
  * fourth extended-fs super-block data in memory
  */
@@ -1837,6 +1867,7 @@ struct ext4_sb_info {
 	struct mutex s_fc_lock;
 	struct buffer_head *s_fc_bh;
 	struct ext4_fc_stats s_fc_stats;
+	struct ext4_fc_snap_stats s_fc_snap_stats;
 	tid_t s_fc_ineligible_tid;
 #ifdef CONFIG_EXT4_DEBUG
 	int s_fc_debug_max_replay;
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 4929e2990b292..09ae8f52abdab 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -890,13 +890,17 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 	int inode_len;
 	int ret;
 
-	if (!snap)
+	if (!snap) {
+		EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++;
 		return -ECANCELED;
+	}
 
 	src = snap->inode_buf;
 	inode_len = snap->inode_len;
-	if (!src || inode_len == 0)
+	if (!src || inode_len == 0) {
+		EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++;
 		return -ECANCELED;
+	}
 
 	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
 	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
@@ -931,8 +935,10 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 	struct ext4_extent *ex;
 	struct ext4_fc_range *range;
 
-	if (!snap)
+	if (!snap) {
+		EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++;
 		return -ECANCELED;
+	}
 
 	list_for_each_entry(range, &snap->data_list, list) {
 		if (range->tag == EXT4_FC_TAG_DEL_RANGE) {
@@ -993,6 +999,8 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
 	unsigned int nr_ranges = 0;
 
@@ -1018,11 +1026,13 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		ext4_lblk_t len;
 
 		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			stats->snap_fail_es_miss++;
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
 		}
 
 		if (ext4_es_is_delayed(&es)) {
+			stats->snap_fail_es_delayed++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
@@ -1037,6 +1047,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		}
 
 		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			stats->snap_fail_ranges_cap++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
@@ -1044,6 +1055,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
 		if (!range) {
+			stats->snap_fail_nomem++;
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
 		}
@@ -1071,6 +1083,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			stats->snap_fail_es_other++;
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
@@ -1091,6 +1104,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	struct ext4_fc_inode_snap *snap;
 	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
 	struct ext4_iloc iloc;
@@ -1101,6 +1116,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
 	if (ret) {
+		stats->snap_fail_inode_loc++;
 		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
 	}
@@ -1112,6 +1128,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		stats->snap_fail_nomem++;
 		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
@@ -1136,6 +1153,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	list_splice_tail_init(&ranges, &snap->data_list);
 	ext4_fc_unlock(inode->i_sb, alloc_ctx);
 
+	stats->snap_inodes++;
+	stats->snap_ranges += nr_ranges;
 	if (nr_rangesp)
 		*nr_rangesp = nr_ranges;
 	return 0;
@@ -1245,6 +1264,7 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			sbi->s_fc_snap_stats.snap_fail_inodes_cap++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
@@ -1270,6 +1290,7 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			sbi->s_fc_snap_stats.snap_fail_inodes_cap++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
@@ -1313,6 +1334,7 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_snap_stats *snap_stats = &sbi->s_fc_snap_stats;
 	struct ext4_inode_info *iter;
 	struct ext4_fc_head head;
 	struct inode *inode;
@@ -1375,8 +1397,13 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 		return ret;
 
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
-	if (ret)
+	if (ret) {
+		if (ret == -E2BIG)
+			snap_stats->snap_fail_inodes_cap++;
+		else if (ret == -ENOMEM)
+			snap_stats->snap_fail_nomem++;
 		return ret;
+	}
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
@@ -1398,6 +1425,10 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
 	locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+	snap_stats->lock_updates_ns_total += locked_ns;
+	snap_stats->lock_updates_samples++;
+	if (locked_ns > snap_stats->lock_updates_ns_max)
+		snap_stats->lock_updates_ns_max = locked_ns;
 	trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns, snap_inodes,
 				   snap_ranges, ret, snap_err);
 	kvfree(inodes);
@@ -2694,11 +2725,17 @@ int ext4_fc_info_show(struct seq_file *seq, void *v)
 {
 	struct ext4_sb_info *sbi = EXT4_SB((struct super_block *)seq->private);
 	struct ext4_fc_stats *stats = &sbi->s_fc_stats;
+	struct ext4_fc_snap_stats *snap_stats = &sbi->s_fc_snap_stats;
+	u64 lock_avg_ns = 0;
 	int i;
 
 	if (v != SEQ_START_TOKEN)
 		return 0;
 
+	if (snap_stats->lock_updates_samples)
+		lock_avg_ns = div_u64(snap_stats->lock_updates_ns_total,
+				      snap_stats->lock_updates_samples);
+
 	seq_printf(seq,
 		"fc stats:\n%ld commits\n%ld ineligible\n%ld numblks\n%lluus avg_commit_time\n",
 		   stats->fc_num_commits, stats->fc_ineligible_commits,
@@ -2709,6 +2746,22 @@ int ext4_fc_info_show(struct seq_file *seq, void *v)
 		seq_printf(seq, "\"%s\":\t%d\n", fc_ineligible_reasons[i],
 			stats->fc_ineligible_reason_count[i]);
 
+	seq_printf(seq,
+		   "Snapshot stats:\n%llu inodes\n%llu ranges\n%lluus lock_updates_avg\n%lluus lock_updates_max\n",
+		   snap_stats->snap_inodes, snap_stats->snap_ranges,
+		   div_u64(lock_avg_ns, 1000),
+		   div_u64(snap_stats->lock_updates_ns_max, 1000));
+	seq_printf(seq,
+		   "Snapshot failures:\n%llu es_miss\n%llu es_delayed\n%llu es_other\n%llu inodes_cap\n%llu ranges_cap\n%llu nomem\n%llu inode_loc\n%llu no_snap\n",
+		   snap_stats->snap_fail_es_miss,
+		   snap_stats->snap_fail_es_delayed,
+		   snap_stats->snap_fail_es_other,
+		   snap_stats->snap_fail_inodes_cap,
+		   snap_stats->snap_fail_ranges_cap,
+		   snap_stats->snap_fail_nomem,
+		   snap_stats->snap_fail_inode_loc,
+		   snap_stats->snap_fail_no_snap);
+
 	return 0;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4f5f0c21d436f..3afcaf9d80078 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4500,6 +4500,7 @@ static void ext4_fast_commit_init(struct super_block *sb)
 	sbi->s_fc_ineligible_tid = 0;
 	mutex_init(&sbi->s_fc_lock);
 	memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
+	memset(&sbi->s_fc_snap_stats, 0, sizeof(sbi->s_fc_snap_stats));
 	sbi->s_fc_replay_state.fc_regions = NULL;
 	sbi->s_fc_replay_state.fc_regions_size = 0;
 	sbi->s_fc_replay_state.fc_regions_used = 0;
-- 
2.53.0


^ permalink raw reply related

* [RFC v5 6/7] ext4: fast commit: add lock_updates tracepoint
From: Li Chen @ 2026-03-17  8:46 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-ext4, linux-kernel,
	linux-trace-kernel
  Cc: Li Chen
In-Reply-To: <20260317084624.457185-1-me@linux.beauty>

Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
so it is useful to quantify the time spent with updates locked and to
understand why snapshotting can fail.

Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
the updates-locked window along with the number of snapshotted inodes
and ranges. Record the first snapshot failure reason in a stable snap_err
field for tooling.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/ext4.h              | 15 ++++++++
 fs/ext4/fast_commit.c       | 71 +++++++++++++++++++++++++++++--------
 include/trace/events/ext4.h | 61 +++++++++++++++++++++++++++++++
 3 files changed, 132 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 68a64fa0be926..b9e146f3dd9e4 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1037,6 +1037,21 @@ enum {
 
 struct ext4_fc_inode_snap;
 
+/*
+ * Snapshot failure reasons for ext4_fc_lock_updates tracepoint.
+ * Keep these stable for tooling.
+ */
+enum ext4_fc_snap_err {
+	EXT4_FC_SNAP_ERR_NONE		= 0,
+	EXT4_FC_SNAP_ERR_ES_MISS	= 1,
+	EXT4_FC_SNAP_ERR_ES_DELAYED	= 2,
+	EXT4_FC_SNAP_ERR_ES_OTHER	= 3,
+	EXT4_FC_SNAP_ERR_INODES_CAP	= 4,
+	EXT4_FC_SNAP_ERR_RANGES_CAP	= 5,
+	EXT4_FC_SNAP_ERR_NOMEM		= 6,
+	EXT4_FC_SNAP_ERR_INODE_LOC	= 7,
+};
+
 /*
  * fourth extended file system inode data in memory
  */
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index d1eefee609120..4929e2990b292 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -193,6 +193,12 @@ static struct kmem_cache *ext4_fc_range_cachep;
 #define EXT4_FC_SNAPSHOT_MAX_INODES	1024
 #define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
+static inline void ext4_fc_set_snap_err(int *snap_err, int err)
+{
+	if (snap_err && *snap_err == EXT4_FC_SNAP_ERR_NONE)
+		*snap_err = err;
+}
+
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
 	BUFFER_TRACE(bh, "");
@@ -983,11 +989,12 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       struct list_head *ranges,
 				       unsigned int nr_ranges_total,
-				       unsigned int *nr_rangesp)
+				       unsigned int *nr_rangesp,
+				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+	unsigned int nr_ranges = 0;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
@@ -1010,11 +1017,16 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		struct ext4_fc_range *range;
 		ext4_lblk_t len;
 
-		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
+		}
 
-		if (ext4_es_is_delayed(&es))
+		if (ext4_es_is_delayed(&es)) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
+		}
 
 		len = es.es_len - (cur_lblk - es.es_lblk);
 		if (len > end_lblk - cur_lblk + 1)
@@ -1024,12 +1036,17 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 			continue;
 		}
 
-		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
+		}
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
-		if (!range)
+		if (!range) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
+		}
 		nr_ranges++;
 
 		range->lblk = cur_lblk;
@@ -1054,6 +1071,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
 
@@ -1070,7 +1088,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int nr_ranges_total,
-				  unsigned int *nr_rangesp)
+				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
@@ -1082,8 +1100,10 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	int alloc_ctx;
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
-	if (ret)
+	if (ret) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
+	}
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
 		inode_len = EXT4_INODE_SIZE(inode->i_sb);
@@ -1092,6 +1112,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
 	}
@@ -1102,7 +1123,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	brelse(iloc.bh);
 
 	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
-					  &nr_ranges);
+					  &nr_ranges, snap_err);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1203,7 +1224,10 @@ static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
 					 unsigned int *nr_inodesp);
 
 static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
-				   unsigned int inodes_size)
+				   unsigned int inodes_size,
+				   unsigned int *nr_inodesp,
+				   unsigned int *nr_rangesp,
+				   int *snap_err)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1221,6 +1245,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1244,6 +1270,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1268,16 +1296,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 		unsigned int inode_ranges = 0;
 
 		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
-					     &inode_ranges);
+					     &inode_ranges, snap_err);
 		if (ret)
 			break;
 		nr_ranges += inode_ranges;
 	}
 
+	if (nr_inodesp)
+		*nr_inodesp = i;
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return ret;
 }
 
-static int ext4_fc_perform_commit(journal_t *journal)
+static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1286,10 +1318,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct inode *inode;
 	struct inode **inodes;
 	unsigned int inodes_size;
+	unsigned int snap_inodes = 0;
+	unsigned int snap_ranges = 0;
+	int snap_err = EXT4_FC_SNAP_ERR_NONE;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
 	int alloc_ctx;
+	ktime_t lock_start;
+	u64 locked_ns;
 
 	/*
 	 * Step 1: Mark all inodes on s_fc_q[MAIN] with
@@ -1337,13 +1374,13 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	if (ret)
 		return ret;
 
-
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
 	if (ret)
 		return ret;
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
+	lock_start = ktime_get();
 	/*
 	 * The journal is now locked. No more handles can start and all the
 	 * previous handles are now drained. Snapshotting happens in this
@@ -1357,8 +1394,12 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
+				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
+	locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+	trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns, snap_inodes,
+				   snap_ranges, ret, snap_err);
 	kvfree(inodes);
 	if (ret)
 		return ret;
@@ -1563,7 +1604,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 		journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
 	set_task_ioprio(current, journal_ioprio);
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
-	ret = ext4_fc_perform_commit(journal);
+	ret = ext4_fc_perform_commit(journal, commit_tid);
 	if (ret < 0) {
 		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
 			status = EXT4_FC_STATUS_INELIGIBLE;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index fd76d14c2776e..dc084f39b74ad 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -104,6 +104,26 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_INODE_JOURNAL_DATA);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_ENCRYPTED_FILENAME);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
 
+#undef EM
+#undef EMe
+#define EM(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
+#define EMe(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
+
+#define TRACE_SNAP_ERR						\
+	EM(NONE)						\
+	EM(ES_MISS)						\
+	EM(ES_DELAYED)						\
+	EM(ES_OTHER)						\
+	EM(INODES_CAP)						\
+	EM(RANGES_CAP)						\
+	EM(NOMEM)						\
+	EMe(INODE_LOC)
+
+TRACE_SNAP_ERR
+
+#undef EM
+#undef EMe
+
 #define show_fc_reason(reason)						\
 	__print_symbolic(reason,					\
 		{ EXT4_FC_REASON_XATTR,		"XATTR"},		\
@@ -2812,6 +2832,47 @@ TRACE_EVENT(ext4_fc_commit_stop,
 		  __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid)
 );
 
+#define EM(a)	{ EXT4_FC_SNAP_ERR_##a, #a },
+#define EMe(a)	{ EXT4_FC_SNAP_ERR_##a, #a }
+
+TRACE_EVENT(ext4_fc_lock_updates,
+	    TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns,
+		     unsigned int nr_inodes, unsigned int nr_ranges, int err,
+		     int snap_err),
+
+	TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err),
+
+	TP_STRUCT__entry(/* entry */
+		__field(dev_t, dev)
+		__field(tid_t, tid)
+		__field(u64, locked_ns)
+		__field(unsigned int, nr_inodes)
+		__field(unsigned int, nr_ranges)
+		__field(int, err)
+		__field(int, snap_err)
+	),
+
+	TP_fast_assign(/* assign */
+		__entry->dev = sb->s_dev;
+		__entry->tid = commit_tid;
+		__entry->locked_ns = locked_ns;
+		__entry->nr_inodes = nr_inodes;
+		__entry->nr_ranges = nr_ranges;
+		__entry->err = err;
+		__entry->snap_err = snap_err;
+	),
+
+	TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err %d snap_err %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
+		  __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges,
+		  __entry->err, __print_symbolic(__entry->snap_err,
+						 TRACE_SNAP_ERR))
+);
+
+#undef EM
+#undef EMe
+#undef TRACE_SNAP_ERR
+
 #define FC_REASON_NAME_STAT(reason)					\
 	show_fc_reason(reason),						\
 	__entry->fc_ineligible_rc[reason]
-- 
2.53.0


^ permalink raw reply related

* [RFC v5 5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
From: Li Chen @ 2026-03-17  8:46 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260317084624.457185-1-me@linux.beauty>

Commit-time snapshots run under jbd2_journal_lock_updates(), so the work
done there must stay bounded.

The snapshot path still used ext4_map_blocks() to build data ranges. This
can take i_data_sem and pulls the mapping code into the snapshot logic.
Build inode data range snapshots from the extent status tree instead.

The extent status tree is a cache, not an authoritative source. If the
needed information is missing or unstable (e.g. delayed allocation), treat
the transaction as fast commit ineligible and fall back to full commit.

Also cap the number of inodes and ranges snapshotted per fast commit and
allocate range records from a dedicated slab cache. The inode pointer
array is allocated outside the updates-locked window.

Testing: QEMU/KVM guest, virtio-pmem + dax, ext4 -O fast_commit, mounted
dax,noatime. Ran python3 500x {4K write + fsync}, fallocate 256M, and
python3 500x {creat + fsync(dir)} without lockdep splats or errors.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/fast_commit.c | 253 +++++++++++++++++++++++++++++-------------
 1 file changed, 177 insertions(+), 76 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 966211a3342a0..d1eefee609120 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -183,6 +183,15 @@
 
 #include <trace/events/ext4.h>
 static struct kmem_cache *ext4_fc_dentry_cachep;
+static struct kmem_cache *ext4_fc_range_cachep;
+
+/*
+ * Avoid spending unbounded time/memory snapshotting highly fragmented files
+ * under jbd2_journal_lock_updates(). If we exceed this limit, fall back to
+ * full commit.
+ */
+#define EXT4_FC_SNAPSHOT_MAX_INODES	1024
+#define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
@@ -954,7 +963,7 @@ static void ext4_fc_free_ranges(struct list_head *head)
 
 	list_for_each_entry_safe(range, range_n, head, list) {
 		list_del(&range->list);
-		kfree(range);
+		kmem_cache_free(ext4_fc_range_cachep, range);
 	}
 }
 
@@ -972,16 +981,19 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 }
 
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
-				       struct list_head *ranges)
+				       struct list_head *ranges,
+				       unsigned int nr_ranges_total,
+				       unsigned int *nr_rangesp)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
-	struct ext4_map_blocks map;
-	int ret;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
 		spin_unlock(&ei->i_fc_lock);
+		if (nr_rangesp)
+			*nr_rangesp = 0;
 		return 0;
 	}
 	start_lblk = ei->i_fc_lblk_start;
@@ -994,61 +1006,78 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		   start_lblk, end_lblk, inode->i_ino);
 
 	while (cur_lblk <= end_lblk) {
+		struct extent_status es;
 		struct ext4_fc_range *range;
+		ext4_lblk_t len;
 
-		map.m_lblk = cur_lblk;
-		map.m_len = end_lblk - cur_lblk + 1;
-		ret = ext4_map_blocks(NULL, inode, &map,
-				      EXT4_GET_BLOCKS_IO_SUBMIT |
-				      EXT4_EX_NOCACHE);
-		if (ret < 0)
-			return -ECANCELED;
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+			return -EAGAIN;
+
+		if (ext4_es_is_delayed(&es))
+			return -EAGAIN;
 
-		if (map.m_len == 0) {
+		len = es.es_len - (cur_lblk - es.es_lblk);
+		if (len > end_lblk - cur_lblk + 1)
+			len = end_lblk - cur_lblk + 1;
+		if (len == 0) {
 			cur_lblk++;
 			continue;
 		}
 
-		range = kmalloc(sizeof(*range), GFP_NOFS);
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+			return -E2BIG;
+
+		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
 		if (!range)
 			return -ENOMEM;
+		nr_ranges++;
 
-		range->lblk = map.m_lblk;
-		range->len = map.m_len;
+		range->lblk = cur_lblk;
+		range->len = len;
 		range->pblk = 0;
 		range->unwritten = false;
 
-		if (ret == 0) {
+		if (ext4_es_is_hole(&es)) {
 			range->tag = EXT4_FC_TAG_DEL_RANGE;
-		} else {
-			unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ?
-				EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN;
-
-			/* Limit the number of blocks in one extent */
-			map.m_len = min(max, map.m_len);
+		} else if (ext4_es_is_written(&es) ||
+			   ext4_es_is_unwritten(&es)) {
+			unsigned int max;
 
 			range->tag = EXT4_FC_TAG_ADD_RANGE;
-			range->len = map.m_len;
-			range->pblk = map.m_pblk;
-			range->unwritten = !!(map.m_flags & EXT4_MAP_UNWRITTEN);
+			range->pblk = ext4_es_pblock(&es) +
+				      (cur_lblk - es.es_lblk);
+			range->unwritten = ext4_es_is_unwritten(&es);
+
+			max = range->unwritten ? EXT_UNWRITTEN_MAX_LEN :
+						 EXT_INIT_MAX_LEN;
+			if (range->len > max)
+				range->len = max;
+		} else {
+			kmem_cache_free(ext4_fc_range_cachep, range);
+			return -EAGAIN;
 		}
 
 		INIT_LIST_HEAD(&range->list);
 		list_add_tail(&range->list, ranges);
 
-		cur_lblk += map.m_len;
+		cur_lblk += range->len;
 	}
 
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return 0;
 }
 
-static int ext4_fc_snapshot_inode(struct inode *inode)
+static int ext4_fc_snapshot_inode(struct inode *inode,
+				  unsigned int nr_ranges_total,
+				  unsigned int *nr_rangesp)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
 	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
 	struct ext4_iloc iloc;
 	LIST_HEAD(ranges);
+	unsigned int nr_ranges = 0;
 	int ret;
 	int alloc_ctx;
 
@@ -1072,7 +1101,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode)
 	memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len);
 	brelse(iloc.bh);
 
-	ret = ext4_fc_snapshot_inode_data(inode, &ranges);
+	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
+					  &nr_ranges);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1085,10 +1115,11 @@ static int ext4_fc_snapshot_inode(struct inode *inode)
 	list_splice_tail_init(&ranges, &snap->data_list);
 	ext4_fc_unlock(inode->i_sb, alloc_ctx);
 
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return 0;
 }
 
-
 /* Flushes data of all the inodes in the commit queue. */
 static int ext4_fc_flush_data(journal_t *journal)
 {
@@ -1167,49 +1198,32 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 	return 0;
 }
 
-static int ext4_fc_snapshot_inodes(journal_t *journal)
+static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
+					 struct inode ***inodesp,
+					 unsigned int *nr_inodesp);
+
+static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
+				   unsigned int inodes_size)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_inode_info *iter;
 	struct ext4_fc_dentry_update *fc_dentry;
-	struct inode **inodes;
-	unsigned int nr_inodes = 0;
 	unsigned int i = 0;
+	unsigned int idx;
+	unsigned int nr_ranges = 0;
 	int ret = 0;
 	int alloc_ctx;
 
-	alloc_ctx = ext4_fc_lock(sb);
-	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
-		nr_inodes++;
-
-	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
-		struct ext4_inode_info *ei;
-
-		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
-			continue;
-		if (list_empty(&fc_dentry->fcd_dilist))
-			continue;
-
-		/* See the comment in ext4_fc_commit_dentry_updates(). */
-		ei = list_first_entry(&fc_dentry->fcd_dilist,
-				      struct ext4_inode_info, i_fc_dilist);
-		if (!list_empty(&ei->i_fc_list))
-			continue;
-
-		nr_inodes++;
-	}
-	ext4_fc_unlock(sb, alloc_ctx);
-
-	if (!nr_inodes)
+	if (!inodes_size)
 		return 0;
 
-	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
-	if (!inodes)
-		return -ENOMEM;
-
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		if (i >= inodes_size) {
+			ret = -E2BIG;
+			goto unlock;
+		}
 		inodes[i++] = &iter->vfs_inode;
 	}
 
@@ -1229,6 +1243,10 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		if (!list_empty(&ei->i_fc_list))
 			continue;
 
+		if (i >= inodes_size) {
+			ret = -E2BIG;
+			goto unlock;
+		}
 		/*
 		 * Create-only inodes may only be referenced via fcd_dilist and
 		 * not appear on s_fc_q[MAIN]. They may hit the last iput while
@@ -1240,15 +1258,22 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING);
 		inodes[i++] = inode;
 	}
+unlock:
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
-		ret = ext4_fc_snapshot_inode(inodes[nr_inodes]);
+	if (ret)
+		return ret;
+
+	for (idx = 0; idx < i; idx++) {
+		unsigned int inode_ranges = 0;
+
+		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
+					     &inode_ranges);
 		if (ret)
 			break;
+		nr_ranges += inode_ranges;
 	}
 
-	kvfree(inodes);
 	return ret;
 }
 
@@ -1259,6 +1284,8 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct ext4_inode_info *iter;
 	struct ext4_fc_head head;
 	struct inode *inode;
+	struct inode **inodes;
+	unsigned int inodes_size;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
@@ -1311,6 +1338,10 @@ static int ext4_fc_perform_commit(journal_t *journal)
 		return ret;
 
 
+	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
+	if (ret)
+		return ret;
+
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
 	/*
@@ -1326,8 +1357,9 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
 	jbd2_journal_unlock_updates(journal);
+	kvfree(inodes);
 	if (ret)
 		return ret;
 
@@ -1383,6 +1415,64 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	return ret;
 }
 
+static unsigned int ext4_fc_count_snapshot_inodes(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	unsigned int nr_inodes = 0;
+	int alloc_ctx;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
+		nr_inodes++;
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		nr_inodes++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	return nr_inodes;
+}
+
+static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
+					 struct inode ***inodesp,
+					 unsigned int *nr_inodesp)
+{
+	unsigned int nr_inodes = ext4_fc_count_snapshot_inodes(sb);
+	struct inode **inodes;
+
+	*inodesp = NULL;
+	*nr_inodesp = 0;
+
+	if (!nr_inodes)
+		return 0;
+
+	if (nr_inodes > EXT4_FC_SNAPSHOT_MAX_INODES)
+		return -E2BIG;
+
+	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
+	if (!inodes)
+		return -ENOMEM;
+
+	*inodesp = inodes;
+	*nr_inodesp = nr_inodes;
+	return 0;
+}
+
 static void ext4_fc_update_stats(struct super_block *sb, int status,
 				 u64 commit_time, int nblks, tid_t commit_tid)
 {
@@ -1475,7 +1565,10 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
 	ret = ext4_fc_perform_commit(journal);
 	if (ret < 0) {
-		status = EXT4_FC_STATUS_FAILED;
+		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
+			status = EXT4_FC_STATUS_INELIGIBLE;
+		else
+			status = EXT4_FC_STATUS_FAILED;
 		goto fallback;
 	}
 	nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before;
@@ -1559,34 +1652,35 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
 		fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN],
-					     struct ext4_fc_dentry_update,
-					     fcd_list);
+						 struct ext4_fc_dentry_update,
+						 fcd_list);
 		list_del_init(&fc_dentry->fcd_list);
 		if (fc_dentry->fcd_op == EXT4_FC_TAG_CREAT &&
-		    !list_empty(&fc_dentry->fcd_dilist)) {
+			!list_empty(&fc_dentry->fcd_dilist)) {
 			/* See the comment in ext4_fc_commit_dentry_updates(). */
 			ei = list_first_entry(&fc_dentry->fcd_dilist,
-					      struct ext4_inode_info,
-					      i_fc_dilist);
+						  struct ext4_inode_info,
+						  i_fc_dilist);
 			ext4_fc_free_inode_snap(&ei->vfs_inode);
 			spin_lock(&ei->i_fc_lock);
 			ext4_clear_inode_state(&ei->vfs_inode,
-					       EXT4_STATE_FC_REQUEUE);
+						   EXT4_STATE_FC_REQUEUE);
 			ext4_clear_inode_state(&ei->vfs_inode,
-					       EXT4_STATE_FC_COMMITTING);
+						   EXT4_STATE_FC_COMMITTING);
 			spin_unlock(&ei->i_fc_lock);
 			/*
 			 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
-			 * visible before we send the wakeup. Pairs with implicit
-			 * barrier in prepare_to_wait() in ext4_fc_del().
+			 * visible before we send the wakeup. Pairs with
+			 * implicit barrier in prepare_to_wait() in
+			 * ext4_fc_del().
 			 */
 			smp_mb();
 #if (BITS_PER_LONG < 64)
 			wake_up_bit(&ei->i_state_flags,
-				    EXT4_STATE_FC_COMMITTING);
+					EXT4_STATE_FC_COMMITTING);
 #else
 			wake_up_bit(&ei->i_flags,
-				    EXT4_STATE_FC_COMMITTING);
+					EXT4_STATE_FC_COMMITTING);
 #endif
 		}
 		list_del_init(&fc_dentry->fcd_dilist);
@@ -2582,13 +2676,20 @@ int __init ext4_fc_init_dentry_cache(void)
 	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
 					   SLAB_RECLAIM_ACCOUNT);
 
-	if (ext4_fc_dentry_cachep == NULL)
+	if (!ext4_fc_dentry_cachep)
 		return -ENOMEM;
 
+	ext4_fc_range_cachep = KMEM_CACHE(ext4_fc_range, SLAB_RECLAIM_ACCOUNT);
+	if (!ext4_fc_range_cachep) {
+		kmem_cache_destroy(ext4_fc_dentry_cachep);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
 void ext4_fc_destroy_dentry_cache(void)
 {
+	kmem_cache_destroy(ext4_fc_range_cachep);
 	kmem_cache_destroy(ext4_fc_dentry_cachep);
 }
-- 
2.53.0


^ permalink raw reply related

* [RFC v5 4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting
From: Li Chen @ 2026-03-17  8:46 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260317084624.457185-1-me@linux.beauty>

ext4_fc_snapshot_inodes() used igrab()/iput() to pin inodes while building
commit-time snapshots. With ext4_fc_del() waiting for
EXT4_STATE_FC_COMMITTING, iput() can trigger
ext4_clear_inode()->ext4_fc_del() in the commit thread and deadlock waiting
for the fast commit to finish.

Avoid taking extra references. Collect inode pointers under s_fc_lock and
rely on EXT4_STATE_FC_COMMITTING to pin inodes until ext4_fc_cleanup()
clears the bit.

Also set EXT4_STATE_FC_COMMITTING for create-only inodes referenced
from the dentry update queue, and wake up waiters when ext4_fc_cleanup()
clears the bit.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/fast_commit.c | 47 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 809170d46167b..966211a3342a0 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -1210,13 +1210,12 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
-		inodes[i] = igrab(&iter->vfs_inode);
-		if (inodes[i])
-			i++;
+		inodes[i++] = &iter->vfs_inode;
 	}
 
 	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
 		struct ext4_inode_info *ei;
+		struct inode *inode;
 
 		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
 			continue;
@@ -1226,12 +1225,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		/* See the comment in ext4_fc_commit_dentry_updates(). */
 		ei = list_first_entry(&fc_dentry->fcd_dilist,
 				      struct ext4_inode_info, i_fc_dilist);
+		inode = &ei->vfs_inode;
 		if (!list_empty(&ei->i_fc_list))
 			continue;
 
-		inodes[i] = igrab(&ei->vfs_inode);
-		if (inodes[i])
-			i++;
+		/*
+		 * Create-only inodes may only be referenced via fcd_dilist and
+		 * not appear on s_fc_q[MAIN]. They may hit the last iput while
+		 * we are snapshotting, but inode eviction calls ext4_fc_del(),
+		 * which waits for FC_COMMITTING to clear. Mark them FC_COMMITTING
+		 * so the inode stays pinned and the snapshot stays valid until
+		 * ext4_fc_cleanup().
+		 */
+		ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+		inodes[i++] = inode;
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
@@ -1241,10 +1248,6 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 			break;
 	}
 
-	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
-		if (inodes[nr_inodes])
-			iput(inodes[nr_inodes]);
-	}
 	kvfree(inodes);
 	return ret;
 }
@@ -1312,8 +1315,9 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	jbd2_journal_lock_updates(journal);
 	/*
 	 * The journal is now locked. No more handles can start and all the
-	 * previous handles are now drained. We now mark the inodes on the
-	 * commit queue as being committed.
+	 * previous handles are now drained. Snapshotting happens in this
+	 * window so log writing can consume only stable snapshots without
+	 * doing logical-to-physical mapping.
 	 */
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
@@ -1565,6 +1569,25 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					      struct ext4_inode_info,
 					      i_fc_dilist);
 			ext4_fc_free_inode_snap(&ei->vfs_inode);
+			spin_lock(&ei->i_fc_lock);
+			ext4_clear_inode_state(&ei->vfs_inode,
+					       EXT4_STATE_FC_REQUEUE);
+			ext4_clear_inode_state(&ei->vfs_inode,
+					       EXT4_STATE_FC_COMMITTING);
+			spin_unlock(&ei->i_fc_lock);
+			/*
+			 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
+			 * visible before we send the wakeup. Pairs with implicit
+			 * barrier in prepare_to_wait() in ext4_fc_del().
+			 */
+			smp_mb();
+#if (BITS_PER_LONG < 64)
+			wake_up_bit(&ei->i_state_flags,
+				    EXT4_STATE_FC_COMMITTING);
+#else
+			wake_up_bit(&ei->i_flags,
+				    EXT4_STATE_FC_COMMITTING);
+#endif
 		}
 		list_del_init(&fc_dentry->fcd_dilist);
 
-- 
2.53.0


^ permalink raw reply related

* [RFC v5 3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
From: Li Chen @ 2026-03-17  8:46 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260317084624.457185-1-me@linux.beauty>

ext4_fc_track_inode() can be called while holding i_data_sem (e.g.
fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an
ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING ->
wait(i_data_sem) in the commit task.

Now that fast commit snapshots inode state at commit time, updates during
log writing do not need to block. Drop the wait and lockdep assertion in
ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an
inode cannot be removed while the commit thread is still using it.

When an inode is modified during a fast commit, mark it with
EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit.
This is needed because jbd2_fc_end_commit() invokes the cleanup callback
with tid == 0, so tid-based requeue logic would requeue every inode.

Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same
transaction. nblks is the number of journal blocks written for that fast
commit. Before this change, the second fsync still wrote almost the same
fast commit log (nblks 10->9), because tid == 0 in jbd2_fc_end_commit()
caused the tid-based requeue logic to keep all inodes queued. After this
change, only inodes modified during the commit are requeued, and the
second fsync wrote a nearly empty fast commit (nblks 10->1).

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/ext4.h        |   1 +
 fs/ext4/fast_commit.c | 111 ++++++++++++++++++++----------------------
 2 files changed, 53 insertions(+), 59 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2e1681057196a..68a64fa0be926 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2004,6 +2004,7 @@ enum {
 	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
 	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
 	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
+	EXT4_STATE_FC_REQUEUE,		/* Inode modified during fast commit */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index d5c28304e8181..809170d46167b 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -61,9 +61,8 @@
  *     setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
  *     needed for log writing.
  * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
- *     starting of new handles. If new handles try to start an update on
- *     any of the inodes that are being committed, ext4_fc_track_inode()
- *     will block until those inodes have finished the fast commit.
+ *     starting of new handles. Updates to inodes being fast committed are
+ *     tracked for requeue rather than blocking.
  * [6] Commit all the directory entry updates in the fast commit space.
  * [7] Commit all the changed inodes in the fast commit space.
  * [8] Write tail tag (this tag ensures the atomicity, please read the following
@@ -217,6 +216,7 @@ void ext4_fc_init_inode(struct inode *inode)
 
 	ext4_fc_reset_inode(inode);
 	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+	ext4_clear_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
 	ei->i_fc_snap = NULL;
@@ -251,22 +251,30 @@ void ext4_fc_del(struct inode *inode)
 	}
 
 	/*
-	 * Since ext4_fc_del is called from ext4_evict_inode while having a
-	 * handle open, there is no need for us to wait here even if a fast
-	 * commit is going on. That is because, if this inode is being
-	 * committed, ext4_mark_inode_dirty would have waited for inode commit
-	 * operation to finish before we come here. So, by the time we come
-	 * here, inode's EXT4_STATE_FC_COMMITTING would have been cleared. So,
-	 * we shouldn't see EXT4_STATE_FC_COMMITTING to be set on this inode
-	 * here.
-	 *
-	 * We may come here without any handles open in the "no_delete" case of
-	 * ext4_evict_inode as well. However, if that happens, we first mark the
-	 * file system as fast commit ineligible anyway. So, even in that case,
-	 * it is okay to remove the inode from the fc list.
+	 * Wait for ongoing fast commit to finish. We cannot remove the inode
+	 * from fast commit lists while it is being committed.
 	 */
-	WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)
-		&& !ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE));
+	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+#if (BITS_PER_LONG < 64)
+		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_state_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#else
+		DEFINE_WAIT_BIT(wait, &ei->i_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#endif
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+			ext4_fc_unlock(inode->i_sb, alloc_ctx);
+			schedule();
+			alloc_ctx = ext4_fc_lock(inode->i_sb);
+		}
+		finish_wait(wq, &wait.wq_entry);
+	}
+
 	while (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
 #if (BITS_PER_LONG < 64)
 		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
@@ -287,19 +295,22 @@ void ext4_fc_del(struct inode *inode)
 		}
 		finish_wait(wq, &wait.wq_entry);
 	}
+
 	ext4_fc_free_inode_snap(inode);
 	list_del_init(&ei->i_fc_list);
 
 	/*
-	 * Since this inode is getting removed, let's also remove all FC
-	 * dentry create references, since it is not needed to log it anyways.
+	 * Since this inode is getting removed, let's also remove all FC dentry
+	 * create references, since it is not needed to log it anyways.
 	 */
 	if (list_empty(&ei->i_fc_dilist)) {
 		ext4_fc_unlock(inode->i_sb, alloc_ctx);
 		return;
 	}
 
-	fc_dentry = list_first_entry(&ei->i_fc_dilist, struct ext4_fc_dentry_update, fcd_dilist);
+	fc_dentry = list_first_entry(&ei->i_fc_dilist,
+				     struct ext4_fc_dentry_update,
+				     fcd_dilist);
 	WARN_ON(fc_dentry->fcd_op != EXT4_FC_TAG_CREAT);
 	list_del_init(&fc_dentry->fcd_list);
 	list_del_init(&fc_dentry->fcd_dilist);
@@ -371,6 +382,8 @@ static int ext4_fc_track_template(
 
 	tid = handle->h_transaction->t_tid;
 	spin_lock(&ei->i_fc_lock);
+	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+		ext4_set_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	if (tid == ei->i_sync_tid) {
 		update = true;
 	} else {
@@ -557,8 +570,6 @@ static int __track_inode(handle_t *handle, struct inode *inode, void *arg,
 
 void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 {
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	wait_queue_head_t *wq;
 	int ret;
 
 	if (S_ISDIR(inode->i_mode))
@@ -577,29 +588,11 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 		return;
 
 	/*
-	 * If we come here, we may sleep while waiting for the inode to
-	 * commit. We shouldn't be holding i_data_sem when we go to sleep since
-	 * the commit path needs to grab the lock while committing the inode.
+	 * Fast commit snapshots inode state at commit time, so there's no need
+	 * to wait for EXT4_STATE_FC_COMMITTING here. If the inode is already
+	 * on the commit queue, ext4_fc_cleanup() will requeue it for the new
+	 * transaction once the current commit finishes.
 	 */
-	lockdep_assert_not_held(&ei->i_data_sem);
-
-	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-#if (BITS_PER_LONG < 64)
-		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
-				EXT4_STATE_FC_COMMITTING);
-		wq = bit_waitqueue(&ei->i_state_flags,
-				   EXT4_STATE_FC_COMMITTING);
-#else
-		DEFINE_WAIT_BIT(wait, &ei->i_flags,
-				EXT4_STATE_FC_COMMITTING);
-		wq = bit_waitqueue(&ei->i_flags,
-				   EXT4_STATE_FC_COMMITTING);
-#endif
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
-			schedule();
-		finish_wait(wq, &wait.wq_entry);
-	}
 
 	/*
 	 * From this point on, this inode will not be committed either
@@ -1525,32 +1518,32 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	while (!list_empty(&sbi->s_fc_q[FC_Q_MAIN])) {
+		bool requeue;
+
 		ei = list_first_entry(&sbi->s_fc_q[FC_Q_MAIN],
 					struct ext4_inode_info,
 					i_fc_list);
 		list_del_init(&ei->i_fc_list);
 		ext4_fc_free_inode_snap(&ei->vfs_inode);
+		spin_lock(&ei->i_fc_lock);
+		if (full)
+			requeue = !tid_geq(tid, ei->i_sync_tid);
+		else
+			requeue = ext4_test_inode_state(&ei->vfs_inode,
+							EXT4_STATE_FC_REQUEUE);
+		if (!requeue)
+			ext4_fc_reset_inode(&ei->vfs_inode);
+		ext4_clear_inode_state(&ei->vfs_inode, EXT4_STATE_FC_REQUEUE);
 		ext4_clear_inode_state(&ei->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
-		if (tid_geq(tid, ei->i_sync_tid)) {
-			ext4_fc_reset_inode(&ei->vfs_inode);
-		} else if (full) {
-			/*
-			 * We are called after a full commit, inode has been
-			 * modified while the commit was running. Re-enqueue
-			 * the inode into STAGING, which will then be splice
-			 * back into MAIN. This cannot happen during
-			 * fastcommit because the journal is locked all the
-			 * time in that case (and tid doesn't increase so
-			 * tid check above isn't reliable).
-			 */
+		spin_unlock(&ei->i_fc_lock);
+		if (requeue)
 			list_add_tail(&ei->i_fc_list,
 				      &sbi->s_fc_q[FC_Q_STAGING]);
-		}
 		/*
 		 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
 		 * visible before we send the wakeup. Pairs with implicit
-		 * barrier in prepare_to_wait() in ext4_fc_track_inode().
+		 * barrier in prepare_to_wait() in ext4_fc_del().
 		 */
 		smp_mb();
 #if (BITS_PER_LONG < 64)
-- 
2.53.0


^ permalink raw reply related

* [RFC v5 2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
From: Li Chen @ 2026-03-17  8:46 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260317084624.457185-1-me@linux.beauty>

Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.

lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.

Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/ext4.h  | 4 +++-
 fs/ext4/super.c | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bd30c24d4f948..2e1681057196a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1025,12 +1025,14 @@ do {										\
  *			  than the first
  *  I_DATA_SEM_QUOTA  - Used for quota inodes only
  *  I_DATA_SEM_EA     - Used for ea_inodes only
+ *  I_DATA_SEM_JOURNAL - Used for journal inode only
  */
 enum {
 	I_DATA_SEM_NORMAL = 0,
 	I_DATA_SEM_OTHER,
 	I_DATA_SEM_QUOTA,
-	I_DATA_SEM_EA
+	I_DATA_SEM_EA,
+	I_DATA_SEM_JOURNAL
 };
 
 struct ext4_fc_inode_snap;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 79762c3e0dff3..4f5f0c21d436f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1423,6 +1423,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
 	ext4_fc_init_inode(&ei->vfs_inode);
 	spin_lock_init(&ei->i_fc_lock);
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&ei->i_data_sem, I_DATA_SEM_NORMAL);
+#endif
 	return &ei->vfs_inode;
 }
 
@@ -5863,6 +5866,11 @@ static struct inode *ext4_get_journal_inode(struct super_block *sb,
 		return ERR_PTR(-EFSCORRUPTED);
 	}
 
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&EXT4_I(journal_inode)->i_data_sem,
+			     I_DATA_SEM_JOURNAL);
+#endif
+
 	ext4_debug("Journal inode found at %p: %lld bytes\n",
 		  journal_inode, journal_inode->i_size);
 	return journal_inode;
-- 
2.53.0


^ permalink raw reply related

* [RFC v5 1/7] ext4: fast commit: snapshot inode state before writing log
From: Li Chen @ 2026-03-17  8:46 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260317084624.457185-1-me@linux.beauty>

Fast commit writes inode metadata and data range updates after unlocking
journal updates. New handles can start at that point, so the log writing
path must not look at live inode state.

Add a commit-time per-inode snapshot and populate it while journal updates
are locked and existing handles are drained. Store the snapshot behind
ext4_inode_info->i_fc_snap so ext4_inode_info only grows by one pointer.
The snapshot contains a copy of the on-disk inode plus the data range
records needed for fast commit TLVs.

Snapshotting runs under jbd2_journal_lock_updates(). Avoid triggering I/O
there by using ext4_get_inode_loc_noio() and falling back to full commit
if the inode table block is not present or not uptodate.

Log writing then only serializes the snapshot, so it no longer needs to
call ext4_map_blocks() and take i_data_sem under s_fc_lock. The snapshot
is installed and freed under s_fc_lock and is released from fast commit
cleanup and inode eviction.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/ext4.h        |  22 ++-
 fs/ext4/fast_commit.c | 330 +++++++++++++++++++++++++++++++++++-------
 fs/ext4/inode.c       |  51 +++++++
 3 files changed, 351 insertions(+), 52 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1524276aeac79..bd30c24d4f948 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1033,6 +1033,7 @@ enum {
 	I_DATA_SEM_EA
 };
 
+struct ext4_fc_inode_snap;
 
 /*
  * fourth extended file system inode data in memory
@@ -1089,6 +1090,22 @@ struct ext4_inode_info {
 	/* End of lblk range that needs to be committed in this fast commit */
 	ext4_lblk_t i_fc_lblk_len;
 
+	/*
+	 * Commit-time fast commit snapshots.
+	 *
+	 * i_fc_snap is installed and freed under sbi->s_fc_lock. The fast
+	 * commit log writing path reads the snapshot under sbi->s_fc_lock while
+	 * serializing fast commit TLVs.
+	 *
+	 * The snapshot lifetime is bounded by EXT4_STATE_FC_COMMITTING and the
+	 * corresponding cleanup / eviction paths.
+	 *
+	 * i_fc_snap points to per-inode snapshot data for fast commit:
+	 * - a raw inode snapshot for EXT4_FC_TAG_INODE
+	 * - data range records for EXT4_FC_TAG_{ADD,DEL}_RANGE
+	 */
+	struct ext4_fc_inode_snap *i_fc_snap;
+
 	spinlock_t i_raw_lock;	/* protects updates to the raw inode */
 
 	/* Fast commit wait queue for this inode */
@@ -3093,8 +3110,9 @@ extern int  ext4_file_getattr(struct mnt_idmap *, const struct path *,
 			      struct kstat *, u32, unsigned int);
 extern void ext4_dirty_inode(struct inode *, int);
 extern int ext4_change_inode_journal_flag(struct inode *, int);
-extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
-extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc);
+int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc);
+int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
 			  struct ext4_iloc *iloc);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 5bd57d7f921b9..d5c28304e8181 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -55,21 +55,23 @@
  *     deleted while it is being flushed.
  * [2] Flush data buffers to disk and clear "EXT4_STATE_FC_FLUSHING_DATA"
  *     state.
- * [3] Lock the journal by calling jbd2_journal_lock_updates. This ensures that
- *     all the exsiting handles finish and no new handles can start.
- * [4] Mark all the fast commit eligible inodes as undergoing fast commit
- *     by setting "EXT4_STATE_FC_COMMITTING" state.
- * [5] Unlock the journal by calling jbd2_journal_unlock_updates. This allows
+ * [3] Lock the journal by calling jbd2_journal_lock_updates(). This ensures
+ *     that all the existing handles finish and no new handles can start.
+ * [4] Mark all the fast commit eligible inodes as undergoing fast commit by
+ *     setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
+ *     needed for log writing.
+ * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
  *     starting of new handles. If new handles try to start an update on
  *     any of the inodes that are being committed, ext4_fc_track_inode()
  *     will block until those inodes have finished the fast commit.
  * [6] Commit all the directory entry updates in the fast commit space.
- * [7] Commit all the changed inodes in the fast commit space and clear
- *     "EXT4_STATE_FC_COMMITTING" for these inodes.
+ * [7] Commit all the changed inodes in the fast commit space.
  * [8] Write tail tag (this tag ensures the atomicity, please read the following
  *     section for more details).
+ * [9] Clear "EXT4_STATE_FC_COMMITTING" and wake up waiters in
+ *     ext4_fc_cleanup().
  *
- * All the inode updates must be enclosed within jbd2_jounrnal_start()
+ * All the inode updates must be enclosed within jbd2_journal_start()
  * and jbd2_journal_stop() similar to JBD2 journaling.
  *
  * Fast Commit Ineligibility
@@ -199,6 +201,8 @@ static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 	unlock_buffer(bh);
 }
 
+static void ext4_fc_free_inode_snap(struct inode *inode);
+
 static inline void ext4_fc_reset_inode(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
@@ -215,6 +219,7 @@ void ext4_fc_init_inode(struct inode *inode)
 	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
+	ei->i_fc_snap = NULL;
 	init_waitqueue_head(&ei->i_fc_wait);
 }
 
@@ -240,6 +245,7 @@ void ext4_fc_del(struct inode *inode)
 
 	alloc_ctx = ext4_fc_lock(inode->i_sb);
 	if (list_empty(&ei->i_fc_list) && list_empty(&ei->i_fc_dilist)) {
+		ext4_fc_free_inode_snap(inode);
 		ext4_fc_unlock(inode->i_sb, alloc_ctx);
 		return;
 	}
@@ -281,6 +287,7 @@ void ext4_fc_del(struct inode *inode)
 		}
 		finish_wait(wq, &wait.wq_entry);
 	}
+	ext4_fc_free_inode_snap(inode);
 	list_del_init(&ei->i_fc_list);
 
 	/*
@@ -845,6 +852,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc,
 	return true;
 }
 
+struct ext4_fc_range {
+	struct list_head list;
+	u16 tag;
+	ext4_lblk_t lblk;
+	ext4_lblk_t len;
+	ext4_fsblk_t pblk;
+	bool unwritten;
+};
+
+struct ext4_fc_inode_snap {
+	struct list_head data_list;
+	unsigned int inode_len;
+	u8 inode_buf[];
+};
+
 /*
  * Writes inode in the fast commit space under TLV with tag @tag.
  * Returns 0 on success, error on failure.
@@ -852,21 +874,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc,
 static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
-	int ret;
-	struct ext4_iloc iloc;
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
 	struct ext4_fc_inode fc_inode;
 	struct ext4_fc_tl tl;
 	u8 *dst;
+	u8 *src;
+	int inode_len;
+	int ret;
 
-	ret = ext4_get_inode_loc(inode, &iloc);
-	if (ret)
-		return ret;
+	if (!snap)
+		return -ECANCELED;
 
-	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
-		inode_len = EXT4_INODE_SIZE(inode->i_sb);
-	else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
-		inode_len += ei->i_extra_isize;
+	src = snap->inode_buf;
+	inode_len = snap->inode_len;
+	if (!src || inode_len == 0)
+		return -ECANCELED;
 
 	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
 	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
@@ -882,10 +904,9 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 	dst += EXT4_FC_TAG_BASE_LEN;
 	memcpy(dst, &fc_inode, sizeof(fc_inode));
 	dst += sizeof(fc_inode);
-	memcpy(dst, (u8 *)ext4_raw_inode(&iloc), inode_len);
+	memcpy(dst, src, inode_len);
 	ret = 0;
 err:
-	brelse(iloc.bh);
 	return ret;
 }
 
@@ -895,12 +916,74 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
  */
 static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 {
-	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	struct ext4_map_blocks map;
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
 	struct ext4_fc_add_range fc_ext;
 	struct ext4_fc_del_range lrange;
 	struct ext4_extent *ex;
+	struct ext4_fc_range *range;
+
+	if (!snap)
+		return -ECANCELED;
+
+	list_for_each_entry(range, &snap->data_list, list) {
+		if (range->tag == EXT4_FC_TAG_DEL_RANGE) {
+			lrange.fc_ino = cpu_to_le32(inode->i_ino);
+			lrange.fc_lblk = cpu_to_le32(range->lblk);
+			lrange.fc_len = cpu_to_le32(range->len);
+			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
+					     sizeof(lrange), (u8 *)&lrange, crc))
+				return -ENOSPC;
+			continue;
+		}
+
+		fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
+		ex = (struct ext4_extent *)&fc_ext.fc_ex;
+		ex->ee_block = cpu_to_le32(range->lblk);
+		ex->ee_len = cpu_to_le16(range->len);
+		ext4_ext_store_pblock(ex, range->pblk);
+		if (range->unwritten)
+			ext4_ext_mark_unwritten(ex);
+		else
+			ext4_ext_mark_initialized(ex);
+
+		if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
+				     sizeof(fc_ext), (u8 *)&fc_ext, crc))
+			return -ENOSPC;
+	}
+
+	return 0;
+}
+
+static void ext4_fc_free_ranges(struct list_head *head)
+{
+	struct ext4_fc_range *range, *range_n;
+
+	list_for_each_entry_safe(range, range_n, head, list) {
+		list_del(&range->list);
+		kfree(range);
+	}
+}
+
+static void ext4_fc_free_inode_snap(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
+
+	if (!snap)
+		return;
+
+	ext4_fc_free_ranges(&snap->data_list);
+	kfree(snap);
+	ei->i_fc_snap = NULL;
+}
+
+static int ext4_fc_snapshot_inode_data(struct inode *inode,
+				       struct list_head *ranges)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+	struct ext4_map_blocks map;
 	int ret;
 
 	spin_lock(&ei->i_fc_lock);
@@ -908,18 +991,20 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 		spin_unlock(&ei->i_fc_lock);
 		return 0;
 	}
-	old_blk_size = ei->i_fc_lblk_start;
-	new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
+	start_lblk = ei->i_fc_lblk_start;
+	end_lblk = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
 	ei->i_fc_lblk_len = 0;
 	spin_unlock(&ei->i_fc_lock);
 
-	cur_lblk_off = old_blk_size;
-	ext4_debug("will try writing %d to %d for inode %ld\n",
-		   cur_lblk_off, new_blk_size, inode->i_ino);
+	cur_lblk = start_lblk;
+	ext4_debug("snapshot data ranges %u-%u for inode %lu\n",
+		   start_lblk, end_lblk, inode->i_ino);
+
+	while (cur_lblk <= end_lblk) {
+		struct ext4_fc_range *range;
 
-	while (cur_lblk_off <= new_blk_size) {
-		map.m_lblk = cur_lblk_off;
-		map.m_len = new_blk_size - cur_lblk_off + 1;
+		map.m_lblk = cur_lblk;
+		map.m_len = end_lblk - cur_lblk + 1;
 		ret = ext4_map_blocks(NULL, inode, &map,
 				      EXT4_GET_BLOCKS_IO_SUBMIT |
 				      EXT4_EX_NOCACHE);
@@ -927,17 +1012,21 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 			return -ECANCELED;
 
 		if (map.m_len == 0) {
-			cur_lblk_off++;
+			cur_lblk++;
 			continue;
 		}
 
+		range = kmalloc(sizeof(*range), GFP_NOFS);
+		if (!range)
+			return -ENOMEM;
+
+		range->lblk = map.m_lblk;
+		range->len = map.m_len;
+		range->pblk = 0;
+		range->unwritten = false;
+
 		if (ret == 0) {
-			lrange.fc_ino = cpu_to_le32(inode->i_ino);
-			lrange.fc_lblk = cpu_to_le32(map.m_lblk);
-			lrange.fc_len = cpu_to_le32(map.m_len);
-			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
-					    sizeof(lrange), (u8 *)&lrange, crc))
-				return -ENOSPC;
+			range->tag = EXT4_FC_TAG_DEL_RANGE;
 		} else {
 			unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ?
 				EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN;
@@ -945,26 +1034,67 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 			/* Limit the number of blocks in one extent */
 			map.m_len = min(max, map.m_len);
 
-			fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
-			ex = (struct ext4_extent *)&fc_ext.fc_ex;
-			ex->ee_block = cpu_to_le32(map.m_lblk);
-			ex->ee_len = cpu_to_le16(map.m_len);
-			ext4_ext_store_pblock(ex, map.m_pblk);
-			if (map.m_flags & EXT4_MAP_UNWRITTEN)
-				ext4_ext_mark_unwritten(ex);
-			else
-				ext4_ext_mark_initialized(ex);
-			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
-					    sizeof(fc_ext), (u8 *)&fc_ext, crc))
-				return -ENOSPC;
+			range->tag = EXT4_FC_TAG_ADD_RANGE;
+			range->len = map.m_len;
+			range->pblk = map.m_pblk;
+			range->unwritten = !!(map.m_flags & EXT4_MAP_UNWRITTEN);
 		}
 
-		cur_lblk_off += map.m_len;
+		INIT_LIST_HEAD(&range->list);
+		list_add_tail(&range->list, ranges);
+
+		cur_lblk += map.m_len;
 	}
 
 	return 0;
 }
 
+static int ext4_fc_snapshot_inode(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_inode_snap *snap;
+	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
+	struct ext4_iloc iloc;
+	LIST_HEAD(ranges);
+	int ret;
+	int alloc_ctx;
+
+	ret = ext4_get_inode_loc_noio(inode, &iloc);
+	if (ret)
+		return ret;
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
+		inode_len = EXT4_INODE_SIZE(inode->i_sb);
+	else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
+		inode_len += ei->i_extra_isize;
+
+	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
+	if (!snap) {
+		brelse(iloc.bh);
+		return -ENOMEM;
+	}
+	INIT_LIST_HEAD(&snap->data_list);
+	snap->inode_len = inode_len;
+
+	memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len);
+	brelse(iloc.bh);
+
+	ret = ext4_fc_snapshot_inode_data(inode, &ranges);
+	if (ret) {
+		kfree(snap);
+		ext4_fc_free_ranges(&ranges);
+		return ret;
+	}
+
+	alloc_ctx = ext4_fc_lock(inode->i_sb);
+	ext4_fc_free_inode_snap(inode);
+	ei->i_fc_snap = snap;
+	list_splice_tail_init(&ranges, &snap->data_list);
+	ext4_fc_unlock(inode->i_sb, alloc_ctx);
+
+	return 0;
+}
+
 
 /* Flushes data of all the inodes in the commit queue. */
 static int ext4_fc_flush_data(journal_t *journal)
@@ -1015,6 +1145,11 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 		 */
 		if (list_empty(&fc_dentry->fcd_dilist))
 			continue;
+		/*
+		 * For EXT4_FC_TAG_CREAT, fcd_dilist is linked on the created
+		 * inode's i_fc_dilist list (kept singular), so we can recover the
+		 * inode through it.
+		 */
 		ei = list_first_entry(&fc_dentry->fcd_dilist,
 				struct ext4_inode_info, i_fc_dilist);
 		inode = &ei->vfs_inode;
@@ -1039,6 +1174,88 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 	return 0;
 }
 
+static int ext4_fc_snapshot_inodes(journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	struct inode **inodes;
+	unsigned int nr_inodes = 0;
+	unsigned int i = 0;
+	int ret = 0;
+	int alloc_ctx;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
+		nr_inodes++;
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		nr_inodes++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	if (!nr_inodes)
+		return 0;
+
+	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
+	if (!inodes)
+		return -ENOMEM;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		inodes[i] = igrab(&iter->vfs_inode);
+		if (inodes[i])
+			i++;
+	}
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		inodes[i] = igrab(&ei->vfs_inode);
+		if (inodes[i])
+			i++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
+		ret = ext4_fc_snapshot_inode(inodes[nr_inodes]);
+		if (ret)
+			break;
+	}
+
+	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
+		if (inodes[nr_inodes])
+			iput(inodes[nr_inodes]);
+	}
+	kvfree(inodes);
+	return ret;
+}
+
 static int ext4_fc_perform_commit(journal_t *journal)
 {
 	struct super_block *sb = journal->j_private;
@@ -1111,7 +1328,11 @@ static int ext4_fc_perform_commit(journal_t *journal)
 				     EXT4_STATE_FC_COMMITTING);
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
+
+	ret = ext4_fc_snapshot_inodes(journal);
 	jbd2_journal_unlock_updates(journal);
+	if (ret)
+		return ret;
 
 	/*
 	 * Step 5: If file system device is different from journal device,
@@ -1308,6 +1529,7 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					struct ext4_inode_info,
 					i_fc_list);
 		list_del_init(&ei->i_fc_list);
+		ext4_fc_free_inode_snap(&ei->vfs_inode);
 		ext4_clear_inode_state(&ei->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
 		if (tid_geq(tid, ei->i_sync_tid)) {
@@ -1343,6 +1565,14 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					     struct ext4_fc_dentry_update,
 					     fcd_list);
 		list_del_init(&fc_dentry->fcd_list);
+		if (fc_dentry->fcd_op == EXT4_FC_TAG_CREAT &&
+		    !list_empty(&fc_dentry->fcd_dilist)) {
+			/* See the comment in ext4_fc_commit_dentry_updates(). */
+			ei = list_first_entry(&fc_dentry->fcd_dilist,
+					      struct ext4_inode_info,
+					      i_fc_dilist);
+			ext4_fc_free_inode_snap(&ei->vfs_inode);
+		}
 		list_del_init(&fc_dentry->fcd_dilist);
 
 		release_dentry_name_snapshot(&fc_dentry->fcd_name);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a1c81ffdca2b9..385ff112d405e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4969,6 +4969,57 @@ int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
 	return ret;
 }
 
+/*
+ * ext4_get_inode_loc_noio() is a best-effort variant of ext4_get_inode_loc().
+ * It looks up the inode table block in the buffer cache and returns -EAGAIN if
+ * the block is not present or not uptodate, without starting any I/O.
+ */
+int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc)
+{
+	struct super_block *sb = inode->i_sb;
+	struct ext4_group_desc *gdp;
+	struct buffer_head *bh;
+	ext4_fsblk_t block;
+	int inodes_per_block, inode_offset;
+	unsigned long ino = inode->i_ino;
+
+	iloc->bh = NULL;
+	if (ino < EXT4_ROOT_INO ||
+	    ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
+		return -EFSCORRUPTED;
+
+	iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
+	gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
+	if (!gdp)
+		return -EIO;
+
+	/* Figure out the offset within the block group inode table. */
+	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+	inode_offset = ((ino - 1) % EXT4_INODES_PER_GROUP(sb));
+	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
+
+	block = ext4_inode_table(sb, gdp);
+	if (block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block) ||
+	    block >= ext4_blocks_count(EXT4_SB(sb)->s_es)) {
+		ext4_error(sb,
+			   "Invalid inode table block %llu in block_group %u",
+			   block, iloc->block_group);
+		return -EFSCORRUPTED;
+	}
+	block += inode_offset / inodes_per_block;
+
+	bh = sb_find_get_block(sb, block);
+	if (!bh)
+		return -EAGAIN;
+	if (!ext4_buffer_uptodate(bh)) {
+		brelse(bh);
+		return -EAGAIN;
+	}
+
+	iloc->bh = bh;
+	return 0;
+}
+
 
 int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
 			  struct ext4_iloc *iloc)
-- 
2.53.0


^ permalink raw reply related

* [RFC v5 0/7] ext4: fast commit: snapshot inode state for FC log
From: Li Chen @ 2026-03-17  8:46 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-ext4,
	linux-trace-kernel, linux-kernel

Hi,

(This RFC v5 series is based on linux-next tag next-20260106 plus the
prerequisite patch "ext4: fast commit: make s_fc_lock reclaim-safe":
https://lore.kernel.org/all/20260106120621.440126-1-me@linux.beauty/)

Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
masks the issue, and that sleeping in ext4_fc_track_inode() while holding
i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
i_data_sem while the inode is in FC_COMMITTING.

Zhang Yi suggested two possible directions to address the root cause:

1. "Ha, the solution seems to have already been listed in the TODOs in
fast_commit.c.

  Change ext4_fc_commit() to lookup logical to physical mapping using extent
  status tree. This would get rid of the need to call ext4_fc_track_inode()
  before acquiring i_data_sem. To do that we would need to ensure that
  modified extents from the extent status tree are not evicted from memory."

2. "Alternatively, recording the mapped range of tracking might also be
feasible."

This series implements a hybrid way: it implements approach 2 by snapshotting inode image
and mapped ranges at commit time, and consuming only snapshots during log
writing.

Approach 2 still needs a mapping source while building the snapshot
(logical-to-physical and unwritten/hole semantics). Calling ext4_map_blocks()
there would take i_data_sem and can block inside the
jbd2_journal_lock_updates() window, which risks deadlocks or unbounded stalls.
So the snapshot path uses approach 1's extent status lookups as a best-effort
mapping source to avoid ext4_map_blocks().

I did not fully implement approach 1 (making extent status lookups
authoritative by preventing reclaim of needed entries) because that would need
additional pinning/integration under memory pressure and a larger correctness
surface. Instead, the extent status tree is treated as a cache and the
snapshot path falls back to full commit on cache misses or unstable mappings
(e.g. delayed allocation).

Lock inversion / deadlock model (before):

CPU0 (metadata update)               CPU1 (fast commit)
--------------------               -----------------
... hold i_data_sem (A)             mutex_lock(s_fc_lock) (B)
    ext4_fc_track_inode()             ext4_fc_write_inode_data()
      mutex_lock(s_fc_lock) (B)         ext4_map_blocks()
      wait FC_COMMITTING (sleep)          down_read(i_data_sem) (A)

This creates i_data_sem (A) -> s_fc_lock (B) on update paths, and
s_fc_lock (B) -> i_data_sem (A) on commit paths. Once CPU0 sleeps while
holding (A), CPU1 can block on (A) while holding (B), completing the ABBA
cycle.

New model (this series):

CPU0 (metadata update)               CPU1 (fast commit)
--------------------               -----------------
... maybe hold i_data_sem (A)        jbd2_journal_lock_updates()
    ext4_fc_track_*()                 snapshot inode + ranges (no map_blocks)
      mutex_lock(s_fc_lock) (B)       jbd2_journal_unlock_updates()
      if FC_COMMITTING: set FC_REQUEUE s_fc_lock (B)
      no sleep                         write FC log from snapshots only
                                    cleanup: clear COMMITTING, requeue if set

The commit path no longer takes i_data_sem while holding s_fc_lock, and
tracking no longer sleeps waiting for FC_COMMITTING. If an inode is updated
during a fast commit, EXT4_STATE_FC_REQUEUE records that fact and the inode
is moved to FC_Q_STAGING for the next commit.
The only remaining FC_COMMITTING waiter is ext4_fc_del(), which drops
s_fc_lock before sleeping.

This series snapshots the on-disk inode and tracked data ranges while journal
updates are locked and existing handles are drained. The log writing phase then
serializes only snapshots, so it no longer needs to call ext4_map_blocks() and
take i_data_sem under s_fc_lock. This is done in two steps: patch 1 drops
ext4_map_blocks() from log writing by introducing commit-time snapshots, and
patch 5 drops ext4_map_blocks() from the snapshot path by using the extent
status cache. The snapshot also records whether a mapped extent is unwritten,
so the ADD_RANGE records (and replay) preserve unwritten semantics.

Snapshotting runs under jbd2_journal_lock_updates(). Since a cache miss in
ext4_get_inode_loc() can start synchronous inode table I/O and stall handle
starts for milliseconds, patch 1 uses ext4_get_inode_loc_noio() and falls back
to full commit if the inode table block is not present or not uptodate.

ext4_fc_track_inode() also stops waiting for FC_COMMITTING. Updates during an
ongoing fast commit are marked with EXT4_STATE_FC_REQUEUE and are replayed in
the next fast commit, while ext4_fc_del() waits for FC_COMMITTING so an inode
cannot be removed while the commit thread is still using it.

The extent status tree is a cache, not an authoritative source, so the snapshot
path falls back to full commit on cache misses or unstable mappings (e.g.
delayed allocation). This includes cases where extent status entries are not
present (or have been reclaimed) under memory pressure. The snapshot path does
not try to rebuild mappings by calling ext4_map_blocks(); instead it simply
marks the transaction fast commit ineligible.

To keep the updates-locked window bounded, the snapshot path caps the number of
snapshotted inodes and ranges per fast commit (currently 1024 inodes and 2048
ranges) and falls back to full commit when the cap is exceeded. The series also
handles the journal inode i_data_sem lockdep false positive via subclassing;
journal inode mapping may still take i_data_sem even when data inode mapping is
avoided.

Patch 6 adds the ext4_fc_lock_updates tracepoint to quantify the updates-locked
window and snapshot fallback reasons. Patch 7 extends
/proc/fs/ext4/<sb_id>/fc_info with best-effort snapshot counters. If the /proc
interface is undesirable, I can drop patch 7 and keep the tracepoint only, or
drop even both.

Testing and measurement were done on a QEMU/KVM guest with virtio-pmem + dax
(ext4 -O fast_commit, mounted dax,noatime). The workload does python3 500x
{4K write + fsync}, fallocate 256M, and python3 500x {creat + fsync(dir)}.
Over 3 cold boots, ext4_fc_lock_updates reported locked_ns p50 2.88-2.92 us,
p99 <= 6.71 us, and max <= 102.71 us, with snap_err always 0. Under stress-ng
memory pressure (stress-ng --vm 4 --vm-bytes 75% --timeout 60s), locked_ns p50
2.94 us, p99 <= 4.97 us, and max <= 20.07 us. The fc_info snapshot failure
counters stayed at 0.
These hold times are in the low microseconds range, and the caps keep the
worst case bounded.

Comments and guidance are very welcome. Please let me know if there are any
concerns about correctness, corner cases, or better approaches.

RFC v4 -> RFC v5:
- Patch 6: Make ext4_fc_lock_updates snap_err human readable via
  TRACE_DEFINE_ENUM() + __print_symbolic(), using a single TRACE_SNAP_ERR
  mapping while keeping the enum values stable for tooling.

RFC v3 -> RFC v4:
- Replace lockdep_assert movement with removing the wait in
  ext4_fc_track_inode() and using EXT4_STATE_FC_REQUEUE to capture updates
  during an ongoing fast commit.
- Replace dropping s_fc_lock around log writing with commit-time snapshots of
  inode image and mapped ranges (recording the mapped range of tracking as
  suggested by Zhang Yi) so log writing consumes only snapshots.
- Avoid inode table I/O under jbd2_journal_lock_updates() via
  ext4_get_inode_loc_noio() and fallback to full commit on cache misses.
- Use the extent status cache for snapshot mappings and fall back to full
  commit on cache misses or unstable mappings (e.g. delayed allocation).
- Add tracepoint and /proc snapshot stats to quantify the updates-locked window
  and snapshot fallback reasons.

RFC v2 -> RFC v3:
- rebase on top of
  https://lore.kernel.org/linux-ext4/20251223131342.287864-1-me@linux.beauty/T/#u

RFC v1 -> RFC v2:
- patch 1: move comments to correct place
- patch 2: add it to patchset.
- add missing RFC prefix

RFC v1: https://lore.kernel.org/linux-ext4/20251222032655.87056-1-me@linux.beauty/T/#u
RFC v2: https://lore.kernel.org/linux-ext4/20251222151906.24607-1-me@linux.beauty/T/#t
RFC v3: https://lore.kernel.org/linux-ext4/20251224032943.134063-1-me@linux.beauty/
RFC v4: https://lore.kernel.org/all/20260120112538.132774-1-me@linux.beauty/t/#m9a6c8f2391c6dc67471e918a0577b130e7633e49

Thanks,
Li Chen (7):
  ext4: fast commit: snapshot inode state before writing log
  ext4: lockdep: handle i_data_sem subclassing for special inodes
  ext4: fast commit: avoid waiting for FC_COMMITTING
  ext4: fast commit: avoid self-deadlock in inode snapshotting
  ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in
    snapshots
  ext4: fast commit: add lock_updates tracepoint
  ext4: fast commit: export snapshot stats in fc_info

 fs/ext4/ext4.h              |  73 +++-
 fs/ext4/fast_commit.c       | 703 +++++++++++++++++++++++++++++-------
 fs/ext4/inode.c             |  51 +++
 fs/ext4/super.c             |   9 +
 include/trace/events/ext4.h |  61 ++++
 5 files changed, 763 insertions(+), 134 deletions(-)

-- 
2.53.0

^ permalink raw reply

* Re: [PATCH v6 16/17] lib/bootconfig: fix sign-compare in xbc_node_compose_key_after()
From: Masami Hiramatsu @ 2026-03-17  7:55 UTC (permalink / raw)
  To: Josh Law; +Cc: Andrew Morton, Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <20260315122015.55965-17-objecting@objecting.org>

On Sun, 15 Mar 2026 12:20:14 +0000
Josh Law <objecting@objecting.org> wrote:

>   lib/bootconfig.c:322:25: warning: comparison of integer expressions
>   of different signedness: 'int' and 'size_t' [-Wsign-compare]
>   lib/bootconfig.c:325:30: warning: conversion to 'size_t' from 'int'
>   may change the sign of the result [-Wsign-conversion]
> 
> snprintf() returns int but size is size_t, so comparing ret >= size
> and subtracting size -= ret involve mixed-sign operations.  Cast ret
> at the comparison and subtraction sites; ret is known non-negative at
> this point because the ret < 0 early return has already been taken.
> 
> Signed-off-by: Josh Law <objecting@objecting.org>
> ---
>  lib/bootconfig.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/bootconfig.c b/lib/bootconfig.c
> index e318b236e728..68a72dbc38fa 100644
> --- a/lib/bootconfig.c
> +++ b/lib/bootconfig.c
> @@ -319,10 +319,10 @@ int __init xbc_node_compose_key_after(struct xbc_node *root,
>  			       depth ? "." : "");
>  		if (ret < 0)
>  			return ret;
> -		if (ret >= size) {
> +		if (ret >= (int)size) {

nit:

	if ((size_t)ret >= size) {

because sizeof(size_t) > sizeof(int).

Thanks,

>  			size = 0;
>  		} else {
> -			size -= ret;
> +			size -= (size_t)ret;
>  			buf += ret;
>  		}
>  		total += ret;
> -- 
> 2.34.1
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 11/17] tools/bootconfig: fix fd leak in load_xbc_file() on fstat failure
From: Josh Law @ 2026-03-17  7:34 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andrew Morton, Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <20260317163151.264b0617484d202daba85f0f@kernel.org>



On 17 March 2026 07:31:51 GMT, Masami Hiramatsu <mhiramat@kernel.org> wrote:
>On Sun, 15 Mar 2026 12:20:09 +0000
>Josh Law <objecting@objecting.org> wrote:
>
>> If fstat() fails after open() succeeds, load_xbc_file() returns
>> -errno without closing the file descriptor.  Add the missing close()
>> call on the error path.
>> 
>> Fixes: 950313ebf79c ("tools: bootconfig: Add bootconfig command")
>> Signed-off-by: Josh Law <objecting@objecting.org>
>> ---
>>  tools/bootconfig/main.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
>> index 55d59ed507d5..8078fee0b75b 100644
>> --- a/tools/bootconfig/main.c
>> +++ b/tools/bootconfig/main.c
>> @@ -162,8 +162,10 @@ static int load_xbc_file(const char *path, char **buf)
>>  	if (fd < 0)
>>  		return -errno;
>>  	ret = fstat(fd, &stat);
>> -	if (ret < 0)
>> +	if (ret < 0) {
>> +		close(fd);
>>  		return -errno;
>
>Sashiko.dev[1] found that close() will overwrite errno. So please make it
>
>	ret = -errno;
>	close(fd);
>	return ret;
>
>[1] https://sashiko.dev/#/patchset/20260315122015.55965-1-objecting%40objecting.org
>
>Thanks,
>
>> +	}
>>  
>>  	ret = load_xbc_fd(fd, buf, stat.st_size);
>>  
>> -- 
>> 2.34.1
>> 
>
>


I'm on the computer now, I'm cleaning everything up as requested, expect a patch coming.

V/R

Josh Law

^ permalink raw reply

* Re: [PATCH v6 01/17] lib/bootconfig: add missing __init annotations to static helpers
From: Masami Hiramatsu @ 2026-03-17  7:33 UTC (permalink / raw)
  To: Josh Law; +Cc: Andrew Morton, Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <20260315122015.55965-2-objecting@objecting.org>

On Sun, 15 Mar 2026 12:19:59 +0000
Josh Law <objecting@objecting.org> wrote:

> skip_comment() and skip_spaces_until_newline() are static functions
> called exclusively from __init code paths but lack the __init
> annotation themselves. Add it so their memory can be reclaimed after
> init.
> 
> Signed-off-by: Josh Law <objecting@objecting.org>
> ---
>  lib/bootconfig.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/bootconfig.c b/lib/bootconfig.c
> index b0ef1e74e98a..51fd2299ec0f 100644
> --- a/lib/bootconfig.c
> +++ b/lib/bootconfig.c
> @@ -509,7 +509,7 @@ static inline __init bool xbc_valid_keyword(char *key)
>  	return *key == '\0';
>  }
>  
> -static char *skip_comment(char *p)
> +static char __init *skip_comment(char *p)

static __init char *skip_comment()

__init attribute is not for char but the function itself.


>  {
>  	char *ret;
>  
> @@ -522,7 +522,7 @@ static char *skip_comment(char *p)
>  	return ret;
>  }
>  
> -static char *skip_spaces_until_newline(char *p)
> +static char __init *skip_spaces_until_newline(char *p)

Ditto.

>  {
>  	while (isspace(*p) && *p != '\n')
>  		p++;
> -- 
> 2.34.1
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 11/17] tools/bootconfig: fix fd leak in load_xbc_file() on fstat failure
From: Masami Hiramatsu @ 2026-03-17  7:31 UTC (permalink / raw)
  To: Josh Law; +Cc: Andrew Morton, Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <20260315122015.55965-12-objecting@objecting.org>

On Sun, 15 Mar 2026 12:20:09 +0000
Josh Law <objecting@objecting.org> wrote:

> If fstat() fails after open() succeeds, load_xbc_file() returns
> -errno without closing the file descriptor.  Add the missing close()
> call on the error path.
> 
> Fixes: 950313ebf79c ("tools: bootconfig: Add bootconfig command")
> Signed-off-by: Josh Law <objecting@objecting.org>
> ---
>  tools/bootconfig/main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/bootconfig/main.c b/tools/bootconfig/main.c
> index 55d59ed507d5..8078fee0b75b 100644
> --- a/tools/bootconfig/main.c
> +++ b/tools/bootconfig/main.c
> @@ -162,8 +162,10 @@ static int load_xbc_file(const char *path, char **buf)
>  	if (fd < 0)
>  		return -errno;
>  	ret = fstat(fd, &stat);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		close(fd);
>  		return -errno;

Sashiko.dev[1] found that close() will overwrite errno. So please make it

	ret = -errno;
	close(fd);
	return ret;

[1] https://sashiko.dev/#/patchset/20260315122015.55965-1-objecting%40objecting.org

Thanks,

> +	}
>  
>  	ret = load_xbc_fd(fd, buf, stat.st_size);
>  
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCHv3 bpf-next 24/24] selftests/bpf: Add tracing multi attach rollback tests
From: Leon Hwang @ 2026-03-17  3:20 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: bpf, linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, Menglong Dong, Steven Rostedt
In-Reply-To: <20260316075138.465430-25-jolsa@kernel.org>

On 16/3/26 15:51, Jiri Olsa wrote:
> Adding tests for the rollback code when the tracing_multi
> link won't get attached, covering 2 reasons:
> 
>   - wrong btf id passed by user, where all previously allocated
>     trampolines will be released
>   - trampoline for requested function is fully attached (has already
>     maximum programs attached) and the link fails, the rollback code
>     needs to release all previously link-ed trampolines and release
>     them
> 
> We need the bpf_fentry_test* unattached for the tests to pass,
> so the rollback tests are serial.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  .../selftests/bpf/prog_tests/tracing_multi.c  | 181 ++++++++++++++++++
>  .../bpf/progs/tracing_multi_rollback.c        |  38 ++++
>  2 files changed, 219 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_rollback.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> index a0fcda51bb6c..10b8cc6b368b 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> @@ -10,6 +10,7 @@
>  #include "tracing_multi_session.skel.h"
>  #include "tracing_multi_fail.skel.h"
>  #include "tracing_multi_bench.skel.h"
> +#include "tracing_multi_rollback.skel.h"
>  #include "trace_helpers.h"
>  
>  static __u64 bpf_fentry_test_cookies[] = {
> @@ -649,6 +650,186 @@ void serial_test_tracing_multi_bench_attach(void)
>  	free(ids);
>  }
>  
> +static void tracing_multi_rollback_run(struct tracing_multi_rollback *skel)
> +{
> +	LIBBPF_OPTS(bpf_test_run_opts, topts);
> +	int err, prog_fd;
> +
> +	prog_fd = bpf_program__fd(skel->progs.test_fentry);
> +	err = bpf_prog_test_run_opts(prog_fd, &topts);
> +	ASSERT_OK(err, "test_run");
> +
> +	/* make sure the rollback code did not leave any program attached */
> +	ASSERT_EQ(skel->bss->test_result_fentry, 0, "test_result_fentry");
> +	ASSERT_EQ(skel->bss->test_result_fexit, 0, "test_result_fexit");
> +}
> +
> +static void test_rollback_put(void)
> +{
> +	LIBBPF_OPTS(bpf_tracing_multi_opts, opts);
> +	struct tracing_multi_rollback *skel = NULL;
> +	size_t cnt = FUNCS_CNT;
> +	__u32 *ids = NULL;
> +	int err;
> +
> +	skel = tracing_multi_rollback__open();
> +	if (!ASSERT_OK_PTR(skel, "tracing_multi_rollback__open"))
> +		return;
> +
> +	bpf_program__set_autoload(skel->progs.test_fentry, true);
> +	bpf_program__set_autoload(skel->progs.test_fexit, true);
> +
> +	err = tracing_multi_rollback__load(skel);
> +	if (!ASSERT_OK(err, "tracing_multi_rollback__load"))
> +		goto cleanup;
> +
> +	ids = get_ids(bpf_fentry_test, cnt, NULL);
> +	if (!ASSERT_OK_PTR(ids, "get_ids"))
> +		goto cleanup;
> +
> +	/*
> +	 * Mangle last id to trigger rollback, which needs to do put
> +	 * on get-ed trampolines.
> +	 */
> +	ids[9] = 0;
> +
> +	opts.ids = ids;
> +	opts.cnt = cnt;
> +
> +	skel->bss->pid = getpid();
> +
> +	skel->links.test_fentry = bpf_program__attach_tracing_multi(skel->progs.test_fentry,
> +						NULL, &opts);
> +	if (!ASSERT_ERR_PTR(skel->links.test_fentry, "bpf_program__attach_tracing_multi"))
> +		goto cleanup;
> +
> +	skel->links.test_fexit = bpf_program__attach_tracing_multi(skel->progs.test_fexit,
> +						NULL, &opts);
> +	if (!ASSERT_ERR_PTR(skel->links.test_fexit, "bpf_program__attach_tracing_multi"))
> +		goto cleanup;
> +
> +	/* We don't really attach any program, but let's make sure. */
> +	tracing_multi_rollback_run(skel);
> +
> +cleanup:
> +	tracing_multi_rollback__destroy(skel);
> +	free(ids);
> +}
> +
> +
> +static void fillers_cleanup(struct tracing_multi_rollback **skels, int cnt)
> +{
> +	int i;
> +
> +	for (i = 0; i < cnt; i++)
> +		tracing_multi_rollback__destroy(skels[i]);
> +
> +	free(skels);
> +}
> +
> +static struct tracing_multi_rollback **fillers_load_and_link(int max)
> +{
> +	struct tracing_multi_rollback **skels, *skel;
> +	int i, err;
> +
> +	skels = calloc(max + 1, sizeof(*skels));
> +	if (!ASSERT_OK_PTR(skels, "calloc"))
> +		return NULL;
> +
> +	for (i = 0; i < max; i++) {
> +		skel = skels[i] = tracing_multi_rollback__open();
> +		if (!ASSERT_OK_PTR(skels[i], "tracing_multi_rollback__open"))
> +			goto cleanup;
> +
> +		bpf_program__set_autoload(skel->progs.filler, true);
> +
> +		err = tracing_multi_rollback__load(skel);
> +		if (!ASSERT_OK(err, "tracing_multi_rollback__load"))
> +			goto cleanup;
> +
> +		skel->links.filler = bpf_program__attach_trace(skel->progs.filler);
> +		if (!ASSERT_OK_PTR(skels[i]->links.filler, "bpf_program__attach_trace"))
> +			goto cleanup;
> +	}
> +
> +	return skels;
> +
> +cleanup:
> +	fillers_cleanup(skels, i);
> +	return NULL;
> +}
> +
> +static void test_rollback_unlink(void)
> +{
> +	LIBBPF_OPTS(bpf_tracing_multi_opts, opts);
> +	struct tracing_multi_rollback **fillers;
> +	struct tracing_multi_rollback *skel;
> +	size_t cnt = FUNCS_CNT;
> +	__u32 *ids = NULL;
> +	int err, max;
> +
> +	max = get_bpf_max_tramp_links();
> +	if (!ASSERT_GE(max, 1, "bpf_max_tramp_links"))
> +		return;
> +
> +	/* Attach maximum allowed programs to bpf_fentry_test10 */
> +	fillers = fillers_load_and_link(max);
> +	if (!ASSERT_OK_PTR(fillers, "fillers_load_and_link"))
> +		return;
> +
> +	skel = tracing_multi_rollback__open();
> +	if (!ASSERT_OK_PTR(skel, "tracing_multi_rollback__open"))
> +		goto cleanup;
> +
> +	bpf_program__set_autoload(skel->progs.test_fentry, true);
> +	bpf_program__set_autoload(skel->progs.test_fexit, true);
> +
> +	/*
> +	 * Attach tracing_multi link on bpf_fentry_test1-10, which will
> +	 * fail on bpf_fentry_test10 function, because it already has
> +	 * maximum allowed programs attached.
> +	 *
> +	 * The rollback needs to unlink already link-ed trampolines and
> +	 * put all of them.
> +	 */
> +	err = tracing_multi_rollback__load(skel);
> +	if (!ASSERT_OK(err, "tracing_multi_rollback__load"))
> +		goto cleanup;
> +
> +	ids = get_ids(bpf_fentry_test, cnt, NULL);
> +	if (!ASSERT_OK_PTR(ids, "get_ids"))
> +		goto cleanup;
> +
> +	opts.ids = ids;
> +	opts.cnt = cnt;
> +
> +	skel->bss->pid = getpid();
> +
> +	skel->links.test_fentry = bpf_program__attach_tracing_multi(skel->progs.test_fentry,
> +						NULL, &opts);
> +	if (!ASSERT_ERR_PTR(skel->links.test_fentry, "bpf_program__attach_tracing_multi"))
> +		goto cleanup;
> +
> +	skel->links.test_fexit = bpf_program__attach_tracing_multi(skel->progs.test_fexit,
> +						NULL, &opts);
> +	if (!ASSERT_ERR_PTR(skel->links.test_fexit, "bpf_program__attach_tracing_multi"))
> +		goto cleanup;
> +
> +	tracing_multi_rollback_run(skel);
> +
> +cleanup:
	tracing_multi_rollback__destroy(skel); is missed to destroy skel?

Thanks,
Leon

> +	fillers_cleanup(fillers, max);
> +	free(ids);
> +}
> +
> +void serial_test_tracing_multi_attach_rollback(void)
> +{
> +	if (test__start_subtest("put"))
> +		test_rollback_put();
> +	if (test__start_subtest("unlink"))
> +		test_rollback_unlink();
> +}
> +
>  void test_tracing_multi_test(void)
>  {
>  #ifndef __x86_64__
> diff --git a/tools/testing/selftests/bpf/progs/tracing_multi_rollback.c b/tools/testing/selftests/bpf/progs/tracing_multi_rollback.c
> new file mode 100644
> index 000000000000..eb27869f551a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/tracing_multi_rollback.c
> @@ -0,0 +1,38 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdbool.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +int pid = 0;
> +
> +__u64 test_result_fentry = 0;
> +__u64 test_result_fexit = 0;
> +
> +SEC("?fentry.multi")
> +int BPF_PROG(test_fentry)
> +{
> +	if (bpf_get_current_pid_tgid() >> 32 != pid)
> +		return 0;
> +
> +	test_result_fentry++;
> +	return 0;
> +}
> +
> +SEC("?fexit.multi")
> +int BPF_PROG(test_fexit)
> +{
> +	if (bpf_get_current_pid_tgid() >> 32 != pid)
> +		return 0;
> +
> +	test_result_fexit++;
> +	return 0;
> +}
> +
> +SEC("?fentry/bpf_fentry_test10")
> +int BPF_PROG(filler)
> +{
> +	return 0;
> +}


^ permalink raw reply

* Re: [PATCHv3 bpf-next 23/24] selftests/bpf: Add tracing multi attach benchmark test
From: Leon Hwang @ 2026-03-17  3:09 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: bpf, linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, Menglong Dong, Steven Rostedt
In-Reply-To: <20260316075138.465430-24-jolsa@kernel.org>

On 16/3/26 15:51, Jiri Olsa wrote:
> Adding benchmark test that attaches to (almost) all allowed tracing
> functions and display attach/detach times.
> 
>   # ./test_progs -t tracing_multi_bench_attach -v
>   bpf_testmod.ko is already unloaded.
>   Loading bpf_testmod.ko...
>   Successfully loaded bpf_testmod.ko.
>   serial_test_tracing_multi_bench_attach:PASS:btf__load_vmlinux_btf 0 nsec
>   serial_test_tracing_multi_bench_attach:PASS:tracing_multi_bench__open_and_load 0 nsec
>   serial_test_tracing_multi_bench_attach:PASS:get_syms 0 nsec
>   serial_test_tracing_multi_bench_attach:PASS:bpf_program__attach_tracing_multi 0 nsec
>   serial_test_tracing_multi_bench_attach: found 51186 functions
>   serial_test_tracing_multi_bench_attach: attached in   1.295s
>   serial_test_tracing_multi_bench_attach: detached in   0.243s
>   #507     tracing_multi_bench_attach:OK
>   Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>   Successfully unloaded bpf_testmod.ko.
> 
> Exporting skip_entry as is_unsafe_function and usign it in the test.
                                                 ^ using

> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  .../selftests/bpf/prog_tests/tracing_multi.c  | 97 +++++++++++++++++++
>  .../selftests/bpf/progs/tracing_multi_bench.c | 13 +++
>  tools/testing/selftests/bpf/trace_helpers.c   |  6 +-
>  tools/testing/selftests/bpf/trace_helpers.h   |  1 +
>  4 files changed, 114 insertions(+), 3 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_bench.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> index 9f4c5af88e21..a0fcda51bb6c 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> @@ -9,6 +9,7 @@
>  #include "tracing_multi_intersect.skel.h"
>  #include "tracing_multi_session.skel.h"
>  #include "tracing_multi_fail.skel.h"
> +#include "tracing_multi_bench.skel.h"
>  #include "trace_helpers.h"
>  
>  static __u64 bpf_fentry_test_cookies[] = {
> @@ -552,6 +553,102 @@ static void test_attach_api_fails(void)
>  	tracing_multi_fail__destroy(skel);
>  }
>  
> +void serial_test_tracing_multi_bench_attach(void)
> +{
> +	LIBBPF_OPTS(bpf_tracing_multi_opts, opts);
> +	struct tracing_multi_bench *skel = NULL;
> +	long attach_start_ns, attach_end_ns;
> +	long detach_start_ns, detach_end_ns;
> +	double attach_delta, detach_delta;
> +	struct bpf_link *link = NULL;
> +	size_t i, cap = 0, cnt = 0;
> +	struct ksyms *ksyms = NULL;
> +	void *root = NULL;
> +	__u32 *ids = NULL;
> +	__u32 nr, type_id;
> +	struct btf *btf;
> +	int err;
> +
> +#ifndef __x86_64__
> +	test__skip();
> +	return;
> +#endif
> +
> +	btf = btf__load_vmlinux_btf();
> +	if (!ASSERT_OK_PTR(btf, "btf__load_vmlinux_btf"))
> +		return;
> +
> +	skel = tracing_multi_bench__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "tracing_multi_bench__open_and_load"))
> +		goto cleanup;
> +
> +	if (!ASSERT_OK(bpf_get_ksyms(&ksyms, true), "get_syms"))
> +		goto cleanup;
> +
> +	/* Get all ftrace 'safe' symbols.. */
> +	for (i = 0; i < ksyms->filtered_cnt; i++) {
> +		if (is_unsafe_function(ksyms->filtered_syms[i]))
> +			continue;
> +		tsearch(&ksyms->filtered_syms[i], &root, compare);
                ^ missing tdestroy() to free tree nodes?

> +	}
> +
> +	/* ..and filter them through BTF and btf_type_is_traceable_func. */
> +	nr = btf__type_cnt(btf);
> +	for (type_id = 1; type_id < nr; type_id++) {
> +		const struct btf_type *type;
> +		const char *str;
> +
> +		type = btf__type_by_id(btf, type_id);
> +		if (!type)
> +			break;
> +
> +		if (BTF_INFO_KIND(type->info) != BTF_KIND_FUNC)
> +			continue;
> +
> +		str = btf__name_by_offset(btf, type->name_off);
> +		if (!str)
> +			break;
> +
> +		if (!tfind(&str, &root, compare))
> +			continue;
> +
> +		if (!btf_type_is_traceable_func(btf, type))
> +			continue;
> +
> +		err = libbpf_ensure_mem((void **) &ids, &cap, sizeof(*ids), cnt + 1);
> +		if (err)
> +			goto cleanup;
> +
> +		ids[cnt++] = type_id;
> +	}
> +
> +	opts.ids = ids;
> +	opts.cnt = cnt;
> +
> +	attach_start_ns = get_time_ns();
> +	link = bpf_program__attach_tracing_multi(skel->progs.bench, NULL, &opts);
> +	attach_end_ns = get_time_ns();
> +
> +	if (!ASSERT_OK_PTR(link, "bpf_program__attach_tracing_multi"))
> +		goto cleanup;
> +
> +	detach_start_ns = get_time_ns();
> +	bpf_link__destroy(link);
> +	detach_end_ns = get_time_ns();
> +
> +	attach_delta = (attach_end_ns - attach_start_ns) / 1000000000.0;
> +	detach_delta = (detach_end_ns - detach_start_ns) / 1000000000.0;
> +
> +	printf("%s: found %lu functions\n", __func__, cnt);
> +	printf("%s: attached in %7.3lfs\n", __func__, attach_delta);
> +	printf("%s: detached in %7.3lfs\n", __func__, detach_delta);
> +
> +cleanup:
> +	tracing_multi_bench__destroy(skel);
> +	free_kallsyms_local(ksyms);
> +	free(ids);
> +}
> +
>  void test_tracing_multi_test(void)
>  {
>  #ifndef __x86_64__
> diff --git a/tools/testing/selftests/bpf/progs/tracing_multi_bench.c b/tools/testing/selftests/bpf/progs/tracing_multi_bench.c
> new file mode 100644
> index 000000000000..067ba668489b
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/tracing_multi_bench.c
> @@ -0,0 +1,13 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdbool.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("fentry.multi")
> +int BPF_PROG(bench)
> +{
> +	return 0;
> +}
> diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
> index 0e63daf83ed5..3bf600f3271b 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.c
> +++ b/tools/testing/selftests/bpf/trace_helpers.c
> @@ -548,7 +548,7 @@ static const char * const trace_blacklist[] = {
>  	"bpf_get_numa_node_id",
>  };
>  
> -static bool skip_entry(char *name)
> +bool is_unsafe_function(char *name)
NIT:                       ^ should const char * ?

Thanks,
Leon

>  {
>  	int i;
>  
> @@ -651,7 +651,7 @@ int bpf_get_ksyms(struct ksyms **ksymsp, bool kernel)
>  		free(name);
>  		if (sscanf(buf, "%ms$*[^\n]\n", &name) != 1)
>  			continue;
> -		if (skip_entry(name))
> +		if (is_unsafe_function(name))
>  			continue;
>  
>  		ks = search_kallsyms_custom_local(ksyms, name, search_kallsyms_compare);
> @@ -728,7 +728,7 @@ int bpf_get_addrs(unsigned long **addrsp, size_t *cntp, bool kernel)
>  		free(name);
>  		if (sscanf(buf, "%p %ms$*[^\n]\n", &addr, &name) != 2)
>  			continue;
> -		if (skip_entry(name))
> +		if (is_unsafe_function(name))
>  			continue;
>  
>  		if (cnt == max_cnt) {
> diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
> index d5bf1433675d..d93be322675d 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.h
> +++ b/tools/testing/selftests/bpf/trace_helpers.h
> @@ -63,4 +63,5 @@ int read_build_id(const char *path, char *build_id, size_t size);
>  int bpf_get_ksyms(struct ksyms **ksymsp, bool kernel);
>  int bpf_get_addrs(unsigned long **addrsp, size_t *cntp, bool kernel);
>  
> +bool is_unsafe_function(char *name);
>  #endif


^ permalink raw reply

* Re: [PATCHv3 bpf-next 22/24] selftests/bpf: Add tracing multi attach fails test
From: Leon Hwang @ 2026-03-17  3:06 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: bpf, linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, Menglong Dong, Steven Rostedt
In-Reply-To: <20260316075138.465430-23-jolsa@kernel.org>

On 16/3/26 15:51, Jiri Olsa wrote:
> Adding tests for attach fails on tracing multi link.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  .../selftests/bpf/prog_tests/tracing_multi.c  | 74 +++++++++++++++++++
>  .../selftests/bpf/progs/tracing_multi_fail.c  | 19 +++++
>  2 files changed, 93 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_fail.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> index 04d83c37495b..9f4c5af88e21 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> @@ -8,6 +8,7 @@
>  #include "tracing_multi_module.skel.h"
>  #include "tracing_multi_intersect.skel.h"
>  #include "tracing_multi_session.skel.h"
> +#include "tracing_multi_fail.skel.h"
>  #include "trace_helpers.h"
>  
>  static __u64 bpf_fentry_test_cookies[] = {
> @@ -480,6 +481,77 @@ static void test_session(void)
>  	tracing_multi_session__destroy(skel);
>  }
>  
> +static void test_attach_api_fails(void)
> +{
> +	LIBBPF_OPTS(bpf_tracing_multi_opts, opts);
> +	static const char * const func[] = {
> +		"bpf_fentry_test2",
> +	};
> +	struct tracing_multi_fail *skel = NULL;
> +	__u32 ids[2], *ids2;
> +	__u64 cookies[2];
> +
> +	skel = tracing_multi_fail__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "tracing_multi_fail__open_and_load"))
> +		return;
> +
> +	/* fail#1 pattern and opts NULL */
> +	skel->links.test_fentry = bpf_program__attach_tracing_multi(skel->progs.test_fentry,
> +						NULL, NULL);
> +	if (!ASSERT_ERR_PTR(skel->links.test_fentry, "bpf_program__attach_tracing_multi"))
> +		goto cleanup;
> +
> +	/* fail#2 pattern and ids */
> +	opts.ids = ids;
> +	opts.cnt = 2;
> +
> +	skel->links.test_fentry = bpf_program__attach_tracing_multi(skel->progs.test_fentry,
> +						"bpf_fentry_test*", &opts);
> +	if (!ASSERT_ERR_PTR(skel->links.test_fentry, "bpf_program__attach_tracing_multi"))
> +		goto cleanup;
> +
> +	/* fail#3 pattern and cookies */
> +	opts.ids = NULL;
> +	opts.cnt = 2;
> +	opts.cookies = cookies;
> +
> +	skel->links.test_fentry = bpf_program__attach_tracing_multi(skel->progs.test_fentry,
> +						"bpf_fentry_test*", &opts);
> +	if (!ASSERT_ERR_PTR(skel->links.test_fentry, "bpf_program__attach_tracing_multi"))
> +		goto cleanup;
> +
> +	/* fail#4 bogus pattern */
> +	skel->links.test_fentry = bpf_program__attach_tracing_multi(skel->progs.test_fentry,
> +						"bpf_not_really_a_function*", NULL);
> +	if (!ASSERT_ERR_PTR(skel->links.test_fentry, "bpf_program__attach_tracing_multi"))
> +		goto cleanup;
> +
> +	/* fail#5 abnormal cnt */
> +	opts.ids = ids;
> +	opts.cnt = INT_MAX;
> +
> +	skel->links.test_fentry = bpf_program__attach_tracing_multi(skel->progs.test_fentry,
> +						NULL, &opts);
> +	if (!ASSERT_ERR_PTR(skel->links.test_fentry, "bpf_program__attach_tracing_multi"))
> +		goto cleanup;
> +
> +	/* fail#6 attach sleepable program to not-allowed function */
> +	ids2 = get_ids(func, 1, NULL);
> +	if (!ASSERT_OK_PTR(ids, "get_ids"))
                           ^ ids2 ?

> +		goto cleanup;
> +
> +	opts.ids = ids2;
> +	opts.cnt = 1;
> +
> +	skel->links.test_fentry_s = bpf_program__attach_tracing_multi(skel->progs.test_fentry_s,
> +						NULL, &opts);
> +	ASSERT_ERR_PTR(skel->links.test_fentry, "bpf_program__attach_tracing_multi");
                                   ^ test_fentry_s ?

Thanks,
Leon

> +	free(ids2);
> +
> +cleanup:
> +	tracing_multi_fail__destroy(skel);
> +}
> +
>  void test_tracing_multi_test(void)
>  {
>  #ifndef __x86_64__
> @@ -505,4 +577,6 @@ void test_tracing_multi_test(void)
>  		test_link_api_ids(true);
>  	if (test__start_subtest("session"))
>  		test_session();
> +	if (test__start_subtest("attach_api_fails"))
> +		test_attach_api_fails();
>  }
> diff --git a/tools/testing/selftests/bpf/progs/tracing_multi_fail.c b/tools/testing/selftests/bpf/progs/tracing_multi_fail.c
> new file mode 100644
> index 000000000000..8f769ddb9136
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/tracing_multi_fail.c
> @@ -0,0 +1,19 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdbool.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("fentry.multi")
> +int BPF_PROG(test_fentry)
> +{
> +	return 0;
> +}
> +
> +SEC("fentry.multi.s")
> +int BPF_PROG(test_fentry_s)
> +{
> +	return 0;
> +}


^ permalink raw reply

* Re: [PATCHv3 bpf-next 20/24] selftests/bpf: Add tracing multi cookies test
From: Leon Hwang @ 2026-03-17  3:06 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: bpf, linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, Menglong Dong, Steven Rostedt
In-Reply-To: <20260316075138.465430-21-jolsa@kernel.org>

On 16/3/26 15:51, Jiri Olsa wrote:
> Adding tests for using cookies on tracing multi link.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  .../selftests/bpf/prog_tests/tracing_multi.c  | 23 +++++++++++++++++--
>  .../selftests/bpf/progs/tracing_multi_check.c | 15 +++++++++++-
>  2 files changed, 35 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> index b7818f438d6e..f14a936a4667 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> @@ -9,6 +9,19 @@
>  #include "tracing_multi_intersect.skel.h"
>  #include "trace_helpers.h"
>  
> +static __u64 bpf_fentry_test_cookies[] = {
> +	8,  /* bpf_fentry_test1 */
> +	9,  /* bpf_fentry_test2 */
> +	7,  /* bpf_fentry_test3 */
> +	5,  /* bpf_fentry_test4 */
> +	4,  /* bpf_fentry_test5 */
> +	2,  /* bpf_fentry_test6 */
> +	3,  /* bpf_fentry_test7 */
> +	1,  /* bpf_fentry_test8 */
> +	10, /* bpf_fentry_test9 */
> +	6,  /* bpf_fentry_test10 */
> +};
> +
>  static const char * const bpf_fentry_test[] = {
>  	"bpf_fentry_test1",
>  	"bpf_fentry_test2",
> @@ -204,7 +217,7 @@ static void test_link_api_pattern(void)
>  	tracing_multi__destroy(skel);
>  }
>  
> -static void test_link_api_ids(void)
> +static void test_link_api_ids(bool test_cookies)
>  {
>  	LIBBPF_OPTS(bpf_tracing_multi_opts, opts);
>  	struct tracing_multi *skel;
> @@ -216,6 +229,7 @@ static void test_link_api_ids(void)
>  		return;
>  
>  	skel->bss->pid = getpid();
> +	skel->bss->test_cookies = test_cookies;
>  
>  	ids = get_ids(bpf_fentry_test, cnt, NULL);
>  	if (!ASSERT_OK_PTR(ids, "get_ids"))
> @@ -224,6 +238,9 @@ static void test_link_api_ids(void)
>  	opts.ids = ids;
>  	opts.cnt = cnt;
>  
> +	if (test_cookies)
> +		opts.cookies = bpf_fentry_test_cookies;
> +
>  	skel->links.test_fentry = bpf_program__attach_tracing_multi(skel->progs.test_fentry,
>  						NULL, &opts);
>  	if (!ASSERT_OK_PTR(skel->links.test_fentry, "bpf_program__attach_tracing_multi"))
> @@ -437,7 +454,7 @@ void test_tracing_multi_test(void)
>  	if (test__start_subtest("link_api_pattern"))
>  		test_link_api_pattern();
>  	if (test__start_subtest("link_api_ids"))
> -		test_link_api_ids();
> +		test_link_api_ids(false);
>  	if (test__start_subtest("module_skel_api"))
>  		test_module_skel_api();
>  	if (test__start_subtest("module_link_api_pattern"))
> @@ -446,4 +463,6 @@ void test_tracing_multi_test(void)
>  		test_module_link_api_ids();
>  	if (test__start_subtest("intersect"))
>  		test_intersect();
> +	if (test__start_subtest("cookies"))
> +		test_link_api_ids(true);
>  }
> diff --git a/tools/testing/selftests/bpf/progs/tracing_multi_check.c b/tools/testing/selftests/bpf/progs/tracing_multi_check.c
> index 0e3248312dd5..e6047d5a078a 100644
> --- a/tools/testing/selftests/bpf/progs/tracing_multi_check.c
> +++ b/tools/testing/selftests/bpf/progs/tracing_multi_check.c
> @@ -7,6 +7,7 @@
>  char _license[] SEC("license") = "GPL";
>  
>  int pid = 0;
> +bool test_cookies = false;
>  
>  extern const void bpf_fentry_test1 __ksym;
>  extern const void bpf_fentry_test2 __ksym;
> @@ -28,7 +29,7 @@ extern const void bpf_testmod_fentry_test11 __ksym;
>  void tracing_multi_arg_check(__u64 *ctx, __u64 *test_result, bool is_return)
>  {
>  	void *ip = (void *) bpf_get_func_ip(ctx);
> -	__u64 value = 0, ret = 0;
> +	__u64 value = 0, ret = 0, cookie = 0;
>  	long err = 0;
>  
>  	if (bpf_get_current_pid_tgid() >> 32 != pid)
> @@ -36,6 +37,8 @@ void tracing_multi_arg_check(__u64 *ctx, __u64 *test_result, bool is_return)
>  
>  	if (is_return)
>  		err |= bpf_get_func_ret(ctx, &ret);
> +	if (test_cookies)
> +		cookie = test_cookies ? bpf_get_attach_cookie(ctx) : 0;
                         ^ dup test_cookies check ? Can drop this one.

Thanks,
Leon

[...]


^ permalink raw reply

* Re: [PATCHv3 bpf-next 19/24] selftests/bpf: Add tracing multi intersect tests
From: Leon Hwang @ 2026-03-17  3:05 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: bpf, linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, Menglong Dong, Steven Rostedt
In-Reply-To: <20260316075138.465430-20-jolsa@kernel.org>

On 16/3/26 15:51, Jiri Olsa wrote:
> Adding tracing multi tests for intersecting attached functions.
> 
> Using bits from (from 1 to 16 values) to specify (up to 4) attached
> programs, and randomly choosing bpf_fentry_test* functions they are
> attached to.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  tools/testing/selftests/bpf/Makefile          |  4 +-
>  .../selftests/bpf/prog_tests/tracing_multi.c  | 99 +++++++++++++++++++
>  .../progs/tracing_multi_intersect_attach.c    | 42 ++++++++
>  3 files changed, 144 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_intersect_attach.c
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index cf01a11d7803..e56e213441d8 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -486,7 +486,8 @@ LINKED_SKELS := test_static_linked.skel.h linked_funcs.skel.h		\
>  		linked_vars.skel.h linked_maps.skel.h 			\
>  		test_subskeleton.skel.h test_subskeleton_lib.skel.h	\
>  		test_usdt.skel.h tracing_multi.skel.h			\
> -		tracing_multi_module.skel.h
> +		tracing_multi_module.skel.h				\
> +		tracing_multi_intersect.skel.h
>  
>  LSKELS := fexit_sleep.c trace_printk.c trace_vprintk.c map_ptr_kern.c 	\
>  	core_kern.c core_kern_overflow.c test_ringbuf.c			\
> @@ -514,6 +515,7 @@ xdp_hw_metadata.skel.h-deps := xdp_hw_metadata.bpf.o
>  xdp_features.skel.h-deps := xdp_features.bpf.o
>  tracing_multi.skel.h-deps := tracing_multi_attach.bpf.o tracing_multi_check.bpf.o
>  tracing_multi_module.skel.h-deps := tracing_multi_attach_module.bpf.o tracing_multi_check.bpf.o
> +tracing_multi_intersect.skel.h-deps := tracing_multi_intersect_attach.bpf.o tracing_multi_check.bpf.o
>  
>  LINKED_BPF_OBJS := $(foreach skel,$(LINKED_SKELS),$($(skel)-deps))
>  LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.c,$(LINKED_BPF_OBJS))
> diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> index e9042d8d4760..b7818f438d6e 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> @@ -6,6 +6,7 @@
>  #include "bpf/libbpf_internal.h"
>  #include "tracing_multi.skel.h"
>  #include "tracing_multi_module.skel.h"
> +#include "tracing_multi_intersect.skel.h"
>  #include "trace_helpers.h"
>  
>  static const char * const bpf_fentry_test[] = {
> @@ -31,6 +32,20 @@ static const char * const bpf_testmod_fentry_test[] = {
>  
>  #define FUNCS_CNT (ARRAY_SIZE(bpf_fentry_test))
>  
> +static int get_random_funcs(const char **funcs)
> +{
> +	int i, cnt = 0;
> +
> +	for (i = 0; i < FUNCS_CNT; i++) {
> +		if (rand() % 2)
                    ^ srand() is missing for rand() ?

> +			funcs[cnt++] = bpf_fentry_test[i];
> +	}
> +	/* we always need at least one.. */
> +	if (!cnt)
> +		funcs[cnt++] = bpf_fentry_test[rand() % FUNCS_CNT];
> +	return cnt;
> +}
> +
>  static int compare(const void *ppa, const void *ppb)
>  {
>  	const char *pa = *(const char **) ppa;
> @@ -328,6 +343,88 @@ static void test_module_link_api_ids(void)
>  	free(ids);
>  }
>  
> +static bool is_set(__u32 mask, __u32 bit)
> +{
> +	return (1 << bit) & mask;
> +}
> +
> +static void __test_intersect(__u32 mask, const struct bpf_program *progs[4], __u64 *test_results[4])
> +{
> +	LIBBPF_OPTS(bpf_tracing_multi_opts, opts);
> +	LIBBPF_OPTS(bpf_test_run_opts, topts);
> +	struct bpf_link *links[4] = { NULL };
> +	const char *funcs[FUNCS_CNT];
> +	__u64 expected[4];
> +	__u32 *ids, i;
> +	int err, cnt;
> +
> +	/*
> +	 * We have 4 programs in progs and the mask bits pick which
> +	 * of them gets attached to randomly chosen functions.
> +	 */
> +	for (i = 0; i < 4; i++) {
> +		if (!is_set(mask, i))
> +			continue;
> +
> +		cnt = get_random_funcs(funcs);
> +		ids = get_ids(funcs, cnt, NULL);
> +		if (!ASSERT_OK_PTR(ids, "get_ids"))
> +			goto cleanup;
> +
> +		opts.ids = ids;
> +		opts.cnt = cnt;
> +		links[i] = bpf_program__attach_tracing_multi(progs[i], NULL, &opts);
> +		free(ids);
> +
> +		if (!ASSERT_OK_PTR(links[i], "bpf_program__attach_tracing_multi"))
> +			goto cleanup;
> +
> +		expected[i] = *test_results[i] + cnt;
> +	}
> +
> +	err = bpf_prog_test_run_opts(bpf_program__fd(progs[0]), &topts);
> +	ASSERT_OK(err, "test_run");
> +
> +	for (i = 0; i < 4; i++) {
> +		if (!is_set(mask, i))
> +			continue;
> +		ASSERT_EQ(*test_results[i], expected[i], "test_results");
> +	}
> +
> +cleanup:
> +	for (i = 0; i < 4; i++)
> +		bpf_link__destroy(links[i]);
> +}
> +
> +static void test_intersect(void)
> +{
> +	struct tracing_multi_intersect *skel;
> +	const struct bpf_program *progs[4];
> +	__u64 *test_results[4];
> +	__u32 i;
> +
> +	skel = tracing_multi_intersect__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "tracing_multi_intersect__open_and_load"))
> +		return;
> +
> +	skel->bss->pid = getpid();
> +
> +	progs[0] = skel->progs.fentry_1;
> +	progs[1] = skel->progs.fexit_1;
> +	progs[2] = skel->progs.fentry_2;
> +	progs[3] = skel->progs.fexit_2;
> +
> +	test_results[0] = &skel->bss->test_result_fentry_1;
> +	test_results[1] = &skel->bss->test_result_fexit_1;
> +	test_results[2] = &skel->bss->test_result_fentry_2;
> +	test_results[3] = &skel->bss->test_result_fexit_2;
> +
> +	for (i = 1; i < 16; i++)
> +		__test_intersect(i, progs, test_results);
> +
> +	tracing_multi_intersect__destroy(skel);
> +}
> +
>  void test_tracing_multi_test(void)
>  {
>  #ifndef __x86_64__
> @@ -347,4 +444,6 @@ void test_tracing_multi_test(void)
>  		test_module_link_api_pattern();
>  	if (test__start_subtest("module_link_api_ids"))
>  		test_module_link_api_ids();
> +	if (test__start_subtest("intersect"))
> +		test_intersect();
>  }
> diff --git a/tools/testing/selftests/bpf/progs/tracing_multi_intersect_attach.c b/tools/testing/selftests/bpf/progs/tracing_multi_intersect_attach.c
> new file mode 100644
> index 000000000000..b8aecbf44093
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/tracing_multi_intersect_attach.c
> @@ -0,0 +1,42 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stdbool.h>
> +#include <linux/bpf.h>
NIT:         ^ vmlinux.h is better than stdbool.h + bpf.h.

Thanks,
Leon

> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +__hidden extern void tracing_multi_arg_check(__u64 *ctx, __u64 *test_result, bool is_return);
> +
> +__u64 test_result_fentry_1 = 0;
> +__u64 test_result_fentry_2 = 0;
> +__u64 test_result_fexit_1 = 0;
> +__u64 test_result_fexit_2 = 0;
> +
> +SEC("fentry.multi")
> +int BPF_PROG(fentry_1)
> +{
> +	tracing_multi_arg_check(ctx, &test_result_fentry_1, false);
> +	return 0;
> +}
> +
> +SEC("fentry.multi")
> +int BPF_PROG(fentry_2)
> +{
> +	tracing_multi_arg_check(ctx, &test_result_fentry_2, false);
> +	return 0;
> +}
> +
> +SEC("fexit.multi")
> +int BPF_PROG(fexit_1)
> +{
> +	tracing_multi_arg_check(ctx, &test_result_fexit_1, true);
> +	return 0;
> +}
> +
> +SEC("fexit.multi")
> +int BPF_PROG(fexit_2)
> +{
> +	tracing_multi_arg_check(ctx, &test_result_fexit_2, true);
> +	return 0;
> +}


^ permalink raw reply

* Re: [PATCHv3 bpf-next 17/24] selftests/bpf: Add tracing multi skel/pattern/ids attach tests
From: Leon Hwang @ 2026-03-17  3:04 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: bpf, linux-trace-kernel, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, Menglong Dong, Steven Rostedt
In-Reply-To: <20260316075138.465430-18-jolsa@kernel.org>

On 16/3/26 15:51, Jiri Olsa wrote:
> Adding tests for tracing_multi link attachment via all possible
> libbpf apis - skeleton, function pattern and btf ids.
> 
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  tools/testing/selftests/bpf/Makefile          |   3 +-
>  .../selftests/bpf/prog_tests/tracing_multi.c  | 245 ++++++++++++++++++
>  .../bpf/progs/tracing_multi_attach.c          |  40 +++
>  .../selftests/bpf/progs/tracing_multi_check.c | 150 +++++++++++
>  4 files changed, 437 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/tracing_multi.c
>  create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_attach.c
>  create mode 100644 tools/testing/selftests/bpf/progs/tracing_multi_check.c
> 
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index 869b582b1d1f..e09beba5674e 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -485,7 +485,7 @@ SKEL_BLACKLIST := btf__% test_pinning_invalid.c test_sk_assign.c
>  LINKED_SKELS := test_static_linked.skel.h linked_funcs.skel.h		\
>  		linked_vars.skel.h linked_maps.skel.h 			\
>  		test_subskeleton.skel.h test_subskeleton_lib.skel.h	\
> -		test_usdt.skel.h
> +		test_usdt.skel.h tracing_multi.skel.h
>  
>  LSKELS := fexit_sleep.c trace_printk.c trace_vprintk.c map_ptr_kern.c 	\
>  	core_kern.c core_kern_overflow.c test_ringbuf.c			\
> @@ -511,6 +511,7 @@ test_usdt.skel.h-deps := test_usdt.bpf.o test_usdt_multispec.bpf.o
>  xsk_xdp_progs.skel.h-deps := xsk_xdp_progs.bpf.o
>  xdp_hw_metadata.skel.h-deps := xdp_hw_metadata.bpf.o
>  xdp_features.skel.h-deps := xdp_features.bpf.o
> +tracing_multi.skel.h-deps := tracing_multi_attach.bpf.o tracing_multi_check.bpf.o
>  
>  LINKED_BPF_OBJS := $(foreach skel,$(LINKED_SKELS),$($(skel)-deps))
>  LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.c,$(LINKED_BPF_OBJS))
> diff --git a/tools/testing/selftests/bpf/prog_tests/tracing_multi.c b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> new file mode 100644
> index 000000000000..cebf4eb68f18
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/tracing_multi.c
> @@ -0,0 +1,245 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <test_progs.h>
> +#include <bpf/btf.h>
> +#include <search.h>
> +#include "bpf/libbpf_internal.h"
> +#include "tracing_multi.skel.h"
> +#include "trace_helpers.h"
> +
> +static const char * const bpf_fentry_test[] = {
> +	"bpf_fentry_test1",
> +	"bpf_fentry_test2",
> +	"bpf_fentry_test3",
> +	"bpf_fentry_test4",
> +	"bpf_fentry_test5",
> +	"bpf_fentry_test6",
> +	"bpf_fentry_test7",
> +	"bpf_fentry_test8",
> +	"bpf_fentry_test9",
> +	"bpf_fentry_test10",
> +};
> +
> +#define FUNCS_CNT (ARRAY_SIZE(bpf_fentry_test))
> +
> +static int compare(const void *ppa, const void *ppb)
> +{
> +	const char *pa = *(const char **) ppa;
> +	const char *pb = *(const char **) ppb;
> +
> +	return strcmp(pa, pb);
> +}
> +
> +static __u32 *get_ids(const char * const funcs[], int funcs_cnt, const char *mod)
> +{
> +	struct btf *btf, *vmlinux_btf;
> +	__u32 nr, type_id, cnt = 0;
> +	void *root = NULL;
> +	__u32 *ids = NULL;
> +	int i, err = 0;
> +
> +	btf = btf__load_vmlinux_btf();
> +	if (!ASSERT_OK_PTR(btf, "btf__load_vmlinux_btf"))
> +		return NULL;
> +
> +	if (mod) {
> +		vmlinux_btf = btf;
> +		btf = btf__load_module_btf(mod, vmlinux_btf);
> +		if (!ASSERT_OK_PTR(btf, "btf__load_module_btf"))
> +			return NULL;
                        ^ vmlinux_btf does not get released.

> +	}
> +
> +	ids = calloc(funcs_cnt, sizeof(ids[0]));
> +	if (!ids)
> +		goto out;
> +
> +	/*
> +	 * We sort function names by name and search them
> +	 * below for each function.
> +	 */
> +	for (i = 0; i < funcs_cnt; i++)
> +		tsearch(&funcs[i], &root, compare);
                ^ tdestroy() is missing to free tree nodes?

Thanks,
Leon

[...]


^ permalink raw reply

* Re: [PATCH v6 14/17] lib/bootconfig: narrow offset type in xbc_init_node()
From: Masami Hiramatsu @ 2026-03-17  0:55 UTC (permalink / raw)
  To: Josh Law; +Cc: Andrew Morton, Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <20260315122015.55965-15-objecting@objecting.org>

On Sun, 15 Mar 2026 12:20:12 +0000
Josh Law <objecting@objecting.org> wrote:

>   lib/bootconfig.c:415:32: warning: conversion to 'long unsigned int'
>   from 'long int' may change the sign of the result [-Wsign-conversion]
> 
> Pointer subtraction yields ptrdiff_t (signed long), which was stored in
> unsigned long.  The offset is immediately checked against XBC_DATA_MAX
> (32767) and then truncated to uint16_t, so unsigned int is sufficient.
> Add an explicit cast on the subtraction to suppress the sign-conversion
> warning.
> 
> Signed-off-by: Josh Law <objecting@objecting.org>
> ---
>  lib/bootconfig.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/bootconfig.c b/lib/bootconfig.c
> index 995c2ec94cbe..7296df003459 100644
> --- a/lib/bootconfig.c
> +++ b/lib/bootconfig.c
> @@ -412,7 +412,7 @@ const char * __init xbc_node_find_next_key_value(struct xbc_node *root,
>  
>  static int __init xbc_init_node(struct xbc_node *node, char *data, uint16_t flag)
>  {
> -	unsigned long offset = data - xbc_data;
> +	unsigned int offset = (unsigned int)(data - xbc_data);
>  
>  	if (WARN_ON(offset >= XBC_DATA_MAX))

OK, then this can be changed to

	long offset = data - xbc_data;

	if (WARN_ON(offset < 0 || offset >= XBC_DATA_MAX))

The original code is to handle data < xbc_data case (in that
case, the offset is over LONG_MAX, so offset >= XBC_DATA_MAX
is also true.) Note that this is for catching broken pointer
to find program bug (WARN_ON is used for such case).

Thank you,

>  		return -EINVAL;
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 12/17] lib/bootconfig: fix signed comparison in xbc_node_get_data()
From: Masami Hiramatsu @ 2026-03-16 23:57 UTC (permalink / raw)
  To: Josh Law; +Cc: Andrew Morton, Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <20260315122015.55965-13-objecting@objecting.org>

On Sun, 15 Mar 2026 12:20:10 +0000
Josh Law <objecting@objecting.org> wrote:

>   lib/bootconfig.c:188:28: warning: comparison of integer expressions
>   of different signedness: 'int' and 'size_t' [-Wsign-compare]
> 
> The local variable 'offset' is declared as int, but xbc_data_size is
> size_t.  Using ~XBC_VALUE as the mask also involves integer promotion
> rules that obscure intent.
> 
> Change the type to unsigned int and mask with XBC_DATA_MAX (which is
> the 15-bit data mask) instead of ~XBC_VALUE, making the expression
> self-documenting and eliminating the signed/unsigned comparison.

Please follow the warning message and use size_t instead.

Thanks,

> 
> Signed-off-by: Josh Law <objecting@objecting.org>
> ---
>  lib/bootconfig.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/bootconfig.c b/lib/bootconfig.c
> index 182d9d9bc5a6..806a8f038d24 100644
> --- a/lib/bootconfig.c
> +++ b/lib/bootconfig.c
> @@ -183,7 +183,7 @@ struct xbc_node * __init xbc_node_get_next(struct xbc_node *node)
>   */
>  const char * __init xbc_node_get_data(struct xbc_node *node)
>  {
> -	int offset = node->data & ~XBC_VALUE;
> +	unsigned int offset = node->data & XBC_DATA_MAX;
>  
>  	if (WARN_ON(offset >= xbc_data_size))
>  		return NULL;
> -- 
> 2.34.1
> 
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH v9 0/4] ring-buffer: Making persistent ring buffers robust
From: Masami Hiramatsu @ 2026-03-16 23:21 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers
In-Reply-To: <177319273059.130641.10882692460536780093.stgit@mhiramat.tok.corp.google.com>

On Wed, 11 Mar 2026 10:32:11 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> Hi,
> 
> Here is the 9th version of improvement patches for making persistent
> ring buffers robust to failures.
> The previous version is here:
> 
> https://lore.kernel.org/all/177303264034.767813.5345788067082238396.stgit@mhiramat.tok.corp.google.com/
> 
> In this version, I fixed bugs/typos in [2/4][3/4] and add a bugfix patch
> [1/4] and a test[4/4]. Also, add a meta->subbuf_size validation[3/4].

Hmm, the test case fails if rewinding happens, because the 
data_page validation failed in rewinding and stop rewinding.
The test may need to be designed more carefully.
Others looks good to me.

Thanks,

> 
> Thank you,
> 
> ---
> 
> Masami Hiramatsu (Google) (4):
>       ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
>       ring-buffer: Flush and stop persistent ring buffer on panic
>       ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
>       ring-buffer: Add persistent ring buffer selftest
> 
> 
>  arch/alpha/include/asm/Kbuild        |    1 
>  arch/arc/include/asm/Kbuild          |    1 
>  arch/arm/include/asm/Kbuild          |    1 
>  arch/arm64/include/asm/ring_buffer.h |   10 ++
>  arch/csky/include/asm/Kbuild         |    1 
>  arch/hexagon/include/asm/Kbuild      |    1 
>  arch/loongarch/include/asm/Kbuild    |    1 
>  arch/m68k/include/asm/Kbuild         |    1 
>  arch/microblaze/include/asm/Kbuild   |    1 
>  arch/mips/include/asm/Kbuild         |    1 
>  arch/nios2/include/asm/Kbuild        |    1 
>  arch/openrisc/include/asm/Kbuild     |    1 
>  arch/parisc/include/asm/Kbuild       |    1 
>  arch/powerpc/include/asm/Kbuild      |    1 
>  arch/riscv/include/asm/Kbuild        |    1 
>  arch/s390/include/asm/Kbuild         |    1 
>  arch/sh/include/asm/Kbuild           |    1 
>  arch/sparc/include/asm/Kbuild        |    1 
>  arch/um/include/asm/Kbuild           |    1 
>  arch/x86/include/asm/Kbuild          |    1 
>  arch/xtensa/include/asm/Kbuild       |    1 
>  include/asm-generic/ring_buffer.h    |   13 +++
>  include/linux/ring_buffer.h          |    1 
>  kernel/trace/Kconfig                 |   15 +++
>  kernel/trace/ring_buffer.c           |  169 ++++++++++++++++++++++++++--------
>  kernel/trace/trace.c                 |    4 +
>  26 files changed, 192 insertions(+), 40 deletions(-)
>  create mode 100644 arch/arm64/include/asm/ring_buffer.h
>  create mode 100644 include/asm-generic/ring_buffer.h
> 
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox