[RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log

public inbox for linux-trace-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log
@ 2026-01-20 11:25 Li Chen
  2026-01-20 11:25 ` [RFC v4 6/7] ext4: fast commit: add lock_updates tracepoint Li Chen
  2026-04-10  1:18 ` [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log Theodore Tso
  0 siblings, 2 replies; 7+ messages in thread
From: Li Chen @ 2026-01-20 11:25 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-ext4,
	linux-trace-kernel, linux-kernel

Hi,

(This RFC v4 series is based on linux-next tag next-20260106, plus the
prerequisite patch "ext4: fast commit: make s_fc_lock reclaim-safe" posted at:
https://lore.kernel.org/all/20260106120621.440126-1-me@linux.beauty/)

Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
masks the issue, and that sleeping in ext4_fc_track_inode() while holding
i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
i_data_sem while the inode is in FC_COMMITTING.

Zhang Yi suggested two possible directions to address the root cause:

1. "Ha, the solution seems to have already been listed in the TODOs in
fast_commit.c.

  Change ext4_fc_commit() to lookup logical to physical mapping using extent
  status tree. This would get rid of the need to call ext4_fc_track_inode()
  before acquiring i_data_sem. To do that we would need to ensure that
  modified extents from the extent status tree are not evicted from memory."

2. "Alternatively, recording the mapped range of tracking might also be
feasible."

This series implements a hybrid way: it implements approach 2 by snapshotting inode image
and mapped ranges at commit time, and consuming only snapshots during log
writing.

Approach 2 still needs a mapping source while building the snapshot
(logical-to-physical and unwritten/hole semantics). Calling ext4_map_blocks()
there would take i_data_sem and can block inside the
jbd2_journal_lock_updates() window, which risks deadlocks or unbounded stalls.
So the snapshot path uses approach 1's extent status lookups as a best-effort
mapping source to avoid ext4_map_blocks().

I did not fuly implement approach 1 (making extent status lookups
authoritative by preventing reclaim of needed entries) because that would need
additional pinning/integration under memory pressure and a larger correctness
surface. Instead, the extent status tree is treated as a cache and the
snapshot path falls back to full commit on cache misses or unstable mappings
(e.g. delayed allocation).

Lock inversion / deadlock model (before):

CPU0 (metadata update)               CPU1 (fast commit)
--------------------               -----------------
... hold i_data_sem (A)             mutex_lock(s_fc_lock) (B)
    ext4_fc_track_inode()             ext4_fc_write_inode_data()
      mutex_lock(s_fc_lock) (B)         ext4_map_blocks()
      wait FC_COMMITTING (sleep)          down_read(i_data_sem) (A)

This creates i_data_sem (A) -> s_fc_lock (B) on update paths, and
s_fc_lock (B) -> i_data_sem (A) on commit paths. Once CPU0 sleeps while
holding (A), CPU1 can block on (A) while holding (B), completing the ABBA
cycle.

New model (this series):

CPU0 (metadata update)               CPU1 (fast commit)
--------------------               -----------------
... maybe hold i_data_sem (A)        jbd2_journal_lock_updates()
    ext4_fc_track_*()                 snapshot inode + ranges (no map_blocks)
      mutex_lock(s_fc_lock) (B)       jbd2_journal_unlock_updates()
      if FC_COMMITTING: set FC_REQUEUE s_fc_lock (B)
      no sleep                         write FC log from snapshots only
                                    cleanup: clear COMMITTING, requeue if set

The commit path no longer takes i_data_sem while holding s_fc_lock, and
tracking no longer sleeps waiting for FC_COMMITTING. If an inode is updated
during a fast commit, EXT4_STATE_FC_REQUEUE records that fact and the inode
is moved to FC_Q_STAGING for the next commit.
The only remaning FC_COMMITTING waiter is ext4_fc_del(), which drops
s_fc_lock before sleeping.

This series snapshots the on-disk inode and tracked data ranges while journal
updates are locked and existing handles are drained. The log writing phase then
serializes only snapshots, so it no longer needs to call ext4_map_blocks() and
take i_data_sem under s_fc_lock. This is done in two steps: patch 1 drops
ext4_map_blocks() from log writing by introducing commit-time snapshots, and
patch 5 drops ext4_map_blocks() from the snapshot path by using the extent
status cache. The snapshot also records whether a mapped extent is unwritten,
so the ADD_RANGE records (and replay) preserve unwritten semantics.

Snapshotting runs under jbd2_journal_lock_updates(). Since a cache miss in
ext4_get_inode_loc() can start synchronous inode table I/O and stall handle
starts for milliseconds, patch 1 uses ext4_get_inode_loc_noio() and falls back
to full commit if the inode table block is not present or not uptodate.

ext4_fc_track_inode() also stops waiting for FC_COMMITTING. Updates during an
ongoing fast commit are marked with EXT4_STATE_FC_REQUEUE and are replayed in
the next fast commit, while ext4_fc_del() waits for FC_COMMITTING so an inode
cannot be removed while the commit thread is still using it.

The extent status tree is a cache, not an authoritative source, so the snapshot
path falls back to full commit on cache misses or unstable mappings (e.g.
delayed allocation). This includes cases where extent status entries are not
present (or have been reclaimed) under memory pressure. The snapshot path does
not try to rebuild mappings by calling ext4_map_blocks(); instead it simply
marks the transaction fast commit ineligible.

To keep the updates-locked window bounded, the snapshot path caps the number of
snapshotted inodes and ranges per fast commit (currently 1024 inodes and 2048
ranges) and falls back to full commit when the cap is exceeded. The series also
handles the journal inode i_data_sem lockdep false positive via subclassing;
journal inode mapping may still take i_data_sem even when data inode mapping is
avoided.

Patch 6 adds the ext4_fc_lock_updates tracepoint to quantify the updates-locked
window and snapshot fallback reasons. Patch 7 extends
/proc/fs/ext4/<sb_id>/fc_info with best-effort snapshot counters. If the /proc
interface is undesirable, I can drop patch 7 and keep the tracepoint only, or
drop even both.

Testing and mesurement were done on a QEMU/KVM guest with virtio-pmem + dax
(ext4 -O fast_commit, mounted dax,noatime). The workload does python3 500x
{4K write + fsync}, fallocate 256M, and python3 500x {creat + fsync(dir)}.
Over 3 cold boots, ext4_fc_lock_updates reported locked_ns p50 2.88-2.92 us,
p99 <= 6.71 us, and max <= 102.71 us, with snap_err always 0. Under stress-ng
memory pressure (stress-ng --vm 4 --vm-bytes 75% --timeout 60s), locked_ns p50
2.94 us, p99 <= 4.97 us, and max <= 20.07 us. The fc_info snapshot failure
counters stayed at 0.
These hold times are in the low microseconds range, and the caps keep the
worst case bounded.

Comments and guidance are very welcome. Please let me know if there are any
concerns about correctness, corner cases, or better approaches.

RFC v3 -> RFC v4:
- Replace lockdep_assert movement with removing the wait in
  ext4_fc_track_inode() and using EXT4_STATE_FC_REQUEUE to capture updates
  during an ongoing fast commit.
- Replace dropping s_fc_lock around log writing with commit-time snapshots of
  inode image and mapped ranges (recording the mapped range of tracking as
  suggested by Zhang Yi) so log writing consumes only snapshots.
- Avoid inode table I/O under jbd2_journal_lock_updates() via
  ext4_get_inode_loc_noio() and fallback to full commit on cache misses.
- Use the extent status cache for snapshot mappings and fall back to full
  commit on cache misses or unstable mappings (e.g. delayed allocation).
- Add tracepoint and /proc snapshot stats to quantify the updates-locked window
  and snapshot fallback reasons.

RFC v2 -> RFC v3:
- rebase ontop of https://lore.kernel.org/linux-ext4/20251223131342.287864-1-me@linux.beauty/T/#u

RFC v1 -> RFC v2:
- patch 1: move comments to correct place
- patch 2: add it to patchset.
- add missing RFC prefix

RFC v1: https://lore.kernel.org/linux-ext4/20251222032655.87056-1-me@linux.beauty/T/#u
RFC v2: https://lore.kernel.org/linux-ext4/20251222151906.24607-1-me@linux.beauty/T/#t
RFC v3: https://lore.kernel.org/linux-ext4/20251224032943.134063-1-me@linux.beauty/

Thanks,

Li Chen (7):
  ext4: fast commit: snapshot inode state before writing log
  ext4: lockdep: handle i_data_sem subclassing for special inodes
  ext4: fast commit: avoid waiting for FC_COMMITTING
  ext4: fast commit: avoid self-deadlock in inode snapshotting
  ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in
    snapshots
  ext4: fast commit: add lock_updates tracepoint
  ext4: fast commit: export snapshot stats in fc_info

 fs/ext4/ext4.h              |  58 ++-
 fs/ext4/fast_commit.c       | 718 +++++++++++++++++++++++++++++-------
 fs/ext4/inode.c             |  51 +++
 fs/ext4/super.c             |   9 +
 include/trace/events/ext4.h |  33 ++
 5 files changed, 735 insertions(+), 134 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC v4 6/7] ext4: fast commit: add lock_updates tracepoint
  2026-01-20 11:25 [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log Li Chen
@ 2026-01-20 11:25 ` Li Chen
  2026-01-23 17:57   ` Steven Rostedt
  2026-04-10  1:18 ` [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log Theodore Tso
  1 sibling, 1 reply; 7+ messages in thread
From: Li Chen @ 2026-01-20 11:25 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-ext4, linux-kernel,
	linux-trace-kernel
  Cc: Li Chen

Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
so it is useful to quantify the time spent with updates locked and to
understand why snapshotting can fail.

Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
the updates-locked window along with the number of snapshotted inodes
and ranges. Record the first snapshot failure reason in a stable snap_err
field for tooling.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/fast_commit.c       | 86 ++++++++++++++++++++++++++++++-------
 include/trace/events/ext4.h | 33 ++++++++++++++
 2 files changed, 104 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index d1eefee60912..d266eb2a4219 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -193,6 +193,27 @@ static struct kmem_cache *ext4_fc_range_cachep;
 #define EXT4_FC_SNAPSHOT_MAX_INODES	1024
 #define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
+/*
+ * Snapshot failure reasons for ext4_fc_lock_updates tracepoint.
+ * Keep these stable for tooling.
+ */
+enum ext4_fc_snap_err {
+	EXT4_FC_SNAP_ERR_NONE = 0,
+	EXT4_FC_SNAP_ERR_ES_MISS,
+	EXT4_FC_SNAP_ERR_ES_DELAYED,
+	EXT4_FC_SNAP_ERR_ES_OTHER,
+	EXT4_FC_SNAP_ERR_INODES_CAP,
+	EXT4_FC_SNAP_ERR_RANGES_CAP,
+	EXT4_FC_SNAP_ERR_NOMEM,
+	EXT4_FC_SNAP_ERR_INODE_LOC,
+};
+
+static inline void ext4_fc_set_snap_err(int *snap_err, int err)
+{
+	if (snap_err && *snap_err == EXT4_FC_SNAP_ERR_NONE)
+		*snap_err = err;
+}
+
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
 	BUFFER_TRACE(bh, "");
@@ -983,11 +1004,12 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       struct list_head *ranges,
 				       unsigned int nr_ranges_total,
-				       unsigned int *nr_rangesp)
+				       unsigned int *nr_rangesp,
+				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+	unsigned int nr_ranges = 0;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
@@ -1010,11 +1032,16 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		struct ext4_fc_range *range;
 		ext4_lblk_t len;
 
-		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
+		}
 
-		if (ext4_es_is_delayed(&es))
+		if (ext4_es_is_delayed(&es)) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
+		}
 
 		len = es.es_len - (cur_lblk - es.es_lblk);
 		if (len > end_lblk - cur_lblk + 1)
@@ -1024,12 +1051,17 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 			continue;
 		}
 
-		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
+		}
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
-		if (!range)
+		if (!range) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
+		}
 		nr_ranges++;
 
 		range->lblk = cur_lblk;
@@ -1054,6 +1086,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
 
@@ -1070,7 +1103,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int nr_ranges_total,
-				  unsigned int *nr_rangesp)
+				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
@@ -1082,8 +1115,10 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	int alloc_ctx;
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
-	if (ret)
+	if (ret) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
+	}
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
 		inode_len = EXT4_INODE_SIZE(inode->i_sb);
@@ -1092,6 +1127,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
 	}
@@ -1102,7 +1138,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	brelse(iloc.bh);
 
 	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
-					  &nr_ranges);
+					  &nr_ranges, snap_err);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1203,7 +1239,10 @@ static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
 					 unsigned int *nr_inodesp);
 
 static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
-				   unsigned int inodes_size)
+				   unsigned int inodes_size,
+				   unsigned int *nr_inodesp,
+				   unsigned int *nr_rangesp,
+				   int *snap_err)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1221,6 +1260,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1244,6 +1285,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1268,16 +1311,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 		unsigned int inode_ranges = 0;
 
 		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
-					     &inode_ranges);
+					     &inode_ranges, snap_err);
 		if (ret)
 			break;
 		nr_ranges += inode_ranges;
 	}
 
+	if (nr_inodesp)
+		*nr_inodesp = i;
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return ret;
 }
 
-static int ext4_fc_perform_commit(journal_t *journal)
+static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1286,10 +1333,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct inode *inode;
 	struct inode **inodes;
 	unsigned int inodes_size;
+	unsigned int snap_inodes = 0;
+	unsigned int snap_ranges = 0;
+	int snap_err = EXT4_FC_SNAP_ERR_NONE;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
 	int alloc_ctx;
+	ktime_t lock_start;
+	u64 locked_ns;
 
 	/*
 	 * Step 1: Mark all inodes on s_fc_q[MAIN] with
@@ -1337,13 +1389,13 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	if (ret)
 		return ret;
 
-
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
 	if (ret)
 		return ret;
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
+	lock_start = ktime_get();
 	/*
 	 * The journal is now locked. No more handles can start and all the
 	 * previous handles are now drained. Snapshotting happens in this
@@ -1357,8 +1409,12 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
+				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
+	locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+	trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns, snap_inodes,
+				   snap_ranges, ret, snap_err);
 	kvfree(inodes);
 	if (ret)
 		return ret;
@@ -1563,7 +1619,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 		journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
 	set_task_ioprio(current, journal_ioprio);
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
-	ret = ext4_fc_perform_commit(journal);
+	ret = ext4_fc_perform_commit(journal, commit_tid);
 	if (ret < 0) {
 		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
 			status = EXT4_FC_STATUS_INELIGIBLE;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index fd76d14c2776..a1493971821d 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2812,6 +2812,39 @@ TRACE_EVENT(ext4_fc_commit_stop,
 		  __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid)
 );
 
+TRACE_EVENT(ext4_fc_lock_updates,
+	    TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns,
+		     unsigned int nr_inodes, unsigned int nr_ranges, int err,
+		     int snap_err),
+
+	TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err),
+
+	TP_STRUCT__entry(/* entry */
+		__field(dev_t, dev)
+		__field(tid_t, tid)
+		__field(u64, locked_ns)
+		__field(unsigned int, nr_inodes)
+		__field(unsigned int, nr_ranges)
+		__field(int, err)
+		__field(int, snap_err)
+	),
+
+	TP_fast_assign(/* assign */
+		__entry->dev = sb->s_dev;
+		__entry->tid = commit_tid;
+		__entry->locked_ns = locked_ns;
+		__entry->nr_inodes = nr_inodes;
+		__entry->nr_ranges = nr_ranges;
+		__entry->err = err;
+		__entry->snap_err = snap_err;
+	),
+
+	TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err %d snap_err %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
+		  __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges,
+		  __entry->err, __entry->snap_err)
+);
+
 #define FC_REASON_NAME_STAT(reason)					\
 	show_fc_reason(reason),						\
 	__entry->fc_ineligible_rc[reason]
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC v4 6/7] ext4: fast commit: add lock_updates tracepoint
  2026-01-20 11:25 ` [RFC v4 6/7] ext4: fast commit: add lock_updates tracepoint Li Chen
@ 2026-01-23 17:57   ` Steven Rostedt
  2026-01-27 12:05     ` Li Chen
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2026-01-23 17:57 UTC (permalink / raw)
  To: Li Chen
  Cc: Zhang Yi, Theodore Ts'o, Andreas Dilger, Masami Hiramatsu,
	Mathieu Desnoyers, linux-ext4, linux-kernel, linux-trace-kernel

On Tue, 20 Jan 2026 19:25:35 +0800
Li Chen <me@linux.beauty> wrote:

> Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
> so it is useful to quantify the time spent with updates locked and to
> understand why snapshotting can fail.
> 
> Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
> the updates-locked window along with the number of snapshotted inodes
> and ranges. Record the first snapshot failure reason in a stable snap_err
> field for tooling.
> 
> Signed-off-by: Li Chen <me@linux.beauty>
> ---
>  fs/ext4/fast_commit.c       | 86 ++++++++++++++++++++++++++++++-------
>  include/trace/events/ext4.h | 33 ++++++++++++++
>  2 files changed, 104 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index d1eefee60912..d266eb2a4219 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -193,6 +193,27 @@ static struct kmem_cache *ext4_fc_range_cachep;
>  #define EXT4_FC_SNAPSHOT_MAX_INODES	1024
>  #define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
>  
> +/*
> + * Snapshot failure reasons for ext4_fc_lock_updates tracepoint.
> + * Keep these stable for tooling.
> + */
> +enum ext4_fc_snap_err {
> +	EXT4_FC_SNAP_ERR_NONE = 0,
> +	EXT4_FC_SNAP_ERR_ES_MISS,
> +	EXT4_FC_SNAP_ERR_ES_DELAYED,
> +	EXT4_FC_SNAP_ERR_ES_OTHER,
> +	EXT4_FC_SNAP_ERR_INODES_CAP,
> +	EXT4_FC_SNAP_ERR_RANGES_CAP,
> +	EXT4_FC_SNAP_ERR_NOMEM,
> +	EXT4_FC_SNAP_ERR_INODE_LOC,
> +};
> +
> +static inline void ext4_fc_set_snap_err(int *snap_err, int err)
> +{
> +	if (snap_err && *snap_err == EXT4_FC_SNAP_ERR_NONE)
> +		*snap_err = err;
> +}
> +



> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index fd76d14c2776..a1493971821d 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -2812,6 +2812,39 @@ TRACE_EVENT(ext4_fc_commit_stop,
>  		  __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid)
>  );
>  

Why not make the snap_err into a human readable format?

#define TRACE_SNAP_ERR		\
	EM(NONE)		\
	EM(ES_MISS)		\
	EM(ES_DELAYED)		\
	EM(ES_OTHER)		\
	EM(INODES_CAP)		\
	EM(RANGES_CAP)		\
	EM(NOMEM)		\
	EMe(INODE_LOC)		\

#undef EM
#undef EMe

#define EM(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
#define EMe(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);

TRACE_SNAP_ERR

#undef EM
#undef EMe

#define EM(a)	{ EXT4_FC_SNAP_ERR_##a, #a },
#define EM(a)	{ EXT4_FC_SNAP_ERR_##a, #a }


> +TRACE_EVENT(ext4_fc_lock_updates,
> +	    TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns,
> +		     unsigned int nr_inodes, unsigned int nr_ranges, int err,
> +		     int snap_err),
> +
> +	TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err),
> +
> +	TP_STRUCT__entry(/* entry */
> +		__field(dev_t, dev)
> +		__field(tid_t, tid)
> +		__field(u64, locked_ns)
> +		__field(unsigned int, nr_inodes)
> +		__field(unsigned int, nr_ranges)
> +		__field(int, err)
> +		__field(int, snap_err)
> +	),
> +
> +	TP_fast_assign(/* assign */
> +		__entry->dev = sb->s_dev;
> +		__entry->tid = commit_tid;
> +		__entry->locked_ns = locked_ns;
> +		__entry->nr_inodes = nr_inodes;
> +		__entry->nr_ranges = nr_ranges;
> +		__entry->err = err;
> +		__entry->snap_err = snap_err;
> +	),
> +
> +	TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err %d snap_err %d",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
> +		  __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges,
> +		  __entry->err, __entry->snap_err)

And instead of having the raw value of __entry->snap_err, use:

		__entry->err, __print_symbolic(__entry->snap_err, TRACE_SNAP_ERR))

-- Steve


> +);
> +
>  #define FC_REASON_NAME_STAT(reason)					\
>  	show_fc_reason(reason),						\
>  	__entry->fc_ineligible_rc[reason]

1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC v4 6/7] ext4: fast commit: add lock_updates tracepoint
  2026-01-23 17:57   ` Steven Rostedt
@ 2026-01-27 12:05     ` Li Chen
  0 siblings, 0 replies; 7+ messages in thread
From: Li Chen @ 2026-01-27 12:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Zhang Yi, Theodore Ts'o, Andreas Dilger, Masami Hiramatsu,
	Mathieu Desnoyers, linux-ext4, linux-kernel, linux-trace-kernel

Hi Steven,

 > On Tue, 20 Jan 2026 19:25:35 +0800
 > Li Chen <me@linux.beauty> wrote:
 > 
 > > Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
 > > so it is useful to quantify the time spent with updates locked and to
 > > understand why snapshotting can fail.
 > > 
 > > Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
 > > the updates-locked window along with the number of snapshotted inodes
 > > and ranges. Record the first snapshot failure reason in a stable snap_err
 > > field for tooling.
 > > 
 > > Signed-off-by: Li Chen <me@linux.beauty>
 > > ---
 > >  fs/ext4/fast_commit.c       | 86 ++++++++++++++++++++++++++++++-------
 > >  include/trace/events/ext4.h | 33 ++++++++++++++
 > >  2 files changed, 104 insertions(+), 15 deletions(-)
 > > 
 > > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
 > > index d1eefee60912..d266eb2a4219 100644
 > > --- a/fs/ext4/fast_commit.c
 > > +++ b/fs/ext4/fast_commit.c
 > > @@ -193,6 +193,27 @@ static struct kmem_cache *ext4_fc_range_cachep;
 > >  #define EXT4_FC_SNAPSHOT_MAX_INODES    1024
 > >  #define EXT4_FC_SNAPSHOT_MAX_RANGES    2048
 > >  
 > > +/*
 > > + * Snapshot failure reasons for ext4_fc_lock_updates tracepoint.
 > > + * Keep these stable for tooling.
 > > + */
 > > +enum ext4_fc_snap_err {
 > > +    EXT4_FC_SNAP_ERR_NONE = 0,
 > > +    EXT4_FC_SNAP_ERR_ES_MISS,
 > > +    EXT4_FC_SNAP_ERR_ES_DELAYED,
 > > +    EXT4_FC_SNAP_ERR_ES_OTHER,
 > > +    EXT4_FC_SNAP_ERR_INODES_CAP,
 > > +    EXT4_FC_SNAP_ERR_RANGES_CAP,
 > > +    EXT4_FC_SNAP_ERR_NOMEM,
 > > +    EXT4_FC_SNAP_ERR_INODE_LOC,
 > > +};
 > > +
 > > +static inline void ext4_fc_set_snap_err(int *snap_err, int err)
 > > +{
 > > +    if (snap_err && *snap_err == EXT4_FC_SNAP_ERR_NONE)
 > > +        *snap_err = err;
 > > +}
 > > +
 > 
 > 
 > 
 > > diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
 > > index fd76d14c2776..a1493971821d 100644
 > > --- a/include/trace/events/ext4.h
 > > +++ b/include/trace/events/ext4.h
 > > @@ -2812,6 +2812,39 @@ TRACE_EVENT(ext4_fc_commit_stop,
 > >            __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid)
 > >  );
 > >  
 > 
 > Why not make the snap_err into a human readable format?
 > 
 > #define TRACE_SNAP_ERR        \
 >     EM(NONE)        \
 >     EM(ES_MISS)        \
 >     EM(ES_DELAYED)        \
 >     EM(ES_OTHER)        \
 >     EM(INODES_CAP)        \
 >     EM(RANGES_CAP)        \
 >     EM(NOMEM)        \
 >     EMe(INODE_LOC)        \
 > 
 > #undef EM
 > #undef EMe
 > 
 > #define EM(a)    TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
 > #define EMe(a)    TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
 > 
 > TRACE_SNAP_ERR
 > 
 > #undef EM
 > #undef EMe
 > 
 > #define EM(a)    { EXT4_FC_SNAP_ERR_##a, #a },
 > #define EM(a)    { EXT4_FC_SNAP_ERR_##a, #a }
 > 
 > 
 > > +TRACE_EVENT(ext4_fc_lock_updates,
 > > +        TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns,
 > > +             unsigned int nr_inodes, unsigned int nr_ranges, int err,
 > > +             int snap_err),
 > > +
 > > +    TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err),
 > > +
 > > +    TP_STRUCT__entry(/* entry */
 > > +        __field(dev_t, dev)
 > > +        __field(tid_t, tid)
 > > +        __field(u64, locked_ns)
 > > +        __field(unsigned int, nr_inodes)
 > > +        __field(unsigned int, nr_ranges)
 > > +        __field(int, err)
 > > +        __field(int, snap_err)
 > > +    ),
 > > +
 > > +    TP_fast_assign(/* assign */
 > > +        __entry->dev = sb->s_dev;
 > > +        __entry->tid = commit_tid;
 > > +        __entry->locked_ns = locked_ns;
 > > +        __entry->nr_inodes = nr_inodes;
 > > +        __entry->nr_ranges = nr_ranges;
 > > +        __entry->err = err;
 > > +        __entry->snap_err = snap_err;
 > > +    ),
 > > +
 > > +    TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err %d snap_err %d",
 > > +          MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
 > > +          __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges,
 > > +          __entry->err, __entry->snap_err)
 > 
 > And instead of having the raw value of __entry->snap_err, use:
 > 
 >         __entry->err, __print_symbolic(__entry->snap_err, TRACE_SNAP_ERR))

Good point. I'll switch snap_err to __print_symbolic() in v5.

Regards,
Li


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log
  2026-01-20 11:25 [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log Li Chen
  2026-01-20 11:25 ` [RFC v4 6/7] ext4: fast commit: add lock_updates tracepoint Li Chen
@ 2026-04-10  1:18 ` Theodore Tso
  2026-04-13 13:01   ` Li Chen
  1 sibling, 1 reply; 7+ messages in thread
From: Theodore Tso @ 2026-04-10  1:18 UTC (permalink / raw)
  To: Li Chen
  Cc: Zhang Yi, Andreas Dilger, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel

On Tue, Jan 20, 2026 at 07:25:29PM +0800, Li Chen wrote:
> Hi,
> 
> (This RFC v4 series is based on linux-next tag next-20260106, plus the
> prerequisite patch "ext4: fast commit: make s_fc_lock reclaim-safe" posted at:
> https://lore.kernel.org/all/20260106120621.440126-1-me@linux.beauty/)

Can you take a look at the Sashiko reviews here:

    https://sashiko.dev/#/patchset/20260408112020.716706-1-me%40linux.beauty

There seems to be at least one legitimate concern, which is the
potential cur_lblk overflow.  There are a couple of others which I
think is real; could you please look at their review comments?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log
  2026-04-10  1:18 ` [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log Theodore Tso
@ 2026-04-13 13:01   ` Li Chen
  2026-04-13 13:12     ` Theodore Tso
  0 siblings, 1 reply; 7+ messages in thread
From: Li Chen @ 2026-04-13 13:01 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Zhang Yi, Andreas Dilger, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel

Hi Ted,

 ---- On Fri, 10 Apr 2026 09:18:43 +0800  Theodore Tso <tytso@mit.edu> wrote --- 
 > On Tue, Jan 20, 2026 at 07:25:29PM +0800, Li Chen wrote:
 > > Hi,
 > > 
 > > (This RFC v4 series is based on linux-next tag next-20260106, plus the
 > > prerequisite patch "ext4: fast commit: make s_fc_lock reclaim-safe" posted at:
 > > https://lore.kernel.org/all/20260106120621.440126-1-me@linux.beauty/)
 > 
 > Can you take a look at the Sashiko reviews here:
 > 
 >     https://sashiko.dev/#/patchset/20260408112020.716706-1-me%40linux.beauty
 > 
 > There seems to be at least one legitimate concern, which is the
 > potential cur_lblk overflow.  There are a couple of others which I
 > think is real; could you please look at their review comments?

Absolutely! It's great to learn about the Sashiko development site.
I will address the real issues in the next version.

Regards,
Li


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log
  2026-04-13 13:01   ` Li Chen
@ 2026-04-13 13:12     ` Theodore Tso
  0 siblings, 0 replies; 7+ messages in thread
From: Theodore Tso @ 2026-04-13 13:12 UTC (permalink / raw)
  To: Li Chen
  Cc: Zhang Yi, Andreas Dilger, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, linux-ext4, linux-trace-kernel, linux-kernel

On Mon, Apr 13, 2026 at 09:01:28PM +0800, Li Chen wrote:
> Absolutely! It's great to learn about the Sashiko development site.
> I will address the real issues in the next version.

Note that Sashiko will sometimes report a pre-existing issue as if it
were a problem with the commit.  If that happens, feel free to ignore
its complaint; what I consider best practice is to either (a) fix it
in the a subsequent patch or patch series, or (b) leave a TODO in the
code.

I've asked the Sashiko folks to add way for URI's for each issue that
are identified by Sashiko, so we can put a URL in the TODO comment for
someone who wants to fix it later, and to make it easier for Sashiko
to identified pre-existing issues so it doesn't comment on the same
issue across multiple commit reviews (and perhaps save on the some LLM
token budget :-).

In the next few days, for patches sent to linux-ext4, Sashiko will
start e-mailing its reviews to the patch submitter and to me as the
maintainer.  Once we can reduce the false positive rate, I'll ask that
the reviews be cc'ed to the linux-ext4 mailing list.  But it seems
good enough that to send e-mails to the patch submitter and the
maintainer --- but that's a decision that each subsystem maintainer
will be making on their own.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-04-13 13:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-20 11:25 [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log Li Chen
2026-01-20 11:25 ` [RFC v4 6/7] ext4: fast commit: add lock_updates tracepoint Li Chen
2026-01-23 17:57   ` Steven Rostedt
2026-01-27 12:05     ` Li Chen
2026-04-10  1:18 ` [RFC v4 0/7] ext4: fast commit: snapshot inode state for FC log Theodore Tso
2026-04-13 13:01   ` Li Chen
2026-04-13 13:12     ` Theodore Tso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox