Linux EXT4 FS development
 help / color / mirror / Atom feed
* Re: [PATCH v2 3/3] ext4: derive f_fsid from block device to avoid collisions
From: Theodore Tso @ 2026-04-07 14:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Anand Jain, Darrick J. Wong, linux-ext4, linux-btrfs, linux-xfs,
	Anand Jain
In-Reply-To: <adSUiB9L0sFAd04U@infradead.org>

On Mon, Apr 06, 2026 at 10:22:16PM -0700, Christoph Hellwig wrote:
> > Dilemma:
> > While statfs(2) [1] suggests f_fsid is "some random stuff," we know
> > userspace (NFS, systemd) often treats it as a persistent handle.
> > 
> > Do you prefer one of the names above, or is there a more idiomatic ext4
> > naming convention I should follow?
> > 
>
> My take is that anything that should persist should be an on-disk
> feature flag, not a mount option.  But I'm not in charge for ext4

My take is that f_fsid is random stuff, as documented by the
specification, so anyone who tries to depend on it needs to be kept in
a padding room where they can't hurt themselves or their users.

And as far as NFS is concerned, file handles should be based on
the super block UUID, not statfs's f_fsid, and anyone who wants to
mount a snapshot as an NFS exported file system at the same time that
the original file system is mounted is _also_ should be gently coaxed
into a padding room where they can't hurt themselves or their users.

The solution that we've used for people who are cloning block devices
for things like cloud images has been for *years* has been to use
"tune2fs -U random /dev/sda1".  And this works on mounted file system,
and (for example) built into various cloud images for Google Cloud
Engine.

If we want to change statfs's f_fsid, from one set of "Random stuff"
to another set of "Random stuff", I don't really mind, but I don't
think it's worth *either* a mount option, *or* a feature flag, as
either would be confusing for system adminsitrators when some file
systems behave one way, and other file systems behave another.

	       	   	    	  - Ted

^ permalink raw reply

* Re: [RFC PATCH v1 0/6] provenance_time (ptime): a new settable timestamp for cross-filesystem provenance
From: Darrick J. Wong @ 2026-04-07 15:17 UTC (permalink / raw)
  To: Sean Smith
  Cc: tytso, linux-fsdevel, linux-ext4, linux-btrfs, dsterba, david,
	brauner, osandov, hirofumi, linkinjeon
In-Reply-To: <e5be4e66-5fb0-41d3-807f-d26f78949c3d@gmail.com>

On Tue, Apr 07, 2026 at 01:06:09AM -0500, Sean Smith wrote:
> 
> [written with AI assistance]
> 
> On 4/6/2026 20:42, Darrick J. Wong wrote:
> 
> > "Standard"... I was about to write a sardonic reply here, but then I
> > remembred that Linux finally *does* have a standard means to transfer
> > some of those newer file attributes: file_getattr/file_setattr.
> >
> > (Go Andrey!)
> >
> > So, I guess all you really need to do is extend struct file_attr and now
> > userspace has a fairly convenient means to propagate the provenance
> > time. 🙂
> 
> Thank you for pointing to file_getattr/file_setattr — this
> is a significantly better API path than our utimensat
> extension. The size-versioned struct file_attr eliminates
> the glibc times[2] limitation entirely, which was one of
> the main upstream concerns with the current approach.
> 
> We will investigate extending struct file_attr with ptime
> fields for v2. The on-disk storage across all 5 filesystems
> and the rename-over preservation are API-independent and
> would remain unchanged. The change is re-plumbing the
> userspace write path from utimensat to file_setattr.
> 
> Two design questions:
> 
> Would you recommend fa_ptime_sec (__u64) + fa_ptime_nsec
> (__u32) matching the statx timespec pattern, or a different
> representation?

That uses less space in the struct, so yes.

> Pali Rohar has announced plans for mask fields in file_attr.
> Should we coordinate with his mask work so ptime can be
> selectively set without read-modify-write?

fa_xflags already provides that coverage for the other fields in struct
file_attr, e.g. a getattr caller can ignore fa_extsize if
FS_XFLAG_EXTSIZE isn't set; and setattr will (iirc) reject nonzero
fa_extsize if FS_XFLAG_EXTSIZE isn't set.

Pali Rohar's work would make it possible to discover which fa_xflags
fields are settable or clearable for a given file.

> > So does the provenance time cover just the file's contents, or the other
> > attributes and xattrs?
> 
> Content only. ptime records when the file's data first came
> into existence. Metadata changes (permissions, owner,
> xattrs) update ctime but leave ptime unchanged. This
> matches the semantics of Windows Date Created and macOS
> creation time.
> 
> > The reason I ask is, does the ptime get copied over for an FICLONE,
> > which maps all of one file's data blocks into another?
> 
> It should, conceptually — the content's provenance doesn't
> change when you clone it. Currently FICLONE shares data
> extents but does not copy inode metadata (timestamps,
> permissions), so ptime would not be automatically
> preserved. The calling tool (e.g., cp --reflink) handles
> timestamp copying separately via the write path.
> 
> The question is whether FICLONE should be enhanced to copy
> ptime from source to destination at the kernel level —
> similar to how rename-over preserves ptime. There is an
> argument for it: if the kernel handles provenance during
> clone, tools don't need to know. But FICLONE doesn't
> currently copy mtime either, so adding ptime alone would
> be inconsistent. Worth discussing.

FICLONE is currently treated like a write, which means that mtime is
updated on the destination file.  For filesystems supporting ptime you'd
probably want the kernel to copy the ptime from src to dest after the
remapping completes.

(Same for exchange-range)

> Btrfs subvolume snapshots are a different case — they do
> preserve ptime because the inodes are COW copies of the
> originals.
> 
> > And by extension, would it also need to be exchanged if you told
> > XFS_IOC_EXCHANGE_RANGE to exchange all contents between two files?
> 
> Yes — if the content moves, the provenance should move
> with it. If files A and B exchange data extents, their
> ptimes should swap. ptime follows the content, not the
> inode identity.
> 
> > (I know, I know, you said XFS was TBDHBD ;))
> 
> Worth considering for a future XFS implementation — and
> the file_attr route you suggested would give XFS a clean
> integration path for ptime alongside FICLONE and
> EXCHANGE_RANGE.
> 
> > Last question: Is the provenance time only useful if the file is
> > immutable?  Either directly via chattr +i, or by enabling fsverity?
> 
> No — ptime is useful regardless of mutability. It records
> when the document was born, the same way Windows Date
> Created works. Editing a document updates mtime but not
> the creation date. Both are independently valuable:
> 
>   ptime: "This file was first created March 15, 2019"
>   mtime: "It was last modified today"
>   btime: "This inode was created when I copied it here"
> 
> Immutable files (chattr +i, fsverity) are a special case
> where ptime has extra forensic strength — the content
> provably hasn't changed since the provenance date. But for
> the primary use case — preserving creation dates across
> cross-platform migrations — mutability doesn't diminish
> ptime's value. A document's creation date remains meaningful
> regardless of subsequent edits.

Got it.

--D

> Sean
> 
> 
> 

^ permalink raw reply

* Re: [PATCH 0/3] show orphan file inode detail info
From: Theodore Tso @ 2026-04-07 20:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ye Bin, adilger.kernel, linux-ext4, linux-fsdevel
In-Reply-To: <n4sccudy5avcgnkdhc27rzofzoprxqtwhfrlmsh3yyrj6vbc6d@mmu73gmtawkq>

On Tue, Apr 07, 2026 at 12:29:23PM +0200, Jan Kara wrote:
> I agree listing orphan inodes for a superblock is useful and the usefulness
> could actually go beyond ext4. I imagine the very same problem is there for
> XFS or btrfs so perhaps we could think for a while whether we can provide
> an interface that wouldn't be ext4 specific? Perhaps an ioctl
> (GET_ORPHAN_FILES) that would return an fd and reading from that fd would
> return entries for orphan inodes?

I'm really not a fan of ioctl's returning a fd, but that does seem to
be a thing these days, for better or for worse, and I agree that
having a portable solution that works across multiple file systems
would be a good thing.

> Also regarding information reported about orphan inodes - won't it be better
> interface to just return a list of file handles? Userspace can then do
> whatever it needs with them - open, statx, calling ioctl, etc - so we
> thwart feature creep with people asking us to add more information to the
> interface. This also offloads a lot of security questions about the
> interface to appropriate syscalls. So overall it looks like a win to me.

The problem with using a file handle is that the only way to get the
pathname is to open the file handle, and then call readlink on
/proc/self/fd/NN.  And inodes on the orphan inode list have been
unlinked, so we don't want to allow people to be able to open them.  I
suppose we could allow this via O_PATH, but I'm not sure that this is
guaranteed to work across all filesystems' file handles?

	      	   	      		   - Ted

^ permalink raw reply

* Re: [RFC PATCH v1 0/6] provenance_time (ptime): a new settable timestamp for cross-filesystem provenance
From: Theodore Tso @ 2026-04-07 23:36 UTC (permalink / raw)
  To: Sean Smith
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, dsterba, david, brauner,
	osandov, almaz, hirofumi, linkinjeon
In-Reply-To: <20260407000558.417-1-DefendTheDisabled@gmail.com>

On Mon, Apr 06, 2026 at 07:05:55PM -0500, Sean Smith wrote:
> The patches implement rename-over preservation in all 5
> filesystem rename handlers. When rename(source, target)
> replaces an existing file, and the source has ptime=0 (the
> default for any newly-created temp file) while the target
> has ptime != 0, the filesystem copies the target's ptime to
> the source before destroying the target's inode. This runs
> inside the rename transaction, atomic with the rename itself.

Yelch.   This is so *very* non-Unixy / non-POSIX / non-Linux.

I understand why it's convenient for your particular use case, but
rename(2) is fundamentally an operation which works on directory
entries.  We don't copy over extended attributes, or Posix ACL's, or
Unix permission mode bits, because (a) that would violate POSIX and
historical Unix behavior, and (b) because rename(2) is fundamentally a
directory entry operation.  This is despite the fact that it be more
convenient if we didn't have to copy over things like extended
attributes, ACL's, permission mode bits, etc.  So you want to make an
exception for ptime?   Yelch.  Just Yelch.

						- Ted

^ permalink raw reply

* Re: [PATCH v4 0/4] jbd2/ext4/ocfs2: lockless jinode dirty range
From: Li Chen @ 2026-04-08  2:12 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara, Mark Fasheh, linux-ext4, ocfs2-devel
  Cc: Andreas Dilger, Joel Becker, Joseph Qi, linux-kernel
In-Reply-To: <20260306085643.465275-1-me@linux.beauty>

Hi,

Just a gentle ping on this series posted on March 6, 2026.

The current v4 has Reviewed-by tags from Jan on all four patches.
If there are no further concerns, could you please consider it for
merging?

I'm happy to resend if that would help.

Thanks,
Li

^ permalink raw reply

* Re: [PATCH v4 0/4] jbd2/ext4/ocfs2: lockless jinode dirty range
From: Li Chen @ 2026-04-08  2:26 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara, Mark Fasheh, linux-ext4, ocfs2-devel
  Cc: Andreas Dilger, Joel Becker, Joseph Qi, linux-kernel
In-Reply-To: <20260306085643.465275-1-me@linux.beauty>

Hi,

One more note: if there are any remaining concerns or comments on the
series, please let me know and I'll address them promptly.

Thanks,
Li

^ permalink raw reply

* Re: [RFC PATCH v1 0/6] provenance_time (ptime): a new settable timestamp for cross-filesystem provenance
From: Sean Smith @ 2026-04-08  2:54 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, dsterba, david, brauner,
	osandov, almaz, hirofumi, linkinjeon
In-Reply-To: <20260407233618.GB12536@macsyma-wired.lan>



On 4/7/2026 18:36, Theodore Tso wrote:
> Yelch.   This is so *very* non-Unixy / non-POSIX / non-Linux.
> 
> I understand why it's convenient for your particular use case, but
> rename(2) is fundamentally an operation which works on directory
> entries.  We don't copy over extended attributes, or Posix ACL's, or
> Unix permission mode bits, because (a) that would violate POSIX and
> historical Unix behavior, and (b) because rename(2) is fundamentally a
> directory entry operation.  This is despite the fact that it be more
> convenient if we didn't have to copy over things like extended
> attributes, ACL's, permission mode bits, etc.  So you want to make an
> exception for ptime?   Yelch.  Just Yelch.
> 
> 						- Ted

[written by Sean]

Finding an alternative to using rename() to transfer ptime between
inodes during atomic saves seems beyond the scope of what I can address
as someone who relies upon AI agents to review and modify code. From
what my agents have been able to explain to me, the options here are
using rename() to transfer ptime via the kernel, or using renameat2 to
require each application to manually preserve/transfer ptime. The latter
is, well, the phrase herding cats might be an understatement of the
difficulty involved.

As an individual, I don't see how I could convince every major
open-source and closed source application developer for Linux to code
their application to preserve and transfer ptime. That seems like
something only leadership in the community has any chance of doing, and
even then, that would be a very slow-moving process.

It also doesn't solve the immediate needs of increasing number of users
who are trying to ditch Windows for Linux. Windows 11 has pushed one too
many people too far, and they, like me, have had enough.

Maybe senior developers can find an alternative to rename() that works.
I can cross my fingers and hope. Discussing matters with my agents we
couldn't find one. The need for ptime is very real, and the code in my
patch gets the job done, but a solution that respects conventions
requires a level of expertise I don't have. Perhaps in 4-8 months when
my AI harness is more mature and smarter models are available we can
solve this.

If the rename() code making atomic saves work prevents an upstream
merge, I understand. It would be messy adding ptime to the kernel only
to have it disappear anytime any unpatched application modified a file.
Avoiding such failure modes is why it seemed necessary to take a kernel
level approach.

Additionally, the reason an xattr ptime isn't a viable solution is
because applications do not reliably support xattr transfer. Thus it
would not seem likely ptime would receive better support.

I can patch every application I use which is open-source, or I can patch
the kernel. Rational analysis requires that I patch the kernel.

I'll continue using the rename-over approach in my own system, and
maintaining a github repo so that others who need it can patch their own
kernels. As the phrase goes, it's better than nothing. If there's a path
that respects conventions, I'm genuinely interested in guidance.

- Sean


^ permalink raw reply

* [syzbot] Monthly ext4 report (Apr 2026)
From: syzbot @ 2026-04-08  6:44 UTC (permalink / raw)
  To: linux-ext4, linux-kernel, syzkaller-bugs

Hello ext4 maintainers/developers,

This is a 31-day syzbot report for the ext4 subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/ext4

During the period, 0 new issues were detected and 2 were fixed.
In total, 49 issues are still open and 175 have already been fixed.

Some of the still happening issues:

Ref  Crashes Repro Title
<1>  7355    Yes   KASAN: out-of-bounds Read in ext4_xattr_set_entry
                   https://syzkaller.appspot.com/bug?extid=f792df426ff0f5ceb8d1
<2>  6598    Yes   WARNING in ext4_xattr_inode_update_ref (2)
                   https://syzkaller.appspot.com/bug?extid=76916a45d2294b551fd9
<3>  5264    Yes   possible deadlock in ext4_writepages (2)
                   https://syzkaller.appspot.com/bug?extid=eb5b4ef634a018917f3c
<4>  3018    Yes   kernel BUG in ext4_do_writepages
                   https://syzkaller.appspot.com/bug?extid=d1da16f03614058fdc48
<5>  2945    Yes   INFO: task hung in sync_inodes_sb (5)
                   https://syzkaller.appspot.com/bug?extid=30476ec1b6dc84471133
<6>  2495    Yes   possible deadlock in ext4_destroy_inline_data (2)
                   https://syzkaller.appspot.com/bug?extid=bb2455d02bda0b5701e3
<7>  2178    Yes   INFO: task hung in jbd2_journal_commit_transaction (5)
                   https://syzkaller.appspot.com/bug?extid=3071bdd0a9953bc0d177
<8>  994     Yes   WARNING in ext4_xattr_inode_lookup_create
                   https://syzkaller.appspot.com/bug?extid=fe42a669c87e4a980051
<9>  658     Yes   possible deadlock in do_writepages (2)
                   https://syzkaller.appspot.com/bug?extid=756f498a88797cda9299
<10> 464     Yes   INFO: task hung in do_get_write_access (3)
                   https://syzkaller.appspot.com/bug?extid=e7c786ece54bad9d1e43

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

To disable reminders for individual bugs, reply with the following command:
#syz set <Ref> no-reminders

To change bug's subsystems, reply with:
#syz set <Ref> subsystems: new-subsystem

You may send multiple commands in a single email message.

^ permalink raw reply

* [RFC v6 0/7] ext4: fast commit: snapshot inode state for FC log
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, linux-ext4,
	linux-trace-kernel, linux-kernel

Hi,

(This RFC v6 series is rebased onto linux-next master as of 2026-04-08,
commit f3e6330d7fe4 ("Add linux-next specific files for 20260407").)

Zhang Yi in RFC v3 review pointed out that postponing lockdep assertions only
masks the issue, and that sleeping in ext4_fc_track_inode() while holding
i_data_sem can form a real ABBA deadlock if the fast commit writer also needs
i_data_sem while the inode is in FC_COMMITTING.

Zhang Yi suggested two possible directions to address the root cause:

1. "Ha, the solution seems to have already been listed in the TODOs in
fast_commit.c.

  Change ext4_fc_commit() to lookup logical to physical mapping using extent
  status tree. This would get rid of the need to call ext4_fc_track_inode()
  before acquiring i_data_sem. To do that we would need to ensure that
  modified extents from the extent status tree are not evicted from memory."

2. "Alternatively, recording the mapped range of tracking might also be
feasible."

This series implements a hybrid way: it implements approach 2 by snapshotting inode image
and mapped ranges at commit time, and consuming only snapshots during log
writing.

Approach 2 still needs a mapping source while building the snapshot
(logical-to-physical and unwritten/hole semantics). Calling ext4_map_blocks()
there would take i_data_sem and can block inside the
jbd2_journal_lock_updates() window, which risks deadlocks or unbounded stalls.
So the snapshot path uses approach 1's extent status lookups as a best-effort
mapping source to avoid ext4_map_blocks().

I did not fully implement approach 1 (making extent status lookups
authoritative by preventing reclaim of needed entries) because that would need
additional pinning/integration under memory pressure and a larger correctness
surface. Instead, the extent status tree is treated as a cache and the
snapshot path falls back to full commit on cache misses or unstable mappings
(e.g. delayed allocation).

Lock inversion / deadlock model (before):

CPU0 (metadata update)               CPU1 (fast commit)
--------------------               -----------------
... hold i_data_sem (A)             mutex_lock(s_fc_lock) (B)
    ext4_fc_track_inode()             ext4_fc_write_inode_data()
      mutex_lock(s_fc_lock) (B)         ext4_map_blocks()
      wait FC_COMMITTING (sleep)          down_read(i_data_sem) (A)

This creates i_data_sem (A) -> s_fc_lock (B) on update paths, and
s_fc_lock (B) -> i_data_sem (A) on commit paths. Once CPU0 sleeps while
holding (A), CPU1 can block on (A) while holding (B), completing the ABBA
cycle.

New model (this series):

CPU0 (metadata update)               CPU1 (fast commit)
--------------------               -----------------
... maybe hold i_data_sem (A)        jbd2_journal_lock_updates()
    ext4_fc_track_*()                 snapshot inode + ranges (no map_blocks)
      mutex_lock(s_fc_lock) (B)       jbd2_journal_unlock_updates()
      if FC_COMMITTING: set FC_REQUEUE s_fc_lock (B)
      no sleep                         write FC log from snapshots only
                                    cleanup: clear COMMITTING, requeue if set

The commit path no longer takes i_data_sem while holding s_fc_lock, and
tracking no longer sleeps waiting for FC_COMMITTING. If an inode is updated
during a fast commit, EXT4_STATE_FC_REQUEUE records that fact and the inode
is moved to FC_Q_STAGING for the next commit.
The only remaining FC_COMMITTING waiter is ext4_fc_del(), which drops
s_fc_lock before sleeping.

This series snapshots the on-disk inode and tracked data ranges while journal
updates are locked and existing handles are drained. The log writing phase then
serializes only snapshots, so it no longer needs to call ext4_map_blocks() and
take i_data_sem under s_fc_lock. This is done in two steps: patch 1 drops
ext4_map_blocks() from log writing by introducing commit-time snapshots, and
patch 5 drops ext4_map_blocks() from the snapshot path by using the extent
status cache. The snapshot also records whether a mapped extent is unwritten,
so the ADD_RANGE records (and replay) preserve unwritten semantics.

Snapshotting runs under jbd2_journal_lock_updates(). Since a cache miss in
ext4_get_inode_loc() can start synchronous inode table I/O and stall handle
starts for milliseconds, patch 1 uses ext4_get_inode_loc_noio() and falls back
to full commit if the inode table block is not present or not uptodate.

ext4_fc_track_inode() also stops waiting for FC_COMMITTING. Updates during an
ongoing fast commit are marked with EXT4_STATE_FC_REQUEUE and are replayed in
the next fast commit, while ext4_fc_del() waits for FC_COMMITTING so an inode
cannot be removed while the commit thread is still using it.

The extent status tree is a cache, not an authoritative source, so the snapshot
path falls back to full commit on cache misses or unstable mappings (e.g.
delayed allocation). This includes cases where extent status entries are not
present (or have been reclaimed) under memory pressure. The snapshot path does
not try to rebuild mappings by calling ext4_map_blocks(); instead it simply
marks the transaction fast commit ineligible.

To keep the updates-locked window bounded, the snapshot path caps the number of
snapshotted inodes and ranges per fast commit (currently 1024 inodes and 2048
ranges) and falls back to full commit when the cap is exceeded. The series also
handles the journal inode i_data_sem lockdep false positive via subclassing;
journal inode mapping may still take i_data_sem even when data inode mapping is
avoided.

Patch 6 adds the ext4_fc_lock_updates tracepoint to quantify the updates-locked
window and snapshot fallback reasons. Patch 7 extends
/proc/fs/ext4/<sb_id>/fc_info with best-effort snapshot counters. If the /proc
interface is undesirable, I can drop patch 7 and keep the tracepoint only, or
drop even both.

Testing and measurement were done on a QEMU/KVM guest with virtio-pmem + dax
(ext4 -O fast_commit, mounted dax,noatime). The workload does python3 500x
{4K write + fsync}, fallocate 256M, and python3 500x {creat + fsync(dir)}.
Over 3 cold boots, ext4_fc_lock_updates reported locked_ns p50 2.88-2.92 us,
p99 <= 6.71 us, and max <= 102.71 us, with snap_err always 0. Under stress-ng
memory pressure (stress-ng --vm 4 --vm-bytes 75% --timeout 60s), locked_ns p50
2.94 us, p99 <= 4.97 us, and max <= 20.07 us. The fc_info snapshot failure
counters stayed at 0.
These hold times are in the low microseconds range, and the caps keep the
worst case bounded.

Comments and guidance are very welcome. Please let me know if there are any
concerns about correctness, corner cases, or better approaches.

RFC v5 -> RFC v6:
- Rebase onto linux-next master as of 2026-04-08.
- Address tracepoint review feedback by relying on enum auto-increment for
  snap_err values and by switching the guarded ext4_fc_lock_updates call site
  to trace_call__ext4_fc_lock_updates() to avoid the double static_branch. [1]
- Keep lock window accounting unconditional for fc_info while using the guarded
  direct tracepoint call.
- Fix the inode debug print format exposed by the rebase.

RFC v4 -> RFC v5:
- Patch 6: Make ext4_fc_lock_updates snap_err human readable via
  TRACE_DEFINE_ENUM() + __print_symbolic(), using a single TRACE_SNAP_ERR
  mapping while keeping the enum values stable for tooling.

RFC v3 -> RFC v4:
- Replace lockdep_assert movement with removing the wait in
  ext4_fc_track_inode() and using EXT4_STATE_FC_REQUEUE to capture updates
  during an ongoing fast commit.
- Replace dropping s_fc_lock around log writing with commit-time snapshots of
  inode image and mapped ranges (recording the mapped range of tracking as
  suggested by Zhang Yi) so log writing consumes only snapshots.
- Avoid inode table I/O under jbd2_journal_lock_updates() via
  ext4_get_inode_loc_noio() and fallback to full commit on cache misses.
- Use the extent status cache for snapshot mappings and fall back to full
  commit on cache misses or unstable mappings (e.g. delayed allocation).
- Add tracepoint and /proc snapshot stats to quantify the updates-locked window
  and snapshot fallback reasons.

RFC v2 -> RFC v3:
- rebase on top of
  https://lore.kernel.org/linux-ext4/20251223131342.287864-1-me@linux.beauty/T/#u

RFC v1 -> RFC v2:
- patch 1: move comments to correct place
- patch 2: add it to patchset.
- add missing RFC prefix

RFC v1: https://lore.kernel.org/linux-ext4/20251222032655.87056-1-me@linux.beauty/T/#u
RFC v2: https://lore.kernel.org/linux-ext4/20251222151906.24607-1-me@linux.beauty/T/#t
RFC v3: https://lore.kernel.org/linux-ext4/20251224032943.134063-1-me@linux.beauty/
RFC v4: https://lore.kernel.org/all/20260120112538.132774-1-me@linux.beauty/
RFC v5: https://lore.kernel.org/all/20260317084624.457185-1-me@linux.beauty/t/#u

[1]: https://lore.kernel.org/all/acZJl8QUYEq8voqQ@BLRRASHENOY1.amd.com/T/#u

Thanks,

Li Chen (7):
  ext4: fast commit: snapshot inode state before writing log
  ext4: lockdep: handle i_data_sem subclassing for special inodes
  ext4: fast commit: avoid waiting for FC_COMMITTING
  ext4: fast commit: avoid self-deadlock in inode snapshotting
  ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in
    snapshots
  ext4: fast commit: add lock_updates tracepoint
  ext4: fast commit: export snapshot stats in fc_info

 fs/ext4/ext4.h              |  73 +++-
 fs/ext4/fast_commit.c       | 706 +++++++++++++++++++++++++++++-------
 fs/ext4/inode.c             |  51 +++
 fs/ext4/super.c             |   9 +
 include/trace/events/ext4.h |  61 ++++
 5 files changed, 766 insertions(+), 134 deletions(-)

-- 
2.53.0

^ permalink raw reply

* [RFC v6 1/7] ext4: fast commit: snapshot inode state before writing log
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Fast commit writes inode metadata and data range updates after unlocking
journal updates. New handles can start at that point, so the log writing
path must not look at live inode state.

Add a commit-time per-inode snapshot and populate it while journal updates
are locked and existing handles are drained. Store the snapshot behind
ext4_inode_info->i_fc_snap so ext4_inode_info only grows by one pointer.
The snapshot contains a copy of the on-disk inode plus the data range
records needed for fast commit TLVs.

Snapshotting runs under jbd2_journal_lock_updates(). Avoid triggering I/O
there by using ext4_get_inode_loc_noio() and falling back to full commit
if the inode table block is not present or not uptodate.

Log writing then only serializes the snapshot, so it no longer needs to
call ext4_map_blocks() and take i_data_sem under s_fc_lock. The snapshot
is installed and freed under s_fc_lock and is released from fast commit
cleanup and inode eviction.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Rebase onto linux-next master as of 2026-04-08.
- Fix the inode debug print format after rebasing.

 fs/ext4/ext4.h        |  22 ++-
 fs/ext4/fast_commit.c | 331 +++++++++++++++++++++++++++++++++++-------
 fs/ext4/inode.c       |  51 +++++++
 3 files changed, 352 insertions(+), 52 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0cf68f85dfd1..98857292c707 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1024,6 +1024,7 @@ enum {
 	I_DATA_SEM_EA
 };
 
+struct ext4_fc_inode_snap;
 
 /*
  * fourth extended file system inode data in memory
@@ -1080,6 +1081,22 @@ struct ext4_inode_info {
 	/* End of lblk range that needs to be committed in this fast commit */
 	ext4_lblk_t i_fc_lblk_len;
 
+	/*
+	 * Commit-time fast commit snapshots.
+	 *
+	 * i_fc_snap is installed and freed under sbi->s_fc_lock. The fast
+	 * commit log writing path reads the snapshot under sbi->s_fc_lock while
+	 * serializing fast commit TLVs.
+	 *
+	 * The snapshot lifetime is bounded by EXT4_STATE_FC_COMMITTING and the
+	 * corresponding cleanup / eviction paths.
+	 *
+	 * i_fc_snap points to per-inode snapshot data for fast commit:
+	 * - a raw inode snapshot for EXT4_FC_TAG_INODE
+	 * - data range records for EXT4_FC_TAG_{ADD,DEL}_RANGE
+	 */
+	struct ext4_fc_inode_snap *i_fc_snap;
+
 	spinlock_t i_raw_lock;	/* protects updates to the raw inode */
 
 	/* Fast commit wait queue for this inode */
@@ -3083,8 +3100,9 @@ extern int  ext4_file_getattr(struct mnt_idmap *, const struct path *,
 			      struct kstat *, u32, unsigned int);
 extern void ext4_dirty_inode(struct inode *, int);
 extern int ext4_change_inode_journal_flag(struct inode *, int);
-extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
-extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc);
+int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc);
+int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
 			  struct ext4_iloc *iloc);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 390f25dee2b1..dc19795dacdd 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -55,21 +55,23 @@
  *     deleted while it is being flushed.
  * [2] Flush data buffers to disk and clear "EXT4_STATE_FC_FLUSHING_DATA"
  *     state.
- * [3] Lock the journal by calling jbd2_journal_lock_updates. This ensures that
- *     all the exsiting handles finish and no new handles can start.
- * [4] Mark all the fast commit eligible inodes as undergoing fast commit
- *     by setting "EXT4_STATE_FC_COMMITTING" state.
- * [5] Unlock the journal by calling jbd2_journal_unlock_updates. This allows
+ * [3] Lock the journal by calling jbd2_journal_lock_updates(). This ensures
+ *     that all the existing handles finish and no new handles can start.
+ * [4] Mark all the fast commit eligible inodes as undergoing fast commit by
+ *     setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
+ *     needed for log writing.
+ * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
  *     starting of new handles. If new handles try to start an update on
  *     any of the inodes that are being committed, ext4_fc_track_inode()
  *     will block until those inodes have finished the fast commit.
  * [6] Commit all the directory entry updates in the fast commit space.
- * [7] Commit all the changed inodes in the fast commit space and clear
- *     "EXT4_STATE_FC_COMMITTING" for these inodes.
+ * [7] Commit all the changed inodes in the fast commit space.
  * [8] Write tail tag (this tag ensures the atomicity, please read the following
  *     section for more details).
+ * [9] Clear "EXT4_STATE_FC_COMMITTING" and wake up waiters in
+ *     ext4_fc_cleanup().
  *
- * All the inode updates must be enclosed within jbd2_jounrnal_start()
+ * All the inode updates must be enclosed within jbd2_journal_start()
  * and jbd2_journal_stop() similar to JBD2 journaling.
  *
  * Fast Commit Ineligibility
@@ -199,6 +201,8 @@ static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 	unlock_buffer(bh);
 }
 
+static void ext4_fc_free_inode_snap(struct inode *inode);
+
 static inline void ext4_fc_reset_inode(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
@@ -215,6 +219,7 @@ void ext4_fc_init_inode(struct inode *inode)
 	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
+	ei->i_fc_snap = NULL;
 	init_waitqueue_head(&ei->i_fc_wait);
 }
 
@@ -240,6 +245,7 @@ void ext4_fc_del(struct inode *inode)
 
 	alloc_ctx = ext4_fc_lock(inode->i_sb);
 	if (list_empty(&ei->i_fc_list) && list_empty(&ei->i_fc_dilist)) {
+		ext4_fc_free_inode_snap(inode);
 		ext4_fc_unlock(inode->i_sb, alloc_ctx);
 		return;
 	}
@@ -281,6 +287,7 @@ void ext4_fc_del(struct inode *inode)
 		}
 		finish_wait(wq, &wait.wq_entry);
 	}
+	ext4_fc_free_inode_snap(inode);
 	list_del_init(&ei->i_fc_list);
 
 	/*
@@ -845,6 +852,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc,
 	return true;
 }
 
+struct ext4_fc_range {
+	struct list_head list;
+	u16 tag;
+	ext4_lblk_t lblk;
+	ext4_lblk_t len;
+	ext4_fsblk_t pblk;
+	bool unwritten;
+};
+
+struct ext4_fc_inode_snap {
+	struct list_head data_list;
+	unsigned int inode_len;
+	u8 inode_buf[];
+};
+
 /*
  * Writes inode in the fast commit space under TLV with tag @tag.
  * Returns 0 on success, error on failure.
@@ -852,21 +874,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block *sb, u32 *crc,
 static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
-	int ret;
-	struct ext4_iloc iloc;
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
 	struct ext4_fc_inode fc_inode;
 	struct ext4_fc_tl tl;
 	u8 *dst;
+	u8 *src;
+	int inode_len;
+	int ret;
 
-	ret = ext4_get_inode_loc(inode, &iloc);
-	if (ret)
-		return ret;
+	if (!snap)
+		return -ECANCELED;
 
-	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
-		inode_len = EXT4_INODE_SIZE(inode->i_sb);
-	else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
-		inode_len += ei->i_extra_isize;
+	src = snap->inode_buf;
+	inode_len = snap->inode_len;
+	if (!src || inode_len == 0)
+		return -ECANCELED;
 
 	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
 	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
@@ -882,10 +904,9 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 	dst += EXT4_FC_TAG_BASE_LEN;
 	memcpy(dst, &fc_inode, sizeof(fc_inode));
 	dst += sizeof(fc_inode);
-	memcpy(dst, (u8 *)ext4_raw_inode(&iloc), inode_len);
+	memcpy(dst, src, inode_len);
 	ret = 0;
 err:
-	brelse(iloc.bh);
 	return ret;
 }
 
@@ -895,12 +916,74 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
  */
 static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 {
-	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	struct ext4_map_blocks map;
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
 	struct ext4_fc_add_range fc_ext;
 	struct ext4_fc_del_range lrange;
 	struct ext4_extent *ex;
+	struct ext4_fc_range *range;
+
+	if (!snap)
+		return -ECANCELED;
+
+	list_for_each_entry(range, &snap->data_list, list) {
+		if (range->tag == EXT4_FC_TAG_DEL_RANGE) {
+			lrange.fc_ino = cpu_to_le32(inode->i_ino);
+			lrange.fc_lblk = cpu_to_le32(range->lblk);
+			lrange.fc_len = cpu_to_le32(range->len);
+			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
+					     sizeof(lrange), (u8 *)&lrange, crc))
+				return -ENOSPC;
+			continue;
+		}
+
+		fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
+		ex = (struct ext4_extent *)&fc_ext.fc_ex;
+		ex->ee_block = cpu_to_le32(range->lblk);
+		ex->ee_len = cpu_to_le16(range->len);
+		ext4_ext_store_pblock(ex, range->pblk);
+		if (range->unwritten)
+			ext4_ext_mark_unwritten(ex);
+		else
+			ext4_ext_mark_initialized(ex);
+
+		if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
+				     sizeof(fc_ext), (u8 *)&fc_ext, crc))
+			return -ENOSPC;
+	}
+
+	return 0;
+}
+
+static void ext4_fc_free_ranges(struct list_head *head)
+{
+	struct ext4_fc_range *range, *range_n;
+
+	list_for_each_entry_safe(range, range_n, head, list) {
+		list_del(&range->list);
+		kfree(range);
+	}
+}
+
+static void ext4_fc_free_inode_snap(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_inode_snap *snap = ei->i_fc_snap;
+
+	if (!snap)
+		return;
+
+	ext4_fc_free_ranges(&snap->data_list);
+	kfree(snap);
+	ei->i_fc_snap = NULL;
+}
+
+static int ext4_fc_snapshot_inode_data(struct inode *inode,
+				       struct list_head *ranges)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+	struct ext4_map_blocks map;
 	int ret;
 
 	spin_lock(&ei->i_fc_lock);
@@ -908,18 +991,21 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 		spin_unlock(&ei->i_fc_lock);
 		return 0;
 	}
-	old_blk_size = ei->i_fc_lblk_start;
-	new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
+	start_lblk = ei->i_fc_lblk_start;
+	end_lblk = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
 	ei->i_fc_lblk_len = 0;
 	spin_unlock(&ei->i_fc_lock);
 
-	cur_lblk_off = old_blk_size;
-	ext4_debug("will try writing %d to %d for inode %llu\n",
-		   cur_lblk_off, new_blk_size, inode->i_ino);
+	cur_lblk = start_lblk;
+	ext4_debug("snapshot data ranges %u-%u for inode %llu\n",
+		   start_lblk, end_lblk,
+		   (unsigned long long)inode->i_ino);
+
+	while (cur_lblk <= end_lblk) {
+		struct ext4_fc_range *range;
 
-	while (cur_lblk_off <= new_blk_size) {
-		map.m_lblk = cur_lblk_off;
-		map.m_len = new_blk_size - cur_lblk_off + 1;
+		map.m_lblk = cur_lblk;
+		map.m_len = end_lblk - cur_lblk + 1;
 		ret = ext4_map_blocks(NULL, inode, &map,
 				      EXT4_GET_BLOCKS_IO_SUBMIT |
 				      EXT4_EX_NOCACHE);
@@ -927,17 +1013,21 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 			return -ECANCELED;
 
 		if (map.m_len == 0) {
-			cur_lblk_off++;
+			cur_lblk++;
 			continue;
 		}
 
+		range = kmalloc(sizeof(*range), GFP_NOFS);
+		if (!range)
+			return -ENOMEM;
+
+		range->lblk = map.m_lblk;
+		range->len = map.m_len;
+		range->pblk = 0;
+		range->unwritten = false;
+
 		if (ret == 0) {
-			lrange.fc_ino = cpu_to_le32(inode->i_ino);
-			lrange.fc_lblk = cpu_to_le32(map.m_lblk);
-			lrange.fc_len = cpu_to_le32(map.m_len);
-			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
-					    sizeof(lrange), (u8 *)&lrange, crc))
-				return -ENOSPC;
+			range->tag = EXT4_FC_TAG_DEL_RANGE;
 		} else {
 			unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ?
 				EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN;
@@ -945,26 +1035,67 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 			/* Limit the number of blocks in one extent */
 			map.m_len = min(max, map.m_len);
 
-			fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
-			ex = (struct ext4_extent *)&fc_ext.fc_ex;
-			ex->ee_block = cpu_to_le32(map.m_lblk);
-			ex->ee_len = cpu_to_le16(map.m_len);
-			ext4_ext_store_pblock(ex, map.m_pblk);
-			if (map.m_flags & EXT4_MAP_UNWRITTEN)
-				ext4_ext_mark_unwritten(ex);
-			else
-				ext4_ext_mark_initialized(ex);
-			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
-					    sizeof(fc_ext), (u8 *)&fc_ext, crc))
-				return -ENOSPC;
+			range->tag = EXT4_FC_TAG_ADD_RANGE;
+			range->len = map.m_len;
+			range->pblk = map.m_pblk;
+			range->unwritten = !!(map.m_flags & EXT4_MAP_UNWRITTEN);
 		}
 
-		cur_lblk_off += map.m_len;
+		INIT_LIST_HEAD(&range->list);
+		list_add_tail(&range->list, ranges);
+
+		cur_lblk += map.m_len;
 	}
 
 	return 0;
 }
 
+static int ext4_fc_snapshot_inode(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_inode_snap *snap;
+	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
+	struct ext4_iloc iloc;
+	LIST_HEAD(ranges);
+	int ret;
+	int alloc_ctx;
+
+	ret = ext4_get_inode_loc_noio(inode, &iloc);
+	if (ret)
+		return ret;
+
+	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
+		inode_len = EXT4_INODE_SIZE(inode->i_sb);
+	else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
+		inode_len += ei->i_extra_isize;
+
+	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
+	if (!snap) {
+		brelse(iloc.bh);
+		return -ENOMEM;
+	}
+	INIT_LIST_HEAD(&snap->data_list);
+	snap->inode_len = inode_len;
+
+	memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len);
+	brelse(iloc.bh);
+
+	ret = ext4_fc_snapshot_inode_data(inode, &ranges);
+	if (ret) {
+		kfree(snap);
+		ext4_fc_free_ranges(&ranges);
+		return ret;
+	}
+
+	alloc_ctx = ext4_fc_lock(inode->i_sb);
+	ext4_fc_free_inode_snap(inode);
+	ei->i_fc_snap = snap;
+	list_splice_tail_init(&ranges, &snap->data_list);
+	ext4_fc_unlock(inode->i_sb, alloc_ctx);
+
+	return 0;
+}
+
 
 /* Flushes data of all the inodes in the commit queue. */
 static int ext4_fc_flush_data(journal_t *journal)
@@ -1015,6 +1146,11 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 		 */
 		if (list_empty(&fc_dentry->fcd_dilist))
 			continue;
+		/*
+		 * For EXT4_FC_TAG_CREAT, fcd_dilist is linked on the created
+		 * inode's i_fc_dilist list (kept singular), so we can recover the
+		 * inode through it.
+		 */
 		ei = list_first_entry(&fc_dentry->fcd_dilist,
 				struct ext4_inode_info, i_fc_dilist);
 		inode = &ei->vfs_inode;
@@ -1039,6 +1175,88 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 	return 0;
 }
 
+static int ext4_fc_snapshot_inodes(journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	struct inode **inodes;
+	unsigned int nr_inodes = 0;
+	unsigned int i = 0;
+	int ret = 0;
+	int alloc_ctx;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
+		nr_inodes++;
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		nr_inodes++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	if (!nr_inodes)
+		return 0;
+
+	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
+	if (!inodes)
+		return -ENOMEM;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		inodes[i] = igrab(&iter->vfs_inode);
+		if (inodes[i])
+			i++;
+	}
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		inodes[i] = igrab(&ei->vfs_inode);
+		if (inodes[i])
+			i++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
+		ret = ext4_fc_snapshot_inode(inodes[nr_inodes]);
+		if (ret)
+			break;
+	}
+
+	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
+		if (inodes[nr_inodes])
+			iput(inodes[nr_inodes]);
+	}
+	kvfree(inodes);
+	return ret;
+}
+
 static int ext4_fc_perform_commit(journal_t *journal)
 {
 	struct super_block *sb = journal->j_private;
@@ -1111,7 +1329,11 @@ static int ext4_fc_perform_commit(journal_t *journal)
 				     EXT4_STATE_FC_COMMITTING);
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
+
+	ret = ext4_fc_snapshot_inodes(journal);
 	jbd2_journal_unlock_updates(journal);
+	if (ret)
+		return ret;
 
 	/*
 	 * Step 5: If file system device is different from journal device,
@@ -1308,6 +1530,7 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					struct ext4_inode_info,
 					i_fc_list);
 		list_del_init(&ei->i_fc_list);
+		ext4_fc_free_inode_snap(&ei->vfs_inode);
 		ext4_clear_inode_state(&ei->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
 		if (tid_geq(tid, ei->i_sync_tid)) {
@@ -1343,6 +1566,14 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					     struct ext4_fc_dentry_update,
 					     fcd_list);
 		list_del_init(&fc_dentry->fcd_list);
+		if (fc_dentry->fcd_op == EXT4_FC_TAG_CREAT &&
+		    !list_empty(&fc_dentry->fcd_dilist)) {
+			/* See the comment in ext4_fc_commit_dentry_updates(). */
+			ei = list_first_entry(&fc_dentry->fcd_dilist,
+					      struct ext4_inode_info,
+					      i_fc_dilist);
+			ext4_fc_free_inode_snap(&ei->vfs_inode);
+		}
 		list_del_init(&fc_dentry->fcd_dilist);
 
 		release_dentry_name_snapshot(&fc_dentry->fcd_name);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 64dba7679245..478c81e6d08b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4939,6 +4939,57 @@ int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
 	return ret;
 }
 
+/*
+ * ext4_get_inode_loc_noio() is a best-effort variant of ext4_get_inode_loc().
+ * It looks up the inode table block in the buffer cache and returns -EAGAIN if
+ * the block is not present or not uptodate, without starting any I/O.
+ */
+int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc)
+{
+	struct super_block *sb = inode->i_sb;
+	struct ext4_group_desc *gdp;
+	struct buffer_head *bh;
+	ext4_fsblk_t block;
+	int inodes_per_block, inode_offset;
+	unsigned long ino = inode->i_ino;
+
+	iloc->bh = NULL;
+	if (ino < EXT4_ROOT_INO ||
+	    ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
+		return -EFSCORRUPTED;
+
+	iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
+	gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
+	if (!gdp)
+		return -EIO;
+
+	/* Figure out the offset within the block group inode table. */
+	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+	inode_offset = ((ino - 1) % EXT4_INODES_PER_GROUP(sb));
+	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
+
+	block = ext4_inode_table(sb, gdp);
+	if (block <= le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block) ||
+	    block >= ext4_blocks_count(EXT4_SB(sb)->s_es)) {
+		ext4_error(sb,
+			   "Invalid inode table block %llu in block_group %u",
+			   block, iloc->block_group);
+		return -EFSCORRUPTED;
+	}
+	block += inode_offset / inodes_per_block;
+
+	bh = sb_find_get_block(sb, block);
+	if (!bh)
+		return -EAGAIN;
+	if (!ext4_buffer_uptodate(bh)) {
+		brelse(bh);
+		return -EAGAIN;
+	}
+
+	iloc->bh = bh;
+	return 0;
+}
+
 
 int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
 			  struct ext4_iloc *iloc)
-- 
2.53.0

^ permalink raw reply related

* [RFC v6 2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.

lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.

Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Rebase onto linux-next master as of 2026-04-08.
- Refresh the patch context around upstream ext4_alloc_inode() changes,
  without changing the subclassing logic.

 fs/ext4/ext4.h  | 4 +++-
 fs/ext4/super.c | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 98857292c707..66de888ae411 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1016,12 +1016,14 @@ do {										\
  *			  than the first
  *  I_DATA_SEM_QUOTA  - Used for quota inodes only
  *  I_DATA_SEM_EA     - Used for ea_inodes only
+ *  I_DATA_SEM_JOURNAL - Used for journal inode only
  */
 enum {
 	I_DATA_SEM_NORMAL = 0,
 	I_DATA_SEM_OTHER,
 	I_DATA_SEM_QUOTA,
-	I_DATA_SEM_EA
+	I_DATA_SEM_EA,
+	I_DATA_SEM_JOURNAL
 };
 
 struct ext4_fc_inode_snap;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 578508eb4f1a..286f05834900 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1425,6 +1425,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ext4_fc_init_inode(&ei->vfs_inode);
 	spin_lock_init(&ei->i_fc_lock);
 	mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data);
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&ei->i_data_sem, I_DATA_SEM_NORMAL);
+#endif
 	return &ei->vfs_inode;
 }
 
@@ -5904,6 +5907,11 @@ static struct inode *ext4_get_journal_inode(struct super_block *sb,
 		return ERR_PTR(-EFSCORRUPTED);
 	}
 
+#ifdef CONFIG_LOCKDEP
+	lockdep_set_subclass(&EXT4_I(journal_inode)->i_data_sem,
+			     I_DATA_SEM_JOURNAL);
+#endif
+
 	ext4_debug("Journal inode found at %p: %lld bytes\n",
 		  journal_inode, journal_inode->i_size);
 	return journal_inode;
-- 
2.53.0

^ permalink raw reply related

* [RFC v6 3/7] ext4: fast commit: avoid waiting for FC_COMMITTING
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

ext4_fc_track_inode() can be called while holding i_data_sem (e.g.
fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an
ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING ->
wait(i_data_sem) in the commit task.

Now that fast commit snapshots inode state at commit time, updates during
log writing do not need to block. Drop the wait and lockdep assertion in
ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an
inode cannot be removed while the commit thread is still using it.

When an inode is modified during a fast commit, mark it with
EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit.
This is needed because jbd2_fc_end_commit() invokes the cleanup callback
with tid == 0, so tid-based requeue logic would requeue every inode.

Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same
transaction. nblks is the number of journal blocks written for that fast
commit. Before this change, the second fsync still wrote almost the same
fast commit log (nblks 10->9), because tid == 0 in jbd2_fc_end_commit()
caused the tid-based requeue logic to keep all inodes queued. After this
change, only inodes modified during the commit are requeued, and the
second fsync wrote a nearly empty fast commit (nblks 10->1).

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/ext4.h        |   1 +
 fs/ext4/fast_commit.c | 111 ++++++++++++++++++++----------------------
 2 files changed, 53 insertions(+), 59 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 66de888ae411..13fe4fdf9bda 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1995,6 +1995,7 @@ enum {
 	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
 	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
 	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
+	EXT4_STATE_FC_REQUEUE,		/* Inode modified during fast commit */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index dc19795dacdd..5ed884cc4b5c 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -61,9 +61,8 @@
  *     setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode state
  *     needed for log writing.
  * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This allows
- *     starting of new handles. If new handles try to start an update on
- *     any of the inodes that are being committed, ext4_fc_track_inode()
- *     will block until those inodes have finished the fast commit.
+ *     starting of new handles. Updates to inodes being fast committed are
+ *     tracked for requeue rather than blocking.
  * [6] Commit all the directory entry updates in the fast commit space.
  * [7] Commit all the changed inodes in the fast commit space.
  * [8] Write tail tag (this tag ensures the atomicity, please read the following
@@ -217,6 +216,7 @@ void ext4_fc_init_inode(struct inode *inode)
 
 	ext4_fc_reset_inode(inode);
 	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+	ext4_clear_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	INIT_LIST_HEAD(&ei->i_fc_dilist);
 	ei->i_fc_snap = NULL;
@@ -251,22 +251,30 @@ void ext4_fc_del(struct inode *inode)
 	}
 
 	/*
-	 * Since ext4_fc_del is called from ext4_evict_inode while having a
-	 * handle open, there is no need for us to wait here even if a fast
-	 * commit is going on. That is because, if this inode is being
-	 * committed, ext4_mark_inode_dirty would have waited for inode commit
-	 * operation to finish before we come here. So, by the time we come
-	 * here, inode's EXT4_STATE_FC_COMMITTING would have been cleared. So,
-	 * we shouldn't see EXT4_STATE_FC_COMMITTING to be set on this inode
-	 * here.
-	 *
-	 * We may come here without any handles open in the "no_delete" case of
-	 * ext4_evict_inode as well. However, if that happens, we first mark the
-	 * file system as fast commit ineligible anyway. So, even in that case,
-	 * it is okay to remove the inode from the fc list.
+	 * Wait for ongoing fast commit to finish. We cannot remove the inode
+	 * from fast commit lists while it is being committed.
 	 */
-	WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)
-		&& !ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE));
+	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+#if (BITS_PER_LONG < 64)
+		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_state_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#else
+		DEFINE_WAIT_BIT(wait, &ei->i_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#endif
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+			ext4_fc_unlock(inode->i_sb, alloc_ctx);
+			schedule();
+			alloc_ctx = ext4_fc_lock(inode->i_sb);
+		}
+		finish_wait(wq, &wait.wq_entry);
+	}
+
 	while (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) {
 #if (BITS_PER_LONG < 64)
 		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
@@ -287,19 +295,22 @@ void ext4_fc_del(struct inode *inode)
 		}
 		finish_wait(wq, &wait.wq_entry);
 	}
+
 	ext4_fc_free_inode_snap(inode);
 	list_del_init(&ei->i_fc_list);
 
 	/*
-	 * Since this inode is getting removed, let's also remove all FC
-	 * dentry create references, since it is not needed to log it anyways.
+	 * Since this inode is getting removed, let's also remove all FC dentry
+	 * create references, since it is not needed to log it anyways.
 	 */
 	if (list_empty(&ei->i_fc_dilist)) {
 		ext4_fc_unlock(inode->i_sb, alloc_ctx);
 		return;
 	}
 
-	fc_dentry = list_first_entry(&ei->i_fc_dilist, struct ext4_fc_dentry_update, fcd_dilist);
+	fc_dentry = list_first_entry(&ei->i_fc_dilist,
+				     struct ext4_fc_dentry_update,
+				     fcd_dilist);
 	WARN_ON(fc_dentry->fcd_op != EXT4_FC_TAG_CREAT);
 	list_del_init(&fc_dentry->fcd_list);
 	list_del_init(&fc_dentry->fcd_dilist);
@@ -371,6 +382,8 @@ static int ext4_fc_track_template(
 
 	tid = handle->h_transaction->t_tid;
 	spin_lock(&ei->i_fc_lock);
+	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+		ext4_set_inode_state(inode, EXT4_STATE_FC_REQUEUE);
 	if (tid == ei->i_sync_tid) {
 		update = true;
 	} else {
@@ -557,8 +570,6 @@ static int __track_inode(handle_t *handle, struct inode *inode, void *arg,
 
 void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 {
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	wait_queue_head_t *wq;
 	int ret;
 
 	if (S_ISDIR(inode->i_mode))
@@ -577,29 +588,11 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 		return;
 
 	/*
-	 * If we come here, we may sleep while waiting for the inode to
-	 * commit. We shouldn't be holding i_data_sem when we go to sleep since
-	 * the commit path needs to grab the lock while committing the inode.
+	 * Fast commit snapshots inode state at commit time, so there's no need
+	 * to wait for EXT4_STATE_FC_COMMITTING here. If the inode is already
+	 * on the commit queue, ext4_fc_cleanup() will requeue it for the new
+	 * transaction once the current commit finishes.
 	 */
-	lockdep_assert_not_held(&ei->i_data_sem);
-
-	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
-#if (BITS_PER_LONG < 64)
-		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
-				EXT4_STATE_FC_COMMITTING);
-		wq = bit_waitqueue(&ei->i_state_flags,
-				   EXT4_STATE_FC_COMMITTING);
-#else
-		DEFINE_WAIT_BIT(wait, &ei->i_flags,
-				EXT4_STATE_FC_COMMITTING);
-		wq = bit_waitqueue(&ei->i_flags,
-				   EXT4_STATE_FC_COMMITTING);
-#endif
-		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
-			schedule();
-		finish_wait(wq, &wait.wq_entry);
-	}
 
 	/*
 	 * From this point on, this inode will not be committed either
@@ -1526,32 +1519,32 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	while (!list_empty(&sbi->s_fc_q[FC_Q_MAIN])) {
+		bool requeue;
+
 		ei = list_first_entry(&sbi->s_fc_q[FC_Q_MAIN],
 					struct ext4_inode_info,
 					i_fc_list);
 		list_del_init(&ei->i_fc_list);
 		ext4_fc_free_inode_snap(&ei->vfs_inode);
+		spin_lock(&ei->i_fc_lock);
+		if (full)
+			requeue = !tid_geq(tid, ei->i_sync_tid);
+		else
+			requeue = ext4_test_inode_state(&ei->vfs_inode,
+							EXT4_STATE_FC_REQUEUE);
+		if (!requeue)
+			ext4_fc_reset_inode(&ei->vfs_inode);
+		ext4_clear_inode_state(&ei->vfs_inode, EXT4_STATE_FC_REQUEUE);
 		ext4_clear_inode_state(&ei->vfs_inode,
 				       EXT4_STATE_FC_COMMITTING);
-		if (tid_geq(tid, ei->i_sync_tid)) {
-			ext4_fc_reset_inode(&ei->vfs_inode);
-		} else if (full) {
-			/*
-			 * We are called after a full commit, inode has been
-			 * modified while the commit was running. Re-enqueue
-			 * the inode into STAGING, which will then be splice
-			 * back into MAIN. This cannot happen during
-			 * fastcommit because the journal is locked all the
-			 * time in that case (and tid doesn't increase so
-			 * tid check above isn't reliable).
-			 */
+		spin_unlock(&ei->i_fc_lock);
+		if (requeue)
 			list_add_tail(&ei->i_fc_list,
 				      &sbi->s_fc_q[FC_Q_STAGING]);
-		}
 		/*
 		 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
 		 * visible before we send the wakeup. Pairs with implicit
-		 * barrier in prepare_to_wait() in ext4_fc_track_inode().
+		 * barrier in prepare_to_wait() in ext4_fc_del().
 		 */
 		smp_mb();
 #if (BITS_PER_LONG < 64)
-- 
2.53.0


^ permalink raw reply related

* [RFC v6 4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

ext4_fc_snapshot_inodes() used igrab()/iput() to pin inodes while building
commit-time snapshots. With ext4_fc_del() waiting for
EXT4_STATE_FC_COMMITTING, iput() can trigger
ext4_clear_inode()->ext4_fc_del() in the commit thread and deadlock waiting
for the fast commit to finish.

Avoid taking extra references. Collect inode pointers under s_fc_lock and
rely on EXT4_STATE_FC_COMMITTING to pin inodes until ext4_fc_cleanup()
clears the bit.

Also set EXT4_STATE_FC_COMMITTING for create-only inodes referenced
from the dentry update queue, and wake up waiters when ext4_fc_cleanup()
clears the bit.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/fast_commit.c | 47 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 5ed884cc4b5c..f28e732e9be7 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -1211,13 +1211,12 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
-		inodes[i] = igrab(&iter->vfs_inode);
-		if (inodes[i])
-			i++;
+		inodes[i++] = &iter->vfs_inode;
 	}
 
 	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
 		struct ext4_inode_info *ei;
+		struct inode *inode;
 
 		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
 			continue;
@@ -1227,12 +1226,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		/* See the comment in ext4_fc_commit_dentry_updates(). */
 		ei = list_first_entry(&fc_dentry->fcd_dilist,
 				      struct ext4_inode_info, i_fc_dilist);
+		inode = &ei->vfs_inode;
 		if (!list_empty(&ei->i_fc_list))
 			continue;
 
-		inodes[i] = igrab(&ei->vfs_inode);
-		if (inodes[i])
-			i++;
+		/*
+		 * Create-only inodes may only be referenced via fcd_dilist and
+		 * not appear on s_fc_q[MAIN]. They may hit the last iput while
+		 * we are snapshotting, but inode eviction calls ext4_fc_del(),
+		 * which waits for FC_COMMITTING to clear. Mark them FC_COMMITTING
+		 * so the inode stays pinned and the snapshot stays valid until
+		 * ext4_fc_cleanup().
+		 */
+		ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+		inodes[i++] = inode;
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
@@ -1242,10 +1249,6 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 			break;
 	}
 
-	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
-		if (inodes[nr_inodes])
-			iput(inodes[nr_inodes]);
-	}
 	kvfree(inodes);
 	return ret;
 }
@@ -1313,8 +1316,9 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	jbd2_journal_lock_updates(journal);
 	/*
 	 * The journal is now locked. No more handles can start and all the
-	 * previous handles are now drained. We now mark the inodes on the
-	 * commit queue as being committed.
+	 * previous handles are now drained. Snapshotting happens in this
+	 * window so log writing can consume only stable snapshots without
+	 * doing logical-to-physical mapping.
 	 */
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
@@ -1566,6 +1570,25 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 					      struct ext4_inode_info,
 					      i_fc_dilist);
 			ext4_fc_free_inode_snap(&ei->vfs_inode);
+			spin_lock(&ei->i_fc_lock);
+			ext4_clear_inode_state(&ei->vfs_inode,
+					       EXT4_STATE_FC_REQUEUE);
+			ext4_clear_inode_state(&ei->vfs_inode,
+					       EXT4_STATE_FC_COMMITTING);
+			spin_unlock(&ei->i_fc_lock);
+			/*
+			 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
+			 * visible before we send the wakeup. Pairs with implicit
+			 * barrier in prepare_to_wait() in ext4_fc_del().
+			 */
+			smp_mb();
+#if (BITS_PER_LONG < 64)
+			wake_up_bit(&ei->i_state_flags,
+				    EXT4_STATE_FC_COMMITTING);
+#else
+			wake_up_bit(&ei->i_flags,
+				    EXT4_STATE_FC_COMMITTING);
+#endif
 		}
 		list_del_init(&fc_dentry->fcd_dilist);
 
-- 
2.53.0


^ permalink raw reply related

* [RFC v6 5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Commit-time snapshots run under jbd2_journal_lock_updates(), so the work
done there must stay bounded.

The snapshot path still used ext4_map_blocks() to build data ranges. This
can take i_data_sem and pulls the mapping code into the snapshot logic.
Build inode data range snapshots from the extent status tree instead.

The extent status tree is a cache, not an authoritative source. If the
needed information is missing or unstable (e.g. delayed allocation), treat
the transaction as fast commit ineligible and fall back to full commit.

Also cap the number of inodes and ranges snapshotted per fast commit and
allocate range records from a dedicated slab cache. The inode pointer
array is allocated outside the updates-locked window.

Testing: QEMU/KVM guest, virtio-pmem + dax, ext4 -O fast_commit, mounted
dax,noatime. Ran python3 500x {4K write + fsync}, fallocate 256M, and
python3 500x {creat + fsync(dir)} without lockdep splats or errors.

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/fast_commit.c | 253 +++++++++++++++++++++++++++++-------------
 1 file changed, 177 insertions(+), 76 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index f28e732e9be7..ab751b855afa 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -183,6 +183,15 @@
 
 #include <trace/events/ext4.h>
 static struct kmem_cache *ext4_fc_dentry_cachep;
+static struct kmem_cache *ext4_fc_range_cachep;
+
+/*
+ * Avoid spending unbounded time/memory snapshotting highly fragmented files
+ * under jbd2_journal_lock_updates(). If we exceed this limit, fall back to
+ * full commit.
+ */
+#define EXT4_FC_SNAPSHOT_MAX_INODES	1024
+#define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
@@ -954,7 +963,7 @@ static void ext4_fc_free_ranges(struct list_head *head)
 
 	list_for_each_entry_safe(range, range_n, head, list) {
 		list_del(&range->list);
-		kfree(range);
+		kmem_cache_free(ext4_fc_range_cachep, range);
 	}
 }
 
@@ -972,16 +981,19 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 }
 
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
-				       struct list_head *ranges)
+				       struct list_head *ranges,
+				       unsigned int nr_ranges_total,
+				       unsigned int *nr_rangesp)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
-	struct ext4_map_blocks map;
-	int ret;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
 		spin_unlock(&ei->i_fc_lock);
+		if (nr_rangesp)
+			*nr_rangesp = 0;
 		return 0;
 	}
 	start_lblk = ei->i_fc_lblk_start;
@@ -995,61 +1007,78 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		   (unsigned long long)inode->i_ino);
 
 	while (cur_lblk <= end_lblk) {
+		struct extent_status es;
 		struct ext4_fc_range *range;
+		ext4_lblk_t len;
 
-		map.m_lblk = cur_lblk;
-		map.m_len = end_lblk - cur_lblk + 1;
-		ret = ext4_map_blocks(NULL, inode, &map,
-				      EXT4_GET_BLOCKS_IO_SUBMIT |
-				      EXT4_EX_NOCACHE);
-		if (ret < 0)
-			return -ECANCELED;
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+			return -EAGAIN;
+
+		if (ext4_es_is_delayed(&es))
+			return -EAGAIN;
 
-		if (map.m_len == 0) {
+		len = es.es_len - (cur_lblk - es.es_lblk);
+		if (len > end_lblk - cur_lblk + 1)
+			len = end_lblk - cur_lblk + 1;
+		if (len == 0) {
 			cur_lblk++;
 			continue;
 		}
 
-		range = kmalloc(sizeof(*range), GFP_NOFS);
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+			return -E2BIG;
+
+		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
 		if (!range)
 			return -ENOMEM;
+		nr_ranges++;
 
-		range->lblk = map.m_lblk;
-		range->len = map.m_len;
+		range->lblk = cur_lblk;
+		range->len = len;
 		range->pblk = 0;
 		range->unwritten = false;
 
-		if (ret == 0) {
+		if (ext4_es_is_hole(&es)) {
 			range->tag = EXT4_FC_TAG_DEL_RANGE;
-		} else {
-			unsigned int max = (map.m_flags & EXT4_MAP_UNWRITTEN) ?
-				EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN;
-
-			/* Limit the number of blocks in one extent */
-			map.m_len = min(max, map.m_len);
+		} else if (ext4_es_is_written(&es) ||
+			   ext4_es_is_unwritten(&es)) {
+			unsigned int max;
 
 			range->tag = EXT4_FC_TAG_ADD_RANGE;
-			range->len = map.m_len;
-			range->pblk = map.m_pblk;
-			range->unwritten = !!(map.m_flags & EXT4_MAP_UNWRITTEN);
+			range->pblk = ext4_es_pblock(&es) +
+				      (cur_lblk - es.es_lblk);
+			range->unwritten = ext4_es_is_unwritten(&es);
+
+			max = range->unwritten ? EXT_UNWRITTEN_MAX_LEN :
+						 EXT_INIT_MAX_LEN;
+			if (range->len > max)
+				range->len = max;
+		} else {
+			kmem_cache_free(ext4_fc_range_cachep, range);
+			return -EAGAIN;
 		}
 
 		INIT_LIST_HEAD(&range->list);
 		list_add_tail(&range->list, ranges);
 
-		cur_lblk += map.m_len;
+		cur_lblk += range->len;
 	}
 
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return 0;
 }
 
-static int ext4_fc_snapshot_inode(struct inode *inode)
+static int ext4_fc_snapshot_inode(struct inode *inode,
+				  unsigned int nr_ranges_total,
+				  unsigned int *nr_rangesp)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
 	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
 	struct ext4_iloc iloc;
 	LIST_HEAD(ranges);
+	unsigned int nr_ranges = 0;
 	int ret;
 	int alloc_ctx;
 
@@ -1073,7 +1102,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode)
 	memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len);
 	brelse(iloc.bh);
 
-	ret = ext4_fc_snapshot_inode_data(inode, &ranges);
+	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
+					  &nr_ranges);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1086,10 +1116,11 @@ static int ext4_fc_snapshot_inode(struct inode *inode)
 	list_splice_tail_init(&ranges, &snap->data_list);
 	ext4_fc_unlock(inode->i_sb, alloc_ctx);
 
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return 0;
 }
 
-
 /* Flushes data of all the inodes in the commit queue. */
 static int ext4_fc_flush_data(journal_t *journal)
 {
@@ -1168,49 +1199,32 @@ static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
 	return 0;
 }
 
-static int ext4_fc_snapshot_inodes(journal_t *journal)
+static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
+					 struct inode ***inodesp,
+					 unsigned int *nr_inodesp);
+
+static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
+				   unsigned int inodes_size)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_inode_info *iter;
 	struct ext4_fc_dentry_update *fc_dentry;
-	struct inode **inodes;
-	unsigned int nr_inodes = 0;
 	unsigned int i = 0;
+	unsigned int idx;
+	unsigned int nr_ranges = 0;
 	int ret = 0;
 	int alloc_ctx;
 
-	alloc_ctx = ext4_fc_lock(sb);
-	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
-		nr_inodes++;
-
-	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
-		struct ext4_inode_info *ei;
-
-		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
-			continue;
-		if (list_empty(&fc_dentry->fcd_dilist))
-			continue;
-
-		/* See the comment in ext4_fc_commit_dentry_updates(). */
-		ei = list_first_entry(&fc_dentry->fcd_dilist,
-				      struct ext4_inode_info, i_fc_dilist);
-		if (!list_empty(&ei->i_fc_list))
-			continue;
-
-		nr_inodes++;
-	}
-	ext4_fc_unlock(sb, alloc_ctx);
-
-	if (!nr_inodes)
+	if (!inodes_size)
 		return 0;
 
-	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
-	if (!inodes)
-		return -ENOMEM;
-
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		if (i >= inodes_size) {
+			ret = -E2BIG;
+			goto unlock;
+		}
 		inodes[i++] = &iter->vfs_inode;
 	}
 
@@ -1230,6 +1244,10 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		if (!list_empty(&ei->i_fc_list))
 			continue;
 
+		if (i >= inodes_size) {
+			ret = -E2BIG;
+			goto unlock;
+		}
 		/*
 		 * Create-only inodes may only be referenced via fcd_dilist and
 		 * not appear on s_fc_q[MAIN]. They may hit the last iput while
@@ -1241,15 +1259,22 @@ static int ext4_fc_snapshot_inodes(journal_t *journal)
 		ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING);
 		inodes[i++] = inode;
 	}
+unlock:
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	for (nr_inodes = 0; nr_inodes < i; nr_inodes++) {
-		ret = ext4_fc_snapshot_inode(inodes[nr_inodes]);
+	if (ret)
+		return ret;
+
+	for (idx = 0; idx < i; idx++) {
+		unsigned int inode_ranges = 0;
+
+		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
+					     &inode_ranges);
 		if (ret)
 			break;
+		nr_ranges += inode_ranges;
 	}
 
-	kvfree(inodes);
 	return ret;
 }
 
@@ -1260,6 +1285,8 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct ext4_inode_info *iter;
 	struct ext4_fc_head head;
 	struct inode *inode;
+	struct inode **inodes;
+	unsigned int inodes_size;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
@@ -1312,6 +1339,10 @@ static int ext4_fc_perform_commit(journal_t *journal)
 		return ret;
 
 
+	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
+	if (ret)
+		return ret;
+
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
 	/*
@@ -1327,8 +1358,9 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
 	jbd2_journal_unlock_updates(journal);
+	kvfree(inodes);
 	if (ret)
 		return ret;
 
@@ -1384,6 +1416,64 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	return ret;
 }
 
+static unsigned int ext4_fc_count_snapshot_inodes(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	unsigned int nr_inodes = 0;
+	int alloc_ctx;
+
+	alloc_ctx = ext4_fc_lock(sb);
+	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list)
+		nr_inodes++;
+
+	list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) {
+		struct ext4_inode_info *ei;
+
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT)
+			continue;
+		if (list_empty(&fc_dentry->fcd_dilist))
+			continue;
+
+		/* See the comment in ext4_fc_commit_dentry_updates(). */
+		ei = list_first_entry(&fc_dentry->fcd_dilist,
+				      struct ext4_inode_info, i_fc_dilist);
+		if (!list_empty(&ei->i_fc_list))
+			continue;
+
+		nr_inodes++;
+	}
+	ext4_fc_unlock(sb, alloc_ctx);
+
+	return nr_inodes;
+}
+
+static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
+					 struct inode ***inodesp,
+					 unsigned int *nr_inodesp)
+{
+	unsigned int nr_inodes = ext4_fc_count_snapshot_inodes(sb);
+	struct inode **inodes;
+
+	*inodesp = NULL;
+	*nr_inodesp = 0;
+
+	if (!nr_inodes)
+		return 0;
+
+	if (nr_inodes > EXT4_FC_SNAPSHOT_MAX_INODES)
+		return -E2BIG;
+
+	inodes = kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS);
+	if (!inodes)
+		return -ENOMEM;
+
+	*inodesp = inodes;
+	*nr_inodesp = nr_inodes;
+	return 0;
+}
+
 static void ext4_fc_update_stats(struct super_block *sb, int status,
 				 u64 commit_time, int nblks, tid_t commit_tid)
 {
@@ -1476,7 +1566,10 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
 	ret = ext4_fc_perform_commit(journal);
 	if (ret < 0) {
-		status = EXT4_FC_STATUS_FAILED;
+		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
+			status = EXT4_FC_STATUS_INELIGIBLE;
+		else
+			status = EXT4_FC_STATUS_FAILED;
 		goto fallback;
 	}
 	nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before;
@@ -1560,34 +1653,35 @@ static void ext4_fc_cleanup(journal_t *journal, int full, tid_t tid)
 
 	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
 		fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN],
-					     struct ext4_fc_dentry_update,
-					     fcd_list);
+						 struct ext4_fc_dentry_update,
+						 fcd_list);
 		list_del_init(&fc_dentry->fcd_list);
 		if (fc_dentry->fcd_op == EXT4_FC_TAG_CREAT &&
-		    !list_empty(&fc_dentry->fcd_dilist)) {
+			!list_empty(&fc_dentry->fcd_dilist)) {
 			/* See the comment in ext4_fc_commit_dentry_updates(). */
 			ei = list_first_entry(&fc_dentry->fcd_dilist,
-					      struct ext4_inode_info,
-					      i_fc_dilist);
+						  struct ext4_inode_info,
+						  i_fc_dilist);
 			ext4_fc_free_inode_snap(&ei->vfs_inode);
 			spin_lock(&ei->i_fc_lock);
 			ext4_clear_inode_state(&ei->vfs_inode,
-					       EXT4_STATE_FC_REQUEUE);
+						   EXT4_STATE_FC_REQUEUE);
 			ext4_clear_inode_state(&ei->vfs_inode,
-					       EXT4_STATE_FC_COMMITTING);
+						   EXT4_STATE_FC_COMMITTING);
 			spin_unlock(&ei->i_fc_lock);
 			/*
 			 * Make sure clearing of EXT4_STATE_FC_COMMITTING is
-			 * visible before we send the wakeup. Pairs with implicit
-			 * barrier in prepare_to_wait() in ext4_fc_del().
+			 * visible before we send the wakeup. Pairs with
+			 * implicit barrier in prepare_to_wait() in
+			 * ext4_fc_del().
 			 */
 			smp_mb();
 #if (BITS_PER_LONG < 64)
 			wake_up_bit(&ei->i_state_flags,
-				    EXT4_STATE_FC_COMMITTING);
+					EXT4_STATE_FC_COMMITTING);
 #else
 			wake_up_bit(&ei->i_flags,
-				    EXT4_STATE_FC_COMMITTING);
+					EXT4_STATE_FC_COMMITTING);
 #endif
 		}
 		list_del_init(&fc_dentry->fcd_dilist);
@@ -2589,13 +2683,20 @@ int __init ext4_fc_init_dentry_cache(void)
 	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
 					   SLAB_RECLAIM_ACCOUNT);
 
-	if (ext4_fc_dentry_cachep == NULL)
+	if (!ext4_fc_dentry_cachep)
 		return -ENOMEM;
 
+	ext4_fc_range_cachep = KMEM_CACHE(ext4_fc_range, SLAB_RECLAIM_ACCOUNT);
+	if (!ext4_fc_range_cachep) {
+		kmem_cache_destroy(ext4_fc_dentry_cachep);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
 void ext4_fc_destroy_dentry_cache(void)
 {
+	kmem_cache_destroy(ext4_fc_range_cachep);
 	kmem_cache_destroy(ext4_fc_dentry_cachep);
 }
-- 
2.53.0


^ permalink raw reply related

* [RFC v6 6/7] ext4: fast commit: add lock_updates tracepoint
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-ext4, linux-kernel,
	linux-trace-kernel
  Cc: Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
so it is useful to quantify the time spent with updates locked and to
understand why snapshotting can fail.

Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
the updates-locked window along with the number of snapshotted inodes
and ranges. Record the first snapshot failure reason in a stable snap_err
field for tooling.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Drop explicit ext4_fc_snap_err assignments and rely on enum
  auto-increment.
- Treat locked_ns as trace-only in this patch and calculate it only when
  ext4_fc_lock_updates is enabled, as suggested by Steven Rostedt.

 fs/ext4/ext4.h              | 15 ++++++++
 fs/ext4/fast_commit.c       | 74 +++++++++++++++++++++++++++++--------
 include/trace/events/ext4.h | 61 ++++++++++++++++++++++++++++++
 3 files changed, 135 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 13fe4fdf9bda..1ff6ea1bde3e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1028,6 +1028,21 @@ enum {
 
 struct ext4_fc_inode_snap;
 
+/*
+ * Snapshot failure reasons for ext4_fc_lock_updates tracepoint.
+ * Keep these stable for tooling.
+ */
+enum ext4_fc_snap_err {
+	EXT4_FC_SNAP_ERR_NONE = 0,
+	EXT4_FC_SNAP_ERR_ES_MISS,
+	EXT4_FC_SNAP_ERR_ES_DELAYED,
+	EXT4_FC_SNAP_ERR_ES_OTHER,
+	EXT4_FC_SNAP_ERR_INODES_CAP,
+	EXT4_FC_SNAP_ERR_RANGES_CAP,
+	EXT4_FC_SNAP_ERR_NOMEM,
+	EXT4_FC_SNAP_ERR_INODE_LOC,
+};
+
 /*
  * fourth extended file system inode data in memory
  */
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index ab751b855afa..6ac6ebe79d7b 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -193,6 +193,12 @@ static struct kmem_cache *ext4_fc_range_cachep;
 #define EXT4_FC_SNAPSHOT_MAX_INODES	1024
 #define EXT4_FC_SNAPSHOT_MAX_RANGES	2048
 
+static inline void ext4_fc_set_snap_err(int *snap_err, int err)
+{
+	if (snap_err && *snap_err == EXT4_FC_SNAP_ERR_NONE)
+		*snap_err = err;
+}
+
 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
 {
 	BUFFER_TRACE(bh, "");
@@ -983,11 +989,12 @@ static void ext4_fc_free_inode_snap(struct inode *inode)
 static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       struct list_head *ranges,
 				       unsigned int nr_ranges_total,
-				       unsigned int *nr_rangesp)
+				       unsigned int *nr_rangesp,
+				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
-	unsigned int nr_ranges = 0;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
+	unsigned int nr_ranges = 0;
 
 	spin_lock(&ei->i_fc_lock);
 	if (ei->i_fc_lblk_len == 0) {
@@ -1011,11 +1018,16 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		struct ext4_fc_range *range;
 		ext4_lblk_t len;
 
-		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL))
+		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
+		}
 
-		if (ext4_es_is_delayed(&es))
+		if (ext4_es_is_delayed(&es)) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
+		}
 
 		len = es.es_len - (cur_lblk - es.es_lblk);
 		if (len > end_lblk - cur_lblk + 1)
@@ -1025,12 +1037,17 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 			continue;
 		}
 
-		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES)
+		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
+		}
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
-		if (!range)
+		if (!range) {
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
+		}
 		nr_ranges++;
 
 		range->lblk = cur_lblk;
@@ -1055,6 +1072,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
 
@@ -1071,7 +1089,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int nr_ranges_total,
-				  unsigned int *nr_rangesp)
+				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_fc_inode_snap *snap;
@@ -1083,8 +1101,10 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	int alloc_ctx;
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
-	if (ret)
+	if (ret) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
+	}
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA))
 		inode_len = EXT4_INODE_SIZE(inode->i_sb);
@@ -1093,6 +1113,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
 	}
@@ -1103,7 +1124,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	brelse(iloc.bh);
 
 	ret = ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total,
-					  &nr_ranges);
+					  &nr_ranges, snap_err);
 	if (ret) {
 		kfree(snap);
 		ext4_fc_free_ranges(&ranges);
@@ -1204,7 +1225,10 @@ static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb,
 					 unsigned int *nr_inodesp);
 
 static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
-				   unsigned int inodes_size)
+				   unsigned int inodes_size,
+				   unsigned int *nr_inodesp,
+				   unsigned int *nr_rangesp,
+				   int *snap_err)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1222,6 +1246,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1245,6 +1271,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			ext4_fc_set_snap_err(snap_err,
+					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
 			goto unlock;
 		}
@@ -1269,16 +1297,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 		unsigned int inode_ranges = 0;
 
 		ret = ext4_fc_snapshot_inode(inodes[idx], nr_ranges,
-					     &inode_ranges);
+					     &inode_ranges, snap_err);
 		if (ret)
 			break;
 		nr_ranges += inode_ranges;
 	}
 
+	if (nr_inodesp)
+		*nr_inodesp = i;
+	if (nr_rangesp)
+		*nr_rangesp = nr_ranges;
 	return ret;
 }
 
-static int ext4_fc_perform_commit(journal_t *journal)
+static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -1287,10 +1319,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	struct inode *inode;
 	struct inode **inodes;
 	unsigned int inodes_size;
+	unsigned int snap_inodes = 0;
+	unsigned int snap_ranges = 0;
+	int snap_err = EXT4_FC_SNAP_ERR_NONE;
 	struct blk_plug plug;
 	int ret = 0;
 	u32 crc = 0;
 	int alloc_ctx;
+	ktime_t lock_start;
+	u64 locked_ns;
 
 	/*
 	 * Step 1: Mark all inodes on s_fc_q[MAIN] with
@@ -1338,13 +1375,13 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	if (ret)
 		return ret;
 
-
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
 	if (ret)
 		return ret;
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
+	lock_start = ktime_get();
 	/*
 	 * The journal is now locked. No more handles can start and all the
 	 * previous handles are now drained. Snapshotting happens in this
@@ -1358,8 +1395,15 @@ static int ext4_fc_perform_commit(journal_t *journal)
 	}
 	ext4_fc_unlock(sb, alloc_ctx);
 
-	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size);
+	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
+				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
+	if (trace_ext4_fc_lock_updates_enabled()) {
+		locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+		trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns,
+					   snap_inodes, snap_ranges, ret,
+					   snap_err);
+	}
 	kvfree(inodes);
 	if (ret)
 		return ret;
@@ -1564,7 +1608,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
 		journal_ioprio = EXT4_DEF_JOURNAL_IOPRIO;
 	set_task_ioprio(current, journal_ioprio);
 	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
-	ret = ext4_fc_perform_commit(journal);
+	ret = ext4_fc_perform_commit(journal, commit_tid);
 	if (ret < 0) {
 		if (ret == -EAGAIN || ret == -E2BIG || ret == -ECANCELED)
 			status = EXT4_FC_STATUS_INELIGIBLE;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f493642cf121..7028a28316fa 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -107,6 +107,26 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_VERITY);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_MOVE_EXT);
 TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX);
 
+#undef EM
+#undef EMe
+#define EM(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
+#define EMe(a)	TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a);
+
+#define TRACE_SNAP_ERR						\
+	EM(NONE)						\
+	EM(ES_MISS)						\
+	EM(ES_DELAYED)						\
+	EM(ES_OTHER)						\
+	EM(INODES_CAP)						\
+	EM(RANGES_CAP)						\
+	EM(NOMEM)						\
+	EMe(INODE_LOC)
+
+TRACE_SNAP_ERR
+
+#undef EM
+#undef EMe
+
 #define show_fc_reason(reason)						\
 	__print_symbolic(reason,					\
 		{ EXT4_FC_REASON_XATTR,		"XATTR"},		\
@@ -2818,6 +2838,47 @@ TRACE_EVENT(ext4_fc_commit_stop,
 		  __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid)
 );
 
+#define EM(a)	{ EXT4_FC_SNAP_ERR_##a, #a },
+#define EMe(a)	{ EXT4_FC_SNAP_ERR_##a, #a }
+
+TRACE_EVENT(ext4_fc_lock_updates,
+	    TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns,
+		     unsigned int nr_inodes, unsigned int nr_ranges, int err,
+		     int snap_err),
+
+	TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err),
+
+	TP_STRUCT__entry(/* entry */
+		__field(dev_t, dev)
+		__field(tid_t, tid)
+		__field(u64, locked_ns)
+		__field(unsigned int, nr_inodes)
+		__field(unsigned int, nr_ranges)
+		__field(int, err)
+		__field(int, snap_err)
+	),
+
+	TP_fast_assign(/* assign */
+		__entry->dev = sb->s_dev;
+		__entry->tid = commit_tid;
+		__entry->locked_ns = locked_ns;
+		__entry->nr_inodes = nr_inodes;
+		__entry->nr_ranges = nr_ranges;
+		__entry->err = err;
+		__entry->snap_err = snap_err;
+	),
+
+	TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err %d snap_err %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
+		  __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges,
+		  __entry->err, __print_symbolic(__entry->snap_err,
+						 TRACE_SNAP_ERR))
+);
+
+#undef EM
+#undef EMe
+#undef TRACE_SNAP_ERR
+
 #define FC_REASON_NAME_STAT(reason)					\
 	show_fc_reason(reason),						\
 	__entry->fc_ineligible_rc[reason]
-- 
2.53.0

^ permalink raw reply related

* [RFC v6 7/7] ext4: fast commit: export snapshot stats in fc_info
From: Li Chen @ 2026-04-08 11:20 UTC (permalink / raw)
  To: Zhang Yi, Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
	Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi, linux-ext4,
	linux-kernel
  Cc: Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	linux-trace-kernel, Li Chen
In-Reply-To: <20260408112020.716706-1-me@linux.beauty>

Snapshot-based fast commit can fall back when the commit-time snapshot
cannot be built (e.g. extent status cache misses). It is useful to
quantify the updates-locked window and to see why snapshotting failed.

Add best-effort snapshot counters to the ext4 superblock and extend
/proc/fs/ext4/<sb_id>/fc_info to report the number of snapshotted
inodes and ranges, snapshot failure reasons, and the average/max time
spent with journal updates locked.

Signed-off-by: Li Chen <me@linux.beauty>
---
Changes in v6:
- Start consuming locked_ns in fc_info, so this patch intentionally moves
  lock_updates_ns_{total,max,samples} accounting here.
- Guard the tracepoint call with trace_ext4_fc_lock_updates_enabled() and
  use trace_call__ext4_fc_lock_updates() to avoid the double static_branch
  at the guarded call site.
- keeps the stats unconditionally while avoiding extra tracepoint
  overhead when ext4_fc_lock_updates is disabled.

 fs/ext4/ext4.h        | 31 +++++++++++++++++++
 fs/ext4/fast_commit.c | 72 +++++++++++++++++++++++++++++++++++++------
 fs/ext4/super.c       |  1 +
 3 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1ff6ea1bde3e..c9ed7ceca982 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1554,6 +1554,36 @@ struct ext4_orphan_info {
 						 * file blocks */
 };
 
+/*
+ * Ext4 fast commit snapshot statistics.
+ *
+ * These are best-effort counters intended for debugging / performance
+ * introspection; they are not exact under concurrent updates.
+ */
+struct ext4_fc_snap_stats {
+	u64 lock_updates_ns_total;
+	u64 lock_updates_ns_max;
+	u64 lock_updates_samples;
+
+	u64 snap_inodes;
+	u64 snap_ranges;
+
+	u64 snap_fail_es_miss;
+	u64 snap_fail_es_delayed;
+	u64 snap_fail_es_other;
+
+	u64 snap_fail_inodes_cap;
+	u64 snap_fail_ranges_cap;
+	u64 snap_fail_nomem;
+	u64 snap_fail_inode_loc;
+
+	/*
+	 * Missing inode snapshots during log writing should never happen.
+	 * Keep this counter to help catch unexpected regressions.
+	 */
+	u64 snap_fail_no_snap;
+};
+
 /*
  * fourth extended-fs super-block data in memory
  */
@@ -1828,6 +1858,7 @@ struct ext4_sb_info {
 	struct mutex s_fc_lock;
 	struct buffer_head *s_fc_bh;
 	struct ext4_fc_stats s_fc_stats;
+	struct ext4_fc_snap_stats s_fc_snap_stats;
 	tid_t s_fc_ineligible_tid;
 #ifdef CONFIG_EXT4_DEBUG
 	int s_fc_debug_max_replay;
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 6ac6ebe79d7b..3c6ace2b0b94 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -890,13 +890,17 @@ static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
 	int inode_len;
 	int ret;
 
-	if (!snap)
+	if (!snap) {
+		EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++;
 		return -ECANCELED;
+	}
 
 	src = snap->inode_buf;
 	inode_len = snap->inode_len;
-	if (!src || inode_len == 0)
+	if (!src || inode_len == 0) {
+		EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++;
 		return -ECANCELED;
+	}
 
 	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
 	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
@@ -931,8 +935,10 @@ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
 	struct ext4_extent *ex;
 	struct ext4_fc_range *range;
 
-	if (!snap)
+	if (!snap) {
+		EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++;
 		return -ECANCELED;
+	}
 
 	list_for_each_entry(range, &snap->data_list, list) {
 		if (range->tag == EXT4_FC_TAG_DEL_RANGE) {
@@ -993,6 +999,8 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				       int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	ext4_lblk_t start_lblk, end_lblk, cur_lblk;
 	unsigned int nr_ranges = 0;
 
@@ -1019,11 +1027,13 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		ext4_lblk_t len;
 
 		if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) {
+			stats->snap_fail_es_miss++;
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS);
 			return -EAGAIN;
 		}
 
 		if (ext4_es_is_delayed(&es)) {
+			stats->snap_fail_es_delayed++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_ES_DELAYED);
 			return -EAGAIN;
@@ -1038,6 +1048,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 		}
 
 		if (nr_ranges_total + nr_ranges >= EXT4_FC_SNAPSHOT_MAX_RANGES) {
+			stats->snap_fail_ranges_cap++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_RANGES_CAP);
 			return -E2BIG;
@@ -1045,6 +1056,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 
 		range = kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS);
 		if (!range) {
+			stats->snap_fail_nomem++;
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 			return -ENOMEM;
 		}
@@ -1072,6 +1084,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *inode,
 				range->len = max;
 		} else {
 			kmem_cache_free(ext4_fc_range_cachep, range);
+			stats->snap_fail_es_other++;
 			ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER);
 			return -EAGAIN;
 		}
@@ -1092,6 +1105,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 				  unsigned int *nr_rangesp, int *snap_err)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_fc_snap_stats *stats =
+		&EXT4_SB(inode->i_sb)->s_fc_snap_stats;
 	struct ext4_fc_inode_snap *snap;
 	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
 	struct ext4_iloc iloc;
@@ -1102,6 +1117,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	ret = ext4_get_inode_loc_noio(inode, &iloc);
 	if (ret) {
+		stats->snap_fail_inode_loc++;
 		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC);
 		return ret;
 	}
@@ -1113,6 +1129,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 
 	snap = kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS);
 	if (!snap) {
+		stats->snap_fail_nomem++;
 		ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM);
 		brelse(iloc.bh);
 		return -ENOMEM;
@@ -1137,6 +1154,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode,
 	list_splice_tail_init(&ranges, &snap->data_list);
 	ext4_fc_unlock(inode->i_sb, alloc_ctx);
 
+	stats->snap_inodes++;
+	stats->snap_ranges += nr_ranges;
 	if (nr_rangesp)
 		*nr_rangesp = nr_ranges;
 	return 0;
@@ -1246,6 +1265,7 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 	alloc_ctx = ext4_fc_lock(sb);
 	list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
 		if (i >= inodes_size) {
+			sbi->s_fc_snap_stats.snap_fail_inodes_cap++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
@@ -1271,6 +1291,7 @@ static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inodes,
 			continue;
 
 		if (i >= inodes_size) {
+			sbi->s_fc_snap_stats.snap_fail_inodes_cap++;
 			ext4_fc_set_snap_err(snap_err,
 					     EXT4_FC_SNAP_ERR_INODES_CAP);
 			ret = -E2BIG;
@@ -1314,6 +1335,7 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 {
 	struct super_block *sb = journal->j_private;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_snap_stats *snap_stats = &sbi->s_fc_snap_stats;
 	struct ext4_inode_info *iter;
 	struct ext4_fc_head head;
 	struct inode *inode;
@@ -1376,8 +1398,13 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 		return ret;
 
 	ret = ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size);
-	if (ret)
+	if (ret) {
+		if (ret == -E2BIG)
+			snap_stats->snap_fail_inodes_cap++;
+		else if (ret == -ENOMEM)
+			snap_stats->snap_fail_nomem++;
 		return ret;
+	}
 
 	/* Step 4: Mark all inodes as being committed. */
 	jbd2_journal_lock_updates(journal);
@@ -1398,12 +1425,15 @@ static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid)
 	ret = ext4_fc_snapshot_inodes(journal, inodes, inodes_size,
 				      &snap_inodes, &snap_ranges, &snap_err);
 	jbd2_journal_unlock_updates(journal);
-	if (trace_ext4_fc_lock_updates_enabled()) {
-		locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
-		trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns,
-					   snap_inodes, snap_ranges, ret,
-					   snap_err);
-	}
+	locked_ns = ktime_to_ns(ktime_sub(ktime_get(), lock_start));
+	snap_stats->lock_updates_ns_total += locked_ns;
+	snap_stats->lock_updates_samples++;
+	if (locked_ns > snap_stats->lock_updates_ns_max)
+		snap_stats->lock_updates_ns_max = locked_ns;
+	if (trace_ext4_fc_lock_updates_enabled())
+		trace_call__ext4_fc_lock_updates(sb, commit_tid, locked_ns,
+						 snap_inodes, snap_ranges,
+						 ret, snap_err);
 	kvfree(inodes);
 	if (ret)
 		return ret;
@@ -2704,11 +2734,17 @@ int ext4_fc_info_show(struct seq_file *seq, void *v)
 {
 	struct ext4_sb_info *sbi = EXT4_SB((struct super_block *)seq->private);
 	struct ext4_fc_stats *stats = &sbi->s_fc_stats;
+	struct ext4_fc_snap_stats *snap_stats = &sbi->s_fc_snap_stats;
+	u64 lock_avg_ns = 0;
 	int i;
 
 	if (v != SEQ_START_TOKEN)
 		return 0;
 
+	if (snap_stats->lock_updates_samples)
+		lock_avg_ns = div_u64(snap_stats->lock_updates_ns_total,
+				      snap_stats->lock_updates_samples);
+
 	seq_printf(seq,
 		"fc stats:\n%ld commits\n%ld ineligible\n%ld numblks\n%lluus avg_commit_time\n",
 		   stats->fc_num_commits, stats->fc_ineligible_commits,
@@ -2719,6 +2755,22 @@ int ext4_fc_info_show(struct seq_file *seq, void *v)
 		seq_printf(seq, "\"%s\":\t%d\n", fc_ineligible_reasons[i],
 			stats->fc_ineligible_reason_count[i]);
 
+	seq_printf(seq,
+		   "Snapshot stats:\n%llu inodes\n%llu ranges\n%lluus lock_updates_avg\n%lluus lock_updates_max\n",
+		   snap_stats->snap_inodes, snap_stats->snap_ranges,
+		   div_u64(lock_avg_ns, 1000),
+		   div_u64(snap_stats->lock_updates_ns_max, 1000));
+	seq_printf(seq,
+		   "Snapshot failures:\n%llu es_miss\n%llu es_delayed\n%llu es_other\n%llu inodes_cap\n%llu ranges_cap\n%llu nomem\n%llu inode_loc\n%llu no_snap\n",
+		   snap_stats->snap_fail_es_miss,
+		   snap_stats->snap_fail_es_delayed,
+		   snap_stats->snap_fail_es_other,
+		   snap_stats->snap_fail_inodes_cap,
+		   snap_stats->snap_fail_ranges_cap,
+		   snap_stats->snap_fail_nomem,
+		   snap_stats->snap_fail_inode_loc,
+		   snap_stats->snap_fail_no_snap);
+
 	return 0;
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 286f05834900..9ae68a223ea6 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4538,6 +4538,7 @@ static void ext4_fast_commit_init(struct super_block *sb)
 	sbi->s_fc_ineligible_tid = 0;
 	mutex_init(&sbi->s_fc_lock);
 	memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
+	memset(&sbi->s_fc_snap_stats, 0, sizeof(sbi->s_fc_snap_stats));
 	sbi->s_fc_replay_state.fc_regions = NULL;
 	sbi->s_fc_replay_state.fc_regions_size = 0;
 	sbi->s_fc_replay_state.fc_regions_used = 0;
-- 
2.53.0

^ permalink raw reply related

* BUG: net-next (7.0-rc6 based and later) fails to boot on Jetson Xavier NX
From: Russell King (Oracle) @ 2026-04-08 13:07 UTC (permalink / raw)
  To: netdev, linux-arm-kernel, linux-kernel, iommu, linux-ext4,
	Linus Torvalds
  Cc: Marek Szyprowski, Robin Murphy, Theodore Ts'o, Andreas Dilger

Hi,

Just a heads-up that current net-next (v7.0-rc6 based) fails to boot on
my nVidia Jetson Xavier platform. v7.0-rc5 and v6.14 based net-next both
boot fine. This is an arm64 platform.

The problem appears to be completely random in terms of its symptoms,
and looks like severe memory corruption - every boot seems to produce
a different problem. The common theme is, although the kernel gets to
userspace, it never gets anywhere close to a login prompt before
failing in some way.

The last net-next+ boot (which is currently v7.0-rc6 based) resulted
in:

tegra-mc 2c00000.memory-controller: xusb_hostw: secure write @0x00000003ffffff00: VPR violation ((null))
...
irq 91: nobody cared (try booting with the "irqpoll" option)
...
depmod: ERROR: could not open directory /lib/modules/7.0.0-rc6-net-next+: No such file or directory
...
Unable to handle kernel paging request at virtual address 0003201fd50320cf


A previous boot of the exact same kernel didn't oops, but was unable
to find the block device to mount for /mnt via block UUID.

A previous boot to that resulted in an oops.


The intersting thing is - the depmod error above is incorrect:

root@tegra-ubuntu:~# ls -ld /lib/modules/7.0.0-rc6-net-next+
drwxrwxr-x 3 root root 4096 Apr  8 10:23 /lib/modules/7.0.0-rc6-net-next+

The directory is definitely there, and is readable - checked after
booting back into net-next based on 7.0-rc5. In some of these boots,
stmmac hasn't probed yet, which rules out my changes.

Rootfs is ext4, and it seems there were a lot of ext4 commits merged
between rc5 and rc6, but nothing for rc7.

My current net-next head is dfecb0c5af3b. Merging rc7 on top also
fails, I suspect also randomly, with that I just got:

EXT4-fs (mmcblk0p1): VFS: Can't find ext4 filesystem
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/mmcblk0p1, missing codepage or helper program, or other error.
mount: /mnt/: can't find PARTUUID=741c0777-391a-4bce-a222-455e180ece2a.
Unable to handle kernel paging request at virtual address f9bf0011ac0fb893
Mem abort info:
  ESR = 0x0000000096000004
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x04: level 0 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
  CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[f9bf0011ac0fb893] address between user and kernel address ranges
Internal error: Oops: 0000000096000004 [#1]  SMP
Modules linked in:
CPU: 1 UID: 0 PID: 936 Comm: mount Not tainted 7.0.0-rc7-net-next+ #649 PREEMPT
Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 6.0-37391689 08/28/2024
pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : refill_objects+0x298/0x5ec
lr : refill_objects+0x1f0/0x5ec

...

Call trace:
 refill_objects+0x298/0x5ec (P)
 __pcs_replace_empty_main+0x13c/0x3a8
 kmem_cache_alloc_noprof+0x324/0x3a0
 alloc_iova+0x3c/0x290
 alloc_iova_fast+0x168/0x2d4
 iommu_dma_alloc_iova+0x84/0x154
 iommu_dma_map_sg+0x2c4/0x538
 __dma_map_sg_attrs+0x124/0x2c0
 dma_map_sg_attrs+0x10/0x20
 sdhci_pre_dma_transfer+0xb8/0x164
 sdhci_pre_req+0x38/0x44
 mmc_blk_mq_issue_rq+0x3dc/0x920
 mmc_mq_queue_rq+0x104/0x2b0
 __blk_mq_issue_directly+0x38/0xb0
 blk_mq_request_issue_directly+0x54/0xb4
 blk_mq_issue_direct+0x84/0x180
 blk_mq_dispatch_queue_requests+0x1a8/0x2e0
 blk_mq_flush_plug_list+0x60/0x140
 __blk_flush_plug+0xe0/0x11c
 blk_finish_plug+0x38/0x4c
 read_pages+0x158/0x260
 page_cache_ra_unbounded+0x158/0x3e0
 force_page_cache_ra+0xb0/0xe4
 page_cache_sync_ra+0x88/0x480
 filemap_get_pages+0xd8/0x850
 filemap_read+0xdc/0x3d8
 blkdev_read_iter+0x84/0x198
 vfs_read+0x208/0x2d8
 ksys_read+0x58/0xf4
 __arm64_sys_read+0x1c/0x28
 invoke_syscall.constprop.0+0x50/0xe0
 do_el0_svc+0x40/0xc0
 el0_svc+0x48/0x2a0
 el0t_64_sync_handler+0xa0/0xe4
 el0t_64_sync+0x19c/0x1a0
Code: 54000189 f9000022 aa0203e4 b9402ae3 (f8634840)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception

Looking at the changes between rc5 and rc6, there's one drivers/block
change for zram (which is used on this platform), one change in
drivers/base for regmap, nothing for drivers/mmc, but plenty for
fs/ext4. There are five DMA API changes.

Now building straight -rc7. If that also fails, my plan is to start
bisecting rc5..rc6, which will likely take most of the rest of the
day. So, in the mean time I'm sending this as a heads-up that rc6
and onwards has a problem.

I'll update when I have a potential commit located.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [RFC PATCH v1 0/6] provenance_time (ptime): a new settable timestamp for cross-filesystem provenance
From: Theodore Tso @ 2026-04-08 13:33 UTC (permalink / raw)
  To: Sean Smith
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, dsterba, david, brauner,
	osandov, almaz, hirofumi, linkinjeon
In-Reply-To: <92e61267-eb24-4f94-b9a1-e009b5e00d65@gmail.com>

On Tue, Apr 07, 2026 at 09:54:17PM -0500, Sean Smith wrote:
> Finding an alternative to using rename() to transfer ptime between
> inodes during atomic saves seems beyond the scope of what I can address
> as someone who relies upon AI agents to review and modify code....
>
> It also doesn't solve the immediate needs of increasing number of users
> who are trying to ditch Windows for Linux. Windows 11 has pushed one too
> many people too far, and they, like me, have had enough.

I understand that you have one goal which seems more important than
anything else (to you).  But consider what might happen if you ask an
AI, "How can we solve the paper clip shortage problem?", or "How do we
carry out regime change in Iran?"  AI models which don't consider
secord order effects (or used by people who have different priorities
than others) can result in... suboptimal results.

And this is why it's important why it's important to have humans in
the loop instead of blindly trusting AI models.

I'd also suggest that you consider the value of patience.  Linux is 35
years old as of this writing.  One of the reasons why Linux has stuck
around this long is because we take the long view.  Sure, it might
take longer to shift the ecosystem to use some new interface or new
feature; but everyone will have to live with muddled interface
semantics *forever*.

> The need for ptime is very real....

It's important for *you*.  But the vast majority of Linux users are
not Windows refugees.  (Ask your AI models to explain the significance
of the phrase, "This is the year of the Linux Desktop".)  Even for
Windows users, it is not all clear that Windows-style File Creation
time is that important for those users.  MacOS doesn't have this
Windows-style timestamp support, and it hasn't stopped many users from
switching from Windows to MacOS.

> ... and the code in my patch gets the job done...

But it doesn't.  Bastardizing the semantics of the rename interface
doesn't completely solve the problem you've articulated.  In
particular, all of the userspace programs which need to create new
files --- tar file extraction, unzip, file copying, etc., still need
to be changed.

This will require changing userspace applications.  So why not use
that approach to address your problem statement?

> I can patch every application I use which is open-source, or I can patch
> the kernel. Rational analysis requires that I patch the kernel.

That's certainly your perogative, and that's the beauty of Open
Source.  You're free to patch the software that you use on your
system.

Cheers,

						- Ted


^ permalink raw reply

* Re: [PATCH 0/3] show orphan file inode detail info
From: Jan Kara @ 2026-04-08 13:36 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jan Kara, Ye Bin, adilger.kernel, linux-ext4, linux-fsdevel
In-Reply-To: <20260407202845.GA38246@macsyma-wired.lan>

On Tue 07-04-26 16:28:45, Theodore Tso wrote:
> On Tue, Apr 07, 2026 at 12:29:23PM +0200, Jan Kara wrote:
> > I agree listing orphan inodes for a superblock is useful and the usefulness
> > could actually go beyond ext4. I imagine the very same problem is there for
> > XFS or btrfs so perhaps we could think for a while whether we can provide
> > an interface that wouldn't be ext4 specific? Perhaps an ioctl
> > (GET_ORPHAN_FILES) that would return an fd and reading from that fd would
> > return entries for orphan inodes?
> 
> I'm really not a fan of ioctl's returning a fd, but that does seem to
> be a thing these days, for better or for worse, and I agree that
> having a portable solution that works across multiple file systems
> would be a good thing.

Yes, ioctl returning fd isn't great but frankly a file in /proc looks even
worse to me...

> > Also regarding information reported about orphan inodes - won't it be better
> > interface to just return a list of file handles? Userspace can then do
> > whatever it needs with them - open, statx, calling ioctl, etc - so we
> > thwart feature creep with people asking us to add more information to the
> > interface. This also offloads a lot of security questions about the
> > interface to appropriate syscalls. So overall it looks like a win to me.
> 
> The problem with using a file handle is that the only way to get the
> pathname is to open the file handle, and then call readlink on
> /proc/self/fd/NN.

Right, which is pretty standard I'd say.

> And inodes on the orphan inode list have been unlinked, so we don't want
> to allow people to be able to open them.

Why? You can reopen unlinked files using magic links in proc or file
handles just fine today (just tested this if I'm not missing anything in
the code). Only once the inode is really deleted you cannot open using the
handle anymore.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: BUG: net-next (7.0-rc6 based and later) fails to boot on Jetson Xavier NX
From: Russell King (Oracle) @ 2026-04-08 13:59 UTC (permalink / raw)
  To: netdev, linux-arm-kernel, linux-kernel, iommu, linux-ext4,
	Linus Torvalds
  Cc: Marek Szyprowski, Robin Murphy, Theodore Ts'o, Andreas Dilger
In-Reply-To: <adZTGOjjJrVJOcT8@shell.armlinux.org.uk>

On Wed, Apr 08, 2026 at 02:07:36PM +0100, Russell King (Oracle) wrote:
> Hi,
> 
> Just a heads-up that current net-next (v7.0-rc6 based) fails to boot on
> my nVidia Jetson Xavier platform. v7.0-rc5 and v6.14 based net-next both
> boot fine. This is an arm64 platform.
> 
> The problem appears to be completely random in terms of its symptoms,
> and looks like severe memory corruption - every boot seems to produce
> a different problem. The common theme is, although the kernel gets to
> userspace, it never gets anywhere close to a login prompt before
> failing in some way.
> 
> The last net-next+ boot (which is currently v7.0-rc6 based) resulted
> in:
> 
> tegra-mc 2c00000.memory-controller: xusb_hostw: secure write @0x00000003ffffff00: VPR violation ((null))
> ...
> irq 91: nobody cared (try booting with the "irqpoll" option)
> ...
> depmod: ERROR: could not open directory /lib/modules/7.0.0-rc6-net-next+: No such file or directory
> ...
> Unable to handle kernel paging request at virtual address 0003201fd50320cf
> 
> 
> A previous boot of the exact same kernel didn't oops, but was unable
> to find the block device to mount for /mnt via block UUID.
> 
> A previous boot to that resulted in an oops.
> 
> 
> The intersting thing is - the depmod error above is incorrect:
> 
> root@tegra-ubuntu:~# ls -ld /lib/modules/7.0.0-rc6-net-next+
> drwxrwxr-x 3 root root 4096 Apr  8 10:23 /lib/modules/7.0.0-rc6-net-next+
> 
> The directory is definitely there, and is readable - checked after
> booting back into net-next based on 7.0-rc5. In some of these boots,
> stmmac hasn't probed yet, which rules out my changes.
> 
> Rootfs is ext4, and it seems there were a lot of ext4 commits merged
> between rc5 and rc6, but nothing for rc7.
> 
> My current net-next head is dfecb0c5af3b. Merging rc7 on top also
> fails, I suspect also randomly, with that I just got:
> 
> EXT4-fs (mmcblk0p1): VFS: Can't find ext4 filesystem
> mount: /mnt: wrong fs type, bad option, bad superblock on /dev/mmcblk0p1, missing codepage or helper program, or other error.
> mount: /mnt/: can't find PARTUUID=741c0777-391a-4bce-a222-455e180ece2a.
> Unable to handle kernel paging request at virtual address f9bf0011ac0fb893
> Mem abort info:
>   ESR = 0x0000000096000004
>   EC = 0x25: DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
>   FSC = 0x04: level 0 translation fault
> Data abort info:
>   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [f9bf0011ac0fb893] address between user and kernel address ranges
> Internal error: Oops: 0000000096000004 [#1]  SMP
> Modules linked in:
> CPU: 1 UID: 0 PID: 936 Comm: mount Not tainted 7.0.0-rc7-net-next+ #649 PREEMPT
> Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 6.0-37391689 08/28/2024
> pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : refill_objects+0x298/0x5ec
> lr : refill_objects+0x1f0/0x5ec
> 
> ...
> 
> Call trace:
>  refill_objects+0x298/0x5ec (P)
>  __pcs_replace_empty_main+0x13c/0x3a8
>  kmem_cache_alloc_noprof+0x324/0x3a0
>  alloc_iova+0x3c/0x290
>  alloc_iova_fast+0x168/0x2d4
>  iommu_dma_alloc_iova+0x84/0x154
>  iommu_dma_map_sg+0x2c4/0x538
>  __dma_map_sg_attrs+0x124/0x2c0
>  dma_map_sg_attrs+0x10/0x20
>  sdhci_pre_dma_transfer+0xb8/0x164
>  sdhci_pre_req+0x38/0x44
>  mmc_blk_mq_issue_rq+0x3dc/0x920
>  mmc_mq_queue_rq+0x104/0x2b0
>  __blk_mq_issue_directly+0x38/0xb0
>  blk_mq_request_issue_directly+0x54/0xb4
>  blk_mq_issue_direct+0x84/0x180
>  blk_mq_dispatch_queue_requests+0x1a8/0x2e0
>  blk_mq_flush_plug_list+0x60/0x140
>  __blk_flush_plug+0xe0/0x11c
>  blk_finish_plug+0x38/0x4c
>  read_pages+0x158/0x260
>  page_cache_ra_unbounded+0x158/0x3e0
>  force_page_cache_ra+0xb0/0xe4
>  page_cache_sync_ra+0x88/0x480
>  filemap_get_pages+0xd8/0x850
>  filemap_read+0xdc/0x3d8
>  blkdev_read_iter+0x84/0x198
>  vfs_read+0x208/0x2d8
>  ksys_read+0x58/0xf4
>  __arm64_sys_read+0x1c/0x28
>  invoke_syscall.constprop.0+0x50/0xe0
>  do_el0_svc+0x40/0xc0
>  el0_svc+0x48/0x2a0
>  el0t_64_sync_handler+0xa0/0xe4
>  el0t_64_sync+0x19c/0x1a0
> Code: 54000189 f9000022 aa0203e4 b9402ae3 (f8634840)
> ---[ end trace 0000000000000000 ]---
> Kernel panic - not syncing: Oops: Fatal exception
> 
> Looking at the changes between rc5 and rc6, there's one drivers/block
> change for zram (which is used on this platform), one change in
> drivers/base for regmap, nothing for drivers/mmc, but plenty for
> fs/ext4. There are five DMA API changes.
> 
> Now building straight -rc7. If that also fails, my plan is to start
> bisecting rc5..rc6, which will likely take most of the rest of the
> day. So, in the mean time I'm sending this as a heads-up that rc6
> and onwards has a problem.

Plain -rc7 fails (another random oops):

Root device found: PARTUUID=741c0777-391a-4bce-a222-455e180ece2a
depmod: ERROR: could not open directory /lib/modules/7.0.0-rc7-net-next+: No such file or directory
depmod: FATAL: could not search modules: No such file or directory
usb 2-3: new SuperSpeed Plus Gen 2x1 USB device number 2 using tegra-xusb
hub 2-3:1.0: USB hub found
hub 2-3:1.0: 4 ports detected
usb 1-3: new full-speed USB device number 3 using tegra-xusb
Unable to handle kernel paging request at virtual address 0003201fd50320cf
Mem abort info:
  ESR = 0x0000000096000004
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x04: level 0 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
  CM = 0, WnR = 0, TnD = 0, TagAccess = 0
  GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[0003201fd50320cf] address between user and kernel address ranges
Internal error: Oops: 0000000096000004 [#1]  SMP
Modules linked in:
CPU: 1 UID: 0 PID: 917 Comm: mount Not tainted 7.0.0-rc7-net-next+ #649 PREEMPT
Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 6.0-37391689 08/28/2024
pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : refill_objects+0x298/0x5ec
lr : refill_objects+0x1f0/0x5ec
sp : ffff80008606b500
x29: ffff80008606b500 x28: 0000000000000001 x27: fffffdffc20e6200
x26: 0000000000000006 x25: 0000000000000000 x24: 000000000000003c
x23: ffff0000809e4840 x22: ffff0000809dba00 x21: ffff80008606b5a0
x20: ffff000081133820 x19: fffffdffc20e6220 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000100 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: ffff800081e5faa8
x11: ffff800082192c70 x10: ffff8000814074dc x9 : 0000000000000050
x8 : ffff80008606b490 x7 : ffff000083988b40 x6 : ffff80008606b4a0
x5 : 000000080015000f x4 : d503201fd503201f x3 : 00000000000000b0
x2 : d503201fd503201f x1 : ffff000081133828 x0 : d503201fd503201f
Call trace:
 refill_objects+0x298/0x5ec (P)
 __pcs_replace_empty_main+0x13c/0x3a8
 kmem_cache_alloc_noprof+0x324/0x3a0
 mempool_alloc_slab+0x1c/0x28
 mempool_alloc_noprof+0x98/0xe0
 bio_alloc_bioset+0x160/0x3e0
 do_mpage_readpage+0x3d0/0x618
 mpage_readahead+0xb8/0x144
 blkdev_readahead+0x18/0x24
 read_pages+0x58/0x260
 page_cache_ra_unbounded+0x158/0x3e0
 force_page_cache_ra+0xb0/0xe4
 page_cache_sync_ra+0x88/0x480
 filemap_get_pages+0xd8/0x850
 filemap_read+0xdc/0x3d8
 blkdev_read_iter+0x84/0x198
 vfs_read+0x208/0x2d8
 ksys_read+0x58/0xf4
 __arm64_sys_read+0x1c/0x28
 invoke_syscall.constprop.0+0x50/0xe0
 do_el0_svc+0x40/0xc0
 el0_svc+0x48/0x2a0
 el0t_64_sync_handler+0xa0/0xe4
 el0t_64_sync+0x19c/0x1a0
Code: 54000189 f9000022 aa0203e4 b9402ae3 (f8634840)
---[ end trace 0000000000000000 ]---

Now starting the bisect between 7.0-rc5 and 7.0-rc6.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: BUG: net-next (7.0-rc6 based and later) fails to boot on Jetson Xavier NX
From: Linus Torvalds @ 2026-04-08 15:22 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: netdev, linux-arm-kernel, linux-kernel, iommu, linux-ext4,
	Marek Szyprowski, Robin Murphy, Theodore Ts'o, Andreas Dilger
In-Reply-To: <adZfTi3R6jtsjXx-@shell.armlinux.org.uk>

On Wed, 8 Apr 2026 at 06:59, Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> > Now building straight -rc7. If that also fails, my plan is to start
> > bisecting rc5..rc6, which will likely take most of the rest of the
> > day. So, in the mean time I'm sending this as a heads-up that rc6
> > and onwards has a problem.
>
> Plain -rc7 fails (another random oops):
>
> Now starting the bisect between 7.0-rc5 and 7.0-rc6.

Thanks. Not what I wanted to hear at this point, but a bisect should
get the culprit if this is at least sufficiently repeatable.

The exact symptoms and oops details may be random, but hopefully the
"something bad happens" is reliable enough to bisect.

              Linus

^ permalink raw reply

* Re: BUG: net-next (7.0-rc6 based and later) fails to boot on Jetson Xavier NX
From: Russell King (Oracle) @ 2026-04-08 16:08 UTC (permalink / raw)
  To: netdev, linux-arm-kernel, linux-kernel, iommu, linux-ext4,
	Linus Torvalds, dmaengine
  Cc: Marek Szyprowski, Robin Murphy, Theodore Ts'o, Andreas Dilger,
	Vinod Koul, Frank Li
In-Reply-To: <adZfTi3R6jtsjXx-@shell.armlinux.org.uk>

On Wed, Apr 08, 2026 at 02:59:42PM +0100, Russell King (Oracle) wrote:
> On Wed, Apr 08, 2026 at 02:07:36PM +0100, Russell King (Oracle) wrote:
> > Hi,
> > 
> > Just a heads-up that current net-next (v7.0-rc6 based) fails to boot on
> > my nVidia Jetson Xavier platform. v7.0-rc5 and v6.14 based net-next both
> > boot fine. This is an arm64 platform.
> > 
> > The problem appears to be completely random in terms of its symptoms,
> > and looks like severe memory corruption - every boot seems to produce
> > a different problem. The common theme is, although the kernel gets to
> > userspace, it never gets anywhere close to a login prompt before
> > failing in some way.
> > 
> > The last net-next+ boot (which is currently v7.0-rc6 based) resulted
> > in:
> > 
> > tegra-mc 2c00000.memory-controller: xusb_hostw: secure write @0x00000003ffffff00: VPR violation ((null))
> > ...
> > irq 91: nobody cared (try booting with the "irqpoll" option)
> > ...
> > depmod: ERROR: could not open directory /lib/modules/7.0.0-rc6-net-next+: No such file or directory
> > ...
> > Unable to handle kernel paging request at virtual address 0003201fd50320cf
> > 
> > 
> > A previous boot of the exact same kernel didn't oops, but was unable
> > to find the block device to mount for /mnt via block UUID.
> > 
> > A previous boot to that resulted in an oops.
> > 
> > 
> > The intersting thing is - the depmod error above is incorrect:
> > 
> > root@tegra-ubuntu:~# ls -ld /lib/modules/7.0.0-rc6-net-next+
> > drwxrwxr-x 3 root root 4096 Apr  8 10:23 /lib/modules/7.0.0-rc6-net-next+
> > 
> > The directory is definitely there, and is readable - checked after
> > booting back into net-next based on 7.0-rc5. In some of these boots,
> > stmmac hasn't probed yet, which rules out my changes.
> > 
> > Rootfs is ext4, and it seems there were a lot of ext4 commits merged
> > between rc5 and rc6, but nothing for rc7.
> > 
> > My current net-next head is dfecb0c5af3b. Merging rc7 on top also
> > fails, I suspect also randomly, with that I just got:
> > 
> > EXT4-fs (mmcblk0p1): VFS: Can't find ext4 filesystem
> > mount: /mnt: wrong fs type, bad option, bad superblock on /dev/mmcblk0p1, missing codepage or helper program, or other error.
> > mount: /mnt/: can't find PARTUUID=741c0777-391a-4bce-a222-455e180ece2a.
> > Unable to handle kernel paging request at virtual address f9bf0011ac0fb893
> > Mem abort info:
> >   ESR = 0x0000000096000004
> >   EC = 0x25: DABT (current EL), IL = 32 bits
> >   SET = 0, FnV = 0
> >   EA = 0, S1PTW = 0
> >   FSC = 0x04: level 0 translation fault
> > Data abort info:
> >   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> >   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> >   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [f9bf0011ac0fb893] address between user and kernel address ranges
> > Internal error: Oops: 0000000096000004 [#1]  SMP
> > Modules linked in:
> > CPU: 1 UID: 0 PID: 936 Comm: mount Not tainted 7.0.0-rc7-net-next+ #649 PREEMPT
> > Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 6.0-37391689 08/28/2024
> > pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > pc : refill_objects+0x298/0x5ec
> > lr : refill_objects+0x1f0/0x5ec
> > 
> > ...
> > 
> > Call trace:
> >  refill_objects+0x298/0x5ec (P)
> >  __pcs_replace_empty_main+0x13c/0x3a8
> >  kmem_cache_alloc_noprof+0x324/0x3a0
> >  alloc_iova+0x3c/0x290
> >  alloc_iova_fast+0x168/0x2d4
> >  iommu_dma_alloc_iova+0x84/0x154
> >  iommu_dma_map_sg+0x2c4/0x538
> >  __dma_map_sg_attrs+0x124/0x2c0
> >  dma_map_sg_attrs+0x10/0x20
> >  sdhci_pre_dma_transfer+0xb8/0x164
> >  sdhci_pre_req+0x38/0x44
> >  mmc_blk_mq_issue_rq+0x3dc/0x920
> >  mmc_mq_queue_rq+0x104/0x2b0
> >  __blk_mq_issue_directly+0x38/0xb0
> >  blk_mq_request_issue_directly+0x54/0xb4
> >  blk_mq_issue_direct+0x84/0x180
> >  blk_mq_dispatch_queue_requests+0x1a8/0x2e0
> >  blk_mq_flush_plug_list+0x60/0x140
> >  __blk_flush_plug+0xe0/0x11c
> >  blk_finish_plug+0x38/0x4c
> >  read_pages+0x158/0x260
> >  page_cache_ra_unbounded+0x158/0x3e0
> >  force_page_cache_ra+0xb0/0xe4
> >  page_cache_sync_ra+0x88/0x480
> >  filemap_get_pages+0xd8/0x850
> >  filemap_read+0xdc/0x3d8
> >  blkdev_read_iter+0x84/0x198
> >  vfs_read+0x208/0x2d8
> >  ksys_read+0x58/0xf4
> >  __arm64_sys_read+0x1c/0x28
> >  invoke_syscall.constprop.0+0x50/0xe0
> >  do_el0_svc+0x40/0xc0
> >  el0_svc+0x48/0x2a0
> >  el0t_64_sync_handler+0xa0/0xe4
> >  el0t_64_sync+0x19c/0x1a0
> > Code: 54000189 f9000022 aa0203e4 b9402ae3 (f8634840)
> > ---[ end trace 0000000000000000 ]---
> > Kernel panic - not syncing: Oops: Fatal exception
> > 
> > Looking at the changes between rc5 and rc6, there's one drivers/block
> > change for zram (which is used on this platform), one change in
> > drivers/base for regmap, nothing for drivers/mmc, but plenty for
> > fs/ext4. There are five DMA API changes.
> > 
> > Now building straight -rc7. If that also fails, my plan is to start
> > bisecting rc5..rc6, which will likely take most of the rest of the
> > day. So, in the mean time I'm sending this as a heads-up that rc6
> > and onwards has a problem.
> 
> Plain -rc7 fails (another random oops):
> 
> Root device found: PARTUUID=741c0777-391a-4bce-a222-455e180ece2a
> depmod: ERROR: could not open directory /lib/modules/7.0.0-rc7-net-next+: No such file or directory
> depmod: FATAL: could not search modules: No such file or directory
> usb 2-3: new SuperSpeed Plus Gen 2x1 USB device number 2 using tegra-xusb
> hub 2-3:1.0: USB hub found
> hub 2-3:1.0: 4 ports detected
> usb 1-3: new full-speed USB device number 3 using tegra-xusb
> Unable to handle kernel paging request at virtual address 0003201fd50320cf
> Mem abort info:
>   ESR = 0x0000000096000004
>   EC = 0x25: DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
>   FSC = 0x04: level 0 translation fault
> Data abort info:
>   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [0003201fd50320cf] address between user and kernel address ranges
> Internal error: Oops: 0000000096000004 [#1]  SMP
> Modules linked in:
> CPU: 1 UID: 0 PID: 917 Comm: mount Not tainted 7.0.0-rc7-net-next+ #649 PREEMPT
> Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 6.0-37391689 08/28/2024
> pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : refill_objects+0x298/0x5ec
> lr : refill_objects+0x1f0/0x5ec
> sp : ffff80008606b500
> x29: ffff80008606b500 x28: 0000000000000001 x27: fffffdffc20e6200
> x26: 0000000000000006 x25: 0000000000000000 x24: 000000000000003c
> x23: ffff0000809e4840 x22: ffff0000809dba00 x21: ffff80008606b5a0
> x20: ffff000081133820 x19: fffffdffc20e6220 x18: 0000000000000000
> x17: 0000000000000000 x16: 0000000000000100 x15: 0000000000000000
> x14: 0000000000000000 x13: 0000000000000000 x12: ffff800081e5faa8
> x11: ffff800082192c70 x10: ffff8000814074dc x9 : 0000000000000050
> x8 : ffff80008606b490 x7 : ffff000083988b40 x6 : ffff80008606b4a0
> x5 : 000000080015000f x4 : d503201fd503201f x3 : 00000000000000b0
> x2 : d503201fd503201f x1 : ffff000081133828 x0 : d503201fd503201f
> Call trace:
>  refill_objects+0x298/0x5ec (P)
>  __pcs_replace_empty_main+0x13c/0x3a8
>  kmem_cache_alloc_noprof+0x324/0x3a0
>  mempool_alloc_slab+0x1c/0x28
>  mempool_alloc_noprof+0x98/0xe0
>  bio_alloc_bioset+0x160/0x3e0
>  do_mpage_readpage+0x3d0/0x618
>  mpage_readahead+0xb8/0x144
>  blkdev_readahead+0x18/0x24
>  read_pages+0x58/0x260
>  page_cache_ra_unbounded+0x158/0x3e0
>  force_page_cache_ra+0xb0/0xe4
>  page_cache_sync_ra+0x88/0x480
>  filemap_get_pages+0xd8/0x850
>  filemap_read+0xdc/0x3d8
>  blkdev_read_iter+0x84/0x198
>  vfs_read+0x208/0x2d8
>  ksys_read+0x58/0xf4
>  __arm64_sys_read+0x1c/0x28
>  invoke_syscall.constprop.0+0x50/0xe0
>  do_el0_svc+0x40/0xc0
>  el0_svc+0x48/0x2a0
>  el0t_64_sync_handler+0xa0/0xe4
>  el0t_64_sync+0x19c/0x1a0
> Code: 54000189 f9000022 aa0203e4 b9402ae3 (f8634840)
> ---[ end trace 0000000000000000 ]---
> 
> Now starting the bisect between 7.0-rc5 and 7.0-rc6.

The rebase is still progressing, but it's landed on:

c7d812e33f3e dmaengine: xilinx: xilinx_dma: Fix unmasked residue subtraction

and while this boots to a login prompt, it spat out a BUG():

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:591
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 56, name: kworker/u24:3
preempt_count: 0, expected: 0
RCU nest depth: 0, expected: 0
3 locks held by kworker/u24:3/56:
 #0: ffff000080042148 ((wq_completion)events_unbound#2){+.+.}-{0:0}, at: process_one_work+0x184/0x780
 #1: ffff80008299bdf8 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work+0x1ac/0x780
 #2: ffff0000808b48f8 (&dev->mutex){....}-{4:4}, at: __device_attach+0x2c/0x188
irq event stamp: 10872
hardirqs last  enabled at (10871): [<ffff80008013a410>] ktime_get+0x130/0x180
hardirqs last disabled at (10872): [<ffff800080d61ac8>] _raw_spin_lock_irqsave+0x84/0x88
softirqs last  enabled at (9216): [<ffff80008002807c>] fpsimd_save_and_flush_current_state+0x3c/0x80
softirqs last disabled at (9214): [<ffff800080028098>] fpsimd_save_and_flush_current_state+0x58/0x80
CPU: 5 UID: 0 PID: 56 Comm: kworker/u24:3 Not tainted 7.0.0-rc1-bisect+ #654 PREEMPT
Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 6.0-37391689 08/28/2024
Workqueue: events_unbound deferred_probe_work_func
Call trace:
 show_stack+0x18/0x30 (C)
 dump_stack_lvl+0x6c/0x94
 dump_stack+0x18/0x24
 __might_resched+0x154/0x220
 __might_sleep+0x48/0x80
 __mutex_lock+0x48/0x800
 mutex_lock_nested+0x24/0x30
 pinmux_disable_setting+0x9c/0x180
 pinctrl_commit_state+0x5c/0x260
 pinctrl_pm_select_idle_state+0x4c/0xa0
 tegra_i2c_runtime_suspend+0x2c/0x3c
 pm_generic_runtime_suspend+0x2c/0x44
 __rpm_callback+0x48/0x1ec
 rpm_callback+0x74/0x80
 rpm_suspend+0xec/0x630
 rpm_idle+0x2c0/0x420
 __pm_runtime_idle+0x44/0x160
 tegra_i2c_probe+0x2e4/0x640
 platform_probe+0x5c/0xa4
 really_probe+0xbc/0x2c0
 __driver_probe_device+0x78/0x120
 driver_probe_device+0x3c/0x160
 __device_attach_driver+0xbc/0x160
 bus_for_each_drv+0x70/0xb8
 __device_attach+0xa4/0x188
 device_initial_probe+0x50/0x54
 bus_probe_device+0x38/0xa4
 deferred_probe_work_func+0x90/0xcc
 process_one_work+0x204/0x780
 worker_thread+0x1c8/0x36c
 kthread+0x138/0x144
 ret_from_fork+0x10/0x20

This is reproducible.

Adding Vinod and Frank, and dmaengine mailing list.

Bisect continuing, assuming this is a "good" commit as it isn't
producing the boot failure with random memory corruption.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: BUG: net-next (7.0-rc6 based and later) fails to boot on Jetson Xavier NX
From: Russell King (Oracle) @ 2026-04-08 16:16 UTC (permalink / raw)
  To: netdev, linux-arm-kernel, linux-kernel, iommu, linux-ext4,
	Linus Torvalds, dmaengine
  Cc: Marek Szyprowski, Robin Murphy, Theodore Ts'o, Andreas Dilger,
	Vinod Koul, Frank Li
In-Reply-To: <adZ9grUg71f518Fg@shell.armlinux.org.uk>

On Wed, Apr 08, 2026 at 05:08:34PM +0100, Russell King (Oracle) wrote:
> The rebase is still progressing, but it's landed on:
> 
> c7d812e33f3e dmaengine: xilinx: xilinx_dma: Fix unmasked residue subtraction
> 
> and while this boots to a login prompt, it spat out a BUG():
> 
> BUG: sleeping function called from invalid context at kernel/locking/mutex.c:591
> in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 56, name: kworker/u24:3
> preempt_count: 0, expected: 0
> RCU nest depth: 0, expected: 0
> 3 locks held by kworker/u24:3/56:
>  #0: ffff000080042148 ((wq_completion)events_unbound#2){+.+.}-{0:0}, at: process_one_work+0x184/0x780
>  #1: ffff80008299bdf8 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work+0x1ac/0x780
>  #2: ffff0000808b48f8 (&dev->mutex){....}-{4:4}, at: __device_attach+0x2c/0x188
> irq event stamp: 10872
> hardirqs last  enabled at (10871): [<ffff80008013a410>] ktime_get+0x130/0x180
> hardirqs last disabled at (10872): [<ffff800080d61ac8>] _raw_spin_lock_irqsave+0x84/0x88
> softirqs last  enabled at (9216): [<ffff80008002807c>] fpsimd_save_and_flush_current_state+0x3c/0x80
> softirqs last disabled at (9214): [<ffff800080028098>] fpsimd_save_and_flush_current_state+0x58/0x80
> CPU: 5 UID: 0 PID: 56 Comm: kworker/u24:3 Not tainted 7.0.0-rc1-bisect+ #654 PREEMPT
> Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 6.0-37391689 08/28/2024
> Workqueue: events_unbound deferred_probe_work_func
> Call trace:
>  show_stack+0x18/0x30 (C)
>  dump_stack_lvl+0x6c/0x94
>  dump_stack+0x18/0x24
>  __might_resched+0x154/0x220
>  __might_sleep+0x48/0x80
>  __mutex_lock+0x48/0x800
>  mutex_lock_nested+0x24/0x30
>  pinmux_disable_setting+0x9c/0x180
>  pinctrl_commit_state+0x5c/0x260
>  pinctrl_pm_select_idle_state+0x4c/0xa0
>  tegra_i2c_runtime_suspend+0x2c/0x3c
>  pm_generic_runtime_suspend+0x2c/0x44
>  __rpm_callback+0x48/0x1ec
>  rpm_callback+0x74/0x80
>  rpm_suspend+0xec/0x630
>  rpm_idle+0x2c0/0x420
>  __pm_runtime_idle+0x44/0x160
>  tegra_i2c_probe+0x2e4/0x640
>  platform_probe+0x5c/0xa4
>  really_probe+0xbc/0x2c0
>  __driver_probe_device+0x78/0x120
>  driver_probe_device+0x3c/0x160
>  __device_attach_driver+0xbc/0x160
>  bus_for_each_drv+0x70/0xb8
>  __device_attach+0xa4/0x188
>  device_initial_probe+0x50/0x54
>  bus_probe_device+0x38/0xa4
>  deferred_probe_work_func+0x90/0xcc
>  process_one_work+0x204/0x780
>  worker_thread+0x1c8/0x36c
>  kthread+0x138/0x144
>  ret_from_fork+0x10/0x20
> 
> This is reproducible.

I've just realised that it's the Tegra I2C bug that is already known
about, but took ages to be fixed in mainline - it's unrelated to the
memory corruption, so can be ignored. Sorry for the noise.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: BUG: net-next (7.0-rc6 based and later) fails to boot on Jetson Xavier NX
From: Linus Torvalds @ 2026-04-08 16:22 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: netdev, linux-arm-kernel, linux-kernel, iommu, linux-ext4,
	dmaengine, Marek Szyprowski, Robin Murphy, Theodore Ts'o,
	Andreas Dilger, Vinod Koul, Frank Li
In-Reply-To: <adZ9grUg71f518Fg@shell.armlinux.org.uk>

On Wed, 8 Apr 2026 at 09:08, Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> The rebase is still progressing, but it's landed on:
>
> c7d812e33f3e dmaengine: xilinx: xilinx_dma: Fix unmasked residue subtraction

Well, that commit looks completely bogus.

The explanation is just garbage: when subtracting two values that may
have random crud in the top bits, it's actually likely *better* to do
the masking *after* the subtraction.

The subtract of bogus upper bits will only affect upper bits. The
carry-chain only works upwards, not downwards.

So the old code that did

                       residue += (cdma_hw->control - cdma_hw->status) &
                                  chan->xdev->max_buffer_len;

would correctly mask out the upper bits, and the result of the
subtraction would be done "modulo mac_buffer_len". Which is rather
reasonable.

The code was changed to

                       residue += (cdma_hw->control &
chan->xdev->max_buffer_len) -
                                  (cdma_hw->status &
chan->xdev->max_buffer_len);

and now it does obviously still mask out the upper bits on each of the
values), but then the subtraction is done "modulo the arithmetic C
type" (which is 'u32')

In particular, if the status bits are bigger than the control bits,
that residue addition will now add a *huge* 32-bit number. It used to
add a number that was limited by the  max_buffer_len mask.

So the "interference from those top bits" stated in the commit message
is simply NOT TRUE. It's just complete rambling garbage.

Instead, the commit purely changes the final modulus of the
subtraction - which has nothing to do with any upper bits, and
everything to do with what kind of answer you want.

I think that commit is just very very wrong. At least the commit
message is wrong. And see above why I think the changed arithmetic is
likely wrong too.

It's very possible that the 'residue' is now a random 32-bit number
with the high bits set, and you get DMA corruption.

That would explain why this happens on Jetson but I haven't seen other reports.

                    Linus

^ permalink raw reply

* Re: BUG: net-next (7.0-rc6 based and later) fails to boot on Jetson Xavier NX
From: Robin Murphy @ 2026-04-08 16:40 UTC (permalink / raw)
  To: Russell King (Oracle), netdev, linux-arm-kernel, linux-kernel,
	iommu, linux-ext4, Linus Torvalds, dmaengine
  Cc: Marek Szyprowski, Theodore Ts'o, Andreas Dilger, Vinod Koul,
	Frank Li
In-Reply-To: <adZ_ZmjcE8S22vR1@shell.armlinux.org.uk>

On 2026-04-08 5:16 pm, Russell King (Oracle) wrote:
> On Wed, Apr 08, 2026 at 05:08:34PM +0100, Russell King (Oracle) wrote:
>> The rebase is still progressing, but it's landed on:
>>
>> c7d812e33f3e dmaengine: xilinx: xilinx_dma: Fix unmasked residue subtraction

FWIW I don't see a Tegra having the Xilinx IP in it anyway - judging by 
the DT it has their own tegra-gpcdma engine...

There's a fair chance this could be 90c5def10bea ("iommu: Do not call 
drivers for empty gathers"), which JonH also reported causing boot 
issues on Tegras - in short, SMMU TLB maintenance may not be completed 
properly which could lead to recycled DMA addresses causing exactly this 
kind of random memory corruption. I CC'd you on a patch:

https://lore.kernel.org/linux-iommu/20260408162846.GE3357077@nvidia.com/T/#t

Thanks,
Robin.

>>
>> and while this boots to a login prompt, it spat out a BUG():
>>
>> BUG: sleeping function called from invalid context at kernel/locking/mutex.c:591
>> in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 56, name: kworker/u24:3
>> preempt_count: 0, expected: 0
>> RCU nest depth: 0, expected: 0
>> 3 locks held by kworker/u24:3/56:
>>   #0: ffff000080042148 ((wq_completion)events_unbound#2){+.+.}-{0:0}, at: process_one_work+0x184/0x780
>>   #1: ffff80008299bdf8 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work+0x1ac/0x780
>>   #2: ffff0000808b48f8 (&dev->mutex){....}-{4:4}, at: __device_attach+0x2c/0x188
>> irq event stamp: 10872
>> hardirqs last  enabled at (10871): [<ffff80008013a410>] ktime_get+0x130/0x180
>> hardirqs last disabled at (10872): [<ffff800080d61ac8>] _raw_spin_lock_irqsave+0x84/0x88
>> softirqs last  enabled at (9216): [<ffff80008002807c>] fpsimd_save_and_flush_current_state+0x3c/0x80
>> softirqs last disabled at (9214): [<ffff800080028098>] fpsimd_save_and_flush_current_state+0x58/0x80
>> CPU: 5 UID: 0 PID: 56 Comm: kworker/u24:3 Not tainted 7.0.0-rc1-bisect+ #654 PREEMPT
>> Hardware name: NVIDIA NVIDIA Jetson Xavier NX Developer Kit/Jetson, BIOS 6.0-37391689 08/28/2024
>> Workqueue: events_unbound deferred_probe_work_func
>> Call trace:
>>   show_stack+0x18/0x30 (C)
>>   dump_stack_lvl+0x6c/0x94
>>   dump_stack+0x18/0x24
>>   __might_resched+0x154/0x220
>>   __might_sleep+0x48/0x80
>>   __mutex_lock+0x48/0x800
>>   mutex_lock_nested+0x24/0x30
>>   pinmux_disable_setting+0x9c/0x180
>>   pinctrl_commit_state+0x5c/0x260
>>   pinctrl_pm_select_idle_state+0x4c/0xa0
>>   tegra_i2c_runtime_suspend+0x2c/0x3c
>>   pm_generic_runtime_suspend+0x2c/0x44
>>   __rpm_callback+0x48/0x1ec
>>   rpm_callback+0x74/0x80
>>   rpm_suspend+0xec/0x630
>>   rpm_idle+0x2c0/0x420
>>   __pm_runtime_idle+0x44/0x160
>>   tegra_i2c_probe+0x2e4/0x640
>>   platform_probe+0x5c/0xa4
>>   really_probe+0xbc/0x2c0
>>   __driver_probe_device+0x78/0x120
>>   driver_probe_device+0x3c/0x160
>>   __device_attach_driver+0xbc/0x160
>>   bus_for_each_drv+0x70/0xb8
>>   __device_attach+0xa4/0x188
>>   device_initial_probe+0x50/0x54
>>   bus_probe_device+0x38/0xa4
>>   deferred_probe_work_func+0x90/0xcc
>>   process_one_work+0x204/0x780
>>   worker_thread+0x1c8/0x36c
>>   kthread+0x138/0x144
>>   ret_from_fork+0x10/0x20
>>
>> This is reproducible.
> 
> I've just realised that it's the Tegra I2C bug that is already known
> about, but took ages to be fixed in mainline - it's unrelated to the
> memory corruption, so can be ignored. Sorry for the noise.
> 


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox