Linux EXT4 FS development
 help / color / mirror / Atom feed
* Re: [PATCH 10/16] fs/buffer: Remove fs-layer decryption code
From: Christian Brauner @ 2026-06-25  7:01 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-fscrypt, linux-fsdevel, linux-ext4, linux-f2fs-devel,
	linux-block, Christoph Hellwig, Theodore Ts'o, Andreas Dilger,
	Baokun Li, Jan Kara, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi,
	Jaegeuk Kim, Chao Yu
In-Reply-To: <20260624050334.124606-11-ebiggers@kernel.org>

On 2026-06-23 22:03 -0700, Eric Biggers wrote:
> Now that fscrypt's file contents en/decryption is always implemented
> using blk-crypto when the filesystem is block-based, the fs-layer
> decryption code in fs/buffer.c is unused code.  Remove it.
> 
> Signed-off-by: Eric Biggers <ebiggers@kernel.org>
> ---

Reviewed-by: Christian Brauner (Amutable) <brauner@kernel.org>


^ permalink raw reply

* [PATCH v5] ext4: fix ABBA deadlock in ext4_xattr_inode_cache_find()
From: Aditya Srivastava @ 2026-06-25  6:50 UTC (permalink / raw)
  To: tytso, jack
  Cc: adilger.kernel, libaokun, ritesh.list, yi.zhang, linux-ext4,
	linux-kernel, Aditya Prakash Srivastava, Colin Ian King

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Syzbot/stress-ng reported an ABBA deadlock in ext4 when exercising
concurrent xattr workloads (using the ea_inode mount/format option).

The deadlock occurs between the running transaction and the eviction
thread:
- Task 1 (stress-ng): Holds a reference to a shared mbcache_entry (ce)
  and calls ext4_xattr_inode_cache_find() -> ext4_iget() to retrieve
  the corresponding EA inode. Since the EA inode is currently being
  evicted, ext4_iget() blocks in __wait_on_freeing_inode() waiting for
  eviction to complete.
- Task 2 (eviction thread): Currently evicting the same EA inode in
  ext4_evict_ea_inode(). It calls mb_cache_entry_wait_unused(oe) which
  blocks waiting for Task 1 to release the reference to the mbcache_entry.

To break this deadlock, implement a new ext4_iget() configuration flag
named EXT4_IGET_NOWAIT. When set, perform a non-blocking lookup of the
inode via VFS's find_inode_nowait() API.

If the inode is currently being evicted (marked with I_FREEING or
I_WILL_FREE) or created (I_CREATING), or if it is not present in the VFS
inode cache (cache miss), simply skip it (returning -ESTALE) rather than
waiting for eviction/creation to complete, breaking the ABBA cycle.

Since we return -ESTALE immediately on a cache miss, we never attempt to
allocate a new inode or call iget_locked(), completely eliminating any
TOCTOU race window.

If the returned inode is I_NEW, wait for its initialization to clear via
wait_on_new_inode(). If initialization fails and the inode is unhashed
during wait_on_new_inode() waking up (e.g., due to an I/O read error in
another thread), safely drop the reference and return -ESTALE. This
unhashed check is executed unconditionally on all cache-hit pathways to
properly handle concurrent initialization failures.

Finally, standard validation checks (including is_bad_inode,
EXT4_EA_INODE_FL, file_acl, and xattr flags) are executed as normal inside
check_igot_inode() to fully guarantee VFS-layer safety.

In ext4_xattr_inode_cache_find(), invoke ext4_iget() with the new
EXT4_IGET_NOWAIT flag to perform the non-blocking cache search.

Suggested-by: Jan Kara <jack@suse.cz>
Reported-by: Colin Ian King <colin.i.king@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219283
Fixes: 0a46ef234756 ("ext4: do not create EA inode under buffer lock")
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
Changes in v5:
  - Address two critical issues flagged by the Sashiko AI bot in v4:
    1. Resolve the Time-Of-Check to Time-Of-Use (TOCTOU) race window between
       find_inode_nowait() and iget_locked() by returning -ESTALE immediately
       on a VFS cache miss. This completely bypasses fallback to iget_locked()
       and prevents potential ABBA deadlocks.
    2. Fix the improperly nested inode_unhashed() safety check by moving it
       outside the I_NEW condition block, ensuring it runs unconditionally
       on all cache-hit pathways to prevent false-positive filesystem
       corruption errors during concurrent initialization failures.

Changes in v4:
  - Check if the inode was unhashed during wait_on_new_inode() waking up
    to handle transient initialization failures (like I/O read errors)
    gracefully. Dropping the reference and returning -ESTALE prevents
    false filesystem corruption errors (__ext4_error), as found by the
    Sashiko AI bot.

Changes in v3:
  - Implement a new ext4_iget() configuration flag named EXT4_IGET_NOWAIT to
    fully contain the non-blocking lookup and VFS-level validations within
    inode.c, as requested by Jan Kara.
  - Skip inodes currently being created (I_CREATING), following Jan Kara's
    direct feedback.
  - Remove all open-coded match helpers and VFS state-checks from xattr.c.

Changes in v2:
  - Read inode state locklessly using inode_state_read_once() to resolve
    a lockdep assertion on cache hit.
  - Manually restore essential inode/ea_inode validations on the retrieved
    inode (is_bad_inode, EXT4_EA_INODE_FL, file_acl, and xattr checks) to
    match VFS safety guarantees and prevent using corrupted/failed inodes.

 fs/ext4/ext4.h  |  3 ++-
 fs/ext4/inode.c | 41 ++++++++++++++++++++++++++++++++++++++---
 fs/ext4/xattr.c |  2 +-
 3 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b37c136ea3ab..c76dd0bdd3d8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3144,7 +3144,8 @@ typedef enum {
 	EXT4_IGET_SPECIAL =	0x0001, /* OK to iget a system inode */
 	EXT4_IGET_HANDLE = 	0x0002,	/* Inode # is from a handle */
 	EXT4_IGET_BAD =		0x0004, /* Allow to iget a bad inode */
-	EXT4_IGET_EA_INODE =	0x0008	/* Inode should contain an EA value */
+	EXT4_IGET_EA_INODE =	0x0008,	/* Inode should contain an EA value */
+	EXT4_IGET_NOWAIT =	0x0010	/* Non-blocking lookup (skip if freeing) */
 } ext4_iget_flags;
 
 extern struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce99807c5f5b..f6b681320358 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5270,6 +5270,24 @@ void ext4_set_inode_mapping_order(struct inode *inode)
 	mapping_set_folio_order_range(inode->i_mapping, min_order, max_order);
 }
 
+static int ext4_iget_match(struct inode *inode, u64 ino, void *data)
+{
+	bool *is_freeing = data;
+
+	if (inode->i_ino != ino)
+		return 0;
+	spin_lock(&inode->i_lock);
+	if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_CREATING)) {
+		if (is_freeing)
+			*is_freeing = true;
+		spin_unlock(&inode->i_lock);
+		return -1;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	return 1;
+}
+
 struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 			  ext4_iget_flags flags, const char *function,
 			  unsigned int line)
@@ -5298,9 +5316,26 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 		return ERR_PTR(-EFSCORRUPTED);
 	}
 
-	inode = iget_locked(sb, ino);
-	if (!inode)
-		return ERR_PTR(-ENOMEM);
+	if (flags & EXT4_IGET_NOWAIT) {
+		bool is_freeing = false;
+
+		inode = find_inode_nowait(sb, ino, ext4_iget_match, &is_freeing);
+		if (is_freeing || !inode)
+			return ERR_PTR(-ESTALE);
+
+		if (inode_state_read_once(inode) & I_NEW)
+			wait_on_new_inode(inode);
+
+		if (unlikely(inode_unhashed(inode))) {
+			iput(inode);
+			return ERR_PTR(-ESTALE);
+		}
+	} else {
+		inode = iget_locked(sb, ino);
+		if (!inode)
+			return ERR_PTR(-ENOMEM);
+	}
+
 	if (!(inode_state_read_once(inode) & I_NEW)) {
 		ret = check_igot_inode(inode, flags, function, line);
 		if (ret) {
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..21b5670d8503 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1550,7 +1550,7 @@ ext4_xattr_inode_cache_find(struct inode *inode, const void *value,
 
 	while (ce) {
 		ea_inode = ext4_iget(inode->i_sb, ce->e_value,
-				     EXT4_IGET_EA_INODE);
+				     EXT4_IGET_EA_INODE | EXT4_IGET_NOWAIT);
 		if (IS_ERR(ea_inode))
 			goto next_entry;
 		ext4_xattr_inode_set_class(ea_inode);
-- 
2.47.3


^ permalink raw reply related

* [PATCH v4] ext4: fix ABBA deadlock in ext4_xattr_inode_cache_find()
From: Aditya Srivastava @ 2026-06-25  6:03 UTC (permalink / raw)
  To: tytso, jack
  Cc: adilger.kernel, libaokun, ritesh.list, yi.zhang, linux-ext4,
	linux-kernel, Aditya Prakash Srivastava, Colin Ian King

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Syzbot/stress-ng reported an ABBA deadlock in ext4 when exercising
concurrent xattr workloads (using the ea_inode mount/format option).

The deadlock occurs between the running transaction and the eviction
thread:
- Task 1 (stress-ng): Holds a reference to a shared mbcache_entry (ce)
  and calls ext4_xattr_inode_cache_find() -> ext4_iget() to retrieve
  the corresponding EA inode. Since the EA inode is currently being
  evicted, ext4_iget() blocks in __wait_on_freeing_inode() waiting for
  eviction to complete.
- Task 2 (eviction thread): Currently evicting the same EA inode in
  ext4_evict_ea_inode(). It calls mb_cache_entry_wait_unused(oe) which
  blocks waiting for Task 1 to release the reference to the mbcache_entry.

To break this deadlock, implement a new ext4_iget() configuration flag
named EXT4_IGET_NOWAIT. When set, perform a non-blocking lookup of the
inode via VFS's find_inode_nowait() API.

If the inode is currently being evicted (marked with I_FREEING or
I_WILL_FREE) or created (I_CREATING), simply skip it (returning -ESTALE)
rather than waiting for eviction/creation to complete, breaking the ABBA
cycle. If the returned inode is I_NEW, wait for its initialization to
clear via wait_on_new_inode().

If initialization fails and the inode is unhashed during the waking up of
wait_on_new_inode() (e.g., due to an I/O read error in another thread),
safely drop the reference and return -ESTALE to cleanly bypass the xattr
cache entry. Finally, standard validation checks (including is_bad_inode,
EXT4_EA_INODE_FL, file_acl, and xattr flags) are executed as normal inside
check_igot_inode() to fully guarantee VFS-layer safety.

In ext4_xattr_inode_cache_find(), invoke ext4_iget() with the new
EXT4_IGET_NOWAIT flag to perform the non-blocking cache search.

Suggested-by: Jan Kara <jack@suse.cz>
Reported-by: Colin Ian King <colin.i.king@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219283
Fixes: 0a46ef234756 ("ext4: do not create EA inode under buffer lock")
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
Changes in v4:
  - Check if the inode was unhashed during wait_on_new_inode() waking up
    to handle transient initialization failures (like I/O read errors)
    gracefully. Dropping the reference and returning -ESTALE prevents
    false filesystem corruption errors (__ext4_error), as found by the
    Sashiko AI bot.

Changes in v3:
  - Implement a new ext4_iget() configuration flag named EXT4_IGET_NOWAIT to
    fully contain the non-blocking lookup and VFS-level validations within
    inode.c, as requested by Jan Kara.
  - Skip inodes currently being created (I_CREATING), following Jan Kara's
    direct feedback.
  - Remove all open-coded match helpers and VFS state-checks from xattr.c.

Changes in v2:
  - Read inode state locklessly using inode_state_read_once() to resolve
    a lockdep assertion on cache hit.
  - Manually restore essential inode/ea_inode validations on the retrieved
    inode (is_bad_inode, EXT4_EA_INODE_FL, file_acl, and xattr checks) to
    match VFS safety guarantees and prevent using corrupted/failed inodes.

 fs/ext4/ext4.h  |  3 ++-
 fs/ext4/inode.c | 46 +++++++++++++++++++++++++++++++++++++++++++---
 fs/ext4/xattr.c |  2 +-
 3 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b37c136ea3ab..c76dd0bdd3d8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3144,7 +3144,8 @@ typedef enum {
 	EXT4_IGET_SPECIAL =	0x0001, /* OK to iget a system inode */
 	EXT4_IGET_HANDLE = 	0x0002,	/* Inode # is from a handle */
 	EXT4_IGET_BAD =		0x0004, /* Allow to iget a bad inode */
-	EXT4_IGET_EA_INODE =	0x0008	/* Inode should contain an EA value */
+	EXT4_IGET_EA_INODE =	0x0008,	/* Inode should contain an EA value */
+	EXT4_IGET_NOWAIT =	0x0010	/* Non-blocking lookup (skip if freeing) */
 } ext4_iget_flags;
 
 extern struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce99807c5f5b..75ed467f5abf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5270,6 +5270,24 @@ void ext4_set_inode_mapping_order(struct inode *inode)
 	mapping_set_folio_order_range(inode->i_mapping, min_order, max_order);
 }
 
+static int ext4_iget_match(struct inode *inode, u64 ino, void *data)
+{
+	bool *is_freeing = data;
+
+	if (inode->i_ino != ino)
+		return 0;
+	spin_lock(&inode->i_lock);
+	if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_CREATING)) {
+		if (is_freeing)
+			*is_freeing = true;
+		spin_unlock(&inode->i_lock);
+		return -1;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	return 1;
+}
+
 struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 			  ext4_iget_flags flags, const char *function,
 			  unsigned int line)
@@ -5298,9 +5316,31 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 		return ERR_PTR(-EFSCORRUPTED);
 	}
 
-	inode = iget_locked(sb, ino);
-	if (!inode)
-		return ERR_PTR(-ENOMEM);
+	if (flags & EXT4_IGET_NOWAIT) {
+		bool is_freeing = false;
+
+		inode = find_inode_nowait(sb, ino, ext4_iget_match, &is_freeing);
+		if (is_freeing)
+			return ERR_PTR(-ESTALE);
+		if (!inode) {
+			inode = iget_locked(sb, ino);
+			if (!inode)
+				return ERR_PTR(-ENOMEM);
+		} else {
+			if (inode_state_read_once(inode) & I_NEW) {
+				wait_on_new_inode(inode);
+				if (unlikely(inode_unhashed(inode))) {
+					iput(inode);
+					return ERR_PTR(-ESTALE);
+				}
+			}
+		}
+	} else {
+		inode = iget_locked(sb, ino);
+		if (!inode)
+			return ERR_PTR(-ENOMEM);
+	}
+
 	if (!(inode_state_read_once(inode) & I_NEW)) {
 		ret = check_igot_inode(inode, flags, function, line);
 		if (ret) {
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..21b5670d8503 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1550,7 +1550,7 @@ ext4_xattr_inode_cache_find(struct inode *inode, const void *value,
 
 	while (ce) {
 		ea_inode = ext4_iget(inode->i_sb, ce->e_value,
-				     EXT4_IGET_EA_INODE);
+				     EXT4_IGET_EA_INODE | EXT4_IGET_NOWAIT);
 		if (IS_ERR(ea_inode))
 			goto next_entry;
 		ext4_xattr_inode_set_class(ea_inode);
-- 
2.47.3


^ permalink raw reply related

* [PATCH v3] ext4: fix ABBA deadlock in ext4_xattr_inode_cache_find()
From: Aditya Srivastava @ 2026-06-25  4:09 UTC (permalink / raw)
  To: tytso, jack
  Cc: adilger.kernel, libaokun, ritesh.list, yi.zhang, linux-ext4,
	linux-kernel, Aditya Prakash Srivastava, Colin Ian King

From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Syzbot/stress-ng reported an ABBA deadlock in ext4 when exercising
concurrent xattr workloads (using the ea_inode mount/format option).

The deadlock occurs between the running transaction and the eviction
thread:
- Task 1 (stress-ng): Holds a reference to a shared mbcache_entry (ce)
  and calls ext4_xattr_inode_cache_find() -> ext4_iget() to retrieve
  the corresponding EA inode. Since the EA inode is currently being
  evicted, ext4_iget() blocks in __wait_on_freeing_inode() waiting for
  eviction to complete.
- Task 2 (eviction thread): Currently evicting the same EA inode in
  ext4_evict_ea_inode(). It calls mb_cache_entry_wait_unused(oe) which
  blocks waiting for Task 1 to release the reference to the mbcache_entry.

To break this deadlock, implement a new ext4_iget() configuration flag
named EXT4_IGET_NOWAIT. When set, perform a non-blocking lookup of the
inode via VFS's find_inode_nowait() API.

If the inode is currently being evicted (marked with I_FREEING or
I_WILL_FREE) or created (I_CREATING), simply skip it (returning -ESTALE)
rather than waiting for eviction/creation to complete, breaking the ABBA
cycle. If the returned inode is I_NEW, wait for its initialization to
clear via wait_on_new_inode(). Finally, standard validation checks
(including is_bad_inode, EXT4_EA_INODE_FL, file_acl, and xattr flags) are
executed as normal inside check_igot_inode() to fully guarantee VFS-layer
safety.

In ext4_xattr_inode_cache_find(), invoke ext4_iget() with the new
EXT4_IGET_NOWAIT flag to perform the non-blocking cache search.

Suggested-by: Jan Kara <jack@suse.cz>
Reported-by: Colin Ian King <colin.i.king@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219283
Fixes: 0a46ef234756 ("ext4: do not create EA inode under buffer lock")
Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
---
Changes in v3:
  - Implement a new ext4_iget() configuration flag named EXT4_IGET_NOWAIT to
    fully contain the non-blocking lookup and VFS-level validations within
    inode.c, as requested by Jan Kara.
  - Skip inodes currently being created (I_CREATING), following Jan Kara's
    direct feedback.
  - Remove all open-coded match helpers and VFS state-checks from xattr.c.

Changes in v2:
  - Read inode state locklessly using inode_state_read_once() to resolve
    a lockdep assertion on cache hit.
  - Manually restore essential inode/ea_inode validations on the retrieved
    inode (is_bad_inode, EXT4_EA_INODE_FL, file_acl, and xattr checks) to
    match VFS safety guarantees and prevent using corrupted/failed inodes.

 fs/ext4/ext4.h  |  3 ++-
 fs/ext4/inode.c | 41 ++++++++++++++++++++++++++++++++++++++---
 fs/ext4/xattr.c |  2 +-
 3 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b37c136ea3ab..c76dd0bdd3d8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3144,7 +3144,8 @@ typedef enum {
 	EXT4_IGET_SPECIAL =	0x0001, /* OK to iget a system inode */
 	EXT4_IGET_HANDLE = 	0x0002,	/* Inode # is from a handle */
 	EXT4_IGET_BAD =		0x0004, /* Allow to iget a bad inode */
-	EXT4_IGET_EA_INODE =	0x0008	/* Inode should contain an EA value */
+	EXT4_IGET_EA_INODE =	0x0008,	/* Inode should contain an EA value */
+	EXT4_IGET_NOWAIT =	0x0010	/* Non-blocking lookup (skip if freeing) */
 } ext4_iget_flags;
 
 extern struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce99807c5f5b..42a798f333d3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5270,6 +5270,24 @@ void ext4_set_inode_mapping_order(struct inode *inode)
 	mapping_set_folio_order_range(inode->i_mapping, min_order, max_order);
 }
 
+static int ext4_iget_match(struct inode *inode, u64 ino, void *data)
+{
+	bool *is_freeing = data;
+
+	if (inode->i_ino != ino)
+		return 0;
+	spin_lock(&inode->i_lock);
+	if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_CREATING)) {
+		if (is_freeing)
+			*is_freeing = true;
+		spin_unlock(&inode->i_lock);
+		return -1;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	return 1;
+}
+
 struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 			  ext4_iget_flags flags, const char *function,
 			  unsigned int line)
@@ -5298,9 +5316,26 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 		return ERR_PTR(-EFSCORRUPTED);
 	}
 
-	inode = iget_locked(sb, ino);
-	if (!inode)
-		return ERR_PTR(-ENOMEM);
+	if (flags & EXT4_IGET_NOWAIT) {
+		bool is_freeing = false;
+
+		inode = find_inode_nowait(sb, ino, ext4_iget_match, &is_freeing);
+		if (is_freeing)
+			return ERR_PTR(-ESTALE);
+		if (!inode) {
+			inode = iget_locked(sb, ino);
+			if (!inode)
+				return ERR_PTR(-ENOMEM);
+		} else {
+			if (inode_state_read_once(inode) & I_NEW)
+				wait_on_new_inode(inode);
+		}
+	} else {
+		inode = iget_locked(sb, ino);
+		if (!inode)
+			return ERR_PTR(-ENOMEM);
+	}
+
 	if (!(inode_state_read_once(inode) & I_NEW)) {
 		ret = check_igot_inode(inode, flags, function, line);
 		if (ret) {
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..21b5670d8503 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1550,7 +1550,7 @@ ext4_xattr_inode_cache_find(struct inode *inode, const void *value,
 
 	while (ce) {
 		ea_inode = ext4_iget(inode->i_sb, ce->e_value,
-				     EXT4_IGET_EA_INODE);
+				     EXT4_IGET_EA_INODE | EXT4_IGET_NOWAIT);
 		if (IS_ERR(ea_inode))
 			goto next_entry;
 		ext4_xattr_inode_set_class(ea_inode);
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
From: Zhang Yi @ 2026-06-25  3:33 UTC (permalink / raw)
  To: Jan Kara, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, ojaswin, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai
In-Reply-To: <i536qqwj5eyulec3r2ki2ycnelqdd4bkpat2drn7t72t6p622k@ktueynysgo3j>

On 6/25/2026 1:16 AM, Jan Kara wrote:
> On Mon 22-06-26 20:36:02, Zhang Yi wrote:
>> On 6/16/2026 7:47 PM, Jan Kara wrote:
>>> On Mon 11-05-26 15:23:29, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Add the iomap writeback path for ext4 buffered I/O. This introduces:
>>>>
>>>>   - ext4_iomap_writepages(): the main writeback entry point.
>>>>   - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
>>>>     block mapping and I/O submission.
>>>>   - A new end I/O worker for converting unwritten extents, updating file
>>>>     size, and handling DATA_ERR_ABORT after I/O completion.
>>>>
>>>> Core implementation details:
>>>>
>>>>   - ->writeback_range() callback
>>>>     Calls ext4_iomap_map_writeback_range() to query the longest range of
>>>>     existing mapped extents. For performance, when a block range is not
>>>>     yet allocated, it allocates based on the writeback length and delalloc
>>>>     extent length, rather than allocating for a single folio at a time.
>>>>     The folio is then added to an iomap_ioend instance.
>>>>
>>>>   - ->writeback_submit() callback
>>>>     Registers ext4_iomap_end_bio() as the end bio callback. This callback
>>>>     schedules a worker to handle:
>>>>     - Unwritten extent conversion.
>>>>     - i_disksize update after data is written back.
>>>>     - Journal abort on writeback I/O failure.
>>>>
>>>> Key changes and considerations:
>>>>
>>>> - Append write and unwritten extents
>>>>    Since data=ordered mode is not used to prevent stale data exposure
>>>>    during append writebacks, new blocks are always allocated as unwritten
>>>>    extents (i.e. always enable dioread_nolock), and i_disksize update is
>>>>    postponed until I/O completion. Additionally, the deadlock that the
>>>>    reserve handle was expected to resolve does not occur anymore.
>>>>    Therefore, the end I/O worker can start a normal journal handle
>>>>    instead of a reserve handle when converting unwritten extents.
>>>>
>>>> - Lock ordering
>>>>    The ->writeback_range() callback runs under the folio lock, requiring
>>>>    the journal handle to be started under that same lock. This reverses
>>>>    the order compared to the buffer_head writeback path. The lock ordering
>>>>    documentation in super.c has been updated accordingly.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>   fs/ext4/ext4.h        |   4 +
>>>>   fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
>>>>   fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
>>>>   fs/ext4/super.c       |   7 +-
>>>>   fs/iomap/ioend.c      |   3 +-
>>>>   include/linux/iomap.h |   1 +
>>>>   6 files changed, 346 insertions(+), 3 deletions(-)
>>>>

[...]

>> There are actually two reasons for this. First, we want to avoid
>> starting a journal handle in overwrite scenarios. Second, we want to
>> be able to query the extent locklessly without holding i_data_sem in
>> overwrite cases as well (note that ext4_es_lookup_extent() in
>> ext4_iomap_map_one_extent() is called with i_data_sem held).
>>
>> I ran a set of benchmark tests in my VM, performing the following FIO
>> overwrite test on a 500GB ramdisk:
>>
>> $fio -filename=/test_dir/foo -direct=0 -iodepth=8 -fsync=0 -rw=write \
>>            -numjobs=1 -bs=4k -ioengine=io_uring -size=20G -uncached=1 \
>>            -runtime=30 --ramp_time=5s -time_based -norandommap=0 \
>>            -fallocate=none -overwrite=1 \
>>            -group_reportin -name=test --output=/tmp/log
>>
>> The results are as follows:
>>
>> a: on a non-fragmented file
>> A: on a fragmented file [1]
>> b: no background metadata pressure
>> B: with background metadata pressure [2]
>>
>>      buffer_head | iomap pre-map w/o journal | iomap directly map
>> a+b:    680                 691                   690
>> a+B:    560                 568                   567
>> A+b:    637                 633                   579
>> A+B:    540                 571                   495
>>
>> [1] The file is pre-fragmented such that each block occupies a separate
>>     extent.
>> [2] A background fsstress process is running (only contains metadata
>>     ops):
>>     taskset -c 2 fsstress -c -d /test_dir -l 0 -n 1000 -f clonerange=0 \
>>             -f copyrange=0 -f awrite=0 -f aread=0 -f dread=0 \
>>             -f dwrite=0 -f mread=0 -f mwrite=0 -f readv=0 -f write=0 \
>>             -f writev=0 -f read=0 -f sync=0 -f afsync=0 -f fsync=0
>>
>> As can be seen, for large contiguous files, the performance impact is
>> minimal. However, in heavily fragmented scenarios or under other
>> metadata pressure, pre-querying the mapping brings noticeable gains.
>> However, this is testing the most extreme case — I'm not sure about
>> the real-world impact, so I don't have a strong preference either way.
>> But I suppose faster is better, at least not slower than the old
>> buffer_head path. :)
> 
> OK, thanks for the test! So for fragmented files the optimization of not
> starting a transaction seems indeed worth it. I still dislike the
> opencoding :) So given we have the reversed lock ordering now, why don't we
> teach ext4_map_blocks() to start a transaction (if not provided) just
> before it acquires i_data_sem for writing? This should be quite elegant. I
> know you have some concerns about possible races below so let's discuss
> that separately but at least in terms of performance and code complexity
> this would look ideal :).
> 

Yeah, I agree with you on this point. But let's discuss the below race
issue first.  :)

>>> then I'd
>>> probably prefer coming up with an ext4_get_blocks flag which tells it to
>>> start a transaction on its own if we need to allocate blocks... That would
>>> be much simpler than opencoding all this.
>>
>> Additionally, there is a key point here. The reason I open-coded
>> ext4_iomap_map_writeback_range() is that we must ensure extent query
>> and allocation are performed atomically under i_data_sem. Otherwise,
>> concurrent truncate could lead to quota leaks.
>>
>> Specifically, consider the following scenario: we call
>> ext4_map_blocks() to allocate blocks. Suppose there is a delalloc
>> extent covering blocks [0,3). While writeback is submitting block 0, a
>> concurrent truncate(block 1) occurs:
>>
>> wb                         truncate
>> ext4_es_lookup_extent()    ext4_truncate_down()
>>   //get [0,3)
>>                             truncate_inode_pages_range()
>>                                //clear page 1&2
>>                             ext4_truncate()
>>                              down_write(i_data_sem)
>>                               ext4_es_remove_extent()
>>                                //drop extent [1,3)
>>                                //i_reserved_data_blocks: 3->1
>>                               up_write(i_data_sem)
>> down_write(i_data_sem)
>> ext4_map_create_blocks()
>>   //alloc 3 blocks
>>  ext4_es_insert_extent()
>>   //only reclaim 1 block,stale 2 blocks
>> up_write(i_data_sem)
>>
>> Therefore, If we don't open-coding this part, we would need to
>> significantly rework ext4_map_blocks(), which might have a larger
>> impact at this point. What do you think?
> 
> Hum, is something like this really possible? I mean iomap_writepages() will
> lookup and lock folio. Only then it calls ->iomap_begin to map it to
> underlying blocks. And folio lock synchronizes against
> truncate_inode_pages_range() so how would writeback end up trying to
> allocate something underlying pages 1 or 2?

I believe this scenario can indeed occur, and folio lock alone is
insufficient to protect against this concurrency issue. The main reason
is that the iomap writeback framework processes folios one by one. For
each folio, it follows the "lock -> map -> unlock" sequence. If each
iteration only mapped blocks covering no more than one folio in length,
performance would be severely degraded. Therefore, both XFS and the
current ext4 iomap implementation choose to map up to the minimum of the
writeback length and the delalloc extent length. This means that when
processing folio 0, if an extent of length 3 is found, the ranges
corresponding to the subsequent folios are also mapped and cached. As a
result, holding only the folio lock of folio 0 is insufficient to
protect against truncation concurrency with the latter two folios.

This issue does not occur in the original buffer_head writeback path,
ext4_do_writepages(). In that path, a batch of consecutive folios to be
mapped are locked upfront before the mapping operation. Therefore, the
blocks within the corresponding range are protected by folio locks
during mapping, making it impossible for truncation to race with
writeback.

[...]

>>>> +void ext4_iomap_end_bio(struct bio *bio)
>>>> +{
>>>> +	struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
>>>> +	struct inode *inode = ioend->io_inode;
>>>> +	struct ext4_inode_info *ei = EXT4_I(inode);
>>>> +	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>>>> +	unsigned long flags;
>>>> +
>>>> +	/* Needs to convert unwritten extents or update the i_disksize. */
>>>> +	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
>>>> +	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
>>>> +		goto defer;
>>>> +
>>>> +	/* Needs to abort the journal on data_err=abort.  */
>>>> +	if (unlikely(ioend->io_bio.bi_status))
>>>> +		goto defer;
>>>> +
>>>> +	iomap_finish_ioend(ioend, 0);
>>>> +	return;
>>>> +defer:
>>>> +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
>>>> +	if (list_empty(&ei->i_iomap_ioend_list))
>>>> +		queue_work(sbi->rsv_conversion_wq, &ei->i_iomap_ioend_work);
>>>> +	list_add_tail(&ioend->io_list, &ei->i_iomap_ioend_list);
>>>> +	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
>>>> +}
>>>
>>> For now, I'd prefer to do what XFS does and offload everything. Then you
>>> don't have to export iomap_finish_ioend() (which would need to be in a
>>> separate patch and acked by iomap maintainers) and the code is more
>>> standard. There's a patchset in the works which adds general ioend offloading
>>> infrastructure into iomap [1] and when that lands we should get all these
>>                       ^^^^^ block layer?
>>
>>> bells and whistles (even better ones with percpu work queues, batching,
>>> etc.) for free.
>>>
>>> [1] https://lore.kernel.org/all/20260514-blk-dontcache-v6-0-782e2fa7477b@columbia.edu/
>>>
>>> 								Honza
>>
>> Ha, I've noticed this patchset, so I haven't implemented
>> uncached I/O handling for now. As a side note, I have a question:
>> if we convert all endio processing to worker threads, IIRC, my
>> recollection from previous performance tests is that pure overwrite
>> scenarios would see at least a 20% degradation. Is that acceptable?
> 
> No, but the latest version of the patches exactly does IO completion in the
> interrupt unless the bio is flagged as needing IO completion from process
> context or unless end_io handler returns a particular error - which means
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please let me confirm whether my understanding is correct. The latest(v6)
modification to bio_endio() is as follows:

@@ -1791,7 +1865,9 @@ void bio_endio(struct bio *bio)
 	}
 #endif

-	if (bio->bi_end_io)
+	if (bio_flagged(bio, BIO_COMPLETE_IN_TASK) && bio_in_atomic())
+		__bio_complete_in_task(bio);
+	else if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }

Currently, __bio_complete_in_task() is only called in bio_endio() when
BIO_COMPLETE_IN_TASK is explicitly set on the bio. It is not called
again based on a specific error return from bio->bi_end_io(). So I
believe what you meant is that each filesystem's own ->bi_end_io()
should call it to convert the endio processing to process context.
Is my understanding correct?

> IO completion wasn't actually done and needs offloading into process
> context instead.
> 
>> I understand why uncached I/O might need the entire completion path
>> in a worker, but can we complete the I/O in interrupt context for
>> pure overwrite and then release the page cache in a worker? Must
>> page cache invalidation and I/O completion be synchronous?
> 
> Strictly speaking no, we can first complete the IO and evict page cache
> independently later. But it would be quite tricky locking wise (folios and
> the mapping containing them can get evicted once folio writeback bit gets
> cleared) so the whole uncached writes handling would have to be reworked. I
> don't think it's worth it at this point.

Ah, I see. Thanks for the clarification. :)

>  
>> The reason I kept ext4_iomap_end_bio() handling I/O completion in
>> interrupt context is for overwrite performance. XFS also handles
>> overwrites in interrupt context (via ioend_writeback_end_bio()).
>> However, ext4 has the data_error=abort mount option — when this mode
>> is set and an I/O error occurs, we must abort the journal in a
>> worker. Since we cannot predict I/O errors at submission time, we
>> can't directly use ioend_writeback_end_bio() and must instead bind
>> our own ext4_iomap_end_bio(). At the same time, I want to avoid
>> spawning a worker for pure overwrites when no I/O error occurs, so I
>> exported iomap_finish_ioend(). What do you think?
> 
> So data_error=abort handling can exactly use the new generic framework - if
> we detect during processing IO completion we cannot actually do it in the
> interrupt (like in case of error), we just return appropriately from the
> handler and the generic code will handle offloading and call the ->end_io
> callback again.
> 
> 								Honza

If my understanding above is correct, what you are suggesting here is
that in ext4_iomap_end_bio(), we should call __bio_complete_in_task()
to switch to process context when data_error=abort is set and the I/O
returns an error. This way, we could drop ext4's own
i_iomap_ioend_work / i_rsv_conversion_work handling. Is that right?

But even with that, it seems we would still need to export
iomap_finish_ioend(). Because if the I/O does not fail,
ext4_iomap_end_bio() would need to complete the IO processing in
interrupt context for pure overwrite and would not switch to tasks
context. Because only iomap_finish_ioend() is safe to call in interrupt
context. Am I misunderstanding something?

Thanks,
Yi.


^ permalink raw reply

* Re: [PATCH v4] iomap: add simple read path for small direct I/O
From: Fengnan @ 2026-06-25  2:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: brauner, djwong, ojaswin, dgc, linux-xfs, linux-fsdevel,
	linux-ext4, linux-kernel, lidiangang
In-Reply-To: <ajv5pqNureiK80Eu@infradead.org>

在 2026/6/24 23:37, Christoph Hellwig 写道:
> Sorry for the delay in getting back to this, I'm a bit overloaded at
> the moment.
>
>> -static inline bool should_report_dio_fserror(const struct iomap_dio *dio)
>> +static inline bool should_report_dio_fserror(int error)
> Can you split all the refactoring into prep patches?
off course

>
>> +/*
>> + * In the async simple read path, we need to prevent bio_endio() from
>> + * triggering iocb->ki_complete() before the submitter has returned
>> + * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
>> + *
>> + * We use a three-state rendezvous to synchronize the submitter and end_io:
>> + *
>> + * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
>> + *
>> + * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
>> + * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
>> + * ki_complete().
>> + *
>> + * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
>> + * submit path. end_io sets this state and does nothing else. The submitter
>> + * will see this state and handle the completion synchronously (bypassing
>> + * ki_complete() and returning the actual result).
>> + */
> I don't think we actually need any of this.  For the sync case we
> can just use submit_bio_wait, and for async just always complete
> from the end_io handler.  This will simplify the implementation a lot,
> and also avoid the atomic.

I was wrong before, in simple read path, we won't use sr after submit bio.
>
>> +static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
> Btw, I'd drop the _read in the name.  Most of this would work as-is
> for trivial overwrites if we figure out when to use them.
ok, let's rename to iomap_dio_simple_xxx

>
>> +	if (dio_flags & IOMAP_DIO_BOUNCE)
>> +		nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
>> +	else
>> +		nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
> Bounce buffering requires dops, so all this can be dropped.
Get .
>
>> +	ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
>> +			       &iomi.iomap, &iomi.srcmap);
>> +	if (ret) {
>> +		inode_dio_end(inode);
>> +		return ret;
>> +	}
>> +
>> +	if (iomi.iomap.type != IOMAP_MAPPED ||
>> +	    iomi.iomap.offset > iomi.pos ||
> I don't think offset > pos can happen

>
>> +	    iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
>> +	    (iomi.iomap.flags & IOMAP_F_INTEGRITY)) {
>> +		ret = -ENOTBLK;
>> +		goto out_iomap_end;
>> +	}
> Given that we already have a fallback here, I'm not sure why this is
> limited to a single file system block.  Anything that:
>
>    a) fits into the iomap
>    b) fits into a single bio
>
> can be easily supported.  The first condition is a trivial, and for
> the second we could just check if iter->nr_segs is larger than
> BIO_MAX_VECS.
The reason I only added a simple path for 4K reads is that, in current
NVMe, 4K random reads suffer from a significant bottleneck, whereas
8K reads and 4K writes do not.
Considering that if an 8K or larger block size does not fit within a single
BIO, two iomap_begin/iomap_end calls would be required, resulting in
additional overhead.
Of course, even without considering this scenario, there would still be
some benefit, it’s just not as significant.
How do you approach this issue?

>
>> +	if (user_backed_iter(iter))
>> +		dio_flags |= IOMAP_DIO_USER_BACKED;
>> +	bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
>> +	bio->bi_ioprio = iocb->ki_ioprio;
>> +	bio->bi_private = sr;
>> +	bio->bi_end_io = iomap_dio_simple_read_end_io;
>> +
>> +	if ((dio_flags & IOMAP_DIO_USER_BACKED) &&
>> +	    !(dio_flags & IOMAP_DIO_BOUNCE))
>> +		bio_set_pages_dirty(bio);
>> +
>> +	if (iocb->ki_flags & IOCB_NOWAIT)
>> +		bio->bi_opf |= REQ_NOWAIT;
>> +	if ((iocb->ki_flags & IOCB_HIPRI) && !wait_for_completion) {
>> +		bio->bi_opf |= REQ_POLLED;
>> +		bio_set_polled(bio, iocb);
>> +		WRITE_ONCE(iocb->private, bio);
>> +	}
> Can you check if sone more of this can be factored into a shared
> helper?
I'll try.

>
> Below is a completely untested patch implementing my suggestion
> for the completion simplification.  It compiles, but that's about
> the guarantees I can give for it:
I'll apply and do some test.

>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 3cb179752612..c785512e5339 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -909,11 +909,7 @@ struct iomap_dio_simple_read {
>   	struct kiocb		*iocb;
>   	size_t			size;
>   	unsigned int		dio_flags;
> -	atomic_t		state;
> -	union {
> -		struct task_struct	*waiter;
> -		struct work_struct	work;
> -	};
> +	struct work_struct	work;
>   	/*
>   	 * Align @bio to a cacheline boundary so that, combined with the
>   	 * front_pad passed to bioset_init(), the bio sits at the start of
> @@ -926,35 +922,12 @@ struct iomap_dio_simple_read {
>   
>   static struct bio_set iomap_dio_simple_read_pool;
>   
> -/*
> - * In the async simple read path, we need to prevent bio_endio() from
> - * triggering iocb->ki_complete() before the submitter has returned
> - * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
> - *
> - * We use a three-state rendezvous to synchronize the submitter and end_io:
> - *
> - * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
> - *
> - * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
> - * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
> - * ki_complete().
> - *
> - * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
> - * submit path. end_io sets this state and does nothing else. The submitter
> - * will see this state and handle the completion synchronously (bypassing
> - * ki_complete() and returning the actual result).
> - */
> -enum {
> -	IOMAP_DIO_SIMPLE_SUBMITTING = 0,
> -	IOMAP_DIO_SIMPLE_QUEUED,
> -	IOMAP_DIO_SIMPLE_DONE,
> -};
> -
> -static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
> -		struct bio *bio, ssize_t ret)
> +static ssize_t iomap_dio_simple_read_complete(struct iomap_dio_simple_read *sr)
>   {
> +	struct bio *bio = &sr->bio;
> +	struct kiocb *iocb = sr->iocb;
>   	struct inode *inode = file_inode(iocb->ki_filp);
> -	struct iomap_dio_simple_read *sr = bio->bi_private;
> +	ssize_t ret = blk_status_to_errno(bio->bi_status);
>   
>   	if (likely(!ret)) {
>   		ret = sr->size;
> @@ -965,21 +938,6 @@ static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
>   	}
>   
>   	iomap_dio_bio_release_pages(bio, sr->dio_flags, ret < 0);
> -
> -	return ret;
> -}
> -
> -static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
> -		struct bio *bio)
> -{
> -	struct inode *inode = file_inode(iocb->ki_filp);
> -	ssize_t ret;
> -
> -	WRITE_ONCE(iocb->private, NULL);
> -
> -	ret = iomap_dio_simple_read_finish(iocb, bio,
> -			blk_status_to_errno(bio->bi_status));
> -
>   	inode_dio_end(inode);
>   	trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
>   	return ret;
> @@ -988,45 +946,26 @@ static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
>   static void iomap_dio_simple_read_complete_work(struct work_struct *work)
>   {
>   	struct iomap_dio_simple_read *sr =
> -		container_of(work, struct iomap_dio_simple_read, work);
> -	struct kiocb *iocb = sr->iocb;
> -	ssize_t ret;
> +			container_of(work, struct iomap_dio_simple_read, work);
>   
> -	ret = iomap_dio_simple_read_complete(iocb, &sr->bio);
> -	iocb->ki_complete(iocb, ret);
> +	WRITE_ONCE(sr->iocb->private, NULL);
> +	sr->iocb->ki_complete(sr->iocb, iomap_dio_simple_read_complete(sr));
>   }
>   
> -static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
> +static void iomap_dio_simple_read_end_io(struct bio *bio)
>   {
> -	struct kiocb *iocb = sr->iocb;
> +	struct iomap_dio_simple_read *sr =
> +		container_of(bio, struct iomap_dio_simple_read, bio);
>   
>   	if (unlikely(sr->bio.bi_status)) {
> -		struct inode *inode = file_inode(iocb->ki_filp);
> +		struct inode *inode = file_inode(sr->iocb->ki_filp);
>   
>   		INIT_WORK(&sr->work, iomap_dio_simple_read_complete_work);
>   		queue_work(inode->i_sb->s_dio_done_wq, &sr->work);
>   		return;
>   	}
>   
> -	iomap_dio_simple_read_complete_work(&sr->work);
> -}
> -
> -static void iomap_dio_simple_read_end_io(struct bio *bio)
> -{
> -	struct iomap_dio_simple_read *sr = bio->bi_private;
> -
> -	if (sr->waiter) {
> -		struct task_struct *waiter = sr->waiter;
> -
> -		WRITE_ONCE(sr->waiter, NULL);
> -		blk_wake_io_task(waiter);
> -		return;
> -	}
> -
> -	if (likely(atomic_read(&sr->state) == IOMAP_DIO_SIMPLE_QUEUED) ||
> -	    atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
> -			   IOMAP_DIO_SIMPLE_DONE) == IOMAP_DIO_SIMPLE_QUEUED)
> -		iomap_dio_simple_read_async_done(sr);
> +	sr->iocb->ki_complete(sr->iocb, iomap_dio_simple_read_complete(sr));
>   }
>   
>   static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
> @@ -1046,11 +985,13 @@ static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
>   	 */
>   	if (count > inode->i_sb->s_blocksize)
>   		return false;
> -	if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL))
> +	if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL |
> +			 IOMAP_DIO_BOUNCE))
>   		return false;
>   	if (iocb->ki_pos + count > i_size_read(inode))
>   		return false;
>   
> +	// XXX: reject fscrypt
>   	return true;
>   }
>   
> @@ -1060,7 +1001,6 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
>   {
>   	struct inode *inode = file_inode(iocb->ki_filp);
>   	size_t count = iov_iter_count(iter);
> -	int nr_pages;
>   	struct iomap_dio_simple_read *sr;
>   	unsigned int alignment;
>   	struct iomap_iter iomi = {
> @@ -1074,11 +1014,6 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
>   	bool wait_for_completion = is_sync_kiocb(iocb);
>   	ssize_t ret;
>   
> -	if (dio_flags & IOMAP_DIO_BOUNCE)
> -		nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
> -	else
> -		nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
> -
>   	if (iocb->ki_flags & IOCB_NOWAIT)
>   		iomi.flags |= IOMAP_NOWAIT;
>   
> @@ -1120,24 +1055,18 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
>   	if (user_backed_iter(iter))
>   		dio_flags |= IOMAP_DIO_USER_BACKED;
>   
> -	bio = bio_alloc_bioset(iomi.iomap.bdev, nr_pages,
> -			       REQ_OP_READ | REQ_SYNC | REQ_IDLE,
> -			       GFP_KERNEL, &iomap_dio_simple_read_pool);
> +	bio = bio_alloc_bioset(iomi.iomap.bdev,
> +			bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS),
> +			REQ_OP_READ | REQ_SYNC | REQ_IDLE,
> +			GFP_KERNEL, &iomap_dio_simple_read_pool);
>   	sr = container_of(bio, struct iomap_dio_simple_read, bio);
> -
> -	fscrypt_set_bio_crypt_ctx(bio, inode, iomi.pos, GFP_KERNEL);
>   	sr->iocb = iocb;
>   	sr->dio_flags = dio_flags;
>   
>   	bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
>   	bio->bi_ioprio = iocb->ki_ioprio;
> -	bio->bi_private = sr;
> -	bio->bi_end_io = iomap_dio_simple_read_end_io;
>   
> -	if (dio_flags & IOMAP_DIO_BOUNCE)
> -		ret = bio_iov_iter_bounce(bio, iter, count);
> -	else
> -		ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
> +	ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
>   	if (unlikely(ret))
>   		goto out_bio_put;
>   
> @@ -1161,49 +1090,22 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
>   		WRITE_ONCE(iocb->private, bio);
>   	}
>   
> -	if (wait_for_completion) {
> -		sr->waiter = current;
> -		blk_crypto_submit_bio(bio);
> -	} else {
> -		atomic_set(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING);
> -		sr->waiter = NULL;
> -		blk_crypto_submit_bio(bio);
> -		ret = -EIOCBQUEUED;
> -	}
> -
>   	if (ops->iomap_end)
>   		ops->iomap_end(inode, iomi.pos, count, count, iomi.flags,
>   			       &iomi.iomap);
>   
> -	if (wait_for_completion) {
> -		for (;;) {
> -			set_current_state(TASK_UNINTERRUPTIBLE);
> -			if (!READ_ONCE(sr->waiter))
> -				break;
> -			blk_io_schedule();
> -		}
> -		__set_current_state(TASK_RUNNING);
> -
> -		ret = iomap_dio_simple_read_finish(iocb, bio,
> -				blk_status_to_errno(bio->bi_status));
> -		inode_dio_end(inode);
> -		trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0,
> -					 ret > 0 ? ret : 0);
> -	} else if (atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
> -				  IOMAP_DIO_SIMPLE_QUEUED) ==
> -		   IOMAP_DIO_SIMPLE_DONE) {
> -		ret = iomap_dio_simple_read_complete(iocb, bio);
> -	} else {
> +	if (!wait_for_completion) {
> +		bio->bi_end_io = iomap_dio_simple_read_end_io;
> +		submit_bio(bio);
>   		trace_iomap_dio_rw_queued(inode, iomi.pos, count);
> +		return -EIOCBQUEUED;
>   	}
>   
> -	return ret;
> +	submit_bio_wait(bio);
> +	return iomap_dio_simple_read_complete(sr);
>   
>   out_bio_release_pages:
> -	if (dio_flags & IOMAP_DIO_BOUNCE)
> -		bio_iov_iter_unbounce(bio, true, false);
> -	else
> -		bio_release_pages(bio, false);
> +	bio_release_pages(bio, false);
>   out_bio_put:
>   	bio_put(bio);
>   out_iomap_end:

^ permalink raw reply

* [syzbot ci] Re: Data in direntry (dirdata) feature
From: syzbot ci @ 2026-06-24 23:18 UTC (permalink / raw)
  To: ablagodarenko, adilger.kernel, adilger, adilger,
	artem.blagodarenko, linux-ext4, pravin.shelar, xiaowu.417
  Cc: syzbot, syzkaller-bugs
In-Reply-To: <20260624133642.18438-1-ablagodarenko@thelustrecollective.com>

syzbot ci has tested the following series

[v4] Data in direntry (dirdata) feature
https://lore.kernel.org/all/20260624133642.18438-1-ablagodarenko@thelustrecollective.com
* [PATCH v4 01/11] ext4: validate count against limit in ext4_dx_csum_verify/_set
* [PATCH v4 02/11] ext4: replace ext4_dir_entry with ext4_dir_entry_2
* [PATCH v4 03/11] ext4: add ext4_dir_entry_is_tail()
* [PATCH v4 04/11] ext4: refactor dx_root to support variable dirent sizes
* [PATCH v4 05/11] ext4: add dirdata format definitions and access helpers
* [PATCH v4 06/11] ext4: preserve dirdata bits in get_dtype()
* [PATCH v4 07/11] ext4: add ext4_dir_entry_len() and harden dirdata parsing
* [PATCH v4 08/11] ext4: rename ext4_dir_rec_len() and clarify dirdata usage
* [PATCH v4 09/11] ext4: dirdata feature
* [PATCH v4 10/11] ext4: add dirdata set/get helpers
* [PATCH v4 11/11] ext4: Add EXT4_IOC_SET_LUFID ioctl for setting LUFID on directory entries

and found the following issues:
* KASAN: slab-use-after-free Read in ext4_inlinedir_to_tree
* KASAN: use-after-free Read in ext4_inlinedir_to_tree

Full report is available here:
https://ci.syzbot.org/series/7075f9f8-5dad-4e13-83ee-2f76e1e06dcf

***

KASAN: slab-use-after-free Read in ext4_inlinedir_to_tree

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      840ef6c78e6a2f694b578ecb9063241c992aaa9e
arch:      amd64
compiler:  Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
config:    https://ci.syzbot.org/builds/c9c607fc-012f-4e4c-88e7-89d5bade9f75/config
syz repro: https://ci.syzbot.org/findings/3badb95c-16ec-4f87-adf6-da2aca94c39c/syz_repro

EXT4-fs (loop1): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
==================================================================
BUG: KASAN: slab-use-after-free in ext4_dirent_get_data_len fs/ext4/ext4.h:4156 [inline]
BUG: KASAN: slab-use-after-free in ext4_dir_entry_len fs/ext4/ext4.h:4189 [inline]
BUG: KASAN: slab-use-after-free in ext4_inlinedir_to_tree+0x864/0x1030 fs/ext4/inline.c:1339
Read of size 1 at addr ffff888108ff7c19 by task syz.1.18/5891

CPU: 0 UID: 0 PID: 5891 Comm: syz.1.18 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 ext4_dirent_get_data_len fs/ext4/ext4.h:4156 [inline]
 ext4_dir_entry_len fs/ext4/ext4.h:4189 [inline]
 ext4_inlinedir_to_tree+0x864/0x1030 fs/ext4/inline.c:1339
 ext4_htree_fill_tree+0x4b9/0x2140 fs/ext4/namei.c:1206
 ext4_dx_readdir fs/ext4/dir.c:600 [inline]
 ext4_readdir+0x2e2a/0x3720 fs/ext4/dir.c:146
 iterate_dir+0x2e2/0x4d0 fs/readdir.c:110
 ovl_dir_read+0x141/0x4a0 fs/overlayfs/readdir.c:388
 ovl_check_d_type_supported+0xc5/0x150 fs/overlayfs/readdir.c:1167
 ovl_make_workdir fs/overlayfs/super.c:695 [inline]
 ovl_get_workdir fs/overlayfs/super.c:836 [inline]
 ovl_fill_super_creds fs/overlayfs/super.c:1449 [inline]
 ovl_fill_super+0x3a43/0x5d40 fs/overlayfs/super.c:1560
 vfs_get_super fs/super.c:1267 [inline]
 get_tree_nodev+0xbb/0x150 fs/super.c:1286
 vfs_get_tree+0x92/0x2a0 fs/super.c:1694
 fc_mount fs/namespace.c:1198 [inline]
 do_new_mount_fc fs/namespace.c:3765 [inline]
 do_new_mount+0x319/0xdc0 fs/namespace.c:3841
 do_mount fs/namespace.c:4174 [inline]
 __do_sys_mount fs/namespace.c:4390 [inline]
 __se_sys_mount+0x31d/0x420 fs/namespace.c:4367
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f09e159ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f09e24a9028 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f09e1815fa0 RCX: 00007f09e159ce59
RDX: 0000200000000000 RSI: 0000200000000100 RDI: 0000000000000000
RBP: 00007f09e1632e6f R08: 00002000000000c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f09e1816038 R14: 00007f09e1815fa0 R15: 00007fff17222ac8
 </TASK>

Allocated by task 5642:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
 __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415
 kasan_kmalloc include/linux/kasan.h:263 [inline]
 __do_kmalloc_node mm/slub.c:5362 [inline]
 __kmalloc_node_track_caller_noprof+0x4c3/0x730 mm/slub.c:5497
 kmemdup_noprof+0x2b/0x70 mm/util.c:138
 kmemdup_noprof include/linux/fortify-string.h:715 [inline]
 xfrm6_net_sysctl_init net/ipv6/xfrm6_policy.c:206 [inline]
 xfrm6_net_init+0x86/0x180 net/ipv6/xfrm6_policy.c:261
 ops_init+0x35d/0x5d0 net/core/net_namespace.c:137
 setup_net+0x118/0x350 net/core/net_namespace.c:446
 copy_net_ns+0x4f9/0x720 net/core/net_namespace.c:579
 create_new_namespaces+0x3f0/0x6b0 kernel/nsproxy.c:132
 unshare_nsproxy_namespaces+0x149/0x190 kernel/nsproxy.c:234
 ksys_unshare+0x57d/0xa00 kernel/fork.c:3267
 __do_sys_unshare kernel/fork.c:3341 [inline]
 __se_sys_unshare kernel/fork.c:3339 [inline]
 __x64_sys_unshare+0x38/0x50 kernel/fork.c:3339
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 12:
 kasan_save_stack mm/kasan/common.c:57 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:78
 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:584
 poison_slab_object mm/kasan/common.c:253 [inline]
 __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285
 kasan_slab_free include/linux/kasan.h:235 [inline]
 slab_free_hook mm/slub.c:2705 [inline]
 slab_free mm/slub.c:6405 [inline]
 kfree+0x1c5/0x640 mm/slub.c:6720
 xfrm6_net_sysctl_exit net/ipv6/xfrm6_policy.c:238 [inline]
 xfrm6_net_exit+0x79/0xa0 net/ipv6/xfrm6_policy.c:270
 ops_exit_list net/core/net_namespace.c:199 [inline]
 ops_undo_list+0x43d/0x8d0 net/core/net_namespace.c:252
 cleanup_net+0x572/0x810 net/core/net_namespace.c:702
 process_one_work kernel/workqueue.c:3322 [inline]
 process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3405
 worker_thread+0xa47/0xfb0 kernel/workqueue.c:3486
 kthread+0x388/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

The buggy address belongs to the object at ffff888108ff7c00
 which belongs to the cache kmalloc-64 of size 64
The buggy address is located 25 bytes inside of
 freed 64-byte region [ffff888108ff7c00, ffff888108ff7c40)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x108ff7
flags: 0x17ff00000000000(node=0|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 017ff00000000000 ffff8881000418c0 dead000000000100 dead000000000122
raw: 0000000000000000 0000000800200020 00000000f5000000 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xd2cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 1, tgid 1 (swapper/0), ts 14581267674, free_ts 0
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x1f9/0x250 mm/page_alloc.c:1859
 prep_new_page mm/page_alloc.c:1867 [inline]
 get_page_from_freelist+0x21fa/0x2270 mm/page_alloc.c:3946
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5304
 alloc_slab_page mm/slub.c:3294 [inline]
 allocate_slab+0x79/0x5e0 mm/slub.c:3408
 new_slab mm/slub.c:3454 [inline]
 refill_objects+0x2d5/0x350 mm/slub.c:7338
 refill_sheaf mm/slub.c:2832 [inline]
 __pcs_replace_empty_main+0x2bf/0x6b0 mm/slub.c:4703
 alloc_from_pcs mm/slub.c:4801 [inline]
 slab_alloc_node mm/slub.c:4933 [inline]
 __do_kmalloc_node mm/slub.c:5361 [inline]
 __kmalloc_noprof+0x485/0x720 mm/slub.c:5387
 _kmalloc_noprof include/linux/slab.h:973 [inline]
 _kzalloc_noprof include/linux/slab.h:1290 [inline]
 kobject_get_path+0xc5/0x2f0 lib/kobject.c:161
 kobject_uevent_env+0x29e/0x9e0 lib/kobject_uevent.c:548
 device_add+0x544/0xb80 drivers/base/core.c:3738
 device_create_groups_vargs drivers/base/core.c:4454 [inline]
 device_create+0x269/0x300 drivers/base/core.c:4493
 mon_bin_add+0xb6/0x130 drivers/usb/mon/mon_bin.c:1371
 mon_bus_init+0x162/0x2a0 drivers/usb/mon/mon_main.c:291
 mon_bus_add drivers/usb/mon/mon_main.c:188 [inline]
 mon_notify+0x10c/0x3f0 drivers/usb/mon/mon_main.c:219
 notifier_call_chain+0x1a5/0x3d0 kernel/notifier.c:85
 blocking_notifier_call_chain+0x6a/0x90 kernel/notifier.c:380
page_owner free stack trace missing

Memory state around the buggy address:
 ffff888108ff7b00: 00 00 00 00 00 00 00 04 fc fc fc fc fc fc fc fc
 ffff888108ff7b80: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888108ff7c00: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
                            ^
 ffff888108ff7c80: 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc
 ffff888108ff7d00: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc
==================================================================


***

KASAN: use-after-free Read in ext4_inlinedir_to_tree

tree:      torvalds
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base:      840ef6c78e6a2f694b578ecb9063241c992aaa9e
arch:      amd64
compiler:  Debian clang version 22.1.6 (++20260514074242+fc4aad7b5db3-1~exp1~20260514074407.73), Debian LLD 22.1.6
config:    https://ci.syzbot.org/builds/c9c607fc-012f-4e4c-88e7-89d5bade9f75/config
syz repro: https://ci.syzbot.org/findings/560b0247-7e29-4a4c-91b8-c73d275cb34f/syz_repro

loop0: lost filesystem error report for type 5 error -117
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none.
==================================================================
BUG: KASAN: use-after-free in ext4_dirent_get_data_len fs/ext4/ext4.h:4156 [inline]
BUG: KASAN: use-after-free in ext4_dir_entry_len fs/ext4/ext4.h:4189 [inline]
BUG: KASAN: use-after-free in ext4_inlinedir_to_tree+0x864/0x1030 fs/ext4/inline.c:1339
Read of size 1 at addr ffff888113752019 by task syz.0.17/5794

CPU: 0 UID: 0 PID: 5794 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 ext4_dirent_get_data_len fs/ext4/ext4.h:4156 [inline]
 ext4_dir_entry_len fs/ext4/ext4.h:4189 [inline]
 ext4_inlinedir_to_tree+0x864/0x1030 fs/ext4/inline.c:1339
 ext4_htree_fill_tree+0x4b9/0x2140 fs/ext4/namei.c:1206
 ext4_dx_readdir fs/ext4/dir.c:600 [inline]
 ext4_readdir+0x2e2a/0x3720 fs/ext4/dir.c:146
 iterate_dir+0x2e2/0x4d0 fs/readdir.c:110
 ovl_dir_read+0x141/0x4a0 fs/overlayfs/readdir.c:388
 ovl_check_d_type_supported+0xc5/0x150 fs/overlayfs/readdir.c:1167
 ovl_make_workdir fs/overlayfs/super.c:695 [inline]
 ovl_get_workdir fs/overlayfs/super.c:836 [inline]
 ovl_fill_super_creds fs/overlayfs/super.c:1449 [inline]
 ovl_fill_super+0x3a43/0x5d40 fs/overlayfs/super.c:1560
 vfs_get_super fs/super.c:1267 [inline]
 get_tree_nodev+0xbb/0x150 fs/super.c:1286
 vfs_get_tree+0x92/0x2a0 fs/super.c:1694
 fc_mount fs/namespace.c:1198 [inline]
 do_new_mount_fc fs/namespace.c:3765 [inline]
 do_new_mount+0x319/0xdc0 fs/namespace.c:3841
 do_mount fs/namespace.c:4174 [inline]
 __do_sys_mount fs/namespace.c:4390 [inline]
 __se_sys_mount+0x31d/0x420 fs/namespace.c:4367
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f427399ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffce99c0f38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007f4273c15fa0 RCX: 00007f427399ce59
RDX: 0000200000000000 RSI: 0000200000000100 RDI: 0000000000000000
RBP: 00007f4273a32e6f R08: 00002000000000c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f4273c15fac R14: 00007f4273c15fa0 R15: 00007f4273c15fa0
 </TASK>

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888113752e00 pfn:0x113752
flags: 0x17ff00000000000(node=0|zone=2|lastcpupid=0x7ff)
raw: 017ff00000000000 ffffea0004296d08 ffffea0004501b08 0000000000000000
raw: ffff888113752e00 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as freed
page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5462, tgid 5462 (rm), ts 46014589837, free_ts 77936825094
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x1f9/0x250 mm/page_alloc.c:1859
 prep_new_page mm/page_alloc.c:1867 [inline]
 get_page_from_freelist+0x21fa/0x2270 mm/page_alloc.c:3946
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5304
 alloc_slab_page mm/slub.c:3294 [inline]
 allocate_slab+0x79/0x5e0 mm/slub.c:3408
 new_slab mm/slub.c:3454 [inline]
 refill_objects+0x2d5/0x350 mm/slub.c:7338
 refill_sheaf mm/slub.c:2832 [inline]
 __prefill_sheaf_pfmemalloc mm/slub.c:5035 [inline]
 kmem_cache_prefill_sheaf+0x2fb/0x550 mm/slub.c:5123
 mt_get_sheaf lib/maple_tree.c:154 [inline]
 mas_alloc_nodes+0x1c2/0x350 lib/maple_tree.c:1119
 mas_preallocate+0x2cf/0x630 lib/maple_tree.c:4961
 vma_iter_prealloc mm/vma.h:577 [inline]
 __split_vma+0x318/0xa50 mm/vma.c:529
 vms_gather_munmap_vmas+0x322/0x1370 mm/vma.c:1427
 __mmap_setup mm/vma.c:2439 [inline]
 __mmap_region mm/vma.c:2756 [inline]
 mmap_region+0x8f9/0x2310 mm/vma.c:2860
 do_mmap+0xc3b/0x10c0 mm/mmap.c:560
 vm_mmap_pgoff+0x272/0x4e0 mm/util.c:581
 ksys_mmap_pgoff+0x4dc/0x760 mm/mmap.c:606
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 5736 tgid 5736 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1406 [inline]
 __free_frozen_pages+0xc1e/0xd10 mm/page_alloc.c:2950
 __slab_free+0x274/0x2c0 mm/slub.c:5767
 qlink_free mm/kasan/quarantine.c:163 [inline]
 qlist_free_all+0x99/0x100 mm/kasan/quarantine.c:179
 kasan_quarantine_reduce+0x148/0x160 mm/kasan/quarantine.c:286
 __kasan_slab_alloc+0x22/0x80 mm/kasan/common.c:350
 kasan_slab_alloc include/linux/kasan.h:253 [inline]
 slab_post_alloc_hook mm/slub.c:4612 [inline]
 slab_alloc_node mm/slub.c:4945 [inline]
 __kmalloc_cache_noprof+0x2ab/0x660 mm/slub.c:5511
 _kmalloc_noprof include/linux/slab.h:969 [inline]
 _kzalloc_noprof include/linux/slab.h:1290 [inline]
 ref_tracker_alloc+0x15b/0x4b0 lib/ref_tracker.c:270
 __netdev_tracker_alloc include/linux/netdevice.h:4489 [inline]
 netdev_hold include/linux/netdevice.h:4518 [inline]
 rx_queue_add_kobject net/core/net-sysfs.c:1236 [inline]
 net_rx_queue_update_kobjects+0x1c4/0x780 net/core/net-sysfs.c:1301
 register_queue_kobjects net/core/net-sysfs.c:2093 [inline]
 netdev_register_kobject+0x21f/0x310 net/core/net-sysfs.c:2341
 register_netdevice+0x1433/0x1eb0 net/core/dev.c:11439
 ipvlan_link_new+0x3e3/0xa90 drivers/net/ipvlan/ipvlan_main.c:593
 rtnl_newlink_create+0x310/0xb00 net/core/rtnetlink.c:3905
 __rtnl_newlink net/core/rtnetlink.c:4036 [inline]
 rtnl_newlink+0x167f/0x1bd0 net/core/rtnetlink.c:4151
 rtnetlink_rcv_msg+0x802/0xc00 net/core/rtnetlink.c:7068
 netlink_rcv_skb+0x226/0x4a0 net/netlink/af_netlink.c:2556
 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
 netlink_unicast+0x7bb/0x940 net/netlink/af_netlink.c:1345

Memory state around the buggy address:
 ffff888113751f00: 00 00 00 00 00 00 00 04 fc fc fc fc fc fc fc fc
 ffff888113751f80: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff888113752000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                            ^
 ffff888113752080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff888113752100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply

* Re: [PATCH RFC v2 17/18] fs: look up the superblock via the device table in user_get_super()
From: Gao Xiang @ 2026-06-24 22:48 UTC (permalink / raw)
  To: Darrick J. Wong, Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260624175417.GU6078@frogsfrogsfrogs>



On 2026/6/25 01:54, Darrick J. Wong wrote:
> On Tue, Jun 16, 2026 at 04:08:33PM +0200, Christian Brauner wrote:
>> user_get_super() still finds the superblock for a device number by
>> walking the global super_blocks list under sb_lock. Every superblock is
>> registered in the device table under its s_dev since sget_fc() inserts
>> it there, including superblocks on anonymous devices, so use the table
>> instead.
>>
>> The refcount-pinning cursor helpers super_dev_{get,first,next}() only
>> touch table state and do not depend on CONFIG_BLOCK, so drop the
>> CONFIG_BLOCK guard around them: their new caller serves anonymous
>> devices as well (ustat() on e.g. tmpfs) and is built without
>> CONFIG_BLOCK. The guard falls in this patch rather than separately
>> since without this caller the helpers would be unused without
>> CONFIG_BLOCK.
>>
>> The pinned entry holds a passive reference on the superblock so
>> super_lock() can be called directly; once the superblock is locked grab
>> a passive reference for the caller before dropping the pin.
>>
>> The device table contains more than the old walk could find: a
>> superblock is also registered for every additional device it claims
>> (the xfs log and realtime devices, btrfs member devices, the ext4
>> external journal, erofs blob devices). Don't filter those out:
>> specifying any device a filesystem uses now resolves to that
>> filesystem, so ustat() and quotactl() work on e.g. the xfs log device
>> or a btrfs member device (the latter used to fail outright as btrfs
>> superblocks carry an anonymous s_dev that never matches a member
>> device). When several superblocks share a device (erofs blob devices)
>> the first live superblock wins.
> 
> Does erofs have a means to find the other superblocks that share a
> device given a notification coming in on one of them?  
Nope, erofs currently doesn't have a way to find the other
superblocks (it  doesn't maintain the relationship). My previous
thought is that because it's a read-only filesystem, IMHO, there
is not a must to implement shutdown or notification mechanism in
erofs itself, just because it's strictly immutable (no local
write or dirty journals), and block layer can return io error
on dead bdevs directly even it's a shared block dev.  But I may
be wrong if there are reason that we should maintain the
relationship.

Currently it only uses sb->s_type as the holder for bdev sharing,
I think Christian meant that.

Thanks,
Gao Xiang

^ permalink raw reply

* Re: [PATCH RFC v2 17/18] fs: look up the superblock via the device table in user_get_super()
From: Darrick J. Wong @ 2026-06-24 17:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-17-7df6b864028e@kernel.org>

On Tue, Jun 16, 2026 at 04:08:33PM +0200, Christian Brauner wrote:
> user_get_super() still finds the superblock for a device number by
> walking the global super_blocks list under sb_lock. Every superblock is
> registered in the device table under its s_dev since sget_fc() inserts
> it there, including superblocks on anonymous devices, so use the table
> instead.
> 
> The refcount-pinning cursor helpers super_dev_{get,first,next}() only
> touch table state and do not depend on CONFIG_BLOCK, so drop the
> CONFIG_BLOCK guard around them: their new caller serves anonymous
> devices as well (ustat() on e.g. tmpfs) and is built without
> CONFIG_BLOCK. The guard falls in this patch rather than separately
> since without this caller the helpers would be unused without
> CONFIG_BLOCK.
> 
> The pinned entry holds a passive reference on the superblock so
> super_lock() can be called directly; once the superblock is locked grab
> a passive reference for the caller before dropping the pin.
> 
> The device table contains more than the old walk could find: a
> superblock is also registered for every additional device it claims
> (the xfs log and realtime devices, btrfs member devices, the ext4
> external journal, erofs blob devices). Don't filter those out:
> specifying any device a filesystem uses now resolves to that
> filesystem, so ustat() and quotactl() work on e.g. the xfs log device
> or a btrfs member device (the latter used to fail outright as btrfs
> superblocks carry an anonymous s_dev that never matches a member
> device). When several superblocks share a device (erofs blob devices)
> the first live superblock wins.

Does erofs have a means to find the other superblocks that share a
device given a notification coming in on one of them?  As hch says, it
feels weird to have a lookup mechanism when there's also an upcall
mechanism.

<shrug> I've been on vacation for a while so maybe I missed that there's
another use for the bdev->sb lookup?  There are 1600 more emails for me
to go through... :P

--D

> 
> The cursor also keeps scanning past dying superblocks where the old
> walk gave up after the first s_dev match, so a mount racing with the
> unmount of the same device (or with the reuse of a recycled anonymous
> dev_t) finds the live superblock where the old walk could spuriously
> return NULL.
> 
> This removes the last s_dev-keyed walk of the super_blocks list and
> takes ustat() and quotactl()'s block device lookup off sb_lock
> entirely.
> 
> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
> ---
>  fs/super.c | 28 ++++++++--------------------
>  1 file changed, 8 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 2d0a07861bfc..93f24aea75c4 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -501,7 +501,6 @@ static int super_dev_register(struct super_block *sb)
>  	return err;
>  }
>  
> -#ifdef CONFIG_BLOCK
>  static struct super_dev *super_dev_get(struct rhlist_head *pos)
>  {
>  	struct super_dev *sb_dev;
> @@ -535,7 +534,6 @@ static struct super_dev *super_dev_next(struct super_dev *prev)
>  	super_dev_put(prev);
>  	return sb_dev;
>  }
> -#endif
>  
>  static void kill_super_notify(struct super_block *sb)
>  {
> @@ -1044,29 +1042,19 @@ EXPORT_SYMBOL(iterate_supers_type);
>  
>  struct super_block *user_get_super(dev_t dev, bool excl)
>  {
> -	struct super_block *sb;
> -
> -	spin_lock(&sb_lock);
> -	list_for_each_entry(sb, &super_blocks, s_list) {
> -		bool locked;
> +	struct super_dev *sb_dev;
>  
> -		if (sb->s_dev != dev)
> -			continue;
> +	for (sb_dev = super_dev_first(dev); sb_dev; sb_dev = super_dev_next(sb_dev)) {
> +		struct super_block *sb = sb_dev->sd_sb;
>  
> -		if (!refcount_inc_not_zero(&sb->s_passive))
> +		if (!super_lock(sb, excl))
>  			continue;
>  
> -		spin_unlock(&sb_lock);
> -
> -		locked = super_lock(sb, excl);
> -		if (locked)
> -			return sb;
> -
> -		put_super(sb);
> -		spin_lock(&sb_lock);
> -		break;
> +		/* The pinned entry holds a passive reference, take our own. */
> +		refcount_inc(&sb->s_passive);
> +		super_dev_put(sb_dev);
> +		return sb;
>  	}
> -	spin_unlock(&sb_lock);
>  	return NULL;
>  }
>  
> 
> -- 
> 2.47.3
> 
> 

^ permalink raw reply

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
From: Jan Kara @ 2026-06-24 17:16 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel,
	tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, djwong,
	hch, yi.zhang, yangerkun, yukuai
In-Reply-To: <b0781809-4759-4e12-be17-71555b764f48@gmail.com>

On Mon 22-06-26 20:36:02, Zhang Yi wrote:
> On 6/16/2026 7:47 PM, Jan Kara wrote:
> > On Mon 11-05-26 15:23:29, Zhang Yi wrote:
> > > From: Zhang Yi <yi.zhang@huawei.com>
> > > 
> > > Add the iomap writeback path for ext4 buffered I/O. This introduces:
> > > 
> > >   - ext4_iomap_writepages(): the main writeback entry point.
> > >   - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
> > >     block mapping and I/O submission.
> > >   - A new end I/O worker for converting unwritten extents, updating file
> > >     size, and handling DATA_ERR_ABORT after I/O completion.
> > > 
> > > Core implementation details:
> > > 
> > >   - ->writeback_range() callback
> > >     Calls ext4_iomap_map_writeback_range() to query the longest range of
> > >     existing mapped extents. For performance, when a block range is not
> > >     yet allocated, it allocates based on the writeback length and delalloc
> > >     extent length, rather than allocating for a single folio at a time.
> > >     The folio is then added to an iomap_ioend instance.
> > > 
> > >   - ->writeback_submit() callback
> > >     Registers ext4_iomap_end_bio() as the end bio callback. This callback
> > >     schedules a worker to handle:
> > >     - Unwritten extent conversion.
> > >     - i_disksize update after data is written back.
> > >     - Journal abort on writeback I/O failure.
> > > 
> > > Key changes and considerations:
> > > 
> > > - Append write and unwritten extents
> > >    Since data=ordered mode is not used to prevent stale data exposure
> > >    during append writebacks, new blocks are always allocated as unwritten
> > >    extents (i.e. always enable dioread_nolock), and i_disksize update is
> > >    postponed until I/O completion. Additionally, the deadlock that the
> > >    reserve handle was expected to resolve does not occur anymore.
> > >    Therefore, the end I/O worker can start a normal journal handle
> > >    instead of a reserve handle when converting unwritten extents.
> > > 
> > > - Lock ordering
> > >    The ->writeback_range() callback runs under the folio lock, requiring
> > >    the journal handle to be started under that same lock. This reverses
> > >    the order compared to the buffer_head writeback path. The lock ordering
> > >    documentation in super.c has been updated accordingly.
> > > 
> > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > ---
> > >   fs/ext4/ext4.h        |   4 +
> > >   fs/ext4/inode.c       | 208 +++++++++++++++++++++++++++++++++++++++++-
> > >   fs/ext4/page-io.c     | 126 +++++++++++++++++++++++++
> > >   fs/ext4/super.c       |   7 +-
> > >   fs/iomap/ioend.c      |   3 +-
> > >   include/linux/iomap.h |   1 +
> > >   6 files changed, 346 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 4832e7f7db82..078feda47e36 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
> > >   	 */
> > >   	struct list_head i_rsv_conversion_list;
> > >   	struct work_struct i_rsv_conversion_work;
> > > +	struct list_head i_iomap_ioend_list;
> > > +	struct work_struct i_iomap_ioend_work;
> > 
> > Ugh, this adds 48 bytes to ext4 inode. That's pretty heavy. Cannot we reuse
> > i_rsv_conversion_list / work for this? For each inode only one of them
> > should be used AFAICS.
> 
> Thanks for your suggestion. I think we should be able to reuse
> i_rsv_conversion_list / work. We can choose the corresponding
> initialization function for i_rsv_conversion_work based on the buffered
> write path at initialization time, and then reinitialize the work
> handler when changing the path via the ioctl that sets the journal
> flag. That should be sufficient.

Great, thanks.

> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 1ae7d3f4a1c8..a80195bd6f20 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -44,6 +44,7 @@
> > >   #include <linux/iversion.h>
> > >   #include "ext4_jbd2.h"
> > > +#include "ext4_extents.h"
> > >   #include "xattr.h"
> > >   #include "acl.h"
> > >   #include "truncate.h"
> > > @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
> > >   	iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
> > >   }
> > > +static int ext4_iomap_map_one_extent(struct inode *inode,
> > > +				     struct ext4_map_blocks *map)
> > > +{
> > > +	struct extent_status es;
> > > +	handle_t *handle = NULL;
> > > +	int credits, map_flags;
> > > +	int retval;
> > > +
> > > +	credits = ext4_chunk_trans_blocks(inode, map->m_len);
> > > +	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
> > > +	if (IS_ERR(handle))
> > > +		return PTR_ERR(handle);
> > > +
> > > +	map->m_flags = 0;
> > > +	/*
> > > +	 * It is necessary to look up extent and map blocks under i_data_sem
> > > +	 * in write mode, otherwise, the delalloc extent may become stale
> > > +	 * during concurrent truncate operations.
> > > +	 */
> > > +	ext4_fc_track_inode(handle, inode);
> > > +	down_write(&EXT4_I(inode)->i_data_sem);
> > > +	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> > > +		retval = es.es_len - (map->m_lblk - es.es_lblk);
> > > +		map->m_len = min_t(unsigned int, retval, map->m_len);
> > > +
> > > +		if (ext4_es_is_delayed(&es)) {
> > > +			map->m_flags |= EXT4_MAP_DELAYED;
> > > +			trace_ext4_da_write_pages_extent(inode, map);
> > > +			/*
> > > +			 * Call ext4_map_create_blocks() to allocate any
> > > +			 * delayed allocation blocks. It is possible that
> > > +			 * we're going to need more metadata blocks, however
> > > +			 * we must not fail because we're in writeback and
> > > +			 * there is nothing we can do so it might result in
> > > +			 * data loss. So use reserved blocks to allocate
> > > +			 * metadata if possible.
> > > +			 */
> > > +			map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
> > > +				    EXT4_GET_BLOCKS_METADATA_NOFAIL |
> > > +				    EXT4_EX_NOCACHE;
> > > +
> > > +			retval = ext4_map_create_blocks(handle, inode, map,
> > > +							map_flags);
> > > +			if (retval > 0)
> > > +				ext4_fc_track_range(handle, inode, map->m_lblk,
> > > +						map->m_lblk + map->m_len - 1);
> > > +			goto out;
> > > +		} else if (unlikely(ext4_es_is_hole(&es)))
> > > +			goto out;
> > > +
> > > +		/* Found written or unwritten extent. */
> > > +		map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
> > > +		map->m_flags = ext4_es_is_written(&es) ?
> > > +			       EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> > > +		goto out;
> > > +	}
> > > +
> > > +	retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
> > > +out:
> > > +	up_write(&EXT4_I(inode)->i_data_sem);
> > > +	ext4_journal_stop(handle);
> > > +	return retval < 0 ? retval : 0;
> > > +}
> > > +
> > > +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> > > +					  loff_t offset, unsigned int dirty_len)
> > > +{
> > > +	struct inode *inode = wpc->inode;
> > > +	struct super_block *sb = inode->i_sb;
> > > +	struct journal_s *journal = EXT4_SB(sb)->s_journal;
> > > +	struct ext4_map_blocks map;
> > > +	unsigned int blkbits = inode->i_blkbits;
> > > +	unsigned int index = offset >> blkbits;
> > > +	unsigned int blk_end, blk_len;
> > > +	int ret;
> > > +
> > > +	ret = ext4_emergency_state(sb);
> > > +	if (unlikely(ret))
> > > +		return ret;
> > > +
> > > +	/* Check validity of the cached writeback mapping. */
> > > +	if (offset >= wpc->iomap.offset &&
> > > +	    offset < wpc->iomap.offset + wpc->iomap.length &&
> > > +	    ext4_iomap_valid(inode, &wpc->iomap))
> > > +		return 0;
> > > +
> > > +	blk_len = dirty_len >> blkbits;
> > > +	blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
> > > +				      (UINT_MAX - 1));
> > > +	if (blk_end > index + blk_len)
> > > +		blk_len = blk_end - index + 1;
> > > +
> > > +retry:
> > > +	map.m_lblk = index;
> > > +	map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
> > > +	ret = ext4_map_blocks(NULL, inode, &map,
> > > +			      EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);
> > > +	if (ret < 0)
> > > +		return ret;
> > > +
> > > +	/*
> > > +	 * The map is not a delalloc extent, it must either be a hole
> > > +	 * or an extent which have already been allocated.
> > > +	 */
> > > +	if (!(map.m_flags & EXT4_MAP_DELAYED))
> > > +		goto out;
> > > +
> > > +	/* Map one delalloc extent. */
> > > +	ret = ext4_iomap_map_one_extent(inode, &map);
> > 
> > So it looks somewhat strange that here we call ext4_map_blocks() (which
> > consults extent status tree and then possibly on-disk extent tree) and then
> > we call ext4_iomap_map_one_extent() which manipulates with the extent
> > status tree and possibly extent tree as well. Is all this complexity to
> > avoid starting a jbd2 handle unless really needed? If yes, is that really
> > worth it? Given iomap code caches the extent we'd start the transaction
> > only once per mapped extent which shouldn't be that bad?
> > 
> > If you have some benchmark showing this is really worth it,
> 
> There are actually two reasons for this. First, we want to avoid
> starting a journal handle in overwrite scenarios. Second, we want to
> be able to query the extent locklessly without holding i_data_sem in
> overwrite cases as well (note that ext4_es_lookup_extent() in
> ext4_iomap_map_one_extent() is called with i_data_sem held).
> 
> I ran a set of benchmark tests in my VM, performing the following FIO
> overwrite test on a 500GB ramdisk:
> 
> $fio -filename=/test_dir/foo -direct=0 -iodepth=8 -fsync=0 -rw=write \
>            -numjobs=1 -bs=4k -ioengine=io_uring -size=20G -uncached=1 \
>            -runtime=30 --ramp_time=5s -time_based -norandommap=0 \
>            -fallocate=none -overwrite=1 \
>            -group_reportin -name=test --output=/tmp/log
> 
> The results are as follows:
> 
> a: on a non-fragmented file
> A: on a fragmented file [1]
> b: no background metadata pressure
> B: with background metadata pressure [2]
> 
>      buffer_head | iomap pre-map w/o journal | iomap directly map
> a+b:    680                 691                   690
> a+B:    560                 568                   567
> A+b:    637                 633                   579
> A+B:    540                 571                   495
> 
> [1] The file is pre-fragmented such that each block occupies a separate
>     extent.
> [2] A background fsstress process is running (only contains metadata
>     ops):
>     taskset -c 2 fsstress -c -d /test_dir -l 0 -n 1000 -f clonerange=0 \
>             -f copyrange=0 -f awrite=0 -f aread=0 -f dread=0 \
>             -f dwrite=0 -f mread=0 -f mwrite=0 -f readv=0 -f write=0 \
>             -f writev=0 -f read=0 -f sync=0 -f afsync=0 -f fsync=0
> 
> As can be seen, for large contiguous files, the performance impact is
> minimal. However, in heavily fragmented scenarios or under other
> metadata pressure, pre-querying the mapping brings noticeable gains.
> However, this is testing the most extreme case — I'm not sure about
> the real-world impact, so I don't have a strong preference either way.
> But I suppose faster is better, at least not slower than the old
> buffer_head path. :)

OK, thanks for the test! So for fragmented files the optimization of not
starting a transaction seems indeed worth it. I still dislike the
opencoding :) So given we have the reversed lock ordering now, why don't we
teach ext4_map_blocks() to start a transaction (if not provided) just
before it acquires i_data_sem for writing? This should be quite elegant. I
know you have some concerns about possible races below so let's discuss
that separately but at least in terms of performance and code complexity
this would look ideal :).

> > then I'd
> > probably prefer coming up with an ext4_get_blocks flag which tells it to
> > start a transaction on its own if we need to allocate blocks... That would
> > be much simpler than opencoding all this.
> 
> Additionally, there is a key point here. The reason I open-coded
> ext4_iomap_map_writeback_range() is that we must ensure extent query
> and allocation are performed atomically under i_data_sem. Otherwise,
> concurrent truncate could lead to quota leaks.
> 
> Specifically, consider the following scenario: we call
> ext4_map_blocks() to allocate blocks. Suppose there is a delalloc
> extent covering blocks [0,3). While writeback is submitting block 0, a
> concurrent truncate(block 1) occurs:
> 
> wb                         truncate
> ext4_es_lookup_extent()    ext4_truncate_down()
>   //get [0,3)
>                             truncate_inode_pages_range()
>                                //clear page 1&2
>                             ext4_truncate()
>                              down_write(i_data_sem)
>                               ext4_es_remove_extent()
>                                //drop extent [1,3)
>                                //i_reserved_data_blocks: 3->1
>                               up_write(i_data_sem)
> down_write(i_data_sem)
> ext4_map_create_blocks()
>   //alloc 3 blocks
>  ext4_es_insert_extent()
>   //only reclaim 1 block,stale 2 blocks
> up_write(i_data_sem)
> 
> Therefore, If we don't open-coding this part, we would need to
> significantly rework ext4_map_blocks(), which might have a larger
> impact at this point. What do you think?

Hum, is something like this really possible? I mean iomap_writepages() will
lookup and lock folio. Only then it calls ->iomap_begin to map it to
underlying blocks. And folio lock synchronizes against
truncate_inode_pages_range() so how would writeback end up trying to
allocate something underlying pages 1 or 2?

> > > +	if (ret < 0) {
> > > +		if (ext4_emergency_state(sb))
> > > +			return ret;
> > > +
> > > +		/*
> > > +		 * Retry transient ENOSPC errors, if
> > > +		 * ext4_count_free_blocks() is non-zero, a commit
> > > +		 * should free up blocks.
> > > +		 */
> > > +		if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
> > > +			jbd2_journal_force_commit_nested(journal);
> > > +			goto retry;
> > > +		}
> > > +
> > > +		ext4_msg(sb, KERN_CRIT,
> > > +			 "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
> > > +			 inode->i_ino, (unsigned long long)map.m_lblk,
> > > +			 (unsigned int)map.m_len, -ret);
> > > +		ext4_msg(sb, KERN_CRIT,
> > > +			 "This should not happen!! Data will be lost\n");
> > > +		if (ret == -ENOSPC)
> > > +			ext4_print_free_blocks(inode);
> > > +		return ret;
> > > +	}
> > > +out:
> > > +	ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
> > > +	return 0;
> > > +}
> > > +
> > 
> > ...
> > 
> > > +void ext4_iomap_end_bio(struct bio *bio)
> > > +{
> > > +	struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
> > > +	struct inode *inode = ioend->io_inode;
> > > +	struct ext4_inode_info *ei = EXT4_I(inode);
> > > +	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > > +	unsigned long flags;
> > > +
> > > +	/* Needs to convert unwritten extents or update the i_disksize. */
> > > +	if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
> > > +	    ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
> > > +		goto defer;
> > > +
> > > +	/* Needs to abort the journal on data_err=abort.  */
> > > +	if (unlikely(ioend->io_bio.bi_status))
> > > +		goto defer;
> > > +
> > > +	iomap_finish_ioend(ioend, 0);
> > > +	return;
> > > +defer:
> > > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > +	if (list_empty(&ei->i_iomap_ioend_list))
> > > +		queue_work(sbi->rsv_conversion_wq, &ei->i_iomap_ioend_work);
> > > +	list_add_tail(&ioend->io_list, &ei->i_iomap_ioend_list);
> > > +	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > +}
> > 
> > For now, I'd prefer to do what XFS does and offload everything. Then you
> > don't have to export iomap_finish_ioend() (which would need to be in a
> > separate patch and acked by iomap maintainers) and the code is more
> > standard. There's a patchset in the works which adds general ioend offloading
> > infrastructure into iomap [1] and when that lands we should get all these
>                       ^^^^^ block layer?
> 
> > bells and whistles (even better ones with percpu work queues, batching,
> > etc.) for free.
> > 
> > [1] https://lore.kernel.org/all/20260514-blk-dontcache-v6-0-782e2fa7477b@columbia.edu/
> > 
> > 								Honza
> 
> Ha, I've noticed this patchset, so I haven't implemented
> uncached I/O handling for now. As a side note, I have a question:
> if we convert all endio processing to worker threads, IIRC, my
> recollection from previous performance tests is that pure overwrite
> scenarios would see at least a 20% degradation. Is that acceptable?

No, but the latest version of the patches exactly does IO completion in the
interrupt unless the bio is flagged as needing IO completion from process
context or unless end_io handler returns a particular error - which means
IO completion wasn't actually done and needs offloading into process
context instead.

> I understand why uncached I/O might need the entire completion path
> in a worker, but can we complete the I/O in interrupt context for
> pure overwrite and then release the page cache in a worker? Must
> page cache invalidation and I/O completion be synchronous?

Strictly speaking no, we can first complete the IO and evict page cache
independently later. But it would be quite tricky locking wise (folios and
the mapping containing them can get evicted once folio writeback bit gets
cleared) so the whole uncached writes handling would have to be reworked. I
don't think it's worth it at this point.
 
> The reason I kept ext4_iomap_end_bio() handling I/O completion in
> interrupt context is for overwrite performance. XFS also handles
> overwrites in interrupt context (via ioend_writeback_end_bio()).
> However, ext4 has the data_error=abort mount option — when this mode
> is set and an I/O error occurs, we must abort the journal in a
> worker. Since we cannot predict I/O errors at submission time, we
> can't directly use ioend_writeback_end_bio() and must instead bind
> our own ext4_iomap_end_bio(). At the same time, I want to avoid
> spawning a worker for pure overwrites when no I/O error occurs, so I
> exported iomap_finish_ioend(). What do you think?

So data_error=abort handling can exactly use the new generic framework - if
we detect during processing IO completion we cannot actually do it in the
interrupt (like in case of error), we just return appropriately from the
handler and the generic code will handle offloading and call the ->end_io
callback again.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4] iomap: add simple read path for small direct I/O
From: Christoph Hellwig @ 2026-06-24 15:37 UTC (permalink / raw)
  To: Fengnan Chang
  Cc: brauner, djwong, hch, ojaswin, dgc, linux-xfs, linux-fsdevel,
	linux-ext4, linux-kernel, lidiangang
In-Reply-To: <20260608073134.95964-1-changfengnan@bytedance.com>

Sorry for the delay in getting back to this, I'm a bit overloaded at
the moment.

> -static inline bool should_report_dio_fserror(const struct iomap_dio *dio)
> +static inline bool should_report_dio_fserror(int error)

Can you split all the refactoring into prep patches?

> +/*
> + * In the async simple read path, we need to prevent bio_endio() from
> + * triggering iocb->ki_complete() before the submitter has returned
> + * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
> + *
> + * We use a three-state rendezvous to synchronize the submitter and end_io:
> + *
> + * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
> + *
> + * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
> + * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
> + * ki_complete().
> + *
> + * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
> + * submit path. end_io sets this state and does nothing else. The submitter
> + * will see this state and handle the completion synchronously (bypassing
> + * ki_complete() and returning the actual result).
> + */

I don't think we actually need any of this.  For the sync case we
can just use submit_bio_wait, and for async just always complete
from the end_io handler.  This will simplify the implementation a lot,
and also avoid the atomic.

> +static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)

Btw, I'd drop the _read in the name.  Most of this would work as-is
for trivial overwrites if we figure out when to use them.

> +	if (dio_flags & IOMAP_DIO_BOUNCE)
> +		nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
> +	else
> +		nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);

Bounce buffering requires dops, so all this can be dropped.

> +	ret = ops->iomap_begin(inode, iomi.pos, count, iomi.flags,
> +			       &iomi.iomap, &iomi.srcmap);
> +	if (ret) {
> +		inode_dio_end(inode);
> +		return ret;
> +	}
> +
> +	if (iomi.iomap.type != IOMAP_MAPPED ||
> +	    iomi.iomap.offset > iomi.pos ||

I don't think offset > pos can happen

> +	    iomi.iomap.offset + iomi.iomap.length < iomi.pos + count ||
> +	    (iomi.iomap.flags & IOMAP_F_INTEGRITY)) {
> +		ret = -ENOTBLK;
> +		goto out_iomap_end;
> +	}

Given that we already have a fallback here, I'm not sure why this is
limited to a single file system block.  Anything that:

  a) fits into the iomap
  b) fits into a single bio

can be easily supported.  The first condition is a trivial, and for
the second we could just check if iter->nr_segs is larger than
BIO_MAX_VECS.

> +	if (user_backed_iter(iter))
> +		dio_flags |= IOMAP_DIO_USER_BACKED;

> +	bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
> +	bio->bi_ioprio = iocb->ki_ioprio;
> +	bio->bi_private = sr;
> +	bio->bi_end_io = iomap_dio_simple_read_end_io;
> +
> +	if ((dio_flags & IOMAP_DIO_USER_BACKED) &&
> +	    !(dio_flags & IOMAP_DIO_BOUNCE))
> +		bio_set_pages_dirty(bio);
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT)
> +		bio->bi_opf |= REQ_NOWAIT;
> +	if ((iocb->ki_flags & IOCB_HIPRI) && !wait_for_completion) {
> +		bio->bi_opf |= REQ_POLLED;
> +		bio_set_polled(bio, iocb);
> +		WRITE_ONCE(iocb->private, bio);
> +	}

Can you check if sone more of this can be factored into a shared
helper?

Below is a completely untested patch implementing my suggestion
for the completion simplification.  It compiles, but that's about
the guarantees I can give for it:

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 3cb179752612..c785512e5339 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -909,11 +909,7 @@ struct iomap_dio_simple_read {
 	struct kiocb		*iocb;
 	size_t			size;
 	unsigned int		dio_flags;
-	atomic_t		state;
-	union {
-		struct task_struct	*waiter;
-		struct work_struct	work;
-	};
+	struct work_struct	work;
 	/*
 	 * Align @bio to a cacheline boundary so that, combined with the
 	 * front_pad passed to bioset_init(), the bio sits at the start of
@@ -926,35 +922,12 @@ struct iomap_dio_simple_read {
 
 static struct bio_set iomap_dio_simple_read_pool;
 
-/*
- * In the async simple read path, we need to prevent bio_endio() from
- * triggering iocb->ki_complete() before the submitter has returned
- * -EIOCBQUEUED. Otherwise, the caller might free the iocb concurrently.
- *
- * We use a three-state rendezvous to synchronize the submitter and end_io:
- *
- * IOMAP_DIO_SIMPLE_SUBMITTING: Initial state set before submitting the bio.
- *
- * IOMAP_DIO_SIMPLE_QUEUED: The submitter has safely queued the IO and will
- * return -EIOCBQUEUED. If end_io sees this state, it takes over and calls
- * ki_complete().
- *
- * IOMAP_DIO_SIMPLE_DONE: end_io fired before the submitter finished the
- * submit path. end_io sets this state and does nothing else. The submitter
- * will see this state and handle the completion synchronously (bypassing
- * ki_complete() and returning the actual result).
- */
-enum {
-	IOMAP_DIO_SIMPLE_SUBMITTING = 0,
-	IOMAP_DIO_SIMPLE_QUEUED,
-	IOMAP_DIO_SIMPLE_DONE,
-};
-
-static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
-		struct bio *bio, ssize_t ret)
+static ssize_t iomap_dio_simple_read_complete(struct iomap_dio_simple_read *sr)
 {
+	struct bio *bio = &sr->bio;
+	struct kiocb *iocb = sr->iocb;
 	struct inode *inode = file_inode(iocb->ki_filp);
-	struct iomap_dio_simple_read *sr = bio->bi_private;
+	ssize_t ret = blk_status_to_errno(bio->bi_status);
 
 	if (likely(!ret)) {
 		ret = sr->size;
@@ -965,21 +938,6 @@ static ssize_t iomap_dio_simple_read_finish(struct kiocb *iocb,
 	}
 
 	iomap_dio_bio_release_pages(bio, sr->dio_flags, ret < 0);
-
-	return ret;
-}
-
-static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
-		struct bio *bio)
-{
-	struct inode *inode = file_inode(iocb->ki_filp);
-	ssize_t ret;
-
-	WRITE_ONCE(iocb->private, NULL);
-
-	ret = iomap_dio_simple_read_finish(iocb, bio,
-			blk_status_to_errno(bio->bi_status));
-
 	inode_dio_end(inode);
 	trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0, ret > 0 ? ret : 0);
 	return ret;
@@ -988,45 +946,26 @@ static ssize_t iomap_dio_simple_read_complete(struct kiocb *iocb,
 static void iomap_dio_simple_read_complete_work(struct work_struct *work)
 {
 	struct iomap_dio_simple_read *sr =
-		container_of(work, struct iomap_dio_simple_read, work);
-	struct kiocb *iocb = sr->iocb;
-	ssize_t ret;
+			container_of(work, struct iomap_dio_simple_read, work);
 
-	ret = iomap_dio_simple_read_complete(iocb, &sr->bio);
-	iocb->ki_complete(iocb, ret);
+	WRITE_ONCE(sr->iocb->private, NULL);
+	sr->iocb->ki_complete(sr->iocb, iomap_dio_simple_read_complete(sr));
 }
 
-static void iomap_dio_simple_read_async_done(struct iomap_dio_simple_read *sr)
+static void iomap_dio_simple_read_end_io(struct bio *bio)
 {
-	struct kiocb *iocb = sr->iocb;
+	struct iomap_dio_simple_read *sr =
+		container_of(bio, struct iomap_dio_simple_read, bio);
 
 	if (unlikely(sr->bio.bi_status)) {
-		struct inode *inode = file_inode(iocb->ki_filp);
+		struct inode *inode = file_inode(sr->iocb->ki_filp);
 
 		INIT_WORK(&sr->work, iomap_dio_simple_read_complete_work);
 		queue_work(inode->i_sb->s_dio_done_wq, &sr->work);
 		return;
 	}
 
-	iomap_dio_simple_read_complete_work(&sr->work);
-}
-
-static void iomap_dio_simple_read_end_io(struct bio *bio)
-{
-	struct iomap_dio_simple_read *sr = bio->bi_private;
-
-	if (sr->waiter) {
-		struct task_struct *waiter = sr->waiter;
-
-		WRITE_ONCE(sr->waiter, NULL);
-		blk_wake_io_task(waiter);
-		return;
-	}
-
-	if (likely(atomic_read(&sr->state) == IOMAP_DIO_SIMPLE_QUEUED) ||
-	    atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
-			   IOMAP_DIO_SIMPLE_DONE) == IOMAP_DIO_SIMPLE_QUEUED)
-		iomap_dio_simple_read_async_done(sr);
+	sr->iocb->ki_complete(sr->iocb, iomap_dio_simple_read_complete(sr));
 }
 
 static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
@@ -1046,11 +985,13 @@ static inline bool iomap_dio_simple_read_supported(struct kiocb *iocb,
 	 */
 	if (count > inode->i_sb->s_blocksize)
 		return false;
-	if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL))
+	if (dio_flags & (IOMAP_DIO_FORCE_WAIT | IOMAP_DIO_PARTIAL |
+			 IOMAP_DIO_BOUNCE))
 		return false;
 	if (iocb->ki_pos + count > i_size_read(inode))
 		return false;
 
+	// XXX: reject fscrypt
 	return true;
 }
 
@@ -1060,7 +1001,6 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	size_t count = iov_iter_count(iter);
-	int nr_pages;
 	struct iomap_dio_simple_read *sr;
 	unsigned int alignment;
 	struct iomap_iter iomi = {
@@ -1074,11 +1014,6 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
 	bool wait_for_completion = is_sync_kiocb(iocb);
 	ssize_t ret;
 
-	if (dio_flags & IOMAP_DIO_BOUNCE)
-		nr_pages = bio_iov_bounce_nr_vecs(iter, REQ_OP_READ);
-	else
-		nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS);
-
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iomi.flags |= IOMAP_NOWAIT;
 
@@ -1120,24 +1055,18 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
 	if (user_backed_iter(iter))
 		dio_flags |= IOMAP_DIO_USER_BACKED;
 
-	bio = bio_alloc_bioset(iomi.iomap.bdev, nr_pages,
-			       REQ_OP_READ | REQ_SYNC | REQ_IDLE,
-			       GFP_KERNEL, &iomap_dio_simple_read_pool);
+	bio = bio_alloc_bioset(iomi.iomap.bdev,
+			bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS),
+			REQ_OP_READ | REQ_SYNC | REQ_IDLE,
+			GFP_KERNEL, &iomap_dio_simple_read_pool);
 	sr = container_of(bio, struct iomap_dio_simple_read, bio);
-
-	fscrypt_set_bio_crypt_ctx(bio, inode, iomi.pos, GFP_KERNEL);
 	sr->iocb = iocb;
 	sr->dio_flags = dio_flags;
 
 	bio->bi_iter.bi_sector = iomap_sector(&iomi.iomap, iomi.pos);
 	bio->bi_ioprio = iocb->ki_ioprio;
-	bio->bi_private = sr;
-	bio->bi_end_io = iomap_dio_simple_read_end_io;
 
-	if (dio_flags & IOMAP_DIO_BOUNCE)
-		ret = bio_iov_iter_bounce(bio, iter, count);
-	else
-		ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
+	ret = bio_iov_iter_get_pages(bio, iter, alignment - 1);
 	if (unlikely(ret))
 		goto out_bio_put;
 
@@ -1161,49 +1090,22 @@ static ssize_t iomap_dio_simple_read(struct kiocb *iocb,
 		WRITE_ONCE(iocb->private, bio);
 	}
 
-	if (wait_for_completion) {
-		sr->waiter = current;
-		blk_crypto_submit_bio(bio);
-	} else {
-		atomic_set(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING);
-		sr->waiter = NULL;
-		blk_crypto_submit_bio(bio);
-		ret = -EIOCBQUEUED;
-	}
-
 	if (ops->iomap_end)
 		ops->iomap_end(inode, iomi.pos, count, count, iomi.flags,
 			       &iomi.iomap);
 
-	if (wait_for_completion) {
-		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			if (!READ_ONCE(sr->waiter))
-				break;
-			blk_io_schedule();
-		}
-		__set_current_state(TASK_RUNNING);
-
-		ret = iomap_dio_simple_read_finish(iocb, bio,
-				blk_status_to_errno(bio->bi_status));
-		inode_dio_end(inode);
-		trace_iomap_dio_complete(iocb, ret < 0 ? ret : 0,
-					 ret > 0 ? ret : 0);
-	} else if (atomic_cmpxchg(&sr->state, IOMAP_DIO_SIMPLE_SUBMITTING,
-				  IOMAP_DIO_SIMPLE_QUEUED) ==
-		   IOMAP_DIO_SIMPLE_DONE) {
-		ret = iomap_dio_simple_read_complete(iocb, bio);
-	} else {
+	if (!wait_for_completion) {
+		bio->bi_end_io = iomap_dio_simple_read_end_io;
+		submit_bio(bio);
 		trace_iomap_dio_rw_queued(inode, iomi.pos, count);
+		return -EIOCBQUEUED;
 	}
 
-	return ret;
+	submit_bio_wait(bio);
+	return iomap_dio_simple_read_complete(sr);
 
 out_bio_release_pages:
-	if (dio_flags & IOMAP_DIO_BOUNCE)
-		bio_iov_iter_unbounce(bio, true, false);
-	else
-		bio_release_pages(bio, false);
+	bio_release_pages(bio, false);
 out_bio_put:
 	bio_put(bio);
 out_iomap_end:

^ permalink raw reply related

* Re: [PATCH RFC v2 17/18] fs: look up the superblock via the device table in user_get_super()
From: Christoph Hellwig @ 2026-06-24 14:07 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-17-7df6b864028e@kernel.org>

On Tue, Jun 16, 2026 at 04:08:33PM +0200, Christian Brauner wrote:
> user_get_super() still finds the superblock for a device number by
> walking the global super_blocks list under sb_lock. Every superblock is
> registered in the device table under its s_dev since sget_fc() inserts
> it there, including superblocks on anonymous devices, so use the table
> instead.

So what is the benefit of this?  It's not like any of these are heavily
used fast paths.


^ permalink raw reply

* Re: [PATCH RFC v2 08/18] fs: add dedicated block device open helpers for filesystems
From: Christoph Hellwig @ 2026-06-24 14:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christian Brauner, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <xlfnmwv2upjia6ozd4z5l5icaewor4a6cgkafnigulndzmt6r7@rhay3h3wablo>

On Mon, Jun 22, 2026 at 06:28:50PM +0200, Jan Kara wrote:
> > +static int fs_bdev_register(struct file *bdev_file, struct super_block *sb)
> > +{
> > +	struct super_dev *sb_dev __free(kfree) = NULL;
> 
> Frankly I find the use of __free on sb_dev more confusing than helping in
> this function. If you didn't use it, you could remove the somewhat
> confusing retain_and_null_ptr() calls below, remove this initialization and
> just put one kfree() into the error handling branch when super_dev_insert()
> fails...

It is.  __free is really annoying for anything but trivial local
scope only variables.  

^ permalink raw reply

* Re: [PATCH RFC v2 07/18] fs: maintain a global device-to-superblock table
From: Christoph Hellwig @ 2026-06-24 14:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-7-7df6b864028e@kernel.org>

FYI, I still think maintaining a dev_t to sb mapping vs having a holder
register a claim is a major step backwards architecturally.  I spent
a lot of effort to get us out of this.


^ permalink raw reply

* Re: [PATCH RFC v2 01/18] xfs: fix the error unwind in xfs_open_devices()
From: Christoph Hellwig @ 2026-06-24 14:05 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, Jens Axboe, Alexander Viro,
	linux-block, linux-kernel, linux-fsdevel, Carlos Maiolino,
	linux-xfs, Chris Mason, David Sterba, linux-btrfs,
	Theodore Ts'o, linux-ext4, Gao Xiang, linux-erofs
In-Reply-To: <20260616-work-super-bdev_holder_global-v2-1-7df6b864028e@kernel.org>

On Tue, Jun 16, 2026 at 04:08:17PM +0200, Christian Brauner wrote:
> Since the rt and log block devices are closed in xfs_free_buftarg() the
> buftarg owns the device file. The error unwind does not respect that:
> when the log buftarg allocation fails, out_free_rtdev_targ frees the rt
> buftarg - releasing rtdev_file - and then falls through to
> out_close_rtdev and releases it a second time.
> 
> The unwind also leaves mp->m_rtdev_targp and mp->m_ddev_targp pointing
> to the freed buftargs. The failed mount continues into
> deactivate_locked_super() -> xfs_kill_sb() -> xfs_mount_free(), which
> frees them again.
> 
> Clear the buftarg pointers once the unwind freed them and clear
> rtdev_file once the rt buftarg owns it, so nothing is released twice.
> 
> Reachable when a buftarg allocation fails after the data buftarg was
> set up: an I/O error in sync_blockdev() or an allocation failure in
> xfs_init_buftarg() while mounting with external rt and log devices.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

I actually have a major rework of this area pending, but it probably
won't land for 7.2, so we might as well get this local fix in ASAP.


^ permalink raw reply

* [RFC PATCH v3 6/6] ext4/067: LUFID and encryption+casefold+dirdata
From: Artem Blagodarenko @ 2026-06-24 13:49 UTC (permalink / raw)
  To: linux-ext4; +Cc: adilger.kernel, Artem Blagodarenko
In-Reply-To: <20260624134957.19209-1-ablagodarenko@thelustrecollective.com>

From: Artem Blagodarenko <artem.blagodarenko@gmail.com>

Test ext4 LUFID  functionality in the complex combination of
encryption, casefold (case-insensitive), and dirdata features.

Verification uses 'debugfs ls -lD' to check for 'fid: or hash:' markers.
Tests also verify that case-insensitive lookups work correctly and that
encrypted file content is preserved after setting LUFID.

This test validates that LUFID works correctly when encryption and
casefold features are enabled, ensuring feature interactions don't
break the LUFID functionality.

Signed-off-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
---
 tests/ext4/067     | 137 +++++++++++++++++++++++++++++++++++++++++++++
 tests/ext4/067.out |   4 ++
 2 files changed, 141 insertions(+)

diff --git a/tests/ext4/067 b/tests/ext4/067
new file mode 100755
index 00000000..acb49c40
--- /dev/null
+++ b/tests/ext4/067
@@ -0,0 +1,137 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 The Lustre Collective. All Rights Reserved.
+# Author: Artem Blagodarenko <ablagodarenko@thelustrecollective.com>
+#
+# FS QA Test ext4/067
+#
+# Test ext4 LUFID  with encryption and casefold.
+# EXT4_IOC_SET_LUFID is an ioctl that allows LUFID data to be set on a directory entry.
+# directory name hash is stored by EXT4 when casefold and encryption features are enabled.
+
+. ./common/preamble
+_begin_fstest auto quick encrypt casefold
+
+# Import common functions
+. ./common/filter
+. ./common/encrypt
+. ./common/casefold
+. ./common/attr
+. ./common/ext4
+
+_exclude_fs ext2
+_exclude_fs ext3
+
+_require_scratch_nocheck
+_require_scratch_encryption
+_require_scratch_casefold
+_require_command "$SET_LUFID_PROG"
+
+# Check if dirdata feature is supported (required for LUFID IOCTL)
+_require_scratch_dirdata()
+{
+	if test ! -f /sys/fs/ext4/features/dirdata ; then
+		_notrun "dirdata feature not supported by kernel (required for LUFID)"
+	fi
+
+	# Verify that mkfs supports dirdata
+	if ! $MKFS_EXT4_PROG -O dirdata -n $SCRATCH_DEV &>>$seqres.full ; then
+		_notrun "mkfs.ext4 does not support dirdata feature"
+	fi
+
+	# Verify kernel can mount filesystem with encrypt+casefold+dirdata
+	if ! _scratch_mkfs -O encrypt,casefold,dirdata &>>$seqres.full ; then
+		_notrun "failed to create filesystem with encrypt+casefold+dirdata"
+	fi
+	if ! _try_scratch_mount &>>$seqres.full ; then
+		_notrun "kernel cannot mount filesystem with encrypt+casefold+dirdata"
+	fi
+	_scratch_unmount
+}
+
+_require_scratch_dirdata
+
+# Helper to add a v2 encryption key and set policy on a directory
+_setup_encrypted_dir()
+{
+	local dir=$1
+	local raw_key=$(_generate_raw_encryption_key)
+	local keyspec=$(_add_enckey $SCRATCH_MNT "$raw_key" | awk '{print $NF}')
+	_set_encpolicy $dir $keyspec
+	_casefold_set_attr $dir
+	echo $keyspec
+}
+
+# Create a filesystem with encryption, casefold, and dirdata features
+_scratch_mkfs -O encrypt,casefold,dirdata &>>$seqres.full
+_scratch_mount
+
+# Test: Create file in encrypted+casefolded directory and set three 16-byte LUFIDs
+echo "Test: Set three 16-byte LUFIDs on file in encrypted+casefolded directory"
+mkdir $SCRATCH_MNT/encrypted_dir
+_setup_encrypted_dir $SCRATCH_MNT/encrypted_dir > /dev/null
+
+echo "encrypted content" > $SCRATCH_MNT/encrypted_dir/testfile.txt
+
+lufid_payload=$'\xde\xad\xbe\xef\x01\x02\x03\x04\x11\x12\x13\x14\x21\x22\x23\x24\xca\xfe\xba\xbe\x05\x06\x07\x08\x31\x32\x33\x34\x41\x42\x43\x44\xfe\xed\xfa\xce\x09\x0a\x0b\x0c\x51\x52\x53\x54\x61\x62\x63\x64'
+expected_fid='[0xdeadbeef01020304:0x11121314:0x21222324],[0xcafebabe05060708:0x31323334:0x41424344],[0xfeedface090a0b0c:0x51525354:0x61626364]'
+
+# Set three LUFIDs on the file at the same time (48 bytes total: three 16-byte FIDs)
+# First FID:  [part1 (8 bytes):part2 (4 bytes):part3 (4 bytes)]
+# Second FID: [part1 (8 bytes):part2 (4 bytes):part3 (4 bytes)]
+# Third FID:  [part1 (8 bytes):part2 (4 bytes):part3 (4 bytes)]
+set_lufid $SCRATCH_MNT/encrypted_dir testfile.txt "$lufid_payload" >>$seqres.full 2>&1
+if [ $? -ne 0 ]; then
+	echo "FAIL: Could not set three LUFIDs on encrypted+casefolded file"
+	_scratch_unmount
+	_check_scratch_fs
+	status=1
+	exit
+fi
+
+# Verify file is still accessible
+if [ ! -f $SCRATCH_MNT/encrypted_dir/testfile.txt ]; then
+	echo "FAIL: Encrypted file not accessible after setting LUFID"
+	_scratch_unmount
+	_check_scratch_fs
+	status=1
+	exit
+fi
+
+# Verify file content is preserved
+enc_content=$(cat $SCRATCH_MNT/encrypted_dir/testfile.txt 2>/dev/null)
+if [ "$enc_content" != "encrypted content" ]; then
+	echo "FAIL: Encrypted file content not preserved after setting LUFID"
+	_scratch_unmount
+	_check_scratch_fs
+	status=1
+	exit
+fi
+
+# Test case-insensitive lookup still works with LUFID
+if [ ! -f "$SCRATCH_MNT/encrypted_dir/TESTFILE.TXT" ]; then
+	echo "FAIL: Case-insensitive lookup doesn't work with LUFID"
+	_scratch_unmount
+	_check_scratch_fs
+	status=1
+	exit
+fi
+
+echo "Successfully set and verified three 16-byte LUFIDs on encrypted+casefolded file"
+
+# Dump directory structure to verify dirdata
+if ! _dump_dir_structure $SCRATCH_MNT/encrypted_dir testfile.txt "$expected_fid"; then
+	echo "FAIL: Stored LUFID does not match expected value"
+	_scratch_unmount
+	_check_scratch_fs
+	status=1
+	exit
+fi
+
+# Cleanup and verify filesystem
+_scratch_unmount
+_check_scratch_fs
+
+# success, all done
+status=0
+exit
diff --git a/tests/ext4/067.out b/tests/ext4/067.out
new file mode 100644
index 00000000..1c9a8126
--- /dev/null
+++ b/tests/ext4/067.out
@@ -0,0 +1,4 @@
+QA output created by 067
+Test: Set three 16-byte LUFIDs on file in encrypted+casefolded directory
+Successfully set and verified three 16-byte LUFIDs on encrypted+casefolded file
+  Directory structure of encrypted_dir: OK (dirdata verified)
-- 
2.43.7


^ permalink raw reply related

* [RFC PATCH v3 5/6] ext4/066: verify LUFID dirdata operations
From: Artem Blagodarenko @ 2026-06-24 13:49 UTC (permalink / raw)
  To: linux-ext4; +Cc: adilger.kernel, Artem Blagodarenko
In-Reply-To: <20260624134957.19209-1-ablagodarenko@thelustrecollective.com>

From: Artem Blagodarenko <artem.blagodarenko@gmail.com>

Test ext4 LUFID set/get operations on dirdata fields. This test
verifies that the EXT4_IOC_SET_LUFID ioctl can be used
to attach LUFID data to a directory entry and that `debugfs ls -lD`
can read this data.

Verification uses `debugfs ls -lD` to check for `fid:` markers,
indicating the presence of LUFID data in directory entries.

Signed-off-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
---
 tests/ext4/066     | 158 +++++++++++++++++++++++++++++++++++++++++++++
 tests/ext4/066.out |   4 ++
 2 files changed, 162 insertions(+)

diff --git a/tests/ext4/066 b/tests/ext4/066
new file mode 100755
index 00000000..ae98fb45
--- /dev/null
+++ b/tests/ext4/066
@@ -0,0 +1,158 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 The Lustre Collective. All Rights Reserved.
+# Author: Artem Blagodarenko <ablagodarenko@thelustrecollective.com>
+#
+# FS QA Test ext4/066
+#
+# Test the ext4 dirdata feature with LUFID ioctl functionality.
+# LUFID is a 16-byte identifier that can be attached
+# to directory entries. It is used for quick access to file metadata.
+# EXT4_IOC_SET_LUFID is an ioctl that allows LUFID data to be set on a
+# directory entry.
+
+. ./common/preamble
+_begin_fstest auto quick
+
+# Import common functions
+. ./common/filter
+. ./common/ext4
+
+_exclude_fs ext2
+_exclude_fs ext3
+
+_require_scratch_nocheck
+_require_command "$SET_LUFID_PROG"
+
+# Check if dirdata feature is supported (required for LUFID IOCTL)
+_require_scratch_dirdata()
+{
+	if test ! -f /sys/fs/ext4/features/dirdata ; then
+		_notrun "dirdata feature not supported by kernel (required for LUFID)"
+	fi
+
+	# Verify that mkfs supports dirdata
+	if ! $MKFS_EXT4_PROG -O dirdata -n $SCRATCH_DEV &>>$seqres.full ; then
+		_notrun "mkfs.ext4 does not support dirdata feature"
+	fi
+
+	# Verify kernel can mount filesystem with dirdata
+	if ! _scratch_mkfs -O dirdata &>>$seqres.full ; then
+		_notrun "failed to create filesystem with dirdata"
+	fi
+	if ! _try_scratch_mount &>>$seqres.full ; then
+		_notrun "kernel cannot mount filesystem with dirdata"
+	fi
+	_scratch_unmount
+}
+
+_require_scratch_dirdata
+
+_u32_to_le_hex()
+{
+	local v=$1
+	local h
+
+	h=$(printf '%08x' "$((v & 0xffffffff))")
+	printf '%s%s%s%s' "${h:6:2}" "${h:4:2}" "${h:2:2}" "${h:0:2}"
+}
+
+_build_default_expected_fid()
+{
+	local path=$1
+	local inode
+	local version
+	local ino_hi ino_lo
+	local ver_hi ver_lo
+	local seq_hex oid_hex ver_hex
+
+	inode=$(stat -c '%i' "$path") || return 1
+	version=$(debugfs -R "stat <${inode}>" $SCRATCH_DEV 2>/dev/null | \
+		sed -n 's/.*Generation:[[:space:]]*\([0-9xa-fA-F]\+\).*/\1/p' | head -n 1)
+
+	if [ -z "$version" ]; then
+		return 1
+	fi
+
+	ino_hi=$(((inode >> 32) & 0xffffffff))
+	ino_lo=$((inode & 0xffffffff))
+	ver_lo=$((version & 0xffffffff))
+	ver_hi=$(((version >> 32) & 0xffffffff))
+
+	# Match lu_fid cast semantics: set_lufid stores u32 words in native memory
+	# order; debugfs reads lu_fid fields and prints f_seq/f_oid/f_ver.
+	seq_hex="$(_u32_to_le_hex "$ino_hi")$(_u32_to_le_hex "$ino_lo")"
+	oid_hex="$(_u32_to_le_hex "$ver_lo")"
+	ver_hex="$(_u32_to_le_hex "$ver_hi")"
+
+	printf '[0x%x:0x%x:0x%x]' "$((16#$seq_hex))" "$((16#$oid_hex))" \
+		"$((16#$ver_hex))"
+}
+
+# Create a filesystem with dirdata feature
+_scratch_mkfs -O dirdata &>>$seqres.full
+_scratch_mount
+
+# Test: Create file and set multiple 16-byte LUFIDs on the same file
+echo "Test: Set multiple 16-byte LUFIDs on the same file"
+mkdir -p $SCRATCH_MNT/lufid_test
+echo "test content" > $SCRATCH_MNT/lufid_test/testfile.txt
+
+# Set both LUFIDs on the file at the same time (32 bytes total: two 16-byte FIDs)
+# First FID:  [part1 (8 bytes):part2 (4 bytes):part3 (4 bytes)]
+# Second FID: [part1 (8 bytes):part2 (4 bytes):part3 (4 bytes)]
+set_lufid $SCRATCH_MNT/lufid_test testfile.txt >>$seqres.full
+if [ $? -ne 0 ]; then
+	echo "FAIL: Could not set both LUFIDs on testfile.txt"
+	_scratch_unmount
+	_check_scratch_fs
+	status=1
+	exit
+fi
+
+# Verify file is still accessible
+if [ ! -f $SCRATCH_MNT/lufid_test/testfile.txt ]; then
+	echo "FAIL: File not accessible after setting LUFIDs"
+	_scratch_unmount
+	_check_scratch_fs
+	status=1
+	exit
+fi
+
+# Verify file content is preserved
+content=$(cat $SCRATCH_MNT/lufid_test/testfile.txt 2>/dev/null)
+if [ "$content" != "test content" ]; then
+	echo "FAIL: File content not preserved after setting LUFIDs"
+	_scratch_unmount
+	_check_scratch_fs
+	status=1
+	exit
+fi
+
+expected_fid=$(_build_default_expected_fid $SCRATCH_MNT/lufid_test/testfile.txt)
+if [ -z "$expected_fid" ]; then
+	echo "FAIL: Could not calculate expected default LUFID"
+	_scratch_unmount
+	_check_scratch_fs
+	status=1
+	exit
+fi
+
+echo "Successfully set and verified both 16-byte LUFIDs on same file at the same time"
+
+# Dump directory structure to verify dirdata
+if ! _dump_dir_structure $SCRATCH_MNT/lufid_test testfile.txt "$expected_fid"; then
+	echo "FAIL: Stored LUFID does not match expected default value"
+	_scratch_unmount
+	_check_scratch_fs
+	status=1
+	exit
+fi
+
+# Cleanup and verify filesystem
+_scratch_unmount
+_check_scratch_fs
+
+# success, all done
+status=0
+exit
diff --git a/tests/ext4/066.out b/tests/ext4/066.out
new file mode 100644
index 00000000..4ec0fd6d
--- /dev/null
+++ b/tests/ext4/066.out
@@ -0,0 +1,4 @@
+QA output created by 066
+Test: Set multiple 16-byte LUFIDs on the same file
+Successfully set and verified both 16-byte LUFIDs on same file at the same time
+  Directory structure of lufid_test: OK (dirdata verified)
-- 
2.43.7


^ permalink raw reply related

* [RFC PATCH v3 4/6] ext4: add set_lufid utility
From: Artem Blagodarenko @ 2026-06-24 13:49 UTC (permalink / raw)
  To: linux-ext4; +Cc: adilger.kernel, Artem Blagodarenko
In-Reply-To: <20260624134957.19209-1-ablagodarenko@thelustrecollective.com>

From: Artem Blagodarenko <artem.blagodarenko@gmail.com>

EXT4 provides the EXT4_IOC_SET_LUFID ioctl, which allows setting
or replacing the LUFID dirdata field for an existing directory entry.

The set_lufid utility uses this ioctl and accepts a directory path,
directory entry name, and LUFID value as arguments.

This utility is used by subsequent dirdata-related tests.

Signed-off-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
---
 common/config   |   1 +
 src/Makefile    |   2 +-
 src/set_lufid.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 198 insertions(+), 1 deletion(-)

diff --git a/common/config b/common/config
index d5299d5b..a19e1d7e 100644
--- a/common/config
+++ b/common/config
@@ -210,6 +210,7 @@ export LVM_PROG="$(type -P lvm)"
 export LSATTR_PROG="$(type -P lsattr)"
 export CHATTR_PROG="$(type -P chattr)"
 export DEBUGFS_PROG="$(type -P debugfs)"
+export SET_LUFID_PROG="$(type -P set_lufid || echo $here/src/set_lufid)"
 export UUIDGEN_PROG="$(type -P uuidgen)"
 export KEYCTL_PROG="$(type -P keyctl)"
 export XZ_PROG="$(type -P xz)"
diff --git a/src/Makefile b/src/Makefile
index 31ac43b2..a1f161b0 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -36,7 +36,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
 	fscrypt-crypt-util bulkstat_null_ocount splice-test chprojid_fail \
 	detached_mounts_propagation ext4_resize t_readdir_3 splice2pipe \
 	uuid_ioctl t_snapshot_deleted_subvolume fiemap-fault min_dio_alignment \
-	rw_hint fs-monitor
+	rw_hint fs-monitor set_lufid
 
 EXTRA_EXECS = dmerror fill2attr fill2fs fill2fs_check scaleread.sh \
 	      btrfs_crc32c_forged_name.py popdir.pl popattr.py \
diff --git a/src/set_lufid.c b/src/set_lufid.c
new file mode 100644
index 00000000..92af6c6d
--- /dev/null
+++ b/src/set_lufid.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * set_lufid.c --- Set LUFID on a directory entry using IOCTL
+ *
+ * Copyright (C) 2026 The Lustre Collective. All Rights Reserved.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <dirent.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <errno.h>
+#include <libgen.h>
+#include <stdint.h>
+#include <linux/fs.h>
+
+/* Structure for EXT4_IOC_SET_LUFID - should match kernel definition */
+struct ext4_dirdata_fid {
+	__u8 edf_data_len;
+	char edf_data[255];
+} __attribute__((packed));
+
+struct ext4_set_lufid {
+        __u8 esl_name_len;
+        char esl_name[255 + 1];
+        union {
+                char esl_data[1 + 255];
+                struct ext4_dirdata_fid esl_edf;
+        };
+} __attribute__((packed));
+
+#ifndef EXT4_IOC_SET_LUFID
+#define EXT4_IOC_SET_LUFID	_IOW('f', 47, struct ext4_set_lufid)
+#endif
+
+static void usage(const char *prog)
+{
+	fprintf(stderr, "Usage: %s DIRECTORY FILENAME [LUFID]\n", prog);
+	fprintf(stderr, "  DIRECTORY: path to the directory containing the FILENAME\n");
+	fprintf(stderr, "  FILENAME:  name of the file to set LUFID on\n");
+	fprintf(stderr, "  LUFID: data to attach (generated if not passed)\n");
+	exit(1);
+}
+
+static void dump_lufid_payload(const char *data, int len)
+{
+	int i;
+
+	printf("LUFID payload length: %d bytes\n", len);
+	printf("LUFID payload hex:");
+	for (i = 0; i < len; i++)
+		printf(" %02x", (unsigned char)data[i]);
+	printf("\n");
+}
+
+static int build_default_lufid(int dir_fd, const char *dir_path, const char *filename,
+			       uint32_t fid[5])
+{
+	int file_fd;
+	unsigned long ver = 0;
+	struct stat st;
+
+	/* Build an IGIF-style default payload from inode + version. */
+	file_fd = openat(dir_fd, filename, O_RDONLY | O_CLOEXEC);
+	if (file_fd < 0) {
+		fprintf(stderr, "Error opening %s/%s: %s\n",
+			dir_path, filename, strerror(errno));
+		return -1;
+	}
+
+	if (fstat(file_fd, &st) < 0) {
+		fprintf(stderr, "Error stating %s/%s: %s\n",
+			dir_path, filename, strerror(errno));
+		close(file_fd);
+		return -1;
+	}
+
+	fid[0] = (uint32_t)(st.st_ino >> 32);
+	fid[1] = (uint32_t)st.st_ino;
+
+	if (ioctl(file_fd, FS_IOC_GETVERSION, &ver) < 0) {
+		fprintf(stderr, "Error calling EXT4_IOC_GETVERSION for %s/%s: %s\n",
+			dir_path, filename, strerror(errno));
+		close(file_fd);
+		return -1;
+	}
+
+	fid[2] = (uint32_t)ver;
+	fid[3] = (uint32_t)(ver >> 32);
+
+	close(file_fd);
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	const char *dir_path;
+	const char *filename;
+	const char *lufid_data;
+	DIR *dirp;
+	int fd;
+	int name_len, data_len;
+	struct ext4_set_lufid lufid_args;
+	struct stat st;
+	uint32_t fid[4];
+	int rc;
+
+	if (argc < 2) {
+		usage(argv[0]);
+	}
+
+	dir_path = argv[1];
+	filename = argv[2];
+	name_len = strlen(filename) + 1;	/* +1 for NUL terminator */
+
+	if (name_len == 0 || name_len > 255) {
+		fprintf(stderr, "Error: Invalid filename length: %d (must be 1-256 with NUL)\n",
+			name_len);
+		return 1;
+	}
+
+	/* Check if directory exists and is a directory */
+	if (stat(dir_path, &st) < 0) {
+		fprintf(stderr, "Error accessing %s: %s\n", dir_path, strerror(errno));
+		return 1;
+	}
+
+	if (!S_ISDIR(st.st_mode)) {
+		fprintf(stderr, "Error: %s is not a directory\n", dir_path);
+		return 1;
+	}
+
+	/* Open the directory */
+	dirp = opendir(dir_path);
+	if (!dirp) {
+		fprintf(stderr, "Error opening directory %s: %s\n", dir_path, strerror(errno));
+		return 1;
+	}
+
+	fd = dirfd(dirp);
+	if (fd < 0) {
+		fprintf(stderr, "Error getting directory fd: %s\n", strerror(errno));
+		closedir(dirp);
+		return 1;
+	}
+
+	if (argc > 3) {
+		lufid_data = argv[3];
+		data_len = strlen(lufid_data);
+	} else {
+		rc = build_default_lufid(fd, dir_path, filename, fid);
+		if (rc) {
+			fprintf(stderr, "Error getting lufid for %s/%s\n",
+				dir_path, filename);
+			closedir(dirp);
+			return 1;
+		}
+		lufid_data = (char *)fid;
+		data_len = sizeof(fid);
+	}
+
+	if (data_len == 0 || data_len > 255) {
+		fprintf(stderr, "Error: Invalid LUFID data length: %d (must be 1-255)\n",
+			data_len);
+		closedir(dirp);
+		return 1;
+	}
+
+	/* Prepare LUFID data */
+	memset(&lufid_args, 0, sizeof(lufid_args));
+	lufid_args.esl_name_len = name_len;
+	lufid_args.esl_edf.edf_data_len = data_len;
+	/* Ensure filename is properly NUL-terminated at the correct position */
+	strncpy(lufid_args.esl_name, filename, name_len - 1);
+	lufid_args.esl_name[name_len - 1] = '\0';
+	memcpy(lufid_args.esl_edf.edf_data, lufid_data, data_len);
+
+	/* Call the ioctl */
+	if (ioctl(fd, EXT4_IOC_SET_LUFID, &lufid_args)) {
+		fprintf(stderr, "Error calling EXT4_IOC_SET_LUFID for %s/%s: %s\n",
+			dir_path, filename, strerror(errno));
+		closedir(dirp);
+		return 1;
+	}
+
+	closedir(dirp);
+	printf("Successfully set LUFID  on %s in directory %s\n", filename, dir_path);
+	dump_lufid_payload(lufid_args.esl_edf.edf_data, data_len);
+
+	return 0;
+}
-- 
2.43.7


^ permalink raw reply related

* [RFC PATCH v3 3/6] ext4/065 encryption + casefold + dirdata feature combination
From: Artem Blagodarenko @ 2026-06-24 13:49 UTC (permalink / raw)
  To: linux-ext4; +Cc: adilger.kernel, Artem Blagodarenko
In-Reply-To: <20260624134957.19209-1-ablagodarenko@thelustrecollective.com>

From: Artem Blagodarenko <artem.blagodarenko@gmail.com>

Test ext4 encryption + casefold + dirdata feature combination.
This test verifies that files created in directories with encryption,
case-insensitive (casefold), and dirdata attributes work correctly.
See ext4/064 for the same test WITHOUT dirdata feature.

Signed-off-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
---
 tests/ext4/065     | 217 +++++++++++++++++++++++++++++++++++++++++++++
 tests/ext4/065.out |  26 ++++++
 2 files changed, 243 insertions(+)

diff --git a/tests/ext4/065 b/tests/ext4/065
new file mode 100755
index 00000000..0ad7a382
--- /dev/null
+++ b/tests/ext4/065
@@ -0,0 +1,217 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 The Lustre Collective.  All Rights Reserved.
+# Author: Artem Blagodarenko <ablagodarenko@thelustrecollective.com>
+#
+# FS QA Test ext4/065
+#
+# Test ext4 encryption + casefold + dirdata feature combination.
+# This test verifies that files created in directories with encryption,
+# case-insensitive (casefold), and dirdata attributes work correctly.
+# See ext4/064 for the same test WITHOUT dirdata feature.
+#
+. ./common/preamble
+_begin_fstest auto quick encrypt casefold
+
+# get standard environment and checks
+. ./common/filter
+. ./common/encrypt
+. ./common/casefold
+. ./common/attr
+. ./common/ext4
+
+_exclude_fs ext2
+_exclude_fs ext3
+
+# Check if dirdata feature is supported and can be used with encrypt+casefold
+_require_scratch_dirdata()
+{
+	if test ! -f /sys/fs/ext4/features/dirdata ; then
+		_notrun "dirdata feature not supported by kernel"
+	fi
+
+	# Debug: log e2fsprogs tool paths and versions
+	echo "=== _require_scratch_dirdata debug info ===" >> $seqres.full
+	echo "E2FSCK_PROG: $E2FSCK_PROG" >> $seqres.full
+	echo "E2FSCK_PROG resolved: $(type -P e2fsck)" >> $seqres.full
+	echo "MKFS_EXT4_PROG: $MKFS_EXT4_PROG" >> $seqres.full
+	echo "fsck -t ext4 resolves to: $(type -P fsck.ext4)" >> $seqres.full
+	$E2FSCK_PROG -V >> $seqres.full 2>&1
+	$MKFS_EXT4_PROG -V >> $seqres.full 2>&1
+	echo "=== end debug info ===" >> $seqres.full
+
+	# Also verify that mkfs supports dirdata
+	if ! $MKFS_EXT4_PROG -O dirdata -n $SCRATCH_DEV &>>$seqres.full ; then
+		_notrun "mkfs.ext4 does not support dirdata feature"
+	fi
+
+	# Verify kernel can mount filesystem with encrypt+casefold+dirdata
+	echo "Running: _scratch_mkfs -O encrypt,casefold,dirdata" >> $seqres.full
+	if ! _scratch_mkfs -O encrypt,casefold,dirdata &>>$seqres.full ; then
+		_notrun "failed to create filesystem with encrypt+casefold+dirdata"
+	fi
+	if ! _try_scratch_mount &>>$seqres.full ; then
+		_notrun "kernel cannot mount filesystem with encrypt+casefold+dirdata"
+	fi
+	_scratch_unmount
+}
+
+_require_scratch_nocheck
+_require_scratch_encryption
+_require_scratch_casefold
+_require_scratch_dirdata
+_require_xfs_io_command "set_encpolicy"
+_require_xfs_io_command "add_enckey"
+
+# Helper to add a v2 encryption key and set policy on a directory
+_setup_encrypted_casefold_dir()
+{
+	local dir=$1
+	local raw_key=$(_generate_raw_encryption_key)
+	local keyspec=$(_add_enckey $SCRATCH_MNT "$raw_key" | awk '{print $NF}')
+	_set_encpolicy $dir $keyspec
+	_casefold_set_attr $dir
+	echo $keyspec
+}
+
+# Create a filesystem with encrypt, casefold, and dirdata features
+# Debug: log e2fsprogs tool paths and versions
+echo "=== e2fsprogs debug info ===" >> $seqres.full
+echo "E2FSCK_PROG: $E2FSCK_PROG" >> $seqres.full
+echo "E2FSCK_PROG resolved: $(type -P e2fsck)" >> $seqres.full
+echo "MKFS_EXT4_PROG: $MKFS_EXT4_PROG" >> $seqres.full
+echo "FSCK_OPTIONS: $FSCK_OPTIONS" >> $seqres.full
+echo "fsck -t ext4 resolves to: $(type -P fsck.ext4)" >> $seqres.full
+$E2FSCK_PROG -V >> $seqres.full 2>&1
+$MKFS_EXT4_PROG -V >> $seqres.full 2>&1
+echo "=== end e2fsprogs debug info ===" >> $seqres.full
+
+_scratch_mkfs -O encrypt,casefold,dirdata &>>$seqres.full
+_scratch_mount
+
+# Test 1: Create an encrypted + casefolded directory and verify lookups work
+echo "Test 1: Basic encrypted casefold lookup with dirdata"
+mkdir $SCRATCH_MNT/test1
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test1 > /dev/null
+
+# Create file with lowercase, lookup with uppercase
+echo "hello" > $SCRATCH_MNT/test1/testfile.txt
+if [ -f "$SCRATCH_MNT/test1/TESTFILE.TXT" ]; then
+	echo "Case-insensitive lookup works in encrypted dir"
+else
+	echo "FAIL: Case-insensitive lookup failed in encrypted dir"
+fi
+
+# Verify the exact name on disk is preserved
+if _casefold_check_exact_name "$SCRATCH_MNT/test1" "testfile.txt"; then
+	echo "Original filename preserved"
+else
+	echo "FAIL: Original filename not preserved"
+fi
+_dump_dir_structure $SCRATCH_MNT/test1
+
+# Test 2: Create files with different case variations
+echo "Test 2: Conflicting names in encrypted casefold dir"
+mkdir $SCRATCH_MNT/test2
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test2 > /dev/null
+
+echo "first" > $SCRATCH_MNT/test2/MyFile.txt
+# This should fail or overwrite since "MYFILE.TXT" is equivalent
+echo "second" > $SCRATCH_MNT/test2/MYFILE.TXT 2>/dev/null
+content=$(cat $SCRATCH_MNT/test2/myfile.txt)
+echo "Content after writes: $content"
+_dump_dir_structure $SCRATCH_MNT/test2
+
+# Test 3: Unicode normalization in encrypted casefold dir
+echo "Test 3: Unicode in encrypted casefold dir"
+mkdir $SCRATCH_MNT/test3
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test3 > /dev/null
+
+# Test with UTF-8 characters
+fr_file1=$(echo -e "cafe\xcc\x81.txt")
+fr_file2=$(echo -e "caf\xc3\xa9.txt")
+echo "french" > "$SCRATCH_MNT/test3/$fr_file1"
+if [ -f "$SCRATCH_MNT/test3/$fr_file2" ]; then
+	echo "Unicode normalization works in encrypted dir"
+else
+	echo "FAIL: Unicode normalization failed in encrypted dir"
+fi
+_dump_dir_structure $SCRATCH_MNT/test3
+
+# Test 4: Directory operations in encrypted casefold dir
+echo "Test 4: Directory operations in encrypted casefold dir"
+mkdir $SCRATCH_MNT/test4
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test4 > /dev/null
+
+mkdir $SCRATCH_MNT/test4/SubDir
+if [ -d "$SCRATCH_MNT/test4/SUBDIR" ]; then
+	echo "Directory case-insensitive lookup works"
+else
+	echo "FAIL: Directory case-insensitive lookup failed"
+fi
+_dump_dir_structure $SCRATCH_MNT/test4
+
+# Test 5: Verify inheritance of casefold+encryption in subdirectories
+echo "Test 5: Inheritance of attributes"
+mkdir $SCRATCH_MNT/test5
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test5 > /dev/null
+
+mkdir $SCRATCH_MNT/test5/child
+echo "data" > $SCRATCH_MNT/test5/child/file.txt
+if [ -f "$SCRATCH_MNT/test5/CHILD/FILE.TXT" ]; then
+	echo "Attributes inherited correctly"
+else
+	echo "FAIL: Attributes not inherited"
+fi
+_dump_dir_structure $SCRATCH_MNT/test5
+
+# Test 6: Remove and recreate with different case
+echo "Test 6: Remove and recreate with different case"
+mkdir $SCRATCH_MNT/test6
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test6 > /dev/null
+
+echo "original" > $SCRATCH_MNT/test6/RemoveMe.txt
+rm $SCRATCH_MNT/test6/REMOVEME.TXT
+echo "recreated" > $SCRATCH_MNT/test6/REMOVEME.TXT
+if _casefold_check_exact_name "$SCRATCH_MNT/test6" "REMOVEME.TXT"; then
+	echo "Recreated file has new case"
+else
+	echo "FAIL: Recreated file case incorrect"
+fi
+_dump_dir_structure $SCRATCH_MNT/test6
+
+# Test 7: Hard links in encrypted casefold dir
+echo "Test 7: Hard links in encrypted casefold dir"
+mkdir $SCRATCH_MNT/test7
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test7 > /dev/null
+
+echo "linkdata" > $SCRATCH_MNT/test7/original.txt
+ln $SCRATCH_MNT/test7/original.txt $SCRATCH_MNT/test7/hardlink.txt
+if [ -f "$SCRATCH_MNT/test7/HARDLINK.TXT" ]; then
+	echo "Hard link case-insensitive lookup works"
+else
+	echo "FAIL: Hard link case-insensitive lookup failed"
+fi
+_dump_dir_structure $SCRATCH_MNT/test7
+
+# Cleanup and verify filesystem
+_scratch_unmount
+
+# Dirdata analysis summary
+echo ""
+echo "Dirdata analysis:"
+echo "=== e2fsprogs debug info (before _check_scratch_fs) ===" >> $seqres.full
+echo "E2FSCK_PROG: $E2FSCK_PROG" >> $seqres.full
+echo "E2FSCK_PROG resolved: $(type -P e2fsck)" >> $seqres.full
+echo "fsck -t ext4 resolves to: $(type -P fsck.ext4)" >> $seqres.full
+echo "FSCK_OPTIONS: $FSCK_OPTIONS" >> $seqres.full
+$E2FSCK_PROG -V >> $seqres.full 2>&1
+echo "=== end e2fsprogs debug info ===" >> $seqres.full
+
+_check_scratch_fs
+
+echo "Encrypted casefold tests with dirdata completed"
+
+# success, all done
+status=0
+exit
diff --git a/tests/ext4/065.out b/tests/ext4/065.out
new file mode 100644
index 00000000..c1316430
--- /dev/null
+++ b/tests/ext4/065.out
@@ -0,0 +1,26 @@
+QA output created by 065
+Test 1: Basic encrypted casefold lookup with dirdata
+Case-insensitive lookup works in encrypted dir
+Original filename preserved
+  Directory structure of test1: OK (dirdata verified)
+Test 2: Conflicting names in encrypted casefold dir
+Content after writes: second
+  Directory structure of test2: OK (dirdata verified)
+Test 3: Unicode in encrypted casefold dir
+Unicode normalization works in encrypted dir
+  Directory structure of test3: OK (dirdata verified)
+Test 4: Directory operations in encrypted casefold dir
+Directory case-insensitive lookup works
+  Directory structure of test4: OK (dirdata verified)
+Test 5: Inheritance of attributes
+Attributes inherited correctly
+  Directory structure of test5: OK (dirdata verified)
+Test 6: Remove and recreate with different case
+Recreated file has new case
+  Directory structure of test6: OK (dirdata verified)
+Test 7: Hard links in encrypted casefold dir
+Hard link case-insensitive lookup works
+  Directory structure of test7: OK (dirdata verified)
+
+Dirdata analysis:
+Encrypted casefold tests with dirdata completed
-- 
2.43.7


^ permalink raw reply related

* [RFC PATCH v3 2/6] ext4/064 encryption + casefold feature combination WITHOUT dirdata
From: Artem Blagodarenko @ 2026-06-24 13:49 UTC (permalink / raw)
  To: linux-ext4; +Cc: adilger.kernel, Artem Blagodarenko
In-Reply-To: <20260624134957.19209-1-ablagodarenko@thelustrecollective.com>

From: Artem Blagodarenko <artem.blagodarenko@gmail.com>

This test verifies that files created in directories with both
encryption and case-insensitive (casefold) attributes work correctly.
See ext4/065 for the same test WITH dirdata feature enabled.

Signed-off-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
---
 tests/ext4/064     | 153 +++++++++++++++++++++++++++++++++++++++++++++
 tests/ext4/064.out |  17 +++++
 2 files changed, 170 insertions(+)

diff --git a/tests/ext4/064 b/tests/ext4/064
new file mode 100755
index 00000000..53450927
--- /dev/null
+++ b/tests/ext4/064
@@ -0,0 +1,153 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) 2026 The Lustre Collective.  All Rights Reserved.
+# Author: Artem Blagodarenko <ablagodarenko@thelustrecollective.com>
+#
+# FS QA Test ext4/064
+#
+# Test ext4 encryption + casefold feature combination WITHOUT dirdata.
+# This test verifies that files created in directories with both
+# encryption and case-insensitive (casefold) attributes work correctly.
+# See ext4/065 for the same test WITH dirdata feature enabled.
+#
+. ./common/preamble
+_begin_fstest auto quick encrypt casefold
+
+# get standard environment and checks
+. ./common/filter
+. ./common/encrypt
+. ./common/casefold
+. ./common/attr
+
+_exclude_fs ext2
+_exclude_fs ext3
+
+_require_scratch_nocheck
+_require_scratch_encryption
+_require_scratch_casefold
+_require_xfs_io_command "set_encpolicy"
+_require_xfs_io_command "add_enckey"
+
+# Helper to add a v2 encryption key and set policy on a directory
+_setup_encrypted_casefold_dir()
+{
+	local dir=$1
+	local raw_key=$(_generate_raw_encryption_key)
+	local keyspec=$(_add_enckey $SCRATCH_MNT "$raw_key" | awk '{print $NF}')
+	_set_encpolicy $dir $keyspec
+	_casefold_set_attr $dir
+	echo $keyspec
+}
+
+# Create a filesystem with both encrypt and casefold features
+_scratch_mkfs -O encrypt,casefold &>>$seqres.full
+_scratch_mount
+
+# Test 1: Create an encrypted + casefolded directory and verify lookups work
+echo "Test 1: Basic encrypted casefold lookup"
+mkdir $SCRATCH_MNT/test1
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test1 > /dev/null
+
+# Create file with lowercase, lookup with uppercase
+echo "hello" > $SCRATCH_MNT/test1/testfile.txt
+if [ -f "$SCRATCH_MNT/test1/TESTFILE.TXT" ]; then
+	echo "Case-insensitive lookup works in encrypted dir"
+else
+	echo "FAIL: Case-insensitive lookup failed in encrypted dir"
+fi
+
+# Verify the exact name on disk is preserved
+if _casefold_check_exact_name "$SCRATCH_MNT/test1" "testfile.txt"; then
+	echo "Original filename preserved"
+else
+	echo "FAIL: Original filename not preserved"
+fi
+
+# Test 2: Create files with different case variations
+echo "Test 2: Conflicting names in encrypted casefold dir"
+mkdir $SCRATCH_MNT/test2
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test2 > /dev/null
+
+echo "first" > $SCRATCH_MNT/test2/MyFile.txt
+# This should fail or overwrite since "MYFILE.TXT" is equivalent
+echo "second" > $SCRATCH_MNT/test2/MYFILE.TXT 2>/dev/null
+content=$(cat $SCRATCH_MNT/test2/myfile.txt)
+echo "Content after writes: $content"
+
+# Test 3: Unicode normalization in encrypted casefold dir
+echo "Test 3: Unicode in encrypted casefold dir"
+mkdir $SCRATCH_MNT/test3
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test3 > /dev/null
+
+# Test with UTF-8 characters
+fr_file1=$(echo -e "cafe\xcc\x81.txt")
+fr_file2=$(echo -e "caf\xc3\xa9.txt")
+echo "french" > "$SCRATCH_MNT/test3/$fr_file1"
+if [ -f "$SCRATCH_MNT/test3/$fr_file2" ]; then
+	echo "Unicode normalization works in encrypted dir"
+else
+	echo "FAIL: Unicode normalization failed in encrypted dir"
+fi
+
+# Test 4: Directory operations in encrypted casefold dir
+echo "Test 4: Directory operations in encrypted casefold dir"
+mkdir $SCRATCH_MNT/test4
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test4 > /dev/null
+
+mkdir $SCRATCH_MNT/test4/SubDir
+if [ -d "$SCRATCH_MNT/test4/SUBDIR" ]; then
+	echo "Directory case-insensitive lookup works"
+else
+	echo "FAIL: Directory case-insensitive lookup failed"
+fi
+
+# Test 5: Verify inheritance of casefold+encryption in subdirectories
+echo "Test 5: Inheritance of attributes"
+mkdir $SCRATCH_MNT/test5
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test5 > /dev/null
+
+mkdir $SCRATCH_MNT/test5/child
+echo "data" > $SCRATCH_MNT/test5/child/file.txt
+if [ -f "$SCRATCH_MNT/test5/CHILD/FILE.TXT" ]; then
+	echo "Attributes inherited correctly"
+else
+	echo "FAIL: Attributes not inherited"
+fi
+
+# Test 6: Remove and recreate with different case
+echo "Test 6: Remove and recreate with different case"
+mkdir $SCRATCH_MNT/test6
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test6 > /dev/null
+
+echo "original" > $SCRATCH_MNT/test6/RemoveMe.txt
+rm $SCRATCH_MNT/test6/REMOVEME.TXT
+echo "recreated" > $SCRATCH_MNT/test6/REMOVEME.TXT
+if _casefold_check_exact_name "$SCRATCH_MNT/test6" "REMOVEME.TXT"; then
+	echo "Recreated file has new case"
+else
+	echo "FAIL: Recreated file case incorrect"
+fi
+
+# Test 7: Hard links in encrypted casefold dir
+echo "Test 7: Hard links in encrypted casefold dir"
+mkdir $SCRATCH_MNT/test7
+_setup_encrypted_casefold_dir $SCRATCH_MNT/test7 > /dev/null
+
+echo "linkdata" > $SCRATCH_MNT/test7/original.txt
+ln $SCRATCH_MNT/test7/original.txt $SCRATCH_MNT/test7/hardlink.txt
+if [ -f "$SCRATCH_MNT/test7/HARDLINK.TXT" ]; then
+	echo "Hard link case-insensitive lookup works"
+else
+	echo "FAIL: Hard link case-insensitive lookup failed"
+fi
+
+# Cleanup and verify filesystem
+_scratch_unmount
+_check_scratch_fs
+
+echo "Encrypted casefold tests completed"
+
+# success, all done
+status=0
+exit
diff --git a/tests/ext4/064.out b/tests/ext4/064.out
new file mode 100644
index 00000000..0197e51e
--- /dev/null
+++ b/tests/ext4/064.out
@@ -0,0 +1,17 @@
+QA output created by 064
+Test 1: Basic encrypted casefold lookup
+Case-insensitive lookup works in encrypted dir
+Original filename preserved
+Test 2: Conflicting names in encrypted casefold dir
+Content after writes: second
+Test 3: Unicode in encrypted casefold dir
+Unicode normalization works in encrypted dir
+Test 4: Directory operations in encrypted casefold dir
+Directory case-insensitive lookup works
+Test 5: Inheritance of attributes
+Attributes inherited correctly
+Test 6: Remove and recreate with different case
+Recreated file has new case
+Test 7: Hard links in encrypted casefold dir
+Hard link case-insensitive lookup works
+Encrypted casefold tests completed
-- 
2.43.7


^ permalink raw reply related

* [RFC PATCH v3 1/6] ext4: add common helper to check whether dirdata is applied
From: Artem Blagodarenko @ 2026-06-24 13:49 UTC (permalink / raw)
  To: linux-ext4; +Cc: adilger.kernel, Artem Blagodarenko
In-Reply-To: <20260624134957.19209-1-ablagodarenko@thelustrecollective.com>

From: Artem Blagodarenko <artem.blagodarenko@gmail.com>

Add a helper that lists a directory with the -lD flags and checks
whether any dirdata fields exist.

This helper will be used by subsequent dirdata-related patches.

Signed-off-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
---
 common/ext4 | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/common/ext4 b/common/ext4
index a2ce456d..47c31db9 100644
--- a/common/ext4
+++ b/common/ext4
@@ -242,3 +242,37 @@ _ext4_get_inum_iflags() {
 	debugfs -R "stat <${inumber}>" "${dev}" 2> /dev/null | \
 			sed -n 's/^.*Flags: \([0-9a-fx]*\).*$/\1/p'
 }
+
+# Helper to dump directory structure with hash info (requires dirdata feature)
+# This is useful for verifying that dirdata is storing hash information
+_dump_dir_structure()
+{
+	local dir=$1
+	local dir_name=$(basename $dir)
+	local expected=$3
+
+	local debugfs_output=$({
+		echo "cd $dir_name"
+		echo "ls -lD ."
+		echo "quit"
+	} | debugfs $SCRATCH_DEV 2>/dev/null)
+
+	# DEBUG: uncomment to see full debugfs output
+	# echo "  [DEBUG] debugfs output for $dir_name:"
+	# echo "$debugfs_output" | grep -v "^debugfs:" | sed 's/^/    /'
+
+	# Check if hash data is present (encryption+casefold+dirdata case)
+	# or if fid data is present (dirdata+encryption or dirdata only case)
+	if echo "$debugfs_output" | grep -q "fid="; then
+		local fid_value=$(echo "$debugfs_output" | grep -o "fid=[^ ]*" | head -1 | sed 's/^fid=//')
+		if [ "$fid_value" = "$expected" ]; then
+			echo "  Directory structure of $dir_name: OK (dirdata verified)"
+		else
+			echo "  Directory structure of $dir_name: FAILED (fid mismatch: got '$fid_value', expected '$expected')"
+		fi
+	elif echo "$debugfs_output" | grep -q "hash="; then
+		echo "  Directory structure of $dir_name: OK (dirdata verified)"
+	else
+		echo "  Directory structure of $dir_name: FAILED (no dirdata)"
+	fi
+}
-- 
2.43.7


^ permalink raw reply related

* [RFC PATCH v3 0/6] ext4: tests for the dirdata feature (encryption+casefold, LUFID)
From: Artem Blagodarenko @ 2026-06-24 13:49 UTC (permalink / raw)
  To: linux-ext4; +Cc: adilger.kernel, Artem Blagodarenko

These tests cover the ext4 "dirdata" feature (storing extra metadata
in directory entries beyond the file name), sent separately from the
kernel and e2fsprogs dirdata patch series for reference and review.

ext4/064 and ext4/065 verify that encryption and case-insensitive
(casefold) directories continue to work both without and with
dirdata enabled. ext4/066 and ext4/067 exercise the LUFID (Locally
Unique File ID) use of dirdata via a new EXT4_IOC_SET_LUFID ioctl,
using a small set_lufid helper utility added in this series.

Changes in v2:
- Ted Ts'o pointed out that the v1 tests exercised the
  encryption+casefold/dirdata feature combination without actually
  validating that the encrypted hash was stored as a dirdata
  attribute (https://lore.kernel.org/all/20260418214359.GA58909@macsyma-wired.lan/).
  ext4/064 and ext4/065 now use the new _dump_dir_structure helper
  (debugfs-based) to dump and check the on-disk directory entry
  content, confirming the hash is actually present as dirdata rather
  than just exercising the feature combination.
- Zorro Lang asked about a confusingly-named helper
  (_require_encrypted_casefold vs. _require_scratch_casefold); the
  tests now consistently use _require_scratch_casefold.
- Added ext4/066 and ext4/067, plus a new common/ext4 helper and the
  src/set_lufid.c utility, to directly verify LUFID data is correctly
  stored in and retrieved from dirdata via EXT4_IOC_SET_LUFID, including
  in combination with encryption+casefold.

Changes in v3:
- Fixed two off-by-one bugs in src/set_lufid.c, found while testing
  the kernel-side EXT4_IOC_SET_LUFID fixes against this suite:
  - The default (binary fid array) payload path computed
    data_len = sizeof(fid) + 1, copying one byte of stack garbage past
    the 16-byte LUFID struct.
  - The explicit (argv[3]) payload path computed
    data_len = strlen(lufid_data) + 1, treating a NUL terminator as
    part of a binary payload that doesn't have one.
  Both surfaced as a phantom extra FID in ext4/066 and ext4/067's
  dirdata-dump verification once a kernel-side dirdata length
  accounting bug was itself fixed -- that kernel bug had been
  silently masking this test bug by truncating the same garbage byte
  on write.

Artem Blagodarenko (6):
  ext4: add common helper to check whether dirdata is applied
  ext4/064 encryption + casefold feature combination WITHOUT dirdata
  ext4/065 encryption + casefold + dirdata feature combination
  ext4: add set_lufid utility
  ext4/066: verify LUFID dirdata operations
  ext4/067: LUFID and encryption+casefold+dirdata

 common/config      |   1 +
 common/ext4        |  34 +++++++
 src/Makefile       |   2 +-
 src/set_lufid.c    | 196 ++++++++++++++++++++++++++++++++++++++++
 tests/ext4/064     | 153 ++++++++++++++++++++++++++++++++
 tests/ext4/064.out |  17 ++++
 tests/ext4/065     | 217 +++++++++++++++++++++++++++++++++++++++++++++
 tests/ext4/065.out |  26 ++++++
 tests/ext4/066     | 158 +++++++++++++++++++++++++++++++++
 tests/ext4/066.out |   4 +
 tests/ext4/067     | 137 ++++++++++++++++++++++++++++
 tests/ext4/067.out |   4 +
 12 files changed, 948 insertions(+), 1 deletion(-)
 create mode 100644 src/set_lufid.c
 create mode 100755 tests/ext4/064
 create mode 100644 tests/ext4/064.out
 create mode 100755 tests/ext4/065
 create mode 100644 tests/ext4/065.out
 create mode 100755 tests/ext4/066
 create mode 100644 tests/ext4/066.out
 create mode 100755 tests/ext4/067
 create mode 100644 tests/ext4/067.out

-- 
2.43.7


^ permalink raw reply

* [PATCH v4 11/11] ext4: Add EXT4_IOC_SET_LUFID ioctl for setting LUFID on directory entries
From: Artem Blagodarenko @ 2026-06-24 13:36 UTC (permalink / raw)
  To: linux-ext4; +Cc: adilger.kernel, Artem Blagodarenko, Andreas Dilger
In-Reply-To: <20260624133642.18438-1-ablagodarenko@thelustrecollective.com>

From: Artem Blagodarenko <artem.blagodarenko@gmail.com>

Add a new ioctl command that allows setting LUFID (Locally Unique File ID)
data on existing directory entries. This includes:

- ext4_ioctl_set_lufid(): ioctl handler that validates parameters and
  calls the underlying implementation
- ext4_set_direntry_lufid(): Core function that performs the operation by:
  * Looking up the target directory entry
  * Retrieving the associated inode
  * Deleting the old entry and re-creating it with LUFID data attached

This implementation requires the dirdata feature to be enabled on the
filesystem and properly handles transactions and inode locking to ensure
consistency.

Signed-off-by: Artem Blagodarenko artem.blagodarenko@gmail.com
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
---
 fs/ext4/ext4.h            |  15 ++++
 fs/ext4/ioctl.c           |  84 +++++++++++++++++++++
 fs/ext4/namei.c           | 155 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/ext4.h |  13 ++++
 4 files changed, 267 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5674a64f830f..252a5a529205 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1227,6 +1227,7 @@ struct ext4_inode_info {
 #ifdef CONFIG_FS_ENCRYPTION
 	struct fscrypt_inode_info *i_crypt_info;
 #endif
+	void *i_dirdata;
 };
 
 /*
@@ -2601,6 +2602,18 @@ struct ext4_dirent_hash {
 	struct ext4_dir_entry_hash	dh_hash;
 } __packed;
 
+static inline
+struct ext4_dirent_fid *ext4_dentry_get_fid(struct super_block *sb,
+					    struct ext4_dentry_param *p)
+{
+	if (!ext4_has_feature_dirdata(sb))
+		return NULL;
+	if (p && p->edp_magic == EXT4_LUFID_MAGIC)
+		return &p->edp_dfid;
+
+	return NULL;
+}
+
 #define EXT4_FT_DIR_CSUM	0xDE
 
 /*
@@ -3302,6 +3315,8 @@ static inline int ext4_init_new_dir(handle_t *handle, struct inode *dir,
 }
 extern int ext4_dirblock_csum_verify(struct inode *inode,
 				     struct buffer_head *bh);
+extern int ext4_dirdata_set_lufid(struct inode *dir, const char *filename,
+			   int namelen, struct ext4_dentry_param *edp);
 extern int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
 				__u32 start_minor_hash, __u32 *next_hash);
 extern int ext4_search_dir(struct buffer_head *bh,
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index c8387e6a2c6e..725e46e1e46d 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -1535,6 +1535,87 @@ static int ext4_ioctl_set_tune_sb(struct file *filp,
 	return ret;
 }
 
+/*
+ * ext4_ioctl_set_lufid() - Set LUFID on a directory entry
+ * @filp:	file pointer (parent directory)
+ * @arg:	pointer to ext4_set_lufid structure with filename and LUFID data
+ *
+ * This ioctl allows setting LUFID data on an existing
+ * directory entry. It is called on the parent directory with a filename and
+ * LUFID data.
+ */
+static long ext4_ioctl_set_lufid(struct file *filp, unsigned long arg)
+{
+	struct inode *dir = file_inode(filp);
+	struct mnt_idmap *idmap = file_mnt_idmap(filp);
+	struct ext4_set_lufid lufid_args;
+	struct {
+		__u32 edp_magic;
+		struct ext4_dirent_data_header df_header;
+		char df_fid[255];
+	} edp;
+	int err;
+
+	/* Check if parent is a directory */
+	if (!S_ISDIR(dir->i_mode))
+		return -ENOTDIR;
+
+	/* This ioctl mutates directory entries; merely having the directory
+	 * open (which only ever requires read access) is not enough */
+	err = inode_permission(idmap, dir, MAY_WRITE);
+	if (err)
+		return err;
+
+	/* Copy arguments from user space */
+	if (copy_from_user(&lufid_args, (struct ext4_set_lufid __user *)arg,
+			   sizeof(lufid_args)))
+		return -EFAULT;
+
+	/* Validate parameters */
+	if (lufid_args.esl_name_len == 0 || lufid_args.esl_name_len > EXT4_NAME_LEN)
+		return -EINVAL;
+
+	/* ddh_length (esl_data_len + the header byte below) must itself fit
+	 * in the __u8 ddh_length field without wrapping */
+	if (lufid_args.esl_data_len == 0 ||
+	    lufid_args.esl_data_len > 255 - sizeof(edp.df_header))
+		return -EINVAL;
+
+	/* Ensure filename is NUL-terminated and unmodified */
+	if (lufid_args.esl_name[lufid_args.esl_name_len - 1] != '\0')
+		return -EINVAL;
+
+	/* '.' and '..' are not ordinary entries -- they must stay the first
+	 * two entries in the directory's first block, so they can't go
+	 * through the general delete+re-add path this ioctl uses */
+	if (!strcmp(lufid_args.esl_name, ".") || !strcmp(lufid_args.esl_name, ".."))
+		return -EINVAL;
+
+	/* Prepare the dentry param struct with LUFID data. ddh_length is
+	 * documented (see struct ext4_dirent_data_header) as the length of
+	 * the header plus the whole data blob -- include the header here so
+	 * every dirdata reader/writer that takes ddh_length at face value
+	 * (e.g. ext4_dirdata_set()'s memcpy) copies the full LUFID payload
+	 * instead of silently dropping its last byte. */
+	edp.edp_magic = EXT4_LUFID_MAGIC;
+	edp.df_header.ddh_length = lufid_args.esl_data_len +
+				    sizeof(edp.df_header);
+	memcpy(edp.df_fid, lufid_args.esl_data, lufid_args.esl_data_len);
+
+	/* Want write access */
+	err = mnt_want_write_file(filp);
+	if (err)
+		return err;
+
+	/* Call the helper function to do the actual work */
+	err = ext4_dirdata_set_lufid(dir, lufid_args.esl_name,
+				    lufid_args.esl_name_len - 1,
+				    (struct ext4_dentry_param *)&edp);
+
+	mnt_drop_write_file(filp);
+	return err;
+}
+
 static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct inode *inode = file_inode(filp);
@@ -1921,6 +2002,8 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 					      (void __user *)arg);
 	case EXT4_IOC_SET_TUNE_SB_PARAM:
 		return ext4_ioctl_set_tune_sb(filp, (void __user *)arg);
+	case EXT4_IOC_SET_LUFID:
+		return ext4_ioctl_set_lufid(filp, arg);
 	default:
 		return -ENOTTY;
 	}
@@ -2000,6 +2083,7 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 	case FS_IOC_SETFSLABEL:
 	case EXT4_IOC_GETFSUUID:
 	case EXT4_IOC_SETFSUUID:
+	case EXT4_IOC_SET_LUFID:
 		break;
 	default:
 		return -ENOIOCTLCMD;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 6fba1a7c0876..3aeea503f12d 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2319,6 +2319,8 @@ static int add_dirent_to_buf(handle_t *handle, struct ext4_filename *fname,
 	if (ext4_has_feature_metadata_csum(inode->i_sb))
 		csum_size = sizeof(struct ext4_dir_entry_tail);
 
+	dfid = ext4_dentry_get_fid(inode->i_sb,
+		(struct ext4_dentry_param *)EXT4_I(inode)->i_dirdata);
 	if (!de) {
 		if (dfid)
 			dlen = dfid->df_header.ddh_length;
@@ -2665,6 +2667,7 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
 {
 	struct inode *dir = d_inode(dentry->d_parent);
 
+	EXT4_I(inode)->i_dirdata = dentry->d_fsdata;
 	if (fscrypt_is_nokey_name(dentry))
 		return -ENOKEY;
 	return __ext4_add_entry(handle, dir, &dentry->d_name, inode);
@@ -4426,6 +4429,158 @@ static int ext4_rename2(struct mnt_idmap *idmap,
 	return ext4_rename(idmap, old_dir, old_dentry, new_dir, new_dentry, flags);
 }
 
+/*
+ * ext4_dirdata_set_lufid() - Set LUFID data on an existing directory entry
+ * @dir:        parent directory inode
+ * @filename:   name of the file in the directory
+ * @namelen:    length of filename
+ * @edp:        pointer to initialized dentry param with LUFID data
+ *
+ * This function finds an existing directory entry, deletes it, and re-creates it
+ * with LUFID data attached. Used by the EXT4_IOC_SET_LUFID ioctl.
+ *
+ * Returns 0 on success, negative error code on failure.
+ */
+int ext4_dirdata_set_lufid(struct inode *dir, const char *filename,
+			    int namelen, struct ext4_dentry_param *edp)
+{
+	struct super_block *sb = dir->i_sb;
+	struct ext4_filename fname;
+	struct ext4_dir_entry_2 *de = NULL;
+	struct buffer_head *bh = NULL;
+	struct inode *inode = NULL;
+	handle_t *handle = NULL;
+	struct qstr d_name;
+	void *old_dirdata = NULL;
+	int err = 0;
+
+	/* Check if dirdata feature is enabled */
+	if (!ext4_has_feature_dirdata(sb))
+		return -ENOTSUPP;
+
+	if (namelen > EXT4_NAME_LEN)
+               return -ENAMETOOLONG;
+        if (namelen != strnlen(filename, namelen + 1))
+               return -EINVAL;
+
+	/* Setup the filename for lookup */
+	d_name.name = filename;
+	d_name.len = namelen;
+
+	/* Lookup the filename in the directory */
+	err = ext4_fname_setup_filename(dir, &d_name, 0, &fname);
+	if (err)
+		goto out_free;
+
+	bh = ext4_find_entry(dir, &d_name, &de, NULL);
+	if (!bh) {
+		err = -ENOENT;
+		goto out_free;
+	}
+
+	/* Get the inode number from the directory entry */
+	inode = ext4_iget(sb, le32_to_cpu(de->inode), EXT4_IGET_NORMAL);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		inode = NULL;
+		goto out_brelse;
+	}
+
+	/* Start a transaction */
+	handle = ext4_journal_start(dir, EXT4_HT_DIR, 
+				     2 * EXT4_DATA_TRANS_BLOCKS(sb) + 
+				     EXT4_INDEX_EXTRA_TRANS_BLOCKS);
+	if (IS_ERR(handle)) {
+		err = PTR_ERR(handle);
+		handle = NULL;
+		goto out_iput;
+	}
+
+	inode_lock(dir);
+
+	/* EXT4_I(inode)->i_dirdata below is a *shared* per-inode field used
+	 * to smuggle the LUFID payload into ext4_add_entry(); locking only
+	 * dir does not stop a concurrent EXT4_IOC_SET_LUFID call targeting a
+	 * different hardlink of the same inode (different dir, same inode)
+	 * from clobbering it mid-call with its own stack-local pointer.
+	 * Lock the target inode too, consistently dir-then-inode, to
+	 * serialize the i_dirdata set/use/restore window below. */
+	if (inode != dir)
+		inode_lock_nested(inode, I_MUTEX_NONDIR2);
+
+	/* Delete the old entry */
+	err = ext4_delete_entry(handle, dir, de, bh);
+	if (err)
+		goto out_unlock;
+
+	brelse(bh);
+	bh = NULL;
+
+	/* Re-add the entry with LUFID data
+	 * We set i_dirdata before adding so the entry can include it
+	 */
+	old_dirdata = EXT4_I(inode)->i_dirdata;
+	EXT4_I(inode)->i_dirdata = edp;
+
+	/* Use ext4_add_entry() to properly handle hash table management
+	 * and block splitting, just like rename does. This ensures the entry
+	 * is placed in the correct hash block and avoids breaking dirhash.
+	 */
+	{
+		struct dentry parent_dentry = { .d_inode = dir };
+		struct dentry new_dentry = {
+			.d_name = d_name,
+			.d_parent = &parent_dentry,
+			.d_inode = inode,  /* Same inode (in-place update) */
+			.d_fsdata = edp,   /* required */
+		};
+		err = ext4_add_entry(handle, &new_dentry, inode);
+	}
+	EXT4_I(inode)->i_dirdata = old_dirdata;
+
+	if (err) {
+		/*
+		 * The original entry was already removed above and the
+		 * re-add with the new LUFID failed; try to restore the
+		 * original entry so the inode isn't left without any
+		 * directory entry pointing at it.
+		 */
+		struct dentry parent_dentry = { .d_inode = dir };
+		struct dentry orig_dentry = {
+			.d_name = d_name,
+			.d_parent = &parent_dentry,
+			.d_inode = inode,
+		};
+		int rollback_err = ext4_add_entry(handle, &orig_dentry, inode);
+
+		if (rollback_err)
+			EXT4_ERROR_INODE(dir,
+				"Failed to set LUFID on '%.*s' (err=%d) and failed to restore the original directory entry (err=%d); inode %llu may be orphaned",
+				namelen, filename, err, rollback_err,
+				inode->i_ino);
+		goto out_unlock;
+	}
+
+	/* Update inode times */
+	inode_set_ctime_current(dir);
+	inode_inc_iversion(dir);
+	ext4_mark_inode_dirty(handle, dir);
+
+out_unlock:
+	if (inode != dir)
+		inode_unlock(inode);
+	inode_unlock(dir);
+	ext4_journal_stop(handle);
+out_iput:
+	iput(inode);
+out_brelse:
+	brelse(bh);
+out_free:
+	ext4_fname_free_filename(&fname);
+
+	return err;
+}
+
 /*
  * directories can handle most operations...
  */
diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h
index 9c683991c32f..9fab8978843b 100644
--- a/include/uapi/linux/ext4.h
+++ b/include/uapi/linux/ext4.h
@@ -35,6 +35,7 @@
 #define EXT4_IOC_SETFSUUID		_IOW('f', 44, struct fsuuid)
 #define EXT4_IOC_GET_TUNE_SB_PARAM	_IOR('f', 45, struct ext4_tune_sb_params)
 #define EXT4_IOC_SET_TUNE_SB_PARAM	_IOW('f', 46, struct ext4_tune_sb_params)
+#define EXT4_IOC_SET_LUFID		_IOW('f', 47, struct ext4_set_lufid)
 
 #define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)
 
@@ -92,6 +93,18 @@ struct move_extent {
 	__u64 moved_len;	/* moved block length */
 };
 
+/*
+ * Structure for EXT4_IOC_SET_LUFID
+ * Sets LUFID on a directory entry
+ * Called on parent directory with filename and LUFID data as arguments
+ */
+struct ext4_set_lufid {
+	__u8 esl_name_len;	/* length of filename */
+	char  esl_name[255 + 1]; /* filename (NUL-terminated) */
+	__u8 esl_data_len;	/* length of LUFID data */
+	char  esl_data[255]; /* LUFID data (raw bytes) */
+};
+
 /*
  * Flags used by EXT4_IOC_SHUTDOWN
  */
-- 
2.43.7


^ permalink raw reply related

* [PATCH v4 10/11] ext4: add dirdata set/get helpers
From: Artem Blagodarenko @ 2026-06-24 13:36 UTC (permalink / raw)
  To: linux-ext4; +Cc: adilger.kernel, Artem Blagodarenko, Andreas Dilger
In-Reply-To: <20260624133642.18438-1-ablagodarenko@thelustrecollective.com>

From: Artem Blagodarenko <artem.blagodarenko@gmail.com>

Add helpers to set and retrieve dirdata payload and hook them up at
the appropriate call sites.

Enable dirdata for casefold+encryption hashes and storing unique
128-bit file identifier in the directory entry for testing.

Signed-off-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
---
 foofile.txt      |   0
 fs/ext4/ext4.h   |   4 +
 fs/ext4/inline.c |   6 +-
 fs/ext4/namei.c  | 227 +++++++++++++++++++++++++++++++++++++++++------
 4 files changed, 207 insertions(+), 30 deletions(-)

diff --git a/foofile.txt b/foofile.txt
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1e61ce13ed07..5674a64f830f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3874,6 +3874,10 @@ extern int __ext4_unlink(struct inode *dir, const struct qstr *d_name,
 			 struct inode *inode, struct dentry *dentry);
 extern int __ext4_link(struct inode *dir, struct inode *inode,
 		       const struct qstr *d_name, struct dentry *dentry);
+extern unsigned char ext4_dirdata_get(struct ext4_dir_entry_2 *de,
+				      struct inode *dir,
+				      struct ext4_dirent_fid  *lufid,
+				      struct dx_hash_info *hinfo);
 
 #define S_SHIFT 12
 static const unsigned char ext4_type_by_mode[(S_IFMT >> S_SHIFT) + 1] = {
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 1fff4defd45b..32b4ff83d4df 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1350,10 +1350,8 @@ int ext4_inlinedir_to_tree(struct file *dir_file,
 			}
 		}
 
-		if (ext4_hash_in_dirent(dir)) {
-			hinfo->hash = EXT4_DIRENT_HASH(de);
-			hinfo->minor_hash = EXT4_DIRENT_MINOR_HASH(de);
-		} else {
+		if (!(ext4_dirdata_get(de, dir, NULL, hinfo) &
+							EXT4_DIRENT_CFHASH)) {
 			err = ext4fs_dirhash(dir, de->name, de->name_len, hinfo);
 			if (err) {
 				ret = err;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 91def9e0f84d..6fba1a7c0876 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1108,22 +1108,22 @@ static int htree_dirblock_to_tree(struct file *dir_file,
 			/* silently ignore the rest of the block */
 			break;
 		}
-		if (ext4_hash_in_dirent(dir)) {
-			if (de->name_len && de->inode) {
-				hinfo->hash = EXT4_DIRENT_HASH(de);
-				hinfo->minor_hash = EXT4_DIRENT_MINOR_HASH(de);
-			} else {
-				hinfo->hash = 0;
-				hinfo->minor_hash = 0;
-			}
+		if (de->name_len && de->inode) {
+			/* check for saved hash first, or generate it from name */
+			if (!(ext4_dirdata_get(de, dir, NULL, hinfo) &
+			      EXT4_DIRENT_CFHASH)) {
+				err = ext4fs_dirhash(dir, de->name,
+						     de->name_len, hinfo);
+				if (err < 0) {
+					count = err;
+					goto errout;
+				}
+			 }
 		} else {
-			err = ext4fs_dirhash(dir, de->name,
-					     de->name_len, hinfo);
-			if (err < 0) {
-				count = err;
-				goto errout;
-			}
+			hinfo->hash = 0;
+			hinfo->minor_hash = 0;
 		}
+
 		if ((hinfo->hash < start_hash) ||
 		    ((hinfo->hash == start_hash) &&
 		     (hinfo->minor_hash < start_minor_hash)))
@@ -1301,9 +1301,191 @@ static inline int search_dirblock(struct buffer_head *bh,
  */
 
 /*
- * Create map of hash values, offsets, and sizes, stored at end of block.
- * Returns number of entries mapped.
+ * ext4_dirdata_get() - Read dirdata fields from a directory entry.
+ * @de:         directory entry
+ * @dir:        directory inode (used for fscrypt+casefold hash fallback)
+ * @dfid:      if non-NULL and EXT4_DIRENT_LUFID is set, LUFID data is copied
+ * 		here
+ * @hinfo:	if non-NULL, receives the casefold hash and minor hash
+ *
+ * Reads any dirdata stored in @de.  If the dirdata feature is not enabled,
+ * falls back to reading the hash stored inline after the filename (for
+ * compatibility with the older casefold+fscrypt format).
+ *
+ * Returns a bitmask of EXT4_DIRENT_* flags indicating which fields were read.
+ */
+unsigned char ext4_dirdata_get(struct ext4_dir_entry_2 *de, struct inode *dir,
+			       struct ext4_dirent_fid *dfid,
+			       struct dx_hash_info *hinfo)
+{
+	unsigned char ret = 0;
+	unsigned int data_offset = de->name_len + 1;
+	unsigned int rec_len = ext4_rec_len_from_disk(de->rec_len,
+						       dir->i_sb->s_blocksize);
+
+	/* data_offset is relative to de->name, which itself starts
+	 * EXT4_BASE_DIR_LEN bytes into the entry -- rec_len is relative to
+	 * the start of the entry, so add the header size before comparing,
+	 * or this lets reads run EXT4_BASE_DIR_LEN bytes past the entry. */
+	if (EXT4_BASE_DIR_LEN + data_offset > rec_len)
+		return ret;
+
+	/* compatibility: hash stored inline after filename (no dirdata) */
+	if (hinfo && !ext4_has_feature_dirdata(dir->i_sb) &&
+	    ext4_hash_in_dirent(dir)) {
+		hinfo->hash = EXT4_DIRENT_HASH(de);
+		hinfo->minor_hash = EXT4_DIRENT_MINOR_HASH(de);
+		ret |= EXT4_DIRENT_CFHASH;
+
+		return ret;
+	}
+
+	/*  EXT4_DIRENT_* are not expected without flag in i_sb */
+	if (de->file_type & EXT4_DIRENT_LUFID) {
+		struct ext4_dirent_fid *disk_fid =
+			(struct ext4_dirent_fid *)(de->name + data_offset);
+		unsigned int dlen;
+
+		if (EXT4_BASE_DIR_LEN + data_offset + sizeof(disk_fid->df_header) > rec_len)
+			return ret;
+
+		dlen = disk_fid->df_header.ddh_length;
+		if (dlen < sizeof(*disk_fid) ||
+		    EXT4_BASE_DIR_LEN + data_offset + dlen > rec_len)
+			return ret;
+
+		if (dfid) {
+			/* copy the whole record (header + fid), not just the fid
+			 * payload -- dlen already includes the header's length */
+			memcpy(dfid, disk_fid, dlen);
+			ret |= EXT4_DIRENT_LUFID;
+		}
+		data_offset += dlen;
+	}
+
+	/* Skip INO64 for now*/
+	if (de->file_type & EXT4_DIRENT_INO64) {
+		struct ext4_dirent_data_header *ddh =
+		       (struct ext4_dirent_data_header *)(de->name + data_offset);
+		unsigned int dlen;
+
+		if (EXT4_BASE_DIR_LEN + data_offset + sizeof(*ddh) > rec_len)
+			return ret;
+
+		dlen = ddh->ddh_length;
+		if (dlen < sizeof(*ddh) ||
+		    EXT4_BASE_DIR_LEN + data_offset + dlen > rec_len)
+			return ret;
+
+		data_offset += dlen;
+	}
+
+	if (!hinfo)
+		return ret;
+
+	if (de->file_type & EXT4_DIRENT_CFHASH) {
+		struct ext4_dirent_hash *dh =
+			(struct ext4_dirent_hash *)(de->name + data_offset);
+		unsigned int dlen;
+
+		dlen = dh->dh_header.ddh_length;
+		if (dlen < sizeof(*dh) ||
+		    EXT4_BASE_DIR_LEN + data_offset + dlen > rec_len)
+			return ret;
+
+		hinfo->hash = le32_to_cpu(dh->dh_hash.hash);
+		hinfo->minor_hash = le32_to_cpu(dh->dh_hash.minor_hash);
+		ret |= EXT4_DIRENT_CFHASH;
+	}
+
+	return ret;
+}
+
+/*
+ * ext4_dirdata_set() - Write dirdata fields into a directory entry.
+ * @de:    directory entry (name must already be set)
+ * @dir:   directory inode
+ * @data:  LUFID data to store (or NULL)
+ * @fname: filename info carrying the casefold hash
+ *
+ * Writes any required dirdata into @de after the filename.  If the dirdata
+ * feature is not enabled, falls back to writing the hash inline after the
+ * filename (for compatibility with the older casefold+fscrypt format).
  */
+static void ext4_dirdata_set(struct ext4_dir_entry_2 *de, struct inode *dir,
+			     struct ext4_dirent_fid *dfid,
+			     struct ext4_filename *fname)
+{
+	struct dx_hash_info *hinfo = &fname->hinfo;
+	unsigned int data_offset = de->name_len + 1;
+	unsigned int rec_len = ext4_rec_len_from_disk(de->rec_len,
+						       dir->i_sb->s_blocksize);
+
+	/* de->name[] is declared with a fixed EXT4_NAME_LEN size, but the
+	 * real backing storage is this entry's rec_len-sized space in the
+	 * directory block; a max-length name (name_len == EXT4_NAME_LEN)
+	 * leaves no declared array slot for the NUL terminator below, which
+	 * FORTIFY_SOURCE treats as an out-of-bounds array write regardless
+	 * of how much real space the entry has. */
+	if (dfid && de->name_len >= EXT4_NAME_LEN) {
+		EXT4_ERROR_INODE(dir, "Can not insert FID: name_len too long");
+		return;
+	}
+
+	/* always clear the gap byte at de->name[de->name_len], even when no
+	 * FID is being appended -- otherwise it's never initialized before
+	 * dirdata is written right after it, leaking a byte of stale memory
+	 * to disk. Skip it for a max-length name: there's no declared array
+	 * slot for it, and no dirdata can be appended in that case anyway
+	 * (rejected above for dfid; data_offset would already be >= rec_len
+	 * for any other dirdata kind). */
+	if (de->name_len < EXT4_NAME_LEN)
+		de->name[de->name_len] = 0;
+
+	if (dfid) {
+		unsigned int dlen = dfid->df_header.ddh_length;
+
+		if (EXT4_BASE_DIR_LEN + data_offset + dlen > rec_len) {
+			EXT4_ERROR_INODE(dir, "Can not insert FID");
+			return;
+		}
+
+		memcpy(&de->name[de->name_len + 1], dfid,
+		       dlen);
+		de->file_type |= EXT4_DIRENT_LUFID;
+		data_offset += dlen;
+	}
+
+	if (ext4_hash_in_dirent(dir)) {
+		if (ext4_has_feature_dirdata(dir->i_sb)) {
+			struct ext4_dirent_hash *dh =
+			    (struct ext4_dirent_hash *)(de->name + data_offset);
+
+			if (EXT4_BASE_DIR_LEN + data_offset + sizeof(*dh) > rec_len) {
+				EXT4_ERROR_INODE(dir, "Can not insert dhash dirdata");
+				return;
+			}
+
+			dh->dh_header.ddh_length = sizeof(*dh);
+			dh->dh_hash.hash = cpu_to_le32(hinfo->hash);
+			dh->dh_hash.minor_hash = cpu_to_le32(hinfo->minor_hash);
+			de->file_type |= EXT4_DIRENT_CFHASH;
+		} else {
+			/* Compatibility: store hash inline after filename */
+			if (EXT4_BASE_DIR_LEN + data_offset +
+			    sizeof(struct ext4_dir_entry_hash) > rec_len) {
+				EXT4_ERROR_INODE(dir, "Can not insert dhash");
+				return;
+			}
+
+			EXT4_DIRENT_HASHES(de)->hash = cpu_to_le32(hinfo->hash);
+			EXT4_DIRENT_HASHES(de)->minor_hash =
+						cpu_to_le32(hinfo->minor_hash);
+		}
+	}
+}
+
+
 static int dx_make_map(struct inode *dir, struct buffer_head *bh,
 		       struct dx_hash_info *hinfo,
 		       struct dx_map_entry *map_tail)
@@ -1323,9 +1505,8 @@ static int dx_make_map(struct inode *dir, struct buffer_head *bh,
 					 ((char *)de) - base))
 			return -EFSCORRUPTED;
 		if (de->name_len && de->inode) {
-			if (ext4_hash_in_dirent(dir))
-				h.hash = EXT4_DIRENT_HASH(de);
-			else {
+			if (!(ext4_dirdata_get(de, dir, NULL, &h) &
+						EXT4_DIRENT_CFHASH)) {
 				int err = ext4fs_dirhash(dir, de->name,
 						     de->name_len, &h);
 				if (err < 0)
@@ -2113,13 +2294,7 @@ void ext4_insert_dentry_data(struct inode *dir, struct inode *inode,
 	ext4_set_de_type(inode->i_sb, de, inode->i_mode);
 	de->name_len = fname_len(fname);
 	memcpy(de->name, fname_name(fname), fname_len(fname));
-	if (ext4_hash_in_dirent(dir)) {
-		struct dx_hash_info *hinfo = &fname->hinfo;
-
-		EXT4_DIRENT_HASHES(de)->hash = cpu_to_le32(hinfo->hash);
-		EXT4_DIRENT_HASHES(de)->minor_hash =
-						cpu_to_le32(hinfo->minor_hash);
-	}
+	ext4_dirdata_set(de, dir, data, fname);
 }
 
 /*
-- 
2.43.7


^ permalink raw reply related

* [PATCH v4 09/11] ext4: dirdata feature
From: Artem Blagodarenko @ 2026-06-24 13:36 UTC (permalink / raw)
  To: linux-ext4
  Cc: adilger.kernel, Artem Blagodarenko, Pravin Shelar, Andreas Dilger
In-Reply-To: <20260624133642.18438-1-ablagodarenko@thelustrecollective.com>

From: Artem Blagodarenko <artem.blagodarenko@gmail.com>

When fscrypt and casefold are enabled together for a directory,
all ext4_dir_entry[_2] in that directory store a n 8-byte hash
of the filename after 'name' between 'name_len' and 'rec_len'.

However, there is no clear indication there is important data
stored in these bytes, which are only for padding and alignment
in other directory entries.  This adds complexity to code handling
the on-disk directory entries, and there is no provision for other
metadata to be stored in each dir entry after 'name'.

The dirdata feature adds a mechanism to store multiple metadata
entries in each dir entry after 'name' (including the fchash).
The unused high 4 bits of 'file_type' are used to indicate whether
additional data fields are stored after 'name'.  If a bit is set,
the corresponding dirdata record is present, starting after a NUL
filename terminator.  If present, a record starts with a 1-byte
length (including the length byte itself) and the data immediately
follows the length byte without any alignment.

This allows up to four different dirdata records to be stored in
each entry, and allows unhandled record bytes to be skipped without
having to process the contents, providing forward compatibility.

If and when the fourth and last dirdata record is needed, it is
recommended to further subdivide it into sub-records, with
the first byte being the total length, and then there being a
second byte that gives the sub-record length, etc. as long as
the total record length is less than 255 bytes.  However, this
would not affect compatibility with the current code since the
record length would allow it to be skipped without processing.

Signed-off-by: Pravin Shelar <pravin.shelar@sun.com>
Signed-off-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
---
 fs/ext4/ext4.h   | 27 +++++++++++++++++++++------
 fs/ext4/inline.c | 23 +++++++++++++++++++----
 fs/ext4/namei.c  | 45 +++++++++++++++++++++++----------------------
 fs/ext4/sysfs.c  |  2 ++
 4 files changed, 65 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2fc9fa6d3021..1e61ce13ed07 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2334,6 +2334,7 @@ EXT4_FEATURE_INCOMPAT_FUNCS(casefold,		CASEFOLD)
 					 EXT4_FEATURE_INCOMPAT_FLEX_BG| \
 					 EXT4_FEATURE_INCOMPAT_EA_INODE| \
 					 EXT4_FEATURE_INCOMPAT_MMP | \
+					 EXT4_FEATURE_INCOMPAT_DIRDATA | \
 					 EXT4_FEATURE_INCOMPAT_INLINE_DATA | \
 					 EXT4_FEATURE_INCOMPAT_ENCRYPT | \
 					 EXT4_FEATURE_INCOMPAT_CASEFOLD | \
@@ -3035,10 +3036,18 @@ extern int ext4_find_dest_de(struct inode *dir, struct buffer_head *bh,
 			     struct ext4_filename *fname,
 			     struct ext4_dir_entry_2 **dest_de,
 			     int dlen);
-void ext4_insert_dentry(struct inode *dir, struct inode *inode,
-			struct ext4_dir_entry_2 *de,
-			int buf_size,
-			struct ext4_filename *fname);
+void ext4_insert_dentry_data(struct inode *dir, struct inode *inode,
+			     struct ext4_dir_entry_2 *de,
+			     int buf_size,
+			     struct ext4_filename *fname,
+			     void *data);
+static inline void ext4_insert_dentry(struct inode *dir, struct inode *inode,
+				      struct ext4_dir_entry_2 *de,
+				      int buf_size,
+				      struct ext4_filename *fname)
+{
+	ext4_insert_dentry_data(dir, inode, de, buf_size, fname, NULL);
+}
 static inline void ext4_update_dx_flag(struct inode *inode)
 {
 	if (!ext4_has_feature_dir_index(inode->i_sb) &&
@@ -3283,8 +3292,14 @@ extern int ext4_ext_migrate(struct inode *);
 extern int ext4_ind_migrate(struct inode *inode);
 
 /* namei.c */
-extern int ext4_init_new_dir(handle_t *handle, struct inode *dir,
-			     struct inode *inode);
+extern int ext4_init_new_dir_data(handle_t *handle, struct inode *dir,
+				  struct inode *inode,
+				  const void *data1, const void *data2);
+static inline int ext4_init_new_dir(handle_t *handle, struct inode *dir,
+				    struct inode *inode)
+{
+	return ext4_init_new_dir_data(handle, dir, inode, NULL, NULL);
+}
 extern int ext4_dirblock_csum_verify(struct inode *inode,
 				     struct buffer_head *bh);
 extern int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 5b3faacdf143..1fff4defd45b 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -973,11 +973,16 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
 				     struct ext4_iloc *iloc,
 				     void *inline_start, int inline_size)
 {
-	int		err;
+	int		err, dlen = 0;
 	struct ext4_dir_entry_2 *de;
+	unsigned char *data = NULL;
+
+	/* Deliver data in any appropriate way here. Now it is NULL */
+	if (data)
+		dlen = (*data) + 1;
 
 	err = ext4_find_dest_de(dir, iloc->bh, inline_start,
-				inline_size, fname, &de, 0);
+				inline_size, fname, &de, dlen);
 	if (err)
 		return err;
 
@@ -986,7 +991,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
 					    EXT4_JTR_NONE);
 	if (err)
 		return err;
-	ext4_insert_dentry(dir, inode, de, inline_size, fname);
+	ext4_insert_dentry_data(dir, inode, de, inline_size, fname, NULL);
 
 	ext4_show_inline_dir(dir, iloc->bh, inline_start, inline_size);
 
@@ -1326,7 +1331,17 @@ int ext4_inlinedir_to_tree(struct file *dir_file,
 			pos = EXT4_INLINE_DOTDOT_SIZE;
 		} else {
 			de = (struct ext4_dir_entry_2 *)(dir_buf + pos);
-			pos += ext4_rec_len_from_disk(de->rec_len, inline_size);
+			/* Use ext4_dir_entry_len to account for dirdata extensions.
+			 * This buffer is the inline-data buffer (inline_size bytes),
+			 * not a full directory block -- pass the real buffer size so
+			 * a corrupted/sentinel on-disk rec_len doesn't get decoded as
+			 * a full block's worth of bytes. */
+			pos += ext4_dir_entry_len(de, inline_size, dir);
+			/* Validate pos doesn't exceed buffer to prevent use-after-free */
+			if (pos > inline_size) {
+				ret = count;
+				goto out;
+			}
 			if (ext4_check_dir_entry(inode, dir_file, de,
 					 iloc.bh, dir_buf,
 					 inline_size, pos)) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 73c8f1b399ef..91def9e0f84d 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -401,23 +401,26 @@ static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
 {
 	struct ext4_dir_entry_2 *de;
 	struct dx_root_info *root;
-	int count_offset;
+	int count_offset, dotdot_rec_len;
 	int blocksize = EXT4_BLOCK_SIZE(inode->i_sb);
 	unsigned int rlen = ext4_rec_len_from_disk(dirent->rec_len, blocksize);
 
-	if (rlen == blocksize)
+	if (rlen == blocksize) {
 		count_offset = sizeof(struct dx_node);
-	else if (rlen == 12) {
-		de = (struct ext4_dir_entry_2 *)(((void *)dirent) + 12);
-		if (ext4_rec_len_from_disk(de->rec_len, blocksize) != blocksize - 12)
+	} else {
+		de = (struct ext4_dir_entry_2 *)(((char *)dirent) + rlen);
+		if (le16_to_cpu(de->rec_len) != (blocksize - rlen))
 			return NULL;
-		root = (struct dx_root_info *)(((void *)de + 12));
+		/* de->rec_len covers whole dx_root block, calculate actual length.
+		 * This is the '..' entry, which never carries the casefold+fscrypt
+		 * hash, so pass NULL for dir regardless of the directory's flags */
+		dotdot_rec_len = ext4_dir_entry_len(de, blocksize, NULL);
+		root = (struct dx_root_info *)(((char *)de + dotdot_rec_len));
 		if (root->reserved_zero ||
 		    root->info_length != sizeof(struct dx_root_info))
 			return NULL;
-		count_offset = 32;
-	} else
-		return NULL;
+		count_offset = root->info_length + rlen + dotdot_rec_len;
+	}
 
 	if (offset)
 		*offset = count_offset;
@@ -716,7 +719,7 @@ static struct stats dx_show_leaf(struct inode *dir,
 				       (unsigned) ((char *) de - base));
 #endif
 			}
-			space += ext4_dir_rec_len(de->name_len, dir);
+			space += ext4_dir_entry_len(de, size, dir);
 			names++;
 		}
 		de = ext4_next_entry(de, size);
@@ -2090,13 +2093,10 @@ int ext4_find_dest_de(struct inode *dir, struct buffer_head *bh,
 	return 0;
 }
 
-void ext4_insert_dentry(struct inode *dir,
-			struct inode *inode,
-			struct ext4_dir_entry_2 *de,
-			int buf_size,
-			struct ext4_filename *fname)
+void ext4_insert_dentry_data(struct inode *dir, struct inode *inode,
+			     struct ext4_dir_entry_2 *de, int buf_size,
+			     struct ext4_filename *fname, void *data)
 {
-
 	int nlen, rlen;
 
 	nlen = ext4_dir_entry_len(de, buf_size, dir);
@@ -2138,15 +2138,15 @@ static int add_dirent_to_buf(handle_t *handle, struct ext4_filename *fname,
 	unsigned int	blocksize = dir->i_sb->s_blocksize;
 	int		csum_size = 0;
 	int		err, err2, dlen = 0;
-	unsigned char	*data = NULL;
+	struct ext4_dirent_fid *dfid = NULL;
 
 	/* Deliver data in any appropriate way here. Now it is NULL */
 	if (ext4_has_feature_metadata_csum(inode->i_sb))
 		csum_size = sizeof(struct ext4_dir_entry_tail);
 
 	if (!de) {
-		if (data)
-			dlen = (*data) + 1;
+		if (dfid)
+			dlen = dfid->df_header.ddh_length;
 		err = ext4_find_dest_de(dir, bh, bh->b_data,
 					blocksize - csum_size, fname, &de, dlen);
 		if (err)
@@ -2161,7 +2161,7 @@ static int add_dirent_to_buf(handle_t *handle, struct ext4_filename *fname,
 	}
 
 	/* By now the buffer is marked for journaling */
-	ext4_insert_dentry(dir, inode, de, blocksize, fname);
+	ext4_insert_dentry_data(dir, inode, de, blocksize, fname, dfid);
 
 	/*
 	 * XXX shouldn't update any times until successful
@@ -3000,8 +3000,9 @@ int ext4_init_dirblock(handle_t *handle, struct inode *inode,
 	return ext4_handle_dirty_dirblock(handle, inode, bh);
 }
 
-int ext4_init_new_dir(handle_t *handle, struct inode *dir,
-			     struct inode *inode)
+int ext4_init_new_dir_data(handle_t *handle, struct inode *dir,
+			   struct inode *inode,
+			   const void *data1, const void *data2)
 {
 	struct buffer_head *dir_block = NULL;
 	ext4_lblk_t block = 0;
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index 923b375e017f..80074fb15ee9 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -362,6 +362,7 @@ EXT4_ATTR_FEATURE(verity);
 #endif
 EXT4_ATTR_FEATURE(metadata_csum_seed);
 EXT4_ATTR_FEATURE(fast_commit);
+EXT4_ATTR_FEATURE(dirdata);
 #if IS_ENABLED(CONFIG_UNICODE) && defined(CONFIG_FS_ENCRYPTION)
 EXT4_ATTR_FEATURE(encrypted_casefold);
 #endif
@@ -385,6 +386,7 @@ static struct attribute *ext4_feat_attrs[] = {
 #endif
 	ATTR_LIST(metadata_csum_seed),
 	ATTR_LIST(fast_commit),
+	ATTR_LIST(dirdata),
 #if IS_ENABLED(CONFIG_UNICODE) && defined(CONFIG_FS_ENCRYPTION)
 	ATTR_LIST(encrypted_casefold),
 #endif
-- 
2.43.7


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox