Linux EXT4 FS development
 help / color / mirror / Atom feed
* [PATCHSET v6 3/4] fuse2fs: improve block and inode caching
From: Darrick J. Wong @ 2026-06-25 19:35 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4
In-Reply-To: <20260625193302.GC6054@frogsfrogsfrogs>

Hi all,

This series ports the libext2fs inode cache to the new cache.c hashtable
code that was added for fuse4fs unlinked file support and improves on
the UNIX I/O manager's block cache by adding a new I/O manager that does
its own caching.  Now we no longer have statically sized buffer caching
for the two fuse servers.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-caching
---
Commits in this patchset:
 * libsupport: add caching IO manager
 * iocache: add the actual buffer cache
 * iocache: bump buffer mru priority every 50 accesses
 * fuse2fs: enable caching IO manager
 * fuse2fs: increase inode cache size
 * libext2fs: improve caching for inodes
---
 lib/ext2fs/ext2fsP.h     |   13 +
 lib/support/cache.h      |    1 
 lib/support/iocache.h    |   17 +
 debugfs/Makefile.in      |    8 
 e2fsck/Makefile.in       |   12 -
 fuse4fs/Makefile.in      |   10 -
 fuse4fs/fuse4fs.c        |   15 +
 lib/ext2fs/Makefile.in   |   69 ++--
 lib/ext2fs/inline_data.c |    4 
 lib/ext2fs/inode.c       |  215 ++++++++++---
 lib/ext2fs/io_manager.c  |    3 
 lib/support/Makefile.in  |    6 
 lib/support/cache.c      |   16 +
 lib/support/iocache.c    |  751 ++++++++++++++++++++++++++++++++++++++++++++++
 misc/Makefile.in         |   11 -
 misc/fuse2fs.c           |   11 +
 resize/Makefile.in       |   11 -
 tests/fuzz/Makefile.in   |    4 
 tests/progs/Makefile.in  |    4 
 19 files changed, 1057 insertions(+), 124 deletions(-)
 create mode 100644 lib/support/iocache.h
 create mode 100644 lib/support/iocache.c


^ permalink raw reply

* [PATCHSET v6 2/4] fuse4fs: run servers as a contained service
From: Darrick J. Wong @ 2026-06-25 19:35 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-ext4
In-Reply-To: <20260625193302.GC6054@frogsfrogsfrogs>

Hi all,

This series packages the newly created fuse4fs server into a systemd
socket service.  This service can be used by the "mount.service" helper
in libfuse to implement untrusted unprivileged mounts.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-service-container
---
Commits in this patchset:
 * libext2fs: make it possible to extract the fd from an IO manager
 * libext2fs: fix checking for valid fds in mmp.c
 * unix_io: allow passing /dev/fd/XXX paths to the unixfd IO manager
 * libext2fs: fix MMP code to work with unixfd IO manager
 * libext2fs: bump libfuse API version to 3.19
 * fuse4fs: hoist some code out of fuse4fs_main
 * fuse4fs: enable safe service mode
 * fuse4fs: set proc title when in fuse service mode
 * fuse4fs: make MMP work correctly in safe service mode
 * debian: update packaging for fuse4fs service
---
 lib/ext2fs/ext2_io.h         |    4 
 lib/ext2fs/ext2fs.h          |    1 
 lib/ext2fs/ext2fsP.h         |    4 
 MCONFIG.in                   |    2 
 configure                    |  303 ++++++++++++++++++++++++++-
 configure.ac                 |  131 +++++++++++
 debian/e2fsprogs.install     |    7 +
 debian/fuse4fs.install       |    3 
 debian/libext2fs2t64.symbols |    1 
 debian/rules                 |    3 
 fuse4fs/Makefile.in          |   42 +++-
 fuse4fs/fuse4fs.c            |  479 ++++++++++++++++++++++++++++++++++++------
 fuse4fs/fuse4fs.socket.in    |   17 +
 fuse4fs/fuse4fs@.service.in  |  102 +++++++++
 lib/config.h.in              |   12 +
 lib/ext2fs/io_manager.c      |    8 +
 lib/ext2fs/mmp.c             |  101 +++++++++
 lib/ext2fs/openfs.c          |    1 
 lib/ext2fs/unix_io.c         |   50 ++++
 util/subst.conf.in           |    3 
 20 files changed, 1177 insertions(+), 97 deletions(-)
 mode change 100644 => 100755 debian/fuse4fs.install
 create mode 100644 fuse4fs/fuse4fs.socket.in
 create mode 100644 fuse4fs/fuse4fs@.service.in


^ permalink raw reply

* [PATCHSET 1/4] libext2fs: fix some missed fsync calls
From: Darrick J. Wong @ 2026-06-25 19:35 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4
In-Reply-To: <20260625193302.GC6054@frogsfrogsfrogs>

Hi all,

Fix a few places (like device closing) where we really ought to tell the
block device to flush whatever's dirty to disk, even if we've failed to
flush all our cached buffers out to disk.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

e2fsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=libext2fs-flushing-fixes
---
Commits in this patchset:
 * libext2fs: always fsync the device when flushing the cache
 * libext2fs: always fsync the device when closing the unix IO manager
 * libext2fs: only fsync the unix fd if we wrote to the device
---
 lib/ext2fs/unix_io.c |   83 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 69 insertions(+), 14 deletions(-)


^ permalink raw reply

* [PATCHBOMB v6] e2fsprogs: containerize ext4 for safer operation
From: Darrick J. Wong @ 2026-06-25 19:33 UTC (permalink / raw)
  To: linux-ext4; +Cc: Theodore Ts'o, Neal Gompa

Hi everyone,

This is the sole remaining part of the gigantic patchset to enable
mounting ext4 filesystems as a systemd-contained fuse server instead of
in the kernel.  The libfuse parts have now been merged upstream, which
means that fuse4fs can now run as a non-root user, with no privilege,
and no access to the network or hardware, etc.  The only connection to
the outside is an ephemeral AF_UNIX socket.  The mount helper program
the other end is a helper program that acquires resources and calls
fsmount().

Why would you want to do that?  Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence.  Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.  Runtime filesystem metadata parsing is no longer a
privileged (== risky) operation.

The consequences of a crashed driver is a dead mount, instead of a
crashed or corrupt OS kernel.

Note that contained fuse filesystem servers are no faster than regular
fuse.  The containerization code only requires changes to libfuse and is
ready to go today.

e2fsprogs:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-service-container_2026-06-25

Note that I threw in a couple more patchsets to improve the caching
behavior of libext2fs for better performance; and the ability to watch
for memory pressure complaints from the kernel so that we can drop our
own cache in response to memory pressure.

e2fsprogs:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-memory-reclaim_2026-06-25

--Darrick

Unreviewed patches in this patchbomb:

[PATCHSET 1/4] libext2fs: fix some missed fsync calls
  [PATCH 1/3] libext2fs: always fsync the device when flushing the
  [PATCH 2/3] libext2fs: always fsync the device when closing the unix
  [PATCH 3/3] libext2fs: only fsync the unix fd if we wrote to the
[PATCHSET v6 2/4] fuse4fs: run servers as a contained service
  [PATCH 01/10] libext2fs: make it possible to extract the fd from an
  [PATCH 02/10] libext2fs: fix checking for valid fds in mmp.c
  [PATCH 03/10] unix_io: allow passing /dev/fd/XXX paths to the unixfd
  [PATCH 04/10] libext2fs: fix MMP code to work with unixfd IO manager
  [PATCH 05/10] libext2fs: bump libfuse API version to 3.19
  [PATCH 06/10] fuse4fs: hoist some code out of fuse4fs_main
  [PATCH 07/10] fuse4fs: enable safe service mode
  [PATCH 08/10] fuse4fs: set proc title when in fuse service mode
  [PATCH 09/10] fuse4fs: make MMP work correctly in safe service mode
  [PATCH 10/10] debian: update packaging for fuse4fs service
[PATCHSET v6 3/4] fuse2fs: improve block and inode caching
  [PATCH 1/6] libsupport: add caching IO manager
  [PATCH 2/6] iocache: add the actual buffer cache
  [PATCH 3/6] iocache: bump buffer mru priority every 50 accesses
  [PATCH 4/6] fuse2fs: enable caching IO manager
  [PATCH 5/6] fuse2fs: increase inode cache size
  [PATCH 6/6] libext2fs: improve caching for inodes
[PATCHSET v6 4/4] fuse4fs: reclaim buffer cache under memory pressure
  [PATCH 1/4] libsupport: add pressure stall monitor
  [PATCH 2/4] fuse2fs: only reclaim buffer cache when there is memory
  [PATCH 3/4] fuse4fs: enable memory pressure monitoring with service
  [PATCH 4/4] fuse2fs: flush dirty metadata periodically

^ permalink raw reply

* [PATCH RFC] ext4: enable scoped NOFS when starting a handle in nojournal mode
From: Theodore Ts'o @ 2026-06-25 17:22 UTC (permalink / raw)
  To: Ext4 Developers List, Matthew Wilcox; +Cc: Theodore Ts'o

The jbd2 layer enables NOFS mode using memalloc_nofs_{save,restore}()
while a handle is active.  We need to do the same in nojournal mode so
that it is safe to remove GFP_NOFS flags while a jbd2 handle is
active.

This will require that we actually allocate a real handle, but with an
h_invalid flag set, so there is a place to put the saved memalloc
context.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/ext4_jbd2.c  | 35 ++++++++++++++++++++++-------------
 fs/ext4/ext4_jbd2.h  |  6 +-----
 include/linux/jbd2.h |  1 +
 3 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 9a8c225f2753..c2f09cf4b506 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -35,12 +35,21 @@ static handle_t *ext4_get_nojournal(void)
 	handle_t *handle = current->journal_info;
 	unsigned long ref_cnt = (unsigned long)handle;
 
-	BUG_ON(ref_cnt >= EXT4_NOJOURNAL_MAX_REF_COUNT);
-
-	ref_cnt++;
-	handle = (handle_t *)ref_cnt;
-
-	current->journal_info = handle;
+	BUG_ON(handle && !handle->h_invalid);
+
+	if (!handle) {
+		handle = jbd2_alloc_handle(GFP_NOFS);
+		if (!handle)
+			return ERR_PTR(-ENOMEM);
+		handle->h_invalid = 1;
+		/*
+		 * This is done by start_this_handle() if journalling
+		 * is enabled.
+		 */
+		handle->saved_alloc_context = memalloc_nofs_save();
+		current->journal_info = handle;
+	}
+	handle->h_ref++;
 	return handle;
 }
 
@@ -48,14 +57,14 @@ static handle_t *ext4_get_nojournal(void)
 /* Decrement the non-pointer handle value */
 static void ext4_put_nojournal(handle_t *handle)
 {
-	unsigned long ref_cnt = (unsigned long)handle;
+	BUG_ON(handle->h_ref == 0);
 
-	BUG_ON(ref_cnt == 0);
-
-	ref_cnt--;
-	handle = (handle_t *)ref_cnt;
-
-	current->journal_info = handle;
+	handle->h_ref--;
+	if (handle->h_ref == 0) {
+		memalloc_nofs_restore(handle->saved_alloc_context);
+		jbd2_free_handle(handle);
+		current->journal_info = NULL;
+	}
 }
 
 /*
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 63d17c5201b5..75d4670d389c 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -182,15 +182,11 @@ handle_t *__ext4_journal_start_sb(struct inode *inode, struct super_block *sb,
 				  int rsv_blocks, int revoke_creds);
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);
 
-#define EXT4_NOJOURNAL_MAX_REF_COUNT ((unsigned long) 4096)
-
 /* Note:  Do not use this for NULL handles.  This is only to determine if
  * a properly allocated handle is using a journal or not. */
 static inline int ext4_handle_valid(handle_t *handle)
 {
-	if ((unsigned long)handle < EXT4_NOJOURNAL_MAX_REF_COUNT)
-		return 0;
-	return 1;
+	return !handle->h_invalid;
 }
 
 static inline void ext4_handle_sync(handle_t *handle)
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index b68561187e90..7348fdadc810 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -513,6 +513,7 @@ struct jbd2_journal_handle
 	unsigned int	h_sync:		1;
 	unsigned int	h_reserved:	1;
 	unsigned int	h_aborted:	1;
+	unsigned int	h_invalid:	1;
 	unsigned int	h_type:		8;
 	unsigned int	h_line_no:	16;
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2] ext4: clear stale xarray tags on folios skipped during writeback
From: Gerald Yang @ 2026-06-25 16:01 UTC (permalink / raw)
  To: tytso, jack; +Cc: linux-ext4, gerald.yang.tw

In data=journal mode, the writeback thread can hit the
WARN_ON_ONCE(sb_rdonly(sb)) in ext4_journal_check_start() while the
superblock is being remounted read-only during reboot:

Workqueue: writeback wb_workfn (flush-253:0)
RIP: 0010:ext4_journal_check_start+0x8b/0xd0
Call Trace:
  __ext4_journal_start_sb+0x3c/0x1e0
  mpage_prepare_extent_to_map+0x4af/0x580
  ext4_do_writepages+0x3c0/0x1080
  ext4_writepages+0xc8/0x1a0
  do_writepages+0xc4/0x180
  __writeback_single_inode+0x45/0x2f0
  writeback_sb_inodes+0x26b/0x5d0
  __writeback_inodes_wb+0x54/0x100
  wb_writeback+0x1ac/0x320
  wb_workfn+0x394/0x470

And followed by the warning:
EXT4-fs warning (device vda1): ext4_evict_inode:195: inode #6263:
comm (sd-umount): data will be lost

This issue is not reproduced every time, but frequently.
The reproduction step is to create a VM with 8 CPUs, 16G memory and
setup data=journal:
sudo tune2fs -o journal_data /dev/vda1
Run fio:
rm -f fiotest
fio --name=fiotest --rw=randwrite --bs=4k --runtime=6 --ioengine=libaio
--iodepth=256 --numjobs=8 --filename=fiotest --filesize=30G
--group_reporting
Reboot the VM, and check the console output from:
virsh console testvm

But there is no dirty inode, folio_clear_dirty_for_io clears PG_dirty
but leaves tags PAGECACHE_TAG_DIRTY and PAGECACHE_TAG_TOWRITE set which
are only cleared by __folio_start_writeback.
In data=journal mode, jbd2 checkpoints the journalled data to its final
location and clears its own dirty flag without touching folio PG_dirty
or xarray dirty flags.
The commit f4a2b42e7891 ("ext4: fix stale xarray tags after writeback")
fixes when PG_dirty is still set but there is no dirty page.
Another case is PG_dirty is cleared, but PAGECACHE_TAG_DIRTY and
PAGECACHE_TAG_TOWRITE is still set. In this case, writeback thread
checks clean folio and skips it in mpage_prepare_extent_to_map:
if (!folio_test_dirty(folio) ||
    ...
        folio_unlcok(folio);
	continue

And never reaches ext4_bio_write_folio where the commit f4a2b42e7891
clears the stale xarray tags. Print debug logs after the filesystem
is remounted read-only:
writepages RDONLY nrpages=2048 dirtytag=1 wbtag=0 towrite=1 sync=0
And all folios are actually clean:
folio idx=3 dirty=0 wb=0 checked=0 dirtybuf=0 jbddirty=0 mapped=1
...

We need to clear the xarray stale tags for such clean folios by
cycling them through writeback in the skip path, the same way
f4a2b42e7891 does in ext4_bio_write_folio.

Fixes: dff4ac75eeee ("ext4: move keep_towrite handling to ext4_bio_write_page()")
Signed-off-by: Gerald Yang <gerald.yang@canonical.com>
---
Changes in v2:
Split the top level condition based on Jan's suggestion

 fs/ext4/inode.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce99807c5f5b..150f8789f0aa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2694,13 +2694,25 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
 			 * page is already under writeback and we are not doing
 			 * a data integrity writeback, skip the page
 			 */
-			if (!folio_test_dirty(folio) ||
-			    (folio_test_writeback(folio) &&
-			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
+			if ((folio_test_writeback(folio) &&
+			    mpd->wbc->sync_mode == WB_SYNC_NONE) ||
 			    unlikely(folio->mapping != mapping)) {
 				folio_unlock(folio);
 				continue;
 			}
+			/*
+			 * If the folio is clean, skip writing it back.
+			 * Cycle the folio through the writeback state
+			 * though, to clear stale xarray tags.
+			 */
+			if (!folio_test_dirty(folio)) {
+				if (!folio_test_writeback(folio)) {
+					__folio_start_writeback(folio, false);
+					folio_end_writeback(folio);
+				}
+				folio_unlock(folio);
+				continue;
+			}
 
 			folio_wait_writeback(folio);
 			BUG_ON(folio_test_writeback(folio));
-- 
2.43.0


^ permalink raw reply related

* [PATCH v10 4/5] ext4: remove ea_inode_array mechanism in favor of ext4_put_ea_inode()
From: Yun Zhou @ 2026-06-25 15:29 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, viro, brauner
  Cc: linux-ext4, linux-kernel, yun.zhou, linux-fsdevel
In-Reply-To: <20260625152941.24788-1-yun.zhou@windriver.com>

Now that ext4_put_ea_inode() handles deferred iput safely for all cases
(using iput_if_not_last + embedded llist_node), the ea_inode_array
mechanism for batching deferred iputs is redundant.

Remove:
- ext4_expand_inode_array() and ext4_xattr_inode_array_free()
- ext4_xattr_inode_array_free_deferred()
- struct ext4_xattr_inode_array and EIA_INCR/EIA_MASK defines
- ea_inode_array parameter from ext4_xattr_inode_dec_ref_all(),
  ext4_xattr_release_block(), and ext4_xattr_delete_inode()
- ea_inode_array variable from ext4_evict_inode()

Instead, ext4_xattr_inode_dec_ref_all() now calls ext4_put_ea_inode()
directly after processing each EA inode.  This simplifies the code
by eliminating multi-layer parameter threading and removes the need
for callers to manage array lifetime.

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |  6 +---
 fs/ext4/xattr.c | 95 +++----------------------------------------------
 fs/ext4/xattr.h |  7 ----
 3 files changed, 6 insertions(+), 102 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0d131371ad3d..6f1b84e46a2e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -176,7 +176,6 @@ void ext4_evict_inode(struct inode *inode)
 	 * (xattr block freeing), bitmap, group descriptor (inode freeing)
 	 */
 	int extra_credits = 6;
-	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 	bool freeze_protected = false;
 
 	trace_ext4_evict_inode(inode);
@@ -282,8 +281,7 @@ void ext4_evict_inode(struct inode *inode)
 	}
 
 	/* Remove xattr references. */
-	err = ext4_xattr_delete_inode(handle, inode, &ea_inode_array,
-				      extra_credits);
+	err = ext4_xattr_delete_inode(handle, inode, extra_credits);
 	if (err) {
 		ext4_warning(inode->i_sb, "xattr delete (err %d)", err);
 stop_handle:
@@ -291,7 +289,6 @@ void ext4_evict_inode(struct inode *inode)
 		ext4_orphan_del(NULL, inode);
 		if (freeze_protected)
 			sb_end_intwrite(inode->i_sb);
-		ext4_xattr_inode_array_free(ea_inode_array);
 		goto no_delete;
 	}
 
@@ -321,7 +318,6 @@ void ext4_evict_inode(struct inode *inode)
 	ext4_journal_stop(handle);
 	if (freeze_protected)
 		sb_end_intwrite(inode->i_sb);
-	ext4_xattr_inode_array_free(ea_inode_array);
 	return;
 no_delete:
 	/*
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 90b693b78a45..7f334349bd4f 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -114,12 +114,6 @@ const struct xattr_handler * const ext4_xattr_handlers[] = {
 #define EA_INODE_CACHE(inode)	(((struct ext4_sb_info *) \
 				inode->i_sb->s_fs_info)->s_ea_inode_cache)
 
-static int
-ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
-			struct inode *inode);
-static void ext4_xattr_inode_array_free_deferred(struct super_block *sb,
-				struct ext4_xattr_inode_array *array);
-
 #ifdef CONFIG_LOCKDEP
 void ext4_xattr_inode_set_class(struct inode *ea_inode)
 {
@@ -1162,7 +1156,6 @@ static void
 ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 			     struct buffer_head *bh,
 			     struct ext4_xattr_entry *first, bool block_csum,
-			     struct ext4_xattr_inode_array **ea_inode_array,
 			     int extra_credits, bool skip_quota)
 {
 	struct inode *ea_inode;
@@ -1199,14 +1192,6 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 		if (err)
 			continue;
 
-		err = ext4_expand_inode_array(ea_inode_array, ea_inode);
-		if (err) {
-			ext4_warning_inode(ea_inode,
-					   "Expand inode array err=%d", err);
-			ext4_put_ea_inode(parent->i_sb, ea_inode);
-			continue;
-		}
-
 		err = ext4_journal_ensure_credits_fn(handle, credits, credits,
 			ext4_free_metadata_revoke_credits(parent->i_sb, 1),
 			ext4_xattr_restart_fn(handle, parent, bh, block_csum,
@@ -1214,6 +1199,7 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 		if (err < 0) {
 			ext4_warning_inode(ea_inode, "Ensure credits err=%d",
 					   err);
+			ext4_put_ea_inode(parent->i_sb, ea_inode);
 			continue;
 		}
 		if (err > 0) {
@@ -1223,6 +1209,7 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 				ext4_warning_inode(ea_inode,
 						"Re-get write access err=%d",
 						err);
+				ext4_put_ea_inode(parent->i_sb, ea_inode);
 				continue;
 			}
 		}
@@ -1231,6 +1218,7 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 		if (err) {
 			ext4_warning_inode(ea_inode, "ea_inode dec ref err=%d",
 					   err);
+			ext4_put_ea_inode(parent->i_sb, ea_inode);
 			continue;
 		}
 
@@ -1247,6 +1235,7 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 		entry->e_value_inum = 0;
 		entry->e_value_size = 0;
 
+		ext4_put_ea_inode(parent->i_sb, ea_inode);
 		dirty = true;
 	}
 
@@ -1273,7 +1262,6 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 static void
 ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 			 struct buffer_head *bh,
-			 struct ext4_xattr_inode_array **ea_inode_array,
 			 int extra_credits)
 {
 	struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
@@ -1315,7 +1303,6 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 			ext4_xattr_inode_dec_ref_all(handle, inode, bh,
 						     BFIRST(bh),
 						     true /* block_csum */,
-						     ea_inode_array,
 						     extra_credits,
 						     true /* skip_quota */);
 		ext4_free_blocks(handle, inode, bh, 0, 1,
@@ -2184,13 +2171,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 	/* Drop the previous xattr block. */
 	if (bs->bh && bs->bh != new_bh) {
-		struct ext4_xattr_inode_array *ea_inode_array = NULL;
-
 		ext4_xattr_release_block(handle, inode, bs->bh,
-					 &ea_inode_array,
 					 0 /* extra_credits */);
-		ext4_xattr_inode_array_free_deferred(inode->i_sb,
-						     ea_inode_array);
 	}
 	error = 0;
 
@@ -2866,46 +2848,6 @@ int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 	return error;
 }
 
-#define EIA_INCR 16 /* must be 2^n */
-#define EIA_MASK (EIA_INCR - 1)
-
-/* Add the large xattr @inode into @ea_inode_array for deferred iput().
- * If @ea_inode_array is new or full it will be grown and the old
- * contents copied over.
- */
-static int
-ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
-			struct inode *inode)
-{
-	if (*ea_inode_array == NULL) {
-		/*
-		 * Start with 15 inodes, so it fits into a power-of-two size.
-		 */
-		(*ea_inode_array) = kmalloc_flex(**ea_inode_array, inodes,
-						 EIA_MASK, GFP_NOFS);
-		if (*ea_inode_array == NULL)
-			return -ENOMEM;
-		(*ea_inode_array)->count = 0;
-	} else if (((*ea_inode_array)->count & EIA_MASK) == EIA_MASK) {
-		/* expand the array once all 15 + n * 16 slots are full */
-		struct ext4_xattr_inode_array *new_array = NULL;
-
-		new_array = kmalloc_flex(**ea_inode_array, inodes,
-					 (*ea_inode_array)->count + EIA_INCR,
-					 GFP_NOFS);
-		if (new_array == NULL)
-			return -ENOMEM;
-		memcpy(new_array, *ea_inode_array,
-		       struct_size(*ea_inode_array, inodes,
-				   (*ea_inode_array)->count));
-		kfree(*ea_inode_array);
-		*ea_inode_array = new_array;
-	}
-	(*ea_inode_array)->count++;
-	(*ea_inode_array)->inodes[(*ea_inode_array)->count - 1] = inode;
-	return 0;
-}
-
 /*
  * ext4_xattr_delete_inode()
  *
@@ -2916,7 +2858,6 @@ ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
  * references on xattr block and xattr inodes.
  */
 int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
-			    struct ext4_xattr_inode_array **ea_inode_array,
 			    int extra_credits)
 {
 	struct buffer_head *bh = NULL;
@@ -2955,7 +2896,6 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 			ext4_xattr_inode_dec_ref_all(handle, inode, iloc.bh,
 						     IFIRST(header),
 						     false /* block_csum */,
-						     ea_inode_array,
 						     extra_credits,
 						     false /* skip_quota */);
 	}
@@ -2994,7 +2934,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 
 		}
 
-		ext4_xattr_release_block(handle, inode, bh, ea_inode_array,
+		ext4_xattr_release_block(handle, inode, bh,
 					 extra_credits);
 		/*
 		 * Update i_file_acl value in the same transaction that releases
@@ -3016,31 +2956,6 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 	return error;
 }
 
-void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *ea_inode_array)
-{
-	int idx;
-
-	if (ea_inode_array == NULL)
-		return;
-
-	for (idx = 0; idx < ea_inode_array->count; ++idx)
-		iput(ea_inode_array->inodes[idx]);
-	kfree(ea_inode_array);
-}
-
-static void ext4_xattr_inode_array_free_deferred(struct super_block *sb,
-				struct ext4_xattr_inode_array *array)
-{
-	int idx;
-
-	if (array == NULL)
-		return;
-
-	for (idx = 0; idx < array->count; ++idx)
-		ext4_put_ea_inode(sb, array->inodes[idx]);
-	kfree(array);
-}
-
 /*
  * Worker function for deferred EA inode iput.  Processes all inodes queued
  * on s_ea_inode_to_free in a context free of xattr_sem/jbd2 handle locks.
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 9883ba5569a1..8214a31fe001 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -131,11 +131,6 @@ struct ext4_xattr_ibody_find {
 	struct ext4_iloc iloc;
 };
 
-struct ext4_xattr_inode_array {
-	unsigned int count;
-	struct inode *inodes[] __counted_by(count);
-};
-
 extern const struct xattr_handler ext4_xattr_user_handler;
 extern const struct xattr_handler ext4_xattr_trusted_handler;
 extern const struct xattr_handler ext4_xattr_security_handler;
@@ -187,9 +182,7 @@ extern int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
 				bool is_create);
 
 extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
-				   struct ext4_xattr_inode_array **array,
 				   int extra_credits);
-extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
 extern void ext4_init_ea_inode_work(struct ext4_sb_info *sbi);
 extern void ext4_put_ea_inode(struct super_block *sb, struct inode *inode);
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v10 5/5] ext4: prevent deadlock from duplicate EA inode references on corrupted fs
From: Yun Zhou @ 2026-06-25 15:29 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, viro, brauner
  Cc: linux-ext4, linux-kernel, yun.zhou, linux-fsdevel
In-Reply-To: <20260625152941.24788-1-yun.zhou@windriver.com>

On a corrupted filesystem, multiple xattr entries may reference the same
EA inode.  When ext4_xattr_inode_dec_ref_all() processes such entries, it
can dec_ref the EA inode (setting nlink=0) and queue it for deferred iput.
If the deferred worker runs before the loop processes the duplicate entry,
the second iget() may block on I_FREEING while the worker's eviction waits
for the caller's transaction to commit -- classic ABBA deadlock.

Fix by tracking successfully processed EA inodes on a per-call llist
(reusing i_ea_iput_node) and skipping any ea_ino already in the list.
This covers both intra-block duplicates and cross ibody/block duplicates
in ext4_xattr_delete_inode().

The actual ext4_put_ea_inode() is deferred until after the processing
loop completes (ext4_put_ea_inode_llist), ensuring no EA inode is queued
for eviction while the loop is still iterating.

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/xattr.c | 68 ++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 7f334349bd4f..5c929043e44a 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1152,11 +1152,41 @@ static int ext4_xattr_restart_fn(handle_t *handle, struct inode *inode,
 	return 0;
 }
 
+/* Check if an EA inode number is already in the processed llist. */
+static bool ext4_ea_ino_in_llist(unsigned int ea_ino,
+				 struct llist_head *processed)
+{
+	struct ext4_inode_info *ei;
+
+	llist_for_each_entry(ei, processed->first, i_ea_iput_node) {
+		if (ei->vfs_inode.i_ino == ea_ino)
+			return true;
+	}
+	return false;
+}
+
+/* Put all EA inodes on a processed llist via ext4_put_ea_inode. */
+static void ext4_put_ea_inode_llist(struct super_block *sb,
+				    struct llist_head *processed)
+{
+	struct llist_node *node = llist_del_all(processed);
+	struct llist_node *next;
+
+	while (node) {
+		struct ext4_inode_info *ei = container_of(node,
+				struct ext4_inode_info, i_ea_iput_node);
+		next = node->next;
+		ext4_put_ea_inode(sb, &ei->vfs_inode);
+		node = next;
+	}
+}
+
 static void
 ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 			     struct buffer_head *bh,
 			     struct ext4_xattr_entry *first, bool block_csum,
-			     int extra_credits, bool skip_quota)
+			     int extra_credits, bool skip_quota,
+			     struct llist_head *processed)
 {
 	struct inode *ea_inode;
 	struct ext4_xattr_entry *entry;
@@ -1186,6 +1216,11 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 		if (!entry->e_value_inum)
 			continue;
 		ea_ino = le32_to_cpu(entry->e_value_inum);
+
+		/* Skip if already processed (duplicate on corrupted fs) */
+		if (ext4_ea_ino_in_llist(ea_ino, processed))
+			continue;
+
 		err = ext4_xattr_inode_iget(parent, ea_ino,
 					    le32_to_cpu(entry->e_hash),
 					    &ea_inode);
@@ -1235,7 +1270,12 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 		entry->e_value_inum = 0;
 		entry->e_value_size = 0;
 
-		ext4_put_ea_inode(parent->i_sb, ea_inode);
+		/*
+		 * Collect processed EA inodes for dedup and deferred iput.
+		 * ext4_put_ea_inode_llist() handles the actual release
+		 * after the loop, preventing iget deadlocks on duplicates.
+		 */
+		llist_add(&EXT4_I(ea_inode)->i_ea_iput_node, processed);
 		dirty = true;
 	}
 
@@ -1262,7 +1302,8 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 static void
 ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 			 struct buffer_head *bh,
-			 int extra_credits)
+			 int extra_credits,
+			 struct llist_head *processed)
 {
 	struct mb_cache *ea_block_cache = EA_BLOCK_CACHE(inode);
 	u32 hash, ref;
@@ -1304,7 +1345,8 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 						     BFIRST(bh),
 						     true /* block_csum */,
 						     extra_credits,
-						     true /* skip_quota */);
+						     true /* skip_quota */,
+						     processed);
 		ext4_free_blocks(handle, inode, bh, 0, 1,
 				 EXT4_FREE_BLOCKS_METADATA |
 				 EXT4_FREE_BLOCKS_FORGET);
@@ -2171,8 +2213,12 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 	/* Drop the previous xattr block. */
 	if (bs->bh && bs->bh != new_bh) {
+		LLIST_HEAD(processed);
+
 		ext4_xattr_release_block(handle, inode, bs->bh,
-					 0 /* extra_credits */);
+					 0 /* extra_credits */,
+					 &processed);
+		ext4_put_ea_inode_llist(inode->i_sb, &processed);
 	}
 	error = 0;
 
@@ -2866,6 +2912,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 	struct ext4_xattr_entry *entry;
 	struct inode *ea_inode;
 	int error;
+	LLIST_HEAD(processed);
 
 	error = ext4_journal_ensure_credits(handle, extra_credits,
 			ext4_free_metadata_revoke_credits(inode->i_sb, 1));
@@ -2897,7 +2944,8 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 						     IFIRST(header),
 						     false /* block_csum */,
 						     extra_credits,
-						     false /* skip_quota */);
+						     false /* skip_quota */,
+						     &processed);
 	}
 
 	if (EXT4_I(inode)->i_file_acl) {
@@ -2921,6 +2969,11 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 			     entry = EXT4_XATTR_NEXT(entry)) {
 				if (!entry->e_value_inum)
 					continue;
+				/* Skip EA inodes already dec_ref'd from ibody */
+				if (ext4_ea_ino_in_llist(
+					    le32_to_cpu(entry->e_value_inum),
+					    &processed))
+					continue;
 				error = ext4_xattr_inode_iget(inode,
 					      le32_to_cpu(entry->e_value_inum),
 					      le32_to_cpu(entry->e_hash),
@@ -2935,7 +2988,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 		}
 
 		ext4_xattr_release_block(handle, inode, bh,
-					 extra_credits);
+					 extra_credits, &processed);
 		/*
 		 * Update i_file_acl value in the same transaction that releases
 		 * block.
@@ -2951,6 +3004,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 	}
 	error = 0;
 cleanup:
+	ext4_put_ea_inode_llist(inode->i_sb, &processed);
 	brelse(iloc.bh);
 	brelse(bh);
 	return error;
-- 
2.43.0


^ permalink raw reply related

* [PATCH v10 1/5] fs: add iput_if_not_last() helper
From: Yun Zhou @ 2026-06-25 15:29 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, viro, brauner
  Cc: linux-ext4, linux-kernel, yun.zhou, linux-fsdevel
In-Reply-To: <20260625152941.24788-1-yun.zhou@windriver.com>

Add a helper that drops an inode reference only if the caller does not
hold the last one.  Returns true if the reference was dropped, false
otherwise.

This is useful for filesystems that need to release inode references
in contexts where triggering final iput (and thus eviction) would be
unsafe due to lock ordering constraints.  The caller can check the
return value and defer the final iput to a safe context.

Unlike iput_not_last() which BUG_ON's if called with the last ref,
this variant is designed to be called speculatively.

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6da44573ce45..4916a9d54347 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2418,6 +2418,19 @@ static inline void super_set_sysfs_name_generic(struct super_block *sb, const ch
 extern void ihold(struct inode * inode);
 extern void iput(struct inode *);
 void iput_not_last(struct inode *);
+
+/**
+ * iput_if_not_last - drop an inode reference only if it is not the last one
+ * @inode: inode to put
+ *
+ * Returns true if the reference was dropped, false if this was the last
+ * reference and the caller must arrange for final iput() in a safe context.
+ */
+static inline bool iput_if_not_last(struct inode *inode)
+{
+	return atomic_add_unless(&inode->i_count, -1, 1);
+}
+
 int inode_update_time(struct inode *inode, enum fs_update_time type,
 		unsigned int flags);
 int generic_update_time(struct inode *inode, enum fs_update_time type,
-- 
2.43.0


^ permalink raw reply related

* [PATCH v10 3/5] ext4: convert all EA inode iput() calls to ext4_put_ea_inode()
From: Yun Zhou @ 2026-06-25 15:29 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, viro, brauner
  Cc: linux-ext4, linux-kernel, yun.zhou, linux-fsdevel
In-Reply-To: <20260625152941.24788-1-yun.zhou@windriver.com>

Convert all iput() calls on EA inodes in xattr code paths to use
ext4_put_ea_inode().  This establishes a uniform rule: every EA inode
reference release in ext4 xattr code goes through ext4_put_ea_inode(),
eliminating the need to analyze each call site individually for lock
safety.

Converted sites:

- ext4_xattr_inode_get() read path
- ext4_xattr_inode_inc_ref_all() main loop and cleanup path
- ext4_xattr_inode_dec_ref_all() error paths
- ext4_xattr_inode_create() error path
- ext4_xattr_inode_cache_find() mismatch path
- ext4_xattr_inode_lookup_create() out_err
- ext4_xattr_set_entry() old_ea_inode
- ext4_xattr_block_set() new block path, cleanup, and tmp_inode
- ext4_xattr_ibody_set() error and success paths
- ext4_xattr_delete_inode() quota loop

For most of these, iput_if_not_last() will succeed (the EA inode has
other references) making the overhead a single atomic operation.

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
 fs/ext4/xattr.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index ecdad5920b14..90b693b78a45 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -569,7 +569,7 @@ ext4_xattr_inode_get(struct inode *inode, struct ext4_xattr_entry *entry,
 					ea_inode->i_ino, true /* reusable */);
 	}
 out:
-	iput(ea_inode);
+	ext4_put_ea_inode(inode->i_sb, ea_inode);
 	return err;
 }
 
@@ -1106,10 +1106,10 @@ static int ext4_xattr_inode_inc_ref_all(handle_t *handle, struct inode *parent,
 		err = ext4_xattr_inode_inc_ref(handle, ea_inode);
 		if (err) {
 			ext4_warning_inode(ea_inode, "inc ref error %d", err);
-			iput(ea_inode);
+			ext4_put_ea_inode(parent->i_sb, ea_inode);
 			goto cleanup;
 		}
-		iput(ea_inode);
+		ext4_put_ea_inode(parent->i_sb, ea_inode);
 	}
 	return 0;
 
@@ -1135,7 +1135,7 @@ static int ext4_xattr_inode_inc_ref_all(handle_t *handle, struct inode *parent,
 		if (err)
 			ext4_warning_inode(ea_inode, "cleanup dec ref error %d",
 					   err);
-		iput(ea_inode);
+		ext4_put_ea_inode(parent->i_sb, ea_inode);
 	}
 	return saved_err;
 }
@@ -1203,7 +1203,7 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 		if (err) {
 			ext4_warning_inode(ea_inode,
 					   "Expand inode array err=%d", err);
-			iput(ea_inode);
+			ext4_put_ea_inode(parent->i_sb, ea_inode);
 			continue;
 		}
 
@@ -1507,7 +1507,7 @@ static struct inode *ext4_xattr_inode_create(handle_t *handle,
 			if (ext4_xattr_inode_dec_ref(handle, ea_inode))
 				ext4_warning_inode(ea_inode,
 					"cleanup dec ref error %d", err);
-			iput(ea_inode);
+			ext4_put_ea_inode(inode->i_sb, ea_inode);
 			return ERR_PTR(err);
 		}
 
@@ -1566,7 +1566,7 @@ ext4_xattr_inode_cache_find(struct inode *inode, const void *value,
 			kvfree(ea_data);
 			return ea_inode;
 		}
-		iput(ea_inode);
+		ext4_put_ea_inode(inode->i_sb, ea_inode);
 	next_entry:
 		ce = mb_cache_entry_find_next(ea_inode_cache, ce);
 	}
@@ -1617,7 +1617,7 @@ static struct inode *ext4_xattr_inode_lookup_create(handle_t *handle,
 				      ea_inode->i_ino, true /* reusable */);
 	return ea_inode;
 out_err:
-	iput(ea_inode);
+	ext4_put_ea_inode(inode->i_sb, ea_inode);
 	ext4_xattr_inode_free_quota(inode, NULL, value_len);
 	return ERR_PTR(err);
 }
@@ -1850,7 +1850,7 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
 
 	ret = 0;
 out:
-	iput(old_ea_inode);
+	ext4_put_ea_inode(inode->i_sb, old_ea_inode);
 	return ret;
 }
 
@@ -2012,7 +2012,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				old_ea_inode_quota = le32_to_cpu(
 						s->here->e_value_size);
 			}
-			iput(tmp_inode);
+			ext4_put_ea_inode(inode->i_sb, tmp_inode);
 
 			s->here->e_value_inum = 0;
 			s->here->e_value_size = 0;
@@ -2152,7 +2152,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 					ext4_warning_inode(ea_inode,
 							   "dec ref error=%d",
 							   error);
-				iput(ea_inode);
+				ext4_put_ea_inode(inode->i_sb, ea_inode);
 				ea_inode = NULL;
 			}
 
@@ -2206,7 +2206,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
 		}
-		iput(ea_inode);
+		ext4_put_ea_inode(inode->i_sb, ea_inode);
 	}
 	if (ce)
 		mb_cache_entry_put(ea_block_cache, ce);
@@ -2288,7 +2288,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 
 			ext4_xattr_inode_free_quota(inode, ea_inode,
 						    i_size_read(ea_inode));
-			iput(ea_inode);
+			ext4_put_ea_inode(inode->i_sb, ea_inode);
 		}
 		return error;
 	}
@@ -2300,7 +2300,7 @@ int ext4_xattr_ibody_set(handle_t *handle, struct inode *inode,
 		header->h_magic = cpu_to_le32(0);
 		ext4_clear_inode_state(inode, EXT4_STATE_XATTR);
 	}
-	iput(ea_inode);
+	ext4_put_ea_inode(inode->i_sb, ea_inode);
 	return 0;
 }
 
@@ -2989,7 +2989,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 					continue;
 				ext4_xattr_inode_free_quota(inode, ea_inode,
 					      le32_to_cpu(entry->e_value_size));
-				iput(ea_inode);
+				ext4_put_ea_inode(inode->i_sb, ea_inode);
 			}
 
 		}
-- 
2.43.0


^ permalink raw reply related

* [PATCH v10 0/5] ext4: deferred iput framework for EA inodes
From: Yun Zhou @ 2026-06-25 15:29 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, viro, brauner
  Cc: linux-ext4, linux-kernel, yun.zhou, linux-fsdevel

This series introduces a deferred-iput framework for EA inodes to
eliminate a class of lock ordering issues in ext4 xattr code.

The problem: iput() on EA inodes while holding xattr_sem or a jbd2
handle can trigger eviction, which may acquire those same locks or
s_writepages_rwsem, creating circular dependencies.  The immediate
deadlock (during mount-time orphan cleanup) is fixed by two separate
patches already reviewed and posted:

  ext4: skip extra isize expansion during mount to prevent deadlock
  ext4: set EXT4_STATE_NO_EXPAND in ext4_evict_inode

This series provides the structural fix that makes the code safe
regardless of calling context:

Patch 1 adds a VFS helper iput_if_not_last() which drops an inode
reference only if it is not the last one, using atomic_add_unless().
This provides a proper VFS abstraction for filesystems that need to
conditionally defer final iput.

Patch 2 introduces ext4_put_ea_inode() using iput_if_not_last() as
a fast path (single atomic, zero overhead for the common case).  If
this is the last reference, the inode is linked onto a per-sb llist
(via i_ea_iput_node embedded in ext4_inode_info, union with xattr_sem
which is unused for EA inodes) and a delayed worker (1 jiffie) performs
the final iput() in a clean context.  No per-iput allocation needed.
Also moves init_rwsem(xattr_sem) from init_once to ext4_alloc_inode
to handle slab reuse after the union field has been overwritten.

Patch 3 converts all EA inode iput() calls in xattr code to use
ext4_put_ea_inode() uniformly -- no exceptions to reason about.

Patch 4 removes the now-redundant ea_inode_array mechanism (parameter
threading, struct, expand/free functions), replaced entirely by direct
ext4_put_ea_inode() calls.  This is a net code reduction.

Patch 5 prevents a potential ABBA deadlock on corrupted filesystems
where multiple xattr entries reference the same EA inode.  It tracks
processed EA inodes on a per-call llist (reusing i_ea_iput_node) and
skips duplicates before iget, deferring the actual ext4_put_ea_inode()
until after the loop completes.  This covers both intra-block and
cross ibody/block duplicates in ext4_xattr_delete_inode().

Link: https://syzkaller.appspot.com/bug?extid=5d19358d7eb30ffb0cc5

v10:
 - New patch 5: prevent deadlock from duplicate EA inode references
   on corrupted filesystems.  Track processed EA inodes on a per-call
   llist to skip duplicates before iget, and defer ext4_put_ea_inode()
   until after the loop to avoid queuing an inode for eviction while
   the same loop may still iget it.
 - Patch 2: move ext4_init_ea_inode_work() before ext4_multi_mount_protect()
   so that failed_mount3a drain does not hit an uninitialized delayed_work
   when MMP check fails.

v9:
 - Add iput_if_not_last() as proper VFS helper (per reviewer: don't
   let filesystems manipulate inode refcount without VFS abstraction).
 - Use iput_if_not_last() + llist_node embedded in ext4_inode_info
   (union with xattr_sem) to avoid per-iput allocation entirely.
 - Convert ALL EA inode iput() calls uniformly -- no exceptions.
 - Remove entire ea_inode_array mechanism.
 - Add WARN_ON_ONCE in ext4_put_ea_inode() to catch misuse on non-EA
   inodes (protects the xattr_sem union safety).
 - Fix worker re-arm: ext4_drain_ea_inode_work() loops to handle
   nested EA inode evictions re-scheduling work.
 - Move INIT_DELAYED_WORK before journal loading (fast commit replay
   may trigger evictions).
 - Drain before ext4_quotas_off() for correct quota accounting.
 - Add flush in failed_mount_wq and failed_mount3a error paths for
   journal replay case.
 - Move init_rwsem(xattr_sem) from init_once to ext4_alloc_inode to
   handle slab object reuse after union overwrite.
 - Encapsulate worker init into ext4_init_ea_inode_work(), making
   ext4_ea_inode_work() static to xattr.c.

Yun Zhou (5):
  fs: add iput_if_not_last() helper
  ext4: introduce ext4_put_ea_inode() for safe deferred iput
  ext4: convert all EA inode iput() calls to ext4_put_ea_inode()
  ext4: remove ea_inode_array mechanism in favor of ext4_put_ea_inode()
  ext4: prevent deadlock from duplicate EA inode references on corrupted
    fs

 fs/ext4/ext4.h     |  13 ++-
 fs/ext4/inode.c    |   6 +-
 fs/ext4/super.c    |  19 +++-
 fs/ext4/xattr.c    | 214 +++++++++++++++++++++++++++------------------
 fs/ext4/xattr.h    |  21 +++--
 include/linux/fs.h |  13 +++
 6 files changed, 185 insertions(+), 101 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH v10 2/5] ext4: introduce ext4_put_ea_inode() for safe deferred iput
From: Yun Zhou @ 2026-06-25 15:29 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, viro, brauner
  Cc: linux-ext4, linux-kernel, yun.zhou, linux-fsdevel
In-Reply-To: <20260625152941.24788-1-yun.zhou@windriver.com>

Calling iput() on EA inodes while holding xattr_sem or a jbd2 handle
can trigger write_inode_now() -> ext4_writepages() -> s_writepages_rwsem,
creating a lock ordering issue during mount (!SB_ACTIVE).

Add ext4_put_ea_inode() which uses iput_if_not_last() as a fast path.
If this is not the last reference, it is dropped immediately.  If this
is the last reference, the inode is linked onto a per-sb lock-free llist
via i_ea_iput_node (embedded in ext4_inode_info, sharing space with the
unused xattr_sem of EA inodes via a union) and a delayed worker
(1 jiffie) performs the final iput() in a clean context.  This avoids
per-iput memory allocation.

Convert the first call site: ext4_xattr_block_set()'s "Drop the
previous xattr block" path, which previously called
ext4_xattr_inode_array_free() under xattr_sem + jbd2 handle.

The worker is drained in ext4_put_super() before quota shutdown using
a loop to handle re-arming (evicting an EA inode may queue further EA
inodes).  Initialization is placed before journal loading since fast
commit replay may trigger evictions that call ext4_put_ea_inode().

Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  | 13 ++++++++-
 fs/ext4/super.c | 19 ++++++++++++-
 fs/ext4/xattr.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/ext4/xattr.h | 14 ++++++++++
 4 files changed, 116 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b37c136ea3ab..b9b0ada7774b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1070,8 +1070,14 @@ struct ext4_inode_info {
 	 * between readers of EAs and writers of regular file data, so
 	 * instead we synchronize on xattr_sem when reading or changing
 	 * EAs.
+	 *
+	 * EA inodes (EXT4_EA_INODE_FL) do not use xattr_sem; they reuse
+	 * the space for deferred iput linkage.
 	 */
-	struct rw_semaphore xattr_sem;
+	union {
+		struct rw_semaphore xattr_sem;
+		struct llist_node i_ea_iput_node;
+	};
 
 	/*
 	 * Inodes with EXT4_STATE_ORPHAN_FILE use i_orphan_idx. Otherwise
@@ -1770,6 +1776,11 @@ struct ext4_sb_info {
 	struct ext4_es_stats s_es_stats;
 	struct mb_cache *s_ea_block_cache;
 	struct mb_cache *s_ea_inode_cache;
+
+	/* Deferred iput for EA inodes to avoid lock ordering issues */
+	struct llist_head s_ea_inode_to_free;
+	struct delayed_work s_ea_inode_work;
+
 	spinlock_t s_es_lock ____cacheline_aligned_in_smp;
 
 	/* Journal triggers for checksum computation */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 245f67d10ded..ed1d0cad2bc2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1303,6 +1303,8 @@ static void ext4_put_super(struct super_block *sb)
 			 &sb->s_uuid);
 
 	ext4_unregister_li_request(sb);
+	/* Drain deferred EA inode iputs while quota is still active. */
+	ext4_drain_ea_inode_work(sbi);
 	ext4_quotas_off(sb, EXT4_MAXQUOTAS);
 
 	destroy_workqueue(sbi->rsv_conversion_wq);
@@ -1423,6 +1425,13 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
 #endif
 	ei->jinode = NULL;
+	/*
+	 * Reinitialize xattr_sem every allocation because EA inodes
+	 * share this space with i_ea_iput_node (via union) which may
+	 * have overwritten the semaphore when the slab object was
+	 * previously used as an EA inode.
+	 */
+	init_rwsem(&ei->xattr_sem);
 	INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
 	spin_lock_init(&ei->i_completed_io_lock);
 	ei->i_sync_tid = 0;
@@ -1488,7 +1497,6 @@ static void init_once(void *foo)
 	struct ext4_inode_info *ei = foo;
 
 	INIT_LIST_HEAD(&ei->i_orphan);
-	init_rwsem(&ei->xattr_sem);
 	init_rwsem(&ei->i_data_sem);
 	inode_init_once(&ei->vfs_inode);
 	ext4_fc_init_inode(&ei->vfs_inode);
@@ -5497,6 +5505,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 			  ext4_has_feature_orphan_present(sb) ||
 			  ext4_has_feature_journal_needs_recovery(sb));
 
+	ext4_init_ea_inode_work(sbi);
+
 	if (ext4_has_feature_mmp(sb) && !sb_rdonly(sb)) {
 		err = ext4_multi_mount_protect(sb, le64_to_cpu(es->s_mmp_block));
 		if (err)
@@ -5508,6 +5518,7 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	 * The first inode we look at is the journal inode.  Don't try
 	 * root first: it may be modified in the journal!
 	 */
+
 	if (!test_opt(sb, NOLOAD) && ext4_has_feature_journal(sb)) {
 		err = ext4_load_and_init_journal(sb, es, ctx);
 		if (err)
@@ -5747,6 +5758,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	return 0;
 
 failed_mount9:
+	/* Drain deferred EA inode iputs before quota shutdown */
+	ext4_drain_ea_inode_work(sbi);
 	ext4_quotas_off(sb, EXT4_MAXQUOTAS);
 failed_mount8: __maybe_unused
 	ext4_release_orphan_info(sb);
@@ -5767,6 +5780,8 @@ failed_mount8: __maybe_unused
 	if (EXT4_SB(sb)->rsv_conversion_wq)
 		destroy_workqueue(EXT4_SB(sb)->rsv_conversion_wq);
 failed_mount_wq:
+	/* Drain deferred EA inode iputs before freeing structures */
+	ext4_drain_ea_inode_work(sbi);
 	ext4_xattr_destroy_cache(sbi->s_ea_inode_cache);
 	sbi->s_ea_inode_cache = NULL;
 
@@ -5777,6 +5792,8 @@ failed_mount8: __maybe_unused
 		ext4_journal_destroy(sbi, sbi->s_journal);
 	}
 failed_mount3a:
+	/* Drain deferred EA inode iputs from journal replay */
+	ext4_drain_ea_inode_work(sbi);
 	ext4_es_unregister_shrinker(sbi);
 failed_mount3:
 	/* flush s_sb_upd_work before sbi destroy */
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 982a1f831e22..ecdad5920b14 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -117,6 +117,8 @@ const struct xattr_handler * const ext4_xattr_handlers[] = {
 static int
 ext4_expand_inode_array(struct ext4_xattr_inode_array **ea_inode_array,
 			struct inode *inode);
+static void ext4_xattr_inode_array_free_deferred(struct super_block *sb,
+				struct ext4_xattr_inode_array *array);
 
 #ifdef CONFIG_LOCKDEP
 void ext4_xattr_inode_set_class(struct inode *ea_inode)
@@ -2187,7 +2189,8 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		ext4_xattr_release_block(handle, inode, bs->bh,
 					 &ea_inode_array,
 					 0 /* extra_credits */);
-		ext4_xattr_inode_array_free(ea_inode_array);
+		ext4_xattr_inode_array_free_deferred(inode->i_sb,
+						     ea_inode_array);
 	}
 	error = 0;
 
@@ -3025,6 +3028,74 @@ void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *ea_inode_array)
 	kfree(ea_inode_array);
 }
 
+static void ext4_xattr_inode_array_free_deferred(struct super_block *sb,
+				struct ext4_xattr_inode_array *array)
+{
+	int idx;
+
+	if (array == NULL)
+		return;
+
+	for (idx = 0; idx < array->count; ++idx)
+		ext4_put_ea_inode(sb, array->inodes[idx]);
+	kfree(array);
+}
+
+/*
+ * Worker function for deferred EA inode iput.  Processes all inodes queued
+ * on s_ea_inode_to_free in a context free of xattr_sem/jbd2 handle locks.
+ */
+static void ext4_ea_inode_work(struct work_struct *work)
+{
+	struct ext4_sb_info *sbi = container_of(to_delayed_work(work),
+						struct ext4_sb_info,
+						s_ea_inode_work);
+	struct llist_node *node = llist_del_all(&sbi->s_ea_inode_to_free);
+	struct llist_node *next;
+
+	while (node) {
+		struct ext4_inode_info *ei = container_of(node,
+					struct ext4_inode_info, i_ea_iput_node);
+		next = node->next;
+		iput(&ei->vfs_inode);
+		node = next;
+	}
+}
+
+/*
+ * Release a VFS reference on an EA inode.  Must be used instead of iput()
+ * in any context where xattr_sem or a jbd2 handle is held.
+ *
+ * If this is not the last reference, drops it immediately via
+ * iput_if_not_last() with no further action needed.
+ *
+ * If this is the last reference, the inode is linked onto a per-sb
+ * llist via i_ea_iput_node (embedded in ext4_inode_info, sharing space
+ * with the unused xattr_sem) and a delayed worker performs the final
+ * iput() in a clean context.
+ */
+void ext4_put_ea_inode(struct super_block *sb, struct inode *inode)
+{
+	if (!inode)
+		return;
+	WARN_ON_ONCE(!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL));
+	if (iput_if_not_last(inode))
+		return;
+	llist_add(&EXT4_I(inode)->i_ea_iput_node,
+		  &EXT4_SB(sb)->s_ea_inode_to_free);
+	/*
+	 * Use a short delay to allow multiple EA inodes to accumulate,
+	 * reducing workqueue wakeups when several are released together.
+	 */
+	schedule_delayed_work(&EXT4_SB(sb)->s_ea_inode_work, 1);
+}
+
+void ext4_init_ea_inode_work(struct ext4_sb_info *sbi)
+{
+	init_llist_head(&sbi->s_ea_inode_to_free);
+	INIT_DELAYED_WORK(&sbi->s_ea_inode_work, ext4_ea_inode_work);
+}
+
 /*
  * ext4_xattr_block_cache_insert()
  *
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 1fedf44d4fb6..9883ba5569a1 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -190,6 +190,20 @@ extern int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 				   struct ext4_xattr_inode_array **array,
 				   int extra_credits);
 extern void ext4_xattr_inode_array_free(struct ext4_xattr_inode_array *array);
+extern void ext4_init_ea_inode_work(struct ext4_sb_info *sbi);
+extern void ext4_put_ea_inode(struct super_block *sb, struct inode *inode);
+
+/*
+ * Drain all pending deferred EA inode iputs.  Must be called before
+ * freeing resources that eviction depends on (quota, block allocator).
+ * Loops because worker iput may trigger eviction that re-queues.
+ */
+static inline void ext4_drain_ea_inode_work(struct ext4_sb_info *sbi)
+{
+	while (flush_delayed_work(&sbi->s_ea_inode_work) ||
+	       !llist_empty(&sbi->s_ea_inode_to_free))
+		;
+}
 
 extern int ext4_expand_extra_isize_ea(struct inode *inode, int new_extra_isize,
 			    struct ext4_inode *raw_inode, handle_t *handle);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH] ext4: clear stale xarray tags on folios skipped during writeback
From: Gerald Yang @ 2026-06-25 13:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, linux-ext4, gerald.yang.tw
In-Reply-To: <asahstosrqlboeyjqyyzvwgne7qucw4y5mvoos346seqgrvwcd@wtuocifmom6c>

Thanks for the review, I will send a v2 based on the suggestion.


On Thu, Jun 25, 2026 at 7:33 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 23-06-26 16:02:38, Gerald Yang wrote:
> > In data=journal mode, the writeback thread can hit the
> > WARN_ON_ONCE(sb_rdonly(sb)) in ext4_journal_check_start() while the
> > superblock is being remounted read-only during reboot:
> >
> > Workqueue: writeback wb_workfn (flush-253:0)
> > RIP: 0010:ext4_journal_check_start+0x8b/0xd0
> > Call Trace:
> >   __ext4_journal_start_sb+0x3c/0x1e0
> >   mpage_prepare_extent_to_map+0x4af/0x580
> >   ext4_do_writepages+0x3c0/0x1080
> >   ext4_writepages+0xc8/0x1a0
> >   do_writepages+0xc4/0x180
> >   __writeback_single_inode+0x45/0x2f0
> >   writeback_sb_inodes+0x26b/0x5d0
> >   __writeback_inodes_wb+0x54/0x100
> >   wb_writeback+0x1ac/0x320
> >   wb_workfn+0x394/0x470
> >
> > And followed by the warning:
> > EXT4-fs warning (device vda1): ext4_evict_inode:195: inode #6263:
> > comm (sd-umount): data will be lost
> >
> > This issue is not reproduced every time, but frequently.
> > The reproduction step is to create a VM with 8 CPUs, 16G memory and
> > setup data=journal:
> > sudo tune2fs -o journal_data /dev/vda1
> > Run fio:
> > rm -f fiotest
> > fio --name=fiotest --rw=randwrite --bs=4k --runtime=6 --ioengine=libaio
> > --iodepth=256 --numjobs=8 --filename=fiotest --filesize=30G
> > --group_reporting
> > Reboot the VM, and check the console output from:
> > virsh console testvm
> >
> > But there is no dirty inode, folio_clear_dirty_for_io clears PG_dirty
> > but leaves tags PAGECACHE_TAG_DIRTY and PAGECACHE_TAG_TOWRITE set which
> > are only cleared by __folio_start_writeback.
> > In data=journal mode, jbd2 checkpoints the journalled data to its final
> > location and clears its own dirty flag without touching folio PG_dirty
> > or xarray dirty flags.
> > The commit f4a2b42e7891 ("ext4: fix stale xarray tags after writeback")
> > fixes when PG_dirty is still set but there is no dirty page.
> > Another case is PG_dirty is cleared, but PAGECACHE_TAG_DIRTY and
> > PAGECACHE_TAG_TOWRITE is still set. In this case, writeback thread
> > checks clean folio and skips it in mpage_prepare_extent_to_map:
> > if (!folio_test_dirty(folio) ||
> >     ...
> >         folio_unlcok(folio);
> >       continue
> >
> > And never reaches ext4_bio_write_folio where the commit f4a2b42e7891
> > clears the stale xarray tags. Print debug logs after the filesystem
> > is remounted read-only:
> > writepages RDONLY nrpages=2048 dirtytag=1 wbtag=0 towrite=1 sync=0
> > And all folios are actually clean:
> > folio idx=3 dirty=0 wb=0 checked=0 dirtybuf=0 jbddirty=0 mapped=1
> > ...
> >
> > We need to clear the xarray stale tags for such clean folios by
> > cycling them through writeback in the skip path, the same way
> > f4a2b42e7891 does in ext4_bio_write_folio.
> >
> > Fixes: dff4ac75eeee ("ext4: move keep_towrite handling to ext4_bio_write_page()")
> > Signed-off-by: Gerald Yang <gerald.yang@canonical.com>
>
> Good spotting. One comment to the patch:
>
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index ce99807c5f5b..40da611b0831 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -2698,6 +2698,28 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
> >                           (folio_test_writeback(folio) &&
> >                            (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
> >                           unlikely(folio->mapping != mapping)) {
> > +                             /*
> > +                              * A clean folio (PG_dirty = 0) can still carry
> > +                              * stale xarray tags: PAGECACHE_TAG_DIRTY /
> > +                              * PAGECACHE_TAG_TOWRITE.
> > +                              * In data=journal mode, the data may have been
> > +                              * checkpointed to its final location by jbd2
> > +                              * but these tags are only cleared by
> > +                              * __folio_start_writeback() later.
> > +                              * These tags keep the inode on the writeback
> > +                              * list and make the writeback thread calls
> > +                              * ext4_journal_start on a read-only sb during
> > +                              * remounting fs to read-only, and return
> > +                              * -EROFS from ext4_journal_check_start.
> > +                              * Cycle the clean folio through writeback to
> > +                              * drop the stale xarray tags.
> > +                              */
> > +                             if (folio->mapping == mapping &&
> > +                                 !folio_test_dirty(folio) &&
> > +                                 !folio_test_writeback(folio)) {
> > +                                     __folio_start_writeback(folio, false);
> > +                                     folio_end_writeback(folio);
> > +                             }
>
> Instead of this check, just split the top level condition to make things
> more obvious:
>                                 if ((folio_test_writeback(folio) &&
>                                      mpd->wbc->sync_mode == WB_SYNC_NONE) ||
>                                     unlikely(folio->mapping != mapping)) {
>                                         folio_unlock(folio);
>                                         continue;
>                                 }
>                                 /*
>                                  * If the folio is clean, skip writing it back.
>                                  * Cycle the folio through the writeback state
>                                  * though, to clear stale xarray tags.
>                                  */
>                                 if (!folio_test_dirty(folio)) {
>                                         if (!folio_test_writeback(folio)) {
>                                                 __folio_start_writeback(folio,
>                                                                         false);
>                                                 folio_end_writeback(folio);
>                                         }
>                                         folio_unlock(folio);
>                                         continue;
>                                 }
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
From: Jan Kara @ 2026-06-25 12:58 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel,
	tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, djwong,
	hch, yi.zhang, yangerkun, yukuai
In-Reply-To: <881a714b-12ad-4678-81d1-f92dcc126567@huaweicloud.com>

On Thu 25-06-26 11:33:36, Zhang Yi wrote:
> On 6/25/2026 1:16 AM, Jan Kara wrote:
> > On Mon 22-06-26 20:36:02, Zhang Yi wrote:
> >>> then I'd
> >>> probably prefer coming up with an ext4_get_blocks flag which tells it to
> >>> start a transaction on its own if we need to allocate blocks... That would
> >>> be much simpler than opencoding all this.
> >>
> >> Additionally, there is a key point here. The reason I open-coded
> >> ext4_iomap_map_writeback_range() is that we must ensure extent query
> >> and allocation are performed atomically under i_data_sem. Otherwise,
> >> concurrent truncate could lead to quota leaks.
> >>
> >> Specifically, consider the following scenario: we call
> >> ext4_map_blocks() to allocate blocks. Suppose there is a delalloc
> >> extent covering blocks [0,3). While writeback is submitting block 0, a
> >> concurrent truncate(block 1) occurs:
> >>
> >> wb                         truncate
> >> ext4_es_lookup_extent()    ext4_truncate_down()
> >>   //get [0,3)
> >>                             truncate_inode_pages_range()
> >>                                //clear page 1&2
> >>                             ext4_truncate()
> >>                              down_write(i_data_sem)
> >>                               ext4_es_remove_extent()
> >>                                //drop extent [1,3)
> >>                                //i_reserved_data_blocks: 3->1
> >>                               up_write(i_data_sem)
> >> down_write(i_data_sem)
> >> ext4_map_create_blocks()
> >>   //alloc 3 blocks
> >>  ext4_es_insert_extent()
> >>   //only reclaim 1 block,stale 2 blocks
> >> up_write(i_data_sem)
> >>
> >> Therefore, If we don't open-coding this part, we would need to
> >> significantly rework ext4_map_blocks(), which might have a larger
> >> impact at this point. What do you think?
> > 
> > Hum, is something like this really possible? I mean iomap_writepages() will
> > lookup and lock folio. Only then it calls ->iomap_begin to map it to
> > underlying blocks. And folio lock synchronizes against
> > truncate_inode_pages_range() so how would writeback end up trying to
> > allocate something underlying pages 1 or 2?
> 
> I believe this scenario can indeed occur, and folio lock alone is
> insufficient to protect against this concurrency issue. The main reason
> is that the iomap writeback framework processes folios one by one. For
> each folio, it follows the "lock -> map -> unlock" sequence. If each
> iteration only mapped blocks covering no more than one folio in length,
> performance would be severely degraded. Therefore, both XFS and the
> current ext4 iomap implementation choose to map up to the minimum of the
> writeback length and the delalloc extent length. This means that when
> processing folio 0, if an extent of length 3 is found, the ranges
> corresponding to the subsequent folios are also mapped and cached. As a
> result, holding only the folio lock of folio 0 is insufficient to
> protect against truncation concurrency with the latter two folios.

Yes, I know about the behavior that iomap_writepages() can cache extent
longer than what folio_lock covers. But after truncate_inode_pages_range()
completes in the truncate path, writeback_iter will never attempt to map
anything for folios beyond i_size. If we had anything cached from before
truncate, iomap_valid() check takes care it isn't used.

But I think you might be concerned about the following race:
In the writeback path iomap_writeback_folio() is called for folio 0.
iomap_writeback_handle_eof() still sees old i_size so end_pos is kept
large, we go down to ext4_map_blocks() which runs ext4_map_query_blocks()
and still sees larger delalloc extent. Then truncate to folio 1 comes. It
can fully run and complete because folio 0 is not affected by it. Then
ext4_map_blocks() resumes and calls ext4_map_create_blocks() with the (now
stale) long delalloc extent and allocates blocks under already truncated
area.

Frankly what I think needs to happen is to fix ext4_map_blocks() so that
after reacquiring i_data_sem, we recheck m_seq still matches i_seq and if
not, redo the extent status tree lookup which will as a side effect also
trim the length to map to appropriate size. What do you think?

> >>> For now, I'd prefer to do what XFS does and offload everything. Then you
> >>> don't have to export iomap_finish_ioend() (which would need to be in a
> >>> separate patch and acked by iomap maintainers) and the code is more
> >>> standard. There's a patchset in the works which adds general ioend offloading
> >>> infrastructure into iomap [1] and when that lands we should get all these
> >>                       ^^^^^ block layer?
> >>
> >>> bells and whistles (even better ones with percpu work queues, batching,
> >>> etc.) for free.
> >>>
> >>> [1] https://lore.kernel.org/all/20260514-blk-dontcache-v6-0-782e2fa7477b@columbia.edu/
> >>>
> >>> 								Honza
> >>
> >> Ha, I've noticed this patchset, so I haven't implemented
> >> uncached I/O handling for now. As a side note, I have a question:
> >> if we convert all endio processing to worker threads, IIRC, my
> >> recollection from previous performance tests is that pure overwrite
> >> scenarios would see at least a 20% degradation. Is that acceptable?
> > 
> > No, but the latest version of the patches exactly does IO completion in the
> > interrupt unless the bio is flagged as needing IO completion from process
> > context or unless end_io handler returns a particular error - which means
>              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Please let me confirm whether my understanding is correct. The latest(v6)
> modification to bio_endio() is as follows:
> 
> @@ -1791,7 +1865,9 @@ void bio_endio(struct bio *bio)
>  	}
>  #endif
> 
> -	if (bio->bi_end_io)
> +	if (bio_flagged(bio, BIO_COMPLETE_IN_TASK) && bio_in_atomic())
> +		__bio_complete_in_task(bio);
> +	else if (bio->bi_end_io)
>  		bio->bi_end_io(bio);
>  }
> 
> Currently, __bio_complete_in_task() is only called in bio_endio() when
> BIO_COMPLETE_IN_TASK is explicitly set on the bio. It is not called
> again based on a specific error return from bio->bi_end_io(). So I
> believe what you meant is that each filesystem's own ->bi_end_io()
> should call it to convert the endio processing to process context.
> Is my understanding correct?

Right yes. I remembered wrong what we ended up with. But I was certain
the usecase of offloading from ->bi_end_io is handled. :)

> >> The reason I kept ext4_iomap_end_bio() handling I/O completion in
> >> interrupt context is for overwrite performance. XFS also handles
> >> overwrites in interrupt context (via ioend_writeback_end_bio()).
> >> However, ext4 has the data_error=abort mount option — when this mode
> >> is set and an I/O error occurs, we must abort the journal in a
> >> worker. Since we cannot predict I/O errors at submission time, we
> >> can't directly use ioend_writeback_end_bio() and must instead bind
> >> our own ext4_iomap_end_bio(). At the same time, I want to avoid
> >> spawning a worker for pure overwrites when no I/O error occurs, so I
> >> exported iomap_finish_ioend(). What do you think?
> > 
> > So data_error=abort handling can exactly use the new generic framework - if
> > we detect during processing IO completion we cannot actually do it in the
> > interrupt (like in case of error), we just return appropriately from the
> > handler and the generic code will handle offloading and call the ->end_io
> > callback again.
>
> If my understanding above is correct, what you are suggesting here is
> that in ext4_iomap_end_bio(), we should call __bio_complete_in_task()
> to switch to process context when data_error=abort is set and the I/O
> returns an error. This way, we could drop ext4's own
> i_iomap_ioend_work / i_rsv_conversion_work handling. Is that right?

Exactly.

> But even with that, it seems we would still need to export
> iomap_finish_ioend(). Because if the I/O does not fail,
> ext4_iomap_end_bio() would need to complete the IO processing in
> interrupt context for pure overwrite and would not switch to tasks
> context. Because only iomap_finish_ioend() is safe to call in interrupt
> context. Am I misunderstanding something?

Hmm, yes, something like that. Basically iomap ioend processing will need
to be updated to play well with the generic IO completion offloading
infrastructure. But that's a separate topic. Just keep things simple
(possibly not perfect in terms of performance) since we should transition
to the generic infrastructure sooner rather than later.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] ext4: clear stale xarray tags on folios skipped during writeback
From: Jan Kara @ 2026-06-25 11:33 UTC (permalink / raw)
  To: Gerald Yang; +Cc: tytso, jack, linux-ext4, gerald.yang.tw
In-Reply-To: <20260623080312.2713478-1-gerald.yang@canonical.com>

On Tue 23-06-26 16:02:38, Gerald Yang wrote:
> In data=journal mode, the writeback thread can hit the
> WARN_ON_ONCE(sb_rdonly(sb)) in ext4_journal_check_start() while the
> superblock is being remounted read-only during reboot:
> 
> Workqueue: writeback wb_workfn (flush-253:0)
> RIP: 0010:ext4_journal_check_start+0x8b/0xd0
> Call Trace:
>   __ext4_journal_start_sb+0x3c/0x1e0
>   mpage_prepare_extent_to_map+0x4af/0x580
>   ext4_do_writepages+0x3c0/0x1080
>   ext4_writepages+0xc8/0x1a0
>   do_writepages+0xc4/0x180
>   __writeback_single_inode+0x45/0x2f0
>   writeback_sb_inodes+0x26b/0x5d0
>   __writeback_inodes_wb+0x54/0x100
>   wb_writeback+0x1ac/0x320
>   wb_workfn+0x394/0x470
> 
> And followed by the warning:
> EXT4-fs warning (device vda1): ext4_evict_inode:195: inode #6263:
> comm (sd-umount): data will be lost
> 
> This issue is not reproduced every time, but frequently.
> The reproduction step is to create a VM with 8 CPUs, 16G memory and
> setup data=journal:
> sudo tune2fs -o journal_data /dev/vda1
> Run fio:
> rm -f fiotest
> fio --name=fiotest --rw=randwrite --bs=4k --runtime=6 --ioengine=libaio
> --iodepth=256 --numjobs=8 --filename=fiotest --filesize=30G
> --group_reporting
> Reboot the VM, and check the console output from:
> virsh console testvm
> 
> But there is no dirty inode, folio_clear_dirty_for_io clears PG_dirty
> but leaves tags PAGECACHE_TAG_DIRTY and PAGECACHE_TAG_TOWRITE set which
> are only cleared by __folio_start_writeback.
> In data=journal mode, jbd2 checkpoints the journalled data to its final
> location and clears its own dirty flag without touching folio PG_dirty
> or xarray dirty flags.
> The commit f4a2b42e7891 ("ext4: fix stale xarray tags after writeback")
> fixes when PG_dirty is still set but there is no dirty page.
> Another case is PG_dirty is cleared, but PAGECACHE_TAG_DIRTY and
> PAGECACHE_TAG_TOWRITE is still set. In this case, writeback thread
> checks clean folio and skips it in mpage_prepare_extent_to_map:
> if (!folio_test_dirty(folio) ||
>     ...
>         folio_unlcok(folio);
> 	continue
> 
> And never reaches ext4_bio_write_folio where the commit f4a2b42e7891
> clears the stale xarray tags. Print debug logs after the filesystem
> is remounted read-only:
> writepages RDONLY nrpages=2048 dirtytag=1 wbtag=0 towrite=1 sync=0
> And all folios are actually clean:
> folio idx=3 dirty=0 wb=0 checked=0 dirtybuf=0 jbddirty=0 mapped=1
> ...
> 
> We need to clear the xarray stale tags for such clean folios by
> cycling them through writeback in the skip path, the same way
> f4a2b42e7891 does in ext4_bio_write_folio.
> 
> Fixes: dff4ac75eeee ("ext4: move keep_towrite handling to ext4_bio_write_page()")
> Signed-off-by: Gerald Yang <gerald.yang@canonical.com>

Good spotting. One comment to the patch:

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ce99807c5f5b..40da611b0831 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2698,6 +2698,28 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
>  			    (folio_test_writeback(folio) &&
>  			     (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
>  			    unlikely(folio->mapping != mapping)) {
> +				/*
> +				 * A clean folio (PG_dirty = 0) can still carry
> +				 * stale xarray tags: PAGECACHE_TAG_DIRTY /
> +				 * PAGECACHE_TAG_TOWRITE.
> +				 * In data=journal mode, the data may have been
> +				 * checkpointed to its final location by jbd2
> +				 * but these tags are only cleared by
> +				 * __folio_start_writeback() later.
> +				 * These tags keep the inode on the writeback
> +				 * list and make the writeback thread calls
> +				 * ext4_journal_start on a read-only sb during
> +				 * remounting fs to read-only, and return
> +				 * -EROFS from ext4_journal_check_start.
> +				 * Cycle the clean folio through writeback to
> +				 * drop the stale xarray tags.
> +				 */
> +				if (folio->mapping == mapping &&
> +				    !folio_test_dirty(folio) &&
> +				    !folio_test_writeback(folio)) {
> +					__folio_start_writeback(folio, false);
> +					folio_end_writeback(folio);
> +				}

Instead of this check, just split the top level condition to make things
more obvious:
				if ((folio_test_writeback(folio) &&
				     mpd->wbc->sync_mode == WB_SYNC_NONE) ||
				    unlikely(folio->mapping != mapping)) {
					folio_unlock(folio);
					continue;
				}
				/*
				 * If the folio is clean, skip writing it back.
				 * Cycle the folio through the writeback state
				 * though, to clear stale xarray tags.
				 */
				if (!folio_test_dirty(folio)) {
					if (!folio_test_writeback(folio)) {
						__folio_start_writeback(folio,
									false);
						folio_end_writeback(folio);
					}
					folio_unlock(folio);
					continue;
				}

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] ext4: cancel dirty accounting for folios without buffers
From: Jan Kara @ 2026-06-25 11:18 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Zhu Jia, Zhang Yi, tytso, adilger.kernel, libaokun,
	ojaswin, ritesh.list, linux-ext4, linux-kernel, stable
In-Reply-To: <7471b9cb-158a-4ed4-a1ce-95270ef38974@gmail.com>

On Wed 24-06-26 21:29:58, Zhang Yi wrote:
> On 6/24/2026 8:32 PM, Jan Kara wrote:
> > On Wed 24-06-26 17:52:06, Zhu Jia wrote:
> > > Hi Yi,
> > > 
> > > Thanks for taking a look.
> > > 
> > > Yes, clearing PAGECACHE_TAG_DIRTY/TOWRITE would make the page-cache state
> > > cleaner. I had a version that did this by adding a helper around
> > > folio_cancel_dirty() and clearing the xarray tags after confirming the
> > > folio was still the same clean page-cache entry.
> > > 
> > > It looked like this:
> > > 
> > > static void ext4_cancel_dirty_folio(struct address_space *mapping,
> > > 				    struct folio *folio)
> > > {
> > > 	XA_STATE(xas, &mapping->i_pages, folio->index);
> > > 	unsigned long flags;
> > > 
> > > 	folio_cancel_dirty(folio);
> > > 
> > > 	xas_lock_irqsave(&xas, flags);
> > > 	if (xas_load(&xas) == folio && !folio_test_dirty(folio)) {
> > > 		xas_clear_mark(&xas, PAGECACHE_TAG_DIRTY);
> > > 		xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
> > > 	}
> > > 	xas_unlock_irqrestore(&xas, flags);
> > > }
> > > 
> > > The reason I left the tags unchanged in this version is that I was not sure
> > > whether it is appropriate for ext4 to open-code xarray tag cleanup directly.
> > > 
> > > If you think this is the right direction, I can add the helper back and
> > > send a v2.
> > 
> > That was a good judgement! Playing with xarray tags like this in filesystem
> > code is certainly not a good thing. For now, I'd leave the xarray tags
> > dangling - they will be eventually synced with reality on next writeback
> > attempt. If this inconsistency of tags needs to be fixed, the fix belongs
> > to the generic code (so that it can be used in other places as well).
> > 
> > 								Honza
> 
> Yes, I agree. Directly clearing the tag via open code is not a good
> approach. However, I took a look at the !nr_to_submit branch in
> ext4_bio_write_folio(), and it seems to have a similar simple handling
> pattern—it directly calls __folio_start_writeback() and
> folio_end_writeback(), which appears to be an elegant way to clear them.
> Could we also call these two helpers just after folio_cancel_dirty()
> here?

Right, that would be actually doable there and would keep things more
consistent so I think that's a good idea! Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v5] ext4: fix ABBA deadlock in ext4_xattr_inode_cache_find()
From: Aditya Prakash Srivastava @ 2026-06-25 11:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: tytso, adilger.kernel, libaokun, ritesh.list, yi.zhang,
	linux-ext4, linux-kernel, Colin Ian King
In-Reply-To: <emhyf7w5rac5e2bgaxnbphgo5tlbpud24utygg22dobqnhuacx@2t76kf3yooca>

Hi Jan,

My sincere apologies for the rapid succession of patches and the noise
on the list. You are completely right; I got over-excited trying to
quickly address bugs flagged by the Sashiko-AI bot in previous
iterations, and I should have thought them through more carefully. I
will be more careful going forward.

Regarding your feedback:

You are entirely correct that 'is_freeing' is redundant. Since
find_inode_nowait() already returns NULL if the match callback returns
-1, we can simply drop that tracking variable and check !inode.

I also agree that returning -ENOENT on cache misses and unhashed checks
is much more semantically accurate than -ESTALE.

Thank you again for your patience, invaluable reviews, and guidance.

Best regards,
Aditya


On Thu, Jun 25, 2026 at 3:53 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 25-06-26 06:50:09, Aditya Srivastava wrote:
> > From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
> >
> > Syzbot/stress-ng reported an ABBA deadlock in ext4 when exercising
> > concurrent xattr workloads (using the ea_inode mount/format option).
> >
> > The deadlock occurs between the running transaction and the eviction
> > thread:
> > - Task 1 (stress-ng): Holds a reference to a shared mbcache_entry (ce)
> >   and calls ext4_xattr_inode_cache_find() -> ext4_iget() to retrieve
> >   the corresponding EA inode. Since the EA inode is currently being
> >   evicted, ext4_iget() blocks in __wait_on_freeing_inode() waiting for
> >   eviction to complete.
> > - Task 2 (eviction thread): Currently evicting the same EA inode in
> >   ext4_evict_ea_inode(). It calls mb_cache_entry_wait_unused(oe) which
> >   blocks waiting for Task 1 to release the reference to the mbcache_entry.
> >
> > To break this deadlock, implement a new ext4_iget() configuration flag
> > named EXT4_IGET_NOWAIT. When set, perform a non-blocking lookup of the
> > inode via VFS's find_inode_nowait() API.
> >
> > If the inode is currently being evicted (marked with I_FREEING or
> > I_WILL_FREE) or created (I_CREATING), or if it is not present in the VFS
> > inode cache (cache miss), simply skip it (returning -ESTALE) rather than
> > waiting for eviction/creation to complete, breaking the ABBA cycle.
> >
> > Since we return -ESTALE immediately on a cache miss, we never attempt to
> > allocate a new inode or call iget_locked(), completely eliminating any
> > TOCTOU race window.
> >
> > If the returned inode is I_NEW, wait for its initialization to clear via
> > wait_on_new_inode(). If initialization fails and the inode is unhashed
> > during wait_on_new_inode() waking up (e.g., due to an I/O read error in
> > another thread), safely drop the reference and return -ESTALE. This
> > unhashed check is executed unconditionally on all cache-hit pathways to
> > properly handle concurrent initialization failures.
> >
> > Finally, standard validation checks (including is_bad_inode,
> > EXT4_EA_INODE_FL, file_acl, and xattr flags) are executed as normal inside
> > check_igot_inode() to fully guarantee VFS-layer safety.
> >
> > In ext4_xattr_inode_cache_find(), invoke ext4_iget() with the new
> > EXT4_IGET_NOWAIT flag to perform the non-blocking cache search.
> >
> > Suggested-by: Jan Kara <jack@suse.cz>
> > Reported-by: Colin Ian King <colin.i.king@gmail.com>
> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219283
> > Fixes: 0a46ef234756 ("ext4: do not create EA inode under buffer lock")
> > Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
>
> Three patch submissions in a day? I think you are overdoing it. Please test
> more, think more about what you do (i.e., really understand what the code
> does, not just whatever some AI agent suggested) and submit less.
>
> > +static int ext4_iget_match(struct inode *inode, u64 ino, void *data)
> > +{
> > +     bool *is_freeing = data;
> > +
> > +     if (inode->i_ino != ino)
> > +             return 0;
> > +     spin_lock(&inode->i_lock);
> > +     if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_CREATING)) {
> > +             if (is_freeing)
> > +                     *is_freeing = true;
> > +             spin_unlock(&inode->i_lock);
> > +             return -1;
> > +     }
> > +     __iget(inode);
> > +     spin_unlock(&inode->i_lock);
> > +     return 1;
> > +}
> > +
> >  struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
> >                         ext4_iget_flags flags, const char *function,
> >                         unsigned int line)
> > @@ -5298,9 +5316,26 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
> >               return ERR_PTR(-EFSCORRUPTED);
> >       }
> >
> > -     inode = iget_locked(sb, ino);
> > -     if (!inode)
> > -             return ERR_PTR(-ENOMEM);
> > +     if (flags & EXT4_IGET_NOWAIT) {
> > +             bool is_freeing = false;
> > +
> > +             inode = find_inode_nowait(sb, ino, ext4_iget_match, &is_freeing);
> > +             if (is_freeing || !inode)
>
> is_freeing is pointless. Just drop it and use !inode.
>
> > +                     return ERR_PTR(-ESTALE);
>
> I'd prefer -ENOENT here instead of -ESTALE. It's kind of more specific.
>
> > +             if (inode_state_read_once(inode) & I_NEW)
> > +                     wait_on_new_inode(inode);
> > +
> > +             if (unlikely(inode_unhashed(inode))) {
> > +                     iput(inode);
> > +                     return ERR_PTR(-ESTALE);
>
> Here -ENOENT as well please.
>
> Otherwise the patch looks good to me.
>
>                                                                 Honza
>
> > +             }
> > +     } else {
> > +             inode = iget_locked(sb, ino);
> > +             if (!inode)
> > +                     return ERR_PTR(-ENOMEM);
> > +     }
> > +
> >       if (!(inode_state_read_once(inode) & I_NEW)) {
> >               ret = check_igot_inode(inode, flags, function, line);
> >               if (ret) {
> > diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> > index 982a1f831e22..21b5670d8503 100644
> > --- a/fs/ext4/xattr.c
> > +++ b/fs/ext4/xattr.c
> > @@ -1550,7 +1550,7 @@ ext4_xattr_inode_cache_find(struct inode *inode, const void *value,
> >
> >       while (ce) {
> >               ea_inode = ext4_iget(inode->i_sb, ce->e_value,
> > -                                  EXT4_IGET_EA_INODE);
> > +                                  EXT4_IGET_EA_INODE | EXT4_IGET_NOWAIT);
> >               if (IS_ERR(ea_inode))
> >                       goto next_entry;
> >               ext4_xattr_inode_set_class(ea_inode);
> > --
> > 2.47.3
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 0/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Jani Nikula @ 2026-06-25 11:05 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx
In-Reply-To: <20260625104007.041432666@kernel.org>

On Thu, 25 Jun 2026, Steven Rostedt <rostedt@kernel.org> wrote:
> Remove trace_printk.h by creating a trace_controls.h for those places that
> need access to tracing prototypes like tracing_off() and for the places that
> need trace_printk() directly, to have it included directly.
>
> Changse since v3: https://lore.kernel.org/all/20260624081806.120105649@kernel.org/
>
> - Always include trace_controls.h in rcu.h (kernel test robot)
>
>   There are other configs that may include tracing_off() in rcu.h besides
>   the one that had the include of trace_controls.h. Just always include
>   it in that header to be safe.
>
> Steven Rostedt (2):
>       tracing: Move non-trace_printk prototypes into trace_controls.h
>       tracing: Remove trace_printk.h from kernel.h
>
> ----
>  arch/powerpc/kvm/book3s_xics.c         |  1 +
>  arch/powerpc/xmon/xmon.c               |  1 +
>  arch/s390/kernel/ipl.c                 |  1 +
>  arch/s390/kernel/machine_kexec.c       |  1 +
>  drivers/gpu/drm/i915/gt/intel_gtt.h    |  1 +
>  drivers/gpu/drm/i915/i915_gem.h        |  2 ++

For the i915 parts,

Acked-by: Jani Nikula <jani.nikula@intel.com>

for merging via whichever tree.

>  drivers/hwtracing/stm/dummy_stm.c      |  1 +
>  drivers/infiniband/hw/hfi1/trace_dbg.h |  1 +
>  drivers/tty/sysrq.c                    |  1 +
>  drivers/usb/early/xhci-dbc.c           |  1 +
>  fs/ext4/inline.c                       |  1 +
>  include/linux/ftrace.h                 |  2 ++
>  include/linux/kernel.h                 |  1 -
>  include/linux/sunrpc/debug.h           |  1 +
>  include/linux/trace_controls.h         | 54 ++++++++++++++++++++++++++++++++
>  include/linux/trace_printk.h           | 56 ++--------------------------------
>  kernel/debug/debug_core.c              |  1 +
>  kernel/panic.c                         |  1 +
>  kernel/rcu/rcu.h                       |  1 +
>  kernel/rcu/rcutorture.c                |  1 +
>  kernel/trace/ring_buffer_benchmark.c   |  1 +
>  kernel/trace/trace.h                   |  1 +
>  kernel/trace/trace_benchmark.c         |  1 +
>  lib/sys_info.c                         |  1 +
>  samples/fprobe/fprobe_example.c        |  1 +
>  samples/ftrace/ftrace-direct-too.c     |  1 -
>  samples/trace_printk/trace-printk.c    |  1 +
>  27 files changed, 82 insertions(+), 55 deletions(-)
>  create mode 100644 include/linux/trace_controls.h

-- 
Jani Nikula, Intel

^ permalink raw reply

* [PATCH v4 1/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Steven Rostedt @ 2026-06-25 10:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx
In-Reply-To: <20260625104007.041432666@kernel.org>

From: Steven Rostedt <rostedt@goodmis.org>

Remove the prototypes of the code that is not associated with
trace_printk() from trace_printk.h.

These control functions as well as ftrace_dump() and trace_dump_stack()
are used in cases where things go wrong.  The main use case is to do a
trace_dump_stack(); tracing_off(); ftrace_dump(); in a place that detected
that something went wrong, whereas, trace_printk() is added to normal code
during debugging and removed before committing upstream. The dump code is
fine to keep in production.

Suggested-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Changes since v3: https://patch.msgid.link/20260624081948.147764194@kernel.org

- Move include out of #if statement in rcu.h
  kernel test robot found other configs that could require the
  control functions in rcu.h. Just always include it in that file.

 arch/powerpc/xmon/xmon.c         |  1 +
 arch/s390/kernel/ipl.c           |  1 +
 arch/s390/kernel/machine_kexec.c |  1 +
 drivers/gpu/drm/i915/i915_gem.h  |  1 +
 drivers/tty/sysrq.c              |  1 +
 include/linux/trace_controls.h   | 54 ++++++++++++++++++++++++++++++++
 include/linux/trace_printk.h     | 51 ------------------------------
 kernel/debug/debug_core.c        |  1 +
 kernel/panic.c                   |  1 +
 kernel/rcu/rcu.h                 |  1 +
 kernel/rcu/rcutorture.c          |  1 +
 kernel/trace/trace.h             |  1 +
 kernel/trace/trace_benchmark.c   |  1 +
 lib/sys_info.c                   |  1 +
 14 files changed, 66 insertions(+), 51 deletions(-)
 create mode 100644 include/linux/trace_controls.h

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index cb3a3244ae6f..2135f319e0dd 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -27,6 +27,7 @@
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/debugfs.h>
+#include <linux/trace_controls.h>
 
 #include <asm/ptrace.h>
 #include <asm/smp.h>
diff --git a/arch/s390/kernel/ipl.c b/arch/s390/kernel/ipl.c
index 3c346b02ceb9..baac66cc4de4 100644
--- a/arch/s390/kernel/ipl.c
+++ b/arch/s390/kernel/ipl.c
@@ -22,6 +22,7 @@
 #include <linux/debug_locks.h>
 #include <linux/vmalloc.h>
 #include <linux/secure_boot.h>
+#include <linux/trace_controls.h>
 #include <asm/asm-extable.h>
 #include <asm/machine.h>
 #include <asm/diag.h>
diff --git a/arch/s390/kernel/machine_kexec.c b/arch/s390/kernel/machine_kexec.c
index baeb3dcfc1c8..33f9a89eb3ad 100644
--- a/arch/s390/kernel/machine_kexec.c
+++ b/arch/s390/kernel/machine_kexec.c
@@ -12,6 +12,7 @@
 #include <linux/delay.h>
 #include <linux/reboot.h>
 #include <linux/ftrace.h>
+#include <linux/trace_controls.h>
 #include <linux/debug_locks.h>
 #include <linux/cpufeature.h>
 #include <asm/guarded_storage.h>
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 20b3cb29cfff..1da8fb61c09e 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -116,6 +116,7 @@ int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file);
 #endif
 
 #if IS_ENABLED(CONFIG_DRM_I915_TRACE_GEM)
+#include <linux/trace_controls.h>
 #define GEM_TRACE(...) trace_printk(__VA_ARGS__)
 #define GEM_TRACE_ERR(...) do {						\
 	pr_err(__VA_ARGS__);						\
diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index c2e4b31b699a..d3f72dc430b8 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -324,6 +324,7 @@ static const struct sysrq_key_op sysrq_showstate_blocked_op = {
 };
 
 #ifdef CONFIG_TRACING
+#include <linux/trace_controls.h>
 #include <linux/ftrace.h>
 
 static void sysrq_ftrace_dump(u8 key)
diff --git a/include/linux/trace_controls.h b/include/linux/trace_controls.h
new file mode 100644
index 000000000000..995b97e963b4
--- /dev/null
+++ b/include/linux/trace_controls.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TRACE_CONTROLS_H
+#define _LINUX_TRACE_CONTROLS_H
+
+
+/*
+ * General tracing related utility functions - trace_printk(),
+ * tracing_on/tracing_off and tracing_start()/tracing_stop
+ *
+ * Use tracing_on/tracing_off when you want to quickly turn on or off
+ * tracing. It simply enables or disables the recording of the trace events.
+ * This also corresponds to the user space /sys/kernel/tracing/tracing_on
+ * file, which gives a means for the kernel and userspace to interact.
+ * Place a tracing_off() in the kernel where you want tracing to end.
+ * From user space, examine the trace, and then echo 1 > tracing_on
+ * to continue tracing.
+ *
+ * tracing_stop/tracing_start has slightly more overhead. It is used
+ * by things like suspend to ram where disabling the recording of the
+ * trace is not enough, but tracing must actually stop because things
+ * like calling smp_processor_id() may crash the system.
+ *
+ * Most likely, you want to use tracing_on/tracing_off.
+ */
+enum ftrace_dump_mode {
+	DUMP_NONE,
+	DUMP_ALL,
+	DUMP_ORIG,
+	DUMP_PARAM,
+};
+
+#ifdef CONFIG_TRACING
+void tracing_on(void);
+void tracing_off(void);
+int tracing_is_on(void);
+void tracing_snapshot(void);
+void tracing_snapshot_alloc(void);
+void tracing_start(void);
+void tracing_stop(void);
+void trace_dump_stack(int skip);
+void ftrace_dump(enum ftrace_dump_mode oops_dump_mode);
+#else
+static inline void tracing_start(void) { }
+static inline void tracing_stop(void) { }
+static inline void tracing_on(void) { }
+static inline void tracing_off(void) { }
+static inline int tracing_is_on(void) { return 0; }
+static inline void tracing_snapshot(void) { }
+static inline void tracing_snapshot_alloc(void) { }
+static inline void trace_dump_stack(int skip) { }
+static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
+#endif
+
+#endif /* _LINUX_TRACE_CONTROLS_H */
diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index 3d54f440dccf..a488ea9e9f85 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -7,43 +7,7 @@
 #include <linux/stddef.h>
 #include <linux/stringify.h>
 
-/*
- * General tracing related utility functions - trace_printk(),
- * tracing_on/tracing_off and tracing_start()/tracing_stop
- *
- * Use tracing_on/tracing_off when you want to quickly turn on or off
- * tracing. It simply enables or disables the recording of the trace events.
- * This also corresponds to the user space /sys/kernel/tracing/tracing_on
- * file, which gives a means for the kernel and userspace to interact.
- * Place a tracing_off() in the kernel where you want tracing to end.
- * From user space, examine the trace, and then echo 1 > tracing_on
- * to continue tracing.
- *
- * tracing_stop/tracing_start has slightly more overhead. It is used
- * by things like suspend to ram where disabling the recording of the
- * trace is not enough, but tracing must actually stop because things
- * like calling smp_processor_id() may crash the system.
- *
- * Most likely, you want to use tracing_on/tracing_off.
- */
-
-enum ftrace_dump_mode {
-	DUMP_NONE,
-	DUMP_ALL,
-	DUMP_ORIG,
-	DUMP_PARAM,
-};
-
 #ifdef CONFIG_TRACING
-void tracing_on(void);
-void tracing_off(void);
-int tracing_is_on(void);
-void tracing_snapshot(void);
-void tracing_snapshot_alloc(void);
-
-extern void tracing_start(void);
-extern void tracing_stop(void);
-
 static inline __printf(1, 2)
 void ____trace_printk_check_format(const char *fmt, ...)
 {
@@ -149,8 +113,6 @@ int __trace_printk(unsigned long ip, const char *fmt, ...);
 extern int __trace_bputs(unsigned long ip, const char *str);
 extern int __trace_puts(unsigned long ip, const char *str);
 
-extern void trace_dump_stack(int skip);
-
 /*
  * The double __builtin_constant_p is because gcc will give us an error
  * if we try to allocate the static variable to fmt if it is not a
@@ -173,19 +135,7 @@ __ftrace_vbprintk(unsigned long ip, const char *fmt, va_list ap);
 
 extern __printf(2, 0) int
 __ftrace_vprintk(unsigned long ip, const char *fmt, va_list ap);
-
-extern void ftrace_dump(enum ftrace_dump_mode oops_dump_mode);
 #else
-static inline void tracing_start(void) { }
-static inline void tracing_stop(void) { }
-static inline void trace_dump_stack(int skip) { }
-
-static inline void tracing_on(void) { }
-static inline void tracing_off(void) { }
-static inline int tracing_is_on(void) { return 0; }
-static inline void tracing_snapshot(void) { }
-static inline void tracing_snapshot_alloc(void) { }
-
 static inline __printf(1, 2)
 int trace_printk(const char *fmt, ...)
 {
@@ -196,7 +146,6 @@ ftrace_vprintk(const char *fmt, va_list ap)
 {
 	return 0;
 }
-static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
 #endif /* CONFIG_TRACING */
 
 #endif
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index b276504c1c6b..f9c83a470c98 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -27,6 +27,7 @@
 
 #define pr_fmt(fmt) "KGDB: " fmt
 
+#include <linux/trace_controls.h>
 #include <linux/pid_namespace.h>
 #include <linux/clocksource.h>
 #include <linux/serial_core.h>
diff --git a/kernel/panic.c b/kernel/panic.c
index 213725b612aa..1415e910371d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -9,6 +9,7 @@
  * This function is used through-out the kernel (including mm and fs)
  * to indicate a major problem.
  */
+#include <linux/trace_controls.h>
 #include <linux/debug_locks.h>
 #include <linux/sched/debug.h>
 #include <linux/interrupt.h>
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index fa6d30ce73d1..735a80df0b30 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -12,6 +12,7 @@
 
 #include <linux/slab.h>
 #include <trace/events/rcu.h>
+#include <linux/trace_controls.h>
 
 /*
  * Grace-period counter management.
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 882a158ada7b..76bf0184b267 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -39,6 +39,7 @@
 #include <linux/srcu.h>
 #include <linux/slab.h>
 #include <linux/trace_clock.h>
+#include <linux/trace_controls.h>
 #include <asm/byteorder.h>
 #include <linux/torture.h>
 #include <linux/vmalloc.h>
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..2537c33ddd49 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -22,6 +22,7 @@
 #include <linux/ctype.h>
 #include <linux/once_lite.h>
 #include <linux/ftrace_regs.h>
+#include <linux/trace_controls.h>
 #include <linux/llist.h>
 
 #include "pid_list.h"
diff --git a/kernel/trace/trace_benchmark.c b/kernel/trace/trace_benchmark.c
index e19c32f2a938..69cc39008c36 100644
--- a/kernel/trace/trace_benchmark.c
+++ b/kernel/trace/trace_benchmark.c
@@ -3,6 +3,7 @@
 #include <linux/module.h>
 #include <linux/kthread.h>
 #include <linux/trace_clock.h>
+#include <linux/trace_controls.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace_benchmark.h"
diff --git a/lib/sys_info.c b/lib/sys_info.c
index f32a06ec9ed4..e3c9ca05601b 100644
--- a/lib/sys_info.c
+++ b/lib/sys_info.c
@@ -8,6 +8,7 @@
 #include <linux/ftrace.h>
 #include <linux/nmi.h>
 #include <linux/sched/debug.h>
+#include <linux/trace_controls.h>
 #include <linux/string.h>
 #include <linux/sysctl.h>
 
-- 
2.53.0



^ permalink raw reply related

* [PATCH v4 2/2] tracing: Remove trace_printk.h from kernel.h
From: Steven Rostedt @ 2026-06-25 10:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx
In-Reply-To: <20260625104007.041432666@kernel.org>

From: Steven Rostedt <rostedt@goodmis.org>

There have been complaints about trace_printk.h causing more build time
for being in kernel.h if it changes. There is also an effort to clean up
kernel.h to have it not include unneeded header files. Move trace_printk.h
out of kernel.h and place it in the headers and C files that use it.

Link: https://lore.kernel.org/all/CAHk-=wikCBeVFjVXiY4o-oepdbjAoir5+TcAgtL12c4u1TpZLQ@mail.gmail.com/

Suggested-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/powerpc/kvm/book3s_xics.c         | 1 +
 drivers/gpu/drm/i915/gt/intel_gtt.h    | 1 +
 drivers/gpu/drm/i915/i915_gem.h        | 1 +
 drivers/hwtracing/stm/dummy_stm.c      | 1 +
 drivers/infiniband/hw/hfi1/trace_dbg.h | 1 +
 drivers/usb/early/xhci-dbc.c           | 1 +
 fs/ext4/inline.c                       | 1 +
 include/linux/ftrace.h                 | 2 ++
 include/linux/kernel.h                 | 1 -
 include/linux/sunrpc/debug.h           | 1 +
 include/linux/trace_printk.h           | 5 +++--
 kernel/trace/ring_buffer_benchmark.c   | 1 +
 samples/fprobe/fprobe_example.c        | 1 +
 samples/ftrace/ftrace-direct-too.c     | 1 -
 samples/trace_printk/trace-printk.c    | 1 +
 15 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index 74a44fa702b0..ef5eb596a56e 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -26,6 +26,7 @@
 #if 1
 #define XICS_DBG(fmt...) do { } while (0)
 #else
+#include <linux/trace_printk.h>
 #define XICS_DBG(fmt...) trace_printk(fmt)
 #endif
 
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index b54ee4f25af1..f6f223090760 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -35,6 +35,7 @@
 #define I915_GFP_ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
 
 #if IS_ENABLED(CONFIG_DRM_I915_TRACE_GTT)
+#include <linux/trace_printk.h>
 #define GTT_TRACE(...) trace_printk(__VA_ARGS__)
 #else
 #define GTT_TRACE(...)
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 1da8fb61c09e..f490052e8964 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -117,6 +117,7 @@ int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file);
 
 #if IS_ENABLED(CONFIG_DRM_I915_TRACE_GEM)
 #include <linux/trace_controls.h>
+#include <linux/trace_printk.h>
 #define GEM_TRACE(...) trace_printk(__VA_ARGS__)
 #define GEM_TRACE_ERR(...) do {						\
 	pr_err(__VA_ARGS__);						\
diff --git a/drivers/hwtracing/stm/dummy_stm.c b/drivers/hwtracing/stm/dummy_stm.c
index 38528ffdc0b3..7c5e48ebfb9f 100644
--- a/drivers/hwtracing/stm/dummy_stm.c
+++ b/drivers/hwtracing/stm/dummy_stm.c
@@ -8,6 +8,7 @@
  */
 
 #undef DEBUG
+#include <linux/trace_printk.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/slab.h>
diff --git a/drivers/infiniband/hw/hfi1/trace_dbg.h b/drivers/infiniband/hw/hfi1/trace_dbg.h
index 58304b91380f..30df5e246586 100644
--- a/drivers/infiniband/hw/hfi1/trace_dbg.h
+++ b/drivers/infiniband/hw/hfi1/trace_dbg.h
@@ -103,6 +103,7 @@ __hfi1_trace_def(IOCTL);
  */
 
 #ifdef HFI1_EARLY_DBG
+#include <linux/trace_printk.h>
 #define hfi1_dbg_early(fmt, ...) \
 	trace_printk(fmt, ##__VA_ARGS__)
 #else
diff --git a/drivers/usb/early/xhci-dbc.c b/drivers/usb/early/xhci-dbc.c
index 41118bba9197..955c73bd601f 100644
--- a/drivers/usb/early/xhci-dbc.c
+++ b/drivers/usb/early/xhci-dbc.c
@@ -30,6 +30,7 @@ static struct xdbc_state xdbc;
 static bool early_console_keep;
 
 #ifdef XDBC_TRACE
+#include <linux/trace_printk.h>
 #define	xdbc_trace	trace_printk
 #else
 static inline void xdbc_trace(const char *fmt, ...) { }
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..0eff4a0c6a6c 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -934,6 +934,7 @@ static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
 }
 
 #ifdef INLINE_DIR_DEBUG
+#include <linux/trace_printk.h>
 void ext4_show_inline_dir(struct inode *dir, struct buffer_head *bh,
 			  void *inline_start, int inline_size)
 {
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 02bc5027523a..b5336a81e619 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -8,6 +8,8 @@
 #define _LINUX_FTRACE_H
 
 #include <linux/trace_recursion.h>
+#include <linux/trace_controls.h>
+#include <linux/trace_printk.h>
 #include <linux/trace_clock.h>
 #include <linux/jump_label.h>
 #include <linux/kallsyms.h>
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index e5570a16cbb1..e87a40fbd152 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -31,7 +31,6 @@
 #include <linux/build_bug.h>
 #include <linux/sprintf.h>
 #include <linux/static_call_types.h>
-#include <linux/trace_printk.h>
 #include <linux/util_macros.h>
 #include <linux/wordpart.h>
 
diff --git a/include/linux/sunrpc/debug.h b/include/linux/sunrpc/debug.h
index ab61bed2f7af..7524f5d82fba 100644
--- a/include/linux/sunrpc/debug.h
+++ b/include/linux/sunrpc/debug.h
@@ -29,6 +29,7 @@ extern unsigned int		nlm_debug;
 # define ifdebug(fac)		if (unlikely(rpc_debug & RPCDBG_##fac))
 
 # if IS_ENABLED(CONFIG_SUNRPC_DEBUG_TRACE)
+#  include <linux/trace_printk.h>
 #  define __sunrpc_printk(fmt, ...)	trace_printk(fmt, ##__VA_ARGS__)
 # else
 #  define __sunrpc_printk(fmt, ...)	printk(KERN_DEFAULT fmt, ##__VA_ARGS__)
diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index a488ea9e9f85..74ce4f8995c4 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -1,11 +1,12 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_TRACE_PRINTK_H
 #define _LINUX_TRACE_PRINTK_H
+#if !defined(__ASSEMBLY__) && !defined(__GENKSYMS__) && !defined(BUILD_VDSO)
 
-#include <linux/compiler_attributes.h>
 #include <linux/instruction_pointer.h>
 #include <linux/stddef.h>
 #include <linux/stringify.h>
+#include <linux/stdarg.h>
 
 #ifdef CONFIG_TRACING
 static inline __printf(1, 2)
@@ -147,5 +148,5 @@ ftrace_vprintk(const char *fmt, va_list ap)
 	return 0;
 }
 #endif /* CONFIG_TRACING */
-
+#endif /* !defined(__ASSEMBLY__) && !defined(__GENKSYMS__) && !defined(BUILD_VDSO) */
 #endif
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 593e3b59e42e..2bb25caebb75 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2009 Steven Rostedt <srostedt@redhat.com>
  */
 #include <linux/ring_buffer.h>
+#include <linux/trace_printk.h>
 #include <linux/completion.h>
 #include <linux/kthread.h>
 #include <uapi/linux/sched/types.h>
diff --git a/samples/fprobe/fprobe_example.c b/samples/fprobe/fprobe_example.c
index bfe98ce826f3..de81b9b4ca7d 100644
--- a/samples/fprobe/fprobe_example.c
+++ b/samples/fprobe/fprobe_example.c
@@ -12,6 +12,7 @@
 
 #define pr_fmt(fmt) "%s: " fmt, __func__
 
+#include <linux/trace_printk.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/fprobe.h>
diff --git a/samples/ftrace/ftrace-direct-too.c b/samples/ftrace/ftrace-direct-too.c
index bf2411aa6fd7..159190f4103f 100644
--- a/samples/ftrace/ftrace-direct-too.c
+++ b/samples/ftrace/ftrace-direct-too.c
@@ -1,6 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include <linux/module.h>
-
 #include <linux/mm.h> /* for handle_mm_fault() */
 #include <linux/ftrace.h>
 #if !defined(CONFIG_ARM64) && !defined(CONFIG_PPC32)
diff --git a/samples/trace_printk/trace-printk.c b/samples/trace_printk/trace-printk.c
index cfc159580263..ff37aeb8523e 100644
--- a/samples/trace_printk/trace-printk.c
+++ b/samples/trace_printk/trace-printk.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
+#include <linux/trace_printk.h>
 #include <linux/module.h>
 #include <linux/kthread.h>
 #include <linux/irq_work.h>
-- 
2.53.0



^ permalink raw reply related

* [PATCH v4 0/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Steven Rostedt @ 2026-06-25 10:40 UTC (permalink / raw)
  To: linux-kernel, linux-trace-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
	Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
	linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
	linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
	kvm, intel-gfx

Remove trace_printk.h by creating a trace_controls.h for those places that
need access to tracing prototypes like tracing_off() and for the places that
need trace_printk() directly, to have it included directly.

Changse since v3: https://lore.kernel.org/all/20260624081806.120105649@kernel.org/

- Always include trace_controls.h in rcu.h (kernel test robot)

  There are other configs that may include tracing_off() in rcu.h besides
  the one that had the include of trace_controls.h. Just always include
  it in that header to be safe.

Steven Rostedt (2):
      tracing: Move non-trace_printk prototypes into trace_controls.h
      tracing: Remove trace_printk.h from kernel.h

----
 arch/powerpc/kvm/book3s_xics.c         |  1 +
 arch/powerpc/xmon/xmon.c               |  1 +
 arch/s390/kernel/ipl.c                 |  1 +
 arch/s390/kernel/machine_kexec.c       |  1 +
 drivers/gpu/drm/i915/gt/intel_gtt.h    |  1 +
 drivers/gpu/drm/i915/i915_gem.h        |  2 ++
 drivers/hwtracing/stm/dummy_stm.c      |  1 +
 drivers/infiniband/hw/hfi1/trace_dbg.h |  1 +
 drivers/tty/sysrq.c                    |  1 +
 drivers/usb/early/xhci-dbc.c           |  1 +
 fs/ext4/inline.c                       |  1 +
 include/linux/ftrace.h                 |  2 ++
 include/linux/kernel.h                 |  1 -
 include/linux/sunrpc/debug.h           |  1 +
 include/linux/trace_controls.h         | 54 ++++++++++++++++++++++++++++++++
 include/linux/trace_printk.h           | 56 ++--------------------------------
 kernel/debug/debug_core.c              |  1 +
 kernel/panic.c                         |  1 +
 kernel/rcu/rcu.h                       |  1 +
 kernel/rcu/rcutorture.c                |  1 +
 kernel/trace/ring_buffer_benchmark.c   |  1 +
 kernel/trace/trace.h                   |  1 +
 kernel/trace/trace_benchmark.c         |  1 +
 lib/sys_info.c                         |  1 +
 samples/fprobe/fprobe_example.c        |  1 +
 samples/ftrace/ftrace-direct-too.c     |  1 -
 samples/trace_printk/trace-printk.c    |  1 +
 27 files changed, 82 insertions(+), 55 deletions(-)
 create mode 100644 include/linux/trace_controls.h

^ permalink raw reply

* Re: [PATCH v5] ext4: fix ABBA deadlock in ext4_xattr_inode_cache_find()
From: Jan Kara @ 2026-06-25 10:23 UTC (permalink / raw)
  To: Aditya Srivastava
  Cc: tytso, jack, adilger.kernel, libaokun, ritesh.list, yi.zhang,
	linux-ext4, linux-kernel, Colin Ian King
In-Reply-To: <20260625065010.2332-1-aditya.ansh182@gmail.com>

On Thu 25-06-26 06:50:09, Aditya Srivastava wrote:
> From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
> 
> Syzbot/stress-ng reported an ABBA deadlock in ext4 when exercising
> concurrent xattr workloads (using the ea_inode mount/format option).
> 
> The deadlock occurs between the running transaction and the eviction
> thread:
> - Task 1 (stress-ng): Holds a reference to a shared mbcache_entry (ce)
>   and calls ext4_xattr_inode_cache_find() -> ext4_iget() to retrieve
>   the corresponding EA inode. Since the EA inode is currently being
>   evicted, ext4_iget() blocks in __wait_on_freeing_inode() waiting for
>   eviction to complete.
> - Task 2 (eviction thread): Currently evicting the same EA inode in
>   ext4_evict_ea_inode(). It calls mb_cache_entry_wait_unused(oe) which
>   blocks waiting for Task 1 to release the reference to the mbcache_entry.
> 
> To break this deadlock, implement a new ext4_iget() configuration flag
> named EXT4_IGET_NOWAIT. When set, perform a non-blocking lookup of the
> inode via VFS's find_inode_nowait() API.
> 
> If the inode is currently being evicted (marked with I_FREEING or
> I_WILL_FREE) or created (I_CREATING), or if it is not present in the VFS
> inode cache (cache miss), simply skip it (returning -ESTALE) rather than
> waiting for eviction/creation to complete, breaking the ABBA cycle.
> 
> Since we return -ESTALE immediately on a cache miss, we never attempt to
> allocate a new inode or call iget_locked(), completely eliminating any
> TOCTOU race window.
> 
> If the returned inode is I_NEW, wait for its initialization to clear via
> wait_on_new_inode(). If initialization fails and the inode is unhashed
> during wait_on_new_inode() waking up (e.g., due to an I/O read error in
> another thread), safely drop the reference and return -ESTALE. This
> unhashed check is executed unconditionally on all cache-hit pathways to
> properly handle concurrent initialization failures.
> 
> Finally, standard validation checks (including is_bad_inode,
> EXT4_EA_INODE_FL, file_acl, and xattr flags) are executed as normal inside
> check_igot_inode() to fully guarantee VFS-layer safety.
> 
> In ext4_xattr_inode_cache_find(), invoke ext4_iget() with the new
> EXT4_IGET_NOWAIT flag to perform the non-blocking cache search.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Reported-by: Colin Ian King <colin.i.king@gmail.com>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219283
> Fixes: 0a46ef234756 ("ext4: do not create EA inode under buffer lock")
> Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

Three patch submissions in a day? I think you are overdoing it. Please test
more, think more about what you do (i.e., really understand what the code
does, not just whatever some AI agent suggested) and submit less.

> +static int ext4_iget_match(struct inode *inode, u64 ino, void *data)
> +{
> +	bool *is_freeing = data;
> +
> +	if (inode->i_ino != ino)
> +		return 0;
> +	spin_lock(&inode->i_lock);
> +	if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_CREATING)) {
> +		if (is_freeing)
> +			*is_freeing = true;
> +		spin_unlock(&inode->i_lock);
> +		return -1;
> +	}
> +	__iget(inode);
> +	spin_unlock(&inode->i_lock);
> +	return 1;
> +}
> +
>  struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>  			  ext4_iget_flags flags, const char *function,
>  			  unsigned int line)
> @@ -5298,9 +5316,26 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>  		return ERR_PTR(-EFSCORRUPTED);
>  	}
>  
> -	inode = iget_locked(sb, ino);
> -	if (!inode)
> -		return ERR_PTR(-ENOMEM);
> +	if (flags & EXT4_IGET_NOWAIT) {
> +		bool is_freeing = false;
> +
> +		inode = find_inode_nowait(sb, ino, ext4_iget_match, &is_freeing);
> +		if (is_freeing || !inode)

is_freeing is pointless. Just drop it and use !inode.

> +			return ERR_PTR(-ESTALE);

I'd prefer -ENOENT here instead of -ESTALE. It's kind of more specific.

> +		if (inode_state_read_once(inode) & I_NEW)
> +			wait_on_new_inode(inode);
> +
> +		if (unlikely(inode_unhashed(inode))) {
> +			iput(inode);
> +			return ERR_PTR(-ESTALE);

Here -ENOENT as well please.

Otherwise the patch looks good to me.

								Honza

> +		}
> +	} else {
> +		inode = iget_locked(sb, ino);
> +		if (!inode)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
>  	if (!(inode_state_read_once(inode) & I_NEW)) {
>  		ret = check_igot_inode(inode, flags, function, line);
>  		if (ret) {
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 982a1f831e22..21b5670d8503 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1550,7 +1550,7 @@ ext4_xattr_inode_cache_find(struct inode *inode, const void *value,
>  
>  	while (ce) {
>  		ea_inode = ext4_iget(inode->i_sb, ce->e_value,
> -				     EXT4_IGET_EA_INODE);
> +				     EXT4_IGET_EA_INODE | EXT4_IGET_NOWAIT);
>  		if (IS_ERR(ea_inode))
>  			goto next_entry;
>  		ext4_xattr_inode_set_class(ea_inode);
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v3] ext4: fix ABBA deadlock in ext4_xattr_inode_cache_find()
From: Jan Kara @ 2026-06-25 10:14 UTC (permalink / raw)
  To: Aditya Srivastava
  Cc: tytso, jack, adilger.kernel, libaokun, ritesh.list, yi.zhang,
	linux-ext4, linux-kernel, Colin Ian King
In-Reply-To: <20260625040935.2244-1-aditya.ansh182@gmail.com>

On Thu 25-06-26 04:09:35, Aditya Srivastava wrote:
> From: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>
> 
> Syzbot/stress-ng reported an ABBA deadlock in ext4 when exercising
> concurrent xattr workloads (using the ea_inode mount/format option).
> 
> The deadlock occurs between the running transaction and the eviction
> thread:
> - Task 1 (stress-ng): Holds a reference to a shared mbcache_entry (ce)
>   and calls ext4_xattr_inode_cache_find() -> ext4_iget() to retrieve
>   the corresponding EA inode. Since the EA inode is currently being
>   evicted, ext4_iget() blocks in __wait_on_freeing_inode() waiting for
>   eviction to complete.
> - Task 2 (eviction thread): Currently evicting the same EA inode in
>   ext4_evict_ea_inode(). It calls mb_cache_entry_wait_unused(oe) which
>   blocks waiting for Task 1 to release the reference to the mbcache_entry.
> 
> To break this deadlock, implement a new ext4_iget() configuration flag
> named EXT4_IGET_NOWAIT. When set, perform a non-blocking lookup of the
> inode via VFS's find_inode_nowait() API.
> 
> If the inode is currently being evicted (marked with I_FREEING or
> I_WILL_FREE) or created (I_CREATING), simply skip it (returning -ESTALE)
> rather than waiting for eviction/creation to complete, breaking the ABBA
> cycle. If the returned inode is I_NEW, wait for its initialization to
> clear via wait_on_new_inode(). Finally, standard validation checks
> (including is_bad_inode, EXT4_EA_INODE_FL, file_acl, and xattr flags) are
> executed as normal inside check_igot_inode() to fully guarantee VFS-layer
> safety.
> 
> In ext4_xattr_inode_cache_find(), invoke ext4_iget() with the new
> EXT4_IGET_NOWAIT flag to perform the non-blocking cache search.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Reported-by: Colin Ian King <colin.i.king@gmail.com>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219283
> Fixes: 0a46ef234756 ("ext4: do not create EA inode under buffer lock")
> Signed-off-by: Aditya Prakash Srivastava <aditya.ansh182@gmail.com>

...

> +static int ext4_iget_match(struct inode *inode, u64 ino, void *data)
> +{
> +	bool *is_freeing = data;
> +
> +	if (inode->i_ino != ino)
> +		return 0;
> +	spin_lock(&inode->i_lock);
> +	if (inode_state_read(inode) & (I_FREEING | I_WILL_FREE | I_CREATING)) {
> +		if (is_freeing)
> +			*is_freeing = true;
> +		spin_unlock(&inode->i_lock);
> +		return -1;
> +	}
> +	__iget(inode);
> +	spin_unlock(&inode->i_lock);
> +	return 1;
> +}
> +
>  struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>  			  ext4_iget_flags flags, const char *function,
>  			  unsigned int line)
> @@ -5298,9 +5316,26 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>  		return ERR_PTR(-EFSCORRUPTED);
>  	}
>  
> -	inode = iget_locked(sb, ino);
> -	if (!inode)
> -		return ERR_PTR(-ENOMEM);
> +	if (flags & EXT4_IGET_NOWAIT) {
> +		bool is_freeing = false;
> +
> +		inode = find_inode_nowait(sb, ino, ext4_iget_match, &is_freeing);
> +		if (is_freeing)
> +			return ERR_PTR(-ESTALE);

In case of EXT4_IGET_NOWAIT I just would not bother with returning errors
like this. Just if the inode was not found or cannot be grabbed, I'd return
say -ENOENT from __ext4_iget() and be done with it. Due to the nature of
EXT4_IGET_NOWAIT it is unreliable and the caller thus has to decide what to
do in case the inode cannot be grabbed.

> +		if (!inode) {
> +			inode = iget_locked(sb, ino);
> +			if (!inode)
> +				return ERR_PTR(-ENOMEM);

Hu, what's the point of this? It defeats the purpose of EXT4_IGET_NOWAIT as
it can block...

								Honza

> +		} else {
> +			if (inode_state_read_once(inode) & I_NEW)
> +				wait_on_new_inode(inode);
> +		}
> +	} else {
> +		inode = iget_locked(sb, ino);
> +		if (!inode)
> +			return ERR_PTR(-ENOMEM);
> +	}
> +
>  	if (!(inode_state_read_once(inode) & I_NEW)) {
>  		ret = check_igot_inode(inode, flags, function, line);
>  		if (ret) {
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 982a1f831e22..21b5670d8503 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1550,7 +1550,7 @@ ext4_xattr_inode_cache_find(struct inode *inode, const void *value,
>  
>  	while (ce) {
>  		ea_inode = ext4_iget(inode->i_sb, ce->e_value,
> -				     EXT4_IGET_EA_INODE);
> +				     EXT4_IGET_EA_INODE | EXT4_IGET_NOWAIT);
>  		if (IS_ERR(ea_inode))
>  			goto next_entry;
>  		ext4_xattr_inode_set_class(ea_inode);
> -- 
> 2.47.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Jan Kara @ 2026-06-25  9:42 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Jan Kara, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel,
	tytso, adilger.kernel, libaokun, ojaswin, ritesh.list, djwong,
	hch, yi.zhang, yangerkun, yukuai
In-Reply-To: <9ab9281d-c5dc-4183-8bc9-b53b5a4ac9c0@gmail.com>

On Mon 22-06-26 11:32:16, Zhang Yi wrote:
> On 6/18/2026 9:48 PM, Jan Kara wrote:
> > On Mon 11-05-26 15:23:38, Zhang Yi wrote:
> > > From: Zhang Yi <yi.zhang@huawei.com>
> > > 
> > > For append writes, wait for ordered I/O to complete before updating
> > > i_disksize. This ensures that zeroed data is flushed to disk before the
> > > metadata update, preventing stale data from being exposed during
> > > unaligned post-EOF append writes.
> > > 
> > > Suggested-by: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > 
> > Frankly, this all looks too complex to me. Plus your are adding 32-bytes to
> > struct ext4_inode_info which isn't great either. Why don't you just do
> > filemap_fdatawait() for the byte at old i_disksize and be done with it?
> > 
> > I believe we have to simplify this. All this complexity (and thus
> > maintenance burden) across several patches for the corner case of zeroing
> > tail block on extention is in my opinion difficult to justify.
> > 
> > 								Honza
> 
> Hi, Jan!
> 
> Thanks for the review. I understand the concern about complexity and the
> 32-byte increase to ext4_inode_info. I tried using
> filemap_fdatawait_range() as you suggested, but found two issues where
> this solution doesn't work.
> 
> 1. ioend worker deadlock
> 
> Since worker concurrency resources are limited, we cannot wait for
> another ioend worker to complete within one ioend worker with the same
> work_struct. If the worker calls
> filemap_fdatawait_range(byte_at_old_disksize) to wait for the zeroed
> block's folio writeback to complete, it sleeps holding the only worker
> slot. If the folio contains blocks requiring extent conversion, its
> writeback bit is cleared by iomap_finish_ioends() running inside
> another worker -- which can only run after the current worker finishes
> its batch.
> 
> Concretely:
>   - Worker W1 processes ioend A, calls filemap_fdatawait_range() on
>     the old EOF byte, sleeps.
>   - The zeroed data is in ioend B. bio_endio defers it to
>     i_iomap_ioend_list and calls queue_work().
>   - queue_work() on i_iomap_ioend_work is idempotent: it returns false
>     because the work is currently executing (even though sleeping).
>   - ioend B sits in the list, never gets processed.
>   - The folio writeback bit is only cleared by processing ioend B.
>   - W1 sleeps forever -> deadlock.

Yes, good point. We cannot wait for folio writeback completion from end_io
processing for another folio.

> Therefore, I think we have to put the wakeup logic in
> ext4_iomap_end_bio() that runs in interrupt context without consuming
> a worker thread. The ordered range tracking and wait queue are what
> make that possible.
> 
> 2. Truncate-up needs an accurate state query
> 
> In the follow-up patch 19, ext4_set_inode_size() must make a precise
> decision when updating i_disksize during truncate up.
> 
> This needs a state query: "is there ordered zero I/O in flight right
> now?" If yes, the i_disksize update is deferred to
> ext4_iomap_wb_update_disksize(is_ordered=true), which advances
> i_disksize to i_size when the ordered I/O completes. If no, we must
> advance i_disksize immediately, otherwise we will lose the updating
> forever.
> 
> Therefore, we need to track the state of the ordered range. Simply
> using filemap_fdatawait_range() doesn't work. i_ordered_len serves as
> a maintained state flag that both the ioend completion path and the
> setattr path can read atomically without sleeping.
> 
> Suggestions?

I see. Thanks for explanation. I went back to our discussion back from
February to remind myself about the constraints on the tail block zeroing
and the i_disksize update mechanism. And in the light of complexity of the
current mechanism, I think we've discarded the following possibility too
easily:

* On file extend / truncate up just zero tail folio in the page cache, mark
  it dirty, keep i_disksize at old value, update i_size to the new value,
  add inode to orphan list.
  If the i_disksize was block aligned (and so we skip zeroing), we just
  update i_disksize rightaway.

* In io end processing if the folio for which we end io has a block which
  straddles i_disksize, we update i_disksize to current i_size. We defer
  removing inode from orphan list e.g. to file close time (doing it from
  end_io processing is problematic locking wise as we need i_rwsem for it).

This is a very simple scheme with very good performace. It makes sure stale
data in the tail block cannot be seen on disk after a crash.

Potential problems with this:

* We need to do i_disksize update from io end processing which means starting a
  transaction. So this mechanism has to be restricted to buffered IO iomap
  path due to locking constraints. If the inode uses old ordered mode, it
  has to keep using it also for handling of the tail block zeroing.
  I think that's acceptable as we want to transition everything to iomap
  anyway so this duplicity will eventually go away.

* After a crash we can see i_disksize already updated but file content will
  show zeros. This is not breaking any guarantees, it just changes how ext4
  behaves after a crash. Again, I think this is acceptable tradeoff for the
  simplicity.

What do you think? Any problems I have missed? I'm sorry for poking into
this mechanism again and again but I want to keep the code as simple as
possible to make our life easier in the future...

> Regarding the bloat of ext4_inode_info, perhaps we can drop the
> wait_queue_head_t (24 bytes) and use wait_var_event()/ wake_up_var()
> instead. Would this be acceptable?

Yes, that would be a good way to reduce the bloat if we still need this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 10/16] fs/buffer: Remove fs-layer decryption code
From: Christian Brauner @ 2026-06-25  7:01 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-fscrypt, linux-fsdevel, linux-ext4, linux-f2fs-devel,
	linux-block, Christoph Hellwig, Theodore Ts'o, Andreas Dilger,
	Baokun Li, Jan Kara, Ojaswin Mujoo, Ritesh Harjani, Zhang Yi,
	Jaegeuk Kim, Chao Yu
In-Reply-To: <20260624050334.124606-11-ebiggers@kernel.org>

On 2026-06-23 22:03 -0700, Eric Biggers wrote:
> Now that fscrypt's file contents en/decryption is always implemented
> using blk-crypto when the filesystem is block-based, the fs-layer
> decryption code in fs/buffer.c is unused code.  Remove it.
> 
> Signed-off-by: Eric Biggers <ebiggers@kernel.org>
> ---

Reviewed-by: Christian Brauner (Amutable) <brauner@kernel.org>


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox