Linux EXT4 FS development
 help / color / mirror / Atom feed
* [PATCH v10 05/22] fsverity: pass digest size and hash of the all-zeroes block to ->write
From: Andrey Albershteyn @ 2026-05-20 12:37 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260520123722.405752-1-aalbersh@kernel.org>

Let filesystem iterate over hashes in the block and check if these are
hashes of zeroed data blocks. XFS will use this to decide if it want to
store tree block full of these hashes.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: Eric Biggers <ebiggers@kernel.org>
---
 fs/btrfs/verity.c        | 6 +++++-
 fs/ext4/verity.c         | 4 +++-
 fs/f2fs/verity.c         | 4 +++-
 fs/verity/enable.c       | 4 +++-
 include/linux/fsverity.h | 6 +++++-
 5 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
index 0062b3a55781..fd3696d3f4ce 100644
--- a/fs/btrfs/verity.c
+++ b/fs/btrfs/verity.c
@@ -773,11 +773,15 @@ static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
  * @buf:	Merkle tree block to write
  * @pos:	the position of the block in the Merkle tree (in bytes)
  * @size:	the Merkle tree block size (in bytes)
+ * @zero_digest:	the hash of the all-zeroes block
+ * @digest_size:	size of zero_digest, in bytes
  *
  * Returns 0 on success or negative error code on failure
  */
 static int btrfs_write_merkle_tree_block(struct file *file, const void *buf,
-					 u64 pos, unsigned int size)
+					 u64 pos, unsigned int size,
+					 const u8 *zero_digest,
+					 unsigned int digest_size)
 {
 	struct inode *inode = file_inode(file);
 	loff_t merkle_pos = merkle_file_pos(inode);
diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index ca61da53f313..347945ac23a4 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -374,7 +374,9 @@ static void ext4_readahead_merkle_tree(struct inode *inode, pgoff_t index,
 }
 
 static int ext4_write_merkle_tree_block(struct file *file, const void *buf,
-					u64 pos, unsigned int size)
+					u64 pos, unsigned int size,
+					const u8 *zero_digest,
+					unsigned int digest_size)
 {
 	pos += ext4_verity_metadata_pos(file_inode(file));
 
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index 92ebcc19cab0..b3b3e71604ac 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -270,7 +270,9 @@ static void f2fs_readahead_merkle_tree(struct inode *inode, pgoff_t index,
 }
 
 static int f2fs_write_merkle_tree_block(struct file *file, const void *buf,
-					u64 pos, unsigned int size)
+					u64 pos, unsigned int size,
+					const u8 *zero_digest,
+					unsigned int digest_size)
 {
 	pos += f2fs_verity_metadata_pos(file_inode(file));
 
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index 42dfed1ce0ce..ad4ff71d7dd9 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -50,7 +50,9 @@ static int write_merkle_tree_block(struct file *file, const u8 *buf,
 	int err;
 
 	err = inode->i_sb->s_vop->write_merkle_tree_block(file, buf, pos,
-							  params->block_size);
+							  params->block_size,
+							  params->zero_digest,
+							  params->digest_size);
 	if (err)
 		fsverity_err(inode, "Error %d writing Merkle tree block %lu",
 			     err, index);
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 3c3250f6f272..9e7d946676b9 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -124,6 +124,8 @@ struct fsverity_operations {
 	 * @buf: the Merkle tree block to write
 	 * @pos: the position of the block in the Merkle tree (in bytes)
 	 * @size: the Merkle tree block size (in bytes)
+	 * @zero_digest: the hash of the all-zeroes block
+	 * @digest_size: size of zero_digest, in bytes
 	 *
 	 * This is only called between ->begin_enable_verity() and
 	 * ->end_enable_verity().
@@ -131,7 +133,9 @@ struct fsverity_operations {
 	 * Return: 0 on success, -errno on failure
 	 */
 	int (*write_merkle_tree_block)(struct file *file, const void *buf,
-				       u64 pos, unsigned int size);
+				       u64 pos, unsigned int size,
+				       const u8 *zero_digest,
+				       unsigned int digest_size);
 };
 
 #ifdef CONFIG_FS_VERITY
-- 
2.51.2


^ permalink raw reply related

* [PATCH v10 04/22] fsverity: generate and store zero-block hash
From: Andrey Albershteyn @ 2026-05-20 12:37 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260520123722.405752-1-aalbersh@kernel.org>

Compute the hash of one filesystem block's worth of zeros. A filesystem
implementation can decide to elide merkle tree blocks containing only
this hash and synthesize the contents at read time.

Let's pretend that there's a file containing 131 data block and whose
merkle tree looks roughly like this:

root
 +--leaf0
 |   +--data0
 |   +--data1
 |   +--...
 |   `--data128
 `--leaf1
     +--data129
     +--data130
     `--data131

If data[0-128] are sparse holes, then leaf0 will contain a repeating
sequence of @zero_digest.  Therefore, leaf0 need not be written to disk
because its contents can be synthesized.

A subsequent xfs patch will use this to reduce the size of the merkle
tree when dealing with sparse gold master disk images and the like.

Note that this works only on the first-level (data holes). fsverity
doesn't store/generate zero_digest for any higher levels.

Add a helper to pre-fill folio with hashes of empty blocks. This will be
used by iomap to synthesize blocks full of zero hashes on the fly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/verity/fsverity_private.h |  3 +++
 fs/verity/measure.c          |  4 ++--
 fs/verity/open.c             |  3 +++
 fs/verity/pagecache.c        | 22 ++++++++++++++++++++++
 include/linux/fsverity.h     |  8 ++++++++
 5 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index 6e6854c19078..881d46f25e08 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -53,6 +53,9 @@ struct merkle_tree_params {
 	u64 tree_size;			/* Merkle tree size in bytes */
 	unsigned long tree_pages;	/* Merkle tree size in pages */
 
+	/* the hash of an all-zeroes block */
+	u8 zero_digest[FS_VERITY_MAX_DIGEST_SIZE];
+
 	/*
 	 * Starting block index for each tree level, ordered from leaf level (0)
 	 * to root level ('num_levels - 1')
diff --git a/fs/verity/measure.c b/fs/verity/measure.c
index 6a35623ebdf0..818083507885 100644
--- a/fs/verity/measure.c
+++ b/fs/verity/measure.c
@@ -68,8 +68,8 @@ EXPORT_SYMBOL_GPL(fsverity_ioctl_measure);
  * @alg: (out) the digest's algorithm, as a FS_VERITY_HASH_ALG_* value
  * @halg: (out) the digest's algorithm, as a HASH_ALGO_* value
  *
- * Retrieves the fsverity digest of the given file.  The file must have been
- * opened at least once since the inode was last loaded into the inode cache;
+ * Retrieves the fsverity digest of the given file. The
+ * fsverity_ensure_verity_info() must be called on the inode beforehand;
  * otherwise this function will not recognize when fsverity is enabled.
  *
  * The file's fsverity digest consists of @raw_digest in combination with either
diff --git a/fs/verity/open.c b/fs/verity/open.c
index d32d0899df25..875e8850ccba 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -153,6 +153,9 @@ int fsverity_init_merkle_tree_params(struct merkle_tree_params *params,
 		goto out_err;
 	}
 
+	fsverity_hash_block(params, page_address(ZERO_PAGE(0)),
+			    params->zero_digest);
+
 	params->tree_size = offset << log_blocksize;
 	params->tree_pages = PAGE_ALIGN(params->tree_size) >> PAGE_SHIFT;
 	return 0;
diff --git a/fs/verity/pagecache.c b/fs/verity/pagecache.c
index 1819314ecaa3..99f5f53eea98 100644
--- a/fs/verity/pagecache.c
+++ b/fs/verity/pagecache.c
@@ -2,6 +2,7 @@
 /*
  * Copyright 2019 Google LLC
  */
+#include "fsverity_private.h"
 
 #include <linux/export.h>
 #include <linux/fsverity.h>
@@ -56,3 +57,24 @@ void generic_readahead_merkle_tree(struct inode *inode, pgoff_t index,
 		folio_put(folio);
 }
 EXPORT_SYMBOL_GPL(generic_readahead_merkle_tree);
+
+/**
+ * fsverity_fill_zerohash() - fill folio with hashes of zero data block
+ * @folio:	folio to fill
+ * @offset:	offset in the folio to start
+ * @len:	length of the range to fill with hashes
+ * @vi:		fsverity info
+ */
+void fsverity_fill_zerohash(struct folio *folio, size_t offset, size_t len,
+			      struct fsverity_info *vi)
+{
+	size_t off = offset;
+
+	WARN_ON_ONCE(!IS_ALIGNED(offset, vi->tree_params.digest_size));
+	WARN_ON_ONCE(!IS_ALIGNED(len, vi->tree_params.digest_size));
+
+	for (; off < (offset + len); off += vi->tree_params.digest_size)
+		memcpy_to_folio(folio, off, vi->tree_params.zero_digest,
+				vi->tree_params.digest_size);
+}
+EXPORT_SYMBOL_GPL(fsverity_fill_zerohash);
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 5562271bd628..3c3250f6f272 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -201,6 +201,8 @@ bool fsverity_verify_blocks(struct fsverity_info *vi, struct folio *folio,
 			    size_t len, size_t offset);
 void fsverity_verify_bio(struct fsverity_info *vi, struct bio *bio);
 void fsverity_enqueue_verify_work(struct work_struct *work);
+void fsverity_fill_zerohash(struct folio *folio, size_t offset, size_t len,
+			    struct fsverity_info *vi);
 
 #else /* !CONFIG_FS_VERITY */
 
@@ -281,6 +283,12 @@ static inline void fsverity_enqueue_verify_work(struct work_struct *work)
 	WARN_ON_ONCE(1);
 }
 
+static inline void fsverity_fill_zerohash(struct folio *folio, size_t offset,
+		size_t len, struct fsverity_info *vi)
+{
+	WARN_ON_ONCE(1);
+}
+
 #endif	/* !CONFIG_FS_VERITY */
 
 static inline bool fsverity_verify_folio(struct fsverity_info *vi,
-- 
2.51.2


^ permalink raw reply related

* [PATCH v10 03/22] ovl: use core fsverity ensure info interface
From: Andrey Albershteyn @ 2026-05-20 12:37 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong, Amir Goldstein
In-Reply-To: <20260520123722.405752-1-aalbersh@kernel.org>

fsverity now exposes fsverity_ensure_verity_info() which could be used
instead of opening file to ensure that fsverity info is loaded and
attached to inode.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/util.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index b41f4788e4f0..1e783cab4fbf 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -16,6 +16,7 @@
 #include <linux/namei.h>
 #include <linux/ratelimit.h>
 #include <linux/overflow.h>
+#include <linux/fsverity.h>
 #include "overlayfs.h"
 
 /* Get write access to upper mnt - may fail if upper sb was remounted ro */
@@ -1352,18 +1353,9 @@ char *ovl_get_redirect_xattr(struct ovl_fs *ofs, const struct path *path, int pa
 int ovl_ensure_verity_loaded(const struct path *datapath)
 {
 	struct inode *inode = d_inode(datapath->dentry);
-	struct file *filp;
 
-	if (IS_VERITY(inode) && fsverity_get_info(inode) == NULL) {
-		/*
-		 * If this inode was not yet opened, the verity info hasn't been
-		 * loaded yet, so we need to do that here to force it into memory.
-		 */
-		filp = kernel_file_open(datapath, O_RDONLY, current_cred());
-		if (IS_ERR(filp))
-			return PTR_ERR(filp);
-		fput(filp);
-	}
+	if (fsverity_active(inode))
+		return fsverity_ensure_verity_info(inode);
 
 	return 0;
 }
-- 
2.51.2


^ permalink raw reply related

* [PATCH v10 02/22] fsverity: expose ensure_fsverity_info()
From: Andrey Albershteyn @ 2026-05-20 12:37 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260520123722.405752-1-aalbersh@kernel.org>

This function will be used by XFS's scrub to force fsverity activation,
therefore, to read fsverity context.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/verity/open.c         | 22 ++++++++++++++++++++--
 include/linux/fsverity.h |  2 ++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/verity/open.c b/fs/verity/open.c
index dfa0d1afe0fe..d32d0899df25 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -344,7 +344,24 @@ int fsverity_get_descriptor(struct inode *inode,
 	return 0;
 }
 
-static int ensure_verity_info(struct inode *inode)
+/**
+ * fsverity_ensure_verity_info() - cache verity info if it's not already cached
+ * @inode: the inode for which verity info should be cached
+ *
+ * Ensure this inode has verity info attached to it, it's assumed the inode
+ * already has fsverity enabled. Read fsverity descriptor and creates verity
+ * based on that.
+ *
+ * This needs to be called at least once before any of the inode's data
+ * can be verified (and thus read at all) or the inode's fsverity digest
+ * retrieved.  fsverity_file_open() calls this already, which handles
+ * normal file accesses.  If a filesystem does any internal (i.e. not
+ * associated with a file descriptor) reads of the file's data or
+ * fsverity digest, it must call this explicitly before doing so.
+ *
+ * Return: 0 on success, -errno on failure
+ */
+int fsverity_ensure_verity_info(struct inode *inode)
 {
 	struct fsverity_info *vi = fsverity_get_info(inode), *found;
 	struct fsverity_descriptor *desc;
@@ -380,12 +397,13 @@ static int ensure_verity_info(struct inode *inode)
 	kfree(desc);
 	return err;
 }
+EXPORT_SYMBOL_GPL(fsverity_ensure_verity_info);
 
 int __fsverity_file_open(struct inode *inode, struct file *filp)
 {
 	if (filp->f_mode & FMODE_WRITE)
 		return -EPERM;
-	return ensure_verity_info(inode);
+	return fsverity_ensure_verity_info(inode);
 }
 EXPORT_SYMBOL_GPL(__fsverity_file_open);
 
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index a8f9aa75b792..5562271bd628 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -309,6 +309,8 @@ static inline int fsverity_file_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+int fsverity_ensure_verity_info(struct inode *inode);
+
 void fsverity_cleanup_inode(struct inode *inode);
 
 struct page *generic_read_merkle_tree_page(struct inode *inode, pgoff_t index);
-- 
2.51.2


^ permalink raw reply related

* [PATCH v10 01/22] fsverity: report validation errors through fserror to fsnotify
From: Andrey Albershteyn @ 2026-05-20 12:36 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong
In-Reply-To: <20260520123722.405752-1-aalbersh@kernel.org>

Reported verification errors to fsnotify through recently added fserror
interface.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
---
 fs/verity/verify.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 4004a1d42875..db8c350234bb 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -9,6 +9,7 @@
 
 #include <linux/bio.h>
 #include <linux/export.h>
+#include <linux/fserror.h>
 
 #define FS_VERITY_MAX_PENDING_BLOCKS 2
 
@@ -205,6 +206,8 @@ static bool verify_data_block(struct fsverity_info *vi,
 		if (memchr_inv(dblock->data, 0, params->block_size)) {
 			fsverity_err(inode,
 				     "FILE CORRUPTED!  Data past EOF is not zeroed");
+			fserror_report_data_lost(inode, data_pos,
+						 params->block_size, GFP_NOFS);
 			return false;
 		}
 		return true;
@@ -312,6 +315,7 @@ static bool verify_data_block(struct fsverity_info *vi,
 		data_pos, level - 1, params->hash_alg->name, hsize, want_hash,
 		params->hash_alg->name, hsize,
 		level == 0 ? dblock->real_hash : real_hash);
+	fserror_report_data_lost(inode, data_pos, params->block_size, GFP_NOFS);
 error:
 	for (; level > 0; level--) {
 		kunmap_local(hblocks[level - 1].addr);
-- 
2.51.2


^ permalink raw reply related

* [PATCH v10 00/22] fs-verity support for XFS with post EOF merkle tree
From: Andrey Albershteyn @ 2026-05-20 12:36 UTC (permalink / raw)
  To: linux-xfs, fsverity, linux-fsdevel, ebiggers
  Cc: Andrey Albershteyn, hch, linux-ext4, linux-f2fs-devel,
	linux-btrfs, linux-unionfs, djwong, david

Hi all,

This patch series adds fs-verity support for XFS. This version stores
merkle tree beyond end of the file, the same way as ext4 does it. The
difference is that verity descriptor is stored at the next aligned 64k
block after the merkle tree last block. This is done due to sparse
merkle tree which doesn't store hashes of zero data blocks.

The patchset starts with a few fs-verity preparation patches. Then, a
few patches to allow iomap to work in post EOF region. The XFS fs-verity
implementation follows.

The tree is read by iomap into page cache at offset of next largest
folio past end of file. The same offset is used for on-disk.

This patchsets also synthesizes merkle tree block full of hashes of
zeroed data blocks. This merkle blocks are not stored on disk, they are
holes in the tree.

Testing. The -g verity is passing for 1k, 8k and 4k with/without quota
on 4k and 64k page size systems. Tested -g quick for enabled/disabled
fsverity. Also, overlay/080 overlay/089 with XFS as base. Compile test
for FSVERITY=no/yes.

This series based on v7.1-rc4.

kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/aalbersh/xfs-linux.git/log/?h=b4/fsverity

xfsprogs:
https://github.com/alberand/xfsprogs/tree/b4/fsverity

xfstests:
https://github.com/alberand/xfstests/tree/b4/fsverity

Cc: fsverity@lists.linux.dev
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: linux-unionfs@vger.kernel.org

Cc: david@fromorbit.com
Cc: djwong@kernel.org
Cc: ebiggers@kernel.org
Cc: hch@lst.de

1: https://lore.kernel.org/linux-xfs/20260223132021.292832-1-hch@lst.de/

---
Changes in v10:
- Rebase to v7.1-rc3 with relevant adjustments
- Initialize ioend->io_vi to NULL to not get write work onto verity wq
- Range diff below
Changes in v9:
- Fix fsverity_fill_zerohash() parameter names
- A few fixes found by sashiko.dev:
	- Replace ip->i_mount->m_attr_geo->blksize with m_sb.sb_blocksize
	- Don't call xfs_trans_cancel() after xfs_trans_commit() in
	  xfs_fsverity_end_enable()
	- Call xfs_fsverity_delete_metadata() if verity enable failed
	- Change start/end type from xfs_fileoff_t to loff_t
	- Return xfs_trans_commit() error from
	  xfs_fsverity_cancel_unwritten()
Changes in v8:
- Return fsverity_ensure_verity_info() errors from
  ovl_ensure_verity_loaded()
Changes in v7:
- Move kerneldoc to fsverity_ensure_verity_info() definition
- Drop patch adding XFS traces
- Fix overly long line in the comment
- Make order of fserror and fsverity_error consistent
- Add overlay patch converting to fsverity_ensure_verity_info()
Changes in v6:
- Removed stub for fsverity_ensure_verity_info() as it's optimized out
- Rename fsverity_folio_zero_hash() to fsverify_fill_zerohash()
- Merge patches 8 to 10 into one
- Merge patch gerating zero_hash and fsverity_fill_zerohash() into one
- Add kerneldoc to fsverity_ensure_verity_info()
- Add comments to iomap_block_needs_zeroing()
Changes in v5:
- Add fserror_report_data_lost() for data blocks in page spanning EOF
- Issue fsverity metadata readahead in data readahead
- iomap_fsverity_write() return type fix
- Use of S_ISREG(mode)
- Make 65536 #define instead of open-coded
- Use transaction per unwritten extent removal
- Fetch fsverity_info for all fsverity metadata
- Revert fsverity_folio_zero_hash() stub as used in iomap
- Extend cancel_unwritten to whole file range to remove cow leftovers
- Drop delayed allocation on the COW fork on fsverity completion
Changes in v4:
- Use fserror interface in fsverity instead of fs callback
- Hoist pagecache_read from f2fs/ext4 to fsverity
- Refactor iomap code
- Fetch fsverity_info only for file data and merkle tree holes
- Do not disable preallocation, remove unwritten extents instead
- Offload fsverity hash I/O to fsverity workqueue in read path
- Store merkle tree at round_up(i_size, 64k)
- Add a spacing between merkle tree and fsverity descriptor as next 64k
  aligned block
- Squash helpers into first user commits
- Squash on-disk format changes into single commit
- Drop different offset for pagecache/on-disk
- Don't zero out pages in higher order folios in write path
- Link to v3: https://lore.kernel.org/fsverity/20260217231937.1183679-1-aalbersh@kernel.org/T/#t
Changes in v3:
- Different on-disk and pagecache offset
- Use read path ioends
- Switch to hashtable fsverity info
- Synthesize merkle tree blocks full of zeroes
- Other minor refactors
- Link to v2: https://lore.kernel.org/fsverity/20260114164210.GO15583@frogsfrogsfrogs/T/#t
Changes in v2:
- Move to VFS interface for merkle tree block reading
- Drop patchset for per filesystem workqueues
- Change how offsets of the descriptor and tree metadata is calculated
- Store fs-verity descriptor in data fork side by side with merkle tree
- Simplify iomap changes, remove interface for post eof read/write
- Get rid of extended attribute implementation
- Link to v1: https://lore.kernel.org/r/20250728-fsverity-v1-0-9e5443af0e34@kernel.org

-- >8 --

 1:  aabd8b112385 <  -:  ------------ fs-verity support for XFS with post EOF merkle tree
 -:  ------------ >  1:  ec15fda2f683 fs-verity support for XFS with post EOF merkle tree
 2:  f3a3cc5b6ab2 =  2:  97d22944e17f fsverity: report validation errors through fserror to fsnotify
 3:  c8155669bdff =  3:  f717e55c313c fsverity: expose ensure_fsverity_info()
 4:  83262640936a !  4:  38381b5ec63e ovl: use core fsverity ensure info interface
    @@ fs/overlayfs/util.c: char *ovl_get_redirect_xattr(struct ovl_fs *ofs, const stru
        struct inode *inode = d_inode(datapath->dentry);
     -  struct file *filp;

    --  if (IS_VERITY(inode) && fsverity_get_info(inode) == NULL) {
    +-  if (!fsverity_active(inode) && IS_VERITY(inode)) {
     -          /*
     -           * If this inode was not yet opened, the verity info hasn't been
     -           * loaded yet, so we need to do that here to force it into memory.
 5:  6d63f3ee7604 =  5:  b4b86f757eaa fsverity: generate and store zero-block hash
 6:  800753879b5c =  6:  f9d2eaf0fdcd fsverity: pass digest size and hash of the all-zeroes block to ->write
 7:  c4e06f9d9c08 =  7:  87d8f4495aa5 fsverity: hoist pagecache_read from f2fs/ext4 to fsverity
 8:  34be1a25ffd2 =  8:  2da572c7a3bf iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle fsverity
 9:  728df1f13e31 !  9:  3669ecfb610e iomap: teach iomap to read files with fsverity
    @@ fs/iomap/buffered-io.c: void iomap_readahead(const struct iomap_ops *ops,
                iter.status = iomap_readahead_iter(&iter, ctx,
                                        &cur_bytes_submitted);

    - ## fs/iomap/ioend.c ##
    -@@ fs/iomap/ioend.c: struct iomap_ioend *iomap_init_ioend(struct inode *inode,
    -   ioend->io_offset = file_offset;
    -   ioend->io_size = bio->bi_iter.bi_size;
    -   ioend->io_sector = bio->bi_iter.bi_sector;
    -+  ioend->io_vi = NULL;
    -   ioend->io_private = NULL;
    -   return ioend;
    - }
    -
      ## include/linux/iomap.h ##
     @@ include/linux/iomap.h: struct iomap_ioend {
        loff_t                  io_offset;      /* offset in the file */
10:  9eb9dd92b762 = 10:  a8989fbd29f1 iomap: introduce iomap_fsverity_write() for writing fsverity metadata
11:  27602df28674 = 11:  04b23426b5cf xfs: introduce fsverity on-disk changes
12:  544c81d26e7c = 12:  ba69a8817d13 xfs: initialize fs-verity on file open
13:  6158f03b2ad4 = 13:  e2b59e6cd4da xfs: don't allow to enable DAX on fs-verity sealed inode
14:  03b1f44c53b2 = 14:  9e2628f17346 xfs: disable direct read path for fs-verity files
15:  96c9f90ade98 = 15:  3dea5f7e5481 xfs: handle fsverity I/O in write/read path
16:  e864ce49e5b3 = 16:  ff837a33e8f5 xfs: use read ioend for fsverity data verification
17:  b06bb3eefa38 = 17:  5c79a1e5f6ff xfs: add fs-verity support
18:  567d5190bfd9 = 18:  af3c36498d1c xfs: remove unwritten extents after preallocations in fsverity metadata
19:  b952307f139d = 19:  744cf7ec0842 xfs: add fs-verity ioctls
20:  3f5382888801 = 20:  d47aac1643a7 xfs: advertise fs-verity being available on filesystem
21:  b87668403694 = 21:  48e83ac7e6c2 xfs: check and repair the verity inode flag state
22:  eec854161680 = 22:  2a9f58d29909 xfs: introduce health state for corrupted fsverity metadata
23:  9ddc2cf10f52 = 23:  b5bf49f2e750 xfs: enable ro-compat fs-verity flag

Andrey Albershteyn (20):
  fsverity: report validation errors through fserror to fsnotify
  fsverity: expose ensure_fsverity_info()
  ovl: use core fsverity ensure info interface
  fsverity: generate and store zero-block hash
  fsverity: pass digest size and hash of the all-zeroes block to ->write
  fsverity: hoist pagecache_read from f2fs/ext4 to fsverity
  iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle
    fsverity
  iomap: teach iomap to read files with fsverity
  iomap: introduce iomap_fsverity_write() for writing fsverity metadata
  xfs: introduce fsverity on-disk changes
  xfs: initialize fs-verity on file open
  xfs: don't allow to enable DAX on fs-verity sealed inode
  xfs: disable direct read path for fs-verity files
  xfs: handle fsverity I/O in write/read path
  xfs: use read ioend for fsverity data verification
  xfs: add fs-verity support
  xfs: remove unwritten extents after preallocations in fsverity
    metadata
  xfs: add fs-verity ioctls
  xfs: introduce health state for corrupted fsverity metadata
  xfs: enable ro-compat fs-verity flag

Darrick J. Wong (2):
  xfs: advertise fs-verity being available on filesystem
  xfs: check and repair the verity inode flag state

 fs/btrfs/verity.c              |   6 +-
 fs/ext4/verity.c               |  36 +--
 fs/f2fs/verity.c               |  34 +--
 fs/iomap/buffered-io.c         | 109 +++++++-
 fs/iomap/ioend.c               |   1 +
 fs/iomap/trace.h               |   3 +-
 fs/overlayfs/util.c            |  14 +-
 fs/verity/enable.c             |   4 +-
 fs/verity/fsverity_private.h   |   3 +
 fs/verity/measure.c            |   4 +-
 fs/verity/open.c               |  25 +-
 fs/verity/pagecache.c          |  55 ++++
 fs/verity/verify.c             |   4 +
 fs/xfs/Makefile                |   1 +
 fs/xfs/libxfs/xfs_bmap.c       |   7 +
 fs/xfs/libxfs/xfs_format.h     |  35 ++-
 fs/xfs/libxfs/xfs_fs.h         |   2 +
 fs/xfs/libxfs/xfs_health.h     |   4 +-
 fs/xfs/libxfs/xfs_inode_buf.c  |   8 +
 fs/xfs/libxfs/xfs_inode_util.c |   2 +
 fs/xfs/libxfs/xfs_sb.c         |   4 +
 fs/xfs/scrub/attr.c            |   7 +
 fs/xfs/scrub/common.c          |  53 ++++
 fs/xfs/scrub/common.h          |   2 +
 fs/xfs/scrub/inode.c           |   7 +
 fs/xfs/scrub/inode_repair.c    |  36 +++
 fs/xfs/xfs_aops.c              |  62 ++++-
 fs/xfs/xfs_bmap_util.c         |   8 +
 fs/xfs/xfs_file.c              |  19 +-
 fs/xfs/xfs_fsverity.c          | 457 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_fsverity.h          |  28 ++
 fs/xfs/xfs_health.c            |   1 +
 fs/xfs/xfs_inode.h             |   6 +
 fs/xfs/xfs_ioctl.c             |  14 +
 fs/xfs/xfs_iomap.c             |  15 +-
 fs/xfs/xfs_iops.c              |   4 +
 fs/xfs/xfs_message.c           |   4 +
 fs/xfs/xfs_message.h           |   1 +
 fs/xfs/xfs_mount.h             |   4 +
 fs/xfs/xfs_super.c             |   7 +
 include/linux/fsverity.h       |  18 +-
 include/linux/iomap.h          |  13 +
 42 files changed, 1020 insertions(+), 107 deletions(-)
 create mode 100644 fs/xfs/xfs_fsverity.c
 create mode 100644 fs/xfs/xfs_fsverity.h

-- 
2.51.2


^ permalink raw reply

* Re: [PATCH] ext4: Fix ERR_PTR(0) in ext4_mkdir()
From: Hongling Zeng @ 2026-05-20  9:33 UTC (permalink / raw)
  To: Zhang Yi, Hongling Zeng, tytso, adilger.kernel, libaokun, jack,
	ojaswin, ritesh.list, neil, brauner, jlayton
  Cc: linux-ext4, linux-kernel
In-Reply-To: <889382e7-69cc-4797-bf9a-3eada00bf1b3@huaweicloud.com>

   Hi ,

   Good point! I've been systematically fixing this across filesystems.
   Several fixes have already been merged (9p, jfs, orangefs,cachefiles....)

   Still working on a few more filesystems.

   Thanks for the suggestion!

   Best regards,
   Hongling

在 2026年05月20日 17:19, Zhang Yi 写道:
> On 5/20/2026 3:46 PM, Hongling Zeng wrote:
>> When mkdir succeeds, ext4_mkdir() returns ERR_PTR(0) which is incorrect.
>> It should return NULL instead for success and ERR_PTR() only with
>> negative error codes for failure.
> This point is indeed very easy to overlook. However, why not modify
> other file systems as well? Commit 88d5baf69082 made changes not only
> to ext4.
>
> Thanks,
> Yi.
>
>> Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *")
>> Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>
>> ---
>>   fs/ext4/namei.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
>> index 4a47fbd8dd30..8cadaeb15b2b 100644
>> --- a/fs/ext4/namei.c
>> +++ b/fs/ext4/namei.c
>> @@ -3054,7 +3054,7 @@ static struct dentry *ext4_mkdir(struct mnt_idmap *idmap, struct inode *dir,
>>   out_retry:
>>   	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
>>   		goto retry;
>> -	return ERR_PTR(err);
>> +	return err ? ERR_PTR(err) : NULL;
>>   }
>>   
>>   /*


^ permalink raw reply

* Re: [PATCH] ext4: Fix ERR_PTR(0) in ext4_mkdir()
From: Zhang Yi @ 2026-05-20  9:19 UTC (permalink / raw)
  To: Hongling Zeng, tytso, adilger.kernel, libaokun, jack, ojaswin,
	ritesh.list, neil, brauner, jlayton
  Cc: linux-ext4, linux-kernel, zhongling0719
In-Reply-To: <20260520074634.53656-1-zenghongling@kylinos.cn>

On 5/20/2026 3:46 PM, Hongling Zeng wrote:
> When mkdir succeeds, ext4_mkdir() returns ERR_PTR(0) which is incorrect.
> It should return NULL instead for success and ERR_PTR() only with
> negative error codes for failure.

This point is indeed very easy to overlook. However, why not modify
other file systems as well? Commit 88d5baf69082 made changes not only
to ext4.

Thanks,
Yi.

> 
> Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *")
> Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>
> ---
>  fs/ext4/namei.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 4a47fbd8dd30..8cadaeb15b2b 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -3054,7 +3054,7 @@ static struct dentry *ext4_mkdir(struct mnt_idmap *idmap, struct inode *dir,
>  out_retry:
>  	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
>  		goto retry;
> -	return ERR_PTR(err);
> +	return err ? ERR_PTR(err) : NULL;
>  }
>  
>  /*


^ permalink raw reply

* Re: [PATCH] ext4: Fix ERR_PTR(0) in ext4_mkdir()
From: Jan Kara @ 2026-05-20  8:42 UTC (permalink / raw)
  To: Hongling Zeng
  Cc: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, neil, brauner, jlayton, linux-ext4, linux-kernel,
	zhongling0719
In-Reply-To: <20260520074634.53656-1-zenghongling@kylinos.cn>

On Wed 20-05-26 15:46:34, Hongling Zeng wrote:
> When mkdir succeeds, ext4_mkdir() returns ERR_PTR(0) which is incorrect.
> It should return NULL instead for success and ERR_PTR() only with
> negative error codes for failure.
> 
> Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *")
> Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>

You're right this is a bit sloppy programming although there's no actual
functional difference at this point. So feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/namei.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 4a47fbd8dd30..8cadaeb15b2b 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -3054,7 +3054,7 @@ static struct dentry *ext4_mkdir(struct mnt_idmap *idmap, struct inode *dir,
>  out_retry:
>  	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
>  		goto retry;
> -	return ERR_PTR(err);
> +	return err ? ERR_PTR(err) : NULL;
>  }
>  
>  /*
> -- 
> 2.25.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
From: Zhang Yi @ 2026-05-20  8:18 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <agxmG0MhIywySPaA@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 5/19/2026 9:31 PM, Ojaswin Mujoo wrote:
> On Tue, May 19, 2026 at 04:11:30PM +0530, Ojaswin Mujoo wrote:
>> On Mon, May 11, 2026 at 03:23:27PM +0800, Zhang Yi wrote:
>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> The data=ordered mode introduces two fundamental conflicts with the
>>> iomap buffered write path, leading to potential deadlocks.
>>>
>>> 1) Lock ordering conflict
>>>    In the iomap writeback path, each folio is processed sequentially:
>>>    the folio lock is acquired first, followed by starting a transaction
>>>    to create block mappings. In data=ordered mode, writeback triggered
>>>    by the journal commit process may attempt to acquire a folio lock
>>>    that is already held by iomap. Meanwhile, iomap, under that same
>>>    folio lock, may start a new transaction and wait for the currently
>>>    committing transaction to finish, resulting in a deadlock.
>>
>> Right, makes sense.
>>>
>>> 2) Partial folio submission not supported
>>>    When block size is smaller than folio size, a folio may contain both
>>>    mapped and unmapped blocks. In data=ordered mode, if the journal
>>>    waits for such a folio to be written back while the regular writeback
>>>    process has already started committing it (with the writeback flag
>>>    set), mapping the remaining unmapped blocks can deadlock. This is
>>>    because the writeback flag is cleared only after the entire folio is
>>>    processed and committed.
>>
>> Okay so IIUC, if we do end up using iomap with ordered data, there are 2
>> codepaths with issues here:
>>
>> txn_commit
>>   ordered data writeback (say it goes via iomap)
>> 	  folio_lock
>> 		iomap_writeback_folio
>> 			folio_start_writeback
>> 			  iomap_writeback_range
>> 				  ext4_map_block
>> 					  txn_start
>> 						  wait for tnx commit - DEADLOCK
>>
>> Currently we avoid this by having ext4_normal_submit_inode_buffers()
>> pass can_map = 0 so journal flush makese sure not to start any txn.
>>
>> Then we have
>>
>> txn_commit                          background writeback (via iomap)
>>
>>                                     folio_lock()
>>   ordered data writeback
>> 	  folio_lock
>> 			  
>>                                 		iomap_writeback_folio
>>                                 			folio_start_writeback
>>                                 			  iomap_writeback_range
>>                                 				  ext4_map_block
>>                                 					  txn_start
>> 																						  wait for txn commit - DEADLOCK
> 
> Sorry I forget to remove tabs
> 
> this is what I meant:
> 
> txn_commit
>   ordered data writeback (say it goes via iomap)
>     folio_lock
>     iomap_writeback_folio
>       folio_start_writeback
>         iomap_writeback_range
>           ext4_map_block
>             txn_start
>               wait for tnx commit - DEADLOCK
> 
> Currently we avoid this by having ext4_normal_submit_inode_buffers()
> pass can_map = 0 so journal flush makese sure not to start any txn.

Yeah, but we can also solve this problem by adding similar tags. This
is not the most difficult part.

> 
> Then we have
> 
> txn_commit                          background writeback (via iomap)
> 
>                                     folio_lock()
>   ordered data writeback
>     folio_lock
> 
>                                     iomap_writeback_folio
>                                       folio_start_writeback
>                                         iomap_writeback_range
>                                           ext4_map_block
>                                             txn_start
>                                               wait for txn commit - DEADLOCK
> 
> 
>> 	  
>> Currently, this is taken care because we try to start the txn before
>> taking any folio locks/starting writeback, and hence we cannot deadlock.

Yeah. You are right! Actually, this deadlock scenario should essentially
belong to the first category: "Lock ordering conflict". This is not the
scenario I want to describe here. The problematic scenario is as
follows:

T0: Assume we have a folio contains four blocks, from front to back,
    they are A, B, C, D. The last block D is written in delalloc mode
    (the block is not allocated yet).

T1: The writeback process starts to write back data, set writeback flag
    on the folio, allocates block D, and adds it to transaction N's
    order list of jbd2 in JI_WAIT_DATA mode.

T2: This folio completes the writeback and clears the writeback flag.

T3: Before transaction N commit, we punch block B and C, and overwrite
    A-C,

T4: Transaction N commit and folio writeback are running concurrently.

Transaction N commit        folio writeback(iomap)

                            iomap_writeback_folio()
                             folio_start_writeback()  -- set writeback

jbd2_journal_finish_inode_data_buffers()
 __filemap_fdatawait_range()
  -- wait writeback flag to clear
                               iomap_writeback_range()
                                ext4_map_block() -- map block B and C
                                 start handle
                                  wait for transaction N commit
                                   - DEADLOCK

IOMAP does not support submitting partial folios during writeback.
Therefore, the writeback flag is cleared only after the entire folio
has been submitted. As a result, the commit of transaction N would never
wait for this flag to be cleared if we need to map some blocks in this
folio.

Currently, this is handled by ext4_bio_write_folio(), which supports
writing back partial folios. The writeback flag is only set after the
block has been mapped and before the bio is actually issued. There are
no other limitations that would block this flag from being cleared
after the I/O is completed.

>>
>> If the above description makes sense, do you think it'd be good to add
>> them to the commit message. The reason is that although these paths seem
>> obvious when we look at them a lot, it took me a good bit of time to
>> understand what deadlocks you are talking about here :p
>>
>> Having the code traces like above makes it very clear.

Indeed, these problematic cases are complicated and subtle. I also spent
some time recalling this scene. I can add these code traces in my next
iteration.

Thanks,
Yi.

>>>
>>> To support data=ordered mode, the iomap core would need two invasive
>>> changes:
>>>  - Acquire the transaction handle before locking any folio for
>>>    writeback.
>>>  - Support partial folio submission.
>>>
>>> Both changes are complicated and risk performance regressions.
>>> Therefore, we must avoid using data=ordered mode when converting to the
>>> iomap path.
>>>
>>> Currently, data=ordered mode is used in three scenarios:
>>>  - Append write
>>>  - Post-EOF partial block truncate-up followed by append write
>>>  - Online defragmentation
>>>
>>> We can address the first two without data=ordered mode:
>>>  - For append write: always allocate unwritten blocks (i.e. always
>>>    enable dioread_nolock), preserving the behavior of current
>>>    extent-type inodes.
>>>  - For post-EOF truncate-up + append write: postpone updating i_disksize
>>>    until after the zeroed partial block has been written back.
>>
>> I'm still going through how we are addressing no data=ordered so will
>> get back on this in some time.
>>
>> Thanks,
>> Ojaswin
>>
>>>
>>> Online defragmentation does not yet support iomap; this can be resolved
>>> separately in the future.
>>>
>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>> ---
>>>  fs/ext4/ext4_jbd2.h | 7 ++++++-
>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>> index 63d17c5201b5..26999f173870 100644
>>> --- a/fs/ext4/ext4_jbd2.h
>>> +++ b/fs/ext4/ext4_jbd2.h
>>> @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
>>>  
>>>  static inline int ext4_should_order_data(struct inode *inode)
>>>  {
>>> -	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
>>> +	/*
>>> +	 * inodes using the iomap buffered I/O path do not use the
>>> +	 * data=ordered mode.
>>> +	 */
>>> +	return !ext4_inode_buffered_iomap(inode) &&
>>> +		(ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
>>>  }
>>>  
>>>  static inline int ext4_should_writeback_data(struct inode *inode)
>>> -- 
>>> 2.52.0
>>>


^ permalink raw reply

* [PATCH] ext4: Fix ERR_PTR(0) in ext4_mkdir()
From: Hongling Zeng @ 2026-05-20  7:46 UTC (permalink / raw)
  To: tytso, adilger.kernel, libaokun, jack, ojaswin, ritesh.list,
	yi.zhang, neil, brauner, jlayton
  Cc: linux-ext4, linux-kernel, zhongling0719, Hongling Zeng

When mkdir succeeds, ext4_mkdir() returns ERR_PTR(0) which is incorrect.
It should return NULL instead for success and ERR_PTR() only with
negative error codes for failure.

Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *")
Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>
---
 fs/ext4/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 4a47fbd8dd30..8cadaeb15b2b 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -3054,7 +3054,7 @@ static struct dentry *ext4_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 out_retry:
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
-	return ERR_PTR(err);
+	return err ? ERR_PTR(err) : NULL;
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related

* Re: WARN_ON_ONCE in ext4_journal_check_start() triggered during graceful shutdown
From: Jan Kara @ 2026-05-20  6:54 UTC (permalink / raw)
  To: Sanman Pradhan
  Cc: linux-ext4@vger.kernel.org, tytso@mit.edu, jack@suse.cz,
	adilger.kernel@dilger.ca
In-Reply-To: <SJ0PR05MB870981E19060D579A839B4AFBA002@SJ0PR05MB8709.namprd05.prod.outlook.com>

Hi!

On Tue 19-05-26 22:08:03, Sanman Pradhan wrote:
> We're seeing the following warning on v6.12 during graceful system shutdown:
> 
>   WARNING: CPU: 3 PID: 215 at fs/ext4/ext4_jbd2.c:73 ext4_journal_check_start+0x57/0xa0
>   Workqueue: writeback wb_workfn (flush-259:0)
>   Call Trace:
>    __ext4_journal_start_sb+0x4b/0x190
>    mpage_prepare_extent_to_map+0x41d/0x4c0
>    ext4_do_writepages+0x26d/0xd40
>    ext4_writepages+0xad/0x170
>    do_writepages+0xe2/0x280
>    __writeback_single_inode+0x4a/0x380
>    writeback_sb_inodes+0x220/0x4f0
>    wb_writeback+0x1cb/0x350
>    wb_workfn+0x259/0x440
> 
> This appears to be a race: during remount-ro, sync_filesystem() completes
> and SB_RDONLY is set, but a writeback kworker that was already scheduled
> before the remount still calls ext4_do_writepages(), which attempts to
> start a journal transaction.
> 
> The WARN_ON_ONCE was added by commit e7fc2b31e04c ("ext4: warn on
> read-only filesystem in ext4_journal_check_start()") with the rationale
> that EXT4_FLAGS_SHUTDOWN should catch all cases first.  However, normal
> admin-initiated remount-ro (shutdown path) does not set
> EXT4_FLAGS_SHUTDOWN, so the sb_rdonly() check is still reachable via
> in-flight writeback.
> 
> The warning is not a functional issue — the -EROFS return correctly
> declines the journal start — but it produces a noisy stack trace on every
> affected shutdown.

Right. I think f4a2b42e7891 ("ext4: fix stale xarray tags after writeback")
should fix your issue. See the changelog for explanation.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH v2 1/5] iomap: correct the range of a partial dirty clear
From: Zhang Yi @ 2026-05-20  3:03 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: linux-ext4, brauner, djwong, hch, joannelkoong, yi.zhang,
	yi.zhang, yizhang089, yangerkun, yukuai
In-Reply-To: <20260520030357.679687-1-yi.zhang@huaweicloud.com>

From: Zhang Yi <yi.zhang@huawei.com>

The block range calculation in ifs_clear_range_dirty() is incorrect when
partially clearing a range in a folio. We cannot clear the dirty bit of
the first block or the last block if the start or end offset is not
blocksize-aligned. This has not yet caused any issues since we always
clear a whole folio in iomap_writeback_folio().

Fix this by rounding up the first block to blocksize alignment, and
calculate the last block by rounding down (using truncation). Correct
the nr_blks calculation accordingly.

Fixes: 4ce02c679722 ("iomap: Add per-block dirty state tracking to improve performance")
Cc: <stable@vger.kernel.org> # v6.6
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d7b648421a70..64351a448a8b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -176,13 +176,17 @@ static void ifs_clear_range_dirty(struct folio *folio,
 {
 	struct inode *inode = folio->mapping->host;
 	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
-	unsigned int first_blk = (off >> inode->i_blkbits);
-	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
-	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned int first_blk = round_up(off, i_blocksize(inode)) >>
+				 inode->i_blkbits;
+	unsigned int last_blk = (off + len) >> inode->i_blkbits;
 	unsigned long flags;
 
+	if (first_blk >= last_blk)
+		return;
+
 	spin_lock_irqsave(&ifs->state_lock, flags);
-	bitmap_clear(ifs->state, first_blk + blks_per_folio, nr_blks);
+	bitmap_clear(ifs->state, first_blk + blks_per_folio,
+		     last_blk - first_blk);
 	spin_unlock_irqrestore(&ifs->state_lock, flags);
 }
 
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 4/5] iomap: fix out-of-bounds bitmap_set() with zero-length range
From: Zhang Yi @ 2026-05-20  3:03 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: linux-ext4, brauner, djwong, hch, joannelkoong, yi.zhang,
	yi.zhang, yizhang089, yangerkun, yukuai
In-Reply-To: <20260520030357.679687-1-yi.zhang@huaweicloud.com>

From: Zhang Yi <yi.zhang@huawei.com>

ifs_set_range_dirty() and ifs_set_range_uptodate() compute last_blk
as (off + len - 1) >> i_blkbits.  When off is 0 and len is 0, the
unsigned subtraction underflows to SIZE_MAX, producing a huge
last_blk and nr_blks value that causes bitmap_set() to write far
beyond the ifs->state allocation.

Regarding ifs_set_range_uptodate(), it is temporarily safe because len
cannot be passed in as 0. However, for ifs_set_range_dirty() this is
reachable from __iomap_write_end(): when copy_folio_from_iter_atomic()
returns 0 (e.g. user buffer fault) and the folio is already uptodate,
the guard at the top of __iomap_write_end() does not trigger because
!folio_test_uptodate() is false, and iomap_set_range_dirty() is called
with copied == 0.

Add a !len guard to both functions before the computation, so that a
zero-length range is a no-op.

Fixes: 4ce02c679722 ("iomap: Add per-block dirty state tracking to improve performance")
Cc: <stable@vger.kernel.org> # v6.6
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 27ab33edbdee..76f9a43e283c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -67,11 +67,13 @@ static bool ifs_set_range_uptodate(struct folio *folio,
 		struct iomap_folio_state *ifs, size_t off, size_t len)
 {
 	struct inode *inode = folio->mapping->host;
-	unsigned int first_blk = off >> inode->i_blkbits;
-	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
-	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned int first_blk, last_blk;
 
-	bitmap_set(ifs->state, first_blk, nr_blks);
+	if (len) {
+		first_blk = off >> inode->i_blkbits;
+		last_blk = (off + len - 1) >> inode->i_blkbits;
+		bitmap_set(ifs->state, first_blk, last_blk - first_blk + 1);
+	}
 	return ifs_is_fully_uptodate(folio, ifs);
 }
 
@@ -203,13 +205,17 @@ static void ifs_set_range_dirty(struct folio *folio,
 {
 	struct inode *inode = folio->mapping->host;
 	unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
-	unsigned int first_blk = (off >> inode->i_blkbits);
-	unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
-	unsigned int nr_blks = last_blk - first_blk + 1;
+	unsigned int first_blk, last_blk;
 	unsigned long flags;
 
+	if (!len)
+		return;
+
+	first_blk = off >> inode->i_blkbits;
+	last_blk = (off + len - 1) >> inode->i_blkbits;
 	spin_lock_irqsave(&ifs->state_lock, flags);
-	bitmap_set(ifs->state, first_blk + blks_per_folio, nr_blks);
+	bitmap_set(ifs->state, first_blk + blks_per_folio,
+		   last_blk - first_blk + 1);
 	spin_unlock_irqrestore(&ifs->state_lock, flags);
 }
 
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 0/5] iomap: trivial fixes for ext4 conversion
From: Zhang Yi @ 2026-05-20  3:03 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: linux-ext4, brauner, djwong, hch, joannelkoong, yi.zhang,
	yi.zhang, yizhang089, yangerkun, yukuai

Changes since v1:
 - Add fix tags to patch 01 and 04.
 - In patch 04, change ifs_set_range_uptodate() to always fall through
   to ifs_is_fully_uptodate(), preventing a false-positive uptodate
   mask.
 - Add patch 05, add comments for ifs_clear/set_range_dirty().

v1: https://lore.kernel.org/linux-fsdevel/20260514062955.1183976-1-yi.zhang@huaweicloud.com/


Original Cover-letter:

This patch series contains a few trivial iomap-related fixes in
preparation for converting ext4 buffered I/O to use iomap. 

The first three patches are taken from my ext4 conversion series [1], as
suggested by Christoph. The last patch fixes a bug originally reported
by Sashiko during review of my series; although unrelated to the ext4
conversion, it is worth fixing on its own. Please see the following
patches for detail.

Thanks,
Yi.

[1] https://lore.kernel.org/linux-ext4/20260511072344.191271-1-yi.zhang@huaweicloud.com/

Zhang Yi (5):
  iomap: correct the range of a partial dirty clear
  iomap: support invalidating partial folios
  iomap: fix incorrect did_zero setting in iomap_zero_iter()
  iomap: fix out-of-bounds bitmap_set() with zero-length range
  iomap: add comments for ifs_clear/set_range_dirty()

 fs/iomap/buffered-io.c | 58 ++++++++++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 14 deletions(-)

-- 
2.52.0


^ permalink raw reply

* [PATCH v2 2/5] iomap: support invalidating partial folios
From: Zhang Yi @ 2026-05-20  3:03 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: linux-ext4, brauner, djwong, hch, joannelkoong, yi.zhang,
	yi.zhang, yizhang089, yangerkun, yukuai
In-Reply-To: <20260520030357.679687-1-yi.zhang@huaweicloud.com>

From: Zhang Yi <yi.zhang@huawei.com>

Current iomap_invalidate_folio() can only invalidate an entire folio. If
we truncate a partial folio on a filesystem where the block size is
smaller than the folio size, it will leave behind dirty bits for the
truncated or punched blocks. During the write-back process, it will
attempt to map the invalid hole range. Fortunately, this has not caused
any real problems so far because the ->writeback_range() function
corrects the length.

However, the implementation of FALLOC_FL_ZERO_RANGE in ext4 depends on
the support for invalidating partial folios. When ext4 partially zeroes
out a dirty and unwritten folio, it does not perform a flush first like
XFS. Therefore, if the dirty bits of the corresponding area cannot be
cleared, the zeroed area after writeback remains in the written state
rather than reverting to the unwritten state. Fix this by supporting
invalidation of partial folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 64351a448a8b..876c2f507f58 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -761,6 +761,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
 		ifs_free(folio);
+	} else {
+		iomap_clear_range_dirty(folio, offset, len);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 5/5] iomap: add comments for ifs_clear/set_range_dirty()
From: Zhang Yi @ 2026-05-20  3:03 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: linux-ext4, brauner, djwong, hch, joannelkoong, yi.zhang,
	yi.zhang, yizhang089, yangerkun, yukuai
In-Reply-To: <20260520030357.679687-1-yi.zhang@huaweicloud.com>

From: Zhang Yi <yi.zhang@huawei.com>

The range alignment strategy differs between ifs_clear_range_dirty() and
ifs_set_range_dirty(). The former rounds inwards to clear only
fully-covered blocks, while the latter rounds outwards to mark any
partially-touched block as dirty. Add comments to document this
asymmetry in block range calculation.

Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 76f9a43e283c..b1d917504d5d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -173,6 +173,13 @@ static unsigned iomap_find_dirty_range(struct folio *folio, u64 *range_start,
 	return range_end - *range_start;
 }
 
+/*
+ * Clear the per-block dirty bits for the range [@off, @off + @len) within a
+ * folio.  The range is rounded inwards so that only blocks fully covered by
+ * the range are cleared.  This is required for operations like folio
+ * invalidation, where we must ensure a block is fully clean before discarding
+ * it.
+ */
 static void ifs_clear_range_dirty(struct folio *folio,
 		struct iomap_folio_state *ifs, size_t off, size_t len)
 {
@@ -200,6 +207,13 @@ static void iomap_clear_range_dirty(struct folio *folio, size_t off, size_t len)
 		ifs_clear_range_dirty(folio, ifs, off, len);
 }
 
+/*
+ * Set the per-block dirty bits for the range [@off, @off + @len) within a
+ * folio.  The range is rounded outwards so that any block partially touched
+ * by the range is marked dirty.  This ensures blocks containing even a
+ * single dirty byte will be included in subsequent writeback, preventing
+ * data loss when partial blocks are written.
+ */
 static void ifs_set_range_dirty(struct folio *folio,
 		struct iomap_folio_state *ifs, size_t off, size_t len)
 {
-- 
2.52.0


^ permalink raw reply related

* [PATCH v2 3/5] iomap: fix incorrect did_zero setting in iomap_zero_iter()
From: Zhang Yi @ 2026-05-20  3:03 UTC (permalink / raw)
  To: linux-fsdevel, linux-xfs
  Cc: linux-ext4, brauner, djwong, hch, joannelkoong, yi.zhang,
	yi.zhang, yizhang089, yangerkun, yukuai
In-Reply-To: <20260520030357.679687-1-yi.zhang@huaweicloud.com>

From: Zhang Yi <yi.zhang@huawei.com>

The did_zero output parameter was unconditionally set after the loop,
which is incorrect. It should only be set when the zeroing operation
actually completes, not when IOMAP_F_STALE is set or when
IOMAP_F_FOLIO_BATCH is set but !folio causes the loop to break early,
or when iomap_iter_advance() returns an error.

This causes did_zero to be incorrectly set when zeroing a clean
unwritten extent because the loop exits early without actually zeroing
any data.

Fix it by using a local variable to track whether any folio was actually
zeroed, and only set did_zero after the loop if zeroing happened.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 876c2f507f58..27ab33edbdee 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1542,6 +1542,7 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 		const struct iomap_write_ops *write_ops)
 {
 	u64 bytes = iomap_length(iter);
+	bool zeroed = false;
 	int status;
 
 	do {
@@ -1560,6 +1561,8 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 		/* a NULL folio means we're done with a folio batch */
 		if (!folio) {
 			status = iomap_iter_advance_full(iter);
+			if (status)
+				return status;
 			break;
 		}
 
@@ -1570,6 +1573,7 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 				bytes);
 
 		folio_zero_range(folio, offset, bytes);
+		zeroed = true;
 		folio_mark_accessed(folio);
 
 		ret = iomap_write_end(iter, bytes, bytes, folio);
@@ -1579,10 +1583,10 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 
 		status = iomap_iter_advance(iter, bytes);
 		if (status)
-			break;
+			return status;
 	} while ((bytes = iomap_length(iter)) > 0);
 
-	if (did_zero)
+	if (did_zero && zeroed)
 		*did_zero = true;
 	return status;
 }
-- 
2.52.0


^ permalink raw reply related

* Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
From: Zhang Yi @ 2026-05-20  2:49 UTC (permalink / raw)
  To: Ojaswin Mujoo, Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yangerkun,
	yukuai
In-Reply-To: <agyVb1U0US8PVgqo@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 5/20/2026 12:53 AM, Ojaswin Mujoo wrote:
> On Tue, May 19, 2026 at 08:35:51PM +0800, Zhang Yi wrote:
>> On 5/19/2026 5:28 PM, Ojaswin Mujoo wrote:
>>> On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Introduce initial support for iomap in the buffered I/O path for regular
>>>> files on ext4.
>>>>
>>>>    - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
>>>>      inode uses iomap instead of buffer_head for buffered I/O
>>>>    - Add helper ext4_inode_buffered_iomap() to check the flag
>>>>    - Add new address space operations ext4_iomap_aops with callbacks that
>>>>      will use generic iomap implementations
>>>>    - Add ext4_iomap_aops to ext4_set_aops() when the flag is set
>>>>
>>>> The following callbacks(read_folio(), readahead(), writepages()) are
>>>> provided as placeholders and will be implemented in later patches.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> Reviewed-by: Jan Kara <jack@suse.cz>
>>>
>>> Hi Zhang, looks good to me. Just a questions below:
>>
>> Hi, Ojaswin! Thank you for the review of this series.
>>
>>>> ---
>>>>   fs/ext4/ext4.h  |  7 +++++++
>>>>   fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
>>>>   2 files changed, 39 insertions(+)
>>>>
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 94283a991e5c..1e27d73d7427 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -1972,6 +1972,7 @@ enum {
>>>>   	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
>>>>   	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
>>>>   	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
>>>> +	EXT4_STATE_BUFFERED_IOMAP,	/* Inode use iomap for buffered IO */
>>>>   };
>>>>   
>>>>   #define EXT4_INODE_BIT_FNS(name, field, offset)				\
>>>> @@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
>>>>   		!list_empty(&EXT4_I(inode)->i_orphan);
>>>>   }
>>>>   
>>>> +/* Whether the inode pass through the iomap infrastructure for buffered I/O */
>>>> +static inline bool ext4_inode_buffered_iomap(struct inode *inode)
>>>> +{
>>>> +	return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
>>>> +}
>>>> +
>>>>   /*
>>>>    * Codes for operating systems
>>>>    */
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index b1ef706987c3..178ac2be37b7 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
>>>>   	.iomap_begin = ext4_iomap_begin_report,
>>>>   };
>>>>   
>>>> +static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
>>>> +{
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static void ext4_iomap_readahead(struct readahead_control *rac)
>>>> +{
>>>> +
>>>> +}
>>>> +
>>>> +static int ext4_iomap_writepages(struct address_space *mapping,
>>>> +				 struct writeback_control *wbc)
>>>> +{
>>>> +	return 0;
>>>> +}
>>>> +
>>>>   /*
>>>>    * For data=journal mode, folio should be marked dirty only when it was
>>>>    * writeably mapped. When that happens, it was already attached to the
>>>> @@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
>>>>   	.swap_activate		= ext4_iomap_swap_activate,
>>>>   };
>>>>   
>>>> +static const struct address_space_operations ext4_iomap_aops = {
>>>> +	.read_folio		= ext4_iomap_read_folio,
>>>> +	.readahead		= ext4_iomap_readahead,
>>>> +	.writepages		= ext4_iomap_writepages,
>>>> +	.dirty_folio		= iomap_dirty_folio,
>>>> +	.bmap			= ext4_bmap,
>>>> +	.invalidate_folio	= iomap_invalidate_folio,
>>>> +	.release_folio		= iomap_release_folio,
>>>> +	.migrate_folio		= filemap_migrate_folio,
>>>> +	.is_partially_uptodate  = iomap_is_partially_uptodate,
>>>> +	.error_remove_folio	= generic_error_remove_folio,
>>>> +	.swap_activate		= ext4_iomap_swap_activate,
>>>> +};
>>>
>>> So one question, for ->release_folio() we are using
>>> iomap_release_folio() instead of ext4_release_folio() here which doesnt
>>> make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
>>> seems to be trying to clean up already checkpointed buffers.
>>>
>>> I wanted to check if ->release_folio() can be called for folios with
>>> ext4 metadata buffers? (from my limited understanding of
>>> shrink_folio_list() -> filemap_release_folio() it seems we can) And if
>>> it can be called, is it okay to skip the
>>> jbd2_journal_try_to_free_buffers call?
>>
>> Here, in ->release_folio(), folio->mapping points to inode->i_data (the
>> file's pagecache), not the block device's pagecache. ext4 metadata
>> resides in the block device's pagecache, which is at a different layer
>> than this release_folio callback. So we don't need to call
>> jbd2_journal_try_to_free_buffers() in the iomap path here.
> 
> Hi Yi,
> 
> Thanks for clarify and yes, thats what I was missing! So this
> ->release_folio() is only for data folios. So I guess the
> jbd2_journal_try_to_free_buffers() is mostly to handle data=journal
> case?

Yes, that's my understanding as well. Meanwhile, the comment for the
jbd2_journal_try_to_free_buffers() function looks quite outdated and
needs to be updated.

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 4885903bbd10..239bcf88ed1c 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2139,38 +2139,23 @@ static void __jbd2_journal_unfile_buffer(struct 
journal_head *jh)
  }

  /**
- * jbd2_journal_try_to_free_buffers() - try to free page buffers.
+ * jbd2_journal_try_to_free_buffers() - try to free folio buffers.
   * @journal: journal for operation
   * @folio: Folio to detach data from.
   *
- * For all the buffers on this page,
- * if they are fully written out ordered data, move them onto BUF_CLEAN
- * so try_to_free_buffers() can reap them.
+ * For each buffer_head on @folio, if the buffer has a journal_head but
+ * is not attached to a running or committing transaction, try to remove
+ * it from the checkpoint list.  This is needed for data=journal mode
+ * where data buffers are journaled: once they are checkpointed, the
+ * journal_head can be detached and the buffer freed.  If any buffer is
+ * still attached to a transaction, the folio cannot be released and we
+ * bail out.  Otherwise we call try_to_free_buffers() to detach all
+ * buffer_heads from the folio.
   *
- * This function returns non-zero if we wish try_to_free_buffers()
- * to be called. We do this if the page is releasable by 
try_to_free_buffers().
- * We also do it if the page has locked or dirty buffers and the caller 
wants
- * us to perform sync or async writeout.
+ * For data=ordered and writeback modes, data buffers never have
+ * journal_heads, so this degenerates to a plain try_to_free_buffers().
   *
- * This complicates JBD locking somewhat.  We aren't protected by the
- * BKL here.  We wish to remove the buffer from its committing or
- * running transaction's ->t_datalist via __jbd2_journal_unfile_buffer.
- *
- * This may *change* the value of transaction_t->t_datalist, so anyone
- * who looks at t_datalist needs to lock against this function.
- *
- * Even worse, someone may be doing a jbd2_journal_dirty_data on this
- * buffer.  So we need to lock against that.  jbd2_journal_dirty_data()
- * will come out of the lock with the buffer dirty, which makes it
- * ineligible for release here.
- *
- * Who else is affected by this?  hmm...  Really the only contender
- * is do_get_write_access() - it could be looking at the buffer while
- * journal_try_to_free_buffer() is changing its state.  But that
- * cannot happen because we never reallocate freed data as metadata
- * while the data is part of a transaction.  Yes?
- *
- * Return false on failure, true on success
+ * Return: true if the folio's buffers were freed, false otherwise
   */
  bool jbd2_journal_try_to_free_buffers(journal_t *journal, struct folio 
*folio)
  {

Thanks,
Yi.




^ permalink raw reply related

* Re: [PATCH v2] mm: do not install PMD mappings when handling a COW fault
From: Zhang Yi @ 2026-05-20  1:01 UTC (permalink / raw)
  To: William Kucharski
  Cc: linux-mm, linux-ext4, linux-kernel, david, yi.zhang,
	karol.wachowski, wangkefeng.wang, yangerkun, liuyongqiang13
In-Reply-To: <2381B9B8-FD4D-48FE-BBF3-00D3455A8197@linux.dev>

On 5/19/2026 5:02 PM, William Kucharski wrote:
> 
> 
>> On May 19, 2026, at 02:36, Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>>
>> Gentle ping – could anyone take this patch?
>>
>> Thanks,
>> Yi.
> 
> Could you update the comment to clarify why you shouldn't install PMD mappings
> while doing CoW rather than just state it should never be done?

OK, will do.

Thanks,
Yi.

> 
>>
>> On 10/24/2025 6:22 PM, Zhang Yi wrote:
>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> When pinning a page with FOLL_LONGTERM in a CoW VMA and a PMD-aligned
>>> (2MB on x86) large folio follow_page_mask() failed to obtain a valid
>>> anonymous page, resulting in an infinite loop issue. The specific
>>> triggering process is as follows:
>>>
>>> 1. User call mmap with a 2MB size in MAP_PRIVATE mode for a file that
>>>   has a 2MB large folio installed in the page cache.
>>>
>>>   addr = mmap(NULL, 2*1024*1024, PROT_READ, MAP_PRIVATE, file_fd, 0);
>>>
>>> 2. The kernel driver pass this mapped address to pin_user_pages_fast()
>>>   in FOLL_LONGTERM mode.
>>>
>>>   pin_user_pages_fast(addr, 512, FOLL_LONGTERM, pages);
>>>
>>>  ->  pin_user_pages_fast()
>>>  |   gup_fast_fallback()
>>>  |    __gup_longterm_locked()
>>>  |     __get_user_pages_locked()
>>>  |      __get_user_pages()
>>>  |       follow_page_mask()
>>>  |        follow_p4d_mask()
>>>  |         follow_pud_mask()
>>>  |          follow_pmd_mask() //pmd_leaf(pmdval) is true because the
>>>  |                            //huge PMD is installed. This is normal
>>>  |                            //in the first round, but it shouldn't
>>>  |                            //happen in the second round.
>>>  |           follow_huge_pmd() //require an anonymous page
>>>  |            return -EMLINK;
>>>  |   faultin_page()
>>>  |    handle_mm_fault()
>>>  |     wp_huge_pmd() //remove PMD and fall back to PTE
>>>  |     handle_pte_fault()
>>>  |      do_pte_missing()
>>>  |       do_fault()
>>>  |        do_read_fault() //FAULT_FLAG_WRITE is not set
>>>  |         finish_fault()
>>>  |          do_set_pmd() //install a huge PMD again, this is wrong!!!
>>>  |      do_wp_page() //create private anonymous pages
>>>  <-    goto retry;
>>>
>>> Due to an incorrectly large PMD set in do_read_fault(),
>>> follow_pmd_mask() always returns -EMLINK, causing an infinite loop.
>>>
>>> David pointed out that we can preallocate a page table and remap the PMD
>>> to be mapped by a PTE table in wp_huge_pmd() in the future. But now we
>>> can avoid this issue by not installing PMD mappings when handling a COW
>>> and unshare fault in do_set_pmd().
>>>
>>> Fixes: a7f226604170 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
>>> Reported-by: Karol Wachowski <karol.wachowski@linux.intel.com>
>>> Closes: https://lore.kernel.org/linux-ext4/844e5cd4-462e-4b88-b3b5-816465a3b7e3@linux.intel.com/
>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>> Acked-by: David Hildenbrand <david@redhat.com>
>>> ---
>>> mm/memory.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 0ba4f6b71847..0748a31367df 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -5212,6 +5212,11 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct folio *folio, struct page *pa
>>> if (!thp_vma_suitable_order(vma, haddr, PMD_ORDER))
>>> return ret;
>>>
>>> + /* We're about to trigger CoW, so never map it through a PMD. */
>>> + if (is_cow_mapping(vma->vm_flags) &&
>>> +    (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)))
>>> + return ret;
>>> +
>>> if (folio_order(folio) != HPAGE_PMD_ORDER)
>>> return ret;
>>> page = &folio->page;
>>
>>


^ permalink raw reply

* [PATCHBOMB v9] fuse/libfuse/e2fsprogs: faster file IO for containerized ext4 servers
From: Darrick J. Wong @ 2026-05-19 22:22 UTC (permalink / raw)
  To: linux-fsdevel, linux-ext4, fuse-devel
  Cc: Miklos Szeredi, Bernd Schubert, Joanne Koong, Theodore Ts'o,
	Neal Gompa, Amir Goldstein, Christian Brauner, john

Hi everyone,

This is the ninth public draft of a prototype to connect the Linux
fuse driver to fs-iomap for regular file IO operations to and from files
whose contents persist to locally attached storage devices.  With this
release, I show that it's possible to build a fuse server for a real
filesystem (ext4) that runs entirely in userspace yet maintains most of
its performance.

This effort is now separate from the one to run fuse servers in a
constrained environment via systemd.  Putting fuse servers in a
container gets you all the blast radii reduction advantages and provides
a pathway to removing less popular filesystem drivers to reduce
maintenance work in the kernel; now we want trade relaxation of that
isolation for better performance.

The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands.  Pagecache
writeback is now a directio write.  The fuse server can upsert mappings
into the kernel for cached access (== zero upcalls for rereads and pure
overwrites!) and the iomap cache revalidation code works.

At this stage I still get about 95% of the kernel ext4 driver's
streaming directio performance on streaming IO, and 110% of its
streaming buffered IO performance.  Random buffered IO is about 85% as
fast as the kernel.  Random direct IO is about 80% as fast as the
kernel; see the cover letter for the fuse2fs iomap changes for more
details.  Unwritten extent conversions on random direct writes are
especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead.  And that's with (now dynamic) debugging turned on!

This series has been rebased to 7.1-rc4 since the eighth RFC, with
the following kernel changes:

1. The BPF stuff has been replaced with a filesystem striping mechanism.
   This is my first attempt ever to implement raid0.

2. Much tightening of the validation code based on Codex reviews so that
   we don't expose more "ABI" than we feel like getting yelled at for
   in 2031.

3. Refactored iomap writeback mapping so that you can use the standard
   iomap_begin functions for that.

4. Better userspace helpers so that fuse server authors don't have to
   know quite so much detail of the innards.

5. The libfuse changes are based off the WIP fuse-service-container
   branch.

There are some questions remaining:

a. fuse2fs doesn't support the ext4 journal.  Urk.

b. I've dropped everything but the kernel patches for basic plumbing and
   file IO paths because frankly they weren't getting looked at.

c. How on earth am I going to separate out the file_operations?
   Will it actually work to say that fuse-iomap only supports local
   filesystems initially?  How many of the "is_iomap?" predicates are
   actually for local filesystems and not the IO path???

I would like to any part of this submission reviewed for 7.2 now that
this has been collecting comments and tweaks in non-rfc status for 6
months.

Kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-striping

libfuse:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/libfuse.git/log/?h=fuse-iomap-striping

e2fsprogs:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse4fs-memory-reclaim

fstests:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuse2fs

--Darrick

^ permalink raw reply

* Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
From: Ojaswin Mujoo @ 2026-05-19 16:53 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <236657df-71f2-446d-b44b-39865219a850@huaweicloud.com>

On Tue, May 19, 2026 at 08:35:51PM +0800, Zhang Yi wrote:
> On 5/19/2026 5:28 PM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Introduce initial support for iomap in the buffered I/O path for regular
> >> files on ext4.
> >>
> >>   - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
> >>     inode uses iomap instead of buffer_head for buffered I/O
> >>   - Add helper ext4_inode_buffered_iomap() to check the flag
> >>   - Add new address space operations ext4_iomap_aops with callbacks that
> >>     will use generic iomap implementations
> >>   - Add ext4_iomap_aops to ext4_set_aops() when the flag is set
> >>
> >> The following callbacks(read_folio(), readahead(), writepages()) are
> >> provided as placeholders and will be implemented in later patches.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > Hi Zhang, looks good to me. Just a questions below:
> 
> Hi, Ojaswin! Thank you for the review of this series.
> 
> >> ---
> >>  fs/ext4/ext4.h  |  7 +++++++
> >>  fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
> >>  2 files changed, 39 insertions(+)
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 94283a991e5c..1e27d73d7427 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -1972,6 +1972,7 @@ enum {
> >>  	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
> >>  	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
> >>  	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
> >> +	EXT4_STATE_BUFFERED_IOMAP,	/* Inode use iomap for buffered IO */
> >>  };
> >>  
> >>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> >> @@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
> >>  		!list_empty(&EXT4_I(inode)->i_orphan);
> >>  }
> >>  
> >> +/* Whether the inode pass through the iomap infrastructure for buffered I/O */
> >> +static inline bool ext4_inode_buffered_iomap(struct inode *inode)
> >> +{
> >> +	return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
> >> +}
> >> +
> >>  /*
> >>   * Codes for operating systems
> >>   */
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index b1ef706987c3..178ac2be37b7 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
> >>  	.iomap_begin = ext4_iomap_begin_report,
> >>  };
> >>  
> >> +static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +static void ext4_iomap_readahead(struct readahead_control *rac)
> >> +{
> >> +
> >> +}
> >> +
> >> +static int ext4_iomap_writepages(struct address_space *mapping,
> >> +				 struct writeback_control *wbc)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >>  /*
> >>   * For data=journal mode, folio should be marked dirty only when it was
> >>   * writeably mapped. When that happens, it was already attached to the
> >> @@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
> >>  	.swap_activate		= ext4_iomap_swap_activate,
> >>  };
> >>  
> >> +static const struct address_space_operations ext4_iomap_aops = {
> >> +	.read_folio		= ext4_iomap_read_folio,
> >> +	.readahead		= ext4_iomap_readahead,
> >> +	.writepages		= ext4_iomap_writepages,
> >> +	.dirty_folio		= iomap_dirty_folio,
> >> +	.bmap			= ext4_bmap,
> >> +	.invalidate_folio	= iomap_invalidate_folio,
> >> +	.release_folio		= iomap_release_folio,
> >> +	.migrate_folio		= filemap_migrate_folio,
> >> +	.is_partially_uptodate  = iomap_is_partially_uptodate,
> >> +	.error_remove_folio	= generic_error_remove_folio,
> >> +	.swap_activate		= ext4_iomap_swap_activate,
> >> +};
> > 
> > So one question, for ->release_folio() we are using
> > iomap_release_folio() instead of ext4_release_folio() here which doesnt
> > make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
> > seems to be trying to clean up already checkpointed buffers.
> > 
> > I wanted to check if ->release_folio() can be called for folios with
> > ext4 metadata buffers? (from my limited understanding of
> > shrink_folio_list() -> filemap_release_folio() it seems we can) And if
> > it can be called, is it okay to skip the
> > jbd2_journal_try_to_free_buffers call?
> 
> Here, in ->release_folio(), folio->mapping points to inode->i_data (the
> file's pagecache), not the block device's pagecache. ext4 metadata
> resides in the block device's pagecache, which is at a different layer
> than this release_folio callback. So we don't need to call
> jbd2_journal_try_to_free_buffers() in the iomap path here.

Hi Yi,

Thanks for clarify and yes, thats what I was missing! So this
->release_folio() is only for data folios. So I guess the
jbd2_journal_try_to_free_buffers() is mostly to handle data=journal
case?

Regardless, with that clarification feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

> 
> Thanks,
> Yi.
> 
> > 
> > Regards,
> > ojaswin
> > 
> >> +
> >>  static const struct address_space_operations ext4_dax_aops = {
> >>  	.writepages		= ext4_dax_writepages,
> >>  	.dirty_folio		= noop_dirty_folio,
> >> @@ -4015,6 +4045,8 @@ void ext4_set_aops(struct inode *inode)
> >>  	}
> >>  	if (IS_DAX(inode))
> >>  		inode->i_mapping->a_ops = &ext4_dax_aops;
> >> +	else if (ext4_inode_buffered_iomap(inode))
> >> +		inode->i_mapping->a_ops = &ext4_iomap_aops;
> >>  	else if (test_opt(inode->i_sb, DELALLOC))
> >>  		inode->i_mapping->a_ops = &ext4_da_aops;
> >>  	else
> >> -- 
> >> 2.52.0
> >>
> 

^ permalink raw reply

* Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
From: Ojaswin Mujoo @ 2026-05-19 13:31 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <agw-Wt4c4Fwevezk@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On Tue, May 19, 2026 at 04:11:30PM +0530, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:27PM +0800, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > The data=ordered mode introduces two fundamental conflicts with the
> > iomap buffered write path, leading to potential deadlocks.
> > 
> > 1) Lock ordering conflict
> >    In the iomap writeback path, each folio is processed sequentially:
> >    the folio lock is acquired first, followed by starting a transaction
> >    to create block mappings. In data=ordered mode, writeback triggered
> >    by the journal commit process may attempt to acquire a folio lock
> >    that is already held by iomap. Meanwhile, iomap, under that same
> >    folio lock, may start a new transaction and wait for the currently
> >    committing transaction to finish, resulting in a deadlock.
> 
> Right, makes sense.
> > 
> > 2) Partial folio submission not supported
> >    When block size is smaller than folio size, a folio may contain both
> >    mapped and unmapped blocks. In data=ordered mode, if the journal
> >    waits for such a folio to be written back while the regular writeback
> >    process has already started committing it (with the writeback flag
> >    set), mapping the remaining unmapped blocks can deadlock. This is
> >    because the writeback flag is cleared only after the entire folio is
> >    processed and committed.
> 
> Okay so IIUC, if we do end up using iomap with ordered data, there are 2
> codepaths with issues here:
> 
> txn_commit
>   ordered data writeback (say it goes via iomap)
> 	  folio_lock
> 		iomap_writeback_folio
> 			folio_start_writeback
> 			  iomap_writeback_range
> 				  ext4_map_block
> 					  txn_start
> 						  wait for tnx commit - DEADLOCK
> 
> Currently we avoid this by having ext4_normal_submit_inode_buffers()
> pass can_map = 0 so journal flush makese sure not to start any txn.
> 
> Then we have
> 
> txn_commit                          background writeback (via iomap)
> 
>                                     folio_lock()
>   ordered data writeback
> 	  folio_lock
> 			  
>                                 		iomap_writeback_folio
>                                 			folio_start_writeback
>                                 			  iomap_writeback_range
>                                 				  ext4_map_block
>                                 					  txn_start
> 																						  wait for txn commit - DEADLOCK

Sorry I forget to remove tabs

this is what I meant:

txn_commit
  ordered data writeback (say it goes via iomap)
    folio_lock
    iomap_writeback_folio
      folio_start_writeback
        iomap_writeback_range
          ext4_map_block
            txn_start
              wait for tnx commit - DEADLOCK

Currently we avoid this by having ext4_normal_submit_inode_buffers()
pass can_map = 0 so journal flush makese sure not to start any txn.

Then we have

txn_commit                          background writeback (via iomap)

                                    folio_lock()
  ordered data writeback
    folio_lock

                                    iomap_writeback_folio
                                      folio_start_writeback
                                        iomap_writeback_range
                                          ext4_map_block
                                            txn_start
                                              wait for txn commit - DEADLOCK


> 	  
> Currently, this is taken care because we try to start the txn before
> taking any folio locks/starting writeback, and hence we cannot deadlock.
> 
> If the above description makes sense, do you think it'd be good to add
> them to the commit message. The reason is that although these paths seem
> obvious when we look at them a lot, it took me a good bit of time to
> understand what deadlocks you are talking about here :p
> 
> Having the code traces like above makes it very clear.
> > 
> > To support data=ordered mode, the iomap core would need two invasive
> > changes:
> >  - Acquire the transaction handle before locking any folio for
> >    writeback.
> >  - Support partial folio submission.
> > 
> > Both changes are complicated and risk performance regressions.
> > Therefore, we must avoid using data=ordered mode when converting to the
> > iomap path.
> > 
> > Currently, data=ordered mode is used in three scenarios:
> >  - Append write
> >  - Post-EOF partial block truncate-up followed by append write
> >  - Online defragmentation
> > 
> > We can address the first two without data=ordered mode:
> >  - For append write: always allocate unwritten blocks (i.e. always
> >    enable dioread_nolock), preserving the behavior of current
> >    extent-type inodes.
> >  - For post-EOF truncate-up + append write: postpone updating i_disksize
> >    until after the zeroed partial block has been written back.
> 
> I'm still going through how we are addressing no data=ordered so will
> get back on this in some time.
> 
> Thanks,
> Ojaswin
> 
> > 
> > Online defragmentation does not yet support iomap; this can be resolved
> > separately in the future.
> > 
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > ---
> >  fs/ext4/ext4_jbd2.h | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > index 63d17c5201b5..26999f173870 100644
> > --- a/fs/ext4/ext4_jbd2.h
> > +++ b/fs/ext4/ext4_jbd2.h
> > @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
> >  
> >  static inline int ext4_should_order_data(struct inode *inode)
> >  {
> > -	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
> > +	/*
> > +	 * inodes using the iomap buffered I/O path do not use the
> > +	 * data=ordered mode.
> > +	 */
> > +	return !ext4_inode_buffered_iomap(inode) &&
> > +		(ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
> >  }
> >  
> >  static inline int ext4_should_writeback_data(struct inode *inode)
> > -- 
> > 2.52.0
> > 

^ permalink raw reply

* Re: [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O
From: Zhang Yi @ 2026-05-19 12:35 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <agwtJq5-B2E-t7zT@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 5/19/2026 5:28 PM, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:24PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Introduce initial support for iomap in the buffered I/O path for regular
>> files on ext4.
>>
>>   - Add a new inode state flag EXT4_STATE_BUFFERED_IOMAP to indicate the
>>     inode uses iomap instead of buffer_head for buffered I/O
>>   - Add helper ext4_inode_buffered_iomap() to check the flag
>>   - Add new address space operations ext4_iomap_aops with callbacks that
>>     will use generic iomap implementations
>>   - Add ext4_iomap_aops to ext4_set_aops() when the flag is set
>>
>> The following callbacks(read_folio(), readahead(), writepages()) are
>> provided as placeholders and will be implemented in later patches.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Hi Zhang, looks good to me. Just a questions below:

Hi, Ojaswin! Thank you for the review of this series.

>> ---
>>  fs/ext4/ext4.h  |  7 +++++++
>>  fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
>>  2 files changed, 39 insertions(+)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 94283a991e5c..1e27d73d7427 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1972,6 +1972,7 @@ enum {
>>  	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
>>  	EXT4_STATE_FC_FLUSHING_DATA,	/* Fast commit flushing data */
>>  	EXT4_STATE_ORPHAN_FILE,		/* Inode orphaned in orphan file */
>> +	EXT4_STATE_BUFFERED_IOMAP,	/* Inode use iomap for buffered IO */
>>  };
>>  
>>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
>> @@ -2040,6 +2041,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
>>  		!list_empty(&EXT4_I(inode)->i_orphan);
>>  }
>>  
>> +/* Whether the inode pass through the iomap infrastructure for buffered I/O */
>> +static inline bool ext4_inode_buffered_iomap(struct inode *inode)
>> +{
>> +	return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
>> +}
>> +
>>  /*
>>   * Codes for operating systems
>>   */
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index b1ef706987c3..178ac2be37b7 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3908,6 +3908,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
>>  	.iomap_begin = ext4_iomap_begin_report,
>>  };
>>  
>> +static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
>> +{
>> +	return 0;
>> +}
>> +
>> +static void ext4_iomap_readahead(struct readahead_control *rac)
>> +{
>> +
>> +}
>> +
>> +static int ext4_iomap_writepages(struct address_space *mapping,
>> +				 struct writeback_control *wbc)
>> +{
>> +	return 0;
>> +}
>> +
>>  /*
>>   * For data=journal mode, folio should be marked dirty only when it was
>>   * writeably mapped. When that happens, it was already attached to the
>> @@ -3994,6 +4010,20 @@ static const struct address_space_operations ext4_da_aops = {
>>  	.swap_activate		= ext4_iomap_swap_activate,
>>  };
>>  
>> +static const struct address_space_operations ext4_iomap_aops = {
>> +	.read_folio		= ext4_iomap_read_folio,
>> +	.readahead		= ext4_iomap_readahead,
>> +	.writepages		= ext4_iomap_writepages,
>> +	.dirty_folio		= iomap_dirty_folio,
>> +	.bmap			= ext4_bmap,
>> +	.invalidate_folio	= iomap_invalidate_folio,
>> +	.release_folio		= iomap_release_folio,
>> +	.migrate_folio		= filemap_migrate_folio,
>> +	.is_partially_uptodate  = iomap_is_partially_uptodate,
>> +	.error_remove_folio	= generic_error_remove_folio,
>> +	.swap_activate		= ext4_iomap_swap_activate,
>> +};
> 
> So one question, for ->release_folio() we are using
> iomap_release_folio() instead of ext4_release_folio() here which doesnt
> make the jbd2_journal_try_to_free_bufferes() call. IIUC this function
> seems to be trying to clean up already checkpointed buffers.
> 
> I wanted to check if ->release_folio() can be called for folios with
> ext4 metadata buffers? (from my limited understanding of
> shrink_folio_list() -> filemap_release_folio() it seems we can) And if
> it can be called, is it okay to skip the
> jbd2_journal_try_to_free_buffers call?

Here, in ->release_folio(), folio->mapping points to inode->i_data (the
file's pagecache), not the block device's pagecache. ext4 metadata
resides in the block device's pagecache, which is at a different layer
than this release_folio callback. So we don't need to call
jbd2_journal_try_to_free_buffers() in the iomap path here.

Thanks,
Yi.

> 
> Regards,
> ojaswin
> 
>> +
>>  static const struct address_space_operations ext4_dax_aops = {
>>  	.writepages		= ext4_dax_writepages,
>>  	.dirty_folio		= noop_dirty_folio,
>> @@ -4015,6 +4045,8 @@ void ext4_set_aops(struct inode *inode)
>>  	}
>>  	if (IS_DAX(inode))
>>  		inode->i_mapping->a_ops = &ext4_dax_aops;
>> +	else if (ext4_inode_buffered_iomap(inode))
>> +		inode->i_mapping->a_ops = &ext4_iomap_aops;
>>  	else if (test_opt(inode->i_sb, DELALLOC))
>>  		inode->i_mapping->a_ops = &ext4_da_aops;
>>  	else
>> -- 
>> 2.52.0
>>


^ permalink raw reply

* Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path
From: Ojaswin Mujoo @ 2026-05-19 10:41 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
	libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
	yangerkun, yukuai
In-Reply-To: <20260511072344.191271-8-yi.zhang@huaweicloud.com>

On Mon, May 11, 2026 at 03:23:27PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The data=ordered mode introduces two fundamental conflicts with the
> iomap buffered write path, leading to potential deadlocks.
> 
> 1) Lock ordering conflict
>    In the iomap writeback path, each folio is processed sequentially:
>    the folio lock is acquired first, followed by starting a transaction
>    to create block mappings. In data=ordered mode, writeback triggered
>    by the journal commit process may attempt to acquire a folio lock
>    that is already held by iomap. Meanwhile, iomap, under that same
>    folio lock, may start a new transaction and wait for the currently
>    committing transaction to finish, resulting in a deadlock.

Right, makes sense.
> 
> 2) Partial folio submission not supported
>    When block size is smaller than folio size, a folio may contain both
>    mapped and unmapped blocks. In data=ordered mode, if the journal
>    waits for such a folio to be written back while the regular writeback
>    process has already started committing it (with the writeback flag
>    set), mapping the remaining unmapped blocks can deadlock. This is
>    because the writeback flag is cleared only after the entire folio is
>    processed and committed.

Okay so IIUC, if we do end up using iomap with ordered data, there are 2
codepaths with issues here:

txn_commit
  ordered data writeback (say it goes via iomap)
	  folio_lock
		iomap_writeback_folio
			folio_start_writeback
			  iomap_writeback_range
				  ext4_map_block
					  txn_start
						  wait for tnx commit - DEADLOCK

Currently we avoid this by having ext4_normal_submit_inode_buffers()
pass can_map = 0 so journal flush makese sure not to start any txn.

Then we have

txn_commit                          background writeback (via iomap)

                                    folio_lock()
  ordered data writeback
	  folio_lock
			  
                                		iomap_writeback_folio
                                			folio_start_writeback
                                			  iomap_writeback_range
                                				  ext4_map_block
                                					  txn_start
																						  wait for txn commit - DEADLOCK
	  
Currently, this is taken care because we try to start the txn before
taking any folio locks/starting writeback, and hence we cannot deadlock.

If the above description makes sense, do you think it'd be good to add
them to the commit message. The reason is that although these paths seem
obvious when we look at them a lot, it took me a good bit of time to
understand what deadlocks you are talking about here :p

Having the code traces like above makes it very clear.
> 
> To support data=ordered mode, the iomap core would need two invasive
> changes:
>  - Acquire the transaction handle before locking any folio for
>    writeback.
>  - Support partial folio submission.
> 
> Both changes are complicated and risk performance regressions.
> Therefore, we must avoid using data=ordered mode when converting to the
> iomap path.
> 
> Currently, data=ordered mode is used in three scenarios:
>  - Append write
>  - Post-EOF partial block truncate-up followed by append write
>  - Online defragmentation
> 
> We can address the first two without data=ordered mode:
>  - For append write: always allocate unwritten blocks (i.e. always
>    enable dioread_nolock), preserving the behavior of current
>    extent-type inodes.
>  - For post-EOF truncate-up + append write: postpone updating i_disksize
>    until after the zeroed partial block has been written back.

I'm still going through how we are addressing no data=ordered so will
get back on this in some time.

Thanks,
Ojaswin

> 
> Online defragmentation does not yet support iomap; this can be resolved
> separately in the future.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/ext4_jbd2.h | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 63d17c5201b5..26999f173870 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
>  
>  static inline int ext4_should_order_data(struct inode *inode)
>  {
> -	return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
> +	/*
> +	 * inodes using the iomap buffered I/O path do not use the
> +	 * data=ordered mode.
> +	 */
> +	return !ext4_inode_buffered_iomap(inode) &&
> +		(ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
>  }
>  
>  static inline int ext4_should_writeback_data(struct inode *inode)
> -- 
> 2.52.0
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox