* [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
@ 2026-02-03 6:25 Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 01/22] ext4: make ext4_block_zero_page_range() pass out did_zero Zhang Yi
` (22 more replies)
0 siblings, 23 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
From: Zhang Yi <yi.zhang@huaweicloud.com>
Changes since V1:
- Rebase this series on linux-next 20260122.
- Refactor partial block zero range, stop passing handle to
ext4_block_truncate_page() and ext4_zero_partial_blocks(), and move
partial block zeroing operation outside an active journal transaction
to prevent potential deadlocks because of the lock ordering of folio
and transaction start.
- Clarify the lock ordering of folio lock and transaction start, update
the comments accordingly.
- Fix some issues related to fast commit, pollute post-EOF folio.
- Some minor code and comments optimizations.
v1: https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/
RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
Original Cover (Updated):
This series adds the iomap buffered I/O path supports for regular files.
It implements the core iomap APIs on ext4 and introduces two mount
options called 'buffered_iomap' and "nobuffered_iomap" to enable and
disable the iomap buffered I/O path. This series supports the default
features, default mount options and bigalloc feature for ext4. We do not
yet support online defragmentation, inline data, fs_verify, fs_crypt,
non-extent, and data=journal mode, it will fall to buffered_head I/O
path automatically if these features and options are used.
Key notes on the iomap implementations in this series.
- Don't use ordered data mode to prevent exposing stale data when
performing append write and truncating down.
- Override dioread_nolock mount option, always allocate unwritten
extents for new blocks.
- When performing write back, don't use reserved journal handle and
postponing updating i_disksize until I/O is done.
- The lock ordering of the folio lock and start transaction is the
opposite of that in the buffer_head buffered write path.
Series details:
Patch 01-08: Refactor partial block zeroing operation, move it out of an
active running journal transaction, and handle post EOF
partial block zeroing properly.
Patch 09-21: Implement the core iomap buffered read, write path, dirty
folio write back path, mmap path and partial block zeroing
path for ext4 regular file.
Patch 22: Introduce 'buffered_iomap' and 'nobuffer_iomap' mount option
to enable and disable the iomap buffered I/O path.
Tests and Performance:
I tested this series using xfstests-bld with auto configurations, as
well as fast_commit and 64k configurations. No new regressions were
observed.
I used fio to test my virtual machine with a 150 GB memory disk and
found an improvement of approximately 30% to 50% in large I/O write
performance, while read performance showed no significant difference.
buffered write
==============
buffer_head:
bs write cache uncached write
1k 423 MiB/s 36.3 MiB/s
4k 1067 MiB/s 58.4 MiB/s
64k 4321 MiB/s 869 MiB/s
1M 4640 MiB/s 3158 MiB/s
iomap:
bs write cache uncached write
1k 403 MiB/s 57 MiB/s
4k 1093 MiB/s 61 MiB/s
64k 6488 MiB/s 1206 MiB/s
1M 7378 MiB/s 4818 MiB/s
buffered read
=============
buffer_head:
bs read hole read cache read data
1k 635 MiB/s 661 MiB/s 605 MiB/s
4k 1987 MiB/s 2128 MiB/s 1761 MiB/s
64k 6068 MiB/s 9472 MiB/s 4475 MiB/s
1M 5471 MiB/s 8657 MiB/s 4405 MiB/s
iomap:
bs read hole read cache read data
1k 643 MiB/s 653 MiB/s 602 MiB/s
4k 2075 MiB/s 2159 MiB/s 1716 MiB/s
64k 6267 MiB/s 9545MiB/s 4451 MiB/s
1M 6072 MiB/s 9191MiB/s 4467 MiB/s
Comments and suggestions are welcome!
Thanks,
Yi.
Zhang Yi (22):
ext4: make ext4_block_zero_page_range() pass out did_zero
ext4: make ext4_block_truncate_page() return zeroed length
ext4: only order data when partially block truncating down
ext4: factor out journalled block zeroing range
ext4: stop passing handle to ext4_journalled_block_zero_range()
ext4: don't zero partial block under an active handle when truncating
down
ext4: move ext4_block_zero_page_range() out of an active handle
ext4: zero post EOF partial block before appending write
ext4: add a new iomap aops for regular file's buffered IO path
ext4: implement buffered read iomap path
ext4: pass out extent seq counter when mapping da blocks
ext4: implement buffered write iomap path
ext4: implement writeback iomap path
ext4: implement mmap iomap path
iomap: correct the range of a partial dirty clear
iomap: support invalidating partial folios
ext4: implement partial block zero range iomap path
ext4: do not order data for inodes using buffered iomap path
ext4: add block mapping tracepoints for iomap buffered I/O path
ext4: disable online defrag when inode using iomap buffered I/O path
ext4: partially enable iomap for the buffered I/O path of regular
files
ext4: introduce a mount option for iomap buffered I/O path
fs/ext4/ext4.h | 21 +-
fs/ext4/ext4_jbd2.c | 1 +
fs/ext4/ext4_jbd2.h | 7 +-
fs/ext4/extents.c | 31 +-
fs/ext4/file.c | 40 +-
fs/ext4/ialloc.c | 1 +
fs/ext4/inode.c | 822 ++++++++++++++++++++++++++++++++----
fs/ext4/move_extent.c | 11 +
fs/ext4/page-io.c | 119 ++++++
fs/ext4/super.c | 32 +-
fs/iomap/buffered-io.c | 12 +-
include/trace/events/ext4.h | 45 ++
12 files changed, 1033 insertions(+), 109 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH -next v2 01/22] ext4: make ext4_block_zero_page_range() pass out did_zero
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 02/22] ext4: make ext4_block_truncate_page() return zeroed length Zhang Yi
` (21 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Modify ext4_block_zero_page_range() to pass out the did_zero parameter.
This parameter is set to true once a partial block has been zeroed out.
This change prepares for moving ordered data handling out of
__ext4_block_zero_page_range(), which is being adapted for the
conversion of the block zero range to the iomap infrastructure.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index da96db5f2345..759a2a031a9d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4030,7 +4030,8 @@ void ext4_set_aops(struct inode *inode)
* racing writeback can come later and flush the stale pagecache to disk.
*/
static int __ext4_block_zero_page_range(handle_t *handle,
- struct address_space *mapping, loff_t from, loff_t length)
+ struct address_space *mapping, loff_t from, loff_t length,
+ bool *did_zero)
{
unsigned int offset, blocksize, pos;
ext4_lblk_t iblock;
@@ -4118,6 +4119,8 @@ static int __ext4_block_zero_page_range(handle_t *handle,
err = ext4_jbd2_inode_add_write(handle, inode, from,
length);
}
+ if (!err && did_zero)
+ *did_zero = true;
unlock:
folio_unlock(folio);
@@ -4133,7 +4136,8 @@ static int __ext4_block_zero_page_range(handle_t *handle,
* that corresponds to 'from'
*/
static int ext4_block_zero_page_range(handle_t *handle,
- struct address_space *mapping, loff_t from, loff_t length)
+ struct address_space *mapping, loff_t from, loff_t length,
+ bool *did_zero)
{
struct inode *inode = mapping->host;
unsigned blocksize = inode->i_sb->s_blocksize;
@@ -4147,10 +4151,11 @@ static int ext4_block_zero_page_range(handle_t *handle,
length = max;
if (IS_DAX(inode)) {
- return dax_zero_range(inode, from, length, NULL,
+ return dax_zero_range(inode, from, length, did_zero,
&ext4_iomap_ops);
}
- return __ext4_block_zero_page_range(handle, mapping, from, length);
+ return __ext4_block_zero_page_range(handle, mapping, from, length,
+ did_zero);
}
/*
@@ -4173,7 +4178,7 @@ static int ext4_block_truncate_page(handle_t *handle,
blocksize = i_blocksize(inode);
length = blocksize - (from & (blocksize - 1));
- return ext4_block_zero_page_range(handle, mapping, from, length);
+ return ext4_block_zero_page_range(handle, mapping, from, length, NULL);
}
int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
@@ -4196,13 +4201,13 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
if (start == end &&
(partial_start || (partial_end != sb->s_blocksize - 1))) {
err = ext4_block_zero_page_range(handle, mapping,
- lstart, length);
+ lstart, length, NULL);
return err;
}
/* Handle partial zero out on the start of the range */
if (partial_start) {
- err = ext4_block_zero_page_range(handle, mapping,
- lstart, sb->s_blocksize);
+ err = ext4_block_zero_page_range(handle, mapping, lstart,
+ sb->s_blocksize, NULL);
if (err)
return err;
}
@@ -4210,7 +4215,7 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
if (partial_end != sb->s_blocksize - 1)
err = ext4_block_zero_page_range(handle, mapping,
byte_end - partial_end,
- partial_end + 1);
+ partial_end + 1, NULL);
return err;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 02/22] ext4: make ext4_block_truncate_page() return zeroed length
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 01/22] ext4: make ext4_block_zero_page_range() pass out did_zero Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 03/22] ext4: only order data when partially block truncating down Zhang Yi
` (20 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Modify ext4_block_truncate_page() to return the zeroed length. This is
prepared for the move out of ordered data handling in
__ext4_block_zero_page_range(), which is prepared for the conversion of
block zero range to the iomap infrastructure.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 759a2a031a9d..f856ea015263 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4163,6 +4163,7 @@ static int ext4_block_zero_page_range(handle_t *handle,
* up to the end of the block which corresponds to `from'.
* This required during truncate. We need to physically zero the tail end
* of that block so it doesn't yield old data if the file is later grown.
+ * Return the zeroed length on success.
*/
static int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from)
@@ -4170,6 +4171,8 @@ static int ext4_block_truncate_page(handle_t *handle,
unsigned length;
unsigned blocksize;
struct inode *inode = mapping->host;
+ bool did_zero = false;
+ int err;
/* If we are processing an encrypted inode during orphan list handling */
if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
@@ -4178,7 +4181,12 @@ static int ext4_block_truncate_page(handle_t *handle,
blocksize = i_blocksize(inode);
length = blocksize - (from & (blocksize - 1));
- return ext4_block_zero_page_range(handle, mapping, from, length, NULL);
+ err = ext4_block_zero_page_range(handle, mapping, from, length,
+ &did_zero);
+ if (err)
+ return err;
+
+ return did_zero ? length : 0;
}
int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 01/22] ext4: make ext4_block_zero_page_range() pass out did_zero Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 02/22] ext4: make ext4_block_truncate_page() return zeroed length Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 9:59 ` Jan Kara
` (2 more replies)
2026-02-03 6:25 ` [PATCH -next v2 04/22] ext4: factor out journalled block zeroing range Zhang Yi
` (19 subsequent siblings)
22 siblings, 3 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Currently, __ext4_block_zero_page_range() is called in the following
four cases to zero out the data in partial blocks:
1. Truncate down.
2. Truncate up.
3. Perform block allocation (e.g., fallocate) or append writes across a
range extending beyond the end of the file (EOF).
4. Partial block punch hole.
If the default ordered data mode is used, __ext4_block_zero_page_range()
will write back the zeroed data to the disk through the order mode after
zeroing out.
Among the cases 1,2 and 3 described above, only case 1 actually requires
this ordered write. Assuming no one intentionally bypasses the file
system to write directly to the disk. When performing a truncate down
operation, ensuring that the data beyond the EOF is zeroed out before
updating i_disksize is sufficient to prevent old data from being exposed
when the file is later extended. In other words, as long as the on-disk
data in case 1 can be properly zeroed out, only the data in memory needs
to be zeroed out in cases 2 and 3, without requiring ordered data.
Case 4 does not require ordered data because the entire punch hole
operation does not provide atomicity guarantees. Therefore, it's safe to
move the ordered data operation from __ext4_block_zero_page_range() to
ext4_truncate().
It should be noted that after this change, we can only determine whether
to perform ordered data operations based on whether the target block has
been zeroed, rather than on the state of the buffer head. Consequently,
unnecessary ordered data operations may occur when truncating an
unwritten dirty block. However, this scenario is relatively rare, so the
overall impact is minimal.
This is prepared for the conversion to the iomap infrastructure since it
doesn't use ordered data mode and requires active writeback, which
reduces the complexity of the conversion.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f856ea015263..20b60abcf777 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
folio_zero_range(folio, offset, length);
BUFFER_TRACE(bh, "zeroed end of block");
- if (ext4_should_journal_data(inode)) {
+ if (ext4_should_journal_data(inode))
err = ext4_dirty_journalled_data(handle, bh);
- } else {
+ else
mark_buffer_dirty(bh);
- /*
- * Only the written block requires ordered data to prevent
- * exposing stale data.
- */
- if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
- ext4_should_order_data(inode))
- err = ext4_jbd2_inode_add_write(handle, inode, from,
- length);
- }
if (!err && did_zero)
*did_zero = true;
@@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
goto out_trace;
}
- if (inode->i_size & (inode->i_sb->s_blocksize - 1))
- ext4_block_truncate_page(handle, mapping, inode->i_size);
+ if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
+ unsigned int zero_len;
+
+ zero_len = ext4_block_truncate_page(handle, mapping,
+ inode->i_size);
+ if (zero_len < 0) {
+ err = zero_len;
+ goto out_stop;
+ }
+ if (zero_len && !IS_DAX(inode) &&
+ ext4_should_order_data(inode)) {
+ err = ext4_jbd2_inode_add_write(handle, inode,
+ inode->i_size, zero_len);
+ if (err)
+ goto out_stop;
+ }
+ }
/*
* We add the inode to the orphan list, so that if this
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 04/22] ext4: factor out journalled block zeroing range
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (2 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 03/22] ext4: only order data when partially block truncating down Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 05/22] ext4: stop passing handle to ext4_journalled_block_zero_range() Zhang Yi
` (18 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Refactor __ext4_block_zero_page_range() by separating the block zeroing
operations for ordered data mode and journal data mode into two distinct
functions. Additionally, extract a common helper,
ext4_block_get_zero_range(), to identify the buffer that requires
zeroing.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 84 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 63 insertions(+), 21 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 20b60abcf777..7990ad566e10 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4029,13 +4029,12 @@ void ext4_set_aops(struct inode *inode)
* ext4_punch_hole, etc) which needs to be properly zeroed out. Otherwise a
* racing writeback can come later and flush the stale pagecache to disk.
*/
-static int __ext4_block_zero_page_range(handle_t *handle,
- struct address_space *mapping, loff_t from, loff_t length,
- bool *did_zero)
+static struct buffer_head *ext4_block_get_zero_range(struct inode *inode,
+ loff_t from, loff_t length)
{
unsigned int offset, blocksize, pos;
ext4_lblk_t iblock;
- struct inode *inode = mapping->host;
+ struct address_space *mapping = inode->i_mapping;
struct buffer_head *bh;
struct folio *folio;
int err = 0;
@@ -4044,7 +4043,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
FGP_LOCK | FGP_ACCESSED | FGP_CREAT,
mapping_gfp_constraint(mapping, ~__GFP_FS));
if (IS_ERR(folio))
- return PTR_ERR(folio);
+ return ERR_CAST(folio);
blocksize = inode->i_sb->s_blocksize;
@@ -4096,24 +4095,65 @@ static int __ext4_block_zero_page_range(handle_t *handle,
}
}
}
- if (ext4_should_journal_data(inode)) {
- BUFFER_TRACE(bh, "get write access");
- err = ext4_journal_get_write_access(handle, inode->i_sb, bh,
- EXT4_JTR_NONE);
- if (err)
- goto unlock;
- }
- folio_zero_range(folio, offset, length);
+ return bh;
+
+unlock:
+ folio_unlock(folio);
+ folio_put(folio);
+ return err ? ERR_PTR(err) : NULL;
+}
+
+static int ext4_block_zero_range(struct inode *inode, loff_t from,
+ loff_t length, bool *did_zero)
+{
+ struct buffer_head *bh;
+ struct folio *folio;
+
+ bh = ext4_block_get_zero_range(inode, from, length);
+ if (IS_ERR_OR_NULL(bh))
+ return PTR_ERR_OR_ZERO(bh);
+
+ folio = bh->b_folio;
+ folio_zero_range(folio, offset_in_folio(folio, from), length);
BUFFER_TRACE(bh, "zeroed end of block");
- if (ext4_should_journal_data(inode))
- err = ext4_dirty_journalled_data(handle, bh);
- else
- mark_buffer_dirty(bh);
- if (!err && did_zero)
+ mark_buffer_dirty(bh);
+ if (did_zero)
*did_zero = true;
-unlock:
+ folio_unlock(folio);
+ folio_put(folio);
+ return 0;
+}
+
+static int ext4_journalled_block_zero_range(handle_t *handle,
+ struct inode *inode, loff_t from, loff_t length, bool *did_zero)
+{
+ struct buffer_head *bh;
+ struct folio *folio;
+ int err;
+
+ bh = ext4_block_get_zero_range(inode, from, length);
+ if (IS_ERR_OR_NULL(bh))
+ return PTR_ERR_OR_ZERO(bh);
+ folio = bh->b_folio;
+
+ BUFFER_TRACE(bh, "get write access");
+ err = ext4_journal_get_write_access(handle, inode->i_sb, bh,
+ EXT4_JTR_NONE);
+ if (err)
+ goto out;
+
+ folio_zero_range(folio, offset_in_folio(folio, from), length);
+ BUFFER_TRACE(bh, "zeroed end of block");
+
+ err = ext4_dirty_journalled_data(handle, bh);
+ if (err)
+ goto out;
+
+ if (did_zero)
+ *did_zero = true;
+out:
folio_unlock(folio);
folio_put(folio);
return err;
@@ -4144,9 +4184,11 @@ static int ext4_block_zero_page_range(handle_t *handle,
if (IS_DAX(inode)) {
return dax_zero_range(inode, from, length, did_zero,
&ext4_iomap_ops);
+ } else if (ext4_should_journal_data(inode)) {
+ return ext4_journalled_block_zero_range(handle, inode, from,
+ length, did_zero);
}
- return __ext4_block_zero_page_range(handle, mapping, from, length,
- did_zero);
+ return ext4_block_zero_range(inode, from, length, did_zero);
}
/*
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 05/22] ext4: stop passing handle to ext4_journalled_block_zero_range()
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (3 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 04/22] ext4: factor out journalled block zeroing range Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 06/22] ext4: don't zero partial block under an active handle when truncating down Zhang Yi
` (17 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
When zeroing partial blocks, only the journal data mode requires an
active journal handle. Therefore, stop passing the handle to
ext4_zero_partial_blocks() and related functions, and make
ext4_journalled_block_zero_range() start a handle independently.
Currently, this change has no practical impact because all calls occur
within the context of an active handle. This change prepares for moving
ext4_block_truncate_page() out of an active handle, which is a
prerequisite for converting block zero range operations to the iomap
infrastructure because it requires active writeback after truncate down.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 4 ++--
fs/ext4/extents.c | 6 +++---
fs/ext4/inode.c | 54 +++++++++++++++++++++++++----------------------
3 files changed, 34 insertions(+), 30 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e0ed273e2e8a..19d0b4917aea 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3103,8 +3103,8 @@ extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
extern int ext4_chunk_trans_extent(struct inode *inode, int nrblocks);
extern int ext4_meta_trans_blocks(struct inode *inode, int lblocks,
int pextents);
-extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
- loff_t lstart, loff_t lend);
+extern int ext4_zero_partial_blocks(struct inode *inode,
+ loff_t lstart, loff_t lend);
extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3630b27e4fd7..953bf8945bda 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4627,8 +4627,8 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
inode_get_ctime(inode));
if (epos > old_size) {
pagecache_isize_extended(inode, old_size, epos);
- ext4_zero_partial_blocks(handle, inode,
- old_size, epos - old_size);
+ ext4_zero_partial_blocks(inode, old_size,
+ epos - old_size);
}
}
ret2 = ext4_mark_inode_dirty(handle, inode);
@@ -4746,7 +4746,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
}
/* Zero out partial block at the edges of the range */
- ret = ext4_zero_partial_blocks(handle, inode, offset, len);
+ ret = ext4_zero_partial_blocks(inode, offset, len);
if (ret)
goto out_handle;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7990ad566e10..c05b1c0a1b45 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1458,7 +1458,7 @@ static int ext4_write_end(const struct kiocb *iocb,
if (old_size < pos && !verity) {
pagecache_isize_extended(inode, old_size, pos);
- ext4_zero_partial_blocks(handle, inode, old_size, pos - old_size);
+ ext4_zero_partial_blocks(inode, old_size, pos - old_size);
}
/*
* Don't mark the inode dirty under folio lock. First, it unnecessarily
@@ -1576,7 +1576,7 @@ static int ext4_journalled_write_end(const struct kiocb *iocb,
if (old_size < pos && !verity) {
pagecache_isize_extended(inode, old_size, pos);
- ext4_zero_partial_blocks(handle, inode, old_size, pos - old_size);
+ ext4_zero_partial_blocks(inode, old_size, pos - old_size);
}
if (size_changed) {
@@ -3252,7 +3252,7 @@ static int ext4_da_do_write_end(struct address_space *mapping,
if (IS_ERR(handle))
return PTR_ERR(handle);
if (zero_len)
- ext4_zero_partial_blocks(handle, inode, old_size, zero_len);
+ ext4_zero_partial_blocks(inode, old_size, zero_len);
ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
@@ -4126,16 +4126,23 @@ static int ext4_block_zero_range(struct inode *inode, loff_t from,
return 0;
}
-static int ext4_journalled_block_zero_range(handle_t *handle,
- struct inode *inode, loff_t from, loff_t length, bool *did_zero)
+static int ext4_journalled_block_zero_range(struct inode *inode, loff_t from,
+ loff_t length, bool *did_zero)
{
struct buffer_head *bh;
struct folio *folio;
+ handle_t *handle;
int err;
+ handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
bh = ext4_block_get_zero_range(inode, from, length);
- if (IS_ERR_OR_NULL(bh))
- return PTR_ERR_OR_ZERO(bh);
+ if (IS_ERR_OR_NULL(bh)) {
+ err = PTR_ERR_OR_ZERO(bh);
+ goto out_handle;
+ }
folio = bh->b_folio;
BUFFER_TRACE(bh, "get write access");
@@ -4156,6 +4163,8 @@ static int ext4_journalled_block_zero_range(handle_t *handle,
out:
folio_unlock(folio);
folio_put(folio);
+out_handle:
+ ext4_journal_stop(handle);
return err;
}
@@ -4166,9 +4175,9 @@ static int ext4_journalled_block_zero_range(handle_t *handle,
* the end of the block it will be shortened to end of the block
* that corresponds to 'from'
*/
-static int ext4_block_zero_page_range(handle_t *handle,
- struct address_space *mapping, loff_t from, loff_t length,
- bool *did_zero)
+static int ext4_block_zero_page_range(struct address_space *mapping,
+ loff_t from, loff_t length,
+ bool *did_zero)
{
struct inode *inode = mapping->host;
unsigned blocksize = inode->i_sb->s_blocksize;
@@ -4185,7 +4194,7 @@ static int ext4_block_zero_page_range(handle_t *handle,
return dax_zero_range(inode, from, length, did_zero,
&ext4_iomap_ops);
} else if (ext4_should_journal_data(inode)) {
- return ext4_journalled_block_zero_range(handle, inode, from,
+ return ext4_journalled_block_zero_range(inode, from,
length, did_zero);
}
return ext4_block_zero_range(inode, from, length, did_zero);
@@ -4198,8 +4207,7 @@ static int ext4_block_zero_page_range(handle_t *handle,
* of that block so it doesn't yield old data if the file is later grown.
* Return the zeroed length on success.
*/
-static int ext4_block_truncate_page(handle_t *handle,
- struct address_space *mapping, loff_t from)
+static int ext4_block_truncate_page(struct address_space *mapping, loff_t from)
{
unsigned length;
unsigned blocksize;
@@ -4214,16 +4222,14 @@ static int ext4_block_truncate_page(handle_t *handle,
blocksize = i_blocksize(inode);
length = blocksize - (from & (blocksize - 1));
- err = ext4_block_zero_page_range(handle, mapping, from, length,
- &did_zero);
+ err = ext4_block_zero_page_range(mapping, from, length, &did_zero);
if (err)
return err;
return did_zero ? length : 0;
}
-int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
- loff_t lstart, loff_t length)
+int ext4_zero_partial_blocks(struct inode *inode, loff_t lstart, loff_t length)
{
struct super_block *sb = inode->i_sb;
struct address_space *mapping = inode->i_mapping;
@@ -4241,20 +4247,19 @@ int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
/* Handle partial zero within the single block */
if (start == end &&
(partial_start || (partial_end != sb->s_blocksize - 1))) {
- err = ext4_block_zero_page_range(handle, mapping,
- lstart, length, NULL);
+ err = ext4_block_zero_page_range(mapping, lstart, length, NULL);
return err;
}
/* Handle partial zero out on the start of the range */
if (partial_start) {
- err = ext4_block_zero_page_range(handle, mapping, lstart,
+ err = ext4_block_zero_page_range(mapping, lstart,
sb->s_blocksize, NULL);
if (err)
return err;
}
/* Handle partial zero out on the end of the range */
if (partial_end != sb->s_blocksize - 1)
- err = ext4_block_zero_page_range(handle, mapping,
+ err = ext4_block_zero_page_range(mapping,
byte_end - partial_end,
partial_end + 1, NULL);
return err;
@@ -4462,7 +4467,7 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
return ret;
}
- ret = ext4_zero_partial_blocks(handle, inode, offset, length);
+ ret = ext4_zero_partial_blocks(inode, offset, length);
if (ret)
goto out_handle;
@@ -4614,8 +4619,7 @@ int ext4_truncate(struct inode *inode)
if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
unsigned int zero_len;
- zero_len = ext4_block_truncate_page(handle, mapping,
- inode->i_size);
+ zero_len = ext4_block_truncate_page(mapping, inode->i_size);
if (zero_len < 0) {
err = zero_len;
goto out_stop;
@@ -5990,7 +5994,7 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
inode_set_mtime_to_ts(inode,
inode_set_ctime_current(inode));
if (oldsize & (inode->i_sb->s_blocksize - 1))
- ext4_block_truncate_page(handle,
+ ext4_block_truncate_page(
inode->i_mapping, oldsize);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 06/22] ext4: don't zero partial block under an active handle when truncating down
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (4 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 05/22] ext4: stop passing handle to ext4_journalled_block_zero_range() Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 07/22] ext4: move ext4_block_zero_page_range() out of an active handle Zhang Yi
` (16 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
When truncating down, move the call to ext4_block_truncate_page() before
starting the handle. This change has no effect in non-journal data mode
because it doesn't require an active handle. However, in journal data
mode, it may cause the zeroing of partial blocks and the release of
subsequent full blocks to be distributed across different transactions.
This is safe as well because the transaction that zeroes the blocks will
always be committed first, and the entire truncate operation does not
require atomicity guarantee.
This change prepares for converting the block zero range to the iomap
infrastructure, which does not use ordered data mode and requires active
writeback to prevent exposing stale data. The writeback must be
completed before the transaction to remove the orphan is committed, and
it cannot be performed within an active handle.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c05b1c0a1b45..e67c750866a5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4570,7 +4570,7 @@ int ext4_inode_attach_jinode(struct inode *inode)
int ext4_truncate(struct inode *inode)
{
struct ext4_inode_info *ei = EXT4_I(inode);
- unsigned int credits;
+ unsigned int credits, zero_len = 0;
int err = 0, err2;
handle_t *handle;
struct address_space *mapping = inode->i_mapping;
@@ -4603,6 +4603,12 @@ int ext4_truncate(struct inode *inode)
err = ext4_inode_attach_jinode(inode);
if (err)
goto out_trace;
+
+ zero_len = ext4_block_truncate_page(mapping, inode->i_size);
+ if (zero_len < 0) {
+ err = zero_len;
+ goto out_trace;
+ }
}
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
@@ -4616,21 +4622,12 @@ int ext4_truncate(struct inode *inode)
goto out_trace;
}
- if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
- unsigned int zero_len;
-
- zero_len = ext4_block_truncate_page(mapping, inode->i_size);
- if (zero_len < 0) {
- err = zero_len;
+ /* Ordered zeroed data to prevent exposure of stale data. */
+ if (zero_len && !IS_DAX(inode) && ext4_should_order_data(inode)) {
+ err = ext4_jbd2_inode_add_write(handle, inode, inode->i_size,
+ zero_len);
+ if (err)
goto out_stop;
- }
- if (zero_len && !IS_DAX(inode) &&
- ext4_should_order_data(inode)) {
- err = ext4_jbd2_inode_add_write(handle, inode,
- inode->i_size, zero_len);
- if (err)
- goto out_stop;
- }
}
/*
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 07/22] ext4: move ext4_block_zero_page_range() out of an active handle
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (5 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 06/22] ext4: don't zero partial block under an active handle when truncating down Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 08/22] ext4: zero post EOF partial block before appending write Zhang Yi
` (15 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
In the cases of truncating up and beyond EOF with fallocate, since
truncating down, buffered writeback, and DIO write operations have
guaranteed that the on-disk data has been zeroed, only the data in
memory needs to be zeroed out. Therefore, it is safe to move the call to
ext4_block_zero_page_range() outside the active handle.
In the case of a partial zero range and a partial punch hole, the entire
operation does not require atomicity guarantees. Therefore, it is also
safe to move the ext4_block_zero_page_range() call outside the active
handle.
This change prepares for converting the block zero range to the iomap
infrastructure. The folio lock will be held during the zeroing process.
Since the iomap iteration process always holds the folio lock before
starting a new handle, we need to ensure that the folio lock is not held
while an active handle is in use; otherwise, a potential deadlock may
occur.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/extents.c | 31 ++++++++++++-------------------
fs/ext4/inode.c | 33 +++++++++++++++++----------------
2 files changed, 29 insertions(+), 35 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 953bf8945bda..afe92e58ca8d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4625,11 +4625,6 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
if (ext4_update_inode_size(inode, epos) & 0x1)
inode_set_mtime_to_ts(inode,
inode_get_ctime(inode));
- if (epos > old_size) {
- pagecache_isize_extended(inode, old_size, epos);
- ext4_zero_partial_blocks(inode, old_size,
- epos - old_size);
- }
}
ret2 = ext4_mark_inode_dirty(handle, inode);
ext4_update_inode_fsync_trans(handle, inode, 1);
@@ -4638,6 +4633,11 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
if (unlikely(ret2))
break;
+ if (new_size && epos > old_size) {
+ pagecache_isize_extended(inode, old_size, epos);
+ ext4_zero_partial_blocks(inode, old_size,
+ epos - old_size);
+ }
if (alloc_zero &&
(map.m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN))) {
ret2 = ext4_issue_zeroout(inode, map.m_lblk, map.m_pblk,
@@ -4673,7 +4673,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
ext4_lblk_t start_lblk, end_lblk;
unsigned int blocksize = i_blocksize(inode);
unsigned int blkbits = inode->i_blkbits;
- int ret, flags, credits;
+ int ret, flags;
trace_ext4_zero_range(inode, offset, len, mode);
WARN_ON_ONCE(!inode_is_locked(inode));
@@ -4731,25 +4731,18 @@ static long ext4_zero_range(struct file *file, loff_t offset,
if (IS_ALIGNED(offset | end, blocksize))
return ret;
- /*
- * In worst case we have to writeout two nonadjacent unwritten
- * blocks and update the inode
- */
- credits = (2 * ext4_ext_index_trans_blocks(inode, 2)) + 1;
- if (ext4_should_journal_data(inode))
- credits += 2;
- handle = ext4_journal_start(inode, EXT4_HT_MISC, credits);
+ /* Zero out partial block at the edges of the range */
+ ret = ext4_zero_partial_blocks(inode, offset, len);
+ if (ret)
+ return ret;
+
+ handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
ext4_std_error(inode->i_sb, ret);
return ret;
}
- /* Zero out partial block at the edges of the range */
- ret = ext4_zero_partial_blocks(inode, offset, len);
- if (ret)
- goto out_handle;
-
if (new_size)
ext4_update_inode_size(inode, new_size);
ret = ext4_mark_inode_dirty(handle, inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e67c750866a5..9c0e70256527 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4456,8 +4456,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
if (ret)
return ret;
+ ret = ext4_zero_partial_blocks(inode, offset, length);
+ if (ret)
+ return ret;
+
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
- credits = ext4_chunk_trans_extent(inode, 2);
+ credits = ext4_chunk_trans_extent(inode, 0);
else
credits = ext4_blocks_for_truncate(inode);
handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
@@ -4467,10 +4471,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
return ret;
}
- ret = ext4_zero_partial_blocks(inode, offset, length);
- if (ret)
- goto out_handle;
-
/* If there are blocks to remove, do it */
start_lblk = EXT4_B_TO_LBLK(inode, offset);
end_lblk = end >> inode->i_blkbits;
@@ -5973,15 +5973,6 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
goto out_mmap_sem;
}
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
- if (IS_ERR(handle)) {
- error = PTR_ERR(handle);
- goto out_mmap_sem;
- }
- if (ext4_handle_valid(handle) && shrink) {
- error = ext4_orphan_add(handle, inode);
- orphan = 1;
- }
/*
* Update c/mtime and tail zero the EOF folio on
* truncate up. ext4_truncate() handles the shrink case
@@ -5989,10 +5980,20 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
*/
if (!shrink) {
inode_set_mtime_to_ts(inode,
- inode_set_ctime_current(inode));
+ inode_set_ctime_current(inode));
if (oldsize & (inode->i_sb->s_blocksize - 1))
ext4_block_truncate_page(
- inode->i_mapping, oldsize);
+ inode->i_mapping, oldsize);
+ }
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
+ if (IS_ERR(handle)) {
+ error = PTR_ERR(handle);
+ goto out_mmap_sem;
+ }
+ if (ext4_handle_valid(handle) && shrink) {
+ error = ext4_orphan_add(handle, inode);
+ orphan = 1;
}
if (shrink)
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 08/22] ext4: zero post EOF partial block before appending write
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (6 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 07/22] ext4: move ext4_block_zero_page_range() out of an active handle Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 09/22] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
` (14 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
In cases of appending write beyond the end of file (EOF),
ext4_zero_partial_blocks() is called within ext4_*_write_end() to zero
out the partial block beyond the EOF. This prevents exposing stale data
that might be written through mmap. However, supporting only the regular
buffered write path is insufficient. It is also necessary to support the
DAX path as well as the upcoming iomap buffered write path. Therefore,
move this operation to ext4_write_checks(). In addition, the zero length
is limited within the EOF block to prevent ext4_zero_partial_blocks()
from attempting to zero out the extra end block (although it would not
do anything anyway).
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/file.c | 20 ++++++++++++++++++++
fs/ext4/inode.c | 21 +++++++--------------
2 files changed, 27 insertions(+), 14 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 4320ebff74f3..3ecc09f286e4 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -271,6 +271,9 @@ static ssize_t ext4_generic_write_checks(struct kiocb *iocb,
static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ unsigned int blocksize = i_blocksize(inode);
+ loff_t old_size = i_size_read(inode);
ssize_t ret, count;
count = ext4_generic_write_checks(iocb, from);
@@ -280,6 +283,23 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
ret = file_modified(iocb->ki_filp);
if (ret)
return ret;
+
+ /*
+ * If the position is beyond the EOF, it is necessary to zero out the
+ * partial block that beyond the existing EOF, as it may contains
+ * stale data written through mmap.
+ */
+ if (iocb->ki_pos > old_size && (old_size & (blocksize - 1))) {
+ loff_t end = round_up(old_size, blocksize);
+
+ if (iocb->ki_pos < end)
+ end = iocb->ki_pos;
+
+ ret = ext4_zero_partial_blocks(inode, old_size, end - old_size);
+ if (ret)
+ return ret;
+ }
+
return count;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9c0e70256527..1ac93c39d21e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1456,10 +1456,9 @@ static int ext4_write_end(const struct kiocb *iocb,
folio_unlock(folio);
folio_put(folio);
- if (old_size < pos && !verity) {
+ if (old_size < pos && !verity)
pagecache_isize_extended(inode, old_size, pos);
- ext4_zero_partial_blocks(inode, old_size, pos - old_size);
- }
+
/*
* Don't mark the inode dirty under folio lock. First, it unnecessarily
* makes the holding time of folio lock longer. Second, it forces lock
@@ -1574,10 +1573,8 @@ static int ext4_journalled_write_end(const struct kiocb *iocb,
folio_unlock(folio);
folio_put(folio);
- if (old_size < pos && !verity) {
+ if (old_size < pos && !verity)
pagecache_isize_extended(inode, old_size, pos);
- ext4_zero_partial_blocks(inode, old_size, pos - old_size);
- }
if (size_changed) {
ret2 = ext4_mark_inode_dirty(handle, inode);
@@ -3196,7 +3193,7 @@ static int ext4_da_do_write_end(struct address_space *mapping,
struct inode *inode = mapping->host;
loff_t old_size = inode->i_size;
bool disksize_changed = false;
- loff_t new_i_size, zero_len = 0;
+ loff_t new_i_size;
handle_t *handle;
if (unlikely(!folio_buffers(folio))) {
@@ -3240,19 +3237,15 @@ static int ext4_da_do_write_end(struct address_space *mapping,
folio_unlock(folio);
folio_put(folio);
- if (pos > old_size) {
+ if (pos > old_size)
pagecache_isize_extended(inode, old_size, pos);
- zero_len = pos - old_size;
- }
- if (!disksize_changed && !zero_len)
+ if (!disksize_changed)
return copied;
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
if (IS_ERR(handle))
return PTR_ERR(handle);
- if (zero_len)
- ext4_zero_partial_blocks(inode, old_size, zero_len);
ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 09/22] ext4: add a new iomap aops for regular file's buffered IO path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (7 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 08/22] ext4: zero post EOF partial block before appending write Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 10/22] ext4: implement buffered read iomap path Zhang Yi
` (13 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Starts support for iomap in the buffered I/O path for regular files on
ext4.
- Introduces a new iomap address space operation, ext4_iomap_aops.
- Adds an inode state flag, EXT4_STATE_BUFFERED_IOMAP, which indicates
that the inode uses the iomap path instead of the original
buffer_head path for buffered I/O.
Most callbacks of ext4_iomap_aops can directly utilize generic iomap
implementations, the remaining callbacks: read_folio(), readahead(),
and writepages() will be implemented in later patches.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 7 +++++++
fs/ext4/inode.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19d0b4917aea..4930446cfec1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1978,6 +1978,7 @@ enum {
EXT4_STATE_FC_COMMITTING, /* Fast commit ongoing */
EXT4_STATE_FC_FLUSHING_DATA, /* Fast commit flushing data */
EXT4_STATE_ORPHAN_FILE, /* Inode orphaned in orphan file */
+ EXT4_STATE_BUFFERED_IOMAP, /* Inode use iomap for buffered IO */
};
#define EXT4_INODE_BIT_FNS(name, field, offset) \
@@ -2046,6 +2047,12 @@ static inline bool ext4_inode_orphan_tracked(struct inode *inode)
!list_empty(&EXT4_I(inode)->i_orphan);
}
+/* Whether the inode pass through the iomap infrastructure for buffered I/O */
+static inline bool ext4_inode_buffered_iomap(struct inode *inode)
+{
+ return ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+}
+
/*
* Codes for operating systems
*/
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1ac93c39d21e..fb7e75de2065 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3903,6 +3903,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
.iomap_begin = ext4_iomap_begin_report,
};
+static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
+{
+ return 0;
+}
+
+static void ext4_iomap_readahead(struct readahead_control *rac)
+{
+
+}
+
+static int ext4_iomap_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ return 0;
+}
+
/*
* For data=journal mode, folio should be marked dirty only when it was
* writeably mapped. When that happens, it was already attached to the
@@ -3989,6 +4005,20 @@ static const struct address_space_operations ext4_da_aops = {
.swap_activate = ext4_iomap_swap_activate,
};
+static const struct address_space_operations ext4_iomap_aops = {
+ .read_folio = ext4_iomap_read_folio,
+ .readahead = ext4_iomap_readahead,
+ .writepages = ext4_iomap_writepages,
+ .dirty_folio = iomap_dirty_folio,
+ .bmap = ext4_bmap,
+ .invalidate_folio = iomap_invalidate_folio,
+ .release_folio = iomap_release_folio,
+ .migrate_folio = filemap_migrate_folio,
+ .is_partially_uptodate = iomap_is_partially_uptodate,
+ .error_remove_folio = generic_error_remove_folio,
+ .swap_activate = ext4_iomap_swap_activate,
+};
+
static const struct address_space_operations ext4_dax_aops = {
.writepages = ext4_dax_writepages,
.dirty_folio = noop_dirty_folio,
@@ -4010,6 +4040,8 @@ void ext4_set_aops(struct inode *inode)
}
if (IS_DAX(inode))
inode->i_mapping->a_ops = &ext4_dax_aops;
+ else if (ext4_inode_buffered_iomap(inode))
+ inode->i_mapping->a_ops = &ext4_iomap_aops;
else if (test_opt(inode->i_sb, DELALLOC))
inode->i_mapping->a_ops = &ext4_da_aops;
else
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 10/22] ext4: implement buffered read iomap path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (8 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 09/22] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 11/22] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
` (12 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Introduce a new iomap_ops instance, ext4_iomap_buffer_read_ops, to
implement the iomap read path for ext4, specifically the read_folio()
and readahead() callbacks of ext4_iomap_aops.
ext4_iomap_map_blocks() invokes ext4_map_blocks() to query the extent
mapping status of the read range and then converts the mapping
information to iomap type.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 44 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fb7e75de2065..25d9462d2da7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3903,14 +3903,57 @@ const struct iomap_ops ext4_iomap_report_ops = {
.iomap_begin = ext4_iomap_begin_report,
};
+static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
+ loff_t length, struct ext4_map_blocks *map)
+{
+ u8 blkbits = inode->i_blkbits;
+
+ if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+ return -EINVAL;
+
+ /* Calculate the first and last logical blocks respectively. */
+ map->m_lblk = offset >> blkbits;
+ map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
+
+ return ext4_map_blocks(NULL, inode, map, 0);
+}
+
+static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
+ loff_t length, unsigned int flags, struct iomap *iomap,
+ struct iomap *srcmap)
+{
+ struct ext4_map_blocks map;
+ int ret;
+
+ if (unlikely(ext4_forced_shutdown(inode->i_sb)))
+ return -EIO;
+
+ /* Inline data support is not yet available. */
+ if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+ return -ERANGE;
+
+ ret = ext4_iomap_map_blocks(inode, offset, length, &map);
+ if (ret < 0)
+ return ret;
+
+ ext4_set_iomap(inode, iomap, &map, offset, length, flags);
+ return 0;
+}
+
+const struct iomap_ops ext4_iomap_buffered_read_ops = {
+ .iomap_begin = ext4_iomap_buffered_read_begin,
+};
+
static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
{
+ iomap_bio_read_folio(folio, &ext4_iomap_buffered_read_ops);
return 0;
}
static void ext4_iomap_readahead(struct readahead_control *rac)
{
-
+ iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
}
static int ext4_iomap_writepages(struct address_space *mapping,
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 11/22] ext4: pass out extent seq counter when mapping da blocks
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (9 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 10/22] ext4: implement buffered read iomap path Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 12/22] ext4: implement buffered write iomap path Zhang Yi
` (11 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
The iomap buffered write path does not hold any locks between querying
the inode extent mapping information and performing buffered writes. It
relies on the sequence counter saved in the inode to determine whether
the mapping information is stale. Commit 07c440e8da8f ("ext4: pass out
extent seq counter when mapping blocks") passed out the sequence number
when mapping blocks, but missed two places where it would be used later
in the iomap buffered delayed write path; these have now been filled in.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 25d9462d2da7..c9489978358e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1903,7 +1903,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
ext4_check_map_extents_env(inode);
/* Lookup extent status tree firstly */
- if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
+ if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
map->m_len = min_t(unsigned int, map->m_len,
es.es_len - (map->m_lblk - es.es_lblk));
@@ -1956,7 +1956,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
* is held in write mode, before inserting a new da entry in
* the extent status tree.
*/
- if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
+ if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
map->m_len = min_t(unsigned int, map->m_len,
es.es_len - (map->m_lblk - es.es_lblk));
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 12/22] ext4: implement buffered write iomap path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (10 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 11/22] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 13/22] ext4: implement writeback " Zhang Yi
` (10 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Introduce two new iomap_ops instances, ext4_iomap_buffer_write_ops and
ext4_iomap_buffer_da_write_ops, to implement the iomap write paths for
ext4. ext4_iomap_buffer_da_write_begin() invokes ext4_da_map_blocks()
to map delayed allocation extents and ext4_iomap_buffer_write_begin()
invokes ext4_iomap_get_blocks() to directly allocate blocks in
non-delayed allocation mode. Additionally, add ext4_iomap_valid() to
check the validity of extents by iomap infrastructure.
Key notes:
- Since we don't use ordered data mode to prevent exposing stale data
in the non-delayed allocation path, we ignore the dioread_nolock
mount option and always allocate unwritten extents for new blocks.
- The iomap write path maps multiple blocks at a time in the
iomap_begin() callbacks, so we must remove the stale delayed
allocation range in case of short writes and write failures.
Otherwise, this could result in a range of delayed extents being
covered by a clean folio, which would lead to inaccurate space
reservation.
- The lock ordering of the folio lock and transaction start is the
opposite of that in the buffer_head buffered write path, update the
locking document as well.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 4 ++
fs/ext4/file.c | 20 +++++-
fs/ext4/inode.c | 173 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/super.c | 10 ++-
4 files changed, 200 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4930446cfec1..89059b15ee5c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3062,6 +3062,7 @@ int ext4_walk_page_buffers(handle_t *handle,
int do_journal_get_write_access(handle_t *handle, struct inode *inode,
struct buffer_head *bh);
void ext4_set_inode_mapping_order(struct inode *inode);
+int ext4_nonda_switch(struct super_block *sb);
#define FALL_BACK_TO_NONDELALLOC 1
#define CONVERT_INLINE_DATA 2
@@ -3930,6 +3931,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
extern const struct iomap_ops ext4_iomap_ops;
extern const struct iomap_ops ext4_iomap_report_ops;
+extern const struct iomap_ops ext4_iomap_buffered_write_ops;
+extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
+extern const struct iomap_write_ops ext4_iomap_write_ops;
static inline int ext4_buffer_uptodate(struct buffer_head *bh)
{
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3ecc09f286e4..11fbc607d332 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -303,6 +303,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return count;
}
+static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ const struct iomap_ops *iomap_ops;
+
+ if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
+ iomap_ops = &ext4_iomap_buffered_da_write_ops;
+ else
+ iomap_ops = &ext4_iomap_buffered_write_ops;
+
+ return iomap_file_buffered_write(iocb, from, iomap_ops,
+ &ext4_iomap_write_ops, NULL);
+}
+
static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
struct iov_iter *from)
{
@@ -317,7 +332,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
if (ret <= 0)
goto out;
- ret = generic_perform_write(iocb, from);
+ if (ext4_inode_buffered_iomap(inode))
+ ret = ext4_iomap_buffered_write(iocb, from);
+ else
+ ret = generic_perform_write(iocb, from);
out:
inode_unlock(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c9489978358e..da4fd62c6963 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3065,7 +3065,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
return ret;
}
-static int ext4_nonda_switch(struct super_block *sb)
+int ext4_nonda_switch(struct super_block *sb)
{
s64 free_clusters, dirty_clusters;
struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -3462,6 +3462,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
}
+static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
+{
+ return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
+}
+
+const struct iomap_write_ops ext4_iomap_write_ops = {
+ .iomap_valid = ext4_iomap_valid,
+};
+
static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
struct ext4_map_blocks *map, loff_t offset,
loff_t length, unsigned int flags)
@@ -3496,6 +3505,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
iomap->flags |= IOMAP_F_MERGED;
+ iomap->validity_cookie = map->m_seq;
+
/*
* Flags passed to ext4_map_blocks() for direct I/O writes can result
* in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
@@ -3903,8 +3914,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
.iomap_begin = ext4_iomap_begin_report,
};
+/* Map blocks */
+typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
+
static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
- loff_t length, struct ext4_map_blocks *map)
+ loff_t length, ext4_get_blocks_t get_blocks,
+ struct ext4_map_blocks *map)
{
u8 blkbits = inode->i_blkbits;
@@ -3916,6 +3931,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
+ if (get_blocks)
+ return get_blocks(inode, map);
+
return ext4_map_blocks(NULL, inode, map, 0);
}
@@ -3933,7 +3951,91 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
return -ERANGE;
- ret = ext4_iomap_map_blocks(inode, offset, length, &map);
+ ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
+ if (ret < 0)
+ return ret;
+
+ ext4_set_iomap(inode, iomap, &map, offset, length, flags);
+ return 0;
+}
+
+static int ext4_iomap_get_blocks(struct inode *inode,
+ struct ext4_map_blocks *map)
+{
+ loff_t i_size = i_size_read(inode);
+ handle_t *handle;
+ int ret, needed_blocks;
+
+ /*
+ * Check if the blocks have already been allocated, this could
+ * avoid initiating a new journal transaction and return the
+ * mapping information directly.
+ */
+ if ((map->m_lblk + map->m_len) <=
+ round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
+ ret = ext4_map_blocks(NULL, inode, map, 0);
+ if (ret < 0)
+ return ret;
+ if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
+ EXT4_MAP_DELAYED))
+ return 0;
+ }
+
+ /*
+ * Reserve one block more for addition to orphan list in case
+ * we allocate blocks but write fails for some reason.
+ */
+ needed_blocks = ext4_chunk_trans_blocks(inode, map->m_len) + 1;
+ handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ ret = ext4_map_blocks(handle, inode, map,
+ EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
+ /*
+ * We have to stop handle here for two reasons.
+ *
+ * - One is a potential deadlock caused by the subsequent call to
+ * balance_dirty_pages(). It may wait for the dirty pages to be
+ * written back, which could initiate another handle and cause it
+ * to wait for the current one to complete.
+ *
+ * - Another one is that we cannot hole lock folio under an active
+ * handle because the lock order of iomap is always acquires the
+ * folio lock before starting a new handle; otherwise, this could
+ * cause a potential deadlock.
+ */
+ ext4_journal_stop(handle);
+
+ return ret;
+}
+
+static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
+ loff_t offset, loff_t length, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap, bool delalloc)
+{
+ int ret, retries = 0;
+ struct ext4_map_blocks map;
+ ext4_get_blocks_t *get_blocks;
+
+ ret = ext4_emergency_state(inode->i_sb);
+ if (unlikely(ret))
+ return ret;
+
+ /* Inline data support is not yet available. */
+ if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+ return -ERANGE;
+ if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
+ return -EINVAL;
+
+ if (delalloc)
+ get_blocks = ext4_da_map_blocks;
+ else
+ get_blocks = ext4_iomap_get_blocks;
+retry:
+ ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
if (ret < 0)
return ret;
@@ -3941,6 +4043,71 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
return 0;
}
+static int ext4_iomap_buffered_write_begin(struct inode *inode,
+ loff_t offset, loff_t length, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
+ iomap, srcmap, false);
+}
+
+static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
+ loff_t offset, loff_t length, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
+ iomap, srcmap, true);
+}
+
+/*
+ * Drop the staled delayed allocation range from the write failure,
+ * including both start and end blocks. If not, we could leave a range
+ * of delayed extents covered by a clean folio, it could lead to
+ * inaccurate space reservation.
+ */
+static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
+ loff_t length, struct iomap *iomap)
+{
+ down_write(&EXT4_I(inode)->i_data_sem);
+ ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
+ DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
+ up_write(&EXT4_I(inode)->i_data_sem);
+}
+
+static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
+ loff_t length, ssize_t written,
+ unsigned int flags,
+ struct iomap *iomap)
+{
+ loff_t start_byte, end_byte;
+
+ /* If we didn't reserve the blocks, we're not allowed to punch them. */
+ if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
+ return 0;
+
+ /* Nothing to do if we've written the entire delalloc extent */
+ start_byte = iomap_last_written_block(inode, offset, written);
+ end_byte = round_up(offset + length, i_blocksize(inode));
+ if (start_byte >= end_byte)
+ return 0;
+
+ filemap_invalidate_lock(inode->i_mapping);
+ iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
+ iomap, ext4_iomap_punch_delalloc);
+ filemap_invalidate_unlock(inode->i_mapping);
+ return 0;
+}
+
+
+const struct iomap_ops ext4_iomap_buffered_write_ops = {
+ .iomap_begin = ext4_iomap_buffered_write_begin,
+};
+
+const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
+ .iomap_begin = ext4_iomap_buffered_da_write_begin,
+ .iomap_end = ext4_iomap_buffered_da_write_end,
+};
+
const struct iomap_ops ext4_iomap_buffered_read_ops = {
.iomap_begin = ext4_iomap_buffered_read_begin,
};
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 69eb63dde983..b68509505558 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
* -> page lock -> i_data_sem (rw)
*
* buffered write path:
- * sb_start_write -> i_mutex -> mmap_lock
- * sb_start_write -> i_mutex -> transaction start -> page lock ->
- * i_data_sem (rw)
+ * sb_start_write -> i_rwsem (w) -> mmap_lock
+ * - buffer_head path:
+ * sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
+ * i_data_sem (rw)
+ * - iomap path:
+ * sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
+ * sb_start_write -> i_rwsem (w) -> folio lock
*
* truncate:
* sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 13/22] ext4: implement writeback iomap path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (11 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 12/22] ext4: implement buffered write iomap path Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 14/22] ext4: implement mmap " Zhang Yi
` (9 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Implement the iomap writeback path for ext4. It implement
ext4_iomap_writepages(), introduce a new iomap_writeback_ops instance,
ext4_writeback_ops, and create a new end I/O extent conversion worker to
convert unwritten extents after the I/O is completed.
In the ->writeback_range() callback, it first call
ext4_iomap_map_writeback_range() to query the longest range of existing
mapped extents. For performance considerations, if the block range has
not been allocated, it attempts to allocate a range of longest blocks
which is based on the writeback length and the delalloc extent length,
rather than allocating for a single folio length at a time. Then, add
the folio to the iomap_ioend instance.
In the ->writeback_submit() callback, it registers a special end bio
callback, ext4_iomap_end_bio(), which will start a worker if we need to
convert unwritten extents or need to update i_disksize after the data
has been written back, and if we need to abort the journal when the I/O
is failed to write back.
Key notes:
- Since we aim to allocate a range of blocks as long as possible within
the writeback length for each invocation of ->writeback_range()
callback, we may allocate a long range but write less in certain
corner cases. Therefore, we have to ignore the dioread_nolock mount
option and always allocate unwritten blocks. This is consistent with
the non-delayed buffer write process.
- Since ->writeback_range() is always executed under the folio lock,
this means we need to start the handle under the folio lock as well.
This is opposite to the order in the buffer_head writeback path.
Therefore, we cannot use the ordered data mode to write back data,
otherwise it would cause a deadlock. Fortunately, since we always
allocate unwritten extents when allocating blocks, the functionality
of the ordered data mode is already quite limited and can be replaced
by other methods.
- Since we don't use ordered data mode, the deadlock problem that was
expected to be resolved through the reserve handle does not exists
here. Therefore, we also do not need to use the reserve handle when
converting the unwritten extent in the end I/O worker, we can start a
normal journal handle instead.
- Since we always allocate unwritten blocks, we also delay updating the
i_disksize until the I/O is done, which could prevent the exposure of
zero data that may occur during a system crash while performing
buffer append writes.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 4 +
fs/ext4/inode.c | 213 +++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/page-io.c | 119 ++++++++++++++++++++++++++
fs/ext4/super.c | 7 +-
4 files changed, 341 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 89059b15ee5c..520f6d5dcdab 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1176,6 +1176,8 @@ struct ext4_inode_info {
*/
struct list_head i_rsv_conversion_list;
struct work_struct i_rsv_conversion_work;
+ struct list_head i_iomap_ioend_list;
+ struct work_struct i_iomap_ioend_work;
/*
* Transactions that contain inode's metadata needed to complete
@@ -3874,6 +3876,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
size_t len);
extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
+extern void ext4_iomap_end_io(struct work_struct *work);
+extern void ext4_iomap_end_bio(struct bio *bio);
/* mmp.c */
extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index da4fd62c6963..4a7d18511c3f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -44,6 +44,7 @@
#include <linux/iversion.h>
#include "ext4_jbd2.h"
+#include "ext4_extents.h"
#include "xattr.h"
#include "acl.h"
#include "truncate.h"
@@ -4123,10 +4124,220 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
}
+struct ext4_writeback_ctx {
+ struct iomap_writepage_ctx ctx;
+ unsigned int data_seq;
+};
+
+static int ext4_iomap_map_one_extent(struct inode *inode,
+ struct ext4_map_blocks *map)
+{
+ struct extent_status es;
+ handle_t *handle = NULL;
+ int credits, map_flags;
+ int retval;
+
+ credits = ext4_chunk_trans_blocks(inode, map->m_len);
+ handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ map->m_flags = 0;
+ /*
+ * It is necessary to look up extent and map blocks under i_data_sem
+ * in write mode, otherwise, the delalloc extent may become stale
+ * during concurrent truncate operations.
+ */
+ ext4_fc_track_inode(handle, inode);
+ down_write(&EXT4_I(inode)->i_data_sem);
+ if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
+ retval = es.es_len - (map->m_lblk - es.es_lblk);
+ map->m_len = min_t(unsigned int, retval, map->m_len);
+
+ if (ext4_es_is_delayed(&es)) {
+ map->m_flags |= EXT4_MAP_DELAYED;
+ trace_ext4_da_write_pages_extent(inode, map);
+ /*
+ * Call ext4_map_create_blocks() to allocate any
+ * delayed allocation blocks. It is possible that
+ * we're going to need more metadata blocks, however
+ * we must not fail because we're in writeback and
+ * there is nothing we can do so it might result in
+ * data loss. So use reserved blocks to allocate
+ * metadata if possible.
+ */
+ map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
+ EXT4_GET_BLOCKS_METADATA_NOFAIL |
+ EXT4_EX_NOCACHE;
+
+ retval = ext4_map_create_blocks(handle, inode, map,
+ map_flags);
+ if (retval > 0)
+ ext4_fc_track_range(handle, inode, map->m_lblk,
+ map->m_lblk + map->m_len - 1);
+ goto out;
+ } else if (unlikely(ext4_es_is_hole(&es)))
+ goto out;
+
+ /* Found written or unwritten extent. */
+ map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
+ map->m_flags = ext4_es_is_written(&es) ?
+ EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
+ goto out;
+ }
+
+ retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
+out:
+ up_write(&EXT4_I(inode)->i_data_sem);
+ ext4_journal_stop(handle);
+ return retval < 0 ? retval : 0;
+}
+
+static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
+ loff_t offset, unsigned int dirty_len)
+{
+ struct ext4_writeback_ctx *ewpc =
+ container_of(wpc, struct ext4_writeback_ctx, ctx);
+ struct inode *inode = wpc->inode;
+ struct super_block *sb = inode->i_sb;
+ struct journal_s *journal = EXT4_SB(sb)->s_journal;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_map_blocks map;
+ unsigned int blkbits = inode->i_blkbits;
+ unsigned int index = offset >> blkbits;
+ unsigned int blk_end, blk_len;
+ int ret;
+
+ ret = ext4_emergency_state(sb);
+ if (unlikely(ret))
+ return ret;
+
+ /* Check validity of the cached writeback mapping. */
+ if (offset >= wpc->iomap.offset &&
+ offset < wpc->iomap.offset + wpc->iomap.length &&
+ ewpc->data_seq == READ_ONCE(ei->i_es_seq))
+ return 0;
+
+ blk_len = dirty_len >> blkbits;
+ blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
+ (UINT_MAX - 1));
+ if (blk_end > index + blk_len)
+ blk_len = blk_end - index + 1;
+
+retry:
+ map.m_lblk = index;
+ map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
+ ret = ext4_map_blocks(NULL, inode, &map,
+ EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * The map is not a delalloc extent, it must either be a hole
+ * or an extent which have already been allocated.
+ */
+ if (!(map.m_flags & EXT4_MAP_DELAYED))
+ goto out;
+
+ /* Map one delalloc extent. */
+ ret = ext4_iomap_map_one_extent(inode, &map);
+ if (ret < 0) {
+ if (ext4_emergency_state(sb))
+ return ret;
+
+ /*
+ * Retry transient ENOSPC errors, if
+ * ext4_count_free_blocks() is non-zero, a commit
+ * should free up blocks.
+ */
+ if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
+ jbd2_journal_force_commit_nested(journal);
+ goto retry;
+ }
+
+ ext4_msg(sb, KERN_CRIT,
+ "Delayed block allocation failed for inode %lu at logical offset %llu with max blocks %u with error %d",
+ inode->i_ino, (unsigned long long)map.m_lblk,
+ (unsigned int)map.m_len, -ret);
+ ext4_msg(sb, KERN_CRIT,
+ "This should not happen!! Data will be lost\n");
+ if (ret == -ENOSPC)
+ ext4_print_free_blocks(inode);
+ return ret;
+ }
+out:
+ ewpc->data_seq = map.m_seq;
+ ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
+ return 0;
+}
+
+static void ext4_iomap_discard_folio(struct folio *folio, loff_t pos)
+{
+ struct inode *inode = folio->mapping->host;
+ loff_t length = folio_pos(folio) + folio_size(folio) - pos;
+
+ ext4_iomap_punch_delalloc(inode, pos, length, NULL);
+}
+
+static ssize_t ext4_iomap_writeback_range(struct iomap_writepage_ctx *wpc,
+ struct folio *folio, u64 offset,
+ unsigned int len, u64 end_pos)
+{
+ ssize_t ret;
+
+ ret = ext4_iomap_map_writeback_range(wpc, offset, len);
+ if (!ret)
+ ret = iomap_add_to_ioend(wpc, folio, offset, end_pos, len);
+ if (ret < 0)
+ ext4_iomap_discard_folio(folio, offset);
+ return ret;
+}
+
+static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
+ int error)
+{
+ struct iomap_ioend *ioend = wpc->wb_ctx;
+ struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
+
+ /* Need to convert unwritten extents when I/Os are completed. */
+ if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
+ ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
+ ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
+
+ return iomap_ioend_writeback_submit(wpc, error);
+}
+
+static const struct iomap_writeback_ops ext4_writeback_ops = {
+ .writeback_range = ext4_iomap_writeback_range,
+ .writeback_submit = ext4_iomap_writeback_submit,
+};
+
static int ext4_iomap_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- return 0;
+ struct inode *inode = mapping->host;
+ struct super_block *sb = inode->i_sb;
+ long nr = wbc->nr_to_write;
+ int alloc_ctx, ret;
+ struct ext4_writeback_ctx ewpc = {
+ .ctx = {
+ .inode = inode,
+ .wbc = wbc,
+ .ops = &ext4_writeback_ops,
+ },
+ };
+
+ ret = ext4_emergency_state(sb);
+ if (unlikely(ret))
+ return ret;
+
+ alloc_ctx = ext4_writepages_down_read(sb);
+ trace_ext4_writepages(inode, wbc);
+ ret = iomap_writepages(&ewpc.ctx);
+ trace_ext4_writepages_result(inode, wbc, ret, nr - wbc->nr_to_write);
+ ext4_writepages_up_read(sb, alloc_ctx);
+
+ return ret;
}
/*
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index a8c95eee91b7..d74aa430636f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -23,6 +23,7 @@
#include <linux/bio.h>
#include <linux/workqueue.h>
#include <linux/kernel.h>
+#include <linux/iomap.h>
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/sched/mm.h>
@@ -592,3 +593,121 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
return 0;
}
+
+static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
+{
+ struct inode *inode = ioend->io_inode;
+ struct super_block *sb = inode->i_sb;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ loff_t pos = ioend->io_offset;
+ size_t size = ioend->io_size;
+ loff_t new_disksize;
+ handle_t *handle;
+ int credits;
+ int ret, err;
+
+ ret = blk_status_to_errno(ioend->io_bio.bi_status);
+ if (unlikely(ret)) {
+ if (test_opt(sb, DATA_ERR_ABORT))
+ jbd2_journal_abort(EXT4_SB(sb)->s_journal, ret);
+ goto out;
+ }
+
+ /* We may need to convert one extent and dirty the inode. */
+ credits = ext4_chunk_trans_blocks(inode,
+ EXT4_MAX_BLOCKS(size, pos, inode->i_blkbits));
+ handle = ext4_journal_start(inode, EXT4_HT_EXT_CONVERT, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out_err;
+ }
+
+ if (ioend->io_flags & IOMAP_IOEND_UNWRITTEN) {
+ ret = ext4_convert_unwritten_extents(handle, inode, pos, size);
+ if (ret)
+ goto out_journal;
+ }
+
+ /*
+ * Update on-disk size after IO is completed. Races with
+ * truncate are avoided by checking i_size under i_data_sem.
+ */
+ new_disksize = pos + size;
+ if (new_disksize > READ_ONCE(ei->i_disksize)) {
+ down_write(&ei->i_data_sem);
+ new_disksize = min(new_disksize, i_size_read(inode));
+ if (new_disksize > ei->i_disksize)
+ ei->i_disksize = new_disksize;
+ up_write(&ei->i_data_sem);
+ ret = ext4_mark_inode_dirty(handle, inode);
+ if (ret)
+ EXT4_ERROR_INODE_ERR(inode, -ret,
+ "Failed to mark inode dirty");
+ }
+
+out_journal:
+ err = ext4_journal_stop(handle);
+ if (!ret)
+ ret = err;
+out_err:
+ if (ret < 0 && !ext4_emergency_state(sb)) {
+ ext4_msg(sb, KERN_EMERG,
+ "failed to convert unwritten extents to written extents or update inode size -- potential data loss! (inode %lu, error %d)",
+ inode->i_ino, ret);
+ }
+out:
+ iomap_finish_ioends(ioend, ret);
+}
+
+/*
+ * Work on buffered iomap completed IO, to convert unwritten extents to
+ * mapped extents
+ */
+void ext4_iomap_end_io(struct work_struct *work)
+{
+ struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
+ i_iomap_ioend_work);
+ struct iomap_ioend *ioend;
+ struct list_head ioend_list;
+ unsigned long flags;
+
+ spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+ list_replace_init(&ei->i_iomap_ioend_list, &ioend_list);
+ spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+
+ iomap_sort_ioends(&ioend_list);
+ while (!list_empty(&ioend_list)) {
+ ioend = list_entry(ioend_list.next, struct iomap_ioend, io_list);
+ list_del_init(&ioend->io_list);
+ iomap_ioend_try_merge(ioend, &ioend_list);
+ ext4_iomap_finish_ioend(ioend);
+ }
+}
+
+void ext4_iomap_end_bio(struct bio *bio)
+{
+ struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
+ struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
+ struct ext4_sb_info *sbi = EXT4_SB(ioend->io_inode->i_sb);
+ unsigned long flags;
+ int ret;
+
+ /* Needs to convert unwritten extents or update the i_disksize. */
+ if ((ioend->io_flags & IOMAP_IOEND_UNWRITTEN) ||
+ ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
+ goto defer;
+
+ /* Needs to abort the journal on data_err=abort. */
+ ret = blk_status_to_errno(ioend->io_bio.bi_status);
+ if (unlikely(ret) && test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
+ goto defer;
+
+ iomap_finish_ioends(ioend, ret);
+ return;
+defer:
+ spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+ if (list_empty(&ei->i_iomap_ioend_list))
+ queue_work(sbi->rsv_conversion_wq, &ei->i_iomap_ioend_work);
+ list_add_tail(&ioend->io_list, &ei->i_iomap_ioend_list);
+ spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b68509505558..cffe63deba31 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -123,7 +123,10 @@ static const struct fs_parameter_spec ext4_param_specs[];
* sb_start_write -> i_mutex -> transaction start -> i_data_sem (rw)
*
* writepages:
- * transaction start -> page lock(s) -> i_data_sem (rw)
+ * - buffer_head path:
+ * transaction start -> folio lock(s) -> i_data_sem (rw)
+ * - iomap path:
+ * folio lock -> transaction start -> i_data_sem (rw)
*/
static const struct fs_context_operations ext4_context_ops = {
@@ -1426,10 +1429,12 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
#endif
ei->jinode = NULL;
INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
+ INIT_LIST_HEAD(&ei->i_iomap_ioend_list);
spin_lock_init(&ei->i_completed_io_lock);
ei->i_sync_tid = 0;
ei->i_datasync_tid = 0;
INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+ INIT_WORK(&ei->i_iomap_ioend_work, ext4_iomap_end_io);
ext4_fc_init_inode(&ei->vfs_inode);
spin_lock_init(&ei->i_fc_lock);
return &ei->vfs_inode;
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 14/22] ext4: implement mmap iomap path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (12 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 13/22] ext4: implement writeback " Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 15/22] iomap: correct the range of a partial dirty clear Zhang Yi
` (8 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Introduce ext4_iomap_page_mkwrite() to implement the mmap iomap path for
ext4. Most of this work is delegated to iomap_page_mkwrite(), which only
needs to be called with ext4_iomap_buffer_write_ops and
ext4_iomap_buffer_da_write_ops as arguments to allocate and map the
blocks. However, the lock ordering of the folio lock and transaction
start is the opposite of that in the buffer_head buffered write path,
update the locking document accordingly.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 32 +++++++++++++++++++++++++++++++-
fs/ext4/super.c | 8 ++++++--
2 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4a7d18511c3f..0d2852159fa3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4026,7 +4026,7 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
/* Inline data support is not yet available. */
if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
return -ERANGE;
- if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
+ if (WARN_ON_ONCE(!(flags & (IOMAP_WRITE | IOMAP_FAULT))))
return -EINVAL;
if (delalloc)
@@ -4086,6 +4086,14 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
return 0;
+ /*
+ * iomap_page_mkwrite() will never fail in a way that requires delalloc
+ * extents that it allocated to be revoked. Hence never try to release
+ * them here.
+ */
+ if (flags & IOMAP_FAULT)
+ return 0;
+
/* Nothing to do if we've written the entire delalloc extent */
start_byte = iomap_last_written_block(inode, offset, written);
end_byte = round_up(offset + length, i_blocksize(inode));
@@ -7135,6 +7143,23 @@ static int ext4_block_page_mkwrite(struct inode *inode, struct folio *folio,
return ret;
}
+static vm_fault_t ext4_iomap_page_mkwrite(struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ const struct iomap_ops *iomap_ops;
+
+ /*
+ * ext4_nonda_switch() could writeback this folio, so have to
+ * call it before lock folio.
+ */
+ if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
+ iomap_ops = &ext4_iomap_buffered_da_write_ops;
+ else
+ iomap_ops = &ext4_iomap_buffered_write_ops;
+
+ return iomap_page_mkwrite(vmf, iomap_ops, NULL);
+}
+
vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
@@ -7157,6 +7182,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
filemap_invalidate_lock_shared(mapping);
+ if (ext4_inode_buffered_iomap(inode)) {
+ ret = ext4_iomap_page_mkwrite(vmf);
+ goto out;
+ }
+
err = ext4_convert_inline_data(inode);
if (err)
goto out_ret;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index cffe63deba31..4bb77703ffe1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -100,8 +100,12 @@ static const struct fs_parameter_spec ext4_param_specs[];
* Lock ordering
*
* page fault path:
- * mmap_lock -> sb_start_pagefault -> invalidate_lock (r) -> transaction start
- * -> page lock -> i_data_sem (rw)
+ * - buffer_head path:
+ * mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
+ * transaction start -> folio lock -> i_data_sem (rw)
+ * - iomap path:
+ * mmap_lock -> sb_start_pagefault -> invalidate_lock (r) ->
+ * folio lock -> transaction start -> i_data_sem (rw)
*
* buffered write path:
* sb_start_write -> i_rwsem (w) -> mmap_lock
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 15/22] iomap: correct the range of a partial dirty clear
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (13 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 14/22] ext4: implement mmap " Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 16/22] iomap: support invalidating partial folios Zhang Yi
` (7 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
The block range calculation in ifs_clear_range_dirty() is incorrect when
partial clear a range in a folio. We can't clear the dirty bit of the
first block or the last block if the start or end offset is blocksize
unaligned, this has not yet caused any issue since we always clear a
whole folio in iomap_writeback_folio().
Fix this by round up the first block and round down the last block,
correct the calculation of nr_blks.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
This is modified from:
https://lore.kernel.org/linux-fsdevel/20240812121159.3775074-2-yi.zhang@huaweicloud.com/
Changes:
- Use round_up() instead of DIV_ROUND_UP() to prevent wasted integer
division.
fs/iomap/buffered-io.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 154456e39fe5..3c8e085e79cf 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -167,11 +167,15 @@ static void ifs_clear_range_dirty(struct folio *folio,
{
struct inode *inode = folio->mapping->host;
unsigned int blks_per_folio = i_blocks_per_folio(inode, folio);
- unsigned int first_blk = (off >> inode->i_blkbits);
- unsigned int last_blk = (off + len - 1) >> inode->i_blkbits;
- unsigned int nr_blks = last_blk - first_blk + 1;
+ unsigned int first_blk = round_up(off, i_blocksize(inode)) >>
+ inode->i_blkbits;
+ unsigned int last_blk = (off + len) >> inode->i_blkbits;
+ unsigned int nr_blks = last_blk - first_blk;
unsigned long flags;
+ if (!nr_blks)
+ return;
+
spin_lock_irqsave(&ifs->state_lock, flags);
bitmap_clear(ifs->state, first_blk + blks_per_folio, nr_blks);
spin_unlock_irqrestore(&ifs->state_lock, flags);
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 16/22] iomap: support invalidating partial folios
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (14 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 15/22] iomap: correct the range of a partial dirty clear Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 17/22] ext4: implement partial block zero range iomap path Zhang Yi
` (6 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Current iomap_invalidate_folio() can only invalidate an entire folio. If
we truncate a partial folio on a filesystem where the block size is
smaller than the folio size, it will leave behind dirty bits from the
truncated or punched blocks. During the write-back process, it will
attempt to map the invalid hole range. Fortunately, this has not
caused any real problems so far because the ->writeback_range() function
corrects the length.
However, the implementation of FALLOC_FL_ZERO_RANGE in ext4 depends on
the support for invalidating partial folios. When ext4 partially zeroes
out a dirty and unwritten folio, it does not perform a flush first like
XFS. Therefore, if the dirty bits of the corresponding area cannot be
cleared, the zeroed area after writeback remains in the written state
rather than reverting to the unwritten state. Therefore, fix this by
supporting invalidating partial folios.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
This is cherry picked form:
https://lore.kernel.org/linux-fsdevel/20240812121159.3775074-3-yi.zhang@huaweicloud.com/
No code changes, only update the commit message to explain why Ext4
needs this.
fs/iomap/buffered-io.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 3c8e085e79cf..d4dd1874a471 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -744,6 +744,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
WARN_ON_ONCE(folio_test_writeback(folio));
folio_cancel_dirty(folio);
ifs_free(folio);
+ } else {
+ iomap_clear_range_dirty(folio, offset, len);
}
}
EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 17/22] ext4: implement partial block zero range iomap path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (15 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 16/22] iomap: support invalidating partial folios Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-04 0:21 ` kernel test robot
2026-02-03 6:25 ` [PATCH -next v2 18/22] ext4: do not order data for inodes using buffered " Zhang Yi
` (5 subsequent siblings)
22 siblings, 1 reply; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
ext4_iomap_block_zero_range() to implement the iomap block zeroing range
for ext4. ext4_iomap_block_zero_range() invokes iomap_zero_range() and
passes ext4_iomap_zero_begin() to locate and zero out a mapped partial
block or a dirty, unwritten partial block.
Note that zeroing out under an active handle can cause deadlock since
the order of acquiring the folio lock and starting a handle is
inconsistent with the iomap iteration procedure. Therefore,
ext4_iomap_block_zero_range() cannot be called under an active handle.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 85 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0d2852159fa3..c59f3adba0f3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4107,6 +4107,50 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
return 0;
}
+static int ext4_iomap_zero_begin(struct inode *inode,
+ loff_t offset, loff_t length, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
+ unsigned int iomap_flags = 0;
+ int ret;
+
+ ret = ext4_emergency_state(inode->i_sb);
+ if (unlikely(ret))
+ return ret;
+
+ if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
+ return -EINVAL;
+
+ ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * Look up dirty folios for unwritten mappings within EOF. Providing
+ * this bypasses the flush iomap uses to trigger extent conversion
+ * when unwritten mappings have dirty pagecache in need of zeroing.
+ */
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ loff_t offset = ((loff_t)map.m_lblk) << blkbits;
+ loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
+
+ iomap_fill_dirty_folios(iter, &offset, end, &iomap_flags);
+ if ((offset >> blkbits) < map.m_lblk + map.m_len)
+ map.m_len = (offset >> blkbits) - map.m_lblk;
+ }
+
+ ext4_set_iomap(inode, iomap, &map, offset, length, flags);
+ iomap->flags |= iomap_flags;
+
+ return 0;
+}
+
+const struct iomap_ops ext4_iomap_zero_ops = {
+ .iomap_begin = ext4_iomap_zero_begin,
+};
const struct iomap_ops ext4_iomap_buffered_write_ops = {
.iomap_begin = ext4_iomap_buffered_write_begin,
@@ -4622,6 +4666,32 @@ static int ext4_journalled_block_zero_range(struct inode *inode, loff_t from,
return err;
}
+static int ext4_iomap_block_zero_range(struct inode *inode, loff_t from,
+ loff_t length, bool *did_zero)
+{
+ /*
+ * Zeroing out under an active handle can cause deadlock since
+ * the order of acquiring the folio lock and starting a handle is
+ * inconsistent with the iomap writeback procedure.
+ */
+ if (WARN_ON_ONCE(ext4_handle_valid(journal_current_handle())))
+ return -EINVAL;
+
+ /* The zeroing scope should not extend across a block. */
+ if (WARN_ON_ONCE((from >> inode->i_blkbits) !=
+ ((from + length - 1) >> inode->i_blkbits)))
+ return -EINVAL;
+
+ if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_ORPHAN_FS) &&
+ !(inode_state_read_once(inode) & (I_NEW | I_FREEING)))
+ WARN_ON_ONCE(!inode_is_locked(inode) &&
+ !rwsem_is_locked(&inode->i_mapping->invalidate_lock));
+
+ return iomap_zero_range(inode, from, length, did_zero,
+ &ext4_iomap_zero_ops,
+ &ext4_iomap_write_ops, NULL);
+}
+
/*
* ext4_block_zero_page_range() zeros out a mapping of length 'length'
* starting from file offset 'from'. The range to be zero'd must
@@ -4650,6 +4720,9 @@ static int ext4_block_zero_page_range(struct address_space *mapping,
} else if (ext4_should_journal_data(inode)) {
return ext4_journalled_block_zero_range(inode, from,
length, did_zero);
+ } else if (ext4_inode_buffered_iomap(inode)) {
+ return ext4_iomap_block_zero_range(inode, from, length,
+ did_zero);
}
return ext4_block_zero_range(inode, from, length, did_zero);
}
@@ -5063,6 +5136,18 @@ int ext4_truncate(struct inode *inode)
err = zero_len;
goto out_trace;
}
+ /*
+ * inodes using the iomap buffered I/O path do not use the
+ * ordered data mode, it is necessary to write out zeroed data
+ * before the updating i_disksize transaction is committed.
+ */
+ if (zero_len > 0 && ext4_inode_buffered_iomap(inode)) {
+ err = filemap_write_and_wait_range(mapping,
+ inode->i_size,
+ inode->i_size + zero_len - 1);
+ if (err)
+ return err;
+ }
}
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 18/22] ext4: do not order data for inodes using buffered iomap path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (16 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 17/22] ext4: implement partial block zero range iomap path Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 19/22] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
` (4 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
In the iomap buffered I/O path, we always allocate unwritten blocks when
doing append write, and flush data when doing partial block truncate
down. Therefore, there is no risk of exposing stale data, disable
ordered data mode for the iomap buffered I/O path.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4_jbd2.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 63d17c5201b5..7061c7188053 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
static inline int ext4_should_order_data(struct inode *inode)
{
- return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
+ /*
+ * inodes using the iomap buffered I/O path do not use the
+ * ordered data mode.
+ */
+ return !ext4_inode_buffered_iomap(inode) &&
+ (ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
}
static inline int ext4_should_writeback_data(struct inode *inode)
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 19/22] ext4: add block mapping tracepoints for iomap buffered I/O path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (17 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 18/22] ext4: do not order data for inodes using buffered " Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 20/22] ext4: disable online defrag when inode using " Zhang Yi
` (3 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Add tracepoints for iomap buffered read, write, partial block zeroing
and writeback operations.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/inode.c | 6 +++++
include/trace/events/ext4.h | 45 +++++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c59f3adba0f3..77dcca584153 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3956,6 +3956,8 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
if (ret < 0)
return ret;
+ trace_ext4_iomap_buffered_read_begin(inode, &map, offset, length,
+ flags);
ext4_set_iomap(inode, iomap, &map, offset, length, flags);
return 0;
}
@@ -4040,6 +4042,8 @@ static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
if (ret < 0)
return ret;
+ trace_ext4_iomap_buffered_write_begin(inode, &map, offset, length,
+ flags);
ext4_set_iomap(inode, iomap, &map, offset, length, flags);
return 0;
}
@@ -4142,6 +4146,7 @@ static int ext4_iomap_zero_begin(struct inode *inode,
map.m_len = (offset >> blkbits) - map.m_lblk;
}
+ trace_ext4_iomap_zero_begin(inode, &map, offset, length, flags);
ext4_set_iomap(inode, iomap, &map, offset, length, flags);
iomap->flags |= iomap_flags;
@@ -4319,6 +4324,7 @@ static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
}
out:
ewpc->data_seq = map.m_seq;
+ trace_ext4_iomap_map_writeback_range(inode, &map, offset, dirty_len, 0);
ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
return 0;
}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index a3e8fe414df8..1922df4190e7 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -3096,6 +3096,51 @@ TRACE_EVENT(ext4_move_extent_exit,
__entry->ret)
);
+DECLARE_EVENT_CLASS(ext4_set_iomap_class,
+ TP_PROTO(struct inode *inode, struct ext4_map_blocks *map,
+ loff_t offset, loff_t length, unsigned int flags),
+ TP_ARGS(inode, map, offset, length, flags),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(ino_t, ino)
+ __field(ext4_lblk_t, m_lblk)
+ __field(unsigned int, m_len)
+ __field(unsigned int, m_flags)
+ __field(u64, m_seq)
+ __field(loff_t, offset)
+ __field(loff_t, length)
+ __field(unsigned int, iomap_flags)
+ ),
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->m_lblk = map->m_lblk;
+ __entry->m_len = map->m_len;
+ __entry->m_flags = map->m_flags;
+ __entry->m_seq = map->m_seq;
+ __entry->offset = offset;
+ __entry->length = length;
+ __entry->iomap_flags = flags;
+
+ ),
+ TP_printk("dev %d:%d ino %lu m_lblk %u m_len %u m_flags %s m_seq %llu orig_off 0x%llx orig_len 0x%llx iomap_flags 0x%x",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino, __entry->m_lblk, __entry->m_len,
+ show_mflags(__entry->m_flags), __entry->m_seq,
+ __entry->offset, __entry->length, __entry->iomap_flags)
+)
+
+#define DEFINE_SET_IOMAP_EVENT(name) \
+DEFINE_EVENT(ext4_set_iomap_class, name, \
+ TP_PROTO(struct inode *inode, struct ext4_map_blocks *map, \
+ loff_t offset, loff_t length, unsigned int flags), \
+ TP_ARGS(inode, map, offset, length, flags))
+
+DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_read_begin);
+DEFINE_SET_IOMAP_EVENT(ext4_iomap_buffered_write_begin);
+DEFINE_SET_IOMAP_EVENT(ext4_iomap_map_writeback_range);
+DEFINE_SET_IOMAP_EVENT(ext4_iomap_zero_begin);
+
#endif /* _TRACE_EXT4_H */
/* This part must be outside protection */
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 20/22] ext4: disable online defrag when inode using iomap buffered I/O path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (18 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 19/22] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 21/22] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
` (2 subsequent siblings)
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Online defragmentation does not currently support inodes that using
iomap buffered I/O path, as it still relies on buffer_head for the
management of sub-folio blocks.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/move_extent.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index ce1f738dff93..fd8dabdfd962 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -476,6 +476,17 @@ static int mext_check_validity(struct inode *orig_inode,
return -EOPNOTSUPP;
}
+ /*
+ * TODO: support online defrag for inodes that using the buffered
+ * I/O iomap path.
+ */
+ if (ext4_inode_buffered_iomap(orig_inode) ||
+ ext4_inode_buffered_iomap(donor_inode)) {
+ ext4_msg(sb, KERN_ERR,
+ "Online defrag not supported for inode with iomap buffered IO path");
+ return -EOPNOTSUPP;
+ }
+
if (donor_inode->i_mode & (S_ISUID|S_ISGID)) {
ext4_debug("ext4 move extent: suid or sgid is set to donor file [ino:orig %lu, donor %lu]\n",
orig_inode->i_ino, donor_inode->i_ino);
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 21/22] ext4: partially enable iomap for the buffered I/O path of regular files
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (19 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 20/22] ext4: disable online defrag when inode using " Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 22/22] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi
2026-02-03 6:43 ` [PATCH -next v2 00/22] ext4: use iomap for regular file's " Christoph Hellwig
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Partially enable iomap for the buffered I/O path of regular files. We
now support default filesystem features, mount options, and the bigalloc
feature. However, inline data, fs_verity, fs_crypt, online
defragmentation, and data=journal mode are not yet supported. Some of
these features are expected to be gradually supported in the future. The
filesystem will automatically fall back to the original buffered_head
path if these mount options or features are enabled.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 1 +
fs/ext4/ext4_jbd2.c | 1 +
fs/ext4/ialloc.c | 1 +
fs/ext4/inode.c | 36 ++++++++++++++++++++++++++++++++++++
4 files changed, 39 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 520f6d5dcdab..259c6e780e65 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3064,6 +3064,7 @@ int ext4_walk_page_buffers(handle_t *handle,
int do_journal_get_write_access(handle_t *handle, struct inode *inode,
struct buffer_head *bh);
void ext4_set_inode_mapping_order(struct inode *inode);
+void ext4_enable_buffered_iomap(struct inode *inode);
int ext4_nonda_switch(struct super_block *sb);
#define FALL_BACK_TO_NONDELALLOC 1
#define CONVERT_INLINE_DATA 2
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 05e5946ed9b3..f587bfbe8423 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -16,6 +16,7 @@ int ext4_inode_journal_mode(struct inode *inode)
ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
(ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
+ !ext4_inode_buffered_iomap(inode) &&
!test_opt(inode->i_sb, DELALLOC))) {
/* We do not support data journalling for encrypted data */
if (S_ISREG(inode->i_mode) && IS_ENCRYPTED(inode))
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index b20a1bf866ab..dfa6f60f67b3 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1334,6 +1334,7 @@ struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
}
}
+ ext4_enable_buffered_iomap(inode);
ext4_set_inode_mapping_order(inode);
ext4_update_inode_fsync_trans(handle, inode, 1);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 77dcca584153..bbdd0bb3bc8b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -903,6 +903,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
if (ext4_has_inline_data(inode))
return -ERANGE;
+ /* inodes using the iomap buffered I/O path should not go here. */
+ if (WARN_ON_ONCE(ext4_inode_buffered_iomap(inode)))
+ return -EINVAL;
map.m_lblk = iblock;
map.m_len = bh->b_size >> inode->i_blkbits;
@@ -2771,6 +2774,12 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
goto out_writepages;
+ /* inodes using the iomap buffered I/O path should not go here. */
+ if (WARN_ON_ONCE(ext4_inode_buffered_iomap(inode))) {
+ ret = -EINVAL;
+ goto out_writepages;
+ }
+
/*
* If the filesystem has aborted, it is read-only, so return
* right away instead of dumping stack traces later on that
@@ -5730,6 +5739,31 @@ static int check_igot_inode(struct inode *inode, ext4_iget_flags flags,
return -EFSCORRUPTED;
}
+void ext4_enable_buffered_iomap(struct inode *inode)
+{
+ struct super_block *sb = inode->i_sb;
+
+ if (!S_ISREG(inode->i_mode))
+ return;
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE))
+ return;
+
+ /* Unsupported Features */
+ if (ext4_has_feature_inline_data(sb))
+ return;
+ if (ext4_has_feature_verity(sb))
+ return;
+ if (ext4_has_feature_encrypt(sb))
+ return;
+ if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
+ ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
+ return;
+ if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+ return;
+
+ ext4_set_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+}
+
void ext4_set_inode_mapping_order(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
@@ -6015,6 +6049,8 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
if (ret)
goto bad_inode;
+ ext4_enable_buffered_iomap(inode);
+
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
inode->i_fop = &ext4_file_operations;
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH -next v2 22/22] ext4: introduce a mount option for iomap buffered I/O path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (20 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 21/22] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
@ 2026-02-03 6:25 ` Zhang Yi
2026-02-03 6:43 ` [PATCH -next v2 00/22] ext4: use iomap for regular file's " Christoph Hellwig
22 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 6:25 UTC (permalink / raw)
To: linux-ext4
Cc: linux-fsdevel, linux-kernel, tytso, adilger.kernel, jack, ojaswin,
ritesh.list, hch, djwong, yi.zhang, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Since the iomap buffered I/O path does not yet support all existing
features, it cannot be used by default. Introduce 'buffered_iomap' and
'nobuffered_iomap' mount options to enable and disable the iomap
buffered I/O path for regular files.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
fs/ext4/ext4.h | 1 +
fs/ext4/inode.c | 2 ++
fs/ext4/super.c | 7 +++++++
3 files changed, 10 insertions(+)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 259c6e780e65..4e209c14dab9 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1288,6 +1288,7 @@ struct ext4_inode_info {
* scanning in mballoc
*/
#define EXT4_MOUNT2_ABORT 0x00000100 /* Abort filesystem */
+#define EXT4_MOUNT2_BUFFERED_IOMAP 0x00000200 /* Use iomap for buffered I/O */
#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bbdd0bb3bc8b..a3d7c98309bb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5743,6 +5743,8 @@ void ext4_enable_buffered_iomap(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
+ if (!test_opt2(sb, BUFFERED_IOMAP))
+ return;
if (!S_ISREG(inode->i_mode))
return;
if (ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4bb77703ffe1..d967792c7cb1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1701,6 +1701,7 @@ enum {
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
+ Opt_buffered_iomap, Opt_nobuffered_iomap,
Opt_errors, Opt_data, Opt_data_err, Opt_jqfmt, Opt_dax_type,
#ifdef CONFIG_EXT4_DEBUG
Opt_fc_debug_max_replay, Opt_fc_debug_force
@@ -1839,6 +1840,8 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
fsparam_flag ("no_prefetch_block_bitmaps",
Opt_no_prefetch_block_bitmaps),
fsparam_s32 ("mb_optimize_scan", Opt_mb_optimize_scan),
+ fsparam_flag ("buffered_iomap", Opt_buffered_iomap),
+ fsparam_flag ("nobuffered_iomap", Opt_nobuffered_iomap),
fsparam_string ("check", Opt_removed), /* mount option from ext2/3 */
fsparam_flag ("nocheck", Opt_removed), /* mount option from ext2/3 */
fsparam_flag ("reservation", Opt_removed), /* mount option from ext2/3 */
@@ -1932,6 +1935,10 @@ static const struct mount_opts {
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
MOPT_SET},
+ {Opt_buffered_iomap, EXT4_MOUNT2_BUFFERED_IOMAP,
+ MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
+ {Opt_nobuffered_iomap, EXT4_MOUNT2_BUFFERED_IOMAP,
+ MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
#ifdef CONFIG_EXT4_DEBUG
{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
--
2.52.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
` (21 preceding siblings ...)
2026-02-03 6:25 ` [PATCH -next v2 22/22] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi
@ 2026-02-03 6:43 ` Christoph Hellwig
2026-02-03 9:18 ` Zhang Yi
22 siblings, 1 reply; 56+ messages in thread
From: Christoph Hellwig @ 2026-02-03 6:43 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
jack, ojaswin, ritesh.list, hch, djwong, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
> Original Cover (Updated):
This really should always be first. The updates are rather minor
compared to the overview that the cover letter provides.
> Key notes on the iomap implementations in this series.
> - Don't use ordered data mode to prevent exposing stale data when
> performing append write and truncating down.
I can't parse this.
> - Override dioread_nolock mount option, always allocate unwritten
> extents for new blocks.
Why do you override it?
> - When performing write back, don't use reserved journal handle and
> postponing updating i_disksize until I/O is done.
Again missing the why and the implications.
> buffered write
> ==============
>
> buffer_head:
> bs write cache uncached write
> 1k 423 MiB/s 36.3 MiB/s
> 4k 1067 MiB/s 58.4 MiB/s
> 64k 4321 MiB/s 869 MiB/s
> 1M 4640 MiB/s 3158 MiB/s
>
> iomap:
> bs write cache uncached write
> 1k 403 MiB/s 57 MiB/s
> 4k 1093 MiB/s 61 MiB/s
> 64k 6488 MiB/s 1206 MiB/s
> 1M 7378 MiB/s 4818 MiB/s
This would read better if you actually compated buffered_head
vs iomap side by side.
What is the bs? The read unit size? I guess not the file system
block size as some of the values are too large for that.
Looks like iomap is faster, often much faster except for the
1k cached case, where it is slightly slower. Do you have
any idea why?
> buffered read
> =============
>
> buffer_head:
> bs read hole read cache read data
> 1k 635 MiB/s 661 MiB/s 605 MiB/s
> 4k 1987 MiB/s 2128 MiB/s 1761 MiB/s
> 64k 6068 MiB/s 9472 MiB/s 4475 MiB/s
> 1M 5471 MiB/s 8657 MiB/s 4405 MiB/s
>
> iomap:
> bs read hole read cache read data
> 1k 643 MiB/s 653 MiB/s 602 MiB/s
> 4k 2075 MiB/s 2159 MiB/s 1716 MiB/s
> 64k 6267 MiB/s 9545MiB/s 4451 MiB/s
> 1M 6072 MiB/s 9191MiB/s 4467 MiB/s
What is read cache vs read data here?
Otherwise same comments as for the write case.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-03 6:43 ` [PATCH -next v2 00/22] ext4: use iomap for regular file's " Christoph Hellwig
@ 2026-02-03 9:18 ` Zhang Yi
2026-02-03 13:14 ` Theodore Tso
0 siblings, 1 reply; 56+ messages in thread
From: Zhang Yi @ 2026-02-03 9:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
jack, ojaswin, ritesh.list, djwong, Zhang Yi, yi.zhang,
yizhang089, libaokun1, yangerkun, yukuai
Hi, Christoph!
On 2/3/2026 2:43 PM, Christoph Hellwig wrote:
>> Original Cover (Updated):
>
> This really should always be first. The updates are rather minor
> compared to the overview that the cover letter provides.
>
>> Key notes on the iomap implementations in this series.
>> - Don't use ordered data mode to prevent exposing stale data when
>> performing append write and truncating down.
>
> I can't parse this.
Thank you for looking into this series, and sorry for the lack of
clarity. The reasons of these key notes have been described in
detail in patch 12-13.
This means that the ordered journal mode is no longer in ext4 used
under the iomap infrastructure. The main reason is that iomap
processes each folio one by one during writeback. It first holds the
folio lock and then starts a transaction to create the block mapping.
If we still use the ordered mode, we need to perform writeback in
the logging process, which may require initiating a new transaction,
potentially leading to deadlock issues. In addition, ordered journal
mode indeed has many synchronization dependencies, which increase
the risk of deadlocks, and I believe this is one of the reasons why
ext4_do_writepages() is implemented in such a complicated manner.
Therefore, I think we need to give up using the ordered data mode.
Currently, there are three scenarios where the ordered mode is used:
1) append write,
2) partial block truncate down, and
3) online defragmentation.
For append write, we can always allocate unwritten blocks to avoid
using the ordered journal mode. For partial block truncate down, we
can explicitly perform a write-back. The third case is the only one
that will be somewhat more complex. It needs to use the ordered mode
to ensure the atomicity of data copying and extents exchange when
exchanging extents and copying data between two files, preventing
data loss. Considering performance, we cannot explicitly perform a
writeback for each extent exchange. I have not yet thought of a
simple way to handle this. This will require consideration of other
solutions when supporting online defragmentation in the future.
>
>> - Override dioread_nolock mount option, always allocate unwritten
>> extents for new blocks.
>
> Why do you override it?
There are two reasons:
The first one is the previously mentioned reason of not using
ordered journal mode. To prevent exposing stale data during a power
failure that occurs while performing append writes, unwritten
extents are always requested for newly allocated blocks.
The second one is to consider performance during writeback. When
doing writeback, we should allocate blocks as long as possible when
first calling ->writeback_range() based on the writeback length,
rather than mapping each folio individually. Therefore, to avoid the
situation where more blocks are allocated than actually written
(which could cause fsck to complain), we cannot directly allocate
written blocks before performing writeback.
>
>> - When performing write back, don't use reserved journal handle and
>> postponing updating i_disksize until I/O is done.
>
> Again missing the why and the implications.
The reserved journal handle is used to solve deadlock issues in
transaction dependencies when writeback occurs in ordered journal
mode. This mechanism is no longer necessary if the ordered mode is
not used.
>
>> buffered write
>> ==============
>>
>> buffer_head:
>> bs write cache uncached write
>> 1k 423 MiB/s 36.3 MiB/s
>> 4k 1067 MiB/s 58.4 MiB/s
>> 64k 4321 MiB/s 869 MiB/s
>> 1M 4640 MiB/s 3158 MiB/s
>>
>> iomap:
>> bs write cache uncached write
>> 1k 403 MiB/s 57 MiB/s
>> 4k 1093 MiB/s 61 MiB/s
>> 64k 6488 MiB/s 1206 MiB/s
>> 1M 7378 MiB/s 4818 MiB/s
>
> This would read better if you actually compated buffered_head
> vs iomap side by side.
>
> What is the bs? The read unit size? I guess not the file system
> block size as some of the values are too large for that.
The 'bs' is the read/write unit size, and the fs block size is the
default 4KB.
>
> Looks like iomap is faster, often much faster except for the
> 1k cached case, where it is slightly slower. Do you have
> any idea why?
I observed the on-cpu flame graph. I think the main reason is the
buffer_head loop path detects the folio and buffer_head status.
It saves the uptodate flag in the buffer_head structure when the
first 1KB write for each 4KB folio, it doesn't need to get blocks
for the remaining three writes. However, the iomap infrastructure
always call ->iomap_begin() to acquire the mapping info for each
1KB write. Although the first call to ->iomap_begin() has already
allocated the block extent, there are still some overheads due to
synchronization operations such as locking when subsequent calls
are made. The smaller the unit size, the greater the impact, and
this will also have a greater impact on pure cache writes than on
uncached writes.
>
>> buffered read
>> =============
>>
>> buffer_head:
>> bs read hole read cache read data
>> 1k 635 MiB/s 661 MiB/s 605 MiB/s
>> 4k 1987 MiB/s 2128 MiB/s 1761 MiB/s
>> 64k 6068 MiB/s 9472 MiB/s 4475 MiB/s
>> 1M 5471 MiB/s 8657 MiB/s 4405 MiB/s
>>
>> iomap:
>> bs read hole read cache read data
>> 1k 643 MiB/s 653 MiB/s 602 MiB/s
>> 4k 2075 MiB/s 2159 MiB/s 1716 MiB/s
>> 64k 6267 MiB/s 9545MiB/s 4451 MiB/s
>> 1M 6072 MiB/s 9191MiB/s 4467 MiB/s
>
> What is read cache vs read data here?
>
The 'read cache' means that preread is set to 1 during fio tests,
causing it to read cached data. In contrast, the 'read data'
preread is set to 0, so it always reads data directly from the
disk.
Thanks,
Yi.
> Otherwise same comments as for the write case.
>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-03 6:25 ` [PATCH -next v2 03/22] ext4: only order data when partially block truncating down Zhang Yi
@ 2026-02-03 9:59 ` Jan Kara
2026-02-04 6:42 ` Zhang Yi
2026-02-04 4:21 ` kernel test robot
2026-02-10 7:05 ` Ojaswin Mujoo
2 siblings, 1 reply; 56+ messages in thread
From: Jan Kara @ 2026-02-03 9:59 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
jack, ojaswin, ritesh.list, hch, djwong, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
On Tue 03-02-26 14:25:03, Zhang Yi wrote:
> Currently, __ext4_block_zero_page_range() is called in the following
> four cases to zero out the data in partial blocks:
>
> 1. Truncate down.
> 2. Truncate up.
> 3. Perform block allocation (e.g., fallocate) or append writes across a
> range extending beyond the end of the file (EOF).
> 4. Partial block punch hole.
>
> If the default ordered data mode is used, __ext4_block_zero_page_range()
> will write back the zeroed data to the disk through the order mode after
> zeroing out.
>
> Among the cases 1,2 and 3 described above, only case 1 actually requires
> this ordered write. Assuming no one intentionally bypasses the file
> system to write directly to the disk. When performing a truncate down
> operation, ensuring that the data beyond the EOF is zeroed out before
> updating i_disksize is sufficient to prevent old data from being exposed
> when the file is later extended. In other words, as long as the on-disk
> data in case 1 can be properly zeroed out, only the data in memory needs
> to be zeroed out in cases 2 and 3, without requiring ordered data.
Hum, I'm not sure this is correct. The tail block of the file is not
necessarily zeroed out beyond EOF (as mmap writes can race with page
writeback and modify the tail block contents beyond EOF before we really
submit it to the device). Thus after this commit if you truncate up, just
zero out the newly exposed contents in the page cache and dirty it, then
the transaction with the i_disksize update commits (I see nothing
preventing it) and then you crash, you can observe file with the new size
but non-zero content in the newly exposed area. Am I missing something?
> Case 4 does not require ordered data because the entire punch hole
> operation does not provide atomicity guarantees. Therefore, it's safe to
> move the ordered data operation from __ext4_block_zero_page_range() to
> ext4_truncate().
I agree hole punching can already expose intermediate results in case of
crash so there removing the ordered mode handling is safe.
Honza
> It should be noted that after this change, we can only determine whether
> to perform ordered data operations based on whether the target block has
> been zeroed, rather than on the state of the buffer head. Consequently,
> unnecessary ordered data operations may occur when truncating an
> unwritten dirty block. However, this scenario is relatively rare, so the
> overall impact is minimal.
>
> This is prepared for the conversion to the iomap infrastructure since it
> doesn't use ordered data mode and requires active writeback, which
> reduces the complexity of the conversion.
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> fs/ext4/inode.c | 32 +++++++++++++++++++-------------
> 1 file changed, 19 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f856ea015263..20b60abcf777 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
> folio_zero_range(folio, offset, length);
> BUFFER_TRACE(bh, "zeroed end of block");
>
> - if (ext4_should_journal_data(inode)) {
> + if (ext4_should_journal_data(inode))
> err = ext4_dirty_journalled_data(handle, bh);
> - } else {
> + else
> mark_buffer_dirty(bh);
> - /*
> - * Only the written block requires ordered data to prevent
> - * exposing stale data.
> - */
> - if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
> - ext4_should_order_data(inode))
> - err = ext4_jbd2_inode_add_write(handle, inode, from,
> - length);
> - }
> if (!err && did_zero)
> *did_zero = true;
>
> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
> goto out_trace;
> }
>
> - if (inode->i_size & (inode->i_sb->s_blocksize - 1))
> - ext4_block_truncate_page(handle, mapping, inode->i_size);
> + if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
> + unsigned int zero_len;
> +
> + zero_len = ext4_block_truncate_page(handle, mapping,
> + inode->i_size);
> + if (zero_len < 0) {
> + err = zero_len;
> + goto out_stop;
> + }
> + if (zero_len && !IS_DAX(inode) &&
> + ext4_should_order_data(inode)) {
> + err = ext4_jbd2_inode_add_write(handle, inode,
> + inode->i_size, zero_len);
> + if (err)
> + goto out_stop;
> + }
> + }
>
> /*
> * We add the inode to the orphan list, so that if this
> --
> 2.52.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-03 9:18 ` Zhang Yi
@ 2026-02-03 13:14 ` Theodore Tso
2026-02-04 1:33 ` Zhang Yi
2026-02-04 1:59 ` Baokun Li
0 siblings, 2 replies; 56+ messages in thread
From: Theodore Tso @ 2026-02-03 13:14 UTC (permalink / raw)
To: Zhang Yi
Cc: Christoph Hellwig, linux-ext4, linux-fsdevel, linux-kernel,
adilger.kernel, jack, ojaswin, ritesh.list, djwong, Zhang Yi,
yizhang089, libaokun1, yangerkun, yukuai
On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
> This means that the ordered journal mode is no longer in ext4 used
> under the iomap infrastructure. The main reason is that iomap
> processes each folio one by one during writeback. It first holds the
> folio lock and then starts a transaction to create the block mapping.
> If we still use the ordered mode, we need to perform writeback in
> the logging process, which may require initiating a new transaction,
> potentially leading to deadlock issues. In addition, ordered journal
> mode indeed has many synchronization dependencies, which increase
> the risk of deadlocks, and I believe this is one of the reasons why
> ext4_do_writepages() is implemented in such a complicated manner.
> Therefore, I think we need to give up using the ordered data mode.
>
> Currently, there are three scenarios where the ordered mode is used:
> 1) append write,
> 2) partial block truncate down, and
> 3) online defragmentation.
>
> For append write, we can always allocate unwritten blocks to avoid
> using the ordered journal mode.
This is going to be a pretty severe performance regression, since it
means that we will be doubling the journal load for append writes.
What we really need to do here is to first write out the data blocks,
and then only start the transaction handle to modify the data blocks
*after* the data blocks have been written (to heretofore, unused
blocks that were just allocated). It means inverting the order in
which we write data blocks for the append write case, and in fact it
will improve fsync() performance since we won't be gating writing the
commit block on the date blocks getting written out in the append
write case.
Cheers,
- Ted
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 17/22] ext4: implement partial block zero range iomap path
2026-02-03 6:25 ` [PATCH -next v2 17/22] ext4: implement partial block zero range iomap path Zhang Yi
@ 2026-02-04 0:21 ` kernel test robot
0 siblings, 0 replies; 56+ messages in thread
From: kernel test robot @ 2026-02-04 0:21 UTC (permalink / raw)
To: Zhang Yi, linux-ext4
Cc: oe-kbuild-all, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
jack, ojaswin, ritesh.list, hch, djwong, yi.zhang, yi.zhang,
yizhang089, libaokun1, yangerkun, yukuai
Hi Zhang,
kernel test robot noticed the following build warnings:
[auto build test WARNING on next-20260202]
url: https://github.com/intel-lab-lkp/linux/commits/Zhang-Yi/ext4-make-ext4_block_zero_page_range-pass-out-did_zero/20260203-144244
base: next-20260202
patch link: https://lore.kernel.org/r/20260203062523.3869120-18-yi.zhang%40huawei.com
patch subject: [PATCH -next v2 17/22] ext4: implement partial block zero range iomap path
config: m68k-randconfig-r123-20260204 (https://download.01.org/0day-ci/archive/20260204/202602040854.1tmFmFGB-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 13.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260204/202602040854.1tmFmFGB-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602040854.1tmFmFGB-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> fs/ext4/inode.c:4151:24: sparse: sparse: symbol 'ext4_iomap_zero_ops' was not declared. Should it be static?
fs/ext4/inode.c:4164:24: sparse: sparse: symbol 'ext4_iomap_buffered_read_ops' was not declared. Should it be static?
fs/ext4/inode.c:5135:32: sparse: sparse: unsigned value that used to be signed checked against zero?
fs/ext4/inode.c:5134:52: sparse: signed value source
vim +/ext4_iomap_zero_ops +4151 fs/ext4/inode.c
4150
> 4151 const struct iomap_ops ext4_iomap_zero_ops = {
4152 .iomap_begin = ext4_iomap_zero_begin,
4153 };
4154
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-03 13:14 ` Theodore Tso
@ 2026-02-04 1:33 ` Zhang Yi
2026-02-04 1:59 ` Baokun Li
1 sibling, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-04 1:33 UTC (permalink / raw)
To: Theodore Tso
Cc: Christoph Hellwig, linux-ext4, linux-fsdevel, linux-kernel,
adilger.kernel, jack, ojaswin, ritesh.list, djwong, Zhang Yi,
yizhang089, libaokun1, yangerkun, yukuai
Hi, Ted.
On 2/3/2026 9:14 PM, Theodore Tso wrote:
> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>> This means that the ordered journal mode is no longer in ext4 used
>> under the iomap infrastructure. The main reason is that iomap
>> processes each folio one by one during writeback. It first holds the
>> folio lock and then starts a transaction to create the block mapping.
>> If we still use the ordered mode, we need to perform writeback in
>> the logging process, which may require initiating a new transaction,
>> potentially leading to deadlock issues. In addition, ordered journal
>> mode indeed has many synchronization dependencies, which increase
>> the risk of deadlocks, and I believe this is one of the reasons why
>> ext4_do_writepages() is implemented in such a complicated manner.
>> Therefore, I think we need to give up using the ordered data mode.
>>
>> Currently, there are three scenarios where the ordered mode is used:
>> 1) append write,
>> 2) partial block truncate down, and
>> 3) online defragmentation.
>>
>> For append write, we can always allocate unwritten blocks to avoid
>> using the ordered journal mode.
>
> This is going to be a pretty severe performance regression, since it
> means that we will be doubling the journal load for append writes.
Although this will double the journal load compared to directly
allocating written blocks, I think it will not result in significant
performance regression compared to the current append write process, as
this is consistent with the behavior after dioread_nolock is enabled by
default now.
> What we really need to do here is to first write out the data blocks,
> and then only start the transaction handle to modify the data blocks
> *after* the data blocks have been written (to heretofore, unused
> blocks that were just allocated). It means inverting the order in
> which we write data blocks for the append write case, and in fact it
> will improve fsync() performance since we won't be gating writing the
> commit block on the date blocks getting written out in the append
> write case.
>
Yeah, thank you for the suggestion. I agree with you. We are planning to
implement this next. Baokun is currently working to develop a POC. Our
current idea is to use inode PA (The benefit of using PA is that it can
avoid changes to disk metadata, and the pre-allocation operation can be
closed within the mb-allocater) to pre-allocate blocks before doing
writeback, and then map the actual written type extents after the data is
written, which would avoid this journal overhead of unwritten
allocations. At the same time, this could also lay the foundation for
future support of COW writes for reflinks.
> Cheers,
>
> - Ted
Thanks,
Yi.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-03 13:14 ` Theodore Tso
2026-02-04 1:33 ` Zhang Yi
@ 2026-02-04 1:59 ` Baokun Li
2026-02-04 14:23 ` Jan Kara
1 sibling, 1 reply; 56+ messages in thread
From: Baokun Li @ 2026-02-04 1:59 UTC (permalink / raw)
To: Theodore Tso, Zhang Yi
Cc: Christoph Hellwig, linux-ext4, linux-fsdevel, linux-kernel,
adilger.kernel, jack, ojaswin, ritesh.list, djwong, Zhang Yi,
yizhang089, yangerkun, yukuai, libaokun9, Baokun Li
On 2026-02-03 21:14, Theodore Tso wrote:
> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>> This means that the ordered journal mode is no longer in ext4 used
>> under the iomap infrastructure. The main reason is that iomap
>> processes each folio one by one during writeback. It first holds the
>> folio lock and then starts a transaction to create the block mapping.
>> If we still use the ordered mode, we need to perform writeback in
>> the logging process, which may require initiating a new transaction,
>> potentially leading to deadlock issues. In addition, ordered journal
>> mode indeed has many synchronization dependencies, which increase
>> the risk of deadlocks, and I believe this is one of the reasons why
>> ext4_do_writepages() is implemented in such a complicated manner.
>> Therefore, I think we need to give up using the ordered data mode.
>>
>> Currently, there are three scenarios where the ordered mode is used:
>> 1) append write,
>> 2) partial block truncate down, and
>> 3) online defragmentation.
>>
>> For append write, we can always allocate unwritten blocks to avoid
>> using the ordered journal mode.
> This is going to be a pretty severe performance regression, since it
> means that we will be doubling the journal load for append writes.
> What we really need to do here is to first write out the data blocks,
> and then only start the transaction handle to modify the data blocks
> *after* the data blocks have been written (to heretofore, unused
> blocks that were just allocated). It means inverting the order in
> which we write data blocks for the append write case, and in fact it
> will improve fsync() performance since we won't be gating writing the
> commit block on the date blocks getting written out in the append
> write case.
I have some local demo patches doing something similar, and I think this
work could be decoupled from Yi's patch set.
Since inode preallocation (PA) maintains physical block occupancy with a
logical-to-physical mapping, and ensures on-disk data consistency after
power failure, it is an excellent location for recording temporary
occupancy. Furthermore, since inode PA often allocates more blocks than
requested, it can also help reduce file fragmentation.
The specific approach is as follows:
1. Allocate only the PA during block allocation without inserting it into
the extent status tree. Return the PA to the caller and increment its
refcount to prevent it from being discarded.
2. Issue IOs to the blocks within the inode PA. If IO fails, release the
refcount and return -EIO. If successful, proceed to the next step.
3. Start a handle upon successful IO completion to convert the inode PA to
extents. Release the refcount and update the extent tree.
4. If a corresponding extent already exists, we’ll need to punch holes to
release the old extent before inserting the new one.
This ensures data atomicity, while jbd2—being a COW-like implementation
itself—ensures metadata atomicity. By leveraging this "delay map"
mechanism, we can achieve several benefits:
* Lightweight, high-performance COW.
* High-performance software atomic writes (hardware-independent).
* Replacing dio_readnolock, which might otherwise read unexpected zeros.
* Replacing ordered data and data journal modes.
* Reduced handle hold time, as it's only held during extent tree updates.
* Paving the way for snapshot support.
Of course, COW itself can lead to severe file fragmentation, especially
in small-scale overwrite scenarios.
Perhaps I’ve overlooked something. What are your thoughts?
Regards,
Baokun
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-03 6:25 ` [PATCH -next v2 03/22] ext4: only order data when partially block truncating down Zhang Yi
2026-02-03 9:59 ` Jan Kara
@ 2026-02-04 4:21 ` kernel test robot
2026-02-10 7:05 ` Ojaswin Mujoo
2 siblings, 0 replies; 56+ messages in thread
From: kernel test robot @ 2026-02-04 4:21 UTC (permalink / raw)
To: Zhang Yi, linux-ext4
Cc: oe-kbuild-all, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
jack, ojaswin, ritesh.list, hch, djwong, yi.zhang, yi.zhang,
yizhang089, libaokun1, yangerkun, yukuai
Hi Zhang,
kernel test robot noticed the following build warnings:
[auto build test WARNING on next-20260202]
url: https://github.com/intel-lab-lkp/linux/commits/Zhang-Yi/ext4-make-ext4_block_zero_page_range-pass-out-did_zero/20260203-144244
base: next-20260202
patch link: https://lore.kernel.org/r/20260203062523.3869120-4-yi.zhang%40huawei.com
patch subject: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
config: riscv-randconfig-r071-20260204 (https://download.01.org/0day-ci/archive/20260204/202602041239.JFyNwVcg-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9b8addffa70cee5b2acc5454712d9cf78ce45710)
smatch version: v0.5.0-8994-gd50c5a4c
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202602041239.JFyNwVcg-lkp@intel.com/
New smatch warnings:
fs/ext4/inode.c:4577 ext4_truncate() warn: unsigned 'zero_len' is never less than zero.
Old smatch warnings:
fs/ext4/inode.c:2651 mpage_prepare_extent_to_map() warn: missing error code 'err'
fs/ext4/inode.c:5129 check_igot_inode() warn: missing unwind goto?
vim +/zero_len +4577 fs/ext4/inode.c
4494
4495 /*
4496 * ext4_truncate()
4497 *
4498 * We block out ext4_get_block() block instantiations across the entire
4499 * transaction, and VFS/VM ensures that ext4_truncate() cannot run
4500 * simultaneously on behalf of the same inode.
4501 *
4502 * As we work through the truncate and commit bits of it to the journal there
4503 * is one core, guiding principle: the file's tree must always be consistent on
4504 * disk. We must be able to restart the truncate after a crash.
4505 *
4506 * The file's tree may be transiently inconsistent in memory (although it
4507 * probably isn't), but whenever we close off and commit a journal transaction,
4508 * the contents of (the filesystem + the journal) must be consistent and
4509 * restartable. It's pretty simple, really: bottom up, right to left (although
4510 * left-to-right works OK too).
4511 *
4512 * Note that at recovery time, journal replay occurs *before* the restart of
4513 * truncate against the orphan inode list.
4514 *
4515 * The committed inode has the new, desired i_size (which is the same as
4516 * i_disksize in this case). After a crash, ext4_orphan_cleanup() will see
4517 * that this inode's truncate did not complete and it will again call
4518 * ext4_truncate() to have another go. So there will be instantiated blocks
4519 * to the right of the truncation point in a crashed ext4 filesystem. But
4520 * that's fine - as long as they are linked from the inode, the post-crash
4521 * ext4_truncate() run will find them and release them.
4522 */
4523 int ext4_truncate(struct inode *inode)
4524 {
4525 struct ext4_inode_info *ei = EXT4_I(inode);
4526 unsigned int credits;
4527 int err = 0, err2;
4528 handle_t *handle;
4529 struct address_space *mapping = inode->i_mapping;
4530
4531 /*
4532 * There is a possibility that we're either freeing the inode
4533 * or it's a completely new inode. In those cases we might not
4534 * have i_rwsem locked because it's not necessary.
4535 */
4536 if (!(inode_state_read_once(inode) & (I_NEW | I_FREEING)))
4537 WARN_ON(!inode_is_locked(inode));
4538 trace_ext4_truncate_enter(inode);
4539
4540 if (!ext4_can_truncate(inode))
4541 goto out_trace;
4542
4543 if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC))
4544 ext4_set_inode_state(inode, EXT4_STATE_DA_ALLOC_CLOSE);
4545
4546 if (ext4_has_inline_data(inode)) {
4547 int has_inline = 1;
4548
4549 err = ext4_inline_data_truncate(inode, &has_inline);
4550 if (err || has_inline)
4551 goto out_trace;
4552 }
4553
4554 /* If we zero-out tail of the page, we have to create jinode for jbd2 */
4555 if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
4556 err = ext4_inode_attach_jinode(inode);
4557 if (err)
4558 goto out_trace;
4559 }
4560
4561 if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
4562 credits = ext4_chunk_trans_extent(inode, 1);
4563 else
4564 credits = ext4_blocks_for_truncate(inode);
4565
4566 handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
4567 if (IS_ERR(handle)) {
4568 err = PTR_ERR(handle);
4569 goto out_trace;
4570 }
4571
4572 if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
4573 unsigned int zero_len;
4574
4575 zero_len = ext4_block_truncate_page(handle, mapping,
4576 inode->i_size);
> 4577 if (zero_len < 0) {
4578 err = zero_len;
4579 goto out_stop;
4580 }
4581 if (zero_len && !IS_DAX(inode) &&
4582 ext4_should_order_data(inode)) {
4583 err = ext4_jbd2_inode_add_write(handle, inode,
4584 inode->i_size, zero_len);
4585 if (err)
4586 goto out_stop;
4587 }
4588 }
4589
4590 /*
4591 * We add the inode to the orphan list, so that if this
4592 * truncate spans multiple transactions, and we crash, we will
4593 * resume the truncate when the filesystem recovers. It also
4594 * marks the inode dirty, to catch the new size.
4595 *
4596 * Implication: the file must always be in a sane, consistent
4597 * truncatable state while each transaction commits.
4598 */
4599 err = ext4_orphan_add(handle, inode);
4600 if (err)
4601 goto out_stop;
4602
4603 ext4_fc_track_inode(handle, inode);
4604 ext4_check_map_extents_env(inode);
4605
4606 down_write(&EXT4_I(inode)->i_data_sem);
4607 ext4_discard_preallocations(inode);
4608
4609 if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
4610 err = ext4_ext_truncate(handle, inode);
4611 else
4612 ext4_ind_truncate(handle, inode);
4613
4614 up_write(&ei->i_data_sem);
4615 if (err)
4616 goto out_stop;
4617
4618 if (IS_SYNC(inode))
4619 ext4_handle_sync(handle);
4620
4621 out_stop:
4622 /*
4623 * If this was a simple ftruncate() and the file will remain alive,
4624 * then we need to clear up the orphan record which we created above.
4625 * However, if this was a real unlink then we were called by
4626 * ext4_evict_inode(), and we allow that function to clean up the
4627 * orphan info for us.
4628 */
4629 if (inode->i_nlink)
4630 ext4_orphan_del(handle, inode);
4631
4632 inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
4633 err2 = ext4_mark_inode_dirty(handle, inode);
4634 if (unlikely(err2 && !err))
4635 err = err2;
4636 ext4_journal_stop(handle);
4637
4638 out_trace:
4639 trace_ext4_truncate_exit(inode);
4640 return err;
4641 }
4642
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-03 9:59 ` Jan Kara
@ 2026-02-04 6:42 ` Zhang Yi
2026-02-04 14:18 ` Jan Kara
0 siblings, 1 reply; 56+ messages in thread
From: Zhang Yi @ 2026-02-04 6:42 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
ojaswin, ritesh.list, hch, djwong, Zhang Yi, yi.zhang, yizhang089,
libaokun1, yangerkun, yukuai
Hi, Jan!
On 2/3/2026 5:59 PM, Jan Kara wrote:
> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>> Currently, __ext4_block_zero_page_range() is called in the following
>> four cases to zero out the data in partial blocks:
>>
>> 1. Truncate down.
>> 2. Truncate up.
>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>> range extending beyond the end of the file (EOF).
>> 4. Partial block punch hole.
>>
>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>> will write back the zeroed data to the disk through the order mode after
>> zeroing out.
>>
>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>> this ordered write. Assuming no one intentionally bypasses the file
>> system to write directly to the disk. When performing a truncate down
>> operation, ensuring that the data beyond the EOF is zeroed out before
>> updating i_disksize is sufficient to prevent old data from being exposed
>> when the file is later extended. In other words, as long as the on-disk
>> data in case 1 can be properly zeroed out, only the data in memory needs
>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>
> Hum, I'm not sure this is correct. The tail block of the file is not
> necessarily zeroed out beyond EOF (as mmap writes can race with page
> writeback and modify the tail block contents beyond EOF before we really
> submit it to the device). Thus after this commit if you truncate up, just
> zero out the newly exposed contents in the page cache and dirty it, then
> the transaction with the i_disksize update commits (I see nothing
> preventing it) and then you crash, you can observe file with the new size
> but non-zero content in the newly exposed area. Am I missing something?
>
Well, I think you are right! I missed the mmap write race condition that
happens during the writeback submitting I/O. Thank you a lot for pointing
this out. I thought of two possible solutions:
1. We also add explicit writeback operations to the truncate-up and
post-EOF append writes. This solution is the most straightforward but
may cause some performance overhead. However, since at most only one
block is written, the impact is likely limited. Additionally, I
observed that the implementation of the XFS file system also adopts a
similar approach in its truncate up and down operation. (But it is
somewhat strange that XFS also appears to have the same issue with
post-EOF append writes; it only zero out the partial block in
xfs_file_write_checks(), but it neither explicitly writeback zeroed
data nor employs any other mechanism to ensure that the zero data
writebacks before the metadata is written to disk.)
2. Resolve this race condition, ensure that there are no non-zero data
in the post-EOF partial blocks on the disk. I observed that after the
writeback holds the folio lock and calls folio_clear_dirty_for_io(),
mmap writes will re-trigger the page fault. Perhaps we can filter out
the EOF folio based on i_size in ext4_page_mkwrite(),
block_page_mkwrite() and iomap_page_mkwrite(), and then call
folio_wait_writeback() to wait for this partial folio writeback to
complete. This seems can break the race condition without introducing
too much overhead (no?).
What do you think? Any other suggestions are also welcome.
Thanks,
Yi.
>> Case 4 does not require ordered data because the entire punch hole
>> operation does not provide atomicity guarantees. Therefore, it's safe to
>> move the ordered data operation from __ext4_block_zero_page_range() to
>> ext4_truncate().
>
> I agree hole punching can already expose intermediate results in case of
> crash so there removing the ordered mode handling is safe.
>
> Honza
>
>> It should be noted that after this change, we can only determine whether
>> to perform ordered data operations based on whether the target block has
>> been zeroed, rather than on the state of the buffer head. Consequently,
>> unnecessary ordered data operations may occur when truncating an
>> unwritten dirty block. However, this scenario is relatively rare, so the
>> overall impact is minimal.
>>
>> This is prepared for the conversion to the iomap infrastructure since it
>> doesn't use ordered data mode and requires active writeback, which
>> reduces the complexity of the conversion.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>> fs/ext4/inode.c | 32 +++++++++++++++++++-------------
>> 1 file changed, 19 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index f856ea015263..20b60abcf777 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>> folio_zero_range(folio, offset, length);
>> BUFFER_TRACE(bh, "zeroed end of block");
>>
>> - if (ext4_should_journal_data(inode)) {
>> + if (ext4_should_journal_data(inode))
>> err = ext4_dirty_journalled_data(handle, bh);
>> - } else {
>> + else
>> mark_buffer_dirty(bh);
>> - /*
>> - * Only the written block requires ordered data to prevent
>> - * exposing stale data.
>> - */
>> - if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
>> - ext4_should_order_data(inode))
>> - err = ext4_jbd2_inode_add_write(handle, inode, from,
>> - length);
>> - }
>> if (!err && did_zero)
>> *did_zero = true;
>>
>> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
>> goto out_trace;
>> }
>>
>> - if (inode->i_size & (inode->i_sb->s_blocksize - 1))
>> - ext4_block_truncate_page(handle, mapping, inode->i_size);
>> + if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
>> + unsigned int zero_len;
>> +
>> + zero_len = ext4_block_truncate_page(handle, mapping,
>> + inode->i_size);
>> + if (zero_len < 0) {
>> + err = zero_len;
>> + goto out_stop;
>> + }
>> + if (zero_len && !IS_DAX(inode) &&
>> + ext4_should_order_data(inode)) {
>> + err = ext4_jbd2_inode_add_write(handle, inode,
>> + inode->i_size, zero_len);
>> + if (err)
>> + goto out_stop;
>> + }
>> + }
>>
>> /*
>> * We add the inode to the orphan list, so that if this
>> --
>> 2.52.0
>>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-04 6:42 ` Zhang Yi
@ 2026-02-04 14:18 ` Jan Kara
2026-02-05 3:27 ` Baokun Li
2026-02-05 7:50 ` Zhang Yi
0 siblings, 2 replies; 56+ messages in thread
From: Jan Kara @ 2026-02-04 14:18 UTC (permalink / raw)
To: Zhang Yi
Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, ojaswin, ritesh.list, hch, djwong, Zhang Yi,
yizhang089, libaokun1, yangerkun, yukuai
Hi Zhang!
On Wed 04-02-26 14:42:46, Zhang Yi wrote:
> On 2/3/2026 5:59 PM, Jan Kara wrote:
> > On Tue 03-02-26 14:25:03, Zhang Yi wrote:
> >> Currently, __ext4_block_zero_page_range() is called in the following
> >> four cases to zero out the data in partial blocks:
> >>
> >> 1. Truncate down.
> >> 2. Truncate up.
> >> 3. Perform block allocation (e.g., fallocate) or append writes across a
> >> range extending beyond the end of the file (EOF).
> >> 4. Partial block punch hole.
> >>
> >> If the default ordered data mode is used, __ext4_block_zero_page_range()
> >> will write back the zeroed data to the disk through the order mode after
> >> zeroing out.
> >>
> >> Among the cases 1,2 and 3 described above, only case 1 actually requires
> >> this ordered write. Assuming no one intentionally bypasses the file
> >> system to write directly to the disk. When performing a truncate down
> >> operation, ensuring that the data beyond the EOF is zeroed out before
> >> updating i_disksize is sufficient to prevent old data from being exposed
> >> when the file is later extended. In other words, as long as the on-disk
> >> data in case 1 can be properly zeroed out, only the data in memory needs
> >> to be zeroed out in cases 2 and 3, without requiring ordered data.
> >
> > Hum, I'm not sure this is correct. The tail block of the file is not
> > necessarily zeroed out beyond EOF (as mmap writes can race with page
> > writeback and modify the tail block contents beyond EOF before we really
> > submit it to the device). Thus after this commit if you truncate up, just
> > zero out the newly exposed contents in the page cache and dirty it, then
> > the transaction with the i_disksize update commits (I see nothing
> > preventing it) and then you crash, you can observe file with the new size
> > but non-zero content in the newly exposed area. Am I missing something?
> >
>
> Well, I think you are right! I missed the mmap write race condition that
> happens during the writeback submitting I/O. Thank you a lot for pointing
> this out. I thought of two possible solutions:
>
> 1. We also add explicit writeback operations to the truncate-up and
> post-EOF append writes. This solution is the most straightforward but
> may cause some performance overhead. However, since at most only one
> block is written, the impact is likely limited. Additionally, I
> observed that the implementation of the XFS file system also adopts a
> similar approach in its truncate up and down operation. (But it is
> somewhat strange that XFS also appears to have the same issue with
> post-EOF append writes; it only zero out the partial block in
> xfs_file_write_checks(), but it neither explicitly writeback zeroed
> data nor employs any other mechanism to ensure that the zero data
> writebacks before the metadata is written to disk.)
>
> 2. Resolve this race condition, ensure that there are no non-zero data
> in the post-EOF partial blocks on the disk. I observed that after the
> writeback holds the folio lock and calls folio_clear_dirty_for_io(),
> mmap writes will re-trigger the page fault. Perhaps we can filter out
> the EOF folio based on i_size in ext4_page_mkwrite(),
> block_page_mkwrite() and iomap_page_mkwrite(), and then call
> folio_wait_writeback() to wait for this partial folio writeback to
> complete. This seems can break the race condition without introducing
> too much overhead (no?).
>
> What do you think? Any other suggestions are also welcome.
Hum, I like the option 2 because IMO non-zero data beyond EOF is a
corner-case quirk which unnecessarily complicates rather common paths. But
I'm not sure we can easily get rid of it. It can happen for example when
you do appending write inside a block. The page is written back but before
the transaction with i_disksize update commits we crash. Then again we have
a non-zero content inside the block beyond EOF.
So the only realistic option I see is to ensure tail of the block gets
zeroed on disk before the transaction with i_disksize update commits in the
cases of truncate up or write beyond EOF. data=ordered mode machinery is an
asynchronous way how to achieve this. We could also just synchronously
writeback the block where needed but the latency hit of such operation is
going to be significant so I'm quite sure some workload somewhere will
notice although the truncate up / write beyond EOF operations triggering this
are not too common. So why do you need to get rid of these data=ordered
mode usages? I guess because with iomap keeping our transaction handle ->
folio lock ordering is complicated? Last time I looked it seemed still
possible to keep it though.
Another possibility would be to just *submit* the write synchronously and
use data=ordered mode machinery only to wait for IO to complete before the
transaction commits. That way it should be safe to start a transaction
while holding folio lock and thus the iomap conversion would be easier.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-04 1:59 ` Baokun Li
@ 2026-02-04 14:23 ` Jan Kara
2026-02-05 2:06 ` Zhang Yi
2026-02-05 2:55 ` Baokun Li
0 siblings, 2 replies; 56+ messages in thread
From: Jan Kara @ 2026-02-04 14:23 UTC (permalink / raw)
To: Baokun Li
Cc: Theodore Tso, Zhang Yi, Christoph Hellwig, linux-ext4,
linux-fsdevel, linux-kernel, adilger.kernel, jack, ojaswin,
ritesh.list, djwong, Zhang Yi, yizhang089, yangerkun, yukuai,
libaokun9
On Wed 04-02-26 09:59:36, Baokun Li wrote:
> On 2026-02-03 21:14, Theodore Tso wrote:
> > On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
> >> This means that the ordered journal mode is no longer in ext4 used
> >> under the iomap infrastructure. The main reason is that iomap
> >> processes each folio one by one during writeback. It first holds the
> >> folio lock and then starts a transaction to create the block mapping.
> >> If we still use the ordered mode, we need to perform writeback in
> >> the logging process, which may require initiating a new transaction,
> >> potentially leading to deadlock issues. In addition, ordered journal
> >> mode indeed has many synchronization dependencies, which increase
> >> the risk of deadlocks, and I believe this is one of the reasons why
> >> ext4_do_writepages() is implemented in such a complicated manner.
> >> Therefore, I think we need to give up using the ordered data mode.
> >>
> >> Currently, there are three scenarios where the ordered mode is used:
> >> 1) append write,
> >> 2) partial block truncate down, and
> >> 3) online defragmentation.
> >>
> >> For append write, we can always allocate unwritten blocks to avoid
> >> using the ordered journal mode.
> > This is going to be a pretty severe performance regression, since it
> > means that we will be doubling the journal load for append writes.
> > What we really need to do here is to first write out the data blocks,
> > and then only start the transaction handle to modify the data blocks
> > *after* the data blocks have been written (to heretofore, unused
> > blocks that were just allocated). It means inverting the order in
> > which we write data blocks for the append write case, and in fact it
> > will improve fsync() performance since we won't be gating writing the
> > commit block on the date blocks getting written out in the append
> > write case.
>
> I have some local demo patches doing something similar, and I think this
> work could be decoupled from Yi's patch set.
>
> Since inode preallocation (PA) maintains physical block occupancy with a
> logical-to-physical mapping, and ensures on-disk data consistency after
> power failure, it is an excellent location for recording temporary
> occupancy. Furthermore, since inode PA often allocates more blocks than
> requested, it can also help reduce file fragmentation.
>
> The specific approach is as follows:
>
> 1. Allocate only the PA during block allocation without inserting it into
> the extent status tree. Return the PA to the caller and increment its
> refcount to prevent it from being discarded.
>
> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
> refcount and return -EIO. If successful, proceed to the next step.
>
> 3. Start a handle upon successful IO completion to convert the inode PA to
> extents. Release the refcount and update the extent tree.
>
> 4. If a corresponding extent already exists, we’ll need to punch holes to
> release the old extent before inserting the new one.
Sounds good. Just if I understand correctly case 4 would happen only if you
really try to do something like COW with this? Normally you'd just use the
already present blocks and write contents into them?
> This ensures data atomicity, while jbd2—being a COW-like implementation
> itself—ensures metadata atomicity. By leveraging this "delay map"
> mechanism, we can achieve several benefits:
>
> * Lightweight, high-performance COW.
> * High-performance software atomic writes (hardware-independent).
> * Replacing dio_readnolock, which might otherwise read unexpected zeros.
> * Replacing ordered data and data journal modes.
> * Reduced handle hold time, as it's only held during extent tree updates.
> * Paving the way for snapshot support.
>
> Of course, COW itself can lead to severe file fragmentation, especially
> in small-scale overwrite scenarios.
I agree the feature can provide very interesting benefits and we were
pondering about something like that for a long time, just never got to
implementing it. I'd say the immediate benefits are you can completely get
rid of dioread_nolock as well as the legacy dioread_lock modes so overall
code complexity should not increase much. We could also mostly get rid of
data=ordered mode use (although not completely - see my discussion with
Zhang over patch 3) which would be also welcome simplification. These
benefits alone are IMO a good enough reason to have the functionality :).
Even without COW, atomic writes and other fancy stuff.
I don't see how you want to get rid of data=journal mode - perhaps that's
related to the COW functionality?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-04 14:23 ` Jan Kara
@ 2026-02-05 2:06 ` Zhang Yi
2026-02-05 3:04 ` Baokun Li
2026-02-05 12:58 ` Jan Kara
2026-02-05 2:55 ` Baokun Li
1 sibling, 2 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-05 2:06 UTC (permalink / raw)
To: Jan Kara, Baokun Li
Cc: Theodore Tso, Christoph Hellwig, linux-ext4, linux-fsdevel,
linux-kernel, adilger.kernel, ojaswin, ritesh.list, djwong,
Zhang Yi, yizhang089, yangerkun, yukuai, libaokun9
On 2/4/2026 10:23 PM, Jan Kara wrote:
> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>> On 2026-02-03 21:14, Theodore Tso wrote:
>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>> This means that the ordered journal mode is no longer in ext4 used
>>>> under the iomap infrastructure. The main reason is that iomap
>>>> processes each folio one by one during writeback. It first holds the
>>>> folio lock and then starts a transaction to create the block mapping.
>>>> If we still use the ordered mode, we need to perform writeback in
>>>> the logging process, which may require initiating a new transaction,
>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>> mode indeed has many synchronization dependencies, which increase
>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>
>>>> Currently, there are three scenarios where the ordered mode is used:
>>>> 1) append write,
>>>> 2) partial block truncate down, and
>>>> 3) online defragmentation.
>>>>
>>>> For append write, we can always allocate unwritten blocks to avoid
>>>> using the ordered journal mode.
>>> This is going to be a pretty severe performance regression, since it
>>> means that we will be doubling the journal load for append writes.
>>> What we really need to do here is to first write out the data blocks,
>>> and then only start the transaction handle to modify the data blocks
>>> *after* the data blocks have been written (to heretofore, unused
>>> blocks that were just allocated). It means inverting the order in
>>> which we write data blocks for the append write case, and in fact it
>>> will improve fsync() performance since we won't be gating writing the
>>> commit block on the date blocks getting written out in the append
>>> write case.
>>
>> I have some local demo patches doing something similar, and I think this
>> work could be decoupled from Yi's patch set.
>>
>> Since inode preallocation (PA) maintains physical block occupancy with a
>> logical-to-physical mapping, and ensures on-disk data consistency after
>> power failure, it is an excellent location for recording temporary
>> occupancy. Furthermore, since inode PA often allocates more blocks than
>> requested, it can also help reduce file fragmentation.
>>
>> The specific approach is as follows:
>>
>> 1. Allocate only the PA during block allocation without inserting it into
>> the extent status tree. Return the PA to the caller and increment its
>> refcount to prevent it from being discarded.
>>
>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>> refcount and return -EIO. If successful, proceed to the next step.
>>
>> 3. Start a handle upon successful IO completion to convert the inode PA to
>> extents. Release the refcount and update the extent tree.
>>
>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>> release the old extent before inserting the new one.
>
> Sounds good. Just if I understand correctly case 4 would happen only if you
> really try to do something like COW with this? Normally you'd just use the
> already present blocks and write contents into them?
>
>> This ensures data atomicity, while jbd2—being a COW-like implementation
>> itself—ensures metadata atomicity. By leveraging this "delay map"
>> mechanism, we can achieve several benefits:
>>
>> * Lightweight, high-performance COW.
>> * High-performance software atomic writes (hardware-independent).
>> * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>> * Replacing ordered data and data journal modes.
>> * Reduced handle hold time, as it's only held during extent tree updates.
>> * Paving the way for snapshot support.
>>
>> Of course, COW itself can lead to severe file fragmentation, especially
>> in small-scale overwrite scenarios.
>
> I agree the feature can provide very interesting benefits and we were
> pondering about something like that for a long time, just never got to
> implementing it. I'd say the immediate benefits are you can completely get
> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
> code complexity should not increase much. We could also mostly get rid of
> data=ordered mode use (although not completely - see my discussion with
> Zhang over patch 3) which would be also welcome simplification. These
I suppose this feature can also be used to get rid of the data=ordered mode
use in online defragmentation. With this feature, perhaps we can develop a
new method of online defragmentation that eliminates the need to pre-allocate
a donor file. Instead, we can attempt to allocate as many contiguous blocks
as possible through PA. If the allocated length is longer than the original
extent, we can perform the swap and copy the data. Once the copy is complete,
we can atomically construct a new extent, then releases the original blocks
synchronously or asynchronously, similar to a regular copy-on-write (COW)
operation. What does this sounds?
Regards,
Yi.
> benefits alone are IMO a good enough reason to have the functionality :).
> Even without COW, atomic writes and other fancy stuff.
>
> I don't see how you want to get rid of data=journal mode - perhaps that's
> related to the COW functionality?
>
> Honza
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-04 14:23 ` Jan Kara
2026-02-05 2:06 ` Zhang Yi
@ 2026-02-05 2:55 ` Baokun Li
2026-02-05 12:46 ` Jan Kara
1 sibling, 1 reply; 56+ messages in thread
From: Baokun Li @ 2026-02-05 2:55 UTC (permalink / raw)
To: Jan Kara
Cc: Theodore Tso, Zhang Yi, Christoph Hellwig, linux-ext4,
linux-fsdevel, linux-kernel, adilger.kernel, ojaswin, ritesh.list,
djwong, Zhang Yi, yizhang089, yangerkun, yukuai, libaokun9,
Baokun Li
On 2026-02-04 22:23, Jan Kara wrote:
> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>> On 2026-02-03 21:14, Theodore Tso wrote:
>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>> This means that the ordered journal mode is no longer in ext4 used
>>>> under the iomap infrastructure. The main reason is that iomap
>>>> processes each folio one by one during writeback. It first holds the
>>>> folio lock and then starts a transaction to create the block mapping.
>>>> If we still use the ordered mode, we need to perform writeback in
>>>> the logging process, which may require initiating a new transaction,
>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>> mode indeed has many synchronization dependencies, which increase
>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>
>>>> Currently, there are three scenarios where the ordered mode is used:
>>>> 1) append write,
>>>> 2) partial block truncate down, and
>>>> 3) online defragmentation.
>>>>
>>>> For append write, we can always allocate unwritten blocks to avoid
>>>> using the ordered journal mode.
>>> This is going to be a pretty severe performance regression, since it
>>> means that we will be doubling the journal load for append writes.
>>> What we really need to do here is to first write out the data blocks,
>>> and then only start the transaction handle to modify the data blocks
>>> *after* the data blocks have been written (to heretofore, unused
>>> blocks that were just allocated). It means inverting the order in
>>> which we write data blocks for the append write case, and in fact it
>>> will improve fsync() performance since we won't be gating writing the
>>> commit block on the date blocks getting written out in the append
>>> write case.
>> I have some local demo patches doing something similar, and I think this
>> work could be decoupled from Yi's patch set.
>>
>> Since inode preallocation (PA) maintains physical block occupancy with a
>> logical-to-physical mapping, and ensures on-disk data consistency after
>> power failure, it is an excellent location for recording temporary
>> occupancy. Furthermore, since inode PA often allocates more blocks than
>> requested, it can also help reduce file fragmentation.
>>
>> The specific approach is as follows:
>>
>> 1. Allocate only the PA during block allocation without inserting it into
>> the extent status tree. Return the PA to the caller and increment its
>> refcount to prevent it from being discarded.
>>
>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>> refcount and return -EIO. If successful, proceed to the next step.
>>
>> 3. Start a handle upon successful IO completion to convert the inode PA to
>> extents. Release the refcount and update the extent tree.
>>
>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>> release the old extent before inserting the new one.
> Sounds good. Just if I understand correctly case 4 would happen only if you
> really try to do something like COW with this? Normally you'd just use the
> already present blocks and write contents into them?
Yes, case 4 only needs to be considered when implementing COW.
>
>> This ensures data atomicity, while jbd2—being a COW-like implementation
>> itself—ensures metadata atomicity. By leveraging this "delay map"
>> mechanism, we can achieve several benefits:
>>
>> * Lightweight, high-performance COW.
>> * High-performance software atomic writes (hardware-independent).
>> * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>> * Replacing ordered data and data journal modes.
>> * Reduced handle hold time, as it's only held during extent tree updates.
>> * Paving the way for snapshot support.
>>
>> Of course, COW itself can lead to severe file fragmentation, especially
>> in small-scale overwrite scenarios.
> I agree the feature can provide very interesting benefits and we were
> pondering about something like that for a long time, just never got to
> implementing it. I'd say the immediate benefits are you can completely get
> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
> code complexity should not increase much. We could also mostly get rid of
> data=ordered mode use (although not completely - see my discussion with
> Zhang over patch 3) which would be also welcome simplification. These
> benefits alone are IMO a good enough reason to have the functionality :).
> Even without COW, atomic writes and other fancy stuff.
Glad you liked the 'delay map' concept (naming suggestions are welcome!).
With delay-map in place, implementing COW only requires handling overwrite
scenarios, and software atomic writes can be achieved by enabling atomic
delay-maps across multiple PAs.
I expect to send out a minimal RFC version for discussion in a few weeks.
I will share some additional thoughts regarding EOF blocks and
data=ordered mode in patch 3.
Thanks for your feedback!
>
> I don't see how you want to get rid of data=journal mode - perhaps that's
> related to the COW functionality?
>
> Honza
Yes. The only real advantage of data=journal mode over data=ordered is
its guarantee of data atomicity for overwrites.
If we can achieve this through COW-based software atomic writes, we can
move away from the performance-heavy data=journal mode.
Cheers,
Baokun
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-05 2:06 ` Zhang Yi
@ 2026-02-05 3:04 ` Baokun Li
2026-02-05 12:58 ` Jan Kara
1 sibling, 0 replies; 56+ messages in thread
From: Baokun Li @ 2026-02-05 3:04 UTC (permalink / raw)
To: Zhang Yi, Jan Kara
Cc: Theodore Tso, Christoph Hellwig, linux-ext4, linux-fsdevel,
linux-kernel, adilger.kernel, ojaswin, ritesh.list, djwong,
Zhang Yi, yizhang089, yangerkun, yukuai, libaokun9, Baokun Li
On 2026-02-05 10:06, Zhang Yi wrote:
> On 2/4/2026 10:23 PM, Jan Kara wrote:
>> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>>> On 2026-02-03 21:14, Theodore Tso wrote:
>>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>>> This means that the ordered journal mode is no longer in ext4 used
>>>>> under the iomap infrastructure. The main reason is that iomap
>>>>> processes each folio one by one during writeback. It first holds the
>>>>> folio lock and then starts a transaction to create the block mapping.
>>>>> If we still use the ordered mode, we need to perform writeback in
>>>>> the logging process, which may require initiating a new transaction,
>>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>>> mode indeed has many synchronization dependencies, which increase
>>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>>
>>>>> Currently, there are three scenarios where the ordered mode is used:
>>>>> 1) append write,
>>>>> 2) partial block truncate down, and
>>>>> 3) online defragmentation.
>>>>>
>>>>> For append write, we can always allocate unwritten blocks to avoid
>>>>> using the ordered journal mode.
>>>> This is going to be a pretty severe performance regression, since it
>>>> means that we will be doubling the journal load for append writes.
>>>> What we really need to do here is to first write out the data blocks,
>>>> and then only start the transaction handle to modify the data blocks
>>>> *after* the data blocks have been written (to heretofore, unused
>>>> blocks that were just allocated). It means inverting the order in
>>>> which we write data blocks for the append write case, and in fact it
>>>> will improve fsync() performance since we won't be gating writing the
>>>> commit block on the date blocks getting written out in the append
>>>> write case.
>>> I have some local demo patches doing something similar, and I think this
>>> work could be decoupled from Yi's patch set.
>>>
>>> Since inode preallocation (PA) maintains physical block occupancy with a
>>> logical-to-physical mapping, and ensures on-disk data consistency after
>>> power failure, it is an excellent location for recording temporary
>>> occupancy. Furthermore, since inode PA often allocates more blocks than
>>> requested, it can also help reduce file fragmentation.
>>>
>>> The specific approach is as follows:
>>>
>>> 1. Allocate only the PA during block allocation without inserting it into
>>> the extent status tree. Return the PA to the caller and increment its
>>> refcount to prevent it from being discarded.
>>>
>>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>>> refcount and return -EIO. If successful, proceed to the next step.
>>>
>>> 3. Start a handle upon successful IO completion to convert the inode PA to
>>> extents. Release the refcount and update the extent tree.
>>>
>>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>>> release the old extent before inserting the new one.
>> Sounds good. Just if I understand correctly case 4 would happen only if you
>> really try to do something like COW with this? Normally you'd just use the
>> already present blocks and write contents into them?
>>
>>> This ensures data atomicity, while jbd2—being a COW-like implementation
>>> itself—ensures metadata atomicity. By leveraging this "delay map"
>>> mechanism, we can achieve several benefits:
>>>
>>> * Lightweight, high-performance COW.
>>> * High-performance software atomic writes (hardware-independent).
>>> * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>>> * Replacing ordered data and data journal modes.
>>> * Reduced handle hold time, as it's only held during extent tree updates.
>>> * Paving the way for snapshot support.
>>>
>>> Of course, COW itself can lead to severe file fragmentation, especially
>>> in small-scale overwrite scenarios.
>> I agree the feature can provide very interesting benefits and we were
>> pondering about something like that for a long time, just never got to
>> implementing it. I'd say the immediate benefits are you can completely get
>> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
>> code complexity should not increase much. We could also mostly get rid of
>> data=ordered mode use (although not completely - see my discussion with
>> Zhang over patch 3) which would be also welcome simplification. These
> I suppose this feature can also be used to get rid of the data=ordered mode
> use in online defragmentation. With this feature, perhaps we can develop a
> new method of online defragmentation that eliminates the need to pre-allocate
> a donor file. Instead, we can attempt to allocate as many contiguous blocks
> as possible through PA. If the allocated length is longer than the original
> extent, we can perform the swap and copy the data. Once the copy is complete,
> we can atomically construct a new extent, then releases the original blocks
> synchronously or asynchronously, similar to a regular copy-on-write (COW)
> operation. What does this sounds?
>
> Regards,
> Yi.
Good idea! This is much more efficient than allocating files first and
then swapping them. While COW can exacerbate fragmentation, it can also
be leveraged for defragmentation.
We could monitor the average extent length of files within the kernel and
add those that fall below a certain threshold to a "pending defrag" list.
Defragmentation could then be triggered at an appropriate time. To ensure
the effectiveness of the defrag process, we could also set a minimum
length requirement for inode PAs.
Cheers,
Baokun
>> benefits alone are IMO a good enough reason to have the functionality :).
>> Even without COW, atomic writes and other fancy stuff.
>>
>> I don't see how you want to get rid of data=journal mode - perhaps that's
>> related to the COW functionality?
>>
>> Honza
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-04 14:18 ` Jan Kara
@ 2026-02-05 3:27 ` Baokun Li
2026-02-05 14:07 ` Jan Kara
2026-02-05 7:50 ` Zhang Yi
1 sibling, 1 reply; 56+ messages in thread
From: Baokun Li @ 2026-02-05 3:27 UTC (permalink / raw)
To: Jan Kara, Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
ojaswin, ritesh.list, hch, djwong, Zhang Yi, yizhang089,
yangerkun, yukuai, Baokun Li, libaokun9
On 2026-02-04 22:18, Jan Kara wrote:
> Hi Zhang!
>
> On Wed 04-02-26 14:42:46, Zhang Yi wrote:
>> On 2/3/2026 5:59 PM, Jan Kara wrote:
>>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>>>> Currently, __ext4_block_zero_page_range() is called in the following
>>>> four cases to zero out the data in partial blocks:
>>>>
>>>> 1. Truncate down.
>>>> 2. Truncate up.
>>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>>> range extending beyond the end of the file (EOF).
>>>> 4. Partial block punch hole.
>>>>
>>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>>>> will write back the zeroed data to the disk through the order mode after
>>>> zeroing out.
>>>>
>>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>>>> this ordered write. Assuming no one intentionally bypasses the file
>>>> system to write directly to the disk. When performing a truncate down
>>>> operation, ensuring that the data beyond the EOF is zeroed out before
>>>> updating i_disksize is sufficient to prevent old data from being exposed
>>>> when the file is later extended. In other words, as long as the on-disk
>>>> data in case 1 can be properly zeroed out, only the data in memory needs
>>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>> Hum, I'm not sure this is correct. The tail block of the file is not
>>> necessarily zeroed out beyond EOF (as mmap writes can race with page
>>> writeback and modify the tail block contents beyond EOF before we really
>>> submit it to the device). Thus after this commit if you truncate up, just
>>> zero out the newly exposed contents in the page cache and dirty it, then
>>> the transaction with the i_disksize update commits (I see nothing
>>> preventing it) and then you crash, you can observe file with the new size
>>> but non-zero content in the newly exposed area. Am I missing something?
>>>
>> Well, I think you are right! I missed the mmap write race condition that
>> happens during the writeback submitting I/O. Thank you a lot for pointing
>> this out. I thought of two possible solutions:
>>
>> 1. We also add explicit writeback operations to the truncate-up and
>> post-EOF append writes. This solution is the most straightforward but
>> may cause some performance overhead. However, since at most only one
>> block is written, the impact is likely limited. Additionally, I
>> observed that the implementation of the XFS file system also adopts a
>> similar approach in its truncate up and down operation. (But it is
>> somewhat strange that XFS also appears to have the same issue with
>> post-EOF append writes; it only zero out the partial block in
>> xfs_file_write_checks(), but it neither explicitly writeback zeroed
>> data nor employs any other mechanism to ensure that the zero data
>> writebacks before the metadata is written to disk.)
>>
>> 2. Resolve this race condition, ensure that there are no non-zero data
>> in the post-EOF partial blocks on the disk. I observed that after the
>> writeback holds the folio lock and calls folio_clear_dirty_for_io(),
>> mmap writes will re-trigger the page fault. Perhaps we can filter out
>> the EOF folio based on i_size in ext4_page_mkwrite(),
>> block_page_mkwrite() and iomap_page_mkwrite(), and then call
>> folio_wait_writeback() to wait for this partial folio writeback to
>> complete. This seems can break the race condition without introducing
>> too much overhead (no?).
>>
>> What do you think? Any other suggestions are also welcome.
> Hum, I like the option 2 because IMO non-zero data beyond EOF is a
> corner-case quirk which unnecessarily complicates rather common paths. But
> I'm not sure we can easily get rid of it. It can happen for example when
> you do appending write inside a block. The page is written back but before
> the transaction with i_disksize update commits we crash. Then again we have
> a non-zero content inside the block beyond EOF.
>
> So the only realistic option I see is to ensure tail of the block gets
> zeroed on disk before the transaction with i_disksize update commits in the
> cases of truncate up or write beyond EOF. data=ordered mode machinery is an
> asynchronous way how to achieve this. We could also just synchronously
> writeback the block where needed but the latency hit of such operation is
> going to be significant so I'm quite sure some workload somewhere will
> notice although the truncate up / write beyond EOF operations triggering this
> are not too common. So why do you need to get rid of these data=ordered
> mode usages? I guess because with iomap keeping our transaction handle ->
> folio lock ordering is complicated? Last time I looked it seemed still
> possible to keep it though.
>
> Another possibility would be to just *submit* the write synchronously and
> use data=ordered mode machinery only to wait for IO to complete before the
> transaction commits. That way it should be safe to start a transaction
> while holding folio lock and thus the iomap conversion would be easier.
>
> Honza
Can we treat EOF blocks as metadata and update them in the same
transaction as i_disksize? Although this would introduce some
management and journaling overhead, it could avoid the deadlock
of "writeback -> start handle -> trigger writeback".
Regards,
Baokun
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-04 14:18 ` Jan Kara
2026-02-05 3:27 ` Baokun Li
@ 2026-02-05 7:50 ` Zhang Yi
2026-02-05 15:05 ` Jan Kara
1 sibling, 1 reply; 56+ messages in thread
From: Zhang Yi @ 2026-02-05 7:50 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
ojaswin, ritesh.list, hch, djwong, Zhang Yi, yizhang089,
libaokun1, yangerkun, yukuai
On 2/4/2026 10:18 PM, Jan Kara wrote:
> On Wed 04-02-26 14:42:46, Zhang Yi wrote:
>> On 2/3/2026 5:59 PM, Jan Kara wrote:
>>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>>>> Currently, __ext4_block_zero_page_range() is called in the following
>>>> four cases to zero out the data in partial blocks:
>>>>
>>>> 1. Truncate down.
>>>> 2. Truncate up.
>>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>>> range extending beyond the end of the file (EOF).
>>>> 4. Partial block punch hole.
>>>>
>>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>>>> will write back the zeroed data to the disk through the order mode after
>>>> zeroing out.
>>>>
>>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>>>> this ordered write. Assuming no one intentionally bypasses the file
>>>> system to write directly to the disk. When performing a truncate down
>>>> operation, ensuring that the data beyond the EOF is zeroed out before
>>>> updating i_disksize is sufficient to prevent old data from being exposed
>>>> when the file is later extended. In other words, as long as the on-disk
>>>> data in case 1 can be properly zeroed out, only the data in memory needs
>>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>>
>>> Hum, I'm not sure this is correct. The tail block of the file is not
>>> necessarily zeroed out beyond EOF (as mmap writes can race with page
>>> writeback and modify the tail block contents beyond EOF before we really
>>> submit it to the device). Thus after this commit if you truncate up, just
>>> zero out the newly exposed contents in the page cache and dirty it, then
>>> the transaction with the i_disksize update commits (I see nothing
>>> preventing it) and then you crash, you can observe file with the new size
>>> but non-zero content in the newly exposed area. Am I missing something?
>>>
>>
>> Well, I think you are right! I missed the mmap write race condition that
>> happens during the writeback submitting I/O. Thank you a lot for pointing
>> this out. I thought of two possible solutions:
>>
>> 1. We also add explicit writeback operations to the truncate-up and
>> post-EOF append writes. This solution is the most straightforward but
>> may cause some performance overhead. However, since at most only one
>> block is written, the impact is likely limited. Additionally, I
>> observed that the implementation of the XFS file system also adopts a
>> similar approach in its truncate up and down operation. (But it is
>> somewhat strange that XFS also appears to have the same issue with
>> post-EOF append writes; it only zero out the partial block in
>> xfs_file_write_checks(), but it neither explicitly writeback zeroed
>> data nor employs any other mechanism to ensure that the zero data
>> writebacks before the metadata is written to disk.)
>>
>> 2. Resolve this race condition, ensure that there are no non-zero data
>> in the post-EOF partial blocks on the disk. I observed that after the
>> writeback holds the folio lock and calls folio_clear_dirty_for_io(),
>> mmap writes will re-trigger the page fault. Perhaps we can filter out
>> the EOF folio based on i_size in ext4_page_mkwrite(),
>> block_page_mkwrite() and iomap_page_mkwrite(), and then call
>> folio_wait_writeback() to wait for this partial folio writeback to
>> complete. This seems can break the race condition without introducing
>> too much overhead (no?).
>>
>> What do you think? Any other suggestions are also welcome.
>
> Hum, I like the option 2 because IMO non-zero data beyond EOF is a
> corner-case quirk which unnecessarily complicates rather common paths. But
> I'm not sure we can easily get rid of it. It can happen for example when
> you do appending write inside a block. The page is written back but before
> the transaction with i_disksize update commits we crash. Then again we have
> a non-zero content inside the block beyond EOF.
Yes, indeed. From this perspective, it seems difficult to avoid non-zero
content within the block beyond the EOF.
>
> So the only realistic option I see is to ensure tail of the block gets
> zeroed on disk before the transaction with i_disksize update commits in the
> cases of truncate up or write beyond EOF. data=ordered mode machinery is an
> asynchronous way how to achieve this. We could also just synchronously
> writeback the block where needed but the latency hit of such operation is
> going to be significant so I'm quite sure some workload somewhere will
> notice although the truncate up / write beyond EOF operations triggering this
> are not too common.
Yes, I agree.
> So why do you need to get rid of these data=ordered
> mode usages? I guess because with iomap keeping our transaction handle ->
> folio lock ordering is complicated? Last time I looked it seemed still
> possible to keep it though.
>
Yes, that's one reason. There's another reason is that we also need to
implement partial folio submits for iomap.
When the journal process is waiting for a folio to be written back
(which contains an ordered block), and the folio also contains unmapped
blocks with a block size smaller than the folio size, if the regular
writeback process has already started committing this folio (and set the
writeback flag), then a deadlock may occur while mapping the remaining
unmapped blocks. This is because the writeback flag is cleared only
after the entire folio are processed and committed. If we want to support
partial folio submit for iomap, we need to be careful to prevent adding
additional performance overhead in the case of severe fragmentation.
Therefore, this aspect of the logic is complicated and subtle. As we
discussed in patch 0, if we can avoid using the data=ordered mode in
append write and online defrag, then this would be the only remaining
corner case. I'm not sure if it is worth implementing this and adjusting
the lock ordering.
> Another possibility would be to just *submit* the write synchronously and
> use data=ordered mode machinery only to wait for IO to complete before the
> transaction commits. That way it should be safe to start a transaction
> while holding folio lock and thus the iomap conversion would be easier.
>
> Honza
IIUC, this solution seems can avoid adjusting the lock ordering, but partial
folio submission still needs to be implemented, is my understanding right?
This is because although we have already submitted this zeroed partial EOF
block, when the journal process is waiting for this folio, this folio is
being written back, and there are other blocks in this folio that need to be
mapped.
Cheers,
Yi.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-05 2:55 ` Baokun Li
@ 2026-02-05 12:46 ` Jan Kara
0 siblings, 0 replies; 56+ messages in thread
From: Jan Kara @ 2026-02-05 12:46 UTC (permalink / raw)
To: Baokun Li
Cc: Jan Kara, Theodore Tso, Zhang Yi, Christoph Hellwig, linux-ext4,
linux-fsdevel, linux-kernel, adilger.kernel, ojaswin, ritesh.list,
djwong, Zhang Yi, yizhang089, yangerkun, yukuai, libaokun9
On Thu 05-02-26 10:55:59, Baokun Li wrote:
> > I don't see how you want to get rid of data=journal mode - perhaps that's
> > related to the COW functionality?
>
> Yes. The only real advantage of data=journal mode over data=ordered is
> its guarantee of data atomicity for overwrites.
>
> If we can achieve this through COW-based software atomic writes, we can
> move away from the performance-heavy data=journal mode.
Hum, I don't think data=journal actually currently offers overwrite
atomicity - even in data=journal mode you can observe only part of the
write completed after a crash. But what it does offer and why people tend
to use it is that it offers strict linear ordering guarantees between all
data and metadata operations happening in the system. Thus if you can prove
that operation A completed before operation B started, then you are
guaranteed that even after crash you will not see B done and A not done.
This is a very strong consistency guarantee that makes life simpler for the
applications so people that can afford the performance cost and care a lot
about crash safety like it.
I think you are correct that with COW and a bit of care it could be
possible to achieve these guarantees even without journalling data. But I'd
need to think more about it.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-05 2:06 ` Zhang Yi
2026-02-05 3:04 ` Baokun Li
@ 2026-02-05 12:58 ` Jan Kara
2026-02-06 2:15 ` Zhang Yi
1 sibling, 1 reply; 56+ messages in thread
From: Jan Kara @ 2026-02-05 12:58 UTC (permalink / raw)
To: Zhang Yi
Cc: Jan Kara, Baokun Li, Theodore Tso, Christoph Hellwig, linux-ext4,
linux-fsdevel, linux-kernel, adilger.kernel, ojaswin, ritesh.list,
djwong, Zhang Yi, yizhang089, yangerkun, yukuai, libaokun9
On Thu 05-02-26 10:06:11, Zhang Yi wrote:
> On 2/4/2026 10:23 PM, Jan Kara wrote:
> > On Wed 04-02-26 09:59:36, Baokun Li wrote:
> >> On 2026-02-03 21:14, Theodore Tso wrote:
> >>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
> >>>> This means that the ordered journal mode is no longer in ext4 used
> >>>> under the iomap infrastructure. The main reason is that iomap
> >>>> processes each folio one by one during writeback. It first holds the
> >>>> folio lock and then starts a transaction to create the block mapping.
> >>>> If we still use the ordered mode, we need to perform writeback in
> >>>> the logging process, which may require initiating a new transaction,
> >>>> potentially leading to deadlock issues. In addition, ordered journal
> >>>> mode indeed has many synchronization dependencies, which increase
> >>>> the risk of deadlocks, and I believe this is one of the reasons why
> >>>> ext4_do_writepages() is implemented in such a complicated manner.
> >>>> Therefore, I think we need to give up using the ordered data mode.
> >>>>
> >>>> Currently, there are three scenarios where the ordered mode is used:
> >>>> 1) append write,
> >>>> 2) partial block truncate down, and
> >>>> 3) online defragmentation.
> >>>>
> >>>> For append write, we can always allocate unwritten blocks to avoid
> >>>> using the ordered journal mode.
> >>> This is going to be a pretty severe performance regression, since it
> >>> means that we will be doubling the journal load for append writes.
> >>> What we really need to do here is to first write out the data blocks,
> >>> and then only start the transaction handle to modify the data blocks
> >>> *after* the data blocks have been written (to heretofore, unused
> >>> blocks that were just allocated). It means inverting the order in
> >>> which we write data blocks for the append write case, and in fact it
> >>> will improve fsync() performance since we won't be gating writing the
> >>> commit block on the date blocks getting written out in the append
> >>> write case.
> >>
> >> I have some local demo patches doing something similar, and I think this
> >> work could be decoupled from Yi's patch set.
> >>
> >> Since inode preallocation (PA) maintains physical block occupancy with a
> >> logical-to-physical mapping, and ensures on-disk data consistency after
> >> power failure, it is an excellent location for recording temporary
> >> occupancy. Furthermore, since inode PA often allocates more blocks than
> >> requested, it can also help reduce file fragmentation.
> >>
> >> The specific approach is as follows:
> >>
> >> 1. Allocate only the PA during block allocation without inserting it into
> >> the extent status tree. Return the PA to the caller and increment its
> >> refcount to prevent it from being discarded.
> >>
> >> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
> >> refcount and return -EIO. If successful, proceed to the next step.
> >>
> >> 3. Start a handle upon successful IO completion to convert the inode PA to
> >> extents. Release the refcount and update the extent tree.
> >>
> >> 4. If a corresponding extent already exists, we’ll need to punch holes to
> >> release the old extent before inserting the new one.
> >
> > Sounds good. Just if I understand correctly case 4 would happen only if you
> > really try to do something like COW with this? Normally you'd just use the
> > already present blocks and write contents into them?
> >
> >> This ensures data atomicity, while jbd2—being a COW-like implementation
> >> itself—ensures metadata atomicity. By leveraging this "delay map"
> >> mechanism, we can achieve several benefits:
> >>
> >> * Lightweight, high-performance COW.
> >> * High-performance software atomic writes (hardware-independent).
> >> * Replacing dio_readnolock, which might otherwise read unexpected zeros.
> >> * Replacing ordered data and data journal modes.
> >> * Reduced handle hold time, as it's only held during extent tree updates.
> >> * Paving the way for snapshot support.
> >>
> >> Of course, COW itself can lead to severe file fragmentation, especially
> >> in small-scale overwrite scenarios.
> >
> > I agree the feature can provide very interesting benefits and we were
> > pondering about something like that for a long time, just never got to
> > implementing it. I'd say the immediate benefits are you can completely get
> > rid of dioread_nolock as well as the legacy dioread_lock modes so overall
> > code complexity should not increase much. We could also mostly get rid of
> > data=ordered mode use (although not completely - see my discussion with
> > Zhang over patch 3) which would be also welcome simplification. These
>
> I suppose this feature can also be used to get rid of the data=ordered mode
> use in online defragmentation. With this feature, perhaps we can develop a
> new method of online defragmentation that eliminates the need to pre-allocate
> a donor file. Instead, we can attempt to allocate as many contiguous blocks
> as possible through PA. If the allocated length is longer than the original
> extent, we can perform the swap and copy the data. Once the copy is complete,
> we can atomically construct a new extent, then releases the original blocks
> synchronously or asynchronously, similar to a regular copy-on-write (COW)
> operation. What does this sounds?
Well, the reason why defragmentation uses the donor file is that there can
be a lot of policy in where and how the file is exactly placed (e.g. you
might want to place multiple files together). It was decided it is too
complex to implement these policies in the kernel so we've offloaded the
decision where the file is placed to userspace. Back at those times we were
also considering adding interface to guide allocation of blocks for a file
so the userspace defragmenter could prepare donor file with desired blocks.
But then the interest in defragmentation dropped (particularly due to
advances in flash storage) and so these ideas never materialized.
We might rethink the online defragmentation interface but at this point
I'm not sure we are ready to completely replace the idea of guiding the
block placement using a donor file...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-05 3:27 ` Baokun Li
@ 2026-02-05 14:07 ` Jan Kara
2026-02-06 1:14 ` Baokun Li
0 siblings, 1 reply; 56+ messages in thread
From: Jan Kara @ 2026-02-05 14:07 UTC (permalink / raw)
To: Baokun Li
Cc: Jan Kara, Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel,
tytso, adilger.kernel, ojaswin, ritesh.list, hch, djwong,
Zhang Yi, yizhang089, yangerkun, yukuai, libaokun9
On Thu 05-02-26 11:27:09, Baokun Li wrote:
> On 2026-02-04 22:18, Jan Kara wrote:
> > Hi Zhang!
> >
> > On Wed 04-02-26 14:42:46, Zhang Yi wrote:
> >> On 2/3/2026 5:59 PM, Jan Kara wrote:
> >>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
> >>>> Currently, __ext4_block_zero_page_range() is called in the following
> >>>> four cases to zero out the data in partial blocks:
> >>>>
> >>>> 1. Truncate down.
> >>>> 2. Truncate up.
> >>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
> >>>> range extending beyond the end of the file (EOF).
> >>>> 4. Partial block punch hole.
> >>>>
> >>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
> >>>> will write back the zeroed data to the disk through the order mode after
> >>>> zeroing out.
> >>>>
> >>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
> >>>> this ordered write. Assuming no one intentionally bypasses the file
> >>>> system to write directly to the disk. When performing a truncate down
> >>>> operation, ensuring that the data beyond the EOF is zeroed out before
> >>>> updating i_disksize is sufficient to prevent old data from being exposed
> >>>> when the file is later extended. In other words, as long as the on-disk
> >>>> data in case 1 can be properly zeroed out, only the data in memory needs
> >>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
> >>> Hum, I'm not sure this is correct. The tail block of the file is not
> >>> necessarily zeroed out beyond EOF (as mmap writes can race with page
> >>> writeback and modify the tail block contents beyond EOF before we really
> >>> submit it to the device). Thus after this commit if you truncate up, just
> >>> zero out the newly exposed contents in the page cache and dirty it, then
> >>> the transaction with the i_disksize update commits (I see nothing
> >>> preventing it) and then you crash, you can observe file with the new size
> >>> but non-zero content in the newly exposed area. Am I missing something?
> >>>
> >> Well, I think you are right! I missed the mmap write race condition that
> >> happens during the writeback submitting I/O. Thank you a lot for pointing
> >> this out. I thought of two possible solutions:
> >>
> >> 1. We also add explicit writeback operations to the truncate-up and
> >> post-EOF append writes. This solution is the most straightforward but
> >> may cause some performance overhead. However, since at most only one
> >> block is written, the impact is likely limited. Additionally, I
> >> observed that the implementation of the XFS file system also adopts a
> >> similar approach in its truncate up and down operation. (But it is
> >> somewhat strange that XFS also appears to have the same issue with
> >> post-EOF append writes; it only zero out the partial block in
> >> xfs_file_write_checks(), but it neither explicitly writeback zeroed
> >> data nor employs any other mechanism to ensure that the zero data
> >> writebacks before the metadata is written to disk.)
> >>
> >> 2. Resolve this race condition, ensure that there are no non-zero data
> >> in the post-EOF partial blocks on the disk. I observed that after the
> >> writeback holds the folio lock and calls folio_clear_dirty_for_io(),
> >> mmap writes will re-trigger the page fault. Perhaps we can filter out
> >> the EOF folio based on i_size in ext4_page_mkwrite(),
> >> block_page_mkwrite() and iomap_page_mkwrite(), and then call
> >> folio_wait_writeback() to wait for this partial folio writeback to
> >> complete. This seems can break the race condition without introducing
> >> too much overhead (no?).
> >>
> >> What do you think? Any other suggestions are also welcome.
> > Hum, I like the option 2 because IMO non-zero data beyond EOF is a
> > corner-case quirk which unnecessarily complicates rather common paths. But
> > I'm not sure we can easily get rid of it. It can happen for example when
> > you do appending write inside a block. The page is written back but before
> > the transaction with i_disksize update commits we crash. Then again we have
> > a non-zero content inside the block beyond EOF.
> >
> > So the only realistic option I see is to ensure tail of the block gets
> > zeroed on disk before the transaction with i_disksize update commits in the
> > cases of truncate up or write beyond EOF. data=ordered mode machinery is an
> > asynchronous way how to achieve this. We could also just synchronously
> > writeback the block where needed but the latency hit of such operation is
> > going to be significant so I'm quite sure some workload somewhere will
> > notice although the truncate up / write beyond EOF operations triggering this
> > are not too common. So why do you need to get rid of these data=ordered
> > mode usages? I guess because with iomap keeping our transaction handle ->
> > folio lock ordering is complicated? Last time I looked it seemed still
> > possible to keep it though.
> >
> > Another possibility would be to just *submit* the write synchronously and
> > use data=ordered mode machinery only to wait for IO to complete before the
> > transaction commits. That way it should be safe to start a transaction
> > while holding folio lock and thus the iomap conversion would be easier.
> >
> > Honza
>
> Can we treat EOF blocks as metadata and update them in the same
> transaction as i_disksize? Although this would introduce some
> management and journaling overhead, it could avoid the deadlock
> of "writeback -> start handle -> trigger writeback".
No, IMHO that would get too difficult. Just look at the hoops data=journal
mode has to jump through to make page cache handling work with the
journalling machinery. And you'd now have that for all the inodes. So I
think some form of data=ordered machinery is much simpler to reason about.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-05 7:50 ` Zhang Yi
@ 2026-02-05 15:05 ` Jan Kara
2026-02-06 11:09 ` Zhang Yi
0 siblings, 1 reply; 56+ messages in thread
From: Jan Kara @ 2026-02-05 15:05 UTC (permalink / raw)
To: Zhang Yi
Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, ojaswin, ritesh.list, hch, djwong, Zhang Yi,
yizhang089, libaokun1, yangerkun, yukuai
On Thu 05-02-26 15:50:38, Zhang Yi wrote:
> On 2/4/2026 10:18 PM, Jan Kara wrote:
> > So why do you need to get rid of these data=ordered
> > mode usages? I guess because with iomap keeping our transaction handle ->
> > folio lock ordering is complicated? Last time I looked it seemed still
> > possible to keep it though.
>
> Yes, that's one reason. There's another reason is that we also need to
> implement partial folio submits for iomap.
>
> When the journal process is waiting for a folio to be written back
> (which contains an ordered block), and the folio also contains unmapped
> blocks with a block size smaller than the folio size, if the regular
> writeback process has already started committing this folio (and set the
> writeback flag), then a deadlock may occur while mapping the remaining
> unmapped blocks. This is because the writeback flag is cleared only
> after the entire folio are processed and committed. If we want to support
> partial folio submit for iomap, we need to be careful to prevent adding
> additional performance overhead in the case of severe fragmentation.
Yeah, this logic is currently handled by ext4_bio_write_folio(). And the
deadlocks are currently resolved by grabbing transaction handle before we
go and lock any page for writeback. But I agree that with iomap it may be
tricky to keep this scheme.
> Therefore, this aspect of the logic is complicated and subtle. As we
> discussed in patch 0, if we can avoid using the data=ordered mode in
> append write and online defrag, then this would be the only remaining
> corner case. I'm not sure if it is worth implementing this and adjusting
> the lock ordering.
>
> > Another possibility would be to just *submit* the write synchronously and
> > use data=ordered mode machinery only to wait for IO to complete before the
> > transaction commits. That way it should be safe to start a transaction
>
> IIUC, this solution seems can avoid adjusting the lock ordering, but partial
> folio submission still needs to be implemented, is my understanding right?
> This is because although we have already submitted this zeroed partial EOF
> block, when the journal process is waiting for this folio, this folio is
> being written back, and there are other blocks in this folio that need to be
> mapped.
That's a good question. If we submit the tail folio from truncation code,
we could just submit the full folio write and there's no need to restrict
ourselves only to mapped blocks. But you are correct that if this IO
completes but the folio had holes in it and the hole gets filled in by
write before the transaction with i_disksize update commits, jbd2 commit
could still race with flush worker writing this folio again and the
deadlock could happen. Hrm...
So how about the following: We expand our io_end processing with the
ability to journal i_disksize updates after page writeback completes. Then
when doing truncate up or appending writes, we keep i_disksize at the old
value and just zero folio tails in the page cache, mark the folio dirty and
update i_size. When submitting writeback for a folio beyond current
i_disksize we make sure writepages submits IO for all the folios from
current i_disksize upwards. When io_end processing happens after completed
folio writeback, we update i_disksize to min(i_size, end of IO). This
should take care of non-zero data exposure issues and with "delay map"
processing Baokun works on all the inode metadata updates will happen after
IO completion anyway so it will be nicely batched up in one transaction.
It's a big change but so far I think it should work. What do you think?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-05 14:07 ` Jan Kara
@ 2026-02-06 1:14 ` Baokun Li
0 siblings, 0 replies; 56+ messages in thread
From: Baokun Li @ 2026-02-06 1:14 UTC (permalink / raw)
To: Jan Kara
Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, ojaswin, ritesh.list, hch, djwong, Zhang Yi,
yizhang089, yangerkun, yukuai, libaokun9, libaokun9
On 2026-02-05 22:07, Jan Kara wrote:
> On Thu 05-02-26 11:27:09, Baokun Li wrote:
>> On 2026-02-04 22:18, Jan Kara wrote:
>>> Hi Zhang!
>>>
>>> On Wed 04-02-26 14:42:46, Zhang Yi wrote:
>>>> On 2/3/2026 5:59 PM, Jan Kara wrote:
>>>>> On Tue 03-02-26 14:25:03, Zhang Yi wrote:
>>>>>> Currently, __ext4_block_zero_page_range() is called in the following
>>>>>> four cases to zero out the data in partial blocks:
>>>>>>
>>>>>> 1. Truncate down.
>>>>>> 2. Truncate up.
>>>>>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>>>>>> range extending beyond the end of the file (EOF).
>>>>>> 4. Partial block punch hole.
>>>>>>
>>>>>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>>>>>> will write back the zeroed data to the disk through the order mode after
>>>>>> zeroing out.
>>>>>>
>>>>>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>>>>>> this ordered write. Assuming no one intentionally bypasses the file
>>>>>> system to write directly to the disk. When performing a truncate down
>>>>>> operation, ensuring that the data beyond the EOF is zeroed out before
>>>>>> updating i_disksize is sufficient to prevent old data from being exposed
>>>>>> when the file is later extended. In other words, as long as the on-disk
>>>>>> data in case 1 can be properly zeroed out, only the data in memory needs
>>>>>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>>>> Hum, I'm not sure this is correct. The tail block of the file is not
>>>>> necessarily zeroed out beyond EOF (as mmap writes can race with page
>>>>> writeback and modify the tail block contents beyond EOF before we really
>>>>> submit it to the device). Thus after this commit if you truncate up, just
>>>>> zero out the newly exposed contents in the page cache and dirty it, then
>>>>> the transaction with the i_disksize update commits (I see nothing
>>>>> preventing it) and then you crash, you can observe file with the new size
>>>>> but non-zero content in the newly exposed area. Am I missing something?
>>>>>
>>>> Well, I think you are right! I missed the mmap write race condition that
>>>> happens during the writeback submitting I/O. Thank you a lot for pointing
>>>> this out. I thought of two possible solutions:
>>>>
>>>> 1. We also add explicit writeback operations to the truncate-up and
>>>> post-EOF append writes. This solution is the most straightforward but
>>>> may cause some performance overhead. However, since at most only one
>>>> block is written, the impact is likely limited. Additionally, I
>>>> observed that the implementation of the XFS file system also adopts a
>>>> similar approach in its truncate up and down operation. (But it is
>>>> somewhat strange that XFS also appears to have the same issue with
>>>> post-EOF append writes; it only zero out the partial block in
>>>> xfs_file_write_checks(), but it neither explicitly writeback zeroed
>>>> data nor employs any other mechanism to ensure that the zero data
>>>> writebacks before the metadata is written to disk.)
>>>>
>>>> 2. Resolve this race condition, ensure that there are no non-zero data
>>>> in the post-EOF partial blocks on the disk. I observed that after the
>>>> writeback holds the folio lock and calls folio_clear_dirty_for_io(),
>>>> mmap writes will re-trigger the page fault. Perhaps we can filter out
>>>> the EOF folio based on i_size in ext4_page_mkwrite(),
>>>> block_page_mkwrite() and iomap_page_mkwrite(), and then call
>>>> folio_wait_writeback() to wait for this partial folio writeback to
>>>> complete. This seems can break the race condition without introducing
>>>> too much overhead (no?).
>>>>
>>>> What do you think? Any other suggestions are also welcome.
>>> Hum, I like the option 2 because IMO non-zero data beyond EOF is a
>>> corner-case quirk which unnecessarily complicates rather common paths. But
>>> I'm not sure we can easily get rid of it. It can happen for example when
>>> you do appending write inside a block. The page is written back but before
>>> the transaction with i_disksize update commits we crash. Then again we have
>>> a non-zero content inside the block beyond EOF.
>>>
>>> So the only realistic option I see is to ensure tail of the block gets
>>> zeroed on disk before the transaction with i_disksize update commits in the
>>> cases of truncate up or write beyond EOF. data=ordered mode machinery is an
>>> asynchronous way how to achieve this. We could also just synchronously
>>> writeback the block where needed but the latency hit of such operation is
>>> going to be significant so I'm quite sure some workload somewhere will
>>> notice although the truncate up / write beyond EOF operations triggering this
>>> are not too common. So why do you need to get rid of these data=ordered
>>> mode usages? I guess because with iomap keeping our transaction handle ->
>>> folio lock ordering is complicated? Last time I looked it seemed still
>>> possible to keep it though.
>>>
>>> Another possibility would be to just *submit* the write synchronously and
>>> use data=ordered mode machinery only to wait for IO to complete before the
>>> transaction commits. That way it should be safe to start a transaction
>>> while holding folio lock and thus the iomap conversion would be easier.
>>>
>>> Honza
>> Can we treat EOF blocks as metadata and update them in the same
>> transaction as i_disksize? Although this would introduce some
>> management and journaling overhead, it could avoid the deadlock
>> of "writeback -> start handle -> trigger writeback".
> No, IMHO that would get too difficult. Just look at the hoops data=journal
> mode has to jump through to make page cache handling work with the
> journalling machinery. And you'd now have that for all the inodes. So I
> think some form of data=ordered machinery is much simpler to reason about.
>
> Honza
Indeed, this is a bit tricky.
Regards,
Baokun
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path
2026-02-05 12:58 ` Jan Kara
@ 2026-02-06 2:15 ` Zhang Yi
0 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-06 2:15 UTC (permalink / raw)
To: Jan Kara
Cc: Baokun Li, Theodore Tso, Christoph Hellwig, linux-ext4,
linux-fsdevel, linux-kernel, adilger.kernel, ojaswin, ritesh.list,
djwong, Zhang Yi, yizhang089, yangerkun, yukuai, libaokun9
On 2/5/2026 8:58 PM, Jan Kara wrote:
> On Thu 05-02-26 10:06:11, Zhang Yi wrote:
>> On 2/4/2026 10:23 PM, Jan Kara wrote:
>>> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>>>> On 2026-02-03 21:14, Theodore Tso wrote:
>>>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>>>> This means that the ordered journal mode is no longer in ext4 used
>>>>>> under the iomap infrastructure. The main reason is that iomap
>>>>>> processes each folio one by one during writeback. It first holds the
>>>>>> folio lock and then starts a transaction to create the block mapping.
>>>>>> If we still use the ordered mode, we need to perform writeback in
>>>>>> the logging process, which may require initiating a new transaction,
>>>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>>>> mode indeed has many synchronization dependencies, which increase
>>>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>>>
>>>>>> Currently, there are three scenarios where the ordered mode is used:
>>>>>> 1) append write,
>>>>>> 2) partial block truncate down, and
>>>>>> 3) online defragmentation.
>>>>>>
>>>>>> For append write, we can always allocate unwritten blocks to avoid
>>>>>> using the ordered journal mode.
>>>>> This is going to be a pretty severe performance regression, since it
>>>>> means that we will be doubling the journal load for append writes.
>>>>> What we really need to do here is to first write out the data blocks,
>>>>> and then only start the transaction handle to modify the data blocks
>>>>> *after* the data blocks have been written (to heretofore, unused
>>>>> blocks that were just allocated). It means inverting the order in
>>>>> which we write data blocks for the append write case, and in fact it
>>>>> will improve fsync() performance since we won't be gating writing the
>>>>> commit block on the date blocks getting written out in the append
>>>>> write case.
>>>>
>>>> I have some local demo patches doing something similar, and I think this
>>>> work could be decoupled from Yi's patch set.
>>>>
>>>> Since inode preallocation (PA) maintains physical block occupancy with a
>>>> logical-to-physical mapping, and ensures on-disk data consistency after
>>>> power failure, it is an excellent location for recording temporary
>>>> occupancy. Furthermore, since inode PA often allocates more blocks than
>>>> requested, it can also help reduce file fragmentation.
>>>>
>>>> The specific approach is as follows:
>>>>
>>>> 1. Allocate only the PA during block allocation without inserting it into
>>>> the extent status tree. Return the PA to the caller and increment its
>>>> refcount to prevent it from being discarded.
>>>>
>>>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>>>> refcount and return -EIO. If successful, proceed to the next step.
>>>>
>>>> 3. Start a handle upon successful IO completion to convert the inode PA to
>>>> extents. Release the refcount and update the extent tree.
>>>>
>>>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>>>> release the old extent before inserting the new one.
>>>
>>> Sounds good. Just if I understand correctly case 4 would happen only if you
>>> really try to do something like COW with this? Normally you'd just use the
>>> already present blocks and write contents into them?
>>>
>>>> This ensures data atomicity, while jbd2—being a COW-like implementation
>>>> itself—ensures metadata atomicity. By leveraging this "delay map"
>>>> mechanism, we can achieve several benefits:
>>>>
>>>> * Lightweight, high-performance COW.
>>>> * High-performance software atomic writes (hardware-independent).
>>>> * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>>>> * Replacing ordered data and data journal modes.
>>>> * Reduced handle hold time, as it's only held during extent tree updates.
>>>> * Paving the way for snapshot support.
>>>>
>>>> Of course, COW itself can lead to severe file fragmentation, especially
>>>> in small-scale overwrite scenarios.
>>>
>>> I agree the feature can provide very interesting benefits and we were
>>> pondering about something like that for a long time, just never got to
>>> implementing it. I'd say the immediate benefits are you can completely get
>>> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
>>> code complexity should not increase much. We could also mostly get rid of
>>> data=ordered mode use (although not completely - see my discussion with
>>> Zhang over patch 3) which would be also welcome simplification. These
>>
>> I suppose this feature can also be used to get rid of the data=ordered mode
>> use in online defragmentation. With this feature, perhaps we can develop a
>> new method of online defragmentation that eliminates the need to pre-allocate
>> a donor file. Instead, we can attempt to allocate as many contiguous blocks
>> as possible through PA. If the allocated length is longer than the original
>> extent, we can perform the swap and copy the data. Once the copy is complete,
>> we can atomically construct a new extent, then releases the original blocks
>> synchronously or asynchronously, similar to a regular copy-on-write (COW)
>> operation. What does this sounds?
>
> Well, the reason why defragmentation uses the donor file is that there can
> be a lot of policy in where and how the file is exactly placed (e.g. you
> might want to place multiple files together). It was decided it is too
> complex to implement these policies in the kernel so we've offloaded the
> decision where the file is placed to userspace. Back at those times we were
> also considering adding interface to guide allocation of blocks for a file
> so the userspace defragmenter could prepare donor file with desired blocks.
Indeed, it is easier to implement different strategies through donor files.
> But then the interest in defragmentation dropped (particularly due to
> advances in flash storage) and so these ideas never materialized.
As I understand it, defragmentation offers two primary benefits:
1. It improves the contiguity of file blocks, thereby enhancing read/write
performance;
2. It reduces the overhead on the block allocator and the management of
metadata.
As for the first point, indeed, this role has gradually diminished with the
development of flash memory devices. However, I believe the second point is
still very useful. For example, some of our customers have scenarios
involving large-capacity storage, where data is continuously written in a
cyclic manner. This results in the disk space usage remaining at a high level
for a long time, with a large number of both big and small files. Over time,
as fragmentation increases, the CPU usage of the mb_allocater will
significantly rise. Although this issue can be alleviated to some extent
through optimizations of the mb_allocater algorithm and the use of other
pre-allocation techniques, we still find online defragmentation to be very
necessary.
>
> We might rethink the online defragmentation interface but at this point
> I'm not sure we are ready to completely replace the idea of guiding the
> block placement using a donor file...
>
> Honza
Yeah, we can rethink it when supporting online defragmentation for the iomap
path.
Cheers,
Yi.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-05 15:05 ` Jan Kara
@ 2026-02-06 11:09 ` Zhang Yi
2026-02-06 15:35 ` Jan Kara
0 siblings, 1 reply; 56+ messages in thread
From: Zhang Yi @ 2026-02-06 11:09 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
ojaswin, ritesh.list, hch, djwong, Zhang Yi, yizhang089,
libaokun1, yangerkun, yukuai
On 2/5/2026 11:05 PM, Jan Kara wrote:
> On Thu 05-02-26 15:50:38, Zhang Yi wrote:
>> On 2/4/2026 10:18 PM, Jan Kara wrote:
>>> So why do you need to get rid of these data=ordered
>>> mode usages? I guess because with iomap keeping our transaction handle ->
>>> folio lock ordering is complicated? Last time I looked it seemed still
>>> possible to keep it though.
>>
>> Yes, that's one reason. There's another reason is that we also need to
>> implement partial folio submits for iomap.
>>
>> When the journal process is waiting for a folio to be written back
>> (which contains an ordered block), and the folio also contains unmapped
>> blocks with a block size smaller than the folio size, if the regular
>> writeback process has already started committing this folio (and set the
>> writeback flag), then a deadlock may occur while mapping the remaining
>> unmapped blocks. This is because the writeback flag is cleared only
>> after the entire folio are processed and committed. If we want to support
>> partial folio submit for iomap, we need to be careful to prevent adding
>> additional performance overhead in the case of severe fragmentation.
>
> Yeah, this logic is currently handled by ext4_bio_write_folio(). And the
> deadlocks are currently resolved by grabbing transaction handle before we
> go and lock any page for writeback. But I agree that with iomap it may be
> tricky to keep this scheme.
>
>> Therefore, this aspect of the logic is complicated and subtle. As we
>> discussed in patch 0, if we can avoid using the data=ordered mode in
>> append write and online defrag, then this would be the only remaining
>> corner case. I'm not sure if it is worth implementing this and adjusting
>> the lock ordering.
>>
>>> Another possibility would be to just *submit* the write synchronously and
>>> use data=ordered mode machinery only to wait for IO to complete before the
>>> transaction commits. That way it should be safe to start a transaction
>>
>> IIUC, this solution seems can avoid adjusting the lock ordering, but partial
>> folio submission still needs to be implemented, is my understanding right?
>> This is because although we have already submitted this zeroed partial EOF
>> block, when the journal process is waiting for this folio, this folio is
>> being written back, and there are other blocks in this folio that need to be
>> mapped.
>
> That's a good question. If we submit the tail folio from truncation code,
> we could just submit the full folio write and there's no need to restrict
> ourselves only to mapped blocks. But you are correct that if this IO
> completes but the folio had holes in it and the hole gets filled in by
> write before the transaction with i_disksize update commits, jbd2 commit
> could still race with flush worker writing this folio again and the
> deadlock could happen. Hrm...
>
Yes!
> So how about the following:
Let me see, please correct me if my understanding is wrong, ana there are
also some points I don't get.
> We expand our io_end processing with the
> ability to journal i_disksize updates after page writeback completes. Then
> when doing truncate up or appending writes, we keep i_disksize at the old
> value and just zero folio tails in the page cache, mark the folio dirty and
> update i_size.
I think we need to submit this zeroed folio here as well. Because,
1) In the case of truncate up, if we don't submit, the i_disksize may have to
wait a long time (until the folio writeback is complete, which takes about
30 seconds by default) before being updated, which is too long.
2) In the case of appending writes. Assume that the folio written beyond this
one is written back first, we have to wait this zeroed folio to be write
back and then update i_disksize, so we can't wait too long either.
Right?
> When submitting writeback for a folio beyond current
> i_disksize we make sure writepages submits IO for all the folios from
> current i_disksize upwards.
Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.
> When io_end processing happens after completed
> folio writeback, we update i_disksize to min(i_size, end of IO).
Yeah, in the case of append write back. Assume we append write the folio 2
and folio 3,
old_idisksize new_isize
| |
[WWZZ][WWWW][WWWW]
1 | 2 3
A
Assume that folio 1 first completes the writeback, then we update i_disksize
to pos A when the writeback is complete. Assume that folio 2 or 3 completes
first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
something like) folio 1 to complete and then update i_disksize to new_isize.
But in the case of truncate up, We will only write back this zeroed folio. If
the new i_size exceeds the end of this folio, how should we update i_disksize
to the correct value?
For example, we truncate the file from old old_idisksize to new_isize, but we
only zero and writeback folio 1, in the end_io processing of folio 1, we can
only update the i_disksize to A, but we can never update it to new_isize. Am
I missing something ?
old_idisksize new_isize
| |
[WWZZ]...hole ...
1 |
A
> This
> should take care of non-zero data exposure issues and with "delay map"
> processing Baokun works on all the inode metadata updates will happen after
> IO completion anyway so it will be nicely batched up in one transaction.
Currently, my iomap convert implementation always enables dioread_nolock,
so I feel that this solution can be achieved even without the "delay map"
feature. After we have the "delay map", we can extend this to the
buffer_head path.
Thanks,
Yi.
> It's a big change but so far I think it should work. What do you think?
>
> Honza
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-06 11:09 ` Zhang Yi
@ 2026-02-06 15:35 ` Jan Kara
2026-02-09 8:28 ` Zhang Yi
0 siblings, 1 reply; 56+ messages in thread
From: Jan Kara @ 2026-02-06 15:35 UTC (permalink / raw)
To: Zhang Yi
Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, ojaswin, ritesh.list, hch, djwong, Zhang Yi,
yizhang089, libaokun1, yangerkun, yukuai
On Fri 06-02-26 19:09:53, Zhang Yi wrote:
> On 2/5/2026 11:05 PM, Jan Kara wrote:
> > So how about the following:
>
> Let me see, please correct me if my understanding is wrong, ana there are
> also some points I don't get.
>
> > We expand our io_end processing with the
> > ability to journal i_disksize updates after page writeback completes. Then
> > when doing truncate up or appending writes, we keep i_disksize at the old
> > value and just zero folio tails in the page cache, mark the folio dirty and
> > update i_size.
>
> I think we need to submit this zeroed folio here as well. Because,
>
> 1) In the case of truncate up, if we don't submit, the i_disksize may have to
> wait a long time (until the folio writeback is complete, which takes about
> 30 seconds by default) before being updated, which is too long.
Correct but I'm not sure it matters. Current delalloc writes behave in the
same way already. For simplicity I'd thus prefer to not treat truncate up
in a special way but if we decide this indeed desirable, we can either
submit the tail folio immediately, or schedule work with earlier writeback.
> 2) In the case of appending writes. Assume that the folio written beyond this
> one is written back first, we have to wait this zeroed folio to be write
> back and then update i_disksize, so we can't wait too long either.
Correct, update of i_disksize after writeback of folios beyond current
i_disksize is blocked by the writeback of the tail folio.
> > When submitting writeback for a folio beyond current
> > i_disksize we make sure writepages submits IO for all the folios from
> > current i_disksize upwards.
>
> Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.
I was worried about a case like:
We have 4k blocksize, file is i_disksize 2k. Now you do:
pwrite(file, buf, 1, 6k);
pwrite(file, buf, 1, 10k);
pwrite(file, buf, 1, 14k);
The pwrite at offset 6k needs to zero the tail of the folio with index 0,
pwrite at 10k needs to zero the tail of the folio with index 1, etc. And
for us to safely advance i_disksize to 14k+1, I though all the folios (and
zeroed out tails) need to be written out. But that's actually not the case.
We need to make sure the zeroed tail is written out only if the underlying
block is already allocated and marked as written at the time of zeroing.
And the blocks underlying intermediate i_size values will never be allocated
and written without advancing i_disksize to them. So I think you're
correct, we always have at most one tail folio - the one surrounding
current i_disksize - which needs to be written out to safely advance
i_disksize and we don't care about folios inbetween.
> > When io_end processing happens after completed
> > folio writeback, we update i_disksize to min(i_size, end of IO).
>
> Yeah, in the case of append write back. Assume we append write the folio 2
> and folio 3,
>
> old_idisksize new_isize
> | |
> [WWZZ][WWWW][WWWW]
> 1 | 2 3
> A
>
> Assume that folio 1 first completes the writeback, then we update i_disksize
> to pos A when the writeback is complete. Assume that folio 2 or 3 completes
> first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
> something like) folio 1 to complete and then update i_disksize to new_isize.
>
> But in the case of truncate up, We will only write back this zeroed folio. If
> the new i_size exceeds the end of this folio, how should we update i_disksize
> to the correct value?
>
> For example, we truncate the file from old old_idisksize to new_isize, but we
> only zero and writeback folio 1, in the end_io processing of folio 1, we can
> only update the i_disksize to A, but we can never update it to new_isize. Am
> I missing something ?
>
> old_idisksize new_isize
> | |
> [WWZZ]...hole ...
> 1 |
> A
Good question. Based on the analysis above one option would be to setup
writeback of page straddling current i_disksize to update i_disksize to
current i_size on completion. That would be simple but would have an
unpleasant side effect that in case of a crash after append write we could
see increased i_disksize but zeros instead of written data. Another option
would be to update i_disksize on completion to the beginning of the first
dirty folio behind the written back range or i_size of there's not such
folio. This would still be relatively simple and mostly deal with "zeros
instead of data" problem.
> > This
> > should take care of non-zero data exposure issues and with "delay map"
> > processing Baokun works on all the inode metadata updates will happen after
> > IO completion anyway so it will be nicely batched up in one transaction.
>
> Currently, my iomap convert implementation always enables dioread_nolock,
Yes, BTW I think you could remove no-dioread_nolock paths before doing the
conversion to simplify matters a bit. I don't think it's seriously used
anywhere anymore.
> so I feel that this solution can be achieved even without the "delay map"
> feature. After we have the "delay map", we can extend this to the
> buffer_head path.
I agree, delay map is not necessary for this to work. But it will make
things likely faster.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-06 15:35 ` Jan Kara
@ 2026-02-09 8:28 ` Zhang Yi
2026-02-10 12:02 ` Zhang Yi
0 siblings, 1 reply; 56+ messages in thread
From: Zhang Yi @ 2026-02-09 8:28 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
ojaswin, ritesh.list, hch, djwong, Zhang Yi, yizhang089,
libaokun1, yangerkun, yukuai
On 2/6/2026 11:35 PM, Jan Kara wrote:
> On Fri 06-02-26 19:09:53, Zhang Yi wrote:
>> On 2/5/2026 11:05 PM, Jan Kara wrote:
>>> So how about the following:
>>
>> Let me see, please correct me if my understanding is wrong, ana there are
>> also some points I don't get.
>>
>>> We expand our io_end processing with the
>>> ability to journal i_disksize updates after page writeback completes. Then
>>> when doing truncate up or appending writes, we keep i_disksize at the old
>>> value and just zero folio tails in the page cache, mark the folio dirty and
>>> update i_size.
>>
>> I think we need to submit this zeroed folio here as well. Because,
>>
>> 1) In the case of truncate up, if we don't submit, the i_disksize may have to
>> wait a long time (until the folio writeback is complete, which takes about
>> 30 seconds by default) before being updated, which is too long.
>
> Correct but I'm not sure it matters. Current delalloc writes behave in the
> same way already. For simplicity I'd thus prefer to not treat truncate up
> in a special way but if we decide this indeed desirable, we can either
> submit the tail folio immediately, or schedule work with earlier writeback.
>
>> 2) In the case of appending writes. Assume that the folio written beyond this
>> one is written back first, we have to wait this zeroed folio to be write
>> back and then update i_disksize, so we can't wait too long either.
>
> Correct, update of i_disksize after writeback of folios beyond current
> i_disksize is blocked by the writeback of the tail folio.
>
>>> When submitting writeback for a folio beyond current
>>> i_disksize we make sure writepages submits IO for all the folios from
>>> current i_disksize upwards.
>>
>> Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.
>
> I was worried about a case like:
>
> We have 4k blocksize, file is i_disksize 2k. Now you do:
> pwrite(file, buf, 1, 6k);
> pwrite(file, buf, 1, 10k);
> pwrite(file, buf, 1, 14k);
>
> The pwrite at offset 6k needs to zero the tail of the folio with index 0,
> pwrite at 10k needs to zero the tail of the folio with index 1, etc. And
> for us to safely advance i_disksize to 14k+1, I though all the folios (and
> zeroed out tails) need to be written out. But that's actually not the case.
> We need to make sure the zeroed tail is written out only if the underlying
> block is already allocated and marked as written at the time of zeroing.
> And the blocks underlying intermediate i_size values will never be allocated
> and written without advancing i_disksize to them. So I think you're
> correct, we always have at most one tail folio - the one surrounding
> current i_disksize - which needs to be written out to safely advance
> i_disksize and we don't care about folios inbetween.
>
>>> When io_end processing happens after completed
>>> folio writeback, we update i_disksize to min(i_size, end of IO).
>>
>> Yeah, in the case of append write back. Assume we append write the folio 2
>> and folio 3,
>>
>> old_idisksize new_isize
>> | |
>> [WWZZ][WWWW][WWWW]
>> 1 | 2 3
>> A
>>
>> Assume that folio 1 first completes the writeback, then we update i_disksize
>> to pos A when the writeback is complete. Assume that folio 2 or 3 completes
>> first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
>> something like) folio 1 to complete and then update i_disksize to new_isize.
>>
>> But in the case of truncate up, We will only write back this zeroed folio. If
>> the new i_size exceeds the end of this folio, how should we update i_disksize
>> to the correct value?
>>
>> For example, we truncate the file from old old_idisksize to new_isize, but we
>> only zero and writeback folio 1, in the end_io processing of folio 1, we can
>> only update the i_disksize to A, but we can never update it to new_isize. Am
>> I missing something ?
>>
>> old_idisksize new_isize
>> | |
>> [WWZZ]...hole ...
>> 1 |
>> A
>
> Good question. Based on the analysis above one option would be to setup
> writeback of page straddling current i_disksize to update i_disksize to
> current i_size on completion. That would be simple but would have an
> unpleasant side effect that in case of a crash after append write we could
> see increased i_disksize but zeros instead of written data. Another option
> would be to update i_disksize on completion to the beginning of the first
> dirty folio behind the written back range or i_size of there's not such
> folio. This would still be relatively simple and mostly deal with "zeros
> instead of data" problem.
Ha, good idea! I think it should work. I will try the second option, thank
you a lot for this suggestion. :)
>
>>> This
>>> should take care of non-zero data exposure issues and with "delay map"
>>> processing Baokun works on all the inode metadata updates will happen after
>>> IO completion anyway so it will be nicely batched up in one transaction.
>>
>> Currently, my iomap convert implementation always enables dioread_nolock,
>
> Yes, BTW I think you could remove no-dioread_nolock paths before doing the
> conversion to simplify matters a bit. I don't think it's seriously used
> anywhere anymore.
>
Sure. After removing the no-dioread_nolock paths, the behavior of the
buffer_head path (extents-based and no-journal data mode) and the iomap path
in append write and truncate operations can be made consistent.
Cheers,
Yi.
>> so I feel that this solution can be achieved even without the "delay map"
>> feature. After we have the "delay map", we can extend this to the
>> buffer_head path.
>
> I agree, delay map is not necessary for this to work. But it will make
> things likely faster.
>
> Honza
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-03 6:25 ` [PATCH -next v2 03/22] ext4: only order data when partially block truncating down Zhang Yi
2026-02-03 9:59 ` Jan Kara
2026-02-04 4:21 ` kernel test robot
@ 2026-02-10 7:05 ` Ojaswin Mujoo
2026-02-10 15:57 ` Zhang Yi
2 siblings, 1 reply; 56+ messages in thread
From: Ojaswin Mujoo @ 2026-02-10 7:05 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
jack, ritesh.list, hch, djwong, yi.zhang, yizhang089, libaokun1,
yangerkun, yukuai
On Tue, Feb 03, 2026 at 02:25:03PM +0800, Zhang Yi wrote:
> Currently, __ext4_block_zero_page_range() is called in the following
> four cases to zero out the data in partial blocks:
>
> 1. Truncate down.
> 2. Truncate up.
> 3. Perform block allocation (e.g., fallocate) or append writes across a
> range extending beyond the end of the file (EOF).
> 4. Partial block punch hole.
>
> If the default ordered data mode is used, __ext4_block_zero_page_range()
> will write back the zeroed data to the disk through the order mode after
> zeroing out.
>
> Among the cases 1,2 and 3 described above, only case 1 actually requires
> this ordered write. Assuming no one intentionally bypasses the file
> system to write directly to the disk. When performing a truncate down
> operation, ensuring that the data beyond the EOF is zeroed out before
> updating i_disksize is sufficient to prevent old data from being exposed
> when the file is later extended. In other words, as long as the on-disk
> data in case 1 can be properly zeroed out, only the data in memory needs
> to be zeroed out in cases 2 and 3, without requiring ordered data.
>
> Case 4 does not require ordered data because the entire punch hole
> operation does not provide atomicity guarantees. Therefore, it's safe to
> move the ordered data operation from __ext4_block_zero_page_range() to
> ext4_truncate().
>
> It should be noted that after this change, we can only determine whether
> to perform ordered data operations based on whether the target block has
> been zeroed, rather than on the state of the buffer head. Consequently,
> unnecessary ordered data operations may occur when truncating an
> unwritten dirty block. However, this scenario is relatively rare, so the
> overall impact is minimal.
>
> This is prepared for the conversion to the iomap infrastructure since it
> doesn't use ordered data mode and requires active writeback, which
> reduces the complexity of the conversion.
Hi Yi,
Took me quite some time to understand what we are doing here, I'll
just add my understanding here to confirm/document :)
So your argument is that currently all paths that change the i_size take
care of zeroing the (newsize, eof block boundary) before i_size change
is seen by users:
- dio does it in iomap_dio_bio_iter if IOMAP_UNWRITTEN (true for first allocation)
- buffered IO/mmap write does it in ext4_da_write_begin() ->
ext4_block_write_begin() for buffer_new (true for first allocation)
- falloc doesn't zero the new eof block but it allocates an unwrit
extent so no stale data issue. When an allocation happens from the
above 2 methods then we anyways will zero it.
- truncate down also takes care of this via ext4_truncate() ->
ext4_block_truncate_page()
Now, parallely there are also codepaths that say grow the i_size but
then also zero the (old_size, block boundary) range before the i_size
commits. This is so that they want to be sure the newly visible range
doesn't expose stale data.
For example:
- truncate up from 2kb to 8kb will zero (2kb,4kb) via ext4_block_truncate_page()
- with i_size = 2kb, buffered IO at 6kb would zero 2kb,4kb in ext4_da_write_end()
- I'm unable to see if/where we do it via dio path.
You originally proposed that we can remove the logic to zeroout
(old_size, block_boundary) in data=ordered fashion, ie we don't need to
trigger the zeroout IO before the i_size change commits, we can just zero the
range in memory because we would have already zeroed them earlier when
we had allocated at old_isize, or truncated down to old_isize.
To this Jan pointed out that although we take care to zeroout (new_size,
block_boundary) its not enough because we could still end up with data
past eof:
1. race of buffered write vs mmap write past eof. i_size = 2kb,
we write (2kb, 3kb).
2. The write goes through but we crash before i_size=3kb txn can commit.
Again we have data past 2kb ie the eof block.
Now, Im still looking into this part but the reason we want to get rid of
this data=ordered IO is so that we don't trigger a writeback due to
journal commit which tries to acquire folio_lock of a folio already
locked by iomap. However we will now try an alternate way to get past
this.
Is my understanding correct?
Regards,
ojaswin
PS: -g auto tests are passing (no regressions) with 64k and 4k bs on
powerpc 64k pagesize box so thats nice :D
>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
> fs/ext4/inode.c | 32 +++++++++++++++++++-------------
> 1 file changed, 19 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f856ea015263..20b60abcf777 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
> folio_zero_range(folio, offset, length);
> BUFFER_TRACE(bh, "zeroed end of block");
>
> - if (ext4_should_journal_data(inode)) {
> + if (ext4_should_journal_data(inode))
> err = ext4_dirty_journalled_data(handle, bh);
> - } else {
> + else
> mark_buffer_dirty(bh);
> - /*
> - * Only the written block requires ordered data to prevent
> - * exposing stale data.
> - */
> - if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
> - ext4_should_order_data(inode))
> - err = ext4_jbd2_inode_add_write(handle, inode, from,
> - length);
> - }
> if (!err && did_zero)
> *did_zero = true;
>
> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
> goto out_trace;
> }
>
> - if (inode->i_size & (inode->i_sb->s_blocksize - 1))
> - ext4_block_truncate_page(handle, mapping, inode->i_size);
> + if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
> + unsigned int zero_len;
> +
> + zero_len = ext4_block_truncate_page(handle, mapping,
> + inode->i_size);
> + if (zero_len < 0) {
> + err = zero_len;
> + goto out_stop;
> + }
> + if (zero_len && !IS_DAX(inode) &&
> + ext4_should_order_data(inode)) {
> + err = ext4_jbd2_inode_add_write(handle, inode,
> + inode->i_size, zero_len);
> + if (err)
> + goto out_stop;
> + }
> + }
>
> /*
> * We add the inode to the orphan list, so that if this
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-09 8:28 ` Zhang Yi
@ 2026-02-10 12:02 ` Zhang Yi
2026-02-10 14:07 ` Jan Kara
0 siblings, 1 reply; 56+ messages in thread
From: Zhang Yi @ 2026-02-10 12:02 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
ojaswin, ritesh.list, hch, djwong, yizhang089, libaokun1,
yangerkun, yukuai, Zhang Yi
On 2/9/2026 4:28 PM, Zhang Yi wrote:
> On 2/6/2026 11:35 PM, Jan Kara wrote:
>> On Fri 06-02-26 19:09:53, Zhang Yi wrote:
>>> On 2/5/2026 11:05 PM, Jan Kara wrote:
>>>> So how about the following:
>>>
>>> Let me see, please correct me if my understanding is wrong, ana there are
>>> also some points I don't get.
>>>
>>>> We expand our io_end processing with the
>>>> ability to journal i_disksize updates after page writeback completes. Then
While I was extending the end_io path of buffered_head to support updating
i_disksize, I found another problem that requires discussion.
Supporting updates to i_disksize in end_io requires starting a handle, which
conflicts with the data=ordered mode because folios written back through the
journal process cannot initiate any handles; otherwise, this may lead to a
deadlock. This limitation does not affect the iomap path, as it does not use
the data=ordered mode at all. However, in the buffered_head path, online
defragmentation (if this change works, it should be the last user) still uses
the data=ordered mode.
Assume that during online defragmentation, after the EOF partial block is
copied and swapped, the transaction submitting process could be raced by a
concurrent truncate-up operation. Then, when the journal process commits this
block, the i_disksize needs to be updated after the I/O is complete. Finally,
it may trigger a deadlock issue when it starting a new transaction. Conversely,
if we do truncate up first, and then perform the EOF block swap operation just
after it, the same problem will also occur.
Even if we perform synchronous writeback for the EOF block in mext_move_extent(),
it still won't work. This is because swapped blocks that have entered the
ordered list could potentially become EOF blocks at any time before the
transaction is committed (e.g., a concurrent truncate down happens).
Therefore, I was thinking, perhaps currently we have to keep the buffer_head
path as it is, and only modify the truncate up and append write for the iomap
path. What do you think?
Cheers,
Yi.
>>>> when doing truncate up or appending writes, we keep i_disksize at the old
>>>> value and just zero folio tails in the page cache, mark the folio dirty and
>>>> update i_size.
>>>
>>> I think we need to submit this zeroed folio here as well. Because,
>>>
>>> 1) In the case of truncate up, if we don't submit, the i_disksize may have to
>>> wait a long time (until the folio writeback is complete, which takes about
>>> 30 seconds by default) before being updated, which is too long.
>>
>> Correct but I'm not sure it matters. Current delalloc writes behave in the
>> same way already. For simplicity I'd thus prefer to not treat truncate up
>> in a special way but if we decide this indeed desirable, we can either
>> submit the tail folio immediately, or schedule work with earlier writeback.
>>
>>> 2) In the case of appending writes. Assume that the folio written beyond this
>>> one is written back first, we have to wait this zeroed folio to be write
>>> back and then update i_disksize, so we can't wait too long either.
>>
>> Correct, update of i_disksize after writeback of folios beyond current
>> i_disksize is blocked by the writeback of the tail folio.
>>
>>>> When submitting writeback for a folio beyond current
>>>> i_disksize we make sure writepages submits IO for all the folios from
>>>> current i_disksize upwards.
>>>
>>> Why "all the folios"? IIUC, we only wait the zeroed EOF folio is sufficient.
>>
>> I was worried about a case like:
>>
>> We have 4k blocksize, file is i_disksize 2k. Now you do:
>> pwrite(file, buf, 1, 6k);
>> pwrite(file, buf, 1, 10k);
>> pwrite(file, buf, 1, 14k);
>>
>> The pwrite at offset 6k needs to zero the tail of the folio with index 0,
>> pwrite at 10k needs to zero the tail of the folio with index 1, etc. And
>> for us to safely advance i_disksize to 14k+1, I though all the folios (and
>> zeroed out tails) need to be written out. But that's actually not the case.
>> We need to make sure the zeroed tail is written out only if the underlying
>> block is already allocated and marked as written at the time of zeroing.
>> And the blocks underlying intermediate i_size values will never be allocated
>> and written without advancing i_disksize to them. So I think you're
>> correct, we always have at most one tail folio - the one surrounding
>> current i_disksize - which needs to be written out to safely advance
>> i_disksize and we don't care about folios inbetween.
>>
>>>> When io_end processing happens after completed
>>>> folio writeback, we update i_disksize to min(i_size, end of IO).
>>>
>>> Yeah, in the case of append write back. Assume we append write the folio 2
>>> and folio 3,
>>>
>>> old_idisksize new_isize
>>> | |
>>> [WWZZ][WWWW][WWWW]
>>> 1 | 2 3
>>> A
>>>
>>> Assume that folio 1 first completes the writeback, then we update i_disksize
>>> to pos A when the writeback is complete. Assume that folio 2 or 3 completes
>>> first, we should wait(e.g. call filemap_fdatawait_range_keep_errors() or
>>> something like) folio 1 to complete and then update i_disksize to new_isize.
>>>
>>> But in the case of truncate up, We will only write back this zeroed folio. If
>>> the new i_size exceeds the end of this folio, how should we update i_disksize
>>> to the correct value?
>>>
>>> For example, we truncate the file from old old_idisksize to new_isize, but we
>>> only zero and writeback folio 1, in the end_io processing of folio 1, we can
>>> only update the i_disksize to A, but we can never update it to new_isize. Am
>>> I missing something ?
>>>
>>> old_idisksize new_isize
>>> | |
>>> [WWZZ]...hole ...
>>> 1 |
>>> A
>>
>> Good question. Based on the analysis above one option would be to setup
>> writeback of page straddling current i_disksize to update i_disksize to
>> current i_size on completion. That would be simple but would have an
>> unpleasant side effect that in case of a crash after append write we could
>> see increased i_disksize but zeros instead of written data. Another option
>> would be to update i_disksize on completion to the beginning of the first
>> dirty folio behind the written back range or i_size of there's not such
>> folio. This would still be relatively simple and mostly deal with "zeros
>> instead of data" problem.
>
> Ha, good idea! I think it should work. I will try the second option, thank
> you a lot for this suggestion. :)
>
>>
>>>> This
>>>> should take care of non-zero data exposure issues and with "delay map"
>>>> processing Baokun works on all the inode metadata updates will happen after
>>>> IO completion anyway so it will be nicely batched up in one transaction.
>>>
>>> Currently, my iomap convert implementation always enables dioread_nolock,
>>
>> Yes, BTW I think you could remove no-dioread_nolock paths before doing the
>> conversion to simplify matters a bit. I don't think it's seriously used
>> anywhere anymore.
>>
>
> Sure. After removing the no-dioread_nolock paths, the behavior of the
> buffer_head path (extents-based and no-journal data mode) and the iomap path
> in append write and truncate operations can be made consistent.
>
> Cheers,
> Yi.
>
>>> so I feel that this solution can be achieved even without the "delay map"
>>> feature. After we have the "delay map", we can extend this to the
>>> buffer_head path.
>>
>> I agree, delay map is not necessary for this to work. But it will make
>> things likely faster.
>>
>> Honza
>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-10 12:02 ` Zhang Yi
@ 2026-02-10 14:07 ` Jan Kara
2026-02-10 16:11 ` Zhang Yi
0 siblings, 1 reply; 56+ messages in thread
From: Jan Kara @ 2026-02-10 14:07 UTC (permalink / raw)
To: Zhang Yi
Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, ojaswin, ritesh.list, hch, djwong, yizhang089,
libaokun1, yangerkun, yukuai
On Tue 10-02-26 20:02:51, Zhang Yi wrote:
> On 2/9/2026 4:28 PM, Zhang Yi wrote:
> > On 2/6/2026 11:35 PM, Jan Kara wrote:
> >> On Fri 06-02-26 19:09:53, Zhang Yi wrote:
> >>> On 2/5/2026 11:05 PM, Jan Kara wrote:
> >>>> So how about the following:
> >>>
> >>> Let me see, please correct me if my understanding is wrong, ana there are
> >>> also some points I don't get.
> >>>
> >>>> We expand our io_end processing with the
> >>>> ability to journal i_disksize updates after page writeback completes. Then
>
> While I was extending the end_io path of buffered_head to support updating
> i_disksize, I found another problem that requires discussion.
>
> Supporting updates to i_disksize in end_io requires starting a handle, which
> conflicts with the data=ordered mode because folios written back through the
> journal process cannot initiate any handles; otherwise, this may lead to a
> deadlock. This limitation does not affect the iomap path, as it does not use
> the data=ordered mode at all. However, in the buffered_head path, online
> defragmentation (if this change works, it should be the last user) still uses
> the data=ordered mode.
Right and my intention was to use reserved handle for the i_disksize update
similarly as we currently use reserved handle for unwritten extent
conversion after page writeback is done.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-10 7:05 ` Ojaswin Mujoo
@ 2026-02-10 15:57 ` Zhang Yi
2026-02-11 15:23 ` Ojaswin Mujoo
0 siblings, 1 reply; 56+ messages in thread
From: Zhang Yi @ 2026-02-10 15:57 UTC (permalink / raw)
To: Ojaswin Mujoo, Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
jack, ritesh.list, hch, djwong, yi.zhang, libaokun1, yangerkun,
yukuai
On 2/10/2026 3:05 PM, Ojaswin Mujoo wrote:
> On Tue, Feb 03, 2026 at 02:25:03PM +0800, Zhang Yi wrote:
>> Currently, __ext4_block_zero_page_range() is called in the following
>> four cases to zero out the data in partial blocks:
>>
>> 1. Truncate down.
>> 2. Truncate up.
>> 3. Perform block allocation (e.g., fallocate) or append writes across a
>> range extending beyond the end of the file (EOF).
>> 4. Partial block punch hole.
>>
>> If the default ordered data mode is used, __ext4_block_zero_page_range()
>> will write back the zeroed data to the disk through the order mode after
>> zeroing out.
>>
>> Among the cases 1,2 and 3 described above, only case 1 actually requires
>> this ordered write. Assuming no one intentionally bypasses the file
>> system to write directly to the disk. When performing a truncate down
>> operation, ensuring that the data beyond the EOF is zeroed out before
>> updating i_disksize is sufficient to prevent old data from being exposed
>> when the file is later extended. In other words, as long as the on-disk
>> data in case 1 can be properly zeroed out, only the data in memory needs
>> to be zeroed out in cases 2 and 3, without requiring ordered data.
>>
>> Case 4 does not require ordered data because the entire punch hole
>> operation does not provide atomicity guarantees. Therefore, it's safe to
>> move the ordered data operation from __ext4_block_zero_page_range() to
>> ext4_truncate().
>>
>> It should be noted that after this change, we can only determine whether
>> to perform ordered data operations based on whether the target block has
>> been zeroed, rather than on the state of the buffer head. Consequently,
>> unnecessary ordered data operations may occur when truncating an
>> unwritten dirty block. However, this scenario is relatively rare, so the
>> overall impact is minimal.
>>
>> This is prepared for the conversion to the iomap infrastructure since it
>> doesn't use ordered data mode and requires active writeback, which
>> reduces the complexity of the conversion.
>
> Hi Yi,
>
> Took me quite some time to understand what we are doing here, I'll
> just add my understanding here to confirm/document :)
Hi, Ojaswin!
Thank you for review and test this series.
>
> So your argument is that currently all paths that change the i_size take
> care of zeroing the (newsize, eof block boundary) before i_size change
> is seen by users:
> - dio does it in iomap_dio_bio_iter if IOMAP_UNWRITTEN (true for first allocation)
> - buffered IO/mmap write does it in ext4_da_write_begin() ->
> ext4_block_write_begin() for buffer_new (true for first allocation)
> - falloc doesn't zero the new eof block but it allocates an unwrit
> extent so no stale data issue. When an allocation happens from the
> above 2 methods then we anyways will zero it.
These two zeroing operations mentioned above are mainly used to
initialize newly allocated blocks, which is not the main focus of this
discussion.
The focus of this discussion is how to clear the portions of allocated
blocks that extend beyond the EOF.
> - truncate down also takes care of this via ext4_truncate() ->
> ext4_block_truncate_page()
>
> Now, parallely there are also codepaths that say grow the i_size but
> then also zero the (old_size, block boundary) range before the i_size
> commits. This is so that they want to be sure the newly visible range
> doesn't expose stale data.
> For example:
> - truncate up from 2kb to 8kb will zero (2kb,4kb) via ext4_block_truncate_page()
> - with i_size = 2kb, buffered IO at 6kb would zero 2kb,4kb in ext4_da_write_end()
Yes, you are right.
> - I'm unable to see if/where we do it via dio path.
I don't see it too, so I think this is also a problem.
>
> You originally proposed that we can remove the logic to zeroout
> (old_size, block_boundary) in data=ordered fashion, ie we don't need to
> trigger the zeroout IO before the i_size change commits, we can just zero the
> range in memory because we would have already zeroed them earlier when
> we had allocated at old_isize, or truncated down to old_isize.
Yes.
>
> To this Jan pointed out that although we take care to zeroout (new_size,
> block_boundary) its not enough because we could still end up with data
> past eof:
>
> 1. race of buffered write vs mmap write past eof. i_size = 2kb,
> we write (2kb, 3kb).
> 2. The write goes through but we crash before i_size=3kb txn can commit.
> Again we have data past 2kb ie the eof block.
>
Yes.
> Now, Im still looking into this part but the reason we want to get rid of
> this data=ordered IO is so that we don't trigger a writeback due to
> journal commit which tries to acquire folio_lock of a folio already
> locked by iomap.
Yes, and iomap will start a new transaction under the folio lock, which
may also wait the current committing transaction to finish.
> However we will now try an alternate way to get past
> this.
>
> Is my understanding correct?
Yes.
Cheers,
Yi.
>
> Regards,
> ojaswin
>
> PS: -g auto tests are passing (no regressions) with 64k and 4k bs on
> powerpc 64k pagesize box so thats nice :D
>
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>> fs/ext4/inode.c | 32 +++++++++++++++++++-------------
>> 1 file changed, 19 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index f856ea015263..20b60abcf777 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4106,19 +4106,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>> folio_zero_range(folio, offset, length);
>> BUFFER_TRACE(bh, "zeroed end of block");
>>
>> - if (ext4_should_journal_data(inode)) {
>> + if (ext4_should_journal_data(inode))
>> err = ext4_dirty_journalled_data(handle, bh);
>> - } else {
>> + else
>> mark_buffer_dirty(bh);
>> - /*
>> - * Only the written block requires ordered data to prevent
>> - * exposing stale data.
>> - */
>> - if (!buffer_unwritten(bh) && !buffer_delay(bh) &&
>> - ext4_should_order_data(inode))
>> - err = ext4_jbd2_inode_add_write(handle, inode, from,
>> - length);
>> - }
>> if (!err && did_zero)
>> *did_zero = true;
>>
>> @@ -4578,8 +4569,23 @@ int ext4_truncate(struct inode *inode)
>> goto out_trace;
>> }
>>
>> - if (inode->i_size & (inode->i_sb->s_blocksize - 1))
>> - ext4_block_truncate_page(handle, mapping, inode->i_size);
>> + if (inode->i_size & (inode->i_sb->s_blocksize - 1)) {
>> + unsigned int zero_len;
>> +
>> + zero_len = ext4_block_truncate_page(handle, mapping,
>> + inode->i_size);
>> + if (zero_len < 0) {
>> + err = zero_len;
>> + goto out_stop;
>> + }
>> + if (zero_len && !IS_DAX(inode) &&
>> + ext4_should_order_data(inode)) {
>> + err = ext4_jbd2_inode_add_write(handle, inode,
>> + inode->i_size, zero_len);
>> + if (err)
>> + goto out_stop;
>> + }
>> + }
>>
>> /*
>> * We add the inode to the orphan list, so that if this
>> --
>> 2.52.0
>>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-10 14:07 ` Jan Kara
@ 2026-02-10 16:11 ` Zhang Yi
2026-02-11 11:42 ` Jan Kara
0 siblings, 1 reply; 56+ messages in thread
From: Zhang Yi @ 2026-02-10 16:11 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
ojaswin, ritesh.list, hch, djwong, libaokun1, yangerkun, yukuai,
Zhang Yi
On 2/10/2026 10:07 PM, Jan Kara wrote:
> On Tue 10-02-26 20:02:51, Zhang Yi wrote:
>> On 2/9/2026 4:28 PM, Zhang Yi wrote:
>>> On 2/6/2026 11:35 PM, Jan Kara wrote:
>>>> On Fri 06-02-26 19:09:53, Zhang Yi wrote:
>>>>> On 2/5/2026 11:05 PM, Jan Kara wrote:
>>>>>> So how about the following:
>>>>>
>>>>> Let me see, please correct me if my understanding is wrong, ana there are
>>>>> also some points I don't get.
>>>>>
>>>>>> We expand our io_end processing with the
>>>>>> ability to journal i_disksize updates after page writeback completes. Then
>>
>> While I was extending the end_io path of buffered_head to support updating
>> i_disksize, I found another problem that requires discussion.
>>
>> Supporting updates to i_disksize in end_io requires starting a handle, which
>> conflicts with the data=ordered mode because folios written back through the
>> journal process cannot initiate any handles; otherwise, this may lead to a
>> deadlock. This limitation does not affect the iomap path, as it does not use
>> the data=ordered mode at all. However, in the buffered_head path, online
>> defragmentation (if this change works, it should be the last user) still uses
>> the data=ordered mode.
>
> Right and my intention was to use reserved handle for the i_disksize update
> similarly as we currently use reserved handle for unwritten extent
> conversion after page writeback is done.
>
> Honza
IIUC, reserved handle only works for ext4_jbd2_inode_add_wait(). It
doesn't work for ext4_jbd2_inode_add_write() because writebacks
triggered by the journaling process cannot initiate any handles,
including reserved handles. So, I guess you're suggesting that within
mext_move_extent(), we should proactively submit the blocks after
swapping, and then call ext4_jbd2_inode_add_wait() to replace the
existing ext4_jbd2_inode_add_write(). Is that correct?
Thanks,
Yi.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-10 16:11 ` Zhang Yi
@ 2026-02-11 11:42 ` Jan Kara
2026-02-11 13:38 ` Zhang Yi
0 siblings, 1 reply; 56+ messages in thread
From: Jan Kara @ 2026-02-11 11:42 UTC (permalink / raw)
To: Zhang Yi
Cc: Jan Kara, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, ojaswin, ritesh.list, hch, djwong, libaokun1,
yangerkun, yukuai, Zhang Yi
On Wed 11-02-26 00:11:51, Zhang Yi wrote:
> On 2/10/2026 10:07 PM, Jan Kara wrote:
> > On Tue 10-02-26 20:02:51, Zhang Yi wrote:
> > > On 2/9/2026 4:28 PM, Zhang Yi wrote:
> > > > On 2/6/2026 11:35 PM, Jan Kara wrote:
> > > > > On Fri 06-02-26 19:09:53, Zhang Yi wrote:
> > > > > > On 2/5/2026 11:05 PM, Jan Kara wrote:
> > > > > > > So how about the following:
> > > > > >
> > > > > > Let me see, please correct me if my understanding is wrong, ana there are
> > > > > > also some points I don't get.
> > > > > >
> > > > > > > We expand our io_end processing with the
> > > > > > > ability to journal i_disksize updates after page writeback completes. Then
> > >
> > > While I was extending the end_io path of buffered_head to support updating
> > > i_disksize, I found another problem that requires discussion.
> > >
> > > Supporting updates to i_disksize in end_io requires starting a handle, which
> > > conflicts with the data=ordered mode because folios written back through the
> > > journal process cannot initiate any handles; otherwise, this may lead to a
> > > deadlock. This limitation does not affect the iomap path, as it does not use
> > > the data=ordered mode at all. However, in the buffered_head path, online
> > > defragmentation (if this change works, it should be the last user) still uses
> > > the data=ordered mode.
> >
> > Right and my intention was to use reserved handle for the i_disksize update
> > similarly as we currently use reserved handle for unwritten extent
> > conversion after page writeback is done.
>
> IIUC, reserved handle only works for ext4_jbd2_inode_add_wait(). It doesn't
> work for ext4_jbd2_inode_add_write() because writebacks triggered by the
> journaling process cannot initiate any handles, including reserved handles.
Yes, we cannot start any new handles (reserved or not) from writeback
happening from jbd2 thread. I didn't think about that case so good catch.
So we can either do this once we have delay map and get rid of data=ordered
mode altogether or, as you write below, we have to submit the tail folios
proactively during truncate up / append write - but I don't like this
option too much because workloads appending to file by small chunks (say a
few bytes) will get a large performance hit from this.
> So, I guess you're suggesting that within mext_move_extent(), we should
> proactively submit the blocks after swapping, and then call
> ext4_jbd2_inode_add_wait() to replace the existing
> ext4_jbd2_inode_add_write(). Is that correct?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-11 11:42 ` Jan Kara
@ 2026-02-11 13:38 ` Zhang Yi
0 siblings, 0 replies; 56+ messages in thread
From: Zhang Yi @ 2026-02-11 13:38 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
ojaswin, ritesh.list, hch, djwong, libaokun1, yangerkun, yukuai,
Zhang Yi
On 2/11/2026 7:42 PM, Jan Kara wrote:
> On Wed 11-02-26 00:11:51, Zhang Yi wrote:
>> On 2/10/2026 10:07 PM, Jan Kara wrote:
>>> On Tue 10-02-26 20:02:51, Zhang Yi wrote:
>>>> On 2/9/2026 4:28 PM, Zhang Yi wrote:
>>>>> On 2/6/2026 11:35 PM, Jan Kara wrote:
>>>>>> On Fri 06-02-26 19:09:53, Zhang Yi wrote:
>>>>>>> On 2/5/2026 11:05 PM, Jan Kara wrote:
>>>>>>>> So how about the following:
>>>>>>>
>>>>>>> Let me see, please correct me if my understanding is wrong, ana there are
>>>>>>> also some points I don't get.
>>>>>>>
>>>>>>>> We expand our io_end processing with the
>>>>>>>> ability to journal i_disksize updates after page writeback completes. Then
>>>>
>>>> While I was extending the end_io path of buffered_head to support updating
>>>> i_disksize, I found another problem that requires discussion.
>>>>
>>>> Supporting updates to i_disksize in end_io requires starting a handle, which
>>>> conflicts with the data=ordered mode because folios written back through the
>>>> journal process cannot initiate any handles; otherwise, this may lead to a
>>>> deadlock. This limitation does not affect the iomap path, as it does not use
>>>> the data=ordered mode at all. However, in the buffered_head path, online
>>>> defragmentation (if this change works, it should be the last user) still uses
>>>> the data=ordered mode.
>>>
>>> Right and my intention was to use reserved handle for the i_disksize update
>>> similarly as we currently use reserved handle for unwritten extent
>>> conversion after page writeback is done.
>>
>> IIUC, reserved handle only works for ext4_jbd2_inode_add_wait(). It doesn't
>> work for ext4_jbd2_inode_add_write() because writebacks triggered by the
>> journaling process cannot initiate any handles, including reserved handles.
>
> Yes, we cannot start any new handles (reserved or not) from writeback
> happening from jbd2 thread. I didn't think about that case so good catch.
> So we can either do this once we have delay map and get rid of data=ordered
> mode altogether or, as you write below, we have to submit the tail folios
> proactively during truncate up / append write - but I don't like this
> option too much because workloads appending to file by small chunks (say a
> few bytes) will get a large performance hit from this.
>
Yeah, so let's keep the buffered_head path as it is now, and only modify
the iomap path to support the new post-EOF block zeroing solution for
truncate up and append write as discussed.
Cheers,
Yi.
>> So, I guess you're suggesting that within mext_move_extent(), we should
>> proactively submit the blocks after swapping, and then call
>> ext4_jbd2_inode_add_wait() to replace the existing
>> ext4_jbd2_inode_add_write(). Is that correct?
>
> Honza
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH -next v2 03/22] ext4: only order data when partially block truncating down
2026-02-10 15:57 ` Zhang Yi
@ 2026-02-11 15:23 ` Ojaswin Mujoo
0 siblings, 0 replies; 56+ messages in thread
From: Ojaswin Mujoo @ 2026-02-11 15:23 UTC (permalink / raw)
To: Zhang Yi
Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, jack, ritesh.list, hch, djwong, yi.zhang,
libaokun1, yangerkun, yukuai
On Tue, Feb 10, 2026 at 11:57:03PM +0800, Zhang Yi wrote:
> On 2/10/2026 3:05 PM, Ojaswin Mujoo wrote:
> > On Tue, Feb 03, 2026 at 02:25:03PM +0800, Zhang Yi wrote:
> > > Currently, __ext4_block_zero_page_range() is called in the following
> > > four cases to zero out the data in partial blocks:
> > >
> > > 1. Truncate down.
> > > 2. Truncate up.
> > > 3. Perform block allocation (e.g., fallocate) or append writes across a
> > > range extending beyond the end of the file (EOF).
> > > 4. Partial block punch hole.
> > >
> > > If the default ordered data mode is used, __ext4_block_zero_page_range()
> > > will write back the zeroed data to the disk through the order mode after
> > > zeroing out.
> > >
> > > Among the cases 1,2 and 3 described above, only case 1 actually requires
> > > this ordered write. Assuming no one intentionally bypasses the file
> > > system to write directly to the disk. When performing a truncate down
> > > operation, ensuring that the data beyond the EOF is zeroed out before
> > > updating i_disksize is sufficient to prevent old data from being exposed
> > > when the file is later extended. In other words, as long as the on-disk
> > > data in case 1 can be properly zeroed out, only the data in memory needs
> > > to be zeroed out in cases 2 and 3, without requiring ordered data.
> > >
> > > Case 4 does not require ordered data because the entire punch hole
> > > operation does not provide atomicity guarantees. Therefore, it's safe to
> > > move the ordered data operation from __ext4_block_zero_page_range() to
> > > ext4_truncate().
> > >
> > > It should be noted that after this change, we can only determine whether
> > > to perform ordered data operations based on whether the target block has
> > > been zeroed, rather than on the state of the buffer head. Consequently,
> > > unnecessary ordered data operations may occur when truncating an
> > > unwritten dirty block. However, this scenario is relatively rare, so the
> > > overall impact is minimal.
> > >
> > > This is prepared for the conversion to the iomap infrastructure since it
> > > doesn't use ordered data mode and requires active writeback, which
> > > reduces the complexity of the conversion.
> >
> > Hi Yi,
> >
> > Took me quite some time to understand what we are doing here, I'll
> > just add my understanding here to confirm/document :)
>
> Hi, Ojaswin!
>
> Thank you for review and test this series.
>
> >
> > So your argument is that currently all paths that change the i_size take
> > care of zeroing the (newsize, eof block boundary) before i_size change
> > is seen by users:
> > - dio does it in iomap_dio_bio_iter if IOMAP_UNWRITTEN (true for first allocation)
> > - buffered IO/mmap write does it in ext4_da_write_begin() ->
> > ext4_block_write_begin() for buffer_new (true for first allocation)
> > - falloc doesn't zero the new eof block but it allocates an unwrit
> > extent so no stale data issue. When an allocation happens from the
> > above 2 methods then we anyways will zero it.
>
> These two zeroing operations mentioned above are mainly used to initialize
> newly allocated blocks, which is not the main focus of this discussion.
>
> The focus of this discussion is how to clear the portions of allocated
> blocks that extend beyond the EOF.
>
> > - truncate down also takes care of this via ext4_truncate() ->
> > ext4_block_truncate_page()
> >
> > Now, parallely there are also codepaths that say grow the i_size but
> > then also zero the (old_size, block boundary) range before the i_size
> > commits. This is so that they want to be sure the newly visible range
> > doesn't expose stale data.
> > For example:
> > - truncate up from 2kb to 8kb will zero (2kb,4kb) via ext4_block_truncate_page()
> > - with i_size = 2kb, buffered IO at 6kb would zero 2kb,4kb in ext4_da_write_end()
>
> Yes, you are right.
>
> > - I'm unable to see if/where we do it via dio path.
>
> I don't see it too, so I think this is also a problem.
>
> >
> > You originally proposed that we can remove the logic to zeroout
> > (old_size, block_boundary) in data=ordered fashion, ie we don't need to
> > trigger the zeroout IO before the i_size change commits, we can just zero the
> > range in memory because we would have already zeroed them earlier when
> > we had allocated at old_isize, or truncated down to old_isize.
>
> Yes.
>
> >
> > To this Jan pointed out that although we take care to zeroout (new_size,
> > block_boundary) its not enough because we could still end up with data
> > past eof:
> >
> > 1. race of buffered write vs mmap write past eof. i_size = 2kb,
> > we write (2kb, 3kb).
> > 2. The write goes through but we crash before i_size=3kb txn can commit.
> > Again we have data past 2kb ie the eof block.
> >
>
> Yes.
>
> > Now, Im still looking into this part but the reason we want to get rid of
> > this data=ordered IO is so that we don't trigger a writeback due to
> > journal commit which tries to acquire folio_lock of a folio already
> > locked by iomap.
>
> Yes, and iomap will start a new transaction under the folio lock, which may
> also wait the current committing transaction to finish.
Hi Yi,
Ahh okay got it, thanks for confirming.
Regards,
ojaswin
>
> > However we will now try an alternate way to get past
> > this.
> >
> > Is my understanding correct?
>
> Yes.
>
> Cheers,
> Yi.
>
> >
> > Regards,
> > ojaswin
> >
^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2026-02-11 15:23 UTC | newest]
Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-03 6:25 [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 01/22] ext4: make ext4_block_zero_page_range() pass out did_zero Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 02/22] ext4: make ext4_block_truncate_page() return zeroed length Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 03/22] ext4: only order data when partially block truncating down Zhang Yi
2026-02-03 9:59 ` Jan Kara
2026-02-04 6:42 ` Zhang Yi
2026-02-04 14:18 ` Jan Kara
2026-02-05 3:27 ` Baokun Li
2026-02-05 14:07 ` Jan Kara
2026-02-06 1:14 ` Baokun Li
2026-02-05 7:50 ` Zhang Yi
2026-02-05 15:05 ` Jan Kara
2026-02-06 11:09 ` Zhang Yi
2026-02-06 15:35 ` Jan Kara
2026-02-09 8:28 ` Zhang Yi
2026-02-10 12:02 ` Zhang Yi
2026-02-10 14:07 ` Jan Kara
2026-02-10 16:11 ` Zhang Yi
2026-02-11 11:42 ` Jan Kara
2026-02-11 13:38 ` Zhang Yi
2026-02-04 4:21 ` kernel test robot
2026-02-10 7:05 ` Ojaswin Mujoo
2026-02-10 15:57 ` Zhang Yi
2026-02-11 15:23 ` Ojaswin Mujoo
2026-02-03 6:25 ` [PATCH -next v2 04/22] ext4: factor out journalled block zeroing range Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 05/22] ext4: stop passing handle to ext4_journalled_block_zero_range() Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 06/22] ext4: don't zero partial block under an active handle when truncating down Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 07/22] ext4: move ext4_block_zero_page_range() out of an active handle Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 08/22] ext4: zero post EOF partial block before appending write Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 09/22] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 10/22] ext4: implement buffered read iomap path Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 11/22] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 12/22] ext4: implement buffered write iomap path Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 13/22] ext4: implement writeback " Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 14/22] ext4: implement mmap " Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 15/22] iomap: correct the range of a partial dirty clear Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 16/22] iomap: support invalidating partial folios Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 17/22] ext4: implement partial block zero range iomap path Zhang Yi
2026-02-04 0:21 ` kernel test robot
2026-02-03 6:25 ` [PATCH -next v2 18/22] ext4: do not order data for inodes using buffered " Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 19/22] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 20/22] ext4: disable online defrag when inode using " Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 21/22] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
2026-02-03 6:25 ` [PATCH -next v2 22/22] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi
2026-02-03 6:43 ` [PATCH -next v2 00/22] ext4: use iomap for regular file's " Christoph Hellwig
2026-02-03 9:18 ` Zhang Yi
2026-02-03 13:14 ` Theodore Tso
2026-02-04 1:33 ` Zhang Yi
2026-02-04 1:59 ` Baokun Li
2026-02-04 14:23 ` Jan Kara
2026-02-05 2:06 ` Zhang Yi
2026-02-05 3:04 ` Baokun Li
2026-02-05 12:58 ` Jan Kara
2026-02-06 2:15 ` Zhang Yi
2026-02-05 2:55 ` Baokun Li
2026-02-05 12:46 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox