* [PATCH RFC 1/8] fs, block: move blk_mode_t and fop_flags_t into <linux/types.h>
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
To: Christoph Hellwig, Jan Kara
Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
Gao Xiang, linux-erofs, Christian Brauner (Amutable)
In-Reply-To: <20260602-work-super-bdev_holder_global-v1-0-bb0fd82f3861@kernel.org>
blk_mode_t and fop_flags_t are both plain 'unsigned int __bitwise' flag
typedefs, exactly like the gfp_t, slab_flags_t and fmode_t that already
live in <linux/types.h>. Move them there so they are available
everywhere without having to drag in a subsystem header.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
include/linux/blkdev.h | 2 --
include/linux/fs.h | 2 --
include/linux/types.h | 2 ++
3 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..c8494d64a69d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -126,8 +126,6 @@ struct blk_integrity {
unsigned char pi_tuple_size;
};
-typedef unsigned int __bitwise blk_mode_t;
-
/* open for reading */
#define BLK_OPEN_READ ((__force blk_mode_t)(1 << 0))
/* open for writing */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11559c513dfb..e9346be8470f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1921,8 +1921,6 @@ struct dir_context {
struct io_uring_cmd;
struct offset_ctx;
-typedef unsigned int __bitwise fop_flags_t;
-
struct file_operations {
struct module *owner;
fop_flags_t fop_flags;
diff --git a/include/linux/types.h b/include/linux/types.h
index 608050dbca6a..ef026585420b 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -163,6 +163,8 @@ typedef u32 dma_addr_t;
typedef unsigned int __bitwise gfp_t;
typedef unsigned int __bitwise slab_flags_t;
typedef unsigned int __bitwise fmode_t;
+typedef unsigned int __bitwise blk_mode_t;
+typedef unsigned int __bitwise fop_flags_t;
#ifdef CONFIG_PHYS_ADDR_T_64BIT
typedef u64 phys_addr_t;
--
2.47.3
^ permalink raw reply related
* [PATCH RFC 0/8] fs: support freeze/thaw/mark_dead/sync with shared devices
From: Christian Brauner @ 2026-06-02 10:10 UTC (permalink / raw)
To: Christoph Hellwig, Jan Kara
Cc: Jens Axboe, Alexander Viro, linux-block, linux-kernel,
linux-fsdevel, Carlos Maiolino, linux-xfs, Chris Mason,
David Sterba, linux-btrfs, Theodore Ts'o, linux-ext4,
Gao Xiang, linux-erofs, Christian Brauner (Amutable)
Note, this is on the border between RFC/POC and so I haven't pushed this
through testing yet. But I don't want to waste more time on this before
showing it.
I surveyed various fs implementations because I want the ability to
extend userspace the ability to manage what devices can be onlined in a
centralized way without having to force every fs to care about this.
I realized that erofs allows sharing block devices with multiple
superblocks. Any freeze, thaw, removal, or sync on those devices will
not be communicated to the superblocks using it and our current
infrastructure is unable to deal with this.
This attempts to add the ability to go from device number to all the
superblock using that device, iterate through them one-by-one and
perform actions on them. For most fses this is a 1:1 mapping but for
erofs its a 1:many mapping.
This is not unreasonable infastructure to support in my opinion. I
played around with some ideas for this and I want to send out an RFC to
gather some early input.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---
Christian Brauner (8):
fs, block: move blk_mode_t and fop_flags_t into <linux/types.h>
fs: add a global device to super block hash table
fs: refuse to claim any frozen block device
xfs: port to fs_bdev_file_open_by_path()
btrfs: open via dedicated fs bdev helpers
ext4: open via dedicated fs bdev helpers
erofs: open via dedicated fs bdev helpers
super: make fs_holder_ops private
fs/btrfs/dev-replace.c | 6 +-
fs/btrfs/ioctl.c | 4 +-
fs/btrfs/volumes.c | 26 ++-
fs/erofs/data.c | 6 +
fs/erofs/internal.h | 10 ++
fs/erofs/super.c | 66 +++++--
fs/erofs/zdata.c | 10 +-
fs/ext4/super.c | 12 +-
fs/super.c | 452 ++++++++++++++++++++++++++++++++---------------
fs/xfs/xfs_buf.c | 2 +-
fs/xfs/xfs_super.c | 10 +-
include/linux/blkdev.h | 9 -
include/linux/fs.h | 2 -
include/linux/fs/super.h | 7 +
include/linux/types.h | 2 +
15 files changed, 433 insertions(+), 191 deletions(-)
---
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
change-id: 20260602-work-super-bdev_holder_global-8cba5e52bed5
^ permalink raw reply
* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Ojaswin Mujoo @ 2026-06-02 10:05 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <83e36f9c-aeb1-40d4-9265-fd22120a7fa9@huaweicloud.com>
On Fri, May 29, 2026 at 05:13:55PM +0800, Zhang Yi wrote:
> Hi, Ojaswin!
>
> On 5/27/2026 1:10 AM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Introduce two new iomap_ops instances for ext4 buffered writes:
> >>
> >> - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> >> ext4_da_map_blocks() to map delalloc extents.
> >> - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> >> ext4_iomap_get_blocks() to directly allocate blocks.
> >>
> >> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> >> validity.
> >>
> >> Key changes and considerations:
> >>
> >> - Unwritten extents for new blocks (dioread_nolock always on)
> >> Since data=ordered mode is not used to prevent stale data exposure in
> >> the non-delayed allocation path, new blocks are always allocated as
> >> unwritten extents.
> >
> > Okay makes sense.
> >
> >>
> >> - Short write and write failure handling
> >> a. Delalloc path: On short write or failure, the stale delalloc range
> >> must be dropped and its space reservation released. Otherwise, a
> >> clean folio may cover leftover delalloc extents, causing
> >> inaccurate space reservation accounting.
> >
> > Hmm, okay so in the usual buffer head path, seems like during a short
> > write we still zero the new buffers we couldn't write and keep it dirty
> > (folio_zero_new_buffers()). This way they are still written back and
> > the delalloc reservations are used up.
> >
>
> In fact, in the normal buffer head path, writeback does not consume
> delalloc reservations. Instead, the reservations are retained until the
> inode is released or the area is written again using delalloc. This is
> because i_size is not updated during short writes. Therefore, when a
> zeroed dirty folio is written back, no block mapping is created for it.
> For details, please see the lblk >= blocks judgment in
> mpage_process_page_bufs().
Oh okay I see, I'm not very clear on the code path but what about a case
where i_size is beyond the short write range.
>
> This will not lead to duplicate space statistics, because
> ext4_da_map_blocks() only reserves space when inserting a new delalloc
> extent. Therefore, this does not pose a serious issue. However, It may
> cause some temporary and minor space leaks. Nevertheless, I think it
> would be better if delalloc extents could be released for the buffer
> head path when short writes occur.
Yes true, ideally it would be more intuitive if we cancelled the
reservations in short write.
Regards,
ojaswin
>
> > However in iomap we don't mark the range that we couldnt write as dirty
> > so we need to make sure we clear up the stale delalloc mappings. Is this
> > correct?
> >
> Yeah.
>
> Thanks,
> Yi.
>
> > Regards,
> > Ojaswin
> >
> >> b. Non-delalloc path: No cleanup of allocated blocks is needed on
> >> short write.
> >>
> >> - Lock ordering reversal
> >> The folio lock and transaction start ordering is reversed compared to
> >> the buffer_head buffered write path. To handle this, the journal
> >> handle must be stopped in iomap_begin() callbacks. The lock ordering
> >> documentation in super.c has been updated accordingly.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >> fs/ext4/ext4.h | 4 ++
> >> fs/ext4/file.c | 20 +++++-
> >> fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> >> fs/ext4/super.c | 10 ++-
> >> 4 files changed, 192 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >> index 1e27d73d7427..4832e7f7db82 100644
> >> --- a/fs/ext4/ext4.h
> >> +++ b/fs/ext4/ext4.h
> >> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> >> int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> >> struct buffer_head *bh);
> >> void ext4_set_inode_mapping_order(struct inode *inode);
> >> +int ext4_nonda_switch(struct super_block *sb);
> >> #define FALL_BACK_TO_NONDELALLOC 1
> >> #define CONVERT_INLINE_DATA 2
> >>
> >> @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> >>
> >> extern const struct iomap_ops ext4_iomap_ops;
> >> extern const struct iomap_ops ext4_iomap_report_ops;
> >> +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> >> +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> >> +extern const struct iomap_write_ops ext4_iomap_write_ops;
> >>
> >> static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> >> {
> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> >> index eb1a323962b1..7f9bfbbc4a4e 100644
> >> --- a/fs/ext4/file.c
> >> +++ b/fs/ext4/file.c
> >> @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >> return count;
> >> }
> >>
> >> +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> >> + struct iov_iter *from)
> >> +{
> >> + struct inode *inode = file_inode(iocb->ki_filp);
> >> + const struct iomap_ops *iomap_ops;
> >> +
> >> + if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> >> + iomap_ops = &ext4_iomap_buffered_da_write_ops;
> >> + else
> >> + iomap_ops = &ext4_iomap_buffered_write_ops;
> >> +
> >> + return iomap_file_buffered_write(iocb, from, iomap_ops,
> >> + &ext4_iomap_write_ops, NULL);
> >> +}
> >> +
> >> static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >> struct iov_iter *from)
> >> {
> >> @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >> if (ret <= 0)
> >> goto out;
> >>
> >> - ret = generic_perform_write(iocb, from);
> >> + if (ext4_inode_buffered_iomap(inode))
> >> + ret = ext4_iomap_buffered_write(iocb, from);
> >> + else
> >> + ret = generic_perform_write(iocb, from);
> >>
> >> out:
> >> inode_unlock(inode);
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 39577a6b65b9..1ae7d3f4a1c8 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
> >> return ret;
> >> }
> >>
> >> -static int ext4_nonda_switch(struct super_block *sb)
> >> +int ext4_nonda_switch(struct super_block *sb)
> >> {
> >> s64 free_clusters, dirty_clusters;
> >> struct ext4_sb_info *sbi = EXT4_SB(sb);
> >> @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> >> return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
> >> }
> >>
> >> +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> >> +{
> >> + return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> >> +}
> >> +
> >> +const struct iomap_write_ops ext4_iomap_write_ops = {
> >> + .iomap_valid = ext4_iomap_valid,
> >> +};
> >> +
> >> static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >> struct ext4_map_blocks *map, loff_t offset,
> >> loff_t length, unsigned int flags)
> >> @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> >> !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> >> iomap->flags |= IOMAP_F_MERGED;
> >>
> >> + iomap->validity_cookie = map->m_seq;
> >> +
> >> /*
> >> * Flags passed to ext4_map_blocks() for direct I/O writes can result
> >> * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> >> @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
> >> .iomap_begin = ext4_iomap_begin_report,
> >> };
> >>
> >> +/* Map blocks */
> >> +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> >> +
> >> static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> >> - loff_t length, struct ext4_map_blocks *map)
> >> + loff_t length, ext4_get_blocks_t get_blocks,
> >> + struct ext4_map_blocks *map)
> >> {
> >> u8 blkbits = inode->i_blkbits;
> >>
> >> @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> >> map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> >> EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
> >>
> >> + if (get_blocks)
> >> + return get_blocks(inode, map);
> >> +
> >> return ext4_map_blocks(NULL, inode, map, 0);
> >> }
> >>
> >> @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> >> if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> >> return -ERANGE;
> >>
> >> - ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> >> + ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> >> if (ret < 0)
> >> return ret;
> >>
> >> @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> >> return 0;
> >> }
> >>
> >> +static int ext4_iomap_get_blocks(struct inode *inode,
> >> + struct ext4_map_blocks *map)
> >> +{
> >> + loff_t i_size = i_size_read(inode);
> >> + handle_t *handle;
> >> + int ret;
> >> +
> >> + /*
> >> + * Check if the blocks have already been allocated, this could
> >> + * avoid initiating a new journal transaction and return the
> >> + * mapping information directly.
> >> + */
> >> + if ((map->m_lblk + map->m_len) <=
> >> + round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> >> + ret = ext4_map_blocks(NULL, inode, map, 0);
> >> + if (ret < 0)
> >> + return ret;
> >> + if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> >> + EXT4_MAP_DELAYED))
> >> + return 0;
> >> + }
> >> +
> >> + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> >> + ext4_chunk_trans_blocks(inode, map->m_len));
> >> + if (IS_ERR(handle))
> >> + return PTR_ERR(handle);
> >> +
> >> + ret = ext4_map_blocks(handle, inode, map,
> >> + EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> >> + /*
> >> + * Stop handle here following the lock ordering of the folio lock
> >> + * and the transaction start.
> >> + */
> >> + ext4_journal_stop(handle);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> >> + loff_t offset, loff_t length, unsigned int flags,
> >> + struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> >> +{
> >> + int ret, retries = 0;
> >> + struct ext4_map_blocks map;
> >> + ext4_get_blocks_t *get_blocks;
> >> +
> >> + ret = ext4_emergency_state(inode->i_sb);
> >> + if (unlikely(ret))
> >> + return ret;
> >> +
> >> + /* Inline data and non-extent are not supported. */
> >> + if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> >> + return -ERANGE;
> >> + if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> >> + return -EINVAL;
> >> + if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> >> + return -EINVAL;
> >> +
> >> + if (delalloc)
> >> + get_blocks = ext4_da_map_blocks;
> >> + else
> >> + get_blocks = ext4_iomap_get_blocks;
> >> +retry:
> >> + ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> >> + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> >> + goto retry;
> >> + if (ret < 0)
> >> + return ret;
> >> +
> >> + ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> >> + return 0;
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> >> + loff_t offset, loff_t length, unsigned int flags,
> >> + struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> + iomap, srcmap, false);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> >> + loff_t offset, loff_t length, unsigned int flags,
> >> + struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> >> + iomap, srcmap, true);
> >> +}
> >> +
> >> +/*
> >> + * On write failure, drop the stale delayed allocation range and release
> >> + * its reserved space for both start and end blocks. Otherwise, we may
> >> + * leave a range of delayed extents covered by a clean folio, which can
> >> + * result in inaccurate space reservation accounting.
> >> + */
> >> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> >> + loff_t length, struct iomap *iomap)
> >> +{
> >> + down_write(&EXT4_I(inode)->i_data_sem);
> >> + ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> >> + DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> >> + up_write(&EXT4_I(inode)->i_data_sem);
> >> +}
> >> +
> >> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >> + loff_t length, ssize_t written,
> >> + unsigned int flags,
> >> + struct iomap *iomap)
> >> +{
> >> + loff_t start_byte, end_byte;
> >> +
> >> + /* If we didn't reserve the blocks, we're not allowed to punch them. */
> >> + if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> >> + return 0;
> >> +
> >> + /* Nothing to do if we've written the entire delalloc extent */
> >> + start_byte = iomap_last_written_block(inode, offset, written);
> >> + end_byte = round_up(offset + length, i_blocksize(inode));
> >> + if (start_byte >= end_byte)
> >> + return 0;
> >> +
> >> + filemap_invalidate_lock(inode->i_mapping);
> >> + iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> >> + iomap, ext4_iomap_punch_delalloc);
> >> + filemap_invalidate_unlock(inode->i_mapping);
> >> + return 0;
> >> +}
> >> +
> >> +/*
> >> + * Since we always allocate unwritten extents, there is no need for
> >> + * iomap_end to clean up allocated blocks on a short write.
> >> + */
> >> +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> >> + .iomap_begin = ext4_iomap_buffered_write_begin,
> >> +};
> >> +
> >> +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> >> + .iomap_begin = ext4_iomap_buffered_da_write_begin,
> >> + .iomap_end = ext4_iomap_buffered_da_write_end,
> >> +};
> >> +
> >> const struct iomap_ops ext4_iomap_buffered_read_ops = {
> >> .iomap_begin = ext4_iomap_buffered_read_begin,
> >> };
> >> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> >> index 6a77db4d3124..9bc294b769db 100644
> >> --- a/fs/ext4/super.c
> >> +++ b/fs/ext4/super.c
> >> @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
> >> * -> page lock -> i_data_sem (rw)
> >> *
> >> * buffered write path:
> >> - * sb_start_write -> i_mutex -> mmap_lock
> >> - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> >> - * i_data_sem (rw)
> >> + * sb_start_write -> i_rwsem (w) -> mmap_lock
> >> + * - buffer_head path:
> >> + * sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> >> + * i_data_sem (rw)
> >> + * - iomap path:
> >> + * sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> >> + * sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
> >> *
> >> * truncate:
> >> * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> >> --
> >> 2.52.0
> >>
>
^ permalink raw reply
* Re: [PATCH v2 0/10] fs: Fix missed inode write during fsync
From: Jan Kara @ 2026-06-02 7:22 UTC (permalink / raw)
To: linux-fsdevel
Cc: Christian Brauner, aivazian.tigran, Ted Tso, linux-ext4,
OGAWA Hirofumi, Jan Kara
In-Reply-To: <20260525085035.12891-1-jack@suse.cz>
Hello!
On Mon 25-05-26 10:58:06, Jan Kara wrote:
> here is v2 of the patch series which fixes the possibly missing inode write
> during fsync(2) for filesystems using generic metadata bh tracking. The
> inherent problem is that .write_inode methods clear inode dirty bit but they
> only copy inode contents into to the buffer cache. Because buffer carrying the
> inode is shared among multiple inodes, it cannot be tracked by the generic
> metadata bh tracking infrastructure and thus nothing is tracking that buffer
> needs to be written out to maintain fsync(2) guarantees. Normally, this gets
> taken care of by .write_inode checking for WB_SYNC_ALL writeback and submitting
> & waiting for the buffer in that case however if flush worker ends up writing
> the inode before data integrity writeback, this mechanism is broken.
>
> This patch series adds a way for filesystems to track metadata block number
> which contains the inode metadata and then uses this information to writeout
> the buffer on fsync.
FWIW I went through Sashiko review comments. Lot of them are hallucinated
but there are actually three good finds:
1) FAT implementation of inode tracking is broken when fsync races with
rename.
2) ext2 & minix inode tracking makes handling of dirsync even more broken
than it already is (current handling is already broken because we don't
flush any directory indirect blocks but my changes also stop flushing the
inode buffer).
3) mmb_sync() flushing of inode buffer is racy for multiple parallel
fsyncs.
So I'll be addressing these. Please don't waste time with the series as is.
Honza
>
> Changes since v1:
> * Fixed freeing for ext4 dynamically allocated mmb struct
> * Optimized tracking of block carrying the inode so that we don't flush it
> unnecessarily on fsync
> * Add forgotten check for reclaimed bh to mmb_sync() to avoid NULL ptr deref
> * Couple other smaller fixups pointed out by Sashiko
>
> Honza
> Previous versions:
> Link: http://lore.kernel.org/r/20260511115725.28441-1-jack@suse.cz # v1
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Ojaswin Mujoo @ 2026-06-02 7:02 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, libaokun, jack, ritesh.list, hch, yi.zhang,
yizhang089, yangerkun, yukuai
In-Reply-To: <20260528154059.GA6054@frogsfrogsfrogs>
On Thu, May 28, 2026 at 08:40:59AM -0700, Darrick J. Wong wrote:
> On Tue, May 26, 2026 at 10:40:30PM +0530, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
> > > From: Zhang Yi <yi.zhang@huawei.com>
> > >
> > > Introduce two new iomap_ops instances for ext4 buffered writes:
> > >
> > > - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
> > > ext4_da_map_blocks() to map delalloc extents.
> > > - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
> > > ext4_iomap_get_blocks() to directly allocate blocks.
> > >
> > > Also add ext4_iomap_valid() for the iomap infrastructure to check extent
> > > validity.
> > >
> > > Key changes and considerations:
> > >
> > > - Unwritten extents for new blocks (dioread_nolock always on)
> > > Since data=ordered mode is not used to prevent stale data exposure in
> > > the non-delayed allocation path, new blocks are always allocated as
> > > unwritten extents.
> >
> > Okay makes sense.
> >
> > >
> > > - Short write and write failure handling
> > > a. Delalloc path: On short write or failure, the stale delalloc range
> > > must be dropped and its space reservation released. Otherwise, a
> > > clean folio may cover leftover delalloc extents, causing
> > > inaccurate space reservation accounting.
> >
> > Hmm, okay so in the usual buffer head path, seems like during a short
> > write we still zero the new buffers we couldn't write and keep it dirty
> > (folio_zero_new_buffers()). This way they are still written back and
> > the delalloc reservations are used up.
> >
> > However in iomap we don't mark the range that we couldnt write as dirty
> > so we need to make sure we clear up the stale delalloc mappings. Is this
> > correct?
>
> Yes, that's true of iomap's pagecache handling.
Thanks for confirming Darrick.
Regards,
Ojaswin
>
> --D
>
> > Regards,
> > Ojaswin
> >
> > > b. Non-delalloc path: No cleanup of allocated blocks is needed on
> > > short write.
> > >
> > > - Lock ordering reversal
> > > The folio lock and transaction start ordering is reversed compared to
> > > the buffer_head buffered write path. To handle this, the journal
> > > handle must be stopped in iomap_begin() callbacks. The lock ordering
> > > documentation in super.c has been updated accordingly.
> > >
> > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > ---
> > > fs/ext4/ext4.h | 4 ++
> > > fs/ext4/file.c | 20 +++++-
> > > fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
> > > fs/ext4/super.c | 10 ++-
> > > 4 files changed, 192 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 1e27d73d7427..4832e7f7db82 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
> > > int do_journal_get_write_access(handle_t *handle, struct inode *inode,
> > > struct buffer_head *bh);
> > > void ext4_set_inode_mapping_order(struct inode *inode);
> > > +int ext4_nonda_switch(struct super_block *sb);
> > > #define FALL_BACK_TO_NONDELALLOC 1
> > > #define CONVERT_INLINE_DATA 2
> > >
> > > @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> > >
> > > extern const struct iomap_ops ext4_iomap_ops;
> > > extern const struct iomap_ops ext4_iomap_report_ops;
> > > +extern const struct iomap_ops ext4_iomap_buffered_write_ops;
> > > +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;
> > > +extern const struct iomap_write_ops ext4_iomap_write_ops;
> > >
> > > static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> > > {
> > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > > index eb1a323962b1..7f9bfbbc4a4e 100644
> > > --- a/fs/ext4/file.c
> > > +++ b/fs/ext4/file.c
> > > @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > > return count;
> > > }
> > >
> > > +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
> > > + struct iov_iter *from)
> > > +{
> > > + struct inode *inode = file_inode(iocb->ki_filp);
> > > + const struct iomap_ops *iomap_ops;
> > > +
> > > + if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
> > > + iomap_ops = &ext4_iomap_buffered_da_write_ops;
> > > + else
> > > + iomap_ops = &ext4_iomap_buffered_write_ops;
> > > +
> > > + return iomap_file_buffered_write(iocb, from, iomap_ops,
> > > + &ext4_iomap_write_ops, NULL);
> > > +}
> > > +
> > > static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > > struct iov_iter *from)
> > > {
> > > @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > > if (ret <= 0)
> > > goto out;
> > >
> > > - ret = generic_perform_write(iocb, from);
> > > + if (ext4_inode_buffered_iomap(inode))
> > > + ret = ext4_iomap_buffered_write(iocb, from);
> > > + else
> > > + ret = generic_perform_write(iocb, from);
> > >
> > > out:
> > > inode_unlock(inode);
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 39577a6b65b9..1ae7d3f4a1c8 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
> > > return ret;
> > > }
> > >
> > > -static int ext4_nonda_switch(struct super_block *sb)
> > > +int ext4_nonda_switch(struct super_block *sb)
> > > {
> > > s64 free_clusters, dirty_clusters;
> > > struct ext4_sb_info *sbi = EXT4_SB(sb);
> > > @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> > > return inode_state_read_once(inode) & I_DIRTY_DATASYNC;
> > > }
> > >
> > > +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
> > > +{
> > > + return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
> > > +}
> > > +
> > > +const struct iomap_write_ops ext4_iomap_write_ops = {
> > > + .iomap_valid = ext4_iomap_valid,
> > > +};
> > > +
> > > static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> > > struct ext4_map_blocks *map, loff_t offset,
> > > loff_t length, unsigned int flags)
> > > @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> > > !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> > > iomap->flags |= IOMAP_F_MERGED;
> > >
> > > + iomap->validity_cookie = map->m_seq;
> > > +
> > > /*
> > > * Flags passed to ext4_map_blocks() for direct I/O writes can result
> > > * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
> > > @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = {
> > > .iomap_begin = ext4_iomap_begin_report,
> > > };
> > >
> > > +/* Map blocks */
> > > +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *);
> > > +
> > > static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> > > - loff_t length, struct ext4_map_blocks *map)
> > > + loff_t length, ext4_get_blocks_t get_blocks,
> > > + struct ext4_map_blocks *map)
> > > {
> > > u8 blkbits = inode->i_blkbits;
> > >
> > > @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset,
> > > map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> > > EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1;
> > >
> > > + if (get_blocks)
> > > + return get_blocks(inode, map);
> > > +
> > > return ext4_map_blocks(NULL, inode, map, 0);
> > > }
> > >
> > > @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> > > if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> > > return -ERANGE;
> > >
> > > - ret = ext4_iomap_map_blocks(inode, offset, length, &map);
> > > + ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> > > if (ret < 0)
> > > return ret;
> > >
> > > @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset,
> > > return 0;
> > > }
> > >
> > > +static int ext4_iomap_get_blocks(struct inode *inode,
> > > + struct ext4_map_blocks *map)
> > > +{
> > > + loff_t i_size = i_size_read(inode);
> > > + handle_t *handle;
> > > + int ret;
> > > +
> > > + /*
> > > + * Check if the blocks have already been allocated, this could
> > > + * avoid initiating a new journal transaction and return the
> > > + * mapping information directly.
> > > + */
> > > + if ((map->m_lblk + map->m_len) <=
> > > + round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) {
> > > + ret = ext4_map_blocks(NULL, inode, map, 0);
> > > + if (ret < 0)
> > > + return ret;
> > > + if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN |
> > > + EXT4_MAP_DELAYED))
> > > + return 0;
> > > + }
> > > +
> > > + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> > > + ext4_chunk_trans_blocks(inode, map->m_len));
> > > + if (IS_ERR(handle))
> > > + return PTR_ERR(handle);
> > > +
> > > + ret = ext4_map_blocks(handle, inode, map,
> > > + EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
> > > + /*
> > > + * Stop handle here following the lock ordering of the folio lock
> > > + * and the transaction start.
> > > + */
> > > + ext4_journal_stop(handle);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_do_write_begin(struct inode *inode,
> > > + loff_t offset, loff_t length, unsigned int flags,
> > > + struct iomap *iomap, struct iomap *srcmap, bool delalloc)
> > > +{
> > > + int ret, retries = 0;
> > > + struct ext4_map_blocks map;
> > > + ext4_get_blocks_t *get_blocks;
> > > +
> > > + ret = ext4_emergency_state(inode->i_sb);
> > > + if (unlikely(ret))
> > > + return ret;
> > > +
> > > + /* Inline data and non-extent are not supported. */
> > > + if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> > > + return -ERANGE;
> > > + if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> > > + return -EINVAL;
> > > + if (WARN_ON_ONCE(!(flags & IOMAP_WRITE)))
> > > + return -EINVAL;
> > > +
> > > + if (delalloc)
> > > + get_blocks = ext4_da_map_blocks;
> > > + else
> > > + get_blocks = ext4_iomap_get_blocks;
> > > +retry:
> > > + ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map);
> > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> > > + goto retry;
> > > + if (ret < 0)
> > > + return ret;
> > > +
> > > + ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> > > + return 0;
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_write_begin(struct inode *inode,
> > > + loff_t offset, loff_t length, unsigned int flags,
> > > + struct iomap *iomap, struct iomap *srcmap)
> > > +{
> > > + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> > > + iomap, srcmap, false);
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
> > > + loff_t offset, loff_t length, unsigned int flags,
> > > + struct iomap *iomap, struct iomap *srcmap)
> > > +{
> > > + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
> > > + iomap, srcmap, true);
> > > +}
> > > +
> > > +/*
> > > + * On write failure, drop the stale delayed allocation range and release
> > > + * its reserved space for both start and end blocks. Otherwise, we may
> > > + * leave a range of delayed extents covered by a clean folio, which can
> > > + * result in inaccurate space reservation accounting.
> > > + */
> > > +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
> > > + loff_t length, struct iomap *iomap)
> > > +{
> > > + down_write(&EXT4_I(inode)->i_data_sem);
> > > + ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
> > > + DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
> > > + up_write(&EXT4_I(inode)->i_data_sem);
> > > +}
> > > +
> > > +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> > > + loff_t length, ssize_t written,
> > > + unsigned int flags,
> > > + struct iomap *iomap)
> > > +{
> > > + loff_t start_byte, end_byte;
> > > +
> > > + /* If we didn't reserve the blocks, we're not allowed to punch them. */
> > > + if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
> > > + return 0;
> > > +
> > > + /* Nothing to do if we've written the entire delalloc extent */
> > > + start_byte = iomap_last_written_block(inode, offset, written);
> > > + end_byte = round_up(offset + length, i_blocksize(inode));
> > > + if (start_byte >= end_byte)
> > > + return 0;
> > > +
> > > + filemap_invalidate_lock(inode->i_mapping);
> > > + iomap_write_delalloc_release(inode, start_byte, end_byte, flags,
> > > + iomap, ext4_iomap_punch_delalloc);
> > > + filemap_invalidate_unlock(inode->i_mapping);
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * Since we always allocate unwritten extents, there is no need for
> > > + * iomap_end to clean up allocated blocks on a short write.
> > > + */
> > > +const struct iomap_ops ext4_iomap_buffered_write_ops = {
> > > + .iomap_begin = ext4_iomap_buffered_write_begin,
> > > +};
> > > +
> > > +const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
> > > + .iomap_begin = ext4_iomap_buffered_da_write_begin,
> > > + .iomap_end = ext4_iomap_buffered_da_write_end,
> > > +};
> > > +
> > > const struct iomap_ops ext4_iomap_buffered_read_ops = {
> > > .iomap_begin = ext4_iomap_buffered_read_begin,
> > > };
> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > index 6a77db4d3124..9bc294b769db 100644
> > > --- a/fs/ext4/super.c
> > > +++ b/fs/ext4/super.c
> > > @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[];
> > > * -> page lock -> i_data_sem (rw)
> > > *
> > > * buffered write path:
> > > - * sb_start_write -> i_mutex -> mmap_lock
> > > - * sb_start_write -> i_mutex -> transaction start -> page lock ->
> > > - * i_data_sem (rw)
> > > + * sb_start_write -> i_rwsem (w) -> mmap_lock
> > > + * - buffer_head path:
> > > + * sb_start_write -> i_rwsem (w) -> transaction start -> folio lock ->
> > > + * i_data_sem (rw)
> > > + * - iomap path:
> > > + * sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw)
> > > + * sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle)
> > > *
> > > * truncate:
> > > * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) ->
> > > --
> > > 2.52.0
> > >
> >
^ permalink raw reply
* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-06-02 5:56 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <b22baeed-3a20-47a4-8e7c-22f61a6eb49b@huaweicloud.com>
On Sat, May 30, 2026 at 05:32:54PM +0800, Zhang Yi wrote:
> On 5/28/2026 9:34 PM, Ojaswin Mujoo wrote:
> > On Wed, May 27, 2026 at 09:28:28PM +0530, Ojaswin Mujoo wrote:
> >> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> >>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>
> >>> For append writes, wait for ordered I/O to complete before updating
> >>> i_disksize. This ensures that zeroed data is flushed to disk before the
> >>> metadata update, preventing stale data from being exposed during
> >>> unaligned post-EOF append writes.
> >>>
> >>> Suggested-by: Jan Kara <jack@suse.cz>
> >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>> ---
> >>> fs/ext4/ext4.h | 11 +++++++
> >>> fs/ext4/inode.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> >>> fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> >>> fs/ext4/super.c | 23 ++++++++++----
> >>> 4 files changed, 161 insertions(+), 13 deletions(-)
> >>>
> [...]
> >>> @@ -4746,8 +4771,10 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
> >>> loff_t from, loff_t end)
> >>> {
> >>> struct address_space *mapping = inode->i_mapping;
> >>> + struct ext4_inode_info *ei = EXT4_I(inode);
> >>> struct folio *folio;
> >>> bool do_submit = false;
> >>> + int ret;
> >>>
> >>> folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> >>> if (IS_ERR(folio))
> >>> @@ -4757,14 +4784,50 @@ static int ext4_iomap_submit_zero_block(struct inode *inode,
> >>> folio_wait_writeback(folio);
> >>> WARN_ON_ONCE(folio_test_writeback(folio));
> >>>
> >>> - if (likely(folio_test_dirty(folio)))
> >>> + /*
> >>> + * Mark the ordered range. It will be cleared upon I/O completion
> >>> + * in ext4_iomap_end_bio(). Any operation that extends i_disksize
> >>> + * (including append write end io past the zeroed boundary,
> >>> + * truncate up and append fallocate) must wait for this I/O to
> >>> + * complete before updating i_disksize.
> >>> + *
> >>> + * When multiple overlapping unaligned EOF writes are in flight, we
> >>> + * only need to track and wait for the first one. Subsequent writes
> >>> + * will zero the gap in memory and ensure that the zeroed data is
> >>> + * written out along with the valid data in the same block before
> >>> + * i_disksize is updated.
> >>> + */
> >>> + if (likely(folio_test_dirty(folio) &&
> >>> + READ_ONCE(ei->i_ordered_len) == 0)) {
> >>> + WRITE_ONCE(ei->i_ordered_lblk,
> >>> + from >> inode->i_blkbits);
> >>> + /*
> >>> + * Pairs with smp_rmb() in ext4_iomap_writeback_submit()
> >>> + * and ext4_iomap_wb_ordered_wait(). Ensure the updated
> >>> + * i_ordered_lblk is visible when i_ordered_len becomes
> >>> + * non-zero.
> >>> + */
> >>> + smp_store_release(&ei->i_ordered_len, 1);
> >>> do_submit = true;
> >>> + }
> >>> folio_unlock(folio);
> >>> folio_put(folio);
> >>>
> >>> /* Submit zeroed block. */
> >>> - if (do_submit)
> >>> - return filemap_fdatawrite_range(mapping, from, end - 1);
> >>> + if (do_submit) {
> >>> + ret = filemap_fdatawrite_range(mapping, from, end - 1);
> >>> + if (ret) {
> >>> + /*
> >>> + * Pairs with wait_event() in
> >>> + * ext4_iomap_wb_ordered_wait(). Ensure
> >>> + * i_ordered_len = 0 is visible before waking up
> >>> + * waiters.
> >>> + */
> >>> + smp_store_release(&ei->i_ordered_len, 0);
> >>> + wake_up_all(&ei->i_ordered_wq);
> >>> + return ret;
> >
> > Okay so even if the ordered IO fails we still let the i_disksize updates
> > go ahead?
>
> Yes when data_err=ignore, no when data_err=abort.
>
> > I think this is a deviation from the current behavior where we
> > abort the journal. If this is acceptable we should atleast add a comment
> > on why its okay.
> >
>
> I think this behavior is consistent with the current data=ordered mode.
> In the data_err=ignore mode, if an I/O write fails, ext4_end_io_end()
> does not abort the journal, so i_disksize is still updated normally.
> Conversely, in the data_err=abort mode, the journal is aborted, and
> since i_disksize is not updated, it cannot be updated afterwards. Am I
> missing something?
So I was thinking about various scenarios where
filemap_fdatawrite_range() might return an ERROR and yes it seems like
we do end up aborting the journal for almost all paths and ENOMEM is
already taken care of. So I think it should be okay.
>
> >>> + }
> >>> + }
> >>> return 0;
> >>> }
> >>>
> >>> @@ -4827,10 +4890,13 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> >>> * data=ordered mode. We submit zeroed range directly here.
> >>> * Do not wait for I/O completion for performance.
> >>> *
> >>> - * TODO: Any operation that extends i_disksize (including
> >>> - * append write end io past the zeroed boundary, truncate up,
> >>> - * and append fallocate) must wait for the relevant I/O to
> >>> - * complete before updating i_disksize.
> >>> + * The end_io handler ext4_iomap_wb_ordered_wait() will wait
> >>> + * for I/O completion before updating i_disksize if the write
> >>> + * extends beyond the zeroed boundary.
> >>> + *
> >>> + * TODO: Any other operation that extends i_disksize
> >>> + * (including truncate up and append fallocate) must wait for
> >>> + * the relevant I/O to complete before updating i_disksize.
> >>> */
> >>> } else if (ext4_inode_buffered_iomap(inode)) {
> >>> err = ext4_iomap_submit_zero_block(inode, from, end);
> >>> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> >>> index 3050c887329f..ad05ebb49bf6 100644
> >>> --- a/fs/ext4/page-io.c
> >>> +++ b/fs/ext4/page-io.c
> >>> @@ -613,6 +613,46 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,
> >>> return 0;
> >>> }
> >>>
> >>> +/*
> >>> + * If the old disk size is not block size aligned and the current
> >>> + * writeback range is entirely beyond the old EOF block, we should
> >>> + * wait for the zeroed data written in ext4_block_zero_eof() to be
> >>> + * written out, otherwise, it may expose stale data in that block.
> >>> + */
> >>> +static void ext4_iomap_wb_ordered_wait(struct inode *inode,
> >>> + loff_t pos, loff_t end)
> >>> +{
> >>> + struct ext4_inode_info *ei = EXT4_I(inode);
> >>> + unsigned int blocksize = i_blocksize(inode);
> >>> + loff_t disksize = READ_ONCE(ei->i_disksize);
> >>> + ext4_lblk_t order_lblk, order_len;
> >>> +
> >>> + /*
> >>> + * Waiting for ordered I/O is unnecessary when:
> >>> + * - The on-disk size is block-aligned (no stale data exists).
> >>> + * - The write start is within the block of the old EOF
> >>> + * (overwriting, or appending to a block that already contains
> >>> + * valid data).
> >>> + */
> >>> + if (!(disksize & (blocksize - 1)) ||
> >>> + pos < round_up(disksize, blocksize))
> >>> + return;
> >
> > Okay these checks are pretty confusing. I was intially thinking that
> > i_disksize's block would always be equal to i_ordered_lblk but seems
> > like that is not true because ext4_block_zero_eof() uses from=i_size.
>
> Yeah, this is the key point that I was a bit confused about as
> well.
>
> >
> > So we could have a sequence where
> >
> > 1. truncate 4k (i_disksize = i_size = 4k)
> > 2. write 8k,10k (i_disksize = 4k i_size = 10k, i_ordered_len = 0 (old isisze is block aligned))
> > 3. write 16k,18k (i_disksize = 4k i_size = 10k, i_ordered_len = 1, lblk=4)
> 18k lblk=2, (10k >> 12)
^^^ Yess correct, my bad.
> >
> > Here we issue ordered IO even though it' probably not needed. Now if
> > write 3 finishes first we see disksize as 4k so we don't wait for
> > ordered write. Which seems okay since we don't risk any stale data
> > exposure. However, this flow is pretty confuing.
>
> Indeed!
>
> >
> > Can't we somehow avoid having to issue/set ordered len/lblk in case it
> > is not really needed, like only issue it if i_disksize (and not i_size)
> > is unaligned. That can simplify some of our check and avoid extra IO
> > overhead.
> >
>
> I was also planning to explore optimizations on this point next.
> However, since the original logic in buffer_head already works this way,
> keeping the same logic in the iomap path will not introduce any
> additional side effects. To avoid unnecessary waiting, I simply added
> the disksize alignment check in ext4_iomap_wb_ordered_wait().
>
> Therefore, I do not plan to implement this optimization in this series.
> I can open a separate series later to address this optimization — perhaps
> by checking i_disksize in ext4_block_zero_eof() before issuing or adding
> ordered I/O, and the buffer_head path might also benefit from optimization.
> Meanwhile, to avoid confusion, I can add a TODO comment in this patch.
>
> What do you think?
Sure Zhang, such an optimization would make the code simpler but I'm
okay to do this in a different series.
Regards,
ojaswin
>
> Cheers,
> Yi.
>
^ permalink raw reply
* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-06-02 5:35 UTC (permalink / raw)
To: Zhang Yi
Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
yangerkun, yukuai
In-Reply-To: <ad963861-4ae3-4753-9415-793bbac00e06@gmail.com>
On Tue, Jun 02, 2026 at 11:22:12AM +0800, Zhang Yi wrote:
> On 6/2/2026 2:33 AM, Ojaswin Mujoo wrote:
> > On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
> > > On 5/30/2026 3:22 PM, Zhang Yi wrote:
> > > > Hi, Ojaswin!
> > > >
> > > > On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
> > > > > On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> > > > > > From: Zhang Yi <yi.zhang@huawei.com>
> > > > > >
> > > > > > For append writes, wait for ordered I/O to complete before updating
> > > > > > i_disksize. This ensures that zeroed data is flushed to disk before the
> > > > > > metadata update, preventing stale data from being exposed during
> > > > > > unaligned post-EOF append writes.
> > > > > >
> > > > > > Suggested-by: Jan Kara <jack@suse.cz>
> > > > > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > > > > ---
> > > > > > fs/ext4/ext4.h | 11 +++++++
> > > > > > fs/ext4/inode.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> > > > > > fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> > > > > > fs/ext4/super.c | 23 ++++++++++----
> > > > > > 4 files changed, 161 insertions(+), 13 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > > > > index 078feda47e36..9ce2128eea3e 100644
> > > > > > --- a/fs/ext4/ext4.h
> > > > > > +++ b/fs/ext4/ext4.h
> > > > > > @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
> > > > > > #ifdef CONFIG_FS_ENCRYPTION
> > > > > > struct fscrypt_inode_info *i_crypt_info;
> > > > > > #endif
> > > > > > +
> > > > > > + /*
> > > > > > + * Track ordered zeroed data during post-EOF append writes, fallocate,
> > > > > > + * and truncate-up operations. These parameters are used only in the
> > > > > > + * iomap buffered I/O path.
> > > > > > + */
> > > > > > + ext4_lblk_t i_ordered_lblk;
> > > > > > + ext4_lblk_t i_ordered_len;
> > > > > > + wait_queue_head_t i_ordered_wq;
> > > > > > };
> > > > > > /*
> > > > > > @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> > > > > > __u64 len, __u64 *moved_len);
> > > > > > /* page-io.c */
> > > > > > +#define EXT4_IOMAP_IOEND_ORDER_IO 1UL /* This I/O is an ordered one */
> > > > > > +
> > > > > > extern int __init ext4_init_pageio(void);
> > > > > > extern void ext4_exit_pageio(void);
> > > > > > extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > > index e013aeb03d7b..11fb369efeb1 100644
> > > > > > --- a/fs/ext4/inode.c
> > > > > > +++ b/fs/ext4/inode.c
> > > > > > @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> > > > > > {
> > > > > > struct iomap_ioend *ioend = wpc->wb_ctx;
> > > > > > struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> > > > > > + ext4_lblk_t start, end, order_lblk, order_len;
> > > > > > /*
> > > > > > * After I/O completion, a worker needs to be scheduled when:
> > > > > > @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> > > > > > test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
> > > > > > ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> > > > > > + /*
> > > > > > + * Mark the I/O as ordered. Ordered I/O requires separate endio
> > > > > > + * handling and must not be merged with regular I/O operations.
> > > > > > + */
> > > > > > + order_len = READ_ONCE(ei->i_ordered_len);
> > > > > > + if (order_len) {
> > > > > > + /*
> > > > > > + * Pair with smp_store_release() in ext4_block_zero_eof().
> > > > > > + * Ensure we see the updated i_ordered_lblk that was written
> > > > > > + * before the release store to i_ordered_len.
> > > > > > + */
> > > > > > + smp_rmb();
> > > > > > + order_lblk = READ_ONCE(ei->i_ordered_lblk);
> > > > > > + start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> > > > > > + end = EXT4_B_TO_LBLK(ioend->io_inode,
> > > > > > + ioend->io_offset + ioend->io_size);
> > > > > > +
> > > > > > + if (start <= order_lblk && end >= order_lblk + order_len) {
> > > > >
> > > > > Hi Zhang,
> > > > >
> > > > > I guess this check is enough cause ordered_lblk and ordered_len will
> > > > > always be contained in a single block.
> > > >
> > > > Yeah.
> > > >
> > > > >
> > > > > > + ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> > > > > > + ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> > > > > > + ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
> > > > >
> > > > > FWIU, we are wanting the ordered IO to not be merged and submitted asap
> > > > > since we want to wake up the waiters. Is there any other reason?
> > > >
> > > > My original intention was to prevent the loss of the
> > > > EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
> > > > completion, which could be caused by merging an ordered ioend with a
> > > > normal ioend. In patch 19, we need to determine the flag to update
> > > > i_disksize to the correct position.
> >
> > Ahh okay, we don't want the flag to be lost.
> >
> > > >
> > > > >
> > > > > Adding the boundary in ->writeback_submit() only affects
> > > > > iomap_ioend_can_merge() which happens after we have woken up the waiters
> > > > > and deferred the IO to the wq. We ideally want it affect
> > > > > iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
> > > > > ->writeback_range().
> > > >
> > > > IIUC, merging into the same ioend during the submission stage doesn't
> > > > seem to cause any problems.
> >
> > Got it since the flag is set later. I was thinking we want to quickly
> > issue the ordered IO to wake up the waiters and not waste time trying to
> > merge it and hence we wanted to use that flag.
> >
> > > >
> > > > >
> > > > > Secondly, I don't think boundary is the right flag here. It ensures
> > > > > that everything before the ordered iomap gets submitted and the ordered
> > > > > iomap starts a new ioend. This can still keep getting merged with the
> > > > > newer ioends untils we decide to submit the IO, which can delay waking
> > > > > up the waiters. If we really want the "no merge" behavior, we'll have to
> > > > > do something like [1] (Check the 2 NOMERGE flag patches).
> > > >
> > > > Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
> > > > still cannot prevent merging. I missed this, thank you for pointing this
> > > > out. However, I think perhaps we should change iomap_ioend_can_merge()
> > > > to check the iomap_ioend->io_private. Something like:
> > > >
> > > > if (ioend->io_private || next->io_private)
> > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > ioend->io_private != next->io_private
> >
> > I guess if the purpose is to just not lose the flag, then boundary seems
> > to work for because we only lose the flag if ordered ioend backward
> > merges to a prev one. Flag is retained if we forward merge. Which
> > boundary seems to take care of.
> >
> Yes, IOMAP_IOEND_BOUNDARY is indeed worked currently as it prevents flag
> loss. However, from the perspective of the iomap infrastructure, I
> believe it is still necessary to add the ioend->io_private !=
> next->io_private check. Because ioends with different io_private values
> should not be merged, as this carries the risk of potentially losing
> io_private or even memory leaks. With this check in iomap, we would no
> longer need IOMAP_IOEND_BOUNDARY.
I agree that even outside this patchset it seems like a sane thing to
do.
>
> > However, if we want to avoid merges so we can quickly issue IO and wake
> > up the waiters then the above change looks good. Also, if this is the
> > reason we'd also want to have this during submission stage so the flag
> > setting will probs have to move to ->wirteback_range()
>
> Yes. Issuing ordered I/O as soon as possible is beneficial as it reduces
> the latency of sync file range. Suppose when we are syncing data beyond
> the ordered range, the background writeback process has already started
> committing and bundled the ordered range into a large ioend (up to
> IOEND_BATCH_SIZE folios), then this sync operation will indeed
> experience significant latency. However, for other non-sync scenarios,
> there should be little benefit.
Yes that's true.
>
> But I'm not sure if this is strictly necessary, because in the existing
> implementation, issuing ordered I/O via data=ordered mode works the same
> way — it also doesn't issue ordered I/O as soon as possible, and still
> has to wait when encountering concurrent background writeback. So I
> think we can keep the current implementation for now and see user
> feedback to decide whether further optimization is needed.
I agree!
Thanks,
ojaswin
>
> Cheers,
> Yi.
^ permalink raw reply
* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
From: Zhang Yi @ 2026-06-02 3:36 UTC (permalink / raw)
To: Ojaswin Mujoo
Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
yi.zhang, yangerkun, yukuai
In-Reply-To: <ah3MR1gYMm3nKe5f@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
On 6/2/2026 2:15 AM, Ojaswin Mujoo wrote:
> On Mon, Jun 01, 2026 at 08:22:09PM +0800, Zhang Yi wrote:
>> On 6/1/2026 5:08 PM, Ojaswin Mujoo wrote:
>>> On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
>>>> On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
>>>>> On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
>>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>>
>>>>>> In the generic buffered_head I/O path, we rely on the data=order mode to
>>>>>> ensure that the zeroed EOF block data is written before updating
>>>>>> i_disksize, thus preventing stale data from being exposed.
>>>>>>
>>>>>> However, the iomap buffered I/O path cannot use this mechanism. Instead,
>>>>>> we issue the I/O immediately after performing the zero operation
>>>>>> (without synchronous waiting for performance). This can reduce the risk
>>>>>> of exposing stale data, but it does not guarantee that the zero data
>>>>>> will be flushed to disk before the metadata of i_disksize is updated.
>>>>>> The subsequent patches will wait for this I/O to complete before
>>>>>> updating i_disksize.
>>>>>>
>>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>>
>>>>> I think we discussed that we may not need to do this [1] but I guess
>>>>> you've decided to make the tradeoff of issuing the IO to avoid having to
>>>>> wait for bg flush to complete the tail page zeroing
>>>>>
>>
>> I'm glad to hear that, thanks.
>>
>>>>
>>>> Yes. For truncate up and append fallocate, originally i_disksize would
>>>> be updated immediately, and the change would be persisted via the
>>>> journal within default 5 seconds. But now, if the tail page is not
>>>> committed immediately, the update to i_disksize will be delayed by about
>>>> 30 seconds, and persistence will be postponed to around 35 seconds. I'm
>>>> not sure what impact this change might have — I just don't really want
>>>> to introduce it.
>>>
>>>> For normal append writes, the impact is minimal, unless we call
>>>> sync_range to sync the portion of data that extends beyond EOF.
>>>
>>> Hmm while trying to retain the behavior for falloc and truncate up,
>>> we end up changing it for append writes :) But anyways, I understand
>>> your reasoning and don't have any strong opinions against it. I'll let
>>> Jan pitch in since he had some comments around this.
>>>
>>>>
>>>> In addition, if the zeroed page is not issued here immediately, the
>>>> logic will become more complex because we need to more careful about the
>>>> order of write-back IOs to prevent deadlock issues caused by mutual
>>>> waiting.
>>>
>>> You mean an endio completion waiting for ordered IO to complete but
>>> ordered IO writeback is somehow waiting for this endio completion? Is that
>>> actually possible?
>>
>> Well, after thinking it over more carefully, it seems this is
>> impossible, I cannot think of a scenario that could trigger this kind of
>> issue. The generic writeback process always executes writeback in folio
>> index order, so there would be no situation where a later data I/O
>> depends on an earlier ordered I/O. Even if both kinds of IOs are placed
>> in the same ioend, there should be no problem. I was confused and
>> overthinking it.
>>
>> From this perspective, if we can accept that truncate up and fallocate
>> will have longer persistence time by default(I guess this is
>> acceptable), we can avoid writing back zeroed data immediately. To
>> achieve this, we only need to consider the case of sync file range. :-)
>
> Yeah, I think during writeback we will have to submit the ordered data
> if we are writing back data beyond the i_disksize.
>
> If this is straightforward enough to implement, I think this approach
> would be a safer choice cause we will avoid overheads due to small,
> random ordered IOs overworking the writeback layer.
Indeed, Let me investigate further to ensure there are no other side
effects.
Thanks,
Yi.
>
>>
>> Regards,
>> Yi.
>>>>
>>>>> However, I think one side effect might be many threads calling the
>>>>> writeback mechanism to issue zero IOs which might not scale well. I
>>>>> don't know if it'll be a huge problem though, I guess it's a sort of
>>>>> thing we will have to deal with in case we see it in real world
>>>>> workloads.
>>>>>
>>>>
>>>> I agree with you. However, I suspect that unless we run some specific
>>>> benchmark tests, it should be difficult to encounter a large number of
>>>> post-EOF append writes and truncate up operations in real-world usage
>>>> scenarios — and I haven't come across such scenarios yet. For
>>>> simplicity, I'd like to proceed with this implementation for now. If we
>>>> do run into actual problems later, we can consider not issuing I/O
>>>> directly here, but instead: 1) find the ordered block in
>>>> ext4_sync_file() and perform writeback; 2) ensure writeback ordering
>>>> for normal background writeback as well — otherwise, there is a risk of
>>>> deadlock (mutual waiting). What do you think?
>>>
>>> Yes sounds good Yi, we can deal with performance tuning later.
>>>
>>> Regards,
>>> Ojaswin
>>>
>>>>
>>>> Cheers,
>>>> Yi.
>>>>
>>>>> [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
>>>>>
>>>>>> ---
>>>>>> fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
>>>>>> 1 file changed, 55 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>> index 239d387ffaf2..e013aeb03d7b 100644
>>>>>> --- a/fs/ext4/inode.c
>>>>>> +++ b/fs/ext4/inode.c
>>>>>> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
>>>>>> zero_written);
>>>>>> }
>>>>>> +static int ext4_iomap_submit_zero_block(struct inode *inode,
>>>>>> + loff_t from, loff_t end)
>>>>>> +{
>>>>>> + struct address_space *mapping = inode->i_mapping;
>>>>>> + struct folio *folio;
>>>>>> + bool do_submit = false;
>>>>>> +
>>>>>> + folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>>>>>> + if (IS_ERR(folio))
>>>>>> + /* Already writeback and clear? */
>>>>>> + return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
>>>>>> +
>>>>>> + folio_wait_writeback(folio);
>>>>>> + WARN_ON_ONCE(folio_test_writeback(folio));
>>>>>> +
>>>>>> + if (likely(folio_test_dirty(folio)))
>>>>>> + do_submit = true;
>>>>>> + folio_unlock(folio);
>>>>>> + folio_put(folio);
>>>>>> +
>>>>>> + /* Submit zeroed block. */
>>>>>> + if (do_submit)
>>>>>> + return filemap_fdatawrite_range(mapping, from, end - 1);
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> /*
>>>>>> * Zero out a mapping from file offset 'from' up to the end of the block
>>>>>> * which corresponds to 'from' or to the given 'end' inside this block.
>>>>>> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>>>> if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
>>>>>> return 0;
>>>>>> - if (length > blocksize - offset)
>>>>>> + if (length > blocksize - offset) {
>>>>>> length = blocksize - offset;
>>>>>> + end = from + length;
>>>>>> + }
>>>>>> err = ext4_block_zero_range(inode, from, length,
>>>>>> &did_zero, &zero_written);
>>>>>> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>>>> * TODO: In the iomap path, handle this by updating i_disksize to
>>>>>> * i_size after the zeroed data has been written back.
>>>>>> */
>>>>>> - if (ext4_should_order_data(inode) &&
>>>>>> - did_zero && zero_written && !IS_DAX(inode)) {
>>>>>> - handle_t *handle;
>>>>>> + if (did_zero && zero_written && !IS_DAX(inode)) {
>>>>>> + if (ext4_should_order_data(inode)) {
>>>>>> + handle_t *handle;
>>>>>> - handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>>>> - if (IS_ERR(handle))
>>>>>> - return PTR_ERR(handle);
>>>>>> + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>>>> + if (IS_ERR(handle))
>>>>>> + return PTR_ERR(handle);
>>>>>> - err = ext4_jbd2_inode_add_write(handle, inode, from, length);
>>>>>> - ext4_journal_stop(handle);
>>>>>> - if (err)
>>>>>> - return err;
>>>>>> + err = ext4_jbd2_inode_add_write(handle, inode, from,
>>>>>> + length);
>>>>>> + ext4_journal_stop(handle);
>>>>>> + if (err)
>>>>>> + return err;
>>>>>> + /*
>>>>>> + * inodes using the iomap buffered I/O path do not use the
>>>>>> + * data=ordered mode. We submit zeroed range directly here.
>>>>>> + * Do not wait for I/O completion for performance.
>>>>>> + *
>>>>>> + * TODO: Any operation that extends i_disksize (including
>>>>>> + * append write end io past the zeroed boundary, truncate up,
>>>>>> + * and append fallocate) must wait for the relevant I/O to
>>>>>> + * complete before updating i_disksize.
>>>>>> + */
>>>>>> + } else if (ext4_inode_buffered_iomap(inode)) {
>>>>>> + err = ext4_iomap_submit_zero_block(inode, from, end);
>>>>>> + if (err)
>>>>>> + return err;
>>>>>> + }
>>>>>> }
>>>>>> return 0;
>>>>>> --
>>>>>> 2.52.0
>>>>>>
>>>>
>>
^ permalink raw reply
* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Zhang Yi @ 2026-06-02 3:22 UTC (permalink / raw)
To: Ojaswin Mujoo, Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yangerkun, yukuai
In-Reply-To: <ah3QlCt8V-3kVzW8@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
On 6/2/2026 2:33 AM, Ojaswin Mujoo wrote:
> On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
>> On 5/30/2026 3:22 PM, Zhang Yi wrote:
>>> Hi, Ojaswin!
>>>
>>> On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
>>>> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>
>>>>> For append writes, wait for ordered I/O to complete before updating
>>>>> i_disksize. This ensures that zeroed data is flushed to disk before the
>>>>> metadata update, preventing stale data from being exposed during
>>>>> unaligned post-EOF append writes.
>>>>>
>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>> ---
>>>>> fs/ext4/ext4.h | 11 +++++++
>>>>> fs/ext4/inode.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>>>>> fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>>>>> fs/ext4/super.c | 23 ++++++++++----
>>>>> 4 files changed, 161 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>> index 078feda47e36..9ce2128eea3e 100644
>>>>> --- a/fs/ext4/ext4.h
>>>>> +++ b/fs/ext4/ext4.h
>>>>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
>>>>> #ifdef CONFIG_FS_ENCRYPTION
>>>>> struct fscrypt_inode_info *i_crypt_info;
>>>>> #endif
>>>>> +
>>>>> + /*
>>>>> + * Track ordered zeroed data during post-EOF append writes, fallocate,
>>>>> + * and truncate-up operations. These parameters are used only in the
>>>>> + * iomap buffered I/O path.
>>>>> + */
>>>>> + ext4_lblk_t i_ordered_lblk;
>>>>> + ext4_lblk_t i_ordered_len;
>>>>> + wait_queue_head_t i_ordered_wq;
>>>>> };
>>>>>
>>>>> /*
>>>>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>>>>> __u64 len, __u64 *moved_len);
>>>>>
>>>>> /* page-io.c */
>>>>> +#define EXT4_IOMAP_IOEND_ORDER_IO 1UL /* This I/O is an ordered one */
>>>>> +
>>>>> extern int __init ext4_init_pageio(void);
>>>>> extern void ext4_exit_pageio(void);
>>>>> extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>> index e013aeb03d7b..11fb369efeb1 100644
>>>>> --- a/fs/ext4/inode.c
>>>>> +++ b/fs/ext4/inode.c
>>>>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>>>> {
>>>>> struct iomap_ioend *ioend = wpc->wb_ctx;
>>>>> struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
>>>>> + ext4_lblk_t start, end, order_lblk, order_len;
>>>>>
>>>>> /*
>>>>> * After I/O completion, a worker needs to be scheduled when:
>>>>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>>>> test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
>>>>> ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>>>>
>>>>> + /*
>>>>> + * Mark the I/O as ordered. Ordered I/O requires separate endio
>>>>> + * handling and must not be merged with regular I/O operations.
>>>>> + */
>>>>> + order_len = READ_ONCE(ei->i_ordered_len);
>>>>> + if (order_len) {
>>>>> + /*
>>>>> + * Pair with smp_store_release() in ext4_block_zero_eof().
>>>>> + * Ensure we see the updated i_ordered_lblk that was written
>>>>> + * before the release store to i_ordered_len.
>>>>> + */
>>>>> + smp_rmb();
>>>>> + order_lblk = READ_ONCE(ei->i_ordered_lblk);
>>>>> + start = ioend->io_offset >> ioend->io_inode->i_blkbits;
>>>>> + end = EXT4_B_TO_LBLK(ioend->io_inode,
>>>>> + ioend->io_offset + ioend->io_size);
>>>>> +
>>>>> + if (start <= order_lblk && end >= order_lblk + order_len) {
>>>>
>>>> Hi Zhang,
>>>>
>>>> I guess this check is enough cause ordered_lblk and ordered_len will
>>>> always be contained in a single block.
>>>
>>> Yeah.
>>>
>>>>
>>>>> + ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>>>> + ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
>>>>> + ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
>>>>
>>>> FWIU, we are wanting the ordered IO to not be merged and submitted asap
>>>> since we want to wake up the waiters. Is there any other reason?
>>>
>>> My original intention was to prevent the loss of the
>>> EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
>>> completion, which could be caused by merging an ordered ioend with a
>>> normal ioend. In patch 19, we need to determine the flag to update
>>> i_disksize to the correct position.
>
> Ahh okay, we don't want the flag to be lost.
>
>>>
>>>>
>>>> Adding the boundary in ->writeback_submit() only affects
>>>> iomap_ioend_can_merge() which happens after we have woken up the waiters
>>>> and deferred the IO to the wq. We ideally want it affect
>>>> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
>>>> ->writeback_range().
>>>
>>> IIUC, merging into the same ioend during the submission stage doesn't
>>> seem to cause any problems.
>
> Got it since the flag is set later. I was thinking we want to quickly
> issue the ordered IO to wake up the waiters and not waste time trying to
> merge it and hence we wanted to use that flag.
>
>>>
>>>>
>>>> Secondly, I don't think boundary is the right flag here. It ensures
>>>> that everything before the ordered iomap gets submitted and the ordered
>>>> iomap starts a new ioend. This can still keep getting merged with the
>>>> newer ioends untils we decide to submit the IO, which can delay waking
>>>> up the waiters. If we really want the "no merge" behavior, we'll have to
>>>> do something like [1] (Check the 2 NOMERGE flag patches).
>>>
>>> Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
>>> still cannot prevent merging. I missed this, thank you for pointing this
>>> out. However, I think perhaps we should change iomap_ioend_can_merge()
>>> to check the iomap_ioend->io_private. Something like:
>>>
>>> if (ioend->io_private || next->io_private)
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> ioend->io_private != next->io_private
>
> I guess if the purpose is to just not lose the flag, then boundary seems
> to work for because we only lose the flag if ordered ioend backward
> merges to a prev one. Flag is retained if we forward merge. Which
> boundary seems to take care of.
>
Yes, IOMAP_IOEND_BOUNDARY is indeed worked currently as it prevents flag
loss. However, from the perspective of the iomap infrastructure, I
believe it is still necessary to add the ioend->io_private !=
next->io_private check. Because ioends with different io_private values
should not be merged, as this carries the risk of potentially losing
io_private or even memory leaks. With this check in iomap, we would no
longer need IOMAP_IOEND_BOUNDARY.
> However, if we want to avoid merges so we can quickly issue IO and wake
> up the waiters then the above change looks good. Also, if this is the
> reason we'd also want to have this during submission stage so the flag
> setting will probs have to move to ->wirteback_range()
Yes. Issuing ordered I/O as soon as possible is beneficial as it reduces
the latency of sync file range. Suppose when we are syncing data beyond
the ordered range, the background writeback process has already started
committing and bundled the ordered range into a large ioend (up to
IOEND_BATCH_SIZE folios), then this sync operation will indeed
experience significant latency. However, for other non-sync scenarios,
there should be little benefit.
But I'm not sure if this is strictly necessary, because in the existing
implementation, issuing ordered I/O via data=ordered mode works the same
way — it also doesn't issue ordered I/O as soon as possible, and still
has to wait when encountering concurrent background writeback. So I
think we can keep the current implementation for now and see user
feedback to decide whether further optimization is needed.
Cheers,
Yi.
^ permalink raw reply
* Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-06-01 18:33 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yizhang089, yangerkun,
yukuai
In-Reply-To: <81f4c0ec-1d80-4987-b31e-4e9ecd394c63@huaweicloud.com>
On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
> On 5/30/2026 3:22 PM, Zhang Yi wrote:
> > Hi, Ojaswin!
> >
> > On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
> >> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> >>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>
> >>> For append writes, wait for ordered I/O to complete before updating
> >>> i_disksize. This ensures that zeroed data is flushed to disk before the
> >>> metadata update, preventing stale data from being exposed during
> >>> unaligned post-EOF append writes.
> >>>
> >>> Suggested-by: Jan Kara <jack@suse.cz>
> >>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>> ---
> >>> fs/ext4/ext4.h | 11 +++++++
> >>> fs/ext4/inode.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> >>> fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> >>> fs/ext4/super.c | 23 ++++++++++----
> >>> 4 files changed, 161 insertions(+), 13 deletions(-)
> >>>
> >>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>> index 078feda47e36..9ce2128eea3e 100644
> >>> --- a/fs/ext4/ext4.h
> >>> +++ b/fs/ext4/ext4.h
> >>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
> >>> #ifdef CONFIG_FS_ENCRYPTION
> >>> struct fscrypt_inode_info *i_crypt_info;
> >>> #endif
> >>> +
> >>> + /*
> >>> + * Track ordered zeroed data during post-EOF append writes, fallocate,
> >>> + * and truncate-up operations. These parameters are used only in the
> >>> + * iomap buffered I/O path.
> >>> + */
> >>> + ext4_lblk_t i_ordered_lblk;
> >>> + ext4_lblk_t i_ordered_len;
> >>> + wait_queue_head_t i_ordered_wq;
> >>> };
> >>>
> >>> /*
> >>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> >>> __u64 len, __u64 *moved_len);
> >>>
> >>> /* page-io.c */
> >>> +#define EXT4_IOMAP_IOEND_ORDER_IO 1UL /* This I/O is an ordered one */
> >>> +
> >>> extern int __init ext4_init_pageio(void);
> >>> extern void ext4_exit_pageio(void);
> >>> extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> >>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >>> index e013aeb03d7b..11fb369efeb1 100644
> >>> --- a/fs/ext4/inode.c
> >>> +++ b/fs/ext4/inode.c
> >>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> >>> {
> >>> struct iomap_ioend *ioend = wpc->wb_ctx;
> >>> struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> >>> + ext4_lblk_t start, end, order_lblk, order_len;
> >>>
> >>> /*
> >>> * After I/O completion, a worker needs to be scheduled when:
> >>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> >>> test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
> >>> ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> >>>
> >>> + /*
> >>> + * Mark the I/O as ordered. Ordered I/O requires separate endio
> >>> + * handling and must not be merged with regular I/O operations.
> >>> + */
> >>> + order_len = READ_ONCE(ei->i_ordered_len);
> >>> + if (order_len) {
> >>> + /*
> >>> + * Pair with smp_store_release() in ext4_block_zero_eof().
> >>> + * Ensure we see the updated i_ordered_lblk that was written
> >>> + * before the release store to i_ordered_len.
> >>> + */
> >>> + smp_rmb();
> >>> + order_lblk = READ_ONCE(ei->i_ordered_lblk);
> >>> + start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> >>> + end = EXT4_B_TO_LBLK(ioend->io_inode,
> >>> + ioend->io_offset + ioend->io_size);
> >>> +
> >>> + if (start <= order_lblk && end >= order_lblk + order_len) {
> >>
> >> Hi Zhang,
> >>
> >> I guess this check is enough cause ordered_lblk and ordered_len will
> >> always be contained in a single block.
> >
> > Yeah.
> >
> >>
> >>> + ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> >>> + ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> >>> + ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
> >>
> >> FWIU, we are wanting the ordered IO to not be merged and submitted asap
> >> since we want to wake up the waiters. Is there any other reason?
> >
> > My original intention was to prevent the loss of the
> > EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
> > completion, which could be caused by merging an ordered ioend with a
> > normal ioend. In patch 19, we need to determine the flag to update
> > i_disksize to the correct position.
Ahh okay, we don't want the flag to be lost.
> >
> >>
> >> Adding the boundary in ->writeback_submit() only affects
> >> iomap_ioend_can_merge() which happens after we have woken up the waiters
> >> and deferred the IO to the wq. We ideally want it affect
> >> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
> >> ->writeback_range().
> >
> > IIUC, merging into the same ioend during the submission stage doesn't
> > seem to cause any problems.
Got it since the flag is set later. I was thinking we want to quickly
issue the ordered IO to wake up the waiters and not waste time trying to
merge it and hence we wanted to use that flag.
> >
> >>
> >> Secondly, I don't think boundary is the right flag here. It ensures
> >> that everything before the ordered iomap gets submitted and the ordered
> >> iomap starts a new ioend. This can still keep getting merged with the
> >> newer ioends untils we decide to submit the IO, which can delay waking
> >> up the waiters. If we really want the "no merge" behavior, we'll have to
> >> do something like [1] (Check the 2 NOMERGE flag patches).
> >
> > Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
> > still cannot prevent merging. I missed this, thank you for pointing this
> > out. However, I think perhaps we should change iomap_ioend_can_merge()
> > to check the iomap_ioend->io_private. Something like:
> >
> > if (ioend->io_private || next->io_private)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ioend->io_private != next->io_private
I guess if the purpose is to just not lose the flag, then boundary seems
to work for because we only lose the flag if ordered ioend backward
merges to a prev one. Flag is retained if we forward merge. Which
boundary seems to take care of.
However, if we want to avoid merges so we can quickly issue IO and wake
up the waiters then the above change looks good. Also, if this is the
reason we'd also want to have this during submission stage so the flag
setting will probs have to move to ->wirteback_range()
Regards,
Ojaswin
>
>
> > return false;
> >
> > What do you think?
> >
> > Thanks,
> > Yi.
>
^ permalink raw reply
* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-06-01 18:15 UTC (permalink / raw)
To: Zhang Yi
Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
yi.zhang, yangerkun, yukuai
In-Reply-To: <cc458b0f-4637-4313-a2cd-45db4cdaa250@gmail.com>
On Mon, Jun 01, 2026 at 08:22:09PM +0800, Zhang Yi wrote:
> On 6/1/2026 5:08 PM, Ojaswin Mujoo wrote:
> > On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
> > > On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
> > > > On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
> > > > > From: Zhang Yi <yi.zhang@huawei.com>
> > > > >
> > > > > In the generic buffered_head I/O path, we rely on the data=order mode to
> > > > > ensure that the zeroed EOF block data is written before updating
> > > > > i_disksize, thus preventing stale data from being exposed.
> > > > >
> > > > > However, the iomap buffered I/O path cannot use this mechanism. Instead,
> > > > > we issue the I/O immediately after performing the zero operation
> > > > > (without synchronous waiting for performance). This can reduce the risk
> > > > > of exposing stale data, but it does not guarantee that the zero data
> > > > > will be flushed to disk before the metadata of i_disksize is updated.
> > > > > The subsequent patches will wait for this I/O to complete before
> > > > > updating i_disksize.
> > > > >
> > > > > Suggested-by: Jan Kara <jack@suse.cz>
> > > > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > > >
> > > > I think we discussed that we may not need to do this [1] but I guess
> > > > you've decided to make the tradeoff of issuing the IO to avoid having to
> > > > wait for bg flush to complete the tail page zeroing
> > > >
>
> I'm glad to hear that, thanks.
>
> > >
> > > Yes. For truncate up and append fallocate, originally i_disksize would
> > > be updated immediately, and the change would be persisted via the
> > > journal within default 5 seconds. But now, if the tail page is not
> > > committed immediately, the update to i_disksize will be delayed by about
> > > 30 seconds, and persistence will be postponed to around 35 seconds. I'm
> > > not sure what impact this change might have — I just don't really want
> > > to introduce it.
> >
> > > For normal append writes, the impact is minimal, unless we call
> > > sync_range to sync the portion of data that extends beyond EOF.
> >
> > Hmm while trying to retain the behavior for falloc and truncate up,
> > we end up changing it for append writes :) But anyways, I understand
> > your reasoning and don't have any strong opinions against it. I'll let
> > Jan pitch in since he had some comments around this.
> >
> > >
> > > In addition, if the zeroed page is not issued here immediately, the
> > > logic will become more complex because we need to more careful about the
> > > order of write-back IOs to prevent deadlock issues caused by mutual
> > > waiting.
> >
> > You mean an endio completion waiting for ordered IO to complete but
> > ordered IO writeback is somehow waiting for this endio completion? Is that
> > actually possible?
>
> Well, after thinking it over more carefully, it seems this is
> impossible, I cannot think of a scenario that could trigger this kind of
> issue. The generic writeback process always executes writeback in folio
> index order, so there would be no situation where a later data I/O
> depends on an earlier ordered I/O. Even if both kinds of IOs are placed
> in the same ioend, there should be no problem. I was confused and
> overthinking it.
>
> From this perspective, if we can accept that truncate up and fallocate
> will have longer persistence time by default(I guess this is
> acceptable), we can avoid writing back zeroed data immediately. To
> achieve this, we only need to consider the case of sync file range. :-)
Yeah, I think during writeback we will have to submit the ordered data
if we are writing back data beyond the i_disksize.
If this is straightforward enough to implement, I think this approach
would be a safer choice cause we will avoid overheads due to small,
random ordered IOs overworking the writeback layer.
>
> Regards,
> Yi.
> > >
> > > > However, I think one side effect might be many threads calling the
> > > > writeback mechanism to issue zero IOs which might not scale well. I
> > > > don't know if it'll be a huge problem though, I guess it's a sort of
> > > > thing we will have to deal with in case we see it in real world
> > > > workloads.
> > > >
> > >
> > > I agree with you. However, I suspect that unless we run some specific
> > > benchmark tests, it should be difficult to encounter a large number of
> > > post-EOF append writes and truncate up operations in real-world usage
> > > scenarios — and I haven't come across such scenarios yet. For
> > > simplicity, I'd like to proceed with this implementation for now. If we
> > > do run into actual problems later, we can consider not issuing I/O
> > > directly here, but instead: 1) find the ordered block in
> > > ext4_sync_file() and perform writeback; 2) ensure writeback ordering
> > > for normal background writeback as well — otherwise, there is a risk of
> > > deadlock (mutual waiting). What do you think?
> >
> > Yes sounds good Yi, we can deal with performance tuning later.
> >
> > Regards,
> > Ojaswin
> >
> > >
> > > Cheers,
> > > Yi.
> > >
> > > > [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
> > > >
> > > > > ---
> > > > > fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
> > > > > 1 file changed, 55 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index 239d387ffaf2..e013aeb03d7b 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
> > > > > zero_written);
> > > > > }
> > > > > +static int ext4_iomap_submit_zero_block(struct inode *inode,
> > > > > + loff_t from, loff_t end)
> > > > > +{
> > > > > + struct address_space *mapping = inode->i_mapping;
> > > > > + struct folio *folio;
> > > > > + bool do_submit = false;
> > > > > +
> > > > > + folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> > > > > + if (IS_ERR(folio))
> > > > > + /* Already writeback and clear? */
> > > > > + return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
> > > > > +
> > > > > + folio_wait_writeback(folio);
> > > > > + WARN_ON_ONCE(folio_test_writeback(folio));
> > > > > +
> > > > > + if (likely(folio_test_dirty(folio)))
> > > > > + do_submit = true;
> > > > > + folio_unlock(folio);
> > > > > + folio_put(folio);
> > > > > +
> > > > > + /* Submit zeroed block. */
> > > > > + if (do_submit)
> > > > > + return filemap_fdatawrite_range(mapping, from, end - 1);
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * Zero out a mapping from file offset 'from' up to the end of the block
> > > > > * which corresponds to 'from' or to the given 'end' inside this block.
> > > > > @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> > > > > if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
> > > > > return 0;
> > > > > - if (length > blocksize - offset)
> > > > > + if (length > blocksize - offset) {
> > > > > length = blocksize - offset;
> > > > > + end = from + length;
> > > > > + }
> > > > > err = ext4_block_zero_range(inode, from, length,
> > > > > &did_zero, &zero_written);
> > > > > @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> > > > > * TODO: In the iomap path, handle this by updating i_disksize to
> > > > > * i_size after the zeroed data has been written back.
> > > > > */
> > > > > - if (ext4_should_order_data(inode) &&
> > > > > - did_zero && zero_written && !IS_DAX(inode)) {
> > > > > - handle_t *handle;
> > > > > + if (did_zero && zero_written && !IS_DAX(inode)) {
> > > > > + if (ext4_should_order_data(inode)) {
> > > > > + handle_t *handle;
> > > > > - handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> > > > > - if (IS_ERR(handle))
> > > > > - return PTR_ERR(handle);
> > > > > + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> > > > > + if (IS_ERR(handle))
> > > > > + return PTR_ERR(handle);
> > > > > - err = ext4_jbd2_inode_add_write(handle, inode, from, length);
> > > > > - ext4_journal_stop(handle);
> > > > > - if (err)
> > > > > - return err;
> > > > > + err = ext4_jbd2_inode_add_write(handle, inode, from,
> > > > > + length);
> > > > > + ext4_journal_stop(handle);
> > > > > + if (err)
> > > > > + return err;
> > > > > + /*
> > > > > + * inodes using the iomap buffered I/O path do not use the
> > > > > + * data=ordered mode. We submit zeroed range directly here.
> > > > > + * Do not wait for I/O completion for performance.
> > > > > + *
> > > > > + * TODO: Any operation that extends i_disksize (including
> > > > > + * append write end io past the zeroed boundary, truncate up,
> > > > > + * and append fallocate) must wait for the relevant I/O to
> > > > > + * complete before updating i_disksize.
> > > > > + */
> > > > > + } else if (ext4_inode_buffered_iomap(inode)) {
> > > > > + err = ext4_iomap_submit_zero_block(inode, from, end);
> > > > > + if (err)
> > > > > + return err;
> > > > > + }
> > > > > }
> > > > > return 0;
> > > > > --
> > > > > 2.52.0
> > > > >
> > >
>
^ permalink raw reply
* Re: [PATCH] ext2: fix ignored return value of generic_write_sync()
From: Jan Kara @ 2026-06-01 14:31 UTC (permalink / raw)
To: Danila Chernetsov; +Cc: Jan Kara, linux-ext4, linux-kernel, lvc-project
In-Reply-To: <20260530122311.136803-1-listdansp@mail.ru>
On Sat 30-05-26 12:23:11, Danila Chernetsov wrote:
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> Fix ext2_dio_write_iter() to propagate the error returned by
> generic_write_sync() instead of silently discarding it, which could
> cause write(2) to return success to userspace on O_SYNC/O_DSYNC files
> even when the sync failed.
>
> The correct pattern, already used in ext2_dax_write_iter() in the same
> file and in ext4, xfs, f2fs among others, is:
> if (ret > 0)
> ret = generic_write_sync(iocb, ret);
>
> Found by Linux Verification Center (linuxtesting.org) with SVACE.
>
> Fixes: fb5de4358e1a ("ext2: Move direct-io to use iomap")
> Signed-off-by: Danila Chernetsov <listdansp@mail.ru>
Thanks for the patch. I think it makes sense although I also think we
should also reflect the return value of filemap_write_and_wait_range()
(when it fails, return the error from ext2_dio_write_iter()) to be
consistent with handling errors during flushing of direct IO fallback. I'll
update the patch on commit.
Honza
> ---
> fs/ext2/file.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index d9b1eb34694a..855a62e96c38 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -272,7 +272,7 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> pos >> PAGE_SHIFT,
> endbyte >> PAGE_SHIFT);
> if (ret > 0)
> - generic_write_sync(iocb, ret);
> + ret = generic_write_sync(iocb, ret);
> }
>
> out_unlock:
> --
> 2.25.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
From: Zhang Yi @ 2026-06-01 12:22 UTC (permalink / raw)
To: Ojaswin Mujoo, Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yangerkun,
yukuai
In-Reply-To: <ah1MG1b89ELKM-Fl@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
On 6/1/2026 5:08 PM, Ojaswin Mujoo wrote:
> On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
>> On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
>>> On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> In the generic buffered_head I/O path, we rely on the data=order mode to
>>>> ensure that the zeroed EOF block data is written before updating
>>>> i_disksize, thus preventing stale data from being exposed.
>>>>
>>>> However, the iomap buffered I/O path cannot use this mechanism. Instead,
>>>> we issue the I/O immediately after performing the zero operation
>>>> (without synchronous waiting for performance). This can reduce the risk
>>>> of exposing stale data, but it does not guarantee that the zero data
>>>> will be flushed to disk before the metadata of i_disksize is updated.
>>>> The subsequent patches will wait for this I/O to complete before
>>>> updating i_disksize.
>>>>
>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> I think we discussed that we may not need to do this [1] but I guess
>>> you've decided to make the tradeoff of issuing the IO to avoid having to
>>> wait for bg flush to complete the tail page zeroing
>>>
I'm glad to hear that, thanks.
>>
>> Yes. For truncate up and append fallocate, originally i_disksize would
>> be updated immediately, and the change would be persisted via the
>> journal within default 5 seconds. But now, if the tail page is not
>> committed immediately, the update to i_disksize will be delayed by about
>> 30 seconds, and persistence will be postponed to around 35 seconds. I'm
>> not sure what impact this change might have — I just don't really want
>> to introduce it.
>
>> For normal append writes, the impact is minimal, unless we call
>> sync_range to sync the portion of data that extends beyond EOF.
>
> Hmm while trying to retain the behavior for falloc and truncate up,
> we end up changing it for append writes :) But anyways, I understand
> your reasoning and don't have any strong opinions against it. I'll let
> Jan pitch in since he had some comments around this.
>
>>
>> In addition, if the zeroed page is not issued here immediately, the
>> logic will become more complex because we need to more careful about the
>> order of write-back IOs to prevent deadlock issues caused by mutual
>> waiting.
>
> You mean an endio completion waiting for ordered IO to complete but
> ordered IO writeback is somehow waiting for this endio completion? Is that
> actually possible?
Well, after thinking it over more carefully, it seems this is
impossible, I cannot think of a scenario that could trigger this kind of
issue. The generic writeback process always executes writeback in folio
index order, so there would be no situation where a later data I/O
depends on an earlier ordered I/O. Even if both kinds of IOs are placed
in the same ioend, there should be no problem. I was confused and
overthinking it.
From this perspective, if we can accept that truncate up and fallocate
will have longer persistence time by default(I guess this is
acceptable), we can avoid writing back zeroed data immediately. To
achieve this, we only need to consider the case of sync file range. :-)
Regards,
Yi.
>>
>>> However, I think one side effect might be many threads calling the
>>> writeback mechanism to issue zero IOs which might not scale well. I
>>> don't know if it'll be a huge problem though, I guess it's a sort of
>>> thing we will have to deal with in case we see it in real world
>>> workloads.
>>>
>>
>> I agree with you. However, I suspect that unless we run some specific
>> benchmark tests, it should be difficult to encounter a large number of
>> post-EOF append writes and truncate up operations in real-world usage
>> scenarios — and I haven't come across such scenarios yet. For
>> simplicity, I'd like to proceed with this implementation for now. If we
>> do run into actual problems later, we can consider not issuing I/O
>> directly here, but instead: 1) find the ordered block in
>> ext4_sync_file() and perform writeback; 2) ensure writeback ordering
>> for normal background writeback as well — otherwise, there is a risk of
>> deadlock (mutual waiting). What do you think?
>
> Yes sounds good Yi, we can deal with performance tuning later.
>
> Regards,
> Ojaswin
>
>>
>> Cheers,
>> Yi.
>>
>>> [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
>>>
>>>> ---
>>>> fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
>>>> 1 file changed, 55 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 239d387ffaf2..e013aeb03d7b 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
>>>> zero_written);
>>>> }
>>>>
>>>> +static int ext4_iomap_submit_zero_block(struct inode *inode,
>>>> + loff_t from, loff_t end)
>>>> +{
>>>> + struct address_space *mapping = inode->i_mapping;
>>>> + struct folio *folio;
>>>> + bool do_submit = false;
>>>> +
>>>> + folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>>>> + if (IS_ERR(folio))
>>>> + /* Already writeback and clear? */
>>>> + return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
>>>> +
>>>> + folio_wait_writeback(folio);
>>>> + WARN_ON_ONCE(folio_test_writeback(folio));
>>>> +
>>>> + if (likely(folio_test_dirty(folio)))
>>>> + do_submit = true;
>>>> + folio_unlock(folio);
>>>> + folio_put(folio);
>>>> +
>>>> + /* Submit zeroed block. */
>>>> + if (do_submit)
>>>> + return filemap_fdatawrite_range(mapping, from, end - 1);
>>>> + return 0;
>>>> +}
>>>> +
>>>> /*
>>>> * Zero out a mapping from file offset 'from' up to the end of the block
>>>> * which corresponds to 'from' or to the given 'end' inside this block.
>>>> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>> if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
>>>> return 0;
>>>>
>>>> - if (length > blocksize - offset)
>>>> + if (length > blocksize - offset) {
>>>> length = blocksize - offset;
>>>> + end = from + length;
>>>> + }
>>>>
>>>> err = ext4_block_zero_range(inode, from, length,
>>>> &did_zero, &zero_written);
>>>> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>> * TODO: In the iomap path, handle this by updating i_disksize to
>>>> * i_size after the zeroed data has been written back.
>>>> */
>>>> - if (ext4_should_order_data(inode) &&
>>>> - did_zero && zero_written && !IS_DAX(inode)) {
>>>> - handle_t *handle;
>>>> + if (did_zero && zero_written && !IS_DAX(inode)) {
>>>> + if (ext4_should_order_data(inode)) {
>>>> + handle_t *handle;
>>>>
>>>> - handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>> - if (IS_ERR(handle))
>>>> - return PTR_ERR(handle);
>>>> + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>> + if (IS_ERR(handle))
>>>> + return PTR_ERR(handle);
>>>>
>>>> - err = ext4_jbd2_inode_add_write(handle, inode, from, length);
>>>> - ext4_journal_stop(handle);
>>>> - if (err)
>>>> - return err;
>>>> + err = ext4_jbd2_inode_add_write(handle, inode, from,
>>>> + length);
>>>> + ext4_journal_stop(handle);
>>>> + if (err)
>>>> + return err;
>>>> + /*
>>>> + * inodes using the iomap buffered I/O path do not use the
>>>> + * data=ordered mode. We submit zeroed range directly here.
>>>> + * Do not wait for I/O completion for performance.
>>>> + *
>>>> + * TODO: Any operation that extends i_disksize (including
>>>> + * append write end io past the zeroed boundary, truncate up,
>>>> + * and append fallocate) must wait for the relevant I/O to
>>>> + * complete before updating i_disksize.
>>>> + */
>>>> + } else if (ext4_inode_buffered_iomap(inode)) {
>>>> + err = ext4_iomap_submit_zero_block(inode, from, end);
>>>> + if (err)
>>>> + return err;
>>>> + }
>>>> }
>>>>
>>>> return 0;
>>>> --
>>>> 2.52.0
>>>>
>>
^ permalink raw reply
* Re: [PATCH v2] jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
From: Baokun Li @ 2026-06-01 11:37 UTC (permalink / raw)
To: Junrui Luo
Cc: Theodore Ts'o, Jan Kara, Harshad Shirwadkar, linux-ext4,
linux-kernel, Yuhao Jiang, stable
In-Reply-To: <SYBPR01MB7881663C927DE9D7BBF4D1DFAF062@SYBPR01MB7881.ausprd01.prod.outlook.com>
On 2026/5/13 17:28, Junrui Luo wrote:
> jbd2_journal_initialize_fast_commit() validates journal capacity by
> checking (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS).
> Both j_last and num_fc_blks are unsigned, so when num_fc_blks exceeds
> j_last the subtraction wraps to a large value, bypassing the bounds
> check.
>
> The resulting underflow corrupts j_last, j_fc_first, and j_free,
> leading to journal abort.
>
> Fix by checking num_fc_blks against j_last before the subtraction,
> returning -EFSCORRUPTED.
>
> Fixes: 6866d7b3f2bb ("ext4 / jbd2: add fast commit initialization")
> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
The Fixes tag is not quite accurate, it should be:
Fixes: e029c5f27987 ("ext4: make num of fast commit blocks configurable")
Otherwise looks good to me:
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
(P.S. Resend due to malformed email. Sorry for the noise.)
> ---
> Changes in v2:
> - Return -EFSCORRUPTED instead of -ENOSPC
> - Link to v1: https://lore.kernel.org/all/SYBPR01MB78813DD23B28BD49B1AA1123AF392@SYBPR01MB7881.ausprd01.prod.outlook.com/
> ---
> fs/jbd2/journal.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index cb2c529a8f1b..0bb97459fbf0 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -2263,6 +2263,8 @@ jbd2_journal_initialize_fast_commit(journal_t *journal)
> unsigned long long num_fc_blks;
>
> num_fc_blks = jbd2_journal_get_num_fc_blks(sb);
> + if (num_fc_blks > journal->j_last)
> + return -EFSCORRUPTED;
> if (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS)
> return -ENOSPC;
>
>
> ---
> base-commit: 7aaa8047eafd0bd628065b15757d9b48c5f9c07d
> change-id: 20260513-fixes-e6dcda3273d4
>
> Best regards,
^ permalink raw reply
* Re: [PATCH v2] jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
From: Baokun Li @ 2026-06-01 10:06 UTC (permalink / raw)
To: Junrui Luo
Cc: Theodore Ts'o, Jan Kara, Harshad Shirwadkar, linux-ext4,
linux-kernel, Yuhao Jiang, stable
In-Reply-To: <SYBPR01MB7881663C927DE9D7BBF4D1DFAF062@SYBPR01MB7881.ausprd01.prod.outlook.com>
On 2026/5/13 17:28, Junrui Luo wrote:
> jbd2_journal_initialize_fast_commit() validates journal capacity by
> checking (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS).
> Both j_last and num_fc_blks are unsigned, so when num_fc_blks exceeds
> j_last the subtraction wraps to a large value, bypassing the bounds
> check.
>
> The resulting underflow corrupts j_last, j_fc_first, and j_free,
> leading to journal abort.
>
> Fix by checking num_fc_blks against j_last before the subtraction,
> returning -EFSCORRUPTED.
>
> Fixes: 6866d7b3f2bb ("ext4 / jbd2: add fast commit initialization")
> Reported-by: Yuhao Jiang <danisjiang@gmail.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
The Fixes tag is not quite accurate, it should be:
Fixes: e029c5f279872 ("ext4: make num of fast commit blocks configurable")
Otherwise looks good to me:
Reviewed-by: Baokun Li libaokun@linux.alibaba.com
> ---
> Changes in v2:
> - Return -EFSCORRUPTED instead of -ENOSPC
> - Link to v1: https://lore.kernel.org/all/SYBPR01MB78813DD23B28BD49B1AA1123AF392@SYBPR01MB7881.ausprd01.prod.outlook.com/
> ---
> fs/jbd2/journal.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index cb2c529a8f1b..0bb97459fbf0 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -2263,6 +2263,8 @@ jbd2_journal_initialize_fast_commit(journal_t *journal)
> unsigned long long num_fc_blks;
>
> num_fc_blks = jbd2_journal_get_num_fc_blks(sb);
> + if (num_fc_blks > journal->j_last)
> + return -EFSCORRUPTED;
> if (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS)
> return -ENOSPC;
>
>
> ---
> base-commit: 7aaa8047eafd0bd628065b15757d9b48c5f9c07d
> change-id: 20260513-fixes-e6dcda3273d4
>
> Best regards,
^ permalink raw reply
* Re: [BUG] ext4: delayed-free buddy load error reaches BUG_ON in ext4_process_freed_data
From: Jan Kara @ 2026-06-01 9:52 UTC (permalink / raw)
To: Yifei Chu
Cc: linux-ext4, Theodore Ts'o, Andreas Dilger, Baokun Li,
Jan Kara, Ojaswin Mujoo, Ritesh Harjani (IBM), Zhang Yi,
linux-kernel
In-Reply-To: <CAPJnbg+QS5z8D9kvjSukZDPnLLGv4P9jyUC3nx4enac75haecQ@mail.gmail.com>
Hello!
On Sun 24-05-26 11:14:43, Yifei Chu wrote:
> Short version: I am reporting an ext4 delayed-free error-path bug found
> with targeted fault injection. The injected -EIO is in
> ext4_mb_load_buddy()’s normal error-return domain, and the injection is
> placed at the helper return boundary for the delayed-free caller. With that
> rare lower-layer failure made deterministic, ext4_free_data_in_buddy()
> reaches BUG_ON(err != 0) and crashes the kernel.
So how did you inject the EIO error? ext4_mb_load_buddy() returns EIO only
if the folio is not uptodate. Did you manage to hit non-uptodate folio when
ext4_mb_load_buddy() is called from ext4_free_data_in_buddy()? Checking the
code it actually might be possible as I don't see where the buddy folio
would be actually pinned in memory these days but I'd like to to verify...
Honza
>
> Tested kernel:
>
> v7.1-rc4-640-g79bd2dded182
> 79bd2dded182b1d458b18e62684b7f82ffc682e5
> x86_64 QEMU, KASAN config
>
> The relevant code shape in fs/ext4/mballoc.c is:
>
> err = ext4_mb_load_buddy(sb, entry->efd_group, &e4b);
> /* we expect to find existing buddy because it’s pinned */
> BUG_ON(err != 0);
>
> The point of the injection is not to corrupt ext4 state. It only makes a
> plausible buddy/bitmap load failure deterministic at this caller, so the
> caller’s error handling can be tested. ext4_mb_load_buddy() already has
> normal negative error returns from metadata loading paths.
>
> Reproducer shape:
>
> 1. Mount a fresh ext4 filesystem.
> 2. Create and fsync a 256 KiB file.
> 3. Unlink the file.
> 4. Call sync() to force delayed-free processing.
> 5. The instrumentation forces ext4_mb_load_buddy() to return -EIO at the
> delayed-free callsite.
>
> Two fresh image runs reproduced the same crash:
>
> AGENT_INIT: unlink ret=0 errno=0 (Success)
> AGENT_INIT: calling sync to force delayed free processing
> EXT4-fs: AGENT_EXT4_FREE_DATA_BUDDY_BUGON: forcing ext4_mb_load_buddy EIO
> before BUG_ON
> kernel BUG at fs/ext4/mballoc.c:3990!
> Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
> RIP: 0010:ext4_process_freed_data+0x1fe/0x510
>
> I did a local duplicate sweep and found related older
> ext4_mb_load_buddy()/mballoc fixes, but I did not find a direct
> current-upstream fix for this delayed-free BUG_ON(err != 0) path.
>
> Expected behavior:
>
> A metadata load failure during delayed-free processing should go through
> ext4 error handling / transaction abort / filesystem error propagation,
> rather than treating the error as an impossible invariant and BUGing the
> kernel.
>
> The attached tarball includes README.md, repro_init.c,
> instrumentation.patch, and both full serial logs.
>
> Thanks,
> Chuyifei
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH v4] ext2: Remove deprecated DAX support
From: Jan Kara @ 2026-06-01 9:29 UTC (permalink / raw)
To: Ashwin Gundarapu
Cc: kernel test robot, jack, oe-kbuild-all, linux-ext4, linux-kernel
In-Reply-To: <19e6db0f597.25d55d43200914.4102954258751814923@zohomail.in>
On Thu 28-05-26 13:56:14, Ashwin Gundarapu wrote:
> This indentation warning is from pre-existing code (originally written in
> 2010) that my patch did not touch. I prefer to keep this patch focused
> solely on removing deprecated DAX support. A separate cleanup patch can
> be submitted later if desired.
Actually no, your patch indeed introduced several indentation problems (by
removing some conditions but not reindenting internal blocks). Also there
were some spurious empty lines added. No need to resend, I've fixed that up
on commit. Thanks for the patch.
Honza
> Regards,
> Ashwin Gundarapu
>
> From: kernel test robot <lkp@intel.com>
> To: "Ashwin Gundarapu"<linuxuser509@zohomail.in>, "jack"<jack@suse.com>
> Cc: <oe-kbuild-all@lists.linux.dev>, "linux-ext4"<linux-ext4@vger.kernel.org>, "linux-kernel"<linux-kernel@vger.kernel.org>
> Date: Thu, 28 May 2026 09:46:32 +0530
> Subject: Re: [PATCH v4] ext2: Remove deprecated DAX support
>
> > Hi Ashwin,
> >
> > kernel test robot noticed the following build warnings:
> >
> > [auto build test WARNING on jack-fs/for_next]
> > [also build test WARNING on linus/master v7.1-rc5 next-20260527]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use '--base' as documented in
> > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> >
> > url: https://github.com/intel-lab-lkp/linux/commits/Ashwin-Gundarapu/ext2-Remove-deprecated-DAX-support/20260524-233631
> > base: https://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_next
> > patch link: https://lore.kernel.org/r/19e5aa07c9b.3a2e576d130187.5289857983023045470%40zohomail.in
> > patch subject: [PATCH v4] ext2: Remove deprecated DAX support
> > config: arm-randconfig-r071-20260528 (https://download.01.org/0day-ci/archive/20260528/202605281203.e91xvDyr-lkp@intel.com/config)
> > compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> > smatch: v0.5.0-9185-gbcc58b9c
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202605281203.e91xvDyr-lkp@intel.com/
> >
> > smatch warnings:
> > fs/ext2/inode.c:1251 ext2_setsize() warn: inconsistent indenting
> >
> > vim +1251 fs/ext2/inode.c
> >
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1236
> > 2c27c65ed0696f0 Christoph Hellwig 2010-06-04 1237 static int ext2_setsize(struct inode *inode, loff_t newsize)
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1238 {
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1239 int error;
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1240
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1241 if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1242 S_ISLNK(inode->i_mode)))
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1243 return -EINVAL;
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1244 if (ext2_inode_is_fast_symlink(inode))
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1245 return -EINVAL;
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1246 if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1247 return -EPERM;
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1248
> > 562c72aa57c36b1 Christoph Hellwig 2011-06-24 1249 inode_dio_wait(inode);
> > 562c72aa57c36b1 Christoph Hellwig 2011-06-24 1250
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 @1251 error = block_truncate_page(inode->i_mapping,
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1252 newsize, ext2_get_block);
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1253 if (error)
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1254 return error;
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1255
> > 70f3bad8c3154ba Jan Kara 2021-04-12 1256 filemap_invalidate_lock(inode->i_mapping);
> > 2c27c65ed0696f0 Christoph Hellwig 2010-06-04 1257 truncate_setsize(inode, newsize);
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1258 __ext2_truncate_blocks(inode, newsize);
> > 70f3bad8c3154ba Jan Kara 2021-04-12 1259 filemap_invalidate_unlock(inode->i_mapping);
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1260
> > 5cdc59fce617a2e Jeff Layton 2023-10-04 1261 inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
> > ^1da177e4c3f415 Linus Torvalds 2005-04-16 1262 if (inode_needs_sync(inode)) {
> > b0439bbc29f0201 Jan Kara 2026-03-26 1263 mmb_sync(&EXT2_I(inode)->i_metadata_bhs);
> > c37650161a53c01 Christoph Hellwig 2010-10-06 1264 sync_inode_metadata(inode, 1);
> > ^1da177e4c3f415 Linus Torvalds 2005-04-16 1265 } else {
> > ^1da177e4c3f415 Linus Torvalds 2005-04-16 1266 mark_inode_dirty(inode);
> > ^1da177e4c3f415 Linus Torvalds 2005-04-16 1267 }
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1268
> > 737f2e93b9724a3 Nicholas Piggin 2010-05-27 1269 return 0;
> > ^1da177e4c3f415 Linus Torvalds 2005-04-16 1270 }
> > ^1da177e4c3f415 Linus Torvalds 2005-04-16 1271
> >
> > --
> > 0-DAY CI Kernel Test Service
> > https://github.com/intel/lkp-tests/wiki
> >
> >
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
From: Ojaswin Mujoo @ 2026-06-01 9:08 UTC (permalink / raw)
To: Zhang Yi
Cc: linux-ext4, linux-fsdevel, linux-kernel, tytso, adilger.kernel,
libaokun, jack, ritesh.list, djwong, hch, yi.zhang, yizhang089,
yangerkun, yukuai
In-Reply-To: <7c24dc47-0c3e-4ee9-b113-f9ca9f3a6ebf@huaweicloud.com>
On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
> On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> In the generic buffered_head I/O path, we rely on the data=order mode to
> >> ensure that the zeroed EOF block data is written before updating
> >> i_disksize, thus preventing stale data from being exposed.
> >>
> >> However, the iomap buffered I/O path cannot use this mechanism. Instead,
> >> we issue the I/O immediately after performing the zero operation
> >> (without synchronous waiting for performance). This can reduce the risk
> >> of exposing stale data, but it does not guarantee that the zero data
> >> will be flushed to disk before the metadata of i_disksize is updated.
> >> The subsequent patches will wait for this I/O to complete before
> >> updating i_disksize.
> >>
> >> Suggested-by: Jan Kara <jack@suse.cz>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >
> > I think we discussed that we may not need to do this [1] but I guess
> > you've decided to make the tradeoff of issuing the IO to avoid having to
> > wait for bg flush to complete the tail page zeroing
> >
>
> Yes. For truncate up and append fallocate, originally i_disksize would
> be updated immediately, and the change would be persisted via the
> journal within default 5 seconds. But now, if the tail page is not
> committed immediately, the update to i_disksize will be delayed by about
> 30 seconds, and persistence will be postponed to around 35 seconds. I'm
> not sure what impact this change might have — I just don't really want
> to introduce it.
> For normal append writes, the impact is minimal, unless we call
> sync_range to sync the portion of data that extends beyond EOF.
Hmm while trying to retain the behavior for falloc and truncate up,
we end up changing it for append writes :) But anyways, I understand
your reasoning and don't have any strong opinions against it. I'll let
Jan pitch in since he had some comments around this.
>
> In addition, if the zeroed page is not issued here immediately, the
> logic will become more complex because we need to more careful about the
> order of write-back IOs to prevent deadlock issues caused by mutual
> waiting.
You mean an endio completion waiting for ordered IO to complete but
ordered IO writeback is somehow waiting for this endio completion? Is that
actually possible?
>
> > However, I think one side effect might be many threads calling the
> > writeback mechanism to issue zero IOs which might not scale well. I
> > don't know if it'll be a huge problem though, I guess it's a sort of
> > thing we will have to deal with in case we see it in real world
> > workloads.
> >
>
> I agree with you. However, I suspect that unless we run some specific
> benchmark tests, it should be difficult to encounter a large number of
> post-EOF append writes and truncate up operations in real-world usage
> scenarios — and I haven't come across such scenarios yet. For
> simplicity, I'd like to proceed with this implementation for now. If we
> do run into actual problems later, we can consider not issuing I/O
> directly here, but instead: 1) find the ordered block in
> ext4_sync_file() and perform writeback; 2) ensure writeback ordering
> for normal background writeback as well — otherwise, there is a risk of
> deadlock (mutual waiting). What do you think?
Yes sounds good Yi, we can deal with performance tuning later.
Regards,
Ojaswin
>
> Cheers,
> Yi.
>
> > [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
> >
> >> ---
> >> fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
> >> 1 file changed, 55 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index 239d387ffaf2..e013aeb03d7b 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
> >> zero_written);
> >> }
> >>
> >> +static int ext4_iomap_submit_zero_block(struct inode *inode,
> >> + loff_t from, loff_t end)
> >> +{
> >> + struct address_space *mapping = inode->i_mapping;
> >> + struct folio *folio;
> >> + bool do_submit = false;
> >> +
> >> + folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
> >> + if (IS_ERR(folio))
> >> + /* Already writeback and clear? */
> >> + return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
> >> +
> >> + folio_wait_writeback(folio);
> >> + WARN_ON_ONCE(folio_test_writeback(folio));
> >> +
> >> + if (likely(folio_test_dirty(folio)))
> >> + do_submit = true;
> >> + folio_unlock(folio);
> >> + folio_put(folio);
> >> +
> >> + /* Submit zeroed block. */
> >> + if (do_submit)
> >> + return filemap_fdatawrite_range(mapping, from, end - 1);
> >> + return 0;
> >> +}
> >> +
> >> /*
> >> * Zero out a mapping from file offset 'from' up to the end of the block
> >> * which corresponds to 'from' or to the given 'end' inside this block.
> >> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> >> if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
> >> return 0;
> >>
> >> - if (length > blocksize - offset)
> >> + if (length > blocksize - offset) {
> >> length = blocksize - offset;
> >> + end = from + length;
> >> + }
> >>
> >> err = ext4_block_zero_range(inode, from, length,
> >> &did_zero, &zero_written);
> >> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
> >> * TODO: In the iomap path, handle this by updating i_disksize to
> >> * i_size after the zeroed data has been written back.
> >> */
> >> - if (ext4_should_order_data(inode) &&
> >> - did_zero && zero_written && !IS_DAX(inode)) {
> >> - handle_t *handle;
> >> + if (did_zero && zero_written && !IS_DAX(inode)) {
> >> + if (ext4_should_order_data(inode)) {
> >> + handle_t *handle;
> >>
> >> - handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> >> - if (IS_ERR(handle))
> >> - return PTR_ERR(handle);
> >> + handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
> >> + if (IS_ERR(handle))
> >> + return PTR_ERR(handle);
> >>
> >> - err = ext4_jbd2_inode_add_write(handle, inode, from, length);
> >> - ext4_journal_stop(handle);
> >> - if (err)
> >> - return err;
> >> + err = ext4_jbd2_inode_add_write(handle, inode, from,
> >> + length);
> >> + ext4_journal_stop(handle);
> >> + if (err)
> >> + return err;
> >> + /*
> >> + * inodes using the iomap buffered I/O path do not use the
> >> + * data=ordered mode. We submit zeroed range directly here.
> >> + * Do not wait for I/O completion for performance.
> >> + *
> >> + * TODO: Any operation that extends i_disksize (including
> >> + * append write end io past the zeroed boundary, truncate up,
> >> + * and append fallocate) must wait for the relevant I/O to
> >> + * complete before updating i_disksize.
> >> + */
> >> + } else if (ext4_inode_buffered_iomap(inode)) {
> >> + err = ext4_iomap_submit_zero_block(inode, from, end);
> >> + if (err)
> >> + return err;
> >> + }
> >> }
> >>
> >> return 0;
> >> --
> >> 2.52.0
> >>
>
^ permalink raw reply
* Re: [PATCH 3/4] pipe: mark blocking pipe read and FIFO open sleeps as freezable
From: daijunbing @ 2026-06-01 7:00 UTC (permalink / raw)
To: Jan Kara
Cc: linux-fsdevel, viro, brauner, tytso, jack, linux-ext4,
linux-kernel
In-Reply-To: <j44pojzowtbta2pxeb6nfwnw3mu4thlx5o73es3a5kivibwnvs@cqhwry46h2fy>
Hi,
Please drop this patch from vfs-7.2.misc.
After further review, I believe this change is not sufficiently justified.
In particular, I do not have a solid enough analysis to show that making
the anon_pipe_read() wait freezable is safe with respect to freeze-time
lock/resource constraints across all relevant call paths.
While this change has been running on our devices for about a year without
any reported issues, I am not confident that this is sufficient to
justify it for upstream use in the general case.
This concern only affects this patch; the rest of the series should be
unaffected.
Thanks,
在 2026/5/30 18:10, Jan Kara 写道:
> Making sure we don't hold any lock is rather non-trivial in some cases. For
> example in your case here, anon_pipe_read() is used in .read_iter method so
> you should make sure all places calling .read_iter don't hold some lock. If
> nothing else, I'm missing this analysis in the changelog and I actually
> strongly suspect it is not true in some cor
^ permalink raw reply
* Re: [PATCH v4 09/23] ext4: implement writeback path using iomap
From: Ojaswin Mujoo @ 2026-06-01 6:26 UTC (permalink / raw)
To: Zhang Yi
Cc: Zhang Yi, linux-ext4, linux-fsdevel, linux-kernel, tytso,
adilger.kernel, libaokun, jack, ritesh.list, djwong, hch,
yi.zhang, yangerkun, yukuai
In-Reply-To: <29370885-bd50-4032-939f-2c377330872d@huaweicloud.com>
On Sat, May 30, 2026 at 09:21:31AM +0800, Zhang Yi wrote:
> On 5/29/2026 11:32 PM, Ojaswin Mujoo wrote:
> > On Fri, May 29, 2026 at 10:12:12PM +0800, Zhang Yi wrote:
> >> Hi, Ojaswin!
> >>
> >> On 5/27/2026 8:49 PM, Ojaswin Mujoo wrote:
> >>> On Mon, May 11, 2026 at 03:23:29PM +0800, Zhang Yi wrote:
> >>>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>>
> >>>> Add the iomap writeback path for ext4 buffered I/O. This introduces:
> >>>>
> >>>> - ext4_iomap_writepages(): the main writeback entry point.
> >>>> - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
> >>>> block mapping and I/O submission.
> >>>> - A new end I/O worker for converting unwritten extents, updating file
> >>>> size, and handling DATA_ERR_ABORT after I/O completion.
> >>>>
> >>>> Core implementation details:
> >>>>
> >>>> - ->writeback_range() callback
> >>>> Calls ext4_iomap_map_writeback_range() to query the longest range of
> >>>> existing mapped extents. For performance, when a block range is not
> >>>> yet allocated, it allocates based on the writeback length and delalloc
> >>>> extent length, rather than allocating for a single folio at a time.
> >>>> The folio is then added to an iomap_ioend instance.
> >>>>
> >>>> - ->writeback_submit() callback
> >>>> Registers ext4_iomap_end_bio() as the end bio callback. This callback
> >>>> schedules a worker to handle:
> >>>> - Unwritten extent conversion.
> >>>> - i_disksize update after data is written back.
> >>>> - Journal abort on writeback I/O failure.
> >>>
> >>> Hi Zhang, the changes look good. I have a few comments below:
> >>>>
> >>>> Key changes and considerations:
> >>>>
> >>>> - Append write and unwritten extents
> >>>> Since data=ordered mode is not used to prevent stale data exposure
> >>>> during append writebacks, new blocks are always allocated as unwritten
> >>>> extents (i.e. always enable dioread_nolock), and i_disksize update is
> >>>> postponed until I/O completion.
> >>>
> >>> Makes sense.
> >>>
> >>>> Additionally, the deadlock that the
> >>>> reserve handle was expected to resolve does not occur anymore.
> >>>
> >>> I guess this is since we don't use ordered data so we can't block on
> >>> starting a txn in end io.
> >>
> >> Yep.
> >>
> >>>
> >>>> Therefore, the end I/O worker can start a normal journal handle
> >>>> instead of a reserve handle when converting unwritten extents.
> >>>>
> >>>> - Lock ordering
> >>>> The ->writeback_range() callback runs under the folio lock, requiring
> >>>> the journal handle to be started under that same lock. This reverses
> >>>> the order compared to the buffer_head writeback path. The lock ordering
> >>>> documentation in super.c has been updated accordingly.
> >>>>
> >>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>>> ---
> >>>> fs/ext4/ext4.h | 4 +
> >>>> fs/ext4/inode.c | 208 +++++++++++++++++++++++++++++++++++++++++-
> >>>> fs/ext4/page-io.c | 126 +++++++++++++++++++++++++
> >>>> fs/ext4/super.c | 7 +-
> >>>> fs/iomap/ioend.c | 3 +-
> >>>> include/linux/iomap.h | 1 +
> >>>> 6 files changed, 346 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>>> index 4832e7f7db82..078feda47e36 100644
> >>>> --- a/fs/ext4/ext4.h
> >>>> +++ b/fs/ext4/ext4.h
> >>>> @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
> >>>> */
> >>>> struct list_head i_rsv_conversion_list;
> >>>> struct work_struct i_rsv_conversion_work;
> >>>> + struct list_head i_iomap_ioend_list;
> >>>> + struct work_struct i_iomap_ioend_work;
> >>>> /*
> >>>> * Transactions that contain inode's metadata needed to complete
> >>>> @@ -3870,6 +3872,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
> >>>> size_t len);
> >>>> extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
> >>>> extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
> >>>> +extern void ext4_iomap_end_io(struct work_struct *work);
> >>>> +extern void ext4_iomap_end_bio(struct bio *bio);
> >>>> /* mmp.c */
> >>>> extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >>>> index 1ae7d3f4a1c8..a80195bd6f20 100644
> >>>> --- a/fs/ext4/inode.c
> >>>> +++ b/fs/ext4/inode.c
> >>>> @@ -44,6 +44,7 @@
> >>>> #include <linux/iversion.h>
> >>>> #include "ext4_jbd2.h"
> >>>> +#include "ext4_extents.h"
> >>>> #include "xattr.h"
> >>>> #include "acl.h"
> >>>> #include "truncate.h"
> >>>> @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
> >>>> iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
> >>>> }
> >>>> +static int ext4_iomap_map_one_extent(struct inode *inode,
> >>>> + struct ext4_map_blocks *map)
> >>>> +{
> >>>> + struct extent_status es;
> >>>> + handle_t *handle = NULL;
> >>>> + int credits, map_flags;
> >>>> + int retval;
> >>>> +
> >>>> + credits = ext4_chunk_trans_blocks(inode, map->m_len);
> >>>> + handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
> >>>> + if (IS_ERR(handle))
> >>>> + return PTR_ERR(handle);
> >>>> +
> >>>> + map->m_flags = 0;
> >>>> + /*
> >>>> + * It is necessary to look up extent and map blocks under i_data_sem
> >>>> + * in write mode, otherwise, the delalloc extent may become stale
> >>>> + * during concurrent truncate operations.
> >>>> + */
> >>>> + ext4_fc_track_inode(handle, inode);
> >>>> + down_write(&EXT4_I(inode)->i_data_sem);
> >>>> + if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> >>>> + retval = es.es_len - (map->m_lblk - es.es_lblk);
> >>>> + map->m_len = min_t(unsigned int, retval, map->m_len);
> >>>> +
> >>>> + if (ext4_es_is_delayed(&es)) {
> >>>
> >>> I understand that it is okay for us to rely on extent status ==
> >>> delayed here because we never reclaim delayed es entries and hence we
> >>> are sure to not skip any delayed block allocations here.
> >>
> >> Yeah, right.
> >>
> >>>
> >>>> + map->m_flags |= EXT4_MAP_DELAYED;
> >>>> + trace_ext4_da_write_pages_extent(inode, map);
> >>>> + /*
> >>>> + * Call ext4_map_create_blocks() to allocate any
> >>>> + * delayed allocation blocks. It is possible that
> >>>> + * we're going to need more metadata blocks, however
> >>>> + * we must not fail because we're in writeback and
> >>>> + * there is nothing we can do so it might result in
> >>>> + * data loss. So use reserved blocks to allocate
> >>>> + * metadata if possible.
> >>>> + */
> >>>> + map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
> >>>> + EXT4_GET_BLOCKS_METADATA_NOFAIL |
> >>>> + EXT4_EX_NOCACHE;
> >>>> +
> >>>> + retval = ext4_map_create_blocks(handle, inode, map,
> >>>> + map_flags);
> >>>> + if (retval > 0)
> >>>> + ext4_fc_track_range(handle, inode, map->m_lblk,
> >>>> + map->m_lblk + map->m_len - 1);
> >>>> + goto out;
> >>>> + } else if (unlikely(ext4_es_is_hole(&es)))
> >>>
> >>> Now that you've fixed the partial invalidate in iomap (patch 12/23)
> >>> can we still hit this hole case?
> >>
> >> Theoretically, no hole should be encountered; this is just defensive
> >> programming.
> >>
> >>>
> >>>> + goto out;
> >>>> +
> >>>> + /* Found written or unwritten extent. */
> >>>> + map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
> >>>> + map->m_flags = ext4_es_is_written(&es) ?
> >>>> + EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> >>>> + goto out;
> >>>> + }
> >>>> +
> >>>> + retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
> >>>> +out:
> >>>> + up_write(&EXT4_I(inode)->i_data_sem);
> >>>> + ext4_journal_stop(handle);
> >>>> + return retval < 0 ? retval : 0;
> >>>> +}
> >>>> +
> >>>> +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> >>>> + loff_t offset, unsigned int dirty_len)
> >>>> +{
> >>>> + struct inode *inode = wpc->inode;
> >>>> + struct super_block *sb = inode->i_sb;
> >>>> + struct journal_s *journal = EXT4_SB(sb)->s_journal;
> >>>> + struct ext4_map_blocks map;
> >>>> + unsigned int blkbits = inode->i_blkbits;
> >>>> + unsigned int index = offset >> blkbits;
> >>>> + unsigned int blk_end, blk_len;
> >>>> + int ret;
> >>>> +
> >>>> + ret = ext4_emergency_state(sb);
> >>>> + if (unlikely(ret))
> >>>> + return ret;
> >>>> +
> >>>> + /* Check validity of the cached writeback mapping. */
> >>>> + if (offset >= wpc->iomap.offset &&
> >>>> + offset < wpc->iomap.offset + wpc->iomap.length &&
> >>>> + ext4_iomap_valid(inode, &wpc->iomap))
> >>>> + return 0;
> >>>> +
> >>>> + blk_len = dirty_len >> blkbits;
> >>>> + blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
> >>>> + (UINT_MAX - 1));
> >>>
> >>> This is an interesting idea. I'm just a bit worried when we have
> >>> range_end == LLONG_MAX (bg flush) and we will always be trying to allocate
> >>> MAX_WRITEPAGES, incase of a slightly fragmented FS, we might keep
> >>> falling into slower mballoc criterias and might waste a lot of time
> >>> scanning the groups.
> >>
> >> Actually, MAX_WRITEPAGES is not allocated every time; the allocated
> >> length also depends on the length of data that has already been delayed
> >> for writing, and the smaller value is taken. If the user has indeed
> >
> > Hmm so we take the blk_end based on range_end (which is LLONG_MAX for bg
> > flusher) and then our blk_len will be set accordingly and would become a
> > large number as well. Then we will set map.m_len based on this blk_len
> > and MAX_WRITEPAGES. Am I missing something that clamps our m_len?
> >
> Please take a look at ext4_iomap_map_one_extent():
>
> + retval = es.es_len - (map->m_lblk - es.es_lblk);
> + map->m_len = min_t(unsigned int, retval, map->m_len);
>
> In this case, m_len is truncated to the length of the delalloc extent.
>
> Thanks,
> Yi.
Oh yes of course, thanks for the clarification. Looks good then.
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Thanks,
Ojaswin
>
^ permalink raw reply
* [PATCH v6 2/2] ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
From: Guan-Chun Wu @ 2026-05-31 8:00 UTC (permalink / raw)
To: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
Ojaswin Mujoo, Ritesh Harjani, Zhang Yi
Cc: linux-ext4, linux-kernel, edward062254, visitorckw,
david.laight.linux, Guan-Chun Wu
In-Reply-To: <20260531080019.3794809-1-409411716@gms.tku.edu.tw>
The original byte-by-byte implementation with modulo checks is less
efficient. Refactor str2hashbuf_unsigned() and str2hashbuf_signed()
to process input in explicit 4-byte chunks instead of using a
modulus-based loop to emit words byte by byte.
Additionally, the use of function pointers for selecting the appropriate
str2hashbuf implementation has been removed. Instead, the functions are
directly invoked based on the hash type, eliminating the overhead of
dynamic function calls.
Performance test (x86_64, Intel Core i7-10700 @ 2.90GHz, average over 10000
runs, using kernel module for testing):
len | orig_s | new_s | orig_u | new_u
----+--------+-------+--------+-------
1 | 70 | 71 | 63 | 63
8 | 68 | 64 | 64 | 62
32 | 75 | 70 | 75 | 63
64 | 96 | 71 | 100 | 68
255 | 192 | 108 | 187 | 84
This change improves performance, especially for larger input sizes.
Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw>
---
fs/ext4/hash.c | 64 +++++++++++++++++++++++++++++++++-----------------
1 file changed, 42 insertions(+), 22 deletions(-)
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index 72645bd92..978bd92da 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -9,6 +9,7 @@
#include <linux/unicode.h>
#include <linux/compiler.h>
#include <linux/bitops.h>
+#include <linux/unaligned.h>
#include "ext4.h"
#define DELTA 0x9E3779B9
@@ -141,21 +142,28 @@ static void str2hashbuf_signed(const char *msg, int len, __u32 *buf, int num)
pad = (__u32)len | ((__u32)len << 8);
pad |= pad << 16;
- val = pad;
if (len > num*4)
len = num * 4;
- for (i = 0; i < len; i++) {
- val = ((int) scp[i]) + (val << 8);
- if ((i % 4) == 3) {
- *buf++ = val;
- val = pad;
- num--;
- }
+
+ while (len >= 4) {
+ val = ((__u32)scp[0] << 24) + ((__u32)scp[1] << 16) + ((__u32)scp[2] << 8) + scp[3];
+ *buf++ = val;
+ scp += 4;
+ len -= 4;
+ num--;
}
+
+ val = pad;
+
+ for (i = 0; i < len; i++)
+ val = scp[i] + (val << 8);
+
if (--num >= 0)
*buf++ = val;
+
while (--num >= 0)
*buf++ = pad;
+
}
static void str2hashbuf_unsigned(const char *msg, int len, __u32 *buf, int num)
@@ -167,21 +175,28 @@ static void str2hashbuf_unsigned(const char *msg, int len, __u32 *buf, int num)
pad = (__u32)len | ((__u32)len << 8);
pad |= pad << 16;
- val = pad;
if (len > num*4)
len = num * 4;
- for (i = 0; i < len; i++) {
- val = ((int) ucp[i]) + (val << 8);
- if ((i % 4) == 3) {
- *buf++ = val;
- val = pad;
- num--;
- }
+
+ while (len >= 4) {
+ val = get_unaligned_be32(ucp);
+ *buf++ = val;
+ ucp += 4;
+ len -= 4;
+ num--;
}
+
+ val = pad;
+
+ for (i = 0; i < len; i++)
+ val = ucp[i] + (val << 8);
+
if (--num >= 0)
*buf++ = val;
+
while (--num >= 0)
*buf++ = pad;
+
}
/*
@@ -205,8 +220,7 @@ static int __ext4fs_dirhash(const struct inode *dir, const char *name, int len,
const char *p;
int i;
__u32 in[8], buf[4];
- void (*str2hashbuf)(const char *, int, __u32 *, int) =
- str2hashbuf_signed;
+ bool use_unsigned = false;
/* Initialize the default seed for the hash checksum functions */
buf[0] = 0x67452301;
@@ -232,12 +246,15 @@ static int __ext4fs_dirhash(const struct inode *dir, const char *name, int len,
hash = dx_hack_hash_signed(name, len);
break;
case DX_HASH_HALF_MD4_UNSIGNED:
- str2hashbuf = str2hashbuf_unsigned;
+ use_unsigned = true;
fallthrough;
case DX_HASH_HALF_MD4:
p = name;
while (len > 0) {
- (*str2hashbuf)(p, len, in, 8);
+ if (use_unsigned)
+ str2hashbuf_unsigned(p, len, in, 8);
+ else
+ str2hashbuf_signed(p, len, in, 8);
half_md4_transform(buf, in);
len -= 32;
p += 32;
@@ -246,12 +263,15 @@ static int __ext4fs_dirhash(const struct inode *dir, const char *name, int len,
hash = buf[1];
break;
case DX_HASH_TEA_UNSIGNED:
- str2hashbuf = str2hashbuf_unsigned;
+ use_unsigned = true;
fallthrough;
case DX_HASH_TEA:
p = name;
while (len > 0) {
- (*str2hashbuf)(p, len, in, 4);
+ if (use_unsigned)
+ str2hashbuf_unsigned(p, len, in, 4);
+ else
+ str2hashbuf_signed(p, len, in, 4);
TEA_transform(buf, in);
len -= 16;
p += 16;
--
2.34.1
^ permalink raw reply related
* [PATCH v6 1/2] ext4: add Kunit coverage for directory hash computation
From: Guan-Chun Wu @ 2026-05-31 8:00 UTC (permalink / raw)
To: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
Ojaswin Mujoo, Ritesh Harjani, Zhang Yi
Cc: linux-ext4, linux-kernel, edward062254, visitorckw,
david.laight.linux, Guan-Chun Wu
In-Reply-To: <20260531080019.3794809-1-409411716@gms.tku.edu.tw>
Introduce Kunit tests for fs/ext4/hash.c to verify ext4fs_dirhash()
across the legacy, half-MD4, and TEA hash variants.
The tests cover empty, seeded hashing, and non-ASCII name handling.
They also verify error paths, including invalid hash versions and
SipHash without a configured key, and check that the signed and
unsigned hash variants differ on non-ASCII input as expected.
When CONFIG_UNICODE is enabled, the tests further verify casefolded-name
hashing and the fallback behavior for invalid input.
Co-developed-by: Chen Hao Yu <edward062254@gmail.com>
Signed-off-by: Chen Hao Yu <edward062254@gmail.com>
Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw>
---
fs/ext4/Makefile | 2 +-
fs/ext4/hash-test.c | 567 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/hash.c | 4 +
3 files changed, 572 insertions(+), 1 deletion(-)
create mode 100644 fs/ext4/hash-test.c
diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 3baee4e7c..3f9fc0eb8 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -15,7 +15,7 @@ ext4-y := balloc.o bitmap.o block_validity.o dir.o ext4_jbd2.o extents.o \
ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
ext4-test-objs += inode-test.o mballoc-test.o \
- extents-test.o
+ extents-test.o hash-test.o
obj-$(CONFIG_EXT4_KUNIT_TESTS) += ext4-test.o
ext4-$(CONFIG_FS_VERITY) += verity.o
ext4-$(CONFIG_FS_ENCRYPTION) += crypto.o
diff --git a/fs/ext4/hash-test.c b/fs/ext4/hash-test.c
new file mode 100644
index 000000000..49b0d874c
--- /dev/null
+++ b/fs/ext4/hash-test.c
@@ -0,0 +1,567 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * KUnit tests for ext4 directory hash computation.
+ */
+
+#include <kunit/test.h>
+#include <kunit/resource.h>
+#include <linux/fs.h>
+#include <linux/stddef.h>
+#include <linux/string.h>
+#include <linux/unicode.h>
+#include "ext4.h"
+
+static void ext4_hash_init_fake_dir(struct inode *dir, struct super_block *sb)
+{
+ memset(sb, 0, sizeof(*sb));
+ memset(dir, 0, sizeof(*dir));
+ dir->i_sb = sb;
+ strscpy(sb->s_id, "kunit-ext4", sizeof(sb->s_id));
+}
+
+static void ext4_hash_init_fake_dir_with_sbi(struct inode *dir,
+ struct super_block *sb,
+ struct ext4_sb_info *sbi)
+{
+ ext4_hash_init_fake_dir(dir, sb);
+ memset(sbi, 0, sizeof(*sbi));
+ sb->s_fs_info = sbi;
+ sbi->s_sb = sb;
+}
+
+#ifdef CONFIG_FS_ENCRYPTION
+static const struct fscrypt_operations ext4_hash_test_cryptops = {
+ .inode_info_offs =
+ (int)offsetof(struct ext4_inode_info, i_crypt_info) -
+ (int)offsetof(struct ext4_inode_info, vfs_inode),
+};
+#endif
+
+static void ext4_hash_init_fake_ext4_dir(struct ext4_inode_info *ei,
+ struct super_block *sb,
+ struct ext4_sb_info *sbi)
+{
+ struct inode *dir = &ei->vfs_inode;
+
+ memset(sb, 0, sizeof(*sb));
+ memset(ei, 0, sizeof(*ei));
+ memset(sbi, 0, sizeof(*sbi));
+
+ strscpy(sb->s_id, "kunit-ext4", sizeof(sb->s_id));
+ sb->s_fs_info = sbi;
+ sbi->s_sb = sb;
+
+ dir->i_sb = sb;
+ dir->i_mode = S_IFDIR;
+
+#ifdef CONFIG_FS_ENCRYPTION
+ fscrypt_set_ops(sb, &ext4_hash_test_cryptops);
+#endif
+}
+
+struct ext4_dirhash_test_case {
+ const char *name;
+ u32 hash_version;
+ const char *input;
+ int len;
+ u32 seed[4];
+ bool use_seed;
+ u32 expected_hash;
+ u32 expected_minor_hash;
+};
+
+static const struct ext4_dirhash_test_case ext4_dirhash_test_cases[] = {
+ {
+ .name = "legacy_abc",
+ .hash_version = DX_HASH_LEGACY,
+ .input = "abc",
+ .len = 3,
+ .use_seed = false,
+ .expected_hash = 0x75afd992,
+ .expected_minor_hash = 0x00000000,
+ },
+ {
+ .name = "legacy_unsigned_abc",
+ .hash_version = DX_HASH_LEGACY_UNSIGNED,
+ .input = "abc",
+ .len = 3,
+ .use_seed = false,
+ .expected_hash = 0x75afd992,
+ .expected_minor_hash = 0x00000000,
+ },
+ {
+ .name = "half_md4_abc",
+ .hash_version = DX_HASH_HALF_MD4,
+ .input = "abc",
+ .len = 3,
+ .use_seed = false,
+ .expected_hash = 0xd196a868,
+ .expected_minor_hash = 0xc420eb28,
+ },
+ {
+ .name = "half_md4_unsigned_abc",
+ .hash_version = DX_HASH_HALF_MD4_UNSIGNED,
+ .input = "abc",
+ .len = 3,
+ .use_seed = false,
+ .expected_hash = 0xd196a868,
+ .expected_minor_hash = 0xc420eb28,
+ },
+ {
+ .name = "tea_abc",
+ .hash_version = DX_HASH_TEA,
+ .input = "abc",
+ .len = 3,
+ .use_seed = false,
+ .expected_hash = 0xb1435ec4,
+ .expected_minor_hash = 0x3f7eaa0e,
+ },
+ {
+ .name = "tea_unsigned_abc",
+ .hash_version = DX_HASH_TEA_UNSIGNED,
+ .input = "abc",
+ .len = 3,
+ .use_seed = false,
+ .expected_hash = 0xb1435ec4,
+ .expected_minor_hash = 0x3f7eaa0e,
+ },
+ {
+ .name = "empty_half_md4",
+ .hash_version = DX_HASH_HALF_MD4,
+ .input = "",
+ .len = 0,
+ .use_seed = false,
+ .expected_hash = 0xefcdab88,
+ .expected_minor_hash = 0x98badcfe,
+ },
+ {
+ .name = "half_md4_31bytes",
+ .hash_version = DX_HASH_HALF_MD4,
+ .input = "1234567890123456789012345678901",
+ .len = 31,
+ .use_seed = false,
+ .expected_hash = 0xc4db1f78,
+ .expected_minor_hash = 0xea23921b,
+ },
+ {
+ .name = "half_md4_32bytes",
+ .hash_version = DX_HASH_HALF_MD4,
+ .input = "12345678901234567890123456789012",
+ .len = 32,
+ .use_seed = false,
+ .expected_hash = 0xfa6cc63e,
+ .expected_minor_hash = 0x2f77bd1c,
+ },
+ {
+ .name = "half_md4_33bytes",
+ .hash_version = DX_HASH_HALF_MD4,
+ .input = "123456789012345678901234567890123",
+ .len = 33,
+ .use_seed = false,
+ .expected_hash = 0xdc0c2dec,
+ .expected_minor_hash = 0x5ca23365,
+ },
+ {
+ .name = "half_md4_unsigned_31bytes",
+ .hash_version = DX_HASH_HALF_MD4_UNSIGNED,
+ .input = "1234567890123456789012345678901",
+ .len = 31,
+ .use_seed = false,
+ .expected_hash = 0xc4db1f78,
+ .expected_minor_hash = 0xea23921b,
+ },
+ {
+ .name = "half_md4_unsigned_32bytes",
+ .hash_version = DX_HASH_HALF_MD4_UNSIGNED,
+ .input = "12345678901234567890123456789012",
+ .len = 32,
+ .use_seed = false,
+ .expected_hash = 0xfa6cc63e,
+ .expected_minor_hash = 0x2f77bd1c,
+ },
+ {
+ .name = "half_md4_unsigned_33bytes",
+ .hash_version = DX_HASH_HALF_MD4_UNSIGNED,
+ .input = "123456789012345678901234567890123",
+ .len = 33,
+ .use_seed = false,
+ .expected_hash = 0xdc0c2dec,
+ .expected_minor_hash = 0x5ca23365,
+ },
+ {
+ .name = "tea_15bytes",
+ .hash_version = DX_HASH_TEA,
+ .input = "123456789abcdef",
+ .len = 15,
+ .use_seed = false,
+ .expected_hash = 0xa562903a,
+ .expected_minor_hash = 0x6174a00f,
+ },
+ {
+ .name = "tea_16bytes",
+ .hash_version = DX_HASH_TEA,
+ .input = "1234567890abcdef",
+ .len = 16,
+ .use_seed = false,
+ .expected_hash = 0x8449f258,
+ .expected_minor_hash = 0x49a16d46,
+ },
+ {
+ .name = "tea_17bytes",
+ .hash_version = DX_HASH_TEA,
+ .input = "123456789abcdefgh",
+ .len = 17,
+ .use_seed = false,
+ .expected_hash = 0xf32ec10c,
+ .expected_minor_hash = 0x58ceae61,
+ },
+ {
+ .name = "half_md4_seeded",
+ .hash_version = DX_HASH_HALF_MD4,
+ .input = "same-name",
+ .len = 9,
+ .seed = { 0x11111111, 0x22222222, 0x33333333, 0x44444444 },
+ .use_seed = true,
+ .expected_hash = 0x8aebf604,
+ .expected_minor_hash = 0x66ce48fe,
+ },
+ {
+ .name = "half_md4_non_ascii_signed",
+ .hash_version = DX_HASH_HALF_MD4,
+ .input = "\x80\x81\x82\x83\x84",
+ .len = 5,
+ .use_seed = false,
+ .expected_hash = 0x8bab0498,
+ .expected_minor_hash = 0xc326632d,
+ },
+ {
+ .name = "half_md4_non_ascii_unsigned",
+ .hash_version = DX_HASH_HALF_MD4_UNSIGNED,
+ .input = "\x80\x81\x82\x83\x84",
+ .len = 5,
+ .use_seed = false,
+ .expected_hash = 0xbc48596e,
+ .expected_minor_hash = 0xde0fad41,
+ },
+ {
+ .name = "tea_non_ascii_signed",
+ .hash_version = DX_HASH_TEA,
+ .input = "\x80\x81\x82\x83\x84",
+ .len = 5,
+ .use_seed = false,
+ .expected_hash = 0x21e3a154,
+ .expected_minor_hash = 0x90112c3d,
+ },
+ {
+ .name = "tea_non_ascii_unsigned",
+ .hash_version = DX_HASH_TEA_UNSIGNED,
+ .input = "\x80\x81\x82\x83\x84",
+ .len = 5,
+ .use_seed = false,
+ .expected_hash = 0x9b648616,
+ .expected_minor_hash = 0x011dd507,
+ },
+};
+
+static void test_ext4fs_dirhash_vectors(struct kunit *test)
+{
+ struct super_block *sb;
+ struct inode *dir;
+ int i;
+
+ sb = kunit_kzalloc(test, sizeof(*sb), GFP_KERNEL);
+ dir = kunit_kzalloc(test, sizeof(*dir), GFP_KERNEL);
+ KUNIT_ASSERT_NOT_NULL(test, sb);
+ KUNIT_ASSERT_NOT_NULL(test, dir);
+
+ ext4_hash_init_fake_dir(dir, sb);
+
+ for (i = 0; i < ARRAY_SIZE(ext4_dirhash_test_cases); i++) {
+ const struct ext4_dirhash_test_case *tc =
+ &ext4_dirhash_test_cases[i];
+ struct dx_hash_info hinfo;
+ int ret;
+
+ memset(&hinfo, 0, sizeof(hinfo));
+ hinfo.hash_version = tc->hash_version;
+ hinfo.seed = tc->use_seed ? (u32 *)tc->seed : NULL;
+
+ ret = ext4fs_dirhash(dir, tc->input, tc->len, &hinfo);
+
+ KUNIT_ASSERT_EQ_MSG(test, ret, 0, "case=%s", tc->name);
+ KUNIT_EXPECT_EQ_MSG(test, hinfo.hash, tc->expected_hash,
+ "case=%s", tc->name);
+ KUNIT_EXPECT_EQ_MSG(test, hinfo.minor_hash,
+ tc->expected_minor_hash,
+ "case=%s", tc->name);
+ }
+}
+
+static void test_ext4fs_dirhash_seed_changes_result(struct kunit *test)
+{
+ struct super_block *sb;
+ struct inode *dir;
+ u32 seed[4] = { 0x11111111, 0x22222222, 0x33333333, 0x44444444 };
+ struct dx_hash_info plain = {
+ .hash_version = DX_HASH_HALF_MD4,
+ };
+ struct dx_hash_info seeded = {
+ .hash_version = DX_HASH_HALF_MD4,
+ .seed = seed,
+ };
+ int ret_plain, ret_seeded;
+
+ sb = kunit_kzalloc(test, sizeof(*sb), GFP_KERNEL);
+ dir = kunit_kzalloc(test, sizeof(*dir), GFP_KERNEL);
+ KUNIT_ASSERT_NOT_NULL(test, sb);
+ KUNIT_ASSERT_NOT_NULL(test, dir);
+
+ ext4_hash_init_fake_dir(dir, sb);
+
+ ret_plain = ext4fs_dirhash(dir, "same-name", 9, &plain);
+ ret_seeded = ext4fs_dirhash(dir, "same-name", 9, &seeded);
+
+ KUNIT_ASSERT_EQ(test, ret_plain, 0);
+ KUNIT_ASSERT_EQ(test, ret_seeded, 0);
+
+ KUNIT_EXPECT_TRUE(test,
+ plain.hash != seeded.hash ||
+ plain.minor_hash != seeded.minor_hash);
+}
+
+static void test_ext4fs_dirhash_invalid_version_returns_einval(struct kunit *test)
+{
+ struct super_block *sb;
+ struct inode *dir;
+ struct ext4_sb_info *sbi;
+ struct dx_hash_info hinfo = {
+ .hash = 0xdeadbeef,
+ .minor_hash = 0xcafebabe,
+ .hash_version = DX_HASH_LAST + 1,
+ };
+ int ret;
+
+ sb = kunit_kzalloc(test, sizeof(*sb), GFP_KERNEL);
+ dir = kunit_kzalloc(test, sizeof(*dir), GFP_KERNEL);
+ sbi = kunit_kzalloc(test, sizeof(*sbi), GFP_KERNEL);
+ KUNIT_ASSERT_NOT_NULL(test, sb);
+ KUNIT_ASSERT_NOT_NULL(test, dir);
+ KUNIT_ASSERT_NOT_NULL(test, sbi);
+
+ ext4_hash_init_fake_dir_with_sbi(dir, sb, sbi);
+
+ ret = ext4fs_dirhash(dir, "abc", 3, &hinfo);
+
+ KUNIT_EXPECT_EQ(test, ret, -EINVAL);
+ KUNIT_EXPECT_EQ(test, hinfo.hash, 0);
+ KUNIT_EXPECT_EQ(test, hinfo.minor_hash, 0);
+}
+
+static void test_ext4fs_dirhash_siphash_without_key_returns_einval(struct kunit *test)
+{
+ struct super_block *sb;
+ struct ext4_inode_info *ei;
+ struct inode *dir;
+ struct ext4_sb_info *sbi;
+ struct dx_hash_info hinfo = {
+ .hash_version = DX_HASH_SIPHASH,
+ };
+ int ret;
+
+ sb = kunit_kzalloc(test, sizeof(*sb), GFP_KERNEL);
+ ei = kunit_kzalloc(test, sizeof(*ei), GFP_KERNEL);
+ sbi = kunit_kzalloc(test, sizeof(*sbi), GFP_KERNEL);
+ KUNIT_ASSERT_NOT_NULL(test, sb);
+ KUNIT_ASSERT_NOT_NULL(test, ei);
+ KUNIT_ASSERT_NOT_NULL(test, sbi);
+
+ ext4_hash_init_fake_ext4_dir(ei, sb, sbi);
+ dir = &ei->vfs_inode;
+
+ ret = ext4fs_dirhash(dir, "name", strlen("name"), &hinfo);
+
+ KUNIT_EXPECT_EQ(test, ret, -EINVAL);
+}
+
+static void test_ext4fs_dirhash_signed_unsigned_differ_on_nonascii(struct kunit *test)
+{
+ struct super_block *sb;
+ struct inode *dir;
+ static const char input[] = "\x80\xff\x81\xfe\101bc";
+ struct dx_hash_info legacy_signed = {
+ .hash_version = DX_HASH_LEGACY,
+ };
+ struct dx_hash_info legacy_unsigned = {
+ .hash_version = DX_HASH_LEGACY_UNSIGNED,
+ };
+ struct dx_hash_info md4_signed = {
+ .hash_version = DX_HASH_HALF_MD4,
+ };
+ struct dx_hash_info md4_unsigned = {
+ .hash_version = DX_HASH_HALF_MD4_UNSIGNED,
+ };
+ struct dx_hash_info tea_signed = {
+ .hash_version = DX_HASH_TEA,
+ };
+ struct dx_hash_info tea_unsigned = {
+ .hash_version = DX_HASH_TEA_UNSIGNED,
+ };
+ int ret;
+
+ sb = kunit_kzalloc(test, sizeof(*sb), GFP_KERNEL);
+ dir = kunit_kzalloc(test, sizeof(*dir), GFP_KERNEL);
+ KUNIT_ASSERT_NOT_NULL(test, sb);
+ KUNIT_ASSERT_NOT_NULL(test, dir);
+
+ ext4_hash_init_fake_dir(dir, sb);
+
+ ret = ext4fs_dirhash(dir, input, sizeof(input) - 1, &legacy_signed);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+ ret = ext4fs_dirhash(dir, input, sizeof(input) - 1, &legacy_unsigned);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+ KUNIT_EXPECT_NE(test, legacy_signed.hash, legacy_unsigned.hash);
+
+ ret = ext4fs_dirhash(dir, input, sizeof(input) - 1, &md4_signed);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+ ret = ext4fs_dirhash(dir, input, sizeof(input) - 1, &md4_unsigned);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+ KUNIT_EXPECT_TRUE(test,
+ md4_signed.hash != md4_unsigned.hash ||
+ md4_signed.minor_hash != md4_unsigned.minor_hash);
+
+ ret = ext4fs_dirhash(dir, input, sizeof(input) - 1, &tea_signed);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+ ret = ext4fs_dirhash(dir, input, sizeof(input) - 1, &tea_unsigned);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+ KUNIT_EXPECT_TRUE(test,
+ tea_signed.hash != tea_unsigned.hash ||
+ tea_signed.minor_hash != tea_unsigned.minor_hash);
+}
+
+#if IS_ENABLED(CONFIG_UNICODE)
+KUNIT_DEFINE_ACTION_WRAPPER(utf8_unload_action, utf8_unload,
+ struct unicode_map *);
+static void test_ext4fs_dirhash_casefolded_names_hash_consistently(struct kunit *test)
+{
+ struct super_block *sb;
+ struct ext4_inode_info *ei;
+ struct ext4_sb_info *sbi;
+ struct unicode_map *um;
+ struct dx_hash_info h1 = {
+ .hash_version = DX_HASH_HALF_MD4,
+ };
+ struct dx_hash_info h2 = {
+ .hash_version = DX_HASH_HALF_MD4,
+ };
+ int ret, ret1, ret2;
+
+ sb = kunit_kzalloc(test, sizeof(*sb), GFP_KERNEL);
+ ei = kunit_kzalloc(test, sizeof(*ei), GFP_KERNEL);
+ sbi = kunit_kzalloc(test, sizeof(*sbi), GFP_KERNEL);
+ KUNIT_ASSERT_NOT_NULL(test, sb);
+ KUNIT_ASSERT_NOT_NULL(test, ei);
+ KUNIT_ASSERT_NOT_NULL(test, sbi);
+
+ um = utf8_load(UTF8_LATEST);
+ if (IS_ERR(um)) {
+ kunit_skip(test, "utf8_load(UTF8_LATEST) failed: %pe",
+ um);
+ return;
+ }
+
+ ret = kunit_add_action_or_reset(test, utf8_unload_action, um);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+
+ ext4_hash_init_fake_ext4_dir(ei, sb, sbi);
+ sb->s_encoding = um;
+ ei->vfs_inode.i_flags |= S_CASEFOLD;
+
+ KUNIT_ASSERT_TRUE(test, IS_CASEFOLDED(&ei->vfs_inode));
+
+ ret1 = ext4fs_dirhash(&ei->vfs_inode, "Alpha", 5, &h1);
+ ret2 = ext4fs_dirhash(&ei->vfs_inode, "aLPHa", 5, &h2);
+
+ KUNIT_ASSERT_EQ(test, ret1, 0);
+ KUNIT_ASSERT_EQ(test, ret2, 0);
+ KUNIT_EXPECT_EQ(test, h1.hash, h2.hash);
+ KUNIT_EXPECT_EQ(test, h1.minor_hash, h2.minor_hash);
+}
+
+static void test_ext4fs_dirhash_casefold_fallback(struct kunit *test)
+{
+ struct super_block *sb_cf, *sb_plain;
+ struct ext4_inode_info *ei;
+ struct ext4_sb_info *sbi;
+ struct inode *plain_dir;
+ struct unicode_map *um;
+ static const char invalid_utf8[] = "\xc3\x28";
+ struct dx_hash_info folded_dir = {
+ .hash_version = DX_HASH_HALF_MD4,
+ };
+ struct dx_hash_info plain = {
+ .hash_version = DX_HASH_HALF_MD4,
+ };
+ int ret, ret_cf, ret_plain;
+
+ sb_cf = kunit_kzalloc(test, sizeof(*sb_cf), GFP_KERNEL);
+ sb_plain = kunit_kzalloc(test, sizeof(*sb_plain), GFP_KERNEL);
+ ei = kunit_kzalloc(test, sizeof(*ei), GFP_KERNEL);
+ sbi = kunit_kzalloc(test, sizeof(*sbi), GFP_KERNEL);
+ plain_dir = kunit_kzalloc(test, sizeof(*plain_dir), GFP_KERNEL);
+ KUNIT_ASSERT_NOT_NULL(test, sb_cf);
+ KUNIT_ASSERT_NOT_NULL(test, sb_plain);
+ KUNIT_ASSERT_NOT_NULL(test, ei);
+ KUNIT_ASSERT_NOT_NULL(test, sbi);
+ KUNIT_ASSERT_NOT_NULL(test, plain_dir);
+
+ um = utf8_load(UTF8_LATEST);
+ if (IS_ERR(um)) {
+ kunit_skip(test, "utf8_load(UTF8_LATEST) failed: %pe",
+ um);
+ return;
+ }
+
+ ret = kunit_add_action_or_reset(test, utf8_unload_action, um);
+ KUNIT_ASSERT_EQ(test, ret, 0);
+
+ ext4_hash_init_fake_ext4_dir(ei, sb_cf, sbi);
+ sb_cf->s_encoding = um;
+ ei->vfs_inode.i_flags |= S_CASEFOLD;
+
+ KUNIT_ASSERT_TRUE(test, IS_CASEFOLDED(&ei->vfs_inode));
+
+ ext4_hash_init_fake_dir(plain_dir, sb_plain);
+
+ ret_cf = ext4fs_dirhash(&ei->vfs_inode, invalid_utf8,
+ sizeof(invalid_utf8) - 1, &folded_dir);
+ ret_plain = ext4fs_dirhash(plain_dir, invalid_utf8,
+ sizeof(invalid_utf8) - 1, &plain);
+
+ KUNIT_ASSERT_EQ(test, ret_cf, 0);
+ KUNIT_ASSERT_EQ(test, ret_plain, 0);
+ KUNIT_EXPECT_EQ(test, folded_dir.hash, plain.hash);
+ KUNIT_EXPECT_EQ(test, folded_dir.minor_hash, plain.minor_hash);
+}
+#endif
+
+static struct kunit_case ext4_hash_test_cases[] = {
+ KUNIT_CASE(test_ext4fs_dirhash_vectors),
+ KUNIT_CASE(test_ext4fs_dirhash_seed_changes_result),
+ KUNIT_CASE(test_ext4fs_dirhash_invalid_version_returns_einval),
+ KUNIT_CASE(test_ext4fs_dirhash_siphash_without_key_returns_einval),
+ KUNIT_CASE(test_ext4fs_dirhash_signed_unsigned_differ_on_nonascii),
+#if IS_ENABLED(CONFIG_UNICODE)
+ KUNIT_CASE(test_ext4fs_dirhash_casefolded_names_hash_consistently),
+ KUNIT_CASE(test_ext4fs_dirhash_casefold_fallback),
+#endif
+ {}
+};
+
+static struct kunit_suite ext4_hash_test_suite = {
+ .name = "ext4_hash",
+ .test_cases = ext4_hash_test_cases,
+};
+
+kunit_test_suites(&ext4_hash_test_suite);
+
+MODULE_LICENSE("GPL");
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index 48483cd01..72645bd92 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -321,3 +321,7 @@ int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
#endif
return __ext4fs_dirhash(dir, name, len, hinfo);
}
+
+#if IS_ENABLED(CONFIG_EXT4_KUNIT_TESTS)
+EXPORT_SYMBOL_FOR_EXT4_TEST(ext4fs_dirhash);
+#endif
--
2.34.1
^ permalink raw reply related
* [PATCH v6 0/2] ext4: add hash Kunit tests and optimize str2hashbuf
From: Guan-Chun Wu @ 2026-05-31 8:00 UTC (permalink / raw)
To: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
Ojaswin Mujoo, Ritesh Harjani, Zhang Yi
Cc: linux-ext4, linux-kernel, edward062254, visitorckw,
david.laight.linux, Guan-Chun Wu
This series adds Kunit tests for fs/ext4/hash.c and refactors
the str2hashbuf_{signed,unsigned}() helpers.
Patch 1 adds test coverage for ext4fs_dirhash(), including the main
hash variants and relevant edge cases.
Patch 2 simplifies the str2hashbuf helper implementation by processing
input in 4-byte chunks and removing function-pointer dispatch. This also
reduces overhead and shows roughly 2x improvement on longer inputs in
local testing.
Thanks,
Guan-Chun Wu
Link: https://lore.kernel.org/lkml/20260530155817.2311587-1-409411716@gms.tku.edu.tw/
---
v5 -> v6 :
- Fix a modpost undefined symbol error for ext4_cryptops
when building ext4-test.ko.
v4 -> v5 :
- Fix NULL pointer dereference and out-of-bounds read in SipHash tests.
- Use IS_ERR() instead of NULL check for utf8_load() error handling.
- Fix unicode_map memory leaks on assertion failures via kunit_add_action_or_reset().
- Avoid a UBSAN shift warning in str2hashbuf by casting signed char values
to __u32 before left-shifting them.
v3 -> v4 :
- Fix a modpost undefined symbol error for ext4fs_dirhash when building
ext4-test.ko.
v2 -> v3 :
- Added Kunit tests for fs/ext4/hash.c.
v1 -> v2:
- Drop redundant (int) casts.
- Replace indirect calls with simple conditionals.
- Use get_unaligned_be32() instead of manual byte extraction.
---
Guan-Chun Wu (2):
ext4: add Kunit coverage for directory hash computation
ext4: improve str2hashbuf by processing 4-byte chunks and removing
function pointers
fs/ext4/Makefile | 2 +-
fs/ext4/hash-test.c | 567 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/hash.c | 68 ++++--
3 files changed, 614 insertions(+), 23 deletions(-)
create mode 100644 fs/ext4/hash-test.c
--
2.34.1
^ permalink raw reply
* Re: [PATCH] jbd2: check for aborted handle in jbd2_journal_dirty_metadata()
From: Deepanshu Kartikey @ 2026-05-31 1:49 UTC (permalink / raw)
To: tytso, jack; +Cc: linux-ext4, linux-kernel, syzbot+98f651460e558a21baae
In-Reply-To: <20260507050605.50081-1-kartikey406@gmail.com>
On Thu, May 7, 2026 at 10:36 AM Deepanshu Kartikey
<kartikey406@gmail.com> wrote:
>
> jbd2_journal_dirty_metadata() unconditionally dereferences
> handle->h_transaction at function entry to obtain the journal pointer:
>
> transaction_t *transaction = handle->h_transaction;
> journal_t *journal = transaction->t_journal;
>
> However, h_transaction may legitimately be NULL for an aborted handle.
> The is_handle_aborted() helper in include/linux/jbd2.h explicitly
> treats !h_transaction as one of the aborted states:
>
> if (handle->h_aborted || !handle->h_transaction)
> return 1;
>
> Every other entry point in fs/jbd2/transaction.c
> (jbd2_journal_get_{write,undo,create}_access, jbd2_journal_extend,
> jbd2_journal_restart, jbd2_journal_stop, etc.) guards against this
> with an is_handle_aborted() check before any dereference of
> h_transaction. jbd2_journal_dirty_metadata() was missing this guard.
>
> This is reachable from ocfs2's xattr code. ocfs2_xa_set() intentionally
> falls through to ocfs2_xa_journal_dirty() even after
> ocfs2_xa_prepare_entry() fails, on the assumption that the buffer
> needs to be journaled to record any partial modifications (see the
> comment above the out_dirty label in fs/ocfs2/xattr.c). If the failure
> was caused by the journal being aborted -- e.g. an underlying I/O
> error during a sub-operation such as __ocfs2_remove_xattr_range() --
> the handle's h_transaction has been cleared by the abort path, and
> the unconditional deref in jbd2_journal_dirty_metadata() becomes a
> NULL deref.
>
> Reproduced by syzbot with a crafted ocfs2 image where I/O against the
> loop device backing the mount is sabotaged via LOOP_SET_STATUS64
> between two setxattr() calls, causing the second setxattr (which
> truncates an external xattr value) to abort the journal mid-flight:
>
> Oops: general protection fault, probably for non-canonical
> address 0xdffffc0000000000
> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> RIP: jbd2_journal_dirty_metadata+0x4a/0xd30 fs/jbd2/transaction.c:1520
> Call Trace:
> ocfs2_journal_dirty+0x130/0x700 fs/ocfs2/journal.c:831
> ocfs2_xa_journal_dirty fs/ocfs2/xattr.c:1483 [inline]
> ocfs2_xa_set+0x15e3/0x2ec0 fs/ocfs2/xattr.c:2294
> ocfs2_xattr_block_set+0x3e0/0x33c0 fs/ocfs2/xattr.c:3016
> __ocfs2_xattr_set_handle+0x6b3/0xf50 fs/ocfs2/xattr.c:3418
> ocfs2_xattr_set+0xf3f/0x13e0 fs/ocfs2/xattr.c:3681
> __vfs_setxattr+0x43c/0x480 fs/xattr.c:218
> ...
>
> Fix by adding the standard is_handle_aborted() guard at the top of
> jbd2_journal_dirty_metadata() and returning -EROFS, matching the
> pattern used by every other entry point in this file.
> ocfs2_journal_dirty() already handles a non-zero return from
> jbd2_journal_dirty_metadata() correctly.
>
> Reported-by: syzbot+98f651460e558a21baae@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=98f651460e558a21baae
> Tested-by: syzbot+98f651460e558a21baae@syzkaller.appspotmail.com
> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
> ---
> fs/jbd2/transaction.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 4885903bbd10..aa0be9e9c876 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1516,14 +1516,19 @@ void jbd2_buffer_abort_trigger(struct journal_head *jh,
> */
> int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
> {
> - transaction_t *transaction = handle->h_transaction;
> - journal_t *journal = transaction->t_journal;
> + transaction_t *transaction;
> + journal_t *journal;
> struct journal_head *jh;
> int ret = 0;
>
> + if (is_handle_aborted(handle))
> + return -EROFS;
> if (!buffer_jbd(bh))
> return -EUCLEAN;
>
> + transaction = handle->h_transaction;
> + journal = transaction->t_journal;
> +
> /*
> * We don't grab jh reference here since the buffer must be part
> * of the running transaction.
> --
> 2.43.0
>
Gentle Reminder. I want to know the status of this patch.
Let me know if anything is required from my side.
Thanks
Deepanshu
^ permalink raw reply
* [PATCH v5 2/2] ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
From: Guan-Chun Wu @ 2026-05-30 15:58 UTC (permalink / raw)
To: Theodore Ts'o, Andreas Dilger, Baokun Li, Jan Kara,
Ojaswin Mujoo, Ritesh Harjani, Zhang Yi
Cc: linux-ext4, linux-kernel, edward062254, visitorckw,
david.laight.linux, Guan-Chun Wu
In-Reply-To: <20260530155817.2311587-1-409411716@gms.tku.edu.tw>
The original byte-by-byte implementation with modulo checks is less
efficient. Refactor str2hashbuf_unsigned() and str2hashbuf_signed()
to process input in explicit 4-byte chunks instead of using a
modulus-based loop to emit words byte by byte.
Additionally, the use of function pointers for selecting the appropriate
str2hashbuf implementation has been removed. Instead, the functions are
directly invoked based on the hash type, eliminating the overhead of
dynamic function calls.
Performance test (x86_64, Intel Core i7-10700 @ 2.90GHz, average over 10000
runs, using kernel module for testing):
len | orig_s | new_s | orig_u | new_u
----+--------+-------+--------+-------
1 | 70 | 71 | 63 | 63
8 | 68 | 64 | 64 | 62
32 | 75 | 70 | 75 | 63
64 | 96 | 71 | 100 | 68
255 | 192 | 108 | 187 | 84
This change improves performance, especially for larger input sizes.
Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw>
---
fs/ext4/hash.c | 64 +++++++++++++++++++++++++++++++++-----------------
1 file changed, 42 insertions(+), 22 deletions(-)
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index 72645bd92..978bd92da 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -9,6 +9,7 @@
#include <linux/unicode.h>
#include <linux/compiler.h>
#include <linux/bitops.h>
+#include <linux/unaligned.h>
#include "ext4.h"
#define DELTA 0x9E3779B9
@@ -141,21 +142,28 @@ static void str2hashbuf_signed(const char *msg, int len, __u32 *buf, int num)
pad = (__u32)len | ((__u32)len << 8);
pad |= pad << 16;
- val = pad;
if (len > num*4)
len = num * 4;
- for (i = 0; i < len; i++) {
- val = ((int) scp[i]) + (val << 8);
- if ((i % 4) == 3) {
- *buf++ = val;
- val = pad;
- num--;
- }
+
+ while (len >= 4) {
+ val = ((__u32)scp[0] << 24) + ((__u32)scp[1] << 16) + ((__u32)scp[2] << 8) + scp[3];
+ *buf++ = val;
+ scp += 4;
+ len -= 4;
+ num--;
}
+
+ val = pad;
+
+ for (i = 0; i < len; i++)
+ val = scp[i] + (val << 8);
+
if (--num >= 0)
*buf++ = val;
+
while (--num >= 0)
*buf++ = pad;
+
}
static void str2hashbuf_unsigned(const char *msg, int len, __u32 *buf, int num)
@@ -167,21 +175,28 @@ static void str2hashbuf_unsigned(const char *msg, int len, __u32 *buf, int num)
pad = (__u32)len | ((__u32)len << 8);
pad |= pad << 16;
- val = pad;
if (len > num*4)
len = num * 4;
- for (i = 0; i < len; i++) {
- val = ((int) ucp[i]) + (val << 8);
- if ((i % 4) == 3) {
- *buf++ = val;
- val = pad;
- num--;
- }
+
+ while (len >= 4) {
+ val = get_unaligned_be32(ucp);
+ *buf++ = val;
+ ucp += 4;
+ len -= 4;
+ num--;
}
+
+ val = pad;
+
+ for (i = 0; i < len; i++)
+ val = ucp[i] + (val << 8);
+
if (--num >= 0)
*buf++ = val;
+
while (--num >= 0)
*buf++ = pad;
+
}
/*
@@ -205,8 +220,7 @@ static int __ext4fs_dirhash(const struct inode *dir, const char *name, int len,
const char *p;
int i;
__u32 in[8], buf[4];
- void (*str2hashbuf)(const char *, int, __u32 *, int) =
- str2hashbuf_signed;
+ bool use_unsigned = false;
/* Initialize the default seed for the hash checksum functions */
buf[0] = 0x67452301;
@@ -232,12 +246,15 @@ static int __ext4fs_dirhash(const struct inode *dir, const char *name, int len,
hash = dx_hack_hash_signed(name, len);
break;
case DX_HASH_HALF_MD4_UNSIGNED:
- str2hashbuf = str2hashbuf_unsigned;
+ use_unsigned = true;
fallthrough;
case DX_HASH_HALF_MD4:
p = name;
while (len > 0) {
- (*str2hashbuf)(p, len, in, 8);
+ if (use_unsigned)
+ str2hashbuf_unsigned(p, len, in, 8);
+ else
+ str2hashbuf_signed(p, len, in, 8);
half_md4_transform(buf, in);
len -= 32;
p += 32;
@@ -246,12 +263,15 @@ static int __ext4fs_dirhash(const struct inode *dir, const char *name, int len,
hash = buf[1];
break;
case DX_HASH_TEA_UNSIGNED:
- str2hashbuf = str2hashbuf_unsigned;
+ use_unsigned = true;
fallthrough;
case DX_HASH_TEA:
p = name;
while (len > 0) {
- (*str2hashbuf)(p, len, in, 4);
+ if (use_unsigned)
+ str2hashbuf_unsigned(p, len, in, 4);
+ else
+ str2hashbuf_signed(p, len, in, 4);
TEA_transform(buf, in);
len -= 16;
p += 16;
--
2.34.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox