From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2AB7C40313F; Thu, 28 May 2026 15:41:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779982863; cv=none; b=iBPUEwEaCw1KJiAN57wJ/DT5eNFCW4P4iM0SGNeuTAu5pOPD7r8yEaQ7R3U0lVXDEZmywBgM+LbnuhIuJHza4ssB2QDedZkrV0RSP7Qq0h0MnD+D9hF/rXVciNuiBNRcM+EsK+n3LbQPV8uAIQmCWKcSWhX2ikXSXrnTjnsKdNk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779982863; c=relaxed/simple; bh=fkBbUUdQjyF+ilFXIgXUvx3eJaWsh11P5nJj2Dx8NSw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=tvFw5Npcx5ujFRSqsRQIHVHpj3C9MhDQ/Vc/Ez+haFrq7Nz0DLUrLfI5cQYQDVg0sbr98ezMUFWhMsEOhq/6QeoDNH+/485TY1JjCg+UjFQUySsXHNnn9ouQtsrD+9C99q2XD6/vDdqdLeEVX0C3OH1NBNVLkIGZM23hsuRCBR4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=dSuxVwo8; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="dSuxVwo8" Received: by smtp.kernel.org (Postfix) with UTF8SMTPSA id 850451F000E9; Thu, 28 May 2026 15:41:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779982860; bh=MhKqwRITx8IR6NPcV8TYbGDjahLMKlSiFf5808fdrM4=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=dSuxVwo8tV8R3DuizazWiU3IhGOPzax29nBskVUbmNCu3WE8dwBf/ZAdYV1O7AEkj wSYJKlorfIXhHQd80GQWauRr8Vc9GZuoc65FemEcMiiQNtgV2ERKIQLoRcy4AeKMEy pzJuInnHogLhpGqlBIat7b9cZLXzUtcLrxEGzy100JX0QFdQhUgGOYh4XG84QsQWcG oNPMr1CiLFsgKI6aqthXZYWvxyYjOjcgsgh2Sy4VjIZmmzAfw3h1e+c4KcS+qfhOjM 0FYLZnCqu7ipuXZ1l6EnH88/1UbblLefPxwJDjHeMtUj+7BES46nzc5Sp8z3lPkyjO BSOLRnreG1gbQ== Date: Thu, 28 May 2026 08:40:59 -0700 From: "Darrick J. Wong" To: Ojaswin Mujoo Cc: Zhang Yi , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, libaokun@linux.alibaba.com, jack@suse.cz, ritesh.list@gmail.com, hch@infradead.org, yi.zhang@huawei.com, yizhang089@gmail.com, yangerkun@huawei.com, yukuai@fnnas.com Subject: Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap Message-ID: <20260528154059.GA6054@frogsfrogsfrogs> References: <20260511072344.191271-1-yi.zhang@huaweicloud.com> <20260511072344.191271-9-yi.zhang@huaweicloud.com> Precedence: bulk X-Mailing-List: linux-ext4@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Tue, May 26, 2026 at 10:40:30PM +0530, Ojaswin Mujoo wrote: > On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote: > > From: Zhang Yi > > > > Introduce two new iomap_ops instances for ext4 buffered writes: > > > > - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using > > ext4_da_map_blocks() to map delalloc extents. > > - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using > > ext4_iomap_get_blocks() to directly allocate blocks. > > > > Also add ext4_iomap_valid() for the iomap infrastructure to check extent > > validity. > > > > Key changes and considerations: > > > > - Unwritten extents for new blocks (dioread_nolock always on) > > Since data=ordered mode is not used to prevent stale data exposure in > > the non-delayed allocation path, new blocks are always allocated as > > unwritten extents. > > Okay makes sense. > > > > > - Short write and write failure handling > > a. Delalloc path: On short write or failure, the stale delalloc range > > must be dropped and its space reservation released. Otherwise, a > > clean folio may cover leftover delalloc extents, causing > > inaccurate space reservation accounting. > > Hmm, okay so in the usual buffer head path, seems like during a short > write we still zero the new buffers we couldn't write and keep it dirty > (folio_zero_new_buffers()). This way they are still written back and > the delalloc reservations are used up. > > However in iomap we don't mark the range that we couldnt write as dirty > so we need to make sure we clear up the stale delalloc mappings. Is this > correct? Yes, that's true of iomap's pagecache handling. --D > Regards, > Ojaswin > > > b. Non-delalloc path: No cleanup of allocated blocks is needed on > > short write. > > > > - Lock ordering reversal > > The folio lock and transaction start ordering is reversed compared to > > the buffer_head buffered write path. To handle this, the journal > > handle must be stopped in iomap_begin() callbacks. The lock ordering > > documentation in super.c has been updated accordingly. > > > > Signed-off-by: Zhang Yi > > --- > > fs/ext4/ext4.h | 4 ++ > > fs/ext4/file.c | 20 +++++- > > fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++- > > fs/ext4/super.c | 10 ++- > > 4 files changed, 192 insertions(+), 7 deletions(-) > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > > index 1e27d73d7427..4832e7f7db82 100644 > > --- a/fs/ext4/ext4.h > > +++ b/fs/ext4/ext4.h > > @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle, > > int do_journal_get_write_access(handle_t *handle, struct inode *inode, > > struct buffer_head *bh); > > void ext4_set_inode_mapping_order(struct inode *inode); > > +int ext4_nonda_switch(struct super_block *sb); > > #define FALL_BACK_TO_NONDELALLOC 1 > > #define CONVERT_INLINE_DATA 2 > > > > @@ -3926,6 +3927,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end) > > > > extern const struct iomap_ops ext4_iomap_ops; > > extern const struct iomap_ops ext4_iomap_report_ops; > > +extern const struct iomap_ops ext4_iomap_buffered_write_ops; > > +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops; > > +extern const struct iomap_write_ops ext4_iomap_write_ops; > > > > static inline int ext4_buffer_uptodate(struct buffer_head *bh) > > { > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > > index eb1a323962b1..7f9bfbbc4a4e 100644 > > --- a/fs/ext4/file.c > > +++ b/fs/ext4/file.c > > @@ -299,6 +299,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from) > > return count; > > } > > > > +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb, > > + struct iov_iter *from) > > +{ > > + struct inode *inode = file_inode(iocb->ki_filp); > > + const struct iomap_ops *iomap_ops; > > + > > + if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb)) > > + iomap_ops = &ext4_iomap_buffered_da_write_ops; > > + else > > + iomap_ops = &ext4_iomap_buffered_write_ops; > > + > > + return iomap_file_buffered_write(iocb, from, iomap_ops, > > + &ext4_iomap_write_ops, NULL); > > +} > > + > > static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, > > struct iov_iter *from) > > { > > @@ -313,7 +328,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, > > if (ret <= 0) > > goto out; > > > > - ret = generic_perform_write(iocb, from); > > + if (ext4_inode_buffered_iomap(inode)) > > + ret = ext4_iomap_buffered_write(iocb, from); > > + else > > + ret = generic_perform_write(iocb, from); > > > > out: > > inode_unlock(inode); > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 39577a6b65b9..1ae7d3f4a1c8 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -3097,7 +3097,7 @@ static int ext4_dax_writepages(struct address_space *mapping, > > return ret; > > } > > > > -static int ext4_nonda_switch(struct super_block *sb) > > +int ext4_nonda_switch(struct super_block *sb) > > { > > s64 free_clusters, dirty_clusters; > > struct ext4_sb_info *sbi = EXT4_SB(sb); > > @@ -3467,6 +3467,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode) > > return inode_state_read_once(inode) & I_DIRTY_DATASYNC; > > } > > > > +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap) > > +{ > > + return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq); > > +} > > + > > +const struct iomap_write_ops ext4_iomap_write_ops = { > > + .iomap_valid = ext4_iomap_valid, > > +}; > > + > > static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, > > struct ext4_map_blocks *map, loff_t offset, > > loff_t length, unsigned int flags) > > @@ -3501,6 +3510,8 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, > > !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > > iomap->flags |= IOMAP_F_MERGED; > > > > + iomap->validity_cookie = map->m_seq; > > + > > /* > > * Flags passed to ext4_map_blocks() for direct I/O writes can result > > * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits > > @@ -3908,8 +3919,12 @@ const struct iomap_ops ext4_iomap_report_ops = { > > .iomap_begin = ext4_iomap_begin_report, > > }; > > > > +/* Map blocks */ > > +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *); > > + > > static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset, > > - loff_t length, struct ext4_map_blocks *map) > > + loff_t length, ext4_get_blocks_t get_blocks, > > + struct ext4_map_blocks *map) > > { > > u8 blkbits = inode->i_blkbits; > > > > @@ -3921,6 +3936,9 @@ static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset, > > map->m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > > EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1; > > > > + if (get_blocks) > > + return get_blocks(inode, map); > > + > > return ext4_map_blocks(NULL, inode, map, 0); > > } > > > > @@ -3938,7 +3956,7 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset, > > if (WARN_ON_ONCE(ext4_has_inline_data(inode))) > > return -ERANGE; > > > > - ret = ext4_iomap_map_blocks(inode, offset, length, &map); > > + ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map); > > if (ret < 0) > > return ret; > > > > @@ -3946,6 +3964,147 @@ static int ext4_iomap_buffered_read_begin(struct inode *inode, loff_t offset, > > return 0; > > } > > > > +static int ext4_iomap_get_blocks(struct inode *inode, > > + struct ext4_map_blocks *map) > > +{ > > + loff_t i_size = i_size_read(inode); > > + handle_t *handle; > > + int ret; > > + > > + /* > > + * Check if the blocks have already been allocated, this could > > + * avoid initiating a new journal transaction and return the > > + * mapping information directly. > > + */ > > + if ((map->m_lblk + map->m_len) <= > > + round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) { > > + ret = ext4_map_blocks(NULL, inode, map, 0); > > + if (ret < 0) > > + return ret; > > + if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN | > > + EXT4_MAP_DELAYED)) > > + return 0; > > + } > > + > > + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, > > + ext4_chunk_trans_blocks(inode, map->m_len)); > > + if (IS_ERR(handle)) > > + return PTR_ERR(handle); > > + > > + ret = ext4_map_blocks(handle, inode, map, > > + EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); > > + /* > > + * Stop handle here following the lock ordering of the folio lock > > + * and the transaction start. > > + */ > > + ext4_journal_stop(handle); > > + > > + return ret; > > +} > > + > > +static int ext4_iomap_buffered_do_write_begin(struct inode *inode, > > + loff_t offset, loff_t length, unsigned int flags, > > + struct iomap *iomap, struct iomap *srcmap, bool delalloc) > > +{ > > + int ret, retries = 0; > > + struct ext4_map_blocks map; > > + ext4_get_blocks_t *get_blocks; > > + > > + ret = ext4_emergency_state(inode->i_sb); > > + if (unlikely(ret)) > > + return ret; > > + > > + /* Inline data and non-extent are not supported. */ > > + if (WARN_ON_ONCE(ext4_has_inline_data(inode))) > > + return -ERANGE; > > + if (WARN_ON_ONCE(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) > > + return -EINVAL; > > + if (WARN_ON_ONCE(!(flags & IOMAP_WRITE))) > > + return -EINVAL; > > + > > + if (delalloc) > > + get_blocks = ext4_da_map_blocks; > > + else > > + get_blocks = ext4_iomap_get_blocks; > > +retry: > > + ret = ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map); > > + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) > > + goto retry; > > + if (ret < 0) > > + return ret; > > + > > + ext4_set_iomap(inode, iomap, &map, offset, length, flags); > > + return 0; > > +} > > + > > +static int ext4_iomap_buffered_write_begin(struct inode *inode, > > + loff_t offset, loff_t length, unsigned int flags, > > + struct iomap *iomap, struct iomap *srcmap) > > +{ > > + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags, > > + iomap, srcmap, false); > > +} > > + > > +static int ext4_iomap_buffered_da_write_begin(struct inode *inode, > > + loff_t offset, loff_t length, unsigned int flags, > > + struct iomap *iomap, struct iomap *srcmap) > > +{ > > + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags, > > + iomap, srcmap, true); > > +} > > + > > +/* > > + * On write failure, drop the stale delayed allocation range and release > > + * its reserved space for both start and end blocks. Otherwise, we may > > + * leave a range of delayed extents covered by a clean folio, which can > > + * result in inaccurate space reservation accounting. > > + */ > > +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset, > > + loff_t length, struct iomap *iomap) > > +{ > > + down_write(&EXT4_I(inode)->i_data_sem); > > + ext4_es_remove_extent(inode, offset >> inode->i_blkbits, > > + DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb))); > > + up_write(&EXT4_I(inode)->i_data_sem); > > +} > > + > > +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, > > + loff_t length, ssize_t written, > > + unsigned int flags, > > + struct iomap *iomap) > > +{ > > + loff_t start_byte, end_byte; > > + > > + /* If we didn't reserve the blocks, we're not allowed to punch them. */ > > + if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW)) > > + return 0; > > + > > + /* Nothing to do if we've written the entire delalloc extent */ > > + start_byte = iomap_last_written_block(inode, offset, written); > > + end_byte = round_up(offset + length, i_blocksize(inode)); > > + if (start_byte >= end_byte) > > + return 0; > > + > > + filemap_invalidate_lock(inode->i_mapping); > > + iomap_write_delalloc_release(inode, start_byte, end_byte, flags, > > + iomap, ext4_iomap_punch_delalloc); > > + filemap_invalidate_unlock(inode->i_mapping); > > + return 0; > > +} > > + > > +/* > > + * Since we always allocate unwritten extents, there is no need for > > + * iomap_end to clean up allocated blocks on a short write. > > + */ > > +const struct iomap_ops ext4_iomap_buffered_write_ops = { > > + .iomap_begin = ext4_iomap_buffered_write_begin, > > +}; > > + > > +const struct iomap_ops ext4_iomap_buffered_da_write_ops = { > > + .iomap_begin = ext4_iomap_buffered_da_write_begin, > > + .iomap_end = ext4_iomap_buffered_da_write_end, > > +}; > > + > > const struct iomap_ops ext4_iomap_buffered_read_ops = { > > .iomap_begin = ext4_iomap_buffered_read_begin, > > }; > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > > index 6a77db4d3124..9bc294b769db 100644 > > --- a/fs/ext4/super.c > > +++ b/fs/ext4/super.c > > @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs[]; > > * -> page lock -> i_data_sem (rw) > > * > > * buffered write path: > > - * sb_start_write -> i_mutex -> mmap_lock > > - * sb_start_write -> i_mutex -> transaction start -> page lock -> > > - * i_data_sem (rw) > > + * sb_start_write -> i_rwsem (w) -> mmap_lock > > + * - buffer_head path: > > + * sb_start_write -> i_rwsem (w) -> transaction start -> folio lock -> > > + * i_data_sem (rw) > > + * - iomap path: > > + * sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw) > > + * sb_start_write -> i_rwsem (w) -> folio lock (not under an active handle) > > * > > * truncate: > > * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) -> > > -- > > 2.52.0 > > >