Linux filesystem development
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Zhang Yi <yi.zhang@huaweicloud.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, libaokun@linux.alibaba.com,
	ojaswin@linux.ibm.com, ritesh.list@gmail.com, djwong@kernel.org,
	hch@infradead.org, yi.zhang@huawei.com, yizhang089@gmail.com,
	yangerkun@huawei.com, yukuai@fnnas.com
Subject: Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap
Date: Wed, 17 Jun 2026 06:56:47 -0400	[thread overview]
Message-ID: <ajJ9b91fTbM6qApg@bfoster> (raw)
In-Reply-To: <16cccb83-cdad-4113-8182-e8ea9e3049a2@huaweicloud.com>

On Wed, Jun 17, 2026 at 04:14:40PM +0800, Zhang Yi wrote:
> On 6/16/2026 8:28 PM, Jan Kara wrote:
> > On Mon 11-05-26 15:23:34, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
> >> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
> >> infrastructure for ext4.
> >>
> >> ext4_iomap_block_zero_range() calls iomap_zero_range() with
> >> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
> >> out either a mapped partial block or a dirty, unwritten partial block.
> >>
> >> Important constraints:
> >>
> >> Zeroing out under an active journal handle can cause deadlock, because
> >> the order of acquiring the folio lock and starting a handle is
> >> inconsistent with the iomap writeback path.
> >>
> >> Therefore, ext4_iomap_block_zero_range():
> >> - Must NOT be called under an active handle.
> >> - Cannot rely on data=ordered mode to ensure zeroed data persistence
> >>   before updating i_disksize (for the cases of post-EOF append write,
> >>   post-EOF fallocate, and truncate up). In subsequent patches, we will
> >>   address this by synchronizing commit I/O but doesn't waiting for
> >>   completion, and updating i_disksize to i_size only after the zeroed
> >>   data has been written back.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 92 insertions(+)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index c6fe42d012fc..e0dae2501292 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >>  	return 0;
> >>  }
> >>  
> >> +static int ext4_iomap_zero_begin(struct inode *inode,
> >> +		loff_t offset, loff_t length, unsigned int flags,
> >> +		struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> +	struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> > 
> > This looks like a layering violation to me. I don't think you can safely
> > assume the iomap you're passed is a part of iomap_iter...
> > 
> >> +	struct ext4_map_blocks map;
> >> +	u8 blkbits = inode->i_blkbits;
> >> +	unsigned int iomap_flags = 0;
> >> +	int ret;
> >> +
> >> +	ret = ext4_emergency_state(inode->i_sb);
> >> +	if (unlikely(ret))
> >> +		return ret;
> >> +
> >> +	if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
> >> +		return -EINVAL;
> >> +
> >> +	ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> >> +	if (ret < 0)
> >> +		return ret;
> >> +
> >> +	/*
> >> +	 * Look up dirty folios for unwritten mappings within EOF. Providing
> >> +	 * this bypasses the flush iomap uses to trigger extent conversion
> >> +	 * when unwritten mappings have dirty pagecache in need of zeroing.
> >> +	 */
> >> +	if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> >> +		loff_t start = ((loff_t)map.m_lblk) << blkbits;
> >> +		loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
> >> +
> >> +		iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
> >> +		if ((start >> blkbits) < map.m_lblk + map.m_len)
> >> +			map.m_len = (start >> blkbits) - map.m_lblk;
> >> +	}
> > 
> > ... and you need access to iter only for this which seems to be really a
> > hack that's trying to outsmart the iomap code. I have to admit I don't
> > fully understand what you are trying to achieve here. Are you trying to
> > avoid flushing of the range that will be zeroed out?
> 
> This logic is copied from the XFS and iomap infrastructure. Its primary
> purpose is to optimize the zeroing operations on dirty written extents.
> It was introduced by Brian in [1].
> 
> The history as I understand it: originally, the iomap infrastructure
> could not zero dirty unwritten extents during zero range processing,
> which led to stale data exposure. XFS had to flush dirty ranges itself
> before zeroing — a workaround that was not generic.
> 
> In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
> extents"), Brian added an unconditional flush in the iomap
> infrastructure, ensuring that by the time zeroing runs the extent has
> already been converted to written so the zero can proceed correctly.
> However, this flush was too heavy and introduced noticeable performance
> overhead.
> 
> This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
> conditional on unwritten mappings"), which restricts flushing to only
> dirty pagecache over unwritten or hole mappings.
> 
> Brian later proposed a different approach: rather than relying on flush
> to convert the extent type, find dirty folios ahead of the zero range
> and zero the dirty unwritten extents directly. In [1] he added this
> lookup logic. The filesystem now supplies a folio batch (a collection of
> dirty folios) via the iomap begin callback, and zero range iterates over
> these dirty folios to perform zeroing. Clean regions not covered by the
> batch are simply skipped. This entirely eliminates the need to flush.
> 
> [1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@redhat.com/
> 
> If I understand correctly, the current approach is a compromise, and
> Brian is still working on this. Perhaps ext4 and XFS could work together
> on improvements in the future?
> 

I think that about covers it!

I do agree wrt to the iomap_iter thing in that it doesn't seem like the
most elegant thing. I considered that a bit of a roadblock when first
hacking on the batch stuff, but IIRC somebody pointed out that there was
precedent already so I didn't think too hard about it after that. Indeed
if you poke around, other filesystems use a similar pattern to access
iter->private for whatever private context is carried around.

FWIW, one of the longer term thoughts for the dirty folio stuff was to
eventually lift it out of the callback and just have iomap do it for the
fs. That would eliminate this particular pattern and probably clean
things up a bit, but there were also some other caveats with that that
aren't top of mind atm (IIRC, things like dealing with map trimming,
etc., but I haven't had a chance to think about it in a while).

Also note that this isn't necessarily a hard requirement. It's an
optional optimization. iomap will flush and retry in the dirty
pagecache+unwritten extent case if the fs hasn't otherwised provided
folios to make sure it zeroes properly, it's just that performance of
that may or may not be acceptable for your use case.

Brian

> > 
> >> +	ret = iomap_zero_range(inode, from, length, did_zero,
> >> +			       &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
> >> +			       NULL);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	/*
> >> +	 * TODO: The iomap does not distinguish between different types of
> >> +	 * zeroing and always sets zero_written if a zeroing operation is
> >> +	 * performed, which may result in unnecessary order operations.
> >> +	 */
> > 
> > Is this still true after your fix to did_zero handling?
> 
> Yeah. Currently, iomap_zero_range() can only report whether a zeroing
> operation has occurred through did_zero parameter, but it cannot
> distinguish whether the zeroed range is a written extent that already
> exists on disk. That is, even if the zeroing is performed on a delalloc
> extent, did_zero will still return true.
> 
> Thanks,
> Yi.
> 
> > 
> >> +	if (did_zero && zero_written)
> >> +		*zero_written = *did_zero;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >>  /*
> >>   * Zeros out a mapping of length 'length' starting from file offset
> >>   * 'from'.  The range to be zero'd must be contained with in one block.
> > 
> > 								Honza
> 


  parent reply	other threads:[~2026-06-17 10:57 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr() Zhang Yi
2026-05-19  5:24   ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]() Zhang Yi
2026-05-19  6:05   ` Ojaswin Mujoo
2026-06-16  9:31   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr() Zhang Yi
2026-05-19  6:08   ` Ojaswin Mujoo
2026-06-16  9:36   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O Zhang Yi
2026-05-19  9:28   ` Ojaswin Mujoo
2026-05-19 12:35     ` Zhang Yi
2026-05-19 16:53       ` Ojaswin Mujoo
2026-05-20  2:49         ` Zhang Yi
2026-05-26 17:11           ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 05/23] ext4: implement buffered read path using iomap Zhang Yi
2026-05-19 10:00   ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
2026-05-19 10:02   ` Ojaswin Mujoo
2026-06-16 10:04   ` Jan Kara
2026-06-16 12:37     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path Zhang Yi
2026-05-19 10:41   ` Ojaswin Mujoo
2026-05-19 13:31     ` Ojaswin Mujoo
2026-05-20  8:18       ` Zhang Yi
2026-05-20 13:17         ` Ojaswin Mujoo
2026-06-16 10:01   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 08/23] ext4: implement buffered write path using iomap Zhang Yi
2026-05-26 17:10   ` Ojaswin Mujoo
2026-05-28 15:40     ` Darrick J. Wong
2026-06-02  7:02       ` Ojaswin Mujoo
2026-05-29  9:13     ` Zhang Yi
2026-06-02 10:05       ` Ojaswin Mujoo
2026-06-03  1:44         ` Zhang Yi
2026-06-02 10:26   ` Ojaswin Mujoo
2026-06-03  2:56     ` Zhang Yi
2026-06-03 11:08       ` Ojaswin Mujoo
2026-06-16 10:45   ` Jan Kara
2026-06-16 14:42     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 09/23] ext4: implement writeback " Zhang Yi
2026-05-27 12:49   ` Ojaswin Mujoo
2026-05-29 14:12     ` Zhang Yi
2026-05-29 15:32       ` Ojaswin Mujoo
2026-05-30  1:21         ` Zhang Yi
2026-06-01  6:26           ` Ojaswin Mujoo
2026-06-16 11:47   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 10/23] ext4: implement mmap " Zhang Yi
2026-05-27 12:56   ` Ojaswin Mujoo
2026-06-16 11:56   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 11/23] iomap: correct the range of a partial dirty clear Zhang Yi
2026-05-11  7:46   ` Christoph Hellwig
2026-05-11  8:57     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 12/23] iomap: support invalidating partial folios Zhang Yi
2026-05-11  7:23 ` [PATCH v4 13/23] iomap: fix incorrect did_zero setting in iomap_zero_iter() Zhang Yi
2026-05-11  7:23 ` [PATCH v4 14/23] ext4: implement partial block zero range path using iomap Zhang Yi
2026-05-27 13:13   ` Ojaswin Mujoo
2026-06-16 12:28   ` Jan Kara
2026-06-17  8:14     ` Zhang Yi
2026-06-17 10:50       ` Jan Kara
2026-06-17 13:00         ` Zhang Yi
2026-06-17 10:56       ` Brian Foster [this message]
2026-06-17 13:22         ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
2026-05-27 13:14   ` Ojaswin Mujoo
2026-06-16 12:29   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 16/23] ext4: disable online defrag when inode using " Zhang Yi
2026-05-27 13:14   ` Ojaswin Mujoo
2026-06-16 12:30   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the " Zhang Yi
2026-05-27 13:41   ` Ojaswin Mujoo
2026-05-30  2:53     ` Zhang Yi
2026-06-01  9:08       ` Ojaswin Mujoo
2026-06-01 12:22         ` Zhang Yi
2026-06-01 18:15           ` Ojaswin Mujoo
2026-06-02  3:36             ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 18/23] ext4: wait for ordered I/O " Zhang Yi
2026-05-27 15:58   ` Ojaswin Mujoo
2026-05-28 13:34     ` Ojaswin Mujoo
2026-05-30  9:32       ` Zhang Yi
2026-06-02  5:56         ` Ojaswin Mujoo
2026-05-30  7:22     ` Zhang Yi
2026-05-30  8:24       ` Zhang Yi
2026-06-01 18:33         ` Ojaswin Mujoo
2026-06-02  3:22           ` Zhang Yi
2026-06-02  5:35             ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 19/23] ext4: update i_disksize to i_size on ordered I/O completion Zhang Yi
2026-05-11  7:23 ` [PATCH v4 20/23] ext4: wait for ordered I/O to complete during insert and collapse range Zhang Yi
2026-05-11  7:23 ` [PATCH v4 21/23] ext4: add tracepoints for ordered I/O in the iomap buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 22/23] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
2026-05-11  7:23 ` [PATCH v4 23/23] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajJ9b91fTbM6qApg@bfoster \
    --to=bfoster@redhat.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=libaokun@linux.alibaba.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yi.zhang@huaweicloud.com \
    --cc=yizhang089@gmail.com \
    --cc=yukuai@fnnas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox