* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing [not found] ` <20250715052259.GO2672049@frogsfrogsfrogs> @ 2025-07-18 11:30 ` Zhang Yi 2025-07-18 13:48 ` Brian Foster 0 siblings, 1 reply; 13+ messages in thread From: Zhang Yi @ 2025-07-18 11:30 UTC (permalink / raw) To: Brian Foster Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On 2025/7/15 13:22, Darrick J. Wong wrote: > On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: >> The only way zero range can currently process unwritten mappings >> with dirty pagecache is to check whether the range is dirty before >> mapping lookup and then flush when at least one underlying mapping >> is unwritten. This ordering is required to prevent iomap lookup from >> racing with folio writeback and reclaim. >> >> Since zero range can skip ranges of unwritten mappings that are >> clean in cache, this operation can be improved by allowing the >> filesystem to provide a set of dirty folios that require zeroing. In >> turn, rather than flush or iterate file offsets, zero range can >> iterate on folios in the batch and advance over clean or uncached >> ranges in between. >> >> Add a folio_batch in struct iomap and provide a helper for fs' to > > /me confused by the single quote; is this supposed to read: > > "...for the fs to populate..."? > > Either way the code changes look like a reasonable thing to do for the > pagecache (try to grab a bunch of dirty folios while XFS holds the > mapping lock) so > > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> > > --D > > >> populate the batch at lookup time. Update the folio lookup path to >> return the next folio in the batch, if provided, and advance the >> iter if the folio starts beyond the current offset. >> >> Signed-off-by: Brian Foster <bfoster@redhat.com> >> Reviewed-by: Christoph Hellwig <hch@lst.de> >> --- >> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- >> fs/iomap/iter.c | 6 +++ >> include/linux/iomap.h | 4 ++ >> 3 files changed, 94 insertions(+), 5 deletions(-) >> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >> index 38da2fa6e6b0..194e3cc0857f 100644 >> --- a/fs/iomap/buffered-io.c >> +++ b/fs/iomap/buffered-io.c [...] >> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) >> return status; >> } >> >> +loff_t >> +iomap_fill_dirty_folios( >> + struct iomap_iter *iter, >> + loff_t offset, >> + loff_t length) >> +{ >> + struct address_space *mapping = iter->inode->i_mapping; >> + pgoff_t start = offset >> PAGE_SHIFT; >> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; >> + >> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); >> + if (!iter->fbatch) Hi, Brian! I think ext4 needs to be aware of this failure after it converts to use iomap infrastructure. It is because if we fail to add dirty folios to the fbatch, iomap_zero_range() will flush those unwritten and dirty range. This could potentially lead to a deadlock, as most calls to ext4_block_zero_page_range() occur under an active journal handle. Writeback operations under an active journal handle may result in circular waiting within journal transactions. So please return this error code, and then ext4 can interrupt zero operations to prevent deadlock. Thanks, Yi. >> + return offset + length; >> + folio_batch_init(iter->fbatch); >> + >> + filemap_get_folios_dirty(mapping, &start, end, iter->fbatch); >> + return (start << PAGE_SHIFT); >> +} >> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-07-18 11:30 ` [PATCH v3 3/7] iomap: optional zero range dirty folio processing Zhang Yi @ 2025-07-18 13:48 ` Brian Foster 2025-07-19 11:07 ` Zhang Yi 0 siblings, 1 reply; 13+ messages in thread From: Brian Foster @ 2025-07-18 13:48 UTC (permalink / raw) To: Zhang Yi Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: > On 2025/7/15 13:22, Darrick J. Wong wrote: > > On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: > >> The only way zero range can currently process unwritten mappings > >> with dirty pagecache is to check whether the range is dirty before > >> mapping lookup and then flush when at least one underlying mapping > >> is unwritten. This ordering is required to prevent iomap lookup from > >> racing with folio writeback and reclaim. > >> > >> Since zero range can skip ranges of unwritten mappings that are > >> clean in cache, this operation can be improved by allowing the > >> filesystem to provide a set of dirty folios that require zeroing. In > >> turn, rather than flush or iterate file offsets, zero range can > >> iterate on folios in the batch and advance over clean or uncached > >> ranges in between. > >> > >> Add a folio_batch in struct iomap and provide a helper for fs' to > > > > /me confused by the single quote; is this supposed to read: > > > > "...for the fs to populate..."? > > > > Either way the code changes look like a reasonable thing to do for the > > pagecache (try to grab a bunch of dirty folios while XFS holds the > > mapping lock) so > > > > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> > > > > --D > > > > > >> populate the batch at lookup time. Update the folio lookup path to > >> return the next folio in the batch, if provided, and advance the > >> iter if the folio starts beyond the current offset. > >> > >> Signed-off-by: Brian Foster <bfoster@redhat.com> > >> Reviewed-by: Christoph Hellwig <hch@lst.de> > >> --- > >> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- > >> fs/iomap/iter.c | 6 +++ > >> include/linux/iomap.h | 4 ++ > >> 3 files changed, 94 insertions(+), 5 deletions(-) > >> > >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > >> index 38da2fa6e6b0..194e3cc0857f 100644 > >> --- a/fs/iomap/buffered-io.c > >> +++ b/fs/iomap/buffered-io.c > [...] > >> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > >> return status; > >> } > >> > >> +loff_t > >> +iomap_fill_dirty_folios( > >> + struct iomap_iter *iter, > >> + loff_t offset, > >> + loff_t length) > >> +{ > >> + struct address_space *mapping = iter->inode->i_mapping; > >> + pgoff_t start = offset >> PAGE_SHIFT; > >> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; > >> + > >> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); > >> + if (!iter->fbatch) > > Hi, Brian! > > I think ext4 needs to be aware of this failure after it converts to use > iomap infrastructure. It is because if we fail to add dirty folios to the > fbatch, iomap_zero_range() will flush those unwritten and dirty range. > This could potentially lead to a deadlock, as most calls to > ext4_block_zero_page_range() occur under an active journal handle. > Writeback operations under an active journal handle may result in circular > waiting within journal transactions. So please return this error code, and > then ext4 can interrupt zero operations to prevent deadlock. > Hi Yi, Thanks for looking at this. Huh.. so the reason for falling back like this here is just that this was considered an optional optimization, with the flush in iomap_zero_range() being default fallback behavior. IIUC, what you're saying means that the current zero range behavior without this series is problematic for ext4-on-iomap..? If so, have you observed issues you can share details about? FWIW, I think your suggestion is reasonable, but I'm also curious what the error handling would look like in ext4. Do you expect to the fail the higher level operation, for example? Cycle locks and retry, etc.? The reason I ask is because the folio_batch handling has come up through discussions on this series. My position so far has been to keep it as a separate allocation and to keep things simple since it is currently isolated to zero range, but that may change if the usage spills over to other operations (which seems expected at this point). I suspect that if a filesystem actually depends on this for correct behavior, that is another data point worth considering on that topic. So that has me wondering if it would be better/easier here to perhaps embed the batch in iomap_iter, or maybe as an incremental step put it on the stack in iomap_zero_range() and initialize the iomap_iter pointer there instead of doing the dynamic allocation (then the fill helper would set a flag to indicate the fs did pagecache lookup). Thoughts on something like that? Also IIUC ext4-on-iomap is still a WIP and review on this series seems to have mostly wound down. Any objection if the fix for that comes along as a followup patch rather than a rework of this series? Brian P.S., I'm heading on vacation so it will likely be a week or two before I follow up from here, JFYI. > Thanks, > Yi. > > >> + return offset + length; > >> + folio_batch_init(iter->fbatch); > >> + > >> + filemap_get_folios_dirty(mapping, &start, end, iter->fbatch); > >> + return (start << PAGE_SHIFT); > >> +} > >> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios); > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-07-18 13:48 ` Brian Foster @ 2025-07-19 11:07 ` Zhang Yi 2025-07-21 8:47 ` Zhang Yi 2025-07-30 13:17 ` Brian Foster 0 siblings, 2 replies; 13+ messages in thread From: Zhang Yi @ 2025-07-19 11:07 UTC (permalink / raw) To: Brian Foster Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On 2025/7/18 21:48, Brian Foster wrote: > On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: >> On 2025/7/15 13:22, Darrick J. Wong wrote: >>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: >>>> The only way zero range can currently process unwritten mappings >>>> with dirty pagecache is to check whether the range is dirty before >>>> mapping lookup and then flush when at least one underlying mapping >>>> is unwritten. This ordering is required to prevent iomap lookup from >>>> racing with folio writeback and reclaim. >>>> >>>> Since zero range can skip ranges of unwritten mappings that are >>>> clean in cache, this operation can be improved by allowing the >>>> filesystem to provide a set of dirty folios that require zeroing. In >>>> turn, rather than flush or iterate file offsets, zero range can >>>> iterate on folios in the batch and advance over clean or uncached >>>> ranges in between. >>>> >>>> Add a folio_batch in struct iomap and provide a helper for fs' to >>> >>> /me confused by the single quote; is this supposed to read: >>> >>> "...for the fs to populate..."? >>> >>> Either way the code changes look like a reasonable thing to do for the >>> pagecache (try to grab a bunch of dirty folios while XFS holds the >>> mapping lock) so >>> >>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> >>> >>> --D >>> >>> >>>> populate the batch at lookup time. Update the folio lookup path to >>>> return the next folio in the batch, if provided, and advance the >>>> iter if the folio starts beyond the current offset. >>>> >>>> Signed-off-by: Brian Foster <bfoster@redhat.com> >>>> Reviewed-by: Christoph Hellwig <hch@lst.de> >>>> --- >>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- >>>> fs/iomap/iter.c | 6 +++ >>>> include/linux/iomap.h | 4 ++ >>>> 3 files changed, 94 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >>>> index 38da2fa6e6b0..194e3cc0857f 100644 >>>> --- a/fs/iomap/buffered-io.c >>>> +++ b/fs/iomap/buffered-io.c >> [...] >>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) >>>> return status; >>>> } >>>> >>>> +loff_t >>>> +iomap_fill_dirty_folios( >>>> + struct iomap_iter *iter, >>>> + loff_t offset, >>>> + loff_t length) >>>> +{ >>>> + struct address_space *mapping = iter->inode->i_mapping; >>>> + pgoff_t start = offset >> PAGE_SHIFT; >>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; >>>> + >>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); >>>> + if (!iter->fbatch) >> >> Hi, Brian! >> >> I think ext4 needs to be aware of this failure after it converts to use >> iomap infrastructure. It is because if we fail to add dirty folios to the >> fbatch, iomap_zero_range() will flush those unwritten and dirty range. >> This could potentially lead to a deadlock, as most calls to >> ext4_block_zero_page_range() occur under an active journal handle. >> Writeback operations under an active journal handle may result in circular >> waiting within journal transactions. So please return this error code, and >> then ext4 can interrupt zero operations to prevent deadlock. >> > > Hi Yi, > > Thanks for looking at this. > > Huh.. so the reason for falling back like this here is just that this > was considered an optional optimization, with the flush in > iomap_zero_range() being default fallback behavior. IIUC, what you're > saying means that the current zero range behavior without this series is > problematic for ext4-on-iomap..? Yes. > If so, have you observed issues you can share details about? Sure. Before delving into the specific details of this issue, I would like to provide some background information on the rule that ext4 cannot wait for writeback in an active journal handle. If you are aware of this background, please skip this paragraph. During ext4 writing back the page cache, it may start a new journal handle to allocate blocks, update the disksize, and convert unwritten extents after the I/O is completed. When starting this new journal handle, if the current running journal transaction is in the process of being submitted or if the journal space is insufficient, it must wait for the ongoing transaction to be completed, but the prerequisite for this is that all currently running handles must be terminated. However, if we flush the page cache under an active journal handle, we cannot stop it, which may lead to a deadlock. Now, the issue I have observed occurs when I attempt to use iomap_zero_range() within ext4_block_zero_page_range(). My current implementation are below(based on the latest fs-next). diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 28547663e4fd..1a21667f3f7c 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, return 0; } +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, + loff_t length, unsigned int flags, struct iomap *iomap, + struct iomap *srcmap) +{ + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); + struct ext4_map_blocks map; + u8 blkbits = inode->i_blkbits; + int ret; + + ret = ext4_emergency_state(inode->i_sb); + if (unlikely(ret)) + return ret; + + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) + return -EINVAL; + + /* Calculate the first and last logical blocks respectively. */ + map.m_lblk = offset >> blkbits; + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; + + ret = ext4_map_blocks(NULL, inode, &map, 0); + if (ret < 0) + return ret; + + /* + * Look up dirty folios for unwritten mappings within EOF. Providing + * this bypasses the flush iomap uses to trigger extent conversion + * when unwritten mappings have dirty pagecache in need of zeroing. + */ + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { + loff_t end; + + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, + map.m_len << blkbits); + if ((end >> blkbits) < map.m_lblk + map.m_len) + map.m_len = (end >> blkbits) - map.m_lblk; + } + + ext4_set_iomap(inode, iomap, &map, offset, length, flags); + return 0; +} + +const struct iomap_ops ext4_iomap_buffered_zero_ops = { + .iomap_begin = ext4_iomap_buffered_zero_begin, +}; const struct iomap_ops ext4_iomap_buffered_write_ops = { .iomap_begin = ext4_iomap_buffered_write_begin, @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, return err; } +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, + loff_t length) +{ + WARN_ON_ONCE(!inode_is_locked(inode) && + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); + + return iomap_zero_range(inode, from, length, NULL, + &ext4_iomap_buffered_zero_ops, + &ext4_iomap_write_ops, NULL); +} + /* * ext4_block_zero_page_range() zeros out a mapping of length 'length' * starting from file offset 'from'. The range to be zero'd must @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, if (IS_DAX(inode)) { return dax_zero_range(inode, from, length, NULL, &ext4_iomap_ops); + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { + return ext4_iomap_zero_range(inode, from, length); } return __ext4_block_zero_page_range(handle, mapping, from, length); } The problem is most calls to ext4_block_zero_page_range() occur under an active journal handle, so I can reproduce the deadlock issue easily without this series. > > FWIW, I think your suggestion is reasonable, but I'm also curious what > the error handling would look like in ext4. Do you expect to the fail > the higher level operation, for example? Cycle locks and retry, etc.? Originally, I wanted ext4_block_zero_page_range() to return a failure to the higher level operation. However, unfortunately, after my testing today, I discovered that even though we implement this, this series still cannot resolve the issue. The corner case is: Assume we have a dirty folio covers both hole and unwritten mappings. |- dirty folio -| [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten If we punch the range of the hole, ext4_punch_hole()-> ext4_zero_partial_blocks() will zero out the first half of the dirty folio. Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio since the target range is a hole. Finally, iomap_zero_range() will still flush this whole folio and lead to deadlock during writeback the latter half of the folio. > > The reason I ask is because the folio_batch handling has come up through > discussions on this series. My position so far has been to keep it as a > separate allocation and to keep things simple since it is currently > isolated to zero range, but that may change if the usage spills over to > other operations (which seems expected at this point). I suspect that if > a filesystem actually depends on this for correct behavior, that is > another data point worth considering on that topic. > > So that has me wondering if it would be better/easier here to perhaps > embed the batch in iomap_iter, or maybe as an incremental step put it on > the stack in iomap_zero_range() and initialize the iomap_iter pointer > there instead of doing the dynamic allocation (then the fill helper > would set a flag to indicate the fs did pagecache lookup). Thoughts on > something like that? > > Also IIUC ext4-on-iomap is still a WIP and review on this series seems > to have mostly wound down. Any objection if the fix for that comes along > as a followup patch rather than a rework of this series? It seems that we don't need to modify this series, we need to consider other solutions to resolve this deadlock issue. In my v1 ext4-on-iomap series [1], I resolved this issue by moving all instances of ext4_block_zero_page_range() out of the running journal handle(please see patch 19-21). But I don't think this is a good solution since it's complex and fragile. Besides, after commit c7fc0366c6562 ("ext4: partial zero eof block on unaligned inode size extension"), you added more invocations of ext4_zero_partial_blocks(), and the situation has become more complicated (Althrough I think the calls in the three write_end callbacks can be removed). Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios over unwritten mappings before zeroing partial blocks. This is because ext4 always zeroes the in-memory page cache before zeroing(e.g, in ext4_setattr() and ext4_punch_hole()), it means if the target range is still dirty and unwritten when calling ext4_block_zero_page_range(), it must has already been zeroed. Was I missing something? Therefore, I was wondering if there are any ways to prevent flushing in iomap_zero_range()? Any ideas? [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/ > > Brian > > P.S., I'm heading on vacation so it will likely be a week or two before > I follow up from here, JFYI. Wishing you a wonderful time! :-) Best regards, Yi. >> >>>> + return offset + length; >>>> + folio_batch_init(iter->fbatch); >>>> + >>>> + filemap_get_folios_dirty(mapping, &start, end, iter->fbatch); >>>> + return (start << PAGE_SHIFT); >>>> +} >>>> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios); >> ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-07-19 11:07 ` Zhang Yi @ 2025-07-21 8:47 ` Zhang Yi 2025-07-28 12:57 ` Zhang Yi 2025-07-30 13:17 ` Brian Foster 1 sibling, 1 reply; 13+ messages in thread From: Zhang Yi @ 2025-07-21 8:47 UTC (permalink / raw) To: Brian Foster Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On 2025/7/19 19:07, Zhang Yi wrote: > On 2025/7/18 21:48, Brian Foster wrote: >> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: >>> On 2025/7/15 13:22, Darrick J. Wong wrote: >>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: >>>>> The only way zero range can currently process unwritten mappings >>>>> with dirty pagecache is to check whether the range is dirty before >>>>> mapping lookup and then flush when at least one underlying mapping >>>>> is unwritten. This ordering is required to prevent iomap lookup from >>>>> racing with folio writeback and reclaim. >>>>> >>>>> Since zero range can skip ranges of unwritten mappings that are >>>>> clean in cache, this operation can be improved by allowing the >>>>> filesystem to provide a set of dirty folios that require zeroing. In >>>>> turn, rather than flush or iterate file offsets, zero range can >>>>> iterate on folios in the batch and advance over clean or uncached >>>>> ranges in between. >>>>> >>>>> Add a folio_batch in struct iomap and provide a helper for fs' to >>>> >>>> /me confused by the single quote; is this supposed to read: >>>> >>>> "...for the fs to populate..."? >>>> >>>> Either way the code changes look like a reasonable thing to do for the >>>> pagecache (try to grab a bunch of dirty folios while XFS holds the >>>> mapping lock) so >>>> >>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> >>>> >>>> --D >>>> >>>> >>>>> populate the batch at lookup time. Update the folio lookup path to >>>>> return the next folio in the batch, if provided, and advance the >>>>> iter if the folio starts beyond the current offset. >>>>> >>>>> Signed-off-by: Brian Foster <bfoster@redhat.com> >>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> >>>>> --- >>>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- >>>>> fs/iomap/iter.c | 6 +++ >>>>> include/linux/iomap.h | 4 ++ >>>>> 3 files changed, 94 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >>>>> index 38da2fa6e6b0..194e3cc0857f 100644 >>>>> --- a/fs/iomap/buffered-io.c >>>>> +++ b/fs/iomap/buffered-io.c >>> [...] >>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) >>>>> return status; >>>>> } >>>>> >>>>> +loff_t >>>>> +iomap_fill_dirty_folios( >>>>> + struct iomap_iter *iter, >>>>> + loff_t offset, >>>>> + loff_t length) >>>>> +{ >>>>> + struct address_space *mapping = iter->inode->i_mapping; >>>>> + pgoff_t start = offset >> PAGE_SHIFT; >>>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; >>>>> + >>>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); >>>>> + if (!iter->fbatch) >>> >>> Hi, Brian! >>> >>> I think ext4 needs to be aware of this failure after it converts to use >>> iomap infrastructure. It is because if we fail to add dirty folios to the >>> fbatch, iomap_zero_range() will flush those unwritten and dirty range. >>> This could potentially lead to a deadlock, as most calls to >>> ext4_block_zero_page_range() occur under an active journal handle. >>> Writeback operations under an active journal handle may result in circular >>> waiting within journal transactions. So please return this error code, and >>> then ext4 can interrupt zero operations to prevent deadlock. >>> >> >> Hi Yi, >> >> Thanks for looking at this. >> >> Huh.. so the reason for falling back like this here is just that this >> was considered an optional optimization, with the flush in >> iomap_zero_range() being default fallback behavior. IIUC, what you're >> saying means that the current zero range behavior without this series is >> problematic for ext4-on-iomap..? > > Yes. > >> If so, have you observed issues you can share details about? > > Sure. > > Before delving into the specific details of this issue, I would like > to provide some background information on the rule that ext4 cannot > wait for writeback in an active journal handle. If you are aware of > this background, please skip this paragraph. During ext4 writing back > the page cache, it may start a new journal handle to allocate blocks, > update the disksize, and convert unwritten extents after the I/O is > completed. When starting this new journal handle, if the current > running journal transaction is in the process of being submitted or > if the journal space is insufficient, it must wait for the ongoing > transaction to be completed, but the prerequisite for this is that all > currently running handles must be terminated. However, if we flush the > page cache under an active journal handle, we cannot stop it, which > may lead to a deadlock. > > Now, the issue I have observed occurs when I attempt to use > iomap_zero_range() within ext4_block_zero_page_range(). My current > implementation are below(based on the latest fs-next). > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 28547663e4fd..1a21667f3f7c 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, > return 0; > } > > +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, > + loff_t length, unsigned int flags, struct iomap *iomap, > + struct iomap *srcmap) > +{ > + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); > + struct ext4_map_blocks map; > + u8 blkbits = inode->i_blkbits; > + int ret; > + > + ret = ext4_emergency_state(inode->i_sb); > + if (unlikely(ret)) > + return ret; > + > + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) > + return -EINVAL; > + > + /* Calculate the first and last logical blocks respectively. */ > + map.m_lblk = offset >> blkbits; > + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > + > + ret = ext4_map_blocks(NULL, inode, &map, 0); > + if (ret < 0) > + return ret; > + > + /* > + * Look up dirty folios for unwritten mappings within EOF. Providing > + * this bypasses the flush iomap uses to trigger extent conversion > + * when unwritten mappings have dirty pagecache in need of zeroing. > + */ > + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && > + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { > + loff_t end; > + > + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, > + map.m_len << blkbits); > + if ((end >> blkbits) < map.m_lblk + map.m_len) > + map.m_len = (end >> blkbits) - map.m_lblk; > + } > + > + ext4_set_iomap(inode, iomap, &map, offset, length, flags); > + return 0; > +} > + > +const struct iomap_ops ext4_iomap_buffered_zero_ops = { > + .iomap_begin = ext4_iomap_buffered_zero_begin, > +}; > > const struct iomap_ops ext4_iomap_buffered_write_ops = { > .iomap_begin = ext4_iomap_buffered_write_begin, > @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, > return err; > } > > +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, > + loff_t length) > +{ > + WARN_ON_ONCE(!inode_is_locked(inode) && > + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); > + > + return iomap_zero_range(inode, from, length, NULL, > + &ext4_iomap_buffered_zero_ops, > + &ext4_iomap_write_ops, NULL); > +} > + > /* > * ext4_block_zero_page_range() zeros out a mapping of length 'length' > * starting from file offset 'from'. The range to be zero'd must > @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, > if (IS_DAX(inode)) { > return dax_zero_range(inode, from, length, NULL, > &ext4_iomap_ops); > + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { > + return ext4_iomap_zero_range(inode, from, length); > } > return __ext4_block_zero_page_range(handle, mapping, from, length); > } > > The problem is most calls to ext4_block_zero_page_range() occur under > an active journal handle, so I can reproduce the deadlock issue easily > without this series. > >> >> FWIW, I think your suggestion is reasonable, but I'm also curious what >> the error handling would look like in ext4. Do you expect to the fail >> the higher level operation, for example? Cycle locks and retry, etc.? > > Originally, I wanted ext4_block_zero_page_range() to return a failure > to the higher level operation. However, unfortunately, after my testing > today, I discovered that even though we implement this, this series still > cannot resolve the issue. The corner case is: > > Assume we have a dirty folio covers both hole and unwritten mappings. > > |- dirty folio -| > [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten > > If we punch the range of the hole, ext4_punch_hole()-> > ext4_zero_partial_blocks() will zero out the first half of the dirty folio. > Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio > since the target range is a hole. Finally, iomap_zero_range() will still > flush this whole folio and lead to deadlock during writeback the latter > half of the folio. > >> >> The reason I ask is because the folio_batch handling has come up through >> discussions on this series. My position so far has been to keep it as a >> separate allocation and to keep things simple since it is currently >> isolated to zero range, but that may change if the usage spills over to >> other operations (which seems expected at this point). I suspect that if >> a filesystem actually depends on this for correct behavior, that is >> another data point worth considering on that topic. >> >> So that has me wondering if it would be better/easier here to perhaps >> embed the batch in iomap_iter, or maybe as an incremental step put it on >> the stack in iomap_zero_range() and initialize the iomap_iter pointer >> there instead of doing the dynamic allocation (then the fill helper >> would set a flag to indicate the fs did pagecache lookup). Thoughts on >> something like that? >> >> Also IIUC ext4-on-iomap is still a WIP and review on this series seems >> to have mostly wound down. Any objection if the fix for that comes along >> as a followup patch rather than a rework of this series? > > It seems that we don't need to modify this series, we need to consider > other solutions to resolve this deadlock issue. > > In my v1 ext4-on-iomap series [1], I resolved this issue by moving all > instances of ext4_block_zero_page_range() out of the running journal > handle(please see patch 19-21). But I don't think this is a good solution > since it's complex and fragile. Besides, after commit c7fc0366c6562 > ("ext4: partial zero eof block on unaligned inode size extension"), you > added more invocations of ext4_zero_partial_blocks(), and the situation > has become more complicated (Althrough I think the calls in the three > write_end callbacks can be removed). > > Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios > over unwritten mappings before zeroing partial blocks. This is because > ext4 always zeroes the in-memory page cache before zeroing(e.g, in > ext4_setattr() and ext4_punch_hole()), it means if the target range is > still dirty and unwritten when calling ext4_block_zero_page_range(), it > must has already been zeroed. Was I missing something? Therefore, I was > wondering if there are any ways to prevent flushing in > iomap_zero_range()? Any ideas? > The commit 7d9b474ee4cc ("iomap: make zero range flush conditional on unwritten mappings") mentioned the following: iomap_zero_range() flushes pagecache to mitigate consistency problems with dirty pagecache and unwritten mappings. The flush is unconditional over the entire range because checking pagecache state after mapping lookup is racy with writeback and reclaim. There are ways around this using iomap's mapping revalidation mechanism, but this is not supported by all iomap based filesystems and so is not a generic solution. Does the revalidation mechanism here refer to verifying the validity of the mapping through iomap_write_ops->iomap_valid()? IIUC, does this mean that if the filesystem implement the iomap_valid() interface, we can always avoid the iomap_zero_range() from flushing dirty folios back? Something like below: diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 73772d34f502..ba71a6ed2f77 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1522,7 +1522,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, if (range_dirty) { range_dirty = false; - status = iomap_zero_iter_flush_and_stale(&iter); + if (write_ops->iomap_valid) + status = iomap_zero_iter(&iter, did_zero, write_ops); + else + status = iomap_zero_iter_flush_and_stale(&iter); } else { status = iomap_iter_advance_full(&iter); } Thanks, Yi. > [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/ > >> >> Brian >> >> P.S., I'm heading on vacation so it will likely be a week or two before >> I follow up from here, JFYI. > > Wishing you a wonderful time! :-) > > Best regards, > Yi. > >>> >>>>> + return offset + length; >>>>> + folio_batch_init(iter->fbatch); >>>>> + >>>>> + filemap_get_folios_dirty(mapping, &start, end, iter->fbatch); >>>>> + return (start << PAGE_SHIFT); >>>>> +} >>>>> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios); >>> > > ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-07-21 8:47 ` Zhang Yi @ 2025-07-28 12:57 ` Zhang Yi 2025-07-30 13:19 ` Brian Foster 0 siblings, 1 reply; 13+ messages in thread From: Zhang Yi @ 2025-07-28 12:57 UTC (permalink / raw) To: Brian Foster Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On 2025/7/21 16:47, Zhang Yi wrote: > On 2025/7/19 19:07, Zhang Yi wrote: >> On 2025/7/18 21:48, Brian Foster wrote: >>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: >>>> On 2025/7/15 13:22, Darrick J. Wong wrote: >>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: >>>>>> The only way zero range can currently process unwritten mappings >>>>>> with dirty pagecache is to check whether the range is dirty before >>>>>> mapping lookup and then flush when at least one underlying mapping >>>>>> is unwritten. This ordering is required to prevent iomap lookup from >>>>>> racing with folio writeback and reclaim. >>>>>> >>>>>> Since zero range can skip ranges of unwritten mappings that are >>>>>> clean in cache, this operation can be improved by allowing the >>>>>> filesystem to provide a set of dirty folios that require zeroing. In >>>>>> turn, rather than flush or iterate file offsets, zero range can >>>>>> iterate on folios in the batch and advance over clean or uncached >>>>>> ranges in between. >>>>>> >>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to >>>>> >>>>> /me confused by the single quote; is this supposed to read: >>>>> >>>>> "...for the fs to populate..."? >>>>> >>>>> Either way the code changes look like a reasonable thing to do for the >>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the >>>>> mapping lock) so >>>>> >>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> >>>>> >>>>> --D >>>>> >>>>> >>>>>> populate the batch at lookup time. Update the folio lookup path to >>>>>> return the next folio in the batch, if provided, and advance the >>>>>> iter if the folio starts beyond the current offset. >>>>>> >>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com> >>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> >>>>>> --- >>>>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- >>>>>> fs/iomap/iter.c | 6 +++ >>>>>> include/linux/iomap.h | 4 ++ >>>>>> 3 files changed, 94 insertions(+), 5 deletions(-) >>>>>> >>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >>>>>> index 38da2fa6e6b0..194e3cc0857f 100644 >>>>>> --- a/fs/iomap/buffered-io.c >>>>>> +++ b/fs/iomap/buffered-io.c >>>> [...] >>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) >>>>>> return status; >>>>>> } >>>>>> >>>>>> +loff_t >>>>>> +iomap_fill_dirty_folios( >>>>>> + struct iomap_iter *iter, >>>>>> + loff_t offset, >>>>>> + loff_t length) >>>>>> +{ >>>>>> + struct address_space *mapping = iter->inode->i_mapping; >>>>>> + pgoff_t start = offset >> PAGE_SHIFT; >>>>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; >>>>>> + >>>>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); >>>>>> + if (!iter->fbatch) >>>> >>>> Hi, Brian! >>>> >>>> I think ext4 needs to be aware of this failure after it converts to use >>>> iomap infrastructure. It is because if we fail to add dirty folios to the >>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range. >>>> This could potentially lead to a deadlock, as most calls to >>>> ext4_block_zero_page_range() occur under an active journal handle. >>>> Writeback operations under an active journal handle may result in circular >>>> waiting within journal transactions. So please return this error code, and >>>> then ext4 can interrupt zero operations to prevent deadlock. >>>> >>> >>> Hi Yi, >>> >>> Thanks for looking at this. >>> >>> Huh.. so the reason for falling back like this here is just that this >>> was considered an optional optimization, with the flush in >>> iomap_zero_range() being default fallback behavior. IIUC, what you're >>> saying means that the current zero range behavior without this series is >>> problematic for ext4-on-iomap..? >> >> Yes. >> >>> If so, have you observed issues you can share details about? >> >> Sure. >> >> Before delving into the specific details of this issue, I would like >> to provide some background information on the rule that ext4 cannot >> wait for writeback in an active journal handle. If you are aware of >> this background, please skip this paragraph. During ext4 writing back >> the page cache, it may start a new journal handle to allocate blocks, >> update the disksize, and convert unwritten extents after the I/O is >> completed. When starting this new journal handle, if the current >> running journal transaction is in the process of being submitted or >> if the journal space is insufficient, it must wait for the ongoing >> transaction to be completed, but the prerequisite for this is that all >> currently running handles must be terminated. However, if we flush the >> page cache under an active journal handle, we cannot stop it, which >> may lead to a deadlock. >> >> Now, the issue I have observed occurs when I attempt to use >> iomap_zero_range() within ext4_block_zero_page_range(). My current >> implementation are below(based on the latest fs-next). >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 28547663e4fd..1a21667f3f7c 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, >> return 0; >> } >> >> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, >> + loff_t length, unsigned int flags, struct iomap *iomap, >> + struct iomap *srcmap) >> +{ >> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); >> + struct ext4_map_blocks map; >> + u8 blkbits = inode->i_blkbits; >> + int ret; >> + >> + ret = ext4_emergency_state(inode->i_sb); >> + if (unlikely(ret)) >> + return ret; >> + >> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) >> + return -EINVAL; >> + >> + /* Calculate the first and last logical blocks respectively. */ >> + map.m_lblk = offset >> blkbits; >> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, >> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; >> + >> + ret = ext4_map_blocks(NULL, inode, &map, 0); >> + if (ret < 0) >> + return ret; >> + >> + /* >> + * Look up dirty folios for unwritten mappings within EOF. Providing >> + * this bypasses the flush iomap uses to trigger extent conversion >> + * when unwritten mappings have dirty pagecache in need of zeroing. >> + */ >> + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && >> + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { >> + loff_t end; >> + >> + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, >> + map.m_len << blkbits); >> + if ((end >> blkbits) < map.m_lblk + map.m_len) >> + map.m_len = (end >> blkbits) - map.m_lblk; >> + } >> + >> + ext4_set_iomap(inode, iomap, &map, offset, length, flags); >> + return 0; >> +} >> + >> +const struct iomap_ops ext4_iomap_buffered_zero_ops = { >> + .iomap_begin = ext4_iomap_buffered_zero_begin, >> +}; >> >> const struct iomap_ops ext4_iomap_buffered_write_ops = { >> .iomap_begin = ext4_iomap_buffered_write_begin, >> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, >> return err; >> } >> >> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, >> + loff_t length) >> +{ >> + WARN_ON_ONCE(!inode_is_locked(inode) && >> + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); >> + >> + return iomap_zero_range(inode, from, length, NULL, >> + &ext4_iomap_buffered_zero_ops, >> + &ext4_iomap_write_ops, NULL); >> +} >> + >> /* >> * ext4_block_zero_page_range() zeros out a mapping of length 'length' >> * starting from file offset 'from'. The range to be zero'd must >> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, >> if (IS_DAX(inode)) { >> return dax_zero_range(inode, from, length, NULL, >> &ext4_iomap_ops); >> + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { >> + return ext4_iomap_zero_range(inode, from, length); >> } >> return __ext4_block_zero_page_range(handle, mapping, from, length); >> } >> >> The problem is most calls to ext4_block_zero_page_range() occur under >> an active journal handle, so I can reproduce the deadlock issue easily >> without this series. >> >>> >>> FWIW, I think your suggestion is reasonable, but I'm also curious what >>> the error handling would look like in ext4. Do you expect to the fail >>> the higher level operation, for example? Cycle locks and retry, etc.? >> >> Originally, I wanted ext4_block_zero_page_range() to return a failure >> to the higher level operation. However, unfortunately, after my testing >> today, I discovered that even though we implement this, this series still >> cannot resolve the issue. The corner case is: >> >> Assume we have a dirty folio covers both hole and unwritten mappings. >> >> |- dirty folio -| >> [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten >> >> If we punch the range of the hole, ext4_punch_hole()-> >> ext4_zero_partial_blocks() will zero out the first half of the dirty folio. >> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio >> since the target range is a hole. Finally, iomap_zero_range() will still >> flush this whole folio and lead to deadlock during writeback the latter >> half of the folio. >> >>> >>> The reason I ask is because the folio_batch handling has come up through >>> discussions on this series. My position so far has been to keep it as a >>> separate allocation and to keep things simple since it is currently >>> isolated to zero range, but that may change if the usage spills over to >>> other operations (which seems expected at this point). I suspect that if >>> a filesystem actually depends on this for correct behavior, that is >>> another data point worth considering on that topic. >>> >>> So that has me wondering if it would be better/easier here to perhaps >>> embed the batch in iomap_iter, or maybe as an incremental step put it on >>> the stack in iomap_zero_range() and initialize the iomap_iter pointer >>> there instead of doing the dynamic allocation (then the fill helper >>> would set a flag to indicate the fs did pagecache lookup). Thoughts on >>> something like that? >>> >>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems >>> to have mostly wound down. Any objection if the fix for that comes along >>> as a followup patch rather than a rework of this series? >> >> It seems that we don't need to modify this series, we need to consider >> other solutions to resolve this deadlock issue. >> >> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all >> instances of ext4_block_zero_page_range() out of the running journal >> handle(please see patch 19-21). But I don't think this is a good solution >> since it's complex and fragile. Besides, after commit c7fc0366c6562 >> ("ext4: partial zero eof block on unaligned inode size extension"), you >> added more invocations of ext4_zero_partial_blocks(), and the situation >> has become more complicated (Althrough I think the calls in the three >> write_end callbacks can be removed). >> >> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios >> over unwritten mappings before zeroing partial blocks. This is because >> ext4 always zeroes the in-memory page cache before zeroing(e.g, in >> ext4_setattr() and ext4_punch_hole()), it means if the target range is >> still dirty and unwritten when calling ext4_block_zero_page_range(), it >> must has already been zeroed. Was I missing something? Therefore, I was >> wondering if there are any ways to prevent flushing in >> iomap_zero_range()? Any ideas? >> > > The commit 7d9b474ee4cc ("iomap: make zero range flush conditional on > unwritten mappings") mentioned the following: > > iomap_zero_range() flushes pagecache to mitigate consistency > problems with dirty pagecache and unwritten mappings. The flush is > unconditional over the entire range because checking pagecache state > after mapping lookup is racy with writeback and reclaim. There are > ways around this using iomap's mapping revalidation mechanism, but > this is not supported by all iomap based filesystems and so is not a > generic solution. > > Does the revalidation mechanism here refer to verifying the validity of > the mapping through iomap_write_ops->iomap_valid()? IIUC, does this mean > that if the filesystem implement the iomap_valid() interface, we can > always avoid the iomap_zero_range() from flushing dirty folios back? > Something like below: > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 73772d34f502..ba71a6ed2f77 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1522,7 +1522,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > > if (range_dirty) { > range_dirty = false; > - status = iomap_zero_iter_flush_and_stale(&iter); > + if (write_ops->iomap_valid) > + status = iomap_zero_iter(&iter, did_zero, write_ops); > + else > + status = iomap_zero_iter_flush_and_stale(&iter); > } else { > status = iomap_iter_advance_full(&iter); > } > The above diff will trigger WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size) in iomap_zero_iter() on XFS. I revised the 'diff' and ran xfstests with several main configs on both XFS and ext4(with iomap infrastructure), and everything seems to be working fine so far. What do you think? Thanks, Yi. diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 73772d34f502..a75cdb22bab0 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1444,9 +1444,6 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, break; } - /* warn about zeroing folios beyond eof that won't write back */ - WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size); - folio_zero_range(folio, offset, bytes); folio_mark_accessed(folio); @@ -1515,22 +1512,44 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, srcmap->type != IOMAP_UNWRITTEN)) return -EIO; - if (!iter.fbatch && - (srcmap->type == IOMAP_HOLE || - srcmap->type == IOMAP_UNWRITTEN)) { - s64 status; + if (iter.fbatch || (srcmap->type != IOMAP_HOLE && + srcmap->type != IOMAP_UNWRITTEN)) { + iter.status = iomap_zero_iter(&iter, did_zero, + write_ops); + continue; + } - if (range_dirty) { - range_dirty = false; - status = iomap_zero_iter_flush_and_stale(&iter); - } else { - status = iomap_iter_advance_full(&iter); - } - iter.status = status; + /* + * No fbatch, and the target is either a hole or an unwritten + * range, skip zeroing if the range is not dirty. + */ + if (!range_dirty) { + iter.status = iomap_iter_advance_full(&iter); continue; } - iter.status = iomap_zero_iter(&iter, did_zero, write_ops); + /* + * The range is dirty, if the given filesystm does not specify + * a revalidation mechanism, flush the entire range to prevent + * mapping changes that could race with writeback and reclaim. + */ + if (!write_ops->iomap_valid) { + range_dirty = false; + iter.status = iomap_zero_iter_flush_and_stale(&iter); + continue; + } + + /* + * The filesystem specifies an iomap_valid() helper. It is safe + * to zero out the current range if it is unwritten and dirty. + */ + if (srcmap->type == IOMAP_UNWRITTEN && + filemap_range_needs_writeback(mapping, iter.pos, + iomap_length(&iter))) + iter.status = iomap_zero_iter(&iter, did_zero, + write_ops); + else + iter.status = iomap_iter_advance_full(&iter); } return ret; } ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-07-28 12:57 ` Zhang Yi @ 2025-07-30 13:19 ` Brian Foster 2025-08-02 7:26 ` Zhang Yi 0 siblings, 1 reply; 13+ messages in thread From: Brian Foster @ 2025-07-30 13:19 UTC (permalink / raw) To: Zhang Yi Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On Mon, Jul 28, 2025 at 08:57:28PM +0800, Zhang Yi wrote: > On 2025/7/21 16:47, Zhang Yi wrote: > > On 2025/7/19 19:07, Zhang Yi wrote: > >> On 2025/7/18 21:48, Brian Foster wrote: > >>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: > >>>> On 2025/7/15 13:22, Darrick J. Wong wrote: > >>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: > >>>>>> The only way zero range can currently process unwritten mappings > >>>>>> with dirty pagecache is to check whether the range is dirty before > >>>>>> mapping lookup and then flush when at least one underlying mapping > >>>>>> is unwritten. This ordering is required to prevent iomap lookup from > >>>>>> racing with folio writeback and reclaim. > >>>>>> > >>>>>> Since zero range can skip ranges of unwritten mappings that are > >>>>>> clean in cache, this operation can be improved by allowing the > >>>>>> filesystem to provide a set of dirty folios that require zeroing. In > >>>>>> turn, rather than flush or iterate file offsets, zero range can > >>>>>> iterate on folios in the batch and advance over clean or uncached > >>>>>> ranges in between. > >>>>>> > >>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to > >>>>> > >>>>> /me confused by the single quote; is this supposed to read: > >>>>> > >>>>> "...for the fs to populate..."? > >>>>> > >>>>> Either way the code changes look like a reasonable thing to do for the > >>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the > >>>>> mapping lock) so > >>>>> > >>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> > >>>>> > >>>>> --D > >>>>> > >>>>> > >>>>>> populate the batch at lookup time. Update the folio lookup path to > >>>>>> return the next folio in the batch, if provided, and advance the > >>>>>> iter if the folio starts beyond the current offset. > >>>>>> > >>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com> > >>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> > >>>>>> --- > >>>>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- > >>>>>> fs/iomap/iter.c | 6 +++ > >>>>>> include/linux/iomap.h | 4 ++ > >>>>>> 3 files changed, 94 insertions(+), 5 deletions(-) > >>>>>> > >>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > >>>>>> index 38da2fa6e6b0..194e3cc0857f 100644 > >>>>>> --- a/fs/iomap/buffered-io.c > >>>>>> +++ b/fs/iomap/buffered-io.c > >>>> [...] > >>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > >>>>>> return status; > >>>>>> } > >>>>>> > >>>>>> +loff_t > >>>>>> +iomap_fill_dirty_folios( > >>>>>> + struct iomap_iter *iter, > >>>>>> + loff_t offset, > >>>>>> + loff_t length) > >>>>>> +{ > >>>>>> + struct address_space *mapping = iter->inode->i_mapping; > >>>>>> + pgoff_t start = offset >> PAGE_SHIFT; > >>>>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; > >>>>>> + > >>>>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); > >>>>>> + if (!iter->fbatch) > >>>> > >>>> Hi, Brian! > >>>> > >>>> I think ext4 needs to be aware of this failure after it converts to use > >>>> iomap infrastructure. It is because if we fail to add dirty folios to the > >>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range. > >>>> This could potentially lead to a deadlock, as most calls to > >>>> ext4_block_zero_page_range() occur under an active journal handle. > >>>> Writeback operations under an active journal handle may result in circular > >>>> waiting within journal transactions. So please return this error code, and > >>>> then ext4 can interrupt zero operations to prevent deadlock. > >>>> > >>> > >>> Hi Yi, > >>> > >>> Thanks for looking at this. > >>> > >>> Huh.. so the reason for falling back like this here is just that this > >>> was considered an optional optimization, with the flush in > >>> iomap_zero_range() being default fallback behavior. IIUC, what you're > >>> saying means that the current zero range behavior without this series is > >>> problematic for ext4-on-iomap..? > >> > >> Yes. > >> > >>> If so, have you observed issues you can share details about? > >> > >> Sure. > >> > >> Before delving into the specific details of this issue, I would like > >> to provide some background information on the rule that ext4 cannot > >> wait for writeback in an active journal handle. If you are aware of > >> this background, please skip this paragraph. During ext4 writing back > >> the page cache, it may start a new journal handle to allocate blocks, > >> update the disksize, and convert unwritten extents after the I/O is > >> completed. When starting this new journal handle, if the current > >> running journal transaction is in the process of being submitted or > >> if the journal space is insufficient, it must wait for the ongoing > >> transaction to be completed, but the prerequisite for this is that all > >> currently running handles must be terminated. However, if we flush the > >> page cache under an active journal handle, we cannot stop it, which > >> may lead to a deadlock. > >> > >> Now, the issue I have observed occurs when I attempt to use > >> iomap_zero_range() within ext4_block_zero_page_range(). My current > >> implementation are below(based on the latest fs-next). > >> > >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > >> index 28547663e4fd..1a21667f3f7c 100644 > >> --- a/fs/ext4/inode.c > >> +++ b/fs/ext4/inode.c > >> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, > >> return 0; > >> } > >> > >> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, > >> + loff_t length, unsigned int flags, struct iomap *iomap, > >> + struct iomap *srcmap) > >> +{ > >> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); > >> + struct ext4_map_blocks map; > >> + u8 blkbits = inode->i_blkbits; > >> + int ret; > >> + > >> + ret = ext4_emergency_state(inode->i_sb); > >> + if (unlikely(ret)) > >> + return ret; > >> + > >> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) > >> + return -EINVAL; > >> + > >> + /* Calculate the first and last logical blocks respectively. */ > >> + map.m_lblk = offset >> blkbits; > >> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > >> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > >> + > >> + ret = ext4_map_blocks(NULL, inode, &map, 0); > >> + if (ret < 0) > >> + return ret; > >> + > >> + /* > >> + * Look up dirty folios for unwritten mappings within EOF. Providing > >> + * this bypasses the flush iomap uses to trigger extent conversion > >> + * when unwritten mappings have dirty pagecache in need of zeroing. > >> + */ > >> + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && > >> + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { > >> + loff_t end; > >> + > >> + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, > >> + map.m_len << blkbits); > >> + if ((end >> blkbits) < map.m_lblk + map.m_len) > >> + map.m_len = (end >> blkbits) - map.m_lblk; > >> + } > >> + > >> + ext4_set_iomap(inode, iomap, &map, offset, length, flags); > >> + return 0; > >> +} > >> + > >> +const struct iomap_ops ext4_iomap_buffered_zero_ops = { > >> + .iomap_begin = ext4_iomap_buffered_zero_begin, > >> +}; > >> > >> const struct iomap_ops ext4_iomap_buffered_write_ops = { > >> .iomap_begin = ext4_iomap_buffered_write_begin, > >> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, > >> return err; > >> } > >> > >> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, > >> + loff_t length) > >> +{ > >> + WARN_ON_ONCE(!inode_is_locked(inode) && > >> + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); > >> + > >> + return iomap_zero_range(inode, from, length, NULL, > >> + &ext4_iomap_buffered_zero_ops, > >> + &ext4_iomap_write_ops, NULL); > >> +} > >> + > >> /* > >> * ext4_block_zero_page_range() zeros out a mapping of length 'length' > >> * starting from file offset 'from'. The range to be zero'd must > >> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, > >> if (IS_DAX(inode)) { > >> return dax_zero_range(inode, from, length, NULL, > >> &ext4_iomap_ops); > >> + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { > >> + return ext4_iomap_zero_range(inode, from, length); > >> } > >> return __ext4_block_zero_page_range(handle, mapping, from, length); > >> } > >> > >> The problem is most calls to ext4_block_zero_page_range() occur under > >> an active journal handle, so I can reproduce the deadlock issue easily > >> without this series. > >> > >>> > >>> FWIW, I think your suggestion is reasonable, but I'm also curious what > >>> the error handling would look like in ext4. Do you expect to the fail > >>> the higher level operation, for example? Cycle locks and retry, etc.? > >> > >> Originally, I wanted ext4_block_zero_page_range() to return a failure > >> to the higher level operation. However, unfortunately, after my testing > >> today, I discovered that even though we implement this, this series still > >> cannot resolve the issue. The corner case is: > >> > >> Assume we have a dirty folio covers both hole and unwritten mappings. > >> > >> |- dirty folio -| > >> [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten > >> > >> If we punch the range of the hole, ext4_punch_hole()-> > >> ext4_zero_partial_blocks() will zero out the first half of the dirty folio. > >> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio > >> since the target range is a hole. Finally, iomap_zero_range() will still > >> flush this whole folio and lead to deadlock during writeback the latter > >> half of the folio. > >> > >>> > >>> The reason I ask is because the folio_batch handling has come up through > >>> discussions on this series. My position so far has been to keep it as a > >>> separate allocation and to keep things simple since it is currently > >>> isolated to zero range, but that may change if the usage spills over to > >>> other operations (which seems expected at this point). I suspect that if > >>> a filesystem actually depends on this for correct behavior, that is > >>> another data point worth considering on that topic. > >>> > >>> So that has me wondering if it would be better/easier here to perhaps > >>> embed the batch in iomap_iter, or maybe as an incremental step put it on > >>> the stack in iomap_zero_range() and initialize the iomap_iter pointer > >>> there instead of doing the dynamic allocation (then the fill helper > >>> would set a flag to indicate the fs did pagecache lookup). Thoughts on > >>> something like that? > >>> > >>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems > >>> to have mostly wound down. Any objection if the fix for that comes along > >>> as a followup patch rather than a rework of this series? > >> > >> It seems that we don't need to modify this series, we need to consider > >> other solutions to resolve this deadlock issue. > >> > >> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all > >> instances of ext4_block_zero_page_range() out of the running journal > >> handle(please see patch 19-21). But I don't think this is a good solution > >> since it's complex and fragile. Besides, after commit c7fc0366c6562 > >> ("ext4: partial zero eof block on unaligned inode size extension"), you > >> added more invocations of ext4_zero_partial_blocks(), and the situation > >> has become more complicated (Althrough I think the calls in the three > >> write_end callbacks can be removed). > >> > >> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios > >> over unwritten mappings before zeroing partial blocks. This is because > >> ext4 always zeroes the in-memory page cache before zeroing(e.g, in > >> ext4_setattr() and ext4_punch_hole()), it means if the target range is > >> still dirty and unwritten when calling ext4_block_zero_page_range(), it > >> must has already been zeroed. Was I missing something? Therefore, I was > >> wondering if there are any ways to prevent flushing in > >> iomap_zero_range()? Any ideas? > >> > > > > The commit 7d9b474ee4cc ("iomap: make zero range flush conditional on > > unwritten mappings") mentioned the following: > > > > iomap_zero_range() flushes pagecache to mitigate consistency > > problems with dirty pagecache and unwritten mappings. The flush is > > unconditional over the entire range because checking pagecache state > > after mapping lookup is racy with writeback and reclaim. There are > > ways around this using iomap's mapping revalidation mechanism, but > > this is not supported by all iomap based filesystems and so is not a > > generic solution. > > > > Does the revalidation mechanism here refer to verifying the validity of > > the mapping through iomap_write_ops->iomap_valid()? IIUC, does this mean > > that if the filesystem implement the iomap_valid() interface, we can > > always avoid the iomap_zero_range() from flushing dirty folios back? > > Something like below: > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 73772d34f502..ba71a6ed2f77 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -1522,7 +1522,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > > > > if (range_dirty) { > > range_dirty = false; > > - status = iomap_zero_iter_flush_and_stale(&iter); > > + if (write_ops->iomap_valid) > > + status = iomap_zero_iter(&iter, did_zero, write_ops); > > + else > > + status = iomap_zero_iter_flush_and_stale(&iter); > > } else { > > status = iomap_iter_advance_full(&iter); > > } > > > > The above diff will trigger > WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size) in iomap_zero_iter() > on XFS. I revised the 'diff' and ran xfstests with several main configs > on both XFS and ext4(with iomap infrastructure), and everything seems to > be working fine so far. What do you think? > A couple things here.. First, I don't think it's quite enough to assume zeroing is safe just because a revalidation callback is defined. I have multiple (old) prototypes around fixing zero range, one of which was centered around using ->iomap_valid(), so I do believe it's technically possible with further changes. That's what the above "There are ways around this using the revalidation mechanism" text was referring to generally. That said, I don't think going down that path is the right solution here. I opted for the folio batch approach in favor of that one because it is more generic, for one. Based on a lot of the discussion around this series, it also seems to be more broadly useful. For example, there is potential to use for other operations, including buffered writes. If that pans out (no guarantees of course), then the fill thing becomes more of a generic iomap step vs. something called by the fs. That likely means the flush this is trying to work around can also go away entirely. I.e., the flush is really just a fallback mechanism for fs' that don't care or know any better, since there is no guarantee the callback fills the batch and a flush is better than silent data corruption. So I'd really like to avoid trying to reinvent things here if at all possible. I'm curious if the tweaks proposed in the previous reply would be sufficient for ext4. If so, I can dig into that after rolling the next version of this series.. Brian > Thanks, > Yi. > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 73772d34f502..a75cdb22bab0 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1444,9 +1444,6 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, > break; > } > > - /* warn about zeroing folios beyond eof that won't write back */ > - WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size); > - > folio_zero_range(folio, offset, bytes); > folio_mark_accessed(folio); > > @@ -1515,22 +1512,44 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > srcmap->type != IOMAP_UNWRITTEN)) > return -EIO; > > - if (!iter.fbatch && > - (srcmap->type == IOMAP_HOLE || > - srcmap->type == IOMAP_UNWRITTEN)) { > - s64 status; > + if (iter.fbatch || (srcmap->type != IOMAP_HOLE && > + srcmap->type != IOMAP_UNWRITTEN)) { > + iter.status = iomap_zero_iter(&iter, did_zero, > + write_ops); > + continue; > + } > > - if (range_dirty) { > - range_dirty = false; > - status = iomap_zero_iter_flush_and_stale(&iter); > - } else { > - status = iomap_iter_advance_full(&iter); > - } > - iter.status = status; > + /* > + * No fbatch, and the target is either a hole or an unwritten > + * range, skip zeroing if the range is not dirty. > + */ > + if (!range_dirty) { > + iter.status = iomap_iter_advance_full(&iter); > continue; > } > > - iter.status = iomap_zero_iter(&iter, did_zero, write_ops); > + /* > + * The range is dirty, if the given filesystm does not specify > + * a revalidation mechanism, flush the entire range to prevent > + * mapping changes that could race with writeback and reclaim. > + */ > + if (!write_ops->iomap_valid) { > + range_dirty = false; > + iter.status = iomap_zero_iter_flush_and_stale(&iter); > + continue; > + } > + > + /* > + * The filesystem specifies an iomap_valid() helper. It is safe > + * to zero out the current range if it is unwritten and dirty. > + */ > + if (srcmap->type == IOMAP_UNWRITTEN && > + filemap_range_needs_writeback(mapping, iter.pos, > + iomap_length(&iter))) > + iter.status = iomap_zero_iter(&iter, did_zero, > + write_ops); > + else > + iter.status = iomap_iter_advance_full(&iter); > } > return ret; > } > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-07-30 13:19 ` Brian Foster @ 2025-08-02 7:26 ` Zhang Yi 0 siblings, 0 replies; 13+ messages in thread From: Zhang Yi @ 2025-08-02 7:26 UTC (permalink / raw) To: Brian Foster Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On 2025/7/30 21:19, Brian Foster wrote: > On Mon, Jul 28, 2025 at 08:57:28PM +0800, Zhang Yi wrote: >> On 2025/7/21 16:47, Zhang Yi wrote: >>> On 2025/7/19 19:07, Zhang Yi wrote: >>>> On 2025/7/18 21:48, Brian Foster wrote: >>>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: >>>>>> On 2025/7/15 13:22, Darrick J. Wong wrote: >>>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: >>>>>>>> The only way zero range can currently process unwritten mappings >>>>>>>> with dirty pagecache is to check whether the range is dirty before >>>>>>>> mapping lookup and then flush when at least one underlying mapping >>>>>>>> is unwritten. This ordering is required to prevent iomap lookup from >>>>>>>> racing with folio writeback and reclaim. >>>>>>>> >>>>>>>> Since zero range can skip ranges of unwritten mappings that are >>>>>>>> clean in cache, this operation can be improved by allowing the >>>>>>>> filesystem to provide a set of dirty folios that require zeroing. In >>>>>>>> turn, rather than flush or iterate file offsets, zero range can >>>>>>>> iterate on folios in the batch and advance over clean or uncached >>>>>>>> ranges in between. >>>>>>>> >>>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to >>>>>>> >>>>>>> /me confused by the single quote; is this supposed to read: >>>>>>> >>>>>>> "...for the fs to populate..."? >>>>>>> >>>>>>> Either way the code changes look like a reasonable thing to do for the >>>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the >>>>>>> mapping lock) so >>>>>>> >>>>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> >>>>>>> >>>>>>> --D >>>>>>> >>>>>>> >>>>>>>> populate the batch at lookup time. Update the folio lookup path to >>>>>>>> return the next folio in the batch, if provided, and advance the >>>>>>>> iter if the folio starts beyond the current offset. >>>>>>>> >>>>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com> >>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> >>>>>>>> --- >>>>>>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- >>>>>>>> fs/iomap/iter.c | 6 +++ >>>>>>>> include/linux/iomap.h | 4 ++ >>>>>>>> 3 files changed, 94 insertions(+), 5 deletions(-) >>>>>>>> >>>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >>>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644 >>>>>>>> --- a/fs/iomap/buffered-io.c >>>>>>>> +++ b/fs/iomap/buffered-io.c >>>>>> [...] >>>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) >>>>>>>> return status; >>>>>>>> } >>>>>>>> >>>>>>>> +loff_t >>>>>>>> +iomap_fill_dirty_folios( >>>>>>>> + struct iomap_iter *iter, >>>>>>>> + loff_t offset, >>>>>>>> + loff_t length) >>>>>>>> +{ >>>>>>>> + struct address_space *mapping = iter->inode->i_mapping; >>>>>>>> + pgoff_t start = offset >> PAGE_SHIFT; >>>>>>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; >>>>>>>> + >>>>>>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); >>>>>>>> + if (!iter->fbatch) >>>>>> >>>>>> Hi, Brian! >>>>>> >>>>>> I think ext4 needs to be aware of this failure after it converts to use >>>>>> iomap infrastructure. It is because if we fail to add dirty folios to the >>>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range. >>>>>> This could potentially lead to a deadlock, as most calls to >>>>>> ext4_block_zero_page_range() occur under an active journal handle. >>>>>> Writeback operations under an active journal handle may result in circular >>>>>> waiting within journal transactions. So please return this error code, and >>>>>> then ext4 can interrupt zero operations to prevent deadlock. >>>>>> >>>>> >>>>> Hi Yi, >>>>> >>>>> Thanks for looking at this. >>>>> >>>>> Huh.. so the reason for falling back like this here is just that this >>>>> was considered an optional optimization, with the flush in >>>>> iomap_zero_range() being default fallback behavior. IIUC, what you're >>>>> saying means that the current zero range behavior without this series is >>>>> problematic for ext4-on-iomap..? >>>> >>>> Yes. >>>> >>>>> If so, have you observed issues you can share details about? >>>> >>>> Sure. >>>> >>>> Before delving into the specific details of this issue, I would like >>>> to provide some background information on the rule that ext4 cannot >>>> wait for writeback in an active journal handle. If you are aware of >>>> this background, please skip this paragraph. During ext4 writing back >>>> the page cache, it may start a new journal handle to allocate blocks, >>>> update the disksize, and convert unwritten extents after the I/O is >>>> completed. When starting this new journal handle, if the current >>>> running journal transaction is in the process of being submitted or >>>> if the journal space is insufficient, it must wait for the ongoing >>>> transaction to be completed, but the prerequisite for this is that all >>>> currently running handles must be terminated. However, if we flush the >>>> page cache under an active journal handle, we cannot stop it, which >>>> may lead to a deadlock. >>>> >>>> Now, the issue I have observed occurs when I attempt to use >>>> iomap_zero_range() within ext4_block_zero_page_range(). My current >>>> implementation are below(based on the latest fs-next). >>>> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>> index 28547663e4fd..1a21667f3f7c 100644 >>>> --- a/fs/ext4/inode.c >>>> +++ b/fs/ext4/inode.c >>>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, >>>> return 0; >>>> } >>>> >>>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, >>>> + loff_t length, unsigned int flags, struct iomap *iomap, >>>> + struct iomap *srcmap) >>>> +{ >>>> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); >>>> + struct ext4_map_blocks map; >>>> + u8 blkbits = inode->i_blkbits; >>>> + int ret; >>>> + >>>> + ret = ext4_emergency_state(inode->i_sb); >>>> + if (unlikely(ret)) >>>> + return ret; >>>> + >>>> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) >>>> + return -EINVAL; >>>> + >>>> + /* Calculate the first and last logical blocks respectively. */ >>>> + map.m_lblk = offset >> blkbits; >>>> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, >>>> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; >>>> + >>>> + ret = ext4_map_blocks(NULL, inode, &map, 0); >>>> + if (ret < 0) >>>> + return ret; >>>> + >>>> + /* >>>> + * Look up dirty folios for unwritten mappings within EOF. Providing >>>> + * this bypasses the flush iomap uses to trigger extent conversion >>>> + * when unwritten mappings have dirty pagecache in need of zeroing. >>>> + */ >>>> + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && >>>> + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { >>>> + loff_t end; >>>> + >>>> + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, >>>> + map.m_len << blkbits); >>>> + if ((end >> blkbits) < map.m_lblk + map.m_len) >>>> + map.m_len = (end >> blkbits) - map.m_lblk; >>>> + } >>>> + >>>> + ext4_set_iomap(inode, iomap, &map, offset, length, flags); >>>> + return 0; >>>> +} >>>> + >>>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = { >>>> + .iomap_begin = ext4_iomap_buffered_zero_begin, >>>> +}; >>>> >>>> const struct iomap_ops ext4_iomap_buffered_write_ops = { >>>> .iomap_begin = ext4_iomap_buffered_write_begin, >>>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, >>>> return err; >>>> } >>>> >>>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, >>>> + loff_t length) >>>> +{ >>>> + WARN_ON_ONCE(!inode_is_locked(inode) && >>>> + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); >>>> + >>>> + return iomap_zero_range(inode, from, length, NULL, >>>> + &ext4_iomap_buffered_zero_ops, >>>> + &ext4_iomap_write_ops, NULL); >>>> +} >>>> + >>>> /* >>>> * ext4_block_zero_page_range() zeros out a mapping of length 'length' >>>> * starting from file offset 'from'. The range to be zero'd must >>>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, >>>> if (IS_DAX(inode)) { >>>> return dax_zero_range(inode, from, length, NULL, >>>> &ext4_iomap_ops); >>>> + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { >>>> + return ext4_iomap_zero_range(inode, from, length); >>>> } >>>> return __ext4_block_zero_page_range(handle, mapping, from, length); >>>> } >>>> >>>> The problem is most calls to ext4_block_zero_page_range() occur under >>>> an active journal handle, so I can reproduce the deadlock issue easily >>>> without this series. >>>> >>>>> >>>>> FWIW, I think your suggestion is reasonable, but I'm also curious what >>>>> the error handling would look like in ext4. Do you expect to the fail >>>>> the higher level operation, for example? Cycle locks and retry, etc.? >>>> >>>> Originally, I wanted ext4_block_zero_page_range() to return a failure >>>> to the higher level operation. However, unfortunately, after my testing >>>> today, I discovered that even though we implement this, this series still >>>> cannot resolve the issue. The corner case is: >>>> >>>> Assume we have a dirty folio covers both hole and unwritten mappings. >>>> >>>> |- dirty folio -| >>>> [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten >>>> >>>> If we punch the range of the hole, ext4_punch_hole()-> >>>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio. >>>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio >>>> since the target range is a hole. Finally, iomap_zero_range() will still >>>> flush this whole folio and lead to deadlock during writeback the latter >>>> half of the folio. >>>> >>>>> >>>>> The reason I ask is because the folio_batch handling has come up through >>>>> discussions on this series. My position so far has been to keep it as a >>>>> separate allocation and to keep things simple since it is currently >>>>> isolated to zero range, but that may change if the usage spills over to >>>>> other operations (which seems expected at this point). I suspect that if >>>>> a filesystem actually depends on this for correct behavior, that is >>>>> another data point worth considering on that topic. >>>>> >>>>> So that has me wondering if it would be better/easier here to perhaps >>>>> embed the batch in iomap_iter, or maybe as an incremental step put it on >>>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer >>>>> there instead of doing the dynamic allocation (then the fill helper >>>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on >>>>> something like that? >>>>> >>>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems >>>>> to have mostly wound down. Any objection if the fix for that comes along >>>>> as a followup patch rather than a rework of this series? >>>> >>>> It seems that we don't need to modify this series, we need to consider >>>> other solutions to resolve this deadlock issue. >>>> >>>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all >>>> instances of ext4_block_zero_page_range() out of the running journal >>>> handle(please see patch 19-21). But I don't think this is a good solution >>>> since it's complex and fragile. Besides, after commit c7fc0366c6562 >>>> ("ext4: partial zero eof block on unaligned inode size extension"), you >>>> added more invocations of ext4_zero_partial_blocks(), and the situation >>>> has become more complicated (Althrough I think the calls in the three >>>> write_end callbacks can be removed). >>>> >>>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios >>>> over unwritten mappings before zeroing partial blocks. This is because >>>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in >>>> ext4_setattr() and ext4_punch_hole()), it means if the target range is >>>> still dirty and unwritten when calling ext4_block_zero_page_range(), it >>>> must has already been zeroed. Was I missing something? Therefore, I was >>>> wondering if there are any ways to prevent flushing in >>>> iomap_zero_range()? Any ideas? >>>> >>> >>> The commit 7d9b474ee4cc ("iomap: make zero range flush conditional on >>> unwritten mappings") mentioned the following: >>> >>> iomap_zero_range() flushes pagecache to mitigate consistency >>> problems with dirty pagecache and unwritten mappings. The flush is >>> unconditional over the entire range because checking pagecache state >>> after mapping lookup is racy with writeback and reclaim. There are >>> ways around this using iomap's mapping revalidation mechanism, but >>> this is not supported by all iomap based filesystems and so is not a >>> generic solution. >>> >>> Does the revalidation mechanism here refer to verifying the validity of >>> the mapping through iomap_write_ops->iomap_valid()? IIUC, does this mean >>> that if the filesystem implement the iomap_valid() interface, we can >>> always avoid the iomap_zero_range() from flushing dirty folios back? >>> Something like below: >>> >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >>> index 73772d34f502..ba71a6ed2f77 100644 >>> --- a/fs/iomap/buffered-io.c >>> +++ b/fs/iomap/buffered-io.c >>> @@ -1522,7 +1522,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, >>> >>> if (range_dirty) { >>> range_dirty = false; >>> - status = iomap_zero_iter_flush_and_stale(&iter); >>> + if (write_ops->iomap_valid) >>> + status = iomap_zero_iter(&iter, did_zero, write_ops); >>> + else >>> + status = iomap_zero_iter_flush_and_stale(&iter); >>> } else { >>> status = iomap_iter_advance_full(&iter); >>> } >>> >> >> The above diff will trigger >> WARN_ON_ONCE(folio_pos(folio) > iter->inode->i_size) in iomap_zero_iter() >> on XFS. I revised the 'diff' and ran xfstests with several main configs >> on both XFS and ext4(with iomap infrastructure), and everything seems to >> be working fine so far. What do you think? >> > > A couple things here.. First, I don't think it's quite enough to assume > zeroing is safe just because a revalidation callback is defined. Sorry, I can't think of any problems with this. Could you please provide a more detailed explanation along with an example? > I have > multiple (old) prototypes around fixing zero range, one of which was > centered around using ->iomap_valid(), so I do believe it's technically > possible with further changes. That's what the above "There are ways > around this using the revalidation mechanism" text was referring to > generally. > > That said, I don't think going down that path is the right solution > here. I opted for the folio batch approach in favor of that one because > it is more generic, for one. Based on a lot of the discussion around > this series, it also seems to be more broadly useful. For example, there > is potential to use for other operations, including buffered writes. Hmm, I didn't follow your discussion so I don't get this. How does buffered writes utilize this folio batch mechanism, and what are its benefits? Could you please give more infomation? > If > that pans out (no guarantees of course), then the fill thing becomes > more of a generic iomap step vs. something called by the fs. That likely > means the flush this is trying to work around can also go away entirely. It would be great if we could completely avoid this flush. I look forward to seeing it. > I.e., the flush is really just a fallback mechanism for fs' that don't > care or know any better, since there is no guarantee the callback fills > the batch and a flush is better than silent data corruption. So I'd> really like to avoid trying to reinvent things here if at all possible. Yes, this makes sense to me. > I'm curious if the tweaks proposed in the previous reply would be > sufficient for ext4. If so, I can dig into that after rolling the next > version of this series.. > I believe it is working now. Please see my reply for details. Thanks, Yi. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-07-19 11:07 ` Zhang Yi 2025-07-21 8:47 ` Zhang Yi @ 2025-07-30 13:17 ` Brian Foster 2025-08-02 7:19 ` Zhang Yi 1 sibling, 1 reply; 13+ messages in thread From: Brian Foster @ 2025-07-30 13:17 UTC (permalink / raw) To: Zhang Yi Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote: > On 2025/7/18 21:48, Brian Foster wrote: > > On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: > >> On 2025/7/15 13:22, Darrick J. Wong wrote: > >>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: > >>>> The only way zero range can currently process unwritten mappings > >>>> with dirty pagecache is to check whether the range is dirty before > >>>> mapping lookup and then flush when at least one underlying mapping > >>>> is unwritten. This ordering is required to prevent iomap lookup from > >>>> racing with folio writeback and reclaim. > >>>> > >>>> Since zero range can skip ranges of unwritten mappings that are > >>>> clean in cache, this operation can be improved by allowing the > >>>> filesystem to provide a set of dirty folios that require zeroing. In > >>>> turn, rather than flush or iterate file offsets, zero range can > >>>> iterate on folios in the batch and advance over clean or uncached > >>>> ranges in between. > >>>> > >>>> Add a folio_batch in struct iomap and provide a helper for fs' to > >>> > >>> /me confused by the single quote; is this supposed to read: > >>> > >>> "...for the fs to populate..."? > >>> > >>> Either way the code changes look like a reasonable thing to do for the > >>> pagecache (try to grab a bunch of dirty folios while XFS holds the > >>> mapping lock) so > >>> > >>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> > >>> > >>> --D > >>> > >>> > >>>> populate the batch at lookup time. Update the folio lookup path to > >>>> return the next folio in the batch, if provided, and advance the > >>>> iter if the folio starts beyond the current offset. > >>>> > >>>> Signed-off-by: Brian Foster <bfoster@redhat.com> > >>>> Reviewed-by: Christoph Hellwig <hch@lst.de> > >>>> --- > >>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- > >>>> fs/iomap/iter.c | 6 +++ > >>>> include/linux/iomap.h | 4 ++ > >>>> 3 files changed, 94 insertions(+), 5 deletions(-) > >>>> > >>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > >>>> index 38da2fa6e6b0..194e3cc0857f 100644 > >>>> --- a/fs/iomap/buffered-io.c > >>>> +++ b/fs/iomap/buffered-io.c > >> [...] > >>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > >>>> return status; > >>>> } > >>>> > >>>> +loff_t > >>>> +iomap_fill_dirty_folios( > >>>> + struct iomap_iter *iter, > >>>> + loff_t offset, > >>>> + loff_t length) > >>>> +{ > >>>> + struct address_space *mapping = iter->inode->i_mapping; > >>>> + pgoff_t start = offset >> PAGE_SHIFT; > >>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; > >>>> + > >>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); > >>>> + if (!iter->fbatch) > >> > >> Hi, Brian! > >> > >> I think ext4 needs to be aware of this failure after it converts to use > >> iomap infrastructure. It is because if we fail to add dirty folios to the > >> fbatch, iomap_zero_range() will flush those unwritten and dirty range. > >> This could potentially lead to a deadlock, as most calls to > >> ext4_block_zero_page_range() occur under an active journal handle. > >> Writeback operations under an active journal handle may result in circular > >> waiting within journal transactions. So please return this error code, and > >> then ext4 can interrupt zero operations to prevent deadlock. > >> > > > > Hi Yi, > > > > Thanks for looking at this. > > > > Huh.. so the reason for falling back like this here is just that this > > was considered an optional optimization, with the flush in > > iomap_zero_range() being default fallback behavior. IIUC, what you're > > saying means that the current zero range behavior without this series is > > problematic for ext4-on-iomap..? > > Yes. > > > If so, have you observed issues you can share details about? > > Sure. > > Before delving into the specific details of this issue, I would like > to provide some background information on the rule that ext4 cannot > wait for writeback in an active journal handle. If you are aware of > this background, please skip this paragraph. During ext4 writing back > the page cache, it may start a new journal handle to allocate blocks, > update the disksize, and convert unwritten extents after the I/O is > completed. When starting this new journal handle, if the current > running journal transaction is in the process of being submitted or > if the journal space is insufficient, it must wait for the ongoing > transaction to be completed, but the prerequisite for this is that all > currently running handles must be terminated. However, if we flush the > page cache under an active journal handle, we cannot stop it, which > may lead to a deadlock. > Ok, makes sense. > Now, the issue I have observed occurs when I attempt to use > iomap_zero_range() within ext4_block_zero_page_range(). My current > implementation are below(based on the latest fs-next). > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 28547663e4fd..1a21667f3f7c 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, > return 0; > } > > +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, > + loff_t length, unsigned int flags, struct iomap *iomap, > + struct iomap *srcmap) > +{ > + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); > + struct ext4_map_blocks map; > + u8 blkbits = inode->i_blkbits; > + int ret; > + > + ret = ext4_emergency_state(inode->i_sb); > + if (unlikely(ret)) > + return ret; > + > + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) > + return -EINVAL; > + > + /* Calculate the first and last logical blocks respectively. */ > + map.m_lblk = offset >> blkbits; > + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > + > + ret = ext4_map_blocks(NULL, inode, &map, 0); > + if (ret < 0) > + return ret; > + > + /* > + * Look up dirty folios for unwritten mappings within EOF. Providing > + * this bypasses the flush iomap uses to trigger extent conversion > + * when unwritten mappings have dirty pagecache in need of zeroing. > + */ > + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && > + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { > + loff_t end; > + > + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, > + map.m_len << blkbits); > + if ((end >> blkbits) < map.m_lblk + map.m_len) > + map.m_len = (end >> blkbits) - map.m_lblk; > + } > + > + ext4_set_iomap(inode, iomap, &map, offset, length, flags); > + return 0; > +} > + > +const struct iomap_ops ext4_iomap_buffered_zero_ops = { > + .iomap_begin = ext4_iomap_buffered_zero_begin, > +}; > > const struct iomap_ops ext4_iomap_buffered_write_ops = { > .iomap_begin = ext4_iomap_buffered_write_begin, > @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, > return err; > } > > +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, > + loff_t length) > +{ > + WARN_ON_ONCE(!inode_is_locked(inode) && > + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); > + > + return iomap_zero_range(inode, from, length, NULL, > + &ext4_iomap_buffered_zero_ops, > + &ext4_iomap_write_ops, NULL); > +} > + > /* > * ext4_block_zero_page_range() zeros out a mapping of length 'length' > * starting from file offset 'from'. The range to be zero'd must > @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, > if (IS_DAX(inode)) { > return dax_zero_range(inode, from, length, NULL, > &ext4_iomap_ops); > + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { > + return ext4_iomap_zero_range(inode, from, length); > } > return __ext4_block_zero_page_range(handle, mapping, from, length); > } > > The problem is most calls to ext4_block_zero_page_range() occur under > an active journal handle, so I can reproduce the deadlock issue easily > without this series. > > > > > FWIW, I think your suggestion is reasonable, but I'm also curious what > > the error handling would look like in ext4. Do you expect to the fail > > the higher level operation, for example? Cycle locks and retry, etc.? > > Originally, I wanted ext4_block_zero_page_range() to return a failure > to the higher level operation. However, unfortunately, after my testing > today, I discovered that even though we implement this, this series still > cannot resolve the issue. The corner case is: > > Assume we have a dirty folio covers both hole and unwritten mappings. > > |- dirty folio -| > [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten > > If we punch the range of the hole, ext4_punch_hole()-> > ext4_zero_partial_blocks() will zero out the first half of the dirty folio. > Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio > since the target range is a hole. Finally, iomap_zero_range() will still > flush this whole folio and lead to deadlock during writeback the latter > half of the folio. > Hmm.. Ok. So it seems there are at least a couple ways around this particular quirk. I suspect one is that you could just call the fill helper in the hole case as well, but that's kind of a hack and not really intended use. The other way goes back to the fact that the flush for the hole case was kind of a corner case hack in the first place. The original comment for that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap: make zero range flush conditional on unwritten mappings") for reference to the original intent. I'd have to go back and investigate if something regresses with that taken out, but my recollection is that was something that needed proper fixing eventually anyways. I'm particularly wondering if that is no longer an issue now that pagecache_isize_extended() handles the post-eof zeroing (the caveat being we might just need to call it in some additional size extension cases besides just setattr/truncate). > > > > The reason I ask is because the folio_batch handling has come up through > > discussions on this series. My position so far has been to keep it as a > > separate allocation and to keep things simple since it is currently > > isolated to zero range, but that may change if the usage spills over to > > other operations (which seems expected at this point). I suspect that if > > a filesystem actually depends on this for correct behavior, that is > > another data point worth considering on that topic. > > > > So that has me wondering if it would be better/easier here to perhaps > > embed the batch in iomap_iter, or maybe as an incremental step put it on > > the stack in iomap_zero_range() and initialize the iomap_iter pointer > > there instead of doing the dynamic allocation (then the fill helper > > would set a flag to indicate the fs did pagecache lookup). Thoughts on > > something like that? > > > > Also IIUC ext4-on-iomap is still a WIP and review on this series seems > > to have mostly wound down. Any objection if the fix for that comes along > > as a followup patch rather than a rework of this series? > > It seems that we don't need to modify this series, we need to consider > other solutions to resolve this deadlock issue. > > In my v1 ext4-on-iomap series [1], I resolved this issue by moving all > instances of ext4_block_zero_page_range() out of the running journal > handle(please see patch 19-21). But I don't think this is a good solution > since it's complex and fragile. Besides, after commit c7fc0366c6562 > ("ext4: partial zero eof block on unaligned inode size extension"), you > added more invocations of ext4_zero_partial_blocks(), and the situation > has become more complicated (Althrough I think the calls in the three > write_end callbacks can be removed). > > Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios > over unwritten mappings before zeroing partial blocks. This is because > ext4 always zeroes the in-memory page cache before zeroing(e.g, in > ext4_setattr() and ext4_punch_hole()), it means if the target range is > still dirty and unwritten when calling ext4_block_zero_page_range(), it > must has already been zeroed. Was I missing something? Therefore, I was > wondering if there are any ways to prevent flushing in > iomap_zero_range()? Any ideas? It's certainly possible that the quirk fixed by the flush the hole case was never a problem on ext4, if that's what you mean. Most of the testing for this was on XFS since ext4 hadn't used iomap for buffered writes. At the end of the day, the batch mechanism is intended to facilitate avoiding the flush entirely. I'm still paging things back in here.. but if we had two smallish changes to this code path to 1. eliminate the dynamic folio_batch allocation and 2. drop the flush on hole mapping case, would that address the issues with iomap zero range for ext4? > > [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-1-yi.zhang@huaweicloud.com/ > > > > > Brian > > > > P.S., I'm heading on vacation so it will likely be a week or two before > > I follow up from here, JFYI. > > Wishing you a wonderful time! :-) Thanks! Brian > > Best regards, > Yi. > > >> > >>>> + return offset + length; > >>>> + folio_batch_init(iter->fbatch); > >>>> + > >>>> + filemap_get_folios_dirty(mapping, &start, end, iter->fbatch); > >>>> + return (start << PAGE_SHIFT); > >>>> +} > >>>> +EXPORT_SYMBOL_GPL(iomap_fill_dirty_folios); > >> > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-07-30 13:17 ` Brian Foster @ 2025-08-02 7:19 ` Zhang Yi 2025-08-05 13:08 ` Brian Foster 0 siblings, 1 reply; 13+ messages in thread From: Zhang Yi @ 2025-08-02 7:19 UTC (permalink / raw) To: Brian Foster Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On 2025/7/30 21:17, Brian Foster wrote: > On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote: >> On 2025/7/18 21:48, Brian Foster wrote: >>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: >>>> On 2025/7/15 13:22, Darrick J. Wong wrote: >>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: >>>>>> The only way zero range can currently process unwritten mappings >>>>>> with dirty pagecache is to check whether the range is dirty before >>>>>> mapping lookup and then flush when at least one underlying mapping >>>>>> is unwritten. This ordering is required to prevent iomap lookup from >>>>>> racing with folio writeback and reclaim. >>>>>> >>>>>> Since zero range can skip ranges of unwritten mappings that are >>>>>> clean in cache, this operation can be improved by allowing the >>>>>> filesystem to provide a set of dirty folios that require zeroing. In >>>>>> turn, rather than flush or iterate file offsets, zero range can >>>>>> iterate on folios in the batch and advance over clean or uncached >>>>>> ranges in between. >>>>>> >>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to >>>>> >>>>> /me confused by the single quote; is this supposed to read: >>>>> >>>>> "...for the fs to populate..."? >>>>> >>>>> Either way the code changes look like a reasonable thing to do for the >>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the >>>>> mapping lock) so >>>>> >>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> >>>>> >>>>> --D >>>>> >>>>> >>>>>> populate the batch at lookup time. Update the folio lookup path to >>>>>> return the next folio in the batch, if provided, and advance the >>>>>> iter if the folio starts beyond the current offset. >>>>>> >>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com> >>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> >>>>>> --- >>>>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- >>>>>> fs/iomap/iter.c | 6 +++ >>>>>> include/linux/iomap.h | 4 ++ >>>>>> 3 files changed, 94 insertions(+), 5 deletions(-) >>>>>> >>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >>>>>> index 38da2fa6e6b0..194e3cc0857f 100644 >>>>>> --- a/fs/iomap/buffered-io.c >>>>>> +++ b/fs/iomap/buffered-io.c >>>> [...] >>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) >>>>>> return status; >>>>>> } >>>>>> >>>>>> +loff_t >>>>>> +iomap_fill_dirty_folios( >>>>>> + struct iomap_iter *iter, >>>>>> + loff_t offset, >>>>>> + loff_t length) >>>>>> +{ >>>>>> + struct address_space *mapping = iter->inode->i_mapping; >>>>>> + pgoff_t start = offset >> PAGE_SHIFT; >>>>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; >>>>>> + >>>>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); >>>>>> + if (!iter->fbatch) >>>> >>>> Hi, Brian! >>>> >>>> I think ext4 needs to be aware of this failure after it converts to use >>>> iomap infrastructure. It is because if we fail to add dirty folios to the >>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range. >>>> This could potentially lead to a deadlock, as most calls to >>>> ext4_block_zero_page_range() occur under an active journal handle. >>>> Writeback operations under an active journal handle may result in circular >>>> waiting within journal transactions. So please return this error code, and >>>> then ext4 can interrupt zero operations to prevent deadlock. >>>> >>> >>> Hi Yi, >>> >>> Thanks for looking at this. >>> >>> Huh.. so the reason for falling back like this here is just that this >>> was considered an optional optimization, with the flush in >>> iomap_zero_range() being default fallback behavior. IIUC, what you're >>> saying means that the current zero range behavior without this series is >>> problematic for ext4-on-iomap..? >> >> Yes. >> >>> If so, have you observed issues you can share details about? >> >> Sure. >> >> Before delving into the specific details of this issue, I would like >> to provide some background information on the rule that ext4 cannot >> wait for writeback in an active journal handle. If you are aware of >> this background, please skip this paragraph. During ext4 writing back >> the page cache, it may start a new journal handle to allocate blocks, >> update the disksize, and convert unwritten extents after the I/O is >> completed. When starting this new journal handle, if the current >> running journal transaction is in the process of being submitted or >> if the journal space is insufficient, it must wait for the ongoing >> transaction to be completed, but the prerequisite for this is that all >> currently running handles must be terminated. However, if we flush the >> page cache under an active journal handle, we cannot stop it, which >> may lead to a deadlock. >> > > Ok, makes sense. > >> Now, the issue I have observed occurs when I attempt to use >> iomap_zero_range() within ext4_block_zero_page_range(). My current >> implementation are below(based on the latest fs-next). >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 28547663e4fd..1a21667f3f7c 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, >> return 0; >> } >> >> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, >> + loff_t length, unsigned int flags, struct iomap *iomap, >> + struct iomap *srcmap) >> +{ >> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); >> + struct ext4_map_blocks map; >> + u8 blkbits = inode->i_blkbits; >> + int ret; >> + >> + ret = ext4_emergency_state(inode->i_sb); >> + if (unlikely(ret)) >> + return ret; >> + >> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) >> + return -EINVAL; >> + >> + /* Calculate the first and last logical blocks respectively. */ >> + map.m_lblk = offset >> blkbits; >> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, >> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; >> + >> + ret = ext4_map_blocks(NULL, inode, &map, 0); >> + if (ret < 0) >> + return ret; >> + >> + /* >> + * Look up dirty folios for unwritten mappings within EOF. Providing >> + * this bypasses the flush iomap uses to trigger extent conversion >> + * when unwritten mappings have dirty pagecache in need of zeroing. >> + */ >> + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && >> + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { >> + loff_t end; >> + >> + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, >> + map.m_len << blkbits); >> + if ((end >> blkbits) < map.m_lblk + map.m_len) >> + map.m_len = (end >> blkbits) - map.m_lblk; >> + } >> + >> + ext4_set_iomap(inode, iomap, &map, offset, length, flags); >> + return 0; >> +} >> + >> +const struct iomap_ops ext4_iomap_buffered_zero_ops = { >> + .iomap_begin = ext4_iomap_buffered_zero_begin, >> +}; >> >> const struct iomap_ops ext4_iomap_buffered_write_ops = { >> .iomap_begin = ext4_iomap_buffered_write_begin, >> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, >> return err; >> } >> >> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, >> + loff_t length) >> +{ >> + WARN_ON_ONCE(!inode_is_locked(inode) && >> + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); >> + >> + return iomap_zero_range(inode, from, length, NULL, >> + &ext4_iomap_buffered_zero_ops, >> + &ext4_iomap_write_ops, NULL); >> +} >> + >> /* >> * ext4_block_zero_page_range() zeros out a mapping of length 'length' >> * starting from file offset 'from'. The range to be zero'd must >> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, >> if (IS_DAX(inode)) { >> return dax_zero_range(inode, from, length, NULL, >> &ext4_iomap_ops); >> + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { >> + return ext4_iomap_zero_range(inode, from, length); >> } >> return __ext4_block_zero_page_range(handle, mapping, from, length); >> } >> >> The problem is most calls to ext4_block_zero_page_range() occur under >> an active journal handle, so I can reproduce the deadlock issue easily >> without this series. >> >>> >>> FWIW, I think your suggestion is reasonable, but I'm also curious what >>> the error handling would look like in ext4. Do you expect to the fail >>> the higher level operation, for example? Cycle locks and retry, etc.? >> >> Originally, I wanted ext4_block_zero_page_range() to return a failure >> to the higher level operation. However, unfortunately, after my testing >> today, I discovered that even though we implement this, this series still >> cannot resolve the issue. The corner case is: >> >> Assume we have a dirty folio covers both hole and unwritten mappings. >> >> |- dirty folio -| >> [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten >> >> If we punch the range of the hole, ext4_punch_hole()-> >> ext4_zero_partial_blocks() will zero out the first half of the dirty folio. >> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio >> since the target range is a hole. Finally, iomap_zero_range() will still >> flush this whole folio and lead to deadlock during writeback the latter >> half of the folio. >> > > Hmm.. Ok. So it seems there are at least a couple ways around this > particular quirk. I suspect one is that you could just call the fill > helper in the hole case as well, but that's kind of a hack and not > really intended use. > > The other way goes back to the fact that the flush for the hole case was > kind of a corner case hack in the first place. The original comment for > that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap: > make zero range flush conditional on unwritten mappings") for reference > to the original intent. > > I'd have to go back and investigate if something regresses with that > taken out, but my recollection is that was something that needed proper > fixing eventually anyways. I'm particularly wondering if that is no > longer an issue now that pagecache_isize_extended() handles the post-eof > zeroing (the caveat being we might just need to call it in some > additional size extension cases besides just setattr/truncate). Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in pagecache_isize_extended() should work. > >>> >>> The reason I ask is because the folio_batch handling has come up through >>> discussions on this series. My position so far has been to keep it as a >>> separate allocation and to keep things simple since it is currently >>> isolated to zero range, but that may change if the usage spills over to >>> other operations (which seems expected at this point). I suspect that if >>> a filesystem actually depends on this for correct behavior, that is >>> another data point worth considering on that topic. >>> >>> So that has me wondering if it would be better/easier here to perhaps >>> embed the batch in iomap_iter, or maybe as an incremental step put it on >>> the stack in iomap_zero_range() and initialize the iomap_iter pointer >>> there instead of doing the dynamic allocation (then the fill helper >>> would set a flag to indicate the fs did pagecache lookup). Thoughts on >>> something like that? >>> >>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems >>> to have mostly wound down. Any objection if the fix for that comes along >>> as a followup patch rather than a rework of this series? >> >> It seems that we don't need to modify this series, we need to consider >> other solutions to resolve this deadlock issue. >> >> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all >> instances of ext4_block_zero_page_range() out of the running journal >> handle(please see patch 19-21). But I don't think this is a good solution >> since it's complex and fragile. Besides, after commit c7fc0366c6562 >> ("ext4: partial zero eof block on unaligned inode size extension"), you >> added more invocations of ext4_zero_partial_blocks(), and the situation >> has become more complicated (Althrough I think the calls in the three >> write_end callbacks can be removed). >> >> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios >> over unwritten mappings before zeroing partial blocks. This is because >> ext4 always zeroes the in-memory page cache before zeroing(e.g, in >> ext4_setattr() and ext4_punch_hole()), it means if the target range is >> still dirty and unwritten when calling ext4_block_zero_page_range(), it >> must has already been zeroed. Was I missing something? Therefore, I was >> wondering if there are any ways to prevent flushing in >> iomap_zero_range()? Any ideas? > > It's certainly possible that the quirk fixed by the flush the hole case > was never a problem on ext4, if that's what you mean. Most of the > testing for this was on XFS since ext4 hadn't used iomap for buffered > writes. > > At the end of the day, the batch mechanism is intended to facilitate > avoiding the flush entirely. I'm still paging things back in here.. but > if we had two smallish changes to this code path to 1. eliminate the > dynamic folio_batch allocation and 2. drop the flush on hole mapping > case, would that address the issues with iomap zero range for ext4? > Thank you for looking at this! I made a simple modification to the iomap_zero_range() function based on the second solution you mentioned, then tested it using kvm-xfstests these days. This solution works fine on ext4 and I don't find any other risks by now. (Since my testing environment has sufficient memory, I have not yet handled the case of memory allocation failure). --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, srcmap->type == IOMAP_UNWRITTEN)) { s64 status; - if (range_dirty) { + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) { range_dirty = false; status = iomap_zero_iter_flush_and_stale(&iter); } else { Another thing I want to mention (although there are no real issues at the moment, I still want to mention it) is that there appears to be no consistency guarantee between the lookup of the mapping and the follo_batch. For example, assume we have a file which contains two dirty folio and two unwritten extents, one folio corresponds to one extent. We zero out these two folios. | dirty folio 1 || dirty folio 2 | [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu] In the first call to ->iomap_begin(), we get the unwritten extent 1. At the same time, another thread writes back folio 1 and clears this folio, so this folio will not be added to the follo_batch. Then iomap_zero_range() will still flush those two folios. When flushing the second folio, there is still a risk of deadlock due to changes in metadata. However, since ext4 currently uses this interface only to zero out partial block, so this situation will not happen, but if the usage changes in the future, we should be very careful about this point. So in the future, I hope to have a more reliable method to avoid flushing in iomap_zero_range(). Therefore, at the moment, I think that solving the problem through these two points is feasible (I hope I haven't missed anything :-) ), though it is somewhat fragile. What do other ext4 developers think? Thanks, Yi. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-08-02 7:19 ` Zhang Yi @ 2025-08-05 13:08 ` Brian Foster 2025-08-06 3:10 ` Zhang Yi 0 siblings, 1 reply; 13+ messages in thread From: Brian Foster @ 2025-08-05 13:08 UTC (permalink / raw) To: Zhang Yi Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On Sat, Aug 02, 2025 at 03:19:54PM +0800, Zhang Yi wrote: > On 2025/7/30 21:17, Brian Foster wrote: > > On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote: > >> On 2025/7/18 21:48, Brian Foster wrote: > >>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: > >>>> On 2025/7/15 13:22, Darrick J. Wong wrote: > >>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: > >>>>>> The only way zero range can currently process unwritten mappings > >>>>>> with dirty pagecache is to check whether the range is dirty before > >>>>>> mapping lookup and then flush when at least one underlying mapping > >>>>>> is unwritten. This ordering is required to prevent iomap lookup from > >>>>>> racing with folio writeback and reclaim. > >>>>>> > >>>>>> Since zero range can skip ranges of unwritten mappings that are > >>>>>> clean in cache, this operation can be improved by allowing the > >>>>>> filesystem to provide a set of dirty folios that require zeroing. In > >>>>>> turn, rather than flush or iterate file offsets, zero range can > >>>>>> iterate on folios in the batch and advance over clean or uncached > >>>>>> ranges in between. > >>>>>> > >>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to > >>>>> > >>>>> /me confused by the single quote; is this supposed to read: > >>>>> > >>>>> "...for the fs to populate..."? > >>>>> > >>>>> Either way the code changes look like a reasonable thing to do for the > >>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the > >>>>> mapping lock) so > >>>>> > >>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> > >>>>> > >>>>> --D > >>>>> > >>>>> > >>>>>> populate the batch at lookup time. Update the folio lookup path to > >>>>>> return the next folio in the batch, if provided, and advance the > >>>>>> iter if the folio starts beyond the current offset. > >>>>>> > >>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com> > >>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> > >>>>>> --- > >>>>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- > >>>>>> fs/iomap/iter.c | 6 +++ > >>>>>> include/linux/iomap.h | 4 ++ > >>>>>> 3 files changed, 94 insertions(+), 5 deletions(-) > >>>>>> > >>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > >>>>>> index 38da2fa6e6b0..194e3cc0857f 100644 > >>>>>> --- a/fs/iomap/buffered-io.c > >>>>>> +++ b/fs/iomap/buffered-io.c > >>>> [...] > >>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > >>>>>> return status; > >>>>>> } > >>>>>> > >>>>>> +loff_t > >>>>>> +iomap_fill_dirty_folios( > >>>>>> + struct iomap_iter *iter, > >>>>>> + loff_t offset, > >>>>>> + loff_t length) > >>>>>> +{ > >>>>>> + struct address_space *mapping = iter->inode->i_mapping; > >>>>>> + pgoff_t start = offset >> PAGE_SHIFT; > >>>>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; > >>>>>> + > >>>>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); > >>>>>> + if (!iter->fbatch) > >>>> > >>>> Hi, Brian! > >>>> > >>>> I think ext4 needs to be aware of this failure after it converts to use > >>>> iomap infrastructure. It is because if we fail to add dirty folios to the > >>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range. > >>>> This could potentially lead to a deadlock, as most calls to > >>>> ext4_block_zero_page_range() occur under an active journal handle. > >>>> Writeback operations under an active journal handle may result in circular > >>>> waiting within journal transactions. So please return this error code, and > >>>> then ext4 can interrupt zero operations to prevent deadlock. > >>>> > >>> > >>> Hi Yi, > >>> > >>> Thanks for looking at this. > >>> > >>> Huh.. so the reason for falling back like this here is just that this > >>> was considered an optional optimization, with the flush in > >>> iomap_zero_range() being default fallback behavior. IIUC, what you're > >>> saying means that the current zero range behavior without this series is > >>> problematic for ext4-on-iomap..? > >> > >> Yes. > >> > >>> If so, have you observed issues you can share details about? > >> > >> Sure. > >> > >> Before delving into the specific details of this issue, I would like > >> to provide some background information on the rule that ext4 cannot > >> wait for writeback in an active journal handle. If you are aware of > >> this background, please skip this paragraph. During ext4 writing back > >> the page cache, it may start a new journal handle to allocate blocks, > >> update the disksize, and convert unwritten extents after the I/O is > >> completed. When starting this new journal handle, if the current > >> running journal transaction is in the process of being submitted or > >> if the journal space is insufficient, it must wait for the ongoing > >> transaction to be completed, but the prerequisite for this is that all > >> currently running handles must be terminated. However, if we flush the > >> page cache under an active journal handle, we cannot stop it, which > >> may lead to a deadlock. > >> > > > > Ok, makes sense. > > > >> Now, the issue I have observed occurs when I attempt to use > >> iomap_zero_range() within ext4_block_zero_page_range(). My current > >> implementation are below(based on the latest fs-next). > >> > >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > >> index 28547663e4fd..1a21667f3f7c 100644 > >> --- a/fs/ext4/inode.c > >> +++ b/fs/ext4/inode.c > >> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, > >> return 0; > >> } > >> > >> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, > >> + loff_t length, unsigned int flags, struct iomap *iomap, > >> + struct iomap *srcmap) > >> +{ > >> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); > >> + struct ext4_map_blocks map; > >> + u8 blkbits = inode->i_blkbits; > >> + int ret; > >> + > >> + ret = ext4_emergency_state(inode->i_sb); > >> + if (unlikely(ret)) > >> + return ret; > >> + > >> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) > >> + return -EINVAL; > >> + > >> + /* Calculate the first and last logical blocks respectively. */ > >> + map.m_lblk = offset >> blkbits; > >> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > >> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > >> + > >> + ret = ext4_map_blocks(NULL, inode, &map, 0); > >> + if (ret < 0) > >> + return ret; > >> + > >> + /* > >> + * Look up dirty folios for unwritten mappings within EOF. Providing > >> + * this bypasses the flush iomap uses to trigger extent conversion > >> + * when unwritten mappings have dirty pagecache in need of zeroing. > >> + */ > >> + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && > >> + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { > >> + loff_t end; > >> + > >> + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, > >> + map.m_len << blkbits); > >> + if ((end >> blkbits) < map.m_lblk + map.m_len) > >> + map.m_len = (end >> blkbits) - map.m_lblk; > >> + } > >> + > >> + ext4_set_iomap(inode, iomap, &map, offset, length, flags); > >> + return 0; > >> +} > >> + > >> +const struct iomap_ops ext4_iomap_buffered_zero_ops = { > >> + .iomap_begin = ext4_iomap_buffered_zero_begin, > >> +}; > >> > >> const struct iomap_ops ext4_iomap_buffered_write_ops = { > >> .iomap_begin = ext4_iomap_buffered_write_begin, > >> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, > >> return err; > >> } > >> > >> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, > >> + loff_t length) > >> +{ > >> + WARN_ON_ONCE(!inode_is_locked(inode) && > >> + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); > >> + > >> + return iomap_zero_range(inode, from, length, NULL, > >> + &ext4_iomap_buffered_zero_ops, > >> + &ext4_iomap_write_ops, NULL); > >> +} > >> + > >> /* > >> * ext4_block_zero_page_range() zeros out a mapping of length 'length' > >> * starting from file offset 'from'. The range to be zero'd must > >> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, > >> if (IS_DAX(inode)) { > >> return dax_zero_range(inode, from, length, NULL, > >> &ext4_iomap_ops); > >> + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { > >> + return ext4_iomap_zero_range(inode, from, length); > >> } > >> return __ext4_block_zero_page_range(handle, mapping, from, length); > >> } > >> > >> The problem is most calls to ext4_block_zero_page_range() occur under > >> an active journal handle, so I can reproduce the deadlock issue easily > >> without this series. > >> > >>> > >>> FWIW, I think your suggestion is reasonable, but I'm also curious what > >>> the error handling would look like in ext4. Do you expect to the fail > >>> the higher level operation, for example? Cycle locks and retry, etc.? > >> > >> Originally, I wanted ext4_block_zero_page_range() to return a failure > >> to the higher level operation. However, unfortunately, after my testing > >> today, I discovered that even though we implement this, this series still > >> cannot resolve the issue. The corner case is: > >> > >> Assume we have a dirty folio covers both hole and unwritten mappings. > >> > >> |- dirty folio -| > >> [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten > >> > >> If we punch the range of the hole, ext4_punch_hole()-> > >> ext4_zero_partial_blocks() will zero out the first half of the dirty folio. > >> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio > >> since the target range is a hole. Finally, iomap_zero_range() will still > >> flush this whole folio and lead to deadlock during writeback the latter > >> half of the folio. > >> > > > > Hmm.. Ok. So it seems there are at least a couple ways around this > > particular quirk. I suspect one is that you could just call the fill > > helper in the hole case as well, but that's kind of a hack and not > > really intended use. > > > > The other way goes back to the fact that the flush for the hole case was > > kind of a corner case hack in the first place. The original comment for > > that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap: > > make zero range flush conditional on unwritten mappings") for reference > > to the original intent. > > > > I'd have to go back and investigate if something regresses with that > > taken out, but my recollection is that was something that needed proper > > fixing eventually anyways. I'm particularly wondering if that is no > > longer an issue now that pagecache_isize_extended() handles the post-eof > > zeroing (the caveat being we might just need to call it in some > > additional size extension cases besides just setattr/truncate). > > Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in > pagecache_isize_extended() should work. > Ok.. > > > >>> > >>> The reason I ask is because the folio_batch handling has come up through > >>> discussions on this series. My position so far has been to keep it as a > >>> separate allocation and to keep things simple since it is currently > >>> isolated to zero range, but that may change if the usage spills over to > >>> other operations (which seems expected at this point). I suspect that if > >>> a filesystem actually depends on this for correct behavior, that is > >>> another data point worth considering on that topic. > >>> > >>> So that has me wondering if it would be better/easier here to perhaps > >>> embed the batch in iomap_iter, or maybe as an incremental step put it on > >>> the stack in iomap_zero_range() and initialize the iomap_iter pointer > >>> there instead of doing the dynamic allocation (then the fill helper > >>> would set a flag to indicate the fs did pagecache lookup). Thoughts on > >>> something like that? > >>> > >>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems > >>> to have mostly wound down. Any objection if the fix for that comes along > >>> as a followup patch rather than a rework of this series? > >> > >> It seems that we don't need to modify this series, we need to consider > >> other solutions to resolve this deadlock issue. > >> > >> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all > >> instances of ext4_block_zero_page_range() out of the running journal > >> handle(please see patch 19-21). But I don't think this is a good solution > >> since it's complex and fragile. Besides, after commit c7fc0366c6562 > >> ("ext4: partial zero eof block on unaligned inode size extension"), you > >> added more invocations of ext4_zero_partial_blocks(), and the situation > >> has become more complicated (Althrough I think the calls in the three > >> write_end callbacks can be removed). > >> > >> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios > >> over unwritten mappings before zeroing partial blocks. This is because > >> ext4 always zeroes the in-memory page cache before zeroing(e.g, in > >> ext4_setattr() and ext4_punch_hole()), it means if the target range is > >> still dirty and unwritten when calling ext4_block_zero_page_range(), it > >> must has already been zeroed. Was I missing something? Therefore, I was > >> wondering if there are any ways to prevent flushing in > >> iomap_zero_range()? Any ideas? > > > > It's certainly possible that the quirk fixed by the flush the hole case > > was never a problem on ext4, if that's what you mean. Most of the > > testing for this was on XFS since ext4 hadn't used iomap for buffered > > writes. > > > > At the end of the day, the batch mechanism is intended to facilitate > > avoiding the flush entirely. I'm still paging things back in here.. but > > if we had two smallish changes to this code path to 1. eliminate the > > dynamic folio_batch allocation and 2. drop the flush on hole mapping > > case, would that address the issues with iomap zero range for ext4? > > > > Thank you for looking at this! > > I made a simple modification to the iomap_zero_range() function based > on the second solution you mentioned, then tested it using kvm-xfstests > these days. This solution works fine on ext4 and I don't find any other > risks by now. (Since my testing environment has sufficient memory, I > have not yet handled the case of memory allocation failure). > Great, thanks for evaluating that. I've been playing around with the exact same change the past few days. As it turns out, this still breaks something with XFS. I've narrowed it down to an interaction between a large eof folio that fails to split on truncate due to being dirty, COW prealloc and insert range racing with writeback in such a way that this results in a mapped post-eof block on the file. Technically that isn't the end of the world so long as it is zeroed, but a subsequent zero range can warn if the folio is reclaimed and we now end up with a new one that starts beyond EOF, because those folios don't write back. I have a mix of hacks that seems to address the generic/363 failure, but it still needs further testing and analysis to unwind my mess of various experiments and whatnot. ;P > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > srcmap->type == IOMAP_UNWRITTEN)) { > s64 status; > > - if (range_dirty) { > + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) { > range_dirty = false; > status = iomap_zero_iter_flush_and_stale(&iter); > } else { > > Another thing I want to mention (although there are no real issues at > the moment, I still want to mention it) is that there appears to be > no consistency guarantee between the lookup of the mapping and the > follo_batch. For example, assume we have a file which contains two > dirty folio and two unwritten extents, one folio corresponds to one > extent. We zero out these two folios. > > | dirty folio 1 || dirty folio 2 | > [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu] > > In the first call to ->iomap_begin(), we get the unwritten extent 1. > At the same time, another thread writes back folio 1 and clears this > folio, so this folio will not be added to the follo_batch. Then > iomap_zero_range() will still flush those two folios. When flushing > the second folio, there is still a risk of deadlock due to changes in > metadata. > Hmm.. not sure I follow the example. The folio batch should include the folio in any case other than where it can look up, lock it and confirm it is clean. If the folio is clean and thus not included, the iomap logic should still see the empty folio batch and skip over the mapping if unwritten. (I want to replace this with a flag to address your memory allocation concern, but that is orthogonal to this logic.) Of course this should happen under appropriate fs locks such that iomap either sees the folio in dirty/writeback state where the mapping is unwritten, or if the folio has been cleaned, the mapping is reported as written. If I'm still missing something with your example above, can you elaborate a bit further? Thanks. Brian > However, since ext4 currently uses this interface only to zero out > partial block, so this situation will not happen, but if the usage > changes in the future, we should be very careful about this point. > So in the future, I hope to have a more reliable method to avoid > flushing in iomap_zero_range(). > > Therefore, at the moment, I think that solving the problem through > these two points is feasible (I hope I haven't missed anything :-) ), > though it is somewhat fragile. What do other ext4 developers think? > > Thanks, > Yi. > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-08-05 13:08 ` Brian Foster @ 2025-08-06 3:10 ` Zhang Yi 2025-08-06 13:25 ` Brian Foster 0 siblings, 1 reply; 13+ messages in thread From: Zhang Yi @ 2025-08-06 3:10 UTC (permalink / raw) To: Brian Foster Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On 2025/8/5 21:08, Brian Foster wrote: > On Sat, Aug 02, 2025 at 03:19:54PM +0800, Zhang Yi wrote: >> On 2025/7/30 21:17, Brian Foster wrote: >>> On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote: >>>> On 2025/7/18 21:48, Brian Foster wrote: >>>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: >>>>>> On 2025/7/15 13:22, Darrick J. Wong wrote: >>>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: >>>>>>>> The only way zero range can currently process unwritten mappings >>>>>>>> with dirty pagecache is to check whether the range is dirty before >>>>>>>> mapping lookup and then flush when at least one underlying mapping >>>>>>>> is unwritten. This ordering is required to prevent iomap lookup from >>>>>>>> racing with folio writeback and reclaim. >>>>>>>> >>>>>>>> Since zero range can skip ranges of unwritten mappings that are >>>>>>>> clean in cache, this operation can be improved by allowing the >>>>>>>> filesystem to provide a set of dirty folios that require zeroing. In >>>>>>>> turn, rather than flush or iterate file offsets, zero range can >>>>>>>> iterate on folios in the batch and advance over clean or uncached >>>>>>>> ranges in between. >>>>>>>> >>>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to >>>>>>> >>>>>>> /me confused by the single quote; is this supposed to read: >>>>>>> >>>>>>> "...for the fs to populate..."? >>>>>>> >>>>>>> Either way the code changes look like a reasonable thing to do for the >>>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the >>>>>>> mapping lock) so >>>>>>> >>>>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> >>>>>>> >>>>>>> --D >>>>>>> >>>>>>> >>>>>>>> populate the batch at lookup time. Update the folio lookup path to >>>>>>>> return the next folio in the batch, if provided, and advance the >>>>>>>> iter if the folio starts beyond the current offset. >>>>>>>> >>>>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com> >>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> >>>>>>>> --- >>>>>>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- >>>>>>>> fs/iomap/iter.c | 6 +++ >>>>>>>> include/linux/iomap.h | 4 ++ >>>>>>>> 3 files changed, 94 insertions(+), 5 deletions(-) >>>>>>>> >>>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >>>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644 >>>>>>>> --- a/fs/iomap/buffered-io.c >>>>>>>> +++ b/fs/iomap/buffered-io.c >>>>>> [...] >>>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) >>>>>>>> return status; >>>>>>>> } >>>>>>>> >>>>>>>> +loff_t >>>>>>>> +iomap_fill_dirty_folios( >>>>>>>> + struct iomap_iter *iter, >>>>>>>> + loff_t offset, >>>>>>>> + loff_t length) >>>>>>>> +{ >>>>>>>> + struct address_space *mapping = iter->inode->i_mapping; >>>>>>>> + pgoff_t start = offset >> PAGE_SHIFT; >>>>>>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; >>>>>>>> + >>>>>>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); >>>>>>>> + if (!iter->fbatch) >>>>>> >>>>>> Hi, Brian! >>>>>> >>>>>> I think ext4 needs to be aware of this failure after it converts to use >>>>>> iomap infrastructure. It is because if we fail to add dirty folios to the >>>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range. >>>>>> This could potentially lead to a deadlock, as most calls to >>>>>> ext4_block_zero_page_range() occur under an active journal handle. >>>>>> Writeback operations under an active journal handle may result in circular >>>>>> waiting within journal transactions. So please return this error code, and >>>>>> then ext4 can interrupt zero operations to prevent deadlock. >>>>>> >>>>> >>>>> Hi Yi, >>>>> >>>>> Thanks for looking at this. >>>>> >>>>> Huh.. so the reason for falling back like this here is just that this >>>>> was considered an optional optimization, with the flush in >>>>> iomap_zero_range() being default fallback behavior. IIUC, what you're >>>>> saying means that the current zero range behavior without this series is >>>>> problematic for ext4-on-iomap..? >>>> >>>> Yes. >>>> >>>>> If so, have you observed issues you can share details about? >>>> >>>> Sure. >>>> >>>> Before delving into the specific details of this issue, I would like >>>> to provide some background information on the rule that ext4 cannot >>>> wait for writeback in an active journal handle. If you are aware of >>>> this background, please skip this paragraph. During ext4 writing back >>>> the page cache, it may start a new journal handle to allocate blocks, >>>> update the disksize, and convert unwritten extents after the I/O is >>>> completed. When starting this new journal handle, if the current >>>> running journal transaction is in the process of being submitted or >>>> if the journal space is insufficient, it must wait for the ongoing >>>> transaction to be completed, but the prerequisite for this is that all >>>> currently running handles must be terminated. However, if we flush the >>>> page cache under an active journal handle, we cannot stop it, which >>>> may lead to a deadlock. >>>> >>> >>> Ok, makes sense. >>> >>>> Now, the issue I have observed occurs when I attempt to use >>>> iomap_zero_range() within ext4_block_zero_page_range(). My current >>>> implementation are below(based on the latest fs-next). >>>> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>> index 28547663e4fd..1a21667f3f7c 100644 >>>> --- a/fs/ext4/inode.c >>>> +++ b/fs/ext4/inode.c >>>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, >>>> return 0; >>>> } >>>> >>>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, >>>> + loff_t length, unsigned int flags, struct iomap *iomap, >>>> + struct iomap *srcmap) >>>> +{ >>>> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); >>>> + struct ext4_map_blocks map; >>>> + u8 blkbits = inode->i_blkbits; >>>> + int ret; >>>> + >>>> + ret = ext4_emergency_state(inode->i_sb); >>>> + if (unlikely(ret)) >>>> + return ret; >>>> + >>>> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) >>>> + return -EINVAL; >>>> + >>>> + /* Calculate the first and last logical blocks respectively. */ >>>> + map.m_lblk = offset >> blkbits; >>>> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, >>>> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; >>>> + >>>> + ret = ext4_map_blocks(NULL, inode, &map, 0); >>>> + if (ret < 0) >>>> + return ret; >>>> + >>>> + /* >>>> + * Look up dirty folios for unwritten mappings within EOF. Providing >>>> + * this bypasses the flush iomap uses to trigger extent conversion >>>> + * when unwritten mappings have dirty pagecache in need of zeroing. >>>> + */ >>>> + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && >>>> + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { >>>> + loff_t end; >>>> + >>>> + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, >>>> + map.m_len << blkbits); >>>> + if ((end >> blkbits) < map.m_lblk + map.m_len) >>>> + map.m_len = (end >> blkbits) - map.m_lblk; >>>> + } >>>> + >>>> + ext4_set_iomap(inode, iomap, &map, offset, length, flags); >>>> + return 0; >>>> +} >>>> + >>>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = { >>>> + .iomap_begin = ext4_iomap_buffered_zero_begin, >>>> +}; >>>> >>>> const struct iomap_ops ext4_iomap_buffered_write_ops = { >>>> .iomap_begin = ext4_iomap_buffered_write_begin, >>>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, >>>> return err; >>>> } >>>> >>>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, >>>> + loff_t length) >>>> +{ >>>> + WARN_ON_ONCE(!inode_is_locked(inode) && >>>> + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); >>>> + >>>> + return iomap_zero_range(inode, from, length, NULL, >>>> + &ext4_iomap_buffered_zero_ops, >>>> + &ext4_iomap_write_ops, NULL); >>>> +} >>>> + >>>> /* >>>> * ext4_block_zero_page_range() zeros out a mapping of length 'length' >>>> * starting from file offset 'from'. The range to be zero'd must >>>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, >>>> if (IS_DAX(inode)) { >>>> return dax_zero_range(inode, from, length, NULL, >>>> &ext4_iomap_ops); >>>> + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { >>>> + return ext4_iomap_zero_range(inode, from, length); >>>> } >>>> return __ext4_block_zero_page_range(handle, mapping, from, length); >>>> } >>>> >>>> The problem is most calls to ext4_block_zero_page_range() occur under >>>> an active journal handle, so I can reproduce the deadlock issue easily >>>> without this series. >>>> >>>>> >>>>> FWIW, I think your suggestion is reasonable, but I'm also curious what >>>>> the error handling would look like in ext4. Do you expect to the fail >>>>> the higher level operation, for example? Cycle locks and retry, etc.? >>>> >>>> Originally, I wanted ext4_block_zero_page_range() to return a failure >>>> to the higher level operation. However, unfortunately, after my testing >>>> today, I discovered that even though we implement this, this series still >>>> cannot resolve the issue. The corner case is: >>>> >>>> Assume we have a dirty folio covers both hole and unwritten mappings. >>>> >>>> |- dirty folio -| >>>> [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten >>>> >>>> If we punch the range of the hole, ext4_punch_hole()-> >>>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio. >>>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio >>>> since the target range is a hole. Finally, iomap_zero_range() will still >>>> flush this whole folio and lead to deadlock during writeback the latter >>>> half of the folio. >>>> >>> >>> Hmm.. Ok. So it seems there are at least a couple ways around this >>> particular quirk. I suspect one is that you could just call the fill >>> helper in the hole case as well, but that's kind of a hack and not >>> really intended use. >>> >>> The other way goes back to the fact that the flush for the hole case was >>> kind of a corner case hack in the first place. The original comment for >>> that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap: >>> make zero range flush conditional on unwritten mappings") for reference >>> to the original intent. >>> >>> I'd have to go back and investigate if something regresses with that >>> taken out, but my recollection is that was something that needed proper >>> fixing eventually anyways. I'm particularly wondering if that is no >>> longer an issue now that pagecache_isize_extended() handles the post-eof >>> zeroing (the caveat being we might just need to call it in some >>> additional size extension cases besides just setattr/truncate). >> >> Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in >> pagecache_isize_extended() should work. >> > > Ok.. > >>> >>>>> >>>>> The reason I ask is because the folio_batch handling has come up through >>>>> discussions on this series. My position so far has been to keep it as a >>>>> separate allocation and to keep things simple since it is currently >>>>> isolated to zero range, but that may change if the usage spills over to >>>>> other operations (which seems expected at this point). I suspect that if >>>>> a filesystem actually depends on this for correct behavior, that is >>>>> another data point worth considering on that topic. >>>>> >>>>> So that has me wondering if it would be better/easier here to perhaps >>>>> embed the batch in iomap_iter, or maybe as an incremental step put it on >>>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer >>>>> there instead of doing the dynamic allocation (then the fill helper >>>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on >>>>> something like that? >>>>> >>>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems >>>>> to have mostly wound down. Any objection if the fix for that comes along >>>>> as a followup patch rather than a rework of this series? >>>> >>>> It seems that we don't need to modify this series, we need to consider >>>> other solutions to resolve this deadlock issue. >>>> >>>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all >>>> instances of ext4_block_zero_page_range() out of the running journal >>>> handle(please see patch 19-21). But I don't think this is a good solution >>>> since it's complex and fragile. Besides, after commit c7fc0366c6562 >>>> ("ext4: partial zero eof block on unaligned inode size extension"), you >>>> added more invocations of ext4_zero_partial_blocks(), and the situation >>>> has become more complicated (Althrough I think the calls in the three >>>> write_end callbacks can be removed). >>>> >>>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios >>>> over unwritten mappings before zeroing partial blocks. This is because >>>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in >>>> ext4_setattr() and ext4_punch_hole()), it means if the target range is >>>> still dirty and unwritten when calling ext4_block_zero_page_range(), it >>>> must has already been zeroed. Was I missing something? Therefore, I was >>>> wondering if there are any ways to prevent flushing in >>>> iomap_zero_range()? Any ideas? >>> >>> It's certainly possible that the quirk fixed by the flush the hole case >>> was never a problem on ext4, if that's what you mean. Most of the >>> testing for this was on XFS since ext4 hadn't used iomap for buffered >>> writes. >>> >>> At the end of the day, the batch mechanism is intended to facilitate >>> avoiding the flush entirely. I'm still paging things back in here.. but >>> if we had two smallish changes to this code path to 1. eliminate the >>> dynamic folio_batch allocation and 2. drop the flush on hole mapping >>> case, would that address the issues with iomap zero range for ext4? >>> >> >> Thank you for looking at this! >> >> I made a simple modification to the iomap_zero_range() function based >> on the second solution you mentioned, then tested it using kvm-xfstests >> these days. This solution works fine on ext4 and I don't find any other >> risks by now. (Since my testing environment has sufficient memory, I >> have not yet handled the case of memory allocation failure). >> > > Great, thanks for evaluating that. I've been playing around with the > exact same change the past few days. As it turns out, this still breaks > something with XFS. I've narrowed it down to an interaction between a > large eof folio that fails to split on truncate due to being dirty, COW > prealloc and insert range racing with writeback in such a way that this > results in a mapped post-eof block on the file. Technically that isn't > the end of the world so long as it is zeroed, but a subsequent zero > range can warn if the folio is reclaimed and we now end up with a new > one that starts beyond EOF, because those folios don't write back. > > I have a mix of hacks that seems to address the generic/363 failure, but > it still needs further testing and analysis to unwind my mess of various > experiments and whatnot. ;P OK, thanks for debugging and analyzing this. > >> --- a/fs/iomap/buffered-io.c >> +++ b/fs/iomap/buffered-io.c >> @@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, >> srcmap->type == IOMAP_UNWRITTEN)) { >> s64 status; >> >> - if (range_dirty) { >> + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) { >> range_dirty = false; >> status = iomap_zero_iter_flush_and_stale(&iter); >> } else { >> >> Another thing I want to mention (although there are no real issues at >> the moment, I still want to mention it) is that there appears to be >> no consistency guarantee between the lookup of the mapping and the >> follo_batch. For example, assume we have a file which contains two >> dirty folio and two unwritten extents, one folio corresponds to one >> extent. We zero out these two folios. >> >> | dirty folio 1 || dirty folio 2 | >> [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu] >> >> In the first call to ->iomap_begin(), we get the unwritten extent 1. >> At the same time, another thread writes back folio 1 and clears this >> folio, so this folio will not be added to the follo_batch. Then >> iomap_zero_range() will still flush those two folios. When flushing >> the second folio, there is still a risk of deadlock due to changes in >> metadata. >> > > Hmm.. not sure I follow the example. The folio batch should include the > folio in any case other than where it can look up, lock it and confirm > it is clean. If the folio is clean and thus not included, the iomap > logic should still see the empty folio batch and skip over the mapping > if unwritten. (I want to replace this with a flag to address your memory > allocation concern, but that is orthogonal to this logic.) > > Of course this should happen under appropriate fs locks such that iomap > either sees the folio in dirty/writeback state where the mapping is > unwritten, or if the folio has been cleaned, the mapping is reported as > written. > > If I'm still missing something with your example above, can you > elaborate a bit further? Thanks. > Sorry for not make things clear. The race condition is the following, zero range sync_file_range iomap_zero_range() //folio 1+2 range_dirty = filemap_range_needs_writeback() //range_dirty is set to 'ture' iomap_iter() ext4_iomap_buffer_zero_begin() ext4_map_blocks() //get unwritten extent 1 sync_file_range() //folio 1 iomap_writepages() ... iomap_finish_ioend() folio_end_writeback() //clear folio 1, and //extent 1 becomes written iomap_fill_dirty_folios() //do not add folio 1 to batch iomap_zero_iter_flush_and_stale() //!fbatch && IOMAP_UNWRITTEN && range_dirty //flush folio 1+2, folio 2 is still dirty, then deadlock Besides, if the range of folio 1 is initially clean and unwritten (folio 2 is still dirty), the flush can also be triggered without the concurrent sync_file_range. The problem is that we initially checked `range_dirty` only once, and if was done for the entire zero range, instead of checking it each iteration, and there is no fs lock can prevent a concurrent write-back. Perhaps we need to check 'dirty_range' for each iteration? Regards, Yi. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-08-06 3:10 ` Zhang Yi @ 2025-08-06 13:25 ` Brian Foster 2025-08-07 4:58 ` Zhang Yi 0 siblings, 1 reply; 13+ messages in thread From: Brian Foster @ 2025-08-06 13:25 UTC (permalink / raw) To: Zhang Yi Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On Wed, Aug 06, 2025 at 11:10:30AM +0800, Zhang Yi wrote: > On 2025/8/5 21:08, Brian Foster wrote: > > On Sat, Aug 02, 2025 at 03:19:54PM +0800, Zhang Yi wrote: > >> On 2025/7/30 21:17, Brian Foster wrote: > >>> On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote: > >>>> On 2025/7/18 21:48, Brian Foster wrote: > >>>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: > >>>>>> On 2025/7/15 13:22, Darrick J. Wong wrote: > >>>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: > >>>>>>>> The only way zero range can currently process unwritten mappings > >>>>>>>> with dirty pagecache is to check whether the range is dirty before > >>>>>>>> mapping lookup and then flush when at least one underlying mapping > >>>>>>>> is unwritten. This ordering is required to prevent iomap lookup from > >>>>>>>> racing with folio writeback and reclaim. > >>>>>>>> > >>>>>>>> Since zero range can skip ranges of unwritten mappings that are > >>>>>>>> clean in cache, this operation can be improved by allowing the > >>>>>>>> filesystem to provide a set of dirty folios that require zeroing. In > >>>>>>>> turn, rather than flush or iterate file offsets, zero range can > >>>>>>>> iterate on folios in the batch and advance over clean or uncached > >>>>>>>> ranges in between. > >>>>>>>> > >>>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to > >>>>>>> > >>>>>>> /me confused by the single quote; is this supposed to read: > >>>>>>> > >>>>>>> "...for the fs to populate..."? > >>>>>>> > >>>>>>> Either way the code changes look like a reasonable thing to do for the > >>>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the > >>>>>>> mapping lock) so > >>>>>>> > >>>>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> > >>>>>>> > >>>>>>> --D > >>>>>>> > >>>>>>> > >>>>>>>> populate the batch at lookup time. Update the folio lookup path to > >>>>>>>> return the next folio in the batch, if provided, and advance the > >>>>>>>> iter if the folio starts beyond the current offset. > >>>>>>>> > >>>>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com> > >>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> > >>>>>>>> --- > >>>>>>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- > >>>>>>>> fs/iomap/iter.c | 6 +++ > >>>>>>>> include/linux/iomap.h | 4 ++ > >>>>>>>> 3 files changed, 94 insertions(+), 5 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > >>>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644 > >>>>>>>> --- a/fs/iomap/buffered-io.c > >>>>>>>> +++ b/fs/iomap/buffered-io.c > >>>>>> [...] > >>>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) > >>>>>>>> return status; > >>>>>>>> } > >>>>>>>> > >>>>>>>> +loff_t > >>>>>>>> +iomap_fill_dirty_folios( > >>>>>>>> + struct iomap_iter *iter, > >>>>>>>> + loff_t offset, > >>>>>>>> + loff_t length) > >>>>>>>> +{ > >>>>>>>> + struct address_space *mapping = iter->inode->i_mapping; > >>>>>>>> + pgoff_t start = offset >> PAGE_SHIFT; > >>>>>>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; > >>>>>>>> + > >>>>>>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); > >>>>>>>> + if (!iter->fbatch) > >>>>>> > >>>>>> Hi, Brian! > >>>>>> > >>>>>> I think ext4 needs to be aware of this failure after it converts to use > >>>>>> iomap infrastructure. It is because if we fail to add dirty folios to the > >>>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range. > >>>>>> This could potentially lead to a deadlock, as most calls to > >>>>>> ext4_block_zero_page_range() occur under an active journal handle. > >>>>>> Writeback operations under an active journal handle may result in circular > >>>>>> waiting within journal transactions. So please return this error code, and > >>>>>> then ext4 can interrupt zero operations to prevent deadlock. > >>>>>> > >>>>> > >>>>> Hi Yi, > >>>>> > >>>>> Thanks for looking at this. > >>>>> > >>>>> Huh.. so the reason for falling back like this here is just that this > >>>>> was considered an optional optimization, with the flush in > >>>>> iomap_zero_range() being default fallback behavior. IIUC, what you're > >>>>> saying means that the current zero range behavior without this series is > >>>>> problematic for ext4-on-iomap..? > >>>> > >>>> Yes. > >>>> > >>>>> If so, have you observed issues you can share details about? > >>>> > >>>> Sure. > >>>> > >>>> Before delving into the specific details of this issue, I would like > >>>> to provide some background information on the rule that ext4 cannot > >>>> wait for writeback in an active journal handle. If you are aware of > >>>> this background, please skip this paragraph. During ext4 writing back > >>>> the page cache, it may start a new journal handle to allocate blocks, > >>>> update the disksize, and convert unwritten extents after the I/O is > >>>> completed. When starting this new journal handle, if the current > >>>> running journal transaction is in the process of being submitted or > >>>> if the journal space is insufficient, it must wait for the ongoing > >>>> transaction to be completed, but the prerequisite for this is that all > >>>> currently running handles must be terminated. However, if we flush the > >>>> page cache under an active journal handle, we cannot stop it, which > >>>> may lead to a deadlock. > >>>> > >>> > >>> Ok, makes sense. > >>> > >>>> Now, the issue I have observed occurs when I attempt to use > >>>> iomap_zero_range() within ext4_block_zero_page_range(). My current > >>>> implementation are below(based on the latest fs-next). > >>>> > >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > >>>> index 28547663e4fd..1a21667f3f7c 100644 > >>>> --- a/fs/ext4/inode.c > >>>> +++ b/fs/ext4/inode.c > >>>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, > >>>> return 0; > >>>> } > >>>> > >>>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, > >>>> + loff_t length, unsigned int flags, struct iomap *iomap, > >>>> + struct iomap *srcmap) > >>>> +{ > >>>> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); > >>>> + struct ext4_map_blocks map; > >>>> + u8 blkbits = inode->i_blkbits; > >>>> + int ret; > >>>> + > >>>> + ret = ext4_emergency_state(inode->i_sb); > >>>> + if (unlikely(ret)) > >>>> + return ret; > >>>> + > >>>> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) > >>>> + return -EINVAL; > >>>> + > >>>> + /* Calculate the first and last logical blocks respectively. */ > >>>> + map.m_lblk = offset >> blkbits; > >>>> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > >>>> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > >>>> + > >>>> + ret = ext4_map_blocks(NULL, inode, &map, 0); > >>>> + if (ret < 0) > >>>> + return ret; > >>>> + > >>>> + /* > >>>> + * Look up dirty folios for unwritten mappings within EOF. Providing > >>>> + * this bypasses the flush iomap uses to trigger extent conversion > >>>> + * when unwritten mappings have dirty pagecache in need of zeroing. > >>>> + */ > >>>> + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && > >>>> + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { > >>>> + loff_t end; > >>>> + > >>>> + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, > >>>> + map.m_len << blkbits); > >>>> + if ((end >> blkbits) < map.m_lblk + map.m_len) > >>>> + map.m_len = (end >> blkbits) - map.m_lblk; > >>>> + } > >>>> + > >>>> + ext4_set_iomap(inode, iomap, &map, offset, length, flags); > >>>> + return 0; > >>>> +} > >>>> + > >>>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = { > >>>> + .iomap_begin = ext4_iomap_buffered_zero_begin, > >>>> +}; > >>>> > >>>> const struct iomap_ops ext4_iomap_buffered_write_ops = { > >>>> .iomap_begin = ext4_iomap_buffered_write_begin, > >>>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, > >>>> return err; > >>>> } > >>>> > >>>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, > >>>> + loff_t length) > >>>> +{ > >>>> + WARN_ON_ONCE(!inode_is_locked(inode) && > >>>> + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); > >>>> + > >>>> + return iomap_zero_range(inode, from, length, NULL, > >>>> + &ext4_iomap_buffered_zero_ops, > >>>> + &ext4_iomap_write_ops, NULL); > >>>> +} > >>>> + > >>>> /* > >>>> * ext4_block_zero_page_range() zeros out a mapping of length 'length' > >>>> * starting from file offset 'from'. The range to be zero'd must > >>>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, > >>>> if (IS_DAX(inode)) { > >>>> return dax_zero_range(inode, from, length, NULL, > >>>> &ext4_iomap_ops); > >>>> + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { > >>>> + return ext4_iomap_zero_range(inode, from, length); > >>>> } > >>>> return __ext4_block_zero_page_range(handle, mapping, from, length); > >>>> } > >>>> > >>>> The problem is most calls to ext4_block_zero_page_range() occur under > >>>> an active journal handle, so I can reproduce the deadlock issue easily > >>>> without this series. > >>>> > >>>>> > >>>>> FWIW, I think your suggestion is reasonable, but I'm also curious what > >>>>> the error handling would look like in ext4. Do you expect to the fail > >>>>> the higher level operation, for example? Cycle locks and retry, etc.? > >>>> > >>>> Originally, I wanted ext4_block_zero_page_range() to return a failure > >>>> to the higher level operation. However, unfortunately, after my testing > >>>> today, I discovered that even though we implement this, this series still > >>>> cannot resolve the issue. The corner case is: > >>>> > >>>> Assume we have a dirty folio covers both hole and unwritten mappings. > >>>> > >>>> |- dirty folio -| > >>>> [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten > >>>> > >>>> If we punch the range of the hole, ext4_punch_hole()-> > >>>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio. > >>>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio > >>>> since the target range is a hole. Finally, iomap_zero_range() will still > >>>> flush this whole folio and lead to deadlock during writeback the latter > >>>> half of the folio. > >>>> > >>> > >>> Hmm.. Ok. So it seems there are at least a couple ways around this > >>> particular quirk. I suspect one is that you could just call the fill > >>> helper in the hole case as well, but that's kind of a hack and not > >>> really intended use. > >>> > >>> The other way goes back to the fact that the flush for the hole case was > >>> kind of a corner case hack in the first place. The original comment for > >>> that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap: > >>> make zero range flush conditional on unwritten mappings") for reference > >>> to the original intent. > >>> > >>> I'd have to go back and investigate if something regresses with that > >>> taken out, but my recollection is that was something that needed proper > >>> fixing eventually anyways. I'm particularly wondering if that is no > >>> longer an issue now that pagecache_isize_extended() handles the post-eof > >>> zeroing (the caveat being we might just need to call it in some > >>> additional size extension cases besides just setattr/truncate). > >> > >> Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in > >> pagecache_isize_extended() should work. > >> > > > > Ok.. > > > >>> > >>>>> > >>>>> The reason I ask is because the folio_batch handling has come up through > >>>>> discussions on this series. My position so far has been to keep it as a > >>>>> separate allocation and to keep things simple since it is currently > >>>>> isolated to zero range, but that may change if the usage spills over to > >>>>> other operations (which seems expected at this point). I suspect that if > >>>>> a filesystem actually depends on this for correct behavior, that is > >>>>> another data point worth considering on that topic. > >>>>> > >>>>> So that has me wondering if it would be better/easier here to perhaps > >>>>> embed the batch in iomap_iter, or maybe as an incremental step put it on > >>>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer > >>>>> there instead of doing the dynamic allocation (then the fill helper > >>>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on > >>>>> something like that? > >>>>> > >>>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems > >>>>> to have mostly wound down. Any objection if the fix for that comes along > >>>>> as a followup patch rather than a rework of this series? > >>>> > >>>> It seems that we don't need to modify this series, we need to consider > >>>> other solutions to resolve this deadlock issue. > >>>> > >>>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all > >>>> instances of ext4_block_zero_page_range() out of the running journal > >>>> handle(please see patch 19-21). But I don't think this is a good solution > >>>> since it's complex and fragile. Besides, after commit c7fc0366c6562 > >>>> ("ext4: partial zero eof block on unaligned inode size extension"), you > >>>> added more invocations of ext4_zero_partial_blocks(), and the situation > >>>> has become more complicated (Althrough I think the calls in the three > >>>> write_end callbacks can be removed). > >>>> > >>>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios > >>>> over unwritten mappings before zeroing partial blocks. This is because > >>>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in > >>>> ext4_setattr() and ext4_punch_hole()), it means if the target range is > >>>> still dirty and unwritten when calling ext4_block_zero_page_range(), it > >>>> must has already been zeroed. Was I missing something? Therefore, I was > >>>> wondering if there are any ways to prevent flushing in > >>>> iomap_zero_range()? Any ideas? > >>> > >>> It's certainly possible that the quirk fixed by the flush the hole case > >>> was never a problem on ext4, if that's what you mean. Most of the > >>> testing for this was on XFS since ext4 hadn't used iomap for buffered > >>> writes. > >>> > >>> At the end of the day, the batch mechanism is intended to facilitate > >>> avoiding the flush entirely. I'm still paging things back in here.. but > >>> if we had two smallish changes to this code path to 1. eliminate the > >>> dynamic folio_batch allocation and 2. drop the flush on hole mapping > >>> case, would that address the issues with iomap zero range for ext4? > >>> > >> > >> Thank you for looking at this! > >> > >> I made a simple modification to the iomap_zero_range() function based > >> on the second solution you mentioned, then tested it using kvm-xfstests > >> these days. This solution works fine on ext4 and I don't find any other > >> risks by now. (Since my testing environment has sufficient memory, I > >> have not yet handled the case of memory allocation failure). > >> > > > > Great, thanks for evaluating that. I've been playing around with the > > exact same change the past few days. As it turns out, this still breaks > > something with XFS. I've narrowed it down to an interaction between a > > large eof folio that fails to split on truncate due to being dirty, COW > > prealloc and insert range racing with writeback in such a way that this > > results in a mapped post-eof block on the file. Technically that isn't > > the end of the world so long as it is zeroed, but a subsequent zero > > range can warn if the folio is reclaimed and we now end up with a new > > one that starts beyond EOF, because those folios don't write back. > > > > I have a mix of hacks that seems to address the generic/363 failure, but > > it still needs further testing and analysis to unwind my mess of various > > experiments and whatnot. ;P > > OK, thanks for debugging and analyzing this. > > > > >> --- a/fs/iomap/buffered-io.c > >> +++ b/fs/iomap/buffered-io.c > >> @@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, > >> srcmap->type == IOMAP_UNWRITTEN)) { > >> s64 status; > >> > >> - if (range_dirty) { > >> + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) { > >> range_dirty = false; > >> status = iomap_zero_iter_flush_and_stale(&iter); > >> } else { > >> > >> Another thing I want to mention (although there are no real issues at > >> the moment, I still want to mention it) is that there appears to be > >> no consistency guarantee between the lookup of the mapping and the > >> follo_batch. For example, assume we have a file which contains two > >> dirty folio and two unwritten extents, one folio corresponds to one > >> extent. We zero out these two folios. > >> > >> | dirty folio 1 || dirty folio 2 | > >> [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu] > >> > >> In the first call to ->iomap_begin(), we get the unwritten extent 1. > >> At the same time, another thread writes back folio 1 and clears this > >> folio, so this folio will not be added to the follo_batch. Then > >> iomap_zero_range() will still flush those two folios. When flushing > >> the second folio, there is still a risk of deadlock due to changes in > >> metadata. > >> > > > > Hmm.. not sure I follow the example. The folio batch should include the > > folio in any case other than where it can look up, lock it and confirm > > it is clean. If the folio is clean and thus not included, the iomap > > logic should still see the empty folio batch and skip over the mapping > > if unwritten. (I want to replace this with a flag to address your memory > > allocation concern, but that is orthogonal to this logic.) > > > > Of course this should happen under appropriate fs locks such that iomap > > either sees the folio in dirty/writeback state where the mapping is > > unwritten, or if the folio has been cleaned, the mapping is reported as > > written. > > > > If I'm still missing something with your example above, can you > > elaborate a bit further? Thanks. > > > > Sorry for not make things clear. The race condition is the following, > > zero range sync_file_range > iomap_zero_range() //folio 1+2 > range_dirty = filemap_range_needs_writeback() > //range_dirty is set to 'ture' > iomap_iter() > ext4_iomap_buffer_zero_begin() > ext4_map_blocks() > //get unwritten extent 1 > sync_file_range() //folio 1 > iomap_writepages() > ... > iomap_finish_ioend() > folio_end_writeback() > //clear folio 1, and > //extent 1 becomes written > iomap_fill_dirty_folios() > //do not add folio 1 to batch I think the issue here is that this needs serialization between the lookups and extent conversion. I.e., XFS handles this by doing the folio lookup under the same locks used for looking up the extent. So in this scenario, that ensures that the only way we don't add the folio to the batch is if the mapping has already been converted by writeback completion.. > iomap_zero_iter_flush_and_stale() > //!fbatch && IOMAP_UNWRITTEN && range_dirty > //flush folio 1+2, folio 2 is still dirty, then deadlock > I also think the failure characteristic here is different. In this case you'd see an empty fbatch because at least on the lookup side the mapping is presumed to be unwritten. So that means we still wouldn't flush, but the zeroing would probably race with writeback such that it skips zeroing the blocks when it shouldn't. > Besides, if the range of folio 1 is initially clean and unwritten > (folio 2 is still dirty), the flush can also be triggered without the > concurrent sync_file_range. > > The problem is that we initially checked `range_dirty` only once, and > if was done for the entire zero range, instead of checking it each > iteration, and there is no fs lock can prevent a concurrent write-back. > Perhaps we need to check 'dirty_range' for each iteration? > I don't think this needs to prevent writeback, but is there not any locking to protect extent lookups vs. extent conversion (via writeback)? FWIW, I think it's a little hard to reason about some of these interactions because of the pending tweaks that aren't implemented yet. What I'd like to do is get another version of this posted (mostly as is), hopefully get it landed into -next or whatever, then get another series going with this handful of tweaks we're discussing to prepare for ext4 on iomap. From there we can work out any remaining details on things that might need to change on either side. Sound Ok? Brian > Regards, > Yi. > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3 3/7] iomap: optional zero range dirty folio processing 2025-08-06 13:25 ` Brian Foster @ 2025-08-07 4:58 ` Zhang Yi 0 siblings, 0 replies; 13+ messages in thread From: Zhang Yi @ 2025-08-07 4:58 UTC (permalink / raw) To: Brian Foster Cc: linux-fsdevel, linux-xfs, linux-mm, hch, willy, Darrick J. Wong, Ext4 Developers List On 2025/8/6 21:25, Brian Foster wrote: > On Wed, Aug 06, 2025 at 11:10:30AM +0800, Zhang Yi wrote: >> On 2025/8/5 21:08, Brian Foster wrote: >>> On Sat, Aug 02, 2025 at 03:19:54PM +0800, Zhang Yi wrote: >>>> On 2025/7/30 21:17, Brian Foster wrote: >>>>> On Sat, Jul 19, 2025 at 07:07:43PM +0800, Zhang Yi wrote: >>>>>> On 2025/7/18 21:48, Brian Foster wrote: >>>>>>> On Fri, Jul 18, 2025 at 07:30:10PM +0800, Zhang Yi wrote: >>>>>>>> On 2025/7/15 13:22, Darrick J. Wong wrote: >>>>>>>>> On Mon, Jul 14, 2025 at 04:41:18PM -0400, Brian Foster wrote: >>>>>>>>>> The only way zero range can currently process unwritten mappings >>>>>>>>>> with dirty pagecache is to check whether the range is dirty before >>>>>>>>>> mapping lookup and then flush when at least one underlying mapping >>>>>>>>>> is unwritten. This ordering is required to prevent iomap lookup from >>>>>>>>>> racing with folio writeback and reclaim. >>>>>>>>>> >>>>>>>>>> Since zero range can skip ranges of unwritten mappings that are >>>>>>>>>> clean in cache, this operation can be improved by allowing the >>>>>>>>>> filesystem to provide a set of dirty folios that require zeroing. In >>>>>>>>>> turn, rather than flush or iterate file offsets, zero range can >>>>>>>>>> iterate on folios in the batch and advance over clean or uncached >>>>>>>>>> ranges in between. >>>>>>>>>> >>>>>>>>>> Add a folio_batch in struct iomap and provide a helper for fs' to >>>>>>>>> >>>>>>>>> /me confused by the single quote; is this supposed to read: >>>>>>>>> >>>>>>>>> "...for the fs to populate..."? >>>>>>>>> >>>>>>>>> Either way the code changes look like a reasonable thing to do for the >>>>>>>>> pagecache (try to grab a bunch of dirty folios while XFS holds the >>>>>>>>> mapping lock) so >>>>>>>>> >>>>>>>>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> >>>>>>>>> >>>>>>>>> --D >>>>>>>>> >>>>>>>>> >>>>>>>>>> populate the batch at lookup time. Update the folio lookup path to >>>>>>>>>> return the next folio in the batch, if provided, and advance the >>>>>>>>>> iter if the folio starts beyond the current offset. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com> >>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> >>>>>>>>>> --- >>>>>>>>>> fs/iomap/buffered-io.c | 89 +++++++++++++++++++++++++++++++++++++++--- >>>>>>>>>> fs/iomap/iter.c | 6 +++ >>>>>>>>>> include/linux/iomap.h | 4 ++ >>>>>>>>>> 3 files changed, 94 insertions(+), 5 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c >>>>>>>>>> index 38da2fa6e6b0..194e3cc0857f 100644 >>>>>>>>>> --- a/fs/iomap/buffered-io.c >>>>>>>>>> +++ b/fs/iomap/buffered-io.c >>>>>>>> [...] >>>>>>>>>> @@ -1398,6 +1452,26 @@ static int iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) >>>>>>>>>> return status; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> +loff_t >>>>>>>>>> +iomap_fill_dirty_folios( >>>>>>>>>> + struct iomap_iter *iter, >>>>>>>>>> + loff_t offset, >>>>>>>>>> + loff_t length) >>>>>>>>>> +{ >>>>>>>>>> + struct address_space *mapping = iter->inode->i_mapping; >>>>>>>>>> + pgoff_t start = offset >> PAGE_SHIFT; >>>>>>>>>> + pgoff_t end = (offset + length - 1) >> PAGE_SHIFT; >>>>>>>>>> + >>>>>>>>>> + iter->fbatch = kmalloc(sizeof(struct folio_batch), GFP_KERNEL); >>>>>>>>>> + if (!iter->fbatch) >>>>>>>> >>>>>>>> Hi, Brian! >>>>>>>> >>>>>>>> I think ext4 needs to be aware of this failure after it converts to use >>>>>>>> iomap infrastructure. It is because if we fail to add dirty folios to the >>>>>>>> fbatch, iomap_zero_range() will flush those unwritten and dirty range. >>>>>>>> This could potentially lead to a deadlock, as most calls to >>>>>>>> ext4_block_zero_page_range() occur under an active journal handle. >>>>>>>> Writeback operations under an active journal handle may result in circular >>>>>>>> waiting within journal transactions. So please return this error code, and >>>>>>>> then ext4 can interrupt zero operations to prevent deadlock. >>>>>>>> >>>>>>> >>>>>>> Hi Yi, >>>>>>> >>>>>>> Thanks for looking at this. >>>>>>> >>>>>>> Huh.. so the reason for falling back like this here is just that this >>>>>>> was considered an optional optimization, with the flush in >>>>>>> iomap_zero_range() being default fallback behavior. IIUC, what you're >>>>>>> saying means that the current zero range behavior without this series is >>>>>>> problematic for ext4-on-iomap..? >>>>>> >>>>>> Yes. >>>>>> >>>>>>> If so, have you observed issues you can share details about? >>>>>> >>>>>> Sure. >>>>>> >>>>>> Before delving into the specific details of this issue, I would like >>>>>> to provide some background information on the rule that ext4 cannot >>>>>> wait for writeback in an active journal handle. If you are aware of >>>>>> this background, please skip this paragraph. During ext4 writing back >>>>>> the page cache, it may start a new journal handle to allocate blocks, >>>>>> update the disksize, and convert unwritten extents after the I/O is >>>>>> completed. When starting this new journal handle, if the current >>>>>> running journal transaction is in the process of being submitted or >>>>>> if the journal space is insufficient, it must wait for the ongoing >>>>>> transaction to be completed, but the prerequisite for this is that all >>>>>> currently running handles must be terminated. However, if we flush the >>>>>> page cache under an active journal handle, we cannot stop it, which >>>>>> may lead to a deadlock. >>>>>> >>>>> >>>>> Ok, makes sense. >>>>> >>>>>> Now, the issue I have observed occurs when I attempt to use >>>>>> iomap_zero_range() within ext4_block_zero_page_range(). My current >>>>>> implementation are below(based on the latest fs-next). >>>>>> >>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>>>> index 28547663e4fd..1a21667f3f7c 100644 >>>>>> --- a/fs/ext4/inode.c >>>>>> +++ b/fs/ext4/inode.c >>>>>> @@ -4147,6 +4147,53 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset, >>>>>> return 0; >>>>>> } >>>>>> >>>>>> +static int ext4_iomap_buffered_zero_begin(struct inode *inode, loff_t offset, >>>>>> + loff_t length, unsigned int flags, struct iomap *iomap, >>>>>> + struct iomap *srcmap) >>>>>> +{ >>>>>> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap); >>>>>> + struct ext4_map_blocks map; >>>>>> + u8 blkbits = inode->i_blkbits; >>>>>> + int ret; >>>>>> + >>>>>> + ret = ext4_emergency_state(inode->i_sb); >>>>>> + if (unlikely(ret)) >>>>>> + return ret; >>>>>> + >>>>>> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) >>>>>> + return -EINVAL; >>>>>> + >>>>>> + /* Calculate the first and last logical blocks respectively. */ >>>>>> + map.m_lblk = offset >> blkbits; >>>>>> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, >>>>>> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; >>>>>> + >>>>>> + ret = ext4_map_blocks(NULL, inode, &map, 0); >>>>>> + if (ret < 0) >>>>>> + return ret; >>>>>> + >>>>>> + /* >>>>>> + * Look up dirty folios for unwritten mappings within EOF. Providing >>>>>> + * this bypasses the flush iomap uses to trigger extent conversion >>>>>> + * when unwritten mappings have dirty pagecache in need of zeroing. >>>>>> + */ >>>>>> + if ((map.m_flags & EXT4_MAP_UNWRITTEN) && >>>>>> + map.m_lblk < EXT4_B_TO_LBLK(inode, i_size_read(inode))) { >>>>>> + loff_t end; >>>>>> + >>>>>> + end = iomap_fill_dirty_folios(iter, map.m_lblk << blkbits, >>>>>> + map.m_len << blkbits); >>>>>> + if ((end >> blkbits) < map.m_lblk + map.m_len) >>>>>> + map.m_len = (end >> blkbits) - map.m_lblk; >>>>>> + } >>>>>> + >>>>>> + ext4_set_iomap(inode, iomap, &map, offset, length, flags); >>>>>> + return 0; >>>>>> +} >>>>>> + >>>>>> +const struct iomap_ops ext4_iomap_buffered_zero_ops = { >>>>>> + .iomap_begin = ext4_iomap_buffered_zero_begin, >>>>>> +}; >>>>>> >>>>>> const struct iomap_ops ext4_iomap_buffered_write_ops = { >>>>>> .iomap_begin = ext4_iomap_buffered_write_begin, >>>>>> @@ -4611,6 +4658,17 @@ static int __ext4_block_zero_page_range(handle_t *handle, >>>>>> return err; >>>>>> } >>>>>> >>>>>> +static inline int ext4_iomap_zero_range(struct inode *inode, loff_t from, >>>>>> + loff_t length) >>>>>> +{ >>>>>> + WARN_ON_ONCE(!inode_is_locked(inode) && >>>>>> + !rwsem_is_locked(&inode->i_mapping->invalidate_lock)); >>>>>> + >>>>>> + return iomap_zero_range(inode, from, length, NULL, >>>>>> + &ext4_iomap_buffered_zero_ops, >>>>>> + &ext4_iomap_write_ops, NULL); >>>>>> +} >>>>>> + >>>>>> /* >>>>>> * ext4_block_zero_page_range() zeros out a mapping of length 'length' >>>>>> * starting from file offset 'from'. The range to be zero'd must >>>>>> @@ -4636,6 +4694,8 @@ static int ext4_block_zero_page_range(handle_t *handle, >>>>>> if (IS_DAX(inode)) { >>>>>> return dax_zero_range(inode, from, length, NULL, >>>>>> &ext4_iomap_ops); >>>>>> + } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) { >>>>>> + return ext4_iomap_zero_range(inode, from, length); >>>>>> } >>>>>> return __ext4_block_zero_page_range(handle, mapping, from, length); >>>>>> } >>>>>> >>>>>> The problem is most calls to ext4_block_zero_page_range() occur under >>>>>> an active journal handle, so I can reproduce the deadlock issue easily >>>>>> without this series. >>>>>> >>>>>>> >>>>>>> FWIW, I think your suggestion is reasonable, but I'm also curious what >>>>>>> the error handling would look like in ext4. Do you expect to the fail >>>>>>> the higher level operation, for example? Cycle locks and retry, etc.? >>>>>> >>>>>> Originally, I wanted ext4_block_zero_page_range() to return a failure >>>>>> to the higher level operation. However, unfortunately, after my testing >>>>>> today, I discovered that even though we implement this, this series still >>>>>> cannot resolve the issue. The corner case is: >>>>>> >>>>>> Assume we have a dirty folio covers both hole and unwritten mappings. >>>>>> >>>>>> |- dirty folio -| >>>>>> [hhhhhhhhuuuuuuuu] h:hole, u:unwrtten >>>>>> >>>>>> If we punch the range of the hole, ext4_punch_hole()-> >>>>>> ext4_zero_partial_blocks() will zero out the first half of the dirty folio. >>>>>> Then, ext4_iomap_buffered_zero_begin() will skip adding this dirty folio >>>>>> since the target range is a hole. Finally, iomap_zero_range() will still >>>>>> flush this whole folio and lead to deadlock during writeback the latter >>>>>> half of the folio. >>>>>> >>>>> >>>>> Hmm.. Ok. So it seems there are at least a couple ways around this >>>>> particular quirk. I suspect one is that you could just call the fill >>>>> helper in the hole case as well, but that's kind of a hack and not >>>>> really intended use. >>>>> >>>>> The other way goes back to the fact that the flush for the hole case was >>>>> kind of a corner case hack in the first place. The original comment for >>>>> that seems to have been dropped, but see commit 7d9b474ee4cc ("iomap: >>>>> make zero range flush conditional on unwritten mappings") for reference >>>>> to the original intent. >>>>> >>>>> I'd have to go back and investigate if something regresses with that >>>>> taken out, but my recollection is that was something that needed proper >>>>> fixing eventually anyways. I'm particularly wondering if that is no >>>>> longer an issue now that pagecache_isize_extended() handles the post-eof >>>>> zeroing (the caveat being we might just need to call it in some >>>>> additional size extension cases besides just setattr/truncate). >>>> >>>> Yeah, I agree with you. I suppose the post-EOF partial folio zeroing in >>>> pagecache_isize_extended() should work. >>>> >>> >>> Ok.. >>> >>>>> >>>>>>> >>>>>>> The reason I ask is because the folio_batch handling has come up through >>>>>>> discussions on this series. My position so far has been to keep it as a >>>>>>> separate allocation and to keep things simple since it is currently >>>>>>> isolated to zero range, but that may change if the usage spills over to >>>>>>> other operations (which seems expected at this point). I suspect that if >>>>>>> a filesystem actually depends on this for correct behavior, that is >>>>>>> another data point worth considering on that topic. >>>>>>> >>>>>>> So that has me wondering if it would be better/easier here to perhaps >>>>>>> embed the batch in iomap_iter, or maybe as an incremental step put it on >>>>>>> the stack in iomap_zero_range() and initialize the iomap_iter pointer >>>>>>> there instead of doing the dynamic allocation (then the fill helper >>>>>>> would set a flag to indicate the fs did pagecache lookup). Thoughts on >>>>>>> something like that? >>>>>>> >>>>>>> Also IIUC ext4-on-iomap is still a WIP and review on this series seems >>>>>>> to have mostly wound down. Any objection if the fix for that comes along >>>>>>> as a followup patch rather than a rework of this series? >>>>>> >>>>>> It seems that we don't need to modify this series, we need to consider >>>>>> other solutions to resolve this deadlock issue. >>>>>> >>>>>> In my v1 ext4-on-iomap series [1], I resolved this issue by moving all >>>>>> instances of ext4_block_zero_page_range() out of the running journal >>>>>> handle(please see patch 19-21). But I don't think this is a good solution >>>>>> since it's complex and fragile. Besides, after commit c7fc0366c6562 >>>>>> ("ext4: partial zero eof block on unaligned inode size extension"), you >>>>>> added more invocations of ext4_zero_partial_blocks(), and the situation >>>>>> has become more complicated (Althrough I think the calls in the three >>>>>> write_end callbacks can be removed). >>>>>> >>>>>> Besides, IIUC, it seems that ext4 doesn't need to flush dirty folios >>>>>> over unwritten mappings before zeroing partial blocks. This is because >>>>>> ext4 always zeroes the in-memory page cache before zeroing(e.g, in >>>>>> ext4_setattr() and ext4_punch_hole()), it means if the target range is >>>>>> still dirty and unwritten when calling ext4_block_zero_page_range(), it >>>>>> must has already been zeroed. Was I missing something? Therefore, I was >>>>>> wondering if there are any ways to prevent flushing in >>>>>> iomap_zero_range()? Any ideas? >>>>> >>>>> It's certainly possible that the quirk fixed by the flush the hole case >>>>> was never a problem on ext4, if that's what you mean. Most of the >>>>> testing for this was on XFS since ext4 hadn't used iomap for buffered >>>>> writes. >>>>> >>>>> At the end of the day, the batch mechanism is intended to facilitate >>>>> avoiding the flush entirely. I'm still paging things back in here.. but >>>>> if we had two smallish changes to this code path to 1. eliminate the >>>>> dynamic folio_batch allocation and 2. drop the flush on hole mapping >>>>> case, would that address the issues with iomap zero range for ext4? >>>>> >>>> >>>> Thank you for looking at this! >>>> >>>> I made a simple modification to the iomap_zero_range() function based >>>> on the second solution you mentioned, then tested it using kvm-xfstests >>>> these days. This solution works fine on ext4 and I don't find any other >>>> risks by now. (Since my testing environment has sufficient memory, I >>>> have not yet handled the case of memory allocation failure). >>>> >>> >>> Great, thanks for evaluating that. I've been playing around with the >>> exact same change the past few days. As it turns out, this still breaks >>> something with XFS. I've narrowed it down to an interaction between a >>> large eof folio that fails to split on truncate due to being dirty, COW >>> prealloc and insert range racing with writeback in such a way that this >>> results in a mapped post-eof block on the file. Technically that isn't >>> the end of the world so long as it is zeroed, but a subsequent zero >>> range can warn if the folio is reclaimed and we now end up with a new >>> one that starts beyond EOF, because those folios don't write back. >>> >>> I have a mix of hacks that seems to address the generic/363 failure, but >>> it still needs further testing and analysis to unwind my mess of various >>> experiments and whatnot. ;P >> >> OK, thanks for debugging and analyzing this. >> >>> >>>> --- a/fs/iomap/buffered-io.c >>>> +++ b/fs/iomap/buffered-io.c >>>> @@ -1520,7 +1520,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, >>>> srcmap->type == IOMAP_UNWRITTEN)) { >>>> s64 status; >>>> >>>> - if (range_dirty) { >>>> + if (range_dirty && srcmap->type == IOMAP_UNWRITTEN) { >>>> range_dirty = false; >>>> status = iomap_zero_iter_flush_and_stale(&iter); >>>> } else { >>>> >>>> Another thing I want to mention (although there are no real issues at >>>> the moment, I still want to mention it) is that there appears to be >>>> no consistency guarantee between the lookup of the mapping and the >>>> follo_batch. For example, assume we have a file which contains two >>>> dirty folio and two unwritten extents, one folio corresponds to one >>>> extent. We zero out these two folios. >>>> >>>> | dirty folio 1 || dirty folio 2 | >>>> [uuuuuuuuuuuuuuuu][uuuuuuuuuuuuuuuuu] >>>> >>>> In the first call to ->iomap_begin(), we get the unwritten extent 1. >>>> At the same time, another thread writes back folio 1 and clears this >>>> folio, so this folio will not be added to the follo_batch. Then >>>> iomap_zero_range() will still flush those two folios. When flushing >>>> the second folio, there is still a risk of deadlock due to changes in >>>> metadata. >>>> >>> >>> Hmm.. not sure I follow the example. The folio batch should include the >>> folio in any case other than where it can look up, lock it and confirm >>> it is clean. If the folio is clean and thus not included, the iomap >>> logic should still see the empty folio batch and skip over the mapping >>> if unwritten. (I want to replace this with a flag to address your memory >>> allocation concern, but that is orthogonal to this logic.) >>> >>> Of course this should happen under appropriate fs locks such that iomap >>> either sees the folio in dirty/writeback state where the mapping is >>> unwritten, or if the folio has been cleaned, the mapping is reported as >>> written. >>> >>> If I'm still missing something with your example above, can you >>> elaborate a bit further? Thanks. >>> >> >> Sorry for not make things clear. The race condition is the following, >> >> zero range sync_file_range >> iomap_zero_range() //folio 1+2 >> range_dirty = filemap_range_needs_writeback() >> //range_dirty is set to 'ture' >> iomap_iter() >> ext4_iomap_buffer_zero_begin() >> ext4_map_blocks() >> //get unwritten extent 1 >> sync_file_range() //folio 1 >> iomap_writepages() >> ... >> iomap_finish_ioend() >> folio_end_writeback() >> //clear folio 1, and >> //extent 1 becomes written >> iomap_fill_dirty_folios() >> //do not add folio 1 to batch > > I think the issue here is that this needs serialization between the > lookups and extent conversion. I.e., XFS handles this by doing the folio > lookup under the same locks used for looking up the extent. So in this > scenario, that ensures that the only way we don't add the folio to the > batch is if the mapping has already been converted by writeback > completion.. Yeah, XFS serializes this using ip->i_lock. It performs mapping and folio lookup under XFS_ILOCK_EXCL, and the writeback conversion should also take this lock. Therefore, this race condition cannot happen in XFS. However, ext4 currently does not have such a lock that can provide this serialization. Perhaps we could reuse EXT4_I(inode)->i_data_sem but we need further analysis. At the moment, this case should not occur and does not cause any real issues, so we don't need to address it now. We can consider solutions when it becomes truly necessary in the future. > >> iomap_zero_iter_flush_and_stale() >> //!fbatch && IOMAP_UNWRITTEN && range_dirty >> //flush folio 1+2, folio 2 is still dirty, then deadlock >> > > I also think the failure characteristic here is different. In this case > you'd see an empty fbatch because at least on the lookup side the > mapping is presumed to be unwritten. So that means we still wouldn't > flush, but the zeroing would probably race with writeback such that it > skips zeroing the blocks when it shouldn't. > >> Besides, if the range of folio 1 is initially clean and unwritten >> (folio 2 is still dirty), the flush can also be triggered without the >> concurrent sync_file_range. >> >> The problem is that we initially checked `range_dirty` only once, and >> if was done for the entire zero range, instead of checking it each >> iteration, and there is no fs lock can prevent a concurrent write-back. >> Perhaps we need to check 'dirty_range' for each iteration? >> > > I don't think this needs to prevent writeback, but is there not any > locking to protect extent lookups vs. extent conversion (via writeback)? > > FWIW, I think it's a little hard to reason about some of these > interactions because of the pending tweaks that aren't implemented yet. > What I'd like to do is get another version of this posted (mostly as > is), hopefully get it landed into -next or whatever, then get another > series going with this handful of tweaks we're discussing to prepare for > ext4 on iomap. From there we can work out any remaining details on > things that might need to change on either side. Sound Ok? > Yeah, it looks good to me. We should expedite merging of this series as soon as possible to allow us to continue advancing the work on ext4 on iomap. Thanks, Yi. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-08-07 4:58 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250714204122.349582-1-bfoster@redhat.com>
[not found] ` <20250714204122.349582-4-bfoster@redhat.com>
[not found] ` <20250715052259.GO2672049@frogsfrogsfrogs>
2025-07-18 11:30 ` [PATCH v3 3/7] iomap: optional zero range dirty folio processing Zhang Yi
2025-07-18 13:48 ` Brian Foster
2025-07-19 11:07 ` Zhang Yi
2025-07-21 8:47 ` Zhang Yi
2025-07-28 12:57 ` Zhang Yi
2025-07-30 13:19 ` Brian Foster
2025-08-02 7:26 ` Zhang Yi
2025-07-30 13:17 ` Brian Foster
2025-08-02 7:19 ` Zhang Yi
2025-08-05 13:08 ` Brian Foster
2025-08-06 3:10 ` Zhang Yi
2025-08-06 13:25 ` Brian Foster
2025-08-07 4:58 ` Zhang Yi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).