* [PATCH 0/2] iomap: flush dirty cache over unwritten mappings on zero range @ 2024-08-22 14:59 Brian Foster 2024-08-22 14:59 ` [PATCH 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster 2024-08-22 14:59 ` [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster 0 siblings, 2 replies; 12+ messages in thread From: Brian Foster @ 2024-08-22 14:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-xfs, djwong, josef, david Hi all, This is the alternative flushing solution to the iomap zero range problem with dirty pagecache over unwritten mappings. This is done in two steps. Patch 1 lifts the XFS workaround into iomap, flushing the range unconditionally and providing an easily backportable fix for stable kernels. Patch 2 buries the flush further down into iomap, making it conditional on the combined presence of dirty cache and unwritten mappings in the target range. This may be reasonable backportable as well, but is only required if performance is a concern. I still have to look into the improved revalidation approach discussed in the RFC thread, but given that requires validation support and this is intended to be a generic fallback, I wanted to get this nailed down first. fstests coverage for this problem is posted here [1]. Thoughts, reviews, flames appreciated. Brian v1: - Alternative approach, flush instead of revalidate. rfc: https://lore.kernel.org/linux-fsdevel/20240718130212.23905-1-bfoster@redhat.com/ [1] https://lore.kernel.org/fstests/20240822144422.188462-1-bfoster@redhat.com/ Brian Foster (2): iomap: fix handling of dirty folios over unwritten extents iomap: make zero range flush conditional on unwritten mappings fs/iomap/buffered-io.c | 52 ++++++++++++++++++++++++++++++++++++++---- fs/xfs/xfs_iops.c | 10 -------- 2 files changed, 48 insertions(+), 14 deletions(-) -- 2.45.0 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] iomap: fix handling of dirty folios over unwritten extents 2024-08-22 14:59 [PATCH 0/2] iomap: flush dirty cache over unwritten mappings on zero range Brian Foster @ 2024-08-22 14:59 ` Brian Foster 2024-08-22 14:59 ` [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster 1 sibling, 0 replies; 12+ messages in thread From: Brian Foster @ 2024-08-22 14:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-xfs, djwong, josef, david The iomap zero range implementation doesn't properly handle dirty pagecache over unwritten mappings. It skips such mappings as if they were pre-zeroed. If some part of an unwritten mapping is dirty in pagecache from a previous write, the data in cache should be zeroed as well. Instead, the data is left in cache and creates a stale data exposure problem if writeback occurs sometime after the zero range. Most callers are unaffected by this because the higher level filesystem contexts that call zero range typically perform a filemap flush of the target range for other reasons. A couple contexts that don't otherwise need to flush are write file size extension and truncate in XFS. The former path is currently susceptible to the stale data exposure problem and the latter performs a flush specifically to work around it. This is clearly inconsistent and incomplete. As a first step toward correcting behavior, lift the XFS workaround to iomap_zero_range() and unconditionally flush the range before the zero range operation proceeds. While this appears to be a bit of a big hammer, most all users already do this from calling context save for the couple of exceptions noted above. Future patches will optimize or elide this flush while maintaining functional correctness. Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure") Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/iomap/buffered-io.c | 10 ++++++++++ fs/xfs/xfs_iops.c | 10 ---------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index f420c53d86ac..3e846f43ff48 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1451,6 +1451,16 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, }; int ret; + /* + * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but + * pagecache must be flushed to ensure stale data from previous + * buffered writes is not exposed. + */ + ret = filemap_write_and_wait_range(inode->i_mapping, + pos, pos + len - 1); + if (ret) + return ret; + while ((ret = iomap_iter(&iter, ops)) > 0) iter.processed = iomap_zero_iter(&iter, did_zero); return ret; diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 1cdc8034f54d..ddd3697e6ecd 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -870,16 +870,6 @@ xfs_setattr_size( error = xfs_zero_range(ip, oldsize, newsize - oldsize, &did_zeroing); } else { - /* - * iomap won't detect a dirty page over an unwritten block (or a - * cow block over a hole) and subsequently skips zeroing the - * newly post-EOF portion of the page. Flush the new EOF to - * convert the block before the pagecache truncate. - */ - error = filemap_write_and_wait_range(inode->i_mapping, newsize, - newsize); - if (error) - return error; error = xfs_truncate_page(ip, newsize, &did_zeroing); } -- 2.45.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings 2024-08-22 14:59 [PATCH 0/2] iomap: flush dirty cache over unwritten mappings on zero range Brian Foster 2024-08-22 14:59 ` [PATCH 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster @ 2024-08-22 14:59 ` Brian Foster 2024-08-27 6:11 ` Christoph Hellwig 1 sibling, 1 reply; 12+ messages in thread From: Brian Foster @ 2024-08-22 14:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-xfs, djwong, josef, david iomap_zero_range() flushes pagecache to mitigate consistency problems with dirty pagecache and unwritten mappings. The flush is unconditional over the entire range because checking pagecache state after mapping lookup is racy with writeback and reclaim. There are ways around this using iomap's mapping revalidation mechanism, but this is not supported by all iomap based filesystems and so is not a generic solution. There is another way around this limitation that is good enough to filter the flush for most cases in practice. If we check for dirty pagecache over the target range (instead of unconditionally flush), we can keep track of whether the range was dirty before lookup and defer the flush until/unless we see a combination of dirty cache backed by an unwritten mapping. We don't necessarily know whether the dirty cache was backed by the unwritten maping or some other (written) part of the range, but the impliciation of a false positive here is a spurious flush and thus relatively harmless. Note that we also flush for hole mappings because iomap_zero_range() is used for partial folio zeroing in some cases. For example, if a folio straddles EOF on a sub-page FSB size fs, the post-eof portion is hole-backed and dirtied/written via mapped write, and then i_size increases before writeback can occur (which otherwise zeroes the post-eof portion of the EOF folio), then the folio becomes inconsistent with disk until reclaimed. A flush in this case executes partial zeroing from writeback, and iomap knows that there is otherwise no I/O to submit for hole backed mappings. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/iomap/buffered-io.c | 52 ++++++++++++++++++++++++++++++++++-------- 1 file changed, 43 insertions(+), 9 deletions(-) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 3e846f43ff48..841cd01d8194 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1393,16 +1393,42 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len, } EXPORT_SYMBOL_GPL(iomap_file_unshare); -static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero) +/* + * Flush the remaining range of the iter and mark the current mapping stale. + * This is used when zero range sees an unwritten mapping that may have had + * dirty pagecache over it. + */ +static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i) +{ + struct address_space *mapping = i->inode->i_mapping; + loff_t end = i->pos + i->len - 1; + + i->iomap.flags |= IOMAP_F_STALE; + return filemap_write_and_wait_range(mapping, i->pos, end); +} + +static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero, + bool *range_dirty) { const struct iomap *srcmap = iomap_iter_srcmap(iter); loff_t pos = iter->pos; loff_t length = iomap_length(iter); loff_t written = 0; - /* already zeroed? we're done. */ - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) + /* + * We can skip pre-zeroed mappings so long as either the mapping was + * clean before we started or we've flushed at least once since. + * Otherwise we don't know whether the current mapping had dirty + * pagecache, so flush it now, stale the current mapping, and proceed + * from there. + */ + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { + if (*range_dirty) { + *range_dirty = false; + return iomap_zero_iter_flush_and_stale(iter); + } return length; + } do { struct folio *folio; @@ -1450,19 +1476,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, .flags = IOMAP_ZERO, }; int ret; + bool range_dirty; /* * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but * pagecache must be flushed to ensure stale data from previous - * buffered writes is not exposed. + * buffered writes is not exposed. A flush is only required for certain + * types of mappings, but checking pagecache after mapping lookup is + * racy with writeback and reclaim. + * + * Therefore, check the entire range first and pass along whether any + * part of it is dirty. If so and an underlying mapping warrants it, + * flush the cache at that point. This trades off the occasional false + * positive (and spurious flush, if the dirty data and mapping don't + * happen to overlap) for simplicity in handling a relatively uncommon + * situation. */ - ret = filemap_write_and_wait_range(inode->i_mapping, - pos, pos + len - 1); - if (ret) - return ret; + range_dirty = filemap_range_needs_writeback(inode->i_mapping, + pos, pos + len - 1); while ((ret = iomap_iter(&iter, ops)) > 0) - iter.processed = iomap_zero_iter(&iter, did_zero); + iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty); return ret; } EXPORT_SYMBOL_GPL(iomap_zero_range); -- 2.45.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings 2024-08-22 14:59 ` [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster @ 2024-08-27 6:11 ` Christoph Hellwig 2024-08-27 14:23 ` Brian Foster 0 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2024-08-27 6:11 UTC (permalink / raw) To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, djwong, josef, david On Thu, Aug 22, 2024 at 10:59:10AM -0400, Brian Foster wrote: > Note that we also flush for hole mappings because iomap_zero_range() > is used for partial folio zeroing in some cases. For example, if a > folio straddles EOF on a sub-page FSB size fs, the post-eof portion > is hole-backed and dirtied/written via mapped write, and then i_size > increases before writeback can occur (which otherwise zeroes the > post-eof portion of the EOF folio), then the folio becomes > inconsistent with disk until reclaimed. Eww. I'm not sure iomap_zero_range is the right way to handle this even if that's what we have now and it kinda work. > + /* > + * We can skip pre-zeroed mappings so long as either the mapping was > + * clean before we started or we've flushed at least once since. > + * Otherwise we don't know whether the current mapping had dirty > + * pagecache, so flush it now, stale the current mapping, and proceed > + * from there. > + */ > + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { .. at very least the above needs to be documented here as a big fat reminder, though. Otherwise the series looks sensible to me. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings 2024-08-27 6:11 ` Christoph Hellwig @ 2024-08-27 14:23 ` Brian Foster 2024-08-28 4:32 ` Christoph Hellwig 0 siblings, 1 reply; 12+ messages in thread From: Brian Foster @ 2024-08-27 14:23 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, djwong, josef, david On Mon, Aug 26, 2024 at 11:11:42PM -0700, Christoph Hellwig wrote: > On Thu, Aug 22, 2024 at 10:59:10AM -0400, Brian Foster wrote: > > Note that we also flush for hole mappings because iomap_zero_range() > > is used for partial folio zeroing in some cases. For example, if a > > folio straddles EOF on a sub-page FSB size fs, the post-eof portion > > is hole-backed and dirtied/written via mapped write, and then i_size > > increases before writeback can occur (which otherwise zeroes the > > post-eof portion of the EOF folio), then the folio becomes > > inconsistent with disk until reclaimed. > > Eww. I'm not sure iomap_zero_range is the right way to handle this > even if that's what we have now and it kinda work. > Yeah, I agree with that. That was one of the minor appeals (to me) of the prototype I posted a while back that customizes iomap_truncate_page() to do unconditional zeroing instead of being an almost pointless wrapper for iomap_zero_range(). I don't quite recall if that explicitly handled the hole case since it was quickly hacked together, but the general idea is that it could be more purpose built for these partial zeroing cases than zero range might be. There are some other caveats with that approach that would still require some variant of zero range, however, so I'm not immediately clear on what the ideal high level solution looks like quite yet. I'd like to get the existing mechanism working correctly, improve the test situation and go from there. > > + /* > > + * We can skip pre-zeroed mappings so long as either the mapping was > > + * clean before we started or we've flushed at least once since. > > + * Otherwise we don't know whether the current mapping had dirty > > + * pagecache, so flush it now, stale the current mapping, and proceed > > + * from there. > > + */ > > + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) { > > .. at very least the above needs to be documented here as a big fat > reminder, though. > Sure, I'll include that explanation in the code comments in v2. Thanks. Brian > Otherwise the series looks sensible to me. > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings 2024-08-27 14:23 ` Brian Foster @ 2024-08-28 4:32 ` Christoph Hellwig 2024-08-28 12:35 ` Brian Foster 0 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2024-08-28 4:32 UTC (permalink / raw) To: Brian Foster Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, djwong, josef, david On Tue, Aug 27, 2024 at 10:23:10AM -0400, Brian Foster wrote: > Yeah, I agree with that. That was one of the minor appeals (to me) of the > prototype I posted a while back that customizes iomap_truncate_page() to > do unconditional zeroing instead of being an almost pointless wrapper > for iomap_zero_range(). I only very vaguely remember that, you don't happen to have a pointer to that? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings 2024-08-28 4:32 ` Christoph Hellwig @ 2024-08-28 12:35 ` Brian Foster 2024-08-29 5:41 ` Christoph Hellwig 0 siblings, 1 reply; 12+ messages in thread From: Brian Foster @ 2024-08-28 12:35 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, djwong, josef, david On Tue, Aug 27, 2024 at 09:32:35PM -0700, Christoph Hellwig wrote: > On Tue, Aug 27, 2024 at 10:23:10AM -0400, Brian Foster wrote: > > Yeah, I agree with that. That was one of the minor appeals (to me) of the > > prototype I posted a while back that customizes iomap_truncate_page() to > > do unconditional zeroing instead of being an almost pointless wrapper > > for iomap_zero_range(). > > I only very vaguely remember that, you don't happen to have a pointer > to that? > > Yeah, it was buried in a separate review around potentially killing off iomap_truncate_page(): https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/ The idea is pretty simple.. use the same kind of check this patch does for doing a flush, but instead open code and isolate it to iomap_truncate_page() so we can just default to doing the buffered write instead. Note that I don't think this replaces the need for patch 1, but it might arguably make further optimization of the flush kind of pointless because I'm not sure zero range would ever be called from somewhere that doesn't flush already. The tradeoffs I can think of are this might introduce some false positives where an EOF folio might be dirty but a sub-folio size block backing EOF might be clean, and again that callers like truncate and write extension would need to both truncate the eof page and zero the broader post-eof range. Neither of those seem all that significant to me, but just my .02. Brian ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings 2024-08-28 12:35 ` Brian Foster @ 2024-08-29 5:41 ` Christoph Hellwig 2024-08-29 15:05 ` Brian Foster 0 siblings, 1 reply; 12+ messages in thread From: Christoph Hellwig @ 2024-08-29 5:41 UTC (permalink / raw) To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, djwong, josef, david On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote: > Yeah, it was buried in a separate review around potentially killing off > iomap_truncate_page(): > > https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/ > > The idea is pretty simple.. use the same kind of check this patch does > for doing a flush, but instead open code and isolate it to > iomap_truncate_page() so we can just default to doing the buffered write > instead. > > Note that I don't think this replaces the need for patch 1, but it might > arguably make further optimization of the flush kind of pointless > because I'm not sure zero range would ever be called from somewhere that > doesn't flush already. > > The tradeoffs I can think of are this might introduce some false > positives where an EOF folio might be dirty but a sub-folio size block > backing EOF might be clean, and again that callers like truncate and > write extension would need to both truncate the eof page and zero the > broader post-eof range. Neither of those seem all that significant to > me, but just my .02. Looking at that patch and your current series I kinda like not having to deal with the dirty caches in the loop, and in fact I'd also prefer to not do any writeback from the low-level zero helpers if we can. That is not doing your patch 1 but instead auditing the callers if any of them needs them and documenting the expectation. But please let Dave and Darrick chime in first before investing any work into this. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings 2024-08-29 5:41 ` Christoph Hellwig @ 2024-08-29 15:05 ` Brian Foster 2024-08-29 21:48 ` Darrick J. Wong 0 siblings, 1 reply; 12+ messages in thread From: Brian Foster @ 2024-08-29 15:05 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, djwong, josef, david On Wed, Aug 28, 2024 at 10:41:56PM -0700, Christoph Hellwig wrote: > On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote: > > Yeah, it was buried in a separate review around potentially killing off > > iomap_truncate_page(): > > > > https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/ > > > > The idea is pretty simple.. use the same kind of check this patch does > > for doing a flush, but instead open code and isolate it to > > iomap_truncate_page() so we can just default to doing the buffered write > > instead. > > > > Note that I don't think this replaces the need for patch 1, but it might > > arguably make further optimization of the flush kind of pointless > > because I'm not sure zero range would ever be called from somewhere that > > doesn't flush already. > > > > The tradeoffs I can think of are this might introduce some false > > positives where an EOF folio might be dirty but a sub-folio size block > > backing EOF might be clean, and again that callers like truncate and > > write extension would need to both truncate the eof page and zero the > > broader post-eof range. Neither of those seem all that significant to > > me, but just my .02. > > Looking at that patch and your current series I kinda like not having > to deal with the dirty caches in the loop, and in fact I'd also prefer > to not do any writeback from the low-level zero helpers if we can. > That is not doing your patch 1 but instead auditing the callers if > any of them needs them and documenting the expectation. > I agree this seems better in some ways, but I don't like complicating or putting more responsibility on the callers. I think if we had a high level iomap function that wrapped a combination of this proposed variant of truncate_page() and zero_range() for general inode size changes, that might alleviate that concern. Otherwise IME even if we audited and fixed all callers today, over time we'll just reintroduce the same sorts of errors if the low level mechanisms aren't made to function correctly. > But please let Dave and Darrick chime in first before investing any > work into this. > > Based on the feedback to v2, it sounds like there's general consensus on the approach modulo some code factoring discussion. Unless there is objection, I think I'll stick with that for now for the sake of progress and keep this option in mind on the back burner. None of this is really that hard to change if we come up with something better. Brian ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings 2024-08-29 15:05 ` Brian Foster @ 2024-08-29 21:48 ` Darrick J. Wong 2024-08-30 11:57 ` Brian Foster 0 siblings, 1 reply; 12+ messages in thread From: Darrick J. Wong @ 2024-08-29 21:48 UTC (permalink / raw) To: Brian Foster; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, josef, david On Thu, Aug 29, 2024 at 11:05:59AM -0400, Brian Foster wrote: > On Wed, Aug 28, 2024 at 10:41:56PM -0700, Christoph Hellwig wrote: > > On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote: > > > Yeah, it was buried in a separate review around potentially killing off > > > iomap_truncate_page(): > > > > > > https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/ > > > > > > The idea is pretty simple.. use the same kind of check this patch does > > > for doing a flush, but instead open code and isolate it to > > > iomap_truncate_page() so we can just default to doing the buffered write > > > instead. > > > > > > Note that I don't think this replaces the need for patch 1, but it might > > > arguably make further optimization of the flush kind of pointless > > > because I'm not sure zero range would ever be called from somewhere that > > > doesn't flush already. > > > > > > The tradeoffs I can think of are this might introduce some false > > > positives where an EOF folio might be dirty but a sub-folio size block > > > backing EOF might be clean, and again that callers like truncate and > > > write extension would need to both truncate the eof page and zero the > > > broader post-eof range. Neither of those seem all that significant to > > > me, but just my .02. > > > > Looking at that patch and your current series I kinda like not having > > to deal with the dirty caches in the loop, and in fact I'd also prefer > > to not do any writeback from the low-level zero helpers if we can. > > That is not doing your patch 1 but instead auditing the callers if > > any of them needs them and documenting the expectation. I looked, and was pretty sure that XFS is the only one that has that expectation. > I agree this seems better in some ways, but I don't like complicating or > putting more responsibility on the callers. I think if we had a high > level iomap function that wrapped a combination of this proposed variant > of truncate_page() and zero_range() for general inode size changes, that > might alleviate that concern. > > Otherwise IME even if we audited and fixed all callers today, over time > we'll just reintroduce the same sorts of errors if the low level > mechanisms aren't made to function correctly. Yeah. What /are/ the criteria for needing the flush and wait? AFAICT, a filesystem only needs the flush if it's possible to have dirty pagecache backed either by a hole or an unwritten extent, right? I suppose we could amend the iomap ops so that filesystems could signal that they allow either of those things, and then we wouldn't have to query the mapping for filesystems that don't, right? IOWs, one can opt out of safety features if there's no risk of a garbage, right? (Also: does xfs allow dirty page cache backed by a hole? I didn't think that was possible.) > > But please let Dave and Darrick chime in first before investing any > > work into this. > > > > > > Based on the feedback to v2, it sounds like there's general consensus on > the approach modulo some code factoring discussion. Unless there is > objection, I think I'll stick with that for now for the sake of progress > and keep this option in mind on the back burner. None of this is really > that hard to change if we come up with something better. > > Brian > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings 2024-08-29 21:48 ` Darrick J. Wong @ 2024-08-30 11:57 ` Brian Foster 2024-08-30 23:02 ` Darrick J. Wong 0 siblings, 1 reply; 12+ messages in thread From: Brian Foster @ 2024-08-30 11:57 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, josef, david On Thu, Aug 29, 2024 at 02:48:00PM -0700, Darrick J. Wong wrote: > On Thu, Aug 29, 2024 at 11:05:59AM -0400, Brian Foster wrote: > > On Wed, Aug 28, 2024 at 10:41:56PM -0700, Christoph Hellwig wrote: > > > On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote: > > > > Yeah, it was buried in a separate review around potentially killing off > > > > iomap_truncate_page(): > > > > > > > > https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/ > > > > > > > > The idea is pretty simple.. use the same kind of check this patch does > > > > for doing a flush, but instead open code and isolate it to > > > > iomap_truncate_page() so we can just default to doing the buffered write > > > > instead. > > > > > > > > Note that I don't think this replaces the need for patch 1, but it might > > > > arguably make further optimization of the flush kind of pointless > > > > because I'm not sure zero range would ever be called from somewhere that > > > > doesn't flush already. > > > > > > > > The tradeoffs I can think of are this might introduce some false > > > > positives where an EOF folio might be dirty but a sub-folio size block > > > > backing EOF might be clean, and again that callers like truncate and > > > > write extension would need to both truncate the eof page and zero the > > > > broader post-eof range. Neither of those seem all that significant to > > > > me, but just my .02. > > > > > > Looking at that patch and your current series I kinda like not having > > > to deal with the dirty caches in the loop, and in fact I'd also prefer > > > to not do any writeback from the low-level zero helpers if we can. > > > That is not doing your patch 1 but instead auditing the callers if > > > any of them needs them and documenting the expectation. > > I looked, and was pretty sure that XFS is the only one that has that > expectation. > > > I agree this seems better in some ways, but I don't like complicating or > > putting more responsibility on the callers. I think if we had a high > > level iomap function that wrapped a combination of this proposed variant > > of truncate_page() and zero_range() for general inode size changes, that > > might alleviate that concern. > > > > Otherwise IME even if we audited and fixed all callers today, over time > > we'll just reintroduce the same sorts of errors if the low level > > mechanisms aren't made to function correctly. > > Yeah. What /are/ the criteria for needing the flush and wait? AFAICT, > a filesystem only needs the flush if it's possible to have dirty > pagecache backed either by a hole or an unwritten extent, right? > Yeah, but this flush behavior shouldn't be a caller consideration at all. It's just an implementation detail. All the caller should care about is that zero range works As Expected (tm). The pre-iomap way of doing this in XFS was xfs_zero_eof() -> xfs_iozero(), which was an internally coded buffered write loop that wrote zeroes into pagecache. That was ultimately replaced with iomap_zero_range() with the same sort of usage expectations, but iomap_zero_range() just didn't work quite correctly in all cases. > I suppose we could amend the iomap ops so that filesystems could signal > that they allow either of those things, and then we wouldn't have to > query the mapping for filesystems that don't, right? IOWs, one can opt > out of safety features if there's no risk of a garbage, right? > Not sure I parse.. In general I think we could let ops signal whether they want certain checks. This is how I used the IOMAP_F_DIRTY_CACHE flag mentioned in the other thread. If the operation handler is interested in pagecache state, set an IOMAP_DIRTY_CACHE flag in ops to trigger a pre iomap_begin() check and then set the corresponding _F_DIRTY_CACHE flag on the mapping if dirty, but I'm not sure if that's the same concept you're alluding to here. > (Also: does xfs allow dirty page cache backed by a hole? I didn't think > that was possible.) > It's a corner case. A mapped write can write to any portion of a folio so long as it starts within eof. So if you have a mapped write that writes past EOF, there's no guarantee that range of the folio is mapped by blocks. That post-eof part of the folio would be zeroed at writeback time, but that assumes i_size doesn't change before writeback. If it does and the size change operation doesn't do the zeroing itself (enter zero range via write extension), then we end up with a dirty folio at least partially backed by a hole with non-zero data within EOF. There's nothing written back to disk in this hole backed example, but the pagecache is still inconsistent with what's on disk and therefore I suspect data corruption is possible if the folio is redirtied before reclaimed. Brian > > > But please let Dave and Darrick chime in first before investing any > > > work into this. > > > > > > > > > > Based on the feedback to v2, it sounds like there's general consensus on > > the approach modulo some code factoring discussion. Unless there is > > objection, I think I'll stick with that for now for the sake of progress > > and keep this option in mind on the back burner. None of this is really > > that hard to change if we come up with something better. > > > > Brian > > > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings 2024-08-30 11:57 ` Brian Foster @ 2024-08-30 23:02 ` Darrick J. Wong 0 siblings, 0 replies; 12+ messages in thread From: Darrick J. Wong @ 2024-08-30 23:02 UTC (permalink / raw) To: Brian Foster; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, josef, david On Fri, Aug 30, 2024 at 07:57:41AM -0400, Brian Foster wrote: > On Thu, Aug 29, 2024 at 02:48:00PM -0700, Darrick J. Wong wrote: > > On Thu, Aug 29, 2024 at 11:05:59AM -0400, Brian Foster wrote: > > > On Wed, Aug 28, 2024 at 10:41:56PM -0700, Christoph Hellwig wrote: > > > > On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote: > > > > > Yeah, it was buried in a separate review around potentially killing off > > > > > iomap_truncate_page(): > > > > > > > > > > https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/ > > > > > > > > > > The idea is pretty simple.. use the same kind of check this patch does > > > > > for doing a flush, but instead open code and isolate it to > > > > > iomap_truncate_page() so we can just default to doing the buffered write > > > > > instead. > > > > > > > > > > Note that I don't think this replaces the need for patch 1, but it might > > > > > arguably make further optimization of the flush kind of pointless > > > > > because I'm not sure zero range would ever be called from somewhere that > > > > > doesn't flush already. > > > > > > > > > > The tradeoffs I can think of are this might introduce some false > > > > > positives where an EOF folio might be dirty but a sub-folio size block > > > > > backing EOF might be clean, and again that callers like truncate and > > > > > write extension would need to both truncate the eof page and zero the > > > > > broader post-eof range. Neither of those seem all that significant to > > > > > me, but just my .02. > > > > > > > > Looking at that patch and your current series I kinda like not having > > > > to deal with the dirty caches in the loop, and in fact I'd also prefer > > > > to not do any writeback from the low-level zero helpers if we can. > > > > That is not doing your patch 1 but instead auditing the callers if > > > > any of them needs them and documenting the expectation. > > > > I looked, and was pretty sure that XFS is the only one that has that > > expectation. > > > > > I agree this seems better in some ways, but I don't like complicating or > > > putting more responsibility on the callers. I think if we had a high > > > level iomap function that wrapped a combination of this proposed variant > > > of truncate_page() and zero_range() for general inode size changes, that > > > might alleviate that concern. > > > > > > Otherwise IME even if we audited and fixed all callers today, over time > > > we'll just reintroduce the same sorts of errors if the low level > > > mechanisms aren't made to function correctly. > > > > Yeah. What /are/ the criteria for needing the flush and wait? AFAICT, > > a filesystem only needs the flush if it's possible to have dirty > > pagecache backed either by a hole or an unwritten extent, right? > > > > Yeah, but this flush behavior shouldn't be a caller consideration at > all. It's just an implementation detail. All the caller should care > about is that zero range works As Expected (tm). > > The pre-iomap way of doing this in XFS was xfs_zero_eof() -> > xfs_iozero(), which was an internally coded buffered write loop that > wrote zeroes into pagecache. That was ultimately replaced with > iomap_zero_range() with the same sort of usage expectations, but > iomap_zero_range() just didn't work quite correctly in all cases. > > > I suppose we could amend the iomap ops so that filesystems could signal > > that they allow either of those things, and then we wouldn't have to > > query the mapping for filesystems that don't, right? IOWs, one can opt > > out of safety features if there's no risk of a garbage, right? > > > > Not sure I parse.. In general I think we could let ops signal whether > they want certain checks. This is how I used the IOMAP_F_DIRTY_CACHE > flag mentioned in the other thread. If the operation handler is > interested in pagecache state, set an IOMAP_DIRTY_CACHE flag in ops to > trigger a pre iomap_begin() check and then set the corresponding > _F_DIRTY_CACHE flag on the mapping if dirty, but I'm not sure if that's > the same concept you're alluding to here. Nope. I was thinking about adding a field to iomap_ops so that filesystems could declare which types of mappings they could return: const struct iomap_ops xfs_buffered_write_iomap_ops = { .iomap_begin = xfs_buffered_write_iomap_begin, .iomap_end = xfs_buffered_write_iomap_end, /* xfs allows sparse holes and unwritten extents */ .iomap_types = (1U << IOMAP_UNWRITTEN) | (1U << IOMAP_HOLE), }; But given your statement below about dirtying a post-eof region for which the filesystem does not allocate a block, I suspect we just have to enable the flush thing for everyone and don't need the flags thing. > > (Also: does xfs allow dirty page cache backed by a hole? I didn't think > > that was possible.) > > > > It's a corner case. A mapped write can write to any portion of a folio > so long as it starts within eof. So if you have a mapped write that > writes past EOF, there's no guarantee that range of the folio is mapped > by blocks. > > That post-eof part of the folio would be zeroed at writeback time, but > that assumes i_size doesn't change before writeback. If it does and the > size change operation doesn't do the zeroing itself (enter zero range > via write extension), then we end up with a dirty folio at least > partially backed by a hole with non-zero data within EOF. There's > nothing written back to disk in this hole backed example, but the > pagecache is still inconsistent with what's on disk and therefore I > suspect data corruption is possible if the folio is redirtied before > reclaimed. Ah. Yikes. --D > Brian > > > > > But please let Dave and Darrick chime in first before investing any > > > > work into this. > > > > > > > > > > > > > > Based on the feedback to v2, it sounds like there's general consensus on > > > the approach modulo some code factoring discussion. Unless there is > > > objection, I think I'll stick with that for now for the sake of progress > > > and keep this option in mind on the back burner. None of this is really > > > that hard to change if we come up with something better. > > > > > > Brian > > > > > > > > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-08-30 23:02 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-22 14:59 [PATCH 0/2] iomap: flush dirty cache over unwritten mappings on zero range Brian Foster 2024-08-22 14:59 ` [PATCH 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster 2024-08-22 14:59 ` [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster 2024-08-27 6:11 ` Christoph Hellwig 2024-08-27 14:23 ` Brian Foster 2024-08-28 4:32 ` Christoph Hellwig 2024-08-28 12:35 ` Brian Foster 2024-08-29 5:41 ` Christoph Hellwig 2024-08-29 15:05 ` Brian Foster 2024-08-29 21:48 ` Darrick J. Wong 2024-08-30 11:57 ` Brian Foster 2024-08-30 23:02 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).