* [PATCH 0/2] iomap: flush dirty cache over unwritten mappings on zero range
@ 2024-08-22 14:59 Brian Foster
2024-08-22 14:59 ` [PATCH 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster
2024-08-22 14:59 ` [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster
0 siblings, 2 replies; 12+ messages in thread
From: Brian Foster @ 2024-08-22 14:59 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-xfs, djwong, josef, david
Hi all,
This is the alternative flushing solution to the iomap zero range
problem with dirty pagecache over unwritten mappings. This is done in
two steps. Patch 1 lifts the XFS workaround into iomap, flushing the
range unconditionally and providing an easily backportable fix for
stable kernels. Patch 2 buries the flush further down into iomap, making
it conditional on the combined presence of dirty cache and unwritten
mappings in the target range. This may be reasonable backportable as
well, but is only required if performance is a concern.
I still have to look into the improved revalidation approach discussed
in the RFC thread, but given that requires validation support and this
is intended to be a generic fallback, I wanted to get this nailed down
first.
fstests coverage for this problem is posted here [1]. Thoughts, reviews,
flames appreciated.
Brian
v1:
- Alternative approach, flush instead of revalidate.
rfc: https://lore.kernel.org/linux-fsdevel/20240718130212.23905-1-bfoster@redhat.com/
[1] https://lore.kernel.org/fstests/20240822144422.188462-1-bfoster@redhat.com/
Brian Foster (2):
iomap: fix handling of dirty folios over unwritten extents
iomap: make zero range flush conditional on unwritten mappings
fs/iomap/buffered-io.c | 52 ++++++++++++++++++++++++++++++++++++++----
fs/xfs/xfs_iops.c | 10 --------
2 files changed, 48 insertions(+), 14 deletions(-)
--
2.45.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] iomap: fix handling of dirty folios over unwritten extents
2024-08-22 14:59 [PATCH 0/2] iomap: flush dirty cache over unwritten mappings on zero range Brian Foster
@ 2024-08-22 14:59 ` Brian Foster
2024-08-22 14:59 ` [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster
1 sibling, 0 replies; 12+ messages in thread
From: Brian Foster @ 2024-08-22 14:59 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-xfs, djwong, josef, david
The iomap zero range implementation doesn't properly handle dirty
pagecache over unwritten mappings. It skips such mappings as if they
were pre-zeroed. If some part of an unwritten mapping is dirty in
pagecache from a previous write, the data in cache should be zeroed
as well. Instead, the data is left in cache and creates a stale data
exposure problem if writeback occurs sometime after the zero range.
Most callers are unaffected by this because the higher level
filesystem contexts that call zero range typically perform a filemap
flush of the target range for other reasons. A couple contexts that
don't otherwise need to flush are write file size extension and
truncate in XFS. The former path is currently susceptible to the
stale data exposure problem and the latter performs a flush
specifically to work around it.
This is clearly inconsistent and incomplete. As a first step toward
correcting behavior, lift the XFS workaround to iomap_zero_range()
and unconditionally flush the range before the zero range operation
proceeds. While this appears to be a bit of a big hammer, most all
users already do this from calling context save for the couple of
exceptions noted above. Future patches will optimize or elide this
flush while maintaining functional correctness.
Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure")
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/iomap/buffered-io.c | 10 ++++++++++
fs/xfs/xfs_iops.c | 10 ----------
2 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index f420c53d86ac..3e846f43ff48 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1451,6 +1451,16 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
};
int ret;
+ /*
+ * Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
+ * pagecache must be flushed to ensure stale data from previous
+ * buffered writes is not exposed.
+ */
+ ret = filemap_write_and_wait_range(inode->i_mapping,
+ pos, pos + len - 1);
+ if (ret)
+ return ret;
+
while ((ret = iomap_iter(&iter, ops)) > 0)
iter.processed = iomap_zero_iter(&iter, did_zero);
return ret;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 1cdc8034f54d..ddd3697e6ecd 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -870,16 +870,6 @@ xfs_setattr_size(
error = xfs_zero_range(ip, oldsize, newsize - oldsize,
&did_zeroing);
} else {
- /*
- * iomap won't detect a dirty page over an unwritten block (or a
- * cow block over a hole) and subsequently skips zeroing the
- * newly post-EOF portion of the page. Flush the new EOF to
- * convert the block before the pagecache truncate.
- */
- error = filemap_write_and_wait_range(inode->i_mapping, newsize,
- newsize);
- if (error)
- return error;
error = xfs_truncate_page(ip, newsize, &did_zeroing);
}
--
2.45.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings
2024-08-22 14:59 [PATCH 0/2] iomap: flush dirty cache over unwritten mappings on zero range Brian Foster
2024-08-22 14:59 ` [PATCH 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster
@ 2024-08-22 14:59 ` Brian Foster
2024-08-27 6:11 ` Christoph Hellwig
1 sibling, 1 reply; 12+ messages in thread
From: Brian Foster @ 2024-08-22 14:59 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-xfs, djwong, josef, david
iomap_zero_range() flushes pagecache to mitigate consistency
problems with dirty pagecache and unwritten mappings. The flush is
unconditional over the entire range because checking pagecache state
after mapping lookup is racy with writeback and reclaim. There are
ways around this using iomap's mapping revalidation mechanism, but
this is not supported by all iomap based filesystems and so is not a
generic solution.
There is another way around this limitation that is good enough to
filter the flush for most cases in practice. If we check for dirty
pagecache over the target range (instead of unconditionally flush),
we can keep track of whether the range was dirty before lookup and
defer the flush until/unless we see a combination of dirty cache
backed by an unwritten mapping. We don't necessarily know whether
the dirty cache was backed by the unwritten maping or some other
(written) part of the range, but the impliciation of a false
positive here is a spurious flush and thus relatively harmless.
Note that we also flush for hole mappings because iomap_zero_range()
is used for partial folio zeroing in some cases. For example, if a
folio straddles EOF on a sub-page FSB size fs, the post-eof portion
is hole-backed and dirtied/written via mapped write, and then i_size
increases before writeback can occur (which otherwise zeroes the
post-eof portion of the EOF folio), then the folio becomes
inconsistent with disk until reclaimed. A flush in this case
executes partial zeroing from writeback, and iomap knows that there
is otherwise no I/O to submit for hole backed mappings.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/iomap/buffered-io.c | 52 ++++++++++++++++++++++++++++++++++--------
1 file changed, 43 insertions(+), 9 deletions(-)
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 3e846f43ff48..841cd01d8194 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1393,16 +1393,42 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
}
EXPORT_SYMBOL_GPL(iomap_file_unshare);
-static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
+/*
+ * Flush the remaining range of the iter and mark the current mapping stale.
+ * This is used when zero range sees an unwritten mapping that may have had
+ * dirty pagecache over it.
+ */
+static inline int iomap_zero_iter_flush_and_stale(struct iomap_iter *i)
+{
+ struct address_space *mapping = i->inode->i_mapping;
+ loff_t end = i->pos + i->len - 1;
+
+ i->iomap.flags |= IOMAP_F_STALE;
+ return filemap_write_and_wait_range(mapping, i->pos, end);
+}
+
+static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
+ bool *range_dirty)
{
const struct iomap *srcmap = iomap_iter_srcmap(iter);
loff_t pos = iter->pos;
loff_t length = iomap_length(iter);
loff_t written = 0;
- /* already zeroed? we're done. */
- if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+ /*
+ * We can skip pre-zeroed mappings so long as either the mapping was
+ * clean before we started or we've flushed at least once since.
+ * Otherwise we don't know whether the current mapping had dirty
+ * pagecache, so flush it now, stale the current mapping, and proceed
+ * from there.
+ */
+ if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
+ if (*range_dirty) {
+ *range_dirty = false;
+ return iomap_zero_iter_flush_and_stale(iter);
+ }
return length;
+ }
do {
struct folio *folio;
@@ -1450,19 +1476,27 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
.flags = IOMAP_ZERO,
};
int ret;
+ bool range_dirty;
/*
* Zero range wants to skip pre-zeroed (i.e. unwritten) mappings, but
* pagecache must be flushed to ensure stale data from previous
- * buffered writes is not exposed.
+ * buffered writes is not exposed. A flush is only required for certain
+ * types of mappings, but checking pagecache after mapping lookup is
+ * racy with writeback and reclaim.
+ *
+ * Therefore, check the entire range first and pass along whether any
+ * part of it is dirty. If so and an underlying mapping warrants it,
+ * flush the cache at that point. This trades off the occasional false
+ * positive (and spurious flush, if the dirty data and mapping don't
+ * happen to overlap) for simplicity in handling a relatively uncommon
+ * situation.
*/
- ret = filemap_write_and_wait_range(inode->i_mapping,
- pos, pos + len - 1);
- if (ret)
- return ret;
+ range_dirty = filemap_range_needs_writeback(inode->i_mapping,
+ pos, pos + len - 1);
while ((ret = iomap_iter(&iter, ops)) > 0)
- iter.processed = iomap_zero_iter(&iter, did_zero);
+ iter.processed = iomap_zero_iter(&iter, did_zero, &range_dirty);
return ret;
}
EXPORT_SYMBOL_GPL(iomap_zero_range);
--
2.45.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings
2024-08-22 14:59 ` [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster
@ 2024-08-27 6:11 ` Christoph Hellwig
2024-08-27 14:23 ` Brian Foster
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2024-08-27 6:11 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, djwong, josef, david
On Thu, Aug 22, 2024 at 10:59:10AM -0400, Brian Foster wrote:
> Note that we also flush for hole mappings because iomap_zero_range()
> is used for partial folio zeroing in some cases. For example, if a
> folio straddles EOF on a sub-page FSB size fs, the post-eof portion
> is hole-backed and dirtied/written via mapped write, and then i_size
> increases before writeback can occur (which otherwise zeroes the
> post-eof portion of the EOF folio), then the folio becomes
> inconsistent with disk until reclaimed.
Eww. I'm not sure iomap_zero_range is the right way to handle this
even if that's what we have now and it kinda work.
> + /*
> + * We can skip pre-zeroed mappings so long as either the mapping was
> + * clean before we started or we've flushed at least once since.
> + * Otherwise we don't know whether the current mapping had dirty
> + * pagecache, so flush it now, stale the current mapping, and proceed
> + * from there.
> + */
> + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
.. at very least the above needs to be documented here as a big fat
reminder, though.
Otherwise the series looks sensible to me.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings
2024-08-27 6:11 ` Christoph Hellwig
@ 2024-08-27 14:23 ` Brian Foster
2024-08-28 4:32 ` Christoph Hellwig
0 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2024-08-27 14:23 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, djwong, josef, david
On Mon, Aug 26, 2024 at 11:11:42PM -0700, Christoph Hellwig wrote:
> On Thu, Aug 22, 2024 at 10:59:10AM -0400, Brian Foster wrote:
> > Note that we also flush for hole mappings because iomap_zero_range()
> > is used for partial folio zeroing in some cases. For example, if a
> > folio straddles EOF on a sub-page FSB size fs, the post-eof portion
> > is hole-backed and dirtied/written via mapped write, and then i_size
> > increases before writeback can occur (which otherwise zeroes the
> > post-eof portion of the EOF folio), then the folio becomes
> > inconsistent with disk until reclaimed.
>
> Eww. I'm not sure iomap_zero_range is the right way to handle this
> even if that's what we have now and it kinda work.
>
Yeah, I agree with that. That was one of the minor appeals (to me) of the
prototype I posted a while back that customizes iomap_truncate_page() to
do unconditional zeroing instead of being an almost pointless wrapper
for iomap_zero_range(). I don't quite recall if that explicitly handled
the hole case since it was quickly hacked together, but the general idea
is that it could be more purpose built for these partial zeroing cases
than zero range might be.
There are some other caveats with that approach that would still require
some variant of zero range, however, so I'm not immediately clear on
what the ideal high level solution looks like quite yet. I'd like to get
the existing mechanism working correctly, improve the test situation and
go from there.
> > + /*
> > + * We can skip pre-zeroed mappings so long as either the mapping was
> > + * clean before we started or we've flushed at least once since.
> > + * Otherwise we don't know whether the current mapping had dirty
> > + * pagecache, so flush it now, stale the current mapping, and proceed
> > + * from there.
> > + */
> > + if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
>
> .. at very least the above needs to be documented here as a big fat
> reminder, though.
>
Sure, I'll include that explanation in the code comments in v2. Thanks.
Brian
> Otherwise the series looks sensible to me.
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings
2024-08-27 14:23 ` Brian Foster
@ 2024-08-28 4:32 ` Christoph Hellwig
2024-08-28 12:35 ` Brian Foster
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2024-08-28 4:32 UTC (permalink / raw)
To: Brian Foster
Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, djwong, josef, david
On Tue, Aug 27, 2024 at 10:23:10AM -0400, Brian Foster wrote:
> Yeah, I agree with that. That was one of the minor appeals (to me) of the
> prototype I posted a while back that customizes iomap_truncate_page() to
> do unconditional zeroing instead of being an almost pointless wrapper
> for iomap_zero_range().
I only very vaguely remember that, you don't happen to have a pointer
to that?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings
2024-08-28 4:32 ` Christoph Hellwig
@ 2024-08-28 12:35 ` Brian Foster
2024-08-29 5:41 ` Christoph Hellwig
0 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2024-08-28 12:35 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, djwong, josef, david
On Tue, Aug 27, 2024 at 09:32:35PM -0700, Christoph Hellwig wrote:
> On Tue, Aug 27, 2024 at 10:23:10AM -0400, Brian Foster wrote:
> > Yeah, I agree with that. That was one of the minor appeals (to me) of the
> > prototype I posted a while back that customizes iomap_truncate_page() to
> > do unconditional zeroing instead of being an almost pointless wrapper
> > for iomap_zero_range().
>
> I only very vaguely remember that, you don't happen to have a pointer
> to that?
>
>
Yeah, it was buried in a separate review around potentially killing off
iomap_truncate_page():
https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/
The idea is pretty simple.. use the same kind of check this patch does
for doing a flush, but instead open code and isolate it to
iomap_truncate_page() so we can just default to doing the buffered write
instead.
Note that I don't think this replaces the need for patch 1, but it might
arguably make further optimization of the flush kind of pointless
because I'm not sure zero range would ever be called from somewhere that
doesn't flush already.
The tradeoffs I can think of are this might introduce some false
positives where an EOF folio might be dirty but a sub-folio size block
backing EOF might be clean, and again that callers like truncate and
write extension would need to both truncate the eof page and zero the
broader post-eof range. Neither of those seem all that significant to
me, but just my .02.
Brian
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings
2024-08-28 12:35 ` Brian Foster
@ 2024-08-29 5:41 ` Christoph Hellwig
2024-08-29 15:05 ` Brian Foster
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2024-08-29 5:41 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, djwong, josef, david
On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote:
> Yeah, it was buried in a separate review around potentially killing off
> iomap_truncate_page():
>
> https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/
>
> The idea is pretty simple.. use the same kind of check this patch does
> for doing a flush, but instead open code and isolate it to
> iomap_truncate_page() so we can just default to doing the buffered write
> instead.
>
> Note that I don't think this replaces the need for patch 1, but it might
> arguably make further optimization of the flush kind of pointless
> because I'm not sure zero range would ever be called from somewhere that
> doesn't flush already.
>
> The tradeoffs I can think of are this might introduce some false
> positives where an EOF folio might be dirty but a sub-folio size block
> backing EOF might be clean, and again that callers like truncate and
> write extension would need to both truncate the eof page and zero the
> broader post-eof range. Neither of those seem all that significant to
> me, but just my .02.
Looking at that patch and your current series I kinda like not having
to deal with the dirty caches in the loop, and in fact I'd also prefer
to not do any writeback from the low-level zero helpers if we can.
That is not doing your patch 1 but instead auditing the callers if
any of them needs them and documenting the expectation.
But please let Dave and Darrick chime in first before investing any
work into this.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings
2024-08-29 5:41 ` Christoph Hellwig
@ 2024-08-29 15:05 ` Brian Foster
2024-08-29 21:48 ` Darrick J. Wong
0 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2024-08-29 15:05 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-fsdevel, linux-xfs, djwong, josef, david
On Wed, Aug 28, 2024 at 10:41:56PM -0700, Christoph Hellwig wrote:
> On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote:
> > Yeah, it was buried in a separate review around potentially killing off
> > iomap_truncate_page():
> >
> > https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/
> >
> > The idea is pretty simple.. use the same kind of check this patch does
> > for doing a flush, but instead open code and isolate it to
> > iomap_truncate_page() so we can just default to doing the buffered write
> > instead.
> >
> > Note that I don't think this replaces the need for patch 1, but it might
> > arguably make further optimization of the flush kind of pointless
> > because I'm not sure zero range would ever be called from somewhere that
> > doesn't flush already.
> >
> > The tradeoffs I can think of are this might introduce some false
> > positives where an EOF folio might be dirty but a sub-folio size block
> > backing EOF might be clean, and again that callers like truncate and
> > write extension would need to both truncate the eof page and zero the
> > broader post-eof range. Neither of those seem all that significant to
> > me, but just my .02.
>
> Looking at that patch and your current series I kinda like not having
> to deal with the dirty caches in the loop, and in fact I'd also prefer
> to not do any writeback from the low-level zero helpers if we can.
> That is not doing your patch 1 but instead auditing the callers if
> any of them needs them and documenting the expectation.
>
I agree this seems better in some ways, but I don't like complicating or
putting more responsibility on the callers. I think if we had a high
level iomap function that wrapped a combination of this proposed variant
of truncate_page() and zero_range() for general inode size changes, that
might alleviate that concern.
Otherwise IME even if we audited and fixed all callers today, over time
we'll just reintroduce the same sorts of errors if the low level
mechanisms aren't made to function correctly.
> But please let Dave and Darrick chime in first before investing any
> work into this.
>
>
Based on the feedback to v2, it sounds like there's general consensus on
the approach modulo some code factoring discussion. Unless there is
objection, I think I'll stick with that for now for the sake of progress
and keep this option in mind on the back burner. None of this is really
that hard to change if we come up with something better.
Brian
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings
2024-08-29 15:05 ` Brian Foster
@ 2024-08-29 21:48 ` Darrick J. Wong
2024-08-30 11:57 ` Brian Foster
0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2024-08-29 21:48 UTC (permalink / raw)
To: Brian Foster; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, josef, david
On Thu, Aug 29, 2024 at 11:05:59AM -0400, Brian Foster wrote:
> On Wed, Aug 28, 2024 at 10:41:56PM -0700, Christoph Hellwig wrote:
> > On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote:
> > > Yeah, it was buried in a separate review around potentially killing off
> > > iomap_truncate_page():
> > >
> > > https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/
> > >
> > > The idea is pretty simple.. use the same kind of check this patch does
> > > for doing a flush, but instead open code and isolate it to
> > > iomap_truncate_page() so we can just default to doing the buffered write
> > > instead.
> > >
> > > Note that I don't think this replaces the need for patch 1, but it might
> > > arguably make further optimization of the flush kind of pointless
> > > because I'm not sure zero range would ever be called from somewhere that
> > > doesn't flush already.
> > >
> > > The tradeoffs I can think of are this might introduce some false
> > > positives where an EOF folio might be dirty but a sub-folio size block
> > > backing EOF might be clean, and again that callers like truncate and
> > > write extension would need to both truncate the eof page and zero the
> > > broader post-eof range. Neither of those seem all that significant to
> > > me, but just my .02.
> >
> > Looking at that patch and your current series I kinda like not having
> > to deal with the dirty caches in the loop, and in fact I'd also prefer
> > to not do any writeback from the low-level zero helpers if we can.
> > That is not doing your patch 1 but instead auditing the callers if
> > any of them needs them and documenting the expectation.
I looked, and was pretty sure that XFS is the only one that has that
expectation.
> I agree this seems better in some ways, but I don't like complicating or
> putting more responsibility on the callers. I think if we had a high
> level iomap function that wrapped a combination of this proposed variant
> of truncate_page() and zero_range() for general inode size changes, that
> might alleviate that concern.
>
> Otherwise IME even if we audited and fixed all callers today, over time
> we'll just reintroduce the same sorts of errors if the low level
> mechanisms aren't made to function correctly.
Yeah. What /are/ the criteria for needing the flush and wait? AFAICT,
a filesystem only needs the flush if it's possible to have dirty
pagecache backed either by a hole or an unwritten extent, right?
I suppose we could amend the iomap ops so that filesystems could signal
that they allow either of those things, and then we wouldn't have to
query the mapping for filesystems that don't, right? IOWs, one can opt
out of safety features if there's no risk of a garbage, right?
(Also: does xfs allow dirty page cache backed by a hole? I didn't think
that was possible.)
> > But please let Dave and Darrick chime in first before investing any
> > work into this.
> >
> >
>
> Based on the feedback to v2, it sounds like there's general consensus on
> the approach modulo some code factoring discussion. Unless there is
> objection, I think I'll stick with that for now for the sake of progress
> and keep this option in mind on the back burner. None of this is really
> that hard to change if we come up with something better.
>
> Brian
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings
2024-08-29 21:48 ` Darrick J. Wong
@ 2024-08-30 11:57 ` Brian Foster
2024-08-30 23:02 ` Darrick J. Wong
0 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2024-08-30 11:57 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, josef, david
On Thu, Aug 29, 2024 at 02:48:00PM -0700, Darrick J. Wong wrote:
> On Thu, Aug 29, 2024 at 11:05:59AM -0400, Brian Foster wrote:
> > On Wed, Aug 28, 2024 at 10:41:56PM -0700, Christoph Hellwig wrote:
> > > On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote:
> > > > Yeah, it was buried in a separate review around potentially killing off
> > > > iomap_truncate_page():
> > > >
> > > > https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/
> > > >
> > > > The idea is pretty simple.. use the same kind of check this patch does
> > > > for doing a flush, but instead open code and isolate it to
> > > > iomap_truncate_page() so we can just default to doing the buffered write
> > > > instead.
> > > >
> > > > Note that I don't think this replaces the need for patch 1, but it might
> > > > arguably make further optimization of the flush kind of pointless
> > > > because I'm not sure zero range would ever be called from somewhere that
> > > > doesn't flush already.
> > > >
> > > > The tradeoffs I can think of are this might introduce some false
> > > > positives where an EOF folio might be dirty but a sub-folio size block
> > > > backing EOF might be clean, and again that callers like truncate and
> > > > write extension would need to both truncate the eof page and zero the
> > > > broader post-eof range. Neither of those seem all that significant to
> > > > me, but just my .02.
> > >
> > > Looking at that patch and your current series I kinda like not having
> > > to deal with the dirty caches in the loop, and in fact I'd also prefer
> > > to not do any writeback from the low-level zero helpers if we can.
> > > That is not doing your patch 1 but instead auditing the callers if
> > > any of them needs them and documenting the expectation.
>
> I looked, and was pretty sure that XFS is the only one that has that
> expectation.
>
> > I agree this seems better in some ways, but I don't like complicating or
> > putting more responsibility on the callers. I think if we had a high
> > level iomap function that wrapped a combination of this proposed variant
> > of truncate_page() and zero_range() for general inode size changes, that
> > might alleviate that concern.
> >
> > Otherwise IME even if we audited and fixed all callers today, over time
> > we'll just reintroduce the same sorts of errors if the low level
> > mechanisms aren't made to function correctly.
>
> Yeah. What /are/ the criteria for needing the flush and wait? AFAICT,
> a filesystem only needs the flush if it's possible to have dirty
> pagecache backed either by a hole or an unwritten extent, right?
>
Yeah, but this flush behavior shouldn't be a caller consideration at
all. It's just an implementation detail. All the caller should care
about is that zero range works As Expected (tm).
The pre-iomap way of doing this in XFS was xfs_zero_eof() ->
xfs_iozero(), which was an internally coded buffered write loop that
wrote zeroes into pagecache. That was ultimately replaced with
iomap_zero_range() with the same sort of usage expectations, but
iomap_zero_range() just didn't work quite correctly in all cases.
> I suppose we could amend the iomap ops so that filesystems could signal
> that they allow either of those things, and then we wouldn't have to
> query the mapping for filesystems that don't, right? IOWs, one can opt
> out of safety features if there's no risk of a garbage, right?
>
Not sure I parse.. In general I think we could let ops signal whether
they want certain checks. This is how I used the IOMAP_F_DIRTY_CACHE
flag mentioned in the other thread. If the operation handler is
interested in pagecache state, set an IOMAP_DIRTY_CACHE flag in ops to
trigger a pre iomap_begin() check and then set the corresponding
_F_DIRTY_CACHE flag on the mapping if dirty, but I'm not sure if that's
the same concept you're alluding to here.
> (Also: does xfs allow dirty page cache backed by a hole? I didn't think
> that was possible.)
>
It's a corner case. A mapped write can write to any portion of a folio
so long as it starts within eof. So if you have a mapped write that
writes past EOF, there's no guarantee that range of the folio is mapped
by blocks.
That post-eof part of the folio would be zeroed at writeback time, but
that assumes i_size doesn't change before writeback. If it does and the
size change operation doesn't do the zeroing itself (enter zero range
via write extension), then we end up with a dirty folio at least
partially backed by a hole with non-zero data within EOF. There's
nothing written back to disk in this hole backed example, but the
pagecache is still inconsistent with what's on disk and therefore I
suspect data corruption is possible if the folio is redirtied before
reclaimed.
Brian
> > > But please let Dave and Darrick chime in first before investing any
> > > work into this.
> > >
> > >
> >
> > Based on the feedback to v2, it sounds like there's general consensus on
> > the approach modulo some code factoring discussion. Unless there is
> > objection, I think I'll stick with that for now for the sake of progress
> > and keep this option in mind on the back burner. None of this is really
> > that hard to change if we come up with something better.
> >
> > Brian
> >
> >
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings
2024-08-30 11:57 ` Brian Foster
@ 2024-08-30 23:02 ` Darrick J. Wong
0 siblings, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2024-08-30 23:02 UTC (permalink / raw)
To: Brian Foster; +Cc: Christoph Hellwig, linux-fsdevel, linux-xfs, josef, david
On Fri, Aug 30, 2024 at 07:57:41AM -0400, Brian Foster wrote:
> On Thu, Aug 29, 2024 at 02:48:00PM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 29, 2024 at 11:05:59AM -0400, Brian Foster wrote:
> > > On Wed, Aug 28, 2024 at 10:41:56PM -0700, Christoph Hellwig wrote:
> > > > On Wed, Aug 28, 2024 at 08:35:47AM -0400, Brian Foster wrote:
> > > > > Yeah, it was buried in a separate review around potentially killing off
> > > > > iomap_truncate_page():
> > > > >
> > > > > https://lore.kernel.org/linux-fsdevel/ZlxUpYvb9dlOHFR3@bfoster/
> > > > >
> > > > > The idea is pretty simple.. use the same kind of check this patch does
> > > > > for doing a flush, but instead open code and isolate it to
> > > > > iomap_truncate_page() so we can just default to doing the buffered write
> > > > > instead.
> > > > >
> > > > > Note that I don't think this replaces the need for patch 1, but it might
> > > > > arguably make further optimization of the flush kind of pointless
> > > > > because I'm not sure zero range would ever be called from somewhere that
> > > > > doesn't flush already.
> > > > >
> > > > > The tradeoffs I can think of are this might introduce some false
> > > > > positives where an EOF folio might be dirty but a sub-folio size block
> > > > > backing EOF might be clean, and again that callers like truncate and
> > > > > write extension would need to both truncate the eof page and zero the
> > > > > broader post-eof range. Neither of those seem all that significant to
> > > > > me, but just my .02.
> > > >
> > > > Looking at that patch and your current series I kinda like not having
> > > > to deal with the dirty caches in the loop, and in fact I'd also prefer
> > > > to not do any writeback from the low-level zero helpers if we can.
> > > > That is not doing your patch 1 but instead auditing the callers if
> > > > any of them needs them and documenting the expectation.
> >
> > I looked, and was pretty sure that XFS is the only one that has that
> > expectation.
> >
> > > I agree this seems better in some ways, but I don't like complicating or
> > > putting more responsibility on the callers. I think if we had a high
> > > level iomap function that wrapped a combination of this proposed variant
> > > of truncate_page() and zero_range() for general inode size changes, that
> > > might alleviate that concern.
> > >
> > > Otherwise IME even if we audited and fixed all callers today, over time
> > > we'll just reintroduce the same sorts of errors if the low level
> > > mechanisms aren't made to function correctly.
> >
> > Yeah. What /are/ the criteria for needing the flush and wait? AFAICT,
> > a filesystem only needs the flush if it's possible to have dirty
> > pagecache backed either by a hole or an unwritten extent, right?
> >
>
> Yeah, but this flush behavior shouldn't be a caller consideration at
> all. It's just an implementation detail. All the caller should care
> about is that zero range works As Expected (tm).
>
> The pre-iomap way of doing this in XFS was xfs_zero_eof() ->
> xfs_iozero(), which was an internally coded buffered write loop that
> wrote zeroes into pagecache. That was ultimately replaced with
> iomap_zero_range() with the same sort of usage expectations, but
> iomap_zero_range() just didn't work quite correctly in all cases.
>
> > I suppose we could amend the iomap ops so that filesystems could signal
> > that they allow either of those things, and then we wouldn't have to
> > query the mapping for filesystems that don't, right? IOWs, one can opt
> > out of safety features if there's no risk of a garbage, right?
> >
>
> Not sure I parse.. In general I think we could let ops signal whether
> they want certain checks. This is how I used the IOMAP_F_DIRTY_CACHE
> flag mentioned in the other thread. If the operation handler is
> interested in pagecache state, set an IOMAP_DIRTY_CACHE flag in ops to
> trigger a pre iomap_begin() check and then set the corresponding
> _F_DIRTY_CACHE flag on the mapping if dirty, but I'm not sure if that's
> the same concept you're alluding to here.
Nope. I was thinking about adding a field to iomap_ops so that
filesystems could declare which types of mappings they could return:
const struct iomap_ops xfs_buffered_write_iomap_ops = {
.iomap_begin = xfs_buffered_write_iomap_begin,
.iomap_end = xfs_buffered_write_iomap_end,
/* xfs allows sparse holes and unwritten extents */
.iomap_types = (1U << IOMAP_UNWRITTEN) | (1U << IOMAP_HOLE),
};
But given your statement below about dirtying a post-eof region for
which the filesystem does not allocate a block, I suspect we just have
to enable the flush thing for everyone and don't need the flags thing.
> > (Also: does xfs allow dirty page cache backed by a hole? I didn't think
> > that was possible.)
> >
>
> It's a corner case. A mapped write can write to any portion of a folio
> so long as it starts within eof. So if you have a mapped write that
> writes past EOF, there's no guarantee that range of the folio is mapped
> by blocks.
>
> That post-eof part of the folio would be zeroed at writeback time, but
> that assumes i_size doesn't change before writeback. If it does and the
> size change operation doesn't do the zeroing itself (enter zero range
> via write extension), then we end up with a dirty folio at least
> partially backed by a hole with non-zero data within EOF. There's
> nothing written back to disk in this hole backed example, but the
> pagecache is still inconsistent with what's on disk and therefore I
> suspect data corruption is possible if the folio is redirtied before
> reclaimed.
Ah. Yikes.
--D
> Brian
>
> > > > But please let Dave and Darrick chime in first before investing any
> > > > work into this.
> > > >
> > > >
> > >
> > > Based on the feedback to v2, it sounds like there's general consensus on
> > > the approach modulo some code factoring discussion. Unless there is
> > > objection, I think I'll stick with that for now for the sake of progress
> > > and keep this option in mind on the back burner. None of this is really
> > > that hard to change if we come up with something better.
> > >
> > > Brian
> > >
> > >
> >
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-08-30 23:02 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-22 14:59 [PATCH 0/2] iomap: flush dirty cache over unwritten mappings on zero range Brian Foster
2024-08-22 14:59 ` [PATCH 1/2] iomap: fix handling of dirty folios over unwritten extents Brian Foster
2024-08-22 14:59 ` [PATCH 2/2] iomap: make zero range flush conditional on unwritten mappings Brian Foster
2024-08-27 6:11 ` Christoph Hellwig
2024-08-27 14:23 ` Brian Foster
2024-08-28 4:32 ` Christoph Hellwig
2024-08-28 12:35 ` Brian Foster
2024-08-29 5:41 ` Christoph Hellwig
2024-08-29 15:05 ` Brian Foster
2024-08-29 21:48 ` Darrick J. Wong
2024-08-30 11:57 ` Brian Foster
2024-08-30 23:02 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).