[PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range
@ 2024-07-18 13:02 Brian Foster
  2024-07-18 13:02 ` [PATCH 1/4] filemap: return pos of first dirty folio from range_has_writeback Brian Foster
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Brian Foster @ 2024-07-18 13:02 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

Hi all,

This is a stab at fixing the iomap zero range problem where it doesn't
correctly handle the case of an unwritten mapping with dirty pagecache.
The gist is that we scan the mapping for dirty cache, zero any
already-dirty folios via buffered writes as normal, but then otherwise
skip clean ranges once we have a chance to validate those ranges against
races with writeback or reclaim.

This is somewhat simplistic in terms of how it scans, but that is
intentional based on the existing use cases for zero range. From poking
around a bit, my current sense is that there isn't any user of zero
range that would ever expect to see more than a single dirty folio. Most
callers either straddle the EOF folio or flush in higher level code for
presumably (fs) context specific reasons. If somebody has an example to
the contrary, please let me know because I'd love to be able to use it
for testing.

The caveat to this approach is that it only works for filesystems that
implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
doesn't use ->iomap_valid() and does call zero range, but AFAICT it
doesn't actually export unwritten mappings so I suspect this is not a
problem. My understanding is that ext4 iomap support is in progress, but
I've not yet dug into what that looks like (though I suspect similar to
XFS). The concern is mainly that this leaves a landmine for fs that
might grow support for unwritten mappings && zero range but not
->iomap_valid(). We'd likely never know zero range was broken for such
fs until stale data exposure problems start to materialize.

I considered adding a fallback to just add a flush at the top of
iomap_zero_range() so at least all future users would be correct, but I
wanted to gate that on the absence of ->iomap_valid() and folio_ops
isn't provided until iomap_begin() time. I suppose another way around
that could be to add a flags param to iomap_zero_range() where the
caller could explicitly opt out of a flush, but that's still kind of
ugly. I dunno, maybe better than nothing..?

So IMO, this raises the question of whether this is just unnecessarily
overcomplicated. The KISS principle implies that it would also be
perfectly fine to do a conditional "flush and stale" in zero range
whenever we see the combination of an unwritten mapping and dirty
pagecache (the latter checked before or during ->iomap_begin()). That's
simple to implement and AFAICT would work/perform adequately and
generically for all filesystems. I have one or two prototypes of this
sort of thing if folks want to see it as an alternative.

Otherwise, this series has so far seen some light regression and
targeted testing. Patches 1-2 are factoring dependencies, patch 3
implements the main fix, and patch 4 drops the flush from XFS truncate.
Thoughts, reviews, flames appreciated.

Brian

Brian Foster (4):
  filemap: return pos of first dirty folio from range_has_writeback
  iomap: refactor an iomap_revalidate() helper
  iomap: fix handling of dirty folios over unwritten extents
  xfs: remove unnecessary flush of eof page from truncate

 fs/iomap/buffered-io.c  | 81 ++++++++++++++++++++++++++++++++++-------
 fs/xfs/xfs_iops.c       | 10 -----
 include/linux/pagemap.h |  4 +-
 mm/filemap.c            |  8 ++--
 4 files changed, 75 insertions(+), 28 deletions(-)

-- 
2.45.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] filemap: return pos of first dirty folio from range_has_writeback
  2024-07-18 13:02 [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Brian Foster
@ 2024-07-18 13:02 ` Brian Foster
  2024-07-18 15:09   ` Matthew Wilcox
  2024-07-18 13:02 ` [PATCH 2/4] iomap: refactor an iomap_revalidate() helper Brian Foster
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2024-07-18 13:02 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

The iomap layer has a use case that wants to locate the first dirty
or writeback folio in a particular range.
filemap_range_has_writeback() already implements this with the
exception that it only returns a boolean.

Since the _needs_writeback() wrapper is currently the only caller,
tweak has_writeback to take a pointer for the starting offset and
update it to the offset of the first dirty folio found.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 include/linux/pagemap.h | 4 ++--
 mm/filemap.c            | 8 +++++---
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a0a026d2d244..a15131a3fa12 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -1217,7 +1217,7 @@ int __filemap_add_folio(struct address_space *mapping, struct folio *folio,
 		pgoff_t index, gfp_t gfp, void **shadowp);
 
 bool filemap_range_has_writeback(struct address_space *mapping,
-				 loff_t start_byte, loff_t end_byte);
+				 loff_t *start_byte, loff_t end_byte);
 
 /**
  * filemap_range_needs_writeback - check if range potentially needs writeback
@@ -1242,7 +1242,7 @@ static inline bool filemap_range_needs_writeback(struct address_space *mapping,
 	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
 	    !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
 		return false;
-	return filemap_range_has_writeback(mapping, start_byte, end_byte);
+	return filemap_range_has_writeback(mapping, &start_byte, end_byte);
 }
 
 /**
diff --git a/mm/filemap.c b/mm/filemap.c
index 657bcd887fdb..be0a219e8d9e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -636,13 +636,13 @@ static bool mapping_needs_writeback(struct address_space *mapping)
 }
 
 bool filemap_range_has_writeback(struct address_space *mapping,
-				 loff_t start_byte, loff_t end_byte)
+				 loff_t *start_byte, loff_t end_byte)
 {
-	XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
+	XA_STATE(xas, &mapping->i_pages, *start_byte >> PAGE_SHIFT);
 	pgoff_t max = end_byte >> PAGE_SHIFT;
 	struct folio *folio;
 
-	if (end_byte < start_byte)
+	if (end_byte < *start_byte)
 		return false;
 
 	rcu_read_lock();
@@ -655,6 +655,8 @@ bool filemap_range_has_writeback(struct address_space *mapping,
 				folio_test_writeback(folio))
 			break;
 	}
+	if (folio)
+		*start_byte = folio_pos(folio);
 	rcu_read_unlock();
 	return folio != NULL;
 }
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] iomap: refactor an iomap_revalidate() helper
  2024-07-18 13:02 [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Brian Foster
  2024-07-18 13:02 ` [PATCH 1/4] filemap: return pos of first dirty folio from range_has_writeback Brian Foster
@ 2024-07-18 13:02 ` Brian Foster
  2024-07-18 13:02 ` [PATCH 3/4] iomap: fix handling of dirty folios over unwritten extents Brian Foster
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2024-07-18 13:02 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

The buffered write path revalidates the mapping on each folio lock
cycle. If the mapping has changed since the last lookup, the mapping
is marked stale and the operation must break back out into the
lookup path to update the mapping.

The zero range path will need to do something similar for correct
handling of dirty folios over unwritten ranges. Factor out the
validation callback into a new helper that marks the mapping stale
on failure.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d46558990279..a9425170df72 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -756,6 +756,22 @@ static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
 	}
 }
 
+/*
+ * Check whether the current mapping of the iter is still valid. If not, mark
+ * the mapping stale.
+ */
+static inline bool iomap_revalidate(struct iomap_iter *iter)
+{
+	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
+	bool iomap_valid = true;
+
+	if (folio_ops && folio_ops->iomap_valid)
+		iomap_valid = folio_ops->iomap_valid(iter->inode, &iter->iomap);
+	if (!iomap_valid)
+		iter->iomap.flags |= IOMAP_F_STALE;
+	return iomap_valid;
+}
+
 static int iomap_write_begin_inline(const struct iomap_iter *iter,
 		struct folio *folio)
 {
@@ -768,7 +784,6 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
 static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio **foliop)
 {
-	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct folio *folio;
 	int status = 0;
@@ -797,15 +812,8 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	 * could do the wrong thing here (zero a page range incorrectly or fail
 	 * to zero) and corrupt data.
 	 */
-	if (folio_ops && folio_ops->iomap_valid) {
-		bool iomap_valid = folio_ops->iomap_valid(iter->inode,
-							 &iter->iomap);
-		if (!iomap_valid) {
-			iter->iomap.flags |= IOMAP_F_STALE;
-			status = 0;
-			goto out_unlock;
-		}
-	}
+	if (!iomap_revalidate(iter))
+		goto out_unlock;
 
 	if (pos + len > folio_pos(folio) + folio_size(folio))
 		len = folio_pos(folio) + folio_size(folio) - pos;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] iomap: fix handling of dirty folios over unwritten extents
  2024-07-18 13:02 [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Brian Foster
  2024-07-18 13:02 ` [PATCH 1/4] filemap: return pos of first dirty folio from range_has_writeback Brian Foster
  2024-07-18 13:02 ` [PATCH 2/4] iomap: refactor an iomap_revalidate() helper Brian Foster
@ 2024-07-18 13:02 ` Brian Foster
  2024-07-19  0:25   ` Dave Chinner
  2024-07-18 13:02 ` [PATCH 4/4] xfs: remove unnecessary flush of eof page from truncate Brian Foster
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2024-07-18 13:02 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

iomap_zero_range() does not correctly handle unwritten mappings with
dirty folios in pagecache. It skips unwritten mappings
unconditionally as if they were already zeroed, and thus potentially
exposes stale data from a previous write if affected folios are not
written back before the zero range.

Most callers already flush the target range of the zero for
unrelated, context specific reasons, so this problem is not
necessarily prevalent. The known outliers (in XFS) are file
extension via buffered write and truncate. The truncate path issues
a flush to work around this iomap problem, but the file extension
path does not and thus can expose stale data if current EOF is
unaligned and has a dirty folio over an unwritten block.

This patch implements a mechanism for making zero range pagecache
aware for filesystems that support mapping validation (i.e.
folio_ops->iomap_valid()). Instead of just skipping unwritten
mappings, scan the corresponding pagecache range for dirty or
writeback folios. If found, explicitly zero them via buffered write.
Clean or uncached subranges of unwritten mappings are skipped, as
before.

The quirk with a post-iomap_begin() pagecache scan is that it is
racy with writeback and reclaim activity. Even if the higher level
code holds the invalidate lock, nothing prevents a dirty folio from
being written back, cleaned, and even reclaimed sometime after
iomap_begin() returns an unwritten map but before a pagecache scan
might find the dirty folio. To handle this situation, we can rely on
the fact that writeback completion converts unwritten extents in the
fs before writeback state is cleared on the folio.

This means that a pagecache scan followed by a mapping revalidate of
an unwritten mapping should either find a dirty folio if it exists,
or detect a mapping change if a dirty folio did exist and had been
cleaned sometime before the scan but after the unwritten mapping was
found. If the revalidation succeeds then we can safely assume
nothing has been written back and skip the range. If the
revalidation fails then we must assume any offset in the range could
have been modified by writeback. In other words, we must be
particularly careful to make sure that any uncached range we intend
to skip does not make it into iter.processed until the mapping is
revalidated.

Altogether, this allows zero range to handle dirty folios over
unwritten extents without needing to flush and wait for writeback
completion.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 50 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a9425170df72..ea1d396ef445 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1385,6 +1385,23 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 }
 EXPORT_SYMBOL_GPL(iomap_file_unshare);
 
+/*
+ * Scan an unwritten mapping for dirty pagecache and return the length of the
+ * clean or uncached range leading up to it. This is the range that zeroing may
+ * skip once the mapping is validated.
+ */
+static inline loff_t
+iomap_zero_iter_unwritten(struct iomap_iter *iter, loff_t pos, loff_t length)
+{
+	struct address_space *mapping = iter->inode->i_mapping;
+	loff_t fpos = pos;
+
+	if (!filemap_range_has_writeback(mapping, &fpos, length))
+		return length;
+	/* fpos can be smaller if the start folio is dirty */
+	return max(fpos, pos) - pos;
+}
+
 static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -1393,16 +1410,46 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	loff_t written = 0;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if (srcmap->type == IOMAP_HOLE)
 		return length;
 
 	do {
 		struct folio *folio;
 		int status;
 		size_t offset;
-		size_t bytes = min_t(u64, SIZE_MAX, length);
+		size_t bytes;
+		loff_t pending = 0;
 		bool ret;
 
+		/*
+		 * Determine the range of the unwritten mapping that is clean in
+		 * pagecache. We can skip this range, but only if the mapping is
+		 * still valid after the pagecache scan. This is because
+		 * writeback may have cleaned folios after the mapping lookup
+		 * but before we were able to find them here. If that occurs,
+		 * then the mapping must now be stale and we must reprocess the
+		 * range.
+		 */
+		if (srcmap->type == IOMAP_UNWRITTEN) {
+			pending = iomap_zero_iter_unwritten(iter, pos, length);
+			if (pending == length) {
+				/* no dirty cache, revalidate and bounce as we're
+				 * either done or the mapping is stale */
+				if (iomap_revalidate(iter))
+					written += pending;
+				break;
+			}
+
+			/*
+			 * Found a dirty folio. Update pos/length to point at
+			 * it. written is updated only after the mapping is
+			 * revalidated by iomap_write_begin().
+			 */
+			pos += pending;
+			length -= pending;
+		}
+
+		bytes = min_t(u64, SIZE_MAX, length);
 		status = iomap_write_begin(iter, pos, bytes, &folio);
 		if (status)
 			return status;
@@ -1422,7 +1469,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 
 		pos += bytes;
 		length -= bytes;
-		written += bytes;
+		written += bytes + pending;
 	} while (length > 0);
 
 	if (did_zero)
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] xfs: remove unnecessary flush of eof page from truncate
  2024-07-18 13:02 [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Brian Foster
                   ` (2 preceding siblings ...)
  2024-07-18 13:02 ` [PATCH 3/4] iomap: fix handling of dirty folios over unwritten extents Brian Foster
@ 2024-07-18 13:02 ` Brian Foster
  2024-07-18 15:36 ` [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Josef Bacik
  2024-07-19  1:10 ` Dave Chinner
  5 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2024-07-18 13:02 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-xfs, linux-mm

The EOF flush was originally added to work around broken
iomap_zero_range() handling of dirty cache over unwritten extents.
Now that iomap handles this situation correctly, the flush can be
removed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_iops.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff222827e550..eb0b7a88776d 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -862,16 +862,6 @@ xfs_setattr_size(
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
 	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] filemap: return pos of first dirty folio from range_has_writeback
  2024-07-18 13:02 ` [PATCH 1/4] filemap: return pos of first dirty folio from range_has_writeback Brian Foster
@ 2024-07-18 15:09   ` Matthew Wilcox
  2024-07-18 16:03     ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2024-07-18 15:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jul 18, 2024 at 09:02:09AM -0400, Brian Foster wrote:
> @@ -655,6 +655,8 @@ bool filemap_range_has_writeback(struct address_space *mapping,
>  				folio_test_writeback(folio))
>  			break;
>  	}
> +	if (folio)
> +		*start_byte = folio_pos(folio);
>  	rcu_read_unlock();
>  	return folio != NULL;
>  }

Distressingly, this is unsafe.

We have no reference on the folio at this point (not one that matters,
anyway).  We have the rcu read lock, yes, but that doesn't protect enough
to make folio_pos() safe.

Since we do't have folio_get() here, the folio can be freed, sent back to
the page allocator, and then reallocated to literally any purpose.  As I'm
reviewing patch 1/4, I have no idea if this is just a hint and you can
survive it being completely wrong, or if this is going to cause problems.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range
  2024-07-18 13:02 [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Brian Foster
                   ` (3 preceding siblings ...)
  2024-07-18 13:02 ` [PATCH 4/4] xfs: remove unnecessary flush of eof page from truncate Brian Foster
@ 2024-07-18 15:36 ` Josef Bacik
  2024-07-18 16:02   ` Darrick J. Wong
  2024-07-19  1:10 ` Dave Chinner
  5 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2024-07-18 15:36 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> Hi all,
> 
> This is a stab at fixing the iomap zero range problem where it doesn't
> correctly handle the case of an unwritten mapping with dirty pagecache.
> The gist is that we scan the mapping for dirty cache, zero any
> already-dirty folios via buffered writes as normal, but then otherwise
> skip clean ranges once we have a chance to validate those ranges against
> races with writeback or reclaim.
> 
> This is somewhat simplistic in terms of how it scans, but that is
> intentional based on the existing use cases for zero range. From poking
> around a bit, my current sense is that there isn't any user of zero
> range that would ever expect to see more than a single dirty folio. Most
> callers either straddle the EOF folio or flush in higher level code for
> presumably (fs) context specific reasons. If somebody has an example to
> the contrary, please let me know because I'd love to be able to use it
> for testing.
> 
> The caveat to this approach is that it only works for filesystems that
> implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> doesn't actually export unwritten mappings so I suspect this is not a
> problem. My understanding is that ext4 iomap support is in progress, but
> I've not yet dug into what that looks like (though I suspect similar to
> XFS). The concern is mainly that this leaves a landmine for fs that
> might grow support for unwritten mappings && zero range but not
> ->iomap_valid(). We'd likely never know zero range was broken for such
> fs until stale data exposure problems start to materialize.
> 
> I considered adding a fallback to just add a flush at the top of
> iomap_zero_range() so at least all future users would be correct, but I
> wanted to gate that on the absence of ->iomap_valid() and folio_ops
> isn't provided until iomap_begin() time. I suppose another way around
> that could be to add a flags param to iomap_zero_range() where the
> caller could explicitly opt out of a flush, but that's still kind of
> ugly. I dunno, maybe better than nothing..?
> 
> So IMO, this raises the question of whether this is just unnecessarily
> overcomplicated. The KISS principle implies that it would also be
> perfectly fine to do a conditional "flush and stale" in zero range
> whenever we see the combination of an unwritten mapping and dirty
> pagecache (the latter checked before or during ->iomap_begin()). That's
> simple to implement and AFAICT would work/perform adequately and
> generically for all filesystems. I have one or two prototypes of this
> sort of thing if folks want to see it as an alternative.

I think this is the better approach, otherwise there's another behavior that's
gated behind having a callback that other filesystems may not know about and
thus have a gap.

Additionally do you have a test for this stale data exposure?  I think no matter
what the solution it would be good to have a test for this so that we can make
sure we're all doing the correct thing with zero range.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range
  2024-07-18 15:36 ` [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Josef Bacik
@ 2024-07-18 16:02   ` Darrick J. Wong
  2024-07-18 16:40     ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Darrick J. Wong @ 2024-07-18 16:02 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Brian Foster, linux-fsdevel, linux-xfs, linux-mm

On Thu, Jul 18, 2024 at 11:36:13AM -0400, Josef Bacik wrote:
> On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > This is a stab at fixing the iomap zero range problem where it doesn't
> > correctly handle the case of an unwritten mapping with dirty pagecache.
> > The gist is that we scan the mapping for dirty cache, zero any
> > already-dirty folios via buffered writes as normal, but then otherwise
> > skip clean ranges once we have a chance to validate those ranges against
> > races with writeback or reclaim.
> > 
> > This is somewhat simplistic in terms of how it scans, but that is
> > intentional based on the existing use cases for zero range. From poking
> > around a bit, my current sense is that there isn't any user of zero
> > range that would ever expect to see more than a single dirty folio. Most
> > callers either straddle the EOF folio or flush in higher level code for
> > presumably (fs) context specific reasons. If somebody has an example to
> > the contrary, please let me know because I'd love to be able to use it
> > for testing.
> > 
> > The caveat to this approach is that it only works for filesystems that
> > implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> > doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> > doesn't actually export unwritten mappings so I suspect this is not a
> > problem. My understanding is that ext4 iomap support is in progress, but
> > I've not yet dug into what that looks like (though I suspect similar to
> > XFS). The concern is mainly that this leaves a landmine for fs that
> > might grow support for unwritten mappings && zero range but not
> > ->iomap_valid(). We'd likely never know zero range was broken for such
> > fs until stale data exposure problems start to materialize.
> > 
> > I considered adding a fallback to just add a flush at the top of
> > iomap_zero_range() so at least all future users would be correct, but I
> > wanted to gate that on the absence of ->iomap_valid() and folio_ops
> > isn't provided until iomap_begin() time. I suppose another way around
> > that could be to add a flags param to iomap_zero_range() where the
> > caller could explicitly opt out of a flush, but that's still kind of
> > ugly. I dunno, maybe better than nothing..?

Or move ->iomap_valid to the iomap ops structure.  It's a mapping
predicate, and has nothing to do with folios.

> > So IMO, this raises the question of whether this is just unnecessarily
> > overcomplicated. The KISS principle implies that it would also be
> > perfectly fine to do a conditional "flush and stale" in zero range
> > whenever we see the combination of an unwritten mapping and dirty
> > pagecache (the latter checked before or during ->iomap_begin()). That's
> > simple to implement and AFAICT would work/perform adequately and
> > generically for all filesystems. I have one or two prototypes of this
> > sort of thing if folks want to see it as an alternative.

I wouldn't mind seeing such a prototype.  Start by hoisting the
filemap_write_and_wait_range call to iomap, then adjust it only to do
that if there's dirty pagecache + unwritten mappings?  Then get more
complicated from there, and we can decide if we want the increasing
levels of trickiness.

> I think this is the better approach, otherwise there's another behavior that's
> gated behind having a callback that other filesystems may not know about and
> thus have a gap.

<nod> I think filesystems currently only need to supply an ->iomap_valid
function for pagecache operations because those are the only ones where
we have to maintain consistency between something that isn't locked when
we get the mapping, and the mapping not being locked when we lock that
first thing.  I suspect they also only need to supply it if they support
unwritten extents.

From what I can tell, the rest (e.g. directio/FIEMAP) don't care because
callers get to manage concurrency.

*But* in general it makes sense to me that any iomap operation ought to
be able to revalidate a mapping at any time.

> Additionally do you have a test for this stale data exposure?  I think no matter
> what the solution it would be good to have a test for this so that we can make
> sure we're all doing the correct thing with zero range.  Thanks,

I was also curious about this.   IIRC we have some tests for the
validiting checking itself, but I don't recall if there's a specific
regression test for the eofblock clearing.

--D

> Josef
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] filemap: return pos of first dirty folio from range_has_writeback
  2024-07-18 15:09   ` Matthew Wilcox
@ 2024-07-18 16:03     ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2024-07-18 16:03 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jul 18, 2024 at 04:09:03PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 18, 2024 at 09:02:09AM -0400, Brian Foster wrote:
> > @@ -655,6 +655,8 @@ bool filemap_range_has_writeback(struct address_space *mapping,
> >  				folio_test_writeback(folio))
> >  			break;
> >  	}
> > +	if (folio)
> > +		*start_byte = folio_pos(folio);
> >  	rcu_read_unlock();
> >  	return folio != NULL;
> >  }
> 
> Distressingly, this is unsafe.
> 
> We have no reference on the folio at this point (not one that matters,
> anyway).  We have the rcu read lock, yes, but that doesn't protect enough
> to make folio_pos() safe.
> 
> Since we do't have folio_get() here, the folio can be freed, sent back to
> the page allocator, and then reallocated to literally any purpose.  As I'm
> reviewing patch 1/4, I have no idea if this is just a hint and you can
> survive it being completely wrong, or if this is going to cause problems.
> 

Ah, thanks. I was unsure about this when I hacked it up but then got
more focused on patch 3. I think for this implementation I'd want it to
be an accurate pos of the first dirty/wb folio. I think this could
possibly use filemap_range_has_writeback() (without patch 1) as more of
a hint/optimization, but that might involve doing the FGP_NOCREAT thing
from the previous variant of this prototype has and I was trying to
avoid that.

Do you think it would be reasonable to create a variant of this function
that did the relevant bits from __filemap_get_folio():

        if (!folio_try_get(folio))
                goto repeat;

        if (unlikely(folio != xas_reload(&xas))) {
                folio_put(folio);
                goto repeat;
        }
	/* check dirty/wb etc. */

... in order to either return a correct pos or maybe even a reference to
the folio itself? iomap_zero_iter() wants the locked folio anyways, but
that might be too ugly to pass through the iomap_folio_ops thing in
current form.

If that doesn't work, then I might chalk this up as another reason to
just do the flush thing I was rambling about in the cover letter...

Brian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range
  2024-07-18 16:02   ` Darrick J. Wong
@ 2024-07-18 16:40     ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2024-07-18 16:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Josef Bacik, linux-fsdevel, linux-xfs, linux-mm

On Thu, Jul 18, 2024 at 09:02:02AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 18, 2024 at 11:36:13AM -0400, Josef Bacik wrote:
> > On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> > > Hi all,
> > > 
> > > This is a stab at fixing the iomap zero range problem where it doesn't
> > > correctly handle the case of an unwritten mapping with dirty pagecache.
> > > The gist is that we scan the mapping for dirty cache, zero any
> > > already-dirty folios via buffered writes as normal, but then otherwise
> > > skip clean ranges once we have a chance to validate those ranges against
> > > races with writeback or reclaim.
> > > 
> > > This is somewhat simplistic in terms of how it scans, but that is
> > > intentional based on the existing use cases for zero range. From poking
> > > around a bit, my current sense is that there isn't any user of zero
> > > range that would ever expect to see more than a single dirty folio. Most
> > > callers either straddle the EOF folio or flush in higher level code for
> > > presumably (fs) context specific reasons. If somebody has an example to
> > > the contrary, please let me know because I'd love to be able to use it
> > > for testing.
> > > 
> > > The caveat to this approach is that it only works for filesystems that
> > > implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> > > doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> > > doesn't actually export unwritten mappings so I suspect this is not a
> > > problem. My understanding is that ext4 iomap support is in progress, but
> > > I've not yet dug into what that looks like (though I suspect similar to
> > > XFS). The concern is mainly that this leaves a landmine for fs that
> > > might grow support for unwritten mappings && zero range but not
> > > ->iomap_valid(). We'd likely never know zero range was broken for such
> > > fs until stale data exposure problems start to materialize.
> > > 
> > > I considered adding a fallback to just add a flush at the top of
> > > iomap_zero_range() so at least all future users would be correct, but I
> > > wanted to gate that on the absence of ->iomap_valid() and folio_ops
> > > isn't provided until iomap_begin() time. I suppose another way around
> > > that could be to add a flags param to iomap_zero_range() where the
> > > caller could explicitly opt out of a flush, but that's still kind of
> > > ugly. I dunno, maybe better than nothing..?
> 
> Or move ->iomap_valid to the iomap ops structure.  It's a mapping
> predicate, and has nothing to do with folios.
> 

Good idea. That might be an option.

> > > So IMO, this raises the question of whether this is just unnecessarily
> > > overcomplicated. The KISS principle implies that it would also be
> > > perfectly fine to do a conditional "flush and stale" in zero range
> > > whenever we see the combination of an unwritten mapping and dirty
> > > pagecache (the latter checked before or during ->iomap_begin()). That's
> > > simple to implement and AFAICT would work/perform adequately and
> > > generically for all filesystems. I have one or two prototypes of this
> > > sort of thing if folks want to see it as an alternative.
> 
> I wouldn't mind seeing such a prototype.  Start by hoisting the
> filemap_write_and_wait_range call to iomap, then adjust it only to do
> that if there's dirty pagecache + unwritten mappings?  Then get more
> complicated from there, and we can decide if we want the increasing
> levels of trickiness.
> 

Yeah, exactly. Start with an unconditional flush at the top of
iomap_zero_range() (which perhaps also serves as a -stable fix), then
replace it with an unconditional dirty cache check and a conditional
flush/stale down in zero_iter() (for the dirty+unwritten case). With
that false positives from the cache check are less of an issue because
the only consequence is basically just a spurious flush. From there, the
revalidation approach could be an optional further optimization to avoid
the flush entirely, but we'll have to see if it's worth the complexity.

I have various experimental patches around that pretty much do the
conditional flush thing. I just have to form it into a presentable
series.

> > I think this is the better approach, otherwise there's another behavior that's
> > gated behind having a callback that other filesystems may not know about and
> > thus have a gap.
> 
> <nod> I think filesystems currently only need to supply an ->iomap_valid
> function for pagecache operations because those are the only ones where
> we have to maintain consistency between something that isn't locked when
> we get the mapping, and the mapping not being locked when we lock that
> first thing.  I suspect they also only need to supply it if they support
> unwritten extents.
> 
> From what I can tell, the rest (e.g. directio/FIEMAP) don't care because
> callers get to manage concurrency.
> 
> *But* in general it makes sense to me that any iomap operation ought to
> be able to revalidate a mapping at any time.
> 
> > Additionally do you have a test for this stale data exposure?  I think no matter
> > what the solution it would be good to have a test for this so that we can make
> > sure we're all doing the correct thing with zero range.  Thanks,
> 
> I was also curious about this.   IIRC we have some tests for the
> validiting checking itself, but I don't recall if there's a specific
> regression test for the eofblock clearing.
> 

Err.. yeah. I have some random test sequences around that reproduce some
of these issues. I'll form them into an fstest to go along with this.

Thank you both for the feedback.

Brian

> --D
> 
> > Josef
> > 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] iomap: fix handling of dirty folios over unwritten extents
  2024-07-18 13:02 ` [PATCH 3/4] iomap: fix handling of dirty folios over unwritten extents Brian Foster
@ 2024-07-19  0:25   ` Dave Chinner
  2024-07-19 15:17     ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2024-07-19  0:25 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jul 18, 2024 at 09:02:11AM -0400, Brian Foster wrote:
> iomap_zero_range() does not correctly handle unwritten mappings with
> dirty folios in pagecache. It skips unwritten mappings
> unconditionally as if they were already zeroed, and thus potentially
> exposes stale data from a previous write if affected folios are not
> written back before the zero range.
> 
> Most callers already flush the target range of the zero for
> unrelated, context specific reasons, so this problem is not
> necessarily prevalent. The known outliers (in XFS) are file
> extension via buffered write and truncate. The truncate path issues
> a flush to work around this iomap problem, but the file extension
> path does not and thus can expose stale data if current EOF is
> unaligned and has a dirty folio over an unwritten block.
> 
> This patch implements a mechanism for making zero range pagecache
> aware for filesystems that support mapping validation (i.e.
> folio_ops->iomap_valid()). Instead of just skipping unwritten
> mappings, scan the corresponding pagecache range for dirty or
> writeback folios. If found, explicitly zero them via buffered write.
> Clean or uncached subranges of unwritten mappings are skipped, as
> before.
> 
> The quirk with a post-iomap_begin() pagecache scan is that it is
> racy with writeback and reclaim activity. Even if the higher level
> code holds the invalidate lock, nothing prevents a dirty folio from
> being written back, cleaned, and even reclaimed sometime after
> iomap_begin() returns an unwritten map but before a pagecache scan
> might find the dirty folio. To handle this situation, we can rely on
> the fact that writeback completion converts unwritten extents in the
> fs before writeback state is cleared on the folio.
> 
> This means that a pagecache scan followed by a mapping revalidate of
> an unwritten mapping should either find a dirty folio if it exists,
> or detect a mapping change if a dirty folio did exist and had been
> cleaned sometime before the scan but after the unwritten mapping was
> found. If the revalidation succeeds then we can safely assume
> nothing has been written back and skip the range. If the
> revalidation fails then we must assume any offset in the range could
> have been modified by writeback. In other words, we must be
> particularly careful to make sure that any uncached range we intend
> to skip does not make it into iter.processed until the mapping is
> revalidated.
> 
> Altogether, this allows zero range to handle dirty folios over
> unwritten extents without needing to flush and wait for writeback
> completion.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 50 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index a9425170df72..ea1d396ef445 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1385,6 +1385,23 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_unshare);
>  
> +/*
> + * Scan an unwritten mapping for dirty pagecache and return the length of the
> + * clean or uncached range leading up to it. This is the range that zeroing may
> + * skip once the mapping is validated.
> + */
> +static inline loff_t
> +iomap_zero_iter_unwritten(struct iomap_iter *iter, loff_t pos, loff_t length)
> +{
> +	struct address_space *mapping = iter->inode->i_mapping;
> +	loff_t fpos = pos;
> +
> +	if (!filemap_range_has_writeback(mapping, &fpos, length))
> +		return length;
> +	/* fpos can be smaller if the start folio is dirty */
> +	return max(fpos, pos) - pos;

I'm not sure this is safe. filemap_range_has_writeback() doesn't do
checks for invalidation races or that the folio is actually valid.
It also treats locked folios as dirty and a locked folio isn't
necessarily dirty. IOWs, this check assumes that we'll do all these
checks during the actual writeback operation that would follow this
check and so skip anything that might have given a false positive
here.

iomap_write_begin() doesn't do those sorts of check. If there's no
folio in the cache, it will simply instantiate a new one and dirty
it. If there's an existing folio, it will simply dirty it.

Hence I don't think a "is there a folio  in this range that is a
potential writeback candidate" check is correct here. I think we
need to be more robust in determining if a cached folio in the range
exists and needs zeroing. Ideas on that to follow...

>  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  {
>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> @@ -1393,16 +1410,46 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  	loff_t written = 0;
>  
>  	/* already zeroed?  we're done. */
> -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> +	if (srcmap->type == IOMAP_HOLE)
>  		return length;
>  
>  	do {
>  		struct folio *folio;
>  		int status;
>  		size_t offset;
> -		size_t bytes = min_t(u64, SIZE_MAX, length);
> +		size_t bytes;
> +		loff_t pending = 0;
>  		bool ret;
>  
> +		/*
> +		 * Determine the range of the unwritten mapping that is clean in
> +		 * pagecache. We can skip this range, but only if the mapping is
> +		 * still valid after the pagecache scan. This is because
> +		 * writeback may have cleaned folios after the mapping lookup
> +		 * but before we were able to find them here. If that occurs,
> +		 * then the mapping must now be stale and we must reprocess the
> +		 * range.
> +		 */
> +		if (srcmap->type == IOMAP_UNWRITTEN) {
> +			pending = iomap_zero_iter_unwritten(iter, pos, length);
> +			if (pending == length) {
> +				/* no dirty cache, revalidate and bounce as we're
> +				 * either done or the mapping is stale */
> +				if (iomap_revalidate(iter))
> +					written += pending;

Ok, this isn't really how the revalidation was supposed to be used
(i.e. it's not stabilising the data and page cache state first),
but it looks like works for this situation. We can use this. :)

> +				break;
> +			}
> +
> +			/*
> +			 * Found a dirty folio. Update pos/length to point at
> +			 * it. written is updated only after the mapping is
> +			 * revalidated by iomap_write_begin().
> +			 */
> +			pos += pending;
> +			length -= pending;
> +		}
> +
> +		bytes = min_t(u64, SIZE_MAX, length);
>  		status = iomap_write_begin(iter, pos, bytes, &folio);
>  		if (status)
>  			return status;

We don't hold any references to the page cache between the
iomap_zero_iter_unwritten() and then the actual folio lookup in
iomap_write_begin() where we reference and lock the folio at the
given offset. e.g. we get a "dirty" hit because of a locked folio,
and that folio is clean and contains zeroes (e.g. mmap read,
readahead, etc). We now write zeroes to that folio and dirty it
when, we should actually be skipping it and leaving the range as
unwritten.

In previous patches that fixed this zeroing issue, this wasn't a
problem because it used the existing iomap page cache lookup
mechanisms from iomap_write_begin() to get referenced, locked folios
over the zeroing range. Instead of skipping potential page cache
holes, it prevented page cache instantiation over page cache holes
from occurring when zeroing unwritten extents and that skipped page
cache holes naturally.

https://lore.kernel.org/linux-xfs/20240529095206.2568162-2-yi.zhang@huaweicloud.com/

This means the iteration would skip holes but still safely
revalidate the iomap once it found and locked a folio in the given
range. It could also then check the folio is dirty to determine if
zeroing was necessary.  Yes, this means it iterated holes in the
range PAGE_SIZE by PAGE_SIZE to do lookups, so it was inefficient.

However, we could still use this filemap_range_has_writeback()
optimisation to skip unwritten extents that don't have any cached
pages over them completely, but ti don't think it is really safe to
use it during the iteration to discover regions that don't need
zeroing. i.e. the fast path is this:

	if (srcmap->type == IOMAP_HOLE)
		return length;
	if (srcmap->type == IOMAP_UNWRITTEN &&
	    !filemap_range_has_writeback(mapping, pos, length)) {
		if (!iomap_revalidate(iter))
			return 0; /* stale mapping */
		return length;
	}

And the slow path does something similar to the above patch.
However, I'm still not sure that filemap_range_has_writeback() is
the right set of checks to use here.

I just thought of another option - after thinking about how you've
modified filemap_range_has_writeback() to increment the iomap
iterator position to skip empty ranges, I think we could actually
drive that inwards to the page cache lookup. We could use
use filemap_get_folios() instead of __filemap_get_folio() for
unwritten extent zeroing iteration. This would allow the page cache
lookup to return the first folio in the range as a referenced folio
which we can then lock and validate similar to __filemap_get_folio().

We can then increment the iterator position based on the folio
position, the iomap_write_begin() code will do the revalidation of
the iomap, and the zeroing code can then determine if zeroing is
needed based on whether the folio is dirty/writeback or not. As this
all happens under the folio lock, it is largely safe from both page
cache data and state and extent state changes racing with the
zeroing operation.

Thoughts?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range
  2024-07-18 13:02 [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Brian Foster
                   ` (4 preceding siblings ...)
  2024-07-18 15:36 ` [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Josef Bacik
@ 2024-07-19  1:10 ` Dave Chinner
  2024-07-19 15:22   ` Brian Foster
  5 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2024-07-19  1:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> Hi all,
> 
> This is a stab at fixing the iomap zero range problem where it doesn't
> correctly handle the case of an unwritten mapping with dirty pagecache.
> The gist is that we scan the mapping for dirty cache, zero any
> already-dirty folios via buffered writes as normal, but then otherwise
> skip clean ranges once we have a chance to validate those ranges against
> races with writeback or reclaim.
> 
> This is somewhat simplistic in terms of how it scans, but that is
> intentional based on the existing use cases for zero range. From poking
> around a bit, my current sense is that there isn't any user of zero
> range that would ever expect to see more than a single dirty folio.

The current code generally only zeroes a single filesystem block or
less because that's all we need to zero for partial writes.  This is
not going to be true for very much longer with XFS forcealign
functionality, and I suspect it's not true right now for large rt
extent sizes when doing sub-extent writes. In these cases, we are
going to have to zero multiple filesystem blocks during truncate,
hole punch, unaligned writes, etc.

So even if we don't do this now, I think this is something we will
almost certainly be doing in the next kernel release or two.

> Most
> callers either straddle the EOF folio or flush in higher level code for
> presumably (fs) context specific reasons. If somebody has an example to
> the contrary, please let me know because I'd love to be able to use it
> for testing.

Check the xfs_inode_has_bigrtalloc() and xfs_inode_alloc_unitsize()
cases. These are currently being worked on and expanded and factored
so eventually these cases will all fall under
xfs_inode_alloc_unitsize().

> The caveat to this approach is that it only works for filesystems that
> implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> doesn't actually export unwritten mappings so I suspect this is not a
> problem. My understanding is that ext4 iomap support is in progress, but
> I've not yet dug into what that looks like (though I suspect similar to
> XFS). The concern is mainly that this leaves a landmine for fs that
> might grow support for unwritten mappings && zero range but not
> ->iomap_valid(). We'd likely never know zero range was broken for such
> fs until stale data exposure problems start to materialize.
> 
> I considered adding a fallback to just add a flush at the top of
> iomap_zero_range() so at least all future users would be correct, but I
> wanted to gate that on the absence of ->iomap_valid() and folio_ops
> isn't provided until iomap_begin() time. I suppose another way around
> that could be to add a flags param to iomap_zero_range() where the
> caller could explicitly opt out of a flush, but that's still kind of
> ugly. I dunno, maybe better than nothing..?

We want to avoid the flush in this case if we can - what XFS does is
a workaround for iomap not handling dirty data over unwritten
extents. That first flush causes performance issues with certain
truncate heavy workloads, so we really want to avoid it in the
generic code if we can.

> So IMO, this raises the question of whether this is just unnecessarily
> overcomplicated. The KISS principle implies that it would also be
> perfectly fine to do a conditional "flush and stale" in zero range
> whenever we see the combination of an unwritten mapping and dirty
> pagecache (the latter checked before or during ->iomap_begin()). That's
> simple to implement and AFAICT would work/perform adequately and
> generically for all filesystems. I have one or two prototypes of this
> sort of thing if folks want to see it as an alternative.

If we are going to zero the range, and the range is already
unwritten, then why do we need to flush the data in the cache to
make it clean and written before running the zeroing? Why not just
invalidate the entire cache over the unwritten region and so return it
all to containing zeroes (i.e. is unwritten!) without doing any IO.

Yes, if some of the range is under writeback, the invalidation will
have to wait for that to complete - invalidate_inode_pages2_range()
does this for us - but after the invalidation those regions will now
be written and iomap revalidation after page cache invalidation will
detect this.

So maybe the solution is simply to invalidate the cache over
unwritten extents and then revalidate the iomap? If the iomap is
still valid, then we can skip the unwritten extent completely. If
the invalidation returned -EBUSY or the iomap is stale, then remap
it and try again?

If we don't have an iomap validation function, then we could check
filemap_range_needs_writeback() before calling
invalidate_inode_pages2_range() as that will tell us if there were
folios that might have been under writeback during the invalidation.
In that case, we can treat "needs writeback" the same as a failed
iomap revalidation.

So what am I missing? What does the flush actually accomplish that
simply calling invalidate_inode_pages2_range() to throw the data we
need to zero away doesn't?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] iomap: fix handling of dirty folios over unwritten extents
  2024-07-19  0:25   ` Dave Chinner
@ 2024-07-19 15:17     ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2024-07-19 15:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Fri, Jul 19, 2024 at 10:25:20AM +1000, Dave Chinner wrote:
> On Thu, Jul 18, 2024 at 09:02:11AM -0400, Brian Foster wrote:
> > iomap_zero_range() does not correctly handle unwritten mappings with
> > dirty folios in pagecache. It skips unwritten mappings
> > unconditionally as if they were already zeroed, and thus potentially
> > exposes stale data from a previous write if affected folios are not
> > written back before the zero range.
> > 
> > Most callers already flush the target range of the zero for
> > unrelated, context specific reasons, so this problem is not
> > necessarily prevalent. The known outliers (in XFS) are file
> > extension via buffered write and truncate. The truncate path issues
> > a flush to work around this iomap problem, but the file extension
> > path does not and thus can expose stale data if current EOF is
> > unaligned and has a dirty folio over an unwritten block.
> > 
> > This patch implements a mechanism for making zero range pagecache
> > aware for filesystems that support mapping validation (i.e.
> > folio_ops->iomap_valid()). Instead of just skipping unwritten
> > mappings, scan the corresponding pagecache range for dirty or
> > writeback folios. If found, explicitly zero them via buffered write.
> > Clean or uncached subranges of unwritten mappings are skipped, as
> > before.
> > 
> > The quirk with a post-iomap_begin() pagecache scan is that it is
> > racy with writeback and reclaim activity. Even if the higher level
> > code holds the invalidate lock, nothing prevents a dirty folio from
> > being written back, cleaned, and even reclaimed sometime after
> > iomap_begin() returns an unwritten map but before a pagecache scan
> > might find the dirty folio. To handle this situation, we can rely on
> > the fact that writeback completion converts unwritten extents in the
> > fs before writeback state is cleared on the folio.
> > 
> > This means that a pagecache scan followed by a mapping revalidate of
> > an unwritten mapping should either find a dirty folio if it exists,
> > or detect a mapping change if a dirty folio did exist and had been
> > cleaned sometime before the scan but after the unwritten mapping was
> > found. If the revalidation succeeds then we can safely assume
> > nothing has been written back and skip the range. If the
> > revalidation fails then we must assume any offset in the range could
> > have been modified by writeback. In other words, we must be
> > particularly careful to make sure that any uncached range we intend
> > to skip does not make it into iter.processed until the mapping is
> > revalidated.
> > 
> > Altogether, this allows zero range to handle dirty folios over
> > unwritten extents without needing to flush and wait for writeback
> > completion.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 50 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index a9425170df72..ea1d396ef445 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1385,6 +1385,23 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> >  }
> >  EXPORT_SYMBOL_GPL(iomap_file_unshare);
> >  
> > +/*
> > + * Scan an unwritten mapping for dirty pagecache and return the length of the
> > + * clean or uncached range leading up to it. This is the range that zeroing may
> > + * skip once the mapping is validated.
> > + */
> > +static inline loff_t
> > +iomap_zero_iter_unwritten(struct iomap_iter *iter, loff_t pos, loff_t length)
> > +{
> > +	struct address_space *mapping = iter->inode->i_mapping;
> > +	loff_t fpos = pos;
> > +
> > +	if (!filemap_range_has_writeback(mapping, &fpos, length))
> > +		return length;
> > +	/* fpos can be smaller if the start folio is dirty */
> > +	return max(fpos, pos) - pos;
> 
> I'm not sure this is safe. filemap_range_has_writeback() doesn't do
> checks for invalidation races or that the folio is actually valid.
> It also treats locked folios as dirty and a locked folio isn't
> necessarily dirty. IOWs, this check assumes that we'll do all these
> checks during the actual writeback operation that would follow this
> check and so skip anything that might have given a false positive
> here.
> 

I'm aware... I probably should have documented this somewhere, but this
prototype implies that false positives are acceptable. I'm not worried
about the occasional spurious unwritten block conversion, at least for a
first variant of a fix given the constraints mentioned in the cover
letter. If we come up with ideas to iterate and improve on that to
eliminate false positives, then great.

I thought about creating a separate lookup variant for this use case
without the lock check, but I'm still not totally convinced that isn't
actually racy and worth the tradeoff. That said, Willy has already
pointed out why patch 1 is wrong so if we end up recreating something
more bespoke for this purpose then we can probably make it work more
predictably.

> iomap_write_begin() doesn't do those sorts of check. If there's no
> folio in the cache, it will simply instantiate a new one and dirty
> it. If there's an existing folio, it will simply dirty it.
> 
> Hence I don't think a "is there a folio  in this range that is a
> potential writeback candidate" check is correct here. I think we
> need to be more robust in determining if a cached folio in the range
> exists and needs zeroing. Ideas on that to follow...
> 
> >  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  {
> >  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > @@ -1393,16 +1410,46 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  	loff_t written = 0;
> >  
> >  	/* already zeroed?  we're done. */
> > -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> > +	if (srcmap->type == IOMAP_HOLE)
> >  		return length;
> >  
> >  	do {
> >  		struct folio *folio;
> >  		int status;
> >  		size_t offset;
> > -		size_t bytes = min_t(u64, SIZE_MAX, length);
> > +		size_t bytes;
> > +		loff_t pending = 0;
> >  		bool ret;
> >  
> > +		/*
> > +		 * Determine the range of the unwritten mapping that is clean in
> > +		 * pagecache. We can skip this range, but only if the mapping is
> > +		 * still valid after the pagecache scan. This is because
> > +		 * writeback may have cleaned folios after the mapping lookup
> > +		 * but before we were able to find them here. If that occurs,
> > +		 * then the mapping must now be stale and we must reprocess the
> > +		 * range.
> > +		 */
> > +		if (srcmap->type == IOMAP_UNWRITTEN) {
> > +			pending = iomap_zero_iter_unwritten(iter, pos, length);
> > +			if (pending == length) {
> > +				/* no dirty cache, revalidate and bounce as we're
> > +				 * either done or the mapping is stale */
> > +				if (iomap_revalidate(iter))
> > +					written += pending;
> 
> Ok, this isn't really how the revalidation was supposed to be used
> (i.e. it's not stabilising the data and page cache state first),
> but it looks like works for this situation. We can use this. :)
> 

Yeah.. the main concern I had with this was basically whether we need
some kind of barrier or something in the case where there is no folio to
lock.

At the end of the day we need to be able to handle the case where the
last part of a range is uncached and so no folio exists to lock, yet we
need to confirm that we haven't raced with writeback and reclaim before
the zero range operation can complete.

> > +				break;
> > +			}
> > +
> > +			/*
> > +			 * Found a dirty folio. Update pos/length to point at
> > +			 * it. written is updated only after the mapping is
> > +			 * revalidated by iomap_write_begin().
> > +			 */
> > +			pos += pending;
> > +			length -= pending;
> > +		}
> > +
> > +		bytes = min_t(u64, SIZE_MAX, length);
> >  		status = iomap_write_begin(iter, pos, bytes, &folio);
> >  		if (status)
> >  			return status;
> 
> We don't hold any references to the page cache between the
> iomap_zero_iter_unwritten() and then the actual folio lookup in
> iomap_write_begin() where we reference and lock the folio at the
> given offset. e.g. we get a "dirty" hit because of a locked folio,
> and that folio is clean and contains zeroes (e.g. mmap read,
> readahead, etc). We now write zeroes to that folio and dirty it
> when, we should actually be skipping it and leaving the range as
> unwritten.
> 
> In previous patches that fixed this zeroing issue, this wasn't a
> problem because it used the existing iomap page cache lookup
> mechanisms from iomap_write_begin() to get referenced, locked folios
> over the zeroing range. Instead of skipping potential page cache
> holes, it prevented page cache instantiation over page cache holes
> from occurring when zeroing unwritten extents and that skipped page
> cache holes naturally.
> 
> https://lore.kernel.org/linux-xfs/20240529095206.2568162-2-yi.zhang@huaweicloud.com/
> 
> This means the iteration would skip holes but still safely
> revalidate the iomap once it found and locked a folio in the given
> range. It could also then check the folio is dirty to determine if
> zeroing was necessary.  Yes, this means it iterated holes in the
> range PAGE_SIZE by PAGE_SIZE to do lookups, so it was inefficient.
> 
> However, we could still use this filemap_range_has_writeback()
> optimisation to skip unwritten extents that don't have any cached
> pages over them completely, but ti don't think it is really safe to
> use it during the iteration to discover regions that don't need
> zeroing. i.e. the fast path is this:
> 
> 	if (srcmap->type == IOMAP_HOLE)
> 		return length;
> 	if (srcmap->type == IOMAP_UNWRITTEN &&
> 	    !filemap_range_has_writeback(mapping, pos, length)) {
> 		if (!iomap_revalidate(iter))
> 			return 0; /* stale mapping */
> 		return length;
> 	}
> 

We might be able to do something like this, but this is really more of a
behavioral tradeoff than a functional issue. With the above, if you have
a large range with a dirty page at the start, you'll spend a significant
amount of time chugging through the rest of the range checking each
offset for folios (and then still ultimately may have to revalidate an
uncached range).

Last I played around with that it wasn't hard to reproduce a scan that
took minutes to complete. See my replies in the thread you've linked
above for an example (as well as another prototype variant for
iomap_truncate_page() that more explicitly implements the tradeoff of
writing over unwritten blocks).

> And the slow path does something similar to the above patch.
> However, I'm still not sure that filemap_range_has_writeback() is
> the right set of checks to use here.
> 

Re: above, the _has_writeback() thing was more of an attempt to keep
things simple and reuse existing code. I suspect we can roll our own
lookup mechanism if need be.

> I just thought of another option - after thinking about how you've
> modified filemap_range_has_writeback() to increment the iomap
> iterator position to skip empty ranges, I think we could actually
> drive that inwards to the page cache lookup. We could use
> use filemap_get_folios() instead of __filemap_get_folio() for
> unwritten extent zeroing iteration. This would allow the page cache
> lookup to return the first folio in the range as a referenced folio
> which we can then lock and validate similar to __filemap_get_folio().
> 
> We can then increment the iterator position based on the folio
> position, the iomap_write_begin() code will do the revalidation of
> the iomap, and the zeroing code can then determine if zeroing is
> needed based on whether the folio is dirty/writeback or not. As this
> all happens under the folio lock, it is largely safe from both page
> cache data and state and extent state changes racing with the
> zeroing operation.
> 

I thought about something kind of similar.. do a batched folio lookup
either in iomap or in the fs under locks (via an iomap helper) and feed
that into the iomap operation path in place of the internal folio
lookups, etc. Where that kind of gets annoying is dealing with trimming
the mappings and whatnot for dealing with situations where the batch
might be full. I don't recall ruling that out, but it didn't strike me
as worth the complexity at the time. It occurs to me now there might be
more simple ways around that, such as just looping out on a full batch.

It sounds like what you describe above is more to push that lookup down
in the existing iomap folio lookup path, and then essentially drive the
range processing loop via walking the mapping xarray (via
iomap_write_begin() -> filemap_get_folios()) rather than it being
offset/length driven. If I follow that correctly, that sounds like a
potentially elegant solution to avoid the per-offset lookup walk and the
has_wb() scan.

I was _kind of_ hoping to be able to just lookup the next dirty folio,
but last I checked we couldn't easily lookup the next (dirty ||
writeback) folio in the same lookup operation. But TBH the more I think
about this, the batched lookup thing where we just check each present
folio directly for (dirty||wb) might be perfectly good enough. I need to
stare at the code and play around with it a bit. Thanks for the idea.

Brian

> Thoughts?
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range
  2024-07-19  1:10 ` Dave Chinner
@ 2024-07-19 15:22   ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2024-07-19 15:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-xfs, linux-mm

On Fri, Jul 19, 2024 at 11:10:28AM +1000, Dave Chinner wrote:
> On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > This is a stab at fixing the iomap zero range problem where it doesn't
> > correctly handle the case of an unwritten mapping with dirty pagecache.
> > The gist is that we scan the mapping for dirty cache, zero any
> > already-dirty folios via buffered writes as normal, but then otherwise
> > skip clean ranges once we have a chance to validate those ranges against
> > races with writeback or reclaim.
> > 
> > This is somewhat simplistic in terms of how it scans, but that is
> > intentional based on the existing use cases for zero range. From poking
> > around a bit, my current sense is that there isn't any user of zero
> > range that would ever expect to see more than a single dirty folio.
> 
> The current code generally only zeroes a single filesystem block or
> less because that's all we need to zero for partial writes.  This is
> not going to be true for very much longer with XFS forcealign
> functionality, and I suspect it's not true right now for large rt
> extent sizes when doing sub-extent writes. In these cases, we are
> going to have to zero multiple filesystem blocks during truncate,
> hole punch, unaligned writes, etc.
> 
> So even if we don't do this now, I think this is something we will
> almost certainly be doing in the next kernel release or two.
> 
> > Most
> > callers either straddle the EOF folio or flush in higher level code for
> > presumably (fs) context specific reasons. If somebody has an example to
> > the contrary, please let me know because I'd love to be able to use it
> > for testing.
> 
> Check the xfs_inode_has_bigrtalloc() and xfs_inode_alloc_unitsize()
> cases. These are currently being worked on and expanded and factored
> so eventually these cases will all fall under
> xfs_inode_alloc_unitsize().
> 
> > The caveat to this approach is that it only works for filesystems that
> > implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> > doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> > doesn't actually export unwritten mappings so I suspect this is not a
> > problem. My understanding is that ext4 iomap support is in progress, but
> > I've not yet dug into what that looks like (though I suspect similar to
> > XFS). The concern is mainly that this leaves a landmine for fs that
> > might grow support for unwritten mappings && zero range but not
> > ->iomap_valid(). We'd likely never know zero range was broken for such
> > fs until stale data exposure problems start to materialize.
> > 
> > I considered adding a fallback to just add a flush at the top of
> > iomap_zero_range() so at least all future users would be correct, but I
> > wanted to gate that on the absence of ->iomap_valid() and folio_ops
> > isn't provided until iomap_begin() time. I suppose another way around
> > that could be to add a flags param to iomap_zero_range() where the
> > caller could explicitly opt out of a flush, but that's still kind of
> > ugly. I dunno, maybe better than nothing..?
> 
> We want to avoid the flush in this case if we can - what XFS does is
> a workaround for iomap not handling dirty data over unwritten
> extents. That first flush causes performance issues with certain
> truncate heavy workloads, so we really want to avoid it in the
> generic code if we can.
> 

Sort of.. the only complaint I've heard about this was due to reliance
on a userspace program with a subtly dumb and repetitive truncate
workload. We worked around this problem by fixing the userspace tool and
I've not heard a complaint since.

Even without that userspace fix, a conditional flush in the kernel would
have been perfectly suitable for the same workload (as in, pretty much
unnoticeable). So the broader point here is just that this isn't so
black and white that flushing at all is necessarily a problem.

> > So IMO, this raises the question of whether this is just unnecessarily
> > overcomplicated. The KISS principle implies that it would also be
> > perfectly fine to do a conditional "flush and stale" in zero range
> > whenever we see the combination of an unwritten mapping and dirty
> > pagecache (the latter checked before or during ->iomap_begin()). That's
> > simple to implement and AFAICT would work/perform adequately and
> > generically for all filesystems. I have one or two prototypes of this
> > sort of thing if folks want to see it as an alternative.
> 
> If we are going to zero the range, and the range is already
> unwritten, then why do we need to flush the data in the cache to
> make it clean and written before running the zeroing? Why not just
> invalidate the entire cache over the unwritten region and so return it
> all to containing zeroes (i.e. is unwritten!) without doing any IO.
> 
> Yes, if some of the range is under writeback, the invalidation will
> have to wait for that to complete - invalidate_inode_pages2_range()
> does this for us - but after the invalidation those regions will now
> be written and iomap revalidation after page cache invalidation will
> detect this.
> 
> So maybe the solution is simply to invalidate the cache over
> unwritten extents and then revalidate the iomap? If the iomap is
> still valid, then we can skip the unwritten extent completely. If
> the invalidation returned -EBUSY or the iomap is stale, then remap
> it and try again?
> 
> If we don't have an iomap validation function, then we could check
> filemap_range_needs_writeback() before calling
> invalidate_inode_pages2_range() as that will tell us if there were
> folios that might have been under writeback during the invalidation.
> In that case, we can treat "needs writeback" the same as a failed
> iomap revalidation.
> 
> So what am I missing? What does the flush actually accomplish that
> simply calling invalidate_inode_pages2_range() to throw the data we
> need to zero away doesn't?
> 

Ugh.. when I look at invalidate I see various additional corner case
complexities to think about, for pretty much no additional value over a
conditional flush, especially when you start getting into some of the
writeback complexities you've already noted above. Even if you could
make it work technically (which I'm not sure of), it looks like it would
be increasingly more difficult to properly test and maintain. I don't
really see much reason to go down that path without explicit
justification and perhaps some proving out.

I think the right approach to this problem is precisely what was
discussed up thread with Josef and Darrick. Do the simple and
generically correct thing by lifting the unconditional flush from XFS,
optimize to make the flush conditional to dirty+unwritten while still
being fs generic, and then finally optimize away the flush entirely for
filesystems that provide the proper revalidation support like XFS.

A benefit of that is we can start with a tested and veriable functional
base to iterate from and fall back on. I think the mapping iteration
thing from the patch 3 discussion might turn out elegant enough that
maybe we'll come up with a generic non-flushing solution based on that
(I have some vague thoughts that require investigation) once we have
some code to poke around at.

Brian

P.S., I'm off for a week or so after today. I'll think more about the
mapping iteration idea and then try to put something together once I'm
back.

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-07-19 15:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-18 13:02 [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Brian Foster
2024-07-18 13:02 ` [PATCH 1/4] filemap: return pos of first dirty folio from range_has_writeback Brian Foster
2024-07-18 15:09   ` Matthew Wilcox
2024-07-18 16:03     ` Brian Foster
2024-07-18 13:02 ` [PATCH 2/4] iomap: refactor an iomap_revalidate() helper Brian Foster
2024-07-18 13:02 ` [PATCH 3/4] iomap: fix handling of dirty folios over unwritten extents Brian Foster
2024-07-19  0:25   ` Dave Chinner
2024-07-19 15:17     ` Brian Foster
2024-07-18 13:02 ` [PATCH 4/4] xfs: remove unnecessary flush of eof page from truncate Brian Foster
2024-07-18 15:36 ` [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Josef Bacik
2024-07-18 16:02   ` Darrick J. Wong
2024-07-18 16:40     ` Brian Foster
2024-07-19  1:10 ` Dave Chinner
2024-07-19 15:22   ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).