[PATCH v3 00/25] iomap/xfs folio patches

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/25] iomap/xfs folio patches
@ 2021-12-16 21:06 Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 01/25] block: Add bio_add_folio() Matthew Wilcox (Oracle)
                   ` (24 more replies)
  0 siblings, 25 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel

This patchset converts XFS & iomap to use folios, and gets them to a
state where they can handle multi-page folios.  Applying these patches
is not yet sufficient to actually start using multi-page folios for
XFS; more page cache changes are needed.  I don't anticipate needing to
touch XFS again until we're at the point where we want to convert the
aops to be type-safe.  It completes an xfstests run with no unexpected
failures.

I think everything here has been posted before, but I want to get
acks/reviews on patch 15 in particular.  I'll send a pull request
for these patches tomorrow.

v3:
 - Rebased to -rc5, which removes the first four patches from v2.
 - Fixed passing lengths > PAGE_SIZE to iomap_write_begin().
v2:
 - Added review tags from Jens, Darrick & Christoph (thanks!)
 - Added folio_zero_* wrappers around zero_user_*()
 - Added a patch to rename AS_THP_SUPPORT
 - Added a patch to convert __block_write_begin_int() to take a folio
 - Split the iomap_add_to_ioend() patch into three
 - Updated changelog of bio_add_folio() (Jens)
 - Adjusted whitespace of bio patches (Christoph, Jens)
 - Improved changelog of readahead conversion to explain why the put_page()
   disappeared (Christoph)
 - Add a patch to zero an entire folio at a time, instead of limiting to
   a page
 - Switch pos & end_pos back to being u64 from loff_t
 - Call block_write_end() and ->page_done with the head page of the folio,
   as that's what those functions expect.

Matthew Wilcox (Oracle) (25):
  block: Add bio_add_folio()
  block: Add bio_for_each_folio_all()
  fs/buffer: Convert __block_write_begin_int() to take a folio
  iomap: Convert to_iomap_page to take a folio
  iomap: Convert iomap_page_create to take a folio
  iomap: Convert iomap_page_release to take a folio
  iomap: Convert iomap_releasepage to use a folio
  iomap: Add iomap_invalidate_folio
  iomap: Pass the iomap_page into iomap_set_range_uptodate
  iomap: Convert bio completions to use folios
  iomap: Use folio offsets instead of page offsets
  iomap: Convert iomap_read_inline_data to take a folio
  iomap: Convert readahead and readpage to use a folio
  iomap: Convert iomap_page_mkwrite to use a folio
  iomap: Allow iomap_write_begin() to be called with the full length
  iomap: Convert __iomap_zero_iter to use a folio
  iomap: Convert iomap_write_begin() and iomap_write_end() to folios
  iomap: Convert iomap_write_end_inline to take a folio
  iomap,xfs: Convert ->discard_page to ->discard_folio
  iomap: Simplify iomap_writepage_map()
  iomap: Simplify iomap_do_writepage()
  iomap: Convert iomap_add_to_ioend() to take a folio
  iomap: Convert iomap_migrate_page() to use folios
  iomap: Support large folios in invalidatepage
  xfs: Support large folios

 Documentation/core-api/kernel-api.rst |   1 +
 block/bio.c                           |  22 ++
 fs/buffer.c                           |  23 +-
 fs/internal.h                         |   2 +-
 fs/iomap/buffered-io.c                | 518 +++++++++++++-------------
 fs/xfs/xfs_aops.c                     |  24 +-
 fs/xfs/xfs_icache.c                   |   2 +
 include/linux/bio.h                   |  56 ++-
 include/linux/iomap.h                 |   3 +-
 9 files changed, 373 insertions(+), 278 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 01/25] block: Add bio_add_folio()
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
@ 2021-12-16 21:06 ` Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 02/25] block: Add bio_for_each_folio_all() Matthew Wilcox (Oracle)
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Jens Axboe, Christoph Hellwig

This is a thin wrapper around bio_add_page().  The main advantage here
is the documentation that folios larger than 2GiB are not supported.
It's not currently possible to allocate folios that large, but if it
ever becomes possible, this function will fail gracefully instead of
doing I/O to the wrong bytes.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 block/bio.c         | 22 ++++++++++++++++++++++
 include/linux/bio.h |  3 ++-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 15ab0d6d1c06..4b3087e20d51 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1033,6 +1033,28 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
+/**
+ * bio_add_folio - Attempt to add part of a folio to a bio.
+ * @bio: BIO to add to.
+ * @folio: Folio to add.
+ * @len: How many bytes from the folio to add.
+ * @off: First byte in this folio to add.
+ *
+ * Filesystems that use folios can call this function instead of calling
+ * bio_add_page() for each page in the folio.  If @off is bigger than
+ * PAGE_SIZE, this function can create a bio_vec that starts in a page
+ * after the bv_page.  BIOs do not support folios that are 4GiB or larger.
+ *
+ * Return: Whether the addition was successful.
+ */
+bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
+		   size_t off)
+{
+	if (len > UINT_MAX || off > UINT_MAX)
+		return 0;
+	return bio_add_page(bio, &folio->page, len, off) > 0;
+}
+
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
 	struct bvec_iter_all iter_all;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index fe6bdfbbef66..a783cac49978 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -409,7 +409,8 @@ extern void bio_uninit(struct bio *);
 extern void bio_reset(struct bio *);
 void bio_chain(struct bio *, struct bio *);
 
-extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
+int bio_add_page(struct bio *, struct page *, unsigned len, unsigned off);
+bool bio_add_folio(struct bio *, struct folio *, size_t len, size_t off);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
 			   unsigned int, unsigned int);
 int bio_add_zone_append_page(struct bio *bio, struct page *page,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 02/25] block: Add bio_for_each_folio_all()
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 01/25] block: Add bio_add_folio() Matthew Wilcox (Oracle)
@ 2021-12-16 21:06 ` Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 03/25] fs/buffer: Convert __block_write_begin_int() to take a folio Matthew Wilcox (Oracle)
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Jens Axboe, Christoph Hellwig

Allow callers to iterate over each folio instead of each page.  The
bio need not have been constructed using folios originally.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/core-api/kernel-api.rst |  1 +
 include/linux/bio.h                   | 53 ++++++++++++++++++++++++++-
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
index 2e7186805148..7f0cb604b6ab 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -279,6 +279,7 @@ Accounting Framework
 Block Devices
 =============
 
+.. kernel-doc:: include/linux/bio.h
 .. kernel-doc:: block/blk-core.c
    :export:
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index a783cac49978..e3c9e8207f12 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -166,7 +166,7 @@ static inline void bio_advance(struct bio *bio, unsigned int nbytes)
  */
 #define bio_for_each_bvec_all(bvl, bio, i)		\
 	for (i = 0, bvl = bio_first_bvec_all(bio);	\
-	     i < (bio)->bi_vcnt; i++, bvl++)		\
+	     i < (bio)->bi_vcnt; i++, bvl++)
 
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
@@ -260,6 +260,57 @@ static inline struct bio_vec *bio_last_bvec_all(struct bio *bio)
 	return &bio->bi_io_vec[bio->bi_vcnt - 1];
 }
 
+/**
+ * struct folio_iter - State for iterating all folios in a bio.
+ * @folio: The current folio we're iterating.  NULL after the last folio.
+ * @offset: The byte offset within the current folio.
+ * @length: The number of bytes in this iteration (will not cross folio
+ *	boundary).
+ */
+struct folio_iter {
+	struct folio *folio;
+	size_t offset;
+	size_t length;
+	/* private: for use by the iterator */
+	size_t _seg_count;
+	int _i;
+};
+
+static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
+				   int i)
+{
+	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
+
+	fi->folio = page_folio(bvec->bv_page);
+	fi->offset = bvec->bv_offset +
+			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
+	fi->_seg_count = bvec->bv_len;
+	fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count);
+	fi->_i = i;
+}
+
+static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
+{
+	fi->_seg_count -= fi->length;
+	if (fi->_seg_count) {
+		fi->folio = folio_next(fi->folio);
+		fi->offset = 0;
+		fi->length = min(folio_size(fi->folio), fi->_seg_count);
+	} else if (fi->_i + 1 < bio->bi_vcnt) {
+		bio_first_folio(fi, bio, fi->_i + 1);
+	} else {
+		fi->folio = NULL;
+	}
+}
+
+/**
+ * bio_for_each_folio_all - Iterate over each folio in a bio.
+ * @fi: struct folio_iter which is updated for each folio.
+ * @bio: struct bio to iterate over.
+ */
+#define bio_for_each_folio_all(fi, bio)				\
+	for (bio_first_folio(&fi, bio, 0); fi.folio; bio_next_folio(&fi, bio))
+
 enum bip_flags {
 	BIP_BLOCK_INTEGRITY	= 1 << 0, /* block layer owns integrity data */
 	BIP_MAPPED_INTEGRITY	= 1 << 1, /* ref tag has been remapped */
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 03/25] fs/buffer: Convert __block_write_begin_int() to take a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 01/25] block: Add bio_add_folio() Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 02/25] block: Add bio_for_each_folio_all() Matthew Wilcox (Oracle)
@ 2021-12-16 21:06 ` Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 04/25] iomap: Convert to_iomap_page " Matthew Wilcox (Oracle)
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

There are no plans to convert buffer_head infrastructure to use large
folios, but __block_write_begin_int() is called from iomap, and it's
more convenient and less error-prone if we pass in a folio from iomap.
It also has a nice saving of almost 200 bytes of code from removing
repeated calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/buffer.c            | 23 ++++++++++++-----------
 fs/internal.h          |  2 +-
 fs/iomap/buffered-io.c |  7 +++++--
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 46bc589b7a03..8e112b6bd371 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1969,34 +1969,34 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
 	}
 }
 
-int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
+int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
 		get_block_t *get_block, const struct iomap *iomap)
 {
 	unsigned from = pos & (PAGE_SIZE - 1);
 	unsigned to = from + len;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = folio->mapping->host;
 	unsigned block_start, block_end;
 	sector_t block;
 	int err = 0;
 	unsigned blocksize, bbits;
 	struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
 
-	BUG_ON(!PageLocked(page));
+	BUG_ON(!folio_test_locked(folio));
 	BUG_ON(from > PAGE_SIZE);
 	BUG_ON(to > PAGE_SIZE);
 	BUG_ON(from > to);
 
-	head = create_page_buffers(page, inode, 0);
+	head = create_page_buffers(&folio->page, inode, 0);
 	blocksize = head->b_size;
 	bbits = block_size_bits(blocksize);
 
-	block = (sector_t)page->index << (PAGE_SHIFT - bbits);
+	block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
 
 	for(bh = head, block_start = 0; bh != head || !block_start;
 	    block++, block_start=block_end, bh = bh->b_this_page) {
 		block_end = block_start + blocksize;
 		if (block_end <= from || block_start >= to) {
-			if (PageUptodate(page)) {
+			if (folio_test_uptodate(folio)) {
 				if (!buffer_uptodate(bh))
 					set_buffer_uptodate(bh);
 			}
@@ -2016,20 +2016,20 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
 
 			if (buffer_new(bh)) {
 				clean_bdev_bh_alias(bh);
-				if (PageUptodate(page)) {
+				if (folio_test_uptodate(folio)) {
 					clear_buffer_new(bh);
 					set_buffer_uptodate(bh);
 					mark_buffer_dirty(bh);
 					continue;
 				}
 				if (block_end > to || block_start < from)
-					zero_user_segments(page,
+					folio_zero_segments(folio,
 						to, block_end,
 						block_start, from);
 				continue;
 			}
 		}
-		if (PageUptodate(page)) {
+		if (folio_test_uptodate(folio)) {
 			if (!buffer_uptodate(bh))
 				set_buffer_uptodate(bh);
 			continue; 
@@ -2050,14 +2050,15 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
 			err = -EIO;
 	}
 	if (unlikely(err))
-		page_zero_new_buffers(page, from, to);
+		page_zero_new_buffers(&folio->page, from, to);
 	return err;
 }
 
 int __block_write_begin(struct page *page, loff_t pos, unsigned len,
 		get_block_t *get_block)
 {
-	return __block_write_begin_int(page, pos, len, get_block, NULL);
+	return __block_write_begin_int(page_folio(page), pos, len, get_block,
+				       NULL);
 }
 EXPORT_SYMBOL(__block_write_begin);
 
diff --git a/fs/internal.h b/fs/internal.h
index 7979ff8d168c..8590c973c2f4 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -37,7 +37,7 @@ static inline int emergency_thaw_bdev(struct super_block *sb)
 /*
  * buffer.c
  */
-int __block_write_begin_int(struct page *page, loff_t pos, unsigned len,
+int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
 		get_block_t *get_block, const struct iomap *iomap);
 
 /*
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 71a36ae120ee..ecb65167715b 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -603,6 +603,7 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct page *page;
+	struct folio *folio;
 	int status = 0;
 
 	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
@@ -624,11 +625,12 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		status = -ENOMEM;
 		goto out_no_page;
 	}
+	folio = page_folio(page);
 
 	if (srcmap->type == IOMAP_INLINE)
 		status = iomap_write_begin_inline(iter, page);
 	else if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
-		status = __block_write_begin_int(page, pos, len, NULL, srcmap);
+		status = __block_write_begin_int(folio, pos, len, NULL, srcmap);
 	else
 		status = __iomap_write_begin(iter, pos, len, page);
 
@@ -960,11 +962,12 @@ EXPORT_SYMBOL_GPL(iomap_truncate_page);
 static loff_t iomap_page_mkwrite_iter(struct iomap_iter *iter,
 		struct page *page)
 {
+	struct folio *folio = page_folio(page);
 	loff_t length = iomap_length(iter);
 	int ret;
 
 	if (iter->iomap.flags & IOMAP_F_BUFFER_HEAD) {
-		ret = __block_write_begin_int(page, iter->pos, length, NULL,
+		ret = __block_write_begin_int(folio, iter->pos, length, NULL,
 					      &iter->iomap);
 		if (ret)
 			return ret;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 04/25] iomap: Convert to_iomap_page to take a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2021-12-16 21:06 ` [PATCH v3 03/25] fs/buffer: Convert __block_write_begin_int() to take a folio Matthew Wilcox (Oracle)
@ 2021-12-16 21:06 ` Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 05/25] iomap: Convert iomap_page_create " Matthew Wilcox (Oracle)
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

The big comment about only using a head page can go away now that
it takes a folio argument.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ecb65167715b..101d5b0754e9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -22,8 +22,8 @@
 #include "../internal.h"
 
 /*
- * Structure allocated for each page or THP when block size < page size
- * to track sub-page uptodate status and I/O completions.
+ * Structure allocated for each folio when block size < folio size
+ * to track sub-folio uptodate status and I/O completions.
  */
 struct iomap_page {
 	atomic_t		read_bytes_pending;
@@ -32,17 +32,10 @@ struct iomap_page {
 	unsigned long		uptodate[];
 };
 
-static inline struct iomap_page *to_iomap_page(struct page *page)
+static inline struct iomap_page *to_iomap_page(struct folio *folio)
 {
-	/*
-	 * per-block data is stored in the head page.  Callers should
-	 * not be dealing with tail pages, and if they are, they can
-	 * call thp_head() first.
-	 */
-	VM_BUG_ON_PGFLAGS(PageTail(page), page);
-
-	if (page_has_private(page))
-		return (struct iomap_page *)page_private(page);
+	if (folio_test_private(folio))
+		return folio_get_private(folio);
 	return NULL;
 }
 
@@ -51,7 +44,8 @@ static struct bio_set iomap_ioend_bioset;
 static struct iomap_page *
 iomap_page_create(struct inode *inode, struct page *page)
 {
-	struct iomap_page *iop = to_iomap_page(page);
+	struct folio *folio = page_folio(page);
+	struct iomap_page *iop = to_iomap_page(folio);
 	unsigned int nr_blocks = i_blocks_per_page(inode, page);
 
 	if (iop || nr_blocks <= 1)
@@ -144,7 +138,8 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 static void
 iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 {
-	struct iomap_page *iop = to_iomap_page(page);
+	struct folio *folio = page_folio(page);
+	struct iomap_page *iop = to_iomap_page(folio);
 	struct inode *inode = page->mapping->host;
 	unsigned first = off >> inode->i_blkbits;
 	unsigned last = (off + len - 1) >> inode->i_blkbits;
@@ -173,7 +168,8 @@ static void
 iomap_read_page_end_io(struct bio_vec *bvec, int error)
 {
 	struct page *page = bvec->bv_page;
-	struct iomap_page *iop = to_iomap_page(page);
+	struct folio *folio = page_folio(page);
+	struct iomap_page *iop = to_iomap_page(folio);
 
 	if (unlikely(error)) {
 		ClearPageUptodate(page);
@@ -438,7 +434,8 @@ int
 iomap_is_partially_uptodate(struct page *page, unsigned long from,
 		unsigned long count)
 {
-	struct iomap_page *iop = to_iomap_page(page);
+	struct folio *folio = page_folio(page);
+	struct iomap_page *iop = to_iomap_page(folio);
 	struct inode *inode = page->mapping->host;
 	unsigned len, first, last;
 	unsigned i;
@@ -1012,7 +1009,8 @@ static void
 iomap_finish_page_writeback(struct inode *inode, struct page *page,
 		int error, unsigned int len)
 {
-	struct iomap_page *iop = to_iomap_page(page);
+	struct folio *folio = page_folio(page);
+	struct iomap_page *iop = to_iomap_page(folio);
 
 	if (error) {
 		SetPageError(page);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 05/25] iomap: Convert iomap_page_create to take a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2021-12-16 21:06 ` [PATCH v3 04/25] iomap: Convert to_iomap_page " Matthew Wilcox (Oracle)
@ 2021-12-16 21:06 ` Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 06/25] iomap: Convert iomap_page_release " Matthew Wilcox (Oracle)
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

This function already assumed it was being passed a head page, so
just formalise that.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 101d5b0754e9..d7823610da5c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -42,11 +42,10 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 static struct bio_set iomap_ioend_bioset;
 
 static struct iomap_page *
-iomap_page_create(struct inode *inode, struct page *page)
+iomap_page_create(struct inode *inode, struct folio *folio)
 {
-	struct folio *folio = page_folio(page);
 	struct iomap_page *iop = to_iomap_page(folio);
-	unsigned int nr_blocks = i_blocks_per_page(inode, page);
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
 	if (iop || nr_blocks <= 1)
 		return iop;
@@ -54,9 +53,9 @@ iomap_page_create(struct inode *inode, struct page *page)
 	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
 			GFP_NOFS | __GFP_NOFAIL);
 	spin_lock_init(&iop->uptodate_lock);
-	if (PageUptodate(page))
+	if (folio_test_uptodate(folio))
 		bitmap_fill(iop->uptodate, nr_blocks);
-	attach_page_private(page, iop);
+	folio_attach_private(folio, iop);
 	return iop;
 }
 
@@ -213,6 +212,7 @@ struct iomap_readpage_ctx {
 static int iomap_read_inline_data(const struct iomap_iter *iter,
 		struct page *page)
 {
+	struct folio *folio = page_folio(page);
 	const struct iomap *iomap = iomap_iter_srcmap(iter);
 	size_t size = i_size_read(iter->inode) - iomap->offset;
 	size_t poff = offset_in_page(iomap->offset);
@@ -229,7 +229,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (poff > 0)
-		iomap_page_create(iter->inode, page);
+		iomap_page_create(iter->inode, folio);
 
 	addr = kmap_local_page(page) + poff;
 	memcpy(addr, iomap->inline_data, size);
@@ -256,6 +256,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	loff_t pos = iter->pos + offset;
 	loff_t length = iomap_length(iter) - offset;
 	struct page *page = ctx->cur_page;
+	struct folio *folio = page_folio(page);
 	struct iomap_page *iop;
 	loff_t orig_pos = pos;
 	unsigned poff, plen;
@@ -265,7 +266,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		return iomap_read_inline_data(iter, page);
 
 	/* zero post-eof blocks as the page may be mapped */
-	iop = iomap_page_create(iter->inode, page);
+	iop = iomap_page_create(iter->inode, folio);
 	iomap_adjust_read_range(iter->inode, iop, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
@@ -547,8 +548,9 @@ iomap_read_page_sync(loff_t block_start, struct page *page, unsigned poff,
 static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		unsigned len, struct page *page)
 {
+	struct folio *folio = page_folio(page);
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	struct iomap_page *iop = iomap_page_create(iter->inode, page);
+	struct iomap_page *iop = iomap_page_create(iter->inode, folio);
 	loff_t block_size = i_blocksize(iter->inode);
 	loff_t block_start = round_down(pos, block_size);
 	loff_t block_end = round_up(pos + len, block_size);
@@ -1296,7 +1298,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
 		struct page *page, u64 end_offset)
 {
-	struct iomap_page *iop = iomap_page_create(inode, page);
+	struct folio *folio = page_folio(page);
+	struct iomap_page *iop = iomap_page_create(inode, folio);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
 	u64 file_offset; /* file offset of page */
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 06/25] iomap: Convert iomap_page_release to take a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2021-12-16 21:06 ` [PATCH v3 05/25] iomap: Convert iomap_page_create " Matthew Wilcox (Oracle)
@ 2021-12-16 21:06 ` Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 07/25] iomap: Convert iomap_releasepage to use " Matthew Wilcox (Oracle)
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

iomap_page_release() was also assuming that it was being passed a
head page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d7823610da5c..16604f605357 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -59,18 +59,18 @@ iomap_page_create(struct inode *inode, struct folio *folio)
 	return iop;
 }
 
-static void
-iomap_page_release(struct page *page)
+static void iomap_page_release(struct folio *folio)
 {
-	struct iomap_page *iop = detach_page_private(page);
-	unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, page);
+	struct iomap_page *iop = folio_detach_private(folio);
+	struct inode *inode = folio->mapping->host;
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
 
 	if (!iop)
 		return;
 	WARN_ON_ONCE(atomic_read(&iop->read_bytes_pending));
 	WARN_ON_ONCE(atomic_read(&iop->write_bytes_pending));
 	WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
-			PageUptodate(page));
+			folio_test_uptodate(folio));
 	kfree(iop);
 }
 
@@ -462,6 +462,8 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
 int
 iomap_releasepage(struct page *page, gfp_t gfp_mask)
 {
+	struct folio *folio = page_folio(page);
+
 	trace_iomap_releasepage(page->mapping->host, page_offset(page),
 			PAGE_SIZE);
 
@@ -472,7 +474,7 @@ iomap_releasepage(struct page *page, gfp_t gfp_mask)
 	 */
 	if (PageDirty(page) || PageWriteback(page))
 		return 0;
-	iomap_page_release(page);
+	iomap_page_release(folio);
 	return 1;
 }
 EXPORT_SYMBOL_GPL(iomap_releasepage);
@@ -480,6 +482,8 @@ EXPORT_SYMBOL_GPL(iomap_releasepage);
 void
 iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
 {
+	struct folio *folio = page_folio(page);
+
 	trace_iomap_invalidatepage(page->mapping->host, offset, len);
 
 	/*
@@ -489,7 +493,7 @@ iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
 	if (offset == 0 && len == PAGE_SIZE) {
 		WARN_ON_ONCE(PageWriteback(page));
 		cancel_dirty_page(page);
-		iomap_page_release(page);
+		iomap_page_release(folio);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidatepage);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 07/25] iomap: Convert iomap_releasepage to use a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (5 preceding siblings ...)
  2021-12-16 21:06 ` [PATCH v3 06/25] iomap: Convert iomap_page_release " Matthew Wilcox (Oracle)
@ 2021-12-16 21:06 ` Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 08/25] iomap: Add iomap_invalidate_folio Matthew Wilcox (Oracle)
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

This is an address_space operation, so its argument must remain as a
struct page, but we can use a folio internally.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 16604f605357..b0192b148c9f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -464,15 +464,15 @@ iomap_releasepage(struct page *page, gfp_t gfp_mask)
 {
 	struct folio *folio = page_folio(page);
 
-	trace_iomap_releasepage(page->mapping->host, page_offset(page),
-			PAGE_SIZE);
+	trace_iomap_releasepage(folio->mapping->host, folio_pos(folio),
+			folio_size(folio));
 
 	/*
 	 * mm accommodates an old ext3 case where clean pages might not have had
 	 * the dirty bit cleared. Thus, it can send actual dirty pages to
 	 * ->releasepage() via shrink_active_list(); skip those here.
 	 */
-	if (PageDirty(page) || PageWriteback(page))
+	if (folio_test_dirty(folio) || folio_test_writeback(folio))
 		return 0;
 	iomap_page_release(folio);
 	return 1;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 08/25] iomap: Add iomap_invalidate_folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (6 preceding siblings ...)
  2021-12-16 21:06 ` [PATCH v3 07/25] iomap: Convert iomap_releasepage to use " Matthew Wilcox (Oracle)
@ 2021-12-16 21:06 ` Matthew Wilcox (Oracle)
  2021-12-16 21:06 ` [PATCH v3 09/25] iomap: Pass the iomap_page into iomap_set_range_uptodate Matthew Wilcox (Oracle)
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

Keep iomap_invalidatepage around as a wrapper for use in address_space
operations.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 20 ++++++++++++--------
 include/linux/iomap.h  |  1 +
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b0192b148c9f..de7ce1909527 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -479,23 +479,27 @@ iomap_releasepage(struct page *page, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL_GPL(iomap_releasepage);
 
-void
-iomap_invalidatepage(struct page *page, unsigned int offset, unsigned int len)
+void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 {
-	struct folio *folio = page_folio(page);
-
-	trace_iomap_invalidatepage(page->mapping->host, offset, len);
+	trace_iomap_invalidatepage(folio->mapping->host, offset, len);
 
 	/*
 	 * If we're invalidating the entire page, clear the dirty state from it
 	 * and release it to avoid unnecessary buildup of the LRU.
 	 */
-	if (offset == 0 && len == PAGE_SIZE) {
-		WARN_ON_ONCE(PageWriteback(page));
-		cancel_dirty_page(page);
+	if (offset == 0 && len == folio_size(folio)) {
+		WARN_ON_ONCE(folio_test_writeback(folio));
+		folio_cancel_dirty(folio);
 		iomap_page_release(folio);
 	}
 }
+EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
+
+void iomap_invalidatepage(struct page *page, unsigned int offset,
+		unsigned int len)
+{
+	iomap_invalidate_folio(page_folio(page), offset, len);
+}
 EXPORT_SYMBOL_GPL(iomap_invalidatepage);
 
 #ifdef CONFIG_MIGRATION
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6d1b08d0ae93..29491fb9c5ba 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -225,6 +225,7 @@ void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 int iomap_is_partially_uptodate(struct page *page, unsigned long from,
 		unsigned long count);
 int iomap_releasepage(struct page *page, gfp_t gfp_mask);
+void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 void iomap_invalidatepage(struct page *page, unsigned int offset,
 		unsigned int len);
 #ifdef CONFIG_MIGRATION
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 09/25] iomap: Pass the iomap_page into iomap_set_range_uptodate
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (7 preceding siblings ...)
  2021-12-16 21:06 ` [PATCH v3 08/25] iomap: Add iomap_invalidate_folio Matthew Wilcox (Oracle)
@ 2021-12-16 21:06 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 10/25] iomap: Convert bio completions to use folios Matthew Wilcox (Oracle)
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

All but one caller already has the iomap_page, so we can avoid getting
it again.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index de7ce1909527..856ef62b319e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -134,11 +134,9 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 	*lenp = plen;
 }
 
-static void
-iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
+static void iomap_iop_set_range_uptodate(struct page *page,
+		struct iomap_page *iop, unsigned off, unsigned len)
 {
-	struct folio *folio = page_folio(page);
-	struct iomap_page *iop = to_iomap_page(folio);
 	struct inode *inode = page->mapping->host;
 	unsigned first = off >> inode->i_blkbits;
 	unsigned last = (off + len - 1) >> inode->i_blkbits;
@@ -151,14 +149,14 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
 }
 
-static void
-iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
+static void iomap_set_range_uptodate(struct page *page,
+		struct iomap_page *iop, unsigned off, unsigned len)
 {
 	if (PageError(page))
 		return;
 
-	if (page_has_private(page))
-		iomap_iop_set_range_uptodate(page, off, len);
+	if (iop)
+		iomap_iop_set_range_uptodate(page, iop, off, len);
 	else
 		SetPageUptodate(page);
 }
@@ -174,7 +172,8 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error)
 		ClearPageUptodate(page);
 		SetPageError(page);
 	} else {
-		iomap_set_range_uptodate(page, bvec->bv_offset, bvec->bv_len);
+		iomap_set_range_uptodate(page, iop, bvec->bv_offset,
+						bvec->bv_len);
 	}
 
 	if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending))
@@ -213,6 +212,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 		struct page *page)
 {
 	struct folio *folio = page_folio(page);
+	struct iomap_page *iop;
 	const struct iomap *iomap = iomap_iter_srcmap(iter);
 	size_t size = i_size_read(iter->inode) - iomap->offset;
 	size_t poff = offset_in_page(iomap->offset);
@@ -229,13 +229,15 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
 	if (poff > 0)
-		iomap_page_create(iter->inode, folio);
+		iop = iomap_page_create(iter->inode, folio);
+	else
+		iop = to_iomap_page(folio);
 
 	addr = kmap_local_page(page) + poff;
 	memcpy(addr, iomap->inline_data, size);
 	memset(addr + size, 0, PAGE_SIZE - poff - size);
 	kunmap_local(addr);
-	iomap_set_range_uptodate(page, poff, PAGE_SIZE - poff);
+	iomap_set_range_uptodate(page, iop, poff, PAGE_SIZE - poff);
 	return 0;
 }
 
@@ -273,7 +275,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 	if (iomap_block_needs_zeroing(iter, pos)) {
 		zero_user(page, poff, plen);
-		iomap_set_range_uptodate(page, poff, plen);
+		iomap_set_range_uptodate(page, iop, poff, plen);
 		goto done;
 	}
 
@@ -589,7 +591,7 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 			if (status)
 				return status;
 		}
-		iomap_set_range_uptodate(page, poff, plen);
+		iomap_set_range_uptodate(page, iop, poff, plen);
 	} while ((block_start += plen) < block_end);
 
 	return 0;
@@ -661,6 +663,8 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 		size_t copied, struct page *page)
 {
+	struct folio *folio = page_folio(page);
+	struct iomap_page *iop = to_iomap_page(folio);
 	flush_dcache_page(page);
 
 	/*
@@ -676,7 +680,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 */
 	if (unlikely(copied < len && !PageUptodate(page)))
 		return 0;
-	iomap_set_range_uptodate(page, offset_in_page(pos), len);
+	iomap_set_range_uptodate(page, iop, offset_in_page(pos), len);
 	__set_page_dirty_nobuffers(page);
 	return copied;
 }
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 10/25] iomap: Convert bio completions to use folios
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (8 preceding siblings ...)
  2021-12-16 21:06 ` [PATCH v3 09/25] iomap: Pass the iomap_page into iomap_set_range_uptodate Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 11/25] iomap: Use folio offsets instead of page offsets Matthew Wilcox (Oracle)
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

Use bio_for_each_folio() to iterate over each folio in the bio
instead of iterating over each page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 50 ++++++++++++++++++------------------------
 1 file changed, 21 insertions(+), 29 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 856ef62b319e..5730a3317407 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -161,34 +161,29 @@ static void iomap_set_range_uptodate(struct page *page,
 		SetPageUptodate(page);
 }
 
-static void
-iomap_read_page_end_io(struct bio_vec *bvec, int error)
+static void iomap_finish_folio_read(struct folio *folio, size_t offset,
+		size_t len, int error)
 {
-	struct page *page = bvec->bv_page;
-	struct folio *folio = page_folio(page);
 	struct iomap_page *iop = to_iomap_page(folio);
 
 	if (unlikely(error)) {
-		ClearPageUptodate(page);
-		SetPageError(page);
+		folio_clear_uptodate(folio);
+		folio_set_error(folio);
 	} else {
-		iomap_set_range_uptodate(page, iop, bvec->bv_offset,
-						bvec->bv_len);
+		iomap_set_range_uptodate(&folio->page, iop, offset, len);
 	}
 
-	if (!iop || atomic_sub_and_test(bvec->bv_len, &iop->read_bytes_pending))
-		unlock_page(page);
+	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
+		folio_unlock(folio);
 }
 
-static void
-iomap_read_end_io(struct bio *bio)
+static void iomap_read_end_io(struct bio *bio)
 {
 	int error = blk_status_to_errno(bio->bi_status);
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct folio_iter fi;
 
-	bio_for_each_segment_all(bvec, bio, iter_all)
-		iomap_read_page_end_io(bvec, error);
+	bio_for_each_folio_all(fi, bio)
+		iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
 	bio_put(bio);
 }
 
@@ -1019,23 +1014,21 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
 }
 EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
 
-static void
-iomap_finish_page_writeback(struct inode *inode, struct page *page,
-		int error, unsigned int len)
+static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
+		size_t len, int error)
 {
-	struct folio *folio = page_folio(page);
 	struct iomap_page *iop = to_iomap_page(folio);
 
 	if (error) {
-		SetPageError(page);
+		folio_set_error(folio);
 		mapping_set_error(inode->i_mapping, error);
 	}
 
-	WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
+	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !iop);
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) <= 0);
 
 	if (!iop || atomic_sub_and_test(len, &iop->write_bytes_pending))
-		end_page_writeback(page);
+		folio_end_writeback(folio);
 }
 
 /*
@@ -1054,8 +1047,7 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 	bool quiet = bio_flagged(bio, BIO_QUIET);
 
 	for (bio = &ioend->io_inline_bio; bio; bio = next) {
-		struct bio_vec *bv;
-		struct bvec_iter_all iter_all;
+		struct folio_iter fi;
 
 		/*
 		 * For the last bio, bi_private points to the ioend, so we
@@ -1066,10 +1058,10 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 		else
 			next = bio->bi_private;
 
-		/* walk each page on bio, ending page IO on them */
-		bio_for_each_segment_all(bv, bio, iter_all)
-			iomap_finish_page_writeback(inode, bv->bv_page, error,
-					bv->bv_len);
+		/* walk all folios in bio, ending page IO on them */
+		bio_for_each_folio_all(fi, bio)
+			iomap_finish_folio_write(inode, fi.folio, fi.length,
+					error);
 		bio_put(bio);
 	}
 	/* The ioend has been freed by bio_put() */
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 11/25] iomap: Use folio offsets instead of page offsets
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (9 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 10/25] iomap: Convert bio completions to use folios Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 12/25] iomap: Convert iomap_read_inline_data to take a folio Matthew Wilcox (Oracle)
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

Pass a folio around instead of the page, and make sure the offset
is relative to the start of the folio instead of the start of a page.
Also use size_t for offset & length to make it clear that these are byte
counts, and to support >2GB folios in the future.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 78 ++++++++++++++++++++++--------------------
 1 file changed, 40 insertions(+), 38 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 5730a3317407..06ff80c05340 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -75,18 +75,18 @@ static void iomap_page_release(struct folio *folio)
 }
 
 /*
- * Calculate the range inside the page that we actually need to read.
+ * Calculate the range inside the folio that we actually need to read.
  */
-static void
-iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
-		loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
+static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
+		loff_t *pos, loff_t length, size_t *offp, size_t *lenp)
 {
+	struct iomap_page *iop = to_iomap_page(folio);
 	loff_t orig_pos = *pos;
 	loff_t isize = i_size_read(inode);
 	unsigned block_bits = inode->i_blkbits;
 	unsigned block_size = (1 << block_bits);
-	unsigned poff = offset_in_page(*pos);
-	unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
+	size_t poff = offset_in_folio(folio, *pos);
+	size_t plen = min_t(loff_t, folio_size(folio) - poff, length);
 	unsigned first = poff >> block_bits;
 	unsigned last = (poff + plen - 1) >> block_bits;
 
@@ -124,7 +124,7 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 	 * page cache for blocks that are entirely outside of i_size.
 	 */
 	if (orig_pos <= isize && orig_pos + length > isize) {
-		unsigned end = offset_in_page(isize - 1) >> block_bits;
+		unsigned end = offset_in_folio(folio, isize - 1) >> block_bits;
 
 		if (first <= end && last > end)
 			plen -= (last - end) * block_size;
@@ -134,31 +134,31 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
 	*lenp = plen;
 }
 
-static void iomap_iop_set_range_uptodate(struct page *page,
-		struct iomap_page *iop, unsigned off, unsigned len)
+static void iomap_iop_set_range_uptodate(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = folio->mapping->host;
 	unsigned first = off >> inode->i_blkbits;
 	unsigned last = (off + len - 1) >> inode->i_blkbits;
 	unsigned long flags;
 
 	spin_lock_irqsave(&iop->uptodate_lock, flags);
 	bitmap_set(iop->uptodate, first, last - first + 1);
-	if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
-		SetPageUptodate(page);
+	if (bitmap_full(iop->uptodate, i_blocks_per_folio(inode, folio)))
+		folio_mark_uptodate(folio);
 	spin_unlock_irqrestore(&iop->uptodate_lock, flags);
 }
 
-static void iomap_set_range_uptodate(struct page *page,
-		struct iomap_page *iop, unsigned off, unsigned len)
+static void iomap_set_range_uptodate(struct folio *folio,
+		struct iomap_page *iop, size_t off, size_t len)
 {
-	if (PageError(page))
+	if (folio_test_error(folio))
 		return;
 
 	if (iop)
-		iomap_iop_set_range_uptodate(page, iop, off, len);
+		iomap_iop_set_range_uptodate(folio, iop, off, len);
 	else
-		SetPageUptodate(page);
+		folio_mark_uptodate(folio);
 }
 
 static void iomap_finish_folio_read(struct folio *folio, size_t offset,
@@ -170,7 +170,7 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 		folio_clear_uptodate(folio);
 		folio_set_error(folio);
 	} else {
-		iomap_set_range_uptodate(&folio->page, iop, offset, len);
+		iomap_set_range_uptodate(folio, iop, offset, len);
 	}
 
 	if (!iop || atomic_sub_and_test(len, &iop->read_bytes_pending))
@@ -211,6 +211,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	const struct iomap *iomap = iomap_iter_srcmap(iter);
 	size_t size = i_size_read(iter->inode) - iomap->offset;
 	size_t poff = offset_in_page(iomap->offset);
+	size_t offset = offset_in_folio(folio, iomap->offset);
 	void *addr;
 
 	if (PageUptodate(page))
@@ -223,7 +224,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 		return -EIO;
 	if (WARN_ON_ONCE(size > iomap->length))
 		return -EIO;
-	if (poff > 0)
+	if (offset > 0)
 		iop = iomap_page_create(iter->inode, folio);
 	else
 		iop = to_iomap_page(folio);
@@ -232,7 +233,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	memcpy(addr, iomap->inline_data, size);
 	memset(addr + size, 0, PAGE_SIZE - poff - size);
 	kunmap_local(addr);
-	iomap_set_range_uptodate(page, iop, poff, PAGE_SIZE - poff);
+	iomap_set_range_uptodate(folio, iop, offset, PAGE_SIZE - poff);
 	return 0;
 }
 
@@ -256,7 +257,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	struct folio *folio = page_folio(page);
 	struct iomap_page *iop;
 	loff_t orig_pos = pos;
-	unsigned poff, plen;
+	size_t poff, plen;
 	sector_t sector;
 
 	if (iomap->type == IOMAP_INLINE)
@@ -264,13 +265,13 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 
 	/* zero post-eof blocks as the page may be mapped */
 	iop = iomap_page_create(iter->inode, folio);
-	iomap_adjust_read_range(iter->inode, iop, &pos, length, &poff, &plen);
+	iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
 	if (plen == 0)
 		goto done;
 
 	if (iomap_block_needs_zeroing(iter, pos)) {
-		zero_user(page, poff, plen);
-		iomap_set_range_uptodate(page, iop, poff, plen);
+		folio_zero_range(folio, poff, plen);
+		iomap_set_range_uptodate(folio, iop, poff, plen);
 		goto done;
 	}
 
@@ -281,7 +282,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	sector = iomap_sector(iomap, pos);
 	if (!ctx->bio ||
 	    bio_end_sector(ctx->bio) != sector ||
-	    bio_add_page(ctx->bio, page, plen, poff) != plen) {
+	    !bio_add_folio(ctx->bio, folio, plen, poff)) {
 		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
 		gfp_t orig_gfp = gfp;
 		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
@@ -305,8 +306,9 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		ctx->bio->bi_iter.bi_sector = sector;
 		bio_set_dev(ctx->bio, iomap->bdev);
 		ctx->bio->bi_end_io = iomap_read_end_io;
-		__bio_add_page(ctx->bio, page, plen, poff);
+		bio_add_folio(ctx->bio, folio, plen, poff);
 	}
+
 done:
 	/*
 	 * Move the caller beyond our range so that it keeps making progress.
@@ -535,9 +537,8 @@ iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
 		truncate_pagecache_range(inode, max(pos, i_size), pos + len);
 }
 
-static int
-iomap_read_page_sync(loff_t block_start, struct page *page, unsigned poff,
-		unsigned plen, const struct iomap *iomap)
+static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
+		size_t poff, size_t plen, const struct iomap *iomap)
 {
 	struct bio_vec bvec;
 	struct bio bio;
@@ -546,7 +547,7 @@ iomap_read_page_sync(loff_t block_start, struct page *page, unsigned poff,
 	bio.bi_opf = REQ_OP_READ;
 	bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
 	bio_set_dev(&bio, iomap->bdev);
-	__bio_add_page(&bio, page, plen, poff);
+	bio_add_folio(&bio, folio, plen, poff);
 	return submit_bio_wait(&bio);
 }
 
@@ -559,14 +560,15 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	loff_t block_size = i_blocksize(iter->inode);
 	loff_t block_start = round_down(pos, block_size);
 	loff_t block_end = round_up(pos + len, block_size);
-	unsigned from = offset_in_page(pos), to = from + len, poff, plen;
+	size_t from = offset_in_folio(folio, pos), to = from + len;
+	size_t poff, plen;
 
-	if (PageUptodate(page))
+	if (folio_test_uptodate(folio))
 		return 0;
-	ClearPageError(page);
+	folio_clear_error(folio);
 
 	do {
-		iomap_adjust_read_range(iter->inode, iop, &block_start,
+		iomap_adjust_read_range(iter->inode, folio, &block_start,
 				block_end - block_start, &poff, &plen);
 		if (plen == 0)
 			break;
@@ -579,14 +581,14 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		if (iomap_block_needs_zeroing(iter, block_start)) {
 			if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
 				return -EIO;
-			zero_user_segments(page, poff, from, to, poff + plen);
+			folio_zero_segments(folio, poff, from, to, poff + plen);
 		} else {
-			int status = iomap_read_page_sync(block_start, page,
+			int status = iomap_read_folio_sync(block_start, folio,
 					poff, plen, srcmap);
 			if (status)
 				return status;
 		}
-		iomap_set_range_uptodate(page, iop, poff, plen);
+		iomap_set_range_uptodate(folio, iop, poff, plen);
 	} while ((block_start += plen) < block_end);
 
 	return 0;
@@ -675,7 +677,7 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 */
 	if (unlikely(copied < len && !PageUptodate(page)))
 		return 0;
-	iomap_set_range_uptodate(page, iop, offset_in_page(pos), len);
+	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
 	__set_page_dirty_nobuffers(page);
 	return copied;
 }
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 12/25] iomap: Convert iomap_read_inline_data to take a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (10 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 11/25] iomap: Use folio offsets instead of page offsets Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 13/25] iomap: Convert readahead and readpage to use " Matthew Wilcox (Oracle)
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

We still only support up to a single page of inline data (at least,
per call to iomap_read_inline_data()), but it can now be written into
the middle of a folio in case we decide to allocate a 16KiB page for
a file that's 8.1KiB in size.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 06ff80c05340..2ebea02780b8 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -197,16 +197,15 @@ struct iomap_readpage_ctx {
 /**
  * iomap_read_inline_data - copy inline data into the page cache
  * @iter: iteration structure
- * @page: page to copy to
+ * @folio: folio to copy to
  *
- * Copy the inline data in @iter into @page and zero out the rest of the page.
+ * Copy the inline data in @iter into @folio and zero out the rest of the folio.
  * Only a single IOMAP_INLINE extent is allowed at the end of each file.
  * Returns zero for success to complete the read, or the usual negative errno.
  */
 static int iomap_read_inline_data(const struct iomap_iter *iter,
-		struct page *page)
+		struct folio *folio)
 {
-	struct folio *folio = page_folio(page);
 	struct iomap_page *iop;
 	const struct iomap *iomap = iomap_iter_srcmap(iter);
 	size_t size = i_size_read(iter->inode) - iomap->offset;
@@ -214,7 +213,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	size_t offset = offset_in_folio(folio, iomap->offset);
 	void *addr;
 
-	if (PageUptodate(page))
+	if (folio_test_uptodate(folio))
 		return 0;
 
 	if (WARN_ON_ONCE(size > PAGE_SIZE - poff))
@@ -229,7 +228,7 @@ static int iomap_read_inline_data(const struct iomap_iter *iter,
 	else
 		iop = to_iomap_page(folio);
 
-	addr = kmap_local_page(page) + poff;
+	addr = kmap_local_folio(folio, offset);
 	memcpy(addr, iomap->inline_data, size);
 	memset(addr + size, 0, PAGE_SIZE - poff - size);
 	kunmap_local(addr);
@@ -261,7 +260,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	sector_t sector;
 
 	if (iomap->type == IOMAP_INLINE)
-		return iomap_read_inline_data(iter, page);
+		return iomap_read_inline_data(iter, folio);
 
 	/* zero post-eof blocks as the page may be mapped */
 	iop = iomap_page_create(iter->inode, folio);
@@ -597,10 +596,12 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 static int iomap_write_begin_inline(const struct iomap_iter *iter,
 		struct page *page)
 {
+	struct folio *folio = page_folio(page);
+
 	/* needs more work for the tailpacking case; disable for now */
 	if (WARN_ON_ONCE(iomap_iter_srcmap(iter)->offset != 0))
 		return -EIO;
-	return iomap_read_inline_data(iter, page);
+	return iomap_read_inline_data(iter, folio);
 }
 
 static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 13/25] iomap: Convert readahead and readpage to use a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (11 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 12/25] iomap: Convert iomap_read_inline_data to take a folio Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 14/25] iomap: Convert iomap_page_mkwrite " Matthew Wilcox (Oracle)
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

Handle folios of arbitrary size instead of working in PAGE_SIZE units.
readahead_folio() decreases the page refcount for you, so this is not
quite a mechanical change.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 53 +++++++++++++++++++++---------------------
 1 file changed, 26 insertions(+), 27 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2ebea02780b8..ad89c20cb741 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -188,8 +188,8 @@ static void iomap_read_end_io(struct bio *bio)
 }
 
 struct iomap_readpage_ctx {
-	struct page		*cur_page;
-	bool			cur_page_in_bio;
+	struct folio		*cur_folio;
+	bool			cur_folio_in_bio;
 	struct bio		*bio;
 	struct readahead_control *rac;
 };
@@ -252,8 +252,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	const struct iomap *iomap = &iter->iomap;
 	loff_t pos = iter->pos + offset;
 	loff_t length = iomap_length(iter) - offset;
-	struct page *page = ctx->cur_page;
-	struct folio *folio = page_folio(page);
+	struct folio *folio = ctx->cur_folio;
 	struct iomap_page *iop;
 	loff_t orig_pos = pos;
 	size_t poff, plen;
@@ -274,7 +273,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 		goto done;
 	}
 
-	ctx->cur_page_in_bio = true;
+	ctx->cur_folio_in_bio = true;
 	if (iop)
 		atomic_add(plen, &iop->read_bytes_pending);
 
@@ -282,7 +281,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 	if (!ctx->bio ||
 	    bio_end_sector(ctx->bio) != sector ||
 	    !bio_add_folio(ctx->bio, folio, plen, poff)) {
-		gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
+		gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
 		gfp_t orig_gfp = gfp;
 		unsigned int nr_vecs = DIV_ROUND_UP(length, PAGE_SIZE);
 
@@ -321,30 +320,31 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
 int
 iomap_readpage(struct page *page, const struct iomap_ops *ops)
 {
+	struct folio *folio = page_folio(page);
 	struct iomap_iter iter = {
-		.inode		= page->mapping->host,
-		.pos		= page_offset(page),
-		.len		= PAGE_SIZE,
+		.inode		= folio->mapping->host,
+		.pos		= folio_pos(folio),
+		.len		= folio_size(folio),
 	};
 	struct iomap_readpage_ctx ctx = {
-		.cur_page	= page,
+		.cur_folio	= folio,
 	};
 	int ret;
 
-	trace_iomap_readpage(page->mapping->host, 1);
+	trace_iomap_readpage(iter.inode, 1);
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.processed = iomap_readpage_iter(&iter, &ctx, 0);
 
 	if (ret < 0)
-		SetPageError(page);
+		folio_set_error(folio);
 
 	if (ctx.bio) {
 		submit_bio(ctx.bio);
-		WARN_ON_ONCE(!ctx.cur_page_in_bio);
+		WARN_ON_ONCE(!ctx.cur_folio_in_bio);
 	} else {
-		WARN_ON_ONCE(ctx.cur_page_in_bio);
-		unlock_page(page);
+		WARN_ON_ONCE(ctx.cur_folio_in_bio);
+		folio_unlock(folio);
 	}
 
 	/*
@@ -363,15 +363,15 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
 	loff_t done, ret;
 
 	for (done = 0; done < length; done += ret) {
-		if (ctx->cur_page && offset_in_page(iter->pos + done) == 0) {
-			if (!ctx->cur_page_in_bio)
-				unlock_page(ctx->cur_page);
-			put_page(ctx->cur_page);
-			ctx->cur_page = NULL;
+		if (ctx->cur_folio &&
+		    offset_in_folio(ctx->cur_folio, iter->pos + done) == 0) {
+			if (!ctx->cur_folio_in_bio)
+				folio_unlock(ctx->cur_folio);
+			ctx->cur_folio = NULL;
 		}
-		if (!ctx->cur_page) {
-			ctx->cur_page = readahead_page(ctx->rac);
-			ctx->cur_page_in_bio = false;
+		if (!ctx->cur_folio) {
+			ctx->cur_folio = readahead_folio(ctx->rac);
+			ctx->cur_folio_in_bio = false;
 		}
 		ret = iomap_readpage_iter(iter, ctx, done);
 		if (ret <= 0)
@@ -414,10 +414,9 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 
 	if (ctx.bio)
 		submit_bio(ctx.bio);
-	if (ctx.cur_page) {
-		if (!ctx.cur_page_in_bio)
-			unlock_page(ctx.cur_page);
-		put_page(ctx.cur_page);
+	if (ctx.cur_folio) {
+		if (!ctx.cur_folio_in_bio)
+			folio_unlock(ctx.cur_folio);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_readahead);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 14/25] iomap: Convert iomap_page_mkwrite to use a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (12 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 13/25] iomap: Convert readahead and readpage to use " Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 15/25] iomap: Allow iomap_write_begin() to be called with the full length Matthew Wilcox (Oracle)
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

If we write to any page in a folio, we have to mark the entire
folio as dirty, and potentially COW the entire folio, because it'll
all get written back as one unit.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ad89c20cb741..8d7a67655b60 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -967,10 +967,9 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 }
 EXPORT_SYMBOL_GPL(iomap_truncate_page);
 
-static loff_t iomap_page_mkwrite_iter(struct iomap_iter *iter,
-		struct page *page)
+static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter,
+		struct folio *folio)
 {
-	struct folio *folio = page_folio(page);
 	loff_t length = iomap_length(iter);
 	int ret;
 
@@ -979,10 +978,10 @@ static loff_t iomap_page_mkwrite_iter(struct iomap_iter *iter,
 					      &iter->iomap);
 		if (ret)
 			return ret;
-		block_commit_write(page, 0, length);
+		block_commit_write(&folio->page, 0, length);
 	} else {
-		WARN_ON_ONCE(!PageUptodate(page));
-		set_page_dirty(page);
+		WARN_ON_ONCE(!folio_test_uptodate(folio));
+		folio_mark_dirty(folio);
 	}
 
 	return length;
@@ -994,24 +993,24 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
 		.inode		= file_inode(vmf->vma->vm_file),
 		.flags		= IOMAP_WRITE | IOMAP_FAULT,
 	};
-	struct page *page = vmf->page;
+	struct folio *folio = page_folio(vmf->page);
 	ssize_t ret;
 
-	lock_page(page);
-	ret = page_mkwrite_check_truncate(page, iter.inode);
+	folio_lock(folio);
+	ret = folio_mkwrite_check_truncate(folio, iter.inode);
 	if (ret < 0)
 		goto out_unlock;
-	iter.pos = page_offset(page);
+	iter.pos = folio_pos(folio);
 	iter.len = ret;
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_page_mkwrite_iter(&iter, page);
+		iter.processed = iomap_folio_mkwrite_iter(&iter, folio);
 
 	if (ret < 0)
 		goto out_unlock;
-	wait_for_stable_page(page);
+	folio_wait_stable(folio);
 	return VM_FAULT_LOCKED;
 out_unlock:
-	unlock_page(page);
+	folio_unlock(folio);
 	return block_page_mkwrite_return(ret);
 }
 EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 15/25] iomap: Allow iomap_write_begin() to be called with the full length
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (13 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 14/25] iomap: Convert iomap_page_mkwrite " Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:43   ` Darrick J. Wong
  2021-12-16 21:07 ` [PATCH v3 16/25] iomap: Convert __iomap_zero_iter to use a folio Matthew Wilcox (Oracle)
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel

In the future, we want write_begin to know the entire length of the
write so that it can choose to allocate large folios.  Pass the full
length in from __iomap_zero_iter() and limit it where necessary.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8d7a67655b60..b1ded5204d1c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -619,6 +619,9 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	if (fatal_signal_pending(current))
 		return -EINTR;
 
+	if (!mapping_large_folio_support(iter->inode->i_mapping))
+		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
+
 	if (page_ops && page_ops->page_prepare) {
 		status = page_ops->page_prepare(iter->inode, pos, len);
 		if (status)
@@ -632,6 +635,8 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		goto out_no_page;
 	}
 	folio = page_folio(page);
+	if (pos + len > folio_pos(folio) + folio_size(folio))
+		len = folio_pos(folio) + folio_size(folio) - pos;
 
 	if (srcmap->type == IOMAP_INLINE)
 		status = iomap_write_begin_inline(iter, page);
@@ -891,11 +896,13 @@ static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)
 	struct page *page;
 	int status;
 	unsigned offset = offset_in_page(pos);
-	unsigned bytes = min_t(u64, PAGE_SIZE - offset, length);
+	unsigned bytes = min_t(u64, UINT_MAX, length);
 
 	status = iomap_write_begin(iter, pos, bytes, &page);
 	if (status)
 		return status;
+	if (bytes > PAGE_SIZE - offset)
+		bytes = PAGE_SIZE - offset;
 
 	zero_user(page, offset, bytes);
 	mark_page_accessed(page);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 15/25] iomap: Allow iomap_write_begin() to be called with the full length
  2021-12-16 21:07 ` [PATCH v3 15/25] iomap: Allow iomap_write_begin() to be called with the full length Matthew Wilcox (Oracle)
@ 2021-12-16 21:43   ` Darrick J. Wong
  0 siblings, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2021-12-16 21:43 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle); +Cc: linux-xfs, linux-fsdevel, linux-kernel

On Thu, Dec 16, 2021 at 09:07:05PM +0000, Matthew Wilcox (Oracle) wrote:
> In the future, we want write_begin to know the entire length of the
> write so that it can choose to allocate large folios.  Pass the full
> length in from __iomap_zero_iter() and limit it where necessary.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Seems reasonable,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/iomap/buffered-io.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 8d7a67655b60..b1ded5204d1c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -619,6 +619,9 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	if (fatal_signal_pending(current))
>  		return -EINTR;
>  
> +	if (!mapping_large_folio_support(iter->inode->i_mapping))
> +		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
> +
>  	if (page_ops && page_ops->page_prepare) {
>  		status = page_ops->page_prepare(iter->inode, pos, len);
>  		if (status)
> @@ -632,6 +635,8 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  		goto out_no_page;
>  	}
>  	folio = page_folio(page);
> +	if (pos + len > folio_pos(folio) + folio_size(folio))
> +		len = folio_pos(folio) + folio_size(folio) - pos;
>  
>  	if (srcmap->type == IOMAP_INLINE)
>  		status = iomap_write_begin_inline(iter, page);
> @@ -891,11 +896,13 @@ static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)
>  	struct page *page;
>  	int status;
>  	unsigned offset = offset_in_page(pos);
> -	unsigned bytes = min_t(u64, PAGE_SIZE - offset, length);
> +	unsigned bytes = min_t(u64, UINT_MAX, length);
>  
>  	status = iomap_write_begin(iter, pos, bytes, &page);
>  	if (status)
>  		return status;
> +	if (bytes > PAGE_SIZE - offset)
> +		bytes = PAGE_SIZE - offset;
>  
>  	zero_user(page, offset, bytes);
>  	mark_page_accessed(page);
> -- 
> 2.33.0
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 16/25] iomap: Convert __iomap_zero_iter to use a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (14 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 15/25] iomap: Allow iomap_write_begin() to be called with the full length Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-21 17:01   ` iomap-folio & nvdimm merge Matthew Wilcox
  2021-12-16 21:07 ` [PATCH v3 17/25] iomap: Convert iomap_write_begin() and iomap_write_end() to folios Matthew Wilcox (Oracle)
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

The zero iterator can work in folio-sized chunks instead of page-sized
chunks.  This will save a lot of page cache lookups if the file is cached
in large folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b1ded5204d1c..47cf558244f4 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -893,19 +893,23 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare);
 
 static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)
 {
+	struct folio *folio;
 	struct page *page;
 	int status;
-	unsigned offset = offset_in_page(pos);
+	size_t offset;
 	unsigned bytes = min_t(u64, UINT_MAX, length);
 
 	status = iomap_write_begin(iter, pos, bytes, &page);
 	if (status)
 		return status;
-	if (bytes > PAGE_SIZE - offset)
-		bytes = PAGE_SIZE - offset;
+	folio = page_folio(page);
+
+	offset = offset_in_folio(folio, pos);
+	if (bytes > folio_size(folio) - offset)
+		bytes = folio_size(folio) - offset;
 
-	zero_user(page, offset, bytes);
-	mark_page_accessed(page);
+	folio_zero_range(folio, offset, bytes);
+	folio_mark_accessed(folio);
 
 	return iomap_write_end(iter, pos, bytes, bytes, page);
 }
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* iomap-folio & nvdimm merge
  2021-12-16 21:07 ` [PATCH v3 16/25] iomap: Convert __iomap_zero_iter to use a folio Matthew Wilcox (Oracle)
@ 2021-12-21 17:01   ` Matthew Wilcox
  2021-12-21 18:41     ` Darrick J. Wong
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2021-12-21 17:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Dan Williams, Stephen Rothwell

On Thu, Dec 16, 2021 at 09:07:06PM +0000, Matthew Wilcox (Oracle) wrote:
> The zero iterator can work in folio-sized chunks instead of page-sized
> chunks.  This will save a lot of page cache lookups if the file is cached
> in large folios.

This patch (and a few others) end up conflicting with what Christoph did
that's now in the nvdimm tree.  In an effort to make the merge cleaner,
I took the next-20211220 tag and did the following:

Revert de291b590286
Apply: https://lore.kernel.org/linux-xfs/20211221044450.517558-1-willy@infradead.org/
(these two things are likely to happen in the nvdimm tree imminently)

I then checked out iomap-folio-5.17e and added this patch:

    iomap: Inline __iomap_zero_iter into its caller

    To make the merge easier, replicate the inlining of __iomap_zero_iter()
    into iomap_zero_iter() that is currently in the nvdimm tree.

    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ba80bedd9590..c6b3a148e898 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -895,27 +895,6 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 }
 EXPORT_SYMBOL_GPL(iomap_file_unshare);
 
-static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)
-{
-       struct folio *folio;
-       int status;
-       size_t offset;
-       size_t bytes = min_t(u64, SIZE_MAX, length);
-
-       status = iomap_write_begin(iter, pos, bytes, &folio);
-       if (status)
-               return status;
-
-       offset = offset_in_folio(folio, pos);
-       if (bytes > folio_size(folio) - offset)
-               bytes = folio_size(folio) - offset;
-
-       folio_zero_range(folio, offset, bytes);
-       folio_mark_accessed(folio);
-
-       return iomap_write_end(iter, pos, bytes, bytes, folio);
-}
-
 static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 {
        struct iomap *iomap = &iter->iomap;
@@ -929,14 +908,34 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
                return length;
 
        do {
-               s64 bytes;
+               struct folio *folio;
+               int status;
+               size_t offset;
+               size_t bytes = min_t(u64, SIZE_MAX, length);
+
+               if (IS_DAX(iter->inode)) {
+                       s64 tmp = dax_iomap_zero(pos, bytes, iomap);
+                       if (tmp < 0)
+                               return tmp;
+                       bytes = tmp;
+                       goto good;
+               }
 
-               if (IS_DAX(iter->inode))
-                       bytes = dax_iomap_zero(pos, length, iomap);
-               else
-                       bytes = __iomap_zero_iter(iter, pos, length);
-               if (bytes < 0)
-                       return bytes;
+               status = iomap_write_begin(iter, pos, bytes, &folio);
+               if (status)
+                       return status;
+
+               offset = offset_in_folio(folio, pos);
+               if (bytes > folio_size(folio) - offset)
+                       bytes = folio_size(folio) - offset;
+
+               folio_zero_range(folio, offset, bytes);
+               folio_mark_accessed(folio);
+
+               bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
+good:
+               if (WARN_ON_ONCE(bytes == 0))
+                       return -EIO;
 
                pos += bytes;
                length -= bytes;



Then I did the merge, and the merge commit looks pretty sensible
afterwards:

    Merge branch 'iomap-folio-5.17f' into fixup

diff --cc fs/iomap/buffered-io.c
index 955f51f94b3f,c6b3a148e898..c938bbad075e
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@@ -888,19 -908,32 +907,23 @@@ static loff_t iomap_zero_iter(struct io
                return length;

        do {
-               unsigned offset = offset_in_page(pos);
-               size_t bytes = min_t(u64, PAGE_SIZE - offset, length);
-               struct page *page;
+               struct folio *folio;
                int status;
+               size_t offset;
+               size_t bytes = min_t(u64, SIZE_MAX, length);

-               status = iomap_write_begin(iter, pos, bytes, &page);
 -              if (IS_DAX(iter->inode)) {
 -                      s64 tmp = dax_iomap_zero(pos, bytes, iomap);
 -                      if (tmp < 0)
 -                              return tmp;
 -                      bytes = tmp;
 -                      goto good;
 -              }
 -
+               status = iomap_write_begin(iter, pos, bytes, &folio);
                if (status)
                        return status;

-               zero_user(page, offset, bytes);
-               mark_page_accessed(page);
+               offset = offset_in_folio(folio, pos);
+               if (bytes > folio_size(folio) - offset)
+                       bytes = folio_size(folio) - offset;
+
+               folio_zero_range(folio, offset, bytes);
+               folio_mark_accessed(folio);

-               bytes = iomap_write_end(iter, pos, bytes, bytes, page);
+               bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
 -good:
                if (WARN_ON_ONCE(bytes == 0))
                        return -EIO;



Shall I push out a version of this patch series which includes the
"iomap: Inline __iomap_zero_iter into its caller" patch I pasted above?

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: iomap-folio & nvdimm merge
  2021-12-21 17:01   ` iomap-folio & nvdimm merge Matthew Wilcox
@ 2021-12-21 18:41     ` Darrick J. Wong
  2021-12-21 18:53       ` Matthew Wilcox
  0 siblings, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2021-12-21 18:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Dan Williams, Stephen Rothwell

On Tue, Dec 21, 2021 at 05:01:34PM +0000, Matthew Wilcox wrote:
> On Thu, Dec 16, 2021 at 09:07:06PM +0000, Matthew Wilcox (Oracle) wrote:
> > The zero iterator can work in folio-sized chunks instead of page-sized
> > chunks.  This will save a lot of page cache lookups if the file is cached
> > in large folios.
> 
> This patch (and a few others) end up conflicting with what Christoph did
> that's now in the nvdimm tree.  In an effort to make the merge cleaner,
> I took the next-20211220 tag and did the following:
> 
> Revert de291b590286
> Apply: https://lore.kernel.org/linux-xfs/20211221044450.517558-1-willy@infradead.org/
> (these two things are likely to happen in the nvdimm tree imminently)
> 
> I then checked out iomap-folio-5.17e and added this patch:
> 
>     iomap: Inline __iomap_zero_iter into its caller
> 
>     To make the merge easier, replicate the inlining of __iomap_zero_iter()
>     into iomap_zero_iter() that is currently in the nvdimm tree.
> 
>     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Looks like a reasonable function promotion to me...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ba80bedd9590..c6b3a148e898 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -895,27 +895,6 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_unshare);
>  
> -static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)
> -{
> -       struct folio *folio;
> -       int status;
> -       size_t offset;
> -       size_t bytes = min_t(u64, SIZE_MAX, length);
> -
> -       status = iomap_write_begin(iter, pos, bytes, &folio);
> -       if (status)
> -               return status;
> -
> -       offset = offset_in_folio(folio, pos);
> -       if (bytes > folio_size(folio) - offset)
> -               bytes = folio_size(folio) - offset;
> -
> -       folio_zero_range(folio, offset, bytes);
> -       folio_mark_accessed(folio);
> -
> -       return iomap_write_end(iter, pos, bytes, bytes, folio);
> -}
> -
>  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  {
>         struct iomap *iomap = &iter->iomap;
> @@ -929,14 +908,34 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>                 return length;
>  
>         do {
> -               s64 bytes;
> +               struct folio *folio;
> +               int status;
> +               size_t offset;
> +               size_t bytes = min_t(u64, SIZE_MAX, length);
> +
> +               if (IS_DAX(iter->inode)) {
> +                       s64 tmp = dax_iomap_zero(pos, bytes, iomap);
> +                       if (tmp < 0)
> +                               return tmp;
> +                       bytes = tmp;
> +                       goto good;
> +               }
>  
> -               if (IS_DAX(iter->inode))
> -                       bytes = dax_iomap_zero(pos, length, iomap);
> -               else
> -                       bytes = __iomap_zero_iter(iter, pos, length);
> -               if (bytes < 0)
> -                       return bytes;
> +               status = iomap_write_begin(iter, pos, bytes, &folio);
> +               if (status)
> +                       return status;
> +
> +               offset = offset_in_folio(folio, pos);
> +               if (bytes > folio_size(folio) - offset)
> +                       bytes = folio_size(folio) - offset;
> +
> +               folio_zero_range(folio, offset, bytes);
> +               folio_mark_accessed(folio);
> +
> +               bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
> +good:
> +               if (WARN_ON_ONCE(bytes == 0))
> +                       return -EIO;
>  
>                 pos += bytes;
>                 length -= bytes;
> 
> 
> 
> Then I did the merge, and the merge commit looks pretty sensible
> afterwards:
> 
>     Merge branch 'iomap-folio-5.17f' into fixup
> 
> diff --cc fs/iomap/buffered-io.c
> index 955f51f94b3f,c6b3a148e898..c938bbad075e
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@@ -888,19 -908,32 +907,23 @@@ static loff_t iomap_zero_iter(struct io
>                 return length;
> 
>         do {
> -               unsigned offset = offset_in_page(pos);
> -               size_t bytes = min_t(u64, PAGE_SIZE - offset, length);
> -               struct page *page;
> +               struct folio *folio;
>                 int status;
> +               size_t offset;
> +               size_t bytes = min_t(u64, SIZE_MAX, length);
> 
> -               status = iomap_write_begin(iter, pos, bytes, &page);
>  -              if (IS_DAX(iter->inode)) {
>  -                      s64 tmp = dax_iomap_zero(pos, bytes, iomap);
>  -                      if (tmp < 0)
>  -                              return tmp;
>  -                      bytes = tmp;
>  -                      goto good;
>  -              }
>  -
> +               status = iomap_write_begin(iter, pos, bytes, &folio);
>                 if (status)
>                         return status;
> 
> -               zero_user(page, offset, bytes);
> -               mark_page_accessed(page);
> +               offset = offset_in_folio(folio, pos);
> +               if (bytes > folio_size(folio) - offset)
> +                       bytes = folio_size(folio) - offset;
> +
> +               folio_zero_range(folio, offset, bytes);
> +               folio_mark_accessed(folio);
> 
> -               bytes = iomap_write_end(iter, pos, bytes, bytes, page);
> +               bytes = iomap_write_end(iter, pos, bytes, bytes, folio);
>  -good:

Assuming I'm reading the metadiff properly, I think this merge
resolution looks correct given what both patchsets are trying to do.

>                 if (WARN_ON_ONCE(bytes == 0))
>                         return -EIO;
> 
> 
> 
> Shall I push out a version of this patch series which includes the
> "iomap: Inline __iomap_zero_iter into its caller" patch I pasted above?

Yes.

I've been distracted for months with first a Huge Customer Escalation
and now a <embargoed>, which means that I've been and continue to be
very distracted.  I /think/ there are no other iomap patches being
proposed for inclusion -- Andreas' patches were applied as fixes during
5.16-rc, Christoph's DAX refactoring is now in the nvdimm tree, and that
leaves Matthew's folios refactoring.

So seeing as (I think?) there are no other iomap patches for 5.17, if
Matthew wants to add his branch to for-next and push directly to Linus
(rather than pushing to me to push the exact same branch to Linus) I
think that would be ... better than letting it block on me.  IIRC I've
RVB'd everything in the folios branch. :(

FWIW I ran the 5.17e branch through my fstests cloud and nothing fell
out, so I think it's in good enough shape to merge to for-next.

--D

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: iomap-folio & nvdimm merge
  2021-12-21 18:41     ` Darrick J. Wong
@ 2021-12-21 18:53       ` Matthew Wilcox
  2021-12-21 22:46         ` Stephen Rothwell
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2021-12-21 18:53 UTC (permalink / raw)
  To: Darrick J. Wong, Stephen Rothwell
  Cc: linux-xfs, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Dan Williams

On Tue, Dec 21, 2021 at 10:41:15AM -0800, Darrick J. Wong wrote:
> >     iomap: Inline __iomap_zero_iter into its caller
> > 
> >     To make the merge easier, replicate the inlining of __iomap_zero_iter()
> >     into iomap_zero_iter() that is currently in the nvdimm tree.
> > 
> >     Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> Looks like a reasonable function promotion to me...
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks, applied that to the commit.

> > Shall I push out a version of this patch series which includes the
> > "iomap: Inline __iomap_zero_iter into its caller" patch I pasted above?
> 
> Yes.
> 
> I've been distracted for months with first a Huge Customer Escalation
> and now a <embargoed>, which means that I've been and continue to be
> very distracted.  I /think/ there are no other iomap patches being
> proposed for inclusion -- Andreas' patches were applied as fixes during
> 5.16-rc, Christoph's DAX refactoring is now in the nvdimm tree, and that
> leaves Matthew's folios refactoring.
> 
> So seeing as (I think?) there are no other iomap patches for 5.17, if
> Matthew wants to add his branch to for-next and push directly to Linus
> (rather than pushing to me to push the exact same branch to Linus) I
> think that would be ... better than letting it block on me.  IIRC I've
> RVB'd everything in the folios branch. :(
> 
> FWIW I ran the 5.17e branch through my fstests cloud and nothing fell
> out, so I think it's in good enough shape to merge to for-next.

Glad to hear it passed that thorough testing.  Stephen, please pick
up a new tree (hopefully just temporarily until Darrick can swim to
the surface):

 git://git.infradead.org/users/willy/linux.git folio-iomap

Hopefully the previous message will give you enough context for
the merge conflict resolution.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: iomap-folio & nvdimm merge
  2021-12-21 18:53       ` Matthew Wilcox
@ 2021-12-21 22:46         ` Stephen Rothwell
  0 siblings, 0 replies; 51+ messages in thread
From: Stephen Rothwell @ 2021-12-21 22:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Dan Williams

[-- Attachment #1: Type: text/plain, Size: 1342 bytes --]

Hi Matthew,

On Tue, 21 Dec 2021 18:53:25 +0000 Matthew Wilcox <willy@infradead.org> wrote:
>
> Glad to hear it passed that thorough testing.  Stephen, please pick
> up a new tree (hopefully just temporarily until Darrick can swim to
> the surface):
> 
>  git://git.infradead.org/users/willy/linux.git folio-iomap
> 
> Hopefully the previous message will give you enough context for
> the merge conflict resolution.

I have added that after the folio tree today.

Thanks for adding your subsystem tree as a participant of linux-next.  As
you may know, this is not a judgement of your code.  The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window. 

You will need to ensure that the patches/commits in your tree/series have
been:
     * submitted under GPL v2 (or later) and include the Contributor's
        Signed-off-by,
     * posted to the relevant mailing list,
     * reviewed by you (or another maintainer of your subsystem tree),
     * successfully unit tested, and 
     * destined for the current or next Linux merge window.

Basically, this should be just what you would send to Linus (or ask him
to fetch).  It is allowed to be rebased if you deem it necessary.

-- 
Cheers,
Stephen Rothwell 
sfr@canb.auug.org.au

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 17/25] iomap: Convert iomap_write_begin() and iomap_write_end() to folios
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (15 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 16/25] iomap: Convert __iomap_zero_iter to use a folio Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 18/25] iomap: Convert iomap_write_end_inline to take a folio Matthew Wilcox (Oracle)
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

These functions still only work in PAGE_SIZE chunks, but there are
fewer conversions from tail to head pages as a result of this patch.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 73 ++++++++++++++++++++----------------------
 1 file changed, 34 insertions(+), 39 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 47cf558244f4..975a4a6aeca0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -550,9 +550,8 @@ static int iomap_read_folio_sync(loff_t block_start, struct folio *folio,
 }
 
 static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
-		unsigned len, struct page *page)
+		size_t len, struct folio *folio)
 {
-	struct folio *folio = page_folio(page);
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	struct iomap_page *iop = iomap_page_create(iter->inode, folio);
 	loff_t block_size = i_blocksize(iter->inode);
@@ -593,10 +592,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 }
 
 static int iomap_write_begin_inline(const struct iomap_iter *iter,
-		struct page *page)
+		struct folio *folio)
 {
-	struct folio *folio = page_folio(page);
-
 	/* needs more work for the tailpacking case; disable for now */
 	if (WARN_ON_ONCE(iomap_iter_srcmap(iter)->offset != 0))
 		return -EIO;
@@ -604,12 +601,12 @@ static int iomap_write_begin_inline(const struct iomap_iter *iter,
 }
 
 static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
-		unsigned len, struct page **pagep)
+		size_t len, struct folio **foliop)
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	struct page *page;
 	struct folio *folio;
+	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	int status = 0;
 
 	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
@@ -620,7 +617,7 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		return -EINTR;
 
 	if (!mapping_large_folio_support(iter->inode->i_mapping))
-		len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
+		len = min(len, PAGE_SIZE - offset_in_page(pos));
 
 	if (page_ops && page_ops->page_prepare) {
 		status = page_ops->page_prepare(iter->inode, pos, len);
@@ -628,32 +625,31 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 			return status;
 	}
 
-	page = grab_cache_page_write_begin(iter->inode->i_mapping,
-				pos >> PAGE_SHIFT, AOP_FLAG_NOFS);
-	if (!page) {
+	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
+			fgp, mapping_gfp_mask(iter->inode->i_mapping));
+	if (!folio) {
 		status = -ENOMEM;
 		goto out_no_page;
 	}
-	folio = page_folio(page);
 	if (pos + len > folio_pos(folio) + folio_size(folio))
 		len = folio_pos(folio) + folio_size(folio) - pos;
 
 	if (srcmap->type == IOMAP_INLINE)
-		status = iomap_write_begin_inline(iter, page);
+		status = iomap_write_begin_inline(iter, folio);
 	else if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
 		status = __block_write_begin_int(folio, pos, len, NULL, srcmap);
 	else
-		status = __iomap_write_begin(iter, pos, len, page);
+		status = __iomap_write_begin(iter, pos, len, folio);
 
 	if (unlikely(status))
 		goto out_unlock;
 
-	*pagep = page;
+	*foliop = folio;
 	return 0;
 
 out_unlock:
-	unlock_page(page);
-	put_page(page);
+	folio_unlock(folio);
+	folio_put(folio);
 	iomap_write_failed(iter->inode, pos, len);
 
 out_no_page:
@@ -663,11 +659,10 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 }
 
 static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
-		size_t copied, struct page *page)
+		size_t copied, struct folio *folio)
 {
-	struct folio *folio = page_folio(page);
 	struct iomap_page *iop = to_iomap_page(folio);
-	flush_dcache_page(page);
+	flush_dcache_folio(folio);
 
 	/*
 	 * The blocks that were entirely written will now be uptodate, so we
@@ -680,10 +675,10 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 	 * non-uptodate page as a zero-length write, and force the caller to
 	 * redo the whole thing.
 	 */
-	if (unlikely(copied < len && !PageUptodate(page)))
+	if (unlikely(copied < len && !folio_test_uptodate(folio)))
 		return 0;
 	iomap_set_range_uptodate(folio, iop, offset_in_folio(folio, pos), len);
-	__set_page_dirty_nobuffers(page);
+	filemap_dirty_folio(inode->i_mapping, folio);
 	return copied;
 }
 
@@ -707,7 +702,7 @@ static size_t iomap_write_end_inline(const struct iomap_iter *iter,
 
 /* Returns the number of bytes copied.  May be 0.  Cannot be an errno. */
 static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
-		size_t copied, struct page *page)
+		size_t copied, struct folio *folio)
 {
 	const struct iomap_page_ops *page_ops = iter->iomap.page_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
@@ -715,12 +710,12 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 	size_t ret;
 
 	if (srcmap->type == IOMAP_INLINE) {
-		ret = iomap_write_end_inline(iter, page, pos, copied);
+		ret = iomap_write_end_inline(iter, &folio->page, pos, copied);
 	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
 		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
-				copied, page, NULL);
+				copied, &folio->page, NULL);
 	} else {
-		ret = __iomap_write_end(iter->inode, pos, len, copied, page);
+		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
 	}
 
 	/*
@@ -732,13 +727,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 		i_size_write(iter->inode, pos + ret);
 		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
 	}
-	unlock_page(page);
+	folio_unlock(folio);
 
 	if (old_size < pos)
 		pagecache_isize_extended(iter->inode, old_size, pos);
 	if (page_ops && page_ops->page_done)
-		page_ops->page_done(iter->inode, pos, ret, page);
-	put_page(page);
+		page_ops->page_done(iter->inode, pos, ret, &folio->page);
+	folio_put(folio);
 
 	if (ret < len)
 		iomap_write_failed(iter->inode, pos, len);
@@ -753,6 +748,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 	long status = 0;
 
 	do {
+		struct folio *folio;
 		struct page *page;
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
@@ -776,16 +772,17 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			break;
 		}
 
-		status = iomap_write_begin(iter, pos, bytes, &page);
+		status = iomap_write_begin(iter, pos, bytes, &folio);
 		if (unlikely(status))
 			break;
 
+		page = folio_file_page(folio, pos >> PAGE_SHIFT);
 		if (mapping_writably_mapped(iter->inode->i_mapping))
 			flush_dcache_page(page);
 
 		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
 
-		status = iomap_write_end(iter, pos, bytes, copied, page);
+		status = iomap_write_end(iter, pos, bytes, copied, folio);
 
 		if (unlikely(copied != status))
 			iov_iter_revert(i, copied - status);
@@ -851,13 +848,13 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 	do {
 		unsigned long offset = offset_in_page(pos);
 		unsigned long bytes = min_t(loff_t, PAGE_SIZE - offset, length);
-		struct page *page;
+		struct folio *folio;
 
-		status = iomap_write_begin(iter, pos, bytes, &page);
+		status = iomap_write_begin(iter, pos, bytes, &folio);
 		if (unlikely(status))
 			return status;
 
-		status = iomap_write_end(iter, pos, bytes, bytes, page);
+		status = iomap_write_end(iter, pos, bytes, bytes, folio);
 		if (WARN_ON_ONCE(status == 0))
 			return -EIO;
 
@@ -894,15 +891,13 @@ EXPORT_SYMBOL_GPL(iomap_file_unshare);
 static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)
 {
 	struct folio *folio;
-	struct page *page;
 	int status;
 	size_t offset;
-	unsigned bytes = min_t(u64, UINT_MAX, length);
+	size_t bytes = min_t(u64, SIZE_MAX, length);
 
-	status = iomap_write_begin(iter, pos, bytes, &page);
+	status = iomap_write_begin(iter, pos, bytes, &folio);
 	if (status)
 		return status;
-	folio = page_folio(page);
 
 	offset = offset_in_folio(folio, pos);
 	if (bytes > folio_size(folio) - offset)
@@ -911,7 +906,7 @@ static s64 __iomap_zero_iter(struct iomap_iter *iter, loff_t pos, u64 length)
 	folio_zero_range(folio, offset, bytes);
 	folio_mark_accessed(folio);
 
-	return iomap_write_end(iter, pos, bytes, bytes, page);
+	return iomap_write_end(iter, pos, bytes, bytes, folio);
 }
 
 static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 18/25] iomap: Convert iomap_write_end_inline to take a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (16 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 17/25] iomap: Convert iomap_write_begin() and iomap_write_end() to folios Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 19/25] iomap,xfs: Convert ->discard_page to ->discard_folio Matthew Wilcox (Oracle)
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

This conversion is only safe because iomap only supports writes to inline
data which starts at the beginning of the file.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 975a4a6aeca0..b2bf7f337e75 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -683,16 +683,16 @@ static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
 }
 
 static size_t iomap_write_end_inline(const struct iomap_iter *iter,
-		struct page *page, loff_t pos, size_t copied)
+		struct folio *folio, loff_t pos, size_t copied)
 {
 	const struct iomap *iomap = &iter->iomap;
 	void *addr;
 
-	WARN_ON_ONCE(!PageUptodate(page));
+	WARN_ON_ONCE(!folio_test_uptodate(folio));
 	BUG_ON(!iomap_inline_data_valid(iomap));
 
-	flush_dcache_page(page);
-	addr = kmap_local_page(page) + pos;
+	flush_dcache_folio(folio);
+	addr = kmap_local_folio(folio, pos);
 	memcpy(iomap_inline_data(iomap, pos), addr, copied);
 	kunmap_local(addr);
 
@@ -710,7 +710,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 	size_t ret;
 
 	if (srcmap->type == IOMAP_INLINE) {
-		ret = iomap_write_end_inline(iter, &folio->page, pos, copied);
+		ret = iomap_write_end_inline(iter, folio, pos, copied);
 	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
 		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
 				copied, &folio->page, NULL);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 19/25] iomap,xfs: Convert ->discard_page to ->discard_folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (17 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 18/25] iomap: Convert iomap_write_end_inline to take a folio Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 20/25] iomap: Simplify iomap_writepage_map() Matthew Wilcox (Oracle)
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

XFS has the only implementation of ->discard_page today, so convert it
to use folios in the same patch as converting the API.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c |  4 ++--
 fs/xfs/xfs_aops.c      | 24 ++++++++++++------------
 include/linux/iomap.h  |  2 +-
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b2bf7f337e75..ceb7fbe4e7c9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1360,8 +1360,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		 * won't be affected by I/O completion and we must unlock it
 		 * now.
 		 */
-		if (wpc->ops->discard_page)
-			wpc->ops->discard_page(page, file_offset);
+		if (wpc->ops->discard_folio)
+			wpc->ops->discard_folio(folio, file_offset);
 		if (!count) {
 			ClearPageUptodate(page);
 			unlock_page(page);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index c8c15c3c3147..4098a9875c5b 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -437,37 +437,37 @@ xfs_prepare_ioend(
  * see a ENOSPC in writeback).
  */
 static void
-xfs_discard_page(
-	struct page		*page,
-	loff_t			fileoff)
+xfs_discard_folio(
+	struct folio		*folio,
+	loff_t			pos)
 {
-	struct inode		*inode = page->mapping->host;
+	struct inode		*inode = folio->mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
-	unsigned int		pageoff = offset_in_page(fileoff);
-	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, fileoff);
-	xfs_fileoff_t		pageoff_fsb = XFS_B_TO_FSBT(mp, pageoff);
+	size_t			offset = offset_in_folio(folio, pos);
+	xfs_fileoff_t		start_fsb = XFS_B_TO_FSBT(mp, pos);
+	xfs_fileoff_t		pageoff_fsb = XFS_B_TO_FSBT(mp, offset);
 	int			error;
 
 	if (xfs_is_shutdown(mp))
 		goto out_invalidate;
 
 	xfs_alert_ratelimited(mp,
-		"page discard on page "PTR_FMT", inode 0x%llx, offset %llu.",
-			page, ip->i_ino, fileoff);
+		"page discard on page "PTR_FMT", inode 0x%llx, pos %llu.",
+			folio, ip->i_ino, pos);
 
 	error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
-			i_blocks_per_page(inode, page) - pageoff_fsb);
+			i_blocks_per_folio(inode, folio) - pageoff_fsb);
 	if (error && !xfs_is_shutdown(mp))
 		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
 out_invalidate:
-	iomap_invalidatepage(page, pageoff, PAGE_SIZE - pageoff);
+	iomap_invalidate_folio(folio, offset, folio_size(folio) - offset);
 }
 
 static const struct iomap_writeback_ops xfs_writeback_ops = {
 	.map_blocks		= xfs_map_blocks,
 	.prepare_ioend		= xfs_prepare_ioend,
-	.discard_page		= xfs_discard_page,
+	.discard_folio		= xfs_discard_folio,
 };
 
 STATIC int
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 29491fb9c5ba..5ef5088dbbd8 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -285,7 +285,7 @@ struct iomap_writeback_ops {
 	 * Optional, allows the file system to discard state on a page where
 	 * we failed to submit any I/O.
 	 */
-	void (*discard_page)(struct page *page, loff_t fileoff);
+	void (*discard_folio)(struct folio *folio, loff_t pos);
 };
 
 struct iomap_writepage_ctx {
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 20/25] iomap: Simplify iomap_writepage_map()
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (18 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 19/25] iomap,xfs: Convert ->discard_page to ->discard_folio Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 21/25] iomap: Simplify iomap_do_writepage() Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

Rename end_offset to end_pos and file_offset to pos to match the rest
of the file.  Simplify the loop by calculating nblocks up front instead
of each time around the loop.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ceb7fbe4e7c9..20e7087aa75c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1307,37 +1307,36 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
 static int
 iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
-		struct page *page, u64 end_offset)
+		struct page *page, u64 end_pos)
 {
 	struct folio *folio = page_folio(page);
 	struct iomap_page *iop = iomap_page_create(inode, folio);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
-	u64 file_offset; /* file offset of page */
+	unsigned nblocks = i_blocks_per_folio(inode, folio);
+	u64 pos = folio_pos(folio);
 	int error = 0, count = 0, i;
 	LIST_HEAD(submit_list);
 
 	WARN_ON_ONCE(iop && atomic_read(&iop->write_bytes_pending) != 0);
 
 	/*
-	 * Walk through the page to find areas to write back. If we run off the
-	 * end of the current map or find the current map invalid, grab a new
-	 * one.
+	 * Walk through the folio to find areas to write back. If we
+	 * run off the end of the current map or find the current map
+	 * invalid, grab a new one.
 	 */
-	for (i = 0, file_offset = page_offset(page);
-	     i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
-	     i++, file_offset += len) {
+	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
 		if (iop && !test_bit(i, iop->uptodate))
 			continue;
 
-		error = wpc->ops->map_blocks(wpc, inode, file_offset);
+		error = wpc->ops->map_blocks(wpc, inode, pos);
 		if (error)
 			break;
 		if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE))
 			continue;
 		if (wpc->iomap.type == IOMAP_HOLE)
 			continue;
-		iomap_add_to_ioend(inode, file_offset, page, iop, wpc, wbc,
+		iomap_add_to_ioend(inode, pos, page, iop, wpc, wbc,
 				 &submit_list);
 		count++;
 	}
@@ -1361,7 +1360,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		 * now.
 		 */
 		if (wpc->ops->discard_folio)
-			wpc->ops->discard_folio(folio, file_offset);
+			wpc->ops->discard_folio(folio, pos);
 		if (!count) {
 			ClearPageUptodate(page);
 			unlock_page(page);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 21/25] iomap: Simplify iomap_do_writepage()
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (19 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 20/25] iomap: Simplify iomap_writepage_map() Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 22/25] iomap: Convert iomap_add_to_ioend() to take a folio Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

Rename end_offset to end_pos and offset_into_page to poff to match the
rest of the file.  Simplify the handling of the last page straddling
i_size by doing the EOF check based on the byte granularity i_size
instead of converting to a pgoff prematurely.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 20e7087aa75c..8bfdda8e5f1c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1408,9 +1408,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 {
 	struct iomap_writepage_ctx *wpc = data;
 	struct inode *inode = page->mapping->host;
-	pgoff_t end_index;
-	u64 end_offset;
-	loff_t offset;
+	u64 end_pos, isize;
 
 	trace_iomap_writepage(inode, page_offset(page), PAGE_SIZE);
 
@@ -1441,11 +1439,9 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 	 * |     desired writeback range    |      see else    |
 	 * ---------------------------------^------------------|
 	 */
-	offset = i_size_read(inode);
-	end_index = offset >> PAGE_SHIFT;
-	if (page->index < end_index)
-		end_offset = (loff_t)(page->index + 1) << PAGE_SHIFT;
-	else {
+	isize = i_size_read(inode);
+	end_pos = page_offset(page) + PAGE_SIZE;
+	if (end_pos > isize) {
 		/*
 		 * Check whether the page to write out is beyond or straddles
 		 * i_size or not.
@@ -1457,7 +1453,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * |				    |      Straddles     |
 		 * ---------------------------------^-----------|--------|
 		 */
-		unsigned offset_into_page = offset & (PAGE_SIZE - 1);
+		size_t poff = offset_in_page(isize);
+		pgoff_t end_index = isize >> PAGE_SHIFT;
 
 		/*
 		 * Skip the page if it's fully outside i_size, e.g. due to a
@@ -1477,7 +1474,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * offset is just equal to the EOF.
 		 */
 		if (page->index > end_index ||
-		    (page->index == end_index && offset_into_page == 0))
+		    (page->index == end_index && poff == 0))
 			goto redirty;
 
 		/*
@@ -1488,13 +1485,13 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * memory is zeroed when mapped, and writes to that region are
 		 * not written out to the file."
 		 */
-		zero_user_segment(page, offset_into_page, PAGE_SIZE);
+		zero_user_segment(page, poff, PAGE_SIZE);
 
 		/* Adjust the end_offset to the end of file */
-		end_offset = offset;
+		end_pos = isize;
 	}
 
-	return iomap_writepage_map(wpc, wbc, inode, page, end_offset);
+	return iomap_writepage_map(wpc, wbc, inode, page, end_pos);
 
 redirty:
 	redirty_page_for_writepage(wbc, page);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 22/25] iomap: Convert iomap_add_to_ioend() to take a folio
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (20 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 21/25] iomap: Simplify iomap_do_writepage() Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 23/25] iomap: Convert iomap_migrate_page() to use folios Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

We still iterate one block at a time, but now we call compound_head()
less often.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 70 ++++++++++++++++++++----------------------
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8bfdda8e5f1c..a36695db6f9d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1263,29 +1263,29 @@ iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t offset,
  * first; otherwise finish off the current ioend and start another.
  */
 static void
-iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
+iomap_add_to_ioend(struct inode *inode, loff_t pos, struct folio *folio,
 		struct iomap_page *iop, struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct list_head *iolist)
 {
-	sector_t sector = iomap_sector(&wpc->iomap, offset);
+	sector_t sector = iomap_sector(&wpc->iomap, pos);
 	unsigned len = i_blocksize(inode);
-	unsigned poff = offset & (PAGE_SIZE - 1);
+	size_t poff = offset_in_folio(folio, pos);
 
-	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, offset, sector)) {
+	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos, sector)) {
 		if (wpc->ioend)
 			list_add(&wpc->ioend->io_list, iolist);
-		wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
+		wpc->ioend = iomap_alloc_ioend(inode, wpc, pos, sector, wbc);
 	}
 
-	if (bio_add_page(wpc->ioend->io_bio, page, len, poff) != len) {
+	if (!bio_add_folio(wpc->ioend->io_bio, folio, len, poff)) {
 		wpc->ioend->io_bio = iomap_chain_bio(wpc->ioend->io_bio);
-		__bio_add_page(wpc->ioend->io_bio, page, len, poff);
+		bio_add_folio(wpc->ioend->io_bio, folio, len, poff);
 	}
 
 	if (iop)
 		atomic_add(len, &iop->write_bytes_pending);
 	wpc->ioend->io_size += len;
-	wbc_account_cgroup_owner(wbc, page, len);
+	wbc_account_cgroup_owner(wbc, &folio->page, len);
 }
 
 /*
@@ -1307,9 +1307,8 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
 static int
 iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct inode *inode,
-		struct page *page, u64 end_pos)
+		struct folio *folio, u64 end_pos)
 {
-	struct folio *folio = page_folio(page);
 	struct iomap_page *iop = iomap_page_create(inode, folio);
 	struct iomap_ioend *ioend, *next;
 	unsigned len = i_blocksize(inode);
@@ -1336,15 +1335,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 			continue;
 		if (wpc->iomap.type == IOMAP_HOLE)
 			continue;
-		iomap_add_to_ioend(inode, pos, page, iop, wpc, wbc,
+		iomap_add_to_ioend(inode, pos, folio, iop, wpc, wbc,
 				 &submit_list);
 		count++;
 	}
 
 	WARN_ON_ONCE(!wpc->ioend && !list_empty(&submit_list));
-	WARN_ON_ONCE(!PageLocked(page));
-	WARN_ON_ONCE(PageWriteback(page));
-	WARN_ON_ONCE(PageDirty(page));
+	WARN_ON_ONCE(!folio_test_locked(folio));
+	WARN_ON_ONCE(folio_test_writeback(folio));
+	WARN_ON_ONCE(folio_test_dirty(folio));
 
 	/*
 	 * We cannot cancel the ioend directly here on error.  We may have
@@ -1362,14 +1361,14 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 		if (wpc->ops->discard_folio)
 			wpc->ops->discard_folio(folio, pos);
 		if (!count) {
-			ClearPageUptodate(page);
-			unlock_page(page);
+			folio_clear_uptodate(folio);
+			folio_unlock(folio);
 			goto done;
 		}
 	}
 
-	set_page_writeback(page);
-	unlock_page(page);
+	folio_start_writeback(folio);
+	folio_unlock(folio);
 
 	/*
 	 * Preserve the original error if there was one; catch
@@ -1390,9 +1389,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * with a partial page truncate on a sub-page block sized filesystem.
 	 */
 	if (!count)
-		end_page_writeback(page);
+		folio_end_writeback(folio);
 done:
-	mapping_set_error(page->mapping, error);
+	mapping_set_error(folio->mapping, error);
 	return error;
 }
 
@@ -1406,14 +1405,15 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 static int
 iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 {
+	struct folio *folio = page_folio(page);
 	struct iomap_writepage_ctx *wpc = data;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = folio->mapping->host;
 	u64 end_pos, isize;
 
-	trace_iomap_writepage(inode, page_offset(page), PAGE_SIZE);
+	trace_iomap_writepage(inode, folio_pos(folio), folio_size(folio));
 
 	/*
-	 * Refuse to write the page out if we're called from reclaim context.
+	 * Refuse to write the folio out if we're called from reclaim context.
 	 *
 	 * This avoids stack overflows when called from deeply used stacks in
 	 * random callers for direct reclaim or memcg reclaim.  We explicitly
@@ -1427,10 +1427,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		goto redirty;
 
 	/*
-	 * Is this page beyond the end of the file?
+	 * Is this folio beyond the end of the file?
 	 *
-	 * The page index is less than the end_index, adjust the end_offset
-	 * to the highest offset that this page should represent.
+	 * The folio index is less than the end_index, adjust the end_pos
+	 * to the highest offset that this folio should represent.
 	 * -----------------------------------------------------
 	 * |			file mapping	       | <EOF> |
 	 * -----------------------------------------------------
@@ -1440,7 +1440,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 	 * ---------------------------------^------------------|
 	 */
 	isize = i_size_read(inode);
-	end_pos = page_offset(page) + PAGE_SIZE;
+	end_pos = folio_pos(folio) + folio_size(folio);
 	if (end_pos > isize) {
 		/*
 		 * Check whether the page to write out is beyond or straddles
@@ -1453,7 +1453,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * |				    |      Straddles     |
 		 * ---------------------------------^-----------|--------|
 		 */
-		size_t poff = offset_in_page(isize);
+		size_t poff = offset_in_folio(folio, isize);
 		pgoff_t end_index = isize >> PAGE_SHIFT;
 
 		/*
@@ -1473,8 +1473,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * checking if the page is totally beyond i_size or if its
 		 * offset is just equal to the EOF.
 		 */
-		if (page->index > end_index ||
-		    (page->index == end_index && poff == 0))
+		if (folio->index > end_index ||
+		    (folio->index == end_index && poff == 0))
 			goto redirty;
 
 		/*
@@ -1485,17 +1485,15 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * memory is zeroed when mapped, and writes to that region are
 		 * not written out to the file."
 		 */
-		zero_user_segment(page, poff, PAGE_SIZE);
-
-		/* Adjust the end_offset to the end of file */
+		folio_zero_segment(folio, poff, folio_size(folio));
 		end_pos = isize;
 	}
 
-	return iomap_writepage_map(wpc, wbc, inode, page, end_pos);
+	return iomap_writepage_map(wpc, wbc, inode, folio, end_pos);
 
 redirty:
-	redirty_page_for_writepage(wbc, page);
-	unlock_page(page);
+	folio_redirty_for_writepage(wbc, folio);
+	folio_unlock(folio);
 	return 0;
 }
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 23/25] iomap: Convert iomap_migrate_page() to use folios
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (21 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 22/25] iomap: Convert iomap_add_to_ioend() to take a folio Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 24/25] iomap: Support large folios in invalidatepage Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 25/25] xfs: Support large folios Matthew Wilcox (Oracle)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

The arguments are still pages for now, but we can use folios internally
and cut out a lot of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a36695db6f9d..b2f6e5991eb0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -504,19 +504,21 @@ int
 iomap_migrate_page(struct address_space *mapping, struct page *newpage,
 		struct page *page, enum migrate_mode mode)
 {
+	struct folio *folio = page_folio(page);
+	struct folio *newfolio = page_folio(newpage);
 	int ret;
 
-	ret = migrate_page_move_mapping(mapping, newpage, page, 0);
+	ret = folio_migrate_mapping(mapping, newfolio, folio, 0);
 	if (ret != MIGRATEPAGE_SUCCESS)
 		return ret;
 
-	if (page_has_private(page))
-		attach_page_private(newpage, detach_page_private(page));
+	if (folio_test_private(folio))
+		folio_attach_private(newfolio, folio_detach_private(folio));
 
 	if (mode != MIGRATE_SYNC_NO_COPY)
-		migrate_page_copy(newpage, page);
+		folio_migrate_copy(newfolio, folio);
 	else
-		migrate_page_states(newpage, page);
+		folio_migrate_flags(newfolio, folio);
 	return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL_GPL(iomap_migrate_page);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 24/25] iomap: Support large folios in invalidatepage
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (22 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 23/25] iomap: Convert iomap_migrate_page() to use folios Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2021-12-16 21:07 ` [PATCH v3 25/25] xfs: Support large folios Matthew Wilcox (Oracle)
  24 siblings, 0 replies; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

If we're punching a hole in a large folio, we need to remove the
per-folio iomap data as the folio is about to be split and each page will
need its own.  If a dirty folio is only partially-uptodate, the iomap
data contains the information about which blocks cannot be written back,
so assert that a dirty folio is fully uptodate.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b2f6e5991eb0..36ccb903db92 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -481,13 +481,18 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 	trace_iomap_invalidatepage(folio->mapping->host, offset, len);
 
 	/*
-	 * If we're invalidating the entire page, clear the dirty state from it
-	 * and release it to avoid unnecessary buildup of the LRU.
+	 * If we're invalidating the entire folio, clear the dirty state
+	 * from it and release it to avoid unnecessary buildup of the LRU.
 	 */
 	if (offset == 0 && len == folio_size(folio)) {
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
 		iomap_page_release(folio);
+	} else if (folio_test_large(folio)) {
+		/* Must release the iop so the page can be split */
+		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
+			     folio_test_dirty(folio));
+		iomap_page_release(folio);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 25/25] xfs: Support large folios
  2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
                   ` (23 preceding siblings ...)
  2021-12-16 21:07 ` [PATCH v3 24/25] iomap: Support large folios in invalidatepage Matthew Wilcox (Oracle)
@ 2021-12-16 21:07 ` Matthew Wilcox (Oracle)
  2022-06-23  0:42   ` Darrick J. Wong
  24 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox (Oracle) @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

Now that iomap has been converted, XFS is large folio safe.
Indicate to the VFS that it can now create large folios for XFS.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_icache.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index da4af2142a2b..cdc39f576ca1 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -87,6 +87,7 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
+	mapping_set_large_folios(VFS_I(ip)->i_mapping);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -320,6 +321,7 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
+	mapping_set_large_folios(inode->i_mapping);
 	return error;
 }
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 25/25] xfs: Support large folios
  2021-12-16 21:07 ` [PATCH v3 25/25] xfs: Support large folios Matthew Wilcox (Oracle)
@ 2022-06-23  0:42   ` Darrick J. Wong
  2022-06-27  4:15     ` Darrick J. Wong
  0 siblings, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2022-06-23  0:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-xfs, linux-fsdevel, linux-kernel, Christoph Hellwig

[resend with shorter 522.out file to keep us under the 300k maximum]

On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote:
> Now that iomap has been converted, XFS is large folio safe.
> Indicate to the VFS that it can now create large folios for XFS.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_icache.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index da4af2142a2b..cdc39f576ca1 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -87,6 +87,7 @@ xfs_inode_alloc(
>  	/* VFS doesn't initialise i_mode or i_state! */
>  	VFS_I(ip)->i_mode = 0;
>  	VFS_I(ip)->i_state = 0;
> +	mapping_set_large_folios(VFS_I(ip)->i_mapping);
>  
>  	XFS_STATS_INC(mp, vn_active);
>  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> @@ -320,6 +321,7 @@ xfs_reinit_inode(
>  	inode->i_rdev = dev;
>  	inode->i_uid = uid;
>  	inode->i_gid = gid;
> +	mapping_set_large_folios(inode->i_mapping);

Hmm.  Ever since 5.19-rc1, I've noticed that fsx in generic/522 now
reports file corruption after 20 minutes of runtime.  The corruption is
surprisingly reproducible (522.out.bad attached below) in that I ran it
three times and always got the same bad offset (0x6e000) and always the
same opcode (6213798(166 mod 256) MAPREAD).

I turned off multipage folios and now 522 has run for over an hour
without problems, so before I go do more debugging, does this ring a
bell to anyone?

[addendum: Apparently vger now has a 300K message size limit; if you
want the full output, see https://djwong.org/docs/522.out.txt ]

--D

QA output created by 522
READ BAD DATA: offset = 0x69e3e, size = 0x1c922, fname = /mnt/junk
OFFSET	GOOD	BAD	RANGE
0x6e000	0x0000	0x9173	0x00000
operation# (mod 256) for the bad data may be 145
0x6e001	0x0000	0x7391	0x00001
operation# (mod 256) for the bad data may be 145
0x6e002	0x0000	0x9195	0x00002
operation# (mod 256) for the bad data may be 145
0x6e003	0x0000	0x9591	0x00003
operation# (mod 256) for the bad data may be 145
0x6e004	0x0000	0x91b5	0x00004
operation# (mod 256) for the bad data may be 145
0x6e005	0x0000	0xb591	0x00005
operation# (mod 256) for the bad data may be 145
0x6e006	0x0000	0x91e2	0x00006
operation# (mod 256) for the bad data may be 145
0x6e007	0x0000	0xe291	0x00007
operation# (mod 256) for the bad data may be 145
0x6e008	0x0000	0x919d	0x00008
operation# (mod 256) for the bad data may be 145
0x6e009	0x0000	0x9d91	0x00009
operation# (mod 256) for the bad data may be 145
0x6e00a	0x0000	0x91e8	0x0000a
operation# (mod 256) for the bad data may be 145
0x6e00b	0x0000	0xe891	0x0000b
operation# (mod 256) for the bad data may be 145
0x6e00c	0x0000	0x91c9	0x0000c
operation# (mod 256) for the bad data may be 145
0x6e00d	0x0000	0xc991	0x0000d
operation# (mod 256) for the bad data may be 145
0x6e00e	0x0000	0x9147	0x0000e
operation# (mod 256) for the bad data may be 145
0x6e00f	0x0000	0x4791	0x0000f
operation# (mod 256) for the bad data may be 145
LOG DUMP (6213798 total operations):

<snip>

6213732(100 mod 256): COLLAPSE 0x3b000 thru 0x4efff	(0x14000 bytes)
6213733(101 mod 256): READ     0x1953d thru 0x29311	(0xfdd5 bytes)
6213734(102 mod 256): INSERT 0x14000 thru 0x2ffff	(0x1c000 bytes)
6213735(103 mod 256): COPY 0x1d381 thru 0x36d38	(0x199b8 bytes) to 0x64491 thru 0x7de48	******EEEE
6213736(104 mod 256): ZERO     0x74247 thru 0x927bf	(0x1e579 bytes)
6213737(105 mod 256): INSERT 0x8000 thru 0x16fff	(0xf000 bytes)
6213738(106 mod 256): READ     0x87aba thru 0x8ce48	(0x538f bytes)
6213739(107 mod 256): TRUNCATE DOWN	from 0x8ce49 to 0x46571	******WWWW
6213740(108 mod 256): SKIPPED (no operation)
6213741(109 mod 256): ZERO     0x55674 thru 0x70d41	(0x1b6ce bytes)	******ZZZZ
6213742(110 mod 256): PUNCH    0xc8b5 thru 0xe80d	(0x1f59 bytes)
6213743(111 mod 256): TRUNCATE DOWN	from 0x70d42 to 0x11ade	******WWWW
6213744(112 mod 256): COLLAPSE 0x6000 thru 0xffff	(0xa000 bytes)
6213745(113 mod 256): SKIPPED (no operation)
6213746(114 mod 256): MAPREAD  0x2625 thru 0x7add	(0x54b9 bytes)
6213747(115 mod 256): CLONE 0x2000 thru 0x6fff	(0x5000 bytes) to 0x10000 thru 0x14fff
6213748(116 mod 256): SKIPPED (no operation)
6213749(117 mod 256): TRUNCATE UP	from 0x15000 to 0x8d131	******WWWW
6213750(118 mod 256): WRITE    0x82547 thru 0x88334	(0x5dee bytes)
6213751(119 mod 256): DEDUPE 0x7d000 thru 0x83fff	(0x7000 bytes) to 0x22000 thru 0x28fff
6213752(120 mod 256): READ     0x11e69 thru 0x2864c	(0x167e4 bytes)
6213753(121 mod 256): INSERT 0x41000 thru 0x45fff	(0x5000 bytes)
6213754(122 mod 256): COPY 0x2ca4c thru 0x2ed9f	(0x2354 bytes) to 0x2fef1 thru 0x32244
6213755(123 mod 256): MAPWRITE 0x70677 thru 0x8b993	(0x1b31d bytes)
6213756(124 mod 256): FALLOC   0x7229f thru 0x91158	(0x1eeb9 bytes) INTERIOR
6213757(125 mod 256): COLLAPSE 0x13000 thru 0x2bfff	(0x19000 bytes)
6213758(126 mod 256): COPY 0x9271 thru 0xba34	(0x27c4 bytes) to 0x3227c thru 0x34a3f
6213759(127 mod 256): CLONE 0x23000 thru 0x2cfff	(0xa000 bytes) to 0x6c000 thru 0x75fff	******JJJJ
6213760(128 mod 256): READ     0x44cff thru 0x4c4a1	(0x77a3 bytes)
6213761(129 mod 256): DEDUPE 0x60000 thru 0x73fff	(0x14000 bytes) to 0x39000 thru 0x4cfff	BBBB******
6213762(130 mod 256): COLLAPSE 0x39000 thru 0x3ffff	(0x7000 bytes)
6213763(131 mod 256): WRITE    0x57565 thru 0x5e710	(0x71ac bytes)
6213764(132 mod 256): MAPREAD  0x39c49 thru 0x4accd	(0x11085 bytes)
6213765(133 mod 256): ZERO     0x4faf5 thru 0x6a5cc	(0x1aad8 bytes)
6213766(134 mod 256): MAPREAD  0x57f8 thru 0x8c98	(0x34a1 bytes)
6213767(135 mod 256): MAPREAD  0x5cbd8 thru 0x72130	(0x15559 bytes)	***RRRR***
6213768(136 mod 256): SKIPPED (no operation)
6213769(137 mod 256): INSERT 0x24000 thru 0x32fff	(0xf000 bytes)
6213770(138 mod 256): COPY 0x32b0c thru 0x4d035	(0x1a52a bytes) to 0x4f97f thru 0x69ea8
6213771(139 mod 256): DEDUPE 0x3f000 thru 0x52fff	(0x14000 bytes) to 0x23000 thru 0x36fff
6213772(140 mod 256): READ     0x6d9bf thru 0x81130	(0x13772 bytes)	***RRRR***
6213773(141 mod 256): TRUNCATE DOWN	from 0x81131 to 0x569c0	******WWWW
6213774(142 mod 256): MAPREAD  0x354d5 thru 0x44e7b	(0xf9a7 bytes)
6213775(143 mod 256): MAPWRITE 0x547c4 thru 0x60a8e	(0xc2cb bytes)
6213776(144 mod 256): SKIPPED (no operation)
6213777(145 mod 256): WRITE    0x28ada thru 0x4356c	(0x1aa93 bytes)
6213778(146 mod 256): ZERO     0x74c28 thru 0x91fec	(0x1d3c5 bytes)
6213779(147 mod 256): INSERT 0x12000 thru 0x1cfff	(0xb000 bytes)
6213780(148 mod 256): ZERO     0x30834 thru 0x330f7	(0x28c4 bytes)
6213781(149 mod 256): PUNCH    0x36080 thru 0x42edc	(0xce5d bytes)
6213782(150 mod 256): DEDUPE 0x14000 thru 0x19fff	(0x6000 bytes) to 0x49000 thru 0x4efff
6213783(151 mod 256): DEDUPE 0x51000 thru 0x5efff	(0xe000 bytes) to 0x2a000 thru 0x37fff
6213784(152 mod 256): WRITE    0x2448e thru 0x400f5	(0x1bc68 bytes)
6213785(153 mod 256): ZERO     0x87615 thru 0x927bf	(0xb1ab bytes)
6213786(154 mod 256): READ     0x5afc thru 0xa32c	(0x4831 bytes)
6213787(155 mod 256): SKIPPED (no operation)
6213788(156 mod 256): ZERO     0x7aab0 thru 0x7e2b3	(0x3804 bytes)
6213789(157 mod 256): INSERT 0x45000 thru 0x58fff	(0x14000 bytes)
6213790(158 mod 256): FALLOC   0x1a80e thru 0x289a3	(0xe195 bytes) INTERIOR
6213791(159 mod 256): SKIPPED (no operation)
6213792(160 mod 256): SKIPPED (no operation)
6213793(161 mod 256): FALLOC   0x2aca thru 0x20562	(0x1da98 bytes) INTERIOR
6213794(162 mod 256): ZERO     0x72fb9 thru 0x75887	(0x28cf bytes)
6213795(163 mod 256): COPY 0xa62e thru 0x218d0	(0x172a3 bytes) to 0x28ab1 thru 0x3fd53
6213796(164 mod 256): SKIPPED (no operation)
6213797(165 mod 256): COPY 0xa666 thru 0xf6a1	(0x503c bytes) to 0x353f0 thru 0x3a42b
6213798(166 mod 256): MAPREAD  0x69e3e thru 0x8675f	(0x1c922 bytes)	***RRRR***
Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops
Correct content saved for comparison
(maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood")
Silence is golden

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 25/25] xfs: Support large folios
  2022-06-23  0:42   ` Darrick J. Wong
@ 2022-06-27  4:15     ` Darrick J. Wong
  2022-06-27 14:10       ` Matthew Wilcox
  2022-06-27 22:07       ` [PATCH v3 25/25] xfs: Support large folios Dave Chinner
  0 siblings, 2 replies; 51+ messages in thread
From: Darrick J. Wong @ 2022-06-27  4:15 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-xfs, linux-fsdevel, linux-kernel, Christoph Hellwig

On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote:
> [resend with shorter 522.out file to keep us under the 300k maximum]
> 
> On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote:
> > Now that iomap has been converted, XFS is large folio safe.
> > Indicate to the VFS that it can now create large folios for XFS.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_icache.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index da4af2142a2b..cdc39f576ca1 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -87,6 +87,7 @@ xfs_inode_alloc(
> >  	/* VFS doesn't initialise i_mode or i_state! */
> >  	VFS_I(ip)->i_mode = 0;
> >  	VFS_I(ip)->i_state = 0;
> > +	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> >  
> >  	XFS_STATS_INC(mp, vn_active);
> >  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> > @@ -320,6 +321,7 @@ xfs_reinit_inode(
> >  	inode->i_rdev = dev;
> >  	inode->i_uid = uid;
> >  	inode->i_gid = gid;
> > +	mapping_set_large_folios(inode->i_mapping);
> 
> Hmm.  Ever since 5.19-rc1, I've noticed that fsx in generic/522 now
> reports file corruption after 20 minutes of runtime.  The corruption is
> surprisingly reproducible (522.out.bad attached below) in that I ran it
> three times and always got the same bad offset (0x6e000) and always the
> same opcode (6213798(166 mod 256) MAPREAD).
> 
> I turned off multipage folios and now 522 has run for over an hour
> without problems, so before I go do more debugging, does this ring a
> bell to anyone?

I tried bisecting, but that didn't yield anything productive and
5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will
run without problems for at least 3-4 days after reverting this patch
from -rc3.

So I guess I have a blunt force fix if we can't figure this one out
before 5.19 final, but I'd really rather not.  Will keep trying this
week.

--D

> [addendum: Apparently vger now has a 300K message size limit; if you
> want the full output, see https://djwong.org/docs/522.out.txt ]
> 
> --D
> 
> QA output created by 522
> READ BAD DATA: offset = 0x69e3e, size = 0x1c922, fname = /mnt/junk
> OFFSET	GOOD	BAD	RANGE
> 0x6e000	0x0000	0x9173	0x00000
> operation# (mod 256) for the bad data may be 145
> 0x6e001	0x0000	0x7391	0x00001
> operation# (mod 256) for the bad data may be 145
> 0x6e002	0x0000	0x9195	0x00002
> operation# (mod 256) for the bad data may be 145
> 0x6e003	0x0000	0x9591	0x00003
> operation# (mod 256) for the bad data may be 145
> 0x6e004	0x0000	0x91b5	0x00004
> operation# (mod 256) for the bad data may be 145
> 0x6e005	0x0000	0xb591	0x00005
> operation# (mod 256) for the bad data may be 145
> 0x6e006	0x0000	0x91e2	0x00006
> operation# (mod 256) for the bad data may be 145
> 0x6e007	0x0000	0xe291	0x00007
> operation# (mod 256) for the bad data may be 145
> 0x6e008	0x0000	0x919d	0x00008
> operation# (mod 256) for the bad data may be 145
> 0x6e009	0x0000	0x9d91	0x00009
> operation# (mod 256) for the bad data may be 145
> 0x6e00a	0x0000	0x91e8	0x0000a
> operation# (mod 256) for the bad data may be 145
> 0x6e00b	0x0000	0xe891	0x0000b
> operation# (mod 256) for the bad data may be 145
> 0x6e00c	0x0000	0x91c9	0x0000c
> operation# (mod 256) for the bad data may be 145
> 0x6e00d	0x0000	0xc991	0x0000d
> operation# (mod 256) for the bad data may be 145
> 0x6e00e	0x0000	0x9147	0x0000e
> operation# (mod 256) for the bad data may be 145
> 0x6e00f	0x0000	0x4791	0x0000f
> operation# (mod 256) for the bad data may be 145
> LOG DUMP (6213798 total operations):
> 
> <snip>
> 
> 6213732(100 mod 256): COLLAPSE 0x3b000 thru 0x4efff	(0x14000 bytes)
> 6213733(101 mod 256): READ     0x1953d thru 0x29311	(0xfdd5 bytes)
> 6213734(102 mod 256): INSERT 0x14000 thru 0x2ffff	(0x1c000 bytes)
> 6213735(103 mod 256): COPY 0x1d381 thru 0x36d38	(0x199b8 bytes) to 0x64491 thru 0x7de48	******EEEE
> 6213736(104 mod 256): ZERO     0x74247 thru 0x927bf	(0x1e579 bytes)
> 6213737(105 mod 256): INSERT 0x8000 thru 0x16fff	(0xf000 bytes)
> 6213738(106 mod 256): READ     0x87aba thru 0x8ce48	(0x538f bytes)
> 6213739(107 mod 256): TRUNCATE DOWN	from 0x8ce49 to 0x46571	******WWWW
> 6213740(108 mod 256): SKIPPED (no operation)
> 6213741(109 mod 256): ZERO     0x55674 thru 0x70d41	(0x1b6ce bytes)	******ZZZZ
> 6213742(110 mod 256): PUNCH    0xc8b5 thru 0xe80d	(0x1f59 bytes)
> 6213743(111 mod 256): TRUNCATE DOWN	from 0x70d42 to 0x11ade	******WWWW
> 6213744(112 mod 256): COLLAPSE 0x6000 thru 0xffff	(0xa000 bytes)
> 6213745(113 mod 256): SKIPPED (no operation)
> 6213746(114 mod 256): MAPREAD  0x2625 thru 0x7add	(0x54b9 bytes)
> 6213747(115 mod 256): CLONE 0x2000 thru 0x6fff	(0x5000 bytes) to 0x10000 thru 0x14fff
> 6213748(116 mod 256): SKIPPED (no operation)
> 6213749(117 mod 256): TRUNCATE UP	from 0x15000 to 0x8d131	******WWWW
> 6213750(118 mod 256): WRITE    0x82547 thru 0x88334	(0x5dee bytes)
> 6213751(119 mod 256): DEDUPE 0x7d000 thru 0x83fff	(0x7000 bytes) to 0x22000 thru 0x28fff
> 6213752(120 mod 256): READ     0x11e69 thru 0x2864c	(0x167e4 bytes)
> 6213753(121 mod 256): INSERT 0x41000 thru 0x45fff	(0x5000 bytes)
> 6213754(122 mod 256): COPY 0x2ca4c thru 0x2ed9f	(0x2354 bytes) to 0x2fef1 thru 0x32244
> 6213755(123 mod 256): MAPWRITE 0x70677 thru 0x8b993	(0x1b31d bytes)
> 6213756(124 mod 256): FALLOC   0x7229f thru 0x91158	(0x1eeb9 bytes) INTERIOR
> 6213757(125 mod 256): COLLAPSE 0x13000 thru 0x2bfff	(0x19000 bytes)
> 6213758(126 mod 256): COPY 0x9271 thru 0xba34	(0x27c4 bytes) to 0x3227c thru 0x34a3f
> 6213759(127 mod 256): CLONE 0x23000 thru 0x2cfff	(0xa000 bytes) to 0x6c000 thru 0x75fff	******JJJJ
> 6213760(128 mod 256): READ     0x44cff thru 0x4c4a1	(0x77a3 bytes)
> 6213761(129 mod 256): DEDUPE 0x60000 thru 0x73fff	(0x14000 bytes) to 0x39000 thru 0x4cfff	BBBB******
> 6213762(130 mod 256): COLLAPSE 0x39000 thru 0x3ffff	(0x7000 bytes)
> 6213763(131 mod 256): WRITE    0x57565 thru 0x5e710	(0x71ac bytes)
> 6213764(132 mod 256): MAPREAD  0x39c49 thru 0x4accd	(0x11085 bytes)
> 6213765(133 mod 256): ZERO     0x4faf5 thru 0x6a5cc	(0x1aad8 bytes)
> 6213766(134 mod 256): MAPREAD  0x57f8 thru 0x8c98	(0x34a1 bytes)
> 6213767(135 mod 256): MAPREAD  0x5cbd8 thru 0x72130	(0x15559 bytes)	***RRRR***
> 6213768(136 mod 256): SKIPPED (no operation)
> 6213769(137 mod 256): INSERT 0x24000 thru 0x32fff	(0xf000 bytes)
> 6213770(138 mod 256): COPY 0x32b0c thru 0x4d035	(0x1a52a bytes) to 0x4f97f thru 0x69ea8
> 6213771(139 mod 256): DEDUPE 0x3f000 thru 0x52fff	(0x14000 bytes) to 0x23000 thru 0x36fff
> 6213772(140 mod 256): READ     0x6d9bf thru 0x81130	(0x13772 bytes)	***RRRR***
> 6213773(141 mod 256): TRUNCATE DOWN	from 0x81131 to 0x569c0	******WWWW
> 6213774(142 mod 256): MAPREAD  0x354d5 thru 0x44e7b	(0xf9a7 bytes)
> 6213775(143 mod 256): MAPWRITE 0x547c4 thru 0x60a8e	(0xc2cb bytes)
> 6213776(144 mod 256): SKIPPED (no operation)
> 6213777(145 mod 256): WRITE    0x28ada thru 0x4356c	(0x1aa93 bytes)
> 6213778(146 mod 256): ZERO     0x74c28 thru 0x91fec	(0x1d3c5 bytes)
> 6213779(147 mod 256): INSERT 0x12000 thru 0x1cfff	(0xb000 bytes)
> 6213780(148 mod 256): ZERO     0x30834 thru 0x330f7	(0x28c4 bytes)
> 6213781(149 mod 256): PUNCH    0x36080 thru 0x42edc	(0xce5d bytes)
> 6213782(150 mod 256): DEDUPE 0x14000 thru 0x19fff	(0x6000 bytes) to 0x49000 thru 0x4efff
> 6213783(151 mod 256): DEDUPE 0x51000 thru 0x5efff	(0xe000 bytes) to 0x2a000 thru 0x37fff
> 6213784(152 mod 256): WRITE    0x2448e thru 0x400f5	(0x1bc68 bytes)
> 6213785(153 mod 256): ZERO     0x87615 thru 0x927bf	(0xb1ab bytes)
> 6213786(154 mod 256): READ     0x5afc thru 0xa32c	(0x4831 bytes)
> 6213787(155 mod 256): SKIPPED (no operation)
> 6213788(156 mod 256): ZERO     0x7aab0 thru 0x7e2b3	(0x3804 bytes)
> 6213789(157 mod 256): INSERT 0x45000 thru 0x58fff	(0x14000 bytes)
> 6213790(158 mod 256): FALLOC   0x1a80e thru 0x289a3	(0xe195 bytes) INTERIOR
> 6213791(159 mod 256): SKIPPED (no operation)
> 6213792(160 mod 256): SKIPPED (no operation)
> 6213793(161 mod 256): FALLOC   0x2aca thru 0x20562	(0x1da98 bytes) INTERIOR
> 6213794(162 mod 256): ZERO     0x72fb9 thru 0x75887	(0x28cf bytes)
> 6213795(163 mod 256): COPY 0xa62e thru 0x218d0	(0x172a3 bytes) to 0x28ab1 thru 0x3fd53
> 6213796(164 mod 256): SKIPPED (no operation)
> 6213797(165 mod 256): COPY 0xa666 thru 0xf6a1	(0x503c bytes) to 0x353f0 thru 0x3a42b
> 6213798(166 mod 256): MAPREAD  0x69e3e thru 0x8675f	(0x1c922 bytes)	***RRRR***
> Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops
> Correct content saved for comparison
> (maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood")
> Silence is golden

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 25/25] xfs: Support large folios
  2022-06-27  4:15     ` Darrick J. Wong
@ 2022-06-27 14:10       ` Matthew Wilcox
  2022-06-27 22:16         ` Darrick J. Wong
  2022-06-27 22:07       ` [PATCH v3 25/25] xfs: Support large folios Dave Chinner
  1 sibling, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2022-06-27 14:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, linux-kernel, Christoph Hellwig

On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote:
> > [resend with shorter 522.out file to keep us under the 300k maximum]
> > 
> > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote:
> > > Now that iomap has been converted, XFS is large folio safe.
> > > Indicate to the VFS that it can now create large folios for XFS.
> > > 
> > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_icache.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index da4af2142a2b..cdc39f576ca1 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -87,6 +87,7 @@ xfs_inode_alloc(
> > >  	/* VFS doesn't initialise i_mode or i_state! */
> > >  	VFS_I(ip)->i_mode = 0;
> > >  	VFS_I(ip)->i_state = 0;
> > > +	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> > >  
> > >  	XFS_STATS_INC(mp, vn_active);
> > >  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> > > @@ -320,6 +321,7 @@ xfs_reinit_inode(
> > >  	inode->i_rdev = dev;
> > >  	inode->i_uid = uid;
> > >  	inode->i_gid = gid;
> > > +	mapping_set_large_folios(inode->i_mapping);
> > 
> > Hmm.  Ever since 5.19-rc1, I've noticed that fsx in generic/522 now
> > reports file corruption after 20 minutes of runtime.  The corruption is
> > surprisingly reproducible (522.out.bad attached below) in that I ran it
> > three times and always got the same bad offset (0x6e000) and always the
> > same opcode (6213798(166 mod 256) MAPREAD).
> > 
> > I turned off multipage folios and now 522 has run for over an hour
> > without problems, so before I go do more debugging, does this ring a
> > bell to anyone?
> 
> I tried bisecting, but that didn't yield anything productive and
> 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will
> run without problems for at least 3-4 days after reverting this patch
> from -rc3.
> 
> So I guess I have a blunt force fix if we can't figure this one out
> before 5.19 final, but I'd really rather not.  Will keep trying this
> week.

I'm on holiday for the next week, so I'm not going to be able to spend
any time on this until then.  I have a suspicion that this may be the
same bug Zorro is seeing here:

https://lore.kernel.org/linux-mm/20220613010850.6kmpenitmuct2osb@zlang-mailbox/

At least I hope it is, and finding a folio that has been freed would
explain (apparent) file corruption.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 25/25] xfs: Support large folios
  2022-06-27 14:10       ` Matthew Wilcox
@ 2022-06-27 22:16         ` Darrick J. Wong
  2022-06-27 23:35           ` Dave Chinner
  2022-06-28  7:31           ` Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios) Dave Chinner
  0 siblings, 2 replies; 51+ messages in thread
From: Darrick J. Wong @ 2022-06-27 22:16 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs, linux-fsdevel, linux-kernel, Christoph Hellwig

On Mon, Jun 27, 2022 at 03:10:40PM +0100, Matthew Wilcox wrote:
> On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote:
> > > [resend with shorter 522.out file to keep us under the 300k maximum]
> > > 
> > > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote:
> > > > Now that iomap has been converted, XFS is large folio safe.
> > > > Indicate to the VFS that it can now create large folios for XFS.
> > > > 
> > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/xfs_icache.c | 2 ++
> > > >  1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index da4af2142a2b..cdc39f576ca1 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > > @@ -87,6 +87,7 @@ xfs_inode_alloc(
> > > >  	/* VFS doesn't initialise i_mode or i_state! */
> > > >  	VFS_I(ip)->i_mode = 0;
> > > >  	VFS_I(ip)->i_state = 0;
> > > > +	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> > > >  
> > > >  	XFS_STATS_INC(mp, vn_active);
> > > >  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> > > > @@ -320,6 +321,7 @@ xfs_reinit_inode(
> > > >  	inode->i_rdev = dev;
> > > >  	inode->i_uid = uid;
> > > >  	inode->i_gid = gid;
> > > > +	mapping_set_large_folios(inode->i_mapping);
> > > 
> > > Hmm.  Ever since 5.19-rc1, I've noticed that fsx in generic/522 now
> > > reports file corruption after 20 minutes of runtime.  The corruption is
> > > surprisingly reproducible (522.out.bad attached below) in that I ran it
> > > three times and always got the same bad offset (0x6e000) and always the
> > > same opcode (6213798(166 mod 256) MAPREAD).
> > > 
> > > I turned off multipage folios and now 522 has run for over an hour
> > > without problems, so before I go do more debugging, does this ring a
> > > bell to anyone?
> > 
> > I tried bisecting, but that didn't yield anything productive and
> > 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will
> > run without problems for at least 3-4 days after reverting this patch
> > from -rc3.
> > 
> > So I guess I have a blunt force fix if we can't figure this one out
> > before 5.19 final, but I'd really rather not.  Will keep trying this
> > week.
> 
> I'm on holiday for the next week, so I'm not going to be able to spend
> any time on this until then.  I have a suspicion that this may be the
> same bug Zorro is seeing here:
> 
> https://lore.kernel.org/linux-mm/20220613010850.6kmpenitmuct2osb@zlang-mailbox/
> 
> At least I hope it is, and finding a folio that has been freed would
> explain (apparent) file corruption.

Hm.  I suppose it /could/ be a lost folio getting into the works
somewhere.

Today I remembered fsx -X, which makes this reproduce a bit faster (~3-8
minutes instead of 20-25).  That has helped me to narrow things down a
little more:

- Turning off INSERT/COLLAPSE_RANGE doesn't make the problem go away,
  but it does make reading the fsx log much easier.

- Turning off clone/dedupe (either via -J -B or formatting with -m
  reflink=0) makes the problem go away completely.  If you define
  letting fsx run for 90 minutes as "completely".

- Neutering vfs_dedupe_file_range_compare by replacing it with an -EBADE
  return doesn't affect the reproducibility, so it's not the comparison
  function misbehaving.

- I modified fsx.c so that when there's file corruption, it'll report
  both the first 16 bytes of corruption as well as every corruption that
  happens on a page boundary.

- I also modified run_fsx() to diff the good and junk files, and
  complain about any corruption happening on a page boundary.  Now I see
  things like this:

2153984(  0 mod 256): SKIPPED (no operation)
2153985(  1 mod 256): DEDUPE 0xf000 thru 0x23fff        (0x15000 bytes) to 0x2a000 thru 0x3efff ******BBBB
2153986(  2 mod 256): COPY 0xe794 thru 0x2ae41  (0x1c6ae bytes) to 0x60ac4 thru 0x7d171
2153987(  3 mod 256): TRUNCATE DOWN     from 0x7d172 to 0x535da
2153988(  4 mod 256): SKIPPED (no operation)
2153989(  5 mod 256): MAPREAD  0x40b93 thru 0x535d9     (0x12a47 bytes)
2153990(  6 mod 256): COPY 0x5edd thru 0x20282  (0x1a3a6 bytes) to 0x3a9aa thru 0x54d4f
2153991(  7 mod 256): SKIPPED (no operation)
2153992(  8 mod 256): SKIPPED (no operation)
2153993(  9 mod 256): ZERO     0x542d3 thru 0x67006     (0x12d34 bytes)
2153994( 10 mod 256): COPY 0x42cf6 thru 0x538a7 (0x10bb2 bytes) to 0x23fe7 thru 0x34b98 ******EEEE
2153995( 11 mod 256): MAPWRITE 0x5a1fc thru 0x6b067     (0x10e6c bytes)
2153996( 12 mod 256): SKIPPED (no operation)
2153997( 13 mod 256): CLONE 0x38000 thru 0x38fff        (0x1000 bytes) to 0x77000 thru 0x77fff
2153998( 14 mod 256): FALLOC   0x49bdd thru 0x62a55     (0x18e78 bytes) INTERIOR
2153999( 15 mod 256): CLONE 0xf000 thru 0x1bfff (0xd000 bytes) to 0x2c000 thru 0x38fff  ******JJJJ
Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops
Correct content saved for comparison
(maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood")
junk file 177
-02e000  ec  20  ec  5a  ec  78  ec  b2  ec  e6  ec  1e  ec  43  ec  0f
-02f000  ec  30  ec  32  ec  4c  ec  ac  ec  5c  ec  d2  ec  62  ec  d3
-030000  ec  73  ec  ce  ec  8c  ec  cb  ec  94  ec  59  ec  81  ec  34
+02e000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
+02f000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
+030000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8

When I remount the test filesystem, I see further corruption:

$ diff -Naurp <(od -tx1 -Ax -c $TEST_DIR/junk.fsxgood) <(od -tx1 -Ax -c $TEST_DIR/junk) | grep '^[+-]0..000'
-011000  ec  20  ec  5a  ec  78  ec  b2  ec  e6  ec  1e  ec  43  ec  0f
-012000  ec  30  ec  32  ec  4c  ec  ac  ec  5c  ec  d2  ec  62  ec  d3
-013000  ec  73  ec  ce  ec  8c  ec  cb  ec  94  ec  59  ec  81  ec  34
+011000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
+012000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
+013000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8
-02e000  ec  20  ec  5a  ec  78  ec  b2  ec  e6  ec  1e  ec  43  ec  0f
-02f000  ec  30  ec  32  ec  4c  ec  ac  ec  5c  ec  d2  ec  62  ec  d3
-030000  ec  73  ec  ce  ec  8c  ec  cb  ec  94  ec  59  ec  81  ec  34
+02e000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
+02f000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
+030000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8

This is really quite strange!  The same corruption patterns we saw at
pages 0x2e - 0x30 now appear at 0x11-0x13 after the remount!

By comparison, the junk.fsxgood file only contains this 77/db/f1
sequence at:

$ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep '77  db  f1'
008530  db  34  db  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d
03d000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db

Curiously, the same byte trios at 0x2f000 and 0x30000 have similar
repetitions at similar looking offsets:

$ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep 'b3  d8  35'
009530  d8  bb  d8  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b
03e000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
$ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep '23  d8  c8'
00a530  d8  f5  d8  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e
03f000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8

Though the only pattern that happens consistently is that garbage bytes
end up at the reflink dest, and later at the reflink source.  I never
see any VM_BUG_ON_FOLIO asserts, nor does KASAN report anything.

I also added a debug function to dump the folios it finds in the
pagecache for the fsx junk file, but nothing looks odd:

     522-5099  [001]   491.954659: console:              [U] FSX FAILURE
  xfs_io-5125  [002]   491.961232: console:              XFS (sda): EXPERIMENTAL online scrub feature in use. Use at your own risk!
  xfs_io-5125  [002]   491.961238: bprint:               filemap_dump: ino 0xb1 pos 0x0 pfn 0x515cc order 0
  xfs_io-5125  [002]   491.961238: bprint:               filemap_dump: ino 0xb1 pos 0x1000 pfn 0x515cd order 0
  xfs_io-5125  [002]   491.961239: bprint:               filemap_dump: ino 0xb1 pos 0x2000 pfn 0x515ce order 0
  xfs_io-5125  [002]   491.961239: bprint:               filemap_dump: ino 0xb1 pos 0x3000 pfn 0x515cf order 0
  xfs_io-5125  [002]   491.961239: bprint:               filemap_dump: ino 0xb1 pos 0x4000 pfn 0x50c48 order 0
  xfs_io-5125  [002]   491.961240: bprint:               filemap_dump: ino 0xb1 pos 0x5000 pfn 0x50c49 order 0
  xfs_io-5125  [002]   491.961240: bprint:               filemap_dump: ino 0xb1 pos 0x6000 pfn 0x50c4a order 0
  xfs_io-5125  [002]   491.961241: bprint:               filemap_dump: ino 0xb1 pos 0x7000 pfn 0xc8a8 order 0
  xfs_io-5125  [002]   491.961241: bprint:               filemap_dump: ino 0xb1 pos 0x8000 pfn 0x50988 order 2
  xfs_io-5125  [002]   491.961241: bprint:               filemap_dump: ino 0xb1 pos 0xc000 pfn 0x509e0 order 2
  xfs_io-5125  [002]   491.961242: bprint:               filemap_dump: ino 0xb1 pos 0x10000 pfn 0x4db64 order 2

  xfs_io-5125  [002]   491.961242: bprint:               filemap_dump: ino 0xb1 pos 0x14000 pfn 0x50c4c order 0
  xfs_io-5125  [002]   491.961243: bprint:               filemap_dump: ino 0xb1 pos 0x15000 pfn 0x12485 order 0
  xfs_io-5125  [002]   491.961243: bprint:               filemap_dump: ino 0xb1 pos 0x16000 pfn 0x50c4d order 0
  xfs_io-5125  [002]   491.961243: bprint:               filemap_dump: ino 0xb1 pos 0x17000 pfn 0x50c4e order 0
  xfs_io-5125  [002]   491.961244: bprint:               filemap_dump: ino 0xb1 pos 0x18000 pfn 0x4eef8 order 2
  xfs_io-5125  [002]   491.961244: bprint:               filemap_dump: ino 0xb1 pos 0x1c000 pfn 0x4eefc order 2
  xfs_io-5125  [002]   491.961245: bprint:               filemap_dump: ino 0xb1 pos 0x20000 pfn 0x4eef0 order 2
  xfs_io-5125  [002]   491.961245: bprint:               filemap_dump: ino 0xb1 pos 0x24000 pfn 0x50f2c order 2
  xfs_io-5125  [002]   491.961245: bprint:               filemap_dump: ino 0xb1 pos 0x28000 pfn 0x50f28 order 2
  xfs_io-5125  [002]   491.961246: bprint:               filemap_dump: ino 0xb1 pos 0x2c000 pfn 0x50f24 order 2

  xfs_io-5125  [002]   491.961246: bprint:               filemap_dump: ino 0xb1 pos 0x30000 pfn 0x50f20 order 2

I'll keep digging...

--D

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 25/25] xfs: Support large folios
  2022-06-27 22:16         ` Darrick J. Wong
@ 2022-06-27 23:35           ` Dave Chinner
  2022-06-28  7:31           ` Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios) Dave Chinner
  1 sibling, 0 replies; 51+ messages in thread
From: Dave Chinner @ 2022-06-27 23:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

On Mon, Jun 27, 2022 at 03:16:19PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 27, 2022 at 03:10:40PM +0100, Matthew Wilcox wrote:
> > On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote:
> > > On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote:
> > > > [resend with shorter 522.out file to keep us under the 300k maximum]
> > > > 
> > > > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote:
> > > > > Now that iomap has been converted, XFS is large folio safe.
> > > > > Indicate to the VFS that it can now create large folios for XFS.
> > > > > 
> > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > >  fs/xfs/xfs_icache.c | 2 ++
> > > > >  1 file changed, 2 insertions(+)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > index da4af2142a2b..cdc39f576ca1 100644
> > > > > --- a/fs/xfs/xfs_icache.c
> > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > @@ -87,6 +87,7 @@ xfs_inode_alloc(
> > > > >  	/* VFS doesn't initialise i_mode or i_state! */
> > > > >  	VFS_I(ip)->i_mode = 0;
> > > > >  	VFS_I(ip)->i_state = 0;
> > > > > +	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> > > > >  
> > > > >  	XFS_STATS_INC(mp, vn_active);
> > > > >  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> > > > > @@ -320,6 +321,7 @@ xfs_reinit_inode(
> > > > >  	inode->i_rdev = dev;
> > > > >  	inode->i_uid = uid;
> > > > >  	inode->i_gid = gid;
> > > > > +	mapping_set_large_folios(inode->i_mapping);
> > > > 
> > > > Hmm.  Ever since 5.19-rc1, I've noticed that fsx in generic/522 now
> > > > reports file corruption after 20 minutes of runtime.  The corruption is
> > > > surprisingly reproducible (522.out.bad attached below) in that I ran it
> > > > three times and always got the same bad offset (0x6e000) and always the
> > > > same opcode (6213798(166 mod 256) MAPREAD).
> > > > 
> > > > I turned off multipage folios and now 522 has run for over an hour
> > > > without problems, so before I go do more debugging, does this ring a
> > > > bell to anyone?
> > > 
> > > I tried bisecting, but that didn't yield anything productive and
> > > 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will
> > > run without problems for at least 3-4 days after reverting this patch
> > > from -rc3.
> > > 
> > > So I guess I have a blunt force fix if we can't figure this one out
> > > before 5.19 final, but I'd really rather not.  Will keep trying this
> > > week.
> > 
> > I'm on holiday for the next week, so I'm not going to be able to spend
> > any time on this until then.  I have a suspicion that this may be the
> > same bug Zorro is seeing here:
> > 
> > https://lore.kernel.org/linux-mm/20220613010850.6kmpenitmuct2osb@zlang-mailbox/
> > 
> > At least I hope it is, and finding a folio that has been freed would
> > explain (apparent) file corruption.
> 
> Hm.  I suppose it /could/ be a lost folio getting into the works
> somewhere.
> 
> Today I remembered fsx -X, which makes this reproduce a bit faster (~3-8
> minutes instead of 20-25).  That has helped me to narrow things down a
> little more:
> 
> - Turning off INSERT/COLLAPSE_RANGE doesn't make the problem go away,
>   but it does make reading the fsx log much easier.
> 
> - Turning off clone/dedupe (either via -J -B or formatting with -m
>   reflink=0) makes the problem go away completely.  If you define
>   letting fsx run for 90 minutes as "completely".
> 
> - Neutering vfs_dedupe_file_range_compare by replacing it with an -EBADE
>   return doesn't affect the reproducibility, so it's not the comparison
>   function misbehaving.
> - I modified fsx.c so that when there's file corruption, it'll report
>   both the first 16 bytes of corruption as well as every corruption that
>   happens on a page boundary.
> 
> - I also modified run_fsx() to diff the good and junk files, and
>   complain about any corruption happening on a page boundary.  Now I see
>   things like this:
> 
> 2153984(  0 mod 256): SKIPPED (no operation)
> 2153985(  1 mod 256): DEDUPE 0xf000 thru 0x23fff        (0x15000 bytes) to 0x2a000 thru 0x3efff ******BBBB
> 2153986(  2 mod 256): COPY 0xe794 thru 0x2ae41  (0x1c6ae bytes) to 0x60ac4 thru 0x7d171
> 2153987(  3 mod 256): TRUNCATE DOWN     from 0x7d172 to 0x535da
> 2153988(  4 mod 256): SKIPPED (no operation)
> 2153989(  5 mod 256): MAPREAD  0x40b93 thru 0x535d9     (0x12a47 bytes)
> 2153990(  6 mod 256): COPY 0x5edd thru 0x20282  (0x1a3a6 bytes) to 0x3a9aa thru 0x54d4f
> 2153991(  7 mod 256): SKIPPED (no operation)
> 2153992(  8 mod 256): SKIPPED (no operation)
> 2153993(  9 mod 256): ZERO     0x542d3 thru 0x67006     (0x12d34 bytes)
> 2153994( 10 mod 256): COPY 0x42cf6 thru 0x538a7 (0x10bb2 bytes) to 0x23fe7 thru 0x34b98 ******EEEE
> 2153995( 11 mod 256): MAPWRITE 0x5a1fc thru 0x6b067     (0x10e6c bytes)
> 2153996( 12 mod 256): SKIPPED (no operation)
> 2153997( 13 mod 256): CLONE 0x38000 thru 0x38fff        (0x1000 bytes) to 0x77000 thru 0x77fff
> 2153998( 14 mod 256): FALLOC   0x49bdd thru 0x62a55     (0x18e78 bytes) INTERIOR
> 2153999( 15 mod 256): CLONE 0xf000 thru 0x1bfff (0xd000 bytes) to 0x2c000 thru 0x38fff  ******JJJJ
> Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops
> Correct content saved for comparison
> (maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood")
> junk file 177
> -02e000  ec  20  ec  5a  ec  78  ec  b2  ec  e6  ec  1e  ec  43  ec  0f
> -02f000  ec  30  ec  32  ec  4c  ec  ac  ec  5c  ec  d2  ec  62  ec  d3
> -030000  ec  73  ec  ce  ec  8c  ec  cb  ec  94  ec  59  ec  81  ec  34
> +02e000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
> +02f000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
> +030000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8
> 
> When I remount the test filesystem, I see further corruption:
> 
> $ diff -Naurp <(od -tx1 -Ax -c $TEST_DIR/junk.fsxgood) <(od -tx1 -Ax -c $TEST_DIR/junk) | grep '^[+-]0..000'
> -011000  ec  20  ec  5a  ec  78  ec  b2  ec  e6  ec  1e  ec  43  ec  0f
> -012000  ec  30  ec  32  ec  4c  ec  ac  ec  5c  ec  d2  ec  62  ec  d3
> -013000  ec  73  ec  ce  ec  8c  ec  cb  ec  94  ec  59  ec  81  ec  34
> +011000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
> +012000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
> +013000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8
> -02e000  ec  20  ec  5a  ec  78  ec  b2  ec  e6  ec  1e  ec  43  ec  0f
> -02f000  ec  30  ec  32  ec  4c  ec  ac  ec  5c  ec  d2  ec  62  ec  d3
> -030000  ec  73  ec  ce  ec  8c  ec  cb  ec  94  ec  59  ec  81  ec  34
> +02e000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
> +02f000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
> +030000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8
> 
> This is really quite strange!  The same corruption patterns we saw at
> pages 0x2e - 0x30 now appear at 0x11-0x13 after the remount!

Hmmmm - look at what the last operation before failure
does - it clones 0xf000-0x1bfff to 0x2c000-0x38fff. IOWs, those
ranges *should* be identical and the the corruption is actually
occuring at 0x11000-0x13fff. It's not until that range gets cloned
to 0x2e000-0x30fff that the corruption is detected.

So we're looking in the wrong spot for the page cache corruption -
we need to be looking at operations over the range 0x11000-0x13fff
for misbehaviour, not where fsx detected the corrupt data.

> By comparison, the junk.fsxgood file only contains this 77/db/f1
> sequence at:
> 
> $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep '77  db  f1'
> 008530  db  34  db  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d
> 03d000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
> 
> Curiously, the same byte trios at 0x2f000 and 0x30000 have similar
> repetitions at similar looking offsets:
> 
> $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep 'b3  d8  35'
> 009530  d8  bb  d8  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b
> 03e000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
> $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep '23  d8  c8'
> 00a530  d8  f5  d8  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e
> 03f000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8
> 
> Though the only pattern that happens consistently is that garbage bytes
> end up at the reflink dest, and later at the reflink source.  I never
> see any VM_BUG_ON_FOLIO asserts, nor does KASAN report anything.

Smells like the page cache over the clone source is not getting
marked dirty and/or flushed to disk correctly before the clone is
run. It then shares the extent with stale data to the new location
(the destination) which then fails the contents validation.

Do we have a case where we are only writing back the head page of
the multipage folio?

> I also added a debug function to dump the folios it finds in the
> pagecache for the fsx junk file, but nothing looks odd:
> 
>      522-5099  [001]   491.954659: console:              [U] FSX FAILURE
>   xfs_io-5125  [002]   491.961232: console:              XFS (sda): EXPERIMENTAL online scrub feature in use. Use at your own risk!
>   xfs_io-5125  [002]   491.961238: bprint:               filemap_dump: ino 0xb1 pos 0x0 pfn 0x515cc order 0
>   xfs_io-5125  [002]   491.961238: bprint:               filemap_dump: ino 0xb1 pos 0x1000 pfn 0x515cd order 0
>   xfs_io-5125  [002]   491.961239: bprint:               filemap_dump: ino 0xb1 pos 0x2000 pfn 0x515ce order 0
>   xfs_io-5125  [002]   491.961239: bprint:               filemap_dump: ino 0xb1 pos 0x3000 pfn 0x515cf order 0
>   xfs_io-5125  [002]   491.961239: bprint:               filemap_dump: ino 0xb1 pos 0x4000 pfn 0x50c48 order 0
>   xfs_io-5125  [002]   491.961240: bprint:               filemap_dump: ino 0xb1 pos 0x5000 pfn 0x50c49 order 0
>   xfs_io-5125  [002]   491.961240: bprint:               filemap_dump: ino 0xb1 pos 0x6000 pfn 0x50c4a order 0
>   xfs_io-5125  [002]   491.961241: bprint:               filemap_dump: ino 0xb1 pos 0x7000 pfn 0xc8a8 order 0
>   xfs_io-5125  [002]   491.961241: bprint:               filemap_dump: ino 0xb1 pos 0x8000 pfn 0x50988 order 2
>   xfs_io-5125  [002]   491.961241: bprint:               filemap_dump: ino 0xb1 pos 0xc000 pfn 0x509e0 order 2
>   xfs_io-5125  [002]   491.961242: bprint:               filemap_dump: ino 0xb1 pos 0x10000 pfn 0x4db64 order 2

So this is the folio that likely has the problem (the source)...

>   xfs_io-5125  [002]   491.961242: bprint:               filemap_dump: ino 0xb1 pos 0x14000 pfn 0x50c4c order 0
>   xfs_io-5125  [002]   491.961243: bprint:               filemap_dump: ino 0xb1 pos 0x15000 pfn 0x12485 order 0
>   xfs_io-5125  [002]   491.961243: bprint:               filemap_dump: ino 0xb1 pos 0x16000 pfn 0x50c4d order 0
>   xfs_io-5125  [002]   491.961243: bprint:               filemap_dump: ino 0xb1 pos 0x17000 pfn 0x50c4e order 0
>   xfs_io-5125  [002]   491.961244: bprint:               filemap_dump: ino 0xb1 pos 0x18000 pfn 0x4eef8 order 2
>   xfs_io-5125  [002]   491.961244: bprint:               filemap_dump: ino 0xb1 pos 0x1c000 pfn 0x4eefc order 2
>   xfs_io-5125  [002]   491.961245: bprint:               filemap_dump: ino 0xb1 pos 0x20000 pfn 0x4eef0 order 2
>   xfs_io-5125  [002]   491.961245: bprint:               filemap_dump: ino 0xb1 pos 0x24000 pfn 0x50f2c order 2
>   xfs_io-5125  [002]   491.961245: bprint:               filemap_dump: ino 0xb1 pos 0x28000 pfn 0x50f28 order 2
>   xfs_io-5125  [002]   491.961246: bprint:               filemap_dump: ino 0xb1 pos 0x2c000 pfn 0x50f24 order 2

... not the one at the destination here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-27 22:16         ` Darrick J. Wong
  2022-06-27 23:35           ` Dave Chinner
@ 2022-06-28  7:31           ` Dave Chinner
  2022-06-28 11:27             ` Matthew Wilcox
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2022-06-28  7:31 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig, linux-mm

[cc linux-mm@kvack.org]

On Mon, Jun 27, 2022 at 03:16:19PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 27, 2022 at 03:10:40PM +0100, Matthew Wilcox wrote:
> > On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote:
> > > On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote:
> > > > [resend with shorter 522.out file to keep us under the 300k maximum]
> > > > 
> > > > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote:
> > > > > Now that iomap has been converted, XFS is large folio safe.
> > > > > Indicate to the VFS that it can now create large folios for XFS.
> > > > > 
> > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > >  fs/xfs/xfs_icache.c | 2 ++
> > > > >  1 file changed, 2 insertions(+)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > index da4af2142a2b..cdc39f576ca1 100644
> > > > > --- a/fs/xfs/xfs_icache.c
> > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > @@ -87,6 +87,7 @@ xfs_inode_alloc(
> > > > >  	/* VFS doesn't initialise i_mode or i_state! */
> > > > >  	VFS_I(ip)->i_mode = 0;
> > > > >  	VFS_I(ip)->i_state = 0;
> > > > > +	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> > > > >  
> > > > >  	XFS_STATS_INC(mp, vn_active);
> > > > >  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> > > > > @@ -320,6 +321,7 @@ xfs_reinit_inode(
> > > > >  	inode->i_rdev = dev;
> > > > >  	inode->i_uid = uid;
> > > > >  	inode->i_gid = gid;
> > > > > +	mapping_set_large_folios(inode->i_mapping);
> > > > 
> > > > Hmm.  Ever since 5.19-rc1, I've noticed that fsx in generic/522 now
> > > > reports file corruption after 20 minutes of runtime.  The corruption is
> > > > surprisingly reproducible (522.out.bad attached below) in that I ran it
> > > > three times and always got the same bad offset (0x6e000) and always the
> > > > same opcode (6213798(166 mod 256) MAPREAD).
> > > > 
> > > > I turned off multipage folios and now 522 has run for over an hour
> > > > without problems, so before I go do more debugging, does this ring a
> > > > bell to anyone?
> > > 
> > > I tried bisecting, but that didn't yield anything productive and
> > > 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will
> > > run without problems for at least 3-4 days after reverting this patch
> > > from -rc3.
> > > 
> > > So I guess I have a blunt force fix if we can't figure this one out
> > > before 5.19 final, but I'd really rather not.  Will keep trying this
> > > week.
> > 
> > I'm on holiday for the next week, so I'm not going to be able to spend
> > any time on this until then.  I have a suspicion that this may be the
> > same bug Zorro is seeing here:
> > 
> > https://lore.kernel.org/linux-mm/20220613010850.6kmpenitmuct2osb@zlang-mailbox/
> > 
> > At least I hope it is, and finding a folio that has been freed would
> > explain (apparent) file corruption.
> 
> Hm.  I suppose it /could/ be a lost folio getting into the works
> somewhere.
> 
> Today I remembered fsx -X, which makes this reproduce a bit faster (~3-8
> minutes instead of 20-25).  That has helped me to narrow things down a
> little more:
> 
> - Turning off INSERT/COLLAPSE_RANGE doesn't make the problem go away,
>   but it does make reading the fsx log much easier.
> 
> - Turning off clone/dedupe (either via -J -B or formatting with -m
>   reflink=0) makes the problem go away completely.  If you define
>   letting fsx run for 90 minutes as "completely".

So using this technique, I've discovered that there's a dirty page
accounting leak that eventually results in fsx hanging in
balance_dirty_pages().

[15300.670773] sysrq: Show Blocked State
[15300.672712] task:fsx             state:D stack:11600 pid: 5607 ppid:  5601 flags:0x00004004
[15300.676785] Call Trace:
[15300.678061]  <TASK>
[15300.679171]  __schedule+0x310/0x9f0
[15300.681096]  schedule+0x4b/0xb0
[15300.683385]  schedule_timeout+0x88/0x160
[15300.685407]  ? do_raw_spin_unlock+0x4b/0xa0
[15300.687567]  ? timer_migration_handler+0x90/0x90
[15300.690277]  io_schedule_timeout+0x4c/0x70
[15300.692346]  balance_dirty_pages_ratelimited+0x259/0xb60
[15300.695205]  fault_dirty_shared_page+0xef/0x100
[15300.697650]  do_wp_page+0x414/0x760
[15300.699819]  __handle_mm_fault+0xc59/0x1730
[15300.702226]  ? do_mmap+0x348/0x540
[15300.704418]  handle_mm_fault+0x7a/0x1c0
[15300.706658]  exc_page_fault+0x1da/0x780
[15300.709020]  asm_exc_page_fault+0x27/0x30

# cat /proc/meminfo
MemTotal:       16292348 kB
MemFree:        15298308 kB
MemAvailable:   15219180 kB
Buffers:           71336 kB
Cached:           209288 kB
SwapCached:            0 kB
Active:           141840 kB
Inactive:         254748 kB
Active(anon):        572 kB
Inactive(anon):   128000 kB
Active(file):     141268 kB
Inactive(file):   126748 kB
Unevictable:       12344 kB
Mlocked:           12344 kB
SwapTotal:        497976 kB
SwapFree:         497976 kB
Dirty:           2825676 kB	<<<<<<<<<
Writeback:             0 kB
AnonPages:        125072 kB
Mapped:            65592 kB
Shmem:              2128 kB
KReclaimable:      49424 kB
Slab:             110992 kB
SReclaimable:      49424 kB
SUnreclaim:        61568 kB
KernelStack:        4720 kB
PageTables:         4000 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8644148 kB
Committed_AS:     277864 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       46604 kB
VmallocChunk:          0 kB
Percpu:             5376 kB
AnonHugePages:     30720 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:       61276 kB
DirectMap2M:     6230016 kB
DirectMap1G:    29360128 kB

As you can see, the system only has 1GB of 16GB of memory used, but
it's got 2.8GB of dirty pages accounted and so any write that is
done will now hang on the dirty page throttle.

sysrq-m shows:

[15443.617242] sysrq: Show Memory
[15443.618465] Mem-Info:
[15443.619428] active_anon:143 inactive_anon:32000 isolated_anon:0
[15443.619428]  active_file:35317 inactive_file:31687 isolated_file:0
[15443.619428]  unevictable:3086 dirty:706420 writeback:0
[15443.619428]  slab_reclaimable:12356 slab_unreclaimable:15392
[15443.619428]  mapped:16398 shmem:532 pagetables:1000 bounce:0
[15443.619428]  kernel_misc_reclaimable:0
[15443.619428]  free:3825037 free_pcp:34496 free_cma:0
[15443.632592] Node 0 active_anon:152kB inactive_anon:17812kB active_file:27304kB inactive_file:19200kB unevictable:832kB isolated(anon
):0kB isolated(file):0kB mapped:8820kB dirty:632kB writeback:0kB shmem:224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writebac
k_tmp:0kB kernel_stack:1332kB pagetables:812kB all_unreclaimable? no
[15443.643111] Node 1 active_anon:168kB inactive_anon:54348kB active_file:52160kB inactive_file:19808kB unevictable:6932kB isolated(ano
n):0kB isolated(file):0kB mapped:29512kB dirty:0kB writeback:0kB shmem:596kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 14336kB writ
eback_tmp:0kB kernel_stack:836kB pagetables:1212kB all_unreclaimable? no
[15443.653719] Node 2 active_anon:152kB inactive_anon:31780kB active_file:32604kB inactive_file:62196kB unevictable:3324kB isolated(ano
n):0kB isolated(file):0kB mapped:11560kB dirty:0kB writeback:0kB shmem:216kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 8192kB write
back_tmp:0kB kernel_stack:1268kB pagetables:1084kB all_unreclaimable? no
[15443.664282] Node 3 active_anon:100kB inactive_anon:24060kB active_file:29200kB inactive_file:25544kB unevictable:1256kB isolated(ano
n):0kB isolated(file):0kB mapped:15700kB dirty:2825048kB writeback:0kB shmem:1092kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 10240
kB writeback_tmp:0kB kernel_stack:1268kB pagetables:892kB all_unreclaimable? no
[15443.673605] Node 0 DMA free:15360kB boost:0kB min:124kB low:152kB high:180kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0
kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0
kB local_pcp:0kB free_cma:0kB
[15443.680976] lowmem_reserve[]: 0 2911 7902 7902 7902
[15443.682360] Node 0 DMA32 free:2981444kB boost:0kB min:24744kB low:30928kB high:37112kB reserved_highatomic:0KB active_anon:0kB inact
ive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129180kB managed:2986368kB mlocked:0kB bounce:
0kB free_pcp:4916kB local_pcp:0kB free_cma:0kB
[15443.689280] lowmem_reserve[]: 0 0 4990 4990 4990
[15443.690450] Node 0 Normal free:4968904kB boost:0kB min:42412kB low:53012kB high:63612kB reserved_highatomic:0KB active_anon:152kB in
active_anon:17812kB active_file:27304kB inactive_file:19200kB unevictable:832kB writepending:632kB present:5242880kB managed:5110756kB 
mlocked:832kB bounce:0kB free_pcp:41008kB local_pcp:5944kB free_cma:0kB
[15443.697102] lowmem_reserve[]: 0 0 0 0 0
[15443.697986] Node 1 Normal free:5806356kB boost:0kB min:51384kB low:64228kB high:77072kB reserved_highatomic:0KB active_anon:168kB in
active_anon:54348kB active_file:52160kB inactive_file:19808kB unevictable:6932kB writepending:0kB present:6291456kB managed:6192112kB m
locked:6932kB bounce:0kB free_pcp:68644kB local_pcp:15328kB free_cma:0kB
[15443.704154] lowmem_reserve[]: 0 0 0 0 0
[15443.705005] Node 2 Normal free:714616kB boost:0kB min:8556kB low:10692kB high:12828kB reserved_highatomic:0KB active_anon:152kB inac
tive_anon:31780kB active_file:32604kB inactive_file:62196kB unevictable:3324kB writepending:0kB present:1048576kB managed:1031152kB mlo
cked:3324kB bounce:0kB free_pcp:10904kB local_pcp:0kB free_cma:0kB
[15443.712094] lowmem_reserve[]: 0 0 0 0 0
[15443.713270] Node 3 Normal free:813468kB boost:0kB min:7936kB low:9920kB high:11904kB reserved_highatomic:0KB active_anon:100kB inact
ive_anon:24060kB active_file:29200kB inactive_file:25544kB unevictable:1256kB writepending:2825212kB present:1048576kB managed:956600kB
 mlocked:1256kB bounce:0kB free_pcp:12508kB local_pcp:388kB free_cma:0kB
[15443.721013] lowmem_reserve[]: 0 0 0 0 0
[15443.721958] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB
[15443.725048] Node 0 DMA32: 3*4kB (UM) 1*8kB (M) 1*16kB (M) 1*32kB (M) 2*64kB (M) 1*128kB (M) 3*256kB (UM) 3*512kB (UM) 3*1024kB (UM) 
3*2048kB (UM) 725*4096kB (M) = 2981444kB
[15443.729512] Node 0 Normal: 1714*4kB (UME) 1504*8kB (UME) 988*16kB (UME) 562*32kB (UME) 268*64kB (UME) 96*128kB (UME) 27*256kB (UME) 
5*512kB (UM) 1*1024kB (E) 5*2048kB (UME) 1188*4096kB (M) = 4968904kB
[15443.734684] Node 1 Normal: 977*4kB (UM) 1086*8kB (UME) 406*16kB (UME) 210*32kB (UM) 133*64kB (UME) 116*128kB (UME) 37*256kB (UM) 18*
512kB (UM) 12*1024kB (UME) 2*2048kB (UM) 1397*4096kB (M) = 5806356kB
[15443.739997] Node 2 Normal: 456*4kB (UME) 307*8kB (UM) 136*16kB (UME) 138*32kB (UME) 44*64kB (UME) 10*128kB (UE) 21*256kB (UM) 12*512
kB (UM) 6*1024kB (UM) 7*2048kB (UME) 163*4096kB (UM) = 714616kB
[15443.745261] Node 3 Normal: 463*4kB (UM) 940*8kB (UM) 392*16kB (UME) 210*32kB (UME) 121*64kB (UME) 40*128kB (UME) 2*256kB (ME) 1*512k
B (U) 1*1024kB (E) 3*2048kB (UME) 188*4096kB (M) = 813468kB
[15443.750265] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[15443.752776] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[15443.755614] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[15443.758256] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[15443.761168] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[15443.763996] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[15443.766634] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[15443.769423] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[15443.772050] 70156 total pagecache pages
[15443.773364] 0 pages in swap cache
[15443.774535] Swap cache stats: add 0, delete 0, find 0/0
[15443.776213] Free swap  = 497976kB
[15443.777394] Total swap = 497976kB
[15443.778606] 4194165 pages RAM
[15443.779551] 0 pages HighMem/MovableOnly
[15443.780705] 121078 pages reserved

It is node 3 that fsx has been running on and it's the only
node that has been leaking dirty pages. Node 3 only
has 1GB of pages present, so having 2.8GB of dirty pages on the node
is just a little wrong. ZONE_NR_WRITE_PENDING, NR_FILE_DIRTY and
NR_DIRTY all match in terms of leaking dirty pages.

The reproducer for 5.19-rc4 is running fsx from fstests like so:

<create 8GB pmem/ramdisk>
mkfs.xfs -f /dev/pmem0
mount /dev/pmem0 /mnt/test
src/fstests/ltp/fsx q -N 100000000 -p 1000000 -o 128000 -l 600000 -C -I -X /mnt/test/junk

And this will leak around 300-400kB of dirty pages every second.

I suspect this is the same problem as the data corruption that
Darrick is seeing - that smells of a dirty multi-page folio that
hasn't been fully written back. e.g. a multipage folio gets marked
dirty so accounts mulitple pages dirty, but only a single page of
the folio is written and marked clean, leaking both dirty page
accounting and stale data contents on disk that fsx eventually trips
over.

I can't find a likely cause through reading the code, so maybe
someone who knows the mm accounting code better than I do can spot
the problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-28  7:31           ` Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios) Dave Chinner
@ 2022-06-28 11:27             ` Matthew Wilcox
  2022-06-28 11:31               ` Matthew Wilcox
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2022-06-28 11:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig, linux-mm

On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> So using this technique, I've discovered that there's a dirty page
> accounting leak that eventually results in fsx hanging in
> balance_dirty_pages().

Alas, I think this is only an accounting error, and not related to
the problem(s) that Darrick & Zorro are seeing.  I think what you're
seeing is dirty pages being dropped at truncation without the
appropriate accounting.  ie this should be the fix:

+++ b/mm/huge_memory.c
@@ -2443,6 +2443,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
                        __delete_from_page_cache(head + i, NULL);
                        if (shmem_mapping(head->mapping))
                                shmem_uncharge(head->mapping->host, 1);
+                       else
+                               folio_account_cleaned(page_folio(head + i));
                        put_page(head + i);
                } else if (!PageAnon(page)) {
                        __xa_store(&head->mapping->i_pages, head[i].index,


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-28 11:27             ` Matthew Wilcox
@ 2022-06-28 11:31               ` Matthew Wilcox
  2022-06-28 13:18                 ` Matthew Wilcox
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2022-06-28 11:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig, linux-mm

On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > So using this technique, I've discovered that there's a dirty page
> > accounting leak that eventually results in fsx hanging in
> > balance_dirty_pages().
> 
> Alas, I think this is only an accounting error, and not related to
> the problem(s) that Darrick & Zorro are seeing.  I think what you're
> seeing is dirty pages being dropped at truncation without the
> appropriate accounting.  ie this should be the fix:

Argh, try one that actually compiles.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f7248002dad9..0553ae90509f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
 #include <linux/shrinker.h>
 #include <linux/mm_inline.h>
 #include <linux/swapops.h>
+#include <linux/backing-dev.h>
 #include <linux/dax.h>
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
@@ -2443,6 +2444,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 			__delete_from_page_cache(head + i, NULL);
 			if (shmem_mapping(head->mapping))
 				shmem_uncharge(head->mapping->host, 1);
+			else
+				folio_account_cleaned(page_folio(head + i),
+					inode_to_wb(folio->mapping->host));
 			put_page(head + i);
 		} else if (!PageAnon(page)) {
 			__xa_store(&head->mapping->i_pages, head[i].index,

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-28 11:31               ` Matthew Wilcox
@ 2022-06-28 13:18                 ` Matthew Wilcox
  2022-06-28 20:57                   ` Darrick J. Wong
  2022-06-28 22:17                   ` Dave Chinner
  0 siblings, 2 replies; 51+ messages in thread
From: Matthew Wilcox @ 2022-06-28 13:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig, linux-mm

On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > > So using this technique, I've discovered that there's a dirty page
> > > accounting leak that eventually results in fsx hanging in
> > > balance_dirty_pages().
> > 
> > Alas, I think this is only an accounting error, and not related to
> > the problem(s) that Darrick & Zorro are seeing.  I think what you're
> > seeing is dirty pages being dropped at truncation without the
> > appropriate accounting.  ie this should be the fix:
> 
> Argh, try one that actually compiles.

... that one's going to underflow the accounting.  Maybe I shouldn't
be writing code at 6am?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f7248002dad9..4eec6ee83e44 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
 #include <linux/shrinker.h>
 #include <linux/mm_inline.h>
 #include <linux/swapops.h>
+#include <linux/backing-dev.h>
 #include <linux/dax.h>
 #include <linux/khugepaged.h>
 #include <linux/freezer.h>
@@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond EOF: drop them from page cache */
 		if (head[i].index >= end) {
-			ClearPageDirty(head + i);
-			__delete_from_page_cache(head + i, NULL);
+			struct folio *tail = page_folio(head + i);
+
 			if (shmem_mapping(head->mapping))
 				shmem_uncharge(head->mapping->host, 1);
-			put_page(head + i);
+			else if (folio_test_clear_dirty(tail))
+				folio_account_cleaned(tail,
+					inode_to_wb(folio->mapping->host));
+			__filemap_remove_folio(tail, NULL);
+			folio_put(tail);
 		} else if (!PageAnon(page)) {
 			__xa_store(&head->mapping->i_pages, head[i].index,
 					head + i, 0);

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-28 13:18                 ` Matthew Wilcox
@ 2022-06-28 20:57                   ` Darrick J. Wong
  2022-06-28 22:17                   ` Dave Chinner
  1 sibling, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2022-06-28 20:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig, linux-mm

On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote:
> > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > > > So using this technique, I've discovered that there's a dirty page
> > > > accounting leak that eventually results in fsx hanging in
> > > > balance_dirty_pages().
> > > 
> > > Alas, I think this is only an accounting error, and not related to
> > > the problem(s) that Darrick & Zorro are seeing.  I think what you're
> > > seeing is dirty pages being dropped at truncation without the
> > > appropriate accounting.  ie this should be the fix:
> > 
> > Argh, try one that actually compiles.
> 
> ... that one's going to underflow the accounting.  Maybe I shouldn't
> be writing code at 6am?

I dunno, it's been running on my test VMs for 160 minutes (same debug
setup as yesterday) and 100 minutes (regular g/522, no -C/-I flags to
fsx, no debugging junk) and neither have reported corruptions.

$ grep Dirty /proc/meminfo
Dirty:               100 kB

So, pretty good for writing code at 6am while on holiday.  I'll let this
run overnight, but I think you've fixed the problem, at least for me...

--D

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f7248002dad9..4eec6ee83e44 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -18,6 +18,7 @@
>  #include <linux/shrinker.h>
>  #include <linux/mm_inline.h>
>  #include <linux/swapops.h>
> +#include <linux/backing-dev.h>
>  #include <linux/dax.h>
>  #include <linux/khugepaged.h>
>  #include <linux/freezer.h>
> @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  		__split_huge_page_tail(head, i, lruvec, list);
>  		/* Some pages can be beyond EOF: drop them from page cache */
>  		if (head[i].index >= end) {
> -			ClearPageDirty(head + i);
> -			__delete_from_page_cache(head + i, NULL);
> +			struct folio *tail = page_folio(head + i);
> +
>  			if (shmem_mapping(head->mapping))
>  				shmem_uncharge(head->mapping->host, 1);
> -			put_page(head + i);
> +			else if (folio_test_clear_dirty(tail))
> +				folio_account_cleaned(tail,
> +					inode_to_wb(folio->mapping->host));
> +			__filemap_remove_folio(tail, NULL);
> +			folio_put(tail);
>  		} else if (!PageAnon(page)) {
>  			__xa_store(&head->mapping->i_pages, head[i].index,
>  					head + i, 0);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-28 13:18                 ` Matthew Wilcox
  2022-06-28 20:57                   ` Darrick J. Wong
@ 2022-06-28 22:17                   ` Dave Chinner
  2022-06-28 23:21                     ` Darrick J. Wong
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2022-06-28 22:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig, linux-mm

On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote:
> > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > > > So using this technique, I've discovered that there's a dirty page
> > > > accounting leak that eventually results in fsx hanging in
> > > > balance_dirty_pages().
> > > 
> > > Alas, I think this is only an accounting error, and not related to
> > > the problem(s) that Darrick & Zorro are seeing.  I think what you're
> > > seeing is dirty pages being dropped at truncation without the
> > > appropriate accounting.  ie this should be the fix:
> > 
> > Argh, try one that actually compiles.
> 
> ... that one's going to underflow the accounting.  Maybe I shouldn't
> be writing code at 6am?
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f7248002dad9..4eec6ee83e44 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -18,6 +18,7 @@
>  #include <linux/shrinker.h>
>  #include <linux/mm_inline.h>
>  #include <linux/swapops.h>
> +#include <linux/backing-dev.h>
>  #include <linux/dax.h>
>  #include <linux/khugepaged.h>
>  #include <linux/freezer.h>
> @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  		__split_huge_page_tail(head, i, lruvec, list);
>  		/* Some pages can be beyond EOF: drop them from page cache */
>  		if (head[i].index >= end) {
> -			ClearPageDirty(head + i);
> -			__delete_from_page_cache(head + i, NULL);
> +			struct folio *tail = page_folio(head + i);
> +
>  			if (shmem_mapping(head->mapping))
>  				shmem_uncharge(head->mapping->host, 1);
> -			put_page(head + i);
> +			else if (folio_test_clear_dirty(tail))
> +				folio_account_cleaned(tail,
> +					inode_to_wb(folio->mapping->host));
> +			__filemap_remove_folio(tail, NULL);
> +			folio_put(tail);
>  		} else if (!PageAnon(page)) {
>  			__xa_store(&head->mapping->i_pages, head[i].index,
>  					head + i, 0);
> 

Yup, that fixes the leak.

Tested-by: Dave Chinner <dchinner@redhat.com>

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-28 22:17                   ` Dave Chinner
@ 2022-06-28 23:21                     ` Darrick J. Wong
  2022-06-29 12:57                       ` Brian Foster
  0 siblings, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2022-06-28 23:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig, linux-mm

On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote:
> On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote:
> > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > > > > So using this technique, I've discovered that there's a dirty page
> > > > > accounting leak that eventually results in fsx hanging in
> > > > > balance_dirty_pages().
> > > > 
> > > > Alas, I think this is only an accounting error, and not related to
> > > > the problem(s) that Darrick & Zorro are seeing.  I think what you're
> > > > seeing is dirty pages being dropped at truncation without the
> > > > appropriate accounting.  ie this should be the fix:
> > > 
> > > Argh, try one that actually compiles.
> > 
> > ... that one's going to underflow the accounting.  Maybe I shouldn't
> > be writing code at 6am?
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index f7248002dad9..4eec6ee83e44 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/shrinker.h>
> >  #include <linux/mm_inline.h>
> >  #include <linux/swapops.h>
> > +#include <linux/backing-dev.h>
> >  #include <linux/dax.h>
> >  #include <linux/khugepaged.h>
> >  #include <linux/freezer.h>
> > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> >  		__split_huge_page_tail(head, i, lruvec, list);
> >  		/* Some pages can be beyond EOF: drop them from page cache */
> >  		if (head[i].index >= end) {
> > -			ClearPageDirty(head + i);
> > -			__delete_from_page_cache(head + i, NULL);
> > +			struct folio *tail = page_folio(head + i);
> > +
> >  			if (shmem_mapping(head->mapping))
> >  				shmem_uncharge(head->mapping->host, 1);
> > -			put_page(head + i);
> > +			else if (folio_test_clear_dirty(tail))
> > +				folio_account_cleaned(tail,
> > +					inode_to_wb(folio->mapping->host));
> > +			__filemap_remove_folio(tail, NULL);
> > +			folio_put(tail);
> >  		} else if (!PageAnon(page)) {
> >  			__xa_store(&head->mapping->i_pages, head[i].index,
> >  					head + i, 0);
> > 
> 
> Yup, that fixes the leak.
> 
> Tested-by: Dave Chinner <dchinner@redhat.com>

Four hours of generic/522 running is long enough to conclude that this
is likely the fix for my problem and migrate long soak testing to my
main g/522 rig and:

Tested-by: Darrick J. Wong <djwong@kernel.org>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-28 23:21                     ` Darrick J. Wong
@ 2022-06-29 12:57                       ` Brian Foster
  2022-06-29 20:22                         ` Darrick J. Wong
  2022-08-17  9:36                         ` Dave Chinner
  0 siblings, 2 replies; 51+ messages in thread
From: Brian Foster @ 2022-06-29 12:57 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-kernel, Christoph Hellwig, linux-mm

On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote:
> > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote:
> > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > > > > > So using this technique, I've discovered that there's a dirty page
> > > > > > accounting leak that eventually results in fsx hanging in
> > > > > > balance_dirty_pages().
> > > > > 
> > > > > Alas, I think this is only an accounting error, and not related to
> > > > > the problem(s) that Darrick & Zorro are seeing.  I think what you're
> > > > > seeing is dirty pages being dropped at truncation without the
> > > > > appropriate accounting.  ie this should be the fix:
> > > > 
> > > > Argh, try one that actually compiles.
> > > 
> > > ... that one's going to underflow the accounting.  Maybe I shouldn't
> > > be writing code at 6am?
> > > 
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index f7248002dad9..4eec6ee83e44 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -18,6 +18,7 @@
> > >  #include <linux/shrinker.h>
> > >  #include <linux/mm_inline.h>
> > >  #include <linux/swapops.h>
> > > +#include <linux/backing-dev.h>
> > >  #include <linux/dax.h>
> > >  #include <linux/khugepaged.h>
> > >  #include <linux/freezer.h>
> > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> > >  		__split_huge_page_tail(head, i, lruvec, list);
> > >  		/* Some pages can be beyond EOF: drop them from page cache */
> > >  		if (head[i].index >= end) {
> > > -			ClearPageDirty(head + i);
> > > -			__delete_from_page_cache(head + i, NULL);
> > > +			struct folio *tail = page_folio(head + i);
> > > +
> > >  			if (shmem_mapping(head->mapping))
> > >  				shmem_uncharge(head->mapping->host, 1);
> > > -			put_page(head + i);
> > > +			else if (folio_test_clear_dirty(tail))
> > > +				folio_account_cleaned(tail,
> > > +					inode_to_wb(folio->mapping->host));
> > > +			__filemap_remove_folio(tail, NULL);
> > > +			folio_put(tail);
> > >  		} else if (!PageAnon(page)) {
> > >  			__xa_store(&head->mapping->i_pages, head[i].index,
> > >  					head + i, 0);
> > > 
> > 
> > Yup, that fixes the leak.
> > 
> > Tested-by: Dave Chinner <dchinner@redhat.com>
> 
> Four hours of generic/522 running is long enough to conclude that this
> is likely the fix for my problem and migrate long soak testing to my
> main g/522 rig and:
> 
> Tested-by: Darrick J. Wong <djwong@kernel.org>
> 

Just based on Willy's earlier comment.. what I would probably be a
little careful/curious about here is whether the accounting fix leads to
an indirect behavior change that does impact reproducibility of the
corruption problem. For example, does artificially escalated dirty page
tracking lead to increased reclaim/writeback activity than might
otherwise occur, and thus contend with the fs workload? Clearly it has
some impact based on Dave's balance_dirty_pages() problem reproducer,
but I don't know if it extends beyond that off the top of my head. That
might make some sense if the workload is fsx, since that doesn't
typically stress cache/memory usage the way a large fsstress workload or
something might.

So for example, interesting questions might be... Do your corruption
events happen to correspond with dirty page accounting crossing some
threshold based on available memory in your test environment? Does
reducing available memory affect reproducibility? Etc.

Brian

> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-29 12:57                       ` Brian Foster
@ 2022-06-29 20:22                         ` Darrick J. Wong
  2022-07-01 16:03                           ` Brian Foster
  2022-08-17  9:36                         ` Dave Chinner
  1 sibling, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2022-06-29 20:22 UTC (permalink / raw)
  To: Brian Foster
  Cc: Dave Chinner, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-kernel, Christoph Hellwig, linux-mm

On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote:
> On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote:
> > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote:
> > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote:
> > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > > > > > > So using this technique, I've discovered that there's a dirty page
> > > > > > > accounting leak that eventually results in fsx hanging in
> > > > > > > balance_dirty_pages().
> > > > > > 
> > > > > > Alas, I think this is only an accounting error, and not related to
> > > > > > the problem(s) that Darrick & Zorro are seeing.  I think what you're
> > > > > > seeing is dirty pages being dropped at truncation without the
> > > > > > appropriate accounting.  ie this should be the fix:
> > > > > 
> > > > > Argh, try one that actually compiles.
> > > > 
> > > > ... that one's going to underflow the accounting.  Maybe I shouldn't
> > > > be writing code at 6am?
> > > > 
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index f7248002dad9..4eec6ee83e44 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -18,6 +18,7 @@
> > > >  #include <linux/shrinker.h>
> > > >  #include <linux/mm_inline.h>
> > > >  #include <linux/swapops.h>
> > > > +#include <linux/backing-dev.h>
> > > >  #include <linux/dax.h>
> > > >  #include <linux/khugepaged.h>
> > > >  #include <linux/freezer.h>
> > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> > > >  		__split_huge_page_tail(head, i, lruvec, list);
> > > >  		/* Some pages can be beyond EOF: drop them from page cache */
> > > >  		if (head[i].index >= end) {
> > > > -			ClearPageDirty(head + i);
> > > > -			__delete_from_page_cache(head + i, NULL);
> > > > +			struct folio *tail = page_folio(head + i);
> > > > +
> > > >  			if (shmem_mapping(head->mapping))
> > > >  				shmem_uncharge(head->mapping->host, 1);
> > > > -			put_page(head + i);
> > > > +			else if (folio_test_clear_dirty(tail))
> > > > +				folio_account_cleaned(tail,
> > > > +					inode_to_wb(folio->mapping->host));
> > > > +			__filemap_remove_folio(tail, NULL);
> > > > +			folio_put(tail);
> > > >  		} else if (!PageAnon(page)) {
> > > >  			__xa_store(&head->mapping->i_pages, head[i].index,
> > > >  					head + i, 0);
> > > > 
> > > 
> > > Yup, that fixes the leak.
> > > 
> > > Tested-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Four hours of generic/522 running is long enough to conclude that this
> > is likely the fix for my problem and migrate long soak testing to my
> > main g/522 rig and:
> > 
> > Tested-by: Darrick J. Wong <djwong@kernel.org>
> > 
> 
> Just based on Willy's earlier comment.. what I would probably be a
> little careful/curious about here is whether the accounting fix leads to
> an indirect behavior change that does impact reproducibility of the
> corruption problem. For example, does artificially escalated dirty page
> tracking lead to increased reclaim/writeback activity than might
> otherwise occur, and thus contend with the fs workload? Clearly it has
> some impact based on Dave's balance_dirty_pages() problem reproducer,
> but I don't know if it extends beyond that off the top of my head. That
> might make some sense if the workload is fsx, since that doesn't
> typically stress cache/memory usage the way a large fsstress workload or
> something might.
> 
> So for example, interesting questions might be... Do your corruption
> events happen to correspond with dirty page accounting crossing some
> threshold based on available memory in your test environment? Does
> reducing available memory affect reproducibility? Etc.

Yeah, I wonder that too now.  I managed to trace generic/522 a couple of
times before willy's patch dropped.  From what I could tell, a large
folio X would get page P assigned to the fsx file's page cache to cover
range R, dirtied, and written to disk.  At some point later, we'd
reflink into part of the file range adjacent to P, but not P itself.
I /think/ that should have caused the whole folio to get invalidated?

Then some more things happened (none of which dirtied R, according to
fsx) and then suddenly writeback would trigger on some page (don't know
which) that would write to the disk blocks backing R.  I'm fairly sure
that's where the incorrect disk contents came from.

Next, we'd reflink part of the file range including R into a different
part of the file (call it R2).  fsx would read R2, bringing a new page
into cache, and it wouldn't match the fsxgood buffer, leading to fsx
aborting.

After a umount/mount cycle, reading R and R2 would both reveal the
incorrect contents that had caused fsx to abort.

Unfortunately the second ftrace attempt ate some trace data, so I was
unable to figure out if the same thing happened again.

At this point I really need to get on reviewing patches for 5.20, so
I'll try to keep poking at this (examining the trace data requires a lot
of concentration which isn't really possible while sawzall construction
is going on at home) but at worst I can ask Linus to merge a patch for
5.19 final that makes setting mapping_set_large_folio a
Kconfig/CONFIG_XFS_DEBUG option.

--D

> 
> Brian
> 
> > --D
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-29 20:22                         ` Darrick J. Wong
@ 2022-07-01 16:03                           ` Brian Foster
  2022-07-01 18:03                             ` Darrick J. Wong
  0 siblings, 1 reply; 51+ messages in thread
From: Brian Foster @ 2022-07-01 16:03 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-kernel, Christoph Hellwig, linux-mm

On Wed, Jun 29, 2022 at 01:22:06PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote:
> > On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote:
> > > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote:
> > > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote:
> > > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote:
> > > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> > > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > > > > > > > So using this technique, I've discovered that there's a dirty page
> > > > > > > > accounting leak that eventually results in fsx hanging in
> > > > > > > > balance_dirty_pages().
> > > > > > > 
> > > > > > > Alas, I think this is only an accounting error, and not related to
> > > > > > > the problem(s) that Darrick & Zorro are seeing.  I think what you're
> > > > > > > seeing is dirty pages being dropped at truncation without the
> > > > > > > appropriate accounting.  ie this should be the fix:
> > > > > > 
> > > > > > Argh, try one that actually compiles.
> > > > > 
> > > > > ... that one's going to underflow the accounting.  Maybe I shouldn't
> > > > > be writing code at 6am?
> > > > > 
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index f7248002dad9..4eec6ee83e44 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -18,6 +18,7 @@
> > > > >  #include <linux/shrinker.h>
> > > > >  #include <linux/mm_inline.h>
> > > > >  #include <linux/swapops.h>
> > > > > +#include <linux/backing-dev.h>
> > > > >  #include <linux/dax.h>
> > > > >  #include <linux/khugepaged.h>
> > > > >  #include <linux/freezer.h>
> > > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> > > > >  		__split_huge_page_tail(head, i, lruvec, list);
> > > > >  		/* Some pages can be beyond EOF: drop them from page cache */
> > > > >  		if (head[i].index >= end) {
> > > > > -			ClearPageDirty(head + i);
> > > > > -			__delete_from_page_cache(head + i, NULL);
> > > > > +			struct folio *tail = page_folio(head + i);
> > > > > +
> > > > >  			if (shmem_mapping(head->mapping))
> > > > >  				shmem_uncharge(head->mapping->host, 1);
> > > > > -			put_page(head + i);
> > > > > +			else if (folio_test_clear_dirty(tail))
> > > > > +				folio_account_cleaned(tail,
> > > > > +					inode_to_wb(folio->mapping->host));
> > > > > +			__filemap_remove_folio(tail, NULL);
> > > > > +			folio_put(tail);
> > > > >  		} else if (!PageAnon(page)) {
> > > > >  			__xa_store(&head->mapping->i_pages, head[i].index,
> > > > >  					head + i, 0);
> > > > > 
> > > > 
> > > > Yup, that fixes the leak.
> > > > 
> > > > Tested-by: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Four hours of generic/522 running is long enough to conclude that this
> > > is likely the fix for my problem and migrate long soak testing to my
> > > main g/522 rig and:
> > > 
> > > Tested-by: Darrick J. Wong <djwong@kernel.org>
> > > 
> > 
> > Just based on Willy's earlier comment.. what I would probably be a
> > little careful/curious about here is whether the accounting fix leads to
> > an indirect behavior change that does impact reproducibility of the
> > corruption problem. For example, does artificially escalated dirty page
> > tracking lead to increased reclaim/writeback activity than might
> > otherwise occur, and thus contend with the fs workload? Clearly it has
> > some impact based on Dave's balance_dirty_pages() problem reproducer,
> > but I don't know if it extends beyond that off the top of my head. That
> > might make some sense if the workload is fsx, since that doesn't
> > typically stress cache/memory usage the way a large fsstress workload or
> > something might.
> > 
> > So for example, interesting questions might be... Do your corruption
> > events happen to correspond with dirty page accounting crossing some
> > threshold based on available memory in your test environment? Does
> > reducing available memory affect reproducibility? Etc.
> 
> Yeah, I wonder that too now.  I managed to trace generic/522 a couple of
> times before willy's patch dropped.  From what I could tell, a large
> folio X would get page P assigned to the fsx file's page cache to cover
> range R, dirtied, and written to disk.  At some point later, we'd
> reflink into part of the file range adjacent to P, but not P itself.
> I /think/ that should have caused the whole folio to get invalidated?
> 
> Then some more things happened (none of which dirtied R, according to
> fsx) and then suddenly writeback would trigger on some page (don't know
> which) that would write to the disk blocks backing R.  I'm fairly sure
> that's where the incorrect disk contents came from.
> 
> Next, we'd reflink part of the file range including R into a different
> part of the file (call it R2).  fsx would read R2, bringing a new page
> into cache, and it wouldn't match the fsxgood buffer, leading to fsx
> aborting.
> 
> After a umount/mount cycle, reading R and R2 would both reveal the
> incorrect contents that had caused fsx to abort.
> 

FWIW, I hadn't been able to reproduce this in my default environment to
this point. With the memory leak issue in the light, I was eventually
able to by reducing dirty_bytes to something the system would be more
likely to hit sooner (i.e. 16-32MB), but I also see stalling behavior
and whatnot due to the leak that requires backing off from the specified
dirty limit every so often.

If I apply the accounting patch to avoid the leak and set
dirty_background_bytes to something notably aggressive (1kB), the test
survived 100 iterations or so before I stopped it. If I then set
dirty_bytes to something similarly aggressive (1MB), I hit the failure
on the next iteration (assuming it's the same problem). It's spinning
again at ~25 or so iterations without a failure so far, so I'd have to
wait and see how reliable the reproducer really is. Though if it doesn't
reoccur soonish, perhaps I'll try reducing dirty_bytes a bit more...

My suspicion based on these characteristics would be that the blocking
limit triggers more aggressive reclaim/invalidation, and thus helps
detect the problem sooner. If reflink is involved purely as a cache
invalidation step (i.e. so a subsequent read will hit the disk and
detect a cache inconsistency), then it might be interesting to see if it
can still be reproduced without reflink operations enabled but instead
with some combination of the -f/-X fsx flags to perform more flush
invals and on-disk data checks..

Brian

> Unfortunately the second ftrace attempt ate some trace data, so I was
> unable to figure out if the same thing happened again.
> 
> At this point I really need to get on reviewing patches for 5.20, so
> I'll try to keep poking at this (examining the trace data requires a lot
> of concentration which isn't really possible while sawzall construction
> is going on at home) but at worst I can ask Linus to merge a patch for
> 5.19 final that makes setting mapping_set_large_folio a
> Kconfig/CONFIG_XFS_DEBUG option.
> 
> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-07-01 16:03                           ` Brian Foster
@ 2022-07-01 18:03                             ` Darrick J. Wong
  0 siblings, 0 replies; 51+ messages in thread
From: Darrick J. Wong @ 2022-07-01 18:03 UTC (permalink / raw)
  To: Brian Foster
  Cc: Dave Chinner, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-kernel, Christoph Hellwig, linux-mm

On Fri, Jul 01, 2022 at 12:03:23PM -0400, Brian Foster wrote:
> On Wed, Jun 29, 2022 at 01:22:06PM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote:
> > > On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote:
> > > > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote:
> > > > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote:
> > > > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> > > > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > > > > > > > > So using this technique, I've discovered that there's a dirty page
> > > > > > > > > accounting leak that eventually results in fsx hanging in
> > > > > > > > > balance_dirty_pages().
> > > > > > > > 
> > > > > > > > Alas, I think this is only an accounting error, and not related to
> > > > > > > > the problem(s) that Darrick & Zorro are seeing.  I think what you're
> > > > > > > > seeing is dirty pages being dropped at truncation without the
> > > > > > > > appropriate accounting.  ie this should be the fix:
> > > > > > > 
> > > > > > > Argh, try one that actually compiles.
> > > > > > 
> > > > > > ... that one's going to underflow the accounting.  Maybe I shouldn't
> > > > > > be writing code at 6am?
> > > > > > 
> > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > > index f7248002dad9..4eec6ee83e44 100644
> > > > > > --- a/mm/huge_memory.c
> > > > > > +++ b/mm/huge_memory.c
> > > > > > @@ -18,6 +18,7 @@
> > > > > >  #include <linux/shrinker.h>
> > > > > >  #include <linux/mm_inline.h>
> > > > > >  #include <linux/swapops.h>
> > > > > > +#include <linux/backing-dev.h>
> > > > > >  #include <linux/dax.h>
> > > > > >  #include <linux/khugepaged.h>
> > > > > >  #include <linux/freezer.h>
> > > > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> > > > > >  		__split_huge_page_tail(head, i, lruvec, list);
> > > > > >  		/* Some pages can be beyond EOF: drop them from page cache */
> > > > > >  		if (head[i].index >= end) {
> > > > > > -			ClearPageDirty(head + i);
> > > > > > -			__delete_from_page_cache(head + i, NULL);
> > > > > > +			struct folio *tail = page_folio(head + i);
> > > > > > +
> > > > > >  			if (shmem_mapping(head->mapping))
> > > > > >  				shmem_uncharge(head->mapping->host, 1);
> > > > > > -			put_page(head + i);
> > > > > > +			else if (folio_test_clear_dirty(tail))
> > > > > > +				folio_account_cleaned(tail,
> > > > > > +					inode_to_wb(folio->mapping->host));
> > > > > > +			__filemap_remove_folio(tail, NULL);
> > > > > > +			folio_put(tail);
> > > > > >  		} else if (!PageAnon(page)) {
> > > > > >  			__xa_store(&head->mapping->i_pages, head[i].index,
> > > > > >  					head + i, 0);
> > > > > > 
> > > > > 
> > > > > Yup, that fixes the leak.
> > > > > 
> > > > > Tested-by: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Four hours of generic/522 running is long enough to conclude that this
> > > > is likely the fix for my problem and migrate long soak testing to my
> > > > main g/522 rig and:
> > > > 
> > > > Tested-by: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > 
> > > Just based on Willy's earlier comment.. what I would probably be a
> > > little careful/curious about here is whether the accounting fix leads to
> > > an indirect behavior change that does impact reproducibility of the
> > > corruption problem. For example, does artificially escalated dirty page
> > > tracking lead to increased reclaim/writeback activity than might
> > > otherwise occur, and thus contend with the fs workload? Clearly it has
> > > some impact based on Dave's balance_dirty_pages() problem reproducer,
> > > but I don't know if it extends beyond that off the top of my head. That
> > > might make some sense if the workload is fsx, since that doesn't
> > > typically stress cache/memory usage the way a large fsstress workload or
> > > something might.
> > > 
> > > So for example, interesting questions might be... Do your corruption
> > > events happen to correspond with dirty page accounting crossing some
> > > threshold based on available memory in your test environment? Does
> > > reducing available memory affect reproducibility? Etc.
> > 
> > Yeah, I wonder that too now.  I managed to trace generic/522 a couple of
> > times before willy's patch dropped.  From what I could tell, a large
> > folio X would get page P assigned to the fsx file's page cache to cover
> > range R, dirtied, and written to disk.  At some point later, we'd
> > reflink into part of the file range adjacent to P, but not P itself.
> > I /think/ that should have caused the whole folio to get invalidated?
> > 
> > Then some more things happened (none of which dirtied R, according to
> > fsx) and then suddenly writeback would trigger on some page (don't know
> > which) that would write to the disk blocks backing R.  I'm fairly sure
> > that's where the incorrect disk contents came from.
> > 
> > Next, we'd reflink part of the file range including R into a different
> > part of the file (call it R2).  fsx would read R2, bringing a new page
> > into cache, and it wouldn't match the fsxgood buffer, leading to fsx
> > aborting.
> > 
> > After a umount/mount cycle, reading R and R2 would both reveal the
> > incorrect contents that had caused fsx to abort.
> > 
> 
> FWIW, I hadn't been able to reproduce this in my default environment to
> this point. With the memory leak issue in the light, I was eventually
> able to by reducing dirty_bytes to something the system would be more
> likely to hit sooner (i.e. 16-32MB), but I also see stalling behavior
> and whatnot due to the leak that requires backing off from the specified
> dirty limit every so often.
> 
> If I apply the accounting patch to avoid the leak and set
> dirty_background_bytes to something notably aggressive (1kB), the test
> survived 100 iterations or so before I stopped it. If I then set
> dirty_bytes to something similarly aggressive (1MB), I hit the failure
> on the next iteration (assuming it's the same problem). It's spinning
> again at ~25 or so iterations without a failure so far, so I'd have to
> wait and see how reliable the reproducer really is. Though if it doesn't
> reoccur soonish, perhaps I'll try reducing dirty_bytes a bit more...
> 
> My suspicion based on these characteristics would be that the blocking
> limit triggers more aggressive reclaim/invalidation, and thus helps
> detect the problem sooner. If reflink is involved purely as a cache
> invalidation step (i.e. so a subsequent read will hit the disk and
> detect a cache inconsistency), then it might be interesting to see if it
> can still be reproduced without reflink operations enabled but instead
> with some combination of the -f/-X fsx flags to perform more flush
> invals and on-disk data checks..

Hm.  I didn't try -f or lowering dirty_bytes, but with the reflink
operations disabled, g522 ran for 3 hours before I gave up and killed
it.  I would've thought that the fallocate zero/collapse/insert range
functions (which use the same flush/unmap helper) would have sufficed to
make the problem happen, but ... it didn't.

I think I'll try changing dirty_bytes next, to see if I can reproduce
the problem that way.  I'm not surprised that you had to set dirty_bytes
to 1MB, since 522 is only ever creating a 600K file anyway.

(Hopefully willy will be back next week to help us shed some light on
this.)

--D

> Brian
> 
> > Unfortunately the second ftrace attempt ate some trace data, so I was
> > unable to figure out if the same thing happened again.
> > 
> > At this point I really need to get on reviewing patches for 5.20, so
> > I'll try to keep poking at this (examining the trace data requires a lot
> > of concentration which isn't really possible while sawzall construction
> > is going on at home) but at worst I can ask Linus to merge a patch for
> > 5.19 final that makes setting mapping_set_large_folio a
> > Kconfig/CONFIG_XFS_DEBUG option.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Cheers,
> > > > > 
> > > > > Dave.
> > > > > -- 
> > > > > Dave Chinner
> > > > > david@fromorbit.com
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-06-29 12:57                       ` Brian Foster
  2022-06-29 20:22                         ` Darrick J. Wong
@ 2022-08-17  9:36                         ` Dave Chinner
  2022-08-17 23:53                           ` Darrick J. Wong
  1 sibling, 1 reply; 51+ messages in thread
From: Dave Chinner @ 2022-08-17  9:36 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-kernel, Christoph Hellwig, linux-mm

On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote:
> On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote:
> > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote:
> > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote:
> > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > > > > > > So using this technique, I've discovered that there's a dirty page
> > > > > > > accounting leak that eventually results in fsx hanging in
> > > > > > > balance_dirty_pages().
> > > > > > 
> > > > > > Alas, I think this is only an accounting error, and not related to
> > > > > > the problem(s) that Darrick & Zorro are seeing.  I think what you're
> > > > > > seeing is dirty pages being dropped at truncation without the
> > > > > > appropriate accounting.  ie this should be the fix:
> > > > > 
> > > > > Argh, try one that actually compiles.
> > > > 
> > > > ... that one's going to underflow the accounting.  Maybe I shouldn't
> > > > be writing code at 6am?
> > > > 
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index f7248002dad9..4eec6ee83e44 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -18,6 +18,7 @@
> > > >  #include <linux/shrinker.h>
> > > >  #include <linux/mm_inline.h>
> > > >  #include <linux/swapops.h>
> > > > +#include <linux/backing-dev.h>
> > > >  #include <linux/dax.h>
> > > >  #include <linux/khugepaged.h>
> > > >  #include <linux/freezer.h>
> > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> > > >  		__split_huge_page_tail(head, i, lruvec, list);
> > > >  		/* Some pages can be beyond EOF: drop them from page cache */
> > > >  		if (head[i].index >= end) {
> > > > -			ClearPageDirty(head + i);
> > > > -			__delete_from_page_cache(head + i, NULL);
> > > > +			struct folio *tail = page_folio(head + i);
> > > > +
> > > >  			if (shmem_mapping(head->mapping))
> > > >  				shmem_uncharge(head->mapping->host, 1);
> > > > -			put_page(head + i);
> > > > +			else if (folio_test_clear_dirty(tail))
> > > > +				folio_account_cleaned(tail,
> > > > +					inode_to_wb(folio->mapping->host));
> > > > +			__filemap_remove_folio(tail, NULL);
> > > > +			folio_put(tail);
> > > >  		} else if (!PageAnon(page)) {
> > > >  			__xa_store(&head->mapping->i_pages, head[i].index,
> > > >  					head + i, 0);
> > > > 
> > > 
> > > Yup, that fixes the leak.
> > > 
> > > Tested-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Four hours of generic/522 running is long enough to conclude that this
> > is likely the fix for my problem and migrate long soak testing to my
> > main g/522 rig and:
> > 
> > Tested-by: Darrick J. Wong <djwong@kernel.org>
> > 
> 
> Just based on Willy's earlier comment.. what I would probably be a
> little careful/curious about here is whether the accounting fix leads to
> an indirect behavior change that does impact reproducibility of the
> corruption problem. For example, does artificially escalated dirty page
> tracking lead to increased reclaim/writeback activity than might
> otherwise occur, and thus contend with the fs workload?

Yeah, I think you're right, Brian. There is an underlying problem
here, and the writeback/memory pressure bugs exposed it for everyone
to see.  IOWs, we do indeed have a pre-existing problem with partial
writes, iomap, unwritten extents, writeback and memory reclaim.

(Please s/page/folio/ rather than complain I'm calling things pages
and not folios, I'm juggling between RHEL-8 code and upstream
here and it's just easier to refer to everything as pages as I write
it.)

(I'm assuming people are familiar with the iomap buffered IO code
at this point.)

Let's do a partial write into a page. Let's say the second half of
the page. __iomap_begin_write() drops into iomap_adjust_read_range()
to determine the range that needs zeroing or reading from disk.  We
then call iomap_block_needs_zeroing() to determine if we zero -
which we do if it's a newly allocated or unwritten extent. Otherwise
we read it from disk.

Let's put this partial write over a large unwritten extent, which
means we zero the first part of the page, and that writing it back
runs unwritten extent conversion at IO completion. This changes the
extent backing this file offset from unwritten to written.

Then, add extreme memory pressure, so writeback is already running
and soon after the page has been dirtied it gets marked for
cleaning. Therefore, we have the possibility of background IO
submission and memory reclaim occurring at the same time as new
writes are occurring to the file.

To that end, let's add a large racing write that *ends* on this same
page that we have already partially filled with data - it will fill
the start of the page that currently contains zeros with real data.

Let's start that write() just before writeback of the already
uptodate partial page starts. iomap_apply will map the entire
write (and well beyond) as a single unwritten extent. This extent is
the same extent that the original page in the page cache already
covers, and this write would end by filling the remaining part of
that write.

While that second write is running around in the page cache
copy-in loop for the unwritten extent that backs that range, IO
submission runs and completes and converts the original partially
filled page to a written extent, then marks it clean.

[ Ayup, the extent map of the file just changed to be different
to the iomap that the current write() thinks the file layout has. ]

Further, because the partially written page is now clean,
memory reclaim can snaffle it up and it gets removed from the page
cache.

The second write finally gets to the last page of it's write and
doesn't find it in the page cache because memory reclaim removed it.
So it pulls a new page into the page cache in iomap_begin_write(),
and then in __iomap_begin_write we see that the we need to fill the
second half of the page with either zeros or data from disk.

We know that this data is on disk as the extent is now in written
state.  *However*, the cached iomap that we are currently using for
the write() says the range is -unwritten-, and so at that point
iomap_block_needs_zeroing() says "page needs zeroing". Hence
__iomap_begin_write zeroes the second half of the page instead of
reading it from disk, obliterating the data that the previous write
had already written there.

Ouchy, we have data corruption because the incoming write holds a
stale cached iomap......

At this point, I had a nasty feeling of deja vu.

Oh, yeah, we already infrastructure in place to avoid using stale
cached iomaps in ->writepages....

That is, iomap_writepage_map() calls ->map_block for every filesytem
block, even though the iomap_writepage_ctx has a cached iomap in it.
That's so the filesystem can check that the cached iomap is still
valid whilst the page we are writing back is held locked. This
protects writeback against races with truncate, hole punch, etc.

XFS does this via a call to xfs_imap_valid() - we keep a generation
number in the XFS extent trees that is bumped on every change so
that we can check whether a cached iomap is still valid. If the
generation numbers don't match, then we generate a new iomap for the
writepage context....

What is clear to me now is that we have the same stale cached iomap
problem with iomap_apply() because writeback can change the extent
map state in a way that influence incoming writes and those updates
are not in any way serialised against the incoming write IO path.

Hence we have to validate that the cached iomap is not stale before
we decide what to do with a new page in the page cache (zero or read
from disk), and we have to do this iomap validation when the newly
instantiated page is locked and ready for initialisation.

I have not thought past this point - I'm only *just* smart enough to
be able to dig out the root cause of this problem. No scratch that,
I'm not smart, just stubborn enough to ignore the fact I've got the
analysis wrong at least a dozen times already.

That said, this stale mapping problem existed long before we started
using unwritten extents for delayed allocation. The change to use
unwritten extents (to fix other data corruption issues) exposed us
to in-memory zeroing in this race condition case rather reading the
block of data from disk. We just hadn't armed the landmine....

I suspect we are going to need a new iomap validation operation, and
require the iomap actors to be able to handle iomap validation
failure.  What that looks like, I don't know as my brain
turned to goo and dribbled out my ears about 4 hours ago....

BTW, all the credit for finding this goes to Frank Sorenson - he did
all the hard work of reproducing the issue and narrowing down the
scope of the problem to a handful of XFS commits combined with
writeback and memory reclaim interactions. Without his hard work,
we'd still have no idea what was going on.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-08-17  9:36                         ` Dave Chinner
@ 2022-08-17 23:53                           ` Darrick J. Wong
  2022-08-18 21:58                             ` Dave Chinner
  0 siblings, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2022-08-17 23:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Brian Foster, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-kernel, Christoph Hellwig, linux-mm

On Wed, Aug 17, 2022 at 07:36:27PM +1000, Dave Chinner wrote:
> On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote:
> > On Tue, Jun 28, 2022 at 04:21:55PM -0700, Darrick J. Wong wrote:
> > > On Wed, Jun 29, 2022 at 08:17:57AM +1000, Dave Chinner wrote:
> > > > On Tue, Jun 28, 2022 at 02:18:24PM +0100, Matthew Wilcox wrote:
> > > > > On Tue, Jun 28, 2022 at 12:31:55PM +0100, Matthew Wilcox wrote:
> > > > > > On Tue, Jun 28, 2022 at 12:27:40PM +0100, Matthew Wilcox wrote:
> > > > > > > On Tue, Jun 28, 2022 at 05:31:20PM +1000, Dave Chinner wrote:
> > > > > > > > So using this technique, I've discovered that there's a dirty page
> > > > > > > > accounting leak that eventually results in fsx hanging in
> > > > > > > > balance_dirty_pages().
> > > > > > > 
> > > > > > > Alas, I think this is only an accounting error, and not related to
> > > > > > > the problem(s) that Darrick & Zorro are seeing.  I think what you're
> > > > > > > seeing is dirty pages being dropped at truncation without the
> > > > > > > appropriate accounting.  ie this should be the fix:
> > > > > > 
> > > > > > Argh, try one that actually compiles.
> > > > > 
> > > > > ... that one's going to underflow the accounting.  Maybe I shouldn't
> > > > > be writing code at 6am?
> > > > > 
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index f7248002dad9..4eec6ee83e44 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -18,6 +18,7 @@
> > > > >  #include <linux/shrinker.h>
> > > > >  #include <linux/mm_inline.h>
> > > > >  #include <linux/swapops.h>
> > > > > +#include <linux/backing-dev.h>
> > > > >  #include <linux/dax.h>
> > > > >  #include <linux/khugepaged.h>
> > > > >  #include <linux/freezer.h>
> > > > > @@ -2439,11 +2440,15 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> > > > >  		__split_huge_page_tail(head, i, lruvec, list);
> > > > >  		/* Some pages can be beyond EOF: drop them from page cache */
> > > > >  		if (head[i].index >= end) {
> > > > > -			ClearPageDirty(head + i);
> > > > > -			__delete_from_page_cache(head + i, NULL);
> > > > > +			struct folio *tail = page_folio(head + i);
> > > > > +
> > > > >  			if (shmem_mapping(head->mapping))
> > > > >  				shmem_uncharge(head->mapping->host, 1);
> > > > > -			put_page(head + i);
> > > > > +			else if (folio_test_clear_dirty(tail))
> > > > > +				folio_account_cleaned(tail,
> > > > > +					inode_to_wb(folio->mapping->host));
> > > > > +			__filemap_remove_folio(tail, NULL);
> > > > > +			folio_put(tail);
> > > > >  		} else if (!PageAnon(page)) {
> > > > >  			__xa_store(&head->mapping->i_pages, head[i].index,
> > > > >  					head + i, 0);
> > > > > 
> > > > 
> > > > Yup, that fixes the leak.
> > > > 
> > > > Tested-by: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Four hours of generic/522 running is long enough to conclude that this
> > > is likely the fix for my problem and migrate long soak testing to my
> > > main g/522 rig and:
> > > 
> > > Tested-by: Darrick J. Wong <djwong@kernel.org>
> > > 
> > 
> > Just based on Willy's earlier comment.. what I would probably be a
> > little careful/curious about here is whether the accounting fix leads to
> > an indirect behavior change that does impact reproducibility of the
> > corruption problem. For example, does artificially escalated dirty page
> > tracking lead to increased reclaim/writeback activity than might
> > otherwise occur, and thus contend with the fs workload?
> 
> Yeah, I think you're right, Brian. There is an underlying problem
> here, and the writeback/memory pressure bugs exposed it for everyone
> to see.  IOWs, we do indeed have a pre-existing problem with partial
> writes, iomap, unwritten extents, writeback and memory reclaim.
> 
> (Please s/page/folio/ rather than complain I'm calling things pages
> and not folios, I'm juggling between RHEL-8 code and upstream
> here and it's just easier to refer to everything as pages as I write
> it.)
> 
> (I'm assuming people are familiar with the iomap buffered IO code
> at this point.)
> 
> Let's do a partial write into a page. Let's say the second half of
> the page. __iomap_begin_write() drops into iomap_adjust_read_range()
> to determine the range that needs zeroing or reading from disk.  We
> then call iomap_block_needs_zeroing() to determine if we zero -
> which we do if it's a newly allocated or unwritten extent. Otherwise
> we read it from disk.
> 
> Let's put this partial write over a large unwritten extent, which
> means we zero the first part of the page, and that writing it back
> runs unwritten extent conversion at IO completion. This changes the
> extent backing this file offset from unwritten to written.
> 
> Then, add extreme memory pressure, so writeback is already running
> and soon after the page has been dirtied it gets marked for
> cleaning. Therefore, we have the possibility of background IO
> submission and memory reclaim occurring at the same time as new
> writes are occurring to the file.
> 
> To that end, let's add a large racing write that *ends* on this same
> page that we have already partially filled with data - it will fill
> the start of the page that currently contains zeros with real data.
> 
> Let's start that write() just before writeback of the already
> uptodate partial page starts. iomap_apply will map the entire
> write (and well beyond) as a single unwritten extent. This extent is
> the same extent that the original page in the page cache already
> covers, and this write would end by filling the remaining part of
> that write.
> 
> While that second write is running around in the page cache
> copy-in loop for the unwritten extent that backs that range, IO
> submission runs and completes and converts the original partially
> filled page to a written extent, then marks it clean.

So if I understand this correctly --

Thread 1 fallocates 1G of contiguous space and writes 38 bytes to the
end of that gigabyte.  We instantiate a folio X for the region, zero the
parts we're not going to write, and copy the bytes we did want to write
into folio X.  The folio is now uptodate and dirty.  Thread 1 goes away.

Thread 2 starts a (1G-38) write of that fallocated (and mostly
unwritten) space.  It proceeds very slowly, one page at a time; it'll be
a while before it gets to folio X.

ramsnake.exe fires up and pushes the system hard into memory reclaim.
Reclaim initiates dirty writeback of folio X, which writes it to disk
and does the unwritten extent conversion.  Reclaim then removes X from
the page cache, and frees X.

Thread 2 finally gets to the end of its 1G write.  It is still using the
cached 1G unwritten mapping from before, unaware that some of the
mapping has changed to written state.  It grabs a folio for (1G - 38),
but it gets folio Y instead of folio X.  Y is not uptodate, so thread 2
copies bytes from userspace and zeroes the last 38 bytes.

**WRONG**

We now have folio Y with zeroes in the last 38 bytes, instead of
whatever thread 1 wrote.  Writeback pushes Y to disk, clobbering those
38 bytes forever.

> [ Ayup, the extent map of the file just changed to be different
> to the iomap that the current write() thinks the file layout has. ]
> 
> Further, because the partially written page is now clean,
> memory reclaim can snaffle it up and it gets removed from the page
> cache.
> 
> The second write finally gets to the last page of it's write and
> doesn't find it in the page cache because memory reclaim removed it.
> So it pulls a new page into the page cache in iomap_begin_write(),
> and then in __iomap_begin_write we see that the we need to fill the
> second half of the page with either zeros or data from disk.
> 
> We know that this data is on disk as the extent is now in written
> state.  *However*, the cached iomap that we are currently using for
> the write() says the range is -unwritten-, and so at that point
> iomap_block_needs_zeroing() says "page needs zeroing". Hence
> __iomap_begin_write zeroes the second half of the page instead of
> reading it from disk, obliterating the data that the previous write
> had already written there.

Yup.

> Ouchy, we have data corruption because the incoming write holds a
> stale cached iomap......
> 
> At this point, I had a nasty feeling of deja vu.
> 
> Oh, yeah, we already infrastructure in place to avoid using stale
> cached iomaps in ->writepages....
> 
> That is, iomap_writepage_map() calls ->map_block for every filesytem
> block, even though the iomap_writepage_ctx has a cached iomap in it.
> That's so the filesystem can check that the cached iomap is still
> valid whilst the page we are writing back is held locked. This
> protects writeback against races with truncate, hole punch, etc.
> 
> XFS does this via a call to xfs_imap_valid() - we keep a generation
> number in the XFS extent trees that is bumped on every change so
> that we can check whether a cached iomap is still valid. If the
> generation numbers don't match, then we generate a new iomap for the
> writepage context....
> 
> What is clear to me now is that we have the same stale cached iomap
> problem with iomap_apply() because writeback can change the extent
> map state in a way that influence incoming writes and those updates
> are not in any way serialised against the incoming write IO path.
> 
> Hence we have to validate that the cached iomap is not stale before
> we decide what to do with a new page in the page cache (zero or read
> from disk), and we have to do this iomap validation when the newly
> instantiated page is locked and ready for initialisation.
> 
> I have not thought past this point - I'm only *just* smart enough to
> be able to dig out the root cause of this problem. No scratch that,
> I'm not smart, just stubborn enough to ignore the fact I've got the
> analysis wrong at least a dozen times already.
> 
> That said, this stale mapping problem existed long before we started
> using unwritten extents for delayed allocation. The change to use
> unwritten extents (to fix other data corruption issues) exposed us
> to in-memory zeroing in this race condition case rather reading the
> block of data from disk. We just hadn't armed the landmine....
> 
> I suspect we are going to need a new iomap validation operation, and
> require the iomap actors to be able to handle iomap validation
> failure.  What that looks like, I don't know as my brain
> turned to goo and dribbled out my ears about 4 hours ago....

Euughg.  I suppose this means that ->iomap_begin has to be able to stuff
xfs_ifork.if_seq into a cookie value that we can set in struct iomap
somewhere?  And then iomap has to be able to call back to XFS to
revalidate iomap.cookie?  And if iomap.cookie != xfs_ifork.if_seq, then
I guess iomap will just ... unlock and put the folio, and return EAGAIN
(or something) so that its caller will return however many bytes have
been dirtied to iomap_apply so that _apply will go get a fresh mapping?

> BTW, all the credit for finding this goes to Frank Sorenson - he did
> all the hard work of reproducing the issue and narrowing down the
> scope of the problem to a handful of XFS commits combined with
> writeback and memory reclaim interactions. Without his hard work,
> we'd still have no idea what was going on.....

Thanks a bunch to him!!!

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios)
  2022-08-17 23:53                           ` Darrick J. Wong
@ 2022-08-18 21:58                             ` Dave Chinner
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Chinner @ 2022-08-18 21:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Brian Foster, Matthew Wilcox, linux-xfs, linux-fsdevel,
	linux-kernel, Christoph Hellwig, linux-mm

On Wed, Aug 17, 2022 at 04:53:31PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 17, 2022 at 07:36:27PM +1000, Dave Chinner wrote:
> > On Wed, Jun 29, 2022 at 08:57:30AM -0400, Brian Foster wrote:
> > > Just based on Willy's earlier comment.. what I would probably be a
> > > little careful/curious about here is whether the accounting fix leads to
> > > an indirect behavior change that does impact reproducibility of the
> > > corruption problem. For example, does artificially escalated dirty page
> > > tracking lead to increased reclaim/writeback activity than might
> > > otherwise occur, and thus contend with the fs workload?
> > 
> > Yeah, I think you're right, Brian. There is an underlying problem
> > here, and the writeback/memory pressure bugs exposed it for everyone
> > to see.  IOWs, we do indeed have a pre-existing problem with partial
> > writes, iomap, unwritten extents, writeback and memory reclaim.
> > 
> > (Please s/page/folio/ rather than complain I'm calling things pages
> > and not folios, I'm juggling between RHEL-8 code and upstream
> > here and it's just easier to refer to everything as pages as I write
> > it.)
> > 
> > (I'm assuming people are familiar with the iomap buffered IO code
> > at this point.)
> > 
> > Let's do a partial write into a page. Let's say the second half of
> > the page. __iomap_begin_write() drops into iomap_adjust_read_range()
> > to determine the range that needs zeroing or reading from disk.  We
> > then call iomap_block_needs_zeroing() to determine if we zero -
> > which we do if it's a newly allocated or unwritten extent. Otherwise
> > we read it from disk.
> > 
> > Let's put this partial write over a large unwritten extent, which
> > means we zero the first part of the page, and that writing it back
> > runs unwritten extent conversion at IO completion. This changes the
> > extent backing this file offset from unwritten to written.
> > 
> > Then, add extreme memory pressure, so writeback is already running
> > and soon after the page has been dirtied it gets marked for
> > cleaning. Therefore, we have the possibility of background IO
> > submission and memory reclaim occurring at the same time as new
> > writes are occurring to the file.
> > 
> > To that end, let's add a large racing write that *ends* on this same
> > page that we have already partially filled with data - it will fill
> > the start of the page that currently contains zeros with real data.
> > 
> > Let's start that write() just before writeback of the already
> > uptodate partial page starts. iomap_apply will map the entire
> > write (and well beyond) as a single unwritten extent. This extent is
> > the same extent that the original page in the page cache already
> > covers, and this write would end by filling the remaining part of
> > that write.
> > 
> > While that second write is running around in the page cache
> > copy-in loop for the unwritten extent that backs that range, IO
> > submission runs and completes and converts the original partially
> > filled page to a written extent, then marks it clean.
> 
> So if I understand this correctly --
> 
> Thread 1 fallocates 1G of contiguous space and writes 38 bytes to the
> end of that gigabyte.  We instantiate a folio X for the region, zero the
> parts we're not going to write, and copy the bytes we did want to write
> into folio X.  The folio is now uptodate and dirty.  Thread 1 goes away.
> 
> Thread 2 starts a (1G-38) write of that fallocated (and mostly
> unwritten) space.  It proceeds very slowly, one page at a time; it'll be
> a while before it gets to folio X.
> 
> ramsnake.exe fires up and pushes the system hard into memory reclaim.
> Reclaim initiates dirty writeback of folio X, which writes it to disk
> and does the unwritten extent conversion.  Reclaim then removes X from
> the page cache, and frees X.
> 
> Thread 2 finally gets to the end of its 1G write.  It is still using the
> cached 1G unwritten mapping from before, unaware that some of the
> mapping has changed to written state.  It grabs a folio for (1G - 38),
> but it gets folio Y instead of folio X.  Y is not uptodate, so thread 2
> copies bytes from userspace and zeroes the last 38 bytes.
> 
> **WRONG**
> 
> We now have folio Y with zeroes in the last 38 bytes, instead of
> whatever thread 1 wrote.  Writeback pushes Y to disk, clobbering those
> 38 bytes forever.

Yes, that's pretty much it.

> > [ Ayup, the extent map of the file just changed to be different
> > to the iomap that the current write() thinks the file layout has. ]
> > 
> > Further, because the partially written page is now clean,
> > memory reclaim can snaffle it up and it gets removed from the page
> > cache.
> > 
> > The second write finally gets to the last page of it's write and
> > doesn't find it in the page cache because memory reclaim removed it.
> > So it pulls a new page into the page cache in iomap_begin_write(),
> > and then in __iomap_begin_write we see that the we need to fill the
> > second half of the page with either zeros or data from disk.
> > 
> > We know that this data is on disk as the extent is now in written
> > state.  *However*, the cached iomap that we are currently using for
> > the write() says the range is -unwritten-, and so at that point
> > iomap_block_needs_zeroing() says "page needs zeroing". Hence
> > __iomap_begin_write zeroes the second half of the page instead of
> > reading it from disk, obliterating the data that the previous write
> > had already written there.
> 
> Yup.
> 
> > Ouchy, we have data corruption because the incoming write holds a
> > stale cached iomap......
> > 
> > At this point, I had a nasty feeling of deja vu.
> > 
> > Oh, yeah, we already infrastructure in place to avoid using stale
> > cached iomaps in ->writepages....
> > 
> > That is, iomap_writepage_map() calls ->map_block for every filesytem
> > block, even though the iomap_writepage_ctx has a cached iomap in it.
> > That's so the filesystem can check that the cached iomap is still
> > valid whilst the page we are writing back is held locked. This
> > protects writeback against races with truncate, hole punch, etc.
> > 
> > XFS does this via a call to xfs_imap_valid() - we keep a generation
> > number in the XFS extent trees that is bumped on every change so
> > that we can check whether a cached iomap is still valid. If the
> > generation numbers don't match, then we generate a new iomap for the
> > writepage context....
> > 
> > What is clear to me now is that we have the same stale cached iomap
> > problem with iomap_apply() because writeback can change the extent
> > map state in a way that influence incoming writes and those updates
> > are not in any way serialised against the incoming write IO path.
> > 
> > Hence we have to validate that the cached iomap is not stale before
> > we decide what to do with a new page in the page cache (zero or read
> > from disk), and we have to do this iomap validation when the newly
> > instantiated page is locked and ready for initialisation.
> > 
> > I have not thought past this point - I'm only *just* smart enough to
> > be able to dig out the root cause of this problem. No scratch that,
> > I'm not smart, just stubborn enough to ignore the fact I've got the
> > analysis wrong at least a dozen times already.
> > 
> > That said, this stale mapping problem existed long before we started
> > using unwritten extents for delayed allocation. The change to use
> > unwritten extents (to fix other data corruption issues) exposed us
> > to in-memory zeroing in this race condition case rather reading the
> > block of data from disk. We just hadn't armed the landmine....
> > 
> > I suspect we are going to need a new iomap validation operation, and
> > require the iomap actors to be able to handle iomap validation
> > failure.  What that looks like, I don't know as my brain
> > turned to goo and dribbled out my ears about 4 hours ago....
> 
> Euughg.  I suppose this means that ->iomap_begin has to be able to stuff
> xfs_ifork.if_seq into a cookie value that we can set in struct iomap
> somewhere?  And then iomap has to be able to call back to XFS to
> revalidate iomap.cookie?  And if iomap.cookie != xfs_ifork.if_seq, then
> I guess iomap will just ... unlock and put the folio, and return EAGAIN
> (or something) so that its caller will return however many bytes have
> been dirtied to iomap_apply so that _apply will go get a fresh mapping?

Yeah, something like that. I have not had time to think this through
any further than what you've just written. It does not look like it
will be trivial, though, and it points out that every filesystem
using iomap for multi-page extent maps will need to implement iomap
invalidation detection in some way.

We already have that for XFS with the sequence numbers, but I'm
betting that none of the other filesystems that use iomap have the
internal infrastructure to track racing extent map changes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 25/25] xfs: Support large folios
  2022-06-27  4:15     ` Darrick J. Wong
  2022-06-27 14:10       ` Matthew Wilcox
@ 2022-06-27 22:07       ` Dave Chinner
  1 sibling, 0 replies; 51+ messages in thread
From: Dave Chinner @ 2022-06-27 22:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-xfs, linux-fsdevel, linux-kernel,
	Christoph Hellwig

On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote:
> > [resend with shorter 522.out file to keep us under the 300k maximum]
> > 
> > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote:
> > > Now that iomap has been converted, XFS is large folio safe.
> > > Indicate to the VFS that it can now create large folios for XFS.
> > > 
> > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_icache.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index da4af2142a2b..cdc39f576ca1 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -87,6 +87,7 @@ xfs_inode_alloc(
> > >  	/* VFS doesn't initialise i_mode or i_state! */
> > >  	VFS_I(ip)->i_mode = 0;
> > >  	VFS_I(ip)->i_state = 0;
> > > +	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> > >  
> > >  	XFS_STATS_INC(mp, vn_active);
> > >  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> > > @@ -320,6 +321,7 @@ xfs_reinit_inode(
> > >  	inode->i_rdev = dev;
> > >  	inode->i_uid = uid;
> > >  	inode->i_gid = gid;
> > > +	mapping_set_large_folios(inode->i_mapping);
> > 
> > Hmm.  Ever since 5.19-rc1, I've noticed that fsx in generic/522 now
> > reports file corruption after 20 minutes of runtime.  The corruption is
> > surprisingly reproducible (522.out.bad attached below) in that I ran it
> > three times and always got the same bad offset (0x6e000) and always the
> > same opcode (6213798(166 mod 256) MAPREAD).
> > 
> > I turned off multipage folios and now 522 has run for over an hour
> > without problems, so before I go do more debugging, does this ring a
> > bell to anyone?
> 
> I tried bisecting, but that didn't yield anything productive and
> 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will
> run without problems for at least 3-4 days after reverting this patch
> from -rc3.

Took 63 million ops and just over 3 hours before it failed here with
a similar 16 byte map read corruption on the first 16 bytes of a
page. Given the number of fallocate operations that lead up to the
failure - 14 of last 23, plus 3 clone, 2 copy, 2 map read, 1 skip
and the map write that it suggests the stale data came from - this
smells of an invalidation issue...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2022-08-18 21:58 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-16 21:06 [PATCH v3 00/25] iomap/xfs folio patches Matthew Wilcox (Oracle)
2021-12-16 21:06 ` [PATCH v3 01/25] block: Add bio_add_folio() Matthew Wilcox (Oracle)
2021-12-16 21:06 ` [PATCH v3 02/25] block: Add bio_for_each_folio_all() Matthew Wilcox (Oracle)
2021-12-16 21:06 ` [PATCH v3 03/25] fs/buffer: Convert __block_write_begin_int() to take a folio Matthew Wilcox (Oracle)
2021-12-16 21:06 ` [PATCH v3 04/25] iomap: Convert to_iomap_page " Matthew Wilcox (Oracle)
2021-12-16 21:06 ` [PATCH v3 05/25] iomap: Convert iomap_page_create " Matthew Wilcox (Oracle)
2021-12-16 21:06 ` [PATCH v3 06/25] iomap: Convert iomap_page_release " Matthew Wilcox (Oracle)
2021-12-16 21:06 ` [PATCH v3 07/25] iomap: Convert iomap_releasepage to use " Matthew Wilcox (Oracle)
2021-12-16 21:06 ` [PATCH v3 08/25] iomap: Add iomap_invalidate_folio Matthew Wilcox (Oracle)
2021-12-16 21:06 ` [PATCH v3 09/25] iomap: Pass the iomap_page into iomap_set_range_uptodate Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 10/25] iomap: Convert bio completions to use folios Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 11/25] iomap: Use folio offsets instead of page offsets Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 12/25] iomap: Convert iomap_read_inline_data to take a folio Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 13/25] iomap: Convert readahead and readpage to use " Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 14/25] iomap: Convert iomap_page_mkwrite " Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 15/25] iomap: Allow iomap_write_begin() to be called with the full length Matthew Wilcox (Oracle)
2021-12-16 21:43   ` Darrick J. Wong
2021-12-16 21:07 ` [PATCH v3 16/25] iomap: Convert __iomap_zero_iter to use a folio Matthew Wilcox (Oracle)
2021-12-21 17:01   ` iomap-folio & nvdimm merge Matthew Wilcox
2021-12-21 18:41     ` Darrick J. Wong
2021-12-21 18:53       ` Matthew Wilcox
2021-12-21 22:46         ` Stephen Rothwell
2021-12-16 21:07 ` [PATCH v3 17/25] iomap: Convert iomap_write_begin() and iomap_write_end() to folios Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 18/25] iomap: Convert iomap_write_end_inline to take a folio Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 19/25] iomap,xfs: Convert ->discard_page to ->discard_folio Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 20/25] iomap: Simplify iomap_writepage_map() Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 21/25] iomap: Simplify iomap_do_writepage() Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 22/25] iomap: Convert iomap_add_to_ioend() to take a folio Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 23/25] iomap: Convert iomap_migrate_page() to use folios Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 24/25] iomap: Support large folios in invalidatepage Matthew Wilcox (Oracle)
2021-12-16 21:07 ` [PATCH v3 25/25] xfs: Support large folios Matthew Wilcox (Oracle)
2022-06-23  0:42   ` Darrick J. Wong
2022-06-27  4:15     ` Darrick J. Wong
2022-06-27 14:10       ` Matthew Wilcox
2022-06-27 22:16         ` Darrick J. Wong
2022-06-27 23:35           ` Dave Chinner
2022-06-28  7:31           ` Multi-page folio issues in 5.19-rc4 (was [PATCH v3 25/25] xfs: Support large folios) Dave Chinner
2022-06-28 11:27             ` Matthew Wilcox
2022-06-28 11:31               ` Matthew Wilcox
2022-06-28 13:18                 ` Matthew Wilcox
2022-06-28 20:57                   ` Darrick J. Wong
2022-06-28 22:17                   ` Dave Chinner
2022-06-28 23:21                     ` Darrick J. Wong
2022-06-29 12:57                       ` Brian Foster
2022-06-29 20:22                         ` Darrick J. Wong
2022-07-01 16:03                           ` Brian Foster
2022-07-01 18:03                             ` Darrick J. Wong
2022-08-17  9:36                         ` Dave Chinner
2022-08-17 23:53                           ` Darrick J. Wong
2022-08-18 21:58                             ` Dave Chinner
2022-06-27 22:07       ` [PATCH v3 25/25] xfs: Support large folios Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).