linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/9] Create large folios in iomap buffered write path
@ 2023-07-10 13:02 Matthew Wilcox (Oracle)
  2023-07-10 13:02 ` [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
                   ` (9 more replies)
  0 siblings, 10 replies; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-10 13:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet

Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
to improve worst-case latency.  Unfortunately, this had the effect of
limiting the performance of:

fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
        -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
        -numjobs=4 -directory=/mnt/test

The problem ends up being lock contention on the i_pages spinlock as we
clear the writeback bit on each folio (and propagate that up through
the tree).  By using larger folios, we decrease the number of folios
to be processed by a factor of 256 for this benchmark, eliminating the
lock contention.

It's also the right thing to do.  This is a project that has been on
the back burner for years, it just hasn't been important enough to do
before now.

I think it's probably best if this goes through the iomap tree since
the changes outside iomap are either to the page cache or they're
trivial.

v4:
 - Rebase on v6.5-rc1
 - Add documentation for fgf_set_order()
 - Improve documentation for iomap_get_page()
 - Fix crash caused by shadow kaddr name in copy_page_from_iter_atomic()
 - Add copy_folio_from_iter_atomic()
 - Improve comment in iomap_release_folio()
 - Use __GFP_NORETRY | __GRP_NOWARN to allocate large folios
 - Change an 'unsigned long bytes' to 'size_t bytes' for consistency
 - Restructure iomap_write_iter() loop slightly

v3:
 - Fix the handling of compound highmem pages in copy_page_from_iter_atomic()
 - Rename fgp_t to fgf_t
 - Clarify some wording in the documentation

v2:
 - Fix misplaced semicolon
 - Rename fgp_order to fgp_set_order
 - Rename FGP_ORDER to FGP_GET_ORDER
 - Add fgp_t
 - Update the documentation for ->release_folio
 - Fix iomap_invalidate_folio()
 - Update iomap_release_folio()

Matthew Wilcox (Oracle) (9):
  iov_iter: Handle compound highmem pages in
    copy_page_from_iter_atomic()
  iov_iter: Add copy_folio_from_iter_atomic()
  iomap: Remove large folio handling in iomap_invalidate_folio()
  doc: Correct the description of ->release_folio
  iomap: Remove unnecessary test from iomap_release_folio()
  filemap: Add fgf_t typedef
  filemap: Allow __filemap_get_folio to allocate large folios
  iomap: Create large folios in the buffered write path
  iomap: Copy larger chunks from userspace

 Documentation/filesystems/locking.rst | 15 +++--
 fs/btrfs/file.c                       |  6 +-
 fs/f2fs/compress.c                    |  2 +-
 fs/f2fs/f2fs.h                        |  2 +-
 fs/gfs2/bmap.c                        |  2 +-
 fs/iomap/buffered-io.c                | 54 +++++++++---------
 include/linux/iomap.h                 |  2 +-
 include/linux/pagemap.h               | 82 +++++++++++++++++++++++----
 include/linux/uio.h                   |  9 ++-
 lib/iov_iter.c                        | 43 +++++++++-----
 mm/filemap.c                          | 65 +++++++++++----------
 mm/folio-compat.c                     |  2 +-
 mm/readahead.c                        | 13 -----
 13 files changed, 187 insertions(+), 110 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
@ 2023-07-10 13:02 ` Matthew Wilcox (Oracle)
  2023-07-10 23:43   ` Luis Chamberlain
                     ` (3 more replies)
  2023-07-10 13:02 ` [PATCH v4 2/9] iov_iter: Add copy_folio_from_iter_atomic() Matthew Wilcox (Oracle)
                   ` (8 subsequent siblings)
  9 siblings, 4 replies; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-10 13:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet

copy_page_from_iter_atomic() already handles !highmem compound
pages correctly, but if we are passed a highmem compound page,
each base page needs to be mapped & unmapped individually.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 lib/iov_iter.c | 43 ++++++++++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 15 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index b667b1e2f688..c728b6e4fb18 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -566,24 +566,37 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_zero);
 
-size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
-				  struct iov_iter *i)
+size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
+		size_t bytes, struct iov_iter *i)
 {
-	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
-	if (!page_copy_sane(page, offset, bytes)) {
-		kunmap_atomic(kaddr);
+	size_t n, copied = 0;
+
+	if (!page_copy_sane(page, offset, bytes))
 		return 0;
-	}
-	if (WARN_ON_ONCE(!i->data_source)) {
-		kunmap_atomic(kaddr);
+	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
-	}
-	iterate_and_advance(i, bytes, base, len, off,
-		copyin(p + off, base, len),
-		memcpy_from_iter(i, p + off, base, len)
-	)
-	kunmap_atomic(kaddr);
-	return bytes;
+
+	do {
+		char *p;
+
+		n = bytes - copied;
+		if (PageHighMem(page)) {
+			page += offset / PAGE_SIZE;
+			offset %= PAGE_SIZE;
+			n = min_t(size_t, n, PAGE_SIZE - offset);
+		}
+
+		p = kmap_atomic(page) + offset;
+		iterate_and_advance(i, n, base, len, off,
+			copyin(p + off, base, len),
+			memcpy_from_iter(i, p + off, base, len)
+		)
+		kunmap_atomic(p);
+		copied += n;
+		offset += n;
+	} while (PageHighMem(page) && copied != bytes && n > 0);
+
+	return copied;
 }
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v4 2/9] iov_iter: Add copy_folio_from_iter_atomic()
  2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
  2023-07-10 13:02 ` [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
@ 2023-07-10 13:02 ` Matthew Wilcox (Oracle)
  2023-07-11  6:31   ` Christoph Hellwig
  2023-07-24 15:56   ` Darrick J. Wong
  2023-07-10 13:02 ` [PATCH v4 3/9] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-10 13:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet

Add a folio wrapper around copy_page_from_iter_atomic().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/uio.h | 9 ++++++++-
 lib/iov_iter.c      | 2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index ff81e5ccaef2..42bce38a8e87 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -163,7 +163,7 @@ static inline size_t iov_length(const struct iovec *iov, unsigned long nr_segs)
 	return ret;
 }
 
-size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
+size_t copy_page_from_iter_atomic(struct page *page, size_t offset,
 				  size_t bytes, struct iov_iter *i);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
@@ -184,6 +184,13 @@ static inline size_t copy_folio_to_iter(struct folio *folio, size_t offset,
 {
 	return copy_page_to_iter(&folio->page, offset, bytes, i);
 }
+
+static inline size_t copy_folio_from_iter_atomic(struct folio *folio,
+		size_t offset, size_t bytes, struct iov_iter *i)
+{
+	return copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
+}
+
 size_t copy_page_to_iter_nofault(struct page *page, unsigned offset,
 				 size_t bytes, struct iov_iter *i);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c728b6e4fb18..8da566a549ad 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -566,7 +566,7 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_zero);
 
-size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
+size_t copy_page_from_iter_atomic(struct page *page, size_t offset,
 		size_t bytes, struct iov_iter *i)
 {
 	size_t n, copied = 0;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v4 3/9] iomap: Remove large folio handling in iomap_invalidate_folio()
  2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
  2023-07-10 13:02 ` [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
  2023-07-10 13:02 ` [PATCH v4 2/9] iov_iter: Add copy_folio_from_iter_atomic() Matthew Wilcox (Oracle)
@ 2023-07-10 13:02 ` Matthew Wilcox (Oracle)
  2023-07-10 13:02 ` [PATCH v4 4/9] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-10 13:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

We do not need to release the iomap_page in iomap_invalidate_folio()
to allow the folio to be split.  The splitting code will call
->release_folio() if there is still per-fs private data attached to
the folio.  At that point, we will check if the folio is still dirty
and decline to release the iomap_page.  It is possible to trigger the
warning in perfectly legitimate circumstances (eg if a disk read fails,
we do a partial write to the folio, then we truncate the folio), which
will cause those writes to be lost.

Fixes: 60d8231089f0 ("iomap: Support large folios in invalidatepage")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index adb92cdb24b0..1cb905140528 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -508,11 +508,6 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 		WARN_ON_ONCE(folio_test_writeback(folio));
 		folio_cancel_dirty(folio);
 		iomap_page_release(folio);
-	} else if (folio_test_large(folio)) {
-		/* Must release the iop so the page can be split */
-		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
-			     folio_test_dirty(folio));
-		iomap_page_release(folio);
 	}
 }
 EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v4 4/9] doc: Correct the description of ->release_folio
  2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (2 preceding siblings ...)
  2023-07-10 13:02 ` [PATCH v4 3/9] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
@ 2023-07-10 13:02 ` Matthew Wilcox (Oracle)
  2023-07-13  4:43   ` Darrick J. Wong
  2023-07-10 13:02 ` [PATCH v4 5/9] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-10 13:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

The filesystem ->release_folio method is called under more circumstances
now than when the documentation was written.  The second sentence
describing the interpretation of the return value is the wrong polarity
(false indicates failure, not success).  And the third sentence is also
wrong (the kernel calls try_to_free_buffers() instead).

So replace the entire paragraph with a detailed description of what the
state of the folio may be, the meaning of the gfp parameter, why the
method is being called and what the filesystem is expected to do.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/filesystems/locking.rst | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index ed148919e11a..ca3d24becbf5 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -374,10 +374,17 @@ invalidate_lock before invalidating page cache in truncate / hole punch
 path (and thus calling into ->invalidate_folio) to block races between page
 cache invalidation and page cache filling functions (fault, read, ...).
 
-->release_folio() is called when the kernel is about to try to drop the
-buffers from the folio in preparation for freeing it.  It returns false to
-indicate that the buffers are (or may be) freeable.  If ->release_folio is
-NULL, the kernel assumes that the fs has no private interest in the buffers.
+->release_folio() is called when the MM wants to make a change to the
+folio that would invalidate the filesystem's private data.  For example,
+it may be about to be removed from the address_space or split.  The folio
+is locked and not under writeback.  It may be dirty.  The gfp parameter
+is not usually used for allocation, but rather to indicate what the
+filesystem may do to attempt to free the private data.  The filesystem may
+return false to indicate that the folio's private data cannot be freed.
+If it returns true, it should have already removed the private data from
+the folio.  If a filesystem does not provide a ->release_folio method,
+the pagecache will assume that private data is buffer_heads and call
+try_to_free_buffers().
 
 ->free_folio() is called when the kernel has dropped the folio
 from the page cache.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v4 5/9] iomap: Remove unnecessary test from iomap_release_folio()
  2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (3 preceding siblings ...)
  2023-07-10 13:02 ` [PATCH v4 4/9] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
@ 2023-07-10 13:02 ` Matthew Wilcox (Oracle)
  2023-07-13  4:45   ` Darrick J. Wong
  2023-07-10 13:02 ` [PATCH v4 6/9] filemap: Add fgf_t typedef Matthew Wilcox (Oracle)
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-10 13:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

The check for the folio being under writeback is unnecessary; the caller
has checked this and the folio is locked, so the folio cannot be under
writeback at this point.

The comment is somewhat misleading in that it talks about one specific
situation in which we can see a dirty folio.  There are others, so change
the comment to explain why we can't release the iomap_page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1cb905140528..7aa3009f907f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -483,12 +483,11 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
 			folio_size(folio));
 
 	/*
-	 * mm accommodates an old ext3 case where clean folios might
-	 * not have had the dirty bit cleared.  Thus, it can send actual
-	 * dirty folios to ->release_folio() via shrink_active_list();
-	 * skip those here.
+	 * If the folio is dirty, we refuse to release our metadata because
+	 * it may be partially dirty.  Once we track per-block dirty state,
+	 * we can release the metadata if every block is dirty.
 	 */
-	if (folio_test_dirty(folio) || folio_test_writeback(folio))
+	if (folio_test_dirty(folio))
 		return false;
 	iomap_page_release(folio);
 	return true;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v4 6/9] filemap: Add fgf_t typedef
  2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (4 preceding siblings ...)
  2023-07-10 13:02 ` [PATCH v4 5/9] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
@ 2023-07-10 13:02 ` Matthew Wilcox (Oracle)
  2023-07-13  4:47   ` Darrick J. Wong
  2023-07-13  5:08   ` Kent Overstreet
  2023-07-10 13:02 ` [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-10 13:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

Similarly to gfp_t, define fgf_t as its own type to prevent various
misuses and confusion.  Leave the flags as FGP_* for now to reduce the
size of this patch; they will be converted to FGF_* later.  Move the
documentation to the definition of the type insted of burying it in the
__filemap_get_folio() documentation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/file.c         |  6 +++---
 fs/f2fs/compress.c      |  2 +-
 fs/f2fs/f2fs.h          |  2 +-
 fs/iomap/buffered-io.c  |  2 +-
 include/linux/pagemap.h | 48 +++++++++++++++++++++++++++++++----------
 mm/filemap.c            | 19 ++--------------
 mm/folio-compat.c       |  2 +-
 7 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fd03e689a6be..3876fae90fc3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -876,9 +876,9 @@ static int prepare_uptodate_page(struct inode *inode,
 	return 0;
 }
 
-static unsigned int get_prepare_fgp_flags(bool nowait)
+static fgf_t get_prepare_fgp_flags(bool nowait)
 {
-	unsigned int fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
+	fgf_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
 
 	if (nowait)
 		fgp_flags |= FGP_NOWAIT;
@@ -910,7 +910,7 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
 	int i;
 	unsigned long index = pos >> PAGE_SHIFT;
 	gfp_t mask = get_prepare_gfp_flags(inode, nowait);
-	unsigned int fgp_flags = get_prepare_fgp_flags(nowait);
+	fgf_t fgp_flags = get_prepare_fgp_flags(nowait);
 	int err = 0;
 	int faili;
 
diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
index 236d890f560b..0f7df9c11af3 100644
--- a/fs/f2fs/compress.c
+++ b/fs/f2fs/compress.c
@@ -1045,7 +1045,7 @@ static int prepare_compress_overwrite(struct compress_ctx *cc,
 	struct address_space *mapping = cc->inode->i_mapping;
 	struct page *page;
 	sector_t last_block_in_bio;
-	unsigned fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
+	fgf_t fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
 	pgoff_t start_idx = start_idx_of_cluster(cc);
 	int i, ret;
 
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index c7cb2177b252..c275ff2753c2 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2736,7 +2736,7 @@ static inline struct page *f2fs_grab_cache_page(struct address_space *mapping,
 
 static inline struct page *f2fs_pagecache_get_page(
 				struct address_space *mapping, pgoff_t index,
-				int fgp_flags, gfp_t gfp_mask)
+				fgf_t fgp_flags, gfp_t gfp_mask)
 {
 	if (time_to_inject(F2FS_M_SB(mapping), FAULT_PAGE_GET))
 		return NULL;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7aa3009f907f..5e9380cc3e83 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -467,7 +467,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  */
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
 {
-	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
+	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716953ee1ebd..911201fc41fc 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -501,22 +501,48 @@ pgoff_t page_cache_next_miss(struct address_space *mapping,
 pgoff_t page_cache_prev_miss(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
-#define FGP_ACCESSED		0x00000001
-#define FGP_LOCK		0x00000002
-#define FGP_CREAT		0x00000004
-#define FGP_WRITE		0x00000008
-#define FGP_NOFS		0x00000010
-#define FGP_NOWAIT		0x00000020
-#define FGP_FOR_MMAP		0x00000040
-#define FGP_STABLE		0x00000080
+/**
+ * typedef fgf_t - Flags for getting folios from the page cache.
+ *
+ * Most users of the page cache will not need to use these flags;
+ * there are convenience functions such as filemap_get_folio() and
+ * filemap_lock_folio().  For users which need more control over exactly
+ * what is done with the folios, these flags to __filemap_get_folio()
+ * are available.
+ *
+ * * %FGP_ACCESSED - The folio will be marked accessed.
+ * * %FGP_LOCK - The folio is returned locked.
+ * * %FGP_CREAT - If no folio is present then a new folio is allocated,
+ *   added to the page cache and the VM's LRU list.  The folio is
+ *   returned locked.
+ * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
+ *   folio is already in cache.  If the folio was allocated, unlock it
+ *   before returning so the caller can do the same dance.
+ * * %FGP_WRITE - The folio will be written to by the caller.
+ * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
+ * * %FGP_NOWAIT - Don't block on the folio lock.
+ * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ * * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin()
+ *   implementation.
+ */
+typedef unsigned int __bitwise fgf_t;
+
+#define FGP_ACCESSED		((__force fgf_t)0x00000001)
+#define FGP_LOCK		((__force fgf_t)0x00000002)
+#define FGP_CREAT		((__force fgf_t)0x00000004)
+#define FGP_WRITE		((__force fgf_t)0x00000008)
+#define FGP_NOFS		((__force fgf_t)0x00000010)
+#define FGP_NOWAIT		((__force fgf_t)0x00000020)
+#define FGP_FOR_MMAP		((__force fgf_t)0x00000040)
+#define FGP_STABLE		((__force fgf_t)0x00000080)
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		fgf_t fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		fgf_t fgp_flags, gfp_t gfp);
 
 /**
  * filemap_get_folio - Find and get a folio.
@@ -590,7 +616,7 @@ static inline struct page *find_get_page(struct address_space *mapping,
 }
 
 static inline struct page *find_get_page_flags(struct address_space *mapping,
-					pgoff_t offset, int fgp_flags)
+					pgoff_t offset, fgf_t fgp_flags)
 {
 	return pagecache_get_page(mapping, offset, fgp_flags, 0);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 9e44a49bbd74..8a669fecfd1c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1855,30 +1855,15 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
  *
  * Looks up the page cache entry at @mapping & @index.
  *
- * @fgp_flags can be zero or more of these flags:
- *
- * * %FGP_ACCESSED - The folio will be marked accessed.
- * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_CREAT - If no page is present then a new page is allocated using
- *   @gfp and added to the page cache and the VM's LRU list.
- *   The page is returned locked and with an increased refcount.
- * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
- *   page is already in cache.  If the page was allocated, unlock it before
- *   returning so the caller can do the same dance.
- * * %FGP_WRITE - The page will be written to by the caller.
- * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
- * * %FGP_NOWAIT - Don't get blocked by page lock.
- * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
- *
  * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
  * if the %GFP flags specified for %FGP_CREAT are atomic.
  *
- * If there is a page cache page, it is returned with an increased refcount.
+ * If this function returns a folio, it is returned with an increased refcount.
  *
  * Return: The found folio or an ERR_PTR() otherwise.
  */
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		fgf_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index c6f056c20503..10c3247542cb 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -92,7 +92,7 @@ EXPORT_SYMBOL(add_to_page_cache_lru);
 
 noinline
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		fgf_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (5 preceding siblings ...)
  2023-07-10 13:02 ` [PATCH v4 6/9] filemap: Add fgf_t typedef Matthew Wilcox (Oracle)
@ 2023-07-10 13:02 ` Matthew Wilcox (Oracle)
  2023-07-10 23:58   ` Luis Chamberlain
                     ` (2 more replies)
  2023-07-10 13:02 ` [PATCH v4 8/9] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-10 13:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

Allow callers of __filemap_get_folio() to specify a preferred folio
order in the FGP flags.  This is only honoured in the FGP_CREATE path;
if there is already a folio in the page cache that covers the index,
we will return it, no matter what its order is.  No create-around is
attempted; we will only create folios which start at the specified index.
Unmodified callers will continue to allocate order 0 folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/pagemap.h | 34 ++++++++++++++++++++++++++++++
 mm/filemap.c            | 46 +++++++++++++++++++++++++++++------------
 mm/readahead.c          | 13 ------------
 3 files changed, 67 insertions(+), 26 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 911201fc41fc..d87840acbfb2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -470,6 +470,19 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
@@ -535,9 +548,30 @@ typedef unsigned int __bitwise fgf_t;
 #define FGP_NOWAIT		((__force fgf_t)0x00000020)
 #define FGP_FOR_MMAP		((__force fgf_t)0x00000040)
 #define FGP_STABLE		((__force fgf_t)0x00000080)
+#define FGF_GET_ORDER(fgf)	(((__force unsigned)fgf) >> 26)	/* top 6 bits */
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
+/**
+ * fgf_set_order - Encode a length in the fgf_t flags.
+ * @size: The suggested size of the folio to create.
+ *
+ * The caller of __filemap_get_folio() can use this to suggest a preferred
+ * size for the folio that is created.  If there is already a folio at
+ * the index, it will be returned, no matter what its size.  If a folio
+ * is freshly created, it may be of a different size than requested
+ * due to alignment constraints, memory pressure, or the presence of
+ * other folios at nearby indices.
+ */
+static inline fgf_t fgf_set_order(size_t size)
+{
+	unsigned int shift = ilog2(size);
+
+	if (shift <= PAGE_SHIFT)
+		return 0;
+	return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
+}
+
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a669fecfd1c..baafbf324c9f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1905,7 +1905,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
+		unsigned order = FGF_GET_ORDER(fgp_flags);
 		int err;
+
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
 		if (fgp_flags & FGP_NOFS)
@@ -1914,26 +1916,44 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp &= ~GFP_KERNEL;
 			gfp |= GFP_NOWAIT | __GFP_NOWARN;
 		}
-
-		folio = filemap_alloc_folio(gfp, 0);
-		if (!folio)
-			return ERR_PTR(-ENOMEM);
-
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		/* Init accessed so avoid atomic mark_page_accessed later */
-		if (fgp_flags & FGP_ACCESSED)
-			__folio_set_referenced(folio);
+		if (!mapping_large_folio_support(mapping))
+			order = 0;
+		if (order > MAX_PAGECACHE_ORDER)
+			order = MAX_PAGECACHE_ORDER;
+		/* If we're not aligned, allocate a smaller folio */
+		if (index & ((1UL << order) - 1))
+			order = __ffs(index);
 
-		err = filemap_add_folio(mapping, folio, index, gfp);
-		if (unlikely(err)) {
+		do {
+			gfp_t alloc_gfp = gfp;
+
+			err = -ENOMEM;
+			if (order == 1)
+				order = 0;
+			if (order > 0)
+				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
+			folio = filemap_alloc_folio(alloc_gfp, order);
+			if (!folio)
+				continue;
+
+			/* Init accessed so avoid atomic mark_page_accessed later */
+			if (fgp_flags & FGP_ACCESSED)
+				__folio_set_referenced(folio);
+
+			err = filemap_add_folio(mapping, folio, index, gfp);
+			if (!err)
+				break;
 			folio_put(folio);
 			folio = NULL;
-			if (err == -EEXIST)
-				goto repeat;
-		}
+		} while (order-- > 0);
 
+		if (err == -EEXIST)
+			goto repeat;
+		if (err)
+			return ERR_PTR(err);
 		/*
 		 * filemap_add_folio locks the page, and for mmap
 		 * we expect an unlocked page.
diff --git a/mm/readahead.c b/mm/readahead.c
index a9c999aa19af..e815c114de21 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -461,19 +461,6 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v4 8/9] iomap: Create large folios in the buffered write path
  2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (6 preceding siblings ...)
  2023-07-10 13:02 ` [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
@ 2023-07-10 13:02 ` Matthew Wilcox (Oracle)
  2023-07-13  4:56   ` Darrick J. Wong
  2023-07-10 13:02 ` [PATCH v4 9/9] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
  2023-07-10 22:55 ` [PATCH v4 0/9] Create large folios in iomap buffered write path Luis Chamberlain
  9 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-10 13:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

Use the size of the write as a hint for the size of the folio to create.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/bmap.c         | 2 +-
 fs/iomap/buffered-io.c | 6 ++++--
 include/linux/iomap.h  | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 8d611fbcf0bd..521d1e00a0ff 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -971,7 +971,7 @@ gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
 	if (status)
 		return ERR_PTR(status);
 
-	folio = iomap_get_folio(iter, pos);
+	folio = iomap_get_folio(iter, pos, len);
 	if (IS_ERR(folio))
 		gfs2_trans_end(sdp);
 	return folio;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 5e9380cc3e83..2d3e90f4d16e 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -461,16 +461,18 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  * iomap_get_folio - get a folio reference for writing
  * @iter: iteration structure
  * @pos: start offset of write
+ * @len: Suggested size of folio to create.
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
  */
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
+	fgp |= fgf_set_order(len);
 
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
@@ -597,7 +599,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 	if (folio_ops && folio_ops->get_folio)
 		return folio_ops->get_folio(iter, pos, len);
 	else
-		return iomap_get_folio(iter, pos);
+		return iomap_get_folio(iter, pos, len);
 }
 
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..80facb9c9e5b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,7 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v4 9/9] iomap: Copy larger chunks from userspace
  2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (7 preceding siblings ...)
  2023-07-10 13:02 ` [PATCH v4 8/9] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
@ 2023-07-10 13:02 ` Matthew Wilcox (Oracle)
  2023-07-13  4:58   ` Darrick J. Wong
  2023-07-10 22:55 ` [PATCH v4 0/9] Create large folios in iomap buffered write path Luis Chamberlain
  9 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-07-10 13:02 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Matthew Wilcox (Oracle), linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

If we have a large folio, we can copy in larger chunks than PAGE_SIZE.
Start at the maximum page cache size and shrink by half every time we
hit the "we are short on memory" problem.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 32 +++++++++++++++++---------------
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2d3e90f4d16e..f21f1f641c4a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -769,6 +769,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 {
 	loff_t length = iomap_length(iter);
+	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
 	loff_t pos = iter->pos;
 	ssize_t written = 0;
 	long status = 0;
@@ -777,15 +778,12 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 
 	do {
 		struct folio *folio;
-		struct page *page;
-		unsigned long offset;	/* Offset into pagecache page */
-		unsigned long bytes;	/* Bytes to write to page */
+		size_t offset;		/* Offset into folio */
+		size_t bytes;		/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
 
-		offset = offset_in_page(pos);
-		bytes = min_t(unsigned long, PAGE_SIZE - offset,
-						iov_iter_count(i));
-again:
+		offset = pos & (chunk - 1);
+		bytes = min(chunk - offset, iov_iter_count(i));
 		status = balance_dirty_pages_ratelimited_flags(mapping,
 							       bdp_flags);
 		if (unlikely(status))
@@ -815,12 +813,14 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
-		page = folio_file_page(folio, pos >> PAGE_SHIFT);
-		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+		offset = offset_in_folio(folio, pos);
+		if (bytes > folio_size(folio) - offset)
+			bytes = folio_size(folio) - offset;
 
-		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
+		if (mapping_writably_mapped(mapping))
+			flush_dcache_folio(folio);
 
+		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
 		status = iomap_write_end(iter, pos, bytes, copied, folio);
 
 		if (unlikely(copied != status))
@@ -836,11 +836,13 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 			 */
 			if (copied)
 				bytes = copied;
-			goto again;
+			if (chunk > PAGE_SIZE)
+				chunk /= 2;
+		} else {
+			pos += status;
+			written += status;
+			length -= status;
 		}
-		pos += status;
-		written += status;
-		length -= status;
 	} while (iov_iter_count(i) && length);
 
 	if (status == -EAGAIN) {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 0/9] Create large folios in iomap buffered write path
  2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
                   ` (8 preceding siblings ...)
  2023-07-10 13:02 ` [PATCH v4 9/9] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
@ 2023-07-10 22:55 ` Luis Chamberlain
  2023-07-10 23:53   ` Matthew Wilcox
  2023-07-11  0:01   ` Dave Chinner
  9 siblings, 2 replies; 41+ messages in thread
From: Luis Chamberlain @ 2023-07-10 22:55 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet, mcgrof

On Mon, Jul 10, 2023 at 02:02:44PM +0100, Matthew Wilcox (Oracle) wrote:
> Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
> to improve worst-case latency.  Unfortunately, this had the effect of
> limiting the performance of:
> 
> fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
>         -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
>         -numjobs=4 -directory=/mnt/test

When you say performance, do you mean overall throughput / IOPS /
latency or all?

And who noticed it / reported it? The above incantation seems pretty
specific so I'm curious who runs that test and what sort of work flow
is it trying to replicate.

> The problem ends up being lock contention on the i_pages spinlock as we
> clear the writeback bit on each folio (and propagate that up through
> the tree).  By using larger folios, we decrease the number of folios
> to be processed by a factor of 256 for this benchmark, eliminating the
> lock contention.

Implied here seems to suggest that the associated cost for the search a
larger folio is pretty negligable compared the gains of finding one.
That seems to be nice but it gets me wondering if there are other
benchmarks under which there is any penalties instead.

Ie, is the above a microbenchmark where this yields good results?

> It's also the right thing to do.  This is a project that has been on
> the back burner for years, it just hasn't been important enough to do
> before now.

Commit ebb7fb1557b1 (xfs, iomap: limit individual ioend chain lengths in
writeback") dates back to just one year, and so it gets me wondering
how a project in the back burner for years now finds motivation for
just a one year old regression.

What was the original motivation of the older project dating this
effort back to its inception?

  Luis

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-07-10 13:02 ` [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
@ 2023-07-10 23:43   ` Luis Chamberlain
  2023-07-11  0:03     ` Matthew Wilcox
  2023-07-11  6:30   ` Christoph Hellwig
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Luis Chamberlain @ 2023-07-10 23:43 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet

On Mon, Jul 10, 2023 at 02:02:45PM +0100, Matthew Wilcox (Oracle) wrote:
> copy_page_from_iter_atomic() already handles !highmem compound
> pages correctly, but if we are passed a highmem compound page,
> each base page needs to be mapped & unmapped individually.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  lib/iov_iter.c | 43 ++++++++++++++++++++++++++++---------------
>  1 file changed, 28 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index b667b1e2f688..c728b6e4fb18 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -566,24 +566,37 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
>  }
>  EXPORT_SYMBOL(iov_iter_zero);
>  
> -size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
> -				  struct iov_iter *i)
> +size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
> +		size_t bytes, struct iov_iter *i)
>  {
> -	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
> -	if (!page_copy_sane(page, offset, bytes)) {
> -		kunmap_atomic(kaddr);
> +	size_t n, copied = 0;
> +
> +	if (!page_copy_sane(page, offset, bytes))
>  		return 0;
> -	}
> -	if (WARN_ON_ONCE(!i->data_source)) {
> -		kunmap_atomic(kaddr);
> +	if (WARN_ON_ONCE(!i->data_source))
>  		return 0;

To make it easier to review the split of the kmap_atomic() until later
and the saving of the unwinding would be nice as separate patches.

> -	}
> -	iterate_and_advance(i, bytes, base, len, off,
> -		copyin(p + off, base, len),
> -		memcpy_from_iter(i, p + off, base, len)
> -	)
> -	kunmap_atomic(kaddr);
> -	return bytes;
> +
> +	do {
> +		char *p;
> +
> +		n = bytes - copied;
> +		if (PageHighMem(page)) {
> +			page += offset / PAGE_SIZE;

I don't quite follow here how before the page was not modified
to get to the first kmap_atomic(page) and now immediately if we're
on a PageHighMem(page) we're doing some arithmetic to the page
address to get the first kmap_atomic(). The only thing I could
think of is seems like an implicit assumption here that if its a compound
highmem page then we always start off with offset with a value of
0, is that right? But that seems to not be correct either.

  Luis

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 0/9] Create large folios in iomap buffered write path
  2023-07-10 22:55 ` [PATCH v4 0/9] Create large folios in iomap buffered write path Luis Chamberlain
@ 2023-07-10 23:53   ` Matthew Wilcox
  2023-07-11  0:01   ` Dave Chinner
  1 sibling, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2023-07-10 23:53 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet

On Mon, Jul 10, 2023 at 03:55:17PM -0700, Luis Chamberlain wrote:
> On Mon, Jul 10, 2023 at 02:02:44PM +0100, Matthew Wilcox (Oracle) wrote:
> > Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
> > to improve worst-case latency.  Unfortunately, this had the effect of
> > limiting the performance of:
> > 
> > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
> >         -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
> >         -numjobs=4 -directory=/mnt/test
> 
> When you say performance, do you mean overall throughput / IOPS /
> latency or all?

This is buffered I/O, so when we run out of RAM, we block until the
dirty pages are written back.  I suppose that makes it throughput, but
it's throughput from the bottom of the page cache to storage, not the
usual app-to-page-cache bottleneck.

> And who noticed it / reported it? The above incantation seems pretty
> specific so I'm curious who runs that test and what sort of work flow
> is it trying to replicate.

Wang Yugui, who is on the cc reported it.
https://lore.kernel.org/linux-xfs/20230508172406.1CF3.409509F4@e16-tech.com/

> > The problem ends up being lock contention on the i_pages spinlock as we
> > clear the writeback bit on each folio (and propagate that up through
> > the tree).  By using larger folios, we decrease the number of folios
> > to be processed by a factor of 256 for this benchmark, eliminating the
> > lock contention.
> 
> Implied here seems to suggest that the associated cost for the search a
> larger folio is pretty negligable compared the gains of finding one.
> That seems to be nice but it gets me wondering if there are other
> benchmarks under which there is any penalties instead.
> 
> Ie, is the above a microbenchmark where this yields good results?

It happens to be constructed in such a way that it yields the optimum
outcome in this case, but that clearly wasn't deliberate.  And the
solution is the one I had in mind from before the bug report came in.

I don't think you'll be able to find a benchmark that regresses as
a result of this, but I welcome your attempt!

> > It's also the right thing to do.  This is a project that has been on
> > the back burner for years, it just hasn't been important enough to do
> > before now.
> 
> Commit ebb7fb1557b1 (xfs, iomap: limit individual ioend chain lengths in
> writeback") dates back to just one year, and so it gets me wondering
> how a project in the back burner for years now finds motivation for
> just a one year old regression.
> 
> What was the original motivation of the older project dating this
> effort back to its inception?

We should create larger folios on write.  It just wasn't important
enough to do before now.  But now there's an actual user who cares.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-10 13:02 ` [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
@ 2023-07-10 23:58   ` Luis Chamberlain
  2023-07-11  0:07     ` Matthew Wilcox
  2023-07-11  0:13     ` Kent Overstreet
  2023-07-13  4:50   ` Darrick J. Wong
  2023-07-13  5:04   ` Kent Overstreet
  2 siblings, 2 replies; 41+ messages in thread
From: Luis Chamberlain @ 2023-07-10 23:58 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

On Mon, Jul 10, 2023 at 02:02:51PM +0100, Matthew Wilcox (Oracle) wrote:
> Allow callers of __filemap_get_folio() to specify a preferred folio
> order in the FGP flags.  This is only honoured in the FGP_CREATE path;
> if there is already a folio in the page cache that covers the index,
> we will return it, no matter what its order is.  No create-around is
> attempted; we will only create folios which start at the specified index.
> Unmodified callers will continue to allocate order 0 folios.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/pagemap.h | 34 ++++++++++++++++++++++++++++++
>  mm/filemap.c            | 46 +++++++++++++++++++++++++++++------------
>  mm/readahead.c          | 13 ------------
>  3 files changed, 67 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 911201fc41fc..d87840acbfb2 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -470,6 +470,19 @@ static inline void *detach_page_private(struct page *page)
>  	return folio_detach_private(page_folio(page));
>  }
>  
> +/*
> + * There are some parts of the kernel which assume that PMD entries
> + * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> + * limit the maximum allocation order to PMD size.  I'm not aware of any
> + * assumptions about maximum order if THP are disabled, but 8 seems like
> + * a good order (that's 1MB if you're using 4kB pages)
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> +#else
> +#define MAX_PAGECACHE_ORDER	8
> +#endif
> +
>  #ifdef CONFIG_NUMA
>  struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
>  #else
> @@ -535,9 +548,30 @@ typedef unsigned int __bitwise fgf_t;
>  #define FGP_NOWAIT		((__force fgf_t)0x00000020)
>  #define FGP_FOR_MMAP		((__force fgf_t)0x00000040)
>  #define FGP_STABLE		((__force fgf_t)0x00000080)
> +#define FGF_GET_ORDER(fgf)	(((__force unsigned)fgf) >> 26)	/* top 6 bits */
>  
>  #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
>  
> +/**
> + * fgf_set_order - Encode a length in the fgf_t flags.
> + * @size: The suggested size of the folio to create.
> + *
> + * The caller of __filemap_get_folio() can use this to suggest a preferred
> + * size for the folio that is created.  If there is already a folio at
> + * the index, it will be returned, no matter what its size.  If a folio
> + * is freshly created, it may be of a different size than requested
> + * due to alignment constraints, memory pressure, or the presence of
> + * other folios at nearby indices.
> + */
> +static inline fgf_t fgf_set_order(size_t size)
> +{
> +	unsigned int shift = ilog2(size);
> +
> +	if (shift <= PAGE_SHIFT)
> +		return 0;
> +	return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
> +}
> +
>  void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		fgf_t fgp_flags, gfp_t gfp);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8a669fecfd1c..baafbf324c9f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1905,7 +1905,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		folio_wait_stable(folio);
>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
> +		unsigned order = FGF_GET_ORDER(fgp_flags);
>  		int err;
> +
>  		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
>  			gfp |= __GFP_WRITE;
>  		if (fgp_flags & FGP_NOFS)
> @@ -1914,26 +1916,44 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  			gfp &= ~GFP_KERNEL;
>  			gfp |= GFP_NOWAIT | __GFP_NOWARN;
>  		}
> -
> -		folio = filemap_alloc_folio(gfp, 0);
> -		if (!folio)
> -			return ERR_PTR(-ENOMEM);
> -
>  		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
>  			fgp_flags |= FGP_LOCK;
>  
> -		/* Init accessed so avoid atomic mark_page_accessed later */
> -		if (fgp_flags & FGP_ACCESSED)
> -			__folio_set_referenced(folio);
> +		if (!mapping_large_folio_support(mapping))
> +			order = 0;
> +		if (order > MAX_PAGECACHE_ORDER)
> +			order = MAX_PAGECACHE_ORDER;

Curious how this ended up being the heuristic used to shoot for the
MAX_PAGECACHE_ORDER sky first, and then go down 1/2 each time. I don't
see it explained on the commit log but I'm sure there's has to be
some reasonable rationale. From the cover letter, I could guess that
it means the gains of always using the largest folio possible means
an implied latency savings through other means, so the small latencies
spent looking seem to no where compare to the saving in using. But
I rather understand a bit more for the rationale.

Are there situations where perhaps limitting this initial max preferred
high order folio might be smaller than MAX_PAGECACHE_ORDER? How if not,
how do we know?

  Luis

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 0/9] Create large folios in iomap buffered write path
  2023-07-10 22:55 ` [PATCH v4 0/9] Create large folios in iomap buffered write path Luis Chamberlain
  2023-07-10 23:53   ` Matthew Wilcox
@ 2023-07-11  0:01   ` Dave Chinner
  1 sibling, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2023-07-11  0:01 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox (Oracle), linux-fsdevel, linux-xfs, Wang Yugui,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet

On Mon, Jul 10, 2023 at 03:55:17PM -0700, Luis Chamberlain wrote:
> On Mon, Jul 10, 2023 at 02:02:44PM +0100, Matthew Wilcox (Oracle) wrote:
> > Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
> > to improve worst-case latency.  Unfortunately, this had the effect of
> > limiting the performance of:
> > 
> > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
> >         -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
> >         -numjobs=4 -directory=/mnt/test
> 
> When you say performance, do you mean overall throughput / IOPS /
> latency or all?
> 
> And who noticed it / reported it?

https://lore.kernel.org/linux-xfs/20230508172406.1CF3.409509F4@e16-tech.com/

> The above incantation seems pretty
> specific so I'm curious who runs that test and what sort of work flow
> is it trying to replicate.

Not specific at all. It's just a basic concurrent sequential
buffered write performance test. It needs multiple jobs to max out
typical cheap pcie 4.0 NVMe SSD storage (i.e. 6-8GB/s) because
sequential non-async buffered writes are CPU bound at (typically)
2-3GB/s per file write.

> > The problem ends up being lock contention on the i_pages spinlock as we
> > clear the writeback bit on each folio (and propagate that up through
> > the tree).  By using larger folios, we decrease the number of folios
> > to be processed by a factor of 256 for this benchmark, eliminating the
> > lock contention.
> 
> Implied here seems to suggest that the associated cost for the search a
> larger folio is pretty negligable compared the gains of finding one.
> That seems to be nice but it gets me wondering if there are other
> benchmarks under which there is any penalties instead.
> 
> Ie, is the above a microbenchmark where this yields good results?

No, the workload gains are general - they avoid the lock contention
problems involved with cycling, accounting and state changes for
millions of objects (order-0 folios) a second through a single
exclusive lock (mapping tree lock) by reducing the mapping tree lock
cycling by a couple of orders of magnitude.

> > It's also the right thing to do.  This is a project that has been on
> > the back burner for years, it just hasn't been important enough to do
> > before now.
> 
> Commit ebb7fb1557b1 (xfs, iomap: limit individual ioend chain lengths in
> writeback") dates back to just one year, and so it gets me wondering
> how a project in the back burner for years now finds motivation for
> just a one year old regression.

The change in ebb7fb1557b1 is just the straw that broke the camel's
back. It got rid of the massive IO batch processing we used to
minimise the inherent cross-process mapping tree contention in
buffered writes. i.e. the process doing write() calls, multiple
kswapds doing memory reclaim, writeback submission and writeback
completion all contend at the same time for the mapping tree lock.

We largely removed the IO submission and completion from the picture
with huge batch processing, but that then started causing latency
problems with IO completion processing. So we went back to smaller
chunks of IO submission and completion, and that means we went from
3 threads contending on the mapping tree lock to 5 threads. And that
drove the system into catastrophic lock breakdown on the mapping
tree lock.

And so -everything- then went really slow because each write() task
burns down at least 5 CPUs on the mapping tree lock each....

THis is not an XFS issue to solve - this is a general page cache
problem and we've always wanted to fix it in the page cache, either
by large folio support or by large block size support that required
aligned high-order pages in the page cache. Same solution - less
pages to iterate - but different methods...

> What was the original motivation of the older project dating this
> effort back to its inception?

https://www.kernel.org/doc/ols/2006/ols2006v1-pages-177-192.pdf

That was run on a 64kB page size machine (itanic), but there were
signs that the mapping tree lock would be an issue in future.
Indeed, when these large SSI supercomputers moved to x86-64 (down to
4kB page size) a couple of years later, the mapping tree lock popped
up to the top of the list of buffered write throughput limiting
factors.

i.e. the larger the NUMA distances between the workload doing the
write and the node the mapping tree lock is located on, the slower
buffered writes go and the more CPU we burnt on the mapping tree
lock. We carefully worked around that with cpusets and locality
control, and the same was then done in HPC environments on these
types of machines and hence it wasn't an immediate limiting
factor...

But we're talking about multi-million dollar supercomputers here,
and in most cases people just rewrote the apps to use direct IO and
so the problem just went away and the HPC apps could easily use all
the perofrmance the storage provided....

IOWs, we've know about this problem for over 15 years, but the
difference is now that consumer hardware is capable of >10GB/s write
speeds (current PCIe 5.0 nvme ssds) for just a couple of hundred
dollars, rather than multi-million dollar machines in carefully
optimised environments that we first saw it on back in 2006...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-07-10 23:43   ` Luis Chamberlain
@ 2023-07-11  0:03     ` Matthew Wilcox
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2023-07-11  0:03 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet

On Mon, Jul 10, 2023 at 04:43:35PM -0700, Luis Chamberlain wrote:
> > -	}
> > -	iterate_and_advance(i, bytes, base, len, off,
> > -		copyin(p + off, base, len),
> > -		memcpy_from_iter(i, p + off, base, len)
> > -	)
> > -	kunmap_atomic(kaddr);
> > -	return bytes;
> > +
> > +	do {
> > +		char *p;
> > +
> > +		n = bytes - copied;
> > +		if (PageHighMem(page)) {
> > +			page += offset / PAGE_SIZE;
> 
> I don't quite follow here how before the page was not modified
> to get to the first kmap_atomic(page) and now immediately if we're
> on a PageHighMem(page) we're doing some arithmetic to the page
> address to get the first kmap_atomic(). The only thing I could
> think of is seems like an implicit assumption here that if its a compound
> highmem page then we always start off with offset with a value of
> 0, is that right? But that seems to not be correct either.

The important thing to know is that it was buggy before.  If you called
copy_page_from_iter_atomic() with an offset larger than PAGE_SIZE, it
corrupted random memory!  I can only assume that nobody was doing that.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-10 23:58   ` Luis Chamberlain
@ 2023-07-11  0:07     ` Matthew Wilcox
  2023-07-11  0:21       ` Luis Chamberlain
  2023-07-11  0:13     ` Kent Overstreet
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2023-07-11  0:07 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

On Mon, Jul 10, 2023 at 04:58:55PM -0700, Luis Chamberlain wrote:
> > @@ -1914,26 +1916,44 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> >  			gfp &= ~GFP_KERNEL;
> >  			gfp |= GFP_NOWAIT | __GFP_NOWARN;
> >  		}
> > -
> > -		folio = filemap_alloc_folio(gfp, 0);
> > -		if (!folio)
> > -			return ERR_PTR(-ENOMEM);
> > -
> >  		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
> >  			fgp_flags |= FGP_LOCK;
> >  
> > -		/* Init accessed so avoid atomic mark_page_accessed later */
> > -		if (fgp_flags & FGP_ACCESSED)
> > -			__folio_set_referenced(folio);
> > +		if (!mapping_large_folio_support(mapping))
> > +			order = 0;
> > +		if (order > MAX_PAGECACHE_ORDER)
> > +			order = MAX_PAGECACHE_ORDER;
> 
> Curious how this ended up being the heuristic used to shoot for the
> MAX_PAGECACHE_ORDER sky first, and then go down 1/2 each time. I don't
> see it explained on the commit log but I'm sure there's has to be
> some reasonable rationale. From the cover letter, I could guess that
> it means the gains of always using the largest folio possible means
> an implied latency savings through other means, so the small latencies
> spent looking seem to no where compare to the saving in using. But
> I rather understand a bit more for the rationale.

You've completely misunderstood this patch.  The caller hints at the
folio size it wants, and this code you're highlighting limits it to be
less than MAX_PAGECACHE_ORDER.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-10 23:58   ` Luis Chamberlain
  2023-07-11  0:07     ` Matthew Wilcox
@ 2023-07-11  0:13     ` Kent Overstreet
  1 sibling, 0 replies; 41+ messages in thread
From: Kent Overstreet @ 2023-07-11  0:13 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox (Oracle), linux-fsdevel, linux-xfs, Wang Yugui,
	Dave Chinner, Christoph Hellwig, Darrick J . Wong,
	Christoph Hellwig

On Mon, Jul 10, 2023 at 04:58:55PM -0700, Luis Chamberlain wrote:
> Curious how this ended up being the heuristic used to shoot for the
> MAX_PAGECACHE_ORDER sky first, and then go down 1/2 each time. I don't
> see it explained on the commit log but I'm sure there's has to be
> some reasonable rationale. From the cover letter, I could guess that
> it means the gains of always using the largest folio possible means
> an implied latency savings through other means, so the small latencies
> spent looking seem to no where compare to the saving in using. But
> I rather understand a bit more for the rationale.
> 
> Are there situations where perhaps limitting this initial max preferred
> high order folio might be smaller than MAX_PAGECACHE_ORDER? How if not,
> how do we know?

It's the exact same situation as when filesystems all switched from
blocks to extents 10-20 years ago. We _always_ want our allocations to
be as big as possible - it reduces metadata overhead, preserves
locality, reduces fragmentation - IOW, this is clearly the right thing
to do.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-11  0:07     ` Matthew Wilcox
@ 2023-07-11  0:21       ` Luis Chamberlain
  2023-07-11  0:42         ` Matthew Wilcox
  2023-07-11  0:47         ` Dave Chinner
  0 siblings, 2 replies; 41+ messages in thread
From: Luis Chamberlain @ 2023-07-11  0:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

On Tue, Jul 11, 2023 at 01:07:37AM +0100, Matthew Wilcox wrote:
> The caller hints at the
> folio size it wants, and this code you're highlighting limits it to be
> less than MAX_PAGECACHE_ORDER.

Yes sorry, if the write size is large we still max out at MAX_PAGECACHE_ORDER
naturally. I don't doubt the rationale that that is a good idea.

What I'm curious about is why are we OK to jump straight to MAX_PAGECACHE_ORDER
from the start and if there are situations where perhaps a a
not-so-aggressive high order may be desriable. How do we know?

  Luis

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-11  0:21       ` Luis Chamberlain
@ 2023-07-11  0:42         ` Matthew Wilcox
  2023-07-11  0:47         ` Dave Chinner
  1 sibling, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2023-07-11  0:42 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

On Mon, Jul 10, 2023 at 05:21:04PM -0700, Luis Chamberlain wrote:
> On Tue, Jul 11, 2023 at 01:07:37AM +0100, Matthew Wilcox wrote:
> > The caller hints at the
> > folio size it wants, and this code you're highlighting limits it to be
> > less than MAX_PAGECACHE_ORDER.
> 
> Yes sorry, if the write size is large we still max out at MAX_PAGECACHE_ORDER
> naturally. I don't doubt the rationale that that is a good idea.
> 
> What I'm curious about is why are we OK to jump straight to MAX_PAGECACHE_ORDER
> from the start and if there are situations where perhaps a a
> not-so-aggressive high order may be desriable. How do we know?

The page cache trusts the filesystem to make a good guess.  Dave had some
ideas about how XFS might make different guesses from the one in this
patchset, and I encourage people to experiment with different algorithms.

Intuitively "size of the write" is probably not a bad guess.  If userspace
is writing 400kB in a single write, it proabbly makes sense to writeback
& age and eventually reclaim that entire 400kB at the same time.  If we
guess wrong, the downside probably isn't too bad either.

But we need data!  And merging this patch set & gathering real world
experience with it is a good start.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-11  0:21       ` Luis Chamberlain
  2023-07-11  0:42         ` Matthew Wilcox
@ 2023-07-11  0:47         ` Dave Chinner
  1 sibling, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2023-07-11  0:47 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Matthew Wilcox, linux-fsdevel, linux-xfs, Wang Yugui,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet,
	Christoph Hellwig

On Mon, Jul 10, 2023 at 05:21:04PM -0700, Luis Chamberlain wrote:
> On Tue, Jul 11, 2023 at 01:07:37AM +0100, Matthew Wilcox wrote:
> > The caller hints at the
> > folio size it wants, and this code you're highlighting limits it to be
> > less than MAX_PAGECACHE_ORDER.
> 
> Yes sorry, if the write size is large we still max out at MAX_PAGECACHE_ORDER
> naturally. I don't doubt the rationale that that is a good idea.
> 
> What I'm curious about is why are we OK to jump straight to MAX_PAGECACHE_ORDER
> from the start and if there are situations where perhaps a a
> not-so-aggressive high order may be desriable. How do we know?

Decades of experience optimising IO and allocation on extent based
filesystems tell us this is the right thing to do. i.e. we should
always attempt to allocate the largest contiguous range possible for
the current operation being performed. This almost always results in
the minimum amount of metadata and overhead to index and manage the
data being stored.

There are relatively few situations where this is not the right
thing to do, but that is where per-inode behavioural control options
like extent size hints come into play....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-07-10 13:02 ` [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
  2023-07-10 23:43   ` Luis Chamberlain
@ 2023-07-11  6:30   ` Christoph Hellwig
  2023-07-11 20:05   ` Kent Overstreet
  2023-07-13  4:42   ` Darrick J. Wong
  3 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-07-11  6:30 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet

On Mon, Jul 10, 2023 at 02:02:45PM +0100, Matthew Wilcox (Oracle) wrote:
> +size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
> +		size_t bytes, struct iov_iter *i)
>  {
> -	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
> -	if (!page_copy_sane(page, offset, bytes)) {
> -		kunmap_atomic(kaddr);
> +	size_t n, copied = 0;
> +
> +	if (!page_copy_sane(page, offset, bytes))
>  		return 0;
> -	}
> -	if (WARN_ON_ONCE(!i->data_source)) {
> -		kunmap_atomic(kaddr);
> +	if (WARN_ON_ONCE(!i->data_source))
>  		return 0;
> -	iterate_and_advance(i, bytes, base, len, off,
> -		copyin(p + off, base, len),
> -		memcpy_from_iter(i, p + off, base, len)
> -	)
> -	kunmap_atomic(kaddr);

I have to agree with Luis that just moving the odd early kmap in
the existing code is probably worth it's own patch just to keep
the thing readable.  I'm not going to ask for a resend just because
of that, but if we need a respin it would be nice to split it out.

> +
> +	do {
> +		char *p;
> +
> +		n = bytes - copied;
> +		if (PageHighMem(page)) {
> +			page += offset / PAGE_SIZE;
> +			offset %= PAGE_SIZE;
> +			n = min_t(size_t, n, PAGE_SIZE - offset);
> +		}
> +
> +		p = kmap_atomic(page) + offset;
> +		iterate_and_advance(i, n, base, len, off,
> +			copyin(p + off, base, len),
> +			memcpy_from_iter(i, p + off, base, len)
> +		)
> +		kunmap_atomic(p);
> +		copied += n;
> +		offset += n;
> +	} while (PageHighMem(page) && copied != bytes && n > 0);

This first read a little odd to me, but doing arithmetics on the page
should never move it from highmem to non-highmem so I think it's fine.
Would be a lot more readable if it was passed a folio, but I guess we
can do that later.

Otherwise looks good to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 2/9] iov_iter: Add copy_folio_from_iter_atomic()
  2023-07-10 13:02 ` [PATCH v4 2/9] iov_iter: Add copy_folio_from_iter_atomic() Matthew Wilcox (Oracle)
@ 2023-07-11  6:31   ` Christoph Hellwig
  2023-07-11 20:46     ` Matthew Wilcox
  2023-07-24 15:56   ` Darrick J. Wong
  1 sibling, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2023-07-11  6:31 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Kent Overstreet

On Mon, Jul 10, 2023 at 02:02:46PM +0100, Matthew Wilcox (Oracle) wrote:
> Add a folio wrapper around copy_page_from_iter_atomic().

As mentioned in the previous patch it would probably be a lot more
obvious if the folio version was the based implementation and the
page variant the wrapper.  But I'm not going to delay the series for
it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-07-10 13:02 ` [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
  2023-07-10 23:43   ` Luis Chamberlain
  2023-07-11  6:30   ` Christoph Hellwig
@ 2023-07-11 20:05   ` Kent Overstreet
  2023-07-13  4:42   ` Darrick J. Wong
  3 siblings, 0 replies; 41+ messages in thread
From: Kent Overstreet @ 2023-07-11 20:05 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong

On Mon, Jul 10, 2023 at 02:02:45PM +0100, Matthew Wilcox (Oracle) wrote:
> copy_page_from_iter_atomic() already handles !highmem compound
> pages correctly, but if we are passed a highmem compound page,
> each base page needs to be mapped & unmapped individually.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>

I just pulled this into the bcachefs tree; in a week or two if you
haven't heard anything you can add my Tested-by:

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 2/9] iov_iter: Add copy_folio_from_iter_atomic()
  2023-07-11  6:31   ` Christoph Hellwig
@ 2023-07-11 20:46     ` Matthew Wilcox
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2023-07-11 20:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Darrick J . Wong, Kent Overstreet

On Mon, Jul 10, 2023 at 11:31:16PM -0700, Christoph Hellwig wrote:
> On Mon, Jul 10, 2023 at 02:02:46PM +0100, Matthew Wilcox (Oracle) wrote:
> > Add a folio wrapper around copy_page_from_iter_atomic().
> 
> As mentioned in the previous patch it would probably be a lot more
> obvious if the folio version was the based implementation and the
> page variant the wrapper.  But I'm not going to delay the series for
> it.

My plan for that is:

 - Add copy_folio_from_iter() wrapper
 - Convert all users
 - Convert all users of copy_page_to_iter()
 - Convert copy_page_to_iter_nofault() to copy_folio_to_iter_nofault()
   (it has one user)
 - Convert page_copy_sane() to folio_copy_sane()

But this is pretty low priority compared to all the other page->folio
conversions that need to happen.  So many filesystems ...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-07-10 13:02 ` [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
                     ` (2 preceding siblings ...)
  2023-07-11 20:05   ` Kent Overstreet
@ 2023-07-13  4:42   ` Darrick J. Wong
  2023-07-13 13:46     ` Matthew Wilcox
  3 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2023-07-13  4:42 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Kent Overstreet

On Mon, Jul 10, 2023 at 02:02:45PM +0100, Matthew Wilcox (Oracle) wrote:
> copy_page_from_iter_atomic() already handles !highmem compound
> pages correctly, but if we are passed a highmem compound page,
> each base page needs to be mapped & unmapped individually.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Yikes.  Does this want a fixes tag?

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  lib/iov_iter.c | 43 ++++++++++++++++++++++++++++---------------
>  1 file changed, 28 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index b667b1e2f688..c728b6e4fb18 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -566,24 +566,37 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
>  }
>  EXPORT_SYMBOL(iov_iter_zero);
>  
> -size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
> -				  struct iov_iter *i)
> +size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
> +		size_t bytes, struct iov_iter *i)
>  {
> -	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
> -	if (!page_copy_sane(page, offset, bytes)) {
> -		kunmap_atomic(kaddr);
> +	size_t n, copied = 0;
> +
> +	if (!page_copy_sane(page, offset, bytes))
>  		return 0;
> -	}
> -	if (WARN_ON_ONCE(!i->data_source)) {
> -		kunmap_atomic(kaddr);
> +	if (WARN_ON_ONCE(!i->data_source))
>  		return 0;
> -	}
> -	iterate_and_advance(i, bytes, base, len, off,
> -		copyin(p + off, base, len),
> -		memcpy_from_iter(i, p + off, base, len)
> -	)
> -	kunmap_atomic(kaddr);
> -	return bytes;
> +
> +	do {
> +		char *p;
> +
> +		n = bytes - copied;
> +		if (PageHighMem(page)) {
> +			page += offset / PAGE_SIZE;
> +			offset %= PAGE_SIZE;
> +			n = min_t(size_t, n, PAGE_SIZE - offset);
> +		}
> +
> +		p = kmap_atomic(page) + offset;
> +		iterate_and_advance(i, n, base, len, off,
> +			copyin(p + off, base, len),
> +			memcpy_from_iter(i, p + off, base, len)
> +		)
> +		kunmap_atomic(p);
> +		copied += n;
> +		offset += n;
> +	} while (PageHighMem(page) && copied != bytes && n > 0);
> +
> +	return copied;
>  }
>  EXPORT_SYMBOL(copy_page_from_iter_atomic);
>  
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 4/9] doc: Correct the description of ->release_folio
  2023-07-10 13:02 ` [PATCH v4 4/9] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
@ 2023-07-13  4:43   ` Darrick J. Wong
  0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2023-07-13  4:43 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Kent Overstreet, Christoph Hellwig

On Mon, Jul 10, 2023 at 02:02:48PM +0100, Matthew Wilcox (Oracle) wrote:
> The filesystem ->release_folio method is called under more circumstances
> now than when the documentation was written.  The second sentence
> describing the interpretation of the return value is the wrong polarity
> (false indicates failure, not success).  And the third sentence is also
> wrong (the kernel calls try_to_free_buffers() instead).
> 
> So replace the entire paragraph with a detailed description of what the
> state of the folio may be, the meaning of the gfp parameter, why the
> method is being called and what the filesystem is expected to do.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thank you!
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  Documentation/filesystems/locking.rst | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> index ed148919e11a..ca3d24becbf5 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -374,10 +374,17 @@ invalidate_lock before invalidating page cache in truncate / hole punch
>  path (and thus calling into ->invalidate_folio) to block races between page
>  cache invalidation and page cache filling functions (fault, read, ...).
>  
> -->release_folio() is called when the kernel is about to try to drop the
> -buffers from the folio in preparation for freeing it.  It returns false to
> -indicate that the buffers are (or may be) freeable.  If ->release_folio is
> -NULL, the kernel assumes that the fs has no private interest in the buffers.
> +->release_folio() is called when the MM wants to make a change to the
> +folio that would invalidate the filesystem's private data.  For example,
> +it may be about to be removed from the address_space or split.  The folio
> +is locked and not under writeback.  It may be dirty.  The gfp parameter
> +is not usually used for allocation, but rather to indicate what the
> +filesystem may do to attempt to free the private data.  The filesystem may
> +return false to indicate that the folio's private data cannot be freed.
> +If it returns true, it should have already removed the private data from
> +the folio.  If a filesystem does not provide a ->release_folio method,
> +the pagecache will assume that private data is buffer_heads and call
> +try_to_free_buffers().
>  
>  ->free_folio() is called when the kernel has dropped the folio
>  from the page cache.
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 5/9] iomap: Remove unnecessary test from iomap_release_folio()
  2023-07-10 13:02 ` [PATCH v4 5/9] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
@ 2023-07-13  4:45   ` Darrick J. Wong
  2023-07-13  5:25     ` Ritesh Harjani
  0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2023-07-13  4:45 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Ritesh Harjani (IBM)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Kent Overstreet, Christoph Hellwig

[add ritesh]

On Mon, Jul 10, 2023 at 02:02:49PM +0100, Matthew Wilcox (Oracle) wrote:
> The check for the folio being under writeback is unnecessary; the caller
> has checked this and the folio is locked, so the folio cannot be under
> writeback at this point.
> 
> The comment is somewhat misleading in that it talks about one specific
> situation in which we can see a dirty folio.  There are others, so change
> the comment to explain why we can't release the iomap_page.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 1cb905140528..7aa3009f907f 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -483,12 +483,11 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
>  			folio_size(folio));
>  
>  	/*
> -	 * mm accommodates an old ext3 case where clean folios might
> -	 * not have had the dirty bit cleared.  Thus, it can send actual
> -	 * dirty folios to ->release_folio() via shrink_active_list();
> -	 * skip those here.
> +	 * If the folio is dirty, we refuse to release our metadata because
> +	 * it may be partially dirty.  Once we track per-block dirty state,
> +	 * we can release the metadata if every block is dirty.

Ritesh: I'm assuming that implementing this will be part of your v12 series?

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  	 */
> -	if (folio_test_dirty(folio) || folio_test_writeback(folio))
> +	if (folio_test_dirty(folio))
>  		return false;
>  	iomap_page_release(folio);
>  	return true;
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 6/9] filemap: Add fgf_t typedef
  2023-07-10 13:02 ` [PATCH v4 6/9] filemap: Add fgf_t typedef Matthew Wilcox (Oracle)
@ 2023-07-13  4:47   ` Darrick J. Wong
  2023-07-13  5:08   ` Kent Overstreet
  1 sibling, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2023-07-13  4:47 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Kent Overstreet, Christoph Hellwig

On Mon, Jul 10, 2023 at 02:02:50PM +0100, Matthew Wilcox (Oracle) wrote:
> Similarly to gfp_t, define fgf_t as its own type to prevent various
> misuses and confusion.  Leave the flags as FGP_* for now to reduce the
> size of this patch; they will be converted to FGF_* later.  Move the
> documentation to the definition of the type insted of burying it in the
> __filemap_get_folio() documentation.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Yay!
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/btrfs/file.c         |  6 +++---
>  fs/f2fs/compress.c      |  2 +-
>  fs/f2fs/f2fs.h          |  2 +-
>  fs/iomap/buffered-io.c  |  2 +-
>  include/linux/pagemap.h | 48 +++++++++++++++++++++++++++++++----------
>  mm/filemap.c            | 19 ++--------------
>  mm/folio-compat.c       |  2 +-
>  7 files changed, 46 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index fd03e689a6be..3876fae90fc3 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -876,9 +876,9 @@ static int prepare_uptodate_page(struct inode *inode,
>  	return 0;
>  }
>  
> -static unsigned int get_prepare_fgp_flags(bool nowait)
> +static fgf_t get_prepare_fgp_flags(bool nowait)
>  {
> -	unsigned int fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
> +	fgf_t fgp_flags = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
>  
>  	if (nowait)
>  		fgp_flags |= FGP_NOWAIT;
> @@ -910,7 +910,7 @@ static noinline int prepare_pages(struct inode *inode, struct page **pages,
>  	int i;
>  	unsigned long index = pos >> PAGE_SHIFT;
>  	gfp_t mask = get_prepare_gfp_flags(inode, nowait);
> -	unsigned int fgp_flags = get_prepare_fgp_flags(nowait);
> +	fgf_t fgp_flags = get_prepare_fgp_flags(nowait);
>  	int err = 0;
>  	int faili;
>  
> diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
> index 236d890f560b..0f7df9c11af3 100644
> --- a/fs/f2fs/compress.c
> +++ b/fs/f2fs/compress.c
> @@ -1045,7 +1045,7 @@ static int prepare_compress_overwrite(struct compress_ctx *cc,
>  	struct address_space *mapping = cc->inode->i_mapping;
>  	struct page *page;
>  	sector_t last_block_in_bio;
> -	unsigned fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
> +	fgf_t fgp_flag = FGP_LOCK | FGP_WRITE | FGP_CREAT;
>  	pgoff_t start_idx = start_idx_of_cluster(cc);
>  	int i, ret;
>  
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index c7cb2177b252..c275ff2753c2 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -2736,7 +2736,7 @@ static inline struct page *f2fs_grab_cache_page(struct address_space *mapping,
>  
>  static inline struct page *f2fs_pagecache_get_page(
>  				struct address_space *mapping, pgoff_t index,
> -				int fgp_flags, gfp_t gfp_mask)
> +				fgf_t fgp_flags, gfp_t gfp_mask)
>  {
>  	if (time_to_inject(F2FS_M_SB(mapping), FAULT_PAGE_GET))
>  		return NULL;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 7aa3009f907f..5e9380cc3e83 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -467,7 +467,7 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>   */
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
>  {
> -	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
> +	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>  
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 716953ee1ebd..911201fc41fc 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -501,22 +501,48 @@ pgoff_t page_cache_next_miss(struct address_space *mapping,
>  pgoff_t page_cache_prev_miss(struct address_space *mapping,
>  			     pgoff_t index, unsigned long max_scan);
>  
> -#define FGP_ACCESSED		0x00000001
> -#define FGP_LOCK		0x00000002
> -#define FGP_CREAT		0x00000004
> -#define FGP_WRITE		0x00000008
> -#define FGP_NOFS		0x00000010
> -#define FGP_NOWAIT		0x00000020
> -#define FGP_FOR_MMAP		0x00000040
> -#define FGP_STABLE		0x00000080
> +/**
> + * typedef fgf_t - Flags for getting folios from the page cache.
> + *
> + * Most users of the page cache will not need to use these flags;
> + * there are convenience functions such as filemap_get_folio() and
> + * filemap_lock_folio().  For users which need more control over exactly
> + * what is done with the folios, these flags to __filemap_get_folio()
> + * are available.
> + *
> + * * %FGP_ACCESSED - The folio will be marked accessed.
> + * * %FGP_LOCK - The folio is returned locked.
> + * * %FGP_CREAT - If no folio is present then a new folio is allocated,
> + *   added to the page cache and the VM's LRU list.  The folio is
> + *   returned locked.
> + * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> + *   folio is already in cache.  If the folio was allocated, unlock it
> + *   before returning so the caller can do the same dance.
> + * * %FGP_WRITE - The folio will be written to by the caller.
> + * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
> + * * %FGP_NOWAIT - Don't block on the folio lock.
> + * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
> + * * %FGP_WRITEBEGIN - The flags to use in a filesystem write_begin()
> + *   implementation.
> + */
> +typedef unsigned int __bitwise fgf_t;
> +
> +#define FGP_ACCESSED		((__force fgf_t)0x00000001)
> +#define FGP_LOCK		((__force fgf_t)0x00000002)
> +#define FGP_CREAT		((__force fgf_t)0x00000004)
> +#define FGP_WRITE		((__force fgf_t)0x00000008)
> +#define FGP_NOFS		((__force fgf_t)0x00000010)
> +#define FGP_NOWAIT		((__force fgf_t)0x00000020)
> +#define FGP_FOR_MMAP		((__force fgf_t)0x00000040)
> +#define FGP_STABLE		((__force fgf_t)0x00000080)
>  
>  #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
>  
>  void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> -		int fgp_flags, gfp_t gfp);
> +		fgf_t fgp_flags, gfp_t gfp);
>  struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
> -		int fgp_flags, gfp_t gfp);
> +		fgf_t fgp_flags, gfp_t gfp);
>  
>  /**
>   * filemap_get_folio - Find and get a folio.
> @@ -590,7 +616,7 @@ static inline struct page *find_get_page(struct address_space *mapping,
>  }
>  
>  static inline struct page *find_get_page_flags(struct address_space *mapping,
> -					pgoff_t offset, int fgp_flags)
> +					pgoff_t offset, fgf_t fgp_flags)
>  {
>  	return pagecache_get_page(mapping, offset, fgp_flags, 0);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9e44a49bbd74..8a669fecfd1c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1855,30 +1855,15 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
>   *
>   * Looks up the page cache entry at @mapping & @index.
>   *
> - * @fgp_flags can be zero or more of these flags:
> - *
> - * * %FGP_ACCESSED - The folio will be marked accessed.
> - * * %FGP_LOCK - The folio is returned locked.
> - * * %FGP_CREAT - If no page is present then a new page is allocated using
> - *   @gfp and added to the page cache and the VM's LRU list.
> - *   The page is returned locked and with an increased refcount.
> - * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
> - *   page is already in cache.  If the page was allocated, unlock it before
> - *   returning so the caller can do the same dance.
> - * * %FGP_WRITE - The page will be written to by the caller.
> - * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
> - * * %FGP_NOWAIT - Don't get blocked by page lock.
> - * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
> - *
>   * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
>   * if the %GFP flags specified for %FGP_CREAT are atomic.
>   *
> - * If there is a page cache page, it is returned with an increased refcount.
> + * If this function returns a folio, it is returned with an increased refcount.
>   *
>   * Return: The found folio or an ERR_PTR() otherwise.
>   */
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> -		int fgp_flags, gfp_t gfp)
> +		fgf_t fgp_flags, gfp_t gfp)
>  {
>  	struct folio *folio;
>  
> diff --git a/mm/folio-compat.c b/mm/folio-compat.c
> index c6f056c20503..10c3247542cb 100644
> --- a/mm/folio-compat.c
> +++ b/mm/folio-compat.c
> @@ -92,7 +92,7 @@ EXPORT_SYMBOL(add_to_page_cache_lru);
>  
>  noinline
>  struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
> -		int fgp_flags, gfp_t gfp)
> +		fgf_t fgp_flags, gfp_t gfp)
>  {
>  	struct folio *folio;
>  
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-10 13:02 ` [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
  2023-07-10 23:58   ` Luis Chamberlain
@ 2023-07-13  4:50   ` Darrick J. Wong
  2023-07-13  5:04   ` Kent Overstreet
  2 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2023-07-13  4:50 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Kent Overstreet, Christoph Hellwig

On Mon, Jul 10, 2023 at 02:02:51PM +0100, Matthew Wilcox (Oracle) wrote:
> Allow callers of __filemap_get_folio() to specify a preferred folio
> order in the FGP flags.  This is only honoured in the FGP_CREATE path;
> if there is already a folio in the page cache that covers the index,
> we will return it, no matter what its order is.  No create-around is
> attempted; we will only create folios which start at the specified index.
> Unmodified callers will continue to allocate order 0 folios.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/pagemap.h | 34 ++++++++++++++++++++++++++++++
>  mm/filemap.c            | 46 +++++++++++++++++++++++++++++------------
>  mm/readahead.c          | 13 ------------
>  3 files changed, 67 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 911201fc41fc..d87840acbfb2 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -470,6 +470,19 @@ static inline void *detach_page_private(struct page *page)
>  	return folio_detach_private(page_folio(page));
>  }
>  
> +/*
> + * There are some parts of the kernel which assume that PMD entries
> + * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> + * limit the maximum allocation order to PMD size.  I'm not aware of any
> + * assumptions about maximum order if THP are disabled, but 8 seems like
> + * a good order (that's 1MB if you're using 4kB pages)
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> +#else
> +#define MAX_PAGECACHE_ORDER	8
> +#endif
> +
>  #ifdef CONFIG_NUMA
>  struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
>  #else
> @@ -535,9 +548,30 @@ typedef unsigned int __bitwise fgf_t;
>  #define FGP_NOWAIT		((__force fgf_t)0x00000020)
>  #define FGP_FOR_MMAP		((__force fgf_t)0x00000040)
>  #define FGP_STABLE		((__force fgf_t)0x00000080)
> +#define FGF_GET_ORDER(fgf)	(((__force unsigned)fgf) >> 26)	/* top 6 bits */

/me is very excited for xfs to be able to ask for bigger folios for
big allocations.  Playing with 512M folios on arm64 is very exciting,
especially with Ritesh's patches to reduce write amplification.

(Surprisingly, so far the only fstest that's broken is xfs/559, and only
because it assumed that pagecache writes only create basepages.)

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  
>  #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
>  
> +/**
> + * fgf_set_order - Encode a length in the fgf_t flags.
> + * @size: The suggested size of the folio to create.
> + *
> + * The caller of __filemap_get_folio() can use this to suggest a preferred
> + * size for the folio that is created.  If there is already a folio at
> + * the index, it will be returned, no matter what its size.  If a folio
> + * is freshly created, it may be of a different size than requested
> + * due to alignment constraints, memory pressure, or the presence of
> + * other folios at nearby indices.
> + */
> +static inline fgf_t fgf_set_order(size_t size)
> +{
> +	unsigned int shift = ilog2(size);
> +
> +	if (shift <= PAGE_SHIFT)
> +		return 0;
> +	return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
> +}
> +
>  void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		fgf_t fgp_flags, gfp_t gfp);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8a669fecfd1c..baafbf324c9f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1905,7 +1905,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		folio_wait_stable(folio);
>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
> +		unsigned order = FGF_GET_ORDER(fgp_flags);
>  		int err;
> +
>  		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
>  			gfp |= __GFP_WRITE;
>  		if (fgp_flags & FGP_NOFS)
> @@ -1914,26 +1916,44 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  			gfp &= ~GFP_KERNEL;
>  			gfp |= GFP_NOWAIT | __GFP_NOWARN;
>  		}
> -
> -		folio = filemap_alloc_folio(gfp, 0);
> -		if (!folio)
> -			return ERR_PTR(-ENOMEM);
> -
>  		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
>  			fgp_flags |= FGP_LOCK;
>  
> -		/* Init accessed so avoid atomic mark_page_accessed later */
> -		if (fgp_flags & FGP_ACCESSED)
> -			__folio_set_referenced(folio);
> +		if (!mapping_large_folio_support(mapping))
> +			order = 0;
> +		if (order > MAX_PAGECACHE_ORDER)
> +			order = MAX_PAGECACHE_ORDER;
> +		/* If we're not aligned, allocate a smaller folio */
> +		if (index & ((1UL << order) - 1))
> +			order = __ffs(index);
>  
> -		err = filemap_add_folio(mapping, folio, index, gfp);
> -		if (unlikely(err)) {
> +		do {
> +			gfp_t alloc_gfp = gfp;
> +
> +			err = -ENOMEM;
> +			if (order == 1)
> +				order = 0;
> +			if (order > 0)
> +				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> +			folio = filemap_alloc_folio(alloc_gfp, order);
> +			if (!folio)
> +				continue;
> +
> +			/* Init accessed so avoid atomic mark_page_accessed later */
> +			if (fgp_flags & FGP_ACCESSED)
> +				__folio_set_referenced(folio);
> +
> +			err = filemap_add_folio(mapping, folio, index, gfp);
> +			if (!err)
> +				break;
>  			folio_put(folio);
>  			folio = NULL;
> -			if (err == -EEXIST)
> -				goto repeat;
> -		}
> +		} while (order-- > 0);
>  
> +		if (err == -EEXIST)
> +			goto repeat;
> +		if (err)
> +			return ERR_PTR(err);
>  		/*
>  		 * filemap_add_folio locks the page, and for mmap
>  		 * we expect an unlocked page.
> diff --git a/mm/readahead.c b/mm/readahead.c
> index a9c999aa19af..e815c114de21 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -461,19 +461,6 @@ static int try_context_readahead(struct address_space *mapping,
>  	return 1;
>  }
>  
> -/*
> - * There are some parts of the kernel which assume that PMD entries
> - * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> - * limit the maximum allocation order to PMD size.  I'm not aware of any
> - * assumptions about maximum order if THP are disabled, but 8 seems like
> - * a good order (that's 1MB if you're using 4kB pages)
> - */
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> -#else
> -#define MAX_PAGECACHE_ORDER	8
> -#endif
> -
>  static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
>  		pgoff_t mark, unsigned int order, gfp_t gfp)
>  {
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 8/9] iomap: Create large folios in the buffered write path
  2023-07-10 13:02 ` [PATCH v4 8/9] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
@ 2023-07-13  4:56   ` Darrick J. Wong
  0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2023-07-13  4:56 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Kent Overstreet, Christoph Hellwig

On Mon, Jul 10, 2023 at 02:02:52PM +0100, Matthew Wilcox (Oracle) wrote:
> Use the size of the write as a hint for the size of the folio to create.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Very nice!!
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/gfs2/bmap.c         | 2 +-
>  fs/iomap/buffered-io.c | 6 ++++--
>  include/linux/iomap.h  | 2 +-
>  3 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 8d611fbcf0bd..521d1e00a0ff 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -971,7 +971,7 @@ gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
>  	if (status)
>  		return ERR_PTR(status);
>  
> -	folio = iomap_get_folio(iter, pos);
> +	folio = iomap_get_folio(iter, pos, len);
>  	if (IS_ERR(folio))
>  		gfs2_trans_end(sdp);
>  	return folio;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 5e9380cc3e83..2d3e90f4d16e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -461,16 +461,18 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>   * iomap_get_folio - get a folio reference for writing
>   * @iter: iteration structure
>   * @pos: start offset of write
> + * @len: Suggested size of folio to create.
>   *
>   * Returns a locked reference to the folio at @pos, or an error pointer if the
>   * folio could not be obtained.
>   */
> -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>  
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> +	fgp |= fgf_set_order(len);
>  
>  	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
>  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
> @@ -597,7 +599,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
>  	if (folio_ops && folio_ops->get_folio)
>  		return folio_ops->get_folio(iter, pos, len);
>  	else
> -		return iomap_get_folio(iter, pos);
> +		return iomap_get_folio(iter, pos, len);
>  }
>  
>  static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e2b836c2e119..80facb9c9e5b 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -261,7 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
>  int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
>  void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
>  bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
> -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>  int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 9/9] iomap: Copy larger chunks from userspace
  2023-07-10 13:02 ` [PATCH v4 9/9] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
@ 2023-07-13  4:58   ` Darrick J. Wong
  0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2023-07-13  4:58 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Kent Overstreet, Christoph Hellwig

On Mon, Jul 10, 2023 at 02:02:53PM +0100, Matthew Wilcox (Oracle) wrote:
> If we have a large folio, we can copy in larger chunks than PAGE_SIZE.
> Start at the maximum page cache size and shrink by half every time we
> hit the "we are short on memory" problem.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Looks good!
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/iomap/buffered-io.c | 32 +++++++++++++++++---------------
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 2d3e90f4d16e..f21f1f641c4a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -769,6 +769,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
>  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  {
>  	loff_t length = iomap_length(iter);
> +	size_t chunk = PAGE_SIZE << MAX_PAGECACHE_ORDER;
>  	loff_t pos = iter->pos;
>  	ssize_t written = 0;
>  	long status = 0;
> @@ -777,15 +778,12 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  
>  	do {
>  		struct folio *folio;
> -		struct page *page;
> -		unsigned long offset;	/* Offset into pagecache page */
> -		unsigned long bytes;	/* Bytes to write to page */
> +		size_t offset;		/* Offset into folio */
> +		size_t bytes;		/* Bytes to write to folio */
>  		size_t copied;		/* Bytes copied from user */
>  
> -		offset = offset_in_page(pos);
> -		bytes = min_t(unsigned long, PAGE_SIZE - offset,
> -						iov_iter_count(i));
> -again:
> +		offset = pos & (chunk - 1);
> +		bytes = min(chunk - offset, iov_iter_count(i));
>  		status = balance_dirty_pages_ratelimited_flags(mapping,
>  							       bdp_flags);
>  		if (unlikely(status))
> @@ -815,12 +813,14 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
>  
> -		page = folio_file_page(folio, pos >> PAGE_SHIFT);
> -		if (mapping_writably_mapped(mapping))
> -			flush_dcache_page(page);
> +		offset = offset_in_folio(folio, pos);
> +		if (bytes > folio_size(folio) - offset)
> +			bytes = folio_size(folio) - offset;
>  
> -		copied = copy_page_from_iter_atomic(page, offset, bytes, i);
> +		if (mapping_writably_mapped(mapping))
> +			flush_dcache_folio(folio);
>  
> +		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
>  		status = iomap_write_end(iter, pos, bytes, copied, folio);
>  
>  		if (unlikely(copied != status))
> @@ -836,11 +836,13 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  			 */
>  			if (copied)
>  				bytes = copied;
> -			goto again;
> +			if (chunk > PAGE_SIZE)
> +				chunk /= 2;
> +		} else {
> +			pos += status;
> +			written += status;
> +			length -= status;
>  		}
> -		pos += status;
> -		written += status;
> -		length -= status;
>  	} while (iov_iter_count(i) && length);
>  
>  	if (status == -EAGAIN) {
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-10 13:02 ` [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
  2023-07-10 23:58   ` Luis Chamberlain
  2023-07-13  4:50   ` Darrick J. Wong
@ 2023-07-13  5:04   ` Kent Overstreet
  2023-07-13 14:42     ` Matthew Wilcox
  2 siblings, 1 reply; 41+ messages in thread
From: Kent Overstreet @ 2023-07-13  5:04 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Christoph Hellwig

On Mon, Jul 10, 2023 at 02:02:51PM +0100, Matthew Wilcox (Oracle) wrote:
> Allow callers of __filemap_get_folio() to specify a preferred folio
> order in the FGP flags.  This is only honoured in the FGP_CREATE path;
> if there is already a folio in the page cache that covers the index,
> we will return it, no matter what its order is.  No create-around is
> attempted; we will only create folios which start at the specified index.
> Unmodified callers will continue to allocate order 0 folios.

Why not just add an end_index parameter to filemap_get_folio()?

> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  include/linux/pagemap.h | 34 ++++++++++++++++++++++++++++++
>  mm/filemap.c            | 46 +++++++++++++++++++++++++++++------------
>  mm/readahead.c          | 13 ------------
>  3 files changed, 67 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 911201fc41fc..d87840acbfb2 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -470,6 +470,19 @@ static inline void *detach_page_private(struct page *page)
>  	return folio_detach_private(page_folio(page));
>  }
>  
> +/*
> + * There are some parts of the kernel which assume that PMD entries
> + * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> + * limit the maximum allocation order to PMD size.  I'm not aware of any
> + * assumptions about maximum order if THP are disabled, but 8 seems like
> + * a good order (that's 1MB if you're using 4kB pages)
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> +#else
> +#define MAX_PAGECACHE_ORDER	8
> +#endif
> +
>  #ifdef CONFIG_NUMA
>  struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
>  #else
> @@ -535,9 +548,30 @@ typedef unsigned int __bitwise fgf_t;
>  #define FGP_NOWAIT		((__force fgf_t)0x00000020)
>  #define FGP_FOR_MMAP		((__force fgf_t)0x00000040)
>  #define FGP_STABLE		((__force fgf_t)0x00000080)
> +#define FGF_GET_ORDER(fgf)	(((__force unsigned)fgf) >> 26)	/* top 6 bits */
>  
>  #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
>  
> +/**
> + * fgf_set_order - Encode a length in the fgf_t flags.
> + * @size: The suggested size of the folio to create.
> + *
> + * The caller of __filemap_get_folio() can use this to suggest a preferred
> + * size for the folio that is created.  If there is already a folio at
> + * the index, it will be returned, no matter what its size.  If a folio
> + * is freshly created, it may be of a different size than requested
> + * due to alignment constraints, memory pressure, or the presence of
> + * other folios at nearby indices.
> + */
> +static inline fgf_t fgf_set_order(size_t size)
> +{
> +	unsigned int shift = ilog2(size);
> +
> +	if (shift <= PAGE_SHIFT)
> +		return 0;
> +	return (__force fgf_t)((shift - PAGE_SHIFT) << 26);
> +}
> +
>  void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		fgf_t fgp_flags, gfp_t gfp);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8a669fecfd1c..baafbf324c9f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1905,7 +1905,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		folio_wait_stable(folio);
>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
> +		unsigned order = FGF_GET_ORDER(fgp_flags);
>  		int err;
> +
>  		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
>  			gfp |= __GFP_WRITE;
>  		if (fgp_flags & FGP_NOFS)
> @@ -1914,26 +1916,44 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  			gfp &= ~GFP_KERNEL;
>  			gfp |= GFP_NOWAIT | __GFP_NOWARN;
>  		}
> -
> -		folio = filemap_alloc_folio(gfp, 0);
> -		if (!folio)
> -			return ERR_PTR(-ENOMEM);
> -
>  		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
>  			fgp_flags |= FGP_LOCK;
>  
> -		/* Init accessed so avoid atomic mark_page_accessed later */
> -		if (fgp_flags & FGP_ACCESSED)
> -			__folio_set_referenced(folio);
> +		if (!mapping_large_folio_support(mapping))
> +			order = 0;
> +		if (order > MAX_PAGECACHE_ORDER)
> +			order = MAX_PAGECACHE_ORDER;
> +		/* If we're not aligned, allocate a smaller folio */
> +		if (index & ((1UL << order) - 1))
> +			order = __ffs(index);
>  
> -		err = filemap_add_folio(mapping, folio, index, gfp);
> -		if (unlikely(err)) {
> +		do {
> +			gfp_t alloc_gfp = gfp;
> +
> +			err = -ENOMEM;
> +			if (order == 1)
> +				order = 0;
> +			if (order > 0)
> +				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> +			folio = filemap_alloc_folio(alloc_gfp, order);
> +			if (!folio)
> +				continue;
> +
> +			/* Init accessed so avoid atomic mark_page_accessed later */
> +			if (fgp_flags & FGP_ACCESSED)
> +				__folio_set_referenced(folio);
> +
> +			err = filemap_add_folio(mapping, folio, index, gfp);
> +			if (!err)
> +				break;
>  			folio_put(folio);
>  			folio = NULL;
> -			if (err == -EEXIST)
> -				goto repeat;
> -		}
> +		} while (order-- > 0);
>  
> +		if (err == -EEXIST)
> +			goto repeat;
> +		if (err)
> +			return ERR_PTR(err);
>  		/*
>  		 * filemap_add_folio locks the page, and for mmap
>  		 * we expect an unlocked page.
> diff --git a/mm/readahead.c b/mm/readahead.c
> index a9c999aa19af..e815c114de21 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -461,19 +461,6 @@ static int try_context_readahead(struct address_space *mapping,
>  	return 1;
>  }
>  
> -/*
> - * There are some parts of the kernel which assume that PMD entries
> - * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> - * limit the maximum allocation order to PMD size.  I'm not aware of any
> - * assumptions about maximum order if THP are disabled, but 8 seems like
> - * a good order (that's 1MB if you're using 4kB pages)
> - */
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> -#else
> -#define MAX_PAGECACHE_ORDER	8
> -#endif
> -
>  static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
>  		pgoff_t mark, unsigned int order, gfp_t gfp)
>  {
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 6/9] filemap: Add fgf_t typedef
  2023-07-10 13:02 ` [PATCH v4 6/9] filemap: Add fgf_t typedef Matthew Wilcox (Oracle)
  2023-07-13  4:47   ` Darrick J. Wong
@ 2023-07-13  5:08   ` Kent Overstreet
  1 sibling, 0 replies; 41+ messages in thread
From: Kent Overstreet @ 2023-07-13  5:08 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Christoph Hellwig

On Mon, Jul 10, 2023 at 02:02:50PM +0100, Matthew Wilcox (Oracle) wrote:
> Similarly to gfp_t, define fgf_t as its own type to prevent various
> misuses and confusion.  Leave the flags as FGP_* for now to reduce the
> size of this patch; they will be converted to FGF_* later.  Move the
> documentation to the definition of the type insted of burying it in the
> __filemap_get_folio() documentation.

lack of type safety for enums is a real issue - this is A+

Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 5/9] iomap: Remove unnecessary test from iomap_release_folio()
  2023-07-13  4:45   ` Darrick J. Wong
@ 2023-07-13  5:25     ` Ritesh Harjani
  2023-07-13  5:33       ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Ritesh Harjani @ 2023-07-13  5:25 UTC (permalink / raw)
  To: Darrick J. Wong, Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Kent Overstreet, Christoph Hellwig

"Darrick J. Wong" <djwong@kernel.org> writes:

> [add ritesh]
>
> On Mon, Jul 10, 2023 at 02:02:49PM +0100, Matthew Wilcox (Oracle) wrote:
>> The check for the folio being under writeback is unnecessary; the caller
>> has checked this and the folio is locked, so the folio cannot be under
>> writeback at this point.
>> 
>> The comment is somewhat misleading in that it talks about one specific
>> situation in which we can see a dirty folio.  There are others, so change
>> the comment to explain why we can't release the iomap_page.
>> 
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> ---
>>  fs/iomap/buffered-io.c | 9 ++++-----
>>  1 file changed, 4 insertions(+), 5 deletions(-)
>> 
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 1cb905140528..7aa3009f907f 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -483,12 +483,11 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
>>  			folio_size(folio));
>>  
>>  	/*
>> -	 * mm accommodates an old ext3 case where clean folios might
>> -	 * not have had the dirty bit cleared.  Thus, it can send actual
>> -	 * dirty folios to ->release_folio() via shrink_active_list();
>> -	 * skip those here.
>> +	 * If the folio is dirty, we refuse to release our metadata because
>> +	 * it may be partially dirty.  Once we track per-block dirty state,
>> +	 * we can release the metadata if every block is dirty.
>
> Ritesh: I'm assuming that implementing this will be part of your v12 series?

No, if it's any optimization, then I think we can take it up later too,
not in v12 please (I have been doing some extensive testing of current series).
Also let me understand it a bit more.

@willy,
Is this what you are suggesting? So this is mainly to free up some
memory for iomap_folio_state structure then right?
But then whenever we are doing a writeback, we anyway would be
allocating iomap_folio_state() and marking all the bits dirty. Isn't it
sub-optimal then?  

@@ -489,8 +489,11 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
         * it may be partially dirty.  Once we track per-block dirty state,
         * we can release the metadata if every block is dirty.
         */
-       if (folio_test_dirty(folio))
+       if (folio_test_dirty(folio)) {
+               if (ifs_is_fully_dirty(folio, ifs))
+                       iomap_page_release(folio);
                return false;
+       }
        iomap_page_release(folio);
        return true;
 }

(Ignore the old and new apis naming in above. It is just to get an idea)

-ritesh

>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>
> --D
>
>>  	 */
>> -	if (folio_test_dirty(folio) || folio_test_writeback(folio))
>> +	if (folio_test_dirty(folio))
>>  		return false;
>>  	iomap_page_release(folio);
>>  	return true;
>> -- 
>> 2.39.2
>> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 5/9] iomap: Remove unnecessary test from iomap_release_folio()
  2023-07-13  5:25     ` Ritesh Harjani
@ 2023-07-13  5:33       ` Darrick J. Wong
  2023-07-13  5:51         ` Ritesh Harjani
  0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2023-07-13  5:33 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Matthew Wilcox (Oracle), linux-fsdevel, linux-xfs, Wang Yugui,
	Dave Chinner, Christoph Hellwig, Kent Overstreet,
	Christoph Hellwig

On Thu, Jul 13, 2023 at 10:55:20AM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <djwong@kernel.org> writes:
> 
> > [add ritesh]
> >
> > On Mon, Jul 10, 2023 at 02:02:49PM +0100, Matthew Wilcox (Oracle) wrote:
> >> The check for the folio being under writeback is unnecessary; the caller
> >> has checked this and the folio is locked, so the folio cannot be under
> >> writeback at this point.
> >> 
> >> The comment is somewhat misleading in that it talks about one specific
> >> situation in which we can see a dirty folio.  There are others, so change
> >> the comment to explain why we can't release the iomap_page.
> >> 
> >> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >> ---
> >>  fs/iomap/buffered-io.c | 9 ++++-----
> >>  1 file changed, 4 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >> index 1cb905140528..7aa3009f907f 100644
> >> --- a/fs/iomap/buffered-io.c
> >> +++ b/fs/iomap/buffered-io.c
> >> @@ -483,12 +483,11 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
> >>  			folio_size(folio));
> >>  
> >>  	/*
> >> -	 * mm accommodates an old ext3 case where clean folios might
> >> -	 * not have had the dirty bit cleared.  Thus, it can send actual
> >> -	 * dirty folios to ->release_folio() via shrink_active_list();
> >> -	 * skip those here.
> >> +	 * If the folio is dirty, we refuse to release our metadata because
> >> +	 * it may be partially dirty.  Once we track per-block dirty state,
> >> +	 * we can release the metadata if every block is dirty.
> >
> > Ritesh: I'm assuming that implementing this will be part of your v12 series?
> 
> No, if it's any optimization, then I think we can take it up later too,

<nod>

> not in v12 please (I have been doing some extensive testing of current series).
> Also let me understand it a bit more.
> 
> @willy,
> Is this what you are suggesting? So this is mainly to free up some
> memory for iomap_folio_state structure then right?

I think it's also to break up compound folios to free base pages or
other reasons.

https://lore.kernel.org/linux-xfs/20230713044326.GI108251@frogsfrogsfrogs/T/#mc83fe929d57e9aa3c1834232389cad0d62b66e7b

> But then whenever we are doing a writeback, we anyway would be
> allocating iomap_folio_state() and marking all the bits dirty. Isn't it
> sub-optimal then?  
> 
> @@ -489,8 +489,11 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
>          * it may be partially dirty.  Once we track per-block dirty state,
>          * we can release the metadata if every block is dirty.
>          */
> -       if (folio_test_dirty(folio))
> +       if (folio_test_dirty(folio)) {
> +               if (ifs_is_fully_dirty(folio, ifs))
> +                       iomap_page_release(folio);
>                 return false;
> +       }

I think it's more that we *dont* break up partially dirty folios:

	/*
	 * Folio is partially dirty, do not throw away the state or
	 * split the folio.
	 */
	if (folio_test_dirty(folio) && !ifs_is_fully_dirty(folio, ifs))
		return false;

	/* No more private state tracking, ok to split folio. */
	iomap_page_release(folio);
	return true;

But breaking up fully dirty folios is now possible, since the mm can
mark all the basepages dirty.

--D

>         iomap_page_release(folio);
>         return true;
>  }
> 
> (Ignore the old and new apis naming in above. It is just to get an idea)
> 
> -ritesh
> 
> >
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> >
> > --D
> >
> >>  	 */
> >> -	if (folio_test_dirty(folio) || folio_test_writeback(folio))
> >> +	if (folio_test_dirty(folio))
> >>  		return false;
> >>  	iomap_page_release(folio);
> >>  	return true;
> >> -- 
> >> 2.39.2
> >> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 5/9] iomap: Remove unnecessary test from iomap_release_folio()
  2023-07-13  5:33       ` Darrick J. Wong
@ 2023-07-13  5:51         ` Ritesh Harjani
  0 siblings, 0 replies; 41+ messages in thread
From: Ritesh Harjani @ 2023-07-13  5:51 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox (Oracle), linux-fsdevel, linux-xfs, Wang Yugui,
	Dave Chinner, Christoph Hellwig, Kent Overstreet,
	Christoph Hellwig

"Darrick J. Wong" <djwong@kernel.org> writes:

> On Thu, Jul 13, 2023 at 10:55:20AM +0530, Ritesh Harjani wrote:
>> "Darrick J. Wong" <djwong@kernel.org> writes:
>> 
>> > [add ritesh]
>> >
>> > On Mon, Jul 10, 2023 at 02:02:49PM +0100, Matthew Wilcox (Oracle) wrote:
>> >> The check for the folio being under writeback is unnecessary; the caller
>> >> has checked this and the folio is locked, so the folio cannot be under
>> >> writeback at this point.
>> >> 
>> >> The comment is somewhat misleading in that it talks about one specific
>> >> situation in which we can see a dirty folio.  There are others, so change
>> >> the comment to explain why we can't release the iomap_page.
>> >> 
>> >> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> >> ---
>> >>  fs/iomap/buffered-io.c | 9 ++++-----
>> >>  1 file changed, 4 insertions(+), 5 deletions(-)
>> >> 
>> >> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> >> index 1cb905140528..7aa3009f907f 100644
>> >> --- a/fs/iomap/buffered-io.c
>> >> +++ b/fs/iomap/buffered-io.c
>> >> @@ -483,12 +483,11 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
>> >>  			folio_size(folio));
>> >>  
>> >>  	/*
>> >> -	 * mm accommodates an old ext3 case where clean folios might
>> >> -	 * not have had the dirty bit cleared.  Thus, it can send actual
>> >> -	 * dirty folios to ->release_folio() via shrink_active_list();
>> >> -	 * skip those here.
>> >> +	 * If the folio is dirty, we refuse to release our metadata because
>> >> +	 * it may be partially dirty.  Once we track per-block dirty state,
>> >> +	 * we can release the metadata if every block is dirty.
>> >
>> > Ritesh: I'm assuming that implementing this will be part of your v12 series?
>> 
>> No, if it's any optimization, then I think we can take it up later too,
>
> <nod>

Thanks! 

>
>> not in v12 please (I have been doing some extensive testing of current series).
>> Also let me understand it a bit more.
>> 
>> @willy,
>> Is this what you are suggesting? So this is mainly to free up some
>> memory for iomap_folio_state structure then right?
>
> I think it's also to break up compound folios to free base pages or
> other reasons.
>
> https://lore.kernel.org/linux-xfs/20230713044326.GI108251@frogsfrogsfrogs/T/#mc83fe929d57e9aa3c1834232389cad0d62b66e7b
>
>> But then whenever we are doing a writeback, we anyway would be
>> allocating iomap_folio_state() and marking all the bits dirty. Isn't it
>> sub-optimal then?  
>> 
>> @@ -489,8 +489,11 @@ bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags)
>>          * it may be partially dirty.  Once we track per-block dirty state,
>>          * we can release the metadata if every block is dirty.
>>          */
>> -       if (folio_test_dirty(folio))
>> +       if (folio_test_dirty(folio)) {
>> +               if (ifs_is_fully_dirty(folio, ifs))
>> +                       iomap_page_release(folio);
>>                 return false;
>> +       }
>
> I think it's more that we *dont* break up partially dirty folios:
>
> 	/*
> 	 * Folio is partially dirty, do not throw away the state or
> 	 * split the folio.
> 	 */
> 	if (folio_test_dirty(folio) && !ifs_is_fully_dirty(folio, ifs))
> 		return false;
>
> 	/* No more private state tracking, ok to split folio. */
> 	iomap_page_release(folio);
> 	return true;
>

Aah got it. If the folio is dirty and all it's blocks are also dirty
then we can release the metadata and return true. This will allow the MM
to split the folio, right.

Let me test it then. Currently ifs_is_fully_dirty() will walk and test
all the bits for dirtiness. However, splitting the folio might not be
the fast path, so I am assuming it shouldn't have any performance
implication.


> But breaking up fully dirty folios is now possible, since the mm can
> mark all the basepages dirty.

Thanks. Got it.

-ritesh

>
> --D
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()
  2023-07-13  4:42   ` Darrick J. Wong
@ 2023-07-13 13:46     ` Matthew Wilcox
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Wilcox @ 2023-07-13 13:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Kent Overstreet

On Wed, Jul 12, 2023 at 09:42:07PM -0700, Darrick J. Wong wrote:
> On Mon, Jul 10, 2023 at 02:02:45PM +0100, Matthew Wilcox (Oracle) wrote:
> > copy_page_from_iter_atomic() already handles !highmem compound
> > pages correctly, but if we are passed a highmem compound page,
> > each base page needs to be mapped & unmapped individually.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> Yikes.  Does this want a fixes tag?

I was wondering the same thing.  I think the bug is merely latent,
and all existing users limit themselves to being within the page
that they're passing in.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-13  5:04   ` Kent Overstreet
@ 2023-07-13 14:42     ` Matthew Wilcox
  2023-07-13 15:19       ` Kent Overstreet
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2023-07-13 14:42 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Christoph Hellwig

On Thu, Jul 13, 2023 at 01:04:39AM -0400, Kent Overstreet wrote:
> On Mon, Jul 10, 2023 at 02:02:51PM +0100, Matthew Wilcox (Oracle) wrote:
> > Allow callers of __filemap_get_folio() to specify a preferred folio
> > order in the FGP flags.  This is only honoured in the FGP_CREATE path;
> > if there is already a folio in the page cache that covers the index,
> > we will return it, no matter what its order is.  No create-around is
> > attempted; we will only create folios which start at the specified index.
> > Unmodified callers will continue to allocate order 0 folios.
> 
> Why not just add an end_index parameter to filemap_get_folio()?

I'm reluctant to add more parameters.  Aside from the churn, every extra
parameter makes the function that little bit harder to use.  I like this
encoding; users who don't know/care about it get the current default
behaviour, and it's a simple addition to the users who do want to care.
end_index is particularly tricky ... what if it's lower than index?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios
  2023-07-13 14:42     ` Matthew Wilcox
@ 2023-07-13 15:19       ` Kent Overstreet
  0 siblings, 0 replies; 41+ messages in thread
From: Kent Overstreet @ 2023-07-13 15:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Darrick J . Wong, Christoph Hellwig

On Thu, Jul 13, 2023 at 03:42:18PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 13, 2023 at 01:04:39AM -0400, Kent Overstreet wrote:
> > On Mon, Jul 10, 2023 at 02:02:51PM +0100, Matthew Wilcox (Oracle) wrote:
> > > Allow callers of __filemap_get_folio() to specify a preferred folio
> > > order in the FGP flags.  This is only honoured in the FGP_CREATE path;
> > > if there is already a folio in the page cache that covers the index,
> > > we will return it, no matter what its order is.  No create-around is
> > > attempted; we will only create folios which start at the specified index.
> > > Unmodified callers will continue to allocate order 0 folios.
> > 
> > Why not just add an end_index parameter to filemap_get_folio()?
> 
> I'm reluctant to add more parameters.  Aside from the churn, every extra
> parameter makes the function that little bit harder to use.  I like this
> encoding; users who don't know/care about it get the current default
> behaviour, and it's a simple addition to the users who do want to care.
> end_index is particularly tricky ... what if it's lower than index?

But we're refactoring all this code (and chaning our thinking) to
extents/ranges, not blocks - I'd say end_index is more natural in the
long run.

Plus, it lets us put the logic for "how big of a folio do we want to
allocate" in one place.

(end_index < index is just a BUG_ON()).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v4 2/9] iov_iter: Add copy_folio_from_iter_atomic()
  2023-07-10 13:02 ` [PATCH v4 2/9] iov_iter: Add copy_folio_from_iter_atomic() Matthew Wilcox (Oracle)
  2023-07-11  6:31   ` Christoph Hellwig
@ 2023-07-24 15:56   ` Darrick J. Wong
  1 sibling, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2023-07-24 15:56 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: linux-fsdevel, linux-xfs, Wang Yugui, Dave Chinner,
	Christoph Hellwig, Kent Overstreet

On Mon, Jul 10, 2023 at 02:02:46PM +0100, Matthew Wilcox (Oracle) wrote:
> Add a folio wrapper around copy_page_from_iter_atomic().
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Wellllll crap.  I got all ready to push this to for-next, but then my
maintainer checkpatch interlock scripts pointed out that this commit
doesn't have /any/ RVB attached to it.  Apparently I forgot to tag this
one when I went through all this.

Matthew, can you please add:

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

and redo the whole branch-push and pull-request dance, please?

(Also could you put a sob tag on the tag message so that the merge
commit can have full authorship details?)

--D

> ---
>  include/linux/uio.h | 9 ++++++++-
>  lib/iov_iter.c      | 2 +-
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index ff81e5ccaef2..42bce38a8e87 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -163,7 +163,7 @@ static inline size_t iov_length(const struct iovec *iov, unsigned long nr_segs)
>  	return ret;
>  }
>  
> -size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
> +size_t copy_page_from_iter_atomic(struct page *page, size_t offset,
>  				  size_t bytes, struct iov_iter *i);
>  void iov_iter_advance(struct iov_iter *i, size_t bytes);
>  void iov_iter_revert(struct iov_iter *i, size_t bytes);
> @@ -184,6 +184,13 @@ static inline size_t copy_folio_to_iter(struct folio *folio, size_t offset,
>  {
>  	return copy_page_to_iter(&folio->page, offset, bytes, i);
>  }
> +
> +static inline size_t copy_folio_from_iter_atomic(struct folio *folio,
> +		size_t offset, size_t bytes, struct iov_iter *i)
> +{
> +	return copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
> +}
> +
>  size_t copy_page_to_iter_nofault(struct page *page, unsigned offset,
>  				 size_t bytes, struct iov_iter *i);
>  
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index c728b6e4fb18..8da566a549ad 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -566,7 +566,7 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
>  }
>  EXPORT_SYMBOL(iov_iter_zero);
>  
> -size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
> +size_t copy_page_from_iter_atomic(struct page *page, size_t offset,
>  		size_t bytes, struct iov_iter *i)
>  {
>  	size_t n, copied = 0;
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-07-24 15:56 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-10 13:02 [PATCH v4 0/9] Create large folios in iomap buffered write path Matthew Wilcox (Oracle)
2023-07-10 13:02 ` [PATCH v4 1/9] iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic() Matthew Wilcox (Oracle)
2023-07-10 23:43   ` Luis Chamberlain
2023-07-11  0:03     ` Matthew Wilcox
2023-07-11  6:30   ` Christoph Hellwig
2023-07-11 20:05   ` Kent Overstreet
2023-07-13  4:42   ` Darrick J. Wong
2023-07-13 13:46     ` Matthew Wilcox
2023-07-10 13:02 ` [PATCH v4 2/9] iov_iter: Add copy_folio_from_iter_atomic() Matthew Wilcox (Oracle)
2023-07-11  6:31   ` Christoph Hellwig
2023-07-11 20:46     ` Matthew Wilcox
2023-07-24 15:56   ` Darrick J. Wong
2023-07-10 13:02 ` [PATCH v4 3/9] iomap: Remove large folio handling in iomap_invalidate_folio() Matthew Wilcox (Oracle)
2023-07-10 13:02 ` [PATCH v4 4/9] doc: Correct the description of ->release_folio Matthew Wilcox (Oracle)
2023-07-13  4:43   ` Darrick J. Wong
2023-07-10 13:02 ` [PATCH v4 5/9] iomap: Remove unnecessary test from iomap_release_folio() Matthew Wilcox (Oracle)
2023-07-13  4:45   ` Darrick J. Wong
2023-07-13  5:25     ` Ritesh Harjani
2023-07-13  5:33       ` Darrick J. Wong
2023-07-13  5:51         ` Ritesh Harjani
2023-07-10 13:02 ` [PATCH v4 6/9] filemap: Add fgf_t typedef Matthew Wilcox (Oracle)
2023-07-13  4:47   ` Darrick J. Wong
2023-07-13  5:08   ` Kent Overstreet
2023-07-10 13:02 ` [PATCH v4 7/9] filemap: Allow __filemap_get_folio to allocate large folios Matthew Wilcox (Oracle)
2023-07-10 23:58   ` Luis Chamberlain
2023-07-11  0:07     ` Matthew Wilcox
2023-07-11  0:21       ` Luis Chamberlain
2023-07-11  0:42         ` Matthew Wilcox
2023-07-11  0:47         ` Dave Chinner
2023-07-11  0:13     ` Kent Overstreet
2023-07-13  4:50   ` Darrick J. Wong
2023-07-13  5:04   ` Kent Overstreet
2023-07-13 14:42     ` Matthew Wilcox
2023-07-13 15:19       ` Kent Overstreet
2023-07-10 13:02 ` [PATCH v4 8/9] iomap: Create large folios in the buffered write path Matthew Wilcox (Oracle)
2023-07-13  4:56   ` Darrick J. Wong
2023-07-10 13:02 ` [PATCH v4 9/9] iomap: Copy larger chunks from userspace Matthew Wilcox (Oracle)
2023-07-13  4:58   ` Darrick J. Wong
2023-07-10 22:55 ` [PATCH v4 0/9] Create large folios in iomap buffered write path Luis Chamberlain
2023-07-10 23:53   ` Matthew Wilcox
2023-07-11  0:01   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).