[PATCH v2 00/12] fuse: support large folios

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/12] fuse: support large folios
@ 2024-11-25 22:05 Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 01/12] fuse: support copying " Joanne Koong
                   ` (12 more replies)
  0 siblings, 13 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

This patchset adds support for folios larger than one page size in FUSE.

This patchset is rebased on top of the (unmerged) patchset that removes temp
folios in writeback [1]. (There is also a version of this patchset that is
independent from that change, but that version has two additional patches
needed to account for temp folios and temp folio copying, which may require
some debate to get the API right for as these two patches add generic
(non-FUSE) helpers. For simplicity's sake for now, I sent out this patchset
version rebased on top of the patchset that removes temp pages)

This patchset was tested by running it through fstests on passthrough_hp.

Benchmarks show roughly a ~45% improvement in read throughput.

Benchmark setup:

-- Set up server --
 ./libfuse/build/example/passthrough_hp --bypass-rw=1 ~/libfuse
~/mounts/fuse/ --nopassthrough
(using libfuse patched with https://github.com/libfuse/libfuse/pull/807)

-- Run fio --
 fio --name=read --ioengine=sync --rw=read --bs=1M --size=1G
--numjobs=2 --ramp_time=30 --group_reporting=1
--directory=mounts/fuse/

Machine 1:
    No large folios:     ~4400 MiB/s
    Large folios:        ~7100 MiB/s

Machine 2:
    No large folios:     ~3700 MiB/s
    Large folios:        ~6400 MiB/s

Writes are still effectively one page size. Benchmarks showed that trying to get
the largest folios possible from __filemap_get_folio() is an over-optimization
and ends up being significantly more expensive. Fine-tuning for the optimal
order size for the __filemap_get_folio() calls can be done in a future patchset.

[1] https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/

Changelog:
v1: https://lore.kernel.org/linux-fsdevel/20241109001258.2216604-1-joannelkoong@gmail.com/
v1 -> v2:
* Change naming from "non-writeback write" to "writethrough write"
* Fix deadlock for writethrough writes by calling fault_in_iov_iter_readable() first
  before __filemap_get_folio() (Josef)
* For readahead, retain original folio_size() for descs.length (Josef)
* Use folio_zero_range() api in fuse_copy_folio() (Josef)
* Add Josef's reviewed-bys

Joanne Koong (12):
  fuse: support copying large folios
  fuse: support large folios for retrieves
  fuse: refactor fuse_fill_write_pages()
  fuse: support large folios for writethrough writes
  fuse: support large folios for folio reads
  fuse: support large folios for symlinks
  fuse: support large folios for stores
  fuse: support large folios for queued writes
  fuse: support large folios for readahead
  fuse: support large folios for direct io
  fuse: support large folios for writeback
  fuse: enable large folios

 fs/fuse/dev.c  | 128 ++++++++++++++++++++++++-------------------------
 fs/fuse/dir.c  |   8 ++--
 fs/fuse/file.c | 126 +++++++++++++++++++++++++++++++-----------------
 3 files changed, 149 insertions(+), 113 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 01/12] fuse: support copying large folios
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 02/12] fuse: support large folios for retrieves Joanne Koong
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Currently, all folios associated with fuse are one page size. As part of
the work to enable large folios, this commit adds support for copying
to/from folios larger than one page size.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/dev.c | 86 ++++++++++++++++++++++-----------------------------
 1 file changed, 37 insertions(+), 49 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 29fc61a072ba..4a09c41910d7 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -703,7 +703,7 @@ struct fuse_copy_state {
 	struct page *pg;
 	unsigned len;
 	unsigned offset;
-	unsigned move_pages:1;
+	unsigned move_folios:1;
 };
 
 static void fuse_copy_init(struct fuse_copy_state *cs, int write,
@@ -836,10 +836,10 @@ static int fuse_check_folio(struct folio *folio)
 	return 0;
 }
 
-static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep)
+static int fuse_try_move_folio(struct fuse_copy_state *cs, struct folio **foliop)
 {
 	int err;
-	struct folio *oldfolio = page_folio(*pagep);
+	struct folio *oldfolio = *foliop;
 	struct folio *newfolio;
 	struct pipe_buffer *buf = cs->pipebufs;
 
@@ -860,7 +860,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep)
 	cs->pipebufs++;
 	cs->nr_segs--;
 
-	if (cs->len != PAGE_SIZE)
+	if (cs->len != folio_size(oldfolio))
 		goto out_fallback;
 
 	if (!pipe_buf_try_steal(cs->pipe, buf))
@@ -906,7 +906,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep)
 	if (test_bit(FR_ABORTED, &cs->req->flags))
 		err = -ENOENT;
 	else
-		*pagep = &newfolio->page;
+		*foliop = newfolio;
 	spin_unlock(&cs->req->waitq.lock);
 
 	if (err) {
@@ -939,8 +939,8 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, struct page **pagep)
 	goto out_put_old;
 }
 
-static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page,
-			 unsigned offset, unsigned count)
+static int fuse_ref_folio(struct fuse_copy_state *cs, struct folio *folio,
+			  unsigned offset, unsigned count)
 {
 	struct pipe_buffer *buf;
 	int err;
@@ -948,17 +948,17 @@ static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page,
 	if (cs->nr_segs >= cs->pipe->max_usage)
 		return -EIO;
 
-	get_page(page);
+	folio_get(folio);
 	err = unlock_request(cs->req);
 	if (err) {
-		put_page(page);
+		folio_put(folio);
 		return err;
 	}
 
 	fuse_copy_finish(cs);
 
 	buf = cs->pipebufs;
-	buf->page = page;
+	buf->page = &folio->page;
 	buf->offset = offset;
 	buf->len = count;
 
@@ -970,20 +970,21 @@ static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page,
 }
 
 /*
- * Copy a page in the request to/from the userspace buffer.  Must be
+ * Copy a folio in the request to/from the userspace buffer.  Must be
  * done atomically
  */
-static int fuse_copy_page(struct fuse_copy_state *cs, struct page **pagep,
-			  unsigned offset, unsigned count, int zeroing)
+static int fuse_copy_folio(struct fuse_copy_state *cs, struct folio **foliop,
+			   unsigned offset, unsigned count, int zeroing)
 {
 	int err;
-	struct page *page = *pagep;
+	struct folio *folio = *foliop;
+	size_t size = folio_size(folio);
 
-	if (page && zeroing && count < PAGE_SIZE)
-		clear_highpage(page);
+	if (folio && zeroing && count < size)
+		folio_zero_range(folio, 0, size);
 
 	while (count) {
-		if (cs->write && cs->pipebufs && page) {
+		if (cs->write && cs->pipebufs && folio) {
 			/*
 			 * Can't control lifetime of pipe buffers, so always
 			 * copy user pages.
@@ -993,12 +994,12 @@ static int fuse_copy_page(struct fuse_copy_state *cs, struct page **pagep,
 				if (err)
 					return err;
 			} else {
-				return fuse_ref_page(cs, page, offset, count);
+				return fuse_ref_folio(cs, folio, offset, count);
 			}
 		} else if (!cs->len) {
-			if (cs->move_pages && page &&
-			    offset == 0 && count == PAGE_SIZE) {
-				err = fuse_try_move_page(cs, pagep);
+			if (cs->move_folios && folio &&
+			    offset == 0 && count == folio_size(folio)) {
+				err = fuse_try_move_folio(cs, foliop);
 				if (err <= 0)
 					return err;
 			} else {
@@ -1007,22 +1008,22 @@ static int fuse_copy_page(struct fuse_copy_state *cs, struct page **pagep,
 					return err;
 			}
 		}
-		if (page) {
-			void *mapaddr = kmap_local_page(page);
-			void *buf = mapaddr + offset;
+		if (folio) {
+			void *mapaddr = kmap_local_folio(folio, offset);
+			void *buf = mapaddr;
 			offset += fuse_copy_do(cs, &buf, &count);
 			kunmap_local(mapaddr);
 		} else
 			offset += fuse_copy_do(cs, NULL, &count);
 	}
-	if (page && !cs->write)
-		flush_dcache_page(page);
+	if (folio && !cs->write)
+		flush_dcache_folio(folio);
 	return 0;
 }
 
-/* Copy pages in the request to/from userspace buffer */
-static int fuse_copy_pages(struct fuse_copy_state *cs, unsigned nbytes,
-			   int zeroing)
+/* Copy folios in the request to/from userspace buffer */
+static int fuse_copy_folios(struct fuse_copy_state *cs, unsigned nbytes,
+			    int zeroing)
 {
 	unsigned i;
 	struct fuse_req *req = cs->req;
@@ -1032,23 +1033,12 @@ static int fuse_copy_pages(struct fuse_copy_state *cs, unsigned nbytes,
 		int err;
 		unsigned int offset = ap->descs[i].offset;
 		unsigned int count = min(nbytes, ap->descs[i].length);
-		struct page *orig, *pagep;
-
-		orig = pagep = &ap->folios[i]->page;
 
-		err = fuse_copy_page(cs, &pagep, offset, count, zeroing);
+		err = fuse_copy_folio(cs, &ap->folios[i], offset, count, zeroing);
 		if (err)
 			return err;
 
 		nbytes -= count;
-
-		/*
-		 *  fuse_copy_page may have moved a page from a pipe instead of
-		 *  copying into our given page, so update the folios if it was
-		 *  replaced.
-		 */
-		if (pagep != orig)
-			ap->folios[i] = page_folio(pagep);
 	}
 	return 0;
 }
@@ -1078,7 +1068,7 @@ static int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
 	for (i = 0; !err && i < numargs; i++)  {
 		struct fuse_arg *arg = &args[i];
 		if (i == numargs - 1 && argpages)
-			err = fuse_copy_pages(cs, arg->size, zeroing);
+			err = fuse_copy_folios(cs, arg->size, zeroing);
 		else
 			err = fuse_copy_one(cs, arg->value, arg->size);
 	}
@@ -1665,7 +1655,6 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
 	num = outarg.size;
 	while (num) {
 		struct folio *folio;
-		struct page *page;
 		unsigned int this_num;
 
 		folio = filemap_grab_folio(mapping, index);
@@ -1673,9 +1662,8 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
 		if (IS_ERR(folio))
 			goto out_iput;
 
-		page = &folio->page;
 		this_num = min_t(unsigned, num, folio_size(folio) - offset);
-		err = fuse_copy_page(cs, &page, offset, this_num, 0);
+		err = fuse_copy_folio(cs, &folio, offset, this_num, 0);
 		if (!folio_test_uptodate(folio) && !err && offset == 0 &&
 		    (this_num == folio_size(folio) || file_size == end)) {
 			folio_zero_segment(folio, this_num, folio_size(folio));
@@ -1902,8 +1890,8 @@ static int fuse_notify_resend(struct fuse_conn *fc)
 static int fuse_notify(struct fuse_conn *fc, enum fuse_notify_code code,
 		       unsigned int size, struct fuse_copy_state *cs)
 {
-	/* Don't try to move pages (yet) */
-	cs->move_pages = 0;
+	/* Don't try to move folios (yet) */
+	cs->move_folios = 0;
 
 	switch (code) {
 	case FUSE_NOTIFY_POLL:
@@ -2044,7 +2032,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
 	spin_unlock(&fpq->lock);
 	cs->req = req;
 	if (!req->args->page_replace)
-		cs->move_pages = 0;
+		cs->move_folios = 0;
 
 	if (oh.error)
 		err = nbytes != sizeof(oh) ? -EINVAL : 0;
@@ -2163,7 +2151,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 	cs.pipe = pipe;
 
 	if (flags & SPLICE_F_MOVE)
-		cs.move_pages = 1;
+		cs.move_folios = 1;
 
 	ret = fuse_dev_do_write(fud, &cs, len);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 02/12] fuse: support large folios for retrieves
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 01/12] fuse: support copying " Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 03/12] fuse: refactor fuse_fill_write_pages() Joanne Koong
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Add support for folios larger than one page size for retrieves.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fuse/dev.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 4a09c41910d7..8d6418972fe5 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1716,7 +1716,7 @@ static int fuse_retrieve(struct fuse_mount *fm, struct inode *inode,
 	unsigned int num;
 	unsigned int offset;
 	size_t total_len = 0;
-	unsigned int num_pages, cur_pages = 0;
+	unsigned int num_pages;
 	struct fuse_conn *fc = fm->fc;
 	struct fuse_retrieve_args *ra;
 	size_t args_size = sizeof(*ra);
@@ -1734,6 +1734,7 @@ static int fuse_retrieve(struct fuse_mount *fm, struct inode *inode,
 
 	num_pages = (num + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	num_pages = min(num_pages, fc->max_pages);
+	num = min(num, num_pages << PAGE_SHIFT);
 
 	args_size += num_pages * (sizeof(ap->folios[0]) + sizeof(ap->descs[0]));
 
@@ -1754,25 +1755,29 @@ static int fuse_retrieve(struct fuse_mount *fm, struct inode *inode,
 
 	index = outarg->offset >> PAGE_SHIFT;
 
-	while (num && cur_pages < num_pages) {
+	while (num) {
 		struct folio *folio;
-		unsigned int this_num;
+		unsigned int folio_offset;
+		unsigned int nr_bytes;
+		unsigned int nr_pages;
 
 		folio = filemap_get_folio(mapping, index);
 		if (IS_ERR(folio))
 			break;
 
-		this_num = min_t(unsigned, num, PAGE_SIZE - offset);
+		folio_offset = ((index - folio->index) << PAGE_SHIFT) + offset;
+		nr_bytes = min(folio_size(folio) - folio_offset, num);
+		nr_pages = (offset + nr_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
 		ap->folios[ap->num_folios] = folio;
-		ap->descs[ap->num_folios].offset = offset;
-		ap->descs[ap->num_folios].length = this_num;
+		ap->descs[ap->num_folios].offset = folio_offset;
+		ap->descs[ap->num_folios].length = nr_bytes;
 		ap->num_folios++;
-		cur_pages++;
 
 		offset = 0;
-		num -= this_num;
-		total_len += this_num;
-		index++;
+		num -= nr_bytes;
+		total_len += nr_bytes;
+		index += nr_pages;
 	}
 	ra->inarg.offset = outarg->offset;
 	ra->inarg.size = total_len;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 03/12] fuse: refactor fuse_fill_write_pages()
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 01/12] fuse: support copying " Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 02/12] fuse: support large folios for retrieves Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 04/12] fuse: support large folios for writethrough writes Joanne Koong
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Refactor the logic in fuse_fill_write_pages() for copying out write
data. This will make the future change for supporting large folios for
writes easier. No functional changes.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fuse/file.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f8719d8c56ca..a89fdc55a40b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1138,21 +1138,21 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 	struct fuse_args_pages *ap = &ia->ap;
 	struct fuse_conn *fc = get_fuse_conn(mapping->host);
 	unsigned offset = pos & (PAGE_SIZE - 1);
-	unsigned int nr_pages = 0;
 	size_t count = 0;
+	unsigned int num;
 	int err;
 
+	num = min(iov_iter_count(ii), fc->max_write);
+	num = min(num, max_pages << PAGE_SHIFT);
+
 	ap->args.in_pages = true;
 	ap->descs[0].offset = offset;
 
-	do {
+	while (num) {
 		size_t tmp;
 		struct folio *folio;
 		pgoff_t index = pos >> PAGE_SHIFT;
-		size_t bytes = min_t(size_t, PAGE_SIZE - offset,
-				     iov_iter_count(ii));
-
-		bytes = min_t(size_t, bytes, fc->max_write - count);
+		unsigned int bytes = min(PAGE_SIZE - offset, num);
 
  again:
 		err = -EFAULT;
@@ -1182,10 +1182,10 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 		ap->folios[ap->num_folios] = folio;
 		ap->descs[ap->num_folios].length = tmp;
 		ap->num_folios++;
-		nr_pages++;
 
 		count += tmp;
 		pos += tmp;
+		num -= tmp;
 		offset += tmp;
 		if (offset == PAGE_SIZE)
 			offset = 0;
@@ -1202,8 +1202,9 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 		}
 		if (!fc->big_writes)
 			break;
-	} while (iov_iter_count(ii) && count < fc->max_write &&
-		 nr_pages < max_pages && offset == 0);
+		if (offset != 0)
+			break;
+	}
 
 	return count > 0 ? count : err;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 04/12] fuse: support large folios for writethrough writes
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
                   ` (2 preceding siblings ...)
  2024-11-25 22:05 ` [PATCH v2 03/12] fuse: refactor fuse_fill_write_pages() Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 05/12] fuse: support large folios for folio reads Joanne Koong
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Add support for folios larger than one page size for writethrough
writes.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/file.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a89fdc55a40b..fab7cfa8700b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1135,6 +1135,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 				     struct iov_iter *ii, loff_t pos,
 				     unsigned int max_pages)
 {
+	size_t max_folio_size = mapping_max_folio_size(mapping);
 	struct fuse_args_pages *ap = &ia->ap;
 	struct fuse_conn *fc = get_fuse_conn(mapping->host);
 	unsigned offset = pos & (PAGE_SIZE - 1);
@@ -1146,17 +1147,17 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 	num = min(num, max_pages << PAGE_SHIFT);
 
 	ap->args.in_pages = true;
-	ap->descs[0].offset = offset;
 
 	while (num) {
 		size_t tmp;
 		struct folio *folio;
 		pgoff_t index = pos >> PAGE_SHIFT;
-		unsigned int bytes = min(PAGE_SIZE - offset, num);
+		unsigned int bytes;
+		unsigned int folio_offset;
 
  again:
 		err = -EFAULT;
-		if (fault_in_iov_iter_readable(ii, bytes))
+		if (fault_in_iov_iter_readable(ii, max_folio_size) == max_folio_size)
 			break;
 
 		folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
@@ -1169,7 +1170,10 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 		if (mapping_writably_mapped(mapping))
 			flush_dcache_folio(folio);
 
-		tmp = copy_folio_from_iter_atomic(folio, offset, bytes, ii);
+		folio_offset = ((index - folio->index) << PAGE_SHIFT) + offset;
+		bytes = min(folio_size(folio) - folio_offset, num);
+
+		tmp = copy_folio_from_iter_atomic(folio, folio_offset, bytes, ii);
 		flush_dcache_folio(folio);
 
 		if (!tmp) {
@@ -1180,6 +1184,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 
 		err = 0;
 		ap->folios[ap->num_folios] = folio;
+		ap->descs[ap->num_folios].offset = folio_offset;
 		ap->descs[ap->num_folios].length = tmp;
 		ap->num_folios++;
 
@@ -1187,11 +1192,11 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 		pos += tmp;
 		num -= tmp;
 		offset += tmp;
-		if (offset == PAGE_SIZE)
+		if (offset == folio_size(folio))
 			offset = 0;
 
-		/* If we copied full page, mark it uptodate */
-		if (tmp == PAGE_SIZE)
+		/* If we copied full folio, mark it uptodate */
+		if (tmp == folio_size(folio))
 			folio_mark_uptodate(folio);
 
 		if (folio_test_uptodate(folio)) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 05/12] fuse: support large folios for folio reads
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
                   ` (3 preceding siblings ...)
  2024-11-25 22:05 ` [PATCH v2 04/12] fuse: support large folios for writethrough writes Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 06/12] fuse: support large folios for symlinks Joanne Koong
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Add support for folios larger than one page size for folio reads into
the page cache.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fuse/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index fab7cfa8700b..b12b3cb96450 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -797,7 +797,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
 	struct inode *inode = folio->mapping->host;
 	struct fuse_mount *fm = get_fuse_mount(inode);
 	loff_t pos = folio_pos(folio);
-	struct fuse_folio_desc desc = { .length = PAGE_SIZE };
+	struct fuse_folio_desc desc = { .length = folio_size(folio) };
 	struct fuse_io_args ia = {
 		.ap.args.page_zeroing = true,
 		.ap.args.out_pages = true,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 06/12] fuse: support large folios for symlinks
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
                   ` (4 preceding siblings ...)
  2024-11-25 22:05 ` [PATCH v2 05/12] fuse: support large folios for folio reads Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 07/12] fuse: support large folios for stores Joanne Koong
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Support large folios for symlinks and change the name from
fuse_getlink_page() to fuse_getlink_folio().

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fuse/dir.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index b8a4608e31af..37c1e194909b 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1585,10 +1585,10 @@ static int fuse_permission(struct mnt_idmap *idmap,
 	return err;
 }
 
-static int fuse_readlink_page(struct inode *inode, struct folio *folio)
+static int fuse_readlink_folio(struct inode *inode, struct folio *folio)
 {
 	struct fuse_mount *fm = get_fuse_mount(inode);
-	struct fuse_folio_desc desc = { .length = PAGE_SIZE - 1 };
+	struct fuse_folio_desc desc = { .length = folio_size(folio) - 1 };
 	struct fuse_args_pages ap = {
 		.num_folios = 1,
 		.folios = &folio,
@@ -1643,7 +1643,7 @@ static const char *fuse_get_link(struct dentry *dentry, struct inode *inode,
 	if (!folio)
 		goto out_err;
 
-	err = fuse_readlink_page(inode, folio);
+	err = fuse_readlink_folio(inode, folio);
 	if (err) {
 		folio_put(folio);
 		goto out_err;
@@ -2231,7 +2231,7 @@ void fuse_init_dir(struct inode *inode)
 
 static int fuse_symlink_read_folio(struct file *null, struct folio *folio)
 {
-	int err = fuse_readlink_page(folio->mapping->host, folio);
+	int err = fuse_readlink_folio(folio->mapping->host, folio);
 
 	if (!err)
 		folio_mark_uptodate(folio);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 07/12] fuse: support large folios for stores
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
                   ` (5 preceding siblings ...)
  2024-11-25 22:05 ` [PATCH v2 06/12] fuse: support large folios for symlinks Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 08/12] fuse: support large folios for queued writes Joanne Koong
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Add support for folios larger than one page size for stores.
Also change variable naming from "this_num" to "nr_bytes".

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fuse/dev.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 8d6418972fe5..bbb8a8a3cf8b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1655,18 +1655,23 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
 	num = outarg.size;
 	while (num) {
 		struct folio *folio;
-		unsigned int this_num;
+		unsigned int folio_offset;
+		unsigned int nr_bytes;
+		unsigned int nr_pages;
 
 		folio = filemap_grab_folio(mapping, index);
 		err = PTR_ERR(folio);
 		if (IS_ERR(folio))
 			goto out_iput;
 
-		this_num = min_t(unsigned, num, folio_size(folio) - offset);
-		err = fuse_copy_folio(cs, &folio, offset, this_num, 0);
+		folio_offset = ((index - folio->index) << PAGE_SHIFT) + offset;
+		nr_bytes = min_t(unsigned, num, folio_size(folio) - folio_offset);
+		nr_pages = (offset + nr_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+		err = fuse_copy_folio(cs, &folio, folio_offset, nr_bytes, 0);
 		if (!folio_test_uptodate(folio) && !err && offset == 0 &&
-		    (this_num == folio_size(folio) || file_size == end)) {
-			folio_zero_segment(folio, this_num, folio_size(folio));
+		    (nr_bytes == folio_size(folio) || file_size == end)) {
+			folio_zero_segment(folio, nr_bytes, folio_size(folio));
 			folio_mark_uptodate(folio);
 		}
 		folio_unlock(folio);
@@ -1675,9 +1680,9 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
 		if (err)
 			goto out_iput;
 
-		num -= this_num;
+		num -= nr_bytes;
 		offset = 0;
-		index++;
+		index += nr_pages;
 	}
 
 	err = 0;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 08/12] fuse: support large folios for queued writes
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
                   ` (6 preceding siblings ...)
  2024-11-25 22:05 ` [PATCH v2 07/12] fuse: support large folios for stores Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 09/12] fuse: support large folios for readahead Joanne Koong
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Add support for folios larger than one page size for queued writes.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fuse/file.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b12b3cb96450..1cf11ba556f9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1791,11 +1791,14 @@ __releases(fi->lock)
 __acquires(fi->lock)
 {
 	struct fuse_inode *fi = get_fuse_inode(wpa->inode);
+	struct fuse_args_pages *ap = &wpa->ia.ap;
 	struct fuse_write_in *inarg = &wpa->ia.write.in;
-	struct fuse_args *args = &wpa->ia.ap.args;
-	/* Currently, all folios in FUSE are one page */
-	__u64 data_size = wpa->ia.ap.num_folios * PAGE_SIZE;
-	int err;
+	struct fuse_args *args = &ap->args;
+	__u64 data_size = 0;
+	int err, i;
+
+	for (i = 0; i < ap->num_folios; i++)
+		data_size += ap->descs[i].length;
 
 	fi->writectr++;
 	if (inarg->offset + data_size <= size) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 09/12] fuse: support large folios for readahead
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
                   ` (7 preceding siblings ...)
  2024-11-25 22:05 ` [PATCH v2 08/12] fuse: support large folios for queued writes Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 10/12] fuse: support large folios for direct io Joanne Koong
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Add support for folios larger than one page size for readahead.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/file.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1cf11ba556f9..590a3f2fa310 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -885,14 +885,13 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
 	fuse_io_free(ia);
 }
 
-static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
+static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file,
+				unsigned int count)
 {
 	struct fuse_file *ff = file->private_data;
 	struct fuse_mount *fm = ff->fm;
 	struct fuse_args_pages *ap = &ia->ap;
 	loff_t pos = folio_pos(ap->folios[0]);
-	/* Currently, all folios in FUSE are one page */
-	size_t count = ap->num_folios << PAGE_SHIFT;
 	ssize_t res;
 	int err;
 
@@ -929,6 +928,7 @@ static void fuse_readahead(struct readahead_control *rac)
 	unsigned int max_pages, nr_pages;
 	loff_t first = readahead_pos(rac);
 	loff_t last = first + readahead_length(rac) - 1;
+	struct folio *folio = NULL;
 
 	if (fuse_is_bad(inode))
 		return;
@@ -952,8 +952,8 @@ static void fuse_readahead(struct readahead_control *rac)
 	while (nr_pages) {
 		struct fuse_io_args *ia;
 		struct fuse_args_pages *ap;
-		struct folio *folio;
 		unsigned cur_pages = min(max_pages, nr_pages);
+		unsigned int pages = 0;
 
 		if (fc->num_background >= fc->congestion_threshold &&
 		    rac->ra->async_size >= readahead_count(rac))
@@ -968,14 +968,24 @@ static void fuse_readahead(struct readahead_control *rac)
 			return;
 		ap = &ia->ap;
 
-		while (ap->num_folios < cur_pages) {
-			folio = readahead_folio(rac);
+		while (pages < cur_pages) {
+			unsigned int folio_pages;
+
+			if (!folio)
+				folio = readahead_folio(rac);
+
+			folio_pages = folio_nr_pages(folio);
+			if (folio_pages > cur_pages - pages)
+				break;
+
 			ap->folios[ap->num_folios] = folio;
 			ap->descs[ap->num_folios].length = folio_size(folio);
 			ap->num_folios++;
+			pages += folio_pages;
+			folio = NULL;
 		}
-		fuse_send_readpages(ia, rac->file);
-		nr_pages -= cur_pages;
+		fuse_send_readpages(ia, rac->file, pages << PAGE_SHIFT);
+		nr_pages -= pages;
 	}
 }
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 10/12] fuse: support large folios for direct io
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
                   ` (8 preceding siblings ...)
  2024-11-25 22:05 ` [PATCH v2 09/12] fuse: support large folios for readahead Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-12-09 15:50   ` Josef Bacik
  2024-11-25 22:05 ` [PATCH v2 11/12] fuse: support large folios for writeback Joanne Koong
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Add support for folios larger than one page size for direct io.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fuse/file.c | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 590a3f2fa310..a907848f387a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1482,7 +1482,8 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 		return -ENOMEM;
 
 	while (nbytes < *nbytesp && nr_pages < max_pages) {
-		unsigned nfolios, i;
+		unsigned npages;
+		unsigned i = 0;
 		size_t start;
 
 		ret = iov_iter_extract_pages(ii, &pages,
@@ -1494,19 +1495,28 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 
 		nbytes += ret;
 
-		ret += start;
-		/* Currently, all folios in FUSE are one page */
-		nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
+		npages = DIV_ROUND_UP(ret + start, PAGE_SIZE);
 
-		ap->descs[ap->num_folios].offset = start;
-		fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
-		for (i = 0; i < nfolios; i++)
-			ap->folios[i + ap->num_folios] = page_folio(pages[i]);
+		while (ret && i < npages) {
+			struct folio *folio;
+			unsigned int folio_offset;
+			unsigned int len;
 
-		ap->num_folios += nfolios;
-		ap->descs[ap->num_folios - 1].length -=
-			(PAGE_SIZE - ret) & (PAGE_SIZE - 1);
-		nr_pages += nfolios;
+			folio = page_folio(pages[i]);
+			folio_offset = ((size_t)folio_page_idx(folio, pages[i]) <<
+				       PAGE_SHIFT) + start;
+			len = min_t(ssize_t, ret, folio_size(folio) - folio_offset);
+
+			ap->folios[ap->num_folios] = folio;
+			ap->descs[ap->num_folios].offset = folio_offset;
+			ap->descs[ap->num_folios].length = len;
+			ap->num_folios++;
+
+			ret -= len;
+			i += DIV_ROUND_UP(start + len, PAGE_SIZE);
+			start = 0;
+		}
+		nr_pages += npages;
 	}
 	kfree(pages);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 11/12] fuse: support large folios for writeback
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
                   ` (9 preceding siblings ...)
  2024-11-25 22:05 ` [PATCH v2 10/12] fuse: support large folios for direct io Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-11-25 22:05 ` [PATCH v2 12/12] fuse: enable large folios Joanne Koong
  2024-12-06  9:50 ` [PATCH v2 00/12] fuse: support " Jingbo Xu
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Add support for folios larger than one page size for writeback.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fuse/file.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a907848f387a..487e68b59e1a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1991,7 +1991,7 @@ static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struc
 	folio_get(folio);
 	ap->folios[folio_index] = folio;
 	ap->descs[folio_index].offset = 0;
-	ap->descs[folio_index].length = PAGE_SIZE;
+	ap->descs[folio_index].length = folio_size(folio);
 
 	inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
 	node_stat_add_folio(folio, NR_WRITEBACK);
@@ -2066,6 +2066,7 @@ struct fuse_fill_wb_data {
 	struct fuse_file *ff;
 	struct inode *inode;
 	unsigned int max_folios;
+	unsigned int nr_pages;
 };
 
 static bool fuse_pages_realloc(struct fuse_fill_wb_data *data)
@@ -2113,15 +2114,15 @@ static bool fuse_writepage_need_send(struct fuse_conn *fc, struct folio *folio,
 	WARN_ON(!ap->num_folios);
 
 	/* Reached max pages */
-	if (ap->num_folios == fc->max_pages)
+	if (data->nr_pages + folio_nr_pages(folio) > fc->max_pages)
 		return true;
 
 	/* Reached max write bytes */
-	if ((ap->num_folios + 1) * PAGE_SIZE > fc->max_write)
+	if ((data->nr_pages * PAGE_SIZE) + folio_size(folio) > fc->max_write)
 		return true;
 
 	/* Discontinuity */
-	if (ap->folios[ap->num_folios - 1]->index + 1 != folio_index(folio))
+	if (folio_next_index(ap->folios[ap->num_folios - 1]) != folio_index(folio))
 		return true;
 
 	/* Need to grow the pages array?  If so, did the expansion fail? */
@@ -2152,6 +2153,7 @@ static int fuse_writepages_fill(struct folio *folio,
 	if (wpa && fuse_writepage_need_send(fc, folio, ap, data)) {
 		fuse_writepages_send(data);
 		data->wpa = NULL;
+		data->nr_pages = 0;
 	}
 
 	if (data->wpa == NULL) {
@@ -2166,6 +2168,7 @@ static int fuse_writepages_fill(struct folio *folio,
 	folio_start_writeback(folio);
 
 	fuse_writepage_args_page_fill(wpa, folio, ap->num_folios);
+	data->nr_pages += folio_nr_pages(folio);
 
 	err = 0;
 	ap->num_folios++;
@@ -2196,6 +2199,7 @@ static int fuse_writepages(struct address_space *mapping,
 	data.inode = inode;
 	data.wpa = NULL;
 	data.ff = NULL;
+	data.nr_pages = 0;
 
 	err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
 	if (data.wpa) {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 12/12] fuse: enable large folios
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
                   ` (10 preceding siblings ...)
  2024-11-25 22:05 ` [PATCH v2 11/12] fuse: support large folios for writeback Joanne Koong
@ 2024-11-25 22:05 ` Joanne Koong
  2024-12-06  9:50 ` [PATCH v2 00/12] fuse: support " Jingbo Xu
  12 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-11-25 22:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel
  Cc: josef, bernd.schubert, jefflexu, willy, shakeel.butt, kernel-team

Enable folios larger than one page size.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fuse/file.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 487e68b59e1a..b4c4f3575c42 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3164,12 +3164,17 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
 {
 	struct fuse_inode *fi = get_fuse_inode(inode);
 	struct fuse_conn *fc = get_fuse_conn(inode);
+	unsigned int max_pages, max_order;
 
 	inode->i_fop = &fuse_file_operations;
 	inode->i_data.a_ops = &fuse_file_aops;
 	if (fc->writeback_cache)
 		mapping_set_writeback_may_block(&inode->i_data);
 
+	max_pages = min(fc->max_write >> PAGE_SHIFT, fc->max_pages);
+	max_order = ilog2(max_pages);
+	mapping_set_folio_order_range(inode->i_mapping, 0, max_order);
+
 	INIT_LIST_HEAD(&fi->write_files);
 	INIT_LIST_HEAD(&fi->queued_writes);
 	fi->writectr = 0;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 00/12] fuse: support large folios
  2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
                   ` (11 preceding siblings ...)
  2024-11-25 22:05 ` [PATCH v2 12/12] fuse: enable large folios Joanne Koong
@ 2024-12-06  9:50 ` Jingbo Xu
  2024-12-06 17:41   ` Joanne Koong
  12 siblings, 1 reply; 28+ messages in thread
From: Jingbo Xu @ 2024-12-06  9:50 UTC (permalink / raw)
  To: Joanne Koong, miklos, linux-fsdevel
  Cc: josef, bernd.schubert, willy, shakeel.butt, kernel-team

Hi Joanne,

Have no checked the whole series yet, but I just spent some time on the
testing, attempting to find some statistics on the performance improvement.

At least we need:

@@ -2212,7 +2213,7 @@ static int fuse_write_begin(struct file *file,
struct address_space *mapping,

        WARN_ON(!fc->writeback_cache);

-       folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
+       folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN |
fgf_set_order(len),

Otherwise the large folio is not enabled on the buffer write path.


Besides, when applying the above diff, the large folio is indeed enabled
but it suffers severe performance regression:

fio 1 job buffer write:
2GB/s BW w/o large folio, and 200MB/s BW w/ large folio

Have not figured it out yet.


On 11/26/24 6:05 AM, Joanne Koong wrote:
> This patchset adds support for folios larger than one page size in FUSE.
> 
> This patchset is rebased on top of the (unmerged) patchset that removes temp
> folios in writeback [1]. (There is also a version of this patchset that is
> independent from that change, but that version has two additional patches
> needed to account for temp folios and temp folio copying, which may require
> some debate to get the API right for as these two patches add generic
> (non-FUSE) helpers. For simplicity's sake for now, I sent out this patchset
> version rebased on top of the patchset that removes temp pages)
> 
> This patchset was tested by running it through fstests on passthrough_hp.
> 
> Benchmarks show roughly a ~45% improvement in read throughput.
> 
> Benchmark setup:
> 
> -- Set up server --
>  ./libfuse/build/example/passthrough_hp --bypass-rw=1 ~/libfuse
> ~/mounts/fuse/ --nopassthrough
> (using libfuse patched with https://github.com/libfuse/libfuse/pull/807)
> 
> -- Run fio --
>  fio --name=read --ioengine=sync --rw=read --bs=1M --size=1G
> --numjobs=2 --ramp_time=30 --group_reporting=1
> --directory=mounts/fuse/
> 
> Machine 1:
>     No large folios:     ~4400 MiB/s
>     Large folios:        ~7100 MiB/s
> 
> Machine 2:
>     No large folios:     ~3700 MiB/s
>     Large folios:        ~6400 MiB/s
> 
> Writes are still effectively one page size. Benchmarks showed that trying to get
> the largest folios possible from __filemap_get_folio() is an over-optimization
> and ends up being significantly more expensive. Fine-tuning for the optimal
> order size for the __filemap_get_folio() calls can be done in a future patchset.
> 
> [1] https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/
> 
> Changelog:
> v1: https://lore.kernel.org/linux-fsdevel/20241109001258.2216604-1-joannelkoong@gmail.com/
> v1 -> v2:
> * Change naming from "non-writeback write" to "writethrough write"
> * Fix deadlock for writethrough writes by calling fault_in_iov_iter_readable() first
>   before __filemap_get_folio() (Josef)
> * For readahead, retain original folio_size() for descs.length (Josef)
> * Use folio_zero_range() api in fuse_copy_folio() (Josef)
> * Add Josef's reviewed-bys
> 
> Joanne Koong (12):
>   fuse: support copying large folios
>   fuse: support large folios for retrieves
>   fuse: refactor fuse_fill_write_pages()
>   fuse: support large folios for writethrough writes
>   fuse: support large folios for folio reads
>   fuse: support large folios for symlinks
>   fuse: support large folios for stores
>   fuse: support large folios for queued writes
>   fuse: support large folios for readahead
>   fuse: support large folios for direct io
>   fuse: support large folios for writeback
>   fuse: enable large folios
> 
>  fs/fuse/dev.c  | 128 ++++++++++++++++++++++++-------------------------
>  fs/fuse/dir.c  |   8 ++--
>  fs/fuse/file.c | 126 +++++++++++++++++++++++++++++++-----------------
>  3 files changed, 149 insertions(+), 113 deletions(-)
> 

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 00/12] fuse: support large folios
  2024-12-06  9:50 ` [PATCH v2 00/12] fuse: support " Jingbo Xu
@ 2024-12-06 17:41   ` Joanne Koong
  2024-12-06 20:36     ` Shakeel Butt
  2024-12-06 22:25     ` Matthew Wilcox
  0 siblings, 2 replies; 28+ messages in thread
From: Joanne Koong @ 2024-12-06 17:41 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: miklos, linux-fsdevel, josef, bernd.schubert, willy, shakeel.butt,
	kernel-team

On Fri, Dec 6, 2024 at 1:50 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi Joanne,
>
> Have no checked the whole series yet, but I just spent some time on the
> testing, attempting to find some statistics on the performance improvement.
>
> At least we need:
>
> @@ -2212,7 +2213,7 @@ static int fuse_write_begin(struct file *file,
> struct address_space *mapping,
>
>         WARN_ON(!fc->writeback_cache);
>
> -       folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
> +       folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN |
> fgf_set_order(len),
>
> Otherwise the large folio is not enabled on the buffer write path.
>
>
> Besides, when applying the above diff, the large folio is indeed enabled
> but it suffers severe performance regression:
>
> fio 1 job buffer write:
> 2GB/s BW w/o large folio, and 200MB/s BW w/ large folio
>
> Have not figured it out yet.
>

Hi Jingbo,

Thanks for running some benchmarks / tests on your end.

>
> On 11/26/24 6:05 AM, Joanne Koong wrote:
> > This patchset adds support for folios larger than one page size in FUSE.
> >
> > This patchset is rebased on top of the (unmerged) patchset that removes temp
> > folios in writeback [1]. (There is also a version of this patchset that is
> > independent from that change, but that version has two additional patches
> > needed to account for temp folios and temp folio copying, which may require
> > some debate to get the API right for as these two patches add generic
> > (non-FUSE) helpers. For simplicity's sake for now, I sent out this patchset
> > version rebased on top of the patchset that removes temp pages)
> >
> > This patchset was tested by running it through fstests on passthrough_hp.
> >
> > Benchmarks show roughly a ~45% improvement in read throughput.
> >
> > Benchmark setup:
> >
> > -- Set up server --
> >  ./libfuse/build/example/passthrough_hp --bypass-rw=1 ~/libfuse
> > ~/mounts/fuse/ --nopassthrough
> > (using libfuse patched with https://github.com/libfuse/libfuse/pull/807)
> >
> > -- Run fio --
> >  fio --name=read --ioengine=sync --rw=read --bs=1M --size=1G
> > --numjobs=2 --ramp_time=30 --group_reporting=1
> > --directory=mounts/fuse/
> >
> > Machine 1:
> >     No large folios:     ~4400 MiB/s
> >     Large folios:        ~7100 MiB/s
> >
> > Machine 2:
> >     No large folios:     ~3700 MiB/s
> >     Large folios:        ~6400 MiB/s
> >
> > Writes are still effectively one page size. Benchmarks showed that trying to get
> > the largest folios possible from __filemap_get_folio() is an over-optimization
> > and ends up being significantly more expensive. Fine-tuning for the optimal
> > order size for the __filemap_get_folio() calls can be done in a future patchset.

This is the behavior I noticed as well when running some benchmarks on
v1 [1]. I think it's because when we call into __filemap_get_folio(),
we hit the FGP_CREAT path and if the order we set is too high, the
internal call to filemap_alloc_folio() will repeatedly fail until it
finds an order it's able to allocate (eg the do { ... } while (order--
> min_order) loop).


Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1aPVwNmv2uxYLwtdwGqe=QUROUXmZc8BiLAV=uqrnCrrw@mail.gmail.com/

> >
> > [1] https://lore.kernel.org/linux-fsdevel/20241107235614.3637221-1-joannelkoong@gmail.com/
> >
> > Changelog:
> > v1: https://lore.kernel.org/linux-fsdevel/20241109001258.2216604-1-joannelkoong@gmail.com/
> > v1 -> v2:
> > * Change naming from "non-writeback write" to "writethrough write"
> > * Fix deadlock for writethrough writes by calling fault_in_iov_iter_readable() first
> >   before __filemap_get_folio() (Josef)
> > * For readahead, retain original folio_size() for descs.length (Josef)
> > * Use folio_zero_range() api in fuse_copy_folio() (Josef)
> > * Add Josef's reviewed-bys
> >
> > Joanne Koong (12):
> >   fuse: support copying large folios
> >   fuse: support large folios for retrieves
> >   fuse: refactor fuse_fill_write_pages()
> >   fuse: support large folios for writethrough writes
> >   fuse: support large folios for folio reads
> >   fuse: support large folios for symlinks
> >   fuse: support large folios for stores
> >   fuse: support large folios for queued writes
> >   fuse: support large folios for readahead
> >   fuse: support large folios for direct io
> >   fuse: support large folios for writeback
> >   fuse: enable large folios
> >
> >  fs/fuse/dev.c  | 128 ++++++++++++++++++++++++-------------------------
> >  fs/fuse/dir.c  |   8 ++--
> >  fs/fuse/file.c | 126 +++++++++++++++++++++++++++++++-----------------
> >  3 files changed, 149 insertions(+), 113 deletions(-)
> >
>
> --
> Thanks,
> Jingbo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 00/12] fuse: support large folios
  2024-12-06 17:41   ` Joanne Koong
@ 2024-12-06 20:36     ` Shakeel Butt
  2024-12-06 22:11       ` Joanne Koong
  2024-12-06 22:25     ` Matthew Wilcox
  1 sibling, 1 reply; 28+ messages in thread
From: Shakeel Butt @ 2024-12-06 20:36 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jingbo Xu, miklos, linux-fsdevel, josef, bernd.schubert, willy,
	kernel-team

On Fri, Dec 06, 2024 at 09:41:25AM -0800, Joanne Koong wrote:
> On Fri, Dec 6, 2024 at 1:50 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
[...]
> >
> > >
> > > Writes are still effectively one page size. Benchmarks showed that trying to get
> > > the largest folios possible from __filemap_get_folio() is an over-optimization
> > > and ends up being significantly more expensive. Fine-tuning for the optimal
> > > order size for the __filemap_get_folio() calls can be done in a future patchset.
> 
> This is the behavior I noticed as well when running some benchmarks on
> v1 [1]. I think it's because when we call into __filemap_get_folio(),
> we hit the FGP_CREAT path and if the order we set is too high, the
> internal call to filemap_alloc_folio() will repeatedly fail until it
> finds an order it's able to allocate (eg the do { ... } while (order--
> > min_order) loop).
> 

What is the mapping_min_folio_order(mapping) for fuse? One thing we can
do is decide for which range of orders we want a cheap failure i.e. without
__GFP_DIRECT_RECLAIM and then the range where we are fine with some
effort and work. I see __GFP_NORETRY is being used for orders larger
than min_order, please note that this flag still allows one iteration of
reclaim and compaction, so not necessarily cheap.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 00/12] fuse: support large folios
  2024-12-06 20:36     ` Shakeel Butt
@ 2024-12-06 22:11       ` Joanne Koong
  2024-12-06 22:27         ` Shakeel Butt
  0 siblings, 1 reply; 28+ messages in thread
From: Joanne Koong @ 2024-12-06 22:11 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jingbo Xu, miklos, linux-fsdevel, josef, bernd.schubert, willy,
	kernel-team

On Fri, Dec 6, 2024 at 12:36 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Fri, Dec 06, 2024 at 09:41:25AM -0800, Joanne Koong wrote:
> > On Fri, Dec 6, 2024 at 1:50 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> [...]
> > >
> > > >
> > > > Writes are still effectively one page size. Benchmarks showed that trying to get
> > > > the largest folios possible from __filemap_get_folio() is an over-optimization
> > > > and ends up being significantly more expensive. Fine-tuning for the optimal
> > > > order size for the __filemap_get_folio() calls can be done in a future patchset.
> >
> > This is the behavior I noticed as well when running some benchmarks on
> > v1 [1]. I think it's because when we call into __filemap_get_folio(),
> > we hit the FGP_CREAT path and if the order we set is too high, the
> > internal call to filemap_alloc_folio() will repeatedly fail until it
> > finds an order it's able to allocate (eg the do { ... } while (order--
> > > min_order) loop).
> >
>
> What is the mapping_min_folio_order(mapping) for fuse? One thing we can

The mapping_min_folio_order used is 0. The folio order range gets set here [1]

[1] https://lore.kernel.org/linux-fsdevel/20241125220537.3663725-13-joannelkoong@gmail.com/

> do is decide for which range of orders we want a cheap failure i.e. without
> __GFP_DIRECT_RECLAIM and then the range where we are fine with some
> effort and work. I see __GFP_NORETRY is being used for orders larger

The gfp flags we pass into __filemap_get_folio() are the gfp flags of
the mapping, and that gets set in inode_init_always_gfp() to
GFP_HIGHUSER_MOVABLE, which does include __GFP_RECLAIM.

If __GFP_RECLAIM is set and the filemap_alloc_folio() call can't find
enough space, does this automatically trigger a round of reclaim and
compaction as well?

> than min_order, please note that this flag still allows one iteration of
> reclaim and compaction, so not necessarily cheap.


Thanks,
Joanne

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 00/12] fuse: support large folios
  2024-12-06 17:41   ` Joanne Koong
  2024-12-06 20:36     ` Shakeel Butt
@ 2024-12-06 22:25     ` Matthew Wilcox
  2024-12-10  0:31       ` Joanne Koong
  1 sibling, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2024-12-06 22:25 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jingbo Xu, miklos, linux-fsdevel, josef, bernd.schubert,
	shakeel.butt, kernel-team

On Fri, Dec 06, 2024 at 09:41:25AM -0800, Joanne Koong wrote:
> On Fri, Dec 6, 2024 at 1:50 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > -       folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
> > +       folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN |
> > fgf_set_order(len),
> >
> > Otherwise the large folio is not enabled on the buffer write path.
> >
> >
> > Besides, when applying the above diff, the large folio is indeed enabled
> > but it suffers severe performance regression:
> >
> > fio 1 job buffer write:
> > 2GB/s BW w/o large folio, and 200MB/s BW w/ large folio
> 
> This is the behavior I noticed as well when running some benchmarks on
> v1 [1]. I think it's because when we call into __filemap_get_folio(),
> we hit the FGP_CREAT path and if the order we set is too high, the
> internal call to filemap_alloc_folio() will repeatedly fail until it
> finds an order it's able to allocate (eg the do { ... } while (order--
> > min_order) loop).

But this is very different frrom what other filesystems have measured
when allocating large folios during writes.  eg:

https://lore.kernel.org/linux-fsdevel/20240527163616.1135968-1-hch@lst.de/

So we need to understand what's different about fuse.  My suspicion is
that it's disabling some other optimisation that is only done on
order 0 folios, but that's just wild speculation.  Needs someone to
dig into it and look at profiles to see what's really going on.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 00/12] fuse: support large folios
  2024-12-06 22:11       ` Joanne Koong
@ 2024-12-06 22:27         ` Shakeel Butt
  2024-12-06 22:33           ` Matthew Wilcox
  0 siblings, 1 reply; 28+ messages in thread
From: Shakeel Butt @ 2024-12-06 22:27 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jingbo Xu, miklos, linux-fsdevel, josef, bernd.schubert, willy,
	kernel-team

On Fri, Dec 06, 2024 at 02:11:26PM -0800, Joanne Koong wrote:
> On Fri, Dec 6, 2024 at 12:36 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Fri, Dec 06, 2024 at 09:41:25AM -0800, Joanne Koong wrote:
> > > On Fri, Dec 6, 2024 at 1:50 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > [...]
> > > >
> > > > >
> > > > > Writes are still effectively one page size. Benchmarks showed that trying to get
> > > > > the largest folios possible from __filemap_get_folio() is an over-optimization
> > > > > and ends up being significantly more expensive. Fine-tuning for the optimal
> > > > > order size for the __filemap_get_folio() calls can be done in a future patchset.
> > >
> > > This is the behavior I noticed as well when running some benchmarks on
> > > v1 [1]. I think it's because when we call into __filemap_get_folio(),
> > > we hit the FGP_CREAT path and if the order we set is too high, the
> > > internal call to filemap_alloc_folio() will repeatedly fail until it
> > > finds an order it's able to allocate (eg the do { ... } while (order--
> > > > min_order) loop).
> > >
> >
> > What is the mapping_min_folio_order(mapping) for fuse? One thing we can
> 
> The mapping_min_folio_order used is 0. The folio order range gets set here [1]
> 
> [1] https://lore.kernel.org/linux-fsdevel/20241125220537.3663725-13-joannelkoong@gmail.com/
> 
> > do is decide for which range of orders we want a cheap failure i.e. without
> > __GFP_DIRECT_RECLAIM and then the range where we are fine with some
> > effort and work. I see __GFP_NORETRY is being used for orders larger
> 
> The gfp flags we pass into __filemap_get_folio() are the gfp flags of
> the mapping, and that gets set in inode_init_always_gfp() to
> GFP_HIGHUSER_MOVABLE, which does include __GFP_RECLAIM.
> 
> If __GFP_RECLAIM is set and the filemap_alloc_folio() call can't find
> enough space, does this automatically trigger a round of reclaim and
> compaction as well?

Yes, it will trigger reclaim/compaction rounds and based on order size
(order <= PAGE_ALLOC_COSTLY_ORDER), it can be very aggressive. The
__GFP_NORETRY flag limits to one iteration but a single iteration can be
expensive depending on the system condition.

For anon memory or specifically THPs allocation, we can tune though
sysctls to be less aggressive but there is infrastructure like
khugepaged which in background converts small pages into THPs. I can
imagine that we might want something similar for filesystem as well.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 00/12] fuse: support large folios
  2024-12-06 22:27         ` Shakeel Butt
@ 2024-12-06 22:33           ` Matthew Wilcox
  2024-12-09 17:23             ` Shakeel Butt
  0 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2024-12-06 22:33 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Joanne Koong, Jingbo Xu, miklos, linux-fsdevel, josef,
	bernd.schubert, kernel-team

On Fri, Dec 06, 2024 at 02:27:36PM -0800, Shakeel Butt wrote:
> For anon memory or specifically THPs allocation, we can tune though
> sysctls to be less aggressive but there is infrastructure like
> khugepaged which in background converts small pages into THPs. I can
> imagine that we might want something similar for filesystem as well.

khugepaged also works on files.  Where Somebody Needs To Do Something
is that right now it gives up if it sees large folios (because it was
written when there were two sizes of folio -- order 0 and PMD_ORDER).
Nobody has yet taken on making it turn order-N folios into PMD_ORDER
folios.  Perhaps you'd like to do that?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/12] fuse: support large folios for direct io
  2024-11-25 22:05 ` [PATCH v2 10/12] fuse: support large folios for direct io Joanne Koong
@ 2024-12-09 15:50   ` Josef Bacik
  2024-12-09 15:54     ` Matthew Wilcox
  0 siblings, 1 reply; 28+ messages in thread
From: Josef Bacik @ 2024-12-09 15:50 UTC (permalink / raw)
  To: Joanne Koong
  Cc: miklos, linux-fsdevel, bernd.schubert, jefflexu, willy,
	shakeel.butt, kernel-team

On Mon, Nov 25, 2024 at 02:05:35PM -0800, Joanne Koong wrote:
> Add support for folios larger than one page size for direct io.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/fuse/file.c | 34 ++++++++++++++++++++++------------
>  1 file changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 590a3f2fa310..a907848f387a 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1482,7 +1482,8 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>  		return -ENOMEM;
>  
>  	while (nbytes < *nbytesp && nr_pages < max_pages) {
> -		unsigned nfolios, i;
> +		unsigned npages;
> +		unsigned i = 0;
>  		size_t start;
>  
>  		ret = iov_iter_extract_pages(ii, &pages,
> @@ -1494,19 +1495,28 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>  
>  		nbytes += ret;
>  
> -		ret += start;
> -		/* Currently, all folios in FUSE are one page */
> -		nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> +		npages = DIV_ROUND_UP(ret + start, PAGE_SIZE);
>  
> -		ap->descs[ap->num_folios].offset = start;
> -		fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> -		for (i = 0; i < nfolios; i++)
> -			ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> +		while (ret && i < npages) {
> +			struct folio *folio;
> +			unsigned int folio_offset;
> +			unsigned int len;
>  
> -		ap->num_folios += nfolios;
> -		ap->descs[ap->num_folios - 1].length -=
> -			(PAGE_SIZE - ret) & (PAGE_SIZE - 1);
> -		nr_pages += nfolios;
> +			folio = page_folio(pages[i]);
> +			folio_offset = ((size_t)folio_page_idx(folio, pages[i]) <<
> +				       PAGE_SHIFT) + start;
> +			len = min_t(ssize_t, ret, folio_size(folio) - folio_offset);
> +
> +			ap->folios[ap->num_folios] = folio;
> +			ap->descs[ap->num_folios].offset = folio_offset;
> +			ap->descs[ap->num_folios].length = len;
> +			ap->num_folios++;
> +
> +			ret -= len;
> +			i += DIV_ROUND_UP(start + len, PAGE_SIZE);
> +			start = 0;

As we've noticed in the upstream bug report for your initial work here, this
isn't quite correct, as we could have gotten a large folio in from userspace.  I
think the better thing here is to do the page extraction, and then keep track of
the last folio we saw, and simply skip any folios that are the same for the
pages we have.  This way we can handle large folios correctly.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/12] fuse: support large folios for direct io
  2024-12-09 15:50   ` Josef Bacik
@ 2024-12-09 15:54     ` Matthew Wilcox
  2024-12-11 21:04       ` Joanne Koong
  0 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2024-12-09 15:54 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Joanne Koong, miklos, linux-fsdevel, bernd.schubert, jefflexu,
	shakeel.butt, kernel-team

On Mon, Dec 09, 2024 at 10:50:42AM -0500, Josef Bacik wrote:
> As we've noticed in the upstream bug report for your initial work here, this
> isn't quite correct, as we could have gotten a large folio in from userspace.  I
> think the better thing here is to do the page extraction, and then keep track of
> the last folio we saw, and simply skip any folios that are the same for the
> pages we have.  This way we can handle large folios correctly.  Thanks,

Some people have in the past thought that they could skip subsequent
page lookup if the folio they get back is large.  This is an incorrect
optimisation.  Userspace may mmap() a file PROT_WRITE, MAP_PRIVATE.
If they store to the middle of a large folio (the file that is mmaped
may be on a filesystem that does support large folios, rather than
fuse), then we'll have, eg:

folio A page 0
folio A page 1
folio B page 0
folio A page 3

where folio A belongs to the file and folio B is an anonymous COW page.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 00/12] fuse: support large folios
  2024-12-06 22:33           ` Matthew Wilcox
@ 2024-12-09 17:23             ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2024-12-09 17:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Joanne Koong, Jingbo Xu, miklos, linux-fsdevel, josef,
	bernd.schubert, kernel-team

On Fri, Dec 06, 2024 at 10:33:56PM +0000, Matthew Wilcox wrote:
> On Fri, Dec 06, 2024 at 02:27:36PM -0800, Shakeel Butt wrote:
> > For anon memory or specifically THPs allocation, we can tune though
> > sysctls to be less aggressive but there is infrastructure like
> > khugepaged which in background converts small pages into THPs. I can
> > imagine that we might want something similar for filesystem as well.
> 
> khugepaged also works on files.  Where Somebody Needs To Do Something
> is that right now it gives up if it sees large folios (because it was
> written when there were two sizes of folio -- order 0 and PMD_ORDER).
> Nobody has yet taken on making it turn order-N folios into PMD_ORDER
> folios.  Perhaps you'd like to do that?

I was hoping the mTHP work for khugepaged (haven't checked the latest
their yet) would add the necessary building block and adding file
support would be trivial. I will check the latest there and see what we
can do there sometime next year.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 00/12] fuse: support large folios
  2024-12-06 22:25     ` Matthew Wilcox
@ 2024-12-10  0:31       ` Joanne Koong
  2025-01-08 21:03         ` Joanne Koong
  0 siblings, 1 reply; 28+ messages in thread
From: Joanne Koong @ 2024-12-10  0:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jingbo Xu, miklos, linux-fsdevel, josef, bernd.schubert,
	shakeel.butt, kernel-team

On Fri, Dec 6, 2024 at 2:25 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Dec 06, 2024 at 09:41:25AM -0800, Joanne Koong wrote:
> > On Fri, Dec 6, 2024 at 1:50 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > > -       folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
> > > +       folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN |
> > > fgf_set_order(len),
> > >
> > > Otherwise the large folio is not enabled on the buffer write path.
> > >
> > >
> > > Besides, when applying the above diff, the large folio is indeed enabled
> > > but it suffers severe performance regression:
> > >
> > > fio 1 job buffer write:
> > > 2GB/s BW w/o large folio, and 200MB/s BW w/ large folio
> >
> > This is the behavior I noticed as well when running some benchmarks on
> > v1 [1]. I think it's because when we call into __filemap_get_folio(),
> > we hit the FGP_CREAT path and if the order we set is too high, the
> > internal call to filemap_alloc_folio() will repeatedly fail until it
> > finds an order it's able to allocate (eg the do { ... } while (order--
> > > min_order) loop).
>
> But this is very different frrom what other filesystems have measured
> when allocating large folios during writes.  eg:
>
> https://lore.kernel.org/linux-fsdevel/20240527163616.1135968-1-hch@lst.de/

Ok, this seems like something particular to FUSE then, if all the
other filesystems are seeing 2x throughput improvements for buffered
writes. If someone doesn't get to this before me, I'll look deeper
into this.


Thanks,
Joanne
>
> So we need to understand what's different about fuse.  My suspicion is
> that it's disabling some other optimisation that is only done on
> order 0 folios, but that's just wild speculation.  Needs someone to
> dig into it and look at profiles to see what's really going on.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/12] fuse: support large folios for direct io
  2024-12-09 15:54     ` Matthew Wilcox
@ 2024-12-11 21:04       ` Joanne Koong
  2024-12-11 21:11         ` Matthew Wilcox
  0 siblings, 1 reply; 28+ messages in thread
From: Joanne Koong @ 2024-12-11 21:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Josef Bacik, miklos, linux-fsdevel, bernd.schubert, jefflexu,
	shakeel.butt, kernel-team

On Mon, Dec 9, 2024 at 7:54 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Dec 09, 2024 at 10:50:42AM -0500, Josef Bacik wrote:
> > As we've noticed in the upstream bug report for your initial work here, this
> > isn't quite correct, as we could have gotten a large folio in from userspace.  I
> > think the better thing here is to do the page extraction, and then keep track of
> > the last folio we saw, and simply skip any folios that are the same for the
> > pages we have.  This way we can handle large folios correctly.  Thanks,
>
> Some people have in the past thought that they could skip subsequent
> page lookup if the folio they get back is large.  This is an incorrect
> optimisation.  Userspace may mmap() a file PROT_WRITE, MAP_PRIVATE.
> If they store to the middle of a large folio (the file that is mmaped
> may be on a filesystem that does support large folios, rather than
> fuse), then we'll have, eg:
>
> folio A page 0
> folio A page 1
> folio B page 0
> folio A page 3
>
> where folio A belongs to the file and folio B is an anonymous COW page.

Sounds good, I'll fix this up in v3. Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/12] fuse: support large folios for direct io
  2024-12-11 21:04       ` Joanne Koong
@ 2024-12-11 21:11         ` Matthew Wilcox
  2024-12-11 21:35           ` Joanne Koong
  0 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2024-12-11 21:11 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Josef Bacik, miklos, linux-fsdevel, bernd.schubert, jefflexu,
	shakeel.butt, kernel-team

On Wed, Dec 11, 2024 at 01:04:45PM -0800, Joanne Koong wrote:
> On Mon, Dec 9, 2024 at 7:54 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Dec 09, 2024 at 10:50:42AM -0500, Josef Bacik wrote:
> > > As we've noticed in the upstream bug report for your initial work here, this
> > > isn't quite correct, as we could have gotten a large folio in from userspace.  I
> > > think the better thing here is to do the page extraction, and then keep track of
> > > the last folio we saw, and simply skip any folios that are the same for the
> > > pages we have.  This way we can handle large folios correctly.  Thanks,
> >
> > Some people have in the past thought that they could skip subsequent
> > page lookup if the folio they get back is large.  This is an incorrect
> > optimisation.  Userspace may mmap() a file PROT_WRITE, MAP_PRIVATE.
> > If they store to the middle of a large folio (the file that is mmaped
> > may be on a filesystem that does support large folios, rather than
> > fuse), then we'll have, eg:
> >
> > folio A page 0
> > folio A page 1
> > folio B page 0
> > folio A page 3
> >
> > where folio A belongs to the file and folio B is an anonymous COW page.
> 
> Sounds good, I'll fix this up in v3. Thanks.

Hm?  I didn't notice this bug in your code, just mentioning something
I've seen other people do and wanted to make suree you didn't.  Did I
miss a bug in your code?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 10/12] fuse: support large folios for direct io
  2024-12-11 21:11         ` Matthew Wilcox
@ 2024-12-11 21:35           ` Joanne Koong
  0 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2024-12-11 21:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Josef Bacik, miklos, linux-fsdevel, bernd.schubert, jefflexu,
	shakeel.butt, kernel-team

On Wed, Dec 11, 2024 at 1:11 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Dec 11, 2024 at 01:04:45PM -0800, Joanne Koong wrote:
> > On Mon, Dec 9, 2024 at 7:54 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Dec 09, 2024 at 10:50:42AM -0500, Josef Bacik wrote:
> > > > As we've noticed in the upstream bug report for your initial work here, this
> > > > isn't quite correct, as we could have gotten a large folio in from userspace.  I
> > > > think the better thing here is to do the page extraction, and then keep track of
> > > > the last folio we saw, and simply skip any folios that are the same for the
> > > > pages we have.  This way we can handle large folios correctly.  Thanks,
> > >
> > > Some people have in the past thought that they could skip subsequent
> > > page lookup if the folio they get back is large.  This is an incorrect
> > > optimisation.  Userspace may mmap() a file PROT_WRITE, MAP_PRIVATE.
> > > If they store to the middle of a large folio (the file that is mmaped
> > > may be on a filesystem that does support large folios, rather than
> > > fuse), then we'll have, eg:
> > >
> > > folio A page 0
> > > folio A page 1
> > > folio B page 0
> > > folio A page 3
> > >
> > > where folio A belongs to the file and folio B is an anonymous COW page.
> >
> > Sounds good, I'll fix this up in v3. Thanks.
>
> Hm?  I didn't notice this bug in your code, just mentioning something
> I've seen other people do and wanted to make suree you didn't.  Did I
> miss a bug in your code?

Hi Matthew,

I believe I'm doing this too in this patchset with these two lines:

len = min_t(ssize_t, ret, folio_size(folio) - folio_offset);
...
i += DIV_ROUND_UP(start + len, PAGE_SIZE);
(where i is the index into the array of extracted pages)

where I incorrectly assume the entire folio is contiguously
represented in the next set of extracted pages so I just skip over
those.

Whereas what I need to do is check every page that was extracted to
see if it does actually belong to the same folio as the previous page
and adjust the length calculations accordingly.

Thanks for flagging this.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 00/12] fuse: support large folios
  2024-12-10  0:31       ` Joanne Koong
@ 2025-01-08 21:03         ` Joanne Koong
  0 siblings, 0 replies; 28+ messages in thread
From: Joanne Koong @ 2025-01-08 21:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jingbo Xu, miklos, linux-fsdevel, josef, bernd.schubert,
	shakeel.butt, kernel-team

On Mon, Dec 9, 2024 at 4:31 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Dec 6, 2024 at 2:25 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Dec 06, 2024 at 09:41:25AM -0800, Joanne Koong wrote:
> > > On Fri, Dec 6, 2024 at 1:50 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > > > -       folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
> > > > +       folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN |
> > > > fgf_set_order(len),
> > > >
> > > > Otherwise the large folio is not enabled on the buffer write path.
> > > >
> > > >
> > > > Besides, when applying the above diff, the large folio is indeed enabled
> > > > but it suffers severe performance regression:
> > > >
> > > > fio 1 job buffer write:
> > > > 2GB/s BW w/o large folio, and 200MB/s BW w/ large folio
> > >
> > > This is the behavior I noticed as well when running some benchmarks on
> > > v1 [1]. I think it's because when we call into __filemap_get_folio(),
> > > we hit the FGP_CREAT path and if the order we set is too high, the
> > > internal call to filemap_alloc_folio() will repeatedly fail until it
> > > finds an order it's able to allocate (eg the do { ... } while (order--
> > > > min_order) loop).
> >
> > But this is very different frrom what other filesystems have measured
> > when allocating large folios during writes.  eg:
> >
> > https://lore.kernel.org/linux-fsdevel/20240527163616.1135968-1-hch@lst.de/
>
> Ok, this seems like something particular to FUSE then, if all the
> other filesystems are seeing 2x throughput improvements for buffered
> writes. If someone doesn't get to this before me, I'll look deeper
> into this.
>
>
> Thanks,
> Joanne
> >
> > So we need to understand what's different about fuse.  My suspicion is
> > that it's disabling some other optimisation that is only done on
> > order 0 folios, but that's just wild speculation.  Needs someone to
> > dig into it and look at profiles to see what's really going on.

I got a chance to look more into this. This is happening because with
large folios, a large number of pages is diritied per write, and when
the kernel balances pages,  it uses "HZ * pages_dirtied /
task_ratelimit" to determine if an io timeout needs to be scheduled
while the writeback is happening in the background - for large folios,
where lots of pages are dirtied at once, this usually results in a io
timeout, while small folios skirt this because they incrementally
balance / write back pages. the io wait is what's incurring the extra
cost for large folios.

The entry point into this is in generic_perform_write() where fuse
writeback caching calls into this through
fuse_cache_write_iter()
    generic_file_write_iter()
        __generic_file_write_iter()
            generic_perform_write()

In generic_perform_write(), balance_dirty_pages_ratelimited() is
called per folio that's written. If we're doing a 1GB write where the
block size is 1MB, for small folios we write 1 page, call
balance_dirty_pages_ratelimited(), write the next page, call
balance_dirty_pages_ratelimited(), etc. In
balance_dirty_pages_ratelimited(), we only actually write back the
pages if the number of dirty pages exceeds ratelimit (on my running
system that's 16 pages), so effectively for small folios the number of
accumulated dirty pages is the ratelimit. Whereas with large folios,
we write 256 pages at a time, we call
balance_dirty_pages_ratelimited(), this is larger than the ratelimit,
we go to actually balance pages with balance_dirty_pages(), and then
we have to schedule an io wait. Small folios avoids scheduling this in
"if (pause < min_pause) { ... break; }" in balance_dirty_pages().

Without the io wait, I'm seeing a significant improvement in large
folio performance, eg running fio with bs=1M size=1G:
small folios: ~1300 MB/s
large folios (w/ io waits) : ~300 MB/s
large folios (w/out io waits): ~2400 MB/s

Also fwiw, nfs also calls into generic_perform_write() for handling
writeback writes (eg nfs_file_write()). Running nfs on my localhost, I
see a perf drop for size=1G bs=1M writes (~430 MB/s with large folios
and ~550 Mb/s with small folios), though it's nowhere as large as the
perf drop for fuse.

Matthew, what are your thoughts on the best way to address this? do
you think we should increase the min_pause threshold?

Thanks,
Joanne

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-01-08 21:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-25 22:05 [PATCH v2 00/12] fuse: support large folios Joanne Koong
2024-11-25 22:05 ` [PATCH v2 01/12] fuse: support copying " Joanne Koong
2024-11-25 22:05 ` [PATCH v2 02/12] fuse: support large folios for retrieves Joanne Koong
2024-11-25 22:05 ` [PATCH v2 03/12] fuse: refactor fuse_fill_write_pages() Joanne Koong
2024-11-25 22:05 ` [PATCH v2 04/12] fuse: support large folios for writethrough writes Joanne Koong
2024-11-25 22:05 ` [PATCH v2 05/12] fuse: support large folios for folio reads Joanne Koong
2024-11-25 22:05 ` [PATCH v2 06/12] fuse: support large folios for symlinks Joanne Koong
2024-11-25 22:05 ` [PATCH v2 07/12] fuse: support large folios for stores Joanne Koong
2024-11-25 22:05 ` [PATCH v2 08/12] fuse: support large folios for queued writes Joanne Koong
2024-11-25 22:05 ` [PATCH v2 09/12] fuse: support large folios for readahead Joanne Koong
2024-11-25 22:05 ` [PATCH v2 10/12] fuse: support large folios for direct io Joanne Koong
2024-12-09 15:50   ` Josef Bacik
2024-12-09 15:54     ` Matthew Wilcox
2024-12-11 21:04       ` Joanne Koong
2024-12-11 21:11         ` Matthew Wilcox
2024-12-11 21:35           ` Joanne Koong
2024-11-25 22:05 ` [PATCH v2 11/12] fuse: support large folios for writeback Joanne Koong
2024-11-25 22:05 ` [PATCH v2 12/12] fuse: enable large folios Joanne Koong
2024-12-06  9:50 ` [PATCH v2 00/12] fuse: support " Jingbo Xu
2024-12-06 17:41   ` Joanne Koong
2024-12-06 20:36     ` Shakeel Butt
2024-12-06 22:11       ` Joanne Koong
2024-12-06 22:27         ` Shakeel Butt
2024-12-06 22:33           ` Matthew Wilcox
2024-12-09 17:23             ` Shakeel Butt
2024-12-06 22:25     ` Matthew Wilcox
2024-12-10  0:31       ` Joanne Koong
2025-01-08 21:03         ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox