public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] fuse: back uncached readdir buffers with pages
@ 2026-04-28 23:29 Matthew R. Ochs
  2026-04-29  7:27 ` Miklos Szeredi
  2026-04-29  9:29 ` Bernd Schubert
  0 siblings, 2 replies; 6+ messages in thread
From: Matthew R. Ochs @ 2026-04-28 23:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Bernd Schubert, linux-fsdevel, linux-kernel

Commit dabb90391028 ("fuse: increase readdir buffer size") changed
fuse_readdir_uncached() to size its temporary buffer from ctx->count.
That is useful for overlayfs and other in-kernel callers that use
INT_MAX to indicate an unlimited directory read.

The buffer is capped by fc->max_pages converted to bytes with PAGE_SIZE.
However, fc->max_pages is a page-count limit, while fc->max_write is the
negotiated byte-sized payload limit. Using only fc->max_pages can produce
a READDIR request larger than the server is prepared to handle, especially
when the server and client use different page sizes.

The larger buffer is also currently supplied as a kvec output argument.
For virtiofs, kvec arguments are copied through req->argbuf, which is
allocated with kmalloc(..., GFP_ATOMIC). A large readdir buffer can
therefore require a multi-megabyte contiguous atomic allocation and fail
with -ENOMEM.

This was observed with a 64K-page guest on a 4K-page host, using an
overlayfs mount whose lower directory is on virtiofs. Reading a merged
directory through overlayfs failed with:

  ls: reading directory '<path>': Cannot allocate memory

Avoid the oversized request and the large bounce-buffer allocation by
capping the requested byte size by both fc->max_pages and fc->max_write,
then backing the uncached readdir output with pages and setting out_pages.
The virtiofs transport can then pass the pages as scatter-gather entries
instead of copying the output through argbuf.

Map the pages with vm_map_ram() only while parsing the returned dirents,
so the existing parser can continue to operate on a linear kernel mapping.

Fixes: dabb90391028 ("fuse: increase readdir buffer size")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
---
v2:
- Reworked uncached readdir to use output pages and out_pages, per Miklos.
- Cap the requested byte size by both fc->max_pages and fc->max_write.
- Map pages with vm_map_ram() only while parsing returned dirents.
- Verified with --overlay-rwdir across 4K/64K host and guest page sizes.
- Link to v1: https://lore.kernel.org/all/20260428021304.2338592-1-mochs@nvidia.com/

 fs/fuse/readdir.c | 67 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 10 deletions(-)

diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index db5ae8ec1030..27162084a683 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -12,6 +12,7 @@
 #include <linux/posix_acl.h>
 #include <linux/pagemap.h>
 #include <linux/highmem.h>
+#include <linux/vmalloc.h>
 
 static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
 {
@@ -343,17 +344,45 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
 	struct fuse_mount *fm = get_fuse_mount(inode);
 	struct fuse_conn *fc = fm->fc;
 	struct fuse_io_args ia = {};
-	struct fuse_args *args = &ia.ap.args;
+	struct fuse_args_pages *ap = &ia.ap;
+	struct fuse_args *args = &ap->args;
+	struct page **pages;
 	void *buf;
-	size_t bufsize = clamp((unsigned int) ctx->count, PAGE_SIZE, fc->max_pages << PAGE_SHIFT);
+	size_t max_bufsize = min_t(size_t, (size_t)fc->max_pages << PAGE_SHIFT,
+				   fc->max_write);
+	size_t count = ctx->count > 0 ? ctx->count : PAGE_SIZE;
+	size_t bufsize = min_t(size_t, max_t(size_t, count, PAGE_SIZE),
+			       max_bufsize);
+	unsigned int nr_pages = DIV_ROUND_UP(bufsize, PAGE_SIZE);
 	u64 attr_version = 0, evict_ctr = 0;
 	bool locked;
+	unsigned int nr_alloc = 0;
+	unsigned int i;
 
-	buf = kvmalloc(bufsize, GFP_KERNEL);
-	if (!buf)
+	pages = kvcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
 		return -ENOMEM;
 
-	args->out_args[0].value = buf;
+	while (nr_alloc < nr_pages) {
+		unsigned int last = nr_alloc;
+
+		nr_alloc = alloc_pages_bulk(GFP_KERNEL, nr_pages, pages);
+		if (nr_alloc == last)
+			goto nomem;
+	}
+
+	ap->folios = fuse_folios_alloc(nr_pages, GFP_KERNEL, &ap->descs);
+	if (!ap->folios)
+		goto nomem;
+
+	for (i = 0; i < nr_pages; i++) {
+		ap->folios[i] = page_folio(pages[i]);
+		ap->descs[i].length = min_t(size_t,
+					    bufsize - (size_t)i * PAGE_SIZE,
+					    PAGE_SIZE);
+	}
+	ap->num_folios = nr_pages;
+	args->out_pages = true;
 
 	plus = fuse_use_readdirplus(inode, ctx);
 	if (plus) {
@@ -372,17 +401,35 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
 
 			if (ff->open_flags & FOPEN_CACHE_DIR)
 				fuse_readdir_cache_end(file, ctx->pos);
-		} else if (plus) {
-			res = parse_dirplusfile(buf, res, file, ctx, attr_version,
-						evict_ctr);
 		} else {
-			res = parse_dirfile(buf, res, file, ctx);
+			buf = vm_map_ram(pages, nr_pages, -1);
+			if (!buf) {
+				res = -ENOMEM;
+			} else {
+				if (plus)
+					res = parse_dirplusfile(buf, res, file, ctx,
+								attr_version,
+								evict_ctr);
+				else
+					res = parse_dirfile(buf, res, file, ctx);
+
+				vm_unmap_ram(buf, nr_pages);
+			}
 		}
 	}
 
-	kvfree(buf);
 	fuse_invalidate_atime(inode);
+
+out:
+	kfree(ap->folios);
+	for (i = 0; i < nr_alloc; i++)
+		__free_page(pages[i]);
+	kvfree(pages);
 	return res;
+
+nomem:
+	res = -ENOMEM;
+	goto out;
 }
 
 enum fuse_parse_result {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] fuse: back uncached readdir buffers with pages
  2026-04-28 23:29 [PATCH v2] fuse: back uncached readdir buffers with pages Matthew R. Ochs
@ 2026-04-29  7:27 ` Miklos Szeredi
  2026-04-30 19:24   ` Matt Ochs
  2026-04-29  9:29 ` Bernd Schubert
  1 sibling, 1 reply; 6+ messages in thread
From: Miklos Szeredi @ 2026-04-29  7:27 UTC (permalink / raw)
  To: Matthew R. Ochs; +Cc: Bernd Schubert, linux-fsdevel, linux-kernel

On Wed, 29 Apr 2026 at 01:30, Matthew R. Ochs <mochs@nvidia.com> wrote:

> The larger buffer is also currently supplied as a kvec output argument.
> For virtiofs, kvec arguments are copied through req->argbuf, which is
> allocated with kmalloc(..., GFP_ATOMIC). A large readdir buffer can
> therefore require a multi-megabyte contiguous atomic allocation and fail
> with -ENOMEM.

Shouldn't this be max_read?  Here "read" and "write" refer to
direction of I/O on the filesystem, not on the fuse device (see
fuse/file.c)

> @@ -343,17 +344,45 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
>         struct fuse_mount *fm = get_fuse_mount(inode);
>         struct fuse_conn *fc = fm->fc;
>         struct fuse_io_args ia = {};
> -       struct fuse_args *args = &ia.ap.args;
> +       struct fuse_args_pages *ap = &ia.ap;
> +       struct fuse_args *args = &ap->args;
> +       struct page **pages;
>         void *buf;
> -       size_t bufsize = clamp((unsigned int) ctx->count, PAGE_SIZE, fc->max_pages << PAGE_SHIFT);
> +       size_t max_bufsize = min_t(size_t, (size_t)fc->max_pages << PAGE_SHIFT,

No need to cast if using the min_t variant.

> +                                  fc->max_write);
> +       size_t count = ctx->count > 0 ? ctx->count : PAGE_SIZE;

This is open coding the max_t(size_t, count, PAGE_SIZE) in the next
line.  Just delete.

> +       size_t bufsize = min_t(size_t, max_t(size_t, count, PAGE_SIZE),
> +                              max_bufsize);

What's wrong with the clamp() construct used originally?

> +       unsigned int nr_pages = DIV_ROUND_UP(bufsize, PAGE_SIZE);
>         u64 attr_version = 0, evict_ctr = 0;
>         bool locked;
> +       unsigned int nr_alloc = 0;
> +       unsigned int i;
>
> -       buf = kvmalloc(bufsize, GFP_KERNEL);
> -       if (!buf)
> +       pages = kvcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);

 struct page **pages __free(kvfree) = kvcalloc(nr_pages,
sizeof(*pages), GFP_KERNEL);


> +       if (!pages)
>                 return -ENOMEM;
>
> -       args->out_args[0].value = buf;
> +       while (nr_alloc < nr_pages) {
> +               unsigned int last = nr_alloc;
> +
> +               nr_alloc = alloc_pages_bulk(GFP_KERNEL, nr_pages, pages);
> +               if (nr_alloc == last)
> +                       goto nomem;
> +       }

I'd try this without the loop for less complexity.  Falling back to a
shorter read shouldn't be a problem, as long as this doesn't happen
very often.

> +
> +       ap->folios = fuse_folios_alloc(nr_pages, GFP_KERNEL, &ap->descs);
> +       if (!ap->folios)
> +               goto nomem;
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               ap->folios[i] = page_folio(pages[i]);
> +               ap->descs[i].length = min_t(size_t,
> +                                           bufsize - (size_t)i * PAGE_SIZE,
> +                                           PAGE_SIZE);
> +       }
> +       ap->num_folios = nr_pages;
> +       args->out_pages = true;
>
>         plus = fuse_use_readdirplus(inode, ctx);
>         if (plus) {
> @@ -372,17 +401,35 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
>
>                         if (ff->open_flags & FOPEN_CACHE_DIR)
>                                 fuse_readdir_cache_end(file, ctx->pos);
> -               } else if (plus) {
> -                       res = parse_dirplusfile(buf, res, file, ctx, attr_version,
> -                                               evict_ctr);
>                 } else {
> -                       res = parse_dirfile(buf, res, file, ctx);
> +                       buf = vm_map_ram(pages, nr_pages, -1);
> +                       if (!buf) {
> +                               res = -ENOMEM;
> +                       } else {
> +                               if (plus)
> +                                       res = parse_dirplusfile(buf, res, file, ctx,
> +                                                               attr_version,
> +                                                               evict_ctr);
> +                               else
> +                                       res = parse_dirfile(buf, res, file, ctx);
> +
> +                               vm_unmap_ram(buf, nr_pages);
> +                       }
>                 }
>         }
>
> -       kvfree(buf);
>         fuse_invalidate_atime(inode);
> +
> +out:
> +       kfree(ap->folios);
> +       for (i = 0; i < nr_alloc; i++)
> +               __free_page(pages[i]);

release_pages()

> +       kvfree(pages);
>         return res;
> +
> +nomem:
> +       res = -ENOMEM;
> +       goto out;

Usual pattern is to just do res = -ENOMEM before each goto out (or
just the first if nothing else modifies res), so no double jump unless
absolutely necessary.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] fuse: back uncached readdir buffers with pages
  2026-04-28 23:29 [PATCH v2] fuse: back uncached readdir buffers with pages Matthew R. Ochs
  2026-04-29  7:27 ` Miklos Szeredi
@ 2026-04-29  9:29 ` Bernd Schubert
  2026-04-29 10:38   ` Miklos Szeredi
  1 sibling, 1 reply; 6+ messages in thread
From: Bernd Schubert @ 2026-04-29  9:29 UTC (permalink / raw)
  To: Matthew R. Ochs, Miklos Szeredi; +Cc: linux-fsdevel, linux-kernel, Joanne Koong

[-- Attachment #1: Type: text/plain, Size: 6896 bytes --]



On 4/29/26 01:29, Matthew R. Ochs wrote:
> [You don't often get email from mochs@nvidia.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> Commit dabb90391028 ("fuse: increase readdir buffer size") changed
> fuse_readdir_uncached() to size its temporary buffer from ctx->count.
> That is useful for overlayfs and other in-kernel callers that use
> INT_MAX to indicate an unlimited directory read.
> 
> The buffer is capped by fc->max_pages converted to bytes with PAGE_SIZE.
> However, fc->max_pages is a page-count limit, while fc->max_write is the
> negotiated byte-sized payload limit. Using only fc->max_pages can produce
> a READDIR request larger than the server is prepared to handle, especially
> when the server and client use different page sizes.
> 
> The larger buffer is also currently supplied as a kvec output argument.
> For virtiofs, kvec arguments are copied through req->argbuf, which is
> allocated with kmalloc(..., GFP_ATOMIC). A large readdir buffer can
> therefore require a multi-megabyte contiguous atomic allocation and fail
> with -ENOMEM.
> 
> This was observed with a 64K-page guest on a 4K-page host, using an
> overlayfs mount whose lower directory is on virtiofs. Reading a merged
> directory through overlayfs failed with:
> 
>   ls: reading directory '<path>': Cannot allocate memory
> 
> Avoid the oversized request and the large bounce-buffer allocation by
> capping the requested byte size by both fc->max_pages and fc->max_write,
> then backing the uncached readdir output with pages and setting out_pages.
> The virtiofs transport can then pass the pages as scatter-gather entries
> instead of copying the output through argbuf.
> 
> Map the pages with vm_map_ram() only while parsing the returned dirents,
> so the existing parser can continue to operate on a linear kernel mapping.
> 
> Fixes: dabb90391028 ("fuse: increase readdir buffer size")
> Cc: stable@vger.kernel.org
> Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>

Josef and Joanne had spent quite some time to allow to use large folios
- maybe we should make use of it? Attached is a totally untested patch
and that ignores all of Miklos' comments for now. Also cannot be that
easily back ported

Thanks,
Bernd


> ---
> v2:
> - Reworked uncached readdir to use output pages and out_pages, per Miklos.
> - Cap the requested byte size by both fc->max_pages and fc->max_write.
> - Map pages with vm_map_ram() only while parsing returned dirents.
> - Verified with --overlay-rwdir across 4K/64K host and guest page sizes.
> - Link to v1: https://lore.kernel.org/all/20260428021304.2338592-1-mochs@nvidia.com/
> 
>  fs/fuse/readdir.c | 67 ++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 57 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
> index db5ae8ec1030..27162084a683 100644
> --- a/fs/fuse/readdir.c
> +++ b/fs/fuse/readdir.c
> @@ -12,6 +12,7 @@
>  #include <linux/posix_acl.h>
>  #include <linux/pagemap.h>
>  #include <linux/highmem.h>
> +#include <linux/vmalloc.h>
> 
>  static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
>  {
> @@ -343,17 +344,45 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
>         struct fuse_mount *fm = get_fuse_mount(inode);
>         struct fuse_conn *fc = fm->fc;
>         struct fuse_io_args ia = {};
> -       struct fuse_args *args = &ia.ap.args;
> +       struct fuse_args_pages *ap = &ia.ap;
> +       struct fuse_args *args = &ap->args;
> +       struct page **pages;
>         void *buf;
> -       size_t bufsize = clamp((unsigned int) ctx->count, PAGE_SIZE, fc->max_pages << PAGE_SHIFT);
> +       size_t max_bufsize = min_t(size_t, (size_t)fc->max_pages << PAGE_SHIFT,
> +                                  fc->max_write);
> +       size_t count = ctx->count > 0 ? ctx->count : PAGE_SIZE;
> +       size_t bufsize = min_t(size_t, max_t(size_t, count, PAGE_SIZE),
> +                              max_bufsize);
> +       unsigned int nr_pages = DIV_ROUND_UP(bufsize, PAGE_SIZE);
>         u64 attr_version = 0, evict_ctr = 0;
>         bool locked;
> +       unsigned int nr_alloc = 0;
> +       unsigned int i;
> 
> -       buf = kvmalloc(bufsize, GFP_KERNEL);
> -       if (!buf)
> +       pages = kvcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
> +       if (!pages)
>                 return -ENOMEM;
> 
> -       args->out_args[0].value = buf;
> +       while (nr_alloc < nr_pages) {
> +               unsigned int last = nr_alloc;
> +
> +               nr_alloc = alloc_pages_bulk(GFP_KERNEL, nr_pages, pages);
> +               if (nr_alloc == last)
> +                       goto nomem;
> +       }
> +
> +       ap->folios = fuse_folios_alloc(nr_pages, GFP_KERNEL, &ap->descs);
> +       if (!ap->folios)
> +               goto nomem;
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               ap->folios[i] = page_folio(pages[i]);
> +               ap->descs[i].length = min_t(size_t,
> +                                           bufsize - (size_t)i * PAGE_SIZE,
> +                                           PAGE_SIZE);
> +       }
> +       ap->num_folios = nr_pages;
> +       args->out_pages = true;
> 
>         plus = fuse_use_readdirplus(inode, ctx);
>         if (plus) {
> @@ -372,17 +401,35 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
> 
>                         if (ff->open_flags & FOPEN_CACHE_DIR)
>                                 fuse_readdir_cache_end(file, ctx->pos);
> -               } else if (plus) {
> -                       res = parse_dirplusfile(buf, res, file, ctx, attr_version,
> -                                               evict_ctr);
>                 } else {
> -                       res = parse_dirfile(buf, res, file, ctx);
> +                       buf = vm_map_ram(pages, nr_pages, -1);
> +                       if (!buf) {
> +                               res = -ENOMEM;
> +                       } else {
> +                               if (plus)
> +                                       res = parse_dirplusfile(buf, res, file, ctx,
> +                                                               attr_version,
> +                                                               evict_ctr);
> +                               else
> +                                       res = parse_dirfile(buf, res, file, ctx);
> +
> +                               vm_unmap_ram(buf, nr_pages);
> +                       }
>                 }
>         }
> 
> -       kvfree(buf);
>         fuse_invalidate_atime(inode);
> +
> +out:
> +       kfree(ap->folios);
> +       for (i = 0; i < nr_alloc; i++)
> +               __free_page(pages[i]);
> +       kvfree(pages);
>         return res;
> +
> +nomem:
> +       res = -ENOMEM;
> +       goto out;
>  }
> 
>  enum fuse_parse_result {
> --
> 2.50.1
> 

[-- Attachment #2: readir-use-large-folios.patch --]
[-- Type: text/x-patch, Size: 8038 bytes --]

fuse: refactor readdir to support large folios

From: Bernd Schubert <bernd@bsbernd.com>

The current implementation in fuse_readdir_uncached() allocates individual
pages and converts them to folios, which prevents the use of large folios
and makes the code less efficient, especially on systems with large base
page sizes.

This patch refactors the allocation to directly allocate folios with
opportunistic large folio support, similar to netfs. The key changes:

1. New helper function fuse_readdir_alloc_folios() that allocates folios
   directly instead of allocating pages and converting them
2. Tries to allocate large folios (up to MAX_PAGECACHE_ORDER) first and
   falls back to order-0 folios on failure
3. Calculates descriptor lengths based on actual folio sizes rather than
   hardcoded PAGE_SIZE
4. Updates vm_map_ram() handling to extract pages from variable-sized
   folios

Benefits include:
- Reduced TLB pressure through use of large folios when possible
- Better memory efficiency and cache locality
- Improved cross-platform support (e.g., 64K page systems)
- No regression: falls back to order-0 folios under memory pressure
- Future-proof architecture aligned with kernel direction

The implementation is similar to the approach used in netfs
(fs/netfs/misc.c) and ensures backward compatibility by gracefully
falling back to single-page folios when large allocations fail.

Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
---
 fs/fuse/readdir.c |  187 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 133 insertions(+), 54 deletions(-)

diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index 6870bfe73da1..ad5551365cdf 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -14,6 +14,13 @@
 #include <linux/highmem.h>
 #include <linux/vmalloc.h>
 
+/* Forward declarations */
+static int parse_dirfile(char *buf, size_t nbytes, struct file *file,
+			 struct dir_context *ctx);
+static int parse_dirplusfile(char *buf, size_t nbytes, struct file *file,
+			     struct dir_context *ctx, u64 attr_version,
+			     u64 evict_ctr);
+
 static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
 {
 	struct fuse_conn *fc = get_fuse_conn(dir);
@@ -332,6 +339,114 @@ static int parse_dirplusfile(char *buf, size_t nbytes, struct file *file,
 	return 0;
 }
 
+/*
+ * Allocate folios for fuse_args_pages, preferring large folios when possible.
+ *
+ * Note: These are temporary folios for buffering I/O data from the server,
+ * not pagecache folios. They are allocated with folio_alloc() and freed
+ * immediately after use. This is similar to how netfs allocates temporary
+ * buffers (see fs/netfs/misc.c).
+ */
+static int fuse_ap_alloc_folios(struct fuse_args_pages *ap, size_t bufsize,
+				unsigned int max_folios)
+{
+	size_t remaining = bufsize;
+	unsigned int num_folios = 0;
+
+	ap->folios = fuse_folios_alloc(max_folios, GFP_KERNEL, &ap->descs);
+	if (!ap->folios)
+		return -ENOMEM;
+
+	while (remaining > 0 && num_folios < max_folios) {
+		struct folio *folio;
+		unsigned int order = 0;
+		size_t folio_bytes;
+
+		/*
+		 * Try to allocate larger folios for better efficiency.
+		 * Calculate order based on remaining size.
+		 */
+		if (remaining > PAGE_SIZE) {
+			order = min_t(unsigned int,
+				      ilog2(remaining) - PAGE_SHIFT,
+				      MAX_PAGECACHE_ORDER);
+		}
+
+		/* Try with desired order, fall back to order-0 on failure */
+		folio = folio_alloc(GFP_KERNEL, order);
+		if (!folio && order > 0)
+			folio = folio_alloc(GFP_KERNEL, 0);
+		if (!folio)
+			goto nomem;
+
+		folio_bytes = folio_size(folio);
+		ap->folios[num_folios] = folio;
+		ap->descs[num_folios].offset = 0;
+		ap->descs[num_folios].length =
+			min_t(size_t, remaining, folio_bytes);
+
+		remaining -= ap->descs[num_folios].length;
+		num_folios++;
+	}
+
+	ap->num_folios = num_folios;
+	return 0;
+
+nomem:
+	/* Free any folios we managed to allocate */
+	while (num_folios > 0)
+		folio_put(ap->folios[--num_folios]);
+	kfree(ap->folios);
+	ap->folios = NULL;
+	return -ENOMEM;
+}
+
+static int fuse_parse_readdir(struct fuse_args_pages *ap, size_t nbytes,
+			      struct file *file, struct dir_context *ctx,
+			      bool plus, u64 attr_version, u64 evict_ctr)
+{
+	struct page **pages = NULL;
+	unsigned int nr_pages = 0;
+	unsigned int i;
+	void *buf;
+	int res;
+
+	/*
+	 * Build page array from folios for vm_map_ram().
+	 * TODO: Once vmap_range_folios() or similar lands in the kernel,
+	 * we can map folios directly without extracting pages.
+	 */
+	for (i = 0; i < ap->num_folios; i++)
+		nr_pages += folio_nr_pages(ap->folios[i]);
+
+	pages = kvmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	nr_pages = 0;
+	for (i = 0; i < ap->num_folios; i++) {
+		unsigned int nr = folio_nr_pages(ap->folios[i]);
+		unsigned int j;
+
+		for (j = 0; j < nr; j++)
+			pages[nr_pages++] = folio_page(ap->folios[i], j);
+	}
+
+	buf = vm_map_ram(pages, nr_pages, -1);
+	kvfree(pages);
+	if (!buf)
+		return -ENOMEM;
+
+	if (plus)
+		res = parse_dirplusfile(buf, nbytes, file, ctx, attr_version,
+					evict_ctr);
+	else
+		res = parse_dirfile(buf, nbytes, file, ctx);
+
+	vm_unmap_ram(buf, nr_pages);
+	return res;
+}
+
 static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
 {
 	int plus;
@@ -342,42 +457,20 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
 	struct fuse_io_args ia = {};
 	struct fuse_args_pages *ap = &ia.ap;
 	struct fuse_args *args = &ap->args;
-	struct page **pages;
-	void *buf;
 	size_t max_bufsize = min_t(size_t, (size_t)fc->max_pages << PAGE_SHIFT,
 				   fc->max_write);
 	size_t count = ctx->count > 0 ? ctx->count : PAGE_SIZE;
 	size_t bufsize = min_t(size_t, max_t(size_t, count, PAGE_SIZE),
 			       max_bufsize);
-	unsigned int nr_pages = DIV_ROUND_UP(bufsize, PAGE_SIZE);
+	unsigned int max_folios = DIV_ROUND_UP(bufsize, PAGE_SIZE);
 	u64 attr_version = 0, evict_ctr = 0;
 	bool locked;
-	unsigned int nr_alloc = 0;
 	unsigned int i;
 
-	pages = kvcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
-	if (!pages)
-		return -ENOMEM;
+	res = fuse_ap_alloc_folios(ap, bufsize, max_folios);
+	if (res)
+		return res;
 
-	while (nr_alloc < nr_pages) {
-		unsigned int last = nr_alloc;
-
-		nr_alloc = alloc_pages_bulk(GFP_KERNEL, nr_pages, pages);
-		if (nr_alloc == last)
-			goto nomem;
-	}
-
-	ap->folios = fuse_folios_alloc(nr_pages, GFP_KERNEL, &ap->descs);
-	if (!ap->folios)
-		goto nomem;
-
-	for (i = 0; i < nr_pages; i++) {
-		ap->folios[i] = page_folio(pages[i]);
-		ap->descs[i].length = min_t(size_t,
-					    bufsize - (size_t)i * PAGE_SIZE,
-					    PAGE_SIZE);
-	}
-	ap->num_folios = nr_pages;
 	args->out_pages = true;
 
 	plus = fuse_use_readdirplus(inode, ctx);
@@ -391,41 +484,27 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
 	locked = fuse_lock_inode(inode);
 	res = fuse_simple_request(fm, args);
 	fuse_unlock_inode(inode, locked);
-	if (res >= 0) {
-		if (!res) {
-			struct fuse_file *ff = file->private_data;
-
-			if (ff->open_flags & FOPEN_CACHE_DIR)
-				fuse_readdir_cache_end(file, ctx->pos);
-		} else {
-			buf = vm_map_ram(pages, nr_pages, -1);
-			if (!buf) {
-				res = -ENOMEM;
-			} else {
-				if (plus)
-					res = parse_dirplusfile(buf, res, file, ctx,
-								attr_version,
-								evict_ctr);
-				else
-					res = parse_dirfile(buf, res, file, ctx);
-
-				vm_unmap_ram(buf, nr_pages);
-			}
-		}
+	if (res < 0)
+		goto out;
+
+	if (!res) {
+		struct fuse_file *ff = file->private_data;
+
+		if (ff->open_flags & FOPEN_CACHE_DIR)
+			fuse_readdir_cache_end(file, ctx->pos);
+	} else {
+		res = fuse_parse_readdir(ap, res, file, ctx, plus,
+					 attr_version, evict_ctr);
 	}
 
 	fuse_invalidate_atime(inode);
 
 out:
+
+	for (i = 0; i < ap->num_folios; i++)
+		folio_put(ap->folios[i]);
 	kfree(ap->folios);
-	for (i = 0; i < nr_alloc; i++)
-		__free_page(pages[i]);
-	kvfree(pages);
 	return res;
-
-nomem:
-	res = -ENOMEM;
-	goto out;
 }
 
 enum fuse_parse_result {

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] fuse: back uncached readdir buffers with pages
  2026-04-29  9:29 ` Bernd Schubert
@ 2026-04-29 10:38   ` Miklos Szeredi
  2026-04-29 10:47     ` Bernd Schubert
  0 siblings, 1 reply; 6+ messages in thread
From: Miklos Szeredi @ 2026-04-29 10:38 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Matthew R. Ochs, linux-fsdevel, linux-kernel, Joanne Koong

On Wed, 29 Apr 2026 at 11:29, Bernd Schubert <bschubert@ddn.com> wrote:

> Josef and Joanne had spent quite some time to allow to use large folios
> - maybe we should make use of it? Attached is a totally untested patch
> and that ignores all of Miklos' comments for now. Also cannot be that
> easily back ported

I'd be happier if the VM infrastructure for folio arrays was available
first, then used in fuse.  Not the other way round.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] fuse: back uncached readdir buffers with pages
  2026-04-29 10:38   ` Miklos Szeredi
@ 2026-04-29 10:47     ` Bernd Schubert
  0 siblings, 0 replies; 6+ messages in thread
From: Bernd Schubert @ 2026-04-29 10:47 UTC (permalink / raw)
  To: Miklos Szeredi, Bernd Schubert
  Cc: Matthew R. Ochs, linux-fsdevel, linux-kernel, Joanne Koong



On 4/29/26 12:38, Miklos Szeredi wrote:
> On Wed, 29 Apr 2026 at 11:29, Bernd Schubert <bschubert@ddn.com> wrote:
> 
>> Josef and Joanne had spent quite some time to allow to use large folios
>> - maybe we should make use of it? Attached is a totally untested patch
>> and that ignores all of Miklos' comments for now. Also cannot be that
>> easily back ported
> 
> I'd be happier if the VM infrastructure for folio arrays was available
> first, then used in fuse.  Not the other way round.

It is only the the missing vm_map part for folios? I had found a patch

https://lists.freedesktop.org/archives/dri-devel/2025-March/497993.html

and added the comment therefore. Maybe we can bring it up with Matthew
or someone else from mm next week.

A bit a pity if there is generic support for large folios and fuse
internals for random reasons then still use single pages.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] fuse: back uncached readdir buffers with pages
  2026-04-29  7:27 ` Miklos Szeredi
@ 2026-04-30 19:24   ` Matt Ochs
  0 siblings, 0 replies; 6+ messages in thread
From: Matt Ochs @ 2026-04-30 19:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org


> On Apr 29, 2026, at 02:27, Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> On Wed, 29 Apr 2026 at 01:30, Matthew R. Ochs <mochs@nvidia.com> wrote:
> 
>> The larger buffer is also currently supplied as a kvec output argument.
>> For virtiofs, kvec arguments are copied through req->argbuf, which is
>> allocated with kmalloc(..., GFP_ATOMIC). A large readdir buffer can
>> therefore require a multi-megabyte contiguous atomic allocation and fail
>> with -ENOMEM.
> 
> Shouldn't this be max_read?  Here "read" and "write" refer to
> direction of I/O on the filesystem, not on the fuse device (see
> fuse/file.c)

Thanks, the read/write direction point makes sense.

I tested changing the cap to fc->max_read only, but that reproduces the
original virtiofs failure on the 4K-host/64K-guest setup. The runtime
values for the failing READDIR are:

PAGE_SIZE=65536
fc->max_pages=124
fc->max_read=4294967295
fc->max_write=1048576
max_bufsize=8126464
nr_pages=124

So for this virtiofs mount, fc->max_read is effectively unlimited, while
virtiofsd advertises its 1 MiB MAX_BUFFER_SIZE through max_write and
rejects READDIR sizes above that limit.

Do you prefer handling this locally in fuse_readdir_uncached(), for
example by capping the request with all available limits:

min3_t(size_t, fc->max_pages << PAGE_SHIFT, fc->max_read, fc->max_write)

Or should virtiofs/FUSE instead make fc->max_read reflect this byte-sized
buffer limit before readdir uses it?

I will address the other cleanup comments in v3: drop the cast, keep the
clamp-style sizing, use release_pages(), and remove the nomem double jump.


-matt

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-30 19:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 23:29 [PATCH v2] fuse: back uncached readdir buffers with pages Matthew R. Ochs
2026-04-29  7:27 ` Miklos Szeredi
2026-04-30 19:24   ` Matt Ochs
2026-04-29  9:29 ` Bernd Schubert
2026-04-29 10:38   ` Miklos Szeredi
2026-04-29 10:47     ` Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox