[PATCH v4 0/2] virtiofs: fix the warning for kernel direct IO

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/2] virtiofs: fix the warning for kernel direct IO
@ 2024-08-31  9:37 Hou Tao
  2024-08-31  9:37 ` [PATCH v4 1/2] virtiofs: use pages instead of pointer " Hou Tao
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Hou Tao @ 2024-08-31  9:37 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Miklos Szeredi, Vivek Goyal, Stefan Hajnoczi, Bernd Schubert,
	Michael S . Tsirkin, Matthew Wilcox, Benjamin Coddington,
	Jingbo Xu, linux-kernel, virtualization, houtao1

From: Hou Tao <houtao1@huawei.com>

Hi,

The patch set aims to fix the warning related to an abnormal size
parameter of kmalloc() in virtiofs. Patch #1 fixes it by introducing
use_pages_for_kvec_io option in fuse_conn and enabling it in virtiofs.
Beside the abnormal size parameter for kmalloc, the gfp parameter is
also questionable: GFP_ATOMIC is used even when the allocation occurs
in a kworker context. Patch #2 fixes it by using GFP_NOFS when the
allocation is initiated by the kworker. For more details, please check
the individual patches.

As usual, comments are always welcome.

Change Log:

v4:
 * patch 1: add the missed {flush|invalidate}_kernel_vmap_range() and
            update commit message accordingly
 * patch 2: update commit message to explain why GFP_ATOMIC is
            reasonable for the first invocation of
	    virtio_fs_enqueue_req().

v3: https://lore.kernel.org/linux-fsdevel/20240426143903.1305919-1-houtao@huaweicloud.com/
 * introduce use_pages_for_kvec_io for virtiofs. When the option is
   enabled, fuse will use iov_iter_extract_pages() to construct a page
   array and pass the pages array instead of a pointer to virtiofs.
   The benefit is twofold: the length of the data passed to virtiofs is
   limited by max_pages, and there is no memory copy compared with v2.

v2: https://lore.kernel.org/linux-fsdevel/20240228144126.2864064-1-houtao@huaweicloud.com/
  * limit the length of ITER_KVEC dio by max_pages instead of the
    newly-introduced max_nopage_rw. Using max_pages make the ITER_KVEC
    dio being consistent with other rw operations.
  * replace kmalloc-allocated bounce buffer by using a bounce buffer
    backed by scattered pages when the length of the bounce buffer for
    KVEC_ITER dio is larger than PAG_SIZE, so even on hosts with
    fragmented memory, the KVEC_ITER dio can be handled normally by
    virtiofs. (Bernd Schubert)
  * merge the GFP_NOFS patch [1] into this patch-set and use
    memalloc_nofs_{save|restore}+GFP_KERNEL instead of GFP_NOFS
    (Benjamin Coddington)

v1: https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-houtao@huaweicloud.com/

[1]: https://lore.kernel.org/linux-fsdevel/20240105105305.4052672-1-houtao@huaweicloud.com/

Hou Tao (2):
  virtiofs: use pages instead of pointer for kernel direct IO
  virtiofs: use GFP_NOFS when enqueuing request through kworker

 fs/fuse/file.c      | 62 +++++++++++++++++++++++++++++++--------------
 fs/fuse/fuse_i.h    |  6 +++++
 fs/fuse/virtio_fs.c | 25 +++++++++++-------
 3 files changed, 65 insertions(+), 28 deletions(-)

-- 
2.29.2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 1/2] virtiofs: use pages instead of pointer for kernel direct IO
  2024-08-31  9:37 [PATCH v4 0/2] virtiofs: fix the warning for kernel direct IO Hou Tao
@ 2024-08-31  9:37 ` Hou Tao
  2024-09-03  8:44   ` Jingbo Xu
  2024-08-31  9:37 ` [PATCH v4 2/2] virtiofs: use GFP_NOFS when enqueuing request through kworker Hou Tao
  2024-10-08 13:39 ` [PATCH v4 0/2] virtiofs: fix the warning for kernel direct IO Miklos Szeredi
  2 siblings, 1 reply; 9+ messages in thread
From: Hou Tao @ 2024-08-31  9:37 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Miklos Szeredi, Vivek Goyal, Stefan Hajnoczi, Bernd Schubert,
	Michael S . Tsirkin, Matthew Wilcox, Benjamin Coddington,
	Jingbo Xu, linux-kernel, virtualization, houtao1

From: Hou Tao <houtao1@huawei.com>

When trying to insert a 10MB kernel module kept in a virtio-fs with cache
disabled, the following warning was reported:

  ------------[ cut here ]------------
  WARNING: CPU: 1 PID: 404 at mm/page_alloc.c:4551 ......
  Modules linked in:
  CPU: 1 PID: 404 Comm: insmod Not tainted 6.9.0-rc5+ #123
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
  RIP: 0010:__alloc_pages+0x2bf/0x380
  ......
  Call Trace:
   <TASK>
   ? __warn+0x8e/0x150
   ? __alloc_pages+0x2bf/0x380
   __kmalloc_large_node+0x86/0x160
   __kmalloc+0x33c/0x480
   virtio_fs_enqueue_req+0x240/0x6d0
   virtio_fs_wake_pending_and_unlock+0x7f/0x190
   queue_request_and_unlock+0x55/0x60
   fuse_simple_request+0x152/0x2b0
   fuse_direct_io+0x5d2/0x8c0
   fuse_file_read_iter+0x121/0x160
   __kernel_read+0x151/0x2d0
   kernel_read+0x45/0x50
   kernel_read_file+0x1a9/0x2a0
   init_module_from_file+0x6a/0xe0
   idempotent_init_module+0x175/0x230
   __x64_sys_finit_module+0x5d/0xb0
   x64_sys_call+0x1c3/0x9e0
   do_syscall_64+0x3d/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   ......
   </TASK>
  ---[ end trace 0000000000000000 ]---

The warning is triggered as follows:

1) syscall finit_module() handles the module insertion and it invokes
kernel_read_file() to read the content of the module first.

2) kernel_read_file() allocates a 10MB buffer by using vmalloc() and
passes it to kernel_read(). kernel_read() constructs a kvec iter by
using iov_iter_kvec() and passes it to fuse_file_read_iter().

3) virtio-fs disables the cache, so fuse_file_read_iter() invokes
fuse_direct_io(). As for now, the maximal read size for kvec iter is
only limited by fc->max_read. For virtio-fs, max_read is UINT_MAX, so
fuse_direct_io() doesn't split the 10MB buffer. It saves the address and
the size of the 10MB-sized buffer in out_args[0] of a fuse request and
passes the fuse request to virtio_fs_wake_pending_and_unlock().

4) virtio_fs_wake_pending_and_unlock() uses virtio_fs_enqueue_req() to
queue the request. Because virtiofs need DMA-able address, so
virtio_fs_enqueue_req() uses kmalloc() to allocate a bounce buffer for
all fuse args, copies these args into the bounce buffer and passed the
physical address of the bounce buffer to virtiofsd. The total length of
these fuse args for the passed fuse request is about 10MB, so
copy_args_to_argbuf() invokes kmalloc() with a 10MB size parameter and
it triggers the warning in __alloc_pages():

	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
		return NULL;

5) virtio_fs_enqueue_req() will retry the memory allocation in a
kworker, but it won't help, because kmalloc() will always return NULL
due to the abnormal size and finit_module() will hang forever.

A feasible solution is to limit the value of max_read for virtio-fs, so
the length passed to kmalloc() will be limited. However it will affect
the maximal read size for normal read. And for virtio-fs write initiated
from kernel, it has the similar problem but now there is no way to limit
fc->max_write in kernel.

So instead of limiting both the values of max_read and max_write in
kernel, introducing use_pages_for_kvec_io in fuse_conn and setting it as
true in virtiofs. When use_pages_for_kvec_io is enabled, fuse will use
pages instead of pointer to pass the KVEC_IO data.

After switching to pages for KVEC_IO data, these pages will be used for
DMA through virtio-fs. If these pages are backed by vmalloc(),
{flush|invalidate}_kernel_vmap_range() are necessary to flush or
invalidate the cache before the DMA operation. So add two new fields in
fuse_args_pages to record the base address of vmalloc area and the
condition indicating whether invalidation is needed. Perform the flush
in fuse_get_user_pages() for write operations and the invalidation in
fuse_release_user_pages() for read operations.

It may seem necessary to introduce another field in fuse_conn to
indicate that these KVEC_IO pages are used for DMA, However, considering
that virtio-fs is currently the only user of use_pages_for_kvec_io, just
reuse use_pages_for_kvec_io to indicate that these pages will be used
for DMA.

Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 fs/fuse/file.c      | 62 +++++++++++++++++++++++++++++++--------------
 fs/fuse/fuse_i.h    |  6 +++++
 fs/fuse/virtio_fs.c |  1 +
 3 files changed, 50 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f39456c65ed7..331208d3e4d1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -645,7 +645,7 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
 	args->out_args[0].size = count;
 }
 
-static void fuse_release_user_pages(struct fuse_args_pages *ap,
+static void fuse_release_user_pages(struct fuse_args_pages *ap, ssize_t nres,
 				    bool should_dirty)
 {
 	unsigned int i;
@@ -656,6 +656,9 @@ static void fuse_release_user_pages(struct fuse_args_pages *ap,
 		if (ap->args.is_pinned)
 			unpin_user_page(ap->pages[i]);
 	}
+
+	if (nres > 0 && ap->args.invalidate_vmap)
+		invalidate_kernel_vmap_range(ap->args.vmap_base, nres);
 }
 
 static void fuse_io_release(struct kref *kref)
@@ -754,25 +757,29 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
 	struct fuse_io_args *ia = container_of(args, typeof(*ia), ap.args);
 	struct fuse_io_priv *io = ia->io;
 	ssize_t pos = -1;
-
-	fuse_release_user_pages(&ia->ap, io->should_dirty);
+	size_t nres;
 
 	if (err) {
 		/* Nothing */
 	} else if (io->write) {
 		if (ia->write.out.size > ia->write.in.size) {
 			err = -EIO;
-		} else if (ia->write.in.size != ia->write.out.size) {
-			pos = ia->write.in.offset - io->offset +
-				ia->write.out.size;
+		} else {
+			nres = ia->write.out.size;
+			if (ia->write.in.size != ia->write.out.size)
+				pos = ia->write.in.offset - io->offset +
+				      ia->write.out.size;
 		}
 	} else {
 		u32 outsize = args->out_args[0].size;
 
+		nres = outsize;
 		if (ia->read.in.size != outsize)
 			pos = ia->read.in.offset - io->offset + outsize;
 	}
 
+	fuse_release_user_pages(&ia->ap, err ?: nres, io->should_dirty);
+
 	fuse_aio_complete(io, err, pos);
 	fuse_io_free(ia);
 }
@@ -1467,24 +1474,37 @@ static inline size_t fuse_get_frag_size(const struct iov_iter *ii,
 
 static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 			       size_t *nbytesp, int write,
-			       unsigned int max_pages)
+			       unsigned int max_pages,
+			       bool use_pages_for_kvec_io)
 {
+	bool flush_or_invalidate = false;
 	size_t nbytes = 0;  /* # bytes already packed in req */
 	ssize_t ret = 0;
 
-	/* Special case for kernel I/O: can copy directly into the buffer */
+	/* Special case for kernel I/O: can copy directly into the buffer.
+	 * However if the implementation of fuse_conn requires pages instead of
+	 * pointer (e.g., virtio-fs), use iov_iter_extract_pages() instead.
+	 */
 	if (iov_iter_is_kvec(ii)) {
-		unsigned long user_addr = fuse_get_user_addr(ii);
-		size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
+		void *user_addr = (void *)fuse_get_user_addr(ii);
 
-		if (write)
-			ap->args.in_args[1].value = (void *) user_addr;
-		else
-			ap->args.out_args[0].value = (void *) user_addr;
+		if (!use_pages_for_kvec_io) {
+			size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
 
-		iov_iter_advance(ii, frag_size);
-		*nbytesp = frag_size;
-		return 0;
+			if (write)
+				ap->args.in_args[1].value = user_addr;
+			else
+				ap->args.out_args[0].value = user_addr;
+
+			iov_iter_advance(ii, frag_size);
+			*nbytesp = frag_size;
+			return 0;
+		}
+
+		if (is_vmalloc_addr(user_addr)) {
+			ap->args.vmap_base = user_addr;
+			flush_or_invalidate = true;
+		}
 	}
 
 	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
@@ -1513,6 +1533,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 			(PAGE_SIZE - ret) & (PAGE_SIZE - 1);
 	}
 
+	if (write && flush_or_invalidate)
+		flush_kernel_vmap_range(ap->args.vmap_base, nbytes);
+
+	ap->args.invalidate_vmap = !write && flush_or_invalidate;
 	ap->args.is_pinned = iov_iter_extract_will_pin(ii);
 	ap->args.user_pages = true;
 	if (write)
@@ -1581,7 +1605,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 		size_t nbytes = min(count, nmax);
 
 		err = fuse_get_user_pages(&ia->ap, iter, &nbytes, write,
-					  max_pages);
+					  max_pages, fc->use_pages_for_kvec_io);
 		if (err && !nbytes)
 			break;
 
@@ -1595,7 +1619,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 		}
 
 		if (!io->async || nres < 0) {
-			fuse_release_user_pages(&ia->ap, io->should_dirty);
+			fuse_release_user_pages(&ia->ap, nres, io->should_dirty);
 			fuse_io_free(ia);
 		}
 		ia = NULL;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f23919610313..79add14c363f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -309,9 +309,12 @@ struct fuse_args {
 	bool may_block:1;
 	bool is_ext:1;
 	bool is_pinned:1;
+	bool invalidate_vmap:1;
 	struct fuse_in_arg in_args[3];
 	struct fuse_arg out_args[2];
 	void (*end)(struct fuse_mount *fm, struct fuse_args *args, int error);
+	/* Used for kvec iter backed by vmalloc address */
+	void *vmap_base;
 };
 
 struct fuse_args_pages {
@@ -860,6 +863,9 @@ struct fuse_conn {
 	/** Passthrough support for read/write IO */
 	unsigned int passthrough:1;
 
+	/* Use pages instead of pointer for kernel I/O */
+	unsigned int use_pages_for_kvec_io:1;
+
 	/** Maximum stack depth for passthrough backing files */
 	int max_stack_depth;
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index dd5260141615..43d66ab5e891 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1568,6 +1568,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
 	fc->delete_stale = true;
 	fc->auto_submounts = true;
 	fc->sync_fs = true;
+	fc->use_pages_for_kvec_io = true;
 
 	/* Tell FUSE to split requests that exceed the virtqueue's size */
 	fc->max_pages_limit = min_t(unsigned int, fc->max_pages_limit,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 2/2] virtiofs: use GFP_NOFS when enqueuing request through kworker
  2024-08-31  9:37 [PATCH v4 0/2] virtiofs: fix the warning for kernel direct IO Hou Tao
  2024-08-31  9:37 ` [PATCH v4 1/2] virtiofs: use pages instead of pointer " Hou Tao
@ 2024-08-31  9:37 ` Hou Tao
  2024-09-03  9:34   ` Jingbo Xu
  2024-10-08 13:39 ` [PATCH v4 0/2] virtiofs: fix the warning for kernel direct IO Miklos Szeredi
  2 siblings, 1 reply; 9+ messages in thread
From: Hou Tao @ 2024-08-31  9:37 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Miklos Szeredi, Vivek Goyal, Stefan Hajnoczi, Bernd Schubert,
	Michael S . Tsirkin, Matthew Wilcox, Benjamin Coddington,
	Jingbo Xu, linux-kernel, virtualization, houtao1

From: Hou Tao <houtao1@huawei.com>

When invoking virtio_fs_enqueue_req() through kworker, both the
allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
Considering the size of the sg array may be greater than PAGE_SIZE, use
GFP_NOFS instead of GFP_ATOMIC to lower the possibility of memory
allocation failure and to avoid unnecessarily depleting the atomic
reserves. GFP_NOFS is not passed to virtio_fs_enqueue_req() directly,
GFP_KERNEL and memalloc_nofs_{save|restore} helpers are used instead.

It may seem OK to pass GFP_NOFS to virtio_fs_enqueue_req() as well when
queuing the request for the first time, but this is not the case. The
reason is that fuse_request_queue_background() may call
->queue_request_and_unlock() while holding fc->bg_lock, which is a
spin-lock. Therefore, still use GFP_ATOMIC for it.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 fs/fuse/virtio_fs.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 43d66ab5e891..9bc48b3ca384 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -95,7 +95,8 @@ struct virtio_fs_req_work {
 };
 
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-				 struct fuse_req *req, bool in_flight);
+				 struct fuse_req *req, bool in_flight,
+				 gfp_t gfp);
 
 static const struct constant_table dax_param_enums[] = {
 	{"always",	FUSE_DAX_ALWAYS },
@@ -439,6 +440,8 @@ static void virtio_fs_request_dispatch_work(struct work_struct *work)
 
 	/* Dispatch pending requests */
 	while (1) {
+		unsigned int flags;
+
 		spin_lock(&fsvq->lock);
 		req = list_first_entry_or_null(&fsvq->queued_reqs,
 					       struct fuse_req, list);
@@ -449,7 +452,9 @@ static void virtio_fs_request_dispatch_work(struct work_struct *work)
 		list_del_init(&req->list);
 		spin_unlock(&fsvq->lock);
 
-		ret = virtio_fs_enqueue_req(fsvq, req, true);
+		flags = memalloc_nofs_save();
+		ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_KERNEL);
+		memalloc_nofs_restore(flags);
 		if (ret < 0) {
 			if (ret == -ENOSPC) {
 				spin_lock(&fsvq->lock);
@@ -550,7 +555,7 @@ static void virtio_fs_hiprio_dispatch_work(struct work_struct *work)
 }
 
 /* Allocate and copy args into req->argbuf */
-static int copy_args_to_argbuf(struct fuse_req *req)
+static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp)
 {
 	struct fuse_args *args = req->args;
 	unsigned int offset = 0;
@@ -564,7 +569,7 @@ static int copy_args_to_argbuf(struct fuse_req *req)
 	len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
 	      fuse_len_args(num_out, args->out_args);
 
-	req->argbuf = kmalloc(len, GFP_ATOMIC);
+	req->argbuf = kmalloc(len, gfp);
 	if (!req->argbuf)
 		return -ENOMEM;
 
@@ -1239,7 +1244,8 @@ static unsigned int sg_init_fuse_args(struct scatterlist *sg,
 
 /* Add a request to a virtqueue and kick the device */
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-				 struct fuse_req *req, bool in_flight)
+				 struct fuse_req *req, bool in_flight,
+				 gfp_t gfp)
 {
 	/* requests need at least 4 elements */
 	struct scatterlist *stack_sgs[6];
@@ -1260,8 +1266,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 	/* Does the sglist fit on the stack? */
 	total_sgs = sg_count_fuse_req(req);
 	if (total_sgs > ARRAY_SIZE(stack_sgs)) {
-		sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
-		sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
+		sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
+		sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
 		if (!sgs || !sg) {
 			ret = -ENOMEM;
 			goto out;
@@ -1269,7 +1275,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 	}
 
 	/* Use a bounce buffer since stack args cannot be mapped */
-	ret = copy_args_to_argbuf(req);
+	ret = copy_args_to_argbuf(req, gfp);
 	if (ret < 0)
 		goto out;
 
@@ -1367,7 +1373,7 @@ __releases(fiq->lock)
 		 queue_id);
 
 	fsvq = &fs->vqs[queue_id];
-	ret = virtio_fs_enqueue_req(fsvq, req, false);
+	ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC);
 	if (ret < 0) {
 		if (ret == -ENOSPC) {
 			/*
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 1/2] virtiofs: use pages instead of pointer for kernel direct IO
  2024-08-31  9:37 ` [PATCH v4 1/2] virtiofs: use pages instead of pointer " Hou Tao
@ 2024-09-03  8:44   ` Jingbo Xu
  2024-09-04  3:50     ` Hou Tao
  0 siblings, 1 reply; 9+ messages in thread
From: Jingbo Xu @ 2024-09-03  8:44 UTC (permalink / raw)
  To: Hou Tao, linux-fsdevel
  Cc: Miklos Szeredi, Vivek Goyal, Stefan Hajnoczi, Bernd Schubert,
	Michael S . Tsirkin, Matthew Wilcox, Benjamin Coddington,
	linux-kernel, virtualization, houtao1



On 8/31/24 5:37 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> When trying to insert a 10MB kernel module kept in a virtio-fs with cache
> disabled, the following warning was reported:
> 
>   ------------[ cut here ]------------
>   WARNING: CPU: 1 PID: 404 at mm/page_alloc.c:4551 ......
>   Modules linked in:
>   CPU: 1 PID: 404 Comm: insmod Not tainted 6.9.0-rc5+ #123
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
>   RIP: 0010:__alloc_pages+0x2bf/0x380
>   ......
>   Call Trace:
>    <TASK>
>    ? __warn+0x8e/0x150
>    ? __alloc_pages+0x2bf/0x380
>    __kmalloc_large_node+0x86/0x160
>    __kmalloc+0x33c/0x480
>    virtio_fs_enqueue_req+0x240/0x6d0
>    virtio_fs_wake_pending_and_unlock+0x7f/0x190
>    queue_request_and_unlock+0x55/0x60
>    fuse_simple_request+0x152/0x2b0
>    fuse_direct_io+0x5d2/0x8c0
>    fuse_file_read_iter+0x121/0x160
>    __kernel_read+0x151/0x2d0
>    kernel_read+0x45/0x50
>    kernel_read_file+0x1a9/0x2a0
>    init_module_from_file+0x6a/0xe0
>    idempotent_init_module+0x175/0x230
>    __x64_sys_finit_module+0x5d/0xb0
>    x64_sys_call+0x1c3/0x9e0
>    do_syscall_64+0x3d/0xc0
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    ......
>    </TASK>
>   ---[ end trace 0000000000000000 ]---
> 
> The warning is triggered as follows:
> 
> 1) syscall finit_module() handles the module insertion and it invokes
> kernel_read_file() to read the content of the module first.
> 
> 2) kernel_read_file() allocates a 10MB buffer by using vmalloc() and
> passes it to kernel_read(). kernel_read() constructs a kvec iter by
> using iov_iter_kvec() and passes it to fuse_file_read_iter().
> 
> 3) virtio-fs disables the cache, so fuse_file_read_iter() invokes
> fuse_direct_io(). As for now, the maximal read size for kvec iter is
> only limited by fc->max_read. For virtio-fs, max_read is UINT_MAX, so
> fuse_direct_io() doesn't split the 10MB buffer. It saves the address and
> the size of the 10MB-sized buffer in out_args[0] of a fuse request and
> passes the fuse request to virtio_fs_wake_pending_and_unlock().
> 
> 4) virtio_fs_wake_pending_and_unlock() uses virtio_fs_enqueue_req() to
> queue the request. Because virtiofs need DMA-able address, so
> virtio_fs_enqueue_req() uses kmalloc() to allocate a bounce buffer for
> all fuse args, copies these args into the bounce buffer and passed the
> physical address of the bounce buffer to virtiofsd. The total length of
> these fuse args for the passed fuse request is about 10MB, so
> copy_args_to_argbuf() invokes kmalloc() with a 10MB size parameter and
> it triggers the warning in __alloc_pages():
> 
> 	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> 		return NULL;
> 
> 5) virtio_fs_enqueue_req() will retry the memory allocation in a
> kworker, but it won't help, because kmalloc() will always return NULL
> due to the abnormal size and finit_module() will hang forever.
> 
> A feasible solution is to limit the value of max_read for virtio-fs, so
> the length passed to kmalloc() will be limited. However it will affect
> the maximal read size for normal read. And for virtio-fs write initiated
> from kernel, it has the similar problem but now there is no way to limit
> fc->max_write in kernel.
> 
> So instead of limiting both the values of max_read and max_write in
> kernel, introducing use_pages_for_kvec_io in fuse_conn and setting it as
> true in virtiofs. When use_pages_for_kvec_io is enabled, fuse will use
> pages instead of pointer to pass the KVEC_IO data.
> 
> After switching to pages for KVEC_IO data, these pages will be used for
> DMA through virtio-fs. If these pages are backed by vmalloc(),
> {flush|invalidate}_kernel_vmap_range() are necessary to flush or
> invalidate the cache before the DMA operation. So add two new fields in
> fuse_args_pages to record the base address of vmalloc area and the
> condition indicating whether invalidation is needed. Perform the flush
> in fuse_get_user_pages() for write operations and the invalidation in
> fuse_release_user_pages() for read operations.
> 
> It may seem necessary to introduce another field in fuse_conn to
> indicate that these KVEC_IO pages are used for DMA, However, considering
> that virtio-fs is currently the only user of use_pages_for_kvec_io, just
> reuse use_pages_for_kvec_io to indicate that these pages will be used
> for DMA.
> 
> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
> Signed-off-by: Hou Tao <houtao1@huawei.com>

Tested-by: Jingbo Xu <jefflexu@linux.alibaba.com>


> ---
>  fs/fuse/file.c      | 62 +++++++++++++++++++++++++++++++--------------
>  fs/fuse/fuse_i.h    |  6 +++++
>  fs/fuse/virtio_fs.c |  1 +
>  3 files changed, 50 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index f39456c65ed7..331208d3e4d1 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -645,7 +645,7 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
>  	args->out_args[0].size = count;
>  }
>  
> -static void fuse_release_user_pages(struct fuse_args_pages *ap,
> +static void fuse_release_user_pages(struct fuse_args_pages *ap, ssize_t nres,
>  				    bool should_dirty)
>  {
>  	unsigned int i;
> @@ -656,6 +656,9 @@ static void fuse_release_user_pages(struct fuse_args_pages *ap,
>  		if (ap->args.is_pinned)
>  			unpin_user_page(ap->pages[i]);
>  	}
> +
> +	if (nres > 0 && ap->args.invalidate_vmap)
> +		invalidate_kernel_vmap_range(ap->args.vmap_base, nres);
>  }
>  
>  static void fuse_io_release(struct kref *kref)
> @@ -754,25 +757,29 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
>  	struct fuse_io_args *ia = container_of(args, typeof(*ia), ap.args);
>  	struct fuse_io_priv *io = ia->io;
>  	ssize_t pos = -1;
> -
> -	fuse_release_user_pages(&ia->ap, io->should_dirty);
> +	size_t nres;
>  
>  	if (err) {
>  		/* Nothing */
>  	} else if (io->write) {
>  		if (ia->write.out.size > ia->write.in.size) {
>  			err = -EIO;
> -		} else if (ia->write.in.size != ia->write.out.size) {
> -			pos = ia->write.in.offset - io->offset +
> -				ia->write.out.size;
> +		} else {
> +			nres = ia->write.out.size;
> +			if (ia->write.in.size != ia->write.out.size)
> +				pos = ia->write.in.offset - io->offset +
> +				      ia->write.out.size;
>  		}
>  	} else {
>  		u32 outsize = args->out_args[0].size;
>  
> +		nres = outsize;
>  		if (ia->read.in.size != outsize)
>  			pos = ia->read.in.offset - io->offset + outsize;
>  	}
>  
> +	fuse_release_user_pages(&ia->ap, err ?: nres, io->should_dirty);
> +
>  	fuse_aio_complete(io, err, pos);
>  	fuse_io_free(ia);
>  }
> @@ -1467,24 +1474,37 @@ static inline size_t fuse_get_frag_size(const struct iov_iter *ii,
>  
>  static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>  			       size_t *nbytesp, int write,
> -			       unsigned int max_pages)
> +			       unsigned int max_pages,
> +			       bool use_pages_for_kvec_io)
>  {
> +	bool flush_or_invalidate = false;
>  	size_t nbytes = 0;  /* # bytes already packed in req */
>  	ssize_t ret = 0;
>  
> -	/* Special case for kernel I/O: can copy directly into the buffer */
> +	/* Special case for kernel I/O: can copy directly into the buffer.
> +	 * However if the implementation of fuse_conn requires pages instead of
> +	 * pointer (e.g., virtio-fs), use iov_iter_extract_pages() instead.
> +	 */
>  	if (iov_iter_is_kvec(ii)) {
> -		unsigned long user_addr = fuse_get_user_addr(ii);
> -		size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
> +		void *user_addr = (void *)fuse_get_user_addr(ii);
>  
> -		if (write)
> -			ap->args.in_args[1].value = (void *) user_addr;
> -		else
> -			ap->args.out_args[0].value = (void *) user_addr;
> +		if (!use_pages_for_kvec_io) {
> +			size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
>  
> -		iov_iter_advance(ii, frag_size);
> -		*nbytesp = frag_size;
> -		return 0;
> +			if (write)
> +				ap->args.in_args[1].value = user_addr;
> +			else
> +				ap->args.out_args[0].value = user_addr;
> +
> +			iov_iter_advance(ii, frag_size);
> +			*nbytesp = frag_size;
> +			return 0;
> +		}
> +
> +		if (is_vmalloc_addr(user_addr)) {
> +			ap->args.vmap_base = user_addr;
> +			flush_or_invalidate = true;

Could we move flush_kernel_vmap_range() upon here, so that
flush_or_invalidate is not needed anymore and the code looks cleaner?

> +		}
>  	}
>  
>  	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
> @@ -1513,6 +1533,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>  			(PAGE_SIZE - ret) & (PAGE_SIZE - 1);
>  	}
>  
> +	if (write && flush_or_invalidate)
> +		flush_kernel_vmap_range(ap->args.vmap_base, nbytes);
> +
> +	ap->args.invalidate_vmap = !write && flush_or_invalidate;

How about initializing vmap_base only when the data buffer is vmalloced
and it's a read request?  In this case invalidate_vmap is no longer needed.

>  	ap->args.is_pinned = iov_iter_extract_will_pin(ii);
>  	ap->args.user_pages = true;
>  	if (write)
> @@ -1581,7 +1605,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>  		size_t nbytes = min(count, nmax);
>  
>  		err = fuse_get_user_pages(&ia->ap, iter, &nbytes, write,
> -					  max_pages);
> +					  max_pages, fc->use_pages_for_kvec_io);
>  		if (err && !nbytes)
>  			break;
>  
> @@ -1595,7 +1619,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>  		}
>  
>  		if (!io->async || nres < 0) {
> -			fuse_release_user_pages(&ia->ap, io->should_dirty);
> +			fuse_release_user_pages(&ia->ap, nres, io->should_dirty);
>  			fuse_io_free(ia);
>  		}
>  		ia = NULL;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index f23919610313..79add14c363f 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -309,9 +309,12 @@ struct fuse_args {
>  	bool may_block:1;
>  	bool is_ext:1;
>  	bool is_pinned:1;
> +	bool invalidate_vmap:1;
>  	struct fuse_in_arg in_args[3];
>  	struct fuse_arg out_args[2];
>  	void (*end)(struct fuse_mount *fm, struct fuse_args *args, int error);
> +	/* Used for kvec iter backed by vmalloc address */
> +	void *vmap_base;
>  };
>  
>  struct fuse_args_pages {
> @@ -860,6 +863,9 @@ struct fuse_conn {
>  	/** Passthrough support for read/write IO */
>  	unsigned int passthrough:1;
>  
> +	/* Use pages instead of pointer for kernel I/O */
> +	unsigned int use_pages_for_kvec_io:1;

Maybe we need a better (actually shorter) name for this flag. kvec_pages?

> +
>  	/** Maximum stack depth for passthrough backing files */
>  	int max_stack_depth;
>  
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index dd5260141615..43d66ab5e891 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1568,6 +1568,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
>  	fc->delete_stale = true;
>  	fc->auto_submounts = true;
>  	fc->sync_fs = true;
> +	fc->use_pages_for_kvec_io = true;
>  
>  	/* Tell FUSE to split requests that exceed the virtqueue's size */
>  	fc->max_pages_limit = min_t(unsigned int, fc->max_pages_limit,

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] virtiofs: use GFP_NOFS when enqueuing request through kworker
  2024-08-31  9:37 ` [PATCH v4 2/2] virtiofs: use GFP_NOFS when enqueuing request through kworker Hou Tao
@ 2024-09-03  9:34   ` Jingbo Xu
  2024-09-04  3:53     ` Hou Tao
  0 siblings, 1 reply; 9+ messages in thread
From: Jingbo Xu @ 2024-09-03  9:34 UTC (permalink / raw)
  To: Hou Tao, linux-fsdevel
  Cc: Miklos Szeredi, Vivek Goyal, Stefan Hajnoczi, Bernd Schubert,
	Michael S . Tsirkin, Matthew Wilcox, Benjamin Coddington,
	linux-kernel, virtualization, houtao1



On 8/31/24 5:37 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> When invoking virtio_fs_enqueue_req() through kworker, both the
> allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> Considering the size of the sg array may be greater than PAGE_SIZE, use
> GFP_NOFS instead of GFP_ATOMIC to lower the possibility of memory
> allocation failure and to avoid unnecessarily depleting the atomic
> reserves. GFP_NOFS is not passed to virtio_fs_enqueue_req() directly,
> GFP_KERNEL and memalloc_nofs_{save|restore} helpers are used instead.
> 
> It may seem OK to pass GFP_NOFS to virtio_fs_enqueue_req() as well when
> queuing the request for the first time, but this is not the case. The
> reason is that fuse_request_queue_background() may call
> ->queue_request_and_unlock() while holding fc->bg_lock, which is a
> spin-lock. Therefore, still use GFP_ATOMIC for it.

Actually, .wake_pending_and_unlock() is called under fiq->lock and
GFP_ATOMIC is requisite.


> 
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  fs/fuse/virtio_fs.c | 24 +++++++++++++++---------
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 43d66ab5e891..9bc48b3ca384 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -95,7 +95,8 @@ struct virtio_fs_req_work {
>  };
>  
>  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
> -				 struct fuse_req *req, bool in_flight);
> +				 struct fuse_req *req, bool in_flight,
> +				 gfp_t gfp);
>  
>  static const struct constant_table dax_param_enums[] = {
>  	{"always",	FUSE_DAX_ALWAYS },
> @@ -439,6 +440,8 @@ static void virtio_fs_request_dispatch_work(struct work_struct *work)
>  
>  	/* Dispatch pending requests */
>  	while (1) {
> +		unsigned int flags;
> +
>  		spin_lock(&fsvq->lock);
>  		req = list_first_entry_or_null(&fsvq->queued_reqs,
>  					       struct fuse_req, list);
> @@ -449,7 +452,9 @@ static void virtio_fs_request_dispatch_work(struct work_struct *work)
>  		list_del_init(&req->list);
>  		spin_unlock(&fsvq->lock);
>  
> -		ret = virtio_fs_enqueue_req(fsvq, req, true);
> +		flags = memalloc_nofs_save();
> +		ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_KERNEL);
> +		memalloc_nofs_restore(flags);
>  		if (ret < 0) {
>  			if (ret == -ENOSPC) {
>  				spin_lock(&fsvq->lock);
> @@ -550,7 +555,7 @@ static void virtio_fs_hiprio_dispatch_work(struct work_struct *work)
>  }
>  
>  /* Allocate and copy args into req->argbuf */
> -static int copy_args_to_argbuf(struct fuse_req *req)
> +static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp)
>  {
>  	struct fuse_args *args = req->args;
>  	unsigned int offset = 0;
> @@ -564,7 +569,7 @@ static int copy_args_to_argbuf(struct fuse_req *req)
>  	len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
>  	      fuse_len_args(num_out, args->out_args);
>  
> -	req->argbuf = kmalloc(len, GFP_ATOMIC);
> +	req->argbuf = kmalloc(len, gfp);
>  	if (!req->argbuf)
>  		return -ENOMEM;
>  
> @@ -1239,7 +1244,8 @@ static unsigned int sg_init_fuse_args(struct scatterlist *sg,
>  
>  /* Add a request to a virtqueue and kick the device */
>  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
> -				 struct fuse_req *req, bool in_flight)
> +				 struct fuse_req *req, bool in_flight,
> +				 gfp_t gfp)
>  {
>  	/* requests need at least 4 elements */
>  	struct scatterlist *stack_sgs[6];
> @@ -1260,8 +1266,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
>  	/* Does the sglist fit on the stack? */
>  	total_sgs = sg_count_fuse_req(req);
>  	if (total_sgs > ARRAY_SIZE(stack_sgs)) {
> -		sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
> -		sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
> +		sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
> +		sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
>  		if (!sgs || !sg) {
>  			ret = -ENOMEM;
>  			goto out;
> @@ -1269,7 +1275,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
>  	}
>  
>  	/* Use a bounce buffer since stack args cannot be mapped */
> -	ret = copy_args_to_argbuf(req);
> +	ret = copy_args_to_argbuf(req, gfp);
>  	if (ret < 0)
>  		goto out;
>  
> @@ -1367,7 +1373,7 @@ __releases(fiq->lock)
>  		 queue_id);
>  
>  	fsvq = &fs->vqs[queue_id];
> -	ret = virtio_fs_enqueue_req(fsvq, req, false);
> +	ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC);
>  	if (ret < 0) {
>  		if (ret == -ENOSPC) {
>  			/*

LGTM.

Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>



-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 1/2] virtiofs: use pages instead of pointer for kernel direct IO
  2024-09-03  8:44   ` Jingbo Xu
@ 2024-09-04  3:50     ` Hou Tao
  0 siblings, 0 replies; 9+ messages in thread
From: Hou Tao @ 2024-09-04  3:50 UTC (permalink / raw)
  To: Jingbo Xu, linux-fsdevel
  Cc: Miklos Szeredi, Vivek Goyal, Stefan Hajnoczi, Bernd Schubert,
	Michael S . Tsirkin, Matthew Wilcox, Benjamin Coddington,
	linux-kernel, virtualization, houtao1

Hi,

On 9/3/2024 4:44 PM, Jingbo Xu wrote:
>
> On 8/31/24 5:37 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When trying to insert a 10MB kernel module kept in a virtio-fs with cache
>> disabled, the following warning was reported:
>>

SNIP
>>
>> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> Tested-by: Jingbo Xu <jefflexu@linux.alibaba.com>

Thanks for the test.
>
>
>> ---
>>  fs/fuse/file.c      | 62 +++++++++++++++++++++++++++++++--------------
>>  fs/fuse/fuse_i.h    |  6 +++++
>>  fs/fuse/virtio_fs.c |  1 +
>>  3 files changed, 50 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index f39456c65ed7..331208d3e4d1 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -645,7 +645,7 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
>>  	args->out_args[0].size = count;
>>  }
>>  
>> -

SNIP
>>  static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>>  			       size_t *nbytesp, int write,
>> -			       unsigned int max_pages)
>> +			       unsigned int max_pages,
>> +			       bool use_pages_for_kvec_io)
>>  {
>> +	bool flush_or_invalidate = false;
>>  	size_t nbytes = 0;  /* # bytes already packed in req */
>>  	ssize_t ret = 0;
>>  
>> -	/* Special case for kernel I/O: can copy directly into the buffer */
>> +	/* Special case for kernel I/O: can copy directly into the buffer.
>> +	 * However if the implementation of fuse_conn requires pages instead of
>> +	 * pointer (e.g., virtio-fs), use iov_iter_extract_pages() instead.
>> +	 */
>>  	if (iov_iter_is_kvec(ii)) {
>> -		unsigned long user_addr = fuse_get_user_addr(ii);
>> -		size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
>> +		void *user_addr = (void *)fuse_get_user_addr(ii);
>>  
>> -		if (write)
>> -			ap->args.in_args[1].value = (void *) user_addr;
>> -		else
>> -			ap->args.out_args[0].value = (void *) user_addr;
>> +		if (!use_pages_for_kvec_io) {
>> +			size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
>>  
>> -		iov_iter_advance(ii, frag_size);
>> -		*nbytesp = frag_size;
>> -		return 0;
>> +			if (write)
>> +				ap->args.in_args[1].value = user_addr;
>> +			else
>> +				ap->args.out_args[0].value = user_addr;
>> +
>> +			iov_iter_advance(ii, frag_size);
>> +			*nbytesp = frag_size;
>> +			return 0;
>> +		}
>> +
>> +		if (is_vmalloc_addr(user_addr)) {
>> +			ap->args.vmap_base = user_addr;
>> +			flush_or_invalidate = true;
> Could we move flush_kernel_vmap_range() upon here, so that
> flush_or_invalidate is not needed anymore and the code looks cleaner?

flush_kernel_vmap_range() needs to know the length of the flushed area,
if moving it here(), the length will be unknown.
>
>> +		}
>>  	}
>>  
>>  	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
>> @@ -1513,6 +1533,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>>  			(PAGE_SIZE - ret) & (PAGE_SIZE - 1);
>>  	}
>>  
>> +	if (write && flush_or_invalidate)
>> +		flush_kernel_vmap_range(ap->args.vmap_base, nbytes);
>> +
>> +	ap->args.invalidate_vmap = !write && flush_or_invalidate;
> How about initializing vmap_base only when the data buffer is vmalloced
> and it's a read request?  In this case invalidate_vmap is no longer needed.

You mean using the value of vmap_base to indicate whether invalidation
is needed or not, right ? I prefer to keep it, because the extra
variable invalidate_vmap indicates the required action for the vmap area
and it doesn't increase the size of fuse_args.

>
>>  	ap->args.is_pinned = iov_iter_extract_will_pin(ii);
>>  	ap->args.user_pages = true;
>>  	if (write)
>> @@ -1581,7 +1605,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>>  		size_t nbytes = min(count, nmax);
>>  
>>  		err = fuse_get_user_pages(&ia->ap, iter, &nbytes, write,
>> -					  max_pages);
>> +					  max_pages, fc->use_pages_for_kvec_io);
>>  		if (err && !nbytes)
>>  			break;
>>  
>> @@ -1595,7 +1619,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>>  		}
>>  
>>  		if (!io->async || nres < 0) {
>> -			fuse_release_user_pages(&ia->ap, io->should_dirty);
>> +			fuse_release_user_pages(&ia->ap, nres, io->should_dirty);
>>  			fuse_io_free(ia);
>>  		}
>>  		ia = NULL;
>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> index f23919610313..79add14c363f 100644
>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -309,9 +309,12 @@ struct fuse_args {
>>  	bool may_block:1;
>>  	bool is_ext:1;
>>  	bool is_pinned:1;
>> +	bool invalidate_vmap:1;
>>  	struct fuse_in_arg in_args[3];
>>  	struct fuse_arg out_args[2];
>>  	void (*end)(struct fuse_mount *fm, struct fuse_args *args, int error);
>> +	/* Used for kvec iter backed by vmalloc address */
>> +	void *vmap_base;
>>  };
>>  
>>  struct fuse_args_pages {
>> @@ -860,6 +863,9 @@ struct fuse_conn {
>>  	/** Passthrough support for read/write IO */
>>  	unsigned int passthrough:1;
>>  
>> +	/* Use pages instead of pointer for kernel I/O */
>> +	unsigned int use_pages_for_kvec_io:1;
> Maybe we need a better (actually shorter) name for this flag. kvec_pages?

Naming is hard. The name "use_pages_for_kvec_io" is verbose indeed.
kvec_pages is better. Will update it in the next spin.
>
>> +
>>  	/** Maximum stack depth for passthrough backing files */
>>  	int max_stack_depth;
>>  
>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>> index dd5260141615..43d66ab5e891 100644
>> --- a/fs/fuse/virtio_fs.c
>> +++ b/fs/fuse/virtio_fs.c
>> @@ -1568,6 +1568,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
>>  	fc->delete_stale = true;
>>  	fc->auto_submounts = true;
>>  	fc->sync_fs = true;
>> +	fc->use_pages_for_kvec_io = true;
>>  
>>  	/* Tell FUSE to split requests that exceed the virtqueue's size */
>>  	fc->max_pages_limit = min_t(unsigned int, fc->max_pages_limit,


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] virtiofs: use GFP_NOFS when enqueuing request through kworker
  2024-09-03  9:34   ` Jingbo Xu
@ 2024-09-04  3:53     ` Hou Tao
  2024-09-04 12:12       ` Jingbo Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Hou Tao @ 2024-09-04  3:53 UTC (permalink / raw)
  To: Jingbo Xu, linux-fsdevel
  Cc: Miklos Szeredi, Vivek Goyal, Stefan Hajnoczi, Bernd Schubert,
	Michael S . Tsirkin, Matthew Wilcox, Benjamin Coddington,
	linux-kernel, virtualization, houtao1



On 9/3/2024 5:34 PM, Jingbo Xu wrote:
>
> On 8/31/24 5:37 PM, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When invoking virtio_fs_enqueue_req() through kworker, both the
>> allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
>> Considering the size of the sg array may be greater than PAGE_SIZE, use
>> GFP_NOFS instead of GFP_ATOMIC to lower the possibility of memory
>> allocation failure and to avoid unnecessarily depleting the atomic
>> reserves. GFP_NOFS is not passed to virtio_fs_enqueue_req() directly,
>> GFP_KERNEL and memalloc_nofs_{save|restore} helpers are used instead.
>>
>> It may seem OK to pass GFP_NOFS to virtio_fs_enqueue_req() as well when
>> queuing the request for the first time, but this is not the case. The
>> reason is that fuse_request_queue_background() may call
>> ->queue_request_and_unlock() while holding fc->bg_lock, which is a
>> spin-lock. Therefore, still use GFP_ATOMIC for it.
> Actually, .wake_pending_and_unlock() is called under fiq->lock and
> GFP_ATOMIC is requisite.

Er, but virtio_fs_wake_pending_and_unlock() unlocks fiq->lock before
queuing the request.
>
>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  fs/fuse/virtio_fs.c | 24 +++++++++++++++---------
>>  1 file changed, 15 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>> index 43d66ab5e891..9bc48b3ca384 100644
>> --- a/fs/fuse/virtio_fs.c
>> +++ b/fs/fuse/virtio_fs.c
>> @@ -95,7 +95,8 @@ struct virtio_fs_req_work {
>>  };
>>  
>>  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
>> -				 struct fuse_req *req, bool in_flight);
>> +				 struct fuse_req *req, bool in_flight,
>> +				 gfp_t gfp);
>>  
>>  static const struct constant_table dax_param_enums[] = {
>>  	{"always",	FUSE_DAX_ALWAYS },
>> @@ -439,6 +440,8 @@ static void virtio_fs_request_dispatch_work(struct work_struct *work)
>>  
>>  	/* Dispatch pending requests */
>>  	while (1) {
>> +		unsigned int flags;
>> +
>>  		spin_lock(&fsvq->lock);
>>  		req = list_first_entry_or_null(&fsvq->queued_reqs,
>>  					       struct fuse_req, list);
>> @@ -449,7 +452,9 @@ static void virtio_fs_request_dispatch_work(struct work_struct *work)
>>  		list_del_init(&req->list);
>>  		spin_unlock(&fsvq->lock);
>>  
>> -		ret = virtio_fs_enqueue_req(fsvq, req, true);
>> +		flags = memalloc_nofs_save();
>> +		ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_KERNEL);
>> +		memalloc_nofs_restore(flags);
>>  		if (ret < 0) {
>>  			if (ret == -ENOSPC) {
>>  				spin_lock(&fsvq->lock);
>> @@ -550,7 +555,7 @@ static void virtio_fs_hiprio_dispatch_work(struct work_struct *work)
>>  }
>>  
>>  /* Allocate and copy args into req->argbuf */
>> -static int copy_args_to_argbuf(struct fuse_req *req)
>> +static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp)
>>  {
>>  	struct fuse_args *args = req->args;
>>  	unsigned int offset = 0;
>> @@ -564,7 +569,7 @@ static int copy_args_to_argbuf(struct fuse_req *req)
>>  	len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
>>  	      fuse_len_args(num_out, args->out_args);
>>  
>> -	req->argbuf = kmalloc(len, GFP_ATOMIC);
>> +	req->argbuf = kmalloc(len, gfp);
>>  	if (!req->argbuf)
>>  		return -ENOMEM;
>>  
>> @@ -1239,7 +1244,8 @@ static unsigned int sg_init_fuse_args(struct scatterlist *sg,
>>  
>>  /* Add a request to a virtqueue and kick the device */
>>  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
>> -				 struct fuse_req *req, bool in_flight)
>> +				 struct fuse_req *req, bool in_flight,
>> +				 gfp_t gfp)
>>  {
>>  	/* requests need at least 4 elements */
>>  	struct scatterlist *stack_sgs[6];
>> @@ -1260,8 +1266,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
>>  	/* Does the sglist fit on the stack? */
>>  	total_sgs = sg_count_fuse_req(req);
>>  	if (total_sgs > ARRAY_SIZE(stack_sgs)) {
>> -		sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
>> -		sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
>> +		sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
>> +		sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
>>  		if (!sgs || !sg) {
>>  			ret = -ENOMEM;
>>  			goto out;
>> @@ -1269,7 +1275,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
>>  	}
>>  
>>  	/* Use a bounce buffer since stack args cannot be mapped */
>> -	ret = copy_args_to_argbuf(req);
>> +	ret = copy_args_to_argbuf(req, gfp);
>>  	if (ret < 0)
>>  		goto out;
>>  
>> @@ -1367,7 +1373,7 @@ __releases(fiq->lock)
>>  		 queue_id);
>>  
>>  	fsvq = &fs->vqs[queue_id];
>> -	ret = virtio_fs_enqueue_req(fsvq, req, false);
>> +	ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC);
>>  	if (ret < 0) {
>>  		if (ret == -ENOSPC) {
>>  			/*
> LGTM.
>
> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>

Thanks for the review.
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 2/2] virtiofs: use GFP_NOFS when enqueuing request through kworker
  2024-09-04  3:53     ` Hou Tao
@ 2024-09-04 12:12       ` Jingbo Xu
  0 siblings, 0 replies; 9+ messages in thread
From: Jingbo Xu @ 2024-09-04 12:12 UTC (permalink / raw)
  To: Hou Tao, linux-fsdevel
  Cc: Miklos Szeredi, Vivek Goyal, Stefan Hajnoczi, Bernd Schubert,
	Michael S . Tsirkin, Matthew Wilcox, Benjamin Coddington,
	linux-kernel, virtualization, houtao1



On 9/4/24 11:53 AM, Hou Tao wrote:
> 
> 
> On 9/3/2024 5:34 PM, Jingbo Xu wrote:
>>
>> On 8/31/24 5:37 PM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> When invoking virtio_fs_enqueue_req() through kworker, both the
>>> allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
>>> Considering the size of the sg array may be greater than PAGE_SIZE, use
>>> GFP_NOFS instead of GFP_ATOMIC to lower the possibility of memory
>>> allocation failure and to avoid unnecessarily depleting the atomic
>>> reserves. GFP_NOFS is not passed to virtio_fs_enqueue_req() directly,
>>> GFP_KERNEL and memalloc_nofs_{save|restore} helpers are used instead.
>>>
>>> It may seem OK to pass GFP_NOFS to virtio_fs_enqueue_req() as well when
>>> queuing the request for the first time, but this is not the case. The
>>> reason is that fuse_request_queue_background() may call
>>> ->queue_request_and_unlock() while holding fc->bg_lock, which is a
>>> spin-lock. Therefore, still use GFP_ATOMIC for it.
>> Actually, .wake_pending_and_unlock() is called under fiq->lock and
>> GFP_ATOMIC is requisite.
> 
> Er, but virtio_fs_wake_pending_and_unlock() unlocks fiq->lock before
> queuing the request.

Alright, I missed that :(


-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 0/2] virtiofs: fix the warning for kernel direct IO
  2024-08-31  9:37 [PATCH v4 0/2] virtiofs: fix the warning for kernel direct IO Hou Tao
  2024-08-31  9:37 ` [PATCH v4 1/2] virtiofs: use pages instead of pointer " Hou Tao
  2024-08-31  9:37 ` [PATCH v4 2/2] virtiofs: use GFP_NOFS when enqueuing request through kworker Hou Tao
@ 2024-10-08 13:39 ` Miklos Szeredi
  2 siblings, 0 replies; 9+ messages in thread
From: Miklos Szeredi @ 2024-10-08 13:39 UTC (permalink / raw)
  To: Hou Tao
  Cc: linux-fsdevel, Vivek Goyal, Stefan Hajnoczi, Bernd Schubert,
	Michael S . Tsirkin, Matthew Wilcox, Benjamin Coddington,
	Jingbo Xu, linux-kernel, virtualization, houtao1

On Sat, 31 Aug 2024 at 11:38, Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> Hi,
>
> The patch set aims to fix the warning related to an abnormal size
> parameter of kmalloc() in virtiofs. Patch #1 fixes it by introducing
> use_pages_for_kvec_io option in fuse_conn and enabling it in virtiofs.
> Beside the abnormal size parameter for kmalloc, the gfp parameter is
> also questionable: GFP_ATOMIC is used even when the allocation occurs
> in a kworker context. Patch #2 fixes it by using GFP_NOFS when the
> allocation is initiated by the kworker. For more details, please check
> the individual patches.

Applied, thanks.

Miklos

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-10-08 13:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-31  9:37 [PATCH v4 0/2] virtiofs: fix the warning for kernel direct IO Hou Tao
2024-08-31  9:37 ` [PATCH v4 1/2] virtiofs: use pages instead of pointer " Hou Tao
2024-09-03  8:44   ` Jingbo Xu
2024-09-04  3:50     ` Hou Tao
2024-08-31  9:37 ` [PATCH v4 2/2] virtiofs: use GFP_NOFS when enqueuing request through kworker Hou Tao
2024-09-03  9:34   ` Jingbo Xu
2024-09-04  3:53     ` Hou Tao
2024-09-04 12:12       ` Jingbo Xu
2024-10-08 13:39 ` [PATCH v4 0/2] virtiofs: fix the warning for kernel direct IO Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).