linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] fuse: use splice for reading user pages on servers that enable it
@ 2025-04-22 23:56 Joanne Koong
  2025-05-02 13:16 ` Bernd Schubert
  2025-05-07 14:45 ` Miklos Szeredi
  0 siblings, 2 replies; 13+ messages in thread
From: Joanne Koong @ 2025-04-22 23:56 UTC (permalink / raw)
  To: miklos; +Cc: linux-fsdevel, bernd.schubert, jlayton, kernel-team

For direct io writes, splice is disabled when forwarding pages from the
client to the server. This is because the pages in the pipe buffer are
user pages, which is controlled by the client. Thus if a server replies
to the request and then keeps accessing the pages afterwards, there is
the possibility that the client may have modified the contents of the
pages in the meantime. More context on this can be found in commit
0c4bcfdecb1a ("fuse: fix pipe buffer lifetime for direct_io").

For servers that do not need to access pages after answering the
request, splice gives a non-trivial improvement in performance.
Benchmarks show roughly a 40% speedup.

Allow servers with CAP_SYS_ADMIN privileges to enable zero-copy splice
for servicing client direct io writes. By enabling this, the server
understands that they should not continue accessing the pipe buffer
after completing the request or may face incorrect data if they do so.
Only servers with CAP_SYS_ADMIN may enable this, since having access to
the underlying user pages may allow servers to snoop or modify the
user pages after completing the request.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>

---

Changes from v1 -> v2:
* Gate this behind CAP_SYS_ADMIN (Bernd)
v1: https://lore.kernel.org/linux-fsdevel/20250419000614.3795331-1-joannelkoong@gmail.com/

---
 fs/fuse/dev.c             | 18 ++++++++++--------
 fs/fuse/dev_uring.c       |  4 ++--
 fs/fuse/fuse_dev_i.h      |  5 +++--
 fs/fuse/fuse_i.h          |  3 +++
 fs/fuse/inode.c           |  5 ++++-
 include/uapi/linux/fuse.h |  9 +++++++++
 6 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 67d07b4c778a..e4949c379eaf 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -816,12 +816,13 @@ static int unlock_request(struct fuse_req *req)
 	return err;
 }
 
-void fuse_copy_init(struct fuse_copy_state *cs, bool write,
-		    struct iov_iter *iter)
+void fuse_copy_init(struct fuse_copy_state *cs, struct fuse_conn *fc,
+		    bool write, struct iov_iter *iter)
 {
 	memset(cs, 0, sizeof(*cs));
 	cs->write = write;
 	cs->iter = iter;
+	cs->splice_read_user_pages = fc->splice_read_user_pages;
 }
 
 /* Unmap and put previous page of userspace buffer */
@@ -1105,9 +1106,10 @@ static int fuse_copy_page(struct fuse_copy_state *cs, struct page **pagep,
 		if (cs->write && cs->pipebufs && page) {
 			/*
 			 * Can't control lifetime of pipe buffers, so always
-			 * copy user pages.
+			 * copy user pages if server does not support reading
+			 * user pages through splice.
 			 */
-			if (cs->req->args->user_pages) {
+			if (cs->req->args->user_pages && !cs->splice_read_user_pages) {
 				err = fuse_copy_fill(cs);
 				if (err)
 					return err;
@@ -1538,7 +1540,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, struct iov_iter *to)
 	if (!user_backed_iter(to))
 		return -EINVAL;
 
-	fuse_copy_init(&cs, true, to);
+	fuse_copy_init(&cs, fud->fc, true, to);
 
 	return fuse_dev_do_read(fud, file, &cs, iov_iter_count(to));
 }
@@ -1561,7 +1563,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	if (!bufs)
 		return -ENOMEM;
 
-	fuse_copy_init(&cs, true, NULL);
+	fuse_copy_init(&cs, fud->fc, true, NULL);
 	cs.pipebufs = bufs;
 	cs.pipe = pipe;
 	ret = fuse_dev_do_read(fud, in, &cs, len);
@@ -2227,7 +2229,7 @@ static ssize_t fuse_dev_write(struct kiocb *iocb, struct iov_iter *from)
 	if (!user_backed_iter(from))
 		return -EINVAL;
 
-	fuse_copy_init(&cs, false, from);
+	fuse_copy_init(&cs, fud->fc, false, from);
 
 	return fuse_dev_do_write(fud, &cs, iov_iter_count(from));
 }
@@ -2301,7 +2303,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 	}
 	pipe_unlock(pipe);
 
-	fuse_copy_init(&cs, false, NULL);
+	fuse_copy_init(&cs, fud->fc, false, NULL);
 	cs.pipebufs = bufs;
 	cs.nr_segs = nbuf;
 	cs.pipe = pipe;
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index ef470c4a9261..52b883a6a79d 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -593,7 +593,7 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
 	if (err)
 		return err;
 
-	fuse_copy_init(&cs, false, &iter);
+	fuse_copy_init(&cs, ring->fc, false, &iter);
 	cs.is_uring = true;
 	cs.req = req;
 
@@ -623,7 +623,7 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
 		return err;
 	}
 
-	fuse_copy_init(&cs, true, &iter);
+	fuse_copy_init(&cs, ring->fc, true, &iter);
 	cs.is_uring = true;
 	cs.req = req;
 
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index db136e045925..25e593e64c67 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -32,6 +32,7 @@ struct fuse_copy_state {
 	bool write:1;
 	bool move_pages:1;
 	bool is_uring:1;
+	bool splice_read_user_pages:1;
 	struct {
 		unsigned int copied_sz; /* copied size into the user buffer */
 	} ring;
@@ -51,8 +52,8 @@ struct fuse_req *fuse_request_find(struct fuse_pqueue *fpq, u64 unique);
 
 void fuse_dev_end_requests(struct list_head *head);
 
-void fuse_copy_init(struct fuse_copy_state *cs, bool write,
-			   struct iov_iter *iter);
+void fuse_copy_init(struct fuse_copy_state *cs, struct fuse_conn *conn,
+		    bool write, struct iov_iter *iter);
 int fuse_copy_args(struct fuse_copy_state *cs, unsigned int numargs,
 		   unsigned int argpages, struct fuse_arg *args,
 		   int zeroing);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3d5289cb82a5..e21875f16220 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -898,6 +898,9 @@ struct fuse_conn {
 	/* Use io_uring for communication */
 	bool io_uring:1;
 
+	/* Allow splice for reading user pages */
+	bool splice_read_user_pages:1;
+
 	/** Maximum stack depth for passthrough backing files */
 	int max_stack_depth;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 43b6643635ee..8b78dacf6c97 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1439,6 +1439,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 
 			if (flags & FUSE_REQUEST_TIMEOUT)
 				timeout = arg->request_timeout;
+
+			if (flags & FUSE_SPLICE_READ_USER_PAGES && capable(CAP_SYS_ADMIN))
+				fc->splice_read_user_pages = true;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = true;
@@ -1489,7 +1492,7 @@ void fuse_send_init(struct fuse_mount *fm)
 		FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP |
 		FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP |
 		FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_ALLOW_IDMAP |
-		FUSE_REQUEST_TIMEOUT;
+		FUSE_REQUEST_TIMEOUT | FUSE_SPLICE_READ_USER_PAGES;
 #ifdef CONFIG_FUSE_DAX
 	if (fm->fc->dax)
 		flags |= FUSE_MAP_ALIGNMENT;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 122d6586e8d4..fecb06921da9 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -235,6 +235,9 @@
  *
  *  7.44
  *  - add FUSE_NOTIFY_INC_EPOCH
+ *
+ *  7.45
+ *  - add FUSE_SPLICE_READ_USER_PAGES
  */
 
 #ifndef _LINUX_FUSE_H
@@ -443,6 +446,11 @@ struct fuse_file_lock {
  * FUSE_OVER_IO_URING: Indicate that client supports io-uring
  * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
  *			 init_out.request_timeout contains the timeout (in secs)
+ * FUSE_SPLICE_READ_USER_PAGES: kernel supports splice on the device for reading
+ *				user pages. If the server enables this, then the
+ *				server should not access the pipe buffer after
+ *				completing the request. Only servers with
+ *				CAP_SYS_ADMIN privileges can enable this.
  */
 #define FUSE_ASYNC_READ		(1 << 0)
 #define FUSE_POSIX_LOCKS	(1 << 1)
@@ -490,6 +498,7 @@ struct fuse_file_lock {
 #define FUSE_ALLOW_IDMAP	(1ULL << 40)
 #define FUSE_OVER_IO_URING	(1ULL << 41)
 #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
+#define FUSE_SPLICE_READ_USER_PAGES (1ULL << 43)
 
 /**
  * CUSE INIT request/reply flags
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-04-22 23:56 [PATCH v2] fuse: use splice for reading user pages on servers that enable it Joanne Koong
@ 2025-05-02 13:16 ` Bernd Schubert
  2025-05-07 14:45 ` Miklos Szeredi
  1 sibling, 0 replies; 13+ messages in thread
From: Bernd Schubert @ 2025-05-02 13:16 UTC (permalink / raw)
  To: Joanne Koong, miklos; +Cc: linux-fsdevel, jlayton, kernel-team



On 4/23/25 01:56, Joanne Koong wrote:
> For direct io writes, splice is disabled when forwarding pages from the
> client to the server. This is because the pages in the pipe buffer are
> user pages, which is controlled by the client. Thus if a server replies
> to the request and then keeps accessing the pages afterwards, there is
> the possibility that the client may have modified the contents of the
> pages in the meantime. More context on this can be found in commit
> 0c4bcfdecb1a ("fuse: fix pipe buffer lifetime for direct_io").
> 
> For servers that do not need to access pages after answering the
> request, splice gives a non-trivial improvement in performance.
> Benchmarks show roughly a 40% speedup.
> 
> Allow servers with CAP_SYS_ADMIN privileges to enable zero-copy splice
> for servicing client direct io writes. By enabling this, the server
> understands that they should not continue accessing the pipe buffer
> after completing the request or may face incorrect data if they do so.
> Only servers with CAP_SYS_ADMIN may enable this, since having access to
> the underlying user pages may allow servers to snoop or modify the
> user pages after completing the request.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>

Reviewed-by: Bernd Schubert <bschubert@ddn.com>

> 
> ---
> 
> Changes from v1 -> v2:
> * Gate this behind CAP_SYS_ADMIN (Bernd)
> v1: https://lore.kernel.org/linux-fsdevel/20250419000614.3795331-1-joannelkoong@gmail.com/
> 
> ---
>  fs/fuse/dev.c             | 18 ++++++++++--------
>  fs/fuse/dev_uring.c       |  4 ++--
>  fs/fuse/fuse_dev_i.h      |  5 +++--
>  fs/fuse/fuse_i.h          |  3 +++
>  fs/fuse/inode.c           |  5 ++++-
>  include/uapi/linux/fuse.h |  9 +++++++++
>  6 files changed, 31 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 67d07b4c778a..e4949c379eaf 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -816,12 +816,13 @@ static int unlock_request(struct fuse_req *req)
>  	return err;
>  }
>  
> -void fuse_copy_init(struct fuse_copy_state *cs, bool write,
> -		    struct iov_iter *iter)
> +void fuse_copy_init(struct fuse_copy_state *cs, struct fuse_conn *fc,
> +		    bool write, struct iov_iter *iter)
>  {
>  	memset(cs, 0, sizeof(*cs));
>  	cs->write = write;
>  	cs->iter = iter;
> +	cs->splice_read_user_pages = fc->splice_read_user_pages;
>  }
>  
>  /* Unmap and put previous page of userspace buffer */
> @@ -1105,9 +1106,10 @@ static int fuse_copy_page(struct fuse_copy_state *cs, struct page **pagep,
>  		if (cs->write && cs->pipebufs && page) {
>  			/*
>  			 * Can't control lifetime of pipe buffers, so always
> -			 * copy user pages.
> +			 * copy user pages if server does not support reading
> +			 * user pages through splice.
>  			 */
> -			if (cs->req->args->user_pages) {
> +			if (cs->req->args->user_pages && !cs->splice_read_user_pages) {
>  				err = fuse_copy_fill(cs);
>  				if (err)
>  					return err;
> @@ -1538,7 +1540,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, struct iov_iter *to)
>  	if (!user_backed_iter(to))
>  		return -EINVAL;
>  
> -	fuse_copy_init(&cs, true, to);
> +	fuse_copy_init(&cs, fud->fc, true, to);
>  
>  	return fuse_dev_do_read(fud, file, &cs, iov_iter_count(to));
>  }
> @@ -1561,7 +1563,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
>  	if (!bufs)
>  		return -ENOMEM;
>  
> -	fuse_copy_init(&cs, true, NULL);
> +	fuse_copy_init(&cs, fud->fc, true, NULL);
>  	cs.pipebufs = bufs;
>  	cs.pipe = pipe;
>  	ret = fuse_dev_do_read(fud, in, &cs, len);
> @@ -2227,7 +2229,7 @@ static ssize_t fuse_dev_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (!user_backed_iter(from))
>  		return -EINVAL;
>  
> -	fuse_copy_init(&cs, false, from);
> +	fuse_copy_init(&cs, fud->fc, false, from);
>  
>  	return fuse_dev_do_write(fud, &cs, iov_iter_count(from));
>  }
> @@ -2301,7 +2303,7 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
>  	}
>  	pipe_unlock(pipe);
>  
> -	fuse_copy_init(&cs, false, NULL);
> +	fuse_copy_init(&cs, fud->fc, false, NULL);
>  	cs.pipebufs = bufs;
>  	cs.nr_segs = nbuf;
>  	cs.pipe = pipe;
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index ef470c4a9261..52b883a6a79d 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -593,7 +593,7 @@ static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
>  	if (err)
>  		return err;
>  
> -	fuse_copy_init(&cs, false, &iter);
> +	fuse_copy_init(&cs, ring->fc, false, &iter);
>  	cs.is_uring = true;
>  	cs.req = req;
>  
> @@ -623,7 +623,7 @@ static int fuse_uring_args_to_ring(struct fuse_ring *ring, struct fuse_req *req,
>  		return err;
>  	}
>  
> -	fuse_copy_init(&cs, true, &iter);
> +	fuse_copy_init(&cs, ring->fc, true, &iter);
>  	cs.is_uring = true;
>  	cs.req = req;
>  
> diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
> index db136e045925..25e593e64c67 100644
> --- a/fs/fuse/fuse_dev_i.h
> +++ b/fs/fuse/fuse_dev_i.h
> @@ -32,6 +32,7 @@ struct fuse_copy_state {
>  	bool write:1;
>  	bool move_pages:1;
>  	bool is_uring:1;
> +	bool splice_read_user_pages:1;
>  	struct {
>  		unsigned int copied_sz; /* copied size into the user buffer */
>  	} ring;
> @@ -51,8 +52,8 @@ struct fuse_req *fuse_request_find(struct fuse_pqueue *fpq, u64 unique);
>  
>  void fuse_dev_end_requests(struct list_head *head);
>  
> -void fuse_copy_init(struct fuse_copy_state *cs, bool write,
> -			   struct iov_iter *iter);
> +void fuse_copy_init(struct fuse_copy_state *cs, struct fuse_conn *conn,
> +		    bool write, struct iov_iter *iter);
>  int fuse_copy_args(struct fuse_copy_state *cs, unsigned int numargs,
>  		   unsigned int argpages, struct fuse_arg *args,
>  		   int zeroing);
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 3d5289cb82a5..e21875f16220 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -898,6 +898,9 @@ struct fuse_conn {
>  	/* Use io_uring for communication */
>  	bool io_uring:1;
>  
> +	/* Allow splice for reading user pages */
> +	bool splice_read_user_pages:1;
> +
>  	/** Maximum stack depth for passthrough backing files */
>  	int max_stack_depth;
>  
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 43b6643635ee..8b78dacf6c97 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -1439,6 +1439,9 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
>  
>  			if (flags & FUSE_REQUEST_TIMEOUT)
>  				timeout = arg->request_timeout;
> +
> +			if (flags & FUSE_SPLICE_READ_USER_PAGES && capable(CAP_SYS_ADMIN))
> +				fc->splice_read_user_pages = true;
>  		} else {
>  			ra_pages = fc->max_read / PAGE_SIZE;
>  			fc->no_lock = true;
> @@ -1489,7 +1492,7 @@ void fuse_send_init(struct fuse_mount *fm)
>  		FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP |
>  		FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP |
>  		FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_ALLOW_IDMAP |
> -		FUSE_REQUEST_TIMEOUT;
> +		FUSE_REQUEST_TIMEOUT | FUSE_SPLICE_READ_USER_PAGES;
>  #ifdef CONFIG_FUSE_DAX
>  	if (fm->fc->dax)
>  		flags |= FUSE_MAP_ALIGNMENT;
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index 122d6586e8d4..fecb06921da9 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -235,6 +235,9 @@
>   *
>   *  7.44
>   *  - add FUSE_NOTIFY_INC_EPOCH
> + *
> + *  7.45
> + *  - add FUSE_SPLICE_READ_USER_PAGES
>   */
>  
>  #ifndef _LINUX_FUSE_H
> @@ -443,6 +446,11 @@ struct fuse_file_lock {
>   * FUSE_OVER_IO_URING: Indicate that client supports io-uring
>   * FUSE_REQUEST_TIMEOUT: kernel supports timing out requests.
>   *			 init_out.request_timeout contains the timeout (in secs)
> + * FUSE_SPLICE_READ_USER_PAGES: kernel supports splice on the device for reading
> + *				user pages. If the server enables this, then the
> + *				server should not access the pipe buffer after
> + *				completing the request. Only servers with
> + *				CAP_SYS_ADMIN privileges can enable this.
>   */
>  #define FUSE_ASYNC_READ		(1 << 0)
>  #define FUSE_POSIX_LOCKS	(1 << 1)
> @@ -490,6 +498,7 @@ struct fuse_file_lock {
>  #define FUSE_ALLOW_IDMAP	(1ULL << 40)
>  #define FUSE_OVER_IO_URING	(1ULL << 41)
>  #define FUSE_REQUEST_TIMEOUT	(1ULL << 42)
> +#define FUSE_SPLICE_READ_USER_PAGES (1ULL << 43)
>  
>  /**
>   * CUSE INIT request/reply flags


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-04-22 23:56 [PATCH v2] fuse: use splice for reading user pages on servers that enable it Joanne Koong
  2025-05-02 13:16 ` Bernd Schubert
@ 2025-05-07 14:45 ` Miklos Szeredi
  2025-05-12 19:03   ` Joanne Koong
  1 sibling, 1 reply; 13+ messages in thread
From: Miklos Szeredi @ 2025-05-07 14:45 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, bernd.schubert, jlayton, kernel-team

On Wed, 23 Apr 2025 at 01:56, Joanne Koong <joannelkoong@gmail.com> wrote:

> For servers that do not need to access pages after answering the
> request, splice gives a non-trivial improvement in performance.
> Benchmarks show roughly a 40% speedup.

Hmm, have you looked at where this speedup comes from?

Is this a real zero-copy scenario where the server just forwards the
pages to a driver which does DMA, so that the CPU never actually
touches the page contents?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-05-07 14:45 ` Miklos Szeredi
@ 2025-05-12 19:03   ` Joanne Koong
  2025-05-13  5:46     ` Miklos Szeredi
  0 siblings, 1 reply; 13+ messages in thread
From: Joanne Koong @ 2025-05-12 19:03 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, bernd.schubert, jlayton, kernel-team

On Wed, May 7, 2025 at 7:45 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Wed, 23 Apr 2025 at 01:56, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > For servers that do not need to access pages after answering the
> > request, splice gives a non-trivial improvement in performance.
> > Benchmarks show roughly a 40% speedup.
>
> Hmm, have you looked at where this speedup comes from?
>
> Is this a real zero-copy scenario where the server just forwards the
> pages to a driver which does DMA, so that the CPU never actually
> touches the page contents?

I ran the benchmarks last month on the passthrough_ll server (from the
libfuse examples) with the actual copying out / buffer processing
removed (eg the .write_buf handler immediately returns
"fuse_reply_write(req, fuse_buf_size(in_buf));".

Thanks,
Joanne

>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-05-12 19:03   ` Joanne Koong
@ 2025-05-13  5:46     ` Miklos Szeredi
  2025-05-13 21:29       ` Joanne Koong
  0 siblings, 1 reply; 13+ messages in thread
From: Miklos Szeredi @ 2025-05-13  5:46 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, bernd.schubert, jlayton, kernel-team

On Mon, 12 May 2025 at 21:03, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, May 7, 2025 at 7:45 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Wed, 23 Apr 2025 at 01:56, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > > For servers that do not need to access pages after answering the
> > > request, splice gives a non-trivial improvement in performance.
> > > Benchmarks show roughly a 40% speedup.
> >
> > Hmm, have you looked at where this speedup comes from?
> >
> > Is this a real zero-copy scenario where the server just forwards the
> > pages to a driver which does DMA, so that the CPU never actually
> > touches the page contents?
>
> I ran the benchmarks last month on the passthrough_ll server (from the
> libfuse examples) with the actual copying out / buffer processing
> removed (eg the .write_buf handler immediately returns
> "fuse_reply_write(req, fuse_buf_size(in_buf));".

Ah, ok.

It would be good to see results in a more realistic scenario than that
before deciding to do this.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-05-13  5:46     ` Miklos Szeredi
@ 2025-05-13 21:29       ` Joanne Koong
  2025-05-14 11:56         ` Miklos Szeredi
  0 siblings, 1 reply; 13+ messages in thread
From: Joanne Koong @ 2025-05-13 21:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, bernd.schubert, jlayton, kernel-team

On Mon, May 12, 2025 at 10:46 PM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 12 May 2025 at 21:03, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, May 7, 2025 at 7:45 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >
> > > On Wed, 23 Apr 2025 at 01:56, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > > For servers that do not need to access pages after answering the
> > > > request, splice gives a non-trivial improvement in performance.
> > > > Benchmarks show roughly a 40% speedup.
> > >
> > > Hmm, have you looked at where this speedup comes from?
> > >
> > > Is this a real zero-copy scenario where the server just forwards the
> > > pages to a driver which does DMA, so that the CPU never actually
> > > touches the page contents?
> >
> > I ran the benchmarks last month on the passthrough_ll server (from the
> > libfuse examples) with the actual copying out / buffer processing
> > removed (eg the .write_buf handler immediately returns
> > "fuse_reply_write(req, fuse_buf_size(in_buf));".
>
> Ah, ok.
>
> It would be good to see results in a more realistic scenario than that
> before deciding to do this.

The results vary depending on how IO-intensive the server-side
processing logic is (eg ones that are not as intensive would show a
bigger relative performance speedup than ones where a lot of time is
spent on server-side processing). I can include the results from
benchmarks on our internal fuse server, which forwards the data in the
write buffer to a remote server over the network. For that, we saw
roughly a 5% improvement in throughput for 5 GB writes with 16 MB
chunk sizes, and a 2.45% improvement in throughput for 12 parallel
writes of 16 GB files with 64 MB chunk sizes.


Thanks,
Joanne

>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-05-13 21:29       ` Joanne Koong
@ 2025-05-14 11:56         ` Miklos Szeredi
  2025-05-14 23:17           ` Joanne Koong
  0 siblings, 1 reply; 13+ messages in thread
From: Miklos Szeredi @ 2025-05-14 11:56 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, bernd.schubert, jlayton, kernel-team

On Tue, 13 May 2025 at 23:29, Joanne Koong <joannelkoong@gmail.com> wrote:

> The results vary depending on how IO-intensive the server-side
> processing logic is (eg ones that are not as intensive would show a
> bigger relative performance speedup than ones where a lot of time is
> spent on server-side processing). I can include the results from
> benchmarks on our internal fuse server, which forwards the data in the
> write buffer to a remote server over the network. For that, we saw
> roughly a 5% improvement in throughput for 5 GB writes with 16 MB
> chunk sizes, and a 2.45% improvement in throughput for 12 parallel
> writes of 16 GB files with 64 MB chunk sizes.

Okay, those are much saner numbers.

Does the server use MSG_ZEROCOPY?

Can you please include these numbers and the details on how the server
takes advantage of splice in the patch header?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-05-14 11:56         ` Miklos Szeredi
@ 2025-05-14 23:17           ` Joanne Koong
  2025-05-15  8:38             ` Miklos Szeredi
  0 siblings, 1 reply; 13+ messages in thread
From: Joanne Koong @ 2025-05-14 23:17 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, bernd.schubert, jlayton, kernel-team

On Wed, May 14, 2025 at 4:57 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Tue, 13 May 2025 at 23:29, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > The results vary depending on how IO-intensive the server-side
> > processing logic is (eg ones that are not as intensive would show a
> > bigger relative performance speedup than ones where a lot of time is
> > spent on server-side processing). I can include the results from
> > benchmarks on our internal fuse server, which forwards the data in the
> > write buffer to a remote server over the network. For that, we saw
> > roughly a 5% improvement in throughput for 5 GB writes with 16 MB
> > chunk sizes, and a 2.45% improvement in throughput for 12 parallel
> > writes of 16 GB files with 64 MB chunk sizes.
>
> Okay, those are much saner numbers.

Sorry, I wasn't trying to intentionally inflate the numbers. I thought
isolating it to the kernel speedup was the most accurate else the
numbers depend on the individual server implementation.

>
> Does the server use MSG_ZEROCOPY?

No. The server copies the buffer to another buffer (for later
processing) so that the server can immediately reply to the request
and not hold up work on that libfuse thread. Splice here helps because
it gets rid of 1 copy, eg instead of copying the data to the libfuse
buffer and then from libfuse buffer to this other buffer, we can now
just do a read() on the file descriptor returned from splice into the
other buffer.

>
> Can you please include these numbers and the details on how the server
> takes advantage of splice in the patch header?

I will resubmit this as v3 with the numbers and details included in
the patch header underneath the commit message.

Thanks,
Joanne
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-05-14 23:17           ` Joanne Koong
@ 2025-05-15  8:38             ` Miklos Szeredi
  2025-05-15 19:16               ` Joanne Koong
  0 siblings, 1 reply; 13+ messages in thread
From: Miklos Szeredi @ 2025-05-15  8:38 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, bernd.schubert, jlayton, kernel-team

On Thu, 15 May 2025 at 01:17, Joanne Koong <joannelkoong@gmail.com> wrote:

> No. The server copies the buffer to another buffer (for later
> processing) so that the server can immediately reply to the request
> and not hold up work on that libfuse thread. Splice here helps because
> it gets rid of 1 copy, eg instead of copying the data to the libfuse
> buffer and then from libfuse buffer to this other buffer, we can now
> just do a read() on the file descriptor returned from splice into the
> other buffer.

Yeah, splice is neat, but that pesky thing about the buffer liftimes
makes it not all that desirable.

So I'm wondering if the planned zero copy uring api is perhaps a
better solution?

In theory there's nothing preventing us from doing it with the plain
/dev/fuse interface (i.e. read the FUSE_WRITE header and pass a
virtual offset to the libfuse write callback, which can read the
payload from the given offset), but perhaps the uring one is more
elegant.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-05-15  8:38             ` Miklos Szeredi
@ 2025-05-15 19:16               ` Joanne Koong
  2025-05-16  7:58                 ` Miklos Szeredi
  0 siblings, 1 reply; 13+ messages in thread
From: Joanne Koong @ 2025-05-15 19:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, bernd.schubert, jlayton, kernel-team

On Thu, May 15, 2025 at 1:38 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Thu, 15 May 2025 at 01:17, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > No. The server copies the buffer to another buffer (for later
> > processing) so that the server can immediately reply to the request
> > and not hold up work on that libfuse thread. Splice here helps because
> > it gets rid of 1 copy, eg instead of copying the data to the libfuse
> > buffer and then from libfuse buffer to this other buffer, we can now
> > just do a read() on the file descriptor returned from splice into the
> > other buffer.
>
> Yeah, splice is neat, but that pesky thing about the buffer liftimes
> makes it not all that desirable.
>
> So I'm wondering if the planned zero copy uring api is perhaps a
> better solution?
>
> In theory there's nothing preventing us from doing it with the plain
> /dev/fuse interface (i.e. read the FUSE_WRITE header and pass a
> virtual offset to the libfuse write callback, which can read the
> payload from the given offset), but perhaps the uring one is more
> elegant.

As I understand it, the zero copy uring api (I think the one you're
talking about is the one discussed here [1]?) requires client-side
changes in order to utilize it.

[1] https://lore.kernel.org/linux-fsdevel/dc3a5c7d-b254-48ea-9749-2c464bfd3931@davidwei.uk/

Thanks,
Joanne
>
> Thanks,
> Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-05-15 19:16               ` Joanne Koong
@ 2025-05-16  7:58                 ` Miklos Szeredi
  2025-05-16 18:15                   ` Bernd Schubert
  0 siblings, 1 reply; 13+ messages in thread
From: Miklos Szeredi @ 2025-05-16  7:58 UTC (permalink / raw)
  To: Joanne Koong; +Cc: linux-fsdevel, bernd.schubert, jlayton, kernel-team

On Thu, 15 May 2025 at 21:16, Joanne Koong <joannelkoong@gmail.com> wrote:

> As I understand it, the zero copy uring api (I think the one you're
> talking about is the one discussed here [1]?) requires client-side
> changes in order to utilize it.
>
> [1] https://lore.kernel.org/linux-fsdevel/dc3a5c7d-b254-48ea-9749-2c464bfd3931@davidwei.uk/

No, that's not what I was thinking.  That sort of thing is out of
scope for fuse, I think.

Hmm, so you actually need "single copy" direct write.

 - there's the buffer that write(2) gets from application
 - it's copied into server's own buffer, at which point the write(2) can return
 - at some point this buffer is sent to the network and freed/reused

Currently this is not possible:

 - there's the buffer that write(2) gets from application
 - it's copied into libfuse's buffer, which is passed to the write callback
 - the server's write callback copies this to its own buffer, ...

What's preventing libfuse to allow the server to keep the buffer?  It
seems just a logistic problem, not some fundamental API issue.  Adding
a fuse_buf_clone() that just transfers ownership of the underlying
buffer is all that's needed on the API side.  As for the
implementation: libfuse would then need to handle the case of a buffer
that has been transferred.

Does this make sense?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-05-16  7:58                 ` Miklos Szeredi
@ 2025-05-16 18:15                   ` Bernd Schubert
  2025-05-16 23:38                     ` Joanne Koong
  0 siblings, 1 reply; 13+ messages in thread
From: Bernd Schubert @ 2025-05-16 18:15 UTC (permalink / raw)
  To: Miklos Szeredi, Joanne Koong
  Cc: linux-fsdevel, jlayton, kernel-team, Keith Busch



On 5/16/25 09:58, Miklos Szeredi wrote:
> On Thu, 15 May 2025 at 21:16, Joanne Koong <joannelkoong@gmail.com> wrote:
> 
>> As I understand it, the zero copy uring api (I think the one you're
>> talking about is the one discussed here [1]?) requires client-side
>> changes in order to utilize it.
>>
>> [1] https://lore.kernel.org/linux-fsdevel/dc3a5c7d-b254-48ea-9749-2c464bfd3931@davidwei.uk/
> 
> No, that's not what I was thinking.  That sort of thing is out of
> scope for fuse, I think.

Yeah, I don't think that is what Keith had done for ublk either and what is
planned for fuse. Added in Keith.

> 
> Hmm, so you actually need "single copy" direct write.
> 
>   - there's the buffer that write(2) gets from application
>   - it's copied into server's own buffer, at which point the write(2) can return
>   - at some point this buffer is sent to the network and freed/reused
> 
> Currently this is not possible:
> 
>   - there's the buffer that write(2) gets from application
>   - it's copied into libfuse's buffer, which is passed to the write callback
>   - the server's write callback copies this to its own buffer, ...
> 
> What's preventing libfuse to allow the server to keep the buffer?  It
> seems just a logistic problem, not some fundamental API issue.  Adding
> a fuse_buf_clone() that just transfers ownership of the underlying
> buffer is all that's needed on the API side.  As for the
> implementation: libfuse would then need to handle the case of a buffer
> that has been transferred.

With io-uring the buffer belongs to the request and not to the
thread anymore - no need to copy from libfuse to the server.

AFAIK, mergerfs also uses a modified libfuse that also allows to keep
the server the buffer without io-uring. I just don't see much of
a point to work on these things when we have io-uring now.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] fuse: use splice for reading user pages on servers that enable it
  2025-05-16 18:15                   ` Bernd Schubert
@ 2025-05-16 23:38                     ` Joanne Koong
  0 siblings, 0 replies; 13+ messages in thread
From: Joanne Koong @ 2025-05-16 23:38 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, linux-fsdevel, jlayton, kernel-team, Keith Busch

On Fri, May 16, 2025 at 11:15 AM Bernd Schubert
<bernd.schubert@fastmail.fm> wrote:
>
> On 5/16/25 09:58, Miklos Szeredi wrote:
> > On Thu, 15 May 2025 at 21:16, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> >> As I understand it, the zero copy uring api (I think the one you're
> >> talking about is the one discussed here [1]?) requires client-side
> >> changes in order to utilize it.
> >>
> >> [1] https://lore.kernel.org/linux-fsdevel/dc3a5c7d-b254-48ea-9749-2c464bfd3931@davidwei.uk/
> >
> > No, that's not what I was thinking.  That sort of thing is out of
> > scope for fuse, I think.
>
> Yeah, I don't think that is what Keith had done for ublk either and what is
> planned for fuse. Added in Keith.
>
> >
> > Hmm, so you actually need "single copy" direct write.
> >
> >   - there's the buffer that write(2) gets from application
> >   - it's copied into server's own buffer, at which point the write(2) can return
> >   - at some point this buffer is sent to the network and freed/reused
> >
> > Currently this is not possible:
> >
> >   - there's the buffer that write(2) gets from application
> >   - it's copied into libfuse's buffer, which is passed to the write callback
> >   - the server's write callback copies this to its own buffer, ...
> >
> > What's preventing libfuse to allow the server to keep the buffer?  It
> > seems just a logistic problem, not some fundamental API issue.  Adding
> > a fuse_buf_clone() that just transfers ownership of the underlying
> > buffer is all that's needed on the API side.  As for the
> > implementation: libfuse would then need to handle the case of a buffer
> > that has been transferred.

The issue is that there's no good way to transfer ownership of the
buffer. The buffer belongs to the "struct fuse_buf" it's encapsulated
inside. Libfuse passes to the server's write handler a pointer to the
write payload inside the buffer. If the handler steals ownership of
the buffer, the struct fuse_buf that buffer is in needs to set its
->mem to point to null. The only way I can think of to do this, which
is hacky, is to add some "bool buffer_stolen : 1;" field to "struct
fuse_file_info" since we pass "struct fuse_file_info *fi" to the write
handler as an arg, and then in  fuse_session_process_buf(), set
"fuse_buf->mem = NULL" if fi->buffer_stolen was set by the handler.
This changes the semantics of the fuse_session_process_buf() public
API, which I don't believe is backwards compatible. The caller may
supply its own buffer to fuse_session_{receive_process}_buf() and then
do whatever it'd like with that buffer afterwards (eg reuse it for
other data or dereference into it), but now that is not true anymore.

>
> With io-uring the buffer belongs to the request and not to the
> thread anymore - no need to copy from libfuse to the server.

I don't see how io-uring helps here. With io-uring, the ent payload is
now the buffer, no? If a thread still needs that payload to process
that data, it can't still utilize the pointer to that payload while
allowing that ent to be used again for another request, no?


Thanks,
Joanne
>
> AFAIK, mergerfs also uses a modified libfuse that also allows to keep
> the server the buffer without io-uring. I just don't see much of
> a point to work on these things when we have io-uring now.
>
>
> Thanks,
> Bernd

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-05-16 23:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22 23:56 [PATCH v2] fuse: use splice for reading user pages on servers that enable it Joanne Koong
2025-05-02 13:16 ` Bernd Schubert
2025-05-07 14:45 ` Miklos Szeredi
2025-05-12 19:03   ` Joanne Koong
2025-05-13  5:46     ` Miklos Szeredi
2025-05-13 21:29       ` Joanne Koong
2025-05-14 11:56         ` Miklos Szeredi
2025-05-14 23:17           ` Joanne Koong
2025-05-15  8:38             ` Miklos Szeredi
2025-05-15 19:16               ` Joanne Koong
2025-05-16  7:58                 ` Miklos Szeredi
2025-05-16 18:15                   ` Bernd Schubert
2025-05-16 23:38                     ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).