[PATCH RFC v2 00/19] fuse: fuse-over-io-uring

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
@ 2024-05-29 18:00 Bernd Schubert
  2024-05-29 18:00 ` [PATCH RFC v2 01/19] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
                   ` (22 more replies)
  0 siblings, 23 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring

From: Bernd Schubert <bschubert@ddn.com>

This adds support for uring communication between kernel and
userspace daemon using opcode the IORING_OP_URING_CMD. The basic
appraoch was taken from ublk.  The patches are in RFC state,
some major changes are still to be expected.

Motivation for these patches is all to increase fuse performance.
In fuse-over-io-uring requests avoid core switching (application
on core X, processing of fuse server on random core Y) and use
shared memory between kernel and userspace to transfer data.
Similar approaches have been taken by ZUFS and FUSE2, though
not over io-uring, but through ioctl IOs

https://lwn.net/Articles/756625/
https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2

Avoiding cache line bouncing / numa systems was discussed
between Amir and Miklos before and Miklos had posted
part of the private discussion here
https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/

This cache line bouncing should be addressed by these patches
as well.

I had also noticed waitq wake-up latencies in fuse before
https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/

This spinning approach helped with performance (>40% improvement
for file creates), but due to random server side thread/core utilization
spinning cannot be well controlled in /dev/fuse mode.
With fuse-over-io-uring requests are handled on the same core
(sync requests) or on core+1 (large async requests) and performance
improvements are achieved without spinning.

Splice/zero-copy is not supported yet, Ming Lei is working
on io-uring support for ublk_drv, but I think so far there
is no final agreement on the approach to be taken yet.
Fuse-over-io-uring runs significantly faster than reads/writes
over /dev/fuse, even with splice enabled, so missing zc
should not be a blocking issue.

The patches have been tested with multiple xfstest runs in a VM
(32 cores) with a kernel that has several debug options
enabled (like KASAN and MSAN).
For some tests xfstests reports that O_DIRECT is not supported,
I need to investigate that. Interesting part is that exactly
these tests fail in plain /dev/fuse posix mode. I had to disabled
generic/650, which is enabling/disabling cpu cores - given ring
threads are bound to cores issues with that are no totally
unexpected, but then there (scheduler) kernel messages that
core binding for these threads is removed - this needs
to be further investigates.
Nice effect in io-uring mode is that tests run faster (like
generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still
slow as this is with ASAN/leak-detection/etc.

The corresponding libfuse patches are on my uring branch,
but need cleanup for submission - will happen during the next
days.
https://github.com/bsbernd/libfuse/tree/uring

If it should make review easier, patches posted here are on
this branch
https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2

TODO list for next RFC versions
- Let the ring configure ioctl return information, like mmap/queue-buf size
- Request kernel side address and len for a request - avoid calculation in userspace?
- multiple IO sizes per queue (avoiding a calculation in userspace is probably even
  more important)
- FUSE_INTERRUPT handling?
- Logging (adds fields in the ioctl and also ring-request),
  any mismatch between client and server is currently very hard to understand
  through error codes

Future work
- notifications, probably on their own ring
- zero copy

I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023,
which, resulted in some tuning patches (at the end of the
patch series).

Some benchmark results
======================

System used for the benchmark is a 32 core (HyperThreading enabled)
Xeon E5-2650 system. I don't have local disks attached that could do
>5GB/s IOs, for paged and dio results a patched version of passthrough-hp
was used that bypasses final reads/writes.

paged reads
-----------
            128K IO size                      1024K IO size
jobs   /dev/fuse     uring    gain     /dev/fuse    uring   gain
 1        1117        1921    1.72        1902       1942   1.02
 2        2502        3527    1.41        3066       3260   1.06
 4        5052        6125    1.21        5994       6097   1.02
 8        6273       10855    1.73        7101      10491   1.48
16        6373       11320    1.78        7660      11419   1.49
24        6111        9015    1.48        7600       9029   1.19
32        5725        7968    1.39        6986       7961   1.14

dio reads (1024K)
-----------------

jobs   /dev/fuse  uring   gain
1	    2023   3998	  2.42
2	    3375   7950   2.83
4	    3823   15022  3.58
8	    7796   22591  2.77
16	    8520   27864  3.27
24	    8361   20617  2.55
32	    8717   12971  1.55

mmap reads (4K)
---------------
(sequential, I probably should have made it random, sequential exposes
a rather interesting/weird 'optimized' memcpy issue - sequential becomes
reversed order 4K read)
https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm/

jobs  /dev/fuse     uring    gain
1       130          323     2.49
2       219          538     2.46
4       503         1040     2.07
8       1472        2039     1.38
16      2191        3518     1.61
24      2453        4561     1.86
32      2178        5628     2.58

(Results on request, setting MAP_HUGETLB much improves performance
for both, io-uring mode then has a slight advantage only.)

creates/s
----------
threads /dev/fuse     uring   gain
1          3944       10121   2.57
2          8580       24524   2.86
4         16628       44426   2.67
8         46746       56716   1.21
16        79740      102966   1.29
20        80284      119502   1.49

(the gain drop with >=8 cores needs to be investigated)

Remaining TODO list for RFCv3:
--------------------------------
1) Let the ring configure ioctl return information,
like mmap/queue-buf size

Right now libfuse and kernel have lots of duplicated setup code
and any kind of pointer/offset mismatch results in a non-working
ring that is hard to debug - probably better when the kernel does
the calculations and returns that to server side

2) In combination with 1, ring requests should retrieve their
userspace address and length from kernel side instead of
calculating it through the mmaped queue buffer on their own.
(Introduction of FUSE_URING_BUF_ADDR_FETCH)

3) Add log buffer into the ioctl and ring-request

This is to provide better error messages (instead of just
errno)

3) Multiple IO sizes per queue

Small IOs and metadata requests do not need large buffer sizes,
we need multiple IO sizes per queue.

4) FUSE_INTERRUPT handling

These are not handled yet, kernel side is probably not difficult
anymore as ring entries take fuse requests through lists.

Long term TODO:
--------------
Notifications through io-uring, maybe with a separated ring,
but I'm not sure yet.

Changes since RFCv1
-------------------
- No need to hold the task of the server side anymore.  Also no
  ioctls/threads waiting for shutdown anymore.  Shutdown now more
  works like the traditional fuse way.
- Each queue clones the fuse and device release makes an  exception
  for io-uring. Reason is that queued IORING_OP_URING_CMD
  (through .uring_cmd) prevent a device release. I.e. a killed
  server side typically triggers fuse_abort_conn(). This was the
  reason for the async stop-monitor in v1 and reference on the daemon
  task. However it was very racy and annotated immediately by Miklos.
- In v1 the offset parameter to mmap was identifying the QID, in v2
  server side is expected to send mmap from a core bound ring thread
  in numa mode and numa node is taken through the core of that thread.
  Kernel side of the mmap buffer is stored in an rbtree and assigned
  to the right qid through an additional queue ioctl.
- Release of IORING_OP_URING_CMD is done through lists now, instead
  of iterating over the entire array of queues/entries and does not
  depend on the entry state anymore (a bit of the state is still left
  for sanity check).
- Finding free ring queue entries is done through lists and not through
  a bitmap anymore
- Many other code changes and bug fixes
- Performance tunings

---
Bernd Schubert (19):
      fuse: rename to fuse_dev_end_requests and make non-static
      fuse: Move fuse_get_dev to header file
      fuse: Move request bits
      fuse: Add fuse-io-uring design documentation
      fuse: Add a uring config ioctl
      Add a vmalloc_node_user function
      fuse uring: Add an mmap method
      fuse: Add the queue configuration ioctl
      fuse: {uring} Add a dev_release exception for fuse-over-io-uring
      fuse: {uring} Handle SQEs - register commands
      fuse: Add support to copy from/to the ring buffer
      fuse: {uring} Add uring sqe commit and fetch support
      fuse: {uring} Handle uring shutdown
      fuse: {uring} Allow to queue to the ring
      export __wake_on_current_cpu
      fuse: {uring} Wake requests on the the current cpu
      fuse: {uring} Send async requests to qid of core + 1
      fuse: {uring} Set a min cpu offset io-size for reads/writes
      fuse: {uring} Optimize async sends

 Documentation/filesystems/fuse-io-uring.rst |  167 ++++
 fs/fuse/Kconfig                             |   12 +
 fs/fuse/Makefile                            |    1 +
 fs/fuse/dev.c                               |  310 +++++--
 fs/fuse/dev_uring.c                         | 1232 +++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h                       |  395 +++++++++
 fs/fuse/file.c                              |   15 +-
 fs/fuse/fuse_dev_i.h                        |   67 ++
 fs/fuse/fuse_i.h                            |    9 +
 fs/fuse/inode.c                             |    3 +
 include/linux/vmalloc.h                     |    1 +
 include/uapi/linux/fuse.h                   |  135 +++
 kernel/sched/wait.c                         |    1 +
 mm/nommu.c                                  |    6 +
 mm/vmalloc.c                                |   41 +-
 15 files changed, 2330 insertions(+), 65 deletions(-)
---
base-commit: dd5a440a31fae6e459c0d6271dddd62825505361
change-id: 20240529-fuse-uring-for-6-9-rfc2-out-f0a009005fdf

Best regards,
-- 
Bernd Schubert <bschubert@ddn.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 01/19] fuse: rename to fuse_dev_end_requests and make non-static
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-29 21:09   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 02/19] fuse: Move fuse_get_dev to header file Bernd Schubert
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

This function is needed by fuse_uring.c to clean ring queues,
so make it non static. Especially in non-static mode the function
name 'end_requests' should be prefixed with fuse_

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c        |  7 ++++---
 fs/fuse/fuse_dev_i.h | 15 +++++++++++++++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 3ec8bb5e68ff..5cd456e55d80 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -7,6 +7,7 @@
 */
 
 #include "fuse_i.h"
+#include "fuse_dev_i.h"
 
 #include <linux/init.h>
 #include <linux/module.h>
@@ -2135,7 +2136,7 @@ static __poll_t fuse_dev_poll(struct file *file, poll_table *wait)
 }
 
 /* Abort all requests on the given list (pending or processing) */
-static void end_requests(struct list_head *head)
+void fuse_dev_end_requests(struct list_head *head)
 {
 	while (!list_empty(head)) {
 		struct fuse_req *req;
@@ -2238,7 +2239,7 @@ void fuse_abort_conn(struct fuse_conn *fc)
 		wake_up_all(&fc->blocked_waitq);
 		spin_unlock(&fc->lock);
 
-		end_requests(&to_end);
+		fuse_dev_end_requests(&to_end);
 	} else {
 		spin_unlock(&fc->lock);
 	}
@@ -2268,7 +2269,7 @@ int fuse_dev_release(struct inode *inode, struct file *file)
 			list_splice_init(&fpq->processing[i], &to_end);
 		spin_unlock(&fpq->lock);
 
-		end_requests(&to_end);
+		fuse_dev_end_requests(&to_end);
 
 		/* Are we the last open device? */
 		if (atomic_dec_and_test(&fc->dev_count)) {
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
new file mode 100644
index 000000000000..5a1b8a2775d8
--- /dev/null
+++ b/fs/fuse/fuse_dev_i.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * FUSE: Filesystem in Userspace
+ * Copyright (C) 2001-2008  Miklos Szeredi <miklos@szeredi.hu>
+ */
+#ifndef _FS_FUSE_DEV_I_H
+#define _FS_FUSE_DEV_I_H
+
+#include <linux/types.h>
+
+void fuse_dev_end_requests(struct list_head *head);
+
+#endif
+
+

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 01/19] fuse: rename to fuse_dev_end_requests and make non-static
  2024-05-29 18:00 ` [PATCH RFC v2 01/19] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
@ 2024-05-29 21:09   ` Josef Bacik
  0 siblings, 0 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-29 21:09 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:36PM +0200, Bernd Schubert wrote:
> This function is needed by fuse_uring.c to clean ring queues,
> so make it non static. Especially in non-static mode the function
> name 'end_requests' should be prefixed with fuse_
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 02/19] fuse: Move fuse_get_dev to header file
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
  2024-05-29 18:00 ` [PATCH RFC v2 01/19] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-29 21:09   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 03/19] fuse: Move request bits Bernd Schubert
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

Another preparation patch, as this function will be needed by
fuse/dev.c and fuse/dev_uring.c.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c        | 9 ---------
 fs/fuse/fuse_dev_i.h | 9 +++++++++
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5cd456e55d80..3317942b211c 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -32,15 +32,6 @@ MODULE_ALIAS("devname:fuse");
 
 static struct kmem_cache *fuse_req_cachep;
 
-static struct fuse_dev *fuse_get_dev(struct file *file)
-{
-	/*
-	 * Lockless access is OK, because file->private data is set
-	 * once during mount and is valid until the file is released.
-	 */
-	return READ_ONCE(file->private_data);
-}
-
 static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
 {
 	INIT_LIST_HEAD(&req->list);
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 5a1b8a2775d8..b38e67b3f889 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -8,6 +8,15 @@
 
 #include <linux/types.h>
 
+static inline struct fuse_dev *fuse_get_dev(struct file *file)
+{
+	/*
+	 * Lockless access is OK, because file->private data is set
+	 * once during mount and is valid until the file is released.
+	 */
+	return READ_ONCE(file->private_data);
+}
+
 void fuse_dev_end_requests(struct list_head *head);
 
 #endif

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 02/19] fuse: Move fuse_get_dev to header file
  2024-05-29 18:00 ` [PATCH RFC v2 02/19] fuse: Move fuse_get_dev to header file Bernd Schubert
@ 2024-05-29 21:09   ` Josef Bacik
  0 siblings, 0 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-29 21:09 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:37PM +0200, Bernd Schubert wrote:
> Another preparation patch, as this function will be needed by
> fuse/dev.c and fuse/dev_uring.c.
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 03/19] fuse: Move request bits
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
  2024-05-29 18:00 ` [PATCH RFC v2 01/19] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
  2024-05-29 18:00 ` [PATCH RFC v2 02/19] fuse: Move fuse_get_dev to header file Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-29 21:10   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 04/19] fuse: Add fuse-io-uring design documentation Bernd Schubert
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

These are needed by dev_uring functions as well

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c        | 4 ----
 fs/fuse/fuse_dev_i.h | 4 ++++
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 3317942b211c..b98ecb197a28 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -26,10 +26,6 @@
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
 
-/* Ordinary requests have even IDs, while interrupts IDs are odd */
-#define FUSE_INT_REQ_BIT (1ULL << 0)
-#define FUSE_REQ_ID_STEP (1ULL << 1)
-
 static struct kmem_cache *fuse_req_cachep;
 
 static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index b38e67b3f889..6c506f040d5f 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -8,6 +8,10 @@
 
 #include <linux/types.h>
 
+/* Ordinary requests have even IDs, while interrupts IDs are odd */
+#define FUSE_INT_REQ_BIT (1ULL << 0)
+#define FUSE_REQ_ID_STEP (1ULL << 1)
+
 static inline struct fuse_dev *fuse_get_dev(struct file *file)
 {
 	/*

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 03/19] fuse: Move request bits
  2024-05-29 18:00 ` [PATCH RFC v2 03/19] fuse: Move request bits Bernd Schubert
@ 2024-05-29 21:10   ` Josef Bacik
  0 siblings, 0 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-29 21:10 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:38PM +0200, Bernd Schubert wrote:
> These are needed by dev_uring functions as well
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 04/19] fuse: Add fuse-io-uring design documentation
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (2 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 03/19] fuse: Move request bits Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-29 21:17   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 05/19] fuse: Add a uring config ioctl Bernd Schubert
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 Documentation/filesystems/fuse-io-uring.rst | 167 ++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)

diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse-io-uring.rst
new file mode 100644
index 000000000000..4aa168e3b229
--- /dev/null
+++ b/Documentation/filesystems/fuse-io-uring.rst
@@ -0,0 +1,167 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+FUSE Uring design documentation
+==============================
+
+This documentation covers basic details how the fuse
+kernel/userspace communication through uring is configured
+and works. For generic details about FUSE see fuse.rst.
+
+This document also covers the current interface, which is
+still in development and might change.
+
+Limitations
+===========
+As of now not all requests types are supported through uring, userspace
+side is required to also handle requests through /dev/fuse after
+uring setup is complete. These are especially notifications (initiated
+from daemon side), interrupts and forgets.
+Interrupts are probably not working at all when uring is used. At least
+current state of libfuse will not be able to handle those for requests
+on ring queues.
+All these limitation will be addressed later.
+
+Fuse uring configuration
+========================
+
+Fuse kernel requests are queued through the classical /dev/fuse
+read/write interface - until uring setup is complete.
+
+In order to set up fuse-over-io-uring userspace has to send ioctls,
+mmap requests in the right order
+
+1) FUSE_DEV_IOC_URING ioctl with FUSE_URING_IOCTL_CMD_RING_CFG
+
+First the basic kernel data structure has to be set up, using
+FUSE_DEV_IOC_URING with subcommand FUSE_URING_IOCTL_CMD_RING_CFG.
+
+Example (from libfuse)
+
+static int fuse_uring_setup_kernel_ring(int session_fd,
+					int nr_queues, int sync_qdepth,
+					int async_qdepth, int req_arg_len,
+					int req_alloc_sz)
+{
+	int rc;
+
+	struct fuse_ring_config rconf = {
+		.nr_queues		    = nr_queues,
+		.sync_queue_depth	= sync_qdepth,
+		.async_queue_depth	= async_qdepth,
+		.req_arg_len		= req_arg_len,
+		.user_req_buf_sz	= req_alloc_sz,
+		.numa_aware		    = nr_queues > 1,
+	};
+
+	struct fuse_uring_cfg ioc_cfg = {
+		.flags = 0,
+		.cmd = FUSE_URING_IOCTL_CMD_RING_CFG,
+		.rconf = rconf,
+	};
+
+	rc = ioctl(session_fd, FUSE_DEV_IOC_URING, &ioc_cfg);
+	if (rc)
+		rc = -errno;
+
+	return rc;
+}
+
+2) MMAP
+
+For shared memory communication between kernel and userspace
+each queue has to allocate and map memory buffer.
+For numa awares kernel side verifies if the allocating thread
+is bound to a single core - in general kernel side has expectations
+that only a single thread accesses a queue and for numa aware
+memory alloation the core of the thread sending the mmap request
+is used to identify the numa node.
+
+The offsset parameter has to be FUSE_URING_MMAP_OFF to identify
+it is a request concerning fuse-over-io-uring.
+
+3) FUSE_DEV_IOC_URING ioctl with FUSE_URING_IOCTL_CMD_QUEUE_CFG
+
+This ioctl has to be send for every queue and takes the queue-id (qid)
+and memory address obtained by mmap to set up queue data structures.
+
+Kernel - userspace interface using uring
+========================================
+
+After queue ioctl setup and memory mapping userspace submits
+SQEs (opcode = IORING_OP_URING_CMD) in order to fetch
+fuse requests. Initial submit is with the sub command
+FUSE_URING_REQ_FETCH, which will just register entries
+to be available on the kernel side - it sets the according
+entry state and marks the entry as available in the queue bitmap.
+
+Once all entries for all queues are submitted kernel side starts
+to enqueue to ring queue(s). The request is copied into the shared
+memory queue entry buffer and submitted as CQE to the userspace
+side.
+Userspace side handles the CQE and submits the result as subcommand
+FUSE_URING_REQ_COMMIT_AND_FETCH - kernel side does completes the requests
+and also marks the queue entry as available again. If there are
+pending requests waiting the request will be immediately submitted
+to userspace again.
+
+Initial SQE
+-----------
+
+ |                                    |  FUSE filesystem daemon
+ |                                    |
+ |                                    |  >io_uring_submit()
+ |                                    |   IORING_OP_URING_CMD /
+ |                                    |   FUSE_URING_REQ_FETCH
+ |                                    |  [wait cqe]
+ |                                    |   >io_uring_wait_cqe() or
+ |                                    |   >io_uring_submit_and_wait()
+ |                                    |
+ |  >fuse_uring_cmd()                 |
+ |   >fuse_uring_fetch()              |
+ |    >fuse_uring_ent_release()       |
+
+
+Sending requests with CQEs
+--------------------------
+
+ |                                         |  FUSE filesystem daemon
+ |                                         |  [waiting for CQEs]
+ |  "rm /mnt/fuse/file"                    |
+ |                                         |
+ |  >sys_unlink()                          |
+ |    >fuse_unlink()                       |
+ |      [allocate request]                 |
+ |      >__fuse_request_send()             |
+ |        ...                              |
+ |       >fuse_uring_queue_fuse_req        |
+ |        [queue request on fg or          |
+ |          bg queue]                      |
+ |         >fuse_uring_assign_ring_entry() |
+ |         >fuse_uring_send_to_ring()      |
+ |          >fuse_uring_copy_to_ring()     |
+ |          >io_uring_cmd_done()           |
+ |          >request_wait_answer()         |
+ |           [sleep on req->waitq]         |
+ |                                         |  [receives and handles CQE]
+ |                                         |  [submit result and fetch next]
+ |                                         |  >io_uring_submit()
+ |                                         |   IORING_OP_URING_CMD/
+ |                                         |   FUSE_URING_REQ_COMMIT_AND_FETCH
+ |  >fuse_uring_cmd()                      |
+ |   >fuse_uring_commit_and_release()      |
+ |    >fuse_uring_copy_from_ring()         |
+ |     [ copy the result to the fuse req]  |
+ |     >fuse_uring_req_end_and_get_next()  |
+ |      >fuse_request_end()                |
+ |       [wake up req->waitq]              |
+ |      >fuse_uring_ent_release_and_fetch()|
+ |       [wait or handle next req]         |
+ |                                         |
+ |                                         |
+ |       [req->waitq woken up]             |
+ |    <fuse_unlink()                       |
+ |  <sys_unlink()                          |
+
+
+

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 04/19] fuse: Add fuse-io-uring design documentation
  2024-05-29 18:00 ` [PATCH RFC v2 04/19] fuse: Add fuse-io-uring design documentation Bernd Schubert
@ 2024-05-29 21:17   ` Josef Bacik
  2024-05-30 12:50     ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Josef Bacik @ 2024-05-29 21:17 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:39PM +0200, Bernd Schubert wrote:
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  Documentation/filesystems/fuse-io-uring.rst | 167 ++++++++++++++++++++++++++++
>  1 file changed, 167 insertions(+)
> 
> diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse-io-uring.rst
> new file mode 100644
> index 000000000000..4aa168e3b229
> --- /dev/null
> +++ b/Documentation/filesystems/fuse-io-uring.rst
> @@ -0,0 +1,167 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============================
> +FUSE Uring design documentation
> +==============================
> +
> +This documentation covers basic details how the fuse
> +kernel/userspace communication through uring is configured
> +and works. For generic details about FUSE see fuse.rst.
> +
> +This document also covers the current interface, which is
> +still in development and might change.
> +
> +Limitations
> +===========
> +As of now not all requests types are supported through uring, userspace

s/userspace side/userspace/

> +side is required to also handle requests through /dev/fuse after
> +uring setup is complete. These are especially notifications (initiated

especially is an awkward word choice here, I'm not quite sure what you're trying
say here, perhaps

"Specifically notifications (initiated from the daemon side), interrupts and
forgets"

?

> +from daemon side), interrupts and forgets.
> +Interrupts are probably not working at all when uring is used. At least
> +current state of libfuse will not be able to handle those for requests
> +on ring queues.
> +All these limitation will be addressed later.
> +
> +Fuse uring configuration
> +========================
> +
> +Fuse kernel requests are queued through the classical /dev/fuse
> +read/write interface - until uring setup is complete.
> +
> +In order to set up fuse-over-io-uring userspace has to send ioctls,
> +mmap requests in the right order
> +
> +1) FUSE_DEV_IOC_URING ioctl with FUSE_URING_IOCTL_CMD_RING_CFG
> +
> +First the basic kernel data structure has to be set up, using
> +FUSE_DEV_IOC_URING with subcommand FUSE_URING_IOCTL_CMD_RING_CFG.
> +
> +Example (from libfuse)
> +
> +static int fuse_uring_setup_kernel_ring(int session_fd,
> +					int nr_queues, int sync_qdepth,
> +					int async_qdepth, int req_arg_len,
> +					int req_alloc_sz)
> +{
> +	int rc;
> +
> +	struct fuse_ring_config rconf = {
> +		.nr_queues		    = nr_queues,
> +		.sync_queue_depth	= sync_qdepth,
> +		.async_queue_depth	= async_qdepth,
> +		.req_arg_len		= req_arg_len,
> +		.user_req_buf_sz	= req_alloc_sz,
> +		.numa_aware		    = nr_queues > 1,
> +	};
> +
> +	struct fuse_uring_cfg ioc_cfg = {
> +		.flags = 0,
> +		.cmd = FUSE_URING_IOCTL_CMD_RING_CFG,
> +		.rconf = rconf,
> +	};
> +
> +	rc = ioctl(session_fd, FUSE_DEV_IOC_URING, &ioc_cfg);
> +	if (rc)
> +		rc = -errno;
> +
> +	return rc;
> +}
> +
> +2) MMAP
> +
> +For shared memory communication between kernel and userspace
> +each queue has to allocate and map memory buffer.
> +For numa awares kernel side verifies if the allocating thread

This bit is awkwardly worded and there's some spelling mistakes.  Perhaps
something like this?

"For numa aware kernels, the kernel verifies that the allocating thread is bound
to a single core, as the kernel has the expectation that only a single thread
accesses a queue, and for numa aware memory allocation the core of the thread
sending the mmap request is used to identify the numa node"

> +is bound to a single core - in general kernel side has expectations
> +that only a single thread accesses a queue and for numa aware
> +memory alloation the core of the thread sending the mmap request
> +is used to identify the numa node.
> +
> +The offsset parameter has to be FUSE_URING_MMAP_OFF to identify
       ^^^^ "offset"

> +it is a request concerning fuse-over-io-uring.
> +
> +3) FUSE_DEV_IOC_URING ioctl with FUSE_URING_IOCTL_CMD_QUEUE_CFG
> +
> +This ioctl has to be send for every queue and takes the queue-id (qid)
                        ^^^^ "sent"

> +and memory address obtained by mmap to set up queue data structures.
> +
> +Kernel - userspace interface using uring
> +========================================
> +
> +After queue ioctl setup and memory mapping userspace submits

This needs a comma, so

"After queue ioctl setup and memory mapping, userspace submites"

> +SQEs (opcode = IORING_OP_URING_CMD) in order to fetch
> +fuse requests. Initial submit is with the sub command
> +FUSE_URING_REQ_FETCH, which will just register entries
> +to be available on the kernel side - it sets the according

s/according/associated/ maybe?

> +entry state and marks the entry as available in the queue bitmap.
> +
> +Once all entries for all queues are submitted kernel side starts
> +to enqueue to ring queue(s). The request is copied into the shared
> +memory queue entry buffer and submitted as CQE to the userspace
> +side.
> +Userspace side handles the CQE and submits the result as subcommand
> +FUSE_URING_REQ_COMMIT_AND_FETCH - kernel side does completes the requests

"the kernel completes the request"

Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 04/19] fuse: Add fuse-io-uring design documentation
  2024-05-29 21:17   ` Josef Bacik
@ 2024-05-30 12:50     ` Bernd Schubert
  2024-05-30 14:59       ` Josef Bacik
  0 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-30 12:50 UTC (permalink / raw)
  To: Josef Bacik, Bernd Schubert; +Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel



On 5/29/24 23:17, Josef Bacik wrote:
> On Wed, May 29, 2024 at 08:00:39PM +0200, Bernd Schubert wrote:
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  Documentation/filesystems/fuse-io-uring.rst | 167 ++++++++++++++++++++++++++++
>>  1 file changed, 167 insertions(+)
>>
>> diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse-io-uring.rst
>> new file mode 100644
>> index 000000000000..4aa168e3b229
>> --- /dev/null
>> +++ b/Documentation/filesystems/fuse-io-uring.rst
>> @@ -0,0 +1,167 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +===============================
>> +FUSE Uring design documentation
>> +==============================
>> +
>> +This documentation covers basic details how the fuse
>> +kernel/userspace communication through uring is configured
>> +and works. For generic details about FUSE see fuse.rst.
>> +
>> +This document also covers the current interface, which is
>> +still in development and might change.
>> +
>> +Limitations
>> +===========
>> +As of now not all requests types are supported through uring, userspace
> 
> s/userspace side/userspace/
> 
>> +side is required to also handle requests through /dev/fuse after
>> +uring setup is complete. These are especially notifications (initiated
> 
> especially is an awkward word choice here, I'm not quite sure what you're trying
> say here, perhaps
> 
> "Specifically notifications (initiated from the daemon side), interrupts and
> forgets"

Yep, thanks a lot! I removed forgets", these should be working over the ring 
in the mean time.

> 
> ?
> 
>> +from daemon side), interrupts and forgets.
>> +Interrupts are probably not working at all when uring is used. At least
>> +current state of libfuse will not be able to handle those for requests
>> +on ring queues.
>> +All these limitation will be addressed later.
>> +
>> +Fuse uring configuration
>> +========================
>> +
>> +Fuse kernel requests are queued through the classical /dev/fuse
>> +read/write interface - until uring setup is complete.
>> +
>> +In order to set up fuse-over-io-uring userspace has to send ioctls,
>> +mmap requests in the right order
>> +
>> +1) FUSE_DEV_IOC_URING ioctl with FUSE_URING_IOCTL_CMD_RING_CFG
>> +
>> +First the basic kernel data structure has to be set up, using
>> +FUSE_DEV_IOC_URING with subcommand FUSE_URING_IOCTL_CMD_RING_CFG.
>> +
>> +Example (from libfuse)
>> +
>> +static int fuse_uring_setup_kernel_ring(int session_fd,
>> +					int nr_queues, int sync_qdepth,
>> +					int async_qdepth, int req_arg_len,
>> +					int req_alloc_sz)
>> +{
>> +	int rc;
>> +
>> +	struct fuse_ring_config rconf = {
>> +		.nr_queues		    = nr_queues,
>> +		.sync_queue_depth	= sync_qdepth,
>> +		.async_queue_depth	= async_qdepth,
>> +		.req_arg_len		= req_arg_len,
>> +		.user_req_buf_sz	= req_alloc_sz,
>> +		.numa_aware		    = nr_queues > 1,
>> +	};
>> +
>> +	struct fuse_uring_cfg ioc_cfg = {
>> +		.flags = 0,
>> +		.cmd = FUSE_URING_IOCTL_CMD_RING_CFG,
>> +		.rconf = rconf,
>> +	};
>> +
>> +	rc = ioctl(session_fd, FUSE_DEV_IOC_URING, &ioc_cfg);
>> +	if (rc)
>> +		rc = -errno;
>> +
>> +	return rc;
>> +}
>> +
>> +2) MMAP
>> +
>> +For shared memory communication between kernel and userspace
>> +each queue has to allocate and map memory buffer.
>> +For numa awares kernel side verifies if the allocating thread
> 
> This bit is awkwardly worded and there's some spelling mistakes.  Perhaps
> something like this?
> 
> "For numa aware kernels, the kernel verifies that the allocating thread is bound
> to a single core, as the kernel has the expectation that only a single thread
> accesses a queue, and for numa aware memory allocation the core of the thread
> sending the mmap request is used to identify the numa node"

Thank you, updated. I actually consider to reduce this to a warning (will try 
to add an async FUSE_WARN request type for this and others). Issue is that
systems cannot set up fuse-uring when a core is disabled. 

> 
>> +is bound to a single core - in general kernel side has expectations
>> +that only a single thread accesses a queue and for numa aware
>> +memory alloation the core of the thread sending the mmap request
>> +is used to identify the numa node.
>> +
>> +The offsset parameter has to be FUSE_URING_MMAP_OFF to identify
>        ^^^^ "offset"


Fixed.

> 
>> +it is a request concerning fuse-over-io-uring.
>> +
>> +3) FUSE_DEV_IOC_URING ioctl with FUSE_URING_IOCTL_CMD_QUEUE_CFG
>> +
>> +This ioctl has to be send for every queue and takes the queue-id (qid)
>                         ^^^^ "sent"
> 
>> +and memory address obtained by mmap to set up queue data structures.
>> +
>> +Kernel - userspace interface using uring
>> +========================================
>> +
>> +After queue ioctl setup and memory mapping userspace submits
> 
> This needs a comma, so
> 
> "After queue ioctl setup and memory mapping, userspace submites"
> 
>> +SQEs (opcode = IORING_OP_URING_CMD) in order to fetch
>> +fuse requests. Initial submit is with the sub command
>> +FUSE_URING_REQ_FETCH, which will just register entries
>> +to be available on the kernel side - it sets the according
> 
> s/according/associated/ maybe?
> 
>> +entry state and marks the entry as available in the queue bitmap.

Or maybe like this?

Initial submit is with the sub command FUSE_URING_REQ_FETCH, which 
will just register entries to be available in the kernel.


>> +
>> +Once all entries for all queues are submitted kernel side starts
>> +to enqueue to ring queue(s). The request is copied into the shared
>> +memory queue entry buffer and submitted as CQE to the userspace
>> +side.
>> +Userspace side handles the CQE and submits the result as subcommand
>> +FUSE_URING_REQ_COMMIT_AND_FETCH - kernel side does completes the requests
> 
> "the kernel completes the request"

Yeah, now I see the bad grammar myself. Updated to


Once all entries for all queues are submitted, kernel starts
to enqueue to ring queues. The request is copied into the shared
memory buffer and submitted as CQE to the daemon.
Userspace handles the CQE/fuse-request and submits the result as
subcommand FUSE_URING_REQ_COMMIT_AND_FETCH - kernel completes
the requests and also marks the entry available again. If there are
pending requests waiting the request will be immediately submitted
to the daemon again.



Thank you very much for your help to phrase this better!



Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 04/19] fuse: Add fuse-io-uring design documentation
  2024-05-30 12:50     ` Bernd Schubert
@ 2024-05-30 14:59       ` Josef Bacik
  0 siblings, 0 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 14:59 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, Miklos Szeredi, Amir Goldstein, linux-fsdevel

On Thu, May 30, 2024 at 02:50:30PM +0200, Bernd Schubert wrote:
> 
> 
> On 5/29/24 23:17, Josef Bacik wrote:
> > On Wed, May 29, 2024 at 08:00:39PM +0200, Bernd Schubert wrote:
> >> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> >> ---
> >>  Documentation/filesystems/fuse-io-uring.rst | 167 ++++++++++++++++++++++++++++
> >>  1 file changed, 167 insertions(+)
> >>
> >> diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse-io-uring.rst
> >> new file mode 100644
> >> index 000000000000..4aa168e3b229
> >> --- /dev/null
> >> +++ b/Documentation/filesystems/fuse-io-uring.rst
> >> @@ -0,0 +1,167 @@
> >> +.. SPDX-License-Identifier: GPL-2.0
> >> +
> >> +===============================
> >> +FUSE Uring design documentation
> >> +==============================
> >> +
> >> +This documentation covers basic details how the fuse
> >> +kernel/userspace communication through uring is configured
> >> +and works. For generic details about FUSE see fuse.rst.
> >> +
> >> +This document also covers the current interface, which is
> >> +still in development and might change.
> >> +
> >> +Limitations
> >> +===========
> >> +As of now not all requests types are supported through uring, userspace
> > 
> > s/userspace side/userspace/
> > 
> >> +side is required to also handle requests through /dev/fuse after
> >> +uring setup is complete. These are especially notifications (initiated
> > 
> > especially is an awkward word choice here, I'm not quite sure what you're trying
> > say here, perhaps
> > 
> > "Specifically notifications (initiated from the daemon side), interrupts and
> > forgets"
> 
> Yep, thanks a lot! I removed forgets", these should be working over the ring 
> in the mean time.
> 
> > 
> > ?
> > 
> >> +from daemon side), interrupts and forgets.
> >> +Interrupts are probably not working at all when uring is used. At least
> >> +current state of libfuse will not be able to handle those for requests
> >> +on ring queues.
> >> +All these limitation will be addressed later.
> >> +
> >> +Fuse uring configuration
> >> +========================
> >> +
> >> +Fuse kernel requests are queued through the classical /dev/fuse
> >> +read/write interface - until uring setup is complete.
> >> +
> >> +In order to set up fuse-over-io-uring userspace has to send ioctls,
> >> +mmap requests in the right order
> >> +
> >> +1) FUSE_DEV_IOC_URING ioctl with FUSE_URING_IOCTL_CMD_RING_CFG
> >> +
> >> +First the basic kernel data structure has to be set up, using
> >> +FUSE_DEV_IOC_URING with subcommand FUSE_URING_IOCTL_CMD_RING_CFG.
> >> +
> >> +Example (from libfuse)
> >> +
> >> +static int fuse_uring_setup_kernel_ring(int session_fd,
> >> +					int nr_queues, int sync_qdepth,
> >> +					int async_qdepth, int req_arg_len,
> >> +					int req_alloc_sz)
> >> +{
> >> +	int rc;
> >> +
> >> +	struct fuse_ring_config rconf = {
> >> +		.nr_queues		    = nr_queues,
> >> +		.sync_queue_depth	= sync_qdepth,
> >> +		.async_queue_depth	= async_qdepth,
> >> +		.req_arg_len		= req_arg_len,
> >> +		.user_req_buf_sz	= req_alloc_sz,
> >> +		.numa_aware		    = nr_queues > 1,
> >> +	};
> >> +
> >> +	struct fuse_uring_cfg ioc_cfg = {
> >> +		.flags = 0,
> >> +		.cmd = FUSE_URING_IOCTL_CMD_RING_CFG,
> >> +		.rconf = rconf,
> >> +	};
> >> +
> >> +	rc = ioctl(session_fd, FUSE_DEV_IOC_URING, &ioc_cfg);
> >> +	if (rc)
> >> +		rc = -errno;
> >> +
> >> +	return rc;
> >> +}
> >> +
> >> +2) MMAP
> >> +
> >> +For shared memory communication between kernel and userspace
> >> +each queue has to allocate and map memory buffer.
> >> +For numa awares kernel side verifies if the allocating thread
> > 
> > This bit is awkwardly worded and there's some spelling mistakes.  Perhaps
> > something like this?
> > 
> > "For numa aware kernels, the kernel verifies that the allocating thread is bound
> > to a single core, as the kernel has the expectation that only a single thread
> > accesses a queue, and for numa aware memory allocation the core of the thread
> > sending the mmap request is used to identify the numa node"
> 
> Thank you, updated. I actually consider to reduce this to a warning (will try 
> to add an async FUSE_WARN request type for this and others). Issue is that
> systems cannot set up fuse-uring when a core is disabled. 
> 
> > 
> >> +is bound to a single core - in general kernel side has expectations
> >> +that only a single thread accesses a queue and for numa aware
> >> +memory alloation the core of the thread sending the mmap request
> >> +is used to identify the numa node.
> >> +
> >> +The offsset parameter has to be FUSE_URING_MMAP_OFF to identify
> >        ^^^^ "offset"
> 
> 
> Fixed.
> 
> > 
> >> +it is a request concerning fuse-over-io-uring.
> >> +
> >> +3) FUSE_DEV_IOC_URING ioctl with FUSE_URING_IOCTL_CMD_QUEUE_CFG
> >> +
> >> +This ioctl has to be send for every queue and takes the queue-id (qid)
> >                         ^^^^ "sent"
> > 
> >> +and memory address obtained by mmap to set up queue data structures.
> >> +
> >> +Kernel - userspace interface using uring
> >> +========================================
> >> +
> >> +After queue ioctl setup and memory mapping userspace submits
> > 
> > This needs a comma, so
> > 
> > "After queue ioctl setup and memory mapping, userspace submites"
> > 
> >> +SQEs (opcode = IORING_OP_URING_CMD) in order to fetch
> >> +fuse requests. Initial submit is with the sub command
> >> +FUSE_URING_REQ_FETCH, which will just register entries
> >> +to be available on the kernel side - it sets the according
> > 
> > s/according/associated/ maybe?
> > 
> >> +entry state and marks the entry as available in the queue bitmap.
> 
> Or maybe like this?
> 
> Initial submit is with the sub command FUSE_URING_REQ_FETCH, which 
> will just register entries to be available in the kernel.
> 
> 
> >> +
> >> +Once all entries for all queues are submitted kernel side starts
> >> +to enqueue to ring queue(s). The request is copied into the shared
> >> +memory queue entry buffer and submitted as CQE to the userspace
> >> +side.
> >> +Userspace side handles the CQE and submits the result as subcommand
> >> +FUSE_URING_REQ_COMMIT_AND_FETCH - kernel side does completes the requests
> > 
> > "the kernel completes the request"
> 
> Yeah, now I see the bad grammar myself. Updated to
> 
> 
> Once all entries for all queues are submitted, kernel starts
> to enqueue to ring queues. The request is copied into the shared
> memory buffer and submitted as CQE to the daemon.
> Userspace handles the CQE/fuse-request and submits the result as
> subcommand FUSE_URING_REQ_COMMIT_AND_FETCH - kernel completes
> the requests and also marks the entry available again. If there are
> pending requests waiting the request will be immediately submitted
> to the daemon again.
> 
> 
> 
> Thank you very much for your help to phrase this better!
> 

This all looks great, thanks!

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 05/19] fuse: Add a uring config ioctl
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (3 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 04/19] fuse: Add fuse-io-uring design documentation Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-29 21:24   ` Josef Bacik
  2024-06-03 13:03   ` Miklos Szeredi
  2024-05-29 18:00 ` [PATCH RFC v2 06/19] Add a vmalloc_node_user function Bernd Schubert
                   ` (17 subsequent siblings)
  22 siblings, 2 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

This only adds the initial ioctl for basic fuse-uring initialization.
More ioctl types will be added later to initialize queues.

This also adds data structures needed or initialized by the ioctl
command and that will be used later.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/Kconfig           |  12 +++
 fs/fuse/Makefile          |   1 +
 fs/fuse/dev.c             |  91 ++++++++++++++++--
 fs/fuse/dev_uring.c       | 122 +++++++++++++++++++++++
 fs/fuse/dev_uring_i.h     | 239 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/fuse_dev_i.h      |   1 +
 fs/fuse/fuse_i.h          |   5 +
 fs/fuse/inode.c           |   3 +
 include/uapi/linux/fuse.h |  73 ++++++++++++++
 9 files changed, 538 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 8674dbfbe59d..11f37cefc94b 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -63,3 +63,15 @@ config FUSE_PASSTHROUGH
 	  to be performed directly on a backing file.
 
 	  If you want to allow passthrough operations, answer Y.
+
+config FUSE_IO_URING
+	bool "FUSE communication over io-uring"
+	default y
+	depends on FUSE_FS
+	depends on IO_URING
+	help
+	  This allows sending FUSE requests over the IO uring interface and
+          also adds request core affinity.
+
+	  If you want to allow fuse server/client communication through io-uring,
+	  answer Y
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 6e0228c6d0cb..7193a14374fd 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -11,5 +11,6 @@ fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
 fuse-y += iomode.o
 fuse-$(CONFIG_FUSE_DAX) += dax.o
 fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
+fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
 
 virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index b98ecb197a28..bc77413932cf 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -8,6 +8,7 @@
 
 #include "fuse_i.h"
 #include "fuse_dev_i.h"
+#include "dev_uring_i.h"
 
 #include <linux/init.h>
 #include <linux/module.h>
@@ -26,6 +27,13 @@
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
 
+#if IS_ENABLED(CONFIG_FUSE_IO_URING)
+static bool __read_mostly enable_uring;
+module_param(enable_uring, bool, 0644);
+MODULE_PARM_DESC(enable_uring,
+		 "Enable uring userspace communication through uring.");
+#endif
+
 static struct kmem_cache *fuse_req_cachep;
 
 static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
@@ -2297,16 +2305,12 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
 	return 0;
 }
 
-static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
+static long _fuse_dev_ioctl_clone(struct file *file, int oldfd)
 {
 	int res;
-	int oldfd;
 	struct fuse_dev *fud = NULL;
 	struct fd f;
 
-	if (get_user(oldfd, argp))
-		return -EFAULT;
-
 	f = fdget(oldfd);
 	if (!f.file)
 		return -EINVAL;
@@ -2329,6 +2333,16 @@ static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
 	return res;
 }
 
+static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
+{
+	int oldfd;
+
+	if (get_user(oldfd, argp))
+		return -EFAULT;
+
+	return _fuse_dev_ioctl_clone(file, oldfd);
+}
+
 static long fuse_dev_ioctl_backing_open(struct file *file,
 					struct fuse_backing_map __user *argp)
 {
@@ -2364,8 +2378,65 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
 	return fuse_backing_close(fud->fc, backing_id);
 }
 
-static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
-			   unsigned long arg)
+/**
+ * Configure the queue for the given qid. First call will also initialize
+ * the ring for this connection.
+ */
+static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
+{
+#if IS_ENABLED(CONFIG_FUSE_IO_URING)
+	int res;
+	struct fuse_uring_cfg cfg;
+	struct fuse_dev *fud;
+	struct fuse_conn *fc;
+	struct fuse_ring *ring;
+
+	res = copy_from_user(&cfg, (void *)argp, sizeof(cfg));
+	if (res != 0)
+		return -EFAULT;
+
+	fud = fuse_get_dev(file);
+	if (fud == NULL)
+		return -ENODEV;
+	fc = fud->fc;
+
+	switch (cfg.cmd) {
+	case FUSE_URING_IOCTL_CMD_RING_CFG:
+		if (READ_ONCE(fc->ring) == NULL)
+			ring = kzalloc(sizeof(*fc->ring), GFP_KERNEL);
+
+		spin_lock(&fc->lock);
+		if (fc->ring == NULL) {
+			fc->ring = ring;
+			fuse_uring_conn_init(fc->ring, fc);
+		} else {
+			kfree(ring);
+		}
+
+		spin_unlock(&fc->lock);
+		if (fc->ring == NULL)
+			return -ENOMEM;
+
+		mutex_lock(&fc->ring->start_stop_lock);
+		res = fuse_uring_conn_cfg(fc->ring, &cfg.rconf);
+		mutex_unlock(&fc->ring->start_stop_lock);
+
+		if (res != 0)
+			return res;
+		break;
+	default:
+		res = -EINVAL;
+	}
+
+		return res;
+#else
+	return -ENOTTY;
+#endif
+}
+
+static long
+fuse_dev_ioctl(struct file *file, unsigned int cmd,
+	       unsigned long arg)
 {
 	void __user *argp = (void __user *)arg;
 
@@ -2379,8 +2450,10 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 	case FUSE_DEV_IOC_BACKING_CLOSE:
 		return fuse_dev_ioctl_backing_close(file, argp);
 
-	default:
-		return -ENOTTY;
+	case FUSE_DEV_IOC_URING:
+		return fuse_uring_ioctl(file, argp);
+
+	default: return -ENOTTY;
 	}
 }
 
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
new file mode 100644
index 000000000000..702a994cf192
--- /dev/null
+++ b/fs/fuse/dev_uring.c
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * FUSE: Filesystem in Userspace
+ * Copyright (c) 2023-2024 DataDirect Networks.
+ */
+
+#include "fuse_i.h"
+#include "fuse_dev_i.h"
+#include "dev_uring_i.h"
+
+#include "linux/compiler_types.h"
+#include "linux/spinlock.h"
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/sched/signal.h>
+#include <linux/uio.h>
+#include <linux/miscdevice.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/pipe_fs_i.h>
+#include <linux/swap.h>
+#include <linux/splice.h>
+#include <linux/sched.h>
+#include <linux/io_uring.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <linux/io_uring.h>
+#include <linux/io_uring/cmd.h>
+#include <linux/topology.h>
+#include <linux/io_uring/cmd.h>
+
+/*
+ * Basic ring setup for this connection based on the provided configuration
+ */
+int fuse_uring_conn_cfg(struct fuse_ring *ring, struct fuse_ring_config *rcfg)
+{
+	size_t queue_sz;
+
+	if (ring->configured) {
+		pr_info("The ring is already configured.\n");
+		return -EALREADY;
+	}
+
+	if (rcfg->nr_queues == 0) {
+		pr_info("zero number of queues is invalid.\n");
+		return -EINVAL;
+	}
+
+	if (rcfg->nr_queues > 1 && rcfg->nr_queues != num_present_cpus()) {
+		pr_info("nr-queues (%d) does not match nr-cores (%d).\n",
+			rcfg->nr_queues, num_present_cpus());
+		return -EINVAL;
+	}
+
+	if (rcfg->req_arg_len < FUSE_RING_MIN_IN_OUT_ARG_SIZE) {
+		pr_info("Per req buffer size too small (%d), min: %d\n",
+			rcfg->req_arg_len, FUSE_RING_MIN_IN_OUT_ARG_SIZE);
+		return -EINVAL;
+	}
+
+	if (WARN_ON(ring->queues))
+		return -EINVAL;
+
+	ring->numa_aware = rcfg->numa_aware;
+	ring->nr_queues = rcfg->nr_queues;
+	ring->per_core_queue = rcfg->nr_queues > 1;
+
+	ring->max_nr_sync = rcfg->sync_queue_depth;
+	ring->max_nr_async = rcfg->async_queue_depth;
+	ring->queue_depth = ring->max_nr_sync + ring->max_nr_async;
+
+	ring->req_arg_len = rcfg->req_arg_len;
+	ring->req_buf_sz = rcfg->user_req_buf_sz;
+
+	ring->queue_buf_size = ring->req_buf_sz * ring->queue_depth;
+
+	queue_sz = sizeof(*ring->queues) +
+		   ring->queue_depth * sizeof(struct fuse_ring_ent);
+	ring->queues = kcalloc(rcfg->nr_queues, queue_sz, GFP_KERNEL);
+	if (!ring->queues)
+		return -ENOMEM;
+	ring->queue_size = queue_sz;
+	ring->configured = 1;
+
+	atomic_set(&ring->queue_refs, 0);
+
+	return 0;
+}
+
+void fuse_uring_ring_destruct(struct fuse_ring *ring)
+{
+	unsigned int qid;
+	struct rb_node *rbn;
+
+	for (qid = 0; qid < ring->nr_queues; qid++) {
+		struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+		vfree(queue->queue_req_buf);
+	}
+
+	kfree(ring->queues);
+	ring->queues = NULL;
+	ring->nr_queues_ioctl_init = 0;
+	ring->queue_depth = 0;
+	ring->nr_queues = 0;
+
+	rbn = rb_first(&ring->mem_buf_map);
+	while (rbn) {
+		struct rb_node *next = rb_next(rbn);
+		struct fuse_uring_mbuf *entry =
+			rb_entry(rbn, struct fuse_uring_mbuf, rb_node);
+
+		rb_erase(rbn, &ring->mem_buf_map);
+		kfree(entry);
+
+		rbn = next;
+	}
+
+	mutex_destroy(&ring->start_stop_lock);
+}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
new file mode 100644
index 000000000000..58ab4671deff
--- /dev/null
+++ b/fs/fuse/dev_uring_i.h
@@ -0,0 +1,239 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * FUSE: Filesystem in Userspace
+ * Copyright (c) 2023-2024 DataDirect Networks.
+ */
+
+#ifndef _FS_FUSE_DEV_URING_I_H
+#define _FS_FUSE_DEV_URING_I_H
+
+#include "fuse_i.h"
+#include "linux/compiler_types.h"
+#include "linux/rbtree_types.h"
+
+#if IS_ENABLED(CONFIG_FUSE_IO_URING)
+
+/* IORING_MAX_ENTRIES */
+#define FUSE_URING_MAX_QUEUE_DEPTH 32768
+
+struct fuse_uring_mbuf {
+	struct rb_node rb_node;
+	void *kbuf; /* kernel allocated ring request buffer */
+	void *ubuf; /* mmaped address */
+};
+
+/** A fuse ring entry, part of the ring queue */
+struct fuse_ring_ent {
+	/*
+	 * pointer to kernel request buffer, userspace side has direct access
+	 * to it through the mmaped buffer
+	 */
+	struct fuse_ring_req *rreq;
+
+	/* the ring queue that owns the request */
+	struct fuse_ring_queue *queue;
+
+	struct io_uring_cmd *cmd;
+
+	struct list_head list;
+
+	/*
+	 * state the request is currently in
+	 * (enum fuse_ring_req_state)
+	 */
+	unsigned long state;
+
+	/* array index in the ring-queue */
+	int tag;
+
+	/* is this an async or sync entry */
+	unsigned int async : 1;
+
+	struct fuse_req *fuse_req; /* when a list request is handled */
+};
+
+struct fuse_ring_queue {
+	/* task belonging to the current queue */
+	struct task_struct *server_task;
+
+	/*
+	 * back pointer to the main fuse uring structure that holds this
+	 * queue
+	 */
+	struct fuse_ring *ring;
+
+	/* issue flags when running in io-uring task context */
+	unsigned int uring_cmd_issue_flags;
+
+	int qid;
+
+	/*
+	 * available number of sync requests,
+	 * loosely bound to fuse foreground requests
+	 */
+	int nr_req_sync;
+
+	/*
+	 * available number of async requests
+	 * loosely bound to fuse background requests
+	 */
+	int nr_req_async;
+
+	/* queue lock, taken when any value in the queue changes _and_ also
+	 * a ring entry state changes.
+	 */
+	spinlock_t lock;
+
+	/* per queue memory buffer that is divided per request */
+	char *queue_req_buf;
+
+	/* fuse fg/bg request types */
+	struct list_head async_fuse_req_queue;
+	struct list_head sync_fuse_req_queue;
+
+	/* available ring entries (struct fuse_ring_ent) */
+	struct list_head async_ent_avail_queue;
+	struct list_head sync_ent_avail_queue;
+
+	struct list_head ent_in_userspace;
+
+	unsigned int configured : 1;
+	unsigned int stopped : 1;
+
+	/* size depends on queue depth */
+	struct fuse_ring_ent ring_ent[] ____cacheline_aligned_in_smp;
+};
+
+/**
+ * Describes if uring is for communication and holds alls the data needed
+ * for uring communication
+ */
+struct fuse_ring {
+	/* back pointer to fuse_conn */
+	struct fuse_conn *fc;
+
+	/* number of ring queues */
+	size_t nr_queues;
+
+	/* number of entries per queue */
+	size_t queue_depth;
+
+	/* max arg size for a request */
+	size_t req_arg_len;
+
+	/* req_arg_len + sizeof(struct fuse_req) */
+	size_t req_buf_sz;
+
+	/* max number of background requests per queue */
+	size_t max_nr_async;
+
+	/* max number of foreground requests */
+	size_t max_nr_sync;
+
+	/* size of struct fuse_ring_queue + queue-depth * entry-size */
+	size_t queue_size;
+
+	/* buffer size per queue, that is used per queue entry */
+	size_t queue_buf_size;
+
+	/* Used to release the ring on stop */
+	atomic_t queue_refs;
+
+	/* Hold ring requests */
+	struct fuse_ring_queue *queues;
+
+	/* number of initialized queues with the ioctl */
+	int nr_queues_ioctl_init;
+
+	/* number of SQEs initialized */
+	atomic_t nr_sqe_init;
+
+	/* one queue per core or a single queue only ? */
+	unsigned int per_core_queue : 1;
+
+	/* Is the ring completely iocl configured */
+	unsigned int configured : 1;
+
+	/* numa aware memory allocation */
+	unsigned int numa_aware : 1;
+
+	/* Is the ring read to take requests */
+	unsigned int ready : 1;
+
+	/*
+	 * Log ring entry states onces on stop when entries cannot be
+	 * released
+	 */
+	unsigned int stop_debug_log : 1;
+
+	struct mutex start_stop_lock;
+
+	wait_queue_head_t stop_waitq;
+
+	/* mmaped ring entry memory buffers, mmaped values is the key,
+	 * kernel pointer is the value
+	 */
+	struct rb_root mem_buf_map;
+
+	struct delayed_work stop_work;
+	unsigned long stop_time;
+};
+
+void fuse_uring_abort_end_requests(struct fuse_ring *ring);
+int fuse_uring_conn_cfg(struct fuse_ring *ring, struct fuse_ring_config *rcfg);
+int fuse_uring_queue_cfg(struct fuse_ring *ring,
+			 struct fuse_ring_queue_config *qcfg);
+void fuse_uring_ring_destruct(struct fuse_ring *ring);
+
+static inline void fuse_uring_conn_init(struct fuse_ring *ring,
+					struct fuse_conn *fc)
+{
+	/* no reference on fc as ring and fc have to be destructed together */
+	ring->fc = fc;
+	init_waitqueue_head(&ring->stop_waitq);
+	mutex_init(&ring->start_stop_lock);
+	ring->mem_buf_map = RB_ROOT;
+}
+
+static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
+{
+	struct fuse_ring *ring = fc->ring;
+
+	if (ring == NULL)
+		return;
+
+	fuse_uring_ring_destruct(ring);
+
+	WRITE_ONCE(fc->ring, NULL);
+	kfree(ring);
+}
+
+static inline struct fuse_ring_queue *
+fuse_uring_get_queue(struct fuse_ring *ring, int qid)
+{
+	char *ptr = (char *)ring->queues;
+
+	if (unlikely(qid > ring->nr_queues)) {
+		WARN_ON(1);
+		qid = 0;
+	}
+
+	return (struct fuse_ring_queue *)(ptr + qid * ring->queue_size);
+}
+
+#else /* CONFIG_FUSE_IO_URING */
+
+struct fuse_ring;
+
+static inline void fuse_uring_conn_init(struct fuse_ring *ring,
+					struct fuse_conn *fc)
+{
+}
+
+static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
+{
+}
+
+#endif /* CONFIG_FUSE_IO_URING */
+
+#endif /* _FS_FUSE_DEV_URING_I_H */
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 6c506f040d5f..e6289bafb788 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -7,6 +7,7 @@
 #define _FS_FUSE_DEV_I_H
 
 #include <linux/types.h>
+#include <linux/fs.h>
 
 /* Ordinary requests have even IDs, while interrupts IDs are odd */
 #define FUSE_INT_REQ_BIT (1ULL << 0)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f23919610313..d2b058ccb677 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -917,6 +917,11 @@ struct fuse_conn {
 	/** IDR for backing files ids */
 	struct idr backing_files_map;
 #endif
+
+#if IS_ENABLED(CONFIG_FUSE_IO_URING)
+	/**  uring connection information*/
+	struct fuse_ring *ring;
+#endif
 };
 
 /*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 99e44ea7d875..33a080b24d65 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -7,6 +7,7 @@
 */
 
 #include "fuse_i.h"
+#include "dev_uring_i.h"
 
 #include <linux/pagemap.h>
 #include <linux/slab.h>
@@ -947,6 +948,8 @@ static void delayed_release(struct rcu_head *p)
 {
 	struct fuse_conn *fc = container_of(p, struct fuse_conn, rcu);
 
+	fuse_uring_conn_destruct(fc);
+
 	put_user_ns(fc->user_ns);
 	fc->release(fc);
 }
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d08b99d60f6f..0449640f2501 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1079,12 +1079,79 @@ struct fuse_backing_map {
 	uint64_t	padding;
 };
 
+enum fuse_uring_ioctl_cmd {
+	/* not correctly initialized when set */
+	FUSE_URING_IOCTL_CMD_INVALID    = 0,
+
+	/* Ioctl to prepare communucation with io-uring */
+	FUSE_URING_IOCTL_CMD_RING_CFG   = 1,
+
+	/* Ring queue configuration ioctl */
+	FUSE_URING_IOCTL_CMD_QUEUE_CFG  = 2,
+};
+
+enum fuse_uring_cfg_flags {
+	/* server/daemon side requests numa awareness */
+	FUSE_URING_WANT_NUMA = 1ul << 0,
+};
+
+struct fuse_uring_cfg {
+	/* struct flags */
+	uint64_t flags;
+
+	/* configuration command */
+	uint8_t cmd;
+
+	uint8_t padding[7];
+
+	union {
+		struct fuse_ring_config {
+			/* number of queues */
+			uint32_t nr_queues;
+
+			/* number of foreground entries per queue */
+			uint32_t sync_queue_depth;
+
+			/* number of background entries per queue */
+			uint32_t async_queue_depth;
+
+			/* argument (max data length) of a request */
+			uint32_t req_arg_len;
+
+			/*
+			 * buffer size userspace allocated per request buffer
+			 * from the mmaped queue buffer
+			 */
+			uint32_t user_req_buf_sz;
+
+			/* ring config flags */
+			uint32_t numa_aware:1;
+		} rconf;
+
+		struct fuse_ring_queue_config {
+			/* mmaped buffser address */
+			uint64_t uaddr;
+
+			/* qid the command is for */
+			uint32_t qid;
+
+			/* /dev/fuse fd that initiated the mount. */
+			uint32_t control_fd;
+		} qconf;
+
+		/* space for future additions */
+		uint8_t union_size[128];
+	};
+};
+
 /* Device ioctls: */
 #define FUSE_DEV_IOC_MAGIC		229
 #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
 #define FUSE_DEV_IOC_BACKING_OPEN	_IOW(FUSE_DEV_IOC_MAGIC, 1, \
 					     struct fuse_backing_map)
 #define FUSE_DEV_IOC_BACKING_CLOSE	_IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_URING		_IOR(FUSE_DEV_IOC_MAGIC, 3, \
+					     struct fuse_uring_cfg)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
@@ -1186,4 +1253,10 @@ struct fuse_supp_groups {
 	uint32_t	groups[];
 };
 
+/**
+ * Size of the ring buffer header
+ */
+#define FUSE_RING_HEADER_BUF_SIZE 4096
+#define FUSE_RING_MIN_IN_OUT_ARG_SIZE 4096
+
 #endif /* _LINUX_FUSE_H */

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 05/19] fuse: Add a uring config ioctl
  2024-05-29 18:00 ` [PATCH RFC v2 05/19] fuse: Add a uring config ioctl Bernd Schubert
@ 2024-05-29 21:24   ` Josef Bacik
  2024-05-30 12:51     ` Bernd Schubert
  2024-06-03 13:03   ` Miklos Szeredi
  1 sibling, 1 reply; 113+ messages in thread
From: Josef Bacik @ 2024-05-29 21:24 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:40PM +0200, Bernd Schubert wrote:
> This only adds the initial ioctl for basic fuse-uring initialization.
> More ioctl types will be added later to initialize queues.
> 
> This also adds data structures needed or initialized by the ioctl
> command and that will be used later.
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/Kconfig           |  12 +++
>  fs/fuse/Makefile          |   1 +
>  fs/fuse/dev.c             |  91 ++++++++++++++++--
>  fs/fuse/dev_uring.c       | 122 +++++++++++++++++++++++
>  fs/fuse/dev_uring_i.h     | 239 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/fuse/fuse_dev_i.h      |   1 +
>  fs/fuse/fuse_i.h          |   5 +
>  fs/fuse/inode.c           |   3 +
>  include/uapi/linux/fuse.h |  73 ++++++++++++++
>  9 files changed, 538 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 8674dbfbe59d..11f37cefc94b 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -63,3 +63,15 @@ config FUSE_PASSTHROUGH
>  	  to be performed directly on a backing file.
>  
>  	  If you want to allow passthrough operations, answer Y.
> +
> +config FUSE_IO_URING
> +	bool "FUSE communication over io-uring"
> +	default y
> +	depends on FUSE_FS
> +	depends on IO_URING
> +	help
> +	  This allows sending FUSE requests over the IO uring interface and
> +          also adds request core affinity.
> +
> +	  If you want to allow fuse server/client communication through io-uring,
> +	  answer Y
> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> index 6e0228c6d0cb..7193a14374fd 100644
> --- a/fs/fuse/Makefile
> +++ b/fs/fuse/Makefile
> @@ -11,5 +11,6 @@ fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
>  fuse-y += iomode.o
>  fuse-$(CONFIG_FUSE_DAX) += dax.o
>  fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> +fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
>  
>  virtiofs-y := virtio_fs.o
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index b98ecb197a28..bc77413932cf 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -8,6 +8,7 @@
>  
>  #include "fuse_i.h"
>  #include "fuse_dev_i.h"
> +#include "dev_uring_i.h"
>  
>  #include <linux/init.h>
>  #include <linux/module.h>
> @@ -26,6 +27,13 @@
>  MODULE_ALIAS_MISCDEV(FUSE_MINOR);
>  MODULE_ALIAS("devname:fuse");
>  
> +#if IS_ENABLED(CONFIG_FUSE_IO_URING)
> +static bool __read_mostly enable_uring;
> +module_param(enable_uring, bool, 0644);
> +MODULE_PARM_DESC(enable_uring,
> +		 "Enable uring userspace communication through uring.");
> +#endif
> +
>  static struct kmem_cache *fuse_req_cachep;
>  
>  static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
> @@ -2297,16 +2305,12 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
>  	return 0;
>  }
>  
> -static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
> +static long _fuse_dev_ioctl_clone(struct file *file, int oldfd)
>  {
>  	int res;
> -	int oldfd;
>  	struct fuse_dev *fud = NULL;
>  	struct fd f;
>  
> -	if (get_user(oldfd, argp))
> -		return -EFAULT;
> -
>  	f = fdget(oldfd);
>  	if (!f.file)
>  		return -EINVAL;
> @@ -2329,6 +2333,16 @@ static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
>  	return res;
>  }
>  
> +static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
> +{
> +	int oldfd;
> +
> +	if (get_user(oldfd, argp))
> +		return -EFAULT;
> +
> +	return _fuse_dev_ioctl_clone(file, oldfd);
> +}
> +
>  static long fuse_dev_ioctl_backing_open(struct file *file,
>  					struct fuse_backing_map __user *argp)
>  {
> @@ -2364,8 +2378,65 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
>  	return fuse_backing_close(fud->fc, backing_id);
>  }
>  
> -static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
> -			   unsigned long arg)
> +/**
> + * Configure the queue for the given qid. First call will also initialize
> + * the ring for this connection.
> + */
> +static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
> +{
> +#if IS_ENABLED(CONFIG_FUSE_IO_URING)
> +	int res;
> +	struct fuse_uring_cfg cfg;
> +	struct fuse_dev *fud;
> +	struct fuse_conn *fc;
> +	struct fuse_ring *ring;
> +
> +	res = copy_from_user(&cfg, (void *)argp, sizeof(cfg));
> +	if (res != 0)
> +		return -EFAULT;
> +
> +	fud = fuse_get_dev(file);
> +	if (fud == NULL)
> +		return -ENODEV;
> +	fc = fud->fc;
> +
> +	switch (cfg.cmd) {
> +	case FUSE_URING_IOCTL_CMD_RING_CFG:
> +		if (READ_ONCE(fc->ring) == NULL)
> +			ring = kzalloc(sizeof(*fc->ring), GFP_KERNEL);
> +
> +		spin_lock(&fc->lock);
> +		if (fc->ring == NULL) {
> +			fc->ring = ring;

Need to have error handling here in case the kzalloc failed.

> +			fuse_uring_conn_init(fc->ring, fc);
> +		} else {
> +			kfree(ring);
> +		}
> +
> +		spin_unlock(&fc->lock);
> +		if (fc->ring == NULL)
> +			return -ENOMEM;
> +
> +		mutex_lock(&fc->ring->start_stop_lock);
> +		res = fuse_uring_conn_cfg(fc->ring, &cfg.rconf);
> +		mutex_unlock(&fc->ring->start_stop_lock);
> +
> +		if (res != 0)
> +			return res;
> +		break;
> +	default:
> +		res = -EINVAL;
> +	}
> +
> +		return res;
> +#else
> +	return -ENOTTY;
> +#endif
> +}
> +
> +static long
> +fuse_dev_ioctl(struct file *file, unsigned int cmd,
> +	       unsigned long arg)
>  {
>  	void __user *argp = (void __user *)arg;
>  
> @@ -2379,8 +2450,10 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
>  	case FUSE_DEV_IOC_BACKING_CLOSE:
>  		return fuse_dev_ioctl_backing_close(file, argp);
>  
> -	default:
> -		return -ENOTTY;
> +	case FUSE_DEV_IOC_URING:
> +		return fuse_uring_ioctl(file, argp);
> +

Instead just wrap the above in 

#ifdef CONFIG_FUSE_IO_URING
	case FUSE_DEV_IOC_URING:
		return fuse_uring_ioctl(file, argp);
#endif

instead of wrapping the entire function above in the check.
	
> +	default: return -ENOTTY;
>  	}
>  }
>  
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> new file mode 100644
> index 000000000000..702a994cf192
> --- /dev/null
> +++ b/fs/fuse/dev_uring.c
> @@ -0,0 +1,122 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * FUSE: Filesystem in Userspace
> + * Copyright (c) 2023-2024 DataDirect Networks.
> + */
> +
> +#include "fuse_i.h"
> +#include "fuse_dev_i.h"
> +#include "dev_uring_i.h"
> +
> +#include "linux/compiler_types.h"
> +#include "linux/spinlock.h"
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/sched/signal.h>
> +#include <linux/uio.h>
> +#include <linux/miscdevice.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/slab.h>
> +#include <linux/pipe_fs_i.h>
> +#include <linux/swap.h>
> +#include <linux/splice.h>
> +#include <linux/sched.h>
> +#include <linux/io_uring.h>
> +#include <linux/mm.h>
> +#include <linux/io.h>
> +#include <linux/io_uring.h>
> +#include <linux/io_uring/cmd.h>
> +#include <linux/topology.h>
> +#include <linux/io_uring/cmd.h>
> +
> +/*
> + * Basic ring setup for this connection based on the provided configuration
> + */
> +int fuse_uring_conn_cfg(struct fuse_ring *ring, struct fuse_ring_config *rcfg)
> +{
> +	size_t queue_sz;
> +
> +	if (ring->configured) {
> +		pr_info("The ring is already configured.\n");
> +		return -EALREADY;
> +	}
> +
> +	if (rcfg->nr_queues == 0) {
> +		pr_info("zero number of queues is invalid.\n");
> +		return -EINVAL;
> +	}
> +
> +	if (rcfg->nr_queues > 1 && rcfg->nr_queues != num_present_cpus()) {
> +		pr_info("nr-queues (%d) does not match nr-cores (%d).\n",
> +			rcfg->nr_queues, num_present_cpus());
> +		return -EINVAL;
> +	}
> +
> +	if (rcfg->req_arg_len < FUSE_RING_MIN_IN_OUT_ARG_SIZE) {
> +		pr_info("Per req buffer size too small (%d), min: %d\n",
> +			rcfg->req_arg_len, FUSE_RING_MIN_IN_OUT_ARG_SIZE);
> +		return -EINVAL;
> +	}
> +
> +	if (WARN_ON(ring->queues))
> +		return -EINVAL;
> +
> +	ring->numa_aware = rcfg->numa_aware;
> +	ring->nr_queues = rcfg->nr_queues;
> +	ring->per_core_queue = rcfg->nr_queues > 1;
> +
> +	ring->max_nr_sync = rcfg->sync_queue_depth;
> +	ring->max_nr_async = rcfg->async_queue_depth;
> +	ring->queue_depth = ring->max_nr_sync + ring->max_nr_async;
> +
> +	ring->req_arg_len = rcfg->req_arg_len;
> +	ring->req_buf_sz = rcfg->user_req_buf_sz;
> +
> +	ring->queue_buf_size = ring->req_buf_sz * ring->queue_depth;
> +
> +	queue_sz = sizeof(*ring->queues) +
> +		   ring->queue_depth * sizeof(struct fuse_ring_ent);
> +	ring->queues = kcalloc(rcfg->nr_queues, queue_sz, GFP_KERNEL);
> +	if (!ring->queues)
> +		return -ENOMEM;
> +	ring->queue_size = queue_sz;
> +	ring->configured = 1;
> +
> +	atomic_set(&ring->queue_refs, 0);
> +
> +	return 0;
> +}
> +
> +void fuse_uring_ring_destruct(struct fuse_ring *ring)
> +{
> +	unsigned int qid;
> +	struct rb_node *rbn;
> +
> +	for (qid = 0; qid < ring->nr_queues; qid++) {
> +		struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
> +
> +		vfree(queue->queue_req_buf);
> +	}
> +
> +	kfree(ring->queues);
> +	ring->queues = NULL;
> +	ring->nr_queues_ioctl_init = 0;
> +	ring->queue_depth = 0;
> +	ring->nr_queues = 0;
> +
> +	rbn = rb_first(&ring->mem_buf_map);
> +	while (rbn) {
> +		struct rb_node *next = rb_next(rbn);
> +		struct fuse_uring_mbuf *entry =
> +			rb_entry(rbn, struct fuse_uring_mbuf, rb_node);
> +
> +		rb_erase(rbn, &ring->mem_buf_map);
> +		kfree(entry);
> +
> +		rbn = next;
> +	}
> +
> +	mutex_destroy(&ring->start_stop_lock);
> +}
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> new file mode 100644
> index 000000000000..58ab4671deff
> --- /dev/null
> +++ b/fs/fuse/dev_uring_i.h
> @@ -0,0 +1,239 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * FUSE: Filesystem in Userspace
> + * Copyright (c) 2023-2024 DataDirect Networks.
> + */
> +
> +#ifndef _FS_FUSE_DEV_URING_I_H
> +#define _FS_FUSE_DEV_URING_I_H
> +
> +#include "fuse_i.h"
> +#include "linux/compiler_types.h"
> +#include "linux/rbtree_types.h"
> +
> +#if IS_ENABLED(CONFIG_FUSE_IO_URING)
> +
> +/* IORING_MAX_ENTRIES */
> +#define FUSE_URING_MAX_QUEUE_DEPTH 32768
> +
> +struct fuse_uring_mbuf {
> +	struct rb_node rb_node;
> +	void *kbuf; /* kernel allocated ring request buffer */
> +	void *ubuf; /* mmaped address */
> +};
> +
> +/** A fuse ring entry, part of the ring queue */
> +struct fuse_ring_ent {
> +	/*
> +	 * pointer to kernel request buffer, userspace side has direct access
> +	 * to it through the mmaped buffer
> +	 */
> +	struct fuse_ring_req *rreq;
> +
> +	/* the ring queue that owns the request */
> +	struct fuse_ring_queue *queue;
> +
> +	struct io_uring_cmd *cmd;
> +
> +	struct list_head list;
> +
> +	/*
> +	 * state the request is currently in
> +	 * (enum fuse_ring_req_state)
> +	 */
> +	unsigned long state;
> +
> +	/* array index in the ring-queue */
> +	int tag;
> +
> +	/* is this an async or sync entry */
> +	unsigned int async : 1;
> +
> +	struct fuse_req *fuse_req; /* when a list request is handled */
> +};
> +
> +struct fuse_ring_queue {
> +	/* task belonging to the current queue */
> +	struct task_struct *server_task;
> +
> +	/*
> +	 * back pointer to the main fuse uring structure that holds this
> +	 * queue
> +	 */
> +	struct fuse_ring *ring;
> +
> +	/* issue flags when running in io-uring task context */
> +	unsigned int uring_cmd_issue_flags;
> +
> +	int qid;
> +
> +	/*
> +	 * available number of sync requests,
> +	 * loosely bound to fuse foreground requests
> +	 */
> +	int nr_req_sync;
> +
> +	/*
> +	 * available number of async requests
> +	 * loosely bound to fuse background requests
> +	 */
> +	int nr_req_async;
> +
> +	/* queue lock, taken when any value in the queue changes _and_ also
> +	 * a ring entry state changes.
> +	 */
> +	spinlock_t lock;
> +
> +	/* per queue memory buffer that is divided per request */
> +	char *queue_req_buf;
> +
> +	/* fuse fg/bg request types */
> +	struct list_head async_fuse_req_queue;
> +	struct list_head sync_fuse_req_queue;
> +
> +	/* available ring entries (struct fuse_ring_ent) */
> +	struct list_head async_ent_avail_queue;
> +	struct list_head sync_ent_avail_queue;
> +
> +	struct list_head ent_in_userspace;
> +
> +	unsigned int configured : 1;
> +	unsigned int stopped : 1;
> +
> +	/* size depends on queue depth */
> +	struct fuse_ring_ent ring_ent[] ____cacheline_aligned_in_smp;
> +};
> +
> +/**
> + * Describes if uring is for communication and holds alls the data needed
> + * for uring communication
> + */
> +struct fuse_ring {
> +	/* back pointer to fuse_conn */
> +	struct fuse_conn *fc;
> +
> +	/* number of ring queues */
> +	size_t nr_queues;
> +
> +	/* number of entries per queue */
> +	size_t queue_depth;
> +
> +	/* max arg size for a request */
> +	size_t req_arg_len;
> +
> +	/* req_arg_len + sizeof(struct fuse_req) */
> +	size_t req_buf_sz;
> +
> +	/* max number of background requests per queue */
> +	size_t max_nr_async;
> +
> +	/* max number of foreground requests */
> +	size_t max_nr_sync;
> +
> +	/* size of struct fuse_ring_queue + queue-depth * entry-size */
> +	size_t queue_size;
> +
> +	/* buffer size per queue, that is used per queue entry */
> +	size_t queue_buf_size;
> +
> +	/* Used to release the ring on stop */
> +	atomic_t queue_refs;
> +
> +	/* Hold ring requests */
> +	struct fuse_ring_queue *queues;
> +
> +	/* number of initialized queues with the ioctl */
> +	int nr_queues_ioctl_init;
> +
> +	/* number of SQEs initialized */
> +	atomic_t nr_sqe_init;
> +
> +	/* one queue per core or a single queue only ? */
> +	unsigned int per_core_queue : 1;
> +
> +	/* Is the ring completely iocl configured */
> +	unsigned int configured : 1;
> +
> +	/* numa aware memory allocation */
> +	unsigned int numa_aware : 1;
> +
> +	/* Is the ring read to take requests */
> +	unsigned int ready : 1;
> +
> +	/*
> +	 * Log ring entry states onces on stop when entries cannot be
> +	 * released
> +	 */
> +	unsigned int stop_debug_log : 1;
> +
> +	struct mutex start_stop_lock;
> +
> +	wait_queue_head_t stop_waitq;
> +
> +	/* mmaped ring entry memory buffers, mmaped values is the key,
> +	 * kernel pointer is the value
> +	 */
> +	struct rb_root mem_buf_map;
> +
> +	struct delayed_work stop_work;
> +	unsigned long stop_time;
> +};

This is mostly a preference thing, but you've added a huge amount of code that
isn't used in this patch, so it makes it hard for me to review without knowing
how the things are to be used.

Generally it's easier on reviewers if you're adding the structs as you need them
so you can clearly follow what the purpose of everything is.  Here I just have
to go look at the end result and figure out what everything does and if it makes
sense.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 05/19] fuse: Add a uring config ioctl
  2024-05-29 21:24   ` Josef Bacik
@ 2024-05-30 12:51     ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-30 12:51 UTC (permalink / raw)
  To: Josef Bacik, Bernd Schubert; +Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel



On 5/29/24 23:24, Josef Bacik wrote:
> On Wed, May 29, 2024 at 08:00:40PM +0200, Bernd Schubert wrote:
>> This only adds the initial ioctl for basic fuse-uring initialization.
>> More ioctl types will be added later to initialize queues.
>>
>> This also adds data structures needed or initialized by the ioctl
>> command and that will be used later.
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/Kconfig           |  12 +++
>>  fs/fuse/Makefile          |   1 +
>>  fs/fuse/dev.c             |  91 ++++++++++++++++--
>>  fs/fuse/dev_uring.c       | 122 +++++++++++++++++++++++
>>  fs/fuse/dev_uring_i.h     | 239 ++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/fuse/fuse_dev_i.h      |   1 +
>>  fs/fuse/fuse_i.h          |   5 +
>>  fs/fuse/inode.c           |   3 +
>>  include/uapi/linux/fuse.h |  73 ++++++++++++++
>>  9 files changed, 538 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>> index 8674dbfbe59d..11f37cefc94b 100644
>> --- a/fs/fuse/Kconfig
>> +++ b/fs/fuse/Kconfig
>> @@ -63,3 +63,15 @@ config FUSE_PASSTHROUGH
>>  	  to be performed directly on a backing file.
>>  
>>  	  If you want to allow passthrough operations, answer Y.
>> +
>> +config FUSE_IO_URING
>> +	bool "FUSE communication over io-uring"
>> +	default y
>> +	depends on FUSE_FS
>> +	depends on IO_URING
>> +	help
>> +	  This allows sending FUSE requests over the IO uring interface and
>> +          also adds request core affinity.
>> +
>> +	  If you want to allow fuse server/client communication through io-uring,
>> +	  answer Y
>> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
>> index 6e0228c6d0cb..7193a14374fd 100644
>> --- a/fs/fuse/Makefile
>> +++ b/fs/fuse/Makefile
>> @@ -11,5 +11,6 @@ fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
>>  fuse-y += iomode.o
>>  fuse-$(CONFIG_FUSE_DAX) += dax.o
>>  fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
>> +fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
>>  
>>  virtiofs-y := virtio_fs.o
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index b98ecb197a28..bc77413932cf 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -8,6 +8,7 @@
>>  
>>  #include "fuse_i.h"
>>  #include "fuse_dev_i.h"
>> +#include "dev_uring_i.h"
>>  
>>  #include <linux/init.h>
>>  #include <linux/module.h>
>> @@ -26,6 +27,13 @@
>>  MODULE_ALIAS_MISCDEV(FUSE_MINOR);
>>  MODULE_ALIAS("devname:fuse");
>>  
>> +#if IS_ENABLED(CONFIG_FUSE_IO_URING)
>> +static bool __read_mostly enable_uring;
>> +module_param(enable_uring, bool, 0644);
>> +MODULE_PARM_DESC(enable_uring,
>> +		 "Enable uring userspace communication through uring.");
>> +#endif
>> +
>>  static struct kmem_cache *fuse_req_cachep;
>>  
>>  static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
>> @@ -2297,16 +2305,12 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
>>  	return 0;
>>  }
>>  
>> -static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
>> +static long _fuse_dev_ioctl_clone(struct file *file, int oldfd)
>>  {
>>  	int res;
>> -	int oldfd;
>>  	struct fuse_dev *fud = NULL;
>>  	struct fd f;
>>  
>> -	if (get_user(oldfd, argp))
>> -		return -EFAULT;
>> -
>>  	f = fdget(oldfd);
>>  	if (!f.file)
>>  		return -EINVAL;
>> @@ -2329,6 +2333,16 @@ static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
>>  	return res;
>>  }
>>  
>> +static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
>> +{
>> +	int oldfd;
>> +
>> +	if (get_user(oldfd, argp))
>> +		return -EFAULT;
>> +
>> +	return _fuse_dev_ioctl_clone(file, oldfd);
>> +}
>> +
>>  static long fuse_dev_ioctl_backing_open(struct file *file,
>>  					struct fuse_backing_map __user *argp)
>>  {
>> @@ -2364,8 +2378,65 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
>>  	return fuse_backing_close(fud->fc, backing_id);
>>  }
>>  
>> -static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
>> -			   unsigned long arg)
>> +/**
>> + * Configure the queue for the given qid. First call will also initialize
>> + * the ring for this connection.
>> + */
>> +static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
>> +{
>> +#if IS_ENABLED(CONFIG_FUSE_IO_URING)
>> +	int res;
>> +	struct fuse_uring_cfg cfg;
>> +	struct fuse_dev *fud;
>> +	struct fuse_conn *fc;
>> +	struct fuse_ring *ring;
>> +
>> +	res = copy_from_user(&cfg, (void *)argp, sizeof(cfg));
>> +	if (res != 0)
>> +		return -EFAULT;
>> +
>> +	fud = fuse_get_dev(file);
>> +	if (fud == NULL)
>> +		return -ENODEV;
>> +	fc = fud->fc;
>> +
>> +	switch (cfg.cmd) {
>> +	case FUSE_URING_IOCTL_CMD_RING_CFG:
>> +		if (READ_ONCE(fc->ring) == NULL)
>> +			ring = kzalloc(sizeof(*fc->ring), GFP_KERNEL);
>> +
>> +		spin_lock(&fc->lock);
>> +		if (fc->ring == NULL) {
>> +			fc->ring = ring;
> 
> Need to have error handling here in case the kzalloc failed.
> 
>> +			fuse_uring_conn_init(fc->ring, fc);
>> +		} else {
>> +			kfree(ring);
>> +		}
>> +
>> +		spin_unlock(&fc->lock);
>> +		if (fc->ring == NULL)
>> +			return -ENOMEM;
>> +
>> +		mutex_lock(&fc->ring->start_stop_lock);
>> +		res = fuse_uring_conn_cfg(fc->ring, &cfg.rconf);
>> +		mutex_unlock(&fc->ring->start_stop_lock);
>> +
>> +		if (res != 0)
>> +			return res;
>> +		break;
>> +	default:
>> +		res = -EINVAL;
>> +	}
>> +
>> +		return res;
>> +#else
>> +	return -ENOTTY;
>> +#endif
>> +}
>> +
>> +static long
>> +fuse_dev_ioctl(struct file *file, unsigned int cmd,
>> +	       unsigned long arg)
>>  {
>>  	void __user *argp = (void __user *)arg;
>>  
>> @@ -2379,8 +2450,10 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
>>  	case FUSE_DEV_IOC_BACKING_CLOSE:
>>  		return fuse_dev_ioctl_backing_close(file, argp);
>>  
>> -	default:
>> -		return -ENOTTY;
>> +	case FUSE_DEV_IOC_URING:
>> +		return fuse_uring_ioctl(file, argp);
>> +
> 
> Instead just wrap the above in 
> 
> #ifdef CONFIG_FUSE_IO_URING
> 	case FUSE_DEV_IOC_URING:
> 		return fuse_uring_ioctl(file, argp);
> #endif
> 
> instead of wrapping the entire function above in the check.
> 	
>> +	default: return -ENOTTY;
>>  	}
>>  }
>>  
>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>> new file mode 100644
>> index 000000000000..702a994cf192
>> --- /dev/null
>> +++ b/fs/fuse/dev_uring.c
>> @@ -0,0 +1,122 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * FUSE: Filesystem in Userspace
>> + * Copyright (c) 2023-2024 DataDirect Networks.
>> + */
>> +
>> +#include "fuse_i.h"
>> +#include "fuse_dev_i.h"
>> +#include "dev_uring_i.h"
>> +
>> +#include "linux/compiler_types.h"
>> +#include "linux/spinlock.h"
>> +#include <linux/init.h>
>> +#include <linux/module.h>
>> +#include <linux/poll.h>
>> +#include <linux/sched/signal.h>
>> +#include <linux/uio.h>
>> +#include <linux/miscdevice.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/file.h>
>> +#include <linux/slab.h>
>> +#include <linux/pipe_fs_i.h>
>> +#include <linux/swap.h>
>> +#include <linux/splice.h>
>> +#include <linux/sched.h>
>> +#include <linux/io_uring.h>
>> +#include <linux/mm.h>
>> +#include <linux/io.h>
>> +#include <linux/io_uring.h>
>> +#include <linux/io_uring/cmd.h>
>> +#include <linux/topology.h>
>> +#include <linux/io_uring/cmd.h>
>> +
>> +/*
>> + * Basic ring setup for this connection based on the provided configuration
>> + */
>> +int fuse_uring_conn_cfg(struct fuse_ring *ring, struct fuse_ring_config *rcfg)
>> +{
>> +	size_t queue_sz;
>> +
>> +	if (ring->configured) {
>> +		pr_info("The ring is already configured.\n");
>> +		return -EALREADY;
>> +	}
>> +
>> +	if (rcfg->nr_queues == 0) {
>> +		pr_info("zero number of queues is invalid.\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (rcfg->nr_queues > 1 && rcfg->nr_queues != num_present_cpus()) {
>> +		pr_info("nr-queues (%d) does not match nr-cores (%d).\n",
>> +			rcfg->nr_queues, num_present_cpus());
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (rcfg->req_arg_len < FUSE_RING_MIN_IN_OUT_ARG_SIZE) {
>> +		pr_info("Per req buffer size too small (%d), min: %d\n",
>> +			rcfg->req_arg_len, FUSE_RING_MIN_IN_OUT_ARG_SIZE);
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (WARN_ON(ring->queues))
>> +		return -EINVAL;
>> +
>> +	ring->numa_aware = rcfg->numa_aware;
>> +	ring->nr_queues = rcfg->nr_queues;
>> +	ring->per_core_queue = rcfg->nr_queues > 1;
>> +
>> +	ring->max_nr_sync = rcfg->sync_queue_depth;
>> +	ring->max_nr_async = rcfg->async_queue_depth;
>> +	ring->queue_depth = ring->max_nr_sync + ring->max_nr_async;
>> +
>> +	ring->req_arg_len = rcfg->req_arg_len;
>> +	ring->req_buf_sz = rcfg->user_req_buf_sz;
>> +
>> +	ring->queue_buf_size = ring->req_buf_sz * ring->queue_depth;
>> +
>> +	queue_sz = sizeof(*ring->queues) +
>> +		   ring->queue_depth * sizeof(struct fuse_ring_ent);
>> +	ring->queues = kcalloc(rcfg->nr_queues, queue_sz, GFP_KERNEL);
>> +	if (!ring->queues)
>> +		return -ENOMEM;
>> +	ring->queue_size = queue_sz;
>> +	ring->configured = 1;
>> +
>> +	atomic_set(&ring->queue_refs, 0);
>> +
>> +	return 0;
>> +}
>> +
>> +void fuse_uring_ring_destruct(struct fuse_ring *ring)
>> +{
>> +	unsigned int qid;
>> +	struct rb_node *rbn;
>> +
>> +	for (qid = 0; qid < ring->nr_queues; qid++) {
>> +		struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
>> +
>> +		vfree(queue->queue_req_buf);
>> +	}
>> +
>> +	kfree(ring->queues);
>> +	ring->queues = NULL;
>> +	ring->nr_queues_ioctl_init = 0;
>> +	ring->queue_depth = 0;
>> +	ring->nr_queues = 0;
>> +
>> +	rbn = rb_first(&ring->mem_buf_map);
>> +	while (rbn) {
>> +		struct rb_node *next = rb_next(rbn);
>> +		struct fuse_uring_mbuf *entry =
>> +			rb_entry(rbn, struct fuse_uring_mbuf, rb_node);
>> +
>> +		rb_erase(rbn, &ring->mem_buf_map);
>> +		kfree(entry);
>> +
>> +		rbn = next;
>> +	}
>> +
>> +	mutex_destroy(&ring->start_stop_lock);
>> +}
>> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
>> new file mode 100644
>> index 000000000000..58ab4671deff
>> --- /dev/null
>> +++ b/fs/fuse/dev_uring_i.h
>> @@ -0,0 +1,239 @@
>> +/* SPDX-License-Identifier: GPL-2.0
>> + *
>> + * FUSE: Filesystem in Userspace
>> + * Copyright (c) 2023-2024 DataDirect Networks.
>> + */
>> +
>> +#ifndef _FS_FUSE_DEV_URING_I_H
>> +#define _FS_FUSE_DEV_URING_I_H
>> +
>> +#include "fuse_i.h"
>> +#include "linux/compiler_types.h"
>> +#include "linux/rbtree_types.h"
>> +
>> +#if IS_ENABLED(CONFIG_FUSE_IO_URING)
>> +
>> +/* IORING_MAX_ENTRIES */
>> +#define FUSE_URING_MAX_QUEUE_DEPTH 32768
>> +
>> +struct fuse_uring_mbuf {
>> +	struct rb_node rb_node;
>> +	void *kbuf; /* kernel allocated ring request buffer */
>> +	void *ubuf; /* mmaped address */
>> +};
>> +
>> +/** A fuse ring entry, part of the ring queue */
>> +struct fuse_ring_ent {
>> +	/*
>> +	 * pointer to kernel request buffer, userspace side has direct access
>> +	 * to it through the mmaped buffer
>> +	 */
>> +	struct fuse_ring_req *rreq;
>> +
>> +	/* the ring queue that owns the request */
>> +	struct fuse_ring_queue *queue;
>> +
>> +	struct io_uring_cmd *cmd;
>> +
>> +	struct list_head list;
>> +
>> +	/*
>> +	 * state the request is currently in
>> +	 * (enum fuse_ring_req_state)
>> +	 */
>> +	unsigned long state;
>> +
>> +	/* array index in the ring-queue */
>> +	int tag;
>> +
>> +	/* is this an async or sync entry */
>> +	unsigned int async : 1;
>> +
>> +	struct fuse_req *fuse_req; /* when a list request is handled */
>> +};
>> +
>> +struct fuse_ring_queue {
>> +	/* task belonging to the current queue */
>> +	struct task_struct *server_task;
>> +
>> +	/*
>> +	 * back pointer to the main fuse uring structure that holds this
>> +	 * queue
>> +	 */
>> +	struct fuse_ring *ring;
>> +
>> +	/* issue flags when running in io-uring task context */
>> +	unsigned int uring_cmd_issue_flags;
>> +
>> +	int qid;
>> +
>> +	/*
>> +	 * available number of sync requests,
>> +	 * loosely bound to fuse foreground requests
>> +	 */
>> +	int nr_req_sync;
>> +
>> +	/*
>> +	 * available number of async requests
>> +	 * loosely bound to fuse background requests
>> +	 */
>> +	int nr_req_async;
>> +
>> +	/* queue lock, taken when any value in the queue changes _and_ also
>> +	 * a ring entry state changes.
>> +	 */
>> +	spinlock_t lock;
>> +
>> +	/* per queue memory buffer that is divided per request */
>> +	char *queue_req_buf;
>> +
>> +	/* fuse fg/bg request types */
>> +	struct list_head async_fuse_req_queue;
>> +	struct list_head sync_fuse_req_queue;
>> +
>> +	/* available ring entries (struct fuse_ring_ent) */
>> +	struct list_head async_ent_avail_queue;
>> +	struct list_head sync_ent_avail_queue;
>> +
>> +	struct list_head ent_in_userspace;
>> +
>> +	unsigned int configured : 1;
>> +	unsigned int stopped : 1;
>> +
>> +	/* size depends on queue depth */
>> +	struct fuse_ring_ent ring_ent[] ____cacheline_aligned_in_smp;
>> +};
>> +
>> +/**
>> + * Describes if uring is for communication and holds alls the data needed
>> + * for uring communication
>> + */
>> +struct fuse_ring {
>> +	/* back pointer to fuse_conn */
>> +	struct fuse_conn *fc;
>> +
>> +	/* number of ring queues */
>> +	size_t nr_queues;
>> +
>> +	/* number of entries per queue */
>> +	size_t queue_depth;
>> +
>> +	/* max arg size for a request */
>> +	size_t req_arg_len;
>> +
>> +	/* req_arg_len + sizeof(struct fuse_req) */
>> +	size_t req_buf_sz;
>> +
>> +	/* max number of background requests per queue */
>> +	size_t max_nr_async;
>> +
>> +	/* max number of foreground requests */
>> +	size_t max_nr_sync;
>> +
>> +	/* size of struct fuse_ring_queue + queue-depth * entry-size */
>> +	size_t queue_size;
>> +
>> +	/* buffer size per queue, that is used per queue entry */
>> +	size_t queue_buf_size;
>> +
>> +	/* Used to release the ring on stop */
>> +	atomic_t queue_refs;
>> +
>> +	/* Hold ring requests */
>> +	struct fuse_ring_queue *queues;
>> +
>> +	/* number of initialized queues with the ioctl */
>> +	int nr_queues_ioctl_init;
>> +
>> +	/* number of SQEs initialized */
>> +	atomic_t nr_sqe_init;
>> +
>> +	/* one queue per core or a single queue only ? */
>> +	unsigned int per_core_queue : 1;
>> +
>> +	/* Is the ring completely iocl configured */
>> +	unsigned int configured : 1;
>> +
>> +	/* numa aware memory allocation */
>> +	unsigned int numa_aware : 1;
>> +
>> +	/* Is the ring read to take requests */
>> +	unsigned int ready : 1;
>> +
>> +	/*
>> +	 * Log ring entry states onces on stop when entries cannot be
>> +	 * released
>> +	 */
>> +	unsigned int stop_debug_log : 1;
>> +
>> +	struct mutex start_stop_lock;
>> +
>> +	wait_queue_head_t stop_waitq;
>> +
>> +	/* mmaped ring entry memory buffers, mmaped values is the key,
>> +	 * kernel pointer is the value
>> +	 */
>> +	struct rb_root mem_buf_map;
>> +
>> +	struct delayed_work stop_work;
>> +	unsigned long stop_time;
>> +};
> 
> This is mostly a preference thing, but you've added a huge amount of code that
> isn't used in this patch, so it makes it hard for me to review without knowing
> how the things are to be used.
> 
> Generally it's easier on reviewers if you're adding the structs as you need them
> so you can clearly follow what the purpose of everything is.  Here I just have
> to go look at the end result and figure out what everything does and if it makes
> sense.  Thanks,
> 
> Josef

Yeah, entirely agreed. Will improve that in the next patch version.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 05/19] fuse: Add a uring config ioctl
  2024-05-29 18:00 ` [PATCH RFC v2 05/19] fuse: Add a uring config ioctl Bernd Schubert
  2024-05-29 21:24   ` Josef Bacik
@ 2024-06-03 13:03   ` Miklos Szeredi
  2024-06-03 13:48     ` Bernd Schubert
  1 sibling, 1 reply; 113+ messages in thread
From: Miklos Szeredi @ 2024-06-03 13:03 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@ddn.com> wrote:

> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -1079,12 +1079,79 @@ struct fuse_backing_map {
>         uint64_t        padding;
>  };
>
> +enum fuse_uring_ioctl_cmd {
> +       /* not correctly initialized when set */
> +       FUSE_URING_IOCTL_CMD_INVALID    = 0,
> +
> +       /* Ioctl to prepare communucation with io-uring */
> +       FUSE_URING_IOCTL_CMD_RING_CFG   = 1,
> +
> +       /* Ring queue configuration ioctl */
> +       FUSE_URING_IOCTL_CMD_QUEUE_CFG  = 2,
> +};

Is there a reason why these cannot be separate ioctl commands?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 05/19] fuse: Add a uring config ioctl
  2024-06-03 13:03   ` Miklos Szeredi
@ 2024-06-03 13:48     ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-06-03 13:48 UTC (permalink / raw)
  To: Miklos Szeredi, Bernd Schubert; +Cc: Amir Goldstein, linux-fsdevel



On 6/3/24 15:03, Miklos Szeredi wrote:
> On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@ddn.com> wrote:
> 
>> --- a/include/uapi/linux/fuse.h
>> +++ b/include/uapi/linux/fuse.h
>> @@ -1079,12 +1079,79 @@ struct fuse_backing_map {
>>         uint64_t        padding;
>>  };
>>
>> +enum fuse_uring_ioctl_cmd {
>> +       /* not correctly initialized when set */
>> +       FUSE_URING_IOCTL_CMD_INVALID    = 0,
>> +
>> +       /* Ioctl to prepare communucation with io-uring */
>> +       FUSE_URING_IOCTL_CMD_RING_CFG   = 1,
>> +
>> +       /* Ring queue configuration ioctl */
>> +       FUSE_URING_IOCTL_CMD_QUEUE_CFG  = 2,
>> +};
> 
> Is there a reason why these cannot be separate ioctl commands?


I just personally didn't like the idea to have multiple ioctl commands
for the same feature. Initially there were also more ioctls. Easy to
change if you prefer that.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 06/19] Add a vmalloc_node_user function
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (4 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 05/19] fuse: Add a uring config ioctl Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 15:10   ` Josef Bacik
  2024-05-31 13:56   ` Christoph Hellwig
  2024-05-29 18:00 ` [PATCH RFC v2 07/19] fuse uring: Add an mmap method Bernd Schubert
                   ` (16 subsequent siblings)
  22 siblings, 2 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert
  Cc: Andrew Morton, linux-mm

This is to have a numa aware vmalloc function for memory exposed to
userspace. Fuse uring will allocate queue memory using this
new function.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: linux-mm@kvack.org
Acked-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/vmalloc.h |  1 +
 mm/nommu.c              |  6 ++++++
 mm/vmalloc.c            | 41 +++++++++++++++++++++++++++++++++++++----
 3 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 98ea90e90439..e7645702074e 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -141,6 +141,7 @@ static inline unsigned long vmalloc_nr_pages(void) { return 0; }
 extern void *vmalloc(unsigned long size) __alloc_size(1);
 extern void *vzalloc(unsigned long size) __alloc_size(1);
 extern void *vmalloc_user(unsigned long size) __alloc_size(1);
+extern void *vmalloc_node_user(unsigned long size, int node) __alloc_size(1);
 extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
 extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
 extern void *vmalloc_32(unsigned long size) __alloc_size(1);
diff --git a/mm/nommu.c b/mm/nommu.c
index 5ec8f44e7ce9..207ddf639aa9 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -185,6 +185,12 @@ void *vmalloc_user(unsigned long size)
 }
 EXPORT_SYMBOL(vmalloc_user);
 
+void *vmalloc_node_user(unsigned long size, int node)
+{
+	return __vmalloc_user_flags(size, GFP_KERNEL | __GFP_ZERO);
+}
+EXPORT_SYMBOL(vmalloc_node_user);
+
 struct page *vmalloc_to_page(const void *addr)
 {
 	return virt_to_page(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 68fa001648cc..0ac2f44b2b1f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3958,6 +3958,25 @@ void *vzalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vzalloc);
 
+/**
+ * _vmalloc_node_user - allocate zeroed virtually contiguous memory for userspace
+ * on the given numa node
+ * @size: allocation size
+ * @node: numa node
+ *
+ * The resulting memory area is zeroed so it can be mapped to userspace
+ * without leaking data.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+static void *_vmalloc_node_user(unsigned long size, int node)
+{
+	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
+				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
+				    VM_USERMAP, node,
+				    __builtin_return_address(0));
+}
+
 /**
  * vmalloc_user - allocate zeroed virtually contiguous memory for userspace
  * @size: allocation size
@@ -3969,13 +3988,27 @@ EXPORT_SYMBOL(vzalloc);
  */
 void *vmalloc_user(unsigned long size)
 {
-	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
-				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
-				    VM_USERMAP, NUMA_NO_NODE,
-				    __builtin_return_address(0));
+	return _vmalloc_node_user(size, NUMA_NO_NODE);
 }
 EXPORT_SYMBOL(vmalloc_user);
 
+/**
+ * vmalloc_user - allocate zeroed virtually contiguous memory for userspace on
+ *                a numa node
+ * @size: allocation size
+ * @node: numa node
+ *
+ * The resulting memory area is zeroed so it can be mapped to userspace
+ * without leaking data.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *vmalloc_node_user(unsigned long size, int node)
+{
+	return _vmalloc_node_user(size, node);
+}
+EXPORT_SYMBOL(vmalloc_node_user);
+
 /**
  * vmalloc_node - allocate memory on a specific node
  * @size:	  allocation size

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 06/19] Add a vmalloc_node_user function
  2024-05-29 18:00 ` [PATCH RFC v2 06/19] Add a vmalloc_node_user function Bernd Schubert
@ 2024-05-30 15:10   ` Josef Bacik
  2024-05-30 16:13     ` Bernd Schubert
  2024-05-31 13:56   ` Christoph Hellwig
  1 sibling, 1 reply; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 15:10 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert,
	Andrew Morton, linux-mm

On Wed, May 29, 2024 at 08:00:41PM +0200, Bernd Schubert wrote:
> This is to have a numa aware vmalloc function for memory exposed to
> userspace. Fuse uring will allocate queue memory using this
> new function.
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: linux-mm@kvack.org
> Acked-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/vmalloc.h |  1 +
>  mm/nommu.c              |  6 ++++++
>  mm/vmalloc.c            | 41 +++++++++++++++++++++++++++++++++++++----
>  3 files changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 98ea90e90439..e7645702074e 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -141,6 +141,7 @@ static inline unsigned long vmalloc_nr_pages(void) { return 0; }
>  extern void *vmalloc(unsigned long size) __alloc_size(1);
>  extern void *vzalloc(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_user(unsigned long size) __alloc_size(1);
> +extern void *vmalloc_node_user(unsigned long size, int node) __alloc_size(1);
>  extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
>  extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
>  extern void *vmalloc_32(unsigned long size) __alloc_size(1);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 5ec8f44e7ce9..207ddf639aa9 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -185,6 +185,12 @@ void *vmalloc_user(unsigned long size)
>  }
>  EXPORT_SYMBOL(vmalloc_user);
>  
> +void *vmalloc_node_user(unsigned long size, int node)
> +{
> +	return __vmalloc_user_flags(size, GFP_KERNEL | __GFP_ZERO);
> +}
> +EXPORT_SYMBOL(vmalloc_node_user);
> +
>  struct page *vmalloc_to_page(const void *addr)
>  {
>  	return virt_to_page(addr);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 68fa001648cc..0ac2f44b2b1f 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3958,6 +3958,25 @@ void *vzalloc(unsigned long size)
>  }
>  EXPORT_SYMBOL(vzalloc);
>  
> +/**
> + * _vmalloc_node_user - allocate zeroed virtually contiguous memory for userspace
> + * on the given numa node
> + * @size: allocation size
> + * @node: numa node
> + *
> + * The resulting memory area is zeroed so it can be mapped to userspace
> + * without leaking data.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +static void *_vmalloc_node_user(unsigned long size, int node)
> +{
> +	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
> +				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
> +				    VM_USERMAP, node,
> +				    __builtin_return_address(0));
> +}
> +

Looking at the rest of vmalloc it seems like adding an extra variant to do the
special thing is overkill, I think it would be fine to just have

void *vmalloc_nod_user(unsigned long size, int node)
{
	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
				    VM_USERMAP, node,
				    __builtin_return_address(0));
}

instead of creating a _vmalloc_node_user().

Also as an aside, this is definitely being used by this series, but I think it
would be good to go ahead and send this by itself with just the explanation that
it's going to be used by the fuse iouring stuff later, that way you can get this
merged and continue working on the iouring part.

This also goes for the other prep patches earlier this this series, but since
those are fuse related it's probably fine to just keep shipping them with this
series.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 06/19] Add a vmalloc_node_user function
  2024-05-30 15:10   ` Josef Bacik
@ 2024-05-30 16:13     ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-30 16:13 UTC (permalink / raw)
  To: Josef Bacik, Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Andrew Morton,
	linux-mm



On 5/30/24 17:10, Josef Bacik wrote:
> On Wed, May 29, 2024 at 08:00:41PM +0200, Bernd Schubert wrote:
>> This is to have a numa aware vmalloc function for memory exposed to
>> userspace. Fuse uring will allocate queue memory using this
>> new function.
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> cc: Andrew Morton <akpm@linux-foundation.org>
>> cc: linux-mm@kvack.org
>> Acked-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>  include/linux/vmalloc.h |  1 +
>>  mm/nommu.c              |  6 ++++++
>>  mm/vmalloc.c            | 41 +++++++++++++++++++++++++++++++++++++----
>>  3 files changed, 44 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>> index 98ea90e90439..e7645702074e 100644
>> --- a/include/linux/vmalloc.h
>> +++ b/include/linux/vmalloc.h
>> @@ -141,6 +141,7 @@ static inline unsigned long vmalloc_nr_pages(void) { return 0; }
>>  extern void *vmalloc(unsigned long size) __alloc_size(1);
>>  extern void *vzalloc(unsigned long size) __alloc_size(1);
>>  extern void *vmalloc_user(unsigned long size) __alloc_size(1);
>> +extern void *vmalloc_node_user(unsigned long size, int node) __alloc_size(1);
>>  extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
>>  extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
>>  extern void *vmalloc_32(unsigned long size) __alloc_size(1);
>> diff --git a/mm/nommu.c b/mm/nommu.c
>> index 5ec8f44e7ce9..207ddf639aa9 100644
>> --- a/mm/nommu.c
>> +++ b/mm/nommu.c
>> @@ -185,6 +185,12 @@ void *vmalloc_user(unsigned long size)
>>  }
>>  EXPORT_SYMBOL(vmalloc_user);
>>  
>> +void *vmalloc_node_user(unsigned long size, int node)
>> +{
>> +	return __vmalloc_user_flags(size, GFP_KERNEL | __GFP_ZERO);
>> +}
>> +EXPORT_SYMBOL(vmalloc_node_user);
>> +
>>  struct page *vmalloc_to_page(const void *addr)
>>  {
>>  	return virt_to_page(addr);
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 68fa001648cc..0ac2f44b2b1f 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -3958,6 +3958,25 @@ void *vzalloc(unsigned long size)
>>  }
>>  EXPORT_SYMBOL(vzalloc);
>>  
>> +/**
>> + * _vmalloc_node_user - allocate zeroed virtually contiguous memory for userspace
>> + * on the given numa node
>> + * @size: allocation size
>> + * @node: numa node
>> + *
>> + * The resulting memory area is zeroed so it can be mapped to userspace
>> + * without leaking data.
>> + *
>> + * Return: pointer to the allocated memory or %NULL on error
>> + */
>> +static void *_vmalloc_node_user(unsigned long size, int node)
>> +{
>> +	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
>> +				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
>> +				    VM_USERMAP, node,
>> +				    __builtin_return_address(0));
>> +}
>> +
> 
> Looking at the rest of vmalloc it seems like adding an extra variant to do the
> special thing is overkill, I think it would be fine to just have
> 
> void *vmalloc_nod_user(unsigned long size, int node)
> {
> 	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
> 				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
> 				    VM_USERMAP, node,
> 				    __builtin_return_address(0));
> }
> 
> instead of creating a _vmalloc_node_user().

No issue with me either. I had done it like this as there are basically
two caller wit the same flags - vmalloc_user(size, NUMA_NO_NODE) and the new
vmalloc_node_user(size, node).

> 
> Also as an aside, this is definitely being used by this series, but I think it
> would be good to go ahead and send this by itself with just the explanation that
> it's going to be used by the fuse iouring stuff later, that way you can get this
> merged and continue working on the iouring part.

Thanks for your advise, will submit it separately. If the for now used export is
acceptable it would also help me, as we have back ports of these patches.

> 
> This also goes for the other prep patches earlier this this series, but since
> those are fuse related it's probably fine to just keep shipping them with this
> series.  Thanks,


Thanks again for your help and reviews!

Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 06/19] Add a vmalloc_node_user function
  2024-05-29 18:00 ` [PATCH RFC v2 06/19] Add a vmalloc_node_user function Bernd Schubert
  2024-05-30 15:10   ` Josef Bacik
@ 2024-05-31 13:56   ` Christoph Hellwig
  2024-06-03 15:59     ` Kent Overstreet
  1 sibling, 1 reply; 113+ messages in thread
From: Christoph Hellwig @ 2024-05-31 13:56 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert,
	Andrew Morton, linux-mm

On Wed, May 29, 2024 at 08:00:41PM +0200, Bernd Schubert wrote:
> This is to have a numa aware vmalloc function for memory exposed to
> userspace. Fuse uring will allocate queue memory using this
> new function.
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: linux-mm@kvack.org
> Acked-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/vmalloc.h |  1 +
>  mm/nommu.c              |  6 ++++++
>  mm/vmalloc.c            | 41 +++++++++++++++++++++++++++++++++++++----
>  3 files changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 98ea90e90439..e7645702074e 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -141,6 +141,7 @@ static inline unsigned long vmalloc_nr_pages(void) { return 0; }
>  extern void *vmalloc(unsigned long size) __alloc_size(1);
>  extern void *vzalloc(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_user(unsigned long size) __alloc_size(1);
> +extern void *vmalloc_node_user(unsigned long size, int node) __alloc_size(1);
>  extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
>  extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
>  extern void *vmalloc_32(unsigned long size) __alloc_size(1);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 5ec8f44e7ce9..207ddf639aa9 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -185,6 +185,12 @@ void *vmalloc_user(unsigned long size)
>  }
>  EXPORT_SYMBOL(vmalloc_user);
>  
> +void *vmalloc_node_user(unsigned long size, int node)
> +{
> +	return __vmalloc_user_flags(size, GFP_KERNEL | __GFP_ZERO);
> +}
> +EXPORT_SYMBOL(vmalloc_node_user);
> +
>  struct page *vmalloc_to_page(const void *addr)
>  {
>  	return virt_to_page(addr);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 68fa001648cc..0ac2f44b2b1f 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3958,6 +3958,25 @@ void *vzalloc(unsigned long size)
>  }
>  EXPORT_SYMBOL(vzalloc);
>  
> +/**
> + * _vmalloc_node_user - allocate zeroed virtually contiguous memory for userspace

Please avoid the overly long line.

> + * on the given numa node
> + * @size: allocation size
> + * @node: numa node
> + *
> + * The resulting memory area is zeroed so it can be mapped to userspace
> + * without leaking data.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +static void *_vmalloc_node_user(unsigned long size, int node)

Although for static functions kerneldoc comments are pretty silly
to start with.

>  void *vmalloc_user(unsigned long size)
>  {
> -	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
> -				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
> -				    VM_USERMAP, NUMA_NO_NODE,
> -				    __builtin_return_address(0));
> +	return _vmalloc_node_user(size, NUMA_NO_NODE);

But I suspect simply adding a gfp_t argument to vmalloc_node might be
a much easier to use interface here, even if it would need a sanity
check to only allow for actually useful to vmalloc flags.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 06/19] Add a vmalloc_node_user function
  2024-05-31 13:56   ` Christoph Hellwig
@ 2024-06-03 15:59     ` Kent Overstreet
  2024-06-03 19:24       ` Bernd Schubert
  2024-06-04  4:08       ` Christoph Hellwig
  0 siblings, 2 replies; 113+ messages in thread
From: Kent Overstreet @ 2024-06-03 15:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bernd Schubert, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	bernd.schubert, Andrew Morton, linux-mm

On Fri, May 31, 2024 at 06:56:01AM -0700, Christoph Hellwig wrote:
> >  void *vmalloc_user(unsigned long size)
> >  {
> > -	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
> > -				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
> > -				    VM_USERMAP, NUMA_NO_NODE,
> > -				    __builtin_return_address(0));
> > +	return _vmalloc_node_user(size, NUMA_NO_NODE);
> 
> But I suspect simply adding a gfp_t argument to vmalloc_node might be
> a much easier to use interface here, even if it would need a sanity
> check to only allow for actually useful to vmalloc flags.

vmalloc doesn't properly support gfp flags due to page table allocation

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 06/19] Add a vmalloc_node_user function
  2024-06-03 15:59     ` Kent Overstreet
@ 2024-06-03 19:24       ` Bernd Schubert
  2024-06-04  4:20         ` Christoph Hellwig
  2024-06-04  4:08       ` Christoph Hellwig
  1 sibling, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-06-03 19:24 UTC (permalink / raw)
  To: Kent Overstreet, Christoph Hellwig
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel@vger.kernel.org,
	bernd.schubert@fastmail.fm, Andrew Morton, linux-mm@kvack.org

On 6/3/24 17:59, Kent Overstreet wrote:
> On Fri, May 31, 2024 at 06:56:01AM -0700, Christoph Hellwig wrote:
>>>  void *vmalloc_user(unsigned long size)
>>>  {
>>> -	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
>>> -				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
>>> -				    VM_USERMAP, NUMA_NO_NODE,
>>> -				    __builtin_return_address(0));
>>> +	return _vmalloc_node_user(size, NUMA_NO_NODE);
>>
>> But I suspect simply adding a gfp_t argument to vmalloc_node might be
>> a much easier to use interface here, even if it would need a sanity
>> check to only allow for actually useful to vmalloc flags.
> 
> vmalloc doesn't properly support gfp flags due to page table allocation

Thanks Kent, I had actually totally misunderstood what Christoph meant. 


I might miss something, but vmalloc_node looks quite different to
vmalloc_user / vmalloc_node_user


 void *vmalloc_user(unsigned long size)
 {
       return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
                                   GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
                                   VM_USERMAP, NUMA_NO_NODE,
                                   __builtin_return_address(0));
 }



vs


void *__vmalloc_node(unsigned long size, unsigned long align,
                            gfp_t gfp_mask, int node, const void *caller)
{
        return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
                                gfp_mask, PAGE_KERNEL, 0, node, caller);
}


void *vmalloc_node(unsigned long size, int node)
{
        return __vmalloc_node(size, 1, GFP_KERNEL, node,
                        __builtin_return_address(0));
}




If we wanted to avoid another export, shouldn't we better rename
vmalloc_user to vmalloc_node_user, add the node argument and change
all callers?

Anyway, I will send the current patch separately to linux-mm and will ask
if it can get merged before the fuse patches.


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 06/19] Add a vmalloc_node_user function
  2024-06-03 19:24       ` Bernd Schubert
@ 2024-06-04  4:20         ` Christoph Hellwig
  2024-06-07  2:30           ` Dave Chinner
  0 siblings, 1 reply; 113+ messages in thread
From: Christoph Hellwig @ 2024-06-04  4:20 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Kent Overstreet, Christoph Hellwig, Miklos Szeredi,
	Amir Goldstein, linux-fsdevel@vger.kernel.org,
	bernd.schubert@fastmail.fm, Andrew Morton, linux-mm@kvack.org

On Mon, Jun 03, 2024 at 07:24:03PM +0000, Bernd Schubert wrote:
> void *vmalloc_node(unsigned long size, int node)
> {
>         return __vmalloc_node(size, 1, GFP_KERNEL, node,
>                         __builtin_return_address(0));
> }
> 
> 
> 
> 
> If we wanted to avoid another export, shouldn't we better rename
> vmalloc_user to vmalloc_node_user, add the node argument and change
> all callers?
> 
> Anyway, I will send the current patch separately to linux-mm and will ask
> if it can get merged before the fuse patches.

Well, the GFP flags exist to avoid needing a gazillion of variants of
everything build around the page allocator.  For vmalloc we can't, as
Kent rightly said, support GFP_NOFS and GFP_NOIO and need to use the
scopes instead, and we should warn about that (which __vmalloc doesn't
and could use some fixes for).


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 06/19] Add a vmalloc_node_user function
  2024-06-04  4:20         ` Christoph Hellwig
@ 2024-06-07  2:30           ` Dave Chinner
  2024-06-07  4:49             ` Christoph Hellwig
  0 siblings, 1 reply; 113+ messages in thread
From: Dave Chinner @ 2024-06-07  2:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bernd Schubert, Kent Overstreet, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel@vger.kernel.org, bernd.schubert@fastmail.fm,
	Andrew Morton, linux-mm@kvack.org

On Mon, Jun 03, 2024 at 09:20:10PM -0700, Christoph Hellwig wrote:
> On Mon, Jun 03, 2024 at 07:24:03PM +0000, Bernd Schubert wrote:
> > void *vmalloc_node(unsigned long size, int node)
> > {
> >         return __vmalloc_node(size, 1, GFP_KERNEL, node,
> >                         __builtin_return_address(0));
> > }
> > 
> > 
> > 
> > 
> > If we wanted to avoid another export, shouldn't we better rename
> > vmalloc_user to vmalloc_node_user, add the node argument and change
> > all callers?
> > 
> > Anyway, I will send the current patch separately to linux-mm and will ask
> > if it can get merged before the fuse patches.
> 
> Well, the GFP flags exist to avoid needing a gazillion of variants of
> everything build around the page allocator.  For vmalloc we can't, as
> Kent rightly said, support GFP_NOFS and GFP_NOIO and need to use the
> scopes instead, and we should warn about that (which __vmalloc doesn't
> and could use some fixes for).

Perhaps before going any further here, we should refresh our
memories on what the vmalloc code actually does these days?
__vmalloc_area_node() does this when mapping the pages:


	/*
         * page tables allocations ignore external gfp mask, enforce it
         * by the scope API
         */
        if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
                flags = memalloc_nofs_save();
        else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
                flags = memalloc_noio_save();

        do {
                ret = vmap_pages_range(addr, addr + size, prot, area->pages,
                        page_shift);
                if (nofail && (ret < 0))
                        schedule_timeout_uninterruptible(1);
        } while (nofail && (ret < 0));

        if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
                memalloc_nofs_restore(flags);
        else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
                memalloc_noio_restore(flags);


IOWs, vmalloc() has obeyed GFP_NOFS/GFP_NOIO constraints properly
for since early 2022 and there isn't a need to wrap it with scopes
just to do a single constrained allocation:

commit 451769ebb7e792c3404db53b3c2a422990de654e
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri Jan 14 14:06:57 2022 -0800

    mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc
    
    Patch series "extend vmalloc support for constrained allocations", v2.
    
    Based on a recent discussion with Dave and Neil [1] I have tried to
    implement NOFS, NOIO, NOFAIL support for the vmalloc to make life of
    kvmalloc users easier.

.....

    Add support for GFP_NOFS and GFP_NOIO to vmalloc directly.  All internal
    allocations already comply with the given gfp_mask.  The only current
    exception is vmap_pages_range which maps kernel page tables.  Infer the
    proper scope API based on the given gfp mask.
.....

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 06/19] Add a vmalloc_node_user function
  2024-06-07  2:30           ` Dave Chinner
@ 2024-06-07  4:49             ` Christoph Hellwig
  0 siblings, 0 replies; 113+ messages in thread
From: Christoph Hellwig @ 2024-06-07  4:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Bernd Schubert, Kent Overstreet,
	Miklos Szeredi, Amir Goldstein, linux-fsdevel@vger.kernel.org,
	bernd.schubert@fastmail.fm, Andrew Morton, linux-mm@kvack.org

On Fri, Jun 07, 2024 at 12:30:27PM +1000, Dave Chinner wrote:
> IOWs, vmalloc() has obeyed GFP_NOFS/GFP_NOIO constraints properly
> for since early 2022 and there isn't a need to wrap it with scopes
> just to do a single constrained allocation:

Perfect.  Doesn't change that we still need some amount of filterting,
e.g. GFP_COMP and vmalloc won't mix too well.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 06/19] Add a vmalloc_node_user function
  2024-06-03 15:59     ` Kent Overstreet
  2024-06-03 19:24       ` Bernd Schubert
@ 2024-06-04  4:08       ` Christoph Hellwig
  1 sibling, 0 replies; 113+ messages in thread
From: Christoph Hellwig @ 2024-06-04  4:08 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, bernd.schubert, Andrew Morton, linux-mm

On Mon, Jun 03, 2024 at 11:59:01AM -0400, Kent Overstreet wrote:
> On Fri, May 31, 2024 at 06:56:01AM -0700, Christoph Hellwig wrote:
> > >  void *vmalloc_user(unsigned long size)
> > >  {
> > > -	return __vmalloc_node_range(size, SHMLBA,  VMALLOC_START, VMALLOC_END,
> > > -				    GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL,
> > > -				    VM_USERMAP, NUMA_NO_NODE,
> > > -				    __builtin_return_address(0));
> > > +	return _vmalloc_node_user(size, NUMA_NO_NODE);
> > 
> > But I suspect simply adding a gfp_t argument to vmalloc_node might be
> > a much easier to use interface here, even if it would need a sanity
> > check to only allow for actually useful to vmalloc flags.
> 
> vmalloc doesn't properly support gfp flags due to page table allocation

Which I tried to cover by the above "to only allow for actually useful
to vmalloc flags".  I.e. the __GFP_ZERO used here is useful, as would be
a GFP_USER which we'd probably actually want here as well.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 07/19] fuse uring: Add an mmap method
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (5 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 06/19] Add a vmalloc_node_user function Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 15:37   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 08/19] fuse: Add the queue configuration ioctl Bernd Schubert
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c             |   3 ++
 fs/fuse/dev_uring.c       | 114 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h     |  22 +++++++++
 include/uapi/linux/fuse.h |   3 ++
 4 files changed, 142 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index bc77413932cf..349c1d16b0df 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2470,6 +2470,9 @@ const struct file_operations fuse_dev_operations = {
 	.fasync		= fuse_dev_fasync,
 	.unlocked_ioctl = fuse_dev_ioctl,
 	.compat_ioctl   = compat_ptr_ioctl,
+#if IS_ENABLED(CONFIG_FUSE_IO_URING)
+	.mmap		= fuse_uring_mmap,
+#endif
 };
 EXPORT_SYMBOL_GPL(fuse_dev_operations);
 
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 702a994cf192..9491bdaa5716 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -120,3 +120,117 @@ void fuse_uring_ring_destruct(struct fuse_ring *ring)
 
 	mutex_destroy(&ring->start_stop_lock);
 }
+
+static inline int fuse_uring_current_nodeid(void)
+{
+	int cpu;
+	const struct cpumask *proc_mask = current->cpus_ptr;
+
+	cpu = cpumask_first(proc_mask);
+
+	return cpu_to_node(cpu);
+}
+
+static char *fuse_uring_alloc_queue_buf(int size, int node)
+{
+	char *buf;
+
+	if (size <= 0) {
+		pr_info("Invalid queue buf size: %d.\n", size);
+		return ERR_PTR(-EINVAL);
+	}
+
+	buf = vmalloc_node_user(size, node);
+	return buf ? buf : ERR_PTR(-ENOMEM);
+}
+
+/**
+ * fuse uring mmap, per ring qeuue.
+ * Userpsace maps a kernel allocated ring/queue buffer. For numa awareness,
+ * userspace needs to run the do the mapping from a core bound thread.
+ */
+int
+fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct fuse_dev *fud = fuse_get_dev(filp);
+	struct fuse_conn *fc;
+	struct fuse_ring *ring;
+	size_t sz = vma->vm_end - vma->vm_start;
+	int ret;
+	struct fuse_uring_mbuf *new_node = NULL;
+	void *buf = NULL;
+	int nodeid;
+
+	if (vma->vm_pgoff << PAGE_SHIFT != FUSE_URING_MMAP_OFF) {
+		pr_debug("Invalid offset, expected %llu got %lu\n",
+			 FUSE_URING_MMAP_OFF, vma->vm_pgoff << PAGE_SHIFT);
+		return -EINVAL;
+	}
+
+	if (!fud)
+		return -ENODEV;
+	fc = fud->fc;
+	ring = fc->ring;
+	if (!ring)
+		return -ENODEV;
+
+	nodeid = ring->numa_aware ? fuse_uring_current_nodeid() : NUMA_NO_NODE;
+
+	/* check if uring is configured and if the requested size matches */
+	if (ring->nr_queues == 0 || ring->queue_depth == 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (sz != ring->queue_buf_size) {
+		ret = -EINVAL;
+		pr_devel("mmap size mismatch, expected %zu got %zu\n",
+			 ring->queue_buf_size, sz);
+		goto out;
+	}
+
+	if (current->nr_cpus_allowed != 1 && ring->numa_aware) {
+		ret = -EINVAL;
+		pr_debug(
+			"Numa awareness, but thread has more than allowed cpu.\n");
+		goto out;
+	}
+
+	buf = fuse_uring_alloc_queue_buf(ring->queue_buf_size, nodeid);
+	if (IS_ERR(buf)) {
+		ret = PTR_ERR(buf);
+		goto out;
+	}
+
+	new_node = kmalloc(sizeof(*new_node), GFP_USER);
+	if (unlikely(new_node == NULL)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = remap_vmalloc_range(vma, buf, 0);
+	if (ret)
+		goto out;
+
+	mutex_lock(&ring->start_stop_lock);
+	/*
+	 * In this function we do not know the queue the buffer belongs to.
+	 * Later server side will pass the mmaped address, the kernel address
+	 * will be found through the map.
+	 */
+	new_node->kbuf = buf;
+	new_node->ubuf = (void *)vma->vm_start;
+	rb_add(&new_node->rb_node, &ring->mem_buf_map,
+	       fuse_uring_rb_tree_buf_less);
+	mutex_unlock(&ring->start_stop_lock);
+out:
+	if (ret) {
+		kfree(new_node);
+		vfree(buf);
+	}
+
+	pr_devel("%s: pid %d addr: %p sz: %zu  ret: %d\n", __func__,
+		 current->pid, (char *)vma->vm_start, sz, ret);
+
+	return ret;
+}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 58ab4671deff..c455ae0e729a 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -181,6 +181,7 @@ struct fuse_ring {
 
 void fuse_uring_abort_end_requests(struct fuse_ring *ring);
 int fuse_uring_conn_cfg(struct fuse_ring *ring, struct fuse_ring_config *rcfg);
+int fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma);
 int fuse_uring_queue_cfg(struct fuse_ring *ring,
 			 struct fuse_ring_queue_config *qcfg);
 void fuse_uring_ring_destruct(struct fuse_ring *ring);
@@ -208,6 +209,27 @@ static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
 	kfree(ring);
 }
 
+static inline int fuse_uring_rb_tree_buf_cmp(const void *key,
+					     const struct rb_node *node)
+{
+	const struct fuse_uring_mbuf *entry =
+		rb_entry(node, struct fuse_uring_mbuf, rb_node);
+
+	if (key == entry->ubuf)
+		return 0;
+
+	return (unsigned long)key < (unsigned long)entry->ubuf ? -1 : 1;
+}
+
+static inline bool fuse_uring_rb_tree_buf_less(struct rb_node *node1,
+					       const struct rb_node *node2)
+{
+	const struct fuse_uring_mbuf *entry1 =
+		rb_entry(node1, struct fuse_uring_mbuf, rb_node);
+
+	return fuse_uring_rb_tree_buf_cmp(entry1->ubuf, node2) < 0;
+}
+
 static inline struct fuse_ring_queue *
 fuse_uring_get_queue(struct fuse_ring *ring, int qid)
 {
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 0449640f2501..00d0154ec2da 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1259,4 +1259,7 @@ struct fuse_supp_groups {
 #define FUSE_RING_HEADER_BUF_SIZE 4096
 #define FUSE_RING_MIN_IN_OUT_ARG_SIZE 4096
 
+/* The offset parameter is used to identify the request type */
+#define FUSE_URING_MMAP_OFF 0xf8000000ULL
+
 #endif /* _LINUX_FUSE_H */

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 07/19] fuse uring: Add an mmap method
  2024-05-29 18:00 ` [PATCH RFC v2 07/19] fuse uring: Add an mmap method Bernd Schubert
@ 2024-05-30 15:37   ` Josef Bacik
  0 siblings, 0 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 15:37 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:42PM +0200, Bernd Schubert wrote:
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev.c             |   3 ++
>  fs/fuse/dev_uring.c       | 114 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/fuse/dev_uring_i.h     |  22 +++++++++
>  include/uapi/linux/fuse.h |   3 ++
>  4 files changed, 142 insertions(+)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index bc77413932cf..349c1d16b0df 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2470,6 +2470,9 @@ const struct file_operations fuse_dev_operations = {
>  	.fasync		= fuse_dev_fasync,
>  	.unlocked_ioctl = fuse_dev_ioctl,
>  	.compat_ioctl   = compat_ptr_ioctl,
> +#if IS_ENABLED(CONFIG_FUSE_IO_URING)

I'm loathe to use

#if IS_ENABLED()

when we can use

#ifdef CONFIG_FUSE_IO_URING

which is more standard across the kernel.

> +	.mmap		= fuse_uring_mmap,
> +#endif
>  };
>  EXPORT_SYMBOL_GPL(fuse_dev_operations);
>  
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 702a994cf192..9491bdaa5716 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -120,3 +120,117 @@ void fuse_uring_ring_destruct(struct fuse_ring *ring)
>  
>  	mutex_destroy(&ring->start_stop_lock);
>  }
> +
> +static inline int fuse_uring_current_nodeid(void)
> +{
> +	int cpu;
> +	const struct cpumask *proc_mask = current->cpus_ptr;
> +
> +	cpu = cpumask_first(proc_mask);
> +
> +	return cpu_to_node(cpu);

You don't need this, just use numa_node_id();

> +}
> +
> +static char *fuse_uring_alloc_queue_buf(int size, int node)
> +{
> +	char *buf;
> +
> +	if (size <= 0) {
> +		pr_info("Invalid queue buf size: %d.\n", size);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	buf = vmalloc_node_user(size, node);
> +	return buf ? buf : ERR_PTR(-ENOMEM);
> +}

This is excessive, we base size off of ring->queue_buf_size, or the
fuse_uring_mmap() size we get from the vma, which I don't think can ever be 0 or
negative.  I think we just validate that ->queue_buf_size is always correct, and
if we're really worried about it in fuse_uring_mmap we validate that sz is
correct there, and then we just use vmalloc_node_user() directly instead of
having this helper.

> +
> +/**
> + * fuse uring mmap, per ring qeuue.
> + * Userpsace maps a kernel allocated ring/queue buffer. For numa awareness,
> + * userspace needs to run the do the mapping from a core bound thread.
> + */
> +int
> +fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma)

I'm not seeing anywhere else in the fuse code that has this style, I'd prefer we
keep it consistent with the rest of the kernel and have

int fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma)

additionally you're using the docstyle strings without the actual docstyle
formatting, which is pissing off my git hooks that run checkpatch.  Not a big
deal, but if you're going to provide docstyle comments then please do it
formatted properly, or just do a normal

/*
 * fuse uring mmap....
 */

> +{
> +	struct fuse_dev *fud = fuse_get_dev(filp);
> +	struct fuse_conn *fc;
> +	struct fuse_ring *ring;
> +	size_t sz = vma->vm_end - vma->vm_start;
> +	int ret;
> +	struct fuse_uring_mbuf *new_node = NULL;
> +	void *buf = NULL;
> +	int nodeid;
> +
> +	if (vma->vm_pgoff << PAGE_SHIFT != FUSE_URING_MMAP_OFF) {
> +		pr_debug("Invalid offset, expected %llu got %lu\n",
> +			 FUSE_URING_MMAP_OFF, vma->vm_pgoff << PAGE_SHIFT);
> +		return -EINVAL;
> +	}
> +
> +	if (!fud)
> +		return -ENODEV;
> +	fc = fud->fc;
> +	ring = fc->ring;
> +	if (!ring)
> +		return -ENODEV;
> +
> +	nodeid = ring->numa_aware ? fuse_uring_current_nodeid() : NUMA_NO_NODE;

nodeid = ring->numa_awayre ? numa_node_id() : NUMA_NO_NODE;

> +
> +	/* check if uring is configured and if the requested size matches */
> +	if (ring->nr_queues == 0 || ring->queue_depth == 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (sz != ring->queue_buf_size) {
> +		ret = -EINVAL;
> +		pr_devel("mmap size mismatch, expected %zu got %zu\n",
> +			 ring->queue_buf_size, sz);
> +		goto out;
> +	}
> +
> +	if (current->nr_cpus_allowed != 1 && ring->numa_aware) {
> +		ret = -EINVAL;
> +		pr_debug(
> +			"Numa awareness, but thread has more than allowed cpu.\n");
> +		goto out;
> +	}
> +
> +	buf = fuse_uring_alloc_queue_buf(ring->queue_buf_size, nodeid);
> +	if (IS_ERR(buf)) {
> +		ret = PTR_ERR(buf);
> +		goto out;
> +	}

All of the above you can just return ret, you don't have to jump to out.

> +
> +	new_node = kmalloc(sizeof(*new_node), GFP_USER);
> +	if (unlikely(new_node == NULL)) {
> +		ret = -ENOMEM;
> +		goto out;

Here I would just

if (unlikely(new_node == NULL)) {
	vfree(buf);
	return -ENOMEM;
}

> +	}
> +
> +	ret = remap_vmalloc_range(vma, buf, 0);
> +	if (ret)
> +		goto out;

And since this is the only place we can fail with both things allocated I'd just

if (ret) {
	vfree(buf);
	kfree(new_node);
	return ret;
}

and then drop the bit below where you free the buffers if there's an error.

> +
> +	mutex_lock(&ring->start_stop_lock);
> +	/*
> +	 * In this function we do not know the queue the buffer belongs to.
> +	 * Later server side will pass the mmaped address, the kernel address
> +	 * will be found through the map.
> +	 */
> +	new_node->kbuf = buf;
> +	new_node->ubuf = (void *)vma->vm_start;
> +	rb_add(&new_node->rb_node, &ring->mem_buf_map,
> +	       fuse_uring_rb_tree_buf_less);
> +	mutex_unlock(&ring->start_stop_lock);
> +out:
> +	if (ret) {
> +		kfree(new_node);
> +		vfree(buf);
> +	}
> +
> +	pr_devel("%s: pid %d addr: %p sz: %zu  ret: %d\n", __func__,
> +		 current->pid, (char *)vma->vm_start, sz, ret);
> +
> +	return ret;
> +}
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 58ab4671deff..c455ae0e729a 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -181,6 +181,7 @@ struct fuse_ring {
>  
>  void fuse_uring_abort_end_requests(struct fuse_ring *ring);
>  int fuse_uring_conn_cfg(struct fuse_ring *ring, struct fuse_ring_config *rcfg);
> +int fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma);
>  int fuse_uring_queue_cfg(struct fuse_ring *ring,
>  			 struct fuse_ring_queue_config *qcfg);
>  void fuse_uring_ring_destruct(struct fuse_ring *ring);
> @@ -208,6 +209,27 @@ static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
>  	kfree(ring);
>  }
>  
> +static inline int fuse_uring_rb_tree_buf_cmp(const void *key,
> +					     const struct rb_node *node)
> +{
> +	const struct fuse_uring_mbuf *entry =
> +		rb_entry(node, struct fuse_uring_mbuf, rb_node);
> +
> +	if (key == entry->ubuf)
> +		return 0;
> +
> +	return (unsigned long)key < (unsigned long)entry->ubuf ? -1 : 1;
> +}
> +
> +static inline bool fuse_uring_rb_tree_buf_less(struct rb_node *node1,
> +					       const struct rb_node *node2)
> +{
> +	const struct fuse_uring_mbuf *entry1 =
> +		rb_entry(node1, struct fuse_uring_mbuf, rb_node);
> +
> +	return fuse_uring_rb_tree_buf_cmp(entry1->ubuf, node2) < 0;
> +}
> +

These are only used in dev_uring.c, just put them in there instead of the header
file.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 08/19] fuse: Add the queue configuration ioctl
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (6 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 07/19] fuse uring: Add an mmap method Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 15:54   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 09/19] fuse: {uring} Add a dev_release exception for fuse-over-io-uring Bernd Schubert
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c             | 10 +++++
 fs/fuse/dev_uring.c       | 95 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h     | 18 +++++++++
 fs/fuse/fuse_i.h          |  3 ++
 include/uapi/linux/fuse.h | 26 +++++++++++++
 5 files changed, 152 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 349c1d16b0df..78c05516da7f 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2395,6 +2395,12 @@ static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
 	if (res != 0)
 		return -EFAULT;
 
+	if (cfg.cmd == FUSE_URING_IOCTL_CMD_QUEUE_CFG) {
+		res = _fuse_dev_ioctl_clone(file, cfg.qconf.control_fd);
+		if (res != 0)
+			return res;
+	}
+
 	fud = fuse_get_dev(file);
 	if (fud == NULL)
 		return -ENODEV;
@@ -2424,6 +2430,10 @@ static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
 		if (res != 0)
 			return res;
 		break;
+		case FUSE_URING_IOCTL_CMD_QUEUE_CFG:
+			fud->uring_dev = 1;
+			res = fuse_uring_queue_cfg(fc->ring, &cfg.qconf);
+		break;
 	default:
 		res = -EINVAL;
 	}
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 9491bdaa5716..2c0ccb378908 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -144,6 +144,39 @@ static char *fuse_uring_alloc_queue_buf(int size, int node)
 	return buf ? buf : ERR_PTR(-ENOMEM);
 }
 
+/*
+ * mmaped allocated buffers, but does not know which queue that is for
+ * This ioctl uses the userspace address as key to identify the kernel address
+ * and assign it to the kernel side of the queue.
+ */
+static int fuse_uring_ioctl_mem_reg(struct fuse_ring *ring,
+				    struct fuse_ring_queue *queue,
+				    uint64_t uaddr)
+{
+	struct rb_node *node;
+	struct fuse_uring_mbuf *entry;
+	int tag;
+
+	node = rb_find((const void *)uaddr, &ring->mem_buf_map,
+		       fuse_uring_rb_tree_buf_cmp);
+	if (!node)
+		return -ENOENT;
+	entry = rb_entry(node, struct fuse_uring_mbuf, rb_node);
+
+	rb_erase(node, &ring->mem_buf_map);
+
+	queue->queue_req_buf = entry->kbuf;
+
+	for (tag = 0; tag < ring->queue_depth; tag++) {
+		struct fuse_ring_ent *ent = &queue->ring_ent[tag];
+
+		ent->rreq = entry->kbuf + tag * ring->req_buf_sz;
+	}
+
+	kfree(node);
+	return 0;
+}
+
 /**
  * fuse uring mmap, per ring qeuue.
  * Userpsace maps a kernel allocated ring/queue buffer. For numa awareness,
@@ -234,3 +267,65 @@ fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma)
 
 	return ret;
 }
+
+int fuse_uring_queue_cfg(struct fuse_ring *ring,
+			 struct fuse_ring_queue_config *qcfg)
+{
+	int tag;
+	struct fuse_ring_queue *queue;
+
+	if (qcfg->qid >= ring->nr_queues) {
+		pr_info("fuse ring queue config: qid=%u >= nr-queues=%zu\n",
+			qcfg->qid, ring->nr_queues);
+		return -EINVAL;
+	}
+	queue = fuse_uring_get_queue(ring, qcfg->qid);
+
+	if (queue->configured) {
+		pr_info("fuse ring qid=%u already configured!\n", queue->qid);
+		return -EALREADY;
+	}
+
+	mutex_lock(&ring->start_stop_lock);
+	fuse_uring_ioctl_mem_reg(ring, queue, qcfg->uaddr);
+	mutex_unlock(&ring->start_stop_lock);
+
+	queue->qid = qcfg->qid;
+	queue->ring = ring;
+	spin_lock_init(&queue->lock);
+	INIT_LIST_HEAD(&queue->sync_fuse_req_queue);
+	INIT_LIST_HEAD(&queue->async_fuse_req_queue);
+
+	INIT_LIST_HEAD(&queue->sync_ent_avail_queue);
+	INIT_LIST_HEAD(&queue->async_ent_avail_queue);
+
+	INIT_LIST_HEAD(&queue->ent_in_userspace);
+
+	for (tag = 0; tag < ring->queue_depth; tag++) {
+		struct fuse_ring_ent *ent = &queue->ring_ent[tag];
+
+		ent->queue = queue;
+		ent->tag = tag;
+		ent->fuse_req = NULL;
+
+		pr_devel("initialize qid=%d tag=%d queue=%p req=%p", qcfg->qid,
+			 tag, queue, ent);
+
+		ent->rreq->flags = 0;
+
+		ent->state = 0;
+		set_bit(FRRS_INIT, &ent->state);
+
+		INIT_LIST_HEAD(&ent->list);
+	}
+
+	queue->configured = 1;
+	ring->nr_queues_ioctl_init++;
+	if (ring->nr_queues_ioctl_init == ring->nr_queues) {
+		pr_devel("ring=%p nr-queues=%zu depth=%zu ioctl ready\n", ring,
+			 ring->nr_queues, ring->queue_depth);
+	}
+
+	return 0;
+}
+
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index c455ae0e729a..7a2f540d3ea5 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -16,6 +16,24 @@
 /* IORING_MAX_ENTRIES */
 #define FUSE_URING_MAX_QUEUE_DEPTH 32768
 
+enum fuse_ring_req_state {
+
+	/* request is basially initialized */
+	FRRS_INIT,
+
+	/* The ring request waits for a new fuse request */
+	FRRS_WAIT,
+
+	/* The ring req got assigned a fuse req */
+	FRRS_FUSE_REQ,
+
+	/* request is in or on the way to user space */
+	FRRS_USERSPACE,
+
+	/* request is released */
+	FRRS_FREED,
+};
+
 struct fuse_uring_mbuf {
 	struct rb_node rb_node;
 	void *kbuf; /* kernel allocated ring request buffer */
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d2b058ccb677..fadc51a22bb9 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -540,6 +540,9 @@ struct fuse_dev {
 
 	/** list entry on fc->devices */
 	struct list_head entry;
+
+	/** Is the device used for fuse-over-io-uring? */
+	unsigned int uring_dev : 1;
 };
 
 enum fuse_dax_mode {
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 00d0154ec2da..88d4078c4171 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1262,4 +1262,30 @@ struct fuse_supp_groups {
 /* The offset parameter is used to identify the request type */
 #define FUSE_URING_MMAP_OFF 0xf8000000ULL
 
+/**
+ * This structure mapped onto the
+ */
+struct fuse_ring_req {
+	union {
+		/* The first 4K are command data */
+		char ring_header[FUSE_RING_HEADER_BUF_SIZE];
+
+		struct {
+			uint64_t flags;
+
+			/* enum fuse_ring_buf_cmd */
+			uint32_t in_out_arg_len;
+			uint32_t padding;
+
+			/* kernel fills in, reads out */
+			union {
+				struct fuse_in_header in;
+				struct fuse_out_header out;
+			};
+		};
+	};
+
+	char in_out_arg[];
+};
+
 #endif /* _LINUX_FUSE_H */

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 08/19] fuse: Add the queue configuration ioctl
  2024-05-29 18:00 ` [PATCH RFC v2 08/19] fuse: Add the queue configuration ioctl Bernd Schubert
@ 2024-05-30 15:54   ` Josef Bacik
  2024-05-30 17:49     ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 15:54 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:43PM +0200, Bernd Schubert wrote:
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev.c             | 10 +++++
>  fs/fuse/dev_uring.c       | 95 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/fuse/dev_uring_i.h     | 18 +++++++++
>  fs/fuse/fuse_i.h          |  3 ++
>  include/uapi/linux/fuse.h | 26 +++++++++++++
>  5 files changed, 152 insertions(+)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 349c1d16b0df..78c05516da7f 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2395,6 +2395,12 @@ static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
>  	if (res != 0)
>  		return -EFAULT;
>  
> +	if (cfg.cmd == FUSE_URING_IOCTL_CMD_QUEUE_CFG) {
> +		res = _fuse_dev_ioctl_clone(file, cfg.qconf.control_fd);
> +		if (res != 0)
> +			return res;
> +	}
> +
>  	fud = fuse_get_dev(file);
>  	if (fud == NULL)
>  		return -ENODEV;
> @@ -2424,6 +2430,10 @@ static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
>  		if (res != 0)
>  			return res;
>  		break;
> +		case FUSE_URING_IOCTL_CMD_QUEUE_CFG:
> +			fud->uring_dev = 1;
> +			res = fuse_uring_queue_cfg(fc->ring, &cfg.qconf);
> +		break;
>  	default:
>  		res = -EINVAL;
>  	}
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 9491bdaa5716..2c0ccb378908 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -144,6 +144,39 @@ static char *fuse_uring_alloc_queue_buf(int size, int node)
>  	return buf ? buf : ERR_PTR(-ENOMEM);
>  }
>  
> +/*
> + * mmaped allocated buffers, but does not know which queue that is for
> + * This ioctl uses the userspace address as key to identify the kernel address
> + * and assign it to the kernel side of the queue.
> + */
> +static int fuse_uring_ioctl_mem_reg(struct fuse_ring *ring,
> +				    struct fuse_ring_queue *queue,
> +				    uint64_t uaddr)
> +{
> +	struct rb_node *node;
> +	struct fuse_uring_mbuf *entry;
> +	int tag;
> +
> +	node = rb_find((const void *)uaddr, &ring->mem_buf_map,
> +		       fuse_uring_rb_tree_buf_cmp);
> +	if (!node)
> +		return -ENOENT;
> +	entry = rb_entry(node, struct fuse_uring_mbuf, rb_node);
> +
> +	rb_erase(node, &ring->mem_buf_map);
> +
> +	queue->queue_req_buf = entry->kbuf;
> +
> +	for (tag = 0; tag < ring->queue_depth; tag++) {
> +		struct fuse_ring_ent *ent = &queue->ring_ent[tag];
> +
> +		ent->rreq = entry->kbuf + tag * ring->req_buf_sz;
> +	}
> +
> +	kfree(node);
> +	return 0;
> +}
> +
>  /**
>   * fuse uring mmap, per ring qeuue.
>   * Userpsace maps a kernel allocated ring/queue buffer. For numa awareness,
> @@ -234,3 +267,65 @@ fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma)
>  
>  	return ret;
>  }
> +
> +int fuse_uring_queue_cfg(struct fuse_ring *ring,
> +			 struct fuse_ring_queue_config *qcfg)
> +{
> +	int tag;
> +	struct fuse_ring_queue *queue;
> +
> +	if (qcfg->qid >= ring->nr_queues) {
> +		pr_info("fuse ring queue config: qid=%u >= nr-queues=%zu\n",
> +			qcfg->qid, ring->nr_queues);
> +		return -EINVAL;
> +	}
> +	queue = fuse_uring_get_queue(ring, qcfg->qid);
> +
> +	if (queue->configured) {
> +		pr_info("fuse ring qid=%u already configured!\n", queue->qid);
> +		return -EALREADY;
> +	}
> +
> +	mutex_lock(&ring->start_stop_lock);
> +	fuse_uring_ioctl_mem_reg(ring, queue, qcfg->uaddr);
> +	mutex_unlock(&ring->start_stop_lock);

You're not handling the error here.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 08/19] fuse: Add the queue configuration ioctl
  2024-05-30 15:54   ` Josef Bacik
@ 2024-05-30 17:49     ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-30 17:49 UTC (permalink / raw)
  To: Josef Bacik, Bernd Schubert; +Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel



On 5/30/24 17:54, Josef Bacik wrote:
> On Wed, May 29, 2024 at 08:00:43PM +0200, Bernd Schubert wrote:
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev.c             | 10 +++++
>>  fs/fuse/dev_uring.c       | 95 +++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/fuse/dev_uring_i.h     | 18 +++++++++
>>  fs/fuse/fuse_i.h          |  3 ++
>>  include/uapi/linux/fuse.h | 26 +++++++++++++
>>  5 files changed, 152 insertions(+)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 349c1d16b0df..78c05516da7f 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -2395,6 +2395,12 @@ static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
>>  	if (res != 0)
>>  		return -EFAULT;
>>  
>> +	if (cfg.cmd == FUSE_URING_IOCTL_CMD_QUEUE_CFG) {
>> +		res = _fuse_dev_ioctl_clone(file, cfg.qconf.control_fd);
>> +		if (res != 0)
>> +			return res;
>> +	}
>> +
>>  	fud = fuse_get_dev(file);
>>  	if (fud == NULL)
>>  		return -ENODEV;
>> @@ -2424,6 +2430,10 @@ static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
>>  		if (res != 0)
>>  			return res;
>>  		break;
>> +		case FUSE_URING_IOCTL_CMD_QUEUE_CFG:
>> +			fud->uring_dev = 1;
>> +			res = fuse_uring_queue_cfg(fc->ring, &cfg.qconf);
>> +		break;
>>  	default:
>>  		res = -EINVAL;
>>  	}
>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>> index 9491bdaa5716..2c0ccb378908 100644
>> --- a/fs/fuse/dev_uring.c
>> +++ b/fs/fuse/dev_uring.c
>> @@ -144,6 +144,39 @@ static char *fuse_uring_alloc_queue_buf(int size, int node)
>>  	return buf ? buf : ERR_PTR(-ENOMEM);
>>  }
>>  
>> +/*
>> + * mmaped allocated buffers, but does not know which queue that is for
>> + * This ioctl uses the userspace address as key to identify the kernel address
>> + * and assign it to the kernel side of the queue.
>> + */
>> +static int fuse_uring_ioctl_mem_reg(struct fuse_ring *ring,
>> +				    struct fuse_ring_queue *queue,
>> +				    uint64_t uaddr)
>> +{
>> +	struct rb_node *node;
>> +	struct fuse_uring_mbuf *entry;
>> +	int tag;
>> +
>> +	node = rb_find((const void *)uaddr, &ring->mem_buf_map,
>> +		       fuse_uring_rb_tree_buf_cmp);
>> +	if (!node)
>> +		return -ENOENT;
>> +	entry = rb_entry(node, struct fuse_uring_mbuf, rb_node);
>> +
>> +	rb_erase(node, &ring->mem_buf_map);
>> +
>> +	queue->queue_req_buf = entry->kbuf;
>> +
>> +	for (tag = 0; tag < ring->queue_depth; tag++) {
>> +		struct fuse_ring_ent *ent = &queue->ring_ent[tag];
>> +
>> +		ent->rreq = entry->kbuf + tag * ring->req_buf_sz;
>> +	}
>> +
>> +	kfree(node);
>> +	return 0;
>> +}
>> +
>>  /**
>>   * fuse uring mmap, per ring qeuue.
>>   * Userpsace maps a kernel allocated ring/queue buffer. For numa awareness,
>> @@ -234,3 +267,65 @@ fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma)
>>  
>>  	return ret;
>>  }
>> +
>> +int fuse_uring_queue_cfg(struct fuse_ring *ring,
>> +			 struct fuse_ring_queue_config *qcfg)
>> +{
>> +	int tag;
>> +	struct fuse_ring_queue *queue;
>> +
>> +	if (qcfg->qid >= ring->nr_queues) {
>> +		pr_info("fuse ring queue config: qid=%u >= nr-queues=%zu\n",
>> +			qcfg->qid, ring->nr_queues);
>> +		return -EINVAL;
>> +	}
>> +	queue = fuse_uring_get_queue(ring, qcfg->qid);
>> +
>> +	if (queue->configured) {
>> +		pr_info("fuse ring qid=%u already configured!\n", queue->qid);
>> +		return -EALREADY;
>> +	}
>> +
>> +	mutex_lock(&ring->start_stop_lock);
>> +	fuse_uring_ioctl_mem_reg(ring, queue, qcfg->uaddr);
>> +	mutex_unlock(&ring->start_stop_lock);
> 
> You're not handling the error here.  Thanks,


Thanks again for all your reviews! All fixed up to here, except
vmalloc_node_user(), as you suggested, I will try to decouple it from
this series.

And d'oh! I didn't find the simple numa_node_id() function. Thanks so
much for pointing that out.

New branch is here:
https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc3


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 09/19] fuse: {uring} Add a dev_release exception for fuse-over-io-uring
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (7 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 08/19] fuse: Add the queue configuration ioctl Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 19:00   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 10/19] fuse: {uring} Handle SQEs - register commands Bernd Schubert
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

fuse-over-io-uring needs an implicit device clone, which is done per
queue to avoid hanging "umount" when daemon side is already terminated.
Reason is that fuse_dev_release() is not called when there are queued
(waiting) io_uring commands.
Solution is the implicit device clone and an exception in fuse_dev_release
for uring devices to abort the connection when only uring device
are left.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c         | 32 ++++++++++++++++++++++++++++++--
 fs/fuse/dev_uring_i.h | 13 +++++++++++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 78c05516da7f..cd5dc6ae9272 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2257,6 +2257,8 @@ int fuse_dev_release(struct inode *inode, struct file *file)
 		struct fuse_pqueue *fpq = &fud->pq;
 		LIST_HEAD(to_end);
 		unsigned int i;
+		int dev_cnt;
+		bool abort_conn = false;
 
 		spin_lock(&fpq->lock);
 		WARN_ON(!list_empty(&fpq->io));
@@ -2266,8 +2268,34 @@ int fuse_dev_release(struct inode *inode, struct file *file)
 
 		fuse_dev_end_requests(&to_end);
 
-		/* Are we the last open device? */
-		if (atomic_dec_and_test(&fc->dev_count)) {
+		/* Are we the last open device?  */
+		dev_cnt = atomic_dec_return(&fc->dev_count);
+		if (dev_cnt == 0)
+			abort_conn = true;
+
+		/*
+		 * Or is this with io_uring and only ring devices left?
+		 * These devices will not receive a ->release() as long as
+		 * there are io_uring_cmd's waiting and not completed
+		 * with io_uring_cmd_done yet
+		 */
+		if (fuse_uring_configured(fc)) {
+			struct fuse_dev *list_dev;
+			bool all_uring = true;
+
+			spin_lock(&fc->lock);
+			list_for_each_entry(list_dev, &fc->devices, entry) {
+				if (list_dev == fud)
+					continue;
+				if (!list_dev->uring_dev)
+					all_uring = false;
+			}
+			spin_unlock(&fc->lock);
+			if (all_uring)
+				abort_conn = true;
+		}
+
+		if (abort_conn) {
 			WARN_ON(fc->iq.fasync != NULL);
 			fuse_abort_conn(fc);
 		}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 7a2f540d3ea5..114e9c008013 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -261,6 +261,14 @@ fuse_uring_get_queue(struct fuse_ring *ring, int qid)
 	return (struct fuse_ring_queue *)(ptr + qid * ring->queue_size);
 }
 
+static inline bool fuse_uring_configured(struct fuse_conn *fc)
+{
+	if (READ_ONCE(fc->ring) != NULL && fc->ring->configured)
+		return true;
+
+	return false;
+}
+
 #else /* CONFIG_FUSE_IO_URING */
 
 struct fuse_ring;
@@ -274,6 +282,11 @@ static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
 {
 }
 
+static inline bool fuse_uring_configured(struct fuse_conn *fc)
+{
+	return false;
+}
+
 #endif /* CONFIG_FUSE_IO_URING */
 
 #endif /* _FS_FUSE_DEV_URING_I_H */

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 09/19] fuse: {uring} Add a dev_release exception for fuse-over-io-uring
  2024-05-29 18:00 ` [PATCH RFC v2 09/19] fuse: {uring} Add a dev_release exception for fuse-over-io-uring Bernd Schubert
@ 2024-05-30 19:00   ` Josef Bacik
  0 siblings, 0 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 19:00 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:44PM +0200, Bernd Schubert wrote:
> fuse-over-io-uring needs an implicit device clone, which is done per
> queue to avoid hanging "umount" when daemon side is already terminated.
> Reason is that fuse_dev_release() is not called when there are queued
> (waiting) io_uring commands.
> Solution is the implicit device clone and an exception in fuse_dev_release
> for uring devices to abort the connection when only uring device
> are left.
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev.c         | 32 ++++++++++++++++++++++++++++++--
>  fs/fuse/dev_uring_i.h | 13 +++++++++++++
>  2 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 78c05516da7f..cd5dc6ae9272 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2257,6 +2257,8 @@ int fuse_dev_release(struct inode *inode, struct file *file)
>  		struct fuse_pqueue *fpq = &fud->pq;
>  		LIST_HEAD(to_end);
>  		unsigned int i;
> +		int dev_cnt;
> +		bool abort_conn = false;
>  
>  		spin_lock(&fpq->lock);
>  		WARN_ON(!list_empty(&fpq->io));
> @@ -2266,8 +2268,34 @@ int fuse_dev_release(struct inode *inode, struct file *file)
>  
>  		fuse_dev_end_requests(&to_end);
>  
> -		/* Are we the last open device? */
> -		if (atomic_dec_and_test(&fc->dev_count)) {
> +		/* Are we the last open device?  */
> +		dev_cnt = atomic_dec_return(&fc->dev_count);
> +		if (dev_cnt == 0)
> +			abort_conn = true;

You can just do

if (atomic_dec_and_test(&fc->dev_count))
	abort_conn = true;
else if (fuse_uring_configured(fc))
	abort_conn = fuse_uring_empty(fc);

and have fuse_uring_empty() do the work below to find if we're able to abort the
connection, so it's in it's own little helper.

> +
> +		/*
> +		 * Or is this with io_uring and only ring devices left?
> +		 * These devices will not receive a ->release() as long as
> +		 * there are io_uring_cmd's waiting and not completed
> +		 * with io_uring_cmd_done yet
> +		 */
> +		if (fuse_uring_configured(fc)) {
> +			struct fuse_dev *list_dev;
> +			bool all_uring = true;
> +
> +			spin_lock(&fc->lock);
> +			list_for_each_entry(list_dev, &fc->devices, entry) {
> +				if (list_dev == fud)
> +					continue;
> +				if (!list_dev->uring_dev)
> +					all_uring = false;
> +			}
> +			spin_unlock(&fc->lock);
> +			if (all_uring)
> +				abort_conn = true;
> +		}
> +
> +		if (abort_conn) {
>  			WARN_ON(fc->iq.fasync != NULL);
>  			fuse_abort_conn(fc);
>  		}
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index 7a2f540d3ea5..114e9c008013 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -261,6 +261,14 @@ fuse_uring_get_queue(struct fuse_ring *ring, int qid)
>  	return (struct fuse_ring_queue *)(ptr + qid * ring->queue_size);
>  }
>  
> +static inline bool fuse_uring_configured(struct fuse_conn *fc)
> +{
> +	if (READ_ONCE(fc->ring) != NULL && fc->ring->configured)
> +		return true;

I see what you're trying to do here, and it is safe because you won't drop
fc->ring at this point, but it gives the illusion that it'll work if we race
with somebody who is freeing fc->ring, which isn't the case because you
immediately de-reference it again afterwards.

Using READ_ONCE/WRITE_ONCE for pointer access isn't actually safe unless you're
documenting it specifically, don't use it unless you really need lockless access
to the thing.

If we know that having fc means that fc->ring will be valid at all times then
the READ_ONCE is redundant and unnecessary, if we don't know that then this
needs more protection to make sure we don't suddenly lose fc->ring between the
two statements.

AFAICT if we have fc then ->ring will either be NULL or it won't be (once the
connection is established and running), so it's fine to just delete the
READ_ONCE/WRITE_ONCE things.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 10/19] fuse: {uring} Handle SQEs - register commands
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (8 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 09/19] fuse: {uring} Add a dev_release exception for fuse-over-io-uring Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 19:55   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 11/19] fuse: Add support to copy from/to the ring buffer Bernd Schubert
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

This adds basic support for ring SQEs (with opcode=IORING_OP_URING_CMD).
For now only FUSE_URING_REQ_FETCH is handled to register queue entries.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c             |   1 +
 fs/fuse/dev_uring.c       | 267 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h     |  12 +++
 include/uapi/linux/fuse.h |  33 ++++++
 4 files changed, 313 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd5dc6ae9272..05a87731b5c3 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2510,6 +2510,7 @@ const struct file_operations fuse_dev_operations = {
 	.compat_ioctl   = compat_ptr_ioctl,
 #if IS_ENABLED(CONFIG_FUSE_IO_URING)
 	.mmap		= fuse_uring_mmap,
+	.uring_cmd	= fuse_uring_cmd,
 #endif
 };
 EXPORT_SYMBOL_GPL(fuse_dev_operations);
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 2c0ccb378908..48b1118b64f4 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -31,6 +31,27 @@
 #include <linux/topology.h>
 #include <linux/io_uring/cmd.h>
 
+static void fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
+{
+	clear_bit(FRRS_USERSPACE, &ent->state);
+	list_del_init(&ent->list);
+}
+
+/* Update conn limits according to ring values */
+static void fuse_uring_conn_cfg_limits(struct fuse_ring *ring)
+{
+	struct fuse_conn *fc = ring->fc;
+
+	WRITE_ONCE(fc->max_pages, min_t(unsigned int, fc->max_pages,
+					ring->req_arg_len / PAGE_SIZE));
+
+	/* This not ideal, as multiplication with nr_queue assumes the limit
+	 * gets reached when all queues are used, but a single threaded
+	 * application might already do that.
+	 */
+	WRITE_ONCE(fc->max_background, ring->nr_queues * ring->max_nr_async);
+}
+
 /*
  * Basic ring setup for this connection based on the provided configuration
  */
@@ -329,3 +350,249 @@ int fuse_uring_queue_cfg(struct fuse_ring *ring,
 	return 0;
 }
 
+/*
+ * Put a ring request onto hold, it is no longer used for now.
+ */
+static void fuse_uring_ent_avail(struct fuse_ring_ent *ring_ent,
+				 struct fuse_ring_queue *queue)
+	__must_hold(&queue->lock)
+{
+	struct fuse_ring *ring = queue->ring;
+
+	/* unsets all previous flags - basically resets */
+	pr_devel("%s ring=%p qid=%d tag=%d state=%lu async=%d\n", __func__,
+		 ring, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
+		 ring_ent->async);
+
+	if (WARN_ON(test_bit(FRRS_USERSPACE, &ring_ent->state))) {
+		pr_warn("%s qid=%d tag=%d state=%lu async=%d\n", __func__,
+			ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
+			ring_ent->async);
+		return;
+	}
+
+	WARN_ON_ONCE(!list_empty(&ring_ent->list));
+
+	if (ring_ent->async)
+		list_add(&ring_ent->list, &queue->async_ent_avail_queue);
+	else
+		list_add(&ring_ent->list, &queue->sync_ent_avail_queue);
+
+	set_bit(FRRS_WAIT, &ring_ent->state);
+}
+
+/*
+ * fuse_uring_req_fetch command handling
+ */
+static int fuse_uring_fetch(struct fuse_ring_ent *ring_ent,
+			    struct io_uring_cmd *cmd, unsigned int issue_flags)
+__must_hold(ring_ent->queue->lock)
+{
+	struct fuse_ring_queue *queue = ring_ent->queue;
+	struct fuse_ring *ring = queue->ring;
+	int ret = 0;
+	int nr_ring_sqe;
+
+	/* register requests for foreground requests first, then backgrounds */
+	if (queue->nr_req_sync >= ring->max_nr_sync) {
+		queue->nr_req_async++;
+		ring_ent->async = 1;
+	} else
+		queue->nr_req_sync++;
+
+	fuse_uring_ent_avail(ring_ent, queue);
+
+	if (queue->nr_req_sync + queue->nr_req_async > ring->queue_depth) {
+		/* should be caught by ring state before and queue depth
+		 * check before
+		 */
+		WARN_ON(1);
+		pr_info("qid=%d tag=%d req cnt (fg=%d async=%d exceeds depth=%zu",
+			queue->qid, ring_ent->tag, queue->nr_req_sync,
+			queue->nr_req_async, ring->queue_depth);
+		ret = -ERANGE;
+	}
+
+	if (ret)
+		goto out; /* erange */
+
+	WRITE_ONCE(ring_ent->cmd, cmd);
+
+	nr_ring_sqe = ring->queue_depth * ring->nr_queues;
+	if (atomic_inc_return(&ring->nr_sqe_init) == nr_ring_sqe) {
+		fuse_uring_conn_cfg_limits(ring);
+		ring->ready = 1;
+	}
+
+out:
+	return ret;
+}
+
+static struct fuse_ring_queue *
+fuse_uring_get_verify_queue(struct fuse_ring *ring,
+			    const struct fuse_uring_cmd_req *cmd_req,
+			    unsigned int issue_flags)
+{
+	struct fuse_conn *fc = ring->fc;
+	struct fuse_ring_queue *queue;
+	int ret;
+
+	if (!(issue_flags & IO_URING_F_SQE128)) {
+		pr_info("qid=%d tag=%d SQE128 not set\n", cmd_req->qid,
+			cmd_req->tag);
+		ret = -EINVAL;
+		goto err;
+	}
+
+	if (unlikely(!fc->connected)) {
+		ret = -ENOTCONN;
+		goto err;
+	}
+
+	if (unlikely(!ring->configured)) {
+		pr_info("command for a connection that is not ring configured\n");
+		ret = -ENODEV;
+		goto err;
+	}
+
+	if (unlikely(cmd_req->qid >= ring->nr_queues)) {
+		pr_devel("qid=%u >= nr-queues=%zu\n", cmd_req->qid,
+			 ring->nr_queues);
+		ret = -EINVAL;
+		goto err;
+	}
+
+	queue = fuse_uring_get_queue(ring, cmd_req->qid);
+	if (unlikely(queue == NULL)) {
+		pr_info("Got NULL queue for qid=%d\n", cmd_req->qid);
+		ret = -EIO;
+		goto err;
+	}
+
+	if (unlikely(!queue->configured || queue->stopped)) {
+		pr_info("Ring or queue (qid=%u) not ready.\n", cmd_req->qid);
+		ret = -ENOTCONN;
+		goto err;
+	}
+
+	if (cmd_req->tag > ring->queue_depth) {
+		pr_info("tag=%u > queue-depth=%zu\n", cmd_req->tag,
+			ring->queue_depth);
+		ret = -EINVAL;
+		goto err;
+	}
+
+	return queue;
+
+err:
+	return ERR_PTR(ret);
+}
+
+/**
+ * Entry function from io_uring to handle the given passthrough command
+ * (op cocde IORING_OP_URING_CMD)
+ */
+int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+	const struct fuse_uring_cmd_req *cmd_req = io_uring_sqe_cmd(cmd->sqe);
+	struct fuse_dev *fud = fuse_get_dev(cmd->file);
+	struct fuse_conn *fc = fud->fc;
+	struct fuse_ring *ring = fc->ring;
+	struct fuse_ring_queue *queue;
+	struct fuse_ring_ent *ring_ent = NULL;
+	u32 cmd_op = cmd->cmd_op;
+	int ret = 0;
+
+	if (!ring) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	queue = fuse_uring_get_verify_queue(ring, cmd_req, issue_flags);
+	if (IS_ERR(queue)) {
+		ret = PTR_ERR(queue);
+		goto out;
+	}
+
+	ring_ent = &queue->ring_ent[cmd_req->tag];
+
+	pr_devel("%s:%d received: cmd op %d qid %d (%p) tag %d  (%p)\n",
+		 __func__, __LINE__, cmd_op, cmd_req->qid, queue, cmd_req->tag,
+		 ring_ent);
+
+	spin_lock(&queue->lock);
+	if (unlikely(queue->stopped)) {
+		/* XXX how to ensure queue still exists? Add
+		 * an rw ring->stop lock? And take that at the beginning
+		 * of this function? Better would be to advise uring
+		 * not to call this function at all? Or free the queue memory
+		 * only, on daemon PF_EXITING?
+		 */
+		ret = -ENOTCONN;
+		goto err_unlock;
+	}
+
+	if (current == queue->server_task)
+		queue->uring_cmd_issue_flags = issue_flags;
+
+	switch (cmd_op) {
+	case FUSE_URING_REQ_FETCH:
+		if (queue->server_task == NULL) {
+			queue->server_task = current;
+			queue->uring_cmd_issue_flags = issue_flags;
+		}
+
+		/* No other bit must be set here */
+		if (ring_ent->state != BIT(FRRS_INIT)) {
+			pr_info_ratelimited(
+				"qid=%d tag=%d register req state %lu expected %lu",
+				cmd_req->qid, cmd_req->tag, ring_ent->state,
+				BIT(FRRS_INIT));
+			ret = -EINVAL;
+			goto err_unlock;
+		}
+
+		fuse_ring_ring_ent_unset_userspace(ring_ent);
+
+		ret = fuse_uring_fetch(ring_ent, cmd, issue_flags);
+		if (ret)
+			goto err_unlock;
+
+		/*
+		 * The ring entry is registered now and needs to be handled
+		 * for shutdown.
+		 */
+		atomic_inc(&ring->queue_refs);
+
+		spin_unlock(&queue->lock);
+		break;
+	default:
+		ret = -EINVAL;
+		pr_devel("Unknown uring command %d", cmd_op);
+		goto err_unlock;
+	}
+out:
+	pr_devel("uring cmd op=%d, qid=%d tag=%d ret=%d\n", cmd_op,
+		 cmd_req->qid, cmd_req->tag, ret);
+
+	if (ret < 0) {
+		if (ring_ent != NULL) {
+			pr_info_ratelimited("error: uring cmd op=%d, qid=%d tag=%d ret=%d\n",
+					    cmd_op, cmd_req->qid, cmd_req->tag,
+					    ret);
+
+			/* must not change the entry state, as userspace
+			 * might have sent random data, but valid requests
+			 * might be registered already - don't confuse those.
+			 */
+		}
+		io_uring_cmd_done(cmd, ret, 0, issue_flags);
+	}
+
+	return -EIOCBQUEUED;
+
+err_unlock:
+	spin_unlock(&queue->lock);
+	goto out;
+}
+
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 114e9c008013..b2be67bb2fa7 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -203,6 +203,7 @@ int fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma);
 int fuse_uring_queue_cfg(struct fuse_ring *ring,
 			 struct fuse_ring_queue_config *qcfg);
 void fuse_uring_ring_destruct(struct fuse_ring *ring);
+int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
 
 static inline void fuse_uring_conn_init(struct fuse_ring *ring,
 					struct fuse_conn *fc)
@@ -269,6 +270,11 @@ static inline bool fuse_uring_configured(struct fuse_conn *fc)
 	return false;
 }
 
+static inline bool fuse_per_core_queue(struct fuse_conn *fc)
+{
+	return fc->ring && fc->ring->per_core_queue;
+}
+
 #else /* CONFIG_FUSE_IO_URING */
 
 struct fuse_ring;
@@ -287,6 +293,12 @@ static inline bool fuse_uring_configured(struct fuse_conn *fc)
 	return false;
 }
 
+static inline bool fuse_per_core_queue(struct fuse_conn *fc)
+{
+	return false;
+}
+
+
 #endif /* CONFIG_FUSE_IO_URING */
 
 #endif /* _FS_FUSE_DEV_URING_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 88d4078c4171..379388c964a7 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1262,6 +1262,12 @@ struct fuse_supp_groups {
 /* The offset parameter is used to identify the request type */
 #define FUSE_URING_MMAP_OFF 0xf8000000ULL
 
+/*
+ * Request is background type. Daemon side is free to use this information
+ * to handle foreground/background CQEs with different priorities.
+ */
+#define FUSE_RING_REQ_FLAG_ASYNC (1ull << 0)
+
 /**
  * This structure mapped onto the
  */
@@ -1288,4 +1294,31 @@ struct fuse_ring_req {
 	char in_out_arg[];
 };
 
+/**
+ * sqe commands to the kernel
+ */
+enum fuse_uring_cmd {
+	FUSE_URING_REQ_INVALID = 0,
+
+	/* submit sqe to kernel to get a request */
+	FUSE_URING_REQ_FETCH = 1,
+
+	/* commit result and fetch next request */
+	FUSE_URING_REQ_COMMIT_AND_FETCH = 2,
+};
+
+/**
+ * In the 80B command area of the SQE.
+ */
+struct fuse_uring_cmd_req {
+	/* queue the command is for (queue index) */
+	uint16_t qid;
+
+	/* queue entry (array index) */
+	uint16_t tag;
+
+	/* pointer to struct fuse_uring_buf_req */
+	uint32_t flags;
+};
+
 #endif /* _LINUX_FUSE_H */

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 10/19] fuse: {uring} Handle SQEs - register commands
  2024-05-29 18:00 ` [PATCH RFC v2 10/19] fuse: {uring} Handle SQEs - register commands Bernd Schubert
@ 2024-05-30 19:55   ` Josef Bacik
  0 siblings, 0 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 19:55 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:45PM +0200, Bernd Schubert wrote:
> This adds basic support for ring SQEs (with opcode=IORING_OP_URING_CMD).
> For now only FUSE_URING_REQ_FETCH is handled to register queue entries.
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev.c             |   1 +
>  fs/fuse/dev_uring.c       | 267 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/fuse/dev_uring_i.h     |  12 +++
>  include/uapi/linux/fuse.h |  33 ++++++
>  4 files changed, 313 insertions(+)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index cd5dc6ae9272..05a87731b5c3 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2510,6 +2510,7 @@ const struct file_operations fuse_dev_operations = {
>  	.compat_ioctl   = compat_ptr_ioctl,
>  #if IS_ENABLED(CONFIG_FUSE_IO_URING)
>  	.mmap		= fuse_uring_mmap,
> +	.uring_cmd	= fuse_uring_cmd,
>  #endif
>  };
>  EXPORT_SYMBOL_GPL(fuse_dev_operations);
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 2c0ccb378908..48b1118b64f4 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -31,6 +31,27 @@
>  #include <linux/topology.h>
>  #include <linux/io_uring/cmd.h>
>  
> +static void fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
> +{
> +	clear_bit(FRRS_USERSPACE, &ent->state);
> +	list_del_init(&ent->list);
> +}
> +
> +/* Update conn limits according to ring values */
> +static void fuse_uring_conn_cfg_limits(struct fuse_ring *ring)
> +{
> +	struct fuse_conn *fc = ring->fc;
> +
> +	WRITE_ONCE(fc->max_pages, min_t(unsigned int, fc->max_pages,
> +					ring->req_arg_len / PAGE_SIZE));
> +
> +	/* This not ideal, as multiplication with nr_queue assumes the limit
> +	 * gets reached when all queues are used, but a single threaded
> +	 * application might already do that.
> +	 */
> +	WRITE_ONCE(fc->max_background, ring->nr_queues * ring->max_nr_async);
> +}
> +
>  /*
>   * Basic ring setup for this connection based on the provided configuration
>   */
> @@ -329,3 +350,249 @@ int fuse_uring_queue_cfg(struct fuse_ring *ring,
>  	return 0;
>  }
>  
> +/*
> + * Put a ring request onto hold, it is no longer used for now.
> + */
> +static void fuse_uring_ent_avail(struct fuse_ring_ent *ring_ent,
> +				 struct fuse_ring_queue *queue)
> +	__must_hold(&queue->lock)

Sorry I'm just now bringing this up, but I'd love to see a

lockdep_assert_held(<whatever lock>);

in every place where you use __must_hold, so I get a nice big warning when I'm
running stuff.  I don't always run sparse, but I always test with lockdep on,
and that'll help me notice problems.

> +{
> +	struct fuse_ring *ring = queue->ring;
> +
> +	/* unsets all previous flags - basically resets */
> +	pr_devel("%s ring=%p qid=%d tag=%d state=%lu async=%d\n", __func__,
> +		 ring, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
> +		 ring_ent->async);
> +
> +	if (WARN_ON(test_bit(FRRS_USERSPACE, &ring_ent->state))) {
> +		pr_warn("%s qid=%d tag=%d state=%lu async=%d\n", __func__,
> +			ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
> +			ring_ent->async);
> +		return;
> +	}
> +
> +	WARN_ON_ONCE(!list_empty(&ring_ent->list));
> +
> +	if (ring_ent->async)
> +		list_add(&ring_ent->list, &queue->async_ent_avail_queue);
> +	else
> +		list_add(&ring_ent->list, &queue->sync_ent_avail_queue);
> +
> +	set_bit(FRRS_WAIT, &ring_ent->state);
> +}
> +
> +/*
> + * fuse_uring_req_fetch command handling
> + */
> +static int fuse_uring_fetch(struct fuse_ring_ent *ring_ent,
> +			    struct io_uring_cmd *cmd, unsigned int issue_flags)
> +__must_hold(ring_ent->queue->lock)
> +{
> +	struct fuse_ring_queue *queue = ring_ent->queue;
> +	struct fuse_ring *ring = queue->ring;
> +	int ret = 0;
> +	int nr_ring_sqe;
> +
> +	/* register requests for foreground requests first, then backgrounds */
> +	if (queue->nr_req_sync >= ring->max_nr_sync) {
> +		queue->nr_req_async++;
> +		ring_ent->async = 1;
> +	} else
> +		queue->nr_req_sync++;

IIRC the style guidelines say if you use { in any part of the if, you've got to
use them for all of it.  But that may just be what we do in btrfs.  Normally I
wouldn't nit about it but I have comments elsewhere for this patch.

> +
> +	fuse_uring_ent_avail(ring_ent, queue);
> +
> +	if (queue->nr_req_sync + queue->nr_req_async > ring->queue_depth) {
> +		/* should be caught by ring state before and queue depth
> +		 * check before
> +		 */
> +		WARN_ON(1);
> +		pr_info("qid=%d tag=%d req cnt (fg=%d async=%d exceeds depth=%zu",
> +			queue->qid, ring_ent->tag, queue->nr_req_sync,
> +			queue->nr_req_async, ring->queue_depth);
> +		ret = -ERANGE;
> +	}
> +
> +	if (ret)
> +		goto out; /* erange */

This can just be

if (whatever) {
	WARN_ON_ONCE(1);
	return -ERANGE;
}

instead of the goto out thing.

> +
> +	WRITE_ONCE(ring_ent->cmd, cmd);
> +
> +	nr_ring_sqe = ring->queue_depth * ring->nr_queues;
> +	if (atomic_inc_return(&ring->nr_sqe_init) == nr_ring_sqe) {
> +		fuse_uring_conn_cfg_limits(ring);
> +		ring->ready = 1;
> +	}
> +
> +out:
> +	return ret;

And this can just be return 0 here with the above change.

> +}
> +
> +static struct fuse_ring_queue *
> +fuse_uring_get_verify_queue(struct fuse_ring *ring,
> +			    const struct fuse_uring_cmd_req *cmd_req,
> +			    unsigned int issue_flags)
> +{
> +	struct fuse_conn *fc = ring->fc;
> +	struct fuse_ring_queue *queue;
> +	int ret;
> +
> +	if (!(issue_flags & IO_URING_F_SQE128)) {
> +		pr_info("qid=%d tag=%d SQE128 not set\n", cmd_req->qid,
> +			cmd_req->tag);
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	if (unlikely(!fc->connected)) {
> +		ret = -ENOTCONN;
> +		goto err;
> +	}
> +
> +	if (unlikely(!ring->configured)) {
> +		pr_info("command for a connection that is not ring configured\n");
> +		ret = -ENODEV;
> +		goto err;
> +	}
> +
> +	if (unlikely(cmd_req->qid >= ring->nr_queues)) {
> +		pr_devel("qid=%u >= nr-queues=%zu\n", cmd_req->qid,
> +			 ring->nr_queues);
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	queue = fuse_uring_get_queue(ring, cmd_req->qid);
> +	if (unlikely(queue == NULL)) {
> +		pr_info("Got NULL queue for qid=%d\n", cmd_req->qid);
> +		ret = -EIO;
> +		goto err;
> +	}
> +
> +	if (unlikely(!queue->configured || queue->stopped)) {
> +		pr_info("Ring or queue (qid=%u) not ready.\n", cmd_req->qid);
> +		ret = -ENOTCONN;
> +		goto err;
> +	}
> +
> +	if (cmd_req->tag > ring->queue_depth) {
> +		pr_info("tag=%u > queue-depth=%zu\n", cmd_req->tag,
> +			ring->queue_depth);
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	return queue;
> +
> +err:
> +	return ERR_PTR(ret);

There's no cleanup here, so just make all the above

return ERR_PTR(-whatever)

instead of the goto err thing.

> +}
> +
> +/**
> + * Entry function from io_uring to handle the given passthrough command
> + * (op cocde IORING_OP_URING_CMD)
> + */

Docstyle thing.

> +int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
> +{
> +	const struct fuse_uring_cmd_req *cmd_req = io_uring_sqe_cmd(cmd->sqe);
> +	struct fuse_dev *fud = fuse_get_dev(cmd->file);
> +	struct fuse_conn *fc = fud->fc;
> +	struct fuse_ring *ring = fc->ring;
> +	struct fuse_ring_queue *queue;
> +	struct fuse_ring_ent *ring_ent = NULL;
> +	u32 cmd_op = cmd->cmd_op;
> +	int ret = 0;
> +
> +	if (!ring) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	queue = fuse_uring_get_verify_queue(ring, cmd_req, issue_flags);
> +	if (IS_ERR(queue)) {
> +		ret = PTR_ERR(queue);
> +		goto out;
> +	}
> +
> +	ring_ent = &queue->ring_ent[cmd_req->tag];
> +
> +	pr_devel("%s:%d received: cmd op %d qid %d (%p) tag %d  (%p)\n",
> +		 __func__, __LINE__, cmd_op, cmd_req->qid, queue, cmd_req->tag,
> +		 ring_ent);
> +
> +	spin_lock(&queue->lock);
> +	if (unlikely(queue->stopped)) {
> +		/* XXX how to ensure queue still exists? Add
> +		 * an rw ring->stop lock? And take that at the beginning
> +		 * of this function? Better would be to advise uring
> +		 * not to call this function at all? Or free the queue memory
> +		 * only, on daemon PF_EXITING?
> +		 */
> +		ret = -ENOTCONN;
> +		goto err_unlock;
> +	}
> +
> +	if (current == queue->server_task)
> +		queue->uring_cmd_issue_flags = issue_flags;
> +
> +	switch (cmd_op) {
> +	case FUSE_URING_REQ_FETCH:

This is all organized kind of oddly, I think I'd prefer if you put all the code
from above where we grab the queue lock and the bit below into a helper.

So instead of

spin_lock(&queue->lock);
blah

switch (cmd_op) {
case FUSE_URING_REQ_FETCH:
	blah
default:
	ret = -EINVAL;
}

you have

static int fuse_uring_req_fetch(queue, cmd, issue_flags)
{
	ring_ent = blah;
	spin_lock(&queue->lock);
	<blah>
	spin_unlock(&que->lock);
	return ret;
}

then

switch (cmd_op) {
case FUSE_URING_REQ_FETCH:
	ret = fuse_uring_req_fetch(queue, cmd, issue_flags);
	break;
default:
	ret = -EINVAL;
	break;
}

Alternatively just pushe all the queue stuff down into the case
FUSE_URING_REQ_FETCH part, but I think the helper is cleaner.

> +		if (queue->server_task == NULL) {
> +			queue->server_task = current;
> +			queue->uring_cmd_issue_flags = issue_flags;
> +		}
> +
> +		/* No other bit must be set here */
> +		if (ring_ent->state != BIT(FRRS_INIT)) {
> +			pr_info_ratelimited(
> +				"qid=%d tag=%d register req state %lu expected %lu",
> +				cmd_req->qid, cmd_req->tag, ring_ent->state,
> +				BIT(FRRS_INIT));
> +			ret = -EINVAL;
> +			goto err_unlock;
> +		}
> +
> +		fuse_ring_ring_ent_unset_userspace(ring_ent);
> +
> +		ret = fuse_uring_fetch(ring_ent, cmd, issue_flags);
> +		if (ret)
> +			goto err_unlock;
> +
> +		/*
> +		 * The ring entry is registered now and needs to be handled
> +		 * for shutdown.
> +		 */
> +		atomic_inc(&ring->queue_refs);
> +
> +		spin_unlock(&queue->lock);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +		pr_devel("Unknown uring command %d", cmd_op);
> +		goto err_unlock;
> +	}
> +out:
> +	pr_devel("uring cmd op=%d, qid=%d tag=%d ret=%d\n", cmd_op,
> +		 cmd_req->qid, cmd_req->tag, ret);
> +
> +	if (ret < 0) {
> +		if (ring_ent != NULL) {

You don't pull anything from ring_ent in the pr_info, so maybe drop the extra
if statement?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 11/19] fuse: Add support to copy from/to the ring buffer
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (9 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 10/19] fuse: {uring} Handle SQEs - register commands Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 19:59   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 12/19] fuse: {uring} Add uring sqe commit and fetch support Bernd Schubert
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

This adds support to existing fuse copy code to copy
from/to the ring buffer. The ring buffer is here mmaped
shared between kernel and userspace.

This also fuse_ prefixes the copy_out_args function

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c        | 60 ++++++++++++++++++++++++++++++----------------------
 fs/fuse/fuse_dev_i.h | 38 +++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+), 25 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 05a87731b5c3..a7d26440de39 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -637,21 +637,7 @@ static int unlock_request(struct fuse_req *req)
 	return err;
 }
 
-struct fuse_copy_state {
-	int write;
-	struct fuse_req *req;
-	struct iov_iter *iter;
-	struct pipe_buffer *pipebufs;
-	struct pipe_buffer *currbuf;
-	struct pipe_inode_info *pipe;
-	unsigned long nr_segs;
-	struct page *pg;
-	unsigned len;
-	unsigned offset;
-	unsigned move_pages:1;
-};
-
-static void fuse_copy_init(struct fuse_copy_state *cs, int write,
+void fuse_copy_init(struct fuse_copy_state *cs, int write,
 			   struct iov_iter *iter)
 {
 	memset(cs, 0, sizeof(*cs));
@@ -662,6 +648,7 @@ static void fuse_copy_init(struct fuse_copy_state *cs, int write,
 /* Unmap and put previous page of userspace buffer */
 static void fuse_copy_finish(struct fuse_copy_state *cs)
 {
+
 	if (cs->currbuf) {
 		struct pipe_buffer *buf = cs->currbuf;
 
@@ -726,6 +713,10 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 			cs->pipebufs++;
 			cs->nr_segs++;
 		}
+	} else if (cs->is_uring) {
+		if (cs->ring.offset > cs->ring.buf_sz)
+			return -ERANGE;
+		cs->len = cs->ring.buf_sz - cs->ring.offset;
 	} else {
 		size_t off;
 		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
@@ -744,21 +735,35 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size)
 {
 	unsigned ncpy = min(*size, cs->len);
+
 	if (val) {
-		void *pgaddr = kmap_local_page(cs->pg);
-		void *buf = pgaddr + cs->offset;
+
+		void *pgaddr = NULL;
+		void *buf;
+
+		if (cs->is_uring) {
+			buf = cs->ring.buf + cs->ring.offset;
+			cs->ring.offset += ncpy;
+
+		} else {
+			pgaddr = kmap_local_page(cs->pg);
+			buf = pgaddr + cs->offset;
+		}
 
 		if (cs->write)
 			memcpy(buf, *val, ncpy);
 		else
 			memcpy(*val, buf, ncpy);
 
-		kunmap_local(pgaddr);
+		if (pgaddr)
+			kunmap_local(pgaddr);
+
 		*val += ncpy;
 	}
 	*size -= ncpy;
 	cs->len -= ncpy;
 	cs->offset += ncpy;
+
 	return ncpy;
 }
 
@@ -1006,9 +1011,9 @@ static int fuse_copy_one(struct fuse_copy_state *cs, void *val, unsigned size)
 }
 
 /* Copy request arguments to/from userspace buffer */
-static int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
-			  unsigned argpages, struct fuse_arg *args,
-			  int zeroing)
+int fuse_copy_args(struct fuse_copy_state *cs, unsigned int numargs,
+		   unsigned int argpages, struct fuse_arg *args,
+		   int zeroing)
 {
 	int err = 0;
 	unsigned i;
@@ -1873,10 +1878,15 @@ static struct fuse_req *request_find(struct fuse_pqueue *fpq, u64 unique)
 	return NULL;
 }
 
-static int copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args,
-			 unsigned nbytes)
+int fuse_copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args,
+		       unsigned int nbytes)
 {
-	unsigned reqsize = sizeof(struct fuse_out_header);
+
+	unsigned int reqsize = 0;
+
+	/* Uring has the out header outside of args */
+	if (!cs->is_uring)
+		reqsize = sizeof(struct fuse_out_header);
 
 	reqsize += fuse_len_args(args->out_numargs, args->out_args);
 
@@ -1976,7 +1986,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
 	if (oh.error)
 		err = nbytes != sizeof(oh) ? -EINVAL : 0;
 	else
-		err = copy_out_args(cs, req->args, nbytes);
+		err = fuse_copy_out_args(cs, req->args, nbytes);
 	fuse_copy_finish(cs);
 
 	spin_lock(&fpq->lock);
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index e6289bafb788..f3e69ab4c2be 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -13,6 +13,36 @@
 #define FUSE_INT_REQ_BIT (1ULL << 0)
 #define FUSE_REQ_ID_STEP (1ULL << 1)
 
+struct fuse_arg;
+struct fuse_args;
+
+struct fuse_copy_state {
+	int write;
+	struct fuse_req *req;
+	struct iov_iter *iter;
+	struct pipe_buffer *pipebufs;
+	struct pipe_buffer *currbuf;
+	struct pipe_inode_info *pipe;
+	unsigned long nr_segs;
+	struct page *pg;
+	unsigned int len;
+	unsigned int offset;
+	unsigned int move_pages:1, is_uring:1;
+	struct {
+		/* pointer into the ring buffer */
+		char *buf;
+
+		/* for copy to the ring request buffer, the buffer size - must
+		 * not be exceeded, for copy from the ring request buffer,
+		 * the size filled in by user space
+		 */
+		unsigned int buf_sz;
+
+		/* offset within buf while it is copying from/to the buf */
+		unsigned int offset;
+	} ring;
+};
+
 static inline struct fuse_dev *fuse_get_dev(struct file *file)
 {
 	/*
@@ -24,6 +54,14 @@ static inline struct fuse_dev *fuse_get_dev(struct file *file)
 
 void fuse_dev_end_requests(struct list_head *head);
 
+void fuse_copy_init(struct fuse_copy_state *cs, int write,
+			   struct iov_iter *iter);
+int fuse_copy_args(struct fuse_copy_state *cs, unsigned int numargs,
+		   unsigned int argpages, struct fuse_arg *args,
+		   int zeroing);
+int fuse_copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args,
+		       unsigned int nbytes);
+
 #endif
 
 

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 11/19] fuse: Add support to copy from/to the ring buffer
  2024-05-29 18:00 ` [PATCH RFC v2 11/19] fuse: Add support to copy from/to the ring buffer Bernd Schubert
@ 2024-05-30 19:59   ` Josef Bacik
  2024-09-01 11:56     ` Bernd Schubert
  2024-09-01 11:56     ` Bernd Schubert
  0 siblings, 2 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 19:59 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:46PM +0200, Bernd Schubert wrote:
> This adds support to existing fuse copy code to copy
> from/to the ring buffer. The ring buffer is here mmaped
> shared between kernel and userspace.
> 
> This also fuse_ prefixes the copy_out_args function
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev.c        | 60 ++++++++++++++++++++++++++++++----------------------
>  fs/fuse/fuse_dev_i.h | 38 +++++++++++++++++++++++++++++++++
>  2 files changed, 73 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 05a87731b5c3..a7d26440de39 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -637,21 +637,7 @@ static int unlock_request(struct fuse_req *req)
>  	return err;
>  }
>  
> -struct fuse_copy_state {
> -	int write;
> -	struct fuse_req *req;
> -	struct iov_iter *iter;
> -	struct pipe_buffer *pipebufs;
> -	struct pipe_buffer *currbuf;
> -	struct pipe_inode_info *pipe;
> -	unsigned long nr_segs;
> -	struct page *pg;
> -	unsigned len;
> -	unsigned offset;
> -	unsigned move_pages:1;
> -};
> -
> -static void fuse_copy_init(struct fuse_copy_state *cs, int write,
> +void fuse_copy_init(struct fuse_copy_state *cs, int write,
>  			   struct iov_iter *iter)
>  {
>  	memset(cs, 0, sizeof(*cs));
> @@ -662,6 +648,7 @@ static void fuse_copy_init(struct fuse_copy_state *cs, int write,
>  /* Unmap and put previous page of userspace buffer */
>  static void fuse_copy_finish(struct fuse_copy_state *cs)
>  {
> +

Extraneous newline.

>  	if (cs->currbuf) {
>  		struct pipe_buffer *buf = cs->currbuf;
>  
> @@ -726,6 +713,10 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>  			cs->pipebufs++;
>  			cs->nr_segs++;
>  		}
> +	} else if (cs->is_uring) {
> +		if (cs->ring.offset > cs->ring.buf_sz)
> +			return -ERANGE;
> +		cs->len = cs->ring.buf_sz - cs->ring.offset;
>  	} else {
>  		size_t off;
>  		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
> @@ -744,21 +735,35 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>  static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size)
>  {
>  	unsigned ncpy = min(*size, cs->len);
> +
>  	if (val) {
> -		void *pgaddr = kmap_local_page(cs->pg);
> -		void *buf = pgaddr + cs->offset;
> +
> +		void *pgaddr = NULL;
> +		void *buf;
> +
> +		if (cs->is_uring) {
> +			buf = cs->ring.buf + cs->ring.offset;
> +			cs->ring.offset += ncpy;
> +
> +		} else {
> +			pgaddr = kmap_local_page(cs->pg);
> +			buf = pgaddr + cs->offset;
> +		}
>  
>  		if (cs->write)
>  			memcpy(buf, *val, ncpy);
>  		else
>  			memcpy(*val, buf, ncpy);
>  
> -		kunmap_local(pgaddr);
> +		if (pgaddr)
> +			kunmap_local(pgaddr);
> +
>  		*val += ncpy;
>  	}
>  	*size -= ncpy;
>  	cs->len -= ncpy;
>  	cs->offset += ncpy;
> +

Extraneous newline.

Once those nits are fixed you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 11/19] fuse: Add support to copy from/to the ring buffer
  2024-05-30 19:59   ` Josef Bacik
@ 2024-09-01 11:56     ` Bernd Schubert
  2024-09-01 11:56     ` Bernd Schubert
  1 sibling, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-09-01 11:56 UTC (permalink / raw)
  To: Josef Bacik, Bernd Schubert; +Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel



On 5/30/24 21:59, Josef Bacik wrote:
> On Wed, May 29, 2024 at 08:00:46PM +0200, Bernd Schubert wrote:
>> This adds support to existing fuse copy code to copy
>> from/to the ring buffer. The ring buffer is here mmaped
>> shared between kernel and userspace.
>>
>> This also fuse_ prefixes the copy_out_args function
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev.c        | 60 ++++++++++++++++++++++++++++++----------------------
>>  fs/fuse/fuse_dev_i.h | 38 +++++++++++++++++++++++++++++++++
>>  2 files changed, 73 insertions(+), 25 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 05a87731b5c3..a7d26440de39 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -637,21 +637,7 @@ static int unlock_request(struct fuse_req *req)
>>  	return err;
>>  }
>>  
>> -struct fuse_copy_state {
>> -	int write;
>> -	struct fuse_req *req;
>> -	struct iov_iter *iter;
>> -	struct pipe_buffer *pipebufs;
>> -	struct pipe_buffer *currbuf;
>> -	struct pipe_inode_info *pipe;
>> -	unsigned long nr_segs;
>> -	struct page *pg;
>> -	unsigned len;
>> -	unsigned offset;
>> -	unsigned move_pages:1;
>> -};
>> -
>> -static void fuse_copy_init(struct fuse_copy_state *cs, int write,
>> +void fuse_copy_init(struct fuse_copy_state *cs, int write,
>>  			   struct iov_iter *iter)
>>  {
>>  	memset(cs, 0, sizeof(*cs));
>> @@ -662,6 +648,7 @@ static void fuse_copy_init(struct fuse_copy_state *cs, int write,
>>  /* Unmap and put previous page of userspace buffer */
>>  static void fuse_copy_finish(struct fuse_copy_state *cs)
>>  {
>> +
> 
> Extraneous newline.
> 
>>  	if (cs->currbuf) {
>>  		struct pipe_buffer *buf = cs->currbuf;
>>  
>> @@ -726,6 +713,10 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>>  			cs->pipebufs++;
>>  			cs->nr_segs++;
>>  		}
>> +	} else if (cs->is_uring) {
>> +		if (cs->ring.offset > cs->ring.buf_sz)
>> +			return -ERANGE;
>> +		cs->len = cs->ring.buf_sz - cs->ring.offset;
>>  	} else {
>>  		size_t off;
>>  		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
>> @@ -744,21 +735,35 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>>  static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size)
>>  {
>>  	unsigned ncpy = min(*size, cs->len);
>> +
>>  	if (val) {
>> -		void *pgaddr = kmap_local_page(cs->pg);
>> -		void *buf = pgaddr + cs->offset;
>> +
>> +		void *pgaddr = NULL;
>> +		void *buf;
>> +
>> +		if (cs->is_uring) {
>> +			buf = cs->ring.buf + cs->ring.offset;
>> +			cs->ring.offset += ncpy;
>> +
>> +		} else {
>> +			pgaddr = kmap_local_page(cs->pg);
>> +			buf = pgaddr + cs->offset;
>> +		}
>>  
>>  		if (cs->write)
>>  			memcpy(buf, *val, ncpy);
>>  		else
>>  			memcpy(*val, buf, ncpy);
>>  
>> -		kunmap_local(pgaddr);
>> +		if (pgaddr)
>> +			kunmap_local(pgaddr);
>> +
>>  		*val += ncpy;
>>  	}
>>  	*size -= ncpy;
>>  	cs->len -= ncpy;
>>  	cs->offset += ncpy;
>> +
> 
> Extraneous newline.
> 
> Once those nits are fixed you can add
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks again for your reviews! I won't add this for now, as there are
too many changes after removing mmap.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 11/19] fuse: Add support to copy from/to the ring buffer
  2024-05-30 19:59   ` Josef Bacik
  2024-09-01 11:56     ` Bernd Schubert
@ 2024-09-01 11:56     ` Bernd Schubert
  1 sibling, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-09-01 11:56 UTC (permalink / raw)
  To: Josef Bacik, Bernd Schubert; +Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel



On 5/30/24 21:59, Josef Bacik wrote:
> On Wed, May 29, 2024 at 08:00:46PM +0200, Bernd Schubert wrote:
>> This adds support to existing fuse copy code to copy
>> from/to the ring buffer. The ring buffer is here mmaped
>> shared between kernel and userspace.
>>
>> This also fuse_ prefixes the copy_out_args function
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev.c        | 60 ++++++++++++++++++++++++++++++----------------------
>>  fs/fuse/fuse_dev_i.h | 38 +++++++++++++++++++++++++++++++++
>>  2 files changed, 73 insertions(+), 25 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 05a87731b5c3..a7d26440de39 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -637,21 +637,7 @@ static int unlock_request(struct fuse_req *req)
>>  	return err;
>>  }
>>  
>> -struct fuse_copy_state {
>> -	int write;
>> -	struct fuse_req *req;
>> -	struct iov_iter *iter;
>> -	struct pipe_buffer *pipebufs;
>> -	struct pipe_buffer *currbuf;
>> -	struct pipe_inode_info *pipe;
>> -	unsigned long nr_segs;
>> -	struct page *pg;
>> -	unsigned len;
>> -	unsigned offset;
>> -	unsigned move_pages:1;
>> -};
>> -
>> -static void fuse_copy_init(struct fuse_copy_state *cs, int write,
>> +void fuse_copy_init(struct fuse_copy_state *cs, int write,
>>  			   struct iov_iter *iter)
>>  {
>>  	memset(cs, 0, sizeof(*cs));
>> @@ -662,6 +648,7 @@ static void fuse_copy_init(struct fuse_copy_state *cs, int write,
>>  /* Unmap and put previous page of userspace buffer */
>>  static void fuse_copy_finish(struct fuse_copy_state *cs)
>>  {
>> +
> 
> Extraneous newline.
> 
>>  	if (cs->currbuf) {
>>  		struct pipe_buffer *buf = cs->currbuf;
>>  
>> @@ -726,6 +713,10 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>>  			cs->pipebufs++;
>>  			cs->nr_segs++;
>>  		}
>> +	} else if (cs->is_uring) {
>> +		if (cs->ring.offset > cs->ring.buf_sz)
>> +			return -ERANGE;
>> +		cs->len = cs->ring.buf_sz - cs->ring.offset;
>>  	} else {
>>  		size_t off;
>>  		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
>> @@ -744,21 +735,35 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>>  static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size)
>>  {
>>  	unsigned ncpy = min(*size, cs->len);
>> +
>>  	if (val) {
>> -		void *pgaddr = kmap_local_page(cs->pg);
>> -		void *buf = pgaddr + cs->offset;
>> +
>> +		void *pgaddr = NULL;
>> +		void *buf;
>> +
>> +		if (cs->is_uring) {
>> +			buf = cs->ring.buf + cs->ring.offset;
>> +			cs->ring.offset += ncpy;
>> +
>> +		} else {
>> +			pgaddr = kmap_local_page(cs->pg);
>> +			buf = pgaddr + cs->offset;
>> +		}
>>  
>>  		if (cs->write)
>>  			memcpy(buf, *val, ncpy);
>>  		else
>>  			memcpy(*val, buf, ncpy);
>>  
>> -		kunmap_local(pgaddr);
>> +		if (pgaddr)
>> +			kunmap_local(pgaddr);
>> +
>>  		*val += ncpy;
>>  	}
>>  	*size -= ncpy;
>>  	cs->len -= ncpy;
>>  	cs->offset += ncpy;
>> +
> 
> Extraneous newline.
> 
> Once those nits are fixed you can add
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks again for your reviews! I won't add this for now, as there are
too many changes after removing mmap.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 12/19] fuse: {uring} Add uring sqe commit and fetch support
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (10 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 11/19] fuse: Add support to copy from/to the ring buffer Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 20:08   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 13/19] fuse: {uring} Handle uring shutdown Bernd Schubert
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

This adds support for fuse request completion through ring SQEs
(FUSE_URING_REQ_COMMIT_AND_FETCH handling). After committing
the ring entry it becomes available for new fuse requests.
Handling of requests through the ring (SQE/CQE handling)
is complete now.

Fuse request data are copied through the mmaped ring buffer,
there is no support for any zero copy yet.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 311 insertions(+)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 48b1118b64f4..5269b3f8891e 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -31,12 +31,23 @@
 #include <linux/topology.h>
 #include <linux/io_uring/cmd.h>
 
+static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
+					    bool set_err, int error,
+					    unsigned int issue_flags);
+
 static void fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
 {
 	clear_bit(FRRS_USERSPACE, &ent->state);
 	list_del_init(&ent->list);
 }
 
+static void
+fuse_uring_async_send_to_ring(struct io_uring_cmd *cmd,
+			      unsigned int issue_flags)
+{
+	io_uring_cmd_done(cmd, 0, 0, issue_flags);
+}
+
 /* Update conn limits according to ring values */
 static void fuse_uring_conn_cfg_limits(struct fuse_ring *ring)
 {
@@ -350,6 +361,188 @@ int fuse_uring_queue_cfg(struct fuse_ring *ring,
 	return 0;
 }
 
+/*
+ * Checks for errors and stores it into the request
+ */
+static int fuse_uring_ring_ent_has_err(struct fuse_ring *ring,
+				       struct fuse_ring_ent *ring_ent)
+{
+	struct fuse_conn *fc = ring->fc;
+	struct fuse_req *req = ring_ent->fuse_req;
+	struct fuse_out_header *oh = &req->out.h;
+	int err;
+
+	if (oh->unique == 0) {
+		/* Not supportd through request based uring, this needs another
+		 * ring from user space to kernel
+		 */
+		pr_warn("Unsupported fuse-notify\n");
+		err = -EINVAL;
+		goto seterr;
+	}
+
+	if (oh->error <= -512 || oh->error > 0) {
+		err = -EINVAL;
+		goto seterr;
+	}
+
+	if (oh->error) {
+		err = oh->error;
+		pr_devel("%s:%d err=%d op=%d req-ret=%d", __func__, __LINE__,
+			 err, req->args->opcode, req->out.h.error);
+		goto err; /* error already set */
+	}
+
+	if ((oh->unique & ~FUSE_INT_REQ_BIT) != req->in.h.unique) {
+		pr_warn("Unpexted seqno mismatch, expected: %llu got %llu\n",
+			req->in.h.unique, oh->unique & ~FUSE_INT_REQ_BIT);
+		err = -ENOENT;
+		goto seterr;
+	}
+
+	/* Is it an interrupt reply ID?	 */
+	if (oh->unique & FUSE_INT_REQ_BIT) {
+		err = 0;
+		if (oh->error == -ENOSYS)
+			fc->no_interrupt = 1;
+		else if (oh->error == -EAGAIN) {
+			/* XXX Interrupts not handled yet */
+			/* err = queue_interrupt(req); */
+			pr_warn("Intrerupt EAGAIN not supported yet");
+			err = -EINVAL;
+		}
+
+		goto seterr;
+	}
+
+	return 0;
+
+seterr:
+	pr_devel("%s:%d err=%d op=%d req-ret=%d", __func__, __LINE__, err,
+		 req->args->opcode, req->out.h.error);
+	oh->error = err;
+err:
+	pr_devel("%s:%d err=%d op=%d req-ret=%d", __func__, __LINE__, err,
+		 req->args->opcode, req->out.h.error);
+	return err;
+}
+
+/*
+ * Copy data from the ring buffer to the fuse request
+ */
+static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
+				     struct fuse_req *req,
+				     struct fuse_ring_req *rreq)
+{
+	struct fuse_copy_state cs;
+	struct fuse_args *args = req->args;
+
+	fuse_copy_init(&cs, 0, NULL);
+	cs.is_uring = 1;
+	cs.ring.buf = rreq->in_out_arg;
+
+	if (rreq->in_out_arg_len > ring->req_arg_len) {
+		pr_devel("Max ring buffer len exceeded (%u vs %zu\n",
+			 rreq->in_out_arg_len, ring->req_arg_len);
+		return -EINVAL;
+	}
+	cs.ring.buf_sz = rreq->in_out_arg_len;
+	cs.req = req;
+
+	pr_devel("%s:%d buf=%p len=%d args=%d\n", __func__, __LINE__,
+		 cs.ring.buf, cs.ring.buf_sz, args->out_numargs);
+
+	return fuse_copy_out_args(&cs, args, rreq->in_out_arg_len);
+}
+
+/*
+ * Copy data from the req to the ring buffer
+ */
+static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
+				   struct fuse_ring_req *rreq)
+{
+	struct fuse_copy_state cs;
+	struct fuse_args *args = req->args;
+	int err;
+
+	fuse_copy_init(&cs, 1, NULL);
+	cs.is_uring = 1;
+	cs.ring.buf = rreq->in_out_arg;
+	cs.ring.buf_sz = ring->req_arg_len;
+	cs.req = req;
+
+	pr_devel("%s:%d buf=%p len=%d args=%d\n", __func__, __LINE__,
+		 cs.ring.buf, cs.ring.buf_sz, args->out_numargs);
+
+	err = fuse_copy_args(&cs, args->in_numargs, args->in_pages,
+			     (struct fuse_arg *)args->in_args, 0);
+	rreq->in_out_arg_len = cs.ring.offset;
+
+	pr_devel("%s:%d buf=%p len=%d args=%d err=%d\n", __func__, __LINE__,
+		 cs.ring.buf, cs.ring.buf_sz, args->out_numargs, err);
+
+	return err;
+}
+
+/*
+ * Write data to the ring buffer and send the request to userspace,
+ * userspace will read it
+ * This is comparable with classical read(/dev/fuse)
+ */
+static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent,
+				    unsigned int issue_flags, bool send_in_task)
+{
+	struct fuse_ring *ring = ring_ent->queue->ring;
+	struct fuse_ring_req *rreq = ring_ent->rreq;
+	struct fuse_req *req = ring_ent->fuse_req;
+	struct fuse_ring_queue *queue = ring_ent->queue;
+	int err = 0;
+
+	spin_lock(&queue->lock);
+
+	if (WARN_ON(test_bit(FRRS_USERSPACE, &ring_ent->state) ||
+		   (test_bit(FRRS_FREED, &ring_ent->state)))) {
+		pr_err("qid=%d tag=%d ring-req=%p buf_req=%p invalid state %lu on send\n",
+		       queue->qid, ring_ent->tag, ring_ent, rreq,
+		       ring_ent->state);
+		err = -EIO;
+	} else {
+		set_bit(FRRS_USERSPACE, &ring_ent->state);
+		list_add(&ring_ent->list, &queue->ent_in_userspace);
+	}
+
+	spin_unlock(&queue->lock);
+	if (err)
+		goto err;
+
+	err = fuse_uring_copy_to_ring(ring, req, rreq);
+	if (unlikely(err)) {
+		spin_lock(&queue->lock);
+		fuse_ring_ring_ent_unset_userspace(ring_ent);
+		spin_unlock(&queue->lock);
+		goto err;
+	}
+
+	/* ring req go directly into the shared memory buffer */
+	rreq->in = req->in.h;
+	set_bit(FR_SENT, &req->flags);
+
+	pr_devel("%s qid=%d tag=%d state=%lu cmd-done op=%d unique=%llu issue_flags=%u\n",
+		 __func__, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
+		 rreq->in.opcode, rreq->in.unique, issue_flags);
+
+	if (send_in_task)
+		io_uring_cmd_complete_in_task(ring_ent->cmd,
+					      fuse_uring_async_send_to_ring);
+	else
+		io_uring_cmd_done(ring_ent->cmd, 0, 0, issue_flags);
+
+	return;
+
+err:
+	fuse_uring_req_end_and_get_next(ring_ent, true, err, issue_flags);
+}
+
 /*
  * Put a ring request onto hold, it is no longer used for now.
  */
@@ -381,6 +574,104 @@ static void fuse_uring_ent_avail(struct fuse_ring_ent *ring_ent,
 	set_bit(FRRS_WAIT, &ring_ent->state);
 }
 
+/*
+ * Assign a fuse queue entry to the given entry
+ */
+static void fuse_uring_add_req_to_ring_ent(struct fuse_ring_ent *ring_ent,
+					   struct fuse_req *req)
+{
+	clear_bit(FRRS_WAIT, &ring_ent->state);
+	list_del_init(&req->list);
+	clear_bit(FR_PENDING, &req->flags);
+	ring_ent->fuse_req = req;
+	set_bit(FRRS_FUSE_REQ, &ring_ent->state);
+}
+
+/*
+ * Release a uring entry and fetch the next fuse request if available
+ *
+ * @return true if a new request has been fetched
+ */
+static bool fuse_uring_ent_release_and_fetch(struct fuse_ring_ent *ring_ent)
+{
+	struct fuse_req *req = NULL;
+	struct fuse_ring_queue *queue = ring_ent->queue;
+	struct list_head *req_queue = ring_ent->async ?
+		&queue->async_fuse_req_queue : &queue->sync_fuse_req_queue;
+
+	spin_lock(&ring_ent->queue->lock);
+	fuse_uring_ent_avail(ring_ent, queue);
+	if (!list_empty(req_queue)) {
+		req = list_first_entry(req_queue, struct fuse_req, list);
+		fuse_uring_add_req_to_ring_ent(ring_ent, req);
+		list_del_init(&ring_ent->list);
+	}
+	spin_unlock(&ring_ent->queue->lock);
+
+	return req ? true : false;
+}
+
+/*
+ * Finalize a fuse request, then fetch and send the next entry, if available
+ *
+ * has lock/unlock/lock to avoid holding the lock on calling fuse_request_end
+ */
+static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
+					    bool set_err, int error,
+					    unsigned int issue_flags)
+{
+	struct fuse_req *req = ring_ent->fuse_req;
+	int has_next;
+
+	if (set_err)
+		req->out.h.error = error;
+
+	clear_bit(FR_SENT, &req->flags);
+	fuse_request_end(ring_ent->fuse_req);
+	ring_ent->fuse_req = NULL;
+	clear_bit(FRRS_FUSE_REQ, &ring_ent->state);
+
+	has_next = fuse_uring_ent_release_and_fetch(ring_ent);
+	if (has_next) {
+		/* called within uring context - use provided flags */
+		fuse_uring_send_to_ring(ring_ent, issue_flags, false);
+	}
+}
+
+/*
+ * Read data from the ring buffer, which user space has written to
+ * This is comparible with handling of classical write(/dev/fuse).
+ * Also make the ring request available again for new fuse requests.
+ */
+static void fuse_uring_commit_and_release(struct fuse_dev *fud,
+					  struct fuse_ring_ent *ring_ent,
+					  unsigned int issue_flags)
+{
+	struct fuse_ring_req *rreq = ring_ent->rreq;
+	struct fuse_req *req = ring_ent->fuse_req;
+	ssize_t err = 0;
+	bool set_err = false;
+
+	req->out.h = rreq->out;
+
+	err = fuse_uring_ring_ent_has_err(fud->fc->ring, ring_ent);
+	if (err) {
+		/* req->out.h.error already set */
+		pr_devel("%s:%d err=%zd oh->err=%d\n", __func__, __LINE__, err,
+			 req->out.h.error);
+		goto out;
+	}
+
+	err = fuse_uring_copy_from_ring(fud->fc->ring, req, rreq);
+	if (err)
+		set_err = true;
+
+out:
+	pr_devel("%s:%d ret=%zd op=%d req-ret=%d\n", __func__, __LINE__, err,
+		 req->args->opcode, req->out.h.error);
+	fuse_uring_req_end_and_get_next(ring_ent, set_err, err, issue_flags);
+}
+
 /*
  * fuse_uring_req_fetch command handling
  */
@@ -566,6 +857,26 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 
 		spin_unlock(&queue->lock);
 		break;
+	case FUSE_URING_REQ_COMMIT_AND_FETCH:
+		if (unlikely(!ring->ready)) {
+			pr_info("commit and fetch, but fuse-uringis not ready.");
+			goto err_unlock;
+		}
+
+		if (!test_bit(FRRS_USERSPACE, &ring_ent->state)) {
+			pr_info("qid=%d tag=%d state %lu SQE already handled\n",
+				queue->qid, ring_ent->tag, ring_ent->state);
+			goto err_unlock;
+		}
+
+		fuse_ring_ring_ent_unset_userspace(ring_ent);
+		spin_unlock(&queue->lock);
+
+		WRITE_ONCE(ring_ent->cmd, cmd);
+		fuse_uring_commit_and_release(fud, ring_ent, issue_flags);
+
+		ret = 0;
+		break;
 	default:
 		ret = -EINVAL;
 		pr_devel("Unknown uring command %d", cmd_op);

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 12/19] fuse: {uring} Add uring sqe commit and fetch support
  2024-05-29 18:00 ` [PATCH RFC v2 12/19] fuse: {uring} Add uring sqe commit and fetch support Bernd Schubert
@ 2024-05-30 20:08   ` Josef Bacik
  0 siblings, 0 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 20:08 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:47PM +0200, Bernd Schubert wrote:
> This adds support for fuse request completion through ring SQEs
> (FUSE_URING_REQ_COMMIT_AND_FETCH handling). After committing
> the ring entry it becomes available for new fuse requests.
> Handling of requests through the ring (SQE/CQE handling)
> is complete now.
> 
> Fuse request data are copied through the mmaped ring buffer,
> there is no support for any zero copy yet.
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev_uring.c | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 311 insertions(+)
> 
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 48b1118b64f4..5269b3f8891e 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -31,12 +31,23 @@
>  #include <linux/topology.h>
>  #include <linux/io_uring/cmd.h>
>  
> +static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
> +					    bool set_err, int error,
> +					    unsigned int issue_flags);
> +

Just order this above all the users instead of putting a declaration here.

>  static void fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
>  {
>  	clear_bit(FRRS_USERSPACE, &ent->state);
>  	list_del_init(&ent->list);
>  }
>  
> +static void
> +fuse_uring_async_send_to_ring(struct io_uring_cmd *cmd,
> +			      unsigned int issue_flags)
> +{
> +	io_uring_cmd_done(cmd, 0, 0, issue_flags);
> +}
> +
>  /* Update conn limits according to ring values */
>  static void fuse_uring_conn_cfg_limits(struct fuse_ring *ring)
>  {
> @@ -350,6 +361,188 @@ int fuse_uring_queue_cfg(struct fuse_ring *ring,
>  	return 0;
>  }
>  
> +/*
> + * Checks for errors and stores it into the request
> + */
> +static int fuse_uring_ring_ent_has_err(struct fuse_ring *ring,
> +				       struct fuse_ring_ent *ring_ent)
> +{
> +	struct fuse_conn *fc = ring->fc;
> +	struct fuse_req *req = ring_ent->fuse_req;
> +	struct fuse_out_header *oh = &req->out.h;
> +	int err;
> +
> +	if (oh->unique == 0) {
> +		/* Not supportd through request based uring, this needs another
> +		 * ring from user space to kernel
> +		 */
> +		pr_warn("Unsupported fuse-notify\n");
> +		err = -EINVAL;
> +		goto seterr;
> +	}
> +
> +	if (oh->error <= -512 || oh->error > 0) {

What is -512?  No magic numbers please.

> +		err = -EINVAL;
> +		goto seterr;
> +	}
> +
> +	if (oh->error) {
> +		err = oh->error;
> +		pr_devel("%s:%d err=%d op=%d req-ret=%d", __func__, __LINE__,
> +			 err, req->args->opcode, req->out.h.error);
> +		goto err; /* error already set */
> +	}
> +
> +	if ((oh->unique & ~FUSE_INT_REQ_BIT) != req->in.h.unique) {
> +		pr_warn("Unpexted seqno mismatch, expected: %llu got %llu\n",
> +			req->in.h.unique, oh->unique & ~FUSE_INT_REQ_BIT);
> +		err = -ENOENT;
> +		goto seterr;
> +	}
> +
> +	/* Is it an interrupt reply ID?	 */
> +	if (oh->unique & FUSE_INT_REQ_BIT) {
> +		err = 0;
> +		if (oh->error == -ENOSYS)
> +			fc->no_interrupt = 1;
> +		else if (oh->error == -EAGAIN) {
> +			/* XXX Interrupts not handled yet */
> +			/* err = queue_interrupt(req); */
> +			pr_warn("Intrerupt EAGAIN not supported yet");
> +			err = -EINVAL;
> +		}
> +
> +		goto seterr;
> +	}
> +
> +	return 0;
> +
> +seterr:
> +	pr_devel("%s:%d err=%d op=%d req-ret=%d", __func__, __LINE__, err,
> +		 req->args->opcode, req->out.h.error);
> +	oh->error = err;
> +err:
> +	pr_devel("%s:%d err=%d op=%d req-ret=%d", __func__, __LINE__, err,
> +		 req->args->opcode, req->out.h.error);
> +	return err;
> +}
> +
> +/*
> + * Copy data from the ring buffer to the fuse request
> + */
> +static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> +				     struct fuse_req *req,
> +				     struct fuse_ring_req *rreq)
> +{
> +	struct fuse_copy_state cs;
> +	struct fuse_args *args = req->args;
> +
> +	fuse_copy_init(&cs, 0, NULL);
> +	cs.is_uring = 1;
> +	cs.ring.buf = rreq->in_out_arg;
> +
> +	if (rreq->in_out_arg_len > ring->req_arg_len) {
> +		pr_devel("Max ring buffer len exceeded (%u vs %zu\n",
> +			 rreq->in_out_arg_len, ring->req_arg_len);
> +		return -EINVAL;
> +	}
> +	cs.ring.buf_sz = rreq->in_out_arg_len;
> +	cs.req = req;
> +
> +	pr_devel("%s:%d buf=%p len=%d args=%d\n", __func__, __LINE__,
> +		 cs.ring.buf, cs.ring.buf_sz, args->out_numargs);
> +
> +	return fuse_copy_out_args(&cs, args, rreq->in_out_arg_len);
> +}
> +
> +/*
> + * Copy data from the req to the ring buffer
> + */
> +static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
> +				   struct fuse_ring_req *rreq)
> +{
> +	struct fuse_copy_state cs;
> +	struct fuse_args *args = req->args;
> +	int err;
> +
> +	fuse_copy_init(&cs, 1, NULL);
> +	cs.is_uring = 1;
> +	cs.ring.buf = rreq->in_out_arg;
> +	cs.ring.buf_sz = ring->req_arg_len;
> +	cs.req = req;
> +
> +	pr_devel("%s:%d buf=%p len=%d args=%d\n", __func__, __LINE__,
> +		 cs.ring.buf, cs.ring.buf_sz, args->out_numargs);
> +
> +	err = fuse_copy_args(&cs, args->in_numargs, args->in_pages,
> +			     (struct fuse_arg *)args->in_args, 0);
> +	rreq->in_out_arg_len = cs.ring.offset;

Is this ok if there's an error?  I genuinely don't know, maybe add a comment for
idiots like me?

> +
> +	pr_devel("%s:%d buf=%p len=%d args=%d err=%d\n", __func__, __LINE__,
> +		 cs.ring.buf, cs.ring.buf_sz, args->out_numargs, err);
> +
> +	return err;
> +}
> +
> +/*
> + * Write data to the ring buffer and send the request to userspace,
> + * userspace will read it
> + * This is comparable with classical read(/dev/fuse)
> + */
> +static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent,
> +				    unsigned int issue_flags, bool send_in_task)
> +{
> +	struct fuse_ring *ring = ring_ent->queue->ring;
> +	struct fuse_ring_req *rreq = ring_ent->rreq;
> +	struct fuse_req *req = ring_ent->fuse_req;
> +	struct fuse_ring_queue *queue = ring_ent->queue;
> +	int err = 0;
> +
> +	spin_lock(&queue->lock);
> +
> +	if (WARN_ON(test_bit(FRRS_USERSPACE, &ring_ent->state) ||
> +		   (test_bit(FRRS_FREED, &ring_ent->state)))) {

WARN_ON(x || b)

Makes me sad when it trips because IDK which one it was, please make them have
their own warn condition.

Also I don't love using WARN_ON() in an if statement if it can be avoided, so
maybe

if (test_bit() || test_bit()) {
	WARN_ON_ONCE(test_bit(USERSPACE));
	WARN_ON_ONCE(test_bit(FREED));
	err = -EIO;
}

Also again I'm sorry for not bringing this up early, I'd prefer WARN_ON_ONCE().
History has shown me many a hung box because I thought this would never happen
and now it's spewing stack traces to my slow ass serial console and I can't get
the box to respond at all.

> +		pr_err("qid=%d tag=%d ring-req=%p buf_req=%p invalid state %lu on send\n",
> +		       queue->qid, ring_ent->tag, ring_ent, rreq,
> +		       ring_ent->state);
> +		err = -EIO;
> +	} else {
> +		set_bit(FRRS_USERSPACE, &ring_ent->state);
> +		list_add(&ring_ent->list, &queue->ent_in_userspace);
> +	}
> +
> +	spin_unlock(&queue->lock);
> +	if (err)
> +		goto err;
> +
> +	err = fuse_uring_copy_to_ring(ring, req, rreq);
> +	if (unlikely(err)) {
> +		spin_lock(&queue->lock);
> +		fuse_ring_ring_ent_unset_userspace(ring_ent);
> +		spin_unlock(&queue->lock);
> +		goto err;
> +	}
> +
> +	/* ring req go directly into the shared memory buffer */
> +	rreq->in = req->in.h;
> +	set_bit(FR_SENT, &req->flags);
> +
> +	pr_devel("%s qid=%d tag=%d state=%lu cmd-done op=%d unique=%llu issue_flags=%u\n",
> +		 __func__, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
> +		 rreq->in.opcode, rreq->in.unique, issue_flags);
> +
> +	if (send_in_task)
> +		io_uring_cmd_complete_in_task(ring_ent->cmd,
> +					      fuse_uring_async_send_to_ring);
> +	else
> +		io_uring_cmd_done(ring_ent->cmd, 0, 0, issue_flags);
> +
> +	return;
> +
> +err:
> +	fuse_uring_req_end_and_get_next(ring_ent, true, err, issue_flags);
> +}
> +
>  /*
>   * Put a ring request onto hold, it is no longer used for now.
>   */
> @@ -381,6 +574,104 @@ static void fuse_uring_ent_avail(struct fuse_ring_ent *ring_ent,
>  	set_bit(FRRS_WAIT, &ring_ent->state);
>  }
>  
> +/*
> + * Assign a fuse queue entry to the given entry
> + */
> +static void fuse_uring_add_req_to_ring_ent(struct fuse_ring_ent *ring_ent,
> +					   struct fuse_req *req)
> +{
> +	clear_bit(FRRS_WAIT, &ring_ent->state);
> +	list_del_init(&req->list);
> +	clear_bit(FR_PENDING, &req->flags);
> +	ring_ent->fuse_req = req;
> +	set_bit(FRRS_FUSE_REQ, &ring_ent->state);
> +}
> +
> +/*
> + * Release a uring entry and fetch the next fuse request if available
> + *
> + * @return true if a new request has been fetched
> + */
> +static bool fuse_uring_ent_release_and_fetch(struct fuse_ring_ent *ring_ent)
> +{
> +	struct fuse_req *req = NULL;
> +	struct fuse_ring_queue *queue = ring_ent->queue;
> +	struct list_head *req_queue = ring_ent->async ?
> +		&queue->async_fuse_req_queue : &queue->sync_fuse_req_queue;
> +
> +	spin_lock(&ring_ent->queue->lock);
> +	fuse_uring_ent_avail(ring_ent, queue);
> +	if (!list_empty(req_queue)) {
> +		req = list_first_entry(req_queue, struct fuse_req, list);
> +		fuse_uring_add_req_to_ring_ent(ring_ent, req);
> +		list_del_init(&ring_ent->list);
> +	}
> +	spin_unlock(&ring_ent->queue->lock);
> +
> +	return req ? true : false;
> +}
> +
> +/*
> + * Finalize a fuse request, then fetch and send the next entry, if available
> + *
> + * has lock/unlock/lock to avoid holding the lock on calling fuse_request_end
> + */
> +static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
> +					    bool set_err, int error,
> +					    unsigned int issue_flags)
> +{
> +	struct fuse_req *req = ring_ent->fuse_req;
> +	int has_next;
> +
> +	if (set_err)
> +		req->out.h.error = error;

The set_err thing seems redundant since we always have it set to true if error
is set, so just drop this bit and set error if there's an error.

> +
> +	clear_bit(FR_SENT, &req->flags);
> +	fuse_request_end(ring_ent->fuse_req);
> +	ring_ent->fuse_req = NULL;
> +	clear_bit(FRRS_FUSE_REQ, &ring_ent->state);
> +
> +	has_next = fuse_uring_ent_release_and_fetch(ring_ent);
> +	if (has_next) {
> +		/* called within uring context - use provided flags */
> +		fuse_uring_send_to_ring(ring_ent, issue_flags, false);
> +	}
> +}
> +
> +/*
> + * Read data from the ring buffer, which user space has written to
> + * This is comparible with handling of classical write(/dev/fuse).
> + * Also make the ring request available again for new fuse requests.
> + */
> +static void fuse_uring_commit_and_release(struct fuse_dev *fud,
> +					  struct fuse_ring_ent *ring_ent,
> +					  unsigned int issue_flags)
> +{
> +	struct fuse_ring_req *rreq = ring_ent->rreq;
> +	struct fuse_req *req = ring_ent->fuse_req;
> +	ssize_t err = 0;
> +	bool set_err = false;
> +
> +	req->out.h = rreq->out;
> +
> +	err = fuse_uring_ring_ent_has_err(fud->fc->ring, ring_ent);
> +	if (err) {
> +		/* req->out.h.error already set */
> +		pr_devel("%s:%d err=%zd oh->err=%d\n", __func__, __LINE__, err,
> +			 req->out.h.error);
> +		goto out;
> +	}
> +
> +	err = fuse_uring_copy_from_ring(fud->fc->ring, req, rreq);
> +	if (err)
> +		set_err = true;
> +
> +out:
> +	pr_devel("%s:%d ret=%zd op=%d req-ret=%d\n", __func__, __LINE__, err,
> +		 req->args->opcode, req->out.h.error);
> +	fuse_uring_req_end_and_get_next(ring_ent, set_err, err, issue_flags);
> +}
> +
>  /*
>   * fuse_uring_req_fetch command handling
>   */
> @@ -566,6 +857,26 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>  
>  		spin_unlock(&queue->lock);
>  		break;
> +	case FUSE_URING_REQ_COMMIT_AND_FETCH:
> +		if (unlikely(!ring->ready)) {
> +			pr_info("commit and fetch, but fuse-uringis not ready.");
> +			goto err_unlock;
> +		}
> +
> +		if (!test_bit(FRRS_USERSPACE, &ring_ent->state)) {
> +			pr_info("qid=%d tag=%d state %lu SQE already handled\n",
> +				queue->qid, ring_ent->tag, ring_ent->state);
> +			goto err_unlock;
> +		}
> +
> +		fuse_ring_ring_ent_unset_userspace(ring_ent);
> +		spin_unlock(&queue->lock);
> +
> +		WRITE_ONCE(ring_ent->cmd, cmd);
> +		fuse_uring_commit_and_release(fud, ring_ent, issue_flags);
> +
> +		ret = 0;
> +		break;

Hmm ok this changes my comments on the previous patch slightly, tho I think
still it would be better to push this code into a helper as well and do the
locking in there, let me go look at the resulting code...yeah ok I think it's
still better to just have these two cases in their own helper.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 13/19] fuse: {uring} Handle uring shutdown
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (11 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 12/19] fuse: {uring} Add uring sqe commit and fetch support Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 20:21   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 14/19] fuse: {uring} Allow to queue to the ring Bernd Schubert
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c         |  10 +++
 fs/fuse/dev_uring.c   | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h |  67 +++++++++++++++++
 3 files changed, 271 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index a7d26440de39..6ffd216b27c8 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2202,6 +2202,8 @@ void fuse_abort_conn(struct fuse_conn *fc)
 		fc->connected = 0;
 		spin_unlock(&fc->bg_lock);
 
+		fuse_uring_set_stopped(fc);
+
 		fuse_set_initialized(fc);
 		list_for_each_entry(fud, &fc->devices, entry) {
 			struct fuse_pqueue *fpq = &fud->pq;
@@ -2245,6 +2247,12 @@ void fuse_abort_conn(struct fuse_conn *fc)
 		spin_unlock(&fc->lock);
 
 		fuse_dev_end_requests(&to_end);
+
+		/*
+		 * fc->lock must not be taken to avoid conflicts with io-uring
+		 * locks
+		 */
+		fuse_uring_abort(fc);
 	} else {
 		spin_unlock(&fc->lock);
 	}
@@ -2256,6 +2264,8 @@ void fuse_wait_aborted(struct fuse_conn *fc)
 	/* matches implicit memory barrier in fuse_drop_waiting() */
 	smp_mb();
 	wait_event(fc->blocked_waitq, atomic_read(&fc->num_waiting) == 0);
+
+	fuse_uring_wait_stopped_queues(fc);
 }
 
 int fuse_dev_release(struct inode *inode, struct file *file)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 5269b3f8891e..6001ba4d6e82 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -48,6 +48,44 @@ fuse_uring_async_send_to_ring(struct io_uring_cmd *cmd,
 	io_uring_cmd_done(cmd, 0, 0, issue_flags);
 }
 
+/* Abort all list queued request on the given ring queue */
+static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
+{
+	struct fuse_req *req;
+	LIST_HEAD(sync_list);
+	LIST_HEAD(async_list);
+
+	spin_lock(&queue->lock);
+
+	list_for_each_entry(req, &queue->sync_fuse_req_queue, list)
+		clear_bit(FR_PENDING, &req->flags);
+	list_for_each_entry(req, &queue->async_fuse_req_queue, list)
+		clear_bit(FR_PENDING, &req->flags);
+
+	list_splice_init(&queue->async_fuse_req_queue, &sync_list);
+	list_splice_init(&queue->sync_fuse_req_queue, &async_list);
+
+	spin_unlock(&queue->lock);
+
+	/* must not hold queue lock to avoid order issues with fi->lock */
+	fuse_dev_end_requests(&sync_list);
+	fuse_dev_end_requests(&async_list);
+}
+
+void fuse_uring_abort_end_requests(struct fuse_ring *ring)
+{
+	int qid;
+
+	for (qid = 0; qid < ring->nr_queues; qid++) {
+		struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+		if (!queue->configured)
+			continue;
+
+		fuse_uring_abort_end_queue_requests(queue);
+	}
+}
+
 /* Update conn limits according to ring values */
 static void fuse_uring_conn_cfg_limits(struct fuse_ring *ring)
 {
@@ -361,6 +399,162 @@ int fuse_uring_queue_cfg(struct fuse_ring *ring,
 	return 0;
 }
 
+static void fuse_uring_stop_fuse_req_end(struct fuse_ring_ent *ent)
+{
+	struct fuse_req *req = ent->fuse_req;
+
+	ent->fuse_req = NULL;
+	clear_bit(FRRS_FUSE_REQ, &ent->state);
+	clear_bit(FR_SENT, &req->flags);
+	req->out.h.error = -ECONNABORTED;
+	fuse_request_end(req);
+}
+
+/*
+ * Release a request/entry on connection shutdown
+ */
+static bool fuse_uring_try_entry_stop(struct fuse_ring_ent *ent,
+				      bool need_cmd_done)
+	__must_hold(ent->queue->lock)
+{
+	struct fuse_ring_queue *queue = ent->queue;
+	bool released = false;
+
+	if (test_bit(FRRS_FREED, &ent->state))
+		goto out; /* no work left, freed before */
+
+	if (ent->state == BIT(FRRS_INIT) || test_bit(FRRS_WAIT, &ent->state) ||
+	    test_bit(FRRS_USERSPACE, &ent->state)) {
+		set_bit(FRRS_FREED, &ent->state);
+
+		if (need_cmd_done) {
+			pr_devel("qid=%d tag=%d sending cmd_done\n", queue->qid,
+				 ent->tag);
+
+			spin_unlock(&queue->lock);
+			io_uring_cmd_done(ent->cmd, -ENOTCONN, 0,
+					  IO_URING_F_UNLOCKED);
+			spin_lock(&queue->lock);
+		}
+
+		if (ent->fuse_req)
+			fuse_uring_stop_fuse_req_end(ent);
+		released = true;
+	}
+out:
+	return released;
+}
+
+static void fuse_uring_stop_list_entries(struct list_head *head,
+					 struct fuse_ring_queue *queue,
+					 bool need_cmd_done)
+{
+	struct fuse_ring *ring = queue->ring;
+	struct fuse_ring_ent *ent, *next;
+	ssize_t queue_refs = SSIZE_MAX;
+
+	list_for_each_entry_safe(ent, next, head, list) {
+		if (fuse_uring_try_entry_stop(ent, need_cmd_done)) {
+			queue_refs = atomic_dec_return(&ring->queue_refs);
+			list_del_init(&ent->list);
+		}
+
+		if (WARN_ON_ONCE(queue_refs < 0))
+			pr_warn("qid=%d queue_refs=%zd", queue->qid,
+				queue_refs);
+	}
+}
+
+static void fuse_uring_stop_queue(struct fuse_ring_queue *queue)
+	__must_hold(&queue->lock)
+{
+	fuse_uring_stop_list_entries(&queue->ent_in_userspace, queue, false);
+	fuse_uring_stop_list_entries(&queue->async_ent_avail_queue, queue, true);
+	fuse_uring_stop_list_entries(&queue->sync_ent_avail_queue, queue, true);
+}
+
+/*
+ * Log state debug info
+ */
+static void fuse_uring_stop_ent_state(struct fuse_ring *ring)
+{
+	int qid, tag;
+
+	for (qid = 0; qid < ring->nr_queues; qid++) {
+		struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+		for (tag = 0; tag < ring->queue_depth; tag++) {
+			struct fuse_ring_ent *ent = &queue->ring_ent[tag];
+
+			if (!test_bit(FRRS_FREED, &ent->state))
+				pr_info("ring=%p qid=%d tag=%d state=%lu\n",
+					ring, qid, tag, ent->state);
+		}
+	}
+	ring->stop_debug_log = 1;
+}
+
+static void fuse_uring_async_stop_queues(struct work_struct *work)
+{
+	int qid;
+	struct fuse_ring *ring =
+		container_of(work, struct fuse_ring, stop_work.work);
+
+	for (qid = 0; qid < ring->nr_queues; qid++) {
+		struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+		if (!queue->configured)
+			continue;
+
+		spin_lock(&queue->lock);
+		fuse_uring_stop_queue(queue);
+		spin_unlock(&queue->lock);
+	}
+
+	if (atomic_read(&ring->queue_refs) > 0) {
+		if (time_after(jiffies,
+			       ring->stop_time + FUSE_URING_STOP_WARN_TIMEOUT))
+			fuse_uring_stop_ent_state(ring);
+
+		pr_info("ring=%p scheduling intervalled queue stop\n", ring);
+
+		schedule_delayed_work(&ring->stop_work,
+				      FUSE_URING_STOP_INTERVAL);
+	} else {
+		wake_up_all(&ring->stop_waitq);
+	}
+}
+
+/*
+ * Stop the ring queues
+ */
+void fuse_uring_stop_queues(struct fuse_ring *ring)
+{
+	int qid;
+
+	for (qid = 0; qid < ring->nr_queues; qid++) {
+		struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+		if (!queue->configured)
+			continue;
+
+		spin_lock(&queue->lock);
+		fuse_uring_stop_queue(queue);
+		spin_unlock(&queue->lock);
+	}
+
+	if (atomic_read(&ring->queue_refs) > 0) {
+		pr_info("ring=%p scheduling intervalled queue stop\n", ring);
+		ring->stop_time = jiffies;
+		INIT_DELAYED_WORK(&ring->stop_work,
+				  fuse_uring_async_stop_queues);
+		schedule_delayed_work(&ring->stop_work,
+				      FUSE_URING_STOP_INTERVAL);
+	} else {
+		wake_up_all(&ring->stop_waitq);
+	}
+}
+
 /*
  * Checks for errors and stores it into the request
  */
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index b2be67bb2fa7..e5fc84e2f3ea 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -16,6 +16,9 @@
 /* IORING_MAX_ENTRIES */
 #define FUSE_URING_MAX_QUEUE_DEPTH 32768
 
+#define FUSE_URING_STOP_WARN_TIMEOUT (5 * HZ)
+#define FUSE_URING_STOP_INTERVAL (HZ/20)
+
 enum fuse_ring_req_state {
 
 	/* request is basially initialized */
@@ -203,6 +206,7 @@ int fuse_uring_mmap(struct file *filp, struct vm_area_struct *vma);
 int fuse_uring_queue_cfg(struct fuse_ring *ring,
 			 struct fuse_ring_queue_config *qcfg);
 void fuse_uring_ring_destruct(struct fuse_ring *ring);
+void fuse_uring_stop_queues(struct fuse_ring *ring);
 int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
 
 static inline void fuse_uring_conn_init(struct fuse_ring *ring,
@@ -275,6 +279,58 @@ static inline bool fuse_per_core_queue(struct fuse_conn *fc)
 	return fc->ring && fc->ring->per_core_queue;
 }
 
+static inline void fuse_uring_set_stopped_queues(struct fuse_ring *ring)
+{
+	int qid;
+
+	for (qid = 0; qid < ring->nr_queues; qid++) {
+		struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+		if (!queue->configured)
+			continue;
+
+		spin_lock(&queue->lock);
+		queue->stopped = 1;
+		spin_unlock(&queue->lock);
+	}
+}
+
+/*
+ *  Set per queue aborted flag
+ */
+static inline void fuse_uring_set_stopped(struct fuse_conn *fc)
+	__must_hold(fc->lock)
+{
+	if (fc->ring == NULL)
+		return;
+
+	fc->ring->ready = false;
+
+	fuse_uring_set_stopped_queues(fc->ring);
+}
+
+static inline void fuse_uring_abort(struct fuse_conn *fc)
+{
+	struct fuse_ring *ring = fc->ring;
+
+	if (ring == NULL)
+		return;
+
+	if (ring->configured && atomic_read(&ring->queue_refs) > 0) {
+		fuse_uring_abort_end_requests(ring);
+		fuse_uring_stop_queues(ring);
+	}
+}
+
+static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
+{
+	struct fuse_ring *ring = fc->ring;
+
+	if (ring && ring->configured)
+		wait_event(ring->stop_waitq,
+			   atomic_read(&ring->queue_refs) == 0);
+}
+
 #else /* CONFIG_FUSE_IO_URING */
 
 struct fuse_ring;
@@ -298,6 +354,17 @@ static inline bool fuse_per_core_queue(struct fuse_conn *fc)
 	return false;
 }
 
+static inline void fuse_uring_set_stopped(struct fuse_conn *fc)
+{
+}
+
+static inline void fuse_uring_abort(struct fuse_conn *fc)
+{
+}
+
+static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
+{
+}
 
 #endif /* CONFIG_FUSE_IO_URING */
 

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 13/19] fuse: {uring} Handle uring shutdown
  2024-05-29 18:00 ` [PATCH RFC v2 13/19] fuse: {uring} Handle uring shutdown Bernd Schubert
@ 2024-05-30 20:21   ` Josef Bacik
  0 siblings, 0 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 20:21 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:48PM +0200, Bernd Schubert wrote:
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev.c         |  10 +++
>  fs/fuse/dev_uring.c   | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/fuse/dev_uring_i.h |  67 +++++++++++++++++
>  3 files changed, 271 insertions(+)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index a7d26440de39..6ffd216b27c8 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2202,6 +2202,8 @@ void fuse_abort_conn(struct fuse_conn *fc)
>  		fc->connected = 0;
>  		spin_unlock(&fc->bg_lock);
>  
> +		fuse_uring_set_stopped(fc);
> +
>  		fuse_set_initialized(fc);
>  		list_for_each_entry(fud, &fc->devices, entry) {
>  			struct fuse_pqueue *fpq = &fud->pq;
> @@ -2245,6 +2247,12 @@ void fuse_abort_conn(struct fuse_conn *fc)
>  		spin_unlock(&fc->lock);
>  
>  		fuse_dev_end_requests(&to_end);
> +
> +		/*
> +		 * fc->lock must not be taken to avoid conflicts with io-uring
> +		 * locks
> +		 */
> +		fuse_uring_abort(fc);

Perhaps a 

lockdep_assert_not_held(&fc->lock)

in fuse_uring_abort() then?

>  	} else {
>  		spin_unlock(&fc->lock);
>  	}
> @@ -2256,6 +2264,8 @@ void fuse_wait_aborted(struct fuse_conn *fc)
>  	/* matches implicit memory barrier in fuse_drop_waiting() */
>  	smp_mb();
>  	wait_event(fc->blocked_waitq, atomic_read(&fc->num_waiting) == 0);
> +
> +	fuse_uring_wait_stopped_queues(fc);
>  }
>  
>  int fuse_dev_release(struct inode *inode, struct file *file)
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 5269b3f8891e..6001ba4d6e82 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -48,6 +48,44 @@ fuse_uring_async_send_to_ring(struct io_uring_cmd *cmd,
>  	io_uring_cmd_done(cmd, 0, 0, issue_flags);
>  }
>  
> +/* Abort all list queued request on the given ring queue */
> +static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
> +{
> +	struct fuse_req *req;
> +	LIST_HEAD(sync_list);
> +	LIST_HEAD(async_list);
> +
> +	spin_lock(&queue->lock);
> +
> +	list_for_each_entry(req, &queue->sync_fuse_req_queue, list)
> +		clear_bit(FR_PENDING, &req->flags);
> +	list_for_each_entry(req, &queue->async_fuse_req_queue, list)
> +		clear_bit(FR_PENDING, &req->flags);
> +
> +	list_splice_init(&queue->async_fuse_req_queue, &sync_list);
> +	list_splice_init(&queue->sync_fuse_req_queue, &async_list);
> +
> +	spin_unlock(&queue->lock);
> +
> +	/* must not hold queue lock to avoid order issues with fi->lock */
> +	fuse_dev_end_requests(&sync_list);
> +	fuse_dev_end_requests(&async_list);
> +}
> +
> +void fuse_uring_abort_end_requests(struct fuse_ring *ring)
> +{
> +	int qid;
> +
> +	for (qid = 0; qid < ring->nr_queues; qid++) {
> +		struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
> +
> +		if (!queue->configured)
> +			continue;
> +
> +		fuse_uring_abort_end_queue_requests(queue);
> +	}
> +}
> +
>  /* Update conn limits according to ring values */
>  static void fuse_uring_conn_cfg_limits(struct fuse_ring *ring)
>  {
> @@ -361,6 +399,162 @@ int fuse_uring_queue_cfg(struct fuse_ring *ring,
>  	return 0;
>  }
>  
> +static void fuse_uring_stop_fuse_req_end(struct fuse_ring_ent *ent)
> +{
> +	struct fuse_req *req = ent->fuse_req;
> +
> +	ent->fuse_req = NULL;
> +	clear_bit(FRRS_FUSE_REQ, &ent->state);
> +	clear_bit(FR_SENT, &req->flags);
> +	req->out.h.error = -ECONNABORTED;
> +	fuse_request_end(req);
> +}
> +
> +/*
> + * Release a request/entry on connection shutdown
> + */
> +static bool fuse_uring_try_entry_stop(struct fuse_ring_ent *ent,
> +				      bool need_cmd_done)
> +	__must_hold(ent->queue->lock)
> +{
> +	struct fuse_ring_queue *queue = ent->queue;
> +	bool released = false;
> +
> +	if (test_bit(FRRS_FREED, &ent->state))
> +		goto out; /* no work left, freed before */

Just return false;

> +
> +	if (ent->state == BIT(FRRS_INIT) || test_bit(FRRS_WAIT, &ent->state) ||
> +	    test_bit(FRRS_USERSPACE, &ent->state)) {

Again, apologies for just now noticing this, but this is kind of a complicated
state machine.

I think I'd rather you just use ent->state as an actual state machine, so it has
one value and one value only at any given time, which appears to be what happens
except in that we have FRRS_INIT set in addition to whatever other bit is set.

Rework this so it's less complicated, because it's quite difficult to follow in
it's current form.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 14/19] fuse: {uring} Allow to queue to the ring
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (12 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 13/19] fuse: {uring} Handle uring shutdown Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 20:32   ` Josef Bacik
  2024-05-29 18:00 ` [PATCH RFC v2 15/19] export __wake_on_current_cpu Bernd Schubert
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

This enables enqueuing requests through fuse uring queues.

For initial simplicity requests are always allocated the normal way
then added to ring queues lists and only then copied to ring queue
entries. Later on the allocation and adding the requests to a list
can be avoided, by directly using a ring entry. This introduces
some code complexity and is therefore not done for now.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c         | 80 +++++++++++++++++++++++++++++++++++++++-----
 fs/fuse/dev_uring.c   | 92 ++++++++++++++++++++++++++++++++++++++++++---------
 fs/fuse/dev_uring_i.h | 17 ++++++++++
 3 files changed, 165 insertions(+), 24 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6ffd216b27c8..c7fd3849a105 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -218,13 +218,29 @@ const struct fuse_iqueue_ops fuse_dev_fiq_ops = {
 };
 EXPORT_SYMBOL_GPL(fuse_dev_fiq_ops);
 
-static void queue_request_and_unlock(struct fuse_iqueue *fiq,
-				     struct fuse_req *req)
+
+static void queue_request_and_unlock(struct fuse_conn *fc,
+				     struct fuse_req *req, bool allow_uring)
 __releases(fiq->lock)
 {
+	struct fuse_iqueue *fiq = &fc->iq;
+
 	req->in.h.len = sizeof(struct fuse_in_header) +
 		fuse_len_args(req->args->in_numargs,
 			      (struct fuse_arg *) req->args->in_args);
+
+	if (allow_uring && fuse_uring_ready(fc)) {
+		int res;
+
+		/* this lock is not needed at all for ring req handling */
+		spin_unlock(&fiq->lock);
+		res = fuse_uring_queue_fuse_req(fc, req);
+		if (!res)
+			return;
+
+		/* fallthrough, handled through /dev/fuse read/write */
+	}
+
 	list_add_tail(&req->list, &fiq->pending);
 	fiq->ops->wake_pending_and_unlock(fiq);
 }
@@ -261,7 +277,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
 		fc->active_background++;
 		spin_lock(&fiq->lock);
 		req->in.h.unique = fuse_get_unique(fiq);
-		queue_request_and_unlock(fiq, req);
+		queue_request_and_unlock(fc, req, true);
 	}
 }
 
@@ -405,7 +421,8 @@ static void request_wait_answer(struct fuse_req *req)
 
 static void __fuse_request_send(struct fuse_req *req)
 {
-	struct fuse_iqueue *fiq = &req->fm->fc->iq;
+	struct fuse_conn *fc = req->fm->fc;
+	struct fuse_iqueue *fiq = &fc->iq;
 
 	BUG_ON(test_bit(FR_BACKGROUND, &req->flags));
 	spin_lock(&fiq->lock);
@@ -417,7 +434,7 @@ static void __fuse_request_send(struct fuse_req *req)
 		/* acquire extra reference, since request is still needed
 		   after fuse_request_end() */
 		__fuse_get_request(req);
-		queue_request_and_unlock(fiq, req);
+		queue_request_and_unlock(fc, req, true);
 
 		request_wait_answer(req);
 		/* Pairs with smp_wmb() in fuse_request_end() */
@@ -487,6 +504,10 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args)
 	if (args->force) {
 		atomic_inc(&fc->num_waiting);
 		req = fuse_request_alloc(fm, GFP_KERNEL | __GFP_NOFAIL);
+		if (unlikely(!req)) {
+			ret = -ENOTCONN;
+			goto err;
+		}
 
 		if (!args->nocreds)
 			fuse_force_creds(req);
@@ -514,16 +535,55 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args)
 	}
 	fuse_put_request(req);
 
+err:
 	return ret;
 }
 
-static bool fuse_request_queue_background(struct fuse_req *req)
+static bool fuse_request_queue_background_uring(struct fuse_conn *fc,
+					       struct fuse_req *req)
+{
+	struct fuse_iqueue *fiq = &fc->iq;
+	int err;
+
+	req->in.h.unique = fuse_get_unique(fiq);
+	req->in.h.len = sizeof(struct fuse_in_header) +
+		fuse_len_args(req->args->in_numargs,
+			      (struct fuse_arg *) req->args->in_args);
+
+	err = fuse_uring_queue_fuse_req(fc, req);
+	if (!err) {
+		/* XXX remove and lets the users of that use per queue values -
+		 * avoid the shared spin lock...
+		 * Is this needed at all?
+		 */
+		spin_lock(&fc->bg_lock);
+		fc->num_background++;
+		fc->active_background++;
+
+
+		/* XXX block when per ring queues get occupied */
+		if (fc->num_background == fc->max_background)
+			fc->blocked = 1;
+		spin_unlock(&fc->bg_lock);
+	}
+
+	return err ? false : true;
+}
+
+/*
+ * @return true if queued
+ */
+static int fuse_request_queue_background(struct fuse_req *req)
 {
 	struct fuse_mount *fm = req->fm;
 	struct fuse_conn *fc = fm->fc;
 	bool queued = false;
 
 	WARN_ON(!test_bit(FR_BACKGROUND, &req->flags));
+
+	if (fuse_uring_ready(fc))
+		return fuse_request_queue_background_uring(fc, req);
+
 	if (!test_bit(FR_WAITING, &req->flags)) {
 		__set_bit(FR_WAITING, &req->flags);
 		atomic_inc(&fc->num_waiting);
@@ -576,7 +636,8 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
 				    struct fuse_args *args, u64 unique)
 {
 	struct fuse_req *req;
-	struct fuse_iqueue *fiq = &fm->fc->iq;
+	struct fuse_conn *fc = fm->fc;
+	struct fuse_iqueue *fiq = &fc->iq;
 	int err = 0;
 
 	req = fuse_get_req(fm, false);
@@ -590,7 +651,8 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
 
 	spin_lock(&fiq->lock);
 	if (fiq->connected) {
-		queue_request_and_unlock(fiq, req);
+		/* uring for notify not supported yet */
+		queue_request_and_unlock(fc, req, false);
 	} else {
 		err = -ENODEV;
 		spin_unlock(&fiq->lock);
@@ -2205,6 +2267,7 @@ void fuse_abort_conn(struct fuse_conn *fc)
 		fuse_uring_set_stopped(fc);
 
 		fuse_set_initialized(fc);
+
 		list_for_each_entry(fud, &fc->devices, entry) {
 			struct fuse_pqueue *fpq = &fud->pq;
 
@@ -2478,6 +2541,7 @@ static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
 		if (res != 0)
 			return res;
 		break;
+
 		case FUSE_URING_IOCTL_CMD_QUEUE_CFG:
 			fud->uring_dev = 1;
 			res = fuse_uring_queue_cfg(fc->ring, &cfg.qconf);
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 6001ba4d6e82..fe80e66150c3 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -32,8 +32,7 @@
 #include <linux/io_uring/cmd.h>
 
 static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
-					    bool set_err, int error,
-					    unsigned int issue_flags);
+					    bool set_err, int error);
 
 static void fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
 {
@@ -683,8 +682,7 @@ static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
  * userspace will read it
  * This is comparable with classical read(/dev/fuse)
  */
-static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent,
-				    unsigned int issue_flags, bool send_in_task)
+static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent)
 {
 	struct fuse_ring *ring = ring_ent->queue->ring;
 	struct fuse_ring_req *rreq = ring_ent->rreq;
@@ -721,20 +719,17 @@ static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent,
 	rreq->in = req->in.h;
 	set_bit(FR_SENT, &req->flags);
 
-	pr_devel("%s qid=%d tag=%d state=%lu cmd-done op=%d unique=%llu issue_flags=%u\n",
+	pr_devel("%s qid=%d tag=%d state=%lu cmd-done op=%d unique=%llu\n",
 		 __func__, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
-		 rreq->in.opcode, rreq->in.unique, issue_flags);
+		 rreq->in.opcode, rreq->in.unique);
 
-	if (send_in_task)
-		io_uring_cmd_complete_in_task(ring_ent->cmd,
-					      fuse_uring_async_send_to_ring);
-	else
-		io_uring_cmd_done(ring_ent->cmd, 0, 0, issue_flags);
+	io_uring_cmd_complete_in_task(ring_ent->cmd,
+				      fuse_uring_async_send_to_ring);
 
 	return;
 
 err:
-	fuse_uring_req_end_and_get_next(ring_ent, true, err, issue_flags);
+	fuse_uring_req_end_and_get_next(ring_ent, true, err);
 }
 
 /*
@@ -811,8 +806,7 @@ static bool fuse_uring_ent_release_and_fetch(struct fuse_ring_ent *ring_ent)
  * has lock/unlock/lock to avoid holding the lock on calling fuse_request_end
  */
 static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
-					    bool set_err, int error,
-					    unsigned int issue_flags)
+					    bool set_err, int error)
 {
 	struct fuse_req *req = ring_ent->fuse_req;
 	int has_next;
@@ -828,7 +822,7 @@ static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
 	has_next = fuse_uring_ent_release_and_fetch(ring_ent);
 	if (has_next) {
 		/* called within uring context - use provided flags */
-		fuse_uring_send_to_ring(ring_ent, issue_flags, false);
+		fuse_uring_send_to_ring(ring_ent);
 	}
 }
 
@@ -863,7 +857,7 @@ static void fuse_uring_commit_and_release(struct fuse_dev *fud,
 out:
 	pr_devel("%s:%d ret=%zd op=%d req-ret=%d\n", __func__, __LINE__, err,
 		 req->args->opcode, req->out.h.error);
-	fuse_uring_req_end_and_get_next(ring_ent, set_err, err, issue_flags);
+	fuse_uring_req_end_and_get_next(ring_ent, set_err, err);
 }
 
 /*
@@ -1101,3 +1095,69 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	goto out;
 }
 
+int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
+{
+	struct fuse_ring *ring = fc->ring;
+	struct fuse_ring_queue *queue;
+	int qid = 0;
+	struct fuse_ring_ent *ring_ent = NULL;
+	int res;
+	bool async = test_bit(FR_BACKGROUND, &req->flags);
+	struct list_head *req_queue, *ent_queue;
+
+	if (ring->per_core_queue) {
+		/*
+		 * async requests are best handled on another core, the current
+		 * core can do application/page handling, while the async request
+		 * is handled on another core in userspace.
+		 * For sync request the application has to wait - no processing, so
+		 * the request should continue on the current core and avoid context
+		 * switches.
+		 * XXX This should be on the same numa node and not busy - is there
+		 * a scheduler function available  that could make this decision?
+		 * It should also not persistently switch between cores - makes
+		 * it hard for the scheduler.
+		 */
+		qid = task_cpu(current);
+
+		if (unlikely(qid >= ring->nr_queues)) {
+			WARN_ONCE(1,
+				  "Core number (%u) exceeds nr ueues (%zu)\n",
+				  qid, ring->nr_queues);
+			qid = 0;
+		}
+	}
+
+	queue = fuse_uring_get_queue(ring, qid);
+	req_queue = async ? &queue->async_fuse_req_queue :
+			    &queue->sync_fuse_req_queue;
+	ent_queue = async ? &queue->async_ent_avail_queue :
+			    &queue->sync_ent_avail_queue;
+
+	spin_lock(&queue->lock);
+
+	if (unlikely(queue->stopped)) {
+		res = -ENOTCONN;
+		goto err_unlock;
+	}
+
+	if (list_empty(ent_queue)) {
+		list_add_tail(&req->list, req_queue);
+	} else {
+		ring_ent =
+			list_first_entry(ent_queue, struct fuse_ring_ent, list);
+		list_del(&ring_ent->list);
+		fuse_uring_add_req_to_ring_ent(ring_ent, req);
+	}
+	spin_unlock(&queue->lock);
+
+	if (ring_ent != NULL)
+		fuse_uring_send_to_ring(ring_ent);
+
+	return 0;
+
+err_unlock:
+	spin_unlock(&queue->lock);
+	return res;
+}
+
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index e5fc84e2f3ea..5d7e1e6e7a82 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -208,6 +208,7 @@ int fuse_uring_queue_cfg(struct fuse_ring *ring,
 void fuse_uring_ring_destruct(struct fuse_ring *ring);
 void fuse_uring_stop_queues(struct fuse_ring *ring);
 int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
+int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req);
 
 static inline void fuse_uring_conn_init(struct fuse_ring *ring,
 					struct fuse_conn *fc)
@@ -331,6 +332,11 @@ static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
 			   atomic_read(&ring->queue_refs) == 0);
 }
 
+static inline bool fuse_uring_ready(struct fuse_conn *fc)
+{
+	return fc->ring && fc->ring->ready;
+}
+
 #else /* CONFIG_FUSE_IO_URING */
 
 struct fuse_ring;
@@ -366,6 +372,17 @@ static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
 {
 }
 
+static inline bool fuse_uring_ready(struct fuse_conn *fc)
+{
+	return false;
+}
+
+static inline int
+fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
+{
+	return -EPFNOSUPPORT;
+}
+
 #endif /* CONFIG_FUSE_IO_URING */
 
 #endif /* _FS_FUSE_DEV_URING_I_H */

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 14/19] fuse: {uring} Allow to queue to the ring
  2024-05-29 18:00 ` [PATCH RFC v2 14/19] fuse: {uring} Allow to queue to the ring Bernd Schubert
@ 2024-05-30 20:32   ` Josef Bacik
  2024-05-30 21:26     ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 20:32 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 08:00:49PM +0200, Bernd Schubert wrote:
> This enables enqueuing requests through fuse uring queues.
> 
> For initial simplicity requests are always allocated the normal way
> then added to ring queues lists and only then copied to ring queue
> entries. Later on the allocation and adding the requests to a list
> can be avoided, by directly using a ring entry. This introduces
> some code complexity and is therefore not done for now.
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev.c         | 80 +++++++++++++++++++++++++++++++++++++++-----
>  fs/fuse/dev_uring.c   | 92 ++++++++++++++++++++++++++++++++++++++++++---------
>  fs/fuse/dev_uring_i.h | 17 ++++++++++
>  3 files changed, 165 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 6ffd216b27c8..c7fd3849a105 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -218,13 +218,29 @@ const struct fuse_iqueue_ops fuse_dev_fiq_ops = {
>  };
>  EXPORT_SYMBOL_GPL(fuse_dev_fiq_ops);
>  
> -static void queue_request_and_unlock(struct fuse_iqueue *fiq,
> -				     struct fuse_req *req)
> +
> +static void queue_request_and_unlock(struct fuse_conn *fc,
> +				     struct fuse_req *req, bool allow_uring)
>  __releases(fiq->lock)
>  {
> +	struct fuse_iqueue *fiq = &fc->iq;
> +
>  	req->in.h.len = sizeof(struct fuse_in_header) +
>  		fuse_len_args(req->args->in_numargs,
>  			      (struct fuse_arg *) req->args->in_args);
> +
> +	if (allow_uring && fuse_uring_ready(fc)) {
> +		int res;
> +
> +		/* this lock is not needed at all for ring req handling */
> +		spin_unlock(&fiq->lock);
> +		res = fuse_uring_queue_fuse_req(fc, req);
> +		if (!res)
> +			return;
> +
> +		/* fallthrough, handled through /dev/fuse read/write */

We need the lock here because we're modifying &fiq->pending, this will end in
tears.

> +	}
> +
>  	list_add_tail(&req->list, &fiq->pending);
>  	fiq->ops->wake_pending_and_unlock(fiq);
>  }
> @@ -261,7 +277,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
>  		fc->active_background++;
>  		spin_lock(&fiq->lock);
>  		req->in.h.unique = fuse_get_unique(fiq);
> -		queue_request_and_unlock(fiq, req);
> +		queue_request_and_unlock(fc, req, true);
>  	}
>  }
>  
> @@ -405,7 +421,8 @@ static void request_wait_answer(struct fuse_req *req)
>  
>  static void __fuse_request_send(struct fuse_req *req)
>  {
> -	struct fuse_iqueue *fiq = &req->fm->fc->iq;
> +	struct fuse_conn *fc = req->fm->fc;
> +	struct fuse_iqueue *fiq = &fc->iq;
>  
>  	BUG_ON(test_bit(FR_BACKGROUND, &req->flags));
>  	spin_lock(&fiq->lock);
> @@ -417,7 +434,7 @@ static void __fuse_request_send(struct fuse_req *req)
>  		/* acquire extra reference, since request is still needed
>  		   after fuse_request_end() */
>  		__fuse_get_request(req);
> -		queue_request_and_unlock(fiq, req);
> +		queue_request_and_unlock(fc, req, true);
>  
>  		request_wait_answer(req);
>  		/* Pairs with smp_wmb() in fuse_request_end() */
> @@ -487,6 +504,10 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args)
>  	if (args->force) {
>  		atomic_inc(&fc->num_waiting);
>  		req = fuse_request_alloc(fm, GFP_KERNEL | __GFP_NOFAIL);
> +		if (unlikely(!req)) {
> +			ret = -ENOTCONN;
> +			goto err;
> +		}

This is extraneous, and not possible since we're doing __GFP_NOFAIL.

>  
>  		if (!args->nocreds)
>  			fuse_force_creds(req);
> @@ -514,16 +535,55 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args)
>  	}
>  	fuse_put_request(req);
>  
> +err:
>  	return ret;
>  }
>  
> -static bool fuse_request_queue_background(struct fuse_req *req)
> +static bool fuse_request_queue_background_uring(struct fuse_conn *fc,
> +					       struct fuse_req *req)
> +{
> +	struct fuse_iqueue *fiq = &fc->iq;
> +	int err;
> +
> +	req->in.h.unique = fuse_get_unique(fiq);
> +	req->in.h.len = sizeof(struct fuse_in_header) +
> +		fuse_len_args(req->args->in_numargs,
> +			      (struct fuse_arg *) req->args->in_args);
> +
> +	err = fuse_uring_queue_fuse_req(fc, req);
> +	if (!err) {

I'd rather

if (err)
	return false;

Then the rest of this code.

Also generally speaking I think you're correct, below isn't needed because the
queues themselves have their own limits, so I think just delete this bit.

> +		/* XXX remove and lets the users of that use per queue values -
> +		 * avoid the shared spin lock...
> +		 * Is this needed at all?
> +		 */
> +		spin_lock(&fc->bg_lock);
> +		fc->num_background++;
> +		fc->active_background++;
> +
> +
> +		/* XXX block when per ring queues get occupied */
> +		if (fc->num_background == fc->max_background)
> +			fc->blocked = 1;
> +		spin_unlock(&fc->bg_lock);
> +	}
> +
> +	return err ? false : true;
> +}
> +
> +/*
> + * @return true if queued
> + */
> +static int fuse_request_queue_background(struct fuse_req *req)
>  {
>  	struct fuse_mount *fm = req->fm;
>  	struct fuse_conn *fc = fm->fc;
>  	bool queued = false;
>  
>  	WARN_ON(!test_bit(FR_BACKGROUND, &req->flags));
> +
> +	if (fuse_uring_ready(fc))
> +		return fuse_request_queue_background_uring(fc, req);
> +
>  	if (!test_bit(FR_WAITING, &req->flags)) {
>  		__set_bit(FR_WAITING, &req->flags);
>  		atomic_inc(&fc->num_waiting);
> @@ -576,7 +636,8 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
>  				    struct fuse_args *args, u64 unique)
>  {
>  	struct fuse_req *req;
> -	struct fuse_iqueue *fiq = &fm->fc->iq;
> +	struct fuse_conn *fc = fm->fc;
> +	struct fuse_iqueue *fiq = &fc->iq;
>  	int err = 0;
>  
>  	req = fuse_get_req(fm, false);
> @@ -590,7 +651,8 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
>  
>  	spin_lock(&fiq->lock);
>  	if (fiq->connected) {
> -		queue_request_and_unlock(fiq, req);
> +		/* uring for notify not supported yet */
> +		queue_request_and_unlock(fc, req, false);
>  	} else {
>  		err = -ENODEV;
>  		spin_unlock(&fiq->lock);
> @@ -2205,6 +2267,7 @@ void fuse_abort_conn(struct fuse_conn *fc)
>  		fuse_uring_set_stopped(fc);
>  
>  		fuse_set_initialized(fc);
> +

Extraneous newline.

>  		list_for_each_entry(fud, &fc->devices, entry) {
>  			struct fuse_pqueue *fpq = &fud->pq;
>  
> @@ -2478,6 +2541,7 @@ static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
>  		if (res != 0)
>  			return res;
>  		break;
> +

Extraneous newline.

>  		case FUSE_URING_IOCTL_CMD_QUEUE_CFG:
>  			fud->uring_dev = 1;
>  			res = fuse_uring_queue_cfg(fc->ring, &cfg.qconf);
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 6001ba4d6e82..fe80e66150c3 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -32,8 +32,7 @@
>  #include <linux/io_uring/cmd.h>
>  
>  static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
> -					    bool set_err, int error,
> -					    unsigned int issue_flags);
> +					    bool set_err, int error);
>  
>  static void fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
>  {
> @@ -683,8 +682,7 @@ static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
>   * userspace will read it
>   * This is comparable with classical read(/dev/fuse)
>   */
> -static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent,
> -				    unsigned int issue_flags, bool send_in_task)
> +static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent)
>  {
>  	struct fuse_ring *ring = ring_ent->queue->ring;
>  	struct fuse_ring_req *rreq = ring_ent->rreq;
> @@ -721,20 +719,17 @@ static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent,
>  	rreq->in = req->in.h;
>  	set_bit(FR_SENT, &req->flags);
>  
> -	pr_devel("%s qid=%d tag=%d state=%lu cmd-done op=%d unique=%llu issue_flags=%u\n",
> +	pr_devel("%s qid=%d tag=%d state=%lu cmd-done op=%d unique=%llu\n",
>  		 __func__, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
> -		 rreq->in.opcode, rreq->in.unique, issue_flags);
> +		 rreq->in.opcode, rreq->in.unique);
>  
> -	if (send_in_task)
> -		io_uring_cmd_complete_in_task(ring_ent->cmd,
> -					      fuse_uring_async_send_to_ring);
> -	else
> -		io_uring_cmd_done(ring_ent->cmd, 0, 0, issue_flags);
> +	io_uring_cmd_complete_in_task(ring_ent->cmd,
> +				      fuse_uring_async_send_to_ring);
>  
>  	return;
>  
>  err:
> -	fuse_uring_req_end_and_get_next(ring_ent, true, err, issue_flags);
> +	fuse_uring_req_end_and_get_next(ring_ent, true, err);
>  }
>  
>  /*
> @@ -811,8 +806,7 @@ static bool fuse_uring_ent_release_and_fetch(struct fuse_ring_ent *ring_ent)
>   * has lock/unlock/lock to avoid holding the lock on calling fuse_request_end
>   */
>  static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
> -					    bool set_err, int error,
> -					    unsigned int issue_flags)
> +					    bool set_err, int error)
>  {
>  	struct fuse_req *req = ring_ent->fuse_req;
>  	int has_next;
> @@ -828,7 +822,7 @@ static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
>  	has_next = fuse_uring_ent_release_and_fetch(ring_ent);
>  	if (has_next) {
>  		/* called within uring context - use provided flags */
> -		fuse_uring_send_to_ring(ring_ent, issue_flags, false);
> +		fuse_uring_send_to_ring(ring_ent);
>  	}
>  }
>  
> @@ -863,7 +857,7 @@ static void fuse_uring_commit_and_release(struct fuse_dev *fud,
>  out:
>  	pr_devel("%s:%d ret=%zd op=%d req-ret=%d\n", __func__, __LINE__, err,
>  		 req->args->opcode, req->out.h.error);
> -	fuse_uring_req_end_and_get_next(ring_ent, set_err, err, issue_flags);
> +	fuse_uring_req_end_and_get_next(ring_ent, set_err, err);
>  }
>  
>  /*
> @@ -1101,3 +1095,69 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>  	goto out;
>  }
>  
> +int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
> +{
> +	struct fuse_ring *ring = fc->ring;
> +	struct fuse_ring_queue *queue;
> +	int qid = 0;
> +	struct fuse_ring_ent *ring_ent = NULL;
> +	int res;
> +	bool async = test_bit(FR_BACKGROUND, &req->flags);
> +	struct list_head *req_queue, *ent_queue;
> +
> +	if (ring->per_core_queue) {
> +		/*
> +		 * async requests are best handled on another core, the current
> +		 * core can do application/page handling, while the async request
> +		 * is handled on another core in userspace.
> +		 * For sync request the application has to wait - no processing, so
> +		 * the request should continue on the current core and avoid context
> +		 * switches.
> +		 * XXX This should be on the same numa node and not busy - is there
> +		 * a scheduler function available  that could make this decision?
> +		 * It should also not persistently switch between cores - makes
> +		 * it hard for the scheduler.
> +		 */
> +		qid = task_cpu(current);
> +
> +		if (unlikely(qid >= ring->nr_queues)) {
> +			WARN_ONCE(1,
> +				  "Core number (%u) exceeds nr ueues (%zu)\n",
> +				  qid, ring->nr_queues);
> +			qid = 0;
> +		}
> +	}
> +
> +	queue = fuse_uring_get_queue(ring, qid);
> +	req_queue = async ? &queue->async_fuse_req_queue :
> +			    &queue->sync_fuse_req_queue;
> +	ent_queue = async ? &queue->async_ent_avail_queue :
> +			    &queue->sync_ent_avail_queue;
> +
> +	spin_lock(&queue->lock);
> +
> +	if (unlikely(queue->stopped)) {
> +		res = -ENOTCONN;
> +		goto err_unlock;

This is the only place we use err_unlock, just do

if (unlikely(queue->stopped)) {
	spin_unlock(&queue->lock);
	return -ENOTCONN;
}

and then you can get rid of res.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 14/19] fuse: {uring} Allow to queue to the ring
  2024-05-30 20:32   ` Josef Bacik
@ 2024-05-30 21:26     ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-30 21:26 UTC (permalink / raw)
  To: Josef Bacik, Bernd Schubert; +Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel



On 5/30/24 22:32, Josef Bacik wrote:
> On Wed, May 29, 2024 at 08:00:49PM +0200, Bernd Schubert wrote:
>> This enables enqueuing requests through fuse uring queues.
>>
>> For initial simplicity requests are always allocated the normal way
>> then added to ring queues lists and only then copied to ring queue
>> entries. Later on the allocation and adding the requests to a list
>> can be avoided, by directly using a ring entry. This introduces
>> some code complexity and is therefore not done for now.
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev.c         | 80 +++++++++++++++++++++++++++++++++++++++-----
>>  fs/fuse/dev_uring.c   | 92 ++++++++++++++++++++++++++++++++++++++++++---------
>>  fs/fuse/dev_uring_i.h | 17 ++++++++++
>>  3 files changed, 165 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 6ffd216b27c8..c7fd3849a105 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -218,13 +218,29 @@ const struct fuse_iqueue_ops fuse_dev_fiq_ops = {
>>  };
>>  EXPORT_SYMBOL_GPL(fuse_dev_fiq_ops);
>>  
>> -static void queue_request_and_unlock(struct fuse_iqueue *fiq,
>> -				     struct fuse_req *req)
>> +
>> +static void queue_request_and_unlock(struct fuse_conn *fc,
>> +				     struct fuse_req *req, bool allow_uring)
>>  __releases(fiq->lock)
>>  {
>> +	struct fuse_iqueue *fiq = &fc->iq;
>> +
>>  	req->in.h.len = sizeof(struct fuse_in_header) +
>>  		fuse_len_args(req->args->in_numargs,
>>  			      (struct fuse_arg *) req->args->in_args);
>> +
>> +	if (allow_uring && fuse_uring_ready(fc)) {
>> +		int res;
>> +
>> +		/* this lock is not needed at all for ring req handling */
>> +		spin_unlock(&fiq->lock);
>> +		res = fuse_uring_queue_fuse_req(fc, req);
>> +		if (!res)
>> +			return;
>> +
>> +		/* fallthrough, handled through /dev/fuse read/write */
> 
> We need the lock here because we're modifying &fiq->pending, this will end in
> tears.

Ouch right, sorry that I had missed that. I will actually remove the
fallthrough altogether, not needed anymore.

> 
>> +	}
>> +
>>  	list_add_tail(&req->list, &fiq->pending);
>>  	fiq->ops->wake_pending_and_unlock(fiq);
>>  }
>> @@ -261,7 +277,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
>>  		fc->active_background++;
>>  		spin_lock(&fiq->lock);
>>  		req->in.h.unique = fuse_get_unique(fiq);
>> -		queue_request_and_unlock(fiq, req);
>> +		queue_request_and_unlock(fc, req, true);
>>  	}
>>  }
>>  
>> @@ -405,7 +421,8 @@ static void request_wait_answer(struct fuse_req *req)
>>  
>>  static void __fuse_request_send(struct fuse_req *req)
>>  {
>> -	struct fuse_iqueue *fiq = &req->fm->fc->iq;
>> +	struct fuse_conn *fc = req->fm->fc;
>> +	struct fuse_iqueue *fiq = &fc->iq;
>>  
>>  	BUG_ON(test_bit(FR_BACKGROUND, &req->flags));
>>  	spin_lock(&fiq->lock);
>> @@ -417,7 +434,7 @@ static void __fuse_request_send(struct fuse_req *req)
>>  		/* acquire extra reference, since request is still needed
>>  		   after fuse_request_end() */
>>  		__fuse_get_request(req);
>> -		queue_request_and_unlock(fiq, req);
>> +		queue_request_and_unlock(fc, req, true);
>>  
>>  		request_wait_answer(req);
>>  		/* Pairs with smp_wmb() in fuse_request_end() */
>> @@ -487,6 +504,10 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args)
>>  	if (args->force) {
>>  		atomic_inc(&fc->num_waiting);
>>  		req = fuse_request_alloc(fm, GFP_KERNEL | __GFP_NOFAIL);
>> +		if (unlikely(!req)) {
>> +			ret = -ENOTCONN;
>> +			goto err;
>> +		}
> 
> This is extraneous, and not possible since we're doing __GFP_NOFAIL.
> 
>>  
>>  		if (!args->nocreds)
>>  			fuse_force_creds(req);
>> @@ -514,16 +535,55 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args)
>>  	}
>>  	fuse_put_request(req);
>>  
>> +err:
>>  	return ret;
>>  }
>>  
>> -static bool fuse_request_queue_background(struct fuse_req *req)
>> +static bool fuse_request_queue_background_uring(struct fuse_conn *fc,
>> +					       struct fuse_req *req)
>> +{
>> +	struct fuse_iqueue *fiq = &fc->iq;
>> +	int err;
>> +
>> +	req->in.h.unique = fuse_get_unique(fiq);
>> +	req->in.h.len = sizeof(struct fuse_in_header) +
>> +		fuse_len_args(req->args->in_numargs,
>> +			      (struct fuse_arg *) req->args->in_args);
>> +
>> +	err = fuse_uring_queue_fuse_req(fc, req);
>> +	if (!err) {
> 
> I'd rather
> 
> if (err)
> 	return false;
> 
> Then the rest of this code.
> 
> Also generally speaking I think you're correct, below isn't needed because the
> queues themselves have their own limits, so I think just delete this bit.
> 
>> +		/* XXX remove and lets the users of that use per queue values -
>> +		 * avoid the shared spin lock...
>> +		 * Is this needed at all?
>> +		 */
>> +		spin_lock(&fc->bg_lock);
>> +		fc->num_background++;
>> +		fc->active_background++;


I now actually think we still need it, because in the current version it
queues to queue->async_fuse_req_queue  / queue->sync_fuse_req_queue.

>> +
>> +
>> +		/* XXX block when per ring queues get occupied */
>> +		if (fc->num_background == fc->max_background)
>> +			fc->blocked = 1;

I need to double check again, but I think I can just remove both XXX.

I also just see an issue with fc->active_background, fuse_request_end()
is decreasing it unconditionally, but with uring we never increase it
(and don't need it). I think I need an FR_URING flag.


>> +		spin_unlock(&fc->bg_lock);
>> +	}
>> +
>> +	return err ? false : true;
>> +}
>> +
>> +/*
>> + * @return true if queued
>> + */
>> +static int fuse_request_queue_background(struct fuse_req *req)
>>  {
>>  	struct fuse_mount *fm = req->fm;
>>  	struct fuse_conn *fc = fm->fc;
>>  	bool queued = false;
>>  
>>  	WARN_ON(!test_bit(FR_BACKGROUND, &req->flags));
>> +
>> +	if (fuse_uring_ready(fc))
>> +		return fuse_request_queue_background_uring(fc, req);
>> +
>>  	if (!test_bit(FR_WAITING, &req->flags)) {
>>  		__set_bit(FR_WAITING, &req->flags);
>>  		atomic_inc(&fc->num_waiting);
>> @@ -576,7 +636,8 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
>>  				    struct fuse_args *args, u64 unique)
>>  {
>>  	struct fuse_req *req;
>> -	struct fuse_iqueue *fiq = &fm->fc->iq;
>> +	struct fuse_conn *fc = fm->fc;
>> +	struct fuse_iqueue *fiq = &fc->iq;
>>  	int err = 0;
>>  
>>  	req = fuse_get_req(fm, false);
>> @@ -590,7 +651,8 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
>>  
>>  	spin_lock(&fiq->lock);
>>  	if (fiq->connected) {
>> -		queue_request_and_unlock(fiq, req);
>> +		/* uring for notify not supported yet */
>> +		queue_request_and_unlock(fc, req, false);
>>  	} else {
>>  		err = -ENODEV;
>>  		spin_unlock(&fiq->lock);
>> @@ -2205,6 +2267,7 @@ void fuse_abort_conn(struct fuse_conn *fc)
>>  		fuse_uring_set_stopped(fc);
>>  
>>  		fuse_set_initialized(fc);
>> +
> 
> Extraneous newline.
> 
>>  		list_for_each_entry(fud, &fc->devices, entry) {
>>  			struct fuse_pqueue *fpq = &fud->pq;
>>  
>> @@ -2478,6 +2541,7 @@ static long fuse_uring_ioctl(struct file *file, __u32 __user *argp)
>>  		if (res != 0)
>>  			return res;
>>  		break;
>> +
> 
> Extraneous newline.
> 

Sorry, these two slipped through.

>>  		case FUSE_URING_IOCTL_CMD_QUEUE_CFG:
>>  			fud->uring_dev = 1;
>>  			res = fuse_uring_queue_cfg(fc->ring, &cfg.qconf);
>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>> index 6001ba4d6e82..fe80e66150c3 100644
>> --- a/fs/fuse/dev_uring.c
>> +++ b/fs/fuse/dev_uring.c
>> @@ -32,8 +32,7 @@
>>  #include <linux/io_uring/cmd.h>
>>  
>>  static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
>> -					    bool set_err, int error,
>> -					    unsigned int issue_flags);
>> +					    bool set_err, int error);
>>  
>>  static void fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
>>  {
>> @@ -683,8 +682,7 @@ static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
>>   * userspace will read it
>>   * This is comparable with classical read(/dev/fuse)
>>   */
>> -static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent,
>> -				    unsigned int issue_flags, bool send_in_task)
>> +static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent)
>>  {
>>  	struct fuse_ring *ring = ring_ent->queue->ring;
>>  	struct fuse_ring_req *rreq = ring_ent->rreq;
>> @@ -721,20 +719,17 @@ static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent,
>>  	rreq->in = req->in.h;
>>  	set_bit(FR_SENT, &req->flags);
>>  
>> -	pr_devel("%s qid=%d tag=%d state=%lu cmd-done op=%d unique=%llu issue_flags=%u\n",
>> +	pr_devel("%s qid=%d tag=%d state=%lu cmd-done op=%d unique=%llu\n",
>>  		 __func__, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
>> -		 rreq->in.opcode, rreq->in.unique, issue_flags);
>> +		 rreq->in.opcode, rreq->in.unique);
>>  
>> -	if (send_in_task)
>> -		io_uring_cmd_complete_in_task(ring_ent->cmd,
>> -					      fuse_uring_async_send_to_ring);
>> -	else
>> -		io_uring_cmd_done(ring_ent->cmd, 0, 0, issue_flags);
>> +	io_uring_cmd_complete_in_task(ring_ent->cmd,
>> +				      fuse_uring_async_send_to_ring);

Oops, here went something wrong, in the previous patch, which had
introduce the "if (send_in_task)" - this part of later patch.

>>  
>>  	return;
>>  
>>  err:
>> -	fuse_uring_req_end_and_get_next(ring_ent, true, err, issue_flags);
>> +	fuse_uring_req_end_and_get_next(ring_ent, true, err);
>>  }
>>  
>>  /*
>> @@ -811,8 +806,7 @@ static bool fuse_uring_ent_release_and_fetch(struct fuse_ring_ent *ring_ent)
>>   * has lock/unlock/lock to avoid holding the lock on calling fuse_request_end
>>   */
>>  static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
>> -					    bool set_err, int error,
>> -					    unsigned int issue_flags)
>> +					    bool set_err, int error)
>>  {
>>  	struct fuse_req *req = ring_ent->fuse_req;
>>  	int has_next;
>> @@ -828,7 +822,7 @@ static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
>>  	has_next = fuse_uring_ent_release_and_fetch(ring_ent);
>>  	if (has_next) {
>>  		/* called within uring context - use provided flags */
>> -		fuse_uring_send_to_ring(ring_ent, issue_flags, false);
>> +		fuse_uring_send_to_ring(ring_ent);
>>  	}
>>  }
>>  
>> @@ -863,7 +857,7 @@ static void fuse_uring_commit_and_release(struct fuse_dev *fud,
>>  out:
>>  	pr_devel("%s:%d ret=%zd op=%d req-ret=%d\n", __func__, __LINE__, err,
>>  		 req->args->opcode, req->out.h.error);
>> -	fuse_uring_req_end_and_get_next(ring_ent, set_err, err, issue_flags);
>> +	fuse_uring_req_end_and_get_next(ring_ent, set_err, err);
>>  }
>>  
>>  /*
>> @@ -1101,3 +1095,69 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>>  	goto out;
>>  }
>>  
>> +int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
>> +{
>> +	struct fuse_ring *ring = fc->ring;
>> +	struct fuse_ring_queue *queue;
>> +	int qid = 0;
>> +	struct fuse_ring_ent *ring_ent = NULL;
>> +	int res;
>> +	bool async = test_bit(FR_BACKGROUND, &req->flags);
>> +	struct list_head *req_queue, *ent_queue;
>> +
>> +	if (ring->per_core_queue) {
>> +		/*
>> +		 * async requests are best handled on another core, the current
>> +		 * core can do application/page handling, while the async request
>> +		 * is handled on another core in userspace.
>> +		 * For sync request the application has to wait - no processing, so
>> +		 * the request should continue on the current core and avoid context
>> +		 * switches.
>> +		 * XXX This should be on the same numa node and not busy - is there
>> +		 * a scheduler function available  that could make this decision?
>> +		 * It should also not persistently switch between cores - makes
>> +		 * it hard for the scheduler.
>> +		 */
>> +		qid = task_cpu(current);
>> +
>> +		if (unlikely(qid >= ring->nr_queues)) {
>> +			WARN_ONCE(1,
>> +				  "Core number (%u) exceeds nr ueues (%zu)\n",
>> +				  qid, ring->nr_queues);
>> +			qid = 0;
>> +		}
>> +	}
>> +
>> +	queue = fuse_uring_get_queue(ring, qid);
>> +	req_queue = async ? &queue->async_fuse_req_queue :
>> +			    &queue->sync_fuse_req_queue;
>> +	ent_queue = async ? &queue->async_ent_avail_queue :
>> +			    &queue->sync_ent_avail_queue;
>> +
>> +	spin_lock(&queue->lock);
>> +
>> +	if (unlikely(queue->stopped)) {
>> +		res = -ENOTCONN;
>> +		goto err_unlock;
> 
> This is the only place we use err_unlock, just do
> 
> if (unlikely(queue->stopped)) {
> 	spin_unlock(&queue->lock);
> 	return -ENOTCONN;
> }
> 
> and then you can get rid of res.  Thanks,


Thanks, will do.
(I personally typically avoid unlock/return in the middle of a function
as one can easily miss the unlock with new code additions - I have bad
experience with that).


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 15/19] export __wake_on_current_cpu
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (13 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 14/19] fuse: {uring} Allow to queue to the ring Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 20:37   ` Josef Bacik
  2024-05-31 13:51   ` Christoph Hellwig
  2024-05-29 18:00 ` [PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu Bernd Schubert
                   ` (7 subsequent siblings)
  22 siblings, 2 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert
  Cc: Ingo Molnar, Peter Zijlstra, Andrei Vagin

This is needed by fuse-over-io-uring to wake up the waiting
application thread on the core it was submitted from.
Avoiding core switching is actually a major factor for
fuse performance improvements of fuse-over-io-uring.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrei Vagin <avagin@google.com>
---
 kernel/sched/wait.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..6576a1ef5d43 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -132,6 +132,7 @@ void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode
 {
 	__wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
 }
+EXPORT_SYMBOL(__wake_up_on_current_cpu);
 
 /*
  * Same as __wake_up but called with the spinlock in wait_queue_head_t held.

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 15/19] export __wake_on_current_cpu
  2024-05-29 18:00 ` [PATCH RFC v2 15/19] export __wake_on_current_cpu Bernd Schubert
@ 2024-05-30 20:37   ` Josef Bacik
  2024-06-04  9:26     ` Peter Zijlstra
  2024-05-31 13:51   ` Christoph Hellwig
  1 sibling, 1 reply; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 20:37 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert,
	Ingo Molnar, Peter Zijlstra, Andrei Vagin

On Wed, May 29, 2024 at 08:00:50PM +0200, Bernd Schubert wrote:
> This is needed by fuse-over-io-uring to wake up the waiting
> application thread on the core it was submitted from.
> Avoiding core switching is actually a major factor for
> fuse performance improvements of fuse-over-io-uring.
> 
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andrei Vagin <avagin@google.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Probably best to submit this as a one-off so the sched guys can take it and it's
not in the middle of a fuse patchset they may be ignoring.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 15/19] export __wake_on_current_cpu
  2024-05-30 20:37   ` Josef Bacik
@ 2024-06-04  9:26     ` Peter Zijlstra
  2024-06-04  9:36       ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2024-06-04  9:26 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Bernd Schubert, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	bernd.schubert, Ingo Molnar, Andrei Vagin

On Thu, May 30, 2024 at 04:37:29PM -0400, Josef Bacik wrote:
> On Wed, May 29, 2024 at 08:00:50PM +0200, Bernd Schubert wrote:
> > This is needed by fuse-over-io-uring to wake up the waiting
> > application thread on the core it was submitted from.
> > Avoiding core switching is actually a major factor for
> > fuse performance improvements of fuse-over-io-uring.
> > 
> > Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Andrei Vagin <avagin@google.com>
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> 
> Probably best to submit this as a one-off so the sched guys can take it and it's
> not in the middle of a fuse patchset they may be ignoring.  Thanks,

On its own its going to not be applied. Never merge an EXPORT without a
user.

As is, I don't have enough of the series to even see the user, so yeah,
not happy :/

And as hch said, this very much needs to be a GPL export.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 15/19] export __wake_on_current_cpu
  2024-06-04  9:26     ` Peter Zijlstra
@ 2024-06-04  9:36       ` Bernd Schubert
  2024-06-04 19:27         ` Peter Zijlstra
  0 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-06-04  9:36 UTC (permalink / raw)
  To: Peter Zijlstra, Josef Bacik
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel@vger.kernel.org,
	bernd.schubert@fastmail.fm, Ingo Molnar, Andrei Vagin

On 6/4/24 11:26, Peter Zijlstra wrote:
> On Thu, May 30, 2024 at 04:37:29PM -0400, Josef Bacik wrote:
>> On Wed, May 29, 2024 at 08:00:50PM +0200, Bernd Schubert wrote:
>>> This is needed by fuse-over-io-uring to wake up the waiting
>>> application thread on the core it was submitted from.
>>> Avoiding core switching is actually a major factor for
>>> fuse performance improvements of fuse-over-io-uring.
>>>
>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Andrei Vagin <avagin@google.com>
>>
>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>
>> Probably best to submit this as a one-off so the sched guys can take it and it's
>> not in the middle of a fuse patchset they may be ignoring.  Thanks,
> 
> On its own its going to not be applied. Never merge an EXPORT without a
> user.
> 
> As is, I don't have enough of the series to even see the user, so yeah,
> not happy :/
> 
> And as hch said, this very much needs to be a GPL export.

Sorry, accidentally done without the _GPL. What is the right way to get this merged? 
First merge the entire fuse-io-uring series and then add on this? I already have these 
optimization patches at the end of the series... The user for this is in the next patch

[PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index c7fd3849a105..851c5fa99946 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -333,7 +333,10 @@ void fuse_request_end(struct fuse_req *req)
                spin_unlock(&fc->bg_lock);
        } else {
                /* Wake up waiter sleeping in request_wait_answer() */
-               wake_up(&req->waitq);
+               if (fuse_per_core_queue(fc))
+                       __wake_up_on_current_cpu(&req->waitq, TASK_NORMAL, NULL);
+               else
+                       wake_up(&req->waitq);
        }

        if (test_bit(FR_ASYNC, &req->flags))




Thank,
Bernd


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 15/19] export __wake_on_current_cpu
  2024-06-04  9:36       ` Bernd Schubert
@ 2024-06-04 19:27         ` Peter Zijlstra
  2024-09-01 12:07           ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2024-06-04 19:27 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Josef Bacik, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel@vger.kernel.org, bernd.schubert@fastmail.fm,
	Ingo Molnar, Andrei Vagin

On Tue, Jun 04, 2024 at 09:36:08AM +0000, Bernd Schubert wrote:
> On 6/4/24 11:26, Peter Zijlstra wrote:
> > On Thu, May 30, 2024 at 04:37:29PM -0400, Josef Bacik wrote:
> >> On Wed, May 29, 2024 at 08:00:50PM +0200, Bernd Schubert wrote:
> >>> This is needed by fuse-over-io-uring to wake up the waiting
> >>> application thread on the core it was submitted from.
> >>> Avoiding core switching is actually a major factor for
> >>> fuse performance improvements of fuse-over-io-uring.
> >>>
> >>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> >>> Cc: Ingo Molnar <mingo@redhat.com>
> >>> Cc: Peter Zijlstra <peterz@infradead.org>
> >>> Cc: Andrei Vagin <avagin@google.com>
> >>
> >> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> >>
> >> Probably best to submit this as a one-off so the sched guys can take it and it's
> >> not in the middle of a fuse patchset they may be ignoring.  Thanks,
> > 
> > On its own its going to not be applied. Never merge an EXPORT without a
> > user.
> > 
> > As is, I don't have enough of the series to even see the user, so yeah,
> > not happy :/
> > 
> > And as hch said, this very much needs to be a GPL export.
> 
> Sorry, accidentally done without the _GPL. What is the right way to get this merged? 
> First merge the entire fuse-io-uring series and then add on this? I already have these 
> optimization patches at the end of the series... The user for this is in the next patch

Yeah, but you didn't send me the next patch, did you? So I have no
clue.. :-)

Anyway, if you could add a wee comment to __wake_up_con_current_cpu()
along with the EXPORT_SYMBOL_GPL() that might be nice. I suppose you can
copy paste from __wake_up() and then edit a wee bit.

> [PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu
> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index c7fd3849a105..851c5fa99946 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -333,7 +333,10 @@ void fuse_request_end(struct fuse_req *req)
>                 spin_unlock(&fc->bg_lock);
>         } else {
>                 /* Wake up waiter sleeping in request_wait_answer() */
> -               wake_up(&req->waitq);
> +               if (fuse_per_core_queue(fc))
> +                       __wake_up_on_current_cpu(&req->waitq, TASK_NORMAL, NULL);
> +               else
> +                       wake_up(&req->waitq);
>         }
> 
>         if (test_bit(FR_ASYNC, &req->flags))

Fair enough, although do we want a helper like wake_up() -- something
like wake_up_on_current_cpu() ?



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 15/19] export __wake_on_current_cpu
  2024-06-04 19:27         ` Peter Zijlstra
@ 2024-09-01 12:07           ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-09-01 12:07 UTC (permalink / raw)
  To: Peter Zijlstra, Bernd Schubert
  Cc: Josef Bacik, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel@vger.kernel.org, Ingo Molnar, Andrei Vagin



On 6/4/24 21:27, Peter Zijlstra wrote:
> On Tue, Jun 04, 2024 at 09:36:08AM +0000, Bernd Schubert wrote:
>> On 6/4/24 11:26, Peter Zijlstra wrote:
>>> On Thu, May 30, 2024 at 04:37:29PM -0400, Josef Bacik wrote:
>>>> On Wed, May 29, 2024 at 08:00:50PM +0200, Bernd Schubert wrote:
>>>>> This is needed by fuse-over-io-uring to wake up the waiting
>>>>> application thread on the core it was submitted from.
>>>>> Avoiding core switching is actually a major factor for
>>>>> fuse performance improvements of fuse-over-io-uring.
>>>>>
>>>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>>>> Cc: Andrei Vagin <avagin@google.com>
>>>>
>>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>>>
>>>> Probably best to submit this as a one-off so the sched guys can take it and it's
>>>> not in the middle of a fuse patchset they may be ignoring.  Thanks,
>>>
>>> On its own its going to not be applied. Never merge an EXPORT without a
>>> user.
>>>
>>> As is, I don't have enough of the series to even see the user, so yeah,
>>> not happy :/
>>>
>>> And as hch said, this very much needs to be a GPL export.
>>
>> Sorry, accidentally done without the _GPL. What is the right way to get this merged? 
>> First merge the entire fuse-io-uring series and then add on this? I already have these 
>> optimization patches at the end of the series... The user for this is in the next patch
> 
> Yeah, but you didn't send me the next patch, did you? So I have no
> clue.. :-)
> 
> Anyway, if you could add a wee comment to __wake_up_con_current_cpu()
> along with the EXPORT_SYMBOL_GPL() that might be nice. I suppose you can
> copy paste from __wake_up() and then edit a wee bit.
> 
>> [PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index c7fd3849a105..851c5fa99946 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -333,7 +333,10 @@ void fuse_request_end(struct fuse_req *req)
>>                 spin_unlock(&fc->bg_lock);
>>         } else {
>>                 /* Wake up waiter sleeping in request_wait_answer() */
>> -               wake_up(&req->waitq);
>> +               if (fuse_per_core_queue(fc))
>> +                       __wake_up_on_current_cpu(&req->waitq, TASK_NORMAL, NULL);
>> +               else
>> +                       wake_up(&req->waitq);
>>         }
>>
>>         if (test_bit(FR_ASYNC, &req->flags))
> 
> Fair enough, although do we want a helper like wake_up() -- something
> like wake_up_on_current_cpu() ?

Thank you and yes sure!
I remove the patch and optimization from RFCv3, we first need to agree
on the taken approach and get that merged. Will send submit this
optimization immediately after.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 15/19] export __wake_on_current_cpu
  2024-05-29 18:00 ` [PATCH RFC v2 15/19] export __wake_on_current_cpu Bernd Schubert
  2024-05-30 20:37   ` Josef Bacik
@ 2024-05-31 13:51   ` Christoph Hellwig
  1 sibling, 0 replies; 113+ messages in thread
From: Christoph Hellwig @ 2024-05-31 13:51 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert,
	Ingo Molnar, Peter Zijlstra, Andrei Vagin

On Wed, May 29, 2024 at 08:00:50PM +0200, Bernd Schubert wrote:
> This is needed by fuse-over-io-uring to wake up the waiting
> application thread on the core it was submitted from.
> Avoiding core switching is actually a major factor for
> fuse performance improvements of fuse-over-io-uring.

Then maybe split that into a separate enhancement?

> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -132,6 +132,7 @@ void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode
>  {
>  	__wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
>  }
> +EXPORT_SYMBOL(__wake_up_on_current_cpu);

I'll leave it to the scheduler maintainers if they want this exported
at all and if yes with the __-prefix and without a kerneldoc comment
explaining the API, but anything this low-level should be
EXPORT_SYMBOL_GPL for sure.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (14 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 15/19] export __wake_on_current_cpu Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-30 16:44   ` Shachar Sharon
  2024-05-29 18:00 ` [PATCH RFC v2 17/19] fuse: {uring} Send async requests to qid of core + 1 Bernd Schubert
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

Most of the performance improvements
with fuse-over-io-uring for synchronous requests is the possibility
to run processing on the submitting cpu core and to also wake
the submitting process on the same core - switching between
cpu cores.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index c7fd3849a105..851c5fa99946 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -333,7 +333,10 @@ void fuse_request_end(struct fuse_req *req)
 		spin_unlock(&fc->bg_lock);
 	} else {
 		/* Wake up waiter sleeping in request_wait_answer() */
-		wake_up(&req->waitq);
+		if (fuse_per_core_queue(fc))
+			__wake_up_on_current_cpu(&req->waitq, TASK_NORMAL, NULL);
+		else
+			wake_up(&req->waitq);
 	}
 
 	if (test_bit(FR_ASYNC, &req->flags))

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu
  2024-05-29 18:00 ` [PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu Bernd Schubert
@ 2024-05-30 16:44   ` Shachar Sharon
  2024-05-30 16:59     ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Shachar Sharon @ 2024-05-30 16:44 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert

On Wed, May 29, 2024 at 10:36 PM Bernd Schubert <bschubert@ddn.com> wrote:
>
> Most of the performance improvements
> with fuse-over-io-uring for synchronous requests is the possibility
> to run processing on the submitting cpu core and to also wake
> the submitting process on the same core - switching between
> cpu cores.
>
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> ---
>  fs/fuse/dev.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index c7fd3849a105..851c5fa99946 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -333,7 +333,10 @@ void fuse_request_end(struct fuse_req *req)
>                 spin_unlock(&fc->bg_lock);
>         } else {
>                 /* Wake up waiter sleeping in request_wait_answer() */
> -               wake_up(&req->waitq);
> +               if (fuse_per_core_queue(fc))
> +                       __wake_up_on_current_cpu(&req->waitq, TASK_NORMAL, NULL);
> +               else
> +                       wake_up(&req->waitq);

Would it be possible to apply this idea for regular FUSE connection?
What would happen if some (buggy or malicious) userspace FUSE server uses
sched_setaffinity(2) to run only on a subset of active CPUs?


>         }
>
>         if (test_bit(FR_ASYNC, &req->flags))
>
> --
> 2.40.1
>
>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu
  2024-05-30 16:44   ` Shachar Sharon
@ 2024-05-30 16:59     ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-30 16:59 UTC (permalink / raw)
  To: Shachar Sharon, Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel



On 5/30/24 18:44, Shachar Sharon wrote:
> On Wed, May 29, 2024 at 10:36 PM Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> Most of the performance improvements
>> with fuse-over-io-uring for synchronous requests is the possibility
>> to run processing on the submitting cpu core and to also wake
>> the submitting process on the same core - switching between
>> cpu cores.
>>
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>> ---
>>  fs/fuse/dev.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index c7fd3849a105..851c5fa99946 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -333,7 +333,10 @@ void fuse_request_end(struct fuse_req *req)
>>                 spin_unlock(&fc->bg_lock);
>>         } else {
>>                 /* Wake up waiter sleeping in request_wait_answer() */
>> -               wake_up(&req->waitq);
>> +               if (fuse_per_core_queue(fc))
>> +                       __wake_up_on_current_cpu(&req->waitq, TASK_NORMAL, NULL);
>> +               else
>> +                       wake_up(&req->waitq);
> 
> Would it be possible to apply this idea for regular FUSE connection?

I probably should have written it in the commit message, without uring
performance is the same or slightly worse. With direct-IO reads

jobs    /dev/fuse         /dev/fuse
        (migrate off)     (migrate on)
1           2023             1652
2           3375   	     2805
4           3823             4193
8           7796             8161
16          8520             8518
24          8361             8084
32          8717             8342


(in MB/s).

I think there is no improvement as daemon threads process requests on
random cores. I.e. request processing doesn't happen on the same core
a request was submitted to.


> What would happen if some (buggy or malicious) userspace FUSE server uses
> sched_setaffinity(2) to run only on a subset of active CPUs?


The request goes to the ring, which cpu it eventually handles should not
matter. Performance will not be optimal then.
That being said, the introduction mail points out an issue with xfstest
generic/650,
which disables/enables CPUs in a loop - I need to investigate what
happens there.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 17/19] fuse: {uring} Send async requests to qid of core + 1
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (15 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-29 18:00 ` [PATCH RFC v2 18/19] fuse: {uring} Set a min cpu offset io-size for reads/writes Bernd Schubert
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

This is another performance optimization - async requests are
better served on another core.
Async blocking requests are marked as such and treated as sync requests.

Example with mmap read
fio --size=1G --numjobs=32 --ioengine=mmap --output-format=normal,terse\
--directory=/scratch/dest/ --rw=read --bs=4K --group_reporting \
job-file.fio

jobs  /dev/fuse	uring   gain    uring   gain      gain
                               (core+1)
                      (to dev)         (to dev)  (uring same-core)
1	   124.61   306.59  2.46    255.51   2.05     0.83
2	   248.83   580.00  2.33    563.00   2.26     0.97
4	   611.47  1049.65  1.72    998.57   1.63     0.95
8	  1499.95  1848.42  1.23   1990.64   1.33     1.08
16	  2206.30  2890.24  1.31   3439.13   1.56     1.19
24	  2545.68  2704.87  1.06   4527.63   1.78     1.67
32	  2233.52  2574.37  1.15   5263.09   2.36     2.04

Interesting here is that the max gain comes with more core usage,
I had actually expected the other way around.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
 fs/fuse/dev_uring.c | 5 ++++-
 fs/fuse/file.c      | 1 +
 fs/fuse/fuse_i.h    | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index fe80e66150c3..dff210658172 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1106,6 +1106,8 @@ int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
 	struct list_head *req_queue, *ent_queue;
 
 	if (ring->per_core_queue) {
+		int cpu_off;
+
 		/*
 		 * async requests are best handled on another core, the current
 		 * core can do application/page handling, while the async request
@@ -1118,7 +1120,8 @@ int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
 		 * It should also not persistently switch between cores - makes
 		 * it hard for the scheduler.
 		 */
-		qid = task_cpu(current);
+		cpu_off = async ? 1 : 0;
+		qid = (task_cpu(current) + cpu_off) % ring->nr_queues;
 
 		if (unlikely(qid >= ring->nr_queues)) {
 			WARN_ONCE(1,
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b57ce4157640..6fda1e7bd7f4 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -791,6 +791,7 @@ static ssize_t fuse_async_req_send(struct fuse_mount *fm,
 
 	ia->ap.args.end = fuse_aio_complete_req;
 	ia->ap.args.may_block = io->should_dirty;
+	ia->ap.args.async_blocking = io->blocking;
 	err = fuse_simple_background(fm, &ia->ap.args, GFP_KERNEL);
 	if (err)
 		fuse_aio_complete_req(fm, &ia->ap.args, err);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index fadc51a22bb9..7dcf0472df67 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -309,6 +309,7 @@ struct fuse_args {
 	bool may_block:1;
 	bool is_ext:1;
 	bool is_pinned:1;
+	bool async_blocking : 1;
 	struct fuse_in_arg in_args[3];
 	struct fuse_arg out_args[2];
 	void (*end)(struct fuse_mount *fm, struct fuse_args *args, int error);

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 18/19] fuse: {uring} Set a min cpu offset io-size for reads/writes
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (16 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 17/19] fuse: {uring} Send async requests to qid of core + 1 Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-29 18:00 ` [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends Bernd Schubert
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert

This is another optimization - async switches between cores
(as of now uses core + 1) to send IO, but using another
core also means overhead - set a minimal IO size for that.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>

---
I didn't annotate exact benchmark data, but can extract it
(but needs verification)

jobs	/dev/fuse	   uring      uring           uring
                    (same core)  (core + 1) (conditional core + 1)
1         127598      313944      261641         330445
2         254806      593925      576516         551392
4         626144     1074837     1022533        1065389
8         1535953    1892787     2038420        2087627
16        2259253    2959607     3521665        3602580
24        2606776    2769790     4636297        4670717
32        2287126    2636150     5389404        5763385

I.e. this is mostly to compensate for slight degradation
with core + 1 for small requests with few cores.
---
 fs/fuse/dev_uring.c   | 69 +++++++++++++++++++++++++++++++++++++--------------
 fs/fuse/dev_uring_i.h |  7 ++++++
 fs/fuse/file.c        | 14 ++++++++++-
 3 files changed, 70 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index dff210658172..cdc5836edb6e 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1095,18 +1095,33 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	goto out;
 }
 
-int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
+static int fuse_uring_get_req_qid(struct fuse_req *req, struct fuse_ring *ring,
+				  bool async)
 {
-	struct fuse_ring *ring = fc->ring;
-	struct fuse_ring_queue *queue;
-	int qid = 0;
-	struct fuse_ring_ent *ring_ent = NULL;
-	int res;
-	bool async = test_bit(FR_BACKGROUND, &req->flags);
-	struct list_head *req_queue, *ent_queue;
+	int cpu_off = 0;
+	size_t req_size = 0;
+	int qid;
 
-	if (ring->per_core_queue) {
-		int cpu_off;
+	if (!ring->per_core_queue)
+		return 0;
+
+	/*
+	 * async has on a different core (see below) introduces context
+	 * switching - should be avoided for small requests
+	 */
+	if (async) {
+		switch (req->args->opcode) {
+		case FUSE_READ:
+			req_size = req->args->out_args[0].size;
+			break;
+		case FUSE_WRITE:
+			req_size = req->args->in_args[1].size;
+			break;
+		default:
+			/* anything else, <= 4K */
+			req_size = 0;
+			break;
+		}
 
 		/*
 		 * async requests are best handled on another core, the current
@@ -1120,17 +1135,33 @@ int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
 		 * It should also not persistently switch between cores - makes
 		 * it hard for the scheduler.
 		 */
-		cpu_off = async ? 1 : 0;
-		qid = (task_cpu(current) + cpu_off) % ring->nr_queues;
-
-		if (unlikely(qid >= ring->nr_queues)) {
-			WARN_ONCE(1,
-				  "Core number (%u) exceeds nr ueues (%zu)\n",
-				  qid, ring->nr_queues);
-			qid = 0;
-		}
+		if (req_size > FUSE_URING_MIN_ASYNC_SIZE)
+			cpu_off = 1;
 	}
 
+	qid = (task_cpu(current) + cpu_off) % ring->nr_queues;
+
+	if (unlikely(qid >= ring->nr_queues)) {
+		WARN_ONCE(1, "Core number (%u) exceeds nr queues (%zu)\n",
+			  qid, ring->nr_queues);
+		qid = 0;
+	}
+
+	return qid;
+}
+
+int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
+{
+	struct fuse_ring *ring = fc->ring;
+	struct fuse_ring_queue *queue;
+	struct fuse_ring_ent *ring_ent = NULL;
+	int res;
+	int async = test_bit(FR_BACKGROUND, &req->flags) &&
+		    !req->args->async_blocking;
+	struct list_head *ent_queue, *req_queue;
+	int qid;
+
+	qid = fuse_uring_get_req_qid(req, ring, async);
 	queue = fuse_uring_get_queue(ring, qid);
 	req_queue = async ? &queue->async_fuse_req_queue :
 			    &queue->sync_fuse_req_queue;
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 5d7e1e6e7a82..0b201becdf5a 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -11,6 +11,13 @@
 #include "linux/compiler_types.h"
 #include "linux/rbtree_types.h"
 
+/**
+ * Minimal async size with uring communication. Async is handled on a different
+ * core and that has overhead, so the async queue is only used beginning
+ * with a certain size - XXX should this be a tunable parameter?
+ */
+#define FUSE_URING_MIN_ASYNC_SIZE (16384)
+
 #if IS_ENABLED(CONFIG_FUSE_IO_URING)
 
 /* IORING_MAX_ENTRIES */
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6fda1e7bd7f4..4fc742bf0588 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -7,6 +7,7 @@
 */
 
 #include "fuse_i.h"
+#include "dev_uring_i.h"
 
 #include <linux/pagemap.h>
 #include <linux/slab.h>
@@ -955,11 +956,22 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
 {
 	struct fuse_file *ff = file->private_data;
 	struct fuse_mount *fm = ff->fm;
+	struct fuse_conn *fc = fm->fc;
 	struct fuse_args_pages *ap = &ia->ap;
 	loff_t pos = page_offset(ap->pages[0]);
 	size_t count = ap->num_pages << PAGE_SHIFT;
 	ssize_t res;
 	int err;
+	unsigned int async = fc->async_read;
+
+	/*
+	 * sync request stay longer on the same core - important with uring
+	 * Check here and not only in dev_uring.c as we have control in
+	 * fuse_simple_request if it should wake up on the same core,
+	 * avoids application core switching
+	 */
+	if (async && fuse_uring_ready(fc) && count <= FUSE_URING_MIN_ASYNC_SIZE)
+		async = 0;
 
 	ap->args.out_pages = true;
 	ap->args.page_zeroing = true;
@@ -974,7 +986,7 @@ static void fuse_send_readpages(struct fuse_io_args *ia, struct file *file)
 
 	fuse_read_args_fill(ia, file, pos, count, FUSE_READ);
 	ia->read.attr_ver = fuse_get_attr_version(fm->fc);
-	if (fm->fc->async_read) {
+	if (async) {
 		ia->ff = fuse_file_get(ff);
 		ap->args.end = fuse_readpages_end;
 		err = fuse_simple_background(fm, &ap->args, GFP_KERNEL);

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (17 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 18/19] fuse: {uring} Set a min cpu offset io-size for reads/writes Bernd Schubert
@ 2024-05-29 18:00 ` Bernd Schubert
  2024-05-31 16:24   ` Jens Axboe
  2024-05-30  7:07 ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Amir Goldstein
                   ` (3 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert
  Cc: io-uring

This is to avoid using async completion tasks
(i.e. context switches) when not needed.

Cc: io-uring@vger.kernel.org
Signed-off-by: Bernd Schubert <bschubert@ddn.com>

---
This condition should be better verified by io-uring developers.

} else if (current->io_uring) {
    /* There are two cases here
     * 1) fuse-server side uses multiple threads accessing
     *    the ring
     * 2) IO requests through io-uring
     */
    send_in_task = true;
    issue_flags = 0;
---
 fs/fuse/dev_uring.c | 57 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 46 insertions(+), 11 deletions(-)

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index cdc5836edb6e..74407e5e86fa 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -32,7 +32,8 @@
 #include <linux/io_uring/cmd.h>
 
 static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
-					    bool set_err, int error);
+					    bool set_err, int error,
+					    unsigned int issue_flags);
 
 static void fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
 {
@@ -682,7 +683,9 @@ static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
  * userspace will read it
  * This is comparable with classical read(/dev/fuse)
  */
-static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent)
+static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent,
+				    unsigned int issue_flags,
+				    bool send_in_task)
 {
 	struct fuse_ring *ring = ring_ent->queue->ring;
 	struct fuse_ring_req *rreq = ring_ent->rreq;
@@ -723,13 +726,16 @@ static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent)
 		 __func__, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
 		 rreq->in.opcode, rreq->in.unique);
 
-	io_uring_cmd_complete_in_task(ring_ent->cmd,
-				      fuse_uring_async_send_to_ring);
+	if (send_in_task)
+		io_uring_cmd_complete_in_task(ring_ent->cmd,
+					      fuse_uring_async_send_to_ring);
+	else
+		io_uring_cmd_done(ring_ent->cmd, 0, 0, issue_flags);
 
 	return;
 
 err:
-	fuse_uring_req_end_and_get_next(ring_ent, true, err);
+	fuse_uring_req_end_and_get_next(ring_ent, true, err, issue_flags);
 }
 
 /*
@@ -806,7 +812,8 @@ static bool fuse_uring_ent_release_and_fetch(struct fuse_ring_ent *ring_ent)
  * has lock/unlock/lock to avoid holding the lock on calling fuse_request_end
  */
 static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
-					    bool set_err, int error)
+					    bool set_err, int error,
+					    unsigned int issue_flags)
 {
 	struct fuse_req *req = ring_ent->fuse_req;
 	int has_next;
@@ -822,7 +829,7 @@ static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
 	has_next = fuse_uring_ent_release_and_fetch(ring_ent);
 	if (has_next) {
 		/* called within uring context - use provided flags */
-		fuse_uring_send_to_ring(ring_ent);
+		fuse_uring_send_to_ring(ring_ent, issue_flags, false);
 	}
 }
 
@@ -857,7 +864,7 @@ static void fuse_uring_commit_and_release(struct fuse_dev *fud,
 out:
 	pr_devel("%s:%d ret=%zd op=%d req-ret=%d\n", __func__, __LINE__, err,
 		 req->args->opcode, req->out.h.error);
-	fuse_uring_req_end_and_get_next(ring_ent, set_err, err);
+	fuse_uring_req_end_and_get_next(ring_ent, set_err, err, issue_flags);
 }
 
 /*
@@ -1156,10 +1163,12 @@ int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
 	struct fuse_ring_queue *queue;
 	struct fuse_ring_ent *ring_ent = NULL;
 	int res;
-	int async = test_bit(FR_BACKGROUND, &req->flags) &&
-		    !req->args->async_blocking;
+	int async_req = test_bit(FR_BACKGROUND, &req->flags);
+	int async = async_req && !req->args->async_blocking;
 	struct list_head *ent_queue, *req_queue;
 	int qid;
+	bool send_in_task;
+	unsigned int issue_flags;
 
 	qid = fuse_uring_get_req_qid(req, ring, async);
 	queue = fuse_uring_get_queue(ring, qid);
@@ -1182,11 +1191,37 @@ int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
 			list_first_entry(ent_queue, struct fuse_ring_ent, list);
 		list_del(&ring_ent->list);
 		fuse_uring_add_req_to_ring_ent(ring_ent, req);
+		if (current == queue->server_task) {
+			issue_flags = queue->uring_cmd_issue_flags;
+		} else if (current->io_uring) {
+			/* There are two cases here
+			 * 1) fuse-server side uses multiple threads accessing
+			 *    the ring. We only have stored issue_flags for
+			 *    into the queue for one thread (the first one
+			 *    that submits FUSE_URING_REQ_FETCH)
+			 * 2) IO requests through io-uring, we do not have
+			 *    issue flags at all for these
+			 */
+			send_in_task = true;
+			issue_flags = 0;
+		} else {
+			if (async_req) {
+				/*
+				 * page cache writes might hold an upper
+				 * spinlockl, which conflicts with the io-uring
+				 * mutex
+				 */
+				send_in_task = true;
+				issue_flags = 0;
+			} else {
+				issue_flags = IO_URING_F_UNLOCKED;
+			}
+		}
 	}
 	spin_unlock(&queue->lock);
 
 	if (ring_ent != NULL)
-		fuse_uring_send_to_ring(ring_ent);
+		fuse_uring_send_to_ring(ring_ent, issue_flags, send_in_task);
 
 	return 0;
 

-- 
2.40.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends
  2024-05-29 18:00 ` [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends Bernd Schubert
@ 2024-05-31 16:24   ` Jens Axboe
  2024-05-31 17:36     ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Jens Axboe @ 2024-05-31 16:24 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	bernd.schubert
  Cc: io-uring

On 5/29/24 12:00 PM, Bernd Schubert wrote:
> This is to avoid using async completion tasks
> (i.e. context switches) when not needed.
> 
> Cc: io-uring@vger.kernel.org
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>

This patch is very confusing, even after having pulled the other
changes. In general, would be great if the io_uring list was CC'ed on
the whole series, it's very hard to review just a single patch, when you
don't have the full picture.

Outside of that, would be super useful to include a blurb on how you set
things up for testing, and how you run the testing. That would really
help in terms of being able to run and test it, and also to propose
changes that might make a big difference.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends
  2024-05-31 16:24   ` Jens Axboe
@ 2024-05-31 17:36     ` Bernd Schubert
  2024-05-31 19:10       ` Jens Axboe
  0 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-31 17:36 UTC (permalink / raw)
  To: Jens Axboe, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel@vger.kernel.org, bernd.schubert@fastmail.fm
  Cc: io-uring@vger.kernel.org

On 5/31/24 18:24, Jens Axboe wrote:
> On 5/29/24 12:00 PM, Bernd Schubert wrote:
>> This is to avoid using async completion tasks
>> (i.e. context switches) when not needed.
>>
>> Cc: io-uring@vger.kernel.org
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> 
> This patch is very confusing, even after having pulled the other
> changes. In general, would be great if the io_uring list was CC'ed on

Hmm, let me try to explain. And yes, I definitely need to add these details 
to the commit message

Without the patch:

<sending a struct fuse_req> 

fuse_uring_queue_fuse_req
    fuse_uring_send_to_ring
        io_uring_cmd_complete_in_task

<async task runs>
    io_uring_cmd_done()

Now I would like to call io_uring_cmd_done() directly without another task
whenever possible. I didn't benchmark it, but another task is in general
against the entire concept. That is where the patch comes in

fuse_uring_queue_fuse_req() now adds the information if io_uring_cmd_done() 
shall be called directly or via io_uring_cmd_complete_in_task().

Doing it directly requires the knowledge of issue_flags - these are the
conditions in fuse_uring_queue_fuse_req.

1) (current == queue->server_task)
fuse_uring_cmd (IORING_OP_URING_CMD) received a completion for a 
previous fuse_req, after completion it fetched the next fuse_req and 
wants to send it - for 'current == queue->server_task' issue flags
got stored in struct fuse_ring_queue::uring_cmd_issue_flags

2) 'else if (current->io_uring)'

(actually documented in the code)

2.1 This might be through IORING_OP_URING_CMD as well, but then server 
side uses multiple threads to access the same ring - not nice. We only
store issue_flags into the queue for 'current == queue->server_task', so
we do not know issue_flags - sending through task is needed.

2.2 This might be an application request through the mount point, through
the io-uring interface. We do know issue flags either.
(That one was actually a surprise for me, when xfstests caught it.
Initially I had a condition to send without the extra task then lockdep
caught that.

In both cases it has to use a tasks.

My question here is if 'current->io_uring' is reliable.

3) everything else

3.1) For async requests, interesting are cached reads and writes here. At a minimum
writes a holding a spin lock and that lock conflicts with the mutex io-uring is taking - 
we need a task as well

3.2) sync - no lock being hold, it can send without the extra task.

> the whole series, it's very hard to review just a single patch, when you
> don't have the full picture.

Sorry, I will do that for the next version.

> 
> Outside of that, would be super useful to include a blurb on how you set
> things up for testing, and how you run the testing. That would really
> help in terms of being able to run and test it, and also to propose
> changes that might make a big difference.
> 

Will do in the next version. 
You basically need my libfuse uring branch
(right now commit history is not cleaned up) and follow
instructions in <libfuse>/xfstests/README.md how to run xfstests.
Missing is a slight patch for that dir to set extra daemon parameters,
like direct-io (fuse' FOPEN_DIRECT_IO) and io-uring. Will add that libfuse
during the next days.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends
  2024-05-31 17:36     ` Bernd Schubert
@ 2024-05-31 19:10       ` Jens Axboe
  2024-06-01 16:37         ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Jens Axboe @ 2024-05-31 19:10 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel@vger.kernel.org, bernd.schubert@fastmail.fm
  Cc: io-uring@vger.kernel.org

On 5/31/24 11:36 AM, Bernd Schubert wrote:
> On 5/31/24 18:24, Jens Axboe wrote:
>> On 5/29/24 12:00 PM, Bernd Schubert wrote:
>>> This is to avoid using async completion tasks
>>> (i.e. context switches) when not needed.
>>>
>>> Cc: io-uring@vger.kernel.org
>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>
>> This patch is very confusing, even after having pulled the other
>> changes. In general, would be great if the io_uring list was CC'ed on
> 
> Hmm, let me try to explain. And yes, I definitely need to add these details 
> to the commit message
> 
> Without the patch:
> 
> <sending a struct fuse_req> 
> 
> fuse_uring_queue_fuse_req
>     fuse_uring_send_to_ring
>         io_uring_cmd_complete_in_task
>         
> <async task runs>
>     io_uring_cmd_done()

And this is a worthwhile optimization, you always want to complete it
line if at all possible. But none of this logic or code belongs in fuse,
it really should be provided by io_uring helpers.

I would just drop this patch for now and focus on the core
functionality. Send out a version with that, and then we'll be happy to
help this as performant as it can be. This is where the ask on "how to
reproduce your numbers" comes from - with that, it's usually trivial to
spot areas where things could be improved. And I strongly suspect that
will involve providing you with the right API to use here, and perhaps
refactoring a bit on the fuse side. Making up issue_flags is _really_
not something a user should do.

> 1) (current == queue->server_task)
> fuse_uring_cmd (IORING_OP_URING_CMD) received a completion for a 
> previous fuse_req, after completion it fetched the next fuse_req and 
> wants to send it - for 'current == queue->server_task' issue flags
> got stored in struct fuse_ring_queue::uring_cmd_issue_flags

And queue->server_task is the owner of the ring? Then yes that is safe
> 
> 2) 'else if (current->io_uring)'
> 
> (actually documented in the code)
> 
> 2.1 This might be through IORING_OP_URING_CMD as well, but then server 
> side uses multiple threads to access the same ring - not nice. We only
> store issue_flags into the queue for 'current == queue->server_task', so
> we do not know issue_flags - sending through task is needed.

What's the path leading to you not having the issue_flags?

> 2.2 This might be an application request through the mount point, through
> the io-uring interface. We do know issue flags either.
> (That one was actually a surprise for me, when xfstests caught it.
> Initially I had a condition to send without the extra task then lockdep
> caught that.

In general, if you don't know the context (eg you don't have issue_flags
passed in), you should probably assume the only way is to sanely proceed
is to have it processed by the task itself.

> 
> In both cases it has to use a tasks.
> 
> 
> My question here is if 'current->io_uring' is reliable.

Yes that will be reliable in the sense that it tells you that the
current task has (at least) one io_uring context setup. But it doesn't
tell you anything beyond that, like if it's the owner of this request.

> 3) everything else
> 
> 3.1) For async requests, interesting are cached reads and writes here. At a minimum
> writes a holding a spin lock and that lock conflicts with the mutex io-uring is taking - 
> we need a task as well
> 
> 3.2) sync - no lock being hold, it can send without the extra task.

As mentioned, let's drop this patch 19 for now. Send out what you have
with instructions on how to test it, and I'll give it a spin and see
what we can do about this.

>> Outside of that, would be super useful to include a blurb on how you set
>> things up for testing, and how you run the testing. That would really
>> help in terms of being able to run and test it, and also to propose
>> changes that might make a big difference.
>>
> 
> Will do in the next version. 
> You basically need my libfuse uring branch
> (right now commit history is not cleaned up) and follow
> instructions in <libfuse>/xfstests/README.md how to run xfstests.
> Missing is a slight patch for that dir to set extra daemon parameters,
> like direct-io (fuse' FOPEN_DIRECT_IO) and io-uring. Will add that libfuse
> during the next days.

I'll leave the xfstests to you for now, but running some perf testing
just to verify how it's being used would be useful and help improve it
for sure.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends
  2024-05-31 19:10       ` Jens Axboe
@ 2024-06-01 16:37         ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-06-01 16:37 UTC (permalink / raw)
  To: Jens Axboe, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel@vger.kernel.org
  Cc: io-uring@vger.kernel.org



On 5/31/24 21:10, Jens Axboe wrote:
> On 5/31/24 11:36 AM, Bernd Schubert wrote:
>> On 5/31/24 18:24, Jens Axboe wrote:
>>> On 5/29/24 12:00 PM, Bernd Schubert wrote:
>>>> This is to avoid using async completion tasks
>>>> (i.e. context switches) when not needed.
>>>>
>>>> Cc: io-uring@vger.kernel.org
>>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>>
>>> This patch is very confusing, even after having pulled the other
>>> changes. In general, would be great if the io_uring list was CC'ed on
>>
>> Hmm, let me try to explain. And yes, I definitely need to add these details 
>> to the commit message
>>
>> Without the patch:
>>
>> <sending a struct fuse_req> 
>>
>> fuse_uring_queue_fuse_req
>>     fuse_uring_send_to_ring
>>         io_uring_cmd_complete_in_task
>>         
>> <async task runs>
>>     io_uring_cmd_done()
> 
> And this is a worthwhile optimization, you always want to complete it
> line if at all possible. But none of this logic or code belongs in fuse,
> it really should be provided by io_uring helpers.
> 
> I would just drop this patch for now and focus on the core
> functionality. Send out a version with that, and then we'll be happy to
> help this as performant as it can be. This is where the ask on "how to
> reproduce your numbers" comes from - with that, it's usually trivial to
> spot areas where things could be improved. And I strongly suspect that
> will involve providing you with the right API to use here, and perhaps
> refactoring a bit on the fuse side. Making up issue_flags is _really_
> not something a user should do.

Great that you agree, I don't like the issue_flag handling in fuse code either. 
I will also follow your suggestion to drop this patch. 


> 
>> 1) (current == queue->server_task)
>> fuse_uring_cmd (IORING_OP_URING_CMD) received a completion for a 
>> previous fuse_req, after completion it fetched the next fuse_req and 
>> wants to send it - for 'current == queue->server_task' issue flags
>> got stored in struct fuse_ring_queue::uring_cmd_issue_flags
> 
> And queue->server_task is the owner of the ring? Then yes that is safe

Yeah, it is the thread that submits SQEs - should be the owner of the ring, 
unless daemon side does something wrong (given that there are several
userspace implementation and not a single libfuse only, we need to expect
and handle implementation errors, though).

>>
>> 2) 'else if (current->io_uring)'
>>
>> (actually documented in the code)
>>
>> 2.1 This might be through IORING_OP_URING_CMD as well, but then server 
>> side uses multiple threads to access the same ring - not nice. We only
>> store issue_flags into the queue for 'current == queue->server_task', so
>> we do not know issue_flags - sending through task is needed.
> 
> What's the path leading to you not having the issue_flags?

We get issue flags here, but I want to keep changes to libfuse small and want
to avoid changing non uring related function signatures. Which is the the
why we store issue_flags for the presumed ring owner thread in the queue data
structure, but we don't have it for possible other threads then

Example:

IORING_OP_URING_CMD
   fuse_uring_cmd
       fuse_uring_commit_and_release
           fuse_uring_req_end_and_get_next --> until here issue_flags passed
               fuse_request_end -> generic fuse function,  issue_flags not passed
                   req->args->end() / fuse_writepage_end
                       fuse_simple_background
                           fuse_request_queue_background
                               fuse_request_queue_background_uring
                                   fuse_uring_queue_fuse_req
                                       fuse_uring_send_to_ring
                                           io_uring_cmd_done
                   
      
I.e. we had issue_flags up to fuse_uring_req_end_and_get_next(), but then
call into generic fuse functions and stop passing through issue_flags.
For the ring-owner we take issue flags stored by fuse_uring_cmd()
into struct fuse_ring_queue, but if daemon side uses multiple threads to
access the ring we won't have that. Well, we could allow it and store
it into an array or rb-tree, but I don't like that multiple threads access
something that is optimized to have a thread per core already.

> 
>> 2.2 This might be an application request through the mount point, through
>> the io-uring interface. We do know issue flags either.
>> (That one was actually a surprise for me, when xfstests caught it.
>> Initially I had a condition to send without the extra task then lockdep
>> caught that.
> 
> In general, if you don't know the context (eg you don't have issue_flags
> passed in), you should probably assume the only way is to sanely proceed
> is to have it processed by the task itself.
> 
>>
>> In both cases it has to use a tasks.
>>
>>
>> My question here is if 'current->io_uring' is reliable.
> 
> Yes that will be reliable in the sense that it tells you that the
> current task has (at least) one io_uring context setup. But it doesn't
> tell you anything beyond that, like if it's the owner of this request.

Yeah, you can see that it just checks for current->io_uring and then
uses a task.

> 
>> 3) everything else
>>
>> 3.1) For async requests, interesting are cached reads and writes here. At a minimum
>> writes a holding a spin lock and that lock conflicts with the mutex io-uring is taking - 
>> we need a task as well
>>
>> 3.2) sync - no lock being hold, it can send without the extra task.
> 
> As mentioned, let's drop this patch 19 for now. Send out what you have
> with instructions on how to test it, and I'll give it a spin and see
> what we can do about this.
> 
>>> Outside of that, would be super useful to include a blurb on how you set
>>> things up for testing, and how you run the testing. That would really
>>> help in terms of being able to run and test it, and also to propose
>>> changes that might make a big difference.
>>>
>>
>> Will do in the next version. 
>> You basically need my libfuse uring branch
>> (right now commit history is not cleaned up) and follow
>> instructions in <libfuse>/xfstests/README.md how to run xfstests.
>> Missing is a slight patch for that dir to set extra daemon parameters,
>> like direct-io (fuse' FOPEN_DIRECT_IO) and io-uring. Will add that libfuse
>> during the next days.
> 
> I'll leave the xfstests to you for now, but running some perf testing
> just to verify how it's being used would be useful and help improve it
> for sure.
> 

Ah you meant performance tests. I used libfuse/example/passthrough_hp from
my uring branch and then fio on top of that for reads/writes and mdtest from
the ior repo for metadata. Maybe I should upload my scripts somewhere.


Thanks,
Beernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (18 preceding siblings ...)
  2024-05-29 18:00 ` [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends Bernd Schubert
@ 2024-05-30  7:07 ` Amir Goldstein
  2024-05-30 12:09   ` Bernd Schubert
  2024-05-30 15:36 ` Kent Overstreet
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 113+ messages in thread
From: Amir Goldstein @ 2024-05-30  7:07 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, linux-fsdevel, bernd.schubert, Andrew Morton,
	linux-mm, Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring

On Wed, May 29, 2024 at 9:01 PM Bernd Schubert <bschubert@ddn.com> wrote:
>
> From: Bernd Schubert <bschubert@ddn.com>
>
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state,
> some major changes are still to be expected.
>
> Motivation for these patches is all to increase fuse performance.
> In fuse-over-io-uring requests avoid core switching (application
> on core X, processing of fuse server on random core Y) and use
> shared memory between kernel and userspace to transfer data.
> Similar approaches have been taken by ZUFS and FUSE2, though
> not over io-uring, but through ioctl IOs
>
> https://lwn.net/Articles/756625/
> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2
>
> Avoiding cache line bouncing / numa systems was discussed
> between Amir and Miklos before and Miklos had posted
> part of the private discussion here
> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
>
> This cache line bouncing should be addressed by these patches
> as well.
>
> I had also noticed waitq wake-up latencies in fuse before
> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/
>
> This spinning approach helped with performance (>40% improvement
> for file creates), but due to random server side thread/core utilization
> spinning cannot be well controlled in /dev/fuse mode.
> With fuse-over-io-uring requests are handled on the same core
> (sync requests) or on core+1 (large async requests) and performance
> improvements are achieved without spinning.
>
> Splice/zero-copy is not supported yet, Ming Lei is working
> on io-uring support for ublk_drv, but I think so far there
> is no final agreement on the approach to be taken yet.
> Fuse-over-io-uring runs significantly faster than reads/writes
> over /dev/fuse, even with splice enabled, so missing zc
> should not be a blocking issue.
>
> The patches have been tested with multiple xfstest runs in a VM
> (32 cores) with a kernel that has several debug options
> enabled (like KASAN and MSAN).
> For some tests xfstests reports that O_DIRECT is not supported,
> I need to investigate that. Interesting part is that exactly
> these tests fail in plain /dev/fuse posix mode. I had to disabled
> generic/650, which is enabling/disabling cpu cores - given ring
> threads are bound to cores issues with that are no totally
> unexpected, but then there (scheduler) kernel messages that
> core binding for these threads is removed - this needs
> to be further investigates.
> Nice effect in io-uring mode is that tests run faster (like
> generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still
> slow as this is with ASAN/leak-detection/etc.
>
> The corresponding libfuse patches are on my uring branch,
> but need cleanup for submission - will happen during the next
> days.
> https://github.com/bsbernd/libfuse/tree/uring
>
> If it should make review easier, patches posted here are on
> this branch
> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2
>
> TODO list for next RFC versions
> - Let the ring configure ioctl return information, like mmap/queue-buf size
> - Request kernel side address and len for a request - avoid calculation in userspace?
> - multiple IO sizes per queue (avoiding a calculation in userspace is probably even
>   more important)
> - FUSE_INTERRUPT handling?
> - Logging (adds fields in the ioctl and also ring-request),
>   any mismatch between client and server is currently very hard to understand
>   through error codes
>
> Future work
> - notifications, probably on their own ring
> - zero copy
>
> I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023,
> which, resulted in some tuning patches (at the end of the
> patch series).
>
> Some benchmark results
> ======================
>
> System used for the benchmark is a 32 core (HyperThreading enabled)
> Xeon E5-2650 system. I don't have local disks attached that could do
> >5GB/s IOs, for paged and dio results a patched version of passthrough-hp
> was used that bypasses final reads/writes.
>
> paged reads
> -----------
>             128K IO size                      1024K IO size
> jobs   /dev/fuse     uring    gain     /dev/fuse    uring   gain
>  1        1117        1921    1.72        1902       1942   1.02
>  2        2502        3527    1.41        3066       3260   1.06
>  4        5052        6125    1.21        5994       6097   1.02
>  8        6273       10855    1.73        7101      10491   1.48
> 16        6373       11320    1.78        7660      11419   1.49
> 24        6111        9015    1.48        7600       9029   1.19
> 32        5725        7968    1.39        6986       7961   1.14
>
> dio reads (1024K)
> -----------------
>
> jobs   /dev/fuse  uring   gain
> 1           2023   3998   2.42
> 2           3375   7950   2.83
> 4           3823   15022  3.58
> 8           7796   22591  2.77
> 16          8520   27864  3.27
> 24          8361   20617  2.55
> 32          8717   12971  1.55
>
> mmap reads (4K)
> ---------------
> (sequential, I probably should have made it random, sequential exposes
> a rather interesting/weird 'optimized' memcpy issue - sequential becomes
> reversed order 4K read)
> https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm/
>
> jobs  /dev/fuse     uring    gain
> 1       130          323     2.49
> 2       219          538     2.46
> 4       503         1040     2.07
> 8       1472        2039     1.38
> 16      2191        3518     1.61
> 24      2453        4561     1.86
> 32      2178        5628     2.58
>
> (Results on request, setting MAP_HUGETLB much improves performance
> for both, io-uring mode then has a slight advantage only.)
>
> creates/s
> ----------
> threads /dev/fuse     uring   gain
> 1          3944       10121   2.57
> 2          8580       24524   2.86
> 4         16628       44426   2.67
> 8         46746       56716   1.21
> 16        79740      102966   1.29
> 20        80284      119502   1.49
>
> (the gain drop with >=8 cores needs to be investigated)

Hi Bernd,

Those are impressive results!

When approaching the FUSE uring feature from marketing POV,
I think that putting the emphasis on metadata operations is the
best approach.

Not the dio reads are not important (I know that is part of your use case),
but I imagine there are a lot more people out there waiting for
improvement in metadata operations overhead.

To me it helps to know what the current main pain points are
for people using FUSE filesystems wrt performance.

Although it may not be uptodate, the most comprehensive
study about FUSE performance overhead is this FAST17 paper:

https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf

In this paper, table 3 summarizes the different overheads observed
per workload. According to this table, the workloads that degrade
performance worse on an optimized passthrough fs over SSD are:
- many file creates
- many file deletes
- many small file reads
In all these workloads, it was millions of files over many directories.
The highest performance regression reported was -83% on many
small file creations.

The moral of this long story is that it would be nice to know
what performance improvement FUSE uring can aspire to.
This is especially relevant for people that would be interested
in combining the benefits of FUSE passthrough (for data) and
FUSE uring (for metadata).

What did passthrough_hp do in your patched version with creates?
Did it actually create the files?
In how many directories?
Maybe the directory inode lock impeded performance improvement
with >=8 threads?

>
> Remaining TODO list for RFCv3:
> --------------------------------
> 1) Let the ring configure ioctl return information,
> like mmap/queue-buf size
>
> Right now libfuse and kernel have lots of duplicated setup code
> and any kind of pointer/offset mismatch results in a non-working
> ring that is hard to debug - probably better when the kernel does
> the calculations and returns that to server side
>
> 2) In combination with 1, ring requests should retrieve their
> userspace address and length from kernel side instead of
> calculating it through the mmaped queue buffer on their own.
> (Introduction of FUSE_URING_BUF_ADDR_FETCH)
>
> 3) Add log buffer into the ioctl and ring-request
>
> This is to provide better error messages (instead of just
> errno)
>
> 3) Multiple IO sizes per queue
>
> Small IOs and metadata requests do not need large buffer sizes,
> we need multiple IO sizes per queue.
>
> 4) FUSE_INTERRUPT handling
>
> These are not handled yet, kernel side is probably not difficult
> anymore as ring entries take fuse requests through lists.
>
> Long term TODO:
> --------------
> Notifications through io-uring, maybe with a separated ring,
> but I'm not sure yet.

Is that going to improve performance in any real life workload?

Thanks,
Amir.

>
> Changes since RFCv1
> -------------------
> - No need to hold the task of the server side anymore.  Also no
>   ioctls/threads waiting for shutdown anymore.  Shutdown now more
>   works like the traditional fuse way.
> - Each queue clones the fuse and device release makes an  exception
>   for io-uring. Reason is that queued IORING_OP_URING_CMD
>   (through .uring_cmd) prevent a device release. I.e. a killed
>   server side typically triggers fuse_abort_conn(). This was the
>   reason for the async stop-monitor in v1 and reference on the daemon
>   task. However it was very racy and annotated immediately by Miklos.
> - In v1 the offset parameter to mmap was identifying the QID, in v2
>   server side is expected to send mmap from a core bound ring thread
>   in numa mode and numa node is taken through the core of that thread.
>   Kernel side of the mmap buffer is stored in an rbtree and assigned
>   to the right qid through an additional queue ioctl.
> - Release of IORING_OP_URING_CMD is done through lists now, instead
>   of iterating over the entire array of queues/entries and does not
>   depend on the entry state anymore (a bit of the state is still left
>   for sanity check).
> - Finding free ring queue entries is done through lists and not through
>   a bitmap anymore
> - Many other code changes and bug fixes
> - Performance tunings
>
> ---
> Bernd Schubert (19):
>       fuse: rename to fuse_dev_end_requests and make non-static
>       fuse: Move fuse_get_dev to header file
>       fuse: Move request bits
>       fuse: Add fuse-io-uring design documentation
>       fuse: Add a uring config ioctl
>       Add a vmalloc_node_user function
>       fuse uring: Add an mmap method
>       fuse: Add the queue configuration ioctl
>       fuse: {uring} Add a dev_release exception for fuse-over-io-uring
>       fuse: {uring} Handle SQEs - register commands
>       fuse: Add support to copy from/to the ring buffer
>       fuse: {uring} Add uring sqe commit and fetch support
>       fuse: {uring} Handle uring shutdown
>       fuse: {uring} Allow to queue to the ring
>       export __wake_on_current_cpu
>       fuse: {uring} Wake requests on the the current cpu
>       fuse: {uring} Send async requests to qid of core + 1
>       fuse: {uring} Set a min cpu offset io-size for reads/writes
>       fuse: {uring} Optimize async sends
>
>  Documentation/filesystems/fuse-io-uring.rst |  167 ++++
>  fs/fuse/Kconfig                             |   12 +
>  fs/fuse/Makefile                            |    1 +
>  fs/fuse/dev.c                               |  310 +++++--
>  fs/fuse/dev_uring.c                         | 1232 +++++++++++++++++++++++++++
>  fs/fuse/dev_uring_i.h                       |  395 +++++++++
>  fs/fuse/file.c                              |   15 +-
>  fs/fuse/fuse_dev_i.h                        |   67 ++
>  fs/fuse/fuse_i.h                            |    9 +
>  fs/fuse/inode.c                             |    3 +
>  include/linux/vmalloc.h                     |    1 +
>  include/uapi/linux/fuse.h                   |  135 +++
>  kernel/sched/wait.c                         |    1 +
>  mm/nommu.c                                  |    6 +
>  mm/vmalloc.c                                |   41 +-
>  15 files changed, 2330 insertions(+), 65 deletions(-)
> ---
> base-commit: dd5a440a31fae6e459c0d6271dddd62825505361
> change-id: 20240529-fuse-uring-for-6-9-rfc2-out-f0a009005fdf
>
> Best regards,
> --
> Bernd Schubert <bschubert@ddn.com>
>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30  7:07 ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Amir Goldstein
@ 2024-05-30 12:09   ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-30 12:09 UTC (permalink / raw)
  To: Amir Goldstein, Bernd Schubert
  Cc: Miklos Szeredi, linux-fsdevel, Andrew Morton, linux-mm,
	Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring, Josef Bacik



On 5/30/24 09:07, Amir Goldstein wrote:
> On Wed, May 29, 2024 at 9:01 PM Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> From: Bernd Schubert <bschubert@ddn.com>
>>
>> This adds support for uring communication between kernel and
>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>> appraoch was taken from ublk.  The patches are in RFC state,
>> some major changes are still to be expected.
>>
>> Motivation for these patches is all to increase fuse performance.
>> In fuse-over-io-uring requests avoid core switching (application
>> on core X, processing of fuse server on random core Y) and use
>> shared memory between kernel and userspace to transfer data.
>> Similar approaches have been taken by ZUFS and FUSE2, though
>> not over io-uring, but through ioctl IOs
>>
>> https://lwn.net/Articles/756625/
>> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2
>>
>> Avoiding cache line bouncing / numa systems was discussed
>> between Amir and Miklos before and Miklos had posted
>> part of the private discussion here
>> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
>>
>> This cache line bouncing should be addressed by these patches
>> as well.
>>
>> I had also noticed waitq wake-up latencies in fuse before
>> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/
>>
>> This spinning approach helped with performance (>40% improvement
>> for file creates), but due to random server side thread/core utilization
>> spinning cannot be well controlled in /dev/fuse mode.
>> With fuse-over-io-uring requests are handled on the same core
>> (sync requests) or on core+1 (large async requests) and performance
>> improvements are achieved without spinning.
>>
>> Splice/zero-copy is not supported yet, Ming Lei is working
>> on io-uring support for ublk_drv, but I think so far there
>> is no final agreement on the approach to be taken yet.
>> Fuse-over-io-uring runs significantly faster than reads/writes
>> over /dev/fuse, even with splice enabled, so missing zc
>> should not be a blocking issue.
>>
>> The patches have been tested with multiple xfstest runs in a VM
>> (32 cores) with a kernel that has several debug options
>> enabled (like KASAN and MSAN).
>> For some tests xfstests reports that O_DIRECT is not supported,
>> I need to investigate that. Interesting part is that exactly
>> these tests fail in plain /dev/fuse posix mode. I had to disabled
>> generic/650, which is enabling/disabling cpu cores - given ring
>> threads are bound to cores issues with that are no totally
>> unexpected, but then there (scheduler) kernel messages that
>> core binding for these threads is removed - this needs
>> to be further investigates.
>> Nice effect in io-uring mode is that tests run faster (like
>> generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still
>> slow as this is with ASAN/leak-detection/etc.
>>
>> The corresponding libfuse patches are on my uring branch,
>> but need cleanup for submission - will happen during the next
>> days.
>> https://github.com/bsbernd/libfuse/tree/uring
>>
>> If it should make review easier, patches posted here are on
>> this branch
>> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2
>>
>> TODO list for next RFC versions
>> - Let the ring configure ioctl return information, like mmap/queue-buf size
>> - Request kernel side address and len for a request - avoid calculation in userspace?
>> - multiple IO sizes per queue (avoiding a calculation in userspace is probably even
>>   more important)
>> - FUSE_INTERRUPT handling?
>> - Logging (adds fields in the ioctl and also ring-request),
>>   any mismatch between client and server is currently very hard to understand
>>   through error codes
>>
>> Future work
>> - notifications, probably on their own ring
>> - zero copy
>>
>> I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023,
>> which, resulted in some tuning patches (at the end of the
>> patch series).
>>
>> Some benchmark results
>> ======================
>>
>> System used for the benchmark is a 32 core (HyperThreading enabled)
>> Xeon E5-2650 system. I don't have local disks attached that could do
>>> 5GB/s IOs, for paged and dio results a patched version of passthrough-hp
>> was used that bypasses final reads/writes.
>>
>> paged reads
>> -----------
>>             128K IO size                      1024K IO size
>> jobs   /dev/fuse     uring    gain     /dev/fuse    uring   gain
>>  1        1117        1921    1.72        1902       1942   1.02
>>  2        2502        3527    1.41        3066       3260   1.06
>>  4        5052        6125    1.21        5994       6097   1.02
>>  8        6273       10855    1.73        7101      10491   1.48
>> 16        6373       11320    1.78        7660      11419   1.49
>> 24        6111        9015    1.48        7600       9029   1.19
>> 32        5725        7968    1.39        6986       7961   1.14
>>
>> dio reads (1024K)
>> -----------------
>>
>> jobs   /dev/fuse  uring   gain
>> 1           2023   3998   2.42
>> 2           3375   7950   2.83
>> 4           3823   15022  3.58
>> 8           7796   22591  2.77
>> 16          8520   27864  3.27
>> 24          8361   20617  2.55
>> 32          8717   12971  1.55
>>
>> mmap reads (4K)
>> ---------------
>> (sequential, I probably should have made it random, sequential exposes
>> a rather interesting/weird 'optimized' memcpy issue - sequential becomes
>> reversed order 4K read)
>> https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm/
>>
>> jobs  /dev/fuse     uring    gain
>> 1       130          323     2.49
>> 2       219          538     2.46
>> 4       503         1040     2.07
>> 8       1472        2039     1.38
>> 16      2191        3518     1.61
>> 24      2453        4561     1.86
>> 32      2178        5628     2.58
>>
>> (Results on request, setting MAP_HUGETLB much improves performance
>> for both, io-uring mode then has a slight advantage only.)
>>
>> creates/s
>> ----------
>> threads /dev/fuse     uring   gain
>> 1          3944       10121   2.57
>> 2          8580       24524   2.86
>> 4         16628       44426   2.67
>> 8         46746       56716   1.21
>> 16        79740      102966   1.29
>> 20        80284      119502   1.49
>>
>> (the gain drop with >=8 cores needs to be investigated)
> 

Hi Amir,

> Hi Bernd,
> 
> Those are impressive results!

thank you!


> 
> When approaching the FUSE uring feature from marketing POV,
> I think that putting the emphasis on metadata operations is the
> best approach.

I can add in some more results and probably need to redo at least the
metadata tests. I have all the results in google docs and in plain text
files, just a bit cumbersome maybe also spam to post all of it here.

> 
> Not the dio reads are not important (I know that is part of your use case),
> but I imagine there are a lot more people out there waiting for
> improvement in metadata operations overhead.

I think the DIO use case is declining. My fuse work is now related to
the DDN Infina project, which has a DLM - this will all go via cache and
notifications (into from/to client/server) I need to start to work on
that asap... I'm also not too happy yet about cached writes/reads - need
to find time to investigate where the limit is.

> 
> To me it helps to know what the current main pain points are
> for people using FUSE filesystems wrt performance.
> 
> Although it may not be uptodate, the most comprehensive
> study about FUSE performance overhead is this FAST17 paper:
> 
> https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf

Yeah, I had seen it. Just checking again, interesting is actually their
instrumentation branch

https://github.com/sbu-fsl/fuse-kernel-instrumentation

This should be very useful upstream, in combination with Josefs fuse
tracepoints (btw, thanks for the tracepoint patch Josef! I'm going to
look at it and test it tomorrow).


> 
> In this paper, table 3 summarizes the different overheads observed
> per workload. According to this table, the workloads that degrade
> performance worse on an optimized passthrough fs over SSD are:
> - many file creates
> - many file deletes
> - many small file reads
> In all these workloads, it was millions of files over many directories.
> The highest performance regression reported was -83% on many
> small file creations.
> 
> The moral of this long story is that it would be nice to know
> what performance improvement FUSE uring can aspire to.
> This is especially relevant for people that would be interested
> in combining the benefits of FUSE passthrough (for data) and
> FUSE uring (for metadata).

As written above, I can add a few more data. But if possible I wouldn't
like to concentrate on benchmarking - this can be super time consuming
and doesn't help unless one investigates what is actually limiting
performance. Right now we see that io-uring helps, fixing the other
limits is then the next step, imho.

> 
> What did passthrough_hp do in your patched version with creates?
> Did it actually create the files?

Yeah, it creates files, I think on xfs (or ext4). I had tried tmpfs
first, but it had issues with seekdir/telldir until recently - will
switch back to tmpfs for next tests.

> In how many directories?
> Maybe the directory inode lock impeded performance improvement
> with >=8 threads?

I don't think the directory inode lock is an issue - this should be one
(or more directories) per thread

Basically

/usr/lib64/openmpi/bin/mpirun \
            --mca btl self -n $i --oversubscribe \
            ./mdtest -F -n40000 -i1 \
                -d /scratch/dest -u -b2 | tee ${fname}-$i.out


(mdtest is really convenient for meta operations, although requires mpi,
recent versions are here (the initial LLNL project merged with ior).

https://github.com/hpc/ior

"-F"
Perform test on files only (no directories).

"-n" number_of_items
Every process will creat/stat/remove # directories and files

"-i" iterations
The number of iterations the test will run

"-u"
Create a unique working directory for each task

"-b" branching_factor
The branching factor of the hierarchical directory structure [default: 1].


(The older LLNL repo has a better mdtest README
https://github.com/LLNL/mdtest)


Also, regarding metadata, I definitely need to find time resume work on
atomic-open. Besides performance, there is another use case
https://github.com/libfuse/libfuse/issues/945. Sweet Tea Dorminy / Josef
also seem to need that.

> 
>>
>> Remaining TODO list for RFCv3:
>> --------------------------------
>> 1) Let the ring configure ioctl return information,
>> like mmap/queue-buf size
>>
>> Right now libfuse and kernel have lots of duplicated setup code
>> and any kind of pointer/offset mismatch results in a non-working
>> ring that is hard to debug - probably better when the kernel does
>> the calculations and returns that to server side
>>
>> 2) In combination with 1, ring requests should retrieve their
>> userspace address and length from kernel side instead of
>> calculating it through the mmaped queue buffer on their own.
>> (Introduction of FUSE_URING_BUF_ADDR_FETCH)
>>
>> 3) Add log buffer into the ioctl and ring-request
>>
>> This is to provide better error messages (instead of just
>> errno)
>>
>> 3) Multiple IO sizes per queue
>>
>> Small IOs and metadata requests do not need large buffer sizes,
>> we need multiple IO sizes per queue.
>>
>> 4) FUSE_INTERRUPT handling
>>
>> These are not handled yet, kernel side is probably not difficult
>> anymore as ring entries take fuse requests through lists.
>>
>> Long term TODO:
>> --------------
>> Notifications through io-uring, maybe with a separated ring,
>> but I'm not sure yet.
> 
> Is that going to improve performance in any real life workload?
> 


I'm rather sure that we at DDN will need it for our project with the
DLM. I have other priorities for now - once it comes up, adding
notifications over uring shouldn't be difficult.



Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (19 preceding siblings ...)
  2024-05-30  7:07 ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Amir Goldstein
@ 2024-05-30 15:36 ` Kent Overstreet
  2024-05-30 16:02   ` Bernd Schubert
  2024-05-30 20:47 ` Josef Bacik
  2024-06-11  8:20 ` Miklos Szeredi
  22 siblings, 1 reply; 113+ messages in thread
From: Kent Overstreet @ 2024-05-30 15:36 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring

On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
> From: Bernd Schubert <bschubert@ddn.com>
> 
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state,
> some major changes are still to be expected.
> 
> Motivation for these patches is all to increase fuse performance.
> In fuse-over-io-uring requests avoid core switching (application
> on core X, processing of fuse server on random core Y) and use
> shared memory between kernel and userspace to transfer data.
> Similar approaches have been taken by ZUFS and FUSE2, though
> not over io-uring, but through ioctl IOs

What specifically is it about io-uring that's helpful here? Besides the
ringbuffer?

So the original mess was that because we didn't have a generic
ringbuffer, we had aio, tracing, and god knows what else all
implementing their own special purpose ringbuffers (all with weird
quirks of debatable or no usefulness).

It seems to me that what fuse (and a lot of other things want) is just a
clean simple easy to use generic ringbuffer for sending what-have-you
back and forth between the kernel and userspace - in this case RPCs from
the kernel to userspace.

But instead, the solution seems to be just toss everything into a new
giant subsystem?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 15:36 ` Kent Overstreet
@ 2024-05-30 16:02   ` Bernd Schubert
  2024-05-30 16:10     ` Kent Overstreet
  2024-05-30 16:21     ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Jens Axboe
  0 siblings, 2 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-30 16:02 UTC (permalink / raw)
  To: Kent Overstreet, Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Andrew Morton,
	linux-mm, Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring,
	Jens Axboe, Ming Lei, Pavel Begunkov, Josef Bacik



On 5/30/24 17:36, Kent Overstreet wrote:
> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
>> From: Bernd Schubert <bschubert@ddn.com>
>>
>> This adds support for uring communication between kernel and
>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>> appraoch was taken from ublk.  The patches are in RFC state,
>> some major changes are still to be expected.
>>
>> Motivation for these patches is all to increase fuse performance.
>> In fuse-over-io-uring requests avoid core switching (application
>> on core X, processing of fuse server on random core Y) and use
>> shared memory between kernel and userspace to transfer data.
>> Similar approaches have been taken by ZUFS and FUSE2, though
>> not over io-uring, but through ioctl IOs
> 
> What specifically is it about io-uring that's helpful here? Besides the
> ringbuffer?
> 
> So the original mess was that because we didn't have a generic
> ringbuffer, we had aio, tracing, and god knows what else all
> implementing their own special purpose ringbuffers (all with weird
> quirks of debatable or no usefulness).
> 
> It seems to me that what fuse (and a lot of other things want) is just a
> clean simple easy to use generic ringbuffer for sending what-have-you
> back and forth between the kernel and userspace - in this case RPCs from
> the kernel to userspace.
> 
> But instead, the solution seems to be just toss everything into a new
> giant subsystem?


Hmm, initially I had thought about writing my own ring buffer, but then 
io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
need? From interface point of view, io-uring seems easy to use here, 
has everything we need and kind of the same thing is used for ublk - 
what speaks against io-uring? And what other suggestion do you have?

I guess the same concern would also apply to ublk_drv. 

Well, decoupling from io-uring might help to get for zero-copy, as there
doesn't seem to be an agreement with Mings approaches (sorry I'm only
silently following for now).

From our side, a customer has pointed out security concerns for io-uring. 
My thinking so far was to implemented the required io-uring pieces into 
an module and access it with ioctls... Which would also allow to
backport it to RHEL8/RHEL9.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 16:02   ` Bernd Schubert
@ 2024-05-30 16:10     ` Kent Overstreet
  2024-05-30 16:17       ` Bernd Schubert
  2024-05-30 16:21     ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Jens Axboe
  1 sibling, 1 reply; 113+ messages in thread
From: Kent Overstreet @ 2024-05-30 16:10 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring, Jens Axboe, Ming Lei, Pavel Begunkov,
	Josef Bacik

On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
> Hmm, initially I had thought about writing my own ring buffer, but then 
> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> need? From interface point of view, io-uring seems easy to use here, 
> has everything we need and kind of the same thing is used for ublk - 
> what speaks against io-uring? And what other suggestion do you have?
> 
> I guess the same concern would also apply to ublk_drv. 
> 
> Well, decoupling from io-uring might help to get for zero-copy, as there
> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> silently following for now).
> 
> From our side, a customer has pointed out security concerns for io-uring. 
> My thinking so far was to implemented the required io-uring pieces into 
> an module and access it with ioctls... Which would also allow to
> backport it to RHEL8/RHEL9.

Well, I've been starting to sketch out a ringbuffer() syscall, which
would work on any (supported) file descriptor and give you a ringbuffer
for reading or writing (or call it twice for both).

That seems to be what fuse really wants, no? You're already using a file
descriptor and your own RPC format, you just want a faster
communications channel.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 16:10     ` Kent Overstreet
@ 2024-05-30 16:17       ` Bernd Schubert
  2024-05-30 17:30         ` Kent Overstreet
                           ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-05-30 16:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Bernd Schubert, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring, Jens Axboe, Ming Lei, Pavel Begunkov,
	Josef Bacik



On 5/30/24 18:10, Kent Overstreet wrote:
> On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
>> Hmm, initially I had thought about writing my own ring buffer, but then 
>> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
>> need? From interface point of view, io-uring seems easy to use here, 
>> has everything we need and kind of the same thing is used for ublk - 
>> what speaks against io-uring? And what other suggestion do you have?
>>
>> I guess the same concern would also apply to ublk_drv. 
>>
>> Well, decoupling from io-uring might help to get for zero-copy, as there
>> doesn't seem to be an agreement with Mings approaches (sorry I'm only
>> silently following for now).
>>
>> From our side, a customer has pointed out security concerns for io-uring. 
>> My thinking so far was to implemented the required io-uring pieces into 
>> an module and access it with ioctls... Which would also allow to
>> backport it to RHEL8/RHEL9.
> 
> Well, I've been starting to sketch out a ringbuffer() syscall, which
> would work on any (supported) file descriptor and give you a ringbuffer
> for reading or writing (or call it twice for both).
> 
> That seems to be what fuse really wants, no? You're already using a file
> descriptor and your own RPC format, you just want a faster
> communications channel.

Fine with me, if you have something better/simpler with less security
concerns - why not. We just need a community agreement on that.

Do you have something I could look at?

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 16:17       ` Bernd Schubert
@ 2024-05-30 17:30         ` Kent Overstreet
  2024-05-30 19:09         ` Josef Bacik
  2024-05-31  3:53         ` [PATCH] fs: sys_ringbuffer() (WIP) Kent Overstreet
  2 siblings, 0 replies; 113+ messages in thread
From: Kent Overstreet @ 2024-05-30 17:30 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring, Jens Axboe, Ming Lei, Pavel Begunkov,
	Josef Bacik

On Thu, May 30, 2024 at 06:17:29PM +0200, Bernd Schubert wrote:
> 
> 
> On 5/30/24 18:10, Kent Overstreet wrote:
> > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
> >> Hmm, initially I had thought about writing my own ring buffer, but then 
> >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> >> need? From interface point of view, io-uring seems easy to use here, 
> >> has everything we need and kind of the same thing is used for ublk - 
> >> what speaks against io-uring? And what other suggestion do you have?
> >>
> >> I guess the same concern would also apply to ublk_drv. 
> >>
> >> Well, decoupling from io-uring might help to get for zero-copy, as there
> >> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> >> silently following for now).
> >>
> >> From our side, a customer has pointed out security concerns for io-uring. 
> >> My thinking so far was to implemented the required io-uring pieces into 
> >> an module and access it with ioctls... Which would also allow to
> >> backport it to RHEL8/RHEL9.
> > 
> > Well, I've been starting to sketch out a ringbuffer() syscall, which
> > would work on any (supported) file descriptor and give you a ringbuffer
> > for reading or writing (or call it twice for both).
> > 
> > That seems to be what fuse really wants, no? You're already using a file
> > descriptor and your own RPC format, you just want a faster
> > communications channel.
> 
> Fine with me, if you have something better/simpler with less security
> concerns - why not. We just need a community agreement on that.
> 
> Do you have something I could look at?

Like I said it's at the early sketch stage, I haven't written any code
yet. But I'm envisioning something very simple - just a syscall that
gives you a mapped buffer of a specified size with head and tail pointers.

But this has been kicking around for awhile, so if you're interested I
could probably have something for you to try out in the next few days.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 16:17       ` Bernd Schubert
  2024-05-30 17:30         ` Kent Overstreet
@ 2024-05-30 19:09         ` Josef Bacik
  2024-05-30 20:05           ` Kent Overstreet
  2024-05-31  3:53         ` [PATCH] fs: sys_ringbuffer() (WIP) Kent Overstreet
  2 siblings, 1 reply; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 19:09 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Kent Overstreet, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Andrei Vagin, io-uring, Jens Axboe, Ming Lei,
	Pavel Begunkov

On Thu, May 30, 2024 at 06:17:29PM +0200, Bernd Schubert wrote:
> 
> 
> On 5/30/24 18:10, Kent Overstreet wrote:
> > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
> >> Hmm, initially I had thought about writing my own ring buffer, but then 
> >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> >> need? From interface point of view, io-uring seems easy to use here, 
> >> has everything we need and kind of the same thing is used for ublk - 
> >> what speaks against io-uring? And what other suggestion do you have?
> >>
> >> I guess the same concern would also apply to ublk_drv. 
> >>
> >> Well, decoupling from io-uring might help to get for zero-copy, as there
> >> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> >> silently following for now).
> >>
> >> From our side, a customer has pointed out security concerns for io-uring. 
> >> My thinking so far was to implemented the required io-uring pieces into 
> >> an module and access it with ioctls... Which would also allow to
> >> backport it to RHEL8/RHEL9.
> > 
> > Well, I've been starting to sketch out a ringbuffer() syscall, which
> > would work on any (supported) file descriptor and give you a ringbuffer
> > for reading or writing (or call it twice for both).
> > 
> > That seems to be what fuse really wants, no? You're already using a file
> > descriptor and your own RPC format, you just want a faster
> > communications channel.
> 
> Fine with me, if you have something better/simpler with less security
> concerns - why not. We just need a community agreement on that.
> 
> Do you have something I could look at?

FWIW I have no strong feelings between using iouring vs any other ringbuffer
mechanism we come up with in the future.

That being said iouring is here now, is proven to work, and these are good
performance improvements.  If in the future something else comes along that
gives us better performance then absolutely we should explore adding that
functionality.  But this solves the problem today, and I need the problem solved
yesterday, so continuing with this patchset is very much a worthwhile
investment, one that I'm very happy you're tackling Bernd instead of me ;).
Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 19:09         ` Josef Bacik
@ 2024-05-30 20:05           ` Kent Overstreet
  0 siblings, 0 replies; 113+ messages in thread
From: Kent Overstreet @ 2024-05-30 20:05 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Bernd Schubert, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Andrei Vagin, io-uring, Jens Axboe, Ming Lei,
	Pavel Begunkov

On Thu, May 30, 2024 at 03:09:41PM -0400, Josef Bacik wrote:
> On Thu, May 30, 2024 at 06:17:29PM +0200, Bernd Schubert wrote:
> > 
> > 
> > On 5/30/24 18:10, Kent Overstreet wrote:
> > > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
> > >> Hmm, initially I had thought about writing my own ring buffer, but then 
> > >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> > >> need? From interface point of view, io-uring seems easy to use here, 
> > >> has everything we need and kind of the same thing is used for ublk - 
> > >> what speaks against io-uring? And what other suggestion do you have?
> > >>
> > >> I guess the same concern would also apply to ublk_drv. 
> > >>
> > >> Well, decoupling from io-uring might help to get for zero-copy, as there
> > >> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> > >> silently following for now).
> > >>
> > >> From our side, a customer has pointed out security concerns for io-uring. 
> > >> My thinking so far was to implemented the required io-uring pieces into 
> > >> an module and access it with ioctls... Which would also allow to
> > >> backport it to RHEL8/RHEL9.
> > > 
> > > Well, I've been starting to sketch out a ringbuffer() syscall, which
> > > would work on any (supported) file descriptor and give you a ringbuffer
> > > for reading or writing (or call it twice for both).
> > > 
> > > That seems to be what fuse really wants, no? You're already using a file
> > > descriptor and your own RPC format, you just want a faster
> > > communications channel.
> > 
> > Fine with me, if you have something better/simpler with less security
> > concerns - why not. We just need a community agreement on that.
> > 
> > Do you have something I could look at?
> 
> FWIW I have no strong feelings between using iouring vs any other ringbuffer
> mechanism we come up with in the future.
> 
> That being said iouring is here now, is proven to work, and these are good
> performance improvements.  If in the future something else comes along that
> gives us better performance then absolutely we should explore adding that
> functionality.  But this solves the problem today, and I need the problem solved
> yesterday, so continuing with this patchset is very much a worthwhile
> investment, one that I'm very happy you're tackling Bernd instead of me ;).
> Thanks,

I suspect a ringbuffer syscall will actually be simpler than switching
to io_uring. Let me see if I can cook something up quickly - there's no
rocket science here and this all stuff we've done before so it shouldn't
take too long (famous last works...)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH] fs: sys_ringbuffer() (WIP)
  2024-05-30 16:17       ` Bernd Schubert
  2024-05-30 17:30         ` Kent Overstreet
  2024-05-30 19:09         ` Josef Bacik
@ 2024-05-31  3:53         ` Kent Overstreet
  2024-05-31 13:11           ` kernel test robot
  2024-05-31 15:49           ` kernel test robot
  2 siblings, 2 replies; 113+ messages in thread
From: Kent Overstreet @ 2024-05-31  3:53 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring, Jens Axboe, Ming Lei, Pavel Begunkov,
	Josef Bacik

On Thu, May 30, 2024 at 06:17:29PM +0200, Bernd Schubert wrote:
> 
> 
> On 5/30/24 18:10, Kent Overstreet wrote:
> > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
> >> Hmm, initially I had thought about writing my own ring buffer, but then 
> >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> >> need? From interface point of view, io-uring seems easy to use here, 
> >> has everything we need and kind of the same thing is used for ublk - 
> >> what speaks against io-uring? And what other suggestion do you have?
> >>
> >> I guess the same concern would also apply to ublk_drv. 
> >>
> >> Well, decoupling from io-uring might help to get for zero-copy, as there
> >> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> >> silently following for now).
> >>
> >> From our side, a customer has pointed out security concerns for io-uring. 
> >> My thinking so far was to implemented the required io-uring pieces into 
> >> an module and access it with ioctls... Which would also allow to
> >> backport it to RHEL8/RHEL9.
> > 
> > Well, I've been starting to sketch out a ringbuffer() syscall, which
> > would work on any (supported) file descriptor and give you a ringbuffer
> > for reading or writing (or call it twice for both).
> > 
> > That seems to be what fuse really wants, no? You're already using a file
> > descriptor and your own RPC format, you just want a faster
> > communications channel.
> 
> Fine with me, if you have something better/simpler with less security
> concerns - why not. We just need a community agreement on that.
> 
> Do you have something I could look at?

Here you go. Not tested yet, but all the essentials should be there.

there's something else _really_ slick we should be able to do with this:
add support to pipes, and then - if both ends of a pipe ask for a
ringbuffer, map them the _same_ ringbuffer, zero copy and completely
bypassing the kernel and neither end has to know if the other end
supports ringbuffers or just normal pipes.

-- >8 --
Add new syscalls for generic ringbuffers that can be attached to
arbitrary (supporting) file descriptors.

A ringbuffer consists of:
 - a single page for head/tail pointers, size/mask, and other ancilliary
   metadata, described by 'struct ringbuffer_ptrs'
 - a data buffer, consisting of one or more pages mapped at
   'ringbuffer_ptrs.data_offset' above the address of 'ringbuffer_ptrs'

The data buffer is always a power of two size. Head and tail pointers
are u32 byte offsets, and they are stored unmasked (i.e., they use the
full 32 bit range) - they must be masked for reading.

- ringbuffer(int fd, int rw, u32 size, ulong *addr)

Create or get address of an existing ringbuffer for either reads or
writes, of at least size bytes, and attach it to the given file
descriptor; the address of the ringbuffer is returned via addr.

Since files can be shared between processes in different address spaces
a ringbuffer may be mapped into multiple address spaces via this
syscall.

- ringbuffer_wait(int fd, int rw)

Wait for space to be availaable (on a ringbuffer for writing), or data
to be available (on a ringbuffer for writing).

todo: add parameters for timeout, minimum amount of data/space to wait for

- ringbuffer_wakeup(int fd, int rw)

Required after writing to a previously empty ringbuffer, or reading from
a previously full ringbuffer to notify waiters on the other end

todo - investigate integrating with futexes?
todo - add extra fields to ringbuffer_ptrs for waiting on a minimum
amount of data/space, i.e. to signal when a wakeup is required

Kernel interfaces:
 - To indicate that ringbuffers are supported on a file, set
   FOP_RINGBUFFER_READ and/or FOP_RINGBUFFER_WRITE in your
   file_operations.
 - To read or write to a file's associated ringbuffers
   (file->f_ringbuffer), use ringbuffer_read() or ringbuffer_write().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   3 +
 arch/x86/entry/syscalls/syscall_64.tbl |   3 +
 fs/Makefile                            |   1 +
 fs/file_table.c                        |   2 +
 fs/ringbuffer.c                        | 478 +++++++++++++++++++++++++
 include/linux/fs.h                     |  14 +
 include/linux/mm_types.h               |   4 +
 include/linux/ringbuffer_sys.h         |  15 +
 include/uapi/linux/ringbuffer_sys.h    |  38 ++
 init/Kconfig                           |   8 +
 kernel/fork.c                          |   1 +
 11 files changed, 567 insertions(+)
 create mode 100644 fs/ringbuffer.c
 create mode 100644 include/linux/ringbuffer_sys.h
 create mode 100644 include/uapi/linux/ringbuffer_sys.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 7fd1f57ad3d3..2385359eaf75 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -467,3 +467,6 @@
 460	i386	lsm_set_self_attr	sys_lsm_set_self_attr
 461	i386	lsm_list_modules	sys_lsm_list_modules
 462	i386	mseal 			sys_mseal
+463	i386	ringbuffer		sys_ringbuffer
+464	i386	ringbuffer_wait		sys_ringbuffer_wait
+465	i386	ringbuffer_wakeup	sys_ringbuffer_wakeup
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index a396f6e6ab5b..942602ece075 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -384,6 +384,9 @@
 460	common	lsm_set_self_attr	sys_lsm_set_self_attr
 461	common	lsm_list_modules	sys_lsm_list_modules
 462 	common  mseal			sys_mseal
+463	common	ringbuffer		sys_ringbuffer
+464	common	ringbuffer_wait		sys_ringbuffer_wait
+465	common	ringbuffer_wakeup	sys_ringbuffer_wakeup
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/Makefile b/fs/Makefile
index 6ecc9b0a53f2..48e54ac01fb1 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_RINGBUFFER)        += ringbuffer.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FS_VERITY)		+= verity/
diff --git a/fs/file_table.c b/fs/file_table.c
index 4f03beed4737..9675f22d6615 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -25,6 +25,7 @@
 #include <linux/sysctl.h>
 #include <linux/percpu_counter.h>
 #include <linux/percpu.h>
+#include <linux/ringbuffer_sys.h>
 #include <linux/task_work.h>
 #include <linux/swap.h>
 #include <linux/kmemleak.h>
@@ -412,6 +413,7 @@ static void __fput(struct file *file)
 	 */
 	eventpoll_release(file);
 	locks_remove_file(file);
+	ringbuffer_file_exit(file);
 
 	security_file_release(file);
 	if (unlikely(file->f_flags & FASYNC)) {
diff --git a/fs/ringbuffer.c b/fs/ringbuffer.c
new file mode 100644
index 000000000000..cef8ca8b9416
--- /dev/null
+++ b/fs/ringbuffer.c
@@ -0,0 +1,478 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/darray.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/mman.h>
+#include <linux/mount.h>
+#include <linux/mutex.h>
+#include <linux/pagemap.h>
+#include <linux/pseudo_fs.h>
+#include <linux/ringbuffer_sys.h>
+#include <uapi/linux/ringbuffer_sys.h>
+#include <linux/syscalls.h>
+
+#define RINGBUFFER_FS_MAGIC			0xa10a10a2
+
+static DEFINE_MUTEX(ringbuffer_lock);
+
+static struct vfsmount *ringbuffer_mnt;
+
+struct ringbuffer_mapping {
+	ulong			addr;
+	struct mm_struct	*mm;
+};
+
+struct ringbuffer {
+	wait_queue_head_t	wait[2];
+	spinlock_t		lock;
+	int			rw;
+	u32			size;	/* always a power of two */
+	u32			mask;	/* size - 1 */
+	struct file		*io_file;
+	/* hidden internal file for the mmap */
+	struct file		*rb_file;
+	struct ringbuffer_ptrs	*ptrs;
+	void			*data;
+	DARRAY(struct ringbuffer_mapping) mms;
+};
+
+static const struct address_space_operations ringbuffer_aops = {
+	.dirty_folio	= noop_dirty_folio,
+#if 0
+	.migrate_folio	= ringbuffer_migrate_folio,
+#endif
+};
+
+#if 0
+static int ringbuffer_mremap(struct vm_area_struct *vma)
+{
+	struct file *file = vma->vm_file;
+	struct mm_struct *mm = vma->vm_mm;
+	struct kioctx_table *table;
+	int i, res = -EINVAL;
+
+	spin_lock(&mm->ioctx_lock);
+	rcu_read_lock();
+	table = rcu_dereference(mm->ioctx_table);
+	if (!table)
+		goto out_unlock;
+
+	for (i = 0; i < table->nr; i++) {
+		struct kioctx *ctx;
+
+		ctx = rcu_dereference(table->table[i]);
+		if (ctx && ctx->ringbuffer_file == file) {
+			if (!atomic_read(&ctx->dead)) {
+				ctx->user_id = ctx->mmap_base = vma->vm_start;
+				res = 0;
+			}
+			break;
+		}
+	}
+
+out_unlock:
+	rcu_read_unlock();
+	spin_unlock(&mm->ioctx_lock);
+	return res;
+}
+#endif
+
+static const struct vm_operations_struct ringbuffer_vm_ops = {
+#if 0
+	.mremap		= ringbuffer_mremap,
+#endif
+#if IS_ENABLED(CONFIG_MMU)
+	.fault		= filemap_fault,
+	.map_pages	= filemap_map_pages,
+	.page_mkwrite	= filemap_page_mkwrite,
+#endif
+};
+
+static int ringbuffer_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	vm_flags_set(vma, VM_DONTEXPAND);
+	vma->vm_ops = &ringbuffer_vm_ops;
+	return 0;
+}
+
+static const struct file_operations ringbuffer_fops = {
+	.mmap = ringbuffer_mmap,
+};
+
+static void ringbuffer_free(struct ringbuffer *rb)
+{
+	rb->io_file->f_ringbuffers[rb->rw] = NULL;
+
+	darray_for_each(rb->mms, map)
+		darray_for_each_reverse(map->mm->ringbuffers, rb2)
+			if (rb == *rb2)
+				darray_remove_item(&map->mm->ringbuffers, rb2);
+
+	if (rb->rb_file) {
+		/* Kills mapping: */
+		truncate_setsize(file_inode(rb->rb_file), 0);
+
+		/* Prevent further access to the kioctx from migratepages */
+		struct address_space *mapping = rb->rb_file->f_mapping;
+		spin_lock(&mapping->i_private_lock);
+		mapping->i_private_data = NULL;
+		spin_unlock(&mapping->i_private_lock);
+
+		fput(rb->rb_file);
+	}
+
+	free_pages((ulong) rb->data, get_order(rb->size));
+	free_page((ulong) rb->ptrs);
+	kfree(rb);
+}
+
+static int ringbuffer_map(struct ringbuffer *rb, ulong *addr)
+{
+	struct mm_struct *mm = current->mm;
+
+	int ret = darray_make_room(&rb->mms, 1) ?:
+		darray_make_room(&mm->ringbuffers, 1);
+	if (ret)
+		return ret;
+
+	ret = mmap_write_lock_killable(mm);
+	if (ret)
+		return ret;
+
+	ulong unused;
+	struct ringbuffer_mapping map = {
+		.addr = do_mmap(rb->rb_file, 0, rb->size + PAGE_SIZE,
+				PROT_READ|PROT_WRITE,
+				MAP_SHARED, 0, 0, &unused, NULL),
+		.mm = mm,
+	};
+	mmap_write_unlock(mm);
+
+	ret = PTR_ERR_OR_ZERO((void *) map.addr);
+	if (ret)
+		return ret;
+
+	ret =   darray_push(&mm->ringbuffers, rb) ?:
+		darray_push(&rb->mms, map);
+	BUG_ON(ret); /* we preallocated */
+
+	*addr = map.addr;
+	return 0;
+}
+
+static int ringbuffer_get_addr_or_map(struct ringbuffer *rb, ulong *addr)
+{
+	struct mm_struct *mm = current->mm;
+
+	darray_for_each(rb->mms, map)
+		if (map->mm == mm) {
+			*addr = map->addr;
+			return 0;
+		}
+
+	return ringbuffer_map(rb, addr);
+}
+
+static struct ringbuffer *ringbuffer_alloc(struct file *file, int rw, u32 size,
+					   ulong *addr)
+{
+	unsigned order = get_order(size);
+	size = PAGE_SIZE << order;
+
+	struct ringbuffer *rb = kzalloc(sizeof(*rb), GFP_KERNEL);
+	if (!rb)
+		return ERR_PTR(-ENOMEM);
+
+	init_waitqueue_head(&rb->wait[READ]);
+	init_waitqueue_head(&rb->wait[WRITE]);
+	spin_lock_init(&rb->lock);
+	rb->rw		= rw;
+	rb->size	= size;
+	rb->mask	= size - 1;
+	rb->io_file	= file;
+
+	rb->ptrs = (void *) __get_free_page(GFP_KERNEL|__GFP_ZERO);
+	rb->data = (void *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, order);
+	if (!rb->ptrs || !rb->data)
+		goto err;
+
+	rb->ptrs->size	= size;
+	rb->ptrs->mask	= size - 1;
+	rb->ptrs->data_offset = PAGE_SIZE;
+
+	struct inode *inode = alloc_anon_inode(ringbuffer_mnt->mnt_sb);
+	int ret = PTR_ERR_OR_ZERO(inode);
+	if (ret)
+		goto err;
+
+	inode->i_mapping->a_ops = &ringbuffer_aops;
+	inode->i_mapping->i_private_data = rb;
+	inode->i_size = size;
+
+	rb->rb_file = alloc_file_pseudo(inode, ringbuffer_mnt, "[ringbuffer]",
+				     O_RDWR, &ringbuffer_fops);
+	ret = PTR_ERR_OR_ZERO(rb->rb_file);
+	if (ret)
+		goto err_iput;
+
+	ret = filemap_add_folio(rb->rb_file->f_mapping,
+				page_folio(virt_to_page(rb->ptrs)),
+				0, GFP_KERNEL);
+	if (ret)
+		goto err;
+
+	/* todo - implement a fallback when high order allocation fails */
+	ret = filemap_add_folio(rb->rb_file->f_mapping,
+				page_folio(virt_to_page(rb->data)),
+				1, GFP_KERNEL);
+	if (ret)
+		goto err;
+
+	ret = ringbuffer_map(rb, addr);
+	if (ret)
+		goto err;
+
+	return rb;
+err_iput:
+	iput(inode);
+err:
+	ringbuffer_free(rb);
+	return ERR_PTR(ret);
+}
+
+/* file is going away, tear down ringbuffers: */
+void ringbuffer_file_exit(struct file *file)
+{
+	mutex_lock(&ringbuffer_lock);
+	for (unsigned i = 0; i < ARRAY_SIZE(file->f_ringbuffers); i++)
+		if (file->f_ringbuffers[i])
+			ringbuffer_free(file->f_ringbuffers[i]);
+	mutex_unlock(&ringbuffer_lock);
+}
+
+/*
+ * XXX: we require synchronization when killing a ringbuffer (because no longer
+ * mapped anywhere) to a file that is still open (and in use)
+ */
+static void ringbuffer_mm_drop(struct mm_struct *mm, struct ringbuffer *rb)
+{
+	darray_for_each_reverse(rb->mms, map)
+		if (mm == map->mm)
+			darray_remove_item(&rb->mms, map);
+
+	if (!rb->mms.nr)
+		ringbuffer_free(rb);
+}
+
+void ringbuffer_mm_exit(struct mm_struct *mm)
+{
+	mutex_lock(&ringbuffer_lock);
+	darray_for_each_reverse(mm->ringbuffers, rb)
+		ringbuffer_mm_drop(mm, *rb);
+	mutex_unlock(&ringbuffer_lock);
+
+	darray_exit(&mm->ringbuffers);
+}
+
+SYSCALL_DEFINE4(ringbuffer, unsigned, fd, int, rw, u32, size, ulong __user *, ringbufferp)
+{
+	ulong rb_addr;
+
+	int ret = get_user(rb_addr, ringbufferp);
+	if (unlikely(ret))
+		return ret;
+
+	if (unlikely(rb_addr || !size || rw > WRITE))
+		return -EINVAL;
+
+	struct fd f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	if (!(f.file->f_op->fop_flags & (rw == READ ? FOP_RINGBUFFER_READ : FOP_RINGBUFFER_WRITE))) {
+		ret = -EOPNOTSUPP;
+		goto err;
+	}
+
+	mutex_lock(&ringbuffer_lock);
+	struct ringbuffer *rb = f.file->f_ringbuffers[rw];
+	if (rb) {
+		ret = ringbuffer_get_addr_or_map(rb, &rb_addr);
+		if (ret)
+			goto err_unlock;
+
+		ret = put_user(rb_addr, ringbufferp);
+	} else {
+		rb = ringbuffer_alloc(f.file, rw, size, &rb_addr);
+		ret = PTR_ERR_OR_ZERO(rb);
+		if (ret)
+			goto err_unlock;
+
+		ret = put_user(rb_addr, ringbufferp);
+		if (ret) {
+			ringbuffer_free(rb);
+			goto err_unlock;
+		}
+
+		f.file->f_ringbuffers[rw] = rb;
+	}
+err_unlock:
+	mutex_unlock(&ringbuffer_lock);
+err:
+	fdput(f);
+	return ret;
+}
+
+static bool __ringbuffer_read(struct ringbuffer *rb, void **data, size_t *len,
+			       bool nonblocking, size_t *ret)
+{
+	u32 head = rb->ptrs->head;
+	u32 tail = rb->ptrs->tail;
+
+	if (head == tail)
+		return 0;
+
+	ulong flags;
+	spin_lock_irqsave(&rb->lock, flags);
+	/* Multiple consumers - recheck under lock: */
+	tail = rb->ptrs->tail;
+
+	while (*len && tail != head) {
+		u32 tail_masked = tail & rb->mask;
+		u32 b = min(*len,
+			min(head - tail,
+			    rb->size - tail_masked));
+
+		memcpy(*data, rb->data + tail_masked, b);
+		tail	+= b;
+		*data	+= b;
+		*len	-= b;
+		*ret	+= b;
+	}
+
+	smp_store_release(&rb->ptrs->tail, tail);
+	spin_unlock_irqrestore(&rb->lock, flags);
+
+	return !*len || nonblocking;
+}
+
+size_t ringbuffer_read(struct ringbuffer *rb, void *data, size_t len, bool nonblocking)
+{
+	size_t ret = 0;
+	wait_event(rb->wait[READ], __ringbuffer_read(rb, &data, &len, nonblocking, &ret));
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ringbuffer_read);
+
+static bool __ringbuffer_write(struct ringbuffer *rb, void **data, size_t *len,
+			       bool nonblocking, size_t *ret)
+{
+	u32 head = rb->ptrs->head;
+	u32 tail = rb->ptrs->tail;
+
+	if (head - tail >= rb->size)
+		return 0;
+
+	ulong flags;
+	spin_lock_irqsave(&rb->lock, flags);
+	/* Multiple producers - recheck under lock: */
+	head = rb->ptrs->head;
+
+	while (*len && head - tail < rb->size) {
+		u32 head_masked = head & rb->mask;
+		u32 b = min(*len,
+			min(tail + rb->size - head,
+			    rb->size - head_masked));
+
+		memcpy(rb->data + head_masked, *data, b);
+		head	+= b;
+		*data	+= b;
+		*len	-= b;
+		*ret	+= b;
+	}
+
+	smp_store_release(&rb->ptrs->head, head);
+	spin_unlock_irqrestore(&rb->lock, flags);
+
+	return !*len || nonblocking;
+}
+
+size_t ringbuffer_write(struct ringbuffer *rb, void *data, size_t len, bool nonblocking)
+{
+	size_t ret = 0;
+	wait_event(rb->wait[WRITE], __ringbuffer_write(rb, &data, &len, nonblocking, &ret));
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ringbuffer_write);
+
+SYSCALL_DEFINE2(ringbuffer_wait, unsigned, fd, int, rw)
+{
+	int ret = 0;
+
+	if (rw > WRITE)
+		return -EINVAL;
+
+	struct fd f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	struct ringbuffer *rb = f.file->f_ringbuffers[rw];
+	if (!rb) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	struct ringbuffer_ptrs *rp = rb->ptrs;
+	wait_event(rb->wait[rw], rw == READ
+		   ? rp->head != rp->tail
+		   : rp->head - rp->tail < rb->size);
+err:
+	fdput(f);
+	return ret;
+}
+
+SYSCALL_DEFINE2(ringbuffer_wakeup, unsigned, fd, int, rw)
+{
+	int ret = 0;
+
+	if (rw > WRITE)
+		return -EINVAL;
+
+	struct fd f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	struct ringbuffer *rb = f.file->f_ringbuffers[rw];
+	if (!rb) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	wake_up(&rb->wait[rw]);
+err:
+	fdput(f);
+	return ret;
+}
+
+static int ringbuffer_init_fs_context(struct fs_context *fc)
+{
+	if (!init_pseudo(fc, RINGBUFFER_FS_MAGIC))
+		return -ENOMEM;
+	fc->s_iflags |= SB_I_NOEXEC;
+	return 0;
+}
+
+static int __init ringbuffer_setup(void)
+{
+	static struct file_system_type ringbuffer_fs = {
+		.name		= "ringbuffer",
+		.init_fs_context = ringbuffer_init_fs_context,
+		.kill_sb	= kill_anon_super,
+	};
+	ringbuffer_mnt = kern_mount(&ringbuffer_fs);
+	if (IS_ERR(ringbuffer_mnt))
+		panic("Failed to create ringbuffer fs mount.");
+	return 0;
+}
+__initcall(ringbuffer_setup);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0283cf366c2a..ba30fdfff5cb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -978,6 +978,8 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 		index <  ra->start + ra->size);
 }
 
+struct ringbuffer;
+
 /*
  * f_{lock,count,pos_lock} members can be highly contended and share
  * the same cacheline. f_{lock,mode} are very frequently used together
@@ -1024,6 +1026,14 @@ struct file {
 	struct address_space	*f_mapping;
 	errseq_t		f_wb_err;
 	errseq_t		f_sb_err; /* for syncfs */
+
+#ifdef CONFIG_RINGBUFFER
+	/*
+	 * Ringbuffers for reading/writing without syncall overhead, created by
+	 * ringbuffer(2)
+	 */
+	struct ringbuffer	*f_ringbuffers[2];
+#endif
 } __randomize_layout
   __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
 
@@ -2051,6 +2061,10 @@ struct file_operations {
 #define FOP_DIO_PARALLEL_WRITE	((__force fop_flags_t)(1 << 3))
 /* Contains huge pages */
 #define FOP_HUGE_PAGES		((__force fop_flags_t)(1 << 4))
+/* Supports read ringbuffers */
+#define FOP_RINGBUFFER_READ	((__force fop_flags_t)(1 << 5))
+/* Supports write ringbuffers */
+#define FOP_RINGBUFFER_WRITE	((__force fop_flags_t)(1 << 6))
 
 /* Wrap a directory iterator that needs exclusive inode access */
 int wrap_directory_iterator(struct file *, struct dir_context *,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 24323c7d0bd4..6e412718ce7e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -5,6 +5,7 @@
 #include <linux/mm_types_task.h>
 
 #include <linux/auxvec.h>
+#include <linux/darray_types.h>
 #include <linux/kref.h>
 #include <linux/list.h>
 #include <linux/spinlock.h>
@@ -911,6 +912,9 @@ struct mm_struct {
 		spinlock_t			ioctx_lock;
 		struct kioctx_table __rcu	*ioctx_table;
 #endif
+#ifdef CONFIG_RINGBUFFER
+		DARRAY(struct ringbuffer *)	ringbuffers;
+#endif
 #ifdef CONFIG_MEMCG
 		/*
 		 * "owner" points to a task that is regarded as the canonical
diff --git a/include/linux/ringbuffer_sys.h b/include/linux/ringbuffer_sys.h
new file mode 100644
index 000000000000..e9b3d0a0910f
--- /dev/null
+++ b/include/linux/ringbuffer_sys.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RINGBUFFER_SYS_H
+#define _LINUX_RINGBUFFER_SYS_H
+
+struct file;
+void ringbuffer_file_exit(struct file *file);
+
+struct mm_struct;
+void ringbuffer_mm_exit(struct mm_struct *mm);
+
+struct ringbuffer;
+size_t ringbuffer_read(struct ringbuffer *rb, void *data, size_t len, bool nonblocking);
+size_t ringbuffer_write(struct ringbuffer *rb, void *data, size_t len, bool nonblocking);
+
+#endif /* _LINUX_RINGBUFFER_SYS_H */
diff --git a/include/uapi/linux/ringbuffer_sys.h b/include/uapi/linux/ringbuffer_sys.h
new file mode 100644
index 000000000000..d7a3af42da91
--- /dev/null
+++ b/include/uapi/linux/ringbuffer_sys.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_RINGBUFFER_SYS_H
+#define _UAPI_LINUX_RINGBUFFER_SYS_H
+
+/*
+ * ringbuffer_ptrs - head and tail pointers for a ringbuffer, mappped to
+ * userspace:
+ */
+struct ringbuffer_ptrs {
+	/*
+	 * We use u32s because this type is shared between the kernel and
+	 * userspace - ulong/size_t won't work here, we might be 32bit userland
+	 * and 64 bit kernel, and u64 would be preferable (reduced probability
+	 * of ABA) but not all architectures can atomically read/write to a u64;
+	 * we need to avoid torn reads/writes.
+	 *
+	 * head and tail pointers are incremented and stored without masking;
+	 * this is to avoid ABA and differentiate between a full and empty
+	 * buffer - they must be masked with @mask to get an actual offset into
+	 * the data buffer.
+	 *
+	 * All units are in bytes.
+	 *
+	 * Data is emitted at head, consumed from tail.
+	 */
+	u32		head;
+	u32		tail;
+	u32		size;	/* always a power of two */
+	u32		mask;	/* size - 1 */
+
+	/*
+	 * Starting offset of data buffer, from the start of this struct - will
+	 * always be PAGE_SIZE.
+	 */
+	u32		data_offset;
+};
+
+#endif /* _UAPI_LINUX_RINGBUFFER_SYS_H */
diff --git a/init/Kconfig b/init/Kconfig
index 72404c1f2157..1ff8eaa43e2f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1673,6 +1673,14 @@ config IO_URING
 	  applications to submit and complete IO through submission and
 	  completion rings that are shared between the kernel and application.
 
+config RINGBUFFER
+	bool "Enable ringbuffer() syscall" if EXPERT
+	default y
+	help
+	  This option adds support for generic ringbuffers, which can be
+	  attached to any (supported) file descriptor, allowing for reading and
+	  writing without syscall overhead.
+
 config ADVISE_SYSCALLS
 	bool "Enable madvise/fadvise syscalls" if EXPERT
 	default y
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..ea160a9abd60 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1340,6 +1340,7 @@ static inline void __mmput(struct mm_struct *mm)
 	VM_BUG_ON(atomic_read(&mm->mm_users));
 
 	uprobe_clear_state(mm);
+	ringbuffer_mm_exit(mm);
 	exit_aio(mm);
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
-- 
2.45.1


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH] fs: sys_ringbuffer() (WIP)
  2024-05-31  3:53         ` [PATCH] fs: sys_ringbuffer() (WIP) Kent Overstreet
@ 2024-05-31 13:11           ` kernel test robot
  2024-05-31 15:49           ` kernel test robot
  1 sibling, 0 replies; 113+ messages in thread
From: kernel test robot @ 2024-05-31 13:11 UTC (permalink / raw)
  To: Kent Overstreet, Bernd Schubert
  Cc: oe-kbuild-all, Miklos Szeredi, Amir Goldstein, linux-fsdevel,
	Andrew Morton, Linux Memory Management List, Ingo Molnar,
	Peter Zijlstra, Andrei Vagin, io-uring, Jens Axboe, Ming Lei,
	Pavel Begunkov, Josef Bacik

Hi Kent,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.10-rc1]
[cannot apply to tip/x86/asm next-20240531]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/fs-sys_ringbuffer-WIP/20240531-115626
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/ytprj7mx37dna3n3kbiskgvris4nfvv63u3v7wogdrlzbikkmt%40chgq5hw3ny3r
patch subject: [PATCH] fs: sys_ringbuffer() (WIP)
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240531/202405312050.MkVo7MG8-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240531/202405312050.MkVo7MG8-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405312050.MkVo7MG8-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/sched/signal.h:14,
                    from include/linux/ptrace.h:7,
                    from arch/openrisc/kernel/asm-offsets.c:28:
>> include/linux/mm_types.h:8:10: fatal error: linux/darray_types.h: No such file or directory
       8 | #include <linux/darray_types.h>
         |          ^~~~~~~~~~~~~~~~~~~~~~
   compilation terminated.
   make[3]: *** [scripts/Makefile.build:117: arch/openrisc/kernel/asm-offsets.s] Error 1
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1208: prepare0] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:240: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:240: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +8 include/linux/mm_types.h

     6	
     7	#include <linux/auxvec.h>
   > 8	#include <linux/darray_types.h>
     9	#include <linux/kref.h>
    10	#include <linux/list.h>
    11	#include <linux/spinlock.h>
    12	#include <linux/rbtree.h>
    13	#include <linux/maple_tree.h>
    14	#include <linux/rwsem.h>
    15	#include <linux/completion.h>
    16	#include <linux/cpumask.h>
    17	#include <linux/uprobes.h>
    18	#include <linux/rcupdate.h>
    19	#include <linux/page-flags-layout.h>
    20	#include <linux/workqueue.h>
    21	#include <linux/seqlock.h>
    22	#include <linux/percpu_counter.h>
    23	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH] fs: sys_ringbuffer() (WIP)
  2024-05-31  3:53         ` [PATCH] fs: sys_ringbuffer() (WIP) Kent Overstreet
  2024-05-31 13:11           ` kernel test robot
@ 2024-05-31 15:49           ` kernel test robot
  1 sibling, 0 replies; 113+ messages in thread
From: kernel test robot @ 2024-05-31 15:49 UTC (permalink / raw)
  To: Kent Overstreet, Bernd Schubert
  Cc: llvm, oe-kbuild-all, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, Andrew Morton, Linux Memory Management List,
	Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring, Jens Axboe,
	Ming Lei, Pavel Begunkov, Josef Bacik

Hi Kent,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.10-rc1]
[cannot apply to tip/x86/asm next-20240531]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/fs-sys_ringbuffer-WIP/20240531-115626
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/ytprj7mx37dna3n3kbiskgvris4nfvv63u3v7wogdrlzbikkmt%40chgq5hw3ny3r
patch subject: [PATCH] fs: sys_ringbuffer() (WIP)
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20240531/202405312226.yKqHcQE4-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240531/202405312226.yKqHcQE4-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405312226.yKqHcQE4-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/um/kernel/asm-offsets.c:1:
   In file included from arch/x86/um/shared/sysdep/kernel-offsets.h:5:
   In file included from include/linux/crypto.h:17:
   In file included from include/linux/slab.h:16:
   In file included from include/linux/gfp.h:7:
   In file included from include/linux/mmzone.h:22:
>> include/linux/mm_types.h:8:10: fatal error: 'linux/darray_types.h' file not found
       8 | #include <linux/darray_types.h>
         |          ^~~~~~~~~~~~~~~~~~~~~~
   1 error generated.
   make[3]: *** [scripts/Makefile.build:117: arch/um/kernel/asm-offsets.s] Error 1
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1208: prepare0] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:240: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:240: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +8 include/linux/mm_types.h

     6	
     7	#include <linux/auxvec.h>
   > 8	#include <linux/darray_types.h>
     9	#include <linux/kref.h>
    10	#include <linux/list.h>
    11	#include <linux/spinlock.h>
    12	#include <linux/rbtree.h>
    13	#include <linux/maple_tree.h>
    14	#include <linux/rwsem.h>
    15	#include <linux/completion.h>
    16	#include <linux/cpumask.h>
    17	#include <linux/uprobes.h>
    18	#include <linux/rcupdate.h>
    19	#include <linux/page-flags-layout.h>
    20	#include <linux/workqueue.h>
    21	#include <linux/seqlock.h>
    22	#include <linux/percpu_counter.h>
    23	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 16:02   ` Bernd Schubert
  2024-05-30 16:10     ` Kent Overstreet
@ 2024-05-30 16:21     ` Jens Axboe
  2024-05-30 16:32       ` Bernd Schubert
                         ` (2 more replies)
  1 sibling, 3 replies; 113+ messages in thread
From: Jens Axboe @ 2024-05-30 16:21 UTC (permalink / raw)
  To: Bernd Schubert, Kent Overstreet, Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Andrew Morton,
	linux-mm, Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring,
	Ming Lei, Pavel Begunkov, Josef Bacik

On 5/30/24 10:02 AM, Bernd Schubert wrote:
> 
> 
> On 5/30/24 17:36, Kent Overstreet wrote:
>> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
>>> From: Bernd Schubert <bschubert@ddn.com>
>>>
>>> This adds support for uring communication between kernel and
>>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>>> appraoch was taken from ublk.  The patches are in RFC state,
>>> some major changes are still to be expected.
>>>
>>> Motivation for these patches is all to increase fuse performance.
>>> In fuse-over-io-uring requests avoid core switching (application
>>> on core X, processing of fuse server on random core Y) and use
>>> shared memory between kernel and userspace to transfer data.
>>> Similar approaches have been taken by ZUFS and FUSE2, though
>>> not over io-uring, but through ioctl IOs
>>
>> What specifically is it about io-uring that's helpful here? Besides the
>> ringbuffer?
>>
>> So the original mess was that because we didn't have a generic
>> ringbuffer, we had aio, tracing, and god knows what else all
>> implementing their own special purpose ringbuffers (all with weird
>> quirks of debatable or no usefulness).
>>
>> It seems to me that what fuse (and a lot of other things want) is just a
>> clean simple easy to use generic ringbuffer for sending what-have-you
>> back and forth between the kernel and userspace - in this case RPCs from
>> the kernel to userspace.
>>
>> But instead, the solution seems to be just toss everything into a new
>> giant subsystem?
> 
> 
> Hmm, initially I had thought about writing my own ring buffer, but then 
> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> need? From interface point of view, io-uring seems easy to use here, 
> has everything we need and kind of the same thing is used for ublk - 
> what speaks against io-uring? And what other suggestion do you have?
> 
> I guess the same concern would also apply to ublk_drv. 
> 
> Well, decoupling from io-uring might help to get for zero-copy, as there
> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> silently following for now).

If you have an interest in the zero copy, do chime in, it would
certainly help get some closure on that feature. I don't think anyone
disagrees it's a useful and needed feature, but there are different view
points on how it's best solved.

> From our side, a customer has pointed out security concerns for io-uring. 

That's just bs and fud these days.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 16:21     ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Jens Axboe
@ 2024-05-30 16:32       ` Bernd Schubert
  2024-05-30 17:26         ` Jens Axboe
  2024-05-30 17:16       ` Kent Overstreet
  2024-06-04 23:45       ` Ming Lei
  2 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-05-30 16:32 UTC (permalink / raw)
  To: Jens Axboe, Kent Overstreet, Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Andrew Morton,
	linux-mm, Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring,
	Ming Lei, Pavel Begunkov, Josef Bacik



On 5/30/24 18:21, Jens Axboe wrote:
> On 5/30/24 10:02 AM, Bernd Schubert wrote:
>>
>>
>> On 5/30/24 17:36, Kent Overstreet wrote:
>>> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
>>>> From: Bernd Schubert <bschubert@ddn.com>
>>>>
>>>> This adds support for uring communication between kernel and
>>>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>>>> appraoch was taken from ublk.  The patches are in RFC state,
>>>> some major changes are still to be expected.
>>>>
>>>> Motivation for these patches is all to increase fuse performance.
>>>> In fuse-over-io-uring requests avoid core switching (application
>>>> on core X, processing of fuse server on random core Y) and use
>>>> shared memory between kernel and userspace to transfer data.
>>>> Similar approaches have been taken by ZUFS and FUSE2, though
>>>> not over io-uring, but through ioctl IOs
>>>
>>> What specifically is it about io-uring that's helpful here? Besides the
>>> ringbuffer?
>>>
>>> So the original mess was that because we didn't have a generic
>>> ringbuffer, we had aio, tracing, and god knows what else all
>>> implementing their own special purpose ringbuffers (all with weird
>>> quirks of debatable or no usefulness).
>>>
>>> It seems to me that what fuse (and a lot of other things want) is just a
>>> clean simple easy to use generic ringbuffer for sending what-have-you
>>> back and forth between the kernel and userspace - in this case RPCs from
>>> the kernel to userspace.
>>>
>>> But instead, the solution seems to be just toss everything into a new
>>> giant subsystem?
>>
>>
>> Hmm, initially I had thought about writing my own ring buffer, but then 
>> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
>> need? From interface point of view, io-uring seems easy to use here, 
>> has everything we need and kind of the same thing is used for ublk - 
>> what speaks against io-uring? And what other suggestion do you have?
>>
>> I guess the same concern would also apply to ublk_drv. 
>>
>> Well, decoupling from io-uring might help to get for zero-copy, as there
>> doesn't seem to be an agreement with Mings approaches (sorry I'm only
>> silently following for now).
> 
> If you have an interest in the zero copy, do chime in, it would
> certainly help get some closure on that feature. I don't think anyone
> disagrees it's a useful and needed feature, but there are different view
> points on how it's best solved.

We had a bit of discussion with Ming about that last year, besides that
I got busy with other parts, it got a bit less of personal interest for
me as our project really needs to access the buffer (additional
checksums, sending it out over network library (libfabric), possibly
even preprocessing of some data) - I think it makes sense if I work on
the other fuse parts first and only come back zero copy a bit later.

> 
>> From our side, a customer has pointed out security concerns for io-uring. 
> 
> That's just bs and fud these days.

I wasn't in contact with that customer personally, I had just seen their
email.It would probably help if RHEL would eventually gain io-uring
support - almost all of HPC systems are using it or a clone. I was
always hoping that RHEL would get it before I'm done with
fuse-over-io-uring, now I'm not so sure anymore.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 16:32       ` Bernd Schubert
@ 2024-05-30 17:26         ` Jens Axboe
  0 siblings, 0 replies; 113+ messages in thread
From: Jens Axboe @ 2024-05-30 17:26 UTC (permalink / raw)
  To: Bernd Schubert, Kent Overstreet, Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Andrew Morton,
	linux-mm, Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring,
	Ming Lei, Pavel Begunkov, Josef Bacik

On 5/30/24 10:32 AM, Bernd Schubert wrote:
> 
> 
> On 5/30/24 18:21, Jens Axboe wrote:
>> On 5/30/24 10:02 AM, Bernd Schubert wrote:
>>>
>>>
>>> On 5/30/24 17:36, Kent Overstreet wrote:
>>>> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
>>>>> From: Bernd Schubert <bschubert@ddn.com>
>>>>>
>>>>> This adds support for uring communication between kernel and
>>>>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>>>>> appraoch was taken from ublk.  The patches are in RFC state,
>>>>> some major changes are still to be expected.
>>>>>
>>>>> Motivation for these patches is all to increase fuse performance.
>>>>> In fuse-over-io-uring requests avoid core switching (application
>>>>> on core X, processing of fuse server on random core Y) and use
>>>>> shared memory between kernel and userspace to transfer data.
>>>>> Similar approaches have been taken by ZUFS and FUSE2, though
>>>>> not over io-uring, but through ioctl IOs
>>>>
>>>> What specifically is it about io-uring that's helpful here? Besides the
>>>> ringbuffer?
>>>>
>>>> So the original mess was that because we didn't have a generic
>>>> ringbuffer, we had aio, tracing, and god knows what else all
>>>> implementing their own special purpose ringbuffers (all with weird
>>>> quirks of debatable or no usefulness).
>>>>
>>>> It seems to me that what fuse (and a lot of other things want) is just a
>>>> clean simple easy to use generic ringbuffer for sending what-have-you
>>>> back and forth between the kernel and userspace - in this case RPCs from
>>>> the kernel to userspace.
>>>>
>>>> But instead, the solution seems to be just toss everything into a new
>>>> giant subsystem?
>>>
>>>
>>> Hmm, initially I had thought about writing my own ring buffer, but then 
>>> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
>>> need? From interface point of view, io-uring seems easy to use here, 
>>> has everything we need and kind of the same thing is used for ublk - 
>>> what speaks against io-uring? And what other suggestion do you have?
>>>
>>> I guess the same concern would also apply to ublk_drv. 
>>>
>>> Well, decoupling from io-uring might help to get for zero-copy, as there
>>> doesn't seem to be an agreement with Mings approaches (sorry I'm only
>>> silently following for now).
>>
>> If you have an interest in the zero copy, do chime in, it would
>> certainly help get some closure on that feature. I don't think anyone
>> disagrees it's a useful and needed feature, but there are different view
>> points on how it's best solved.
> 
> We had a bit of discussion with Ming about that last year, besides that
> I got busy with other parts, it got a bit less of personal interest for
> me as our project really needs to access the buffer (additional
> checksums, sending it out over network library (libfabric), possibly
> even preprocessing of some data) - I think it makes sense if I work on
> the other fuse parts first and only come back zero copy a bit later.

Ah I see - yes if you're going to be touching the data anyway, zero copy
is less of a concern. Some memory bandwidth can still be saved if you're
not touching all of it, of course. But if you are, you're probably
better off copying it in the first place.

>>> From our side, a customer has pointed out security concerns for io-uring. 
>>
>> That's just bs and fud these days.
> 
> I wasn't in contact with that customer personally, I had just seen their
> email.It would probably help if RHEL would eventually gain io-uring
> support - almost all of HPC systems are using it or a clone. I was
> always hoping that RHEL would get it before I'm done with
> fuse-over-io-uring, now I'm not so sure anymore.

Not sure what the RHEL status is. I know backports are done on the
io_uring side, but not sure what base they are currently on. I strongly
suspect that would be a gating factor for getting it enabled. If it's
too out of date, then performance isn't going to be as good as current
mainline anyway.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 16:21     ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Jens Axboe
  2024-05-30 16:32       ` Bernd Schubert
@ 2024-05-30 17:16       ` Kent Overstreet
  2024-05-30 17:28         ` Jens Axboe
  2024-06-04 23:45       ` Ming Lei
  2 siblings, 1 reply; 113+ messages in thread
From: Kent Overstreet @ 2024-05-30 17:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bernd Schubert, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Andrei Vagin, io-uring, Ming Lei, Pavel Begunkov,
	Josef Bacik

On Thu, May 30, 2024 at 10:21:19AM -0600, Jens Axboe wrote:
> On 5/30/24 10:02 AM, Bernd Schubert wrote:
> > From our side, a customer has pointed out security concerns for io-uring. 
> 
> That's just bs and fud these days.

You have a history of being less than responsive with bug reports, and
this sort of attitude is not the attitude of a responsible maintainer.

From what I've seen those concerns were well founded, so if you want to
be taking seriously I'd be talking about what was done to address them
instead of namecalling.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 17:16       ` Kent Overstreet
@ 2024-05-30 17:28         ` Jens Axboe
  2024-05-30 17:58           ` Kent Overstreet
  0 siblings, 1 reply; 113+ messages in thread
From: Jens Axboe @ 2024-05-30 17:28 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Bernd Schubert, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Andrei Vagin, io-uring, Ming Lei, Pavel Begunkov,
	Josef Bacik

On 5/30/24 11:16 AM, Kent Overstreet wrote:
> On Thu, May 30, 2024 at 10:21:19AM -0600, Jens Axboe wrote:
>> On 5/30/24 10:02 AM, Bernd Schubert wrote:
>>> From our side, a customer has pointed out security concerns for io-uring. 
>>
>> That's just bs and fud these days.
> 
> You have a history of being less than responsive with bug reports, and
> this sort of attitude is not the attitude of a responsible maintainer.

Ok... That's a bold claim. We actually tend to bug reports quickly and
get them resolved in a timely manner. Maybe I've been less responsive on
a bug report from you, but that's usually because the emails turn out
like this one, with odd and unwarranted claims. Not taking the bait.

If you're referring to the file reference and umount issue, yes I do
very much want to get that one resolved. I do have patches for that, but
was never quite happy with them. As it isn't a stability or safety
concern, and not a practical concern outside of the test case in
question, it hasn't been super high on the radar unfortunately.

> From what I've seen those concerns were well founded, so if you want to
> be taking seriously I'd be talking about what was done to address them
> instead of namecalling.

I have addressed it several times in the past. tldr is that yeah the
initial history of io_uring wasn't great, due to some unfortunate
initial design choices (mostly around async worker setup and
identities). Those have since been rectified, and the code base is
stable and solid these days.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 17:28         ` Jens Axboe
@ 2024-05-30 17:58           ` Kent Overstreet
  2024-05-30 18:48             ` Jens Axboe
  0 siblings, 1 reply; 113+ messages in thread
From: Kent Overstreet @ 2024-05-30 17:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bernd Schubert, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Andrei Vagin, io-uring, Ming Lei, Pavel Begunkov,
	Josef Bacik

On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote:
> I have addressed it several times in the past. tldr is that yeah the
> initial history of io_uring wasn't great, due to some unfortunate
> initial design choices (mostly around async worker setup and
> identities).

Not to pick on you too much but the initial history looked pretty messy
to me - a lot of layering violations - it made aio.c look clean.

I know you were in "get shit done" mode, but at some point we have to
take a step back and ask "what are the different core concepts being
expressed here, and can we start picking them apart?". A generic
ringbuffer would be a good place to start.

I'd also really like to see some more standardized mechanisms for "I'm a
kernel thread doing work on behalf of some other user thread" - this
comes up elsewhere, I'm talking with David Howells right now about
fsconfig which is another place it is or will be coming up.

> Those have since been rectified, and the code base is
> stable and solid these days.

good tests, code coverage analysis to verify, good syzbot coverage?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 17:58           ` Kent Overstreet
@ 2024-05-30 18:48             ` Jens Axboe
  2024-05-30 19:35               ` Kent Overstreet
  0 siblings, 1 reply; 113+ messages in thread
From: Jens Axboe @ 2024-05-30 18:48 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Bernd Schubert, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Andrei Vagin, io-uring, Ming Lei, Pavel Begunkov,
	Josef Bacik

On 5/30/24 11:58 AM, Kent Overstreet wrote:
> On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote:
>> I have addressed it several times in the past. tldr is that yeah the
>> initial history of io_uring wasn't great, due to some unfortunate
>> initial design choices (mostly around async worker setup and
>> identities).
> 
> Not to pick on you too much but the initial history looked pretty messy
> to me - a lot of layering violations - it made aio.c look clean.

Oh I certainly agree, the initial code was in a much worse state than it
is in now. Lots of things have happened there, like splitting things up
and adding appropriate layering. That was more of a code hygiene kind of
thing, to make it easier to understand, maintain, and develop.

Any new subsystem is going to see lots of initial churn, regardless of
how long it's been developed before going into upstream. We certainly
had lots of churn, where these days it's stabilized. I don't think
that's unusual, particularly for something that attempts to do certain
things very differently. I would've loved to start with our current
state, but I don't think years of being out of tree would've completely
solved that. Some things you just don't find until it's in tree,
unfortunately.

> I know you were in "get shit done" mode, but at some point we have to
> take a step back and ask "what are the different core concepts being
> expressed here, and can we start picking them apart?". A generic
> ringbuffer would be a good place to start.
> 
> I'd also really like to see some more standardized mechanisms for "I'm a
> kernel thread doing work on behalf of some other user thread" - this
> comes up elsewhere, I'm talking with David Howells right now about
> fsconfig which is another place it is or will be coming up.

That does exist, and it came from the io_uring side of needing exactly
that. This is why we have create_io_thread(). IMHO it's the only sane
way to do it, trying to guesstimate what happens deep down in a random
callstack, and setting things up appropriately, is impossible. This is
where most of the earlier day io_uring issues came from, and what I
referred to as a "poor initial design choice".

>> Those have since been rectified, and the code base is
>> stable and solid these days.
> 
> good tests, code coverage analysis to verify, good syzbot coverage?

3x yes. Obviously I'm always going to say that tests could be better,
have better coverage, cover more things, because nothing is perfect (and
if you think it is, you're fooling yourself) and as a maintainer I want
perfect coverage. But we're pretty diligent these days about adding
tests for everything. And any regression or bug report always gets test
cases written.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 18:48             ` Jens Axboe
@ 2024-05-30 19:35               ` Kent Overstreet
  2024-05-31  0:11                 ` Jens Axboe
  0 siblings, 1 reply; 113+ messages in thread
From: Kent Overstreet @ 2024-05-30 19:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bernd Schubert, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Andrei Vagin, io-uring, Ming Lei, Pavel Begunkov,
	Josef Bacik

On Thu, May 30, 2024 at 12:48:56PM -0600, Jens Axboe wrote:
> On 5/30/24 11:58 AM, Kent Overstreet wrote:
> > On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote:
> >> I have addressed it several times in the past. tldr is that yeah the
> >> initial history of io_uring wasn't great, due to some unfortunate
> >> initial design choices (mostly around async worker setup and
> >> identities).
> > 
> > Not to pick on you too much but the initial history looked pretty messy
> > to me - a lot of layering violations - it made aio.c look clean.
> 
> Oh I certainly agree, the initial code was in a much worse state than it
> is in now. Lots of things have happened there, like splitting things up
> and adding appropriate layering. That was more of a code hygiene kind of
> thing, to make it easier to understand, maintain, and develop.
> 
> Any new subsystem is going to see lots of initial churn, regardless of
> how long it's been developed before going into upstream. We certainly
> had lots of churn, where these days it's stabilized. I don't think
> that's unusual, particularly for something that attempts to do certain
> things very differently. I would've loved to start with our current
> state, but I don't think years of being out of tree would've completely
> solved that. Some things you just don't find until it's in tree,
> unfortunately.

Well, the main thing I would've liked is a bit more discussion in the
early days of io_uring; there are things we could've done differently
back then that could've got us something cleaner in the long run.

My main complaints were always
 - yet another special purpose ringbuffer, and
 - yet another parallel syscall interface.

We've got too many of those too (aio is another), and the API
fragmentation is a real problem for userspace that just wants to be able
to issue arbitrary syscalls asynchronously. IO uring could've just been
serializing syscall numbers and arguments - that would have worked fine.

Given the history of failed AIO replacements just wanting to shove in
something working was understandable, but I don't think those would have
been big asks.

> > I'd also really like to see some more standardized mechanisms for "I'm a
> > kernel thread doing work on behalf of some other user thread" - this
> > comes up elsewhere, I'm talking with David Howells right now about
> > fsconfig which is another place it is or will be coming up.
> 
> That does exist, and it came from the io_uring side of needing exactly
> that. This is why we have create_io_thread(). IMHO it's the only sane
> way to do it, trying to guesstimate what happens deep down in a random
> callstack, and setting things up appropriately, is impossible. This is
> where most of the earlier day io_uring issues came from, and what I
> referred to as a "poor initial design choice".

Thanks, I wasn't aware of that - that's worth highlighting. I may switch
thread_with_file to that, and the fsconfig work David and I are talking
about can probably use it as well.

We really should have something lighter weight that we can use for work
items though, that's our standard mechanism for deferred work, not
spinning up a whole kthread. We do have kthread_use_mm() - there's no
reason we couldn't do an expanded version of that for all the other
shared resources that need to be available.

This was also another blocker in the other aborted AIO replacements, so
reusing clone flags does seem like the most reasonable way to make
progress here, but I would wager there's other stuff in task_struct that
really should be shared and isn't. task_struct is up to 825 (!) lines
now, which means good luck on even finding that stuff - maybe at some
point we'll have to get a giant effort going to clean up and organize
task_struct, like willy's been doing with struct page.

> >> Those have since been rectified, and the code base is
> >> stable and solid these days.
> > 
> > good tests, code coverage analysis to verify, good syzbot coverage?
> 
> 3x yes. Obviously I'm always going to say that tests could be better,
> have better coverage, cover more things, because nothing is perfect (and
> if you think it is, you're fooling yourself) and as a maintainer I want
> perfect coverage. But we're pretty diligent these days about adding
> tests for everything. And any regression or bug report always gets test
> cases written.

*nod* that's encouraging. Looking forward to the umount issue being
fixed so I can re-enable it in my tests...

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 19:35               ` Kent Overstreet
@ 2024-05-31  0:11                 ` Jens Axboe
  0 siblings, 0 replies; 113+ messages in thread
From: Jens Axboe @ 2024-05-31  0:11 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Bernd Schubert, Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Andrei Vagin, io-uring, Ming Lei, Pavel Begunkov,
	Josef Bacik

On 5/30/24 1:35 PM, Kent Overstreet wrote:
> On Thu, May 30, 2024 at 12:48:56PM -0600, Jens Axboe wrote:
>> On 5/30/24 11:58 AM, Kent Overstreet wrote:
>>> On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote:
>>>> I have addressed it several times in the past. tldr is that yeah the
>>>> initial history of io_uring wasn't great, due to some unfortunate
>>>> initial design choices (mostly around async worker setup and
>>>> identities).
>>>
>>> Not to pick on you too much but the initial history looked pretty messy
>>> to me - a lot of layering violations - it made aio.c look clean.
>>
>> Oh I certainly agree, the initial code was in a much worse state than it
>> is in now. Lots of things have happened there, like splitting things up
>> and adding appropriate layering. That was more of a code hygiene kind of
>> thing, to make it easier to understand, maintain, and develop.
>>
>> Any new subsystem is going to see lots of initial churn, regardless of
>> how long it's been developed before going into upstream. We certainly
>> had lots of churn, where these days it's stabilized. I don't think
>> that's unusual, particularly for something that attempts to do certain
>> things very differently. I would've loved to start with our current
>> state, but I don't think years of being out of tree would've completely
>> solved that. Some things you just don't find until it's in tree,
>> unfortunately.
> 
> Well, the main thing I would've liked is a bit more discussion in the
> early days of io_uring; there are things we could've done differently
> back then that could've got us something cleaner in the long run.

No matter how much discussion would've been had, there always would've
been compromises or realizations that "yeah that thing there should've
been done differently". In any case, pointless to argue about that, as
the only thing we can change is how things look in the present and
future. At least I don't have a time machine.

> My main complaints were always
>  - yet another special purpose ringbuffer, and
>  - yet another parallel syscall interface.

Exactly how many "parallel syscall interfaces" do we have?

> We've got too many of those too (aio is another), and the API

Like which ones? aio is a special case async interface for O_DIRECT IO,
that's about it. It's not a generic IO interface. It's literally dio
only. And yes, then it has the option of syncing a file, and poll got
added some years ago as well. But for the longest duration of aio, it
was just dio aio. The early versions of io_uring actually added on top of
that, but I didn't feel like it belonged there.

> fragmentation is a real problem for userspace that just wants to be able
> to issue arbitrary syscalls asynchronously. IO uring could've just been
> serializing syscall numbers and arguments - that would have worked fine.

That doesn't work at all. If all syscalls had been designed with
issue + wait semantics, then yeah that would obviously be the way that
it would've been done. You just add all of them, and pass arguments,
done. But that's not reality. You can do that if you just offload to a
worker thread, but that's not how you get performance. And you could
very much STILL do just that, in fact it'd be trivial to wire up. But
nobody would use it, because something that just always punts to a
thread would be awful for performance. You may as well just do that
offload in userspace then.

Hence the majority of the work for wiring up operations that implement
existing functionality in an async way is core work. The io_uring
interface for it is the simplest part, once you have the underpinnings
doing what you want. Sometimes that's some ugly "this can't block, if it
does, return -EAGAIN", and sometimes it's refactoring things a bit so
that you can tap into the inevitable waitqueue. There's no single
recipe, it all depends on how it currently works.

> Given the history of failed AIO replacements just wanting to shove in
> something working was understandable, but I don't think those would have
> been big asks.

What are these failed AIO replacements? aio is for storage, io_uring was
never meant to be a storage only interface. The only other attempt I can
recall, back in the day, was the acall and threadlet stuff that Ingo and
zab worked on. And even that attempted to support async in a performant
way, by doing work inline whenever possible. But hard to use, as you'd
return as a different pid if the original task blocked.

>>> I'd also really like to see some more standardized mechanisms for "I'm a
>>> kernel thread doing work on behalf of some other user thread" - this
>>> comes up elsewhere, I'm talking with David Howells right now about
>>> fsconfig which is another place it is or will be coming up.
>>
>> That does exist, and it came from the io_uring side of needing exactly
>> that. This is why we have create_io_thread(). IMHO it's the only sane
>> way to do it, trying to guesstimate what happens deep down in a random
>> callstack, and setting things up appropriately, is impossible. This is
>> where most of the earlier day io_uring issues came from, and what I
>> referred to as a "poor initial design choice".
> 
> Thanks, I wasn't aware of that - that's worth highlighting. I may switch
> thread_with_file to that, and the fsconfig work David and I are talking
> about can probably use it as well.
> 
> We really should have something lighter weight that we can use for work
> items though, that's our standard mechanism for deferred work, not
> spinning up a whole kthread. We do have kthread_use_mm() - there's no
> reason we couldn't do an expanded version of that for all the other
> shared resources that need to be available.

Like io-wq does for io_uring? That's why it's there. io_uring tries not
to rely on it very much, it's considered the slow path for the above
mentioned reasons of why thread offload generally isn't a great idea.
But at least it doesn't require a full fork for each item.

> This was also another blocker in the other aborted AIO replacements, so
> reusing clone flags does seem like the most reasonable way to make
> progress here, but I would wager there's other stuff in task_struct that
> really should be shared and isn't. task_struct is up to 825 (!) lines
> now, which means good luck on even finding that stuff - maybe at some
> point we'll have to get a giant effort going to clean up and organize
> task_struct, like willy's been doing with struct page.

Well that thing is an unwieldy beast and has been for many years. So
yeah, very much agree that it needs some tender love and care, and we'd
be better off for it.

>>>> Those have since been rectified, and the code base is
>>>> stable and solid these days.
>>>
>>> good tests, code coverage analysis to verify, good syzbot coverage?
>>
>> 3x yes. Obviously I'm always going to say that tests could be better,
>> have better coverage, cover more things, because nothing is perfect (and
>> if you think it is, you're fooling yourself) and as a maintainer I want
>> perfect coverage. But we're pretty diligent these days about adding
>> tests for everything. And any regression or bug report always gets test
>> cases written.
> 
> *nod* that's encouraging. Looking forward to the umount issue being
> fixed so I can re-enable it in my tests...

I'll pick it up again soon enough, I'll let you know.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-30 16:21     ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Jens Axboe
  2024-05-30 16:32       ` Bernd Schubert
  2024-05-30 17:16       ` Kent Overstreet
@ 2024-06-04 23:45       ` Ming Lei
  2 siblings, 0 replies; 113+ messages in thread
From: Ming Lei @ 2024-06-04 23:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Bernd Schubert, Kent Overstreet, Bernd Schubert, Miklos Szeredi,
	Amir Goldstein, linux-fsdevel, Andrew Morton, linux-mm,
	Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring,
	Pavel Begunkov, Josef Bacik, Ming Lei

On Fri, May 31, 2024 at 12:21 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 5/30/24 10:02 AM, Bernd Schubert wrote:
> >
> >
> > On 5/30/24 17:36, Kent Overstreet wrote:
> >> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
> >>> From: Bernd Schubert <bschubert@ddn.com>
> >>>
> >>> This adds support for uring communication between kernel and
> >>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> >>> appraoch was taken from ublk.  The patches are in RFC state,
> >>> some major changes are still to be expected.
> >>>
> >>> Motivation for these patches is all to increase fuse performance.
> >>> In fuse-over-io-uring requests avoid core switching (application
> >>> on core X, processing of fuse server on random core Y) and use
> >>> shared memory between kernel and userspace to transfer data.
> >>> Similar approaches have been taken by ZUFS and FUSE2, though
> >>> not over io-uring, but through ioctl IOs
> >>
> >> What specifically is it about io-uring that's helpful here? Besides the
> >> ringbuffer?
> >>
> >> So the original mess was that because we didn't have a generic
> >> ringbuffer, we had aio, tracing, and god knows what else all
> >> implementing their own special purpose ringbuffers (all with weird
> >> quirks of debatable or no usefulness).
> >>
> >> It seems to me that what fuse (and a lot of other things want) is just a
> >> clean simple easy to use generic ringbuffer for sending what-have-you
> >> back and forth between the kernel and userspace - in this case RPCs from
> >> the kernel to userspace.
> >>
> >> But instead, the solution seems to be just toss everything into a new
> >> giant subsystem?
> >
> >
> > Hmm, initially I had thought about writing my own ring buffer, but then
> > io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> > need? From interface point of view, io-uring seems easy to use here,
> > has everything we need and kind of the same thing is used for ublk -
> > what speaks against io-uring? And what other suggestion do you have?
> >
> > I guess the same concern would also apply to ublk_drv.
> >
> > Well, decoupling from io-uring might help to get for zero-copy, as there
> > doesn't seem to be an agreement with Mings approaches (sorry I'm only
> > silently following for now).

We have concluded pipe & splice isn't good for zero copy, and io_uring
provides zc in async way, which is really nice for async application.

>
> If you have an interest in the zero copy, do chime in, it would
> certainly help get some closure on that feature. I don't think anyone
> disagrees it's a useful and needed feature, but there are different view
> points on how it's best solved.

Now generic sqe group feature is being added, and generic zero copy can be
built over it easily, can you or anyone take a look?

https://lore.kernel.org/linux-block/20240511001214.173711-1-ming.lei@redhat.com/


Thanks,
Ming


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (20 preceding siblings ...)
  2024-05-30 15:36 ` Kent Overstreet
@ 2024-05-30 20:47 ` Josef Bacik
  2024-06-11  8:20 ` Miklos Szeredi
  22 siblings, 0 replies; 113+ messages in thread
From: Josef Bacik @ 2024-05-30 20:47 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel, bernd.schubert,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring

On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
> From: Bernd Schubert <bschubert@ddn.com>
> 
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state,
> some major changes are still to be expected.
> 

First, thanks for tackling this, this is a pretty big project and pretty
important, you've put a lot of work into this and it's pretty great.

A few things that I've pointed out elsewhere, but bear repeating and keeping in
mind for the entire patch series.

1. Make sure you've got changelogs.  There's several patches that just don't
   have changelogs.  I get things where it's like "add a mmap interface", but it
   would be good to have an explanation as to why you're adding it and what we
   hope to get out of that change.

2. Again as I stated elsewhere, you add a bunch of structs and stuff that aren't
   related to the current patch, which makes it difficult for me to work out
   what's needed or how it's used, so I go back and forth between the code and
   the patch a lot, and I've probably missed a few things.

3. Drop the CPU scheduling changes for this first pass.  The performance
   optimizations are definitely worth pursuing, but I don't want to get hung up
   in waiting on the scheduler dependencies to land.  Additionally what looks
   like it works in your setup may not necessarily be good for everybody's
   setup.  Having the baseline stuff in and working well, and then providing
   patches to change the CPU stuff in a way that we can test in a variety of
   different environments to validate the wins would be better.  As someone who
   has to regularly go figure out what in the scheduler changed to make all of
   our metrics look bad again, I'm very wary of changes that make CPU selection
   policy decisions in a way that aren't changeable without changing the code.

Thanks,

Josef

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
                   ` (21 preceding siblings ...)
  2024-05-30 20:47 ` Josef Bacik
@ 2024-06-11  8:20 ` Miklos Szeredi
  2024-06-11 10:26   ` Bernd Schubert
  22 siblings, 1 reply; 113+ messages in thread
From: Miklos Szeredi @ 2024-06-11  8:20 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Amir Goldstein, linux-fsdevel, bernd.schubert, Andrew Morton,
	linux-mm, Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring

On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@ddn.com> wrote:
>
> From: Bernd Schubert <bschubert@ddn.com>
>
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state,
> some major changes are still to be expected.

Thank you very much for tackling this.  I think this is an important
feature and one that could potentially have a significant effect on
fuse performance, which is something many people would love to see.

I'm thinking about the architecture and there are some questions:

Have you tried just plain IORING_OP_READV / IORING_OP_WRITEV?  That's
would just be the async part, without the mapped buffer.  I suspect
most of the performance advantage comes from this and the per-CPU
queue, not from the mapped buffer, yet most of the complexity seems to
be related to the mapped buffer.

Maybe there's an advantage in using an atomic op for WRITEV + READV,
but I'm not quite seeing it yet, since there's no syscall overhead for
separate ops.

What's the reason for separate async and sync request queues?

> Avoiding cache line bouncing / numa systems was discussed
> between Amir and Miklos before and Miklos had posted
> part of the private discussion here
> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
>
> This cache line bouncing should be addressed by these patches
> as well.

Why do you think this is solved?

> I had also noticed waitq wake-up latencies in fuse before
> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/
>
> This spinning approach helped with performance (>40% improvement
> for file creates), but due to random server side thread/core utilization
> spinning cannot be well controlled in /dev/fuse mode.
> With fuse-over-io-uring requests are handled on the same core
> (sync requests) or on core+1 (large async requests) and performance
> improvements are achieved without spinning.

I feel this should be a scheduler decision, but the selecting the
queue needs to be based on that decision.  Maybe the scheduler people
can help out with this.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-11  8:20 ` Miklos Szeredi
@ 2024-06-11 10:26   ` Bernd Schubert
  2024-06-11 15:35     ` Miklos Szeredi
  0 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-06-11 10:26 UTC (permalink / raw)
  To: Miklos Szeredi, Bernd Schubert
  Cc: Amir Goldstein, linux-fsdevel, Andrew Morton, linux-mm,
	Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring

On 6/11/24 10:20, Miklos Szeredi wrote:
> On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> From: Bernd Schubert <bschubert@ddn.com>
>>
>> This adds support for uring communication between kernel and
>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>> appraoch was taken from ublk.  The patches are in RFC state,
>> some major changes are still to be expected.
> 
> Thank you very much for tackling this.  I think this is an important
> feature and one that could potentially have a significant effect on
> fuse performance, which is something many people would love to see.
> 
> I'm thinking about the architecture and there are some questions:
> 
> Have you tried just plain IORING_OP_READV / IORING_OP_WRITEV?  That's
> would just be the async part, without the mapped buffer.  I suspect
> most of the performance advantage comes from this and the per-CPU
> queue, not from the mapped buffer, yet most of the complexity seems to
> be related to the mapped buffer.

I didn't try because IORING_OP_URING_CMD seems to be exactly made for
our case.

Firstly and although I didn't have time to look into it yet, but with
the current approach it should be rather simple to switch to another
ring as Kent has suggested.

Secondly, with IORING_OP_URING_CMD we already have only a single command
to submit requests and fetch the next one - half of the system calls.

Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
https://github.com/uroni/fuseuring?
I.e. it hook into the existing fuse and just changes from read()/write()
of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
control which ring/queue and which ring-entry handles the request.

Thirdly, initially I had even removed the allocation of 'struct
fuse_req' and directly allocated these on available ring entries. I.e.
the application thread was writing to the mmap ring buffer. I just
removed that code for now, as it introduced additional complexity with
an unknown performance outcome. If it should be helpful we could add
that later. I don't think we have that flexibility with
IORING_OP_READV/IORING_OP_WRITEV.

And then, personally I do not see mmap complexity - it is just very
convenient to write/read to/from the ring buffer from any kernel thread.
Currently not supported by io-uring, but I think we could even avoid any
kind of system call and let the application poll for results. Similar to
what IORING_SETUP_SQPOLL does, but without the requirement of another
kernel thread.

Most complexity and issues I got, come from the requirement of io_uring
to complete requests with io_uring_op_cmd_done. In RFCv1 you had
annotated the async shutdown method - that was indeed really painful and
resulted in a never ending number of shutdown races. Once I removed that
and also removed using a bitmap (I don't even know why I used a bitmap
in the first place in RFCv1 and not lists as in RFCv2) shutdown became
manageable.

If there would be way to tell io-uring that kernel fuse is done and to
let it complete itself whatever was not completed yet, it would would be
great help. Although from my point of view, that could be done once this
series is merged.
Or we decide to switch to Kents new ring buffer and might not have that
problem at all...

> 
> Maybe there's an advantage in using an atomic op for WRITEV + READV,
> but I'm not quite seeing it yet, since there's no syscall overhead for
> separate ops.

Here I get confused, could please explain?
Current fuse has a read + write operation - a read() system call to
process a fuse request and a write() call to submit the result and then
read() to fetch the next request.
If userspace has to send IORING_OP_READV to fetch a request and complete
with IORING_OP_IORING_OP_WRITEV it would go through existing code path
with operations? Well, maybe userspace could submit with IOSQE_IO_LINK,
but that sounds like it would need to send two ring entries? I.e. memory
and processing overhead?

And then, no way to further optimize and do fuse_req allocation on the
ring (if there are free entries). And probably also no way that we ever
let the application work in the SQPOLL way, because the application
thread does not have the right to read from the fuse-server buffer? I.e.
what I mean is that IORING_OP_URING_CMD gives a better flexibility.

Btw, another issue that is avoided with the new ring-request layout is
compatibility and alignment. The fuse header is always in a 4K section
of the request data follow then. I.e. extending request sizes does not
impose compatibility issues any more and also for passthrough and
similar - O_DIRECT can be passed through to the backend file system.
Although these issues probably need to be solved into the current fuse
protocol.

> 
> What's the reason for separate async and sync request queues?

To have credits for IO operations. For an overlay file system it might
not matter, because it might get stuck with another system call in the
underlying file system. But for something that is not system call bound
and that has control, async and sync can be handled with priority given
by userspace.

As I had written in the introduction mail, I'm currently working on
different IO sizes per ring queue - it gets even more fine grained and
with the advantage of reduced memory usage per queue when the queue has
entries for many small requests and a few large ones.

Next step would here to add credits for reads/writes (or to reserve
credits for meta operations) in the sync queue, so that meta operations
can always go through. If there should be async meta operations (through
application io-uring requests?) would need to be done for the async
queue as well.

Last but not least, with separation there is no global async queue
anymore - no global lock and cache issues.

> 
>> Avoiding cache line bouncing / numa systems was discussed
>> between Amir and Miklos before and Miklos had posted
>> part of the private discussion here
>> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
>>
>> This cache line bouncing should be addressed by these patches
>> as well.
> 
> Why do you think this is solved?

I _guess_ that writing to the mmaped buffer and processing requests on
the same cpu core should make it possible to keep things in cpu cache. I
did not verify that with perf, though.

> 
>> I had also noticed waitq wake-up latencies in fuse before
>> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/
>>
>> This spinning approach helped with performance (>40% improvement
>> for file creates), but due to random server side thread/core utilization
>> spinning cannot be well controlled in /dev/fuse mode.
>> With fuse-over-io-uring requests are handled on the same core
>> (sync requests) or on core+1 (large async requests) and performance
>> improvements are achieved without spinning.
> 
> I feel this should be a scheduler decision, but the selecting the
> queue needs to be based on that decision.  Maybe the scheduler people
> can help out with this.

For sync requests getting the scheduler involved is what is responsible
for making really fuse slow. It schedules on random cores, that are in
sleep states and additionally frequency scaling does not go up. We
really need to stay on the same core here, as that is submitting the
result, the core is already running (i.e. not sleeping) and has data in
its cache. All benchmark results with sync requests point that out.

For async requests, the above still applies, but basically one thread is
writing/reading and the other thread handles/provides the data. Random
switching of cores is then still not good. At best and to be tested,
submitting rather large chunks to other cores.
What is indeed to be discussed (and think annotated in the corresponding
patch description), if there is a way to figure out if the other core is
already busy. But then the scheduler does not know what it is busy with
- are these existing fuse requests or something else. That part is
really hard and I don't think it makes sense to discuss this right now
before the main part is merged. IMHO, better to add a config flag for
the current cpu+1 scheduling with an annotation that this setting might
go away in the future.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-11 10:26   ` Bernd Schubert
@ 2024-06-11 15:35     ` Miklos Szeredi
  2024-06-11 17:37       ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Miklos Szeredi @ 2024-06-11 15:35 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, Amir Goldstein, linux-fsdevel, Andrew Morton,
	linux-mm, Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring

On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:

> Secondly, with IORING_OP_URING_CMD we already have only a single command
> to submit requests and fetch the next one - half of the system calls.
>
> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
> https://github.com/uroni/fuseuring?
> I.e. it hook into the existing fuse and just changes from read()/write()
> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
> control which ring/queue and which ring-entry handles the request.

Unlike system calls, io_uring ops should have very little overhead.
That's one of the main selling points of io_uring (as described in the
io_uring(7) man page).

So I don't think it matters to performance whether there's a combined
WRITEV + READV (or COMMIT + FETCH) op or separate ops.

The advantage of separate ops is more flexibility and less complexity
(do only one thing in an op).

> Thirdly, initially I had even removed the allocation of 'struct
> fuse_req' and directly allocated these on available ring entries. I.e.
> the application thread was writing to the mmap ring buffer. I just
> removed that code for now, as it introduced additional complexity with
> an unknown performance outcome. If it should be helpful we could add
> that later. I don't think we have that flexibility with
> IORING_OP_READV/IORING_OP_WRITEV.

I think I understand what you'd like to see in the end: basically a
reverse io_uring, where requests are placed on a "submission queue" by
the kernel and completed requests are placed on a completion queue by
the userspace.  Exactly the opposite of current io_uring.

The major difference between your idea of a fuse_uring and the
io_uring seems to be that you place not only the request on the shared
buffer, but the data as well.   I don't think this is a good idea,
since it will often incur one more memory copy.  Otherwise the idea
itself seems sound.

The implementation twisted due to having to integrate it with
io_uring.  Unfortunately placing fuse requests directly into the
io_uring queue doesn't work, due to the reversal of roles and the size
difference between sqe and cqe entries.  Also the shared buffer seems
to lose its ring aspect due to the fact that fuse doesn't get notified
when a request is taken off the queue (io_uring_cqe_seen(3)).

So I think either better integration with io_uring is needed with
support for "reverse submission" or a new interface.

> >
> > Maybe there's an advantage in using an atomic op for WRITEV + READV,
> > but I'm not quite seeing it yet, since there's no syscall overhead for
> > separate ops.
>
> Here I get confused, could please explain?
> Current fuse has a read + write operation - a read() system call to
> process a fuse request and a write() call to submit the result and then
> read() to fetch the next request.
> If userspace has to send IORING_OP_READV to fetch a request and complete
> with IORING_OP_IORING_OP_WRITEV it would go through existing code path
> with operations? Well, maybe userspace could submit with IOSQE_IO_LINK,
> but that sounds like it would need to send two ring entries? I.e. memory
> and processing overhead?

Overhead should be minimal.

> And then, no way to further optimize and do fuse_req allocation on the
> ring (if there are free entries). And probably also no way that we ever
> let the application work in the SQPOLL way, because the application
> thread does not have the right to read from the fuse-server buffer? I.e.
> what I mean is that IORING_OP_URING_CMD gives a better flexibility.

There should be no difference between IORING_OP_URING_CMD and
IORING_OP_WRITEV +  IORING_OP_READV in this respect.  At least I don't
see why polling would work differently: the writev should complete
immediately and then the readv is queued.  Same as what effectively
happens with IORING_OP_URING_CMD, no?

> Btw, another issue that is avoided with the new ring-request layout is
> compatibility and alignment. The fuse header is always in a 4K section
> of the request data follow then. I.e. extending request sizes does not
> impose compatibility issues any more and also for passthrough and
> similar - O_DIRECT can be passed through to the backend file system.
> Although these issues probably need to be solved into the current fuse
> protocol.

Yes.

> Last but not least, with separation there is no global async queue
> anymore - no global lock and cache issues.

The global async code should be moved into the /dev/fuse specific
"legacy" queuing so it doesn't affect either uring or virtiofs
queuing.

> >> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
> >>
> >> This cache line bouncing should be addressed by these patches
> >> as well.
> >
> > Why do you think this is solved?
>
>
> I _guess_ that writing to the mmaped buffer and processing requests on
> the same cpu core should make it possible to keep things in cpu cache. I
> did not verify that with perf, though.

Well, the issue is with any context switch that happens in the
multithreaded fuse server process.  Shared address spaces will have a
common "which CPU this is currently running on" bitmap
(mm->cpu_bitmap), which is updated whenever one of the threads using
this address space gets scheduled or descheduled.

Now imagine a fuse server running on a big numa system, which has
threads bound to each CPU.  The server tries to avoid using sharing
data structures between threads, so that cache remains local.  But it
can't avoid updating this bitmap on schedule.  The bitmap can pack 512
CPUs into a single cacheline, which means that thread locality is
compromised.

I'm somewhat surprised that this doesn't turn up in profiles in real
life, but I guess it's not a big deal in most cases.  I only observed
it with a special "no-op" fuse server running on big numa and with
per-thread queuing, etc. enabled (fuse2).

> For sync requests getting the scheduler involved is what is responsible
> for making really fuse slow. It schedules on random cores, that are in
> sleep states and additionally frequency scaling does not go up. We
> really need to stay on the same core here, as that is submitting the
> result, the core is already running (i.e. not sleeping) and has data in
> its cache. All benchmark results with sync requests point that out.

No arguments about that.

> For async requests, the above still applies, but basically one thread is
> writing/reading and the other thread handles/provides the data. Random
> switching of cores is then still not good. At best and to be tested,
> submitting rather large chunks to other cores.
> What is indeed to be discussed (and think annotated in the corresponding
> patch description), if there is a way to figure out if the other core is
> already busy. But then the scheduler does not know what it is busy with
> - are these existing fuse requests or something else. That part is
> really hard and I don't think it makes sense to discuss this right now
> before the main part is merged. IMHO, better to add a config flag for
> the current cpu+1 scheduling with an annotation that this setting might
> go away in the future.

The cpu + 1 seems pretty arbitrary, and could be a very bad decision
if there are independent tasks bound to certain CPUs or if the target
turns out to be on a very distant CPU.

I'm not sure what the right answer is.   It's probably something like:
try to schedule this on a CPU which is not busy but is not very
distant from this one.  The scheduler can probably answer this
question, but there's no current API for this.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-11 15:35     ` Miklos Szeredi
@ 2024-06-11 17:37       ` Bernd Schubert
  2024-06-11 23:35         ` Kent Overstreet
  2024-06-12  7:39         ` Miklos Szeredi
  0 siblings, 2 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-06-11 17:37 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, Amir Goldstein, linux-fsdevel, Andrew Morton,
	linux-mm, Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring,
	Kent Overstreet

On 6/11/24 17:35, Miklos Szeredi wrote:
> On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> 
>> Secondly, with IORING_OP_URING_CMD we already have only a single command
>> to submit requests and fetch the next one - half of the system calls.
>>
>> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
>> https://github.com/uroni/fuseuring?
>> I.e. it hook into the existing fuse and just changes from read()/write()
>> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
>> control which ring/queue and which ring-entry handles the request.
> 
> Unlike system calls, io_uring ops should have very little overhead.
> That's one of the main selling points of io_uring (as described in the
> io_uring(7) man page).
> 
> So I don't think it matters to performance whether there's a combined
> WRITEV + READV (or COMMIT + FETCH) op or separate ops.

This has to be performance proven and is no means what I'm seeing. How
should io-uring improve performance if you have the same number of
system calls?

As I see it (@Jens or @Pavel or anyone else please correct me if I'm
wrong), advantage of io-uring comes when there is no syscall overhead at
all - either you have a ring with multiple entries and then one side
operates on multiple entries or you have polling and no syscall overhead
either. We cannot afford cpu intensive polling - out of question,
besides that I had even tried SQPOLL and it made things worse (that is
actually where my idea about application polling comes from).
As I see it, for sync blocking calls (like meta operations) with one
entry in the queue, you would get no advantage with
IORING_OP_READV/IORING_OP_WRITEV -  io-uring has  do two system calls -
one to submit from kernel to userspace and another from userspace to
kernel. Why should io-uring be faster there?

And from my testing this is exactly what I had seen - io-uring for meta
requests (i.e. without a large request queue and *without* core
affinity) makes meta operations even slower that /dev/fuse.

For anything that imposes a large ring queue and where either side
(kernel or userspace) needs to process multiple ring entries - system
call overhead gets reduced by the queue size. Just for DIO or meta
operations that is hard to reach.

Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
I.e. we would not have encoded in the request which queue it belongs to?

> 
> The advantage of separate ops is more flexibility and less complexity
> (do only one thing in an op)

Did you look at patch 12/19? It just does
fuse_uring_req_end_and_get_next(). That part isn't complex, imho.

> 
>> Thirdly, initially I had even removed the allocation of 'struct
>> fuse_req' and directly allocated these on available ring entries. I.e.
>> the application thread was writing to the mmap ring buffer. I just
>> removed that code for now, as it introduced additional complexity with
>> an unknown performance outcome. If it should be helpful we could add
>> that later. I don't think we have that flexibility with
>> IORING_OP_READV/IORING_OP_WRITEV.
> 
> I think I understand what you'd like to see in the end: basically a
> reverse io_uring, where requests are placed on a "submission queue" by
> the kernel and completed requests are placed on a completion queue by
> the userspace.  Exactly the opposite of current io_uring.
> 
> The major difference between your idea of a fuse_uring and the
> io_uring seems to be that you place not only the request on the shared
> buffer, but the data as well.   I don't think this is a good idea,
> since it will often incur one more memory copy.  Otherwise the idea
> itself seems sound.

Coud you explain what you mean with "one more memory copy"? As it is
right now, 'struct fuse_req' is always allocated as it was before and
then a copy is done to the ring entry. No difference to legacy /dev/fuse
IO, which also copies to the read buffer.

If we would avoid allocating 'struct fuse_req' when there are free ring
entry requests we would reduce copies, but never increase?

Btw, advantage for the ring is on the libfuse side, where the
fuse-request buffer is assigned to the CQE and as long as the request is
not completed, the buffer is valid. (For /dev/fuse IO that could be
solved in libfuse by decoupling request memory from the thread, but with
the current ring buffer design that just happens more naturally and
memory is limited by the queue size.)

> 
> The implementation twisted due to having to integrate it with
> io_uring.  Unfortunately placing fuse requests directly into the
> io_uring queue doesn't work, due to the reversal of roles and the size
> difference between sqe and cqe entries.  Also the shared buffer seems
> to lose its ring aspect due to the fact that fuse doesn't get notified
> when a request is taken off the queue (io_uring_cqe_seen(3)).
> 
> So I think either better integration with io_uring is needed with
> support for "reverse submission" or a new interface.

Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And
ublk_drv  also works exactly that way. I had pointed it out before,
initially I had considered to write a reverse io-uring myself and then
exactly at that time ublk came up.

The interface of that 'reverse io' to io-uring is really simple.

1) Userspace sends a IORING_OP_URING_CMD SQE
2) That CMD gets handled/queued by struct file_operations::uring_cmd /
fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the
request
3) When fuse client has data to complete the request, it calls
io_uring_cmd_done() and fuse server receives a CQE with the fuse request.

Personally I don't see anything twisted here, one just needs to
understand that IORING_OP_URING_CMD was written for that reverse order.

(There came up a light twisting when io-uring introduced issue_flags -
that is part of discussion of patch 19/19 with Jens in the series. Jens
suggested to work on io-uring improvements once the main series is
merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask
Jens for help once the other parts are merged. Right now that easy to
work around by always submitting with an io-uring task).

Also, that simplicity is the reason why I'm hesitating a bit to work on
Kents new ring, as io-uring already has all what we need and with a
rather simple interface.

Well, maybe you mean patch 09/19 "Add a dev_release exception for
fuse-over-io-uring". Yep, that is the shutdown part I'm not too happy
about and which initially lead to the async release thread in RFCv1.

> 
>>>
>>> Maybe there's an advantage in using an atomic op for WRITEV + READV,
>>> but I'm not quite seeing it yet, since there's no syscall overhead for
>>> separate ops.
>>
>> Here I get confused, could please explain?
>> Current fuse has a read + write operation - a read() system call to
>> process a fuse request and a write() call to submit the result and then
>> read() to fetch the next request.
>> If userspace has to send IORING_OP_READV to fetch a request and complete
>> with IORING_OP_IORING_OP_WRITEV it would go through existing code path
>> with operations? Well, maybe userspace could submit with IOSQE_IO_LINK,
>> but that sounds like it would need to send two ring entries? I.e. memory
>> and processing overhead?
> 
> Overhead should be minimal.

See above, for single entry blocking requests you get two system calls +
io-uring overhead.

> 
>> And then, no way to further optimize and do fuse_req allocation on the
>> ring (if there are free entries). And probably also no way that we ever
>> let the application work in the SQPOLL way, because the application
>> thread does not have the right to read from the fuse-server buffer? I.e.
>> what I mean is that IORING_OP_URING_CMD gives a better flexibility.
> 
> There should be no difference between IORING_OP_URING_CMD and
> IORING_OP_WRITEV +  IORING_OP_READV in this respect.  At least I don't
> see why polling would work differently: the writev should complete
> immediately and then the readv is queued.  Same as what effectively
> happens with IORING_OP_URING_CMD, no?

Polling yes, but without shared memory the application thread does not
have the right to read from fuse userspace server request?

> 
>> Btw, another issue that is avoided with the new ring-request layout is
>> compatibility and alignment. The fuse header is always in a 4K section
>> of the request data follow then. I.e. extending request sizes does not
>> impose compatibility issues any more and also for passthrough and
>> similar - O_DIRECT can be passed through to the backend file system.
>> Although these issues probably need to be solved into the current fuse
>> protocol.
> 
> Yes.
> 
>> Last but not least, with separation there is no global async queue
>> anymore - no global lock and cache issues.
> 
> The global async code should be moved into the /dev/fuse specific
> "legacy" queuing so it doesn't affect either uring or virtiofs
> queuing.

Yep, wait a few days I have seen your recent patch and I'm may to add
that to my series. I actually considered to point of that the bg queue
could be handled by that series as well, but then preferred to just add
patch for that in my series, which will make use of it for the ring queue.

> 
>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
>>>>
>>>> This cache line bouncing should be addressed by these patches
>>>> as well.
>>>
>>> Why do you think this is solved?
>>
>>
>> I _guess_ that writing to the mmaped buffer and processing requests on
>> the same cpu core should make it possible to keep things in cpu cache. I
>> did not verify that with perf, though.
> 
> Well, the issue is with any context switch that happens in the
> multithreaded fuse server process.  Shared address spaces will have a
> common "which CPU this is currently running on" bitmap
> (mm->cpu_bitmap), which is updated whenever one of the threads using
> this address space gets scheduled or descheduled.
> 
> Now imagine a fuse server running on a big numa system, which has
> threads bound to each CPU.  The server tries to avoid using sharing
> data structures between threads, so that cache remains local.  But it
> can't avoid updating this bitmap on schedule.  The bitmap can pack 512
> CPUs into a single cacheline, which means that thread locality is
> compromised.

To be honest, I wonder how you worked around scheduler issues on waking
up the application thread. Did you core bind application threads as well
(I mean besides fuse server threads)? We now have this (unexported)
wake_on_current_cpu. Last year that still wasn't working perfectly well
and  Hillf Danton has suggested the 'seesaw' approach. And with that the
scheduler was working very well. You could get the same with application
core binding, but with 512 CPUs that is certainly not done manually
anymore. Did you use a script to bind application threads or did you
core bind from within the application?

> 
> I'm somewhat surprised that this doesn't turn up in profiles in real
> life, but I guess it's not a big deal in most cases.  I only observed
> it with a special "no-op" fuse server running on big numa and with
> per-thread queuing, etc. enabled (fuse2).

Ok, I'm testing only with 32 cores and two numa nodes. For final
benchmarking I could try to get a more recent AMD based system with 96
cores. I don't think we have anything near 512 CPUs in the lab. I'm not
aware of such customer systems either.

> 
>> For sync requests getting the scheduler involved is what is responsible
>> for making really fuse slow. It schedules on random cores, that are in
>> sleep states and additionally frequency scaling does not go up. We
>> really need to stay on the same core here, as that is submitting the
>> result, the core is already running (i.e. not sleeping) and has data in
>> its cache. All benchmark results with sync requests point that out.
> 
> No arguments about that.
> 
>> For async requests, the above still applies, but basically one thread is
>> writing/reading and the other thread handles/provides the data. Random
>> switching of cores is then still not good. At best and to be tested,
>> submitting rather large chunks to other cores.
>> What is indeed to be discussed (and think annotated in the corresponding
>> patch description), if there is a way to figure out if the other core is
>> already busy. But then the scheduler does not know what it is busy with
>> - are these existing fuse requests or something else. That part is
>> really hard and I don't think it makes sense to discuss this right now
>> before the main part is merged. IMHO, better to add a config flag for
>> the current cpu+1 scheduling with an annotation that this setting might
>> go away in the future.
> 
> The cpu + 1 seems pretty arbitrary, and could be a very bad decision
> if there are independent tasks bound to certain CPUs or if the target
> turns out to be on a very distant CPU.
> 
> I'm not sure what the right answer is.   It's probably something like:
> try to schedule this on a CPU which is not busy but is not very
> distant from this one.  The scheduler can probably answer this
> question, but there's no current API for this.

This is just another optimization. What you write is true and I was
aware of that. It was just a rather simple optimization that improved
results - enough to demo it. We can work with scheduler people in the
future on that or we add a bit of our own logic and create mapping of
cpu -> next-cpu-on-same-numa-node. Certainly an API and help from the
scheduler would be preferred.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-11 17:37       ` Bernd Schubert
@ 2024-06-11 23:35         ` Kent Overstreet
  2024-06-12 13:53           ` Bernd Schubert
  2024-06-12  7:39         ` Miklos Szeredi
  1 sibling, 1 reply; 113+ messages in thread
From: Kent Overstreet @ 2024-06-11 23:35 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Bernd Schubert, Amir Goldstein, linux-fsdevel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring

On Tue, Jun 11, 2024 at 07:37:30PM GMT, Bernd Schubert wrote:
> 
> 
> On 6/11/24 17:35, Miklos Szeredi wrote:
> > On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> > 
> >> Secondly, with IORING_OP_URING_CMD we already have only a single command
> >> to submit requests and fetch the next one - half of the system calls.
> >>
> >> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
> >> https://github.com/uroni/fuseuring?
> >> I.e. it hook into the existing fuse and just changes from read()/write()
> >> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
> >> control which ring/queue and which ring-entry handles the request.
> > 
> > Unlike system calls, io_uring ops should have very little overhead.
> > That's one of the main selling points of io_uring (as described in the
> > io_uring(7) man page).
> > 
> > So I don't think it matters to performance whether there's a combined
> > WRITEV + READV (or COMMIT + FETCH) op or separate ops.
> 
> This has to be performance proven and is no means what I'm seeing. How
> should io-uring improve performance if you have the same number of
> system calls?
> 
> As I see it (@Jens or @Pavel or anyone else please correct me if I'm
> wrong), advantage of io-uring comes when there is no syscall overhead at
> all - either you have a ring with multiple entries and then one side
> operates on multiple entries or you have polling and no syscall overhead
> either. We cannot afford cpu intensive polling - out of question,
> besides that I had even tried SQPOLL and it made things worse (that is
> actually where my idea about application polling comes from).
> As I see it, for sync blocking calls (like meta operations) with one
> entry in the queue, you would get no advantage with
> IORING_OP_READV/IORING_OP_WRITEV -  io-uring has  do two system calls -
> one to submit from kernel to userspace and another from userspace to
> kernel. Why should io-uring be faster there?
> 
> And from my testing this is exactly what I had seen - io-uring for meta
> requests (i.e. without a large request queue and *without* core
> affinity) makes meta operations even slower that /dev/fuse.
> 
> For anything that imposes a large ring queue and where either side
> (kernel or userspace) needs to process multiple ring entries - system
> call overhead gets reduced by the queue size. Just for DIO or meta
> operations that is hard to reach.
> 
> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
> I.e. we would not have encoded in the request which queue it belongs to?

Want to try out my new ringbuffer syscall?

I haven't yet dug far into the fuse protocol or /dev/fuse code yet, only
skimmed. But using it to replace the read/write syscall overhead should
be straightforward; you'll want to spin up a kthread for responding to
requests.

The next thing I was going to look at is how you guys are using splice,
we want to get away from that too.

Brian was also saying the fuse virtio_fs code may be worth
investigating, maybe that could be adapted?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-11 23:35         ` Kent Overstreet
@ 2024-06-12 13:53           ` Bernd Schubert
  2024-06-12 14:19             ` Kent Overstreet
  0 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-06-12 13:53 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Miklos Szeredi, Bernd Schubert, Amir Goldstein, linux-fsdevel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring



On 6/12/24 01:35, Kent Overstreet wrote:
> On Tue, Jun 11, 2024 at 07:37:30PM GMT, Bernd Schubert wrote:
>>
>>
>> On 6/11/24 17:35, Miklos Szeredi wrote:
>>> On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>>
>>>> Secondly, with IORING_OP_URING_CMD we already have only a single command
>>>> to submit requests and fetch the next one - half of the system calls.
>>>>
>>>> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
>>>> https://github.com/uroni/fuseuring?
>>>> I.e. it hook into the existing fuse and just changes from read()/write()
>>>> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
>>>> control which ring/queue and which ring-entry handles the request.
>>>
>>> Unlike system calls, io_uring ops should have very little overhead.
>>> That's one of the main selling points of io_uring (as described in the
>>> io_uring(7) man page).
>>>
>>> So I don't think it matters to performance whether there's a combined
>>> WRITEV + READV (or COMMIT + FETCH) op or separate ops.
>>
>> This has to be performance proven and is no means what I'm seeing. How
>> should io-uring improve performance if you have the same number of
>> system calls?
>>
>> As I see it (@Jens or @Pavel or anyone else please correct me if I'm
>> wrong), advantage of io-uring comes when there is no syscall overhead at
>> all - either you have a ring with multiple entries and then one side
>> operates on multiple entries or you have polling and no syscall overhead
>> either. We cannot afford cpu intensive polling - out of question,
>> besides that I had even tried SQPOLL and it made things worse (that is
>> actually where my idea about application polling comes from).
>> As I see it, for sync blocking calls (like meta operations) with one
>> entry in the queue, you would get no advantage with
>> IORING_OP_READV/IORING_OP_WRITEV -  io-uring has  do two system calls -
>> one to submit from kernel to userspace and another from userspace to
>> kernel. Why should io-uring be faster there?
>>
>> And from my testing this is exactly what I had seen - io-uring for meta
>> requests (i.e. without a large request queue and *without* core
>> affinity) makes meta operations even slower that /dev/fuse.
>>
>> For anything that imposes a large ring queue and where either side
>> (kernel or userspace) needs to process multiple ring entries - system
>> call overhead gets reduced by the queue size. Just for DIO or meta
>> operations that is hard to reach.
>>
>> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
>> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
>> I.e. we would not have encoded in the request which queue it belongs to?
> 
> Want to try out my new ringbuffer syscall?
> 
> I haven't yet dug far into the fuse protocol or /dev/fuse code yet, only
> skimmed. But using it to replace the read/write syscall overhead should
> be straightforward; you'll want to spin up a kthread for responding to
> requests.

I will definitely look at it this week. Although I don't like the idea
to have a new kthread. We already have an application thread and have
the fuse server thread, why do we need another one?

> 
> The next thing I was going to look at is how you guys are using splice,
> we want to get away from that too.

Well, Ming Lei is working on that for ublk_drv and I guess that new approach
could be adapted as well onto the current way of io-uring.
It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.

https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/

> 
> Brian was also saying the fuse virtio_fs code may be worth
> investigating, maybe that could be adapted?

I need to check, but really, the majority of the new additions
is just to set up things, shutdown and to have sanity checks.
Request sending/completing to/from the ring is not that much new lines.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 13:53           ` Bernd Schubert
@ 2024-06-12 14:19             ` Kent Overstreet
  2024-06-12 15:40               ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Kent Overstreet @ 2024-06-12 14:19 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Miklos Szeredi, Bernd Schubert, Amir Goldstein, linux-fsdevel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring

On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
> I will definitely look at it this week. Although I don't like the idea
> to have a new kthread. We already have an application thread and have
> the fuse server thread, why do we need another one?

Ok, I hadn't found the fuse server thread - that should be fine.

> > 
> > The next thing I was going to look at is how you guys are using splice,
> > we want to get away from that too.
> 
> Well, Ming Lei is working on that for ublk_drv and I guess that new approach
> could be adapted as well onto the current way of io-uring.
> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
> 
> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
> 
> > 
> > Brian was also saying the fuse virtio_fs code may be worth
> > investigating, maybe that could be adapted?
> 
> I need to check, but really, the majority of the new additions
> is just to set up things, shutdown and to have sanity checks.
> Request sending/completing to/from the ring is not that much new lines.

What I'm wondering is how read/write requests are handled. Are the data
payloads going in the same ringbuffer as the commands? That could work,
if the ringbuffer is appropriately sized, but alignment is a an issue.

We just looked up the device DMA requirements and with modern NVME only
4 byte alignment is required, but the block layer likely isn't set up to
handle that.

So - prearranged buffer? Or are you using splice to get pages that
userspace has read into into the kernel pagecache?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 14:19             ` Kent Overstreet
@ 2024-06-12 15:40               ` Bernd Schubert
  2024-06-12 15:55                 ` Kent Overstreet
  0 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-06-12 15:40 UTC (permalink / raw)
  To: Kent Overstreet, Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel@vger.kernel.org,
	Andrew Morton, linux-mm@kvack.org, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring@vger.kernel.org

On 6/12/24 16:19, Kent Overstreet wrote:
> On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
>> I will definitely look at it this week. Although I don't like the idea
>> to have a new kthread. We already have an application thread and have
>> the fuse server thread, why do we need another one?
> 
> Ok, I hadn't found the fuse server thread - that should be fine.
> 
>>>
>>> The next thing I was going to look at is how you guys are using splice,
>>> we want to get away from that too.
>>
>> Well, Ming Lei is working on that for ublk_drv and I guess that new approach
>> could be adapted as well onto the current way of io-uring.
>> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
>>
>> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
>>
>>>
>>> Brian was also saying the fuse virtio_fs code may be worth
>>> investigating, maybe that could be adapted?
>>
>> I need to check, but really, the majority of the new additions
>> is just to set up things, shutdown and to have sanity checks.
>> Request sending/completing to/from the ring is not that much new lines.
> 
> What I'm wondering is how read/write requests are handled. Are the data
> payloads going in the same ringbuffer as the commands? That could work,
> if the ringbuffer is appropriately sized, but alignment is a an issue.

That is exactly the big discussion Miklos and I have. Basically in my
series another buffer is vmalloced, mmaped and then assigned to ring entries.
Fuse meta headers and application payload goes into that buffer.
In both kernel/userspace directions. io-uring only allows 80B, so only a
really small request would fit into it.
Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse
header - intrinsically fixed in the ring patches.

I will now try without mmap and just provide a user buffer as pointer in the 80B
section.

> 
> We just looked up the device DMA requirements and with modern NVME only
> 4 byte alignment is required, but the block layer likely isn't set up to
> handle that.

I think existing fuse headers have and their data have a 4 byte alignment.
Maybe even 8 byte, I don't remember without looking through all request types.
If you try a simple O_DIRECT read/write to libfuse/example_passthrough_hp
without the ring patches it will fail because of alignment. Needs to be fixed
in legacy fuse and would also avoid compat issues we had in libfuse when the
kernel header was updated.

> 
> So - prearranged buffer? Or are you using splice to get pages that
> userspace has read into into the kernel pagecache?

I didn't even try to use splice yet, because for the DDN (my employer) use case
we cannot use  zero copy, at least not without violating the rule that one
cannot access the application buffer in userspace.

I will definitely look into Mings work, as it will be useful for others.

Cheers,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 15:40               ` Bernd Schubert
@ 2024-06-12 15:55                 ` Kent Overstreet
  2024-06-12 16:15                   ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Kent Overstreet @ 2024-06-12 15:55 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel@vger.kernel.org, Andrew Morton, linux-mm@kvack.org,
	Ingo Molnar, Peter Zijlstra, Andrei Vagin,
	io-uring@vger.kernel.org

On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote:
> On 6/12/24 16:19, Kent Overstreet wrote:
> > On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
> >> I will definitely look at it this week. Although I don't like the idea
> >> to have a new kthread. We already have an application thread and have
> >> the fuse server thread, why do we need another one?
> > 
> > Ok, I hadn't found the fuse server thread - that should be fine.
> > 
> >>>
> >>> The next thing I was going to look at is how you guys are using splice,
> >>> we want to get away from that too.
> >>
> >> Well, Ming Lei is working on that for ublk_drv and I guess that new approach
> >> could be adapted as well onto the current way of io-uring.
> >> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
> >>
> >> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
> >>
> >>>
> >>> Brian was also saying the fuse virtio_fs code may be worth
> >>> investigating, maybe that could be adapted?
> >>
> >> I need to check, but really, the majority of the new additions
> >> is just to set up things, shutdown and to have sanity checks.
> >> Request sending/completing to/from the ring is not that much new lines.
> > 
> > What I'm wondering is how read/write requests are handled. Are the data
> > payloads going in the same ringbuffer as the commands? That could work,
> > if the ringbuffer is appropriately sized, but alignment is a an issue.
> 
> That is exactly the big discussion Miklos and I have. Basically in my
> series another buffer is vmalloced, mmaped and then assigned to ring entries.
> Fuse meta headers and application payload goes into that buffer.
> In both kernel/userspace directions. io-uring only allows 80B, so only a
> really small request would fit into it.

Well, the generic ringbuffer would lift that restriction.

> Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse
> header - intrinsically fixed in the ring patches.

*nod*

That's the big question, put the data inline (with potential alignment
hassles) or manage (and map) a separate data structure.

Maybe padding could be inserted to solve alignment?

A separate data structure would only really be useful if it enabled zero
copy, but that should probably be a secondary enhancement.

> I will now try without mmap and just provide a user buffer as pointer in the 80B
> section.
> 
> 
> > 
> > We just looked up the device DMA requirements and with modern NVME only
> > 4 byte alignment is required, but the block layer likely isn't set up to
> > handle that.
> 
> I think existing fuse headers have and their data have a 4 byte alignment.
> Maybe even 8 byte, I don't remember without looking through all request types.
> If you try a simple O_DIRECT read/write to libfuse/example_passthrough_hp
> without the ring patches it will fail because of alignment. Needs to be fixed
> in legacy fuse and would also avoid compat issues we had in libfuse when the
> kernel header was updated.
> 
> > 
> > So - prearranged buffer? Or are you using splice to get pages that
> > userspace has read into into the kernel pagecache?
> 
> I didn't even try to use splice yet, because for the DDN (my employer) use case
> we cannot use  zero copy, at least not without violating the rule that one
> cannot access the application buffer in userspace.

DDN - lustre related?

> 
> I will definitely look into Mings work, as it will be useful for others.
> 
> 
> Cheers,
> Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 15:55                 ` Kent Overstreet
@ 2024-06-12 16:15                   ` Bernd Schubert
  2024-06-12 16:24                     ` Kent Overstreet
  0 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-06-12 16:15 UTC (permalink / raw)
  To: Kent Overstreet, Bernd Schubert
  Cc: Miklos Szeredi, Amir Goldstein, linux-fsdevel@vger.kernel.org,
	Andrew Morton, linux-mm@kvack.org, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring@vger.kernel.org



On 6/12/24 17:55, Kent Overstreet wrote:
> On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote:
>> On 6/12/24 16:19, Kent Overstreet wrote:
>>> On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
>>>> I will definitely look at it this week. Although I don't like the idea
>>>> to have a new kthread. We already have an application thread and have
>>>> the fuse server thread, why do we need another one?
>>>
>>> Ok, I hadn't found the fuse server thread - that should be fine.
>>>
>>>>>
>>>>> The next thing I was going to look at is how you guys are using splice,
>>>>> we want to get away from that too.
>>>>
>>>> Well, Ming Lei is working on that for ublk_drv and I guess that new approach
>>>> could be adapted as well onto the current way of io-uring.
>>>> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
>>>>
>>>> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
>>>>
>>>>>
>>>>> Brian was also saying the fuse virtio_fs code may be worth
>>>>> investigating, maybe that could be adapted?
>>>>
>>>> I need to check, but really, the majority of the new additions
>>>> is just to set up things, shutdown and to have sanity checks.
>>>> Request sending/completing to/from the ring is not that much new lines.
>>>
>>> What I'm wondering is how read/write requests are handled. Are the data
>>> payloads going in the same ringbuffer as the commands? That could work,
>>> if the ringbuffer is appropriately sized, but alignment is a an issue.
>>
>> That is exactly the big discussion Miklos and I have. Basically in my
>> series another buffer is vmalloced, mmaped and then assigned to ring entries.
>> Fuse meta headers and application payload goes into that buffer.
>> In both kernel/userspace directions. io-uring only allows 80B, so only a
>> really small request would fit into it.
> 
> Well, the generic ringbuffer would lift that restriction.

Yeah, kind of. Instead allocating the buffer in fuse, it would be now allocated
in that code. At least all that setup code would be moved out of fuse. I will
eventually come to your patches today.
Now we only need to convince Miklos that your ring is better ;)

> 
>> Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse
>> header - intrinsically fixed in the ring patches.
> 
> *nod*
> 
> That's the big question, put the data inline (with potential alignment
> hassles) or manage (and map) a separate data structure.
> 
> Maybe padding could be inserted to solve alignment?

Right now I have this struct:

struct fuse_ring_req {
	union {
		/* The first 4K are command data */
		char ring_header[FUSE_RING_HEADER_BUF_SIZE];

		struct {
			uint64_t flags;

			/* enum fuse_ring_buf_cmd */
			uint32_t in_out_arg_len;
			uint32_t padding;

			/* kernel fills in, reads out */
			union {
				struct fuse_in_header in;
				struct fuse_out_header out;
			};
		};
	};

	char in_out_arg[];
};


Data go into in_out_arg, i.e. headers are padded by the union.
I actually wonder if FUSE_RING_HEADER_BUF_SIZE should be page size
and not a fixed 4K.

(I just see the stale comment 'enum fuse_ring_buf_cmd',
will remove it in the next series)


> 
> A separate data structure would only really be useful if it enabled zero
> copy, but that should probably be a secondary enhancement.
> 
>> I will now try without mmap and just provide a user buffer as pointer in the 80B
>> section.
>>
>>
>>>
>>> We just looked up the device DMA requirements and with modern NVME only
>>> 4 byte alignment is required, but the block layer likely isn't set up to
>>> handle that.
>>
>> I think existing fuse headers have and their data have a 4 byte alignment.
>> Maybe even 8 byte, I don't remember without looking through all request types.
>> If you try a simple O_DIRECT read/write to libfuse/example_passthrough_hp
>> without the ring patches it will fail because of alignment. Needs to be fixed
>> in legacy fuse and would also avoid compat issues we had in libfuse when the
>> kernel header was updated.
>>
>>>
>>> So - prearranged buffer? Or are you using splice to get pages that
>>> userspace has read into into the kernel pagecache?
>>
>> I didn't even try to use splice yet, because for the DDN (my employer) use case
>> we cannot use  zero copy, at least not without violating the rule that one
>> cannot access the application buffer in userspace.
> 
> DDN - lustre related?

I have bit of ancient Lustre background, also with DDN, then went to Fraunhofer
for FhGFS/BeeGFS (kind of competing with Lustre).
Back at DDN initially on IME (burst buffer) and now Infinia. Lustre is mostly HPC
only, Infina is kind of everything.



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 16:15                   ` Bernd Schubert
@ 2024-06-12 16:24                     ` Kent Overstreet
  2024-06-12 16:44                       ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Kent Overstreet @ 2024-06-12 16:24 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel@vger.kernel.org, Andrew Morton, linux-mm@kvack.org,
	Ingo Molnar, Peter Zijlstra, Andrei Vagin,
	io-uring@vger.kernel.org

On Wed, Jun 12, 2024 at 06:15:57PM GMT, Bernd Schubert wrote:
> 
> 
> On 6/12/24 17:55, Kent Overstreet wrote:
> > On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote:
> > > On 6/12/24 16:19, Kent Overstreet wrote:
> > > > On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
> > > > > I will definitely look at it this week. Although I don't like the idea
> > > > > to have a new kthread. We already have an application thread and have
> > > > > the fuse server thread, why do we need another one?
> > > > 
> > > > Ok, I hadn't found the fuse server thread - that should be fine.
> > > > 
> > > > > > 
> > > > > > The next thing I was going to look at is how you guys are using splice,
> > > > > > we want to get away from that too.
> > > > > 
> > > > > Well, Ming Lei is working on that for ublk_drv and I guess that new approach
> > > > > could be adapted as well onto the current way of io-uring.
> > > > > It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
> > > > > 
> > > > > https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
> > > > > 
> > > > > > 
> > > > > > Brian was also saying the fuse virtio_fs code may be worth
> > > > > > investigating, maybe that could be adapted?
> > > > > 
> > > > > I need to check, but really, the majority of the new additions
> > > > > is just to set up things, shutdown and to have sanity checks.
> > > > > Request sending/completing to/from the ring is not that much new lines.
> > > > 
> > > > What I'm wondering is how read/write requests are handled. Are the data
> > > > payloads going in the same ringbuffer as the commands? That could work,
> > > > if the ringbuffer is appropriately sized, but alignment is a an issue.
> > > 
> > > That is exactly the big discussion Miklos and I have. Basically in my
> > > series another buffer is vmalloced, mmaped and then assigned to ring entries.
> > > Fuse meta headers and application payload goes into that buffer.
> > > In both kernel/userspace directions. io-uring only allows 80B, so only a
> > > really small request would fit into it.
> > 
> > Well, the generic ringbuffer would lift that restriction.
> 
> Yeah, kind of. Instead allocating the buffer in fuse, it would be now allocated
> in that code. At least all that setup code would be moved out of fuse. I will
> eventually come to your patches today.
> Now we only need to convince Miklos that your ring is better ;)
> 
> > 
> > > Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse
> > > header - intrinsically fixed in the ring patches.
> > 
> > *nod*
> > 
> > That's the big question, put the data inline (with potential alignment
> > hassles) or manage (and map) a separate data structure.
> > 
> > Maybe padding could be inserted to solve alignment?
> 
> Right now I have this struct:
> 
> struct fuse_ring_req {
> 	union {
> 		/* The first 4K are command data */
> 		char ring_header[FUSE_RING_HEADER_BUF_SIZE];
> 
> 		struct {
> 			uint64_t flags;
> 
> 			/* enum fuse_ring_buf_cmd */
> 			uint32_t in_out_arg_len;
> 			uint32_t padding;
> 
> 			/* kernel fills in, reads out */
> 			union {
> 				struct fuse_in_header in;
> 				struct fuse_out_header out;
> 			};
> 		};
> 	};
> 
> 	char in_out_arg[];
> };
> 
> 
> Data go into in_out_arg, i.e. headers are padded by the union.
> I actually wonder if FUSE_RING_HEADER_BUF_SIZE should be page size
> and not a fixed 4K.

I would make the commands variable sized, so that commands with no data
buffers don't need padding, and then when you do have a data command you
only pad out that specific command so that the data buffer starts on a
page boundary.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 16:24                     ` Kent Overstreet
@ 2024-06-12 16:44                       ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-06-12 16:44 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Bernd Schubert, Miklos Szeredi, Amir Goldstein,
	linux-fsdevel@vger.kernel.org, Andrew Morton, linux-mm@kvack.org,
	Ingo Molnar, Peter Zijlstra, Andrei Vagin,
	io-uring@vger.kernel.org



On 6/12/24 18:24, Kent Overstreet wrote:
> On Wed, Jun 12, 2024 at 06:15:57PM GMT, Bernd Schubert wrote:
>>
>>
>> On 6/12/24 17:55, Kent Overstreet wrote:
>>> On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote:
>>>> On 6/12/24 16:19, Kent Overstreet wrote:
>>>>> On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
>>>>>> I will definitely look at it this week. Although I don't like the idea
>>>>>> to have a new kthread. We already have an application thread and have
>>>>>> the fuse server thread, why do we need another one?
>>>>>
>>>>> Ok, I hadn't found the fuse server thread - that should be fine.
>>>>>
>>>>>>>
>>>>>>> The next thing I was going to look at is how you guys are using splice,
>>>>>>> we want to get away from that too.
>>>>>>
>>>>>> Well, Ming Lei is working on that for ublk_drv and I guess that new approach
>>>>>> could be adapted as well onto the current way of io-uring.
>>>>>> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
>>>>>>
>>>>>> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
>>>>>>
>>>>>>>
>>>>>>> Brian was also saying the fuse virtio_fs code may be worth
>>>>>>> investigating, maybe that could be adapted?
>>>>>>
>>>>>> I need to check, but really, the majority of the new additions
>>>>>> is just to set up things, shutdown and to have sanity checks.
>>>>>> Request sending/completing to/from the ring is not that much new lines.
>>>>>
>>>>> What I'm wondering is how read/write requests are handled. Are the data
>>>>> payloads going in the same ringbuffer as the commands? That could work,
>>>>> if the ringbuffer is appropriately sized, but alignment is a an issue.
>>>>
>>>> That is exactly the big discussion Miklos and I have. Basically in my
>>>> series another buffer is vmalloced, mmaped and then assigned to ring entries.
>>>> Fuse meta headers and application payload goes into that buffer.
>>>> In both kernel/userspace directions. io-uring only allows 80B, so only a
>>>> really small request would fit into it.
>>>
>>> Well, the generic ringbuffer would lift that restriction.
>>
>> Yeah, kind of. Instead allocating the buffer in fuse, it would be now allocated
>> in that code. At least all that setup code would be moved out of fuse. I will
>> eventually come to your patches today.
>> Now we only need to convince Miklos that your ring is better ;)
>>
>>>
>>>> Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse
>>>> header - intrinsically fixed in the ring patches.
>>>
>>> *nod*
>>>
>>> That's the big question, put the data inline (with potential alignment
>>> hassles) or manage (and map) a separate data structure.
>>>
>>> Maybe padding could be inserted to solve alignment?
>>
>> Right now I have this struct:
>>
>> struct fuse_ring_req {
>> 	union {
>> 		/* The first 4K are command data */
>> 		char ring_header[FUSE_RING_HEADER_BUF_SIZE];
>>
>> 		struct {
>> 			uint64_t flags;
>>
>> 			/* enum fuse_ring_buf_cmd */
>> 			uint32_t in_out_arg_len;
>> 			uint32_t padding;
>>
>> 			/* kernel fills in, reads out */
>> 			union {
>> 				struct fuse_in_header in;
>> 				struct fuse_out_header out;
>> 			};
>> 		};
>> 	};
>>
>> 	char in_out_arg[];
>> };
>>
>>
>> Data go into in_out_arg, i.e. headers are padded by the union.
>> I actually wonder if FUSE_RING_HEADER_BUF_SIZE should be page size
>> and not a fixed 4K.
> 
> I would make the commands variable sized, so that commands with no data
> buffers don't need padding, and then when you do have a data command you
> only pad out that specific command so that the data buffer starts on a
> page boundary.


The same buffer is used for kernel to userspace and the other way around 
- it is attached to the ring entry. Either direction will always have
data, where would a dynamic sizing then be useful?

Well, some "data" like the node id don't need to be aligned - we could 
save memory for that. I still would like to have some padding so that
headers could be grown without any kind of compat issues. Though almost 
4K is probably too much for that.

Thanks for pointing it out, will improve it!

Cheers,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-11 17:37       ` Bernd Schubert
  2024-06-11 23:35         ` Kent Overstreet
@ 2024-06-12  7:39         ` Miklos Szeredi
  2024-06-12 13:32           ` Bernd Schubert
  1 sibling, 1 reply; 113+ messages in thread
From: Miklos Szeredi @ 2024-06-12  7:39 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, Amir Goldstein, linux-fsdevel, Andrew Morton,
	linux-mm, Ingo Molnar, Peter Zijlstra, Andrei Vagin, io-uring,
	Kent Overstreet

On Tue, 11 Jun 2024 at 19:37, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:

> > So I don't think it matters to performance whether there's a combined
> > WRITEV + READV (or COMMIT + FETCH) op or separate ops.
>
> This has to be performance proven and is no means what I'm seeing. How
> should io-uring improve performance if you have the same number of
> system calls?

The ops can be queued together and submitted together.  Two separate
(but possibly linked) ops should result in exactly the same number of
syscalls as a single combined op.

> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
> I.e. we would not have encoded in the request which queue it belongs to?

The original idea was to use the cloned /dev/fuse fd to sort requests
into separate queues.  That was only half finished: the input queue is
currently shared by all clones, but once a request is read by the
server from a particular clone it is put into a separate processing
queue.   Adding separate input queues to each clone should also be
possible.

I'm not saying this is definitely the direction we should be taking,
but it's something to consider.

> > The advantage of separate ops is more flexibility and less complexity
> > (do only one thing in an op)
>
> Did you look at patch 12/19? It just does
> fuse_uring_req_end_and_get_next(). That part isn't complex, imho.

That function name indicates that this is too complex: it is doing two
independent things (ending one request and fetching the next).

Fine if it's a valid optimization, but I'm saying that it likely isn't.

> > The major difference between your idea of a fuse_uring and the
> > io_uring seems to be that you place not only the request on the shared
> > buffer, but the data as well.   I don't think this is a good idea,
> > since it will often incur one more memory copy.  Otherwise the idea
> > itself seems sound.
>
> Coud you explain what you mean with "one more memory copy"?

If the filesystem is providing the result of a READ request as a
pointer to a buffer (which can be the case with fuse_reply_data()),
then that buffer will need to be copied to the shared buffer, and from
the shared buffer to the read destination.

That first copy is unnecessary if the kernel receives the pointer to
the userspace buffer and copies the data directly to the destination.

> > So I think either better integration with io_uring is needed with
> > support for "reverse submission" or a new interface.
>
> Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And
> ublk_drv  also works exactly that way. I had pointed it out before,
> initially I had considered to write a reverse io-uring myself and then
> exactly at that time ublk came up.

I'm just looking for answers why this architecture is the best.  Maybe
it is, but I find it too complex and can't explain why it's going to
perform better than a much simpler single ring model.

> The interface of that 'reverse io' to io-uring is really simple.
>
> 1) Userspace sends a IORING_OP_URING_CMD SQE
> 2) That CMD gets handled/queued by struct file_operations::uring_cmd /
> fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the
> request
> 3) When fuse client has data to complete the request, it calls
> io_uring_cmd_done() and fuse server receives a CQE with the fuse request.
>
> Personally I don't see anything twisted here, one just needs to
> understand that IORING_OP_URING_CMD was written for that reverse order.

That's just my gut feeling.   fuse/dev_uring.c is 1233 in this RFC.
And that's just the queuing.

> (There came up a light twisting when io-uring introduced issue_flags -
> that is part of discussion of patch 19/19 with Jens in the series. Jens
> suggested to work on io-uring improvements once the main series is
> merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask
> Jens for help once the other parts are merged. Right now that easy to
> work around by always submitting with an io-uring task).
>
> Also, that simplicity is the reason why I'm hesitating a bit to work on
> Kents new ring, as io-uring already has all what we need and with a
> rather simple interface.

I'm in favor of using io_uring, if possible.

I'm also in favor of a single shared buffer (ring) if possible.  Using
cloned fd + plain READV / WRITEV ops is one possibility.

But I'm not opposed to IORING_OP_URING_CMD either.   Btw, fuse reply
could be inlined in the majority of cases into that 80 byte free space
in the sqe.  Also might consider an extended cqe mode, where short
fuse request could be inlined as well (e.g. IORING_SETUP_CQE128 -> 112
byte payload).

> To be honest, I wonder how you worked around scheduler issues on waking
> up the application thread. Did you core bind application threads as well
> (I mean besides fuse server threads)? We now have this (unexported)
> wake_on_current_cpu. Last year that still wasn't working perfectly well
> and  Hillf Danton has suggested the 'seesaw' approach. And with that the
> scheduler was working very well. You could get the same with application
> core binding, but with 512 CPUs that is certainly not done manually
> anymore. Did you use a script to bind application threads or did you
> core bind from within the application?

Probably, I don't remember anymore.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12  7:39         ` Miklos Szeredi
@ 2024-06-12 13:32           ` Bernd Schubert
  2024-06-12 13:46             ` Bernd Schubert
  2024-06-12 14:07             ` Miklos Szeredi
  0 siblings, 2 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-06-12 13:32 UTC (permalink / raw)
  To: Miklos Szeredi, Bernd Schubert
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org, Andrew Morton,
	linux-mm@kvack.org, Ingo Molnar, Peter Zijlstra, Andrei Vagin,
	io-uring@vger.kernel.org, Kent Overstreet

On 6/12/24 09:39, Miklos Szeredi wrote:
> On Tue, 11 Jun 2024 at 19:37, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> 
>>> So I don't think it matters to performance whether there's a combined
>>> WRITEV + READV (or COMMIT + FETCH) op or separate ops.
>>
>> This has to be performance proven and is no means what I'm seeing. How
>> should io-uring improve performance if you have the same number of
>> system calls?
> 
> The ops can be queued together and submitted together.  Two separate
> (but possibly linked) ops should result in exactly the same number of
> syscalls as a single combined op.

As I wrote before, that requires the double amount of queue buffer memory.
Goes totally into the opposite direction of what I'm currently working
on - to use less memory, as not all requests need 1MB buffers and as we want
to use as little as possible memory.

> 
>> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
>> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
>> I.e. we would not have encoded in the request which queue it belongs to?
> 
> The original idea was to use the cloned /dev/fuse fd to sort requests
> into separate queues.  That was only half finished: the input queue is
> currently shared by all clones, but once a request is read by the
> server from a particular clone it is put into a separate processing
> queue.   Adding separate input queues to each clone should also be
> possible.
> 
> I'm not saying this is definitely the direction we should be taking,
> but it's something to consider.

I was considering to use that for the mmap method, but then found it easier
to just add per numa to an rb-tree. Well, maybe I should reconsider, as
the current patch series clones the device anyway.

> 
>>> The advantage of separate ops is more flexibility and less complexity
>>> (do only one thing in an op)
>>
>> Did you look at patch 12/19? It just does
>> fuse_uring_req_end_and_get_next(). That part isn't complex, imho.
> 
> That function name indicates that this is too complex: it is doing two
> independent things (ending one request and fetching the next).
> 
> Fine if it's a valid optimization, but I'm saying that it likely isn't.

Would it help, if it would be in two lines? Like
fuse_uring_req_end();
fuse_uring_req_next();

It has to check if there are requests queued, as it goes on hold if not
and then won't process the queue.

> 
>>> The major difference between your idea of a fuse_uring and the
>>> io_uring seems to be that you place not only the request on the shared
>>> buffer, but the data as well.   I don't think this is a good idea,
>>> since it will often incur one more memory copy.  Otherwise the idea
>>> itself seems sound.
>>
>> Coud you explain what you mean with "one more memory copy"?
> 
> If the filesystem is providing the result of a READ request as a
> pointer to a buffer (which can be the case with fuse_reply_data()),
> then that buffer will need to be copied to the shared buffer, and from
> the shared buffer to the read destination.
> 
> That first copy is unnecessary if the kernel receives the pointer to
> the userspace buffer and copies the data directly to the destination.

I didn't do that yet, as we are going to use the ring buffer for requests,
i.e. the ring buffer immediately gets all the data from network, there is
no copy. Even if the ring buffer would get data from local disk - there
is no need to use a separate application buffer anymore. And with that
there is just no extra copy

Keep in mind that the ring buffers are coupled with the request and not
the processing thread as in current libfuse - the buffer is valid as
long as the request is not completed. That part is probably harder with
IORING_OP_READV/IORING_OP_WRITEV, especially if you need two of them.


Your idea sounds useful if userspace would have its own cache outside
of ring buffers and that could be added in as another optimization as
something like:

- fuse-ring-req gets a user pointer
- flag if user pointer it set
- kernel acts on the flag
- completion gets send to userspace by:
    - if next request is already in the queue : piggy-back completion
      into the next request
    - if no next request: send a separate completion message

If you want, I could add that as another optimization patch to the next RFC.

Maybe that is also useful for existing libfuse applications that are used
to the fact that the request buffer is associated with the thread and not
the request itself.

If we want real zero copy, we can add in Mings work on that.
I'm going to look into that today or tomorrow.


> 
>>> So I think either better integration with io_uring is needed with
>>> support for "reverse submission" or a new interface.
>>
>> Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And
>> ublk_drv  also works exactly that way. I had pointed it out before,
>> initially I had considered to write a reverse io-uring myself and then
>> exactly at that time ublk came up.
> 
> I'm just looking for answers why this architecture is the best.  Maybe
> it is, but I find it too complex and can't explain why it's going to
> perform better than a much simpler single ring model.
> 
>> The interface of that 'reverse io' to io-uring is really simple.
>>
>> 1) Userspace sends a IORING_OP_URING_CMD SQE
>> 2) That CMD gets handled/queued by struct file_operations::uring_cmd /
>> fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the
>> request
>> 3) When fuse client has data to complete the request, it calls
>> io_uring_cmd_done() and fuse server receives a CQE with the fuse request.
>>
>> Personally I don't see anything twisted here, one just needs to
>> understand that IORING_OP_URING_CMD was written for that reverse order.
> 
> That's just my gut feeling.   fuse/dev_uring.c is 1233 in this RFC.
> And that's just the queuing.


Dunno, from my point of view the main logic is so much simpler than what
fuse_dev_do_read() has - that function checks multiple queues, takes multiple
locks, has to add the fuse-req to a (now hashed) list and has restart loop.
If possible, I really wouldn't like to make that even more complex.

But we want to add it:

- Selecting different ring entries based on their io-size (what I'm currently
adding in, to reduce memory per queue). I don't think that
with the current wake logic that would be even possible

- Add in credits for for different IO types to avoid that an IO type
can fill the entire queue. Right now the ring has only have separation of
sync/async, but I don't think that is enough.


To compare, could you please check the code flow of
FUSE_URING_REQ_COMMIT_AND_FETCH? I have no issue to split
fuse_uring_req_end_and_get_next() into two functions.

What I mean is that the code flow is hopefully not be hard to follow,
it ends the request and then puts the entry on the avail lets
Then it checks if there is pending fuse request and handles that,
from my point of view that is much easier to follow and with less conditions
than what fuse_dev_do_read() has. And that although there is already
a separation of sync and async queues.

         
> 
>> (There came up a light twisting when io-uring introduced issue_flags -
>> that is part of discussion of patch 19/19 with Jens in the series. Jens
>> suggested to work on io-uring improvements once the main series is
>> merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask
>> Jens for help once the other parts are merged. Right now that easy to
>> work around by always submitting with an io-uring task).
>>
>> Also, that simplicity is the reason why I'm hesitating a bit to work on
>> Kents new ring, as io-uring already has all what we need and with a
>> rather simple interface.
> 
> I'm in favor of using io_uring, if possible.

I have no objections, but I would also like to see an RHEL version with it...

> 
> I'm also in favor of a single shared buffer (ring) if possible.  Using
> cloned fd + plain READV / WRITEV ops is one possibility.


Cons:
- READV / WRITEV would need to be coupled in order to avoid two io-uring
cmd-enter system calls - double amount of memory
- Not impossible, but harder to achieve - request buffer belongs
to the request itself.
- request data section and fuse header are not clearly separated,
data alignment compat issues.
- different IO sizes hard to impossible - with large queue sizes high memory
usage, even if the majority would need small requests only
- Request type credits much harder to achieve
- The vfs application cannot directly write into the ring buffer, reduces
future optimizations
- new zero copy approach Ming Lei is working on cannot be used
- Not very flexible for future additions, IMHO

Pros:
- probably less code additions
- No shutdown issues
- existing splice works

> 
> But I'm not opposed to IORING_OP_URING_CMD either.   Btw, fuse reply
> could be inlined in the majority of cases into that 80 byte free space
> in the sqe.  Also might consider an extended cqe mode, where short
> fuse request could be inlined as well (e.g. IORING_SETUP_CQE128 -> 112
> byte payload).


That conflicts with that we want to have the fuse header in a separate section
to avoid alignment and compat issues. In fact, we might even want to make that
header section depending on the system page size.
And then what would be the advantage, we have the buffer anyway?


Thanks,
Bernd


PS: What I definitely realize that I should have talked at LSFMM2023 why
I had taken that approach and should have reduced slides about the
architecture and performance.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 13:32           ` Bernd Schubert
@ 2024-06-12 13:46             ` Bernd Schubert
  2024-06-12 14:07             ` Miklos Szeredi
  1 sibling, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-06-12 13:46 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org, Andrew Morton,
	linux-mm@kvack.org, Ingo Molnar, Peter Zijlstra, Andrei Vagin,
	io-uring@vger.kernel.org, Kent Overstreet



On 6/12/24 15:32, Bernd Schubert wrote:
> On 6/12/24 09:39, Miklos Szeredi wrote:
>>> Personally I don't see anything twisted here, one just needs to
>>> understand that IORING_OP_URING_CMD was written for that reverse order.
>>
>> That's just my gut feeling.   fuse/dev_uring.c is 1233 in this RFC.
>> And that's just the queuing.
> 


Btw, counting lines, majority of that is not queuing and handling requests,
but setting up things, shutdown (start/stop is already almost half of the file)
and then and doing sanity checks, as in fuse_uring_get_verify_queue().

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 13:32           ` Bernd Schubert
  2024-06-12 13:46             ` Bernd Schubert
@ 2024-06-12 14:07             ` Miklos Szeredi
  2024-06-12 14:56               ` Bernd Schubert
  1 sibling, 1 reply; 113+ messages in thread
From: Miklos Szeredi @ 2024-06-12 14:07 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bernd Schubert, Amir Goldstein, linux-fsdevel@vger.kernel.org,
	Andrew Morton, linux-mm@kvack.org, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring@vger.kernel.org, Kent Overstreet

On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote:

> I didn't do that yet, as we are going to use the ring buffer for requests,
> i.e. the ring buffer immediately gets all the data from network, there is
> no copy. Even if the ring buffer would get data from local disk - there
> is no need to use a separate application buffer anymore. And with that
> there is just no extra copy

Let's just tackle this shared request buffer, as it seems to be a
central part of your design.

You say the shared buffer is used to immediately get the data from the
network (or various other sources), which is completely viable.

And then the kernel will do the copy from the shared buffer.  Single copy, fine.

But if the buffer wasn't shared?  What would be the difference?
Single copy also.

Why is the shared buffer better?  I mean it may even be worse due to
cache aliasing issues on certain architectures.  copy_to_user() /
copy_from_user() are pretty darn efficient.

Why is it better to have that buffer managed by kernel?  Being locked
in memory (being unswappable) is probably a disadvantage as well.  And
if locking is required, it can be done on the user buffer.

And there are all the setup and teardown complexities...

Note: the ring buffer used by io_uring is different.  It literally
allows communication without invoking any system calls in certain
cases.  That shared buffer doesn't add anything like that.  At least I
don't see what it actually adds.

Hmm?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 14:07             ` Miklos Szeredi
@ 2024-06-12 14:56               ` Bernd Schubert
  2024-08-02 23:03                 ` Bernd Schubert
  2024-08-29 22:32                 ` Bernd Schubert
  0 siblings, 2 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-06-12 14:56 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Bernd Schubert, Amir Goldstein, linux-fsdevel@vger.kernel.org,
	Andrew Morton, linux-mm@kvack.org, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring@vger.kernel.org, Kent Overstreet

On 6/12/24 16:07, Miklos Szeredi wrote:
> On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote:
> 
>> I didn't do that yet, as we are going to use the ring buffer for requests,
>> i.e. the ring buffer immediately gets all the data from network, there is
>> no copy. Even if the ring buffer would get data from local disk - there
>> is no need to use a separate application buffer anymore. And with that
>> there is just no extra copy
> 
> Let's just tackle this shared request buffer, as it seems to be a
> central part of your design.
> 
> You say the shared buffer is used to immediately get the data from the
> network (or various other sources), which is completely viable.
> 
> And then the kernel will do the copy from the shared buffer.  Single copy, fine.
> 
> But if the buffer wasn't shared?  What would be the difference?
> Single copy also.
> 
> Why is the shared buffer better?  I mean it may even be worse due to
> cache aliasing issues on certain architectures.  copy_to_user() /
> copy_from_user() are pretty darn efficient.

Right now we have:

- Application thread writes into the buffer, then calls io_uring_cmd_done

I can try to do without mmap and set a pointer to the user buffer in the 
80B section of the SQE. I'm not sure if the application is allowed to 
write into that buffer, possibly/probably we will be forced to use 
io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that 
anyway). My greatest fear here is that the extra task has performance 
implications for sync requests.

> 
> Why is it better to have that buffer managed by kernel?  Being locked
> in memory (being unswappable) is probably a disadvantage as well.  And
> if locking is required, it can be done on the user buffer.

Well, let me try to give the buffer in the 80B section.

> 
> And there are all the setup and teardown complexities...

If the buffer in the 80B section works setup becomes easier, mmap and 
ioctls go away. Teardown, well, we still need the workaround as we need 
to handle io_uring_cmd_done, but if you could live with that for the 
instance, I would ask Jens or Pavel or Ming for help if we could solve 
that in io-uring itself.
Is the ring workaround in fuse_dev_release() acceptable for you? Or do 
you have any another idea about it?

> 
> Note: the ring buffer used by io_uring is different.  It literally
> allows communication without invoking any system calls in certain
> cases.  That shared buffer doesn't add anything like that.  At least I
> don't see what it actually adds.
> 
> Hmm?

The application can write into the buffer. We won't shared queue buffers 
if we could solve the same with a user pointer.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 14:56               ` Bernd Schubert
@ 2024-08-02 23:03                 ` Bernd Schubert
  2024-08-29 22:32                 ` Bernd Schubert
  1 sibling, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-08-02 23:03 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org, Andrew Morton,
	linux-mm@kvack.org, Ingo Molnar, Peter Zijlstra, Andrei Vagin,
	io-uring@vger.kernel.org, Kent Overstreet, Josef Bacik



On 6/12/24 16:56, Bernd Schubert wrote:
> On 6/12/24 16:07, Miklos Szeredi wrote:
>> On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>>> I didn't do that yet, as we are going to use the ring buffer for requests,
>>> i.e. the ring buffer immediately gets all the data from network, there is
>>> no copy. Even if the ring buffer would get data from local disk - there
>>> is no need to use a separate application buffer anymore. And with that
>>> there is just no extra copy
>>
>> Let's just tackle this shared request buffer, as it seems to be a
>> central part of your design.
>>
>> You say the shared buffer is used to immediately get the data from the
>> network (or various other sources), which is completely viable.
>>
>> And then the kernel will do the copy from the shared buffer.  Single copy, fine.
>>
>> But if the buffer wasn't shared?  What would be the difference?
>> Single copy also.
>>
>> Why is the shared buffer better?  I mean it may even be worse due to
>> cache aliasing issues on certain architectures.  copy_to_user() /
>> copy_from_user() are pretty darn efficient.
> 
> Right now we have:
> 
> - Application thread writes into the buffer, then calls io_uring_cmd_done
> 
> I can try to do without mmap and set a pointer to the user buffer in the 
> 80B section of the SQE. I'm not sure if the application is allowed to 
> write into that buffer, possibly/probably we will be forced to use 
> io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that 
> anyway). My greatest fear here is that the extra task has performance 
> implications for sync requests.
> 
> 
>>
>> Why is it better to have that buffer managed by kernel?  Being locked
>> in memory (being unswappable) is probably a disadvantage as well.  And
>> if locking is required, it can be done on the user buffer.
> 
> Well, let me try to give the buffer in the 80B section.
> 
>>
>> And there are all the setup and teardown complexities...
> 
> If the buffer in the 80B section works setup becomes easier, mmap and 
> ioctls go away. Teardown, well, we still need the workaround as we need 
> to handle io_uring_cmd_done, but if you could live with that for the 
> instance, I would ask Jens or Pavel or Ming for help if we could solve 
> that in io-uring itself.
> Is the ring workaround in fuse_dev_release() acceptable for you? Or do 
> you have any another idea about it?
> 
>>


Short update, I have this working for some time now with a hack patch
that just adds in a user buffer (without removing mmap, it is just
unused). Initially I thought that is a lot slower, but after removing
all the kernel debug options perf loss is just around 5% and I think I
can get back the remaining by having iov_iter_get_pages2() of the user
buffer in the initialization (with additional code overhead).

I hope to have new patches by mid of next week. I also want to get rid
of the difference of buffer layout between uring and /dev/fuse as that
can be troublesome for other changes like alignment. That might require
an io-uring CQE128, though.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-06-12 14:56               ` Bernd Schubert
  2024-08-02 23:03                 ` Bernd Schubert
@ 2024-08-29 22:32                 ` Bernd Schubert
  2024-08-30 13:12                   ` Jens Axboe
  1 sibling, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-08-29 22:32 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org,
	io-uring@vger.kernel.org, Kent Overstreet, Josef Bacik,
	Joanne Koong, Jens Axboe

[I shortened the CC list as that long came up only due to mmap and optimizations]

On 6/12/24 16:56, Bernd Schubert wrote:
> On 6/12/24 16:07, Miklos Szeredi wrote:
>> On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>>> I didn't do that yet, as we are going to use the ring buffer for requests,
>>> i.e. the ring buffer immediately gets all the data from network, there is
>>> no copy. Even if the ring buffer would get data from local disk - there
>>> is no need to use a separate application buffer anymore. And with that
>>> there is just no extra copy
>>
>> Let's just tackle this shared request buffer, as it seems to be a
>> central part of your design.
>>
>> You say the shared buffer is used to immediately get the data from the
>> network (or various other sources), which is completely viable.
>>
>> And then the kernel will do the copy from the shared buffer.  Single copy, fine.
>>
>> But if the buffer wasn't shared?  What would be the difference?
>> Single copy also.
>>
>> Why is the shared buffer better?  I mean it may even be worse due to
>> cache aliasing issues on certain architectures.  copy_to_user() /
>> copy_from_user() are pretty darn efficient.
> 
> Right now we have:
> 
> - Application thread writes into the buffer, then calls io_uring_cmd_done
> 
> I can try to do without mmap and set a pointer to the user buffer in the 
> 80B section of the SQE. I'm not sure if the application is allowed to 
> write into that buffer, possibly/probably we will be forced to use 
> io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that 
> anyway). My greatest fear here is that the extra task has performance 
> implications for sync requests.
> 
> 
>>
>> Why is it better to have that buffer managed by kernel?  Being locked
>> in memory (being unswappable) is probably a disadvantage as well.  And
>> if locking is required, it can be done on the user buffer.
> 
> Well, let me try to give the buffer in the 80B section.
> 
>>
>> And there are all the setup and teardown complexities...
> 
> If the buffer in the 80B section works setup becomes easier, mmap and 
> ioctls go away. Teardown, well, we still need the workaround as we need 
> to handle io_uring_cmd_done, but if you could live with that for the 
> instance, I would ask Jens or Pavel or Ming for help if we could solve 
> that in io-uring itself.
> Is the ring workaround in fuse_dev_release() acceptable for you? Or do 
> you have any another idea about it?
> 
>>
>> Note: the ring buffer used by io_uring is different.  It literally
>> allows communication without invoking any system calls in certain
>> cases.  That shared buffer doesn't add anything like that.  At least I
>> don't see what it actually adds.
>>
>> Hmm?
> 
> The application can write into the buffer. We won't shared queue buffers 
> if we could solve the same with a user pointer.


Wanted to send out a new series today, 

https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap

but then just noticed a tear down issue.

 1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
[ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G           O       6.10.0+ #48
[ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 1525.922470] Workqueue: events io_fallback_req_func
[ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80
[ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7
[ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002
[ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001
[ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0
[ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000
[ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001
[ 1525.968225] FS:  0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000
[ 1525.973932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0
[ 1525.980030] Call Trace:
[ 1525.981371]  <TASK>
[ 1525.982567]  ? __die_body+0x66/0xb0
[ 1525.984376]  ? die_addr+0xc1/0x100
[ 1525.986111]  ? exc_general_protection+0x1c6/0x330
[ 1525.988401]  ? asm_exc_general_protection+0x22/0x30
[ 1525.990864]  ? __lock_acquire+0x74/0x7b80
[ 1525.992901]  ? mark_lock+0x9f/0x360
[ 1525.994635]  ? __lock_acquire+0x1420/0x7b80
[ 1525.996629]  ? attach_entity_load_avg+0x47d/0x550
[ 1525.998765]  ? hlock_conflict+0x5a/0x1f0
[ 1526.000515]  ? __bfs+0x2dc/0x5a0
[ 1526.001993]  lock_acquire+0x1fb/0x3d0
[ 1526.004727]  ? gup_fast_fallback+0x13f/0x1d80
[ 1526.006586]  ? gup_fast_fallback+0x13f/0x1d80
[ 1526.008412]  gup_fast_fallback+0x158/0x1d80
[ 1526.010170]  ? gup_fast_fallback+0x13f/0x1d80
[ 1526.011999]  ? __lock_acquire+0x2b07/0x7b80
[ 1526.013793]  __iov_iter_get_pages_alloc+0x36e/0x980
[ 1526.015876]  ? do_raw_spin_unlock+0x5a/0x8a0
[ 1526.017734]  iov_iter_get_pages2+0x56/0x70
[ 1526.019491]  fuse_copy_fill+0x48e/0x980 [fuse]
[ 1526.021400]  fuse_copy_args+0x174/0x6a0 [fuse]
[ 1526.023199]  fuse_uring_prepare_send+0x319/0x6c0 [fuse]
[ 1526.025178]  fuse_uring_send_req_in_task+0x42/0x100 [fuse]
[ 1526.027163]  io_fallback_req_func+0xb4/0x170
[ 1526.028737]  ? process_scheduled_works+0x75b/0x1160
[ 1526.030445]  process_scheduled_works+0x85c/0x1160
[ 1526.032073]  worker_thread+0x8ba/0xce0
[ 1526.033388]  kthread+0x23e/0x2b0
[ 1526.035404]  ? pr_cont_work_flush+0x290/0x290
[ 1526.036958]  ? kthread_blkcg+0xa0/0xa0
[ 1526.038321]  ret_from_fork+0x30/0x60
[ 1526.039600]  ? kthread_blkcg+0xa0/0xa0
[ 1526.040942]  ret_from_fork_asm+0x11/0x20
[ 1526.042353]  </TASK>


We probably need to call iov_iter_get_pages2() immediately
on submitting the buffer from fuse server and not only when needed.
I had planned to do that as optimization later on, I think
it is also needed to avoid io_uring_cmd_complete_in_task().

The part I don't like here is that with mmap we had a complex
initialization - but then either it worked or did not. No exceptions
at IO time. And run time was just a copy into the buffer. 
Without mmap initialization is much simpler, but now complexity shifts
to IO time.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-08-29 22:32                 ` Bernd Schubert
@ 2024-08-30 13:12                   ` Jens Axboe
  2024-08-30 13:28                     ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Jens Axboe @ 2024-08-30 13:12 UTC (permalink / raw)
  To: Bernd Schubert, Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org,
	io-uring@vger.kernel.org, Kent Overstreet, Josef Bacik,
	Joanne Koong

On 8/29/24 4:32 PM, Bernd Schubert wrote:
> Wanted to send out a new series today, 
> 
> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap
> 
> but then just noticed a tear down issue.
> 
>  1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
> [ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G           O       6.10.0+ #48
> [ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [ 1525.922470] Workqueue: events io_fallback_req_func
> [ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80
> [ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7
> [ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002
> [ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001
> [ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0
> [ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
> [ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000
> [ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001
> [ 1525.968225] FS:  0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000
> [ 1525.973932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0
> [ 1525.980030] Call Trace:
> [ 1525.981371]  <TASK>
> [ 1525.982567]  ? __die_body+0x66/0xb0
> [ 1525.984376]  ? die_addr+0xc1/0x100
> [ 1525.986111]  ? exc_general_protection+0x1c6/0x330
> [ 1525.988401]  ? asm_exc_general_protection+0x22/0x30
> [ 1525.990864]  ? __lock_acquire+0x74/0x7b80
> [ 1525.992901]  ? mark_lock+0x9f/0x360
> [ 1525.994635]  ? __lock_acquire+0x1420/0x7b80
> [ 1525.996629]  ? attach_entity_load_avg+0x47d/0x550
> [ 1525.998765]  ? hlock_conflict+0x5a/0x1f0
> [ 1526.000515]  ? __bfs+0x2dc/0x5a0
> [ 1526.001993]  lock_acquire+0x1fb/0x3d0
> [ 1526.004727]  ? gup_fast_fallback+0x13f/0x1d80
> [ 1526.006586]  ? gup_fast_fallback+0x13f/0x1d80
> [ 1526.008412]  gup_fast_fallback+0x158/0x1d80
> [ 1526.010170]  ? gup_fast_fallback+0x13f/0x1d80
> [ 1526.011999]  ? __lock_acquire+0x2b07/0x7b80
> [ 1526.013793]  __iov_iter_get_pages_alloc+0x36e/0x980
> [ 1526.015876]  ? do_raw_spin_unlock+0x5a/0x8a0
> [ 1526.017734]  iov_iter_get_pages2+0x56/0x70
> [ 1526.019491]  fuse_copy_fill+0x48e/0x980 [fuse]
> [ 1526.021400]  fuse_copy_args+0x174/0x6a0 [fuse]
> [ 1526.023199]  fuse_uring_prepare_send+0x319/0x6c0 [fuse]
> [ 1526.025178]  fuse_uring_send_req_in_task+0x42/0x100 [fuse]
> [ 1526.027163]  io_fallback_req_func+0xb4/0x170
> [ 1526.028737]  ? process_scheduled_works+0x75b/0x1160
> [ 1526.030445]  process_scheduled_works+0x85c/0x1160
> [ 1526.032073]  worker_thread+0x8ba/0xce0
> [ 1526.033388]  kthread+0x23e/0x2b0
> [ 1526.035404]  ? pr_cont_work_flush+0x290/0x290
> [ 1526.036958]  ? kthread_blkcg+0xa0/0xa0
> [ 1526.038321]  ret_from_fork+0x30/0x60
> [ 1526.039600]  ? kthread_blkcg+0xa0/0xa0
> [ 1526.040942]  ret_from_fork_asm+0x11/0x20
> [ 1526.042353]  </TASK>
> 
> 
> We probably need to call iov_iter_get_pages2() immediately
> on submitting the buffer from fuse server and not only when needed.
> I had planned to do that as optimization later on, I think
> it is also needed to avoid io_uring_cmd_complete_in_task().

I think you do, but it's not really what's wrong here - fallback work is
being invoked as the ring is being torn down, either directly or because
the task is exiting. Your task_work should check if this is the case,
and just do -ECANCELED for this case rather than attempt to execute the
work. Most task_work doesn't do much outside of post a completion, but
yours seems complex in that attempts to map pages as well, for example.
In any case, regardless of whether you move the gup to the actual issue
side of things (which I think you should), then you'd want something
ala:

if (req->task != current)
	don't issue, -ECANCELED

in your task_work.

> The part I don't like here is that with mmap we had a complex
> initialization - but then either it worked or did not. No exceptions
> at IO time. And run time was just a copy into the buffer. 
> Without mmap initialization is much simpler, but now complexity shifts
> to IO time.

I'll take a look at your code. But I'd say just fix the missing check
above and send out what you have, it's much easier to iterate on the
list rather than poking at patches in some git branch somewhere.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-08-30 13:12                   ` Jens Axboe
@ 2024-08-30 13:28                     ` Bernd Schubert
  2024-08-30 13:33                       ` Jens Axboe
  0 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-08-30 13:28 UTC (permalink / raw)
  To: Jens Axboe, Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org,
	io-uring@vger.kernel.org, Kent Overstreet, Josef Bacik,
	Joanne Koong

On 8/30/24 15:12, Jens Axboe wrote:
> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>> Wanted to send out a new series today, 
>>
>> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap
>>
>> but then just noticed a tear down issue.
>>
>>  1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
>> [ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G           O       6.10.0+ #48
>> [ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>> [ 1525.922470] Workqueue: events io_fallback_req_func
>> [ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80
>> [ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7
>> [ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002
>> [ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001
>> [ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0
>> [ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
>> [ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000
>> [ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001
>> [ 1525.968225] FS:  0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000
>> [ 1525.973932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0
>> [ 1525.980030] Call Trace:
>> [ 1525.981371]  <TASK>
>> [ 1525.982567]  ? __die_body+0x66/0xb0
>> [ 1525.984376]  ? die_addr+0xc1/0x100
>> [ 1525.986111]  ? exc_general_protection+0x1c6/0x330
>> [ 1525.988401]  ? asm_exc_general_protection+0x22/0x30
>> [ 1525.990864]  ? __lock_acquire+0x74/0x7b80
>> [ 1525.992901]  ? mark_lock+0x9f/0x360
>> [ 1525.994635]  ? __lock_acquire+0x1420/0x7b80
>> [ 1525.996629]  ? attach_entity_load_avg+0x47d/0x550
>> [ 1525.998765]  ? hlock_conflict+0x5a/0x1f0
>> [ 1526.000515]  ? __bfs+0x2dc/0x5a0
>> [ 1526.001993]  lock_acquire+0x1fb/0x3d0
>> [ 1526.004727]  ? gup_fast_fallback+0x13f/0x1d80
>> [ 1526.006586]  ? gup_fast_fallback+0x13f/0x1d80
>> [ 1526.008412]  gup_fast_fallback+0x158/0x1d80
>> [ 1526.010170]  ? gup_fast_fallback+0x13f/0x1d80
>> [ 1526.011999]  ? __lock_acquire+0x2b07/0x7b80
>> [ 1526.013793]  __iov_iter_get_pages_alloc+0x36e/0x980
>> [ 1526.015876]  ? do_raw_spin_unlock+0x5a/0x8a0
>> [ 1526.017734]  iov_iter_get_pages2+0x56/0x70
>> [ 1526.019491]  fuse_copy_fill+0x48e/0x980 [fuse]
>> [ 1526.021400]  fuse_copy_args+0x174/0x6a0 [fuse]
>> [ 1526.023199]  fuse_uring_prepare_send+0x319/0x6c0 [fuse]
>> [ 1526.025178]  fuse_uring_send_req_in_task+0x42/0x100 [fuse]
>> [ 1526.027163]  io_fallback_req_func+0xb4/0x170
>> [ 1526.028737]  ? process_scheduled_works+0x75b/0x1160
>> [ 1526.030445]  process_scheduled_works+0x85c/0x1160
>> [ 1526.032073]  worker_thread+0x8ba/0xce0
>> [ 1526.033388]  kthread+0x23e/0x2b0
>> [ 1526.035404]  ? pr_cont_work_flush+0x290/0x290
>> [ 1526.036958]  ? kthread_blkcg+0xa0/0xa0
>> [ 1526.038321]  ret_from_fork+0x30/0x60
>> [ 1526.039600]  ? kthread_blkcg+0xa0/0xa0
>> [ 1526.040942]  ret_from_fork_asm+0x11/0x20
>> [ 1526.042353]  </TASK>
>>
>>
>> We probably need to call iov_iter_get_pages2() immediately
>> on submitting the buffer from fuse server and not only when needed.
>> I had planned to do that as optimization later on, I think
>> it is also needed to avoid io_uring_cmd_complete_in_task().
> 
> I think you do, but it's not really what's wrong here - fallback work is
> being invoked as the ring is being torn down, either directly or because
> the task is exiting. Your task_work should check if this is the case,
> and just do -ECANCELED for this case rather than attempt to execute the
> work. Most task_work doesn't do much outside of post a completion, but
> yours seems complex in that attempts to map pages as well, for example.
> In any case, regardless of whether you move the gup to the actual issue
> side of things (which I think you should), then you'd want something
> ala:
> 
> if (req->task != current)
> 	don't issue, -ECANCELED
> 
> in your task_work.

Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong 
into __io_uring_cmd_do_in_task then? Because my task_work_cb function 
(passed to io_uring_cmd_complete_in_task) doesn't even have the request.

I'm going to test this in a bit

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 21ac5fb2d5f0..c06b9fcff48f 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -120,6 +120,11 @@ static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
 {
        struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
 
+       if (req->task != current) {
+               /* don't issue, -ECANCELED */
+               return;
+       }
+
        /* task_work executor checks the deffered list completion */
        ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER);
 }



> 
>> The part I don't like here is that with mmap we had a complex
>> initialization - but then either it worked or did not. No exceptions
>> at IO time. And run time was just a copy into the buffer. 
>> Without mmap initialization is much simpler, but now complexity shifts
>> to IO time.
> 
> I'll take a look at your code. But I'd say just fix the missing check
> above and send out what you have, it's much easier to iterate on the
> list rather than poking at patches in some git branch somewhere.
> 

I'm almost through updating it, will send something out definitely today.
I will just keep the last patch that pins user buffer pages on top of the series 
- will avoid all the rebasing.


Thanks,
Bernd

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-08-30 13:28                     ` Bernd Schubert
@ 2024-08-30 13:33                       ` Jens Axboe
  2024-08-30 14:55                         ` Pavel Begunkov
  0 siblings, 1 reply; 113+ messages in thread
From: Jens Axboe @ 2024-08-30 13:33 UTC (permalink / raw)
  To: Bernd Schubert, Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org,
	io-uring@vger.kernel.org, Kent Overstreet, Josef Bacik,
	Joanne Koong

On 8/30/24 7:28 AM, Bernd Schubert wrote:
> On 8/30/24 15:12, Jens Axboe wrote:
>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>> Wanted to send out a new series today, 
>>>
>>> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap
>>>
>>> but then just noticed a tear down issue.
>>>
>>>  1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
>>> [ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G           O       6.10.0+ #48
>>> [ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>>> [ 1525.922470] Workqueue: events io_fallback_req_func
>>> [ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80
>>> [ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7
>>> [ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002
>>> [ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001
>>> [ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0
>>> [ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
>>> [ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000
>>> [ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001
>>> [ 1525.968225] FS:  0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000
>>> [ 1525.973932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0
>>> [ 1525.980030] Call Trace:
>>> [ 1525.981371]  <TASK>
>>> [ 1525.982567]  ? __die_body+0x66/0xb0
>>> [ 1525.984376]  ? die_addr+0xc1/0x100
>>> [ 1525.986111]  ? exc_general_protection+0x1c6/0x330
>>> [ 1525.988401]  ? asm_exc_general_protection+0x22/0x30
>>> [ 1525.990864]  ? __lock_acquire+0x74/0x7b80
>>> [ 1525.992901]  ? mark_lock+0x9f/0x360
>>> [ 1525.994635]  ? __lock_acquire+0x1420/0x7b80
>>> [ 1525.996629]  ? attach_entity_load_avg+0x47d/0x550
>>> [ 1525.998765]  ? hlock_conflict+0x5a/0x1f0
>>> [ 1526.000515]  ? __bfs+0x2dc/0x5a0
>>> [ 1526.001993]  lock_acquire+0x1fb/0x3d0
>>> [ 1526.004727]  ? gup_fast_fallback+0x13f/0x1d80
>>> [ 1526.006586]  ? gup_fast_fallback+0x13f/0x1d80
>>> [ 1526.008412]  gup_fast_fallback+0x158/0x1d80
>>> [ 1526.010170]  ? gup_fast_fallback+0x13f/0x1d80
>>> [ 1526.011999]  ? __lock_acquire+0x2b07/0x7b80
>>> [ 1526.013793]  __iov_iter_get_pages_alloc+0x36e/0x980
>>> [ 1526.015876]  ? do_raw_spin_unlock+0x5a/0x8a0
>>> [ 1526.017734]  iov_iter_get_pages2+0x56/0x70
>>> [ 1526.019491]  fuse_copy_fill+0x48e/0x980 [fuse]
>>> [ 1526.021400]  fuse_copy_args+0x174/0x6a0 [fuse]
>>> [ 1526.023199]  fuse_uring_prepare_send+0x319/0x6c0 [fuse]
>>> [ 1526.025178]  fuse_uring_send_req_in_task+0x42/0x100 [fuse]
>>> [ 1526.027163]  io_fallback_req_func+0xb4/0x170
>>> [ 1526.028737]  ? process_scheduled_works+0x75b/0x1160
>>> [ 1526.030445]  process_scheduled_works+0x85c/0x1160
>>> [ 1526.032073]  worker_thread+0x8ba/0xce0
>>> [ 1526.033388]  kthread+0x23e/0x2b0
>>> [ 1526.035404]  ? pr_cont_work_flush+0x290/0x290
>>> [ 1526.036958]  ? kthread_blkcg+0xa0/0xa0
>>> [ 1526.038321]  ret_from_fork+0x30/0x60
>>> [ 1526.039600]  ? kthread_blkcg+0xa0/0xa0
>>> [ 1526.040942]  ret_from_fork_asm+0x11/0x20
>>> [ 1526.042353]  </TASK>
>>>
>>>
>>> We probably need to call iov_iter_get_pages2() immediately
>>> on submitting the buffer from fuse server and not only when needed.
>>> I had planned to do that as optimization later on, I think
>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>
>> I think you do, but it's not really what's wrong here - fallback work is
>> being invoked as the ring is being torn down, either directly or because
>> the task is exiting. Your task_work should check if this is the case,
>> and just do -ECANCELED for this case rather than attempt to execute the
>> work. Most task_work doesn't do much outside of post a completion, but
>> yours seems complex in that attempts to map pages as well, for example.
>> In any case, regardless of whether you move the gup to the actual issue
>> side of things (which I think you should), then you'd want something
>> ala:
>>
>> if (req->task != current)
>> 	don't issue, -ECANCELED
>>
>> in your task_work.
> 
> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong 
> into __io_uring_cmd_do_in_task then? Because my task_work_cb function 
> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.

Yeah it probably does, the uring_cmd case is a bit special is that it's
a set of helpers around task_work that can be consumed by eg fuse and
ublk. The existing users don't really do anything complicated on that
side, hence there's no real need to check. But since the ring/task is
going away, we should be able to generically do it in the helpers like
you did below.

> I'm going to test this in a bit
> 
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 21ac5fb2d5f0..c06b9fcff48f 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -120,6 +120,11 @@ static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
>  {
>         struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
>  
> +       if (req->task != current) {
> +               /* don't issue, -ECANCELED */
> +               return;
> +       }
> +
>         /* task_work executor checks the deffered list completion */
>         ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER);
>  }
> 
> 
> 
>>
>>> The part I don't like here is that with mmap we had a complex
>>> initialization - but then either it worked or did not. No exceptions
>>> at IO time. And run time was just a copy into the buffer. 
>>> Without mmap initialization is much simpler, but now complexity shifts
>>> to IO time.
>>
>> I'll take a look at your code. But I'd say just fix the missing check
>> above and send out what you have, it's much easier to iterate on the
>> list rather than poking at patches in some git branch somewhere.
>>
> 
> I'm almost through updating it, will send something out definitely
> today. I will just keep the last patch that pins user buffer pages on
> top of the series - will avoid all the rebasing.

Excellent! Would be great if you can include how to test it as well, as
then I can give it a spin as well and test any potential code changes
proposed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-08-30 13:33                       ` Jens Axboe
@ 2024-08-30 14:55                         ` Pavel Begunkov
  2024-08-30 15:10                           ` Bernd Schubert
  2024-08-30 20:08                           ` Jens Axboe
  0 siblings, 2 replies; 113+ messages in thread
From: Pavel Begunkov @ 2024-08-30 14:55 UTC (permalink / raw)
  To: Jens Axboe, Bernd Schubert, Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org,
	io-uring@vger.kernel.org, Kent Overstreet, Josef Bacik,
	Joanne Koong

On 8/30/24 14:33, Jens Axboe wrote:
> On 8/30/24 7:28 AM, Bernd Schubert wrote:
>> On 8/30/24 15:12, Jens Axboe wrote:
>>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>>> We probably need to call iov_iter_get_pages2() immediately
>>>> on submitting the buffer from fuse server and not only when needed.
>>>> I had planned to do that as optimization later on, I think
>>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>>
>>> I think you do, but it's not really what's wrong here - fallback work is
>>> being invoked as the ring is being torn down, either directly or because
>>> the task is exiting. Your task_work should check if this is the case,
>>> and just do -ECANCELED for this case rather than attempt to execute the
>>> work. Most task_work doesn't do much outside of post a completion, but
>>> yours seems complex in that attempts to map pages as well, for example.
>>> In any case, regardless of whether you move the gup to the actual issue
>>> side of things (which I think you should), then you'd want something
>>> ala:
>>>
>>> if (req->task != current)
>>> 	don't issue, -ECANCELED
>>>
>>> in your task_work.nvme_uring_task_cb
>>
>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong
>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function
>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.
> 
> Yeah it probably does, the uring_cmd case is a bit special is that it's
> a set of helpers around task_work that can be consumed by eg fuse and
> ublk. The existing users don't really do anything complicated on that
> side, hence there's no real need to check. But since the ring/task is
> going away, we should be able to generically do it in the helpers like
> you did below.

That won't work, we should give commands an opportunity to clean up
after themselves. I'm pretty sure it will break existing users.
For now we can pass a flag to the callback, fuse would need to
check it and fail. Compile tested only

commit a5b382f150b44476ccfa84cefdb22ce2ceeb12f1
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Aug 30 15:43:32 2024 +0100

     io_uring/cmd: let cmds tw know about dying task
     
     When the taks that submitted a request is dying, a task work for that
     request might get run by a kernel thread or even worse by a half
     dismantled task. We can't just cancel the task work without running the
     callback as the cmd might need to do some clean up, so pass a flag
     instead. If set, it's not safe to access any task resources and the
     callback is expected to cancel the cmd ASAP.
     
     Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index ace7ac056d51..a89abec98832 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -37,6 +37,7 @@ enum io_uring_cmd_flags {
  	/* set when uring wants to cancel a previously issued command */
  	IO_URING_F_CANCEL		= (1 << 11),
  	IO_URING_F_COMPAT		= (1 << 12),
+	IO_URING_F_TASK_DEAD		= (1 << 13),
  };
  
  struct io_zcrx_ifq;
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 8391c7c7c1ec..55bdcb4b63b3 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -119,9 +119,13 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_mark_cancelable);
  static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
  {
  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	unsigned flags = IO_URING_F_COMPLETE_DEFER;
+
+	if (req->task->flags & PF_EXITING)
+		flags |= IO_URING_F_TASK_DEAD;
  
  	/* task_work executor checks the deffered list completion */
-	ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER);
+	ioucmd->task_work_cb(ioucmd, flags);
  }
  
  void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,

-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-08-30 14:55                         ` Pavel Begunkov
@ 2024-08-30 15:10                           ` Bernd Schubert
  2024-08-30 20:08                           ` Jens Axboe
  1 sibling, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-08-30 15:10 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org,
	io-uring@vger.kernel.org, Kent Overstreet, Josef Bacik,
	Joanne Koong



On 8/30/24 16:55, Pavel Begunkov wrote:
> On 8/30/24 14:33, Jens Axboe wrote:
>> On 8/30/24 7:28 AM, Bernd Schubert wrote:
>>> On 8/30/24 15:12, Jens Axboe wrote:
>>>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>>>> We probably need to call iov_iter_get_pages2() immediately
>>>>> on submitting the buffer from fuse server and not only when needed.
>>>>> I had planned to do that as optimization later on, I think
>>>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>>>
>>>> I think you do, but it's not really what's wrong here - fallback
>>>> work is
>>>> being invoked as the ring is being torn down, either directly or
>>>> because
>>>> the task is exiting. Your task_work should check if this is the case,
>>>> and just do -ECANCELED for this case rather than attempt to execute the
>>>> work. Most task_work doesn't do much outside of post a completion, but
>>>> yours seems complex in that attempts to map pages as well, for example.
>>>> In any case, regardless of whether you move the gup to the actual issue
>>>> side of things (which I think you should), then you'd want something
>>>> ala:
>>>>
>>>> if (req->task != current)
>>>>     don't issue, -ECANCELED
>>>>
>>>> in your task_work.nvme_uring_task_cb
>>>
>>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong
>>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function
>>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.
>>
>> Yeah it probably does, the uring_cmd case is a bit special is that it's
>> a set of helpers around task_work that can be consumed by eg fuse and
>> ublk. The existing users don't really do anything complicated on that
>> side, hence there's no real need to check. But since the ring/task is
>> going away, we should be able to generically do it in the helpers like
>> you did below.
> 
> That won't work, we should give commands an opportunity to clean up
> after themselves. I'm pretty sure it will break existing users.
> For now we can pass a flag to the callback, fuse would need to
> check it and fail. Compile tested only
> 
> commit a5b382f150b44476ccfa84cefdb22ce2ceeb12f1
> Author: Pavel Begunkov <asml.silence@gmail.com>
> Date:   Fri Aug 30 15:43:32 2024 +0100
> 
>     io_uring/cmd: let cmds tw know about dying task
>         When the taks that submitted a request is dying, a task work for
> that
>     request might get run by a kernel thread or even worse by a half
>     dismantled task. We can't just cancel the task work without running the
>     callback as the cmd might need to do some clean up, so pass a flag
>     instead. If set, it's not safe to access any task resources and the
>     callback is expected to cancel the cmd ASAP.
>         Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> 
> diff --git a/include/linux/io_uring_types.h
> b/include/linux/io_uring_types.h
> index ace7ac056d51..a89abec98832 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -37,6 +37,7 @@ enum io_uring_cmd_flags {
>      /* set when uring wants to cancel a previously issued command */
>      IO_URING_F_CANCEL        = (1 << 11),
>      IO_URING_F_COMPAT        = (1 << 12),
> +    IO_URING_F_TASK_DEAD        = (1 << 13),
>  };
>  
>  struct io_zcrx_ifq;
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 8391c7c7c1ec..55bdcb4b63b3 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -119,9 +119,13 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_mark_cancelable);
>  static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state
> *ts)
>  {
>      struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct
> io_uring_cmd);
> +    unsigned flags = IO_URING_F_COMPLETE_DEFER;
> +
> +    if (req->task->flags & PF_EXITING)
> +        flags |= IO_URING_F_TASK_DEAD;
>  
>      /* task_work executor checks the deffered list completion */
> -    ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER);
> +    ioucmd->task_work_cb(ioucmd, flags);
>  }
>  
>  void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
> 


Thanks and yeah you are right, the previous patch would have missed an
io_uring_cmd_done for fuse-uring as well.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-08-30 14:55                         ` Pavel Begunkov
  2024-08-30 15:10                           ` Bernd Schubert
@ 2024-08-30 20:08                           ` Jens Axboe
  2024-08-31  0:02                             ` Bernd Schubert
  1 sibling, 1 reply; 113+ messages in thread
From: Jens Axboe @ 2024-08-30 20:08 UTC (permalink / raw)
  To: Pavel Begunkov, Bernd Schubert, Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org,
	io-uring@vger.kernel.org, Kent Overstreet, Josef Bacik,
	Joanne Koong

On 8/30/24 8:55 AM, Pavel Begunkov wrote:
> On 8/30/24 14:33, Jens Axboe wrote:
>> On 8/30/24 7:28 AM, Bernd Schubert wrote:
>>> On 8/30/24 15:12, Jens Axboe wrote:
>>>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>>>> We probably need to call iov_iter_get_pages2() immediately
>>>>> on submitting the buffer from fuse server and not only when needed.
>>>>> I had planned to do that as optimization later on, I think
>>>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>>>
>>>> I think you do, but it's not really what's wrong here - fallback work is
>>>> being invoked as the ring is being torn down, either directly or because
>>>> the task is exiting. Your task_work should check if this is the case,
>>>> and just do -ECANCELED for this case rather than attempt to execute the
>>>> work. Most task_work doesn't do much outside of post a completion, but
>>>> yours seems complex in that attempts to map pages as well, for example.
>>>> In any case, regardless of whether you move the gup to the actual issue
>>>> side of things (which I think you should), then you'd want something
>>>> ala:
>>>>
>>>> if (req->task != current)
>>>>     don't issue, -ECANCELED
>>>>
>>>> in your task_work.nvme_uring_task_cb
>>>
>>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong
>>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function
>>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.
>>
>> Yeah it probably does, the uring_cmd case is a bit special is that it's
>> a set of helpers around task_work that can be consumed by eg fuse and
>> ublk. The existing users don't really do anything complicated on that
>> side, hence there's no real need to check. But since the ring/task is
>> going away, we should be able to generically do it in the helpers like
>> you did below.
> 
> That won't work, we should give commands an opportunity to clean up
> after themselves. I'm pretty sure it will break existing users.
> For now we can pass a flag to the callback, fuse would need to
> check it and fail. Compile tested only

Right, I did actually consider that yesterday and why I replied with the
fuse callback needing to do it, but then forgot... Since we can't do a
generic cleanup callback, it'll have to be done in the handler.

I do like making this generic and not needing individual task_work
handlers like this checking for some magic, so I like the flag addition.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-08-30 20:08                           ` Jens Axboe
@ 2024-08-31  0:02                             ` Bernd Schubert
  2024-08-31  0:49                               ` Bernd Schubert
  0 siblings, 1 reply; 113+ messages in thread
From: Bernd Schubert @ 2024-08-31  0:02 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org,
	io-uring@vger.kernel.org, Kent Overstreet, Josef Bacik,
	Joanne Koong



On 8/30/24 22:08, Jens Axboe wrote:
> On 8/30/24 8:55 AM, Pavel Begunkov wrote:
>> On 8/30/24 14:33, Jens Axboe wrote:
>>> On 8/30/24 7:28 AM, Bernd Schubert wrote:
>>>> On 8/30/24 15:12, Jens Axboe wrote:
>>>>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>>>>> We probably need to call iov_iter_get_pages2() immediately
>>>>>> on submitting the buffer from fuse server and not only when needed.
>>>>>> I had planned to do that as optimization later on, I think
>>>>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>>>>
>>>>> I think you do, but it's not really what's wrong here - fallback work is
>>>>> being invoked as the ring is being torn down, either directly or because
>>>>> the task is exiting. Your task_work should check if this is the case,
>>>>> and just do -ECANCELED for this case rather than attempt to execute the
>>>>> work. Most task_work doesn't do much outside of post a completion, but
>>>>> yours seems complex in that attempts to map pages as well, for example.
>>>>> In any case, regardless of whether you move the gup to the actual issue
>>>>> side of things (which I think you should), then you'd want something
>>>>> ala:
>>>>>
>>>>> if (req->task != current)
>>>>>     don't issue, -ECANCELED
>>>>>
>>>>> in your task_work.nvme_uring_task_cb
>>>>
>>>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong
>>>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function
>>>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.
>>>
>>> Yeah it probably does, the uring_cmd case is a bit special is that it's
>>> a set of helpers around task_work that can be consumed by eg fuse and
>>> ublk. The existing users don't really do anything complicated on that
>>> side, hence there's no real need to check. But since the ring/task is
>>> going away, we should be able to generically do it in the helpers like
>>> you did below.
>>
>> That won't work, we should give commands an opportunity to clean up
>> after themselves. I'm pretty sure it will break existing users.
>> For now we can pass a flag to the callback, fuse would need to
>> check it and fail. Compile tested only
> 
> Right, I did actually consider that yesterday and why I replied with the
> fuse callback needing to do it, but then forgot... Since we can't do a
> generic cleanup callback, it'll have to be done in the handler.
> 
> I do like making this generic and not needing individual task_work
> handlers like this checking for some magic, so I like the flag addition.
> 

Found another issue in (error handling in my code) while working on page 
pinning of the user buffer and fixed that first. Ways to late now (or early)
to continue with the page pinning, but I gave Pavels patch a try with the
additional patch below - same issue. 
I added a warn message to see if triggers - doesn't come up

	if (unlikely(issue_flags & IO_URING_F_TASK_DEAD)) {
		pr_warn("IO_URING_F_TASK_DEAD");
		goto terminating;
	}


I could digg further, but I'm actually not sure if we need to. With early page pinning
the entire function should go away, as I hope that the application can write into the
buffer again. Although I'm not sure yet if Miklos will like that pinning.


bschubert2@imesrv6 linux.git>stg show handle-IO_URING_F_TASK_DEAD
commit 42b4dae795bd37918455bad0ce3eea64b28be03c (HEAD -> fuse-uring-for-6.10-rfc3-without-mmap)
Author: Bernd Schubert <bschubert@ddn.com>
Date:   Sat Aug 31 01:26:26 2024 +0200

    fuse: {uring} Handle IO_URING_F_TASK_DEAD
    
    The ring task is terminating, it not safe to still access
    its resources. Also no need for further actions.

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index e557f595133b..1d5dfa9c0965 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1003,6 +1003,9 @@ fuse_uring_send_req_in_task(struct io_uring_cmd *cmd,
        BUILD_BUG_ON(sizeof(pdu) > sizeof(cmd->pdu));
        int err;
 
+       if (unlikely(issue_flags & IO_URING_F_TASK_DEAD))
+               goto terminating;
+
        err = fuse_uring_prepare_send(ring_ent);
        if (err)
                goto err;
@@ -1017,6 +1020,10 @@ fuse_uring_send_req_in_task(struct io_uring_cmd *cmd,
        return;
 err:
        fuse_uring_next_fuse_req(ring_ent, queue);
+
+terminating:
+       /* Avoid all actions as the task that issues the ring is terminating */
+       io_uring_cmd_done(cmd, -ECANCELED, 0, issue_flags);
 }
 
 /* queue a fuse request and send it if a ring entry is available */




Thanks,
Bernd

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
  2024-08-31  0:02                             ` Bernd Schubert
@ 2024-08-31  0:49                               ` Bernd Schubert
  0 siblings, 0 replies; 113+ messages in thread
From: Bernd Schubert @ 2024-08-31  0:49 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, Bernd Schubert, Miklos Szeredi
  Cc: Amir Goldstein, linux-fsdevel@vger.kernel.org,
	io-uring@vger.kernel.org, Kent Overstreet, Josef Bacik,
	Joanne Koong



On 8/31/24 02:02, Bernd Schubert wrote:
> 
> 
> On 8/30/24 22:08, Jens Axboe wrote:
>> On 8/30/24 8:55 AM, Pavel Begunkov wrote:
>>> On 8/30/24 14:33, Jens Axboe wrote:
>>>> On 8/30/24 7:28 AM, Bernd Schubert wrote:
>>>>> On 8/30/24 15:12, Jens Axboe wrote:
>>>>>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>>>>>> We probably need to call iov_iter_get_pages2() immediately
>>>>>>> on submitting the buffer from fuse server and not only when needed.
>>>>>>> I had planned to do that as optimization later on, I think
>>>>>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>>>>>
>>>>>> I think you do, but it's not really what's wrong here - fallback work is
>>>>>> being invoked as the ring is being torn down, either directly or because
>>>>>> the task is exiting. Your task_work should check if this is the case,
>>>>>> and just do -ECANCELED for this case rather than attempt to execute the
>>>>>> work. Most task_work doesn't do much outside of post a completion, but
>>>>>> yours seems complex in that attempts to map pages as well, for example.
>>>>>> In any case, regardless of whether you move the gup to the actual issue
>>>>>> side of things (which I think you should), then you'd want something
>>>>>> ala:
>>>>>>
>>>>>> if (req->task != current)
>>>>>>     don't issue, -ECANCELED
>>>>>>
>>>>>> in your task_work.nvme_uring_task_cb
>>>>>
>>>>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong
>>>>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function
>>>>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.
>>>>
>>>> Yeah it probably does, the uring_cmd case is a bit special is that it's
>>>> a set of helpers around task_work that can be consumed by eg fuse and
>>>> ublk. The existing users don't really do anything complicated on that
>>>> side, hence there's no real need to check. But since the ring/task is
>>>> going away, we should be able to generically do it in the helpers like
>>>> you did below.
>>>
>>> That won't work, we should give commands an opportunity to clean up
>>> after themselves. I'm pretty sure it will break existing users.
>>> For now we can pass a flag to the callback, fuse would need to
>>> check it and fail. Compile tested only
>>
>> Right, I did actually consider that yesterday and why I replied with the
>> fuse callback needing to do it, but then forgot... Since we can't do a
>> generic cleanup callback, it'll have to be done in the handler.
>>
>> I do like making this generic and not needing individual task_work
>> handlers like this checking for some magic, so I like the flag addition.
>>
> 
> Found another issue in (error handling in my code) while working on page 
> pinning of the user buffer and fixed that first. Ways to late now (or early)
> to continue with the page pinning, but I gave Pavels patch a try with the
> additional patch below - same issue. 
> I added a warn message to see if triggers - doesn't come up
> 
> 	if (unlikely(issue_flags & IO_URING_F_TASK_DEAD)) {
> 		pr_warn("IO_URING_F_TASK_DEAD");
> 		goto terminating;
> 	}
> 
> 
> I could digg further, but I'm actually not sure if we need to. With early page pinning
> the entire function should go away, as I hope that the application can write into the
> buffer again. Although I'm not sure yet if Miklos will like that pinning.

Works with page pinning, new series comes once I got some sleep (still
need to write the change log).


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2024-09-01 12:07 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
2024-05-29 18:00 ` [PATCH RFC v2 01/19] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
2024-05-29 21:09   ` Josef Bacik
2024-05-29 18:00 ` [PATCH RFC v2 02/19] fuse: Move fuse_get_dev to header file Bernd Schubert
2024-05-29 21:09   ` Josef Bacik
2024-05-29 18:00 ` [PATCH RFC v2 03/19] fuse: Move request bits Bernd Schubert
2024-05-29 21:10   ` Josef Bacik
2024-05-29 18:00 ` [PATCH RFC v2 04/19] fuse: Add fuse-io-uring design documentation Bernd Schubert
2024-05-29 21:17   ` Josef Bacik
2024-05-30 12:50     ` Bernd Schubert
2024-05-30 14:59       ` Josef Bacik
2024-05-29 18:00 ` [PATCH RFC v2 05/19] fuse: Add a uring config ioctl Bernd Schubert
2024-05-29 21:24   ` Josef Bacik
2024-05-30 12:51     ` Bernd Schubert
2024-06-03 13:03   ` Miklos Szeredi
2024-06-03 13:48     ` Bernd Schubert
2024-05-29 18:00 ` [PATCH RFC v2 06/19] Add a vmalloc_node_user function Bernd Schubert
2024-05-30 15:10   ` Josef Bacik
2024-05-30 16:13     ` Bernd Schubert
2024-05-31 13:56   ` Christoph Hellwig
2024-06-03 15:59     ` Kent Overstreet
2024-06-03 19:24       ` Bernd Schubert
2024-06-04  4:20         ` Christoph Hellwig
2024-06-07  2:30           ` Dave Chinner
2024-06-07  4:49             ` Christoph Hellwig
2024-06-04  4:08       ` Christoph Hellwig
2024-05-29 18:00 ` [PATCH RFC v2 07/19] fuse uring: Add an mmap method Bernd Schubert
2024-05-30 15:37   ` Josef Bacik
2024-05-29 18:00 ` [PATCH RFC v2 08/19] fuse: Add the queue configuration ioctl Bernd Schubert
2024-05-30 15:54   ` Josef Bacik
2024-05-30 17:49     ` Bernd Schubert
2024-05-29 18:00 ` [PATCH RFC v2 09/19] fuse: {uring} Add a dev_release exception for fuse-over-io-uring Bernd Schubert
2024-05-30 19:00   ` Josef Bacik
2024-05-29 18:00 ` [PATCH RFC v2 10/19] fuse: {uring} Handle SQEs - register commands Bernd Schubert
2024-05-30 19:55   ` Josef Bacik
2024-05-29 18:00 ` [PATCH RFC v2 11/19] fuse: Add support to copy from/to the ring buffer Bernd Schubert
2024-05-30 19:59   ` Josef Bacik
2024-09-01 11:56     ` Bernd Schubert
2024-09-01 11:56     ` Bernd Schubert
2024-05-29 18:00 ` [PATCH RFC v2 12/19] fuse: {uring} Add uring sqe commit and fetch support Bernd Schubert
2024-05-30 20:08   ` Josef Bacik
2024-05-29 18:00 ` [PATCH RFC v2 13/19] fuse: {uring} Handle uring shutdown Bernd Schubert
2024-05-30 20:21   ` Josef Bacik
2024-05-29 18:00 ` [PATCH RFC v2 14/19] fuse: {uring} Allow to queue to the ring Bernd Schubert
2024-05-30 20:32   ` Josef Bacik
2024-05-30 21:26     ` Bernd Schubert
2024-05-29 18:00 ` [PATCH RFC v2 15/19] export __wake_on_current_cpu Bernd Schubert
2024-05-30 20:37   ` Josef Bacik
2024-06-04  9:26     ` Peter Zijlstra
2024-06-04  9:36       ` Bernd Schubert
2024-06-04 19:27         ` Peter Zijlstra
2024-09-01 12:07           ` Bernd Schubert
2024-05-31 13:51   ` Christoph Hellwig
2024-05-29 18:00 ` [PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu Bernd Schubert
2024-05-30 16:44   ` Shachar Sharon
2024-05-30 16:59     ` Bernd Schubert
2024-05-29 18:00 ` [PATCH RFC v2 17/19] fuse: {uring} Send async requests to qid of core + 1 Bernd Schubert
2024-05-29 18:00 ` [PATCH RFC v2 18/19] fuse: {uring} Set a min cpu offset io-size for reads/writes Bernd Schubert
2024-05-29 18:00 ` [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends Bernd Schubert
2024-05-31 16:24   ` Jens Axboe
2024-05-31 17:36     ` Bernd Schubert
2024-05-31 19:10       ` Jens Axboe
2024-06-01 16:37         ` Bernd Schubert
2024-05-30  7:07 ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Amir Goldstein
2024-05-30 12:09   ` Bernd Schubert
2024-05-30 15:36 ` Kent Overstreet
2024-05-30 16:02   ` Bernd Schubert
2024-05-30 16:10     ` Kent Overstreet
2024-05-30 16:17       ` Bernd Schubert
2024-05-30 17:30         ` Kent Overstreet
2024-05-30 19:09         ` Josef Bacik
2024-05-30 20:05           ` Kent Overstreet
2024-05-31  3:53         ` [PATCH] fs: sys_ringbuffer() (WIP) Kent Overstreet
2024-05-31 13:11           ` kernel test robot
2024-05-31 15:49           ` kernel test robot
2024-05-30 16:21     ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Jens Axboe
2024-05-30 16:32       ` Bernd Schubert
2024-05-30 17:26         ` Jens Axboe
2024-05-30 17:16       ` Kent Overstreet
2024-05-30 17:28         ` Jens Axboe
2024-05-30 17:58           ` Kent Overstreet
2024-05-30 18:48             ` Jens Axboe
2024-05-30 19:35               ` Kent Overstreet
2024-05-31  0:11                 ` Jens Axboe
2024-06-04 23:45       ` Ming Lei
2024-05-30 20:47 ` Josef Bacik
2024-06-11  8:20 ` Miklos Szeredi
2024-06-11 10:26   ` Bernd Schubert
2024-06-11 15:35     ` Miklos Szeredi
2024-06-11 17:37       ` Bernd Schubert
2024-06-11 23:35         ` Kent Overstreet
2024-06-12 13:53           ` Bernd Schubert
2024-06-12 14:19             ` Kent Overstreet
2024-06-12 15:40               ` Bernd Schubert
2024-06-12 15:55                 ` Kent Overstreet
2024-06-12 16:15                   ` Bernd Schubert
2024-06-12 16:24                     ` Kent Overstreet
2024-06-12 16:44                       ` Bernd Schubert
2024-06-12  7:39         ` Miklos Szeredi
2024-06-12 13:32           ` Bernd Schubert
2024-06-12 13:46             ` Bernd Schubert
2024-06-12 14:07             ` Miklos Szeredi
2024-06-12 14:56               ` Bernd Schubert
2024-08-02 23:03                 ` Bernd Schubert
2024-08-29 22:32                 ` Bernd Schubert
2024-08-30 13:12                   ` Jens Axboe
2024-08-30 13:28                     ` Bernd Schubert
2024-08-30 13:33                       ` Jens Axboe
2024-08-30 14:55                         ` Pavel Begunkov
2024-08-30 15:10                           ` Bernd Schubert
2024-08-30 20:08                           ` Jens Axboe
2024-08-31  0:02                             ` Bernd Schubert
2024-08-31  0:49                               ` Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).