[PATCH 00/15] export/fuse: Use coroutines and multi-threading

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/15] export/fuse: Use coroutines and multi-threading
@ 2025-03-25 16:05 Hanna Czenczek
  2025-03-25 16:06 ` [PATCH 01/15] fuse: Copy write buffer content before polling Hanna Czenczek
                   ` (14 more replies)
  0 siblings, 15 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:05 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

Hi,

This series first contains some two bug fix patches (one more than the
other), then a couple of small modifications to prepare for the big
ones:

We remove libfuse from the request processing path, only using it for
mounting.  This does not really have a performance impact (but see the
benchmark results on that patch’s commit message), but it allows me to
sleep easier when it comes to concurrency, because I don’t know what
guarantees libfuse makes for coroutine concurrency.

In general, I just don’t feel that libfuse is really the best choice for
us: It seems primarily designed for projects that only provide a
filesystem, and nothing else, i.e. it provides a variety of main loops
and you’re supposed to use them.  QEMU however has its own main loop and
event processing, so the opacity of libfuse’s request processing makes
me uneasy.  Also, FUSE request parsing is not that hard.

Then, this series makes request processing run in coroutines.

Finally, it adds FUSE multi-threading (i.e. one FUSE FD per I/O thread).

Hanna Czenczek (15):
  fuse: Copy write buffer content before polling
  fuse: Ensure init clean-up even with error_fatal
  fuse: Remove superfluous empty line
  fuse: Explicitly set inode ID to 1
  fuse: Change setup_... to mount_fuse_export()
  fuse: Fix mount options
  fuse: Set direct_io and parallel_direct_writes
  fuse: Introduce fuse_{at,de}tach_handlers()
  fuse: Introduce fuse_{inc,dec}_in_flight()
  fuse: Add halted flag
  fuse: Manually process requests (without libfuse)
  fuse: Reduce max read size
  fuse: Process requests in coroutines
  fuse: Implement multi-threading
  fuse: Increase MAX_WRITE_SIZE with a second buffer

 qapi/block-export.json     |    8 +-
 block/export/fuse.c        | 1227 ++++++++++++++++++++++++++++--------
 tests/qemu-iotests/308     |    4 +-
 tests/qemu-iotests/308.out |    5 +-
 4 files changed, 965 insertions(+), 279 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 01/15] fuse: Copy write buffer content before polling
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 14:47   ` Stefan Hajnoczi
  2025-04-01 13:44   ` Eric Blake
  2025-03-25 16:06 ` [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, qemu-stable

Polling in I/O functions can lead to nested read_from_fuse_export()
calls, overwriting the request buffer's content.  The only function
affected by this is fuse_write(), which therefore must use a bounce
buffer or corruption may occur.

Note that in addition we do not know whether libfuse-internal structures
can cope with this nesting, and even if we did, we probably cannot rely
on it in the future.  This is the main reason why we want to remove
libfuse from the I/O path.

I do not have a good reproducer for this other than:

$ dd if=/dev/urandom of=image bs=1M count=4096
$ dd if=/dev/zero of=copy bs=1M count=4096
$ touch fuse-export
$ qemu-storage-daemon \
    --blockdev file,node-name=file,filename=copy \
    --export \
    fuse,id=exp,node-name=file,mountpoint=fuse-export,writable=true \
    &

Other shell:
$ qemu-img convert -p -n -f raw -O raw -t none image fuse-export
$ killall -SIGINT qemu-storage-daemon
$ qemu-img compare image copy
Content mismatch at offset 0!

(The -t none in qemu-img convert is important.)

I tried reproducing this with throttle and small aio_write requests from
another qemu-io instance, but for some reason all requests are perfectly
serialized then.

I think in theory we should get parallel writes only if we set
fi->parallel_direct_writes in fuse_open().  In fact, I can confirm that
if we do that, that throttle-based reproducer works (i.e. does get
parallel (nested) write requests).  I have no idea why we still get
parallel requests with qemu-img convert anyway.

Also, a later patch in this series will set fi->parallel_direct_writes
and note that it makes basically no difference when running fio on the
current libfuse-based version of our code.  It does make a difference
without libfuse.  So something quite fishy is going on.

I will try to investigate further what the root cause is, but I think
for now let's assume that calling blk_pwrite() can invalidate the buffer
contents through nested polling.

Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 465cc9891d..a12f479492 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -301,6 +301,12 @@ static void read_from_fuse_export(void *opaque)
         goto out;
     }
 
+    /*
+     * Note that polling in any request-processing function can lead to a nested
+     * read_from_fuse_export() call, which will overwrite the contents of
+     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
+     * content is copied before potentially polling.
+     */
     fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
@@ -624,6 +630,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
                        size_t size, off_t offset, struct fuse_file_info *fi)
 {
     FuseExport *exp = fuse_req_userdata(req);
+    void *copied;
     int64_t length;
     int ret;
 
@@ -638,6 +645,14 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
         return;
     }
 
+    /*
+     * Heed the note on read_from_fuse_export(): If we poll (which any blk_*()
+     * I/O function may do), read_from_fuse_export() may be nested, overwriting
+     * the request buffer content.  Therefore, we must copy it here.
+     */
+    copied = blk_blockalign(exp->common.blk, size);
+    memcpy(copied, buf, size);
+
     /**
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
@@ -645,7 +660,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
     length = blk_getlength(exp->common.blk);
     if (length < 0) {
         fuse_reply_err(req, -length);
-        return;
+        goto free_buffer;
     }
 
     if (offset + size > length) {
@@ -653,19 +668,22 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
                 fuse_reply_err(req, -ret);
-                return;
+                goto free_buffer;
             }
         } else {
             size = length - offset;
         }
     }
 
-    ret = blk_pwrite(exp->common.blk, offset, size, buf, 0);
+    ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
     if (ret >= 0) {
         fuse_reply_write(req, size);
     } else {
         fuse_reply_err(req, -ret);
     }
+
+free_buffer:
+    qemu_vfree(copied);
 }
 
 /**
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
  2025-03-25 16:06 ` [PATCH 01/15] fuse: Copy write buffer content before polling Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-26  5:47   ` Markus Armbruster
  2025-03-27 14:51   ` Stefan Hajnoczi
  2025-03-25 16:06 ` [PATCH 03/15] fuse: Remove superfluous empty line Hanna Czenczek
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

When exports are created on the command line (with the storage daemon),
errp is going to point to error_fatal.  Without ERRP_GUARD, we would
exit immediately when *errp is set, i.e. skip the clean-up code under
the `fail` label.  Use ERRP_GUARD so we always run that code.

As far as I know, this has no actual impact right now[1], but it is
still better to make this right.

[1] Not cleaning up the mount point is the only thing I can imagine
    would be problematic, but that is the last thing we attempt, so if
    it fails, it will clean itself up.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index a12f479492..7c035dd6ca 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -119,6 +119,7 @@ static int fuse_export_create(BlockExport *blk_exp,
                               BlockExportOptions *blk_exp_args,
                               Error **errp)
 {
+    ERRP_GUARD(); /* ensure clean-up even with error_fatal */
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
     BlockExportOptionsFuse *args = &blk_exp_args->u.fuse;
     int ret;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/15] fuse: Remove superfluous empty line
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
  2025-03-25 16:06 ` [PATCH 01/15] fuse: Copy write buffer content before polling Hanna Czenczek
  2025-03-25 16:06 ` [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 14:53   ` Stefan Hajnoczi
  2025-03-25 16:06 ` [PATCH 04/15] fuse: Explicitly set inode ID to 1 Hanna Czenczek
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 7c035dd6ca..17ad1d7b90 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -464,7 +464,6 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
     }
 
     if (add_resize_perm) {
-
         if (!qemu_in_main_thread()) {
             /* Changing permissions like below only works in the main thread */
             return -EPERM;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/15] fuse: Explicitly set inode ID to 1
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (2 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 03/15] fuse: Remove superfluous empty line Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 14:54   ` Stefan Hajnoczi
  2025-03-25 16:06 ` [PATCH 05/15] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

Setting .st_ino to the FUSE inode ID is kind of arbitrary.  While in
practice it is going to be fixed (to FUSE_ROOT_ID, which is 1) because
we only have the root inode, that is not obvious in fuse_getattr().

Just explicitly set it to 1 (i.e. no functional change).

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 17ad1d7b90..10606454c3 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -432,7 +432,7 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
     }
 
     statbuf = (struct stat) {
-        .st_ino     = inode,
+        .st_ino     = 1,
         .st_mode    = exp->st_mode,
         .st_nlink   = 1,
         .st_uid     = exp->st_uid,
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/15] fuse: Change setup_... to mount_fuse_export()
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (3 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 04/15] fuse: Explicitly set inode ID to 1 Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 14:55   ` Stefan Hajnoczi
  2025-03-25 16:06 ` [PATCH 06/15] fuse: Fix mount options Hanna Czenczek
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

There is no clear separation between what should go into
setup_fuse_export() and what should stay in fuse_export_create().

Make it clear that setup_fuse_export() is for mounting only.  Rename it,
and move everything that has nothing to do with mounting up into
fuse_export_create().

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 49 ++++++++++++++++++++-------------------------
 1 file changed, 22 insertions(+), 27 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 10606454c3..7bdec43b5c 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -72,8 +72,7 @@ static void fuse_export_delete(BlockExport *exp);
 
 static void init_exports_table(void);
 
-static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
-                             bool allow_other, Error **errp);
+static int mount_fuse_export(FuseExport *exp, Error **errp);
 static void read_from_fuse_export(void *opaque);
 
 static bool is_regular_file(const char *path, Error **errp);
@@ -193,23 +192,32 @@ static int fuse_export_create(BlockExport *blk_exp,
     exp->st_gid = getgid();
 
     if (args->allow_other == FUSE_EXPORT_ALLOW_OTHER_AUTO) {
-        /* Ignore errors on our first attempt */
-        ret = setup_fuse_export(exp, args->mountpoint, true, NULL);
-        exp->allow_other = ret == 0;
+        /* Try allow_other == true first, ignore errors */
+        exp->allow_other = true;
+        ret = mount_fuse_export(exp, NULL);
         if (ret < 0) {
-            ret = setup_fuse_export(exp, args->mountpoint, false, errp);
+            exp->allow_other = false;
+            ret = mount_fuse_export(exp, errp);
         }
     } else {
         exp->allow_other = args->allow_other == FUSE_EXPORT_ALLOW_OTHER_ON;
-        ret = setup_fuse_export(exp, args->mountpoint, exp->allow_other, errp);
+        ret = mount_fuse_export(exp, errp);
     }
     if (ret < 0) {
         goto fail;
     }
 
+    g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
+
+    aio_set_fd_handler(exp->common.ctx,
+                       fuse_session_fd(exp->fuse_session),
+                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    exp->fd_handler_set_up = true;
+
     return 0;
 
 fail:
+    fuse_export_shutdown(blk_exp);
     fuse_export_delete(blk_exp);
     return ret;
 }
@@ -227,10 +235,10 @@ static void init_exports_table(void)
 }
 
 /**
- * Create exp->fuse_session and mount it.
+ * Create exp->fuse_session and mount it.  Expects exp->mountpoint,
+ * exp->writable, and exp->allow_other to be set as intended for the mount.
  */
-static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
-                             bool allow_other, Error **errp)
+static int mount_fuse_export(FuseExport *exp, Error **errp)
 {
     const char *fuse_argv[4];
     char *mount_opts;
@@ -243,7 +251,7 @@ static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
      */
     mount_opts = g_strdup_printf("max_read=%zu,default_permissions%s",
                                  FUSE_MAX_BOUNCE_BYTES,
-                                 allow_other ? ",allow_other" : "");
+                                 exp->allow_other ? ",allow_other" : "");
 
     fuse_argv[0] = ""; /* Dummy program name */
     fuse_argv[1] = "-o";
@@ -256,30 +264,17 @@ static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
     g_free(mount_opts);
     if (!exp->fuse_session) {
         error_setg(errp, "Failed to set up FUSE session");
-        ret = -EIO;
-        goto fail;
+        return -EIO;
     }
 
-    ret = fuse_session_mount(exp->fuse_session, mountpoint);
+    ret = fuse_session_mount(exp->fuse_session, exp->mountpoint);
     if (ret < 0) {
         error_setg(errp, "Failed to mount FUSE session to export");
-        ret = -EIO;
-        goto fail;
+        return -EIO;
     }
     exp->mounted = true;
 
-    g_hash_table_insert(exports, g_strdup(mountpoint), NULL);
-
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
-
     return 0;
-
-fail:
-    fuse_export_shutdown(&exp->common);
-    return ret;
 }
 
 /**
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/15] fuse: Fix mount options
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (4 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 05/15] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 14:58   ` Stefan Hajnoczi
  2025-03-25 16:06 ` [PATCH 07/15] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

Since I actually took a look into how mounting with libfuse works[1], I
now know that the FUSE mount options are not exactly standard mount
system call options.  Specifically:
- We should add "nosuid,nodev,noatime" because that is going to be
  translated into the respective MS_ mount flags; and those flags make
  sense for us.
- We can set rw/ro to make the mount writable or not.  It makes sense to
  set this flag to produce a better error message for read-only exports
  (EROFS instead of EACCES).
  This changes behavior as can be seen in iotest 308: It is no longer
  possible to modify metadata of read-only exports.

In addition, in the comment, we can note that the FUSE mount() system
call actually expects some more parameters that we can omit because
fusermount3 (i.e. libfuse) will figure them out by itself:
- fd: /dev/fuse fd
- rootmode: Inode mode of the root node
- user_id/group_id: Mounter's UID/GID

[1] It invokes fusermount3, an SUID libfuse helper program, which parses
    and processes some mount options before actually invoking the
    mount() system call.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c        | 14 +++++++++++---
 tests/qemu-iotests/308     |  4 ++--
 tests/qemu-iotests/308.out |  3 ++-
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 7bdec43b5c..0d20995a0e 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -246,10 +246,18 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     int ret;
 
     /*
-     * max_read needs to match what fuse_init() sets.
-     * max_write need not be supplied.
+     * Note that these mount options differ from what we would pass to a direct
+     * mount() call:
+     * - nosuid, nodev, and noatime are not understood by the kernel; libfuse
+     *   uses those options to construct the mount flags (MS_*)
+     * - The FUSE kernel driver requires additional options (fd, rootmode,
+     *   user_id, group_id); these will be set by libfuse.
+     * Note that max_read is set here, while max_write is set via the FUSE INIT
+     * operation.
      */
-    mount_opts = g_strdup_printf("max_read=%zu,default_permissions%s",
+    mount_opts = g_strdup_printf("%s,nosuid,nodev,noatime,max_read=%zu,"
+                                 "default_permissions%s",
+                                 exp->writable ? "rw" : "ro",
                                  FUSE_MAX_BOUNCE_BYTES,
                                  exp->allow_other ? ",allow_other" : "");
 
diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index ea81dc496a..266b109ff3 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -177,7 +177,7 @@ stat -c 'Permissions pre-chmod: %a' "$EXT_MP"
 chmod u+w "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
 stat -c 'Permissions post-+w: %a' "$EXT_MP"
 
-# But that we can set, say, +x (if we are so inclined)
+# Same for other flags, like, say +x
 chmod u+x "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
 stat -c 'Permissions post-+x: %a' "$EXT_MP"
 
@@ -235,7 +235,7 @@ output=$($QEMU_IO -f raw -c 'write -P 42 1M 64k' "$TEST_IMG" 2>&1 \
 
 # Expected reference output: Opening the file fails because it has no
 # write permission
-reference="Could not open 'TEST_DIR/t.IMGFMT': Permission denied"
+reference="Could not open 'TEST_DIR/t.IMGFMT': Read-only file system"
 
 if echo "$output" | grep -q "$reference"; then
     echo "Writing to read-only export failed: OK"
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index e5e233691d..aa96faab6d 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -53,7 +53,8 @@ Images are identical.
 Permissions pre-chmod: 400
 chmod: changing permissions of 'TEST_DIR/t.IMGFMT.fuse': Read-only file system
 Permissions post-+w: 400
-Permissions post-+x: 500
+chmod: changing permissions of 'TEST_DIR/t.IMGFMT.fuse': Read-only file system
+Permissions post-+x: 400
 
 === Mount over existing file ===
 {'execute': 'block-export-add',
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/15] fuse: Set direct_io and parallel_direct_writes
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (5 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 06/15] fuse: Fix mount options Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 15:09   ` Stefan Hajnoczi
  2025-03-25 16:06 ` [PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

In fuse_open(), set these flags:
- direct_io: We probably actually don't want to have the host page cache
  be used for our exports.  QEMU block exports are supposed to represent
  the image as-is (and thus potentially changing).
  This causes a change in iotest 308's reference output.

- parallel_direct_writes: We can (now) cope with parallel writes, so we
  should set this flag.  For some reason, it doesn't seem to make an
  actual performance difference with libfuse, but it does make a
  difference without it, so let's set it.
  (See "fuse: Copy write buffer content before polling" for further
  discussion.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c        | 2 ++
 tests/qemu-iotests/308.out | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 0d20995a0e..2df6297d61 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -576,6 +576,8 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
 static void fuse_open(fuse_req_t req, fuse_ino_t inode,
                       struct fuse_file_info *fi)
 {
+    fi->direct_io = true;
+    fi->parallel_direct_writes = true;
     fuse_reply_open(req, fi);
 }
 
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index aa96faab6d..2d7a38d63d 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -131,7 +131,7 @@ wrote 65536/65536 bytes at offset 1048576
 
 --- Try growing non-growable export ---
 (OK: Lengths of export and original are the same)
-dd: error writing 'TEST_DIR/t.IMGFMT.fuse': Input/output error
+dd: error writing 'TEST_DIR/t.IMGFMT.fuse': No space left on device
 1+0 records in
 0+0 records out
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers()
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (6 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 07/15] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 15:12   ` Stefan Hajnoczi
  2025-04-01 13:55   ` Eric Blake
  2025-03-25 16:06 ` [PATCH 09/15] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

Pull setting up and tearing down the AIO context handlers into two
dedicated functions.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 2df6297d61..bd98809d71 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -78,27 +78,34 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
-static void fuse_export_drained_begin(void *opaque)
+static void fuse_attach_handlers(FuseExport *exp)
 {
-    FuseExport *exp = opaque;
+    aio_set_fd_handler(exp->common.ctx,
+                       fuse_session_fd(exp->fuse_session),
+                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    exp->fd_handler_set_up = true;
+}
 
+static void fuse_detach_handlers(FuseExport *exp)
+{
     aio_set_fd_handler(exp->common.ctx,
                        fuse_session_fd(exp->fuse_session),
                        NULL, NULL, NULL, NULL, NULL);
     exp->fd_handler_set_up = false;
 }
 
+static void fuse_export_drained_begin(void *opaque)
+{
+    fuse_detach_handlers(opaque);
+}
+
 static void fuse_export_drained_end(void *opaque)
 {
     FuseExport *exp = opaque;
 
     /* Refresh AioContext in case it changed */
     exp->common.ctx = blk_get_aio_context(exp->common.blk);
-
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
+    fuse_attach_handlers(exp);
 }
 
 static bool fuse_export_drained_poll(void *opaque)
@@ -209,11 +216,7 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
-
+    fuse_attach_handlers(exp);
     return 0;
 
 fail:
@@ -329,10 +332,7 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
         fuse_session_exit(exp->fuse_session);
 
         if (exp->fd_handler_set_up) {
-            aio_set_fd_handler(exp->common.ctx,
-                               fuse_session_fd(exp->fuse_session),
-                               NULL, NULL, NULL, NULL, NULL);
-            exp->fd_handler_set_up = false;
+            fuse_detach_handlers(exp);
         }
     }
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/15] fuse: Introduce fuse_{inc,dec}_in_flight()
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (7 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 15:13   ` Stefan Hajnoczi
  2025-03-25 16:06 ` [PATCH 10/15] fuse: Add halted flag Hanna Czenczek
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

This is how vduse-blk.c does it, and it does seem better to have
dedicated functions for it.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index bd98809d71..e50dd91d3e 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -78,6 +78,25 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
+static void fuse_inc_in_flight(FuseExport *exp)
+{
+    if (qatomic_fetch_inc(&exp->in_flight) == 0) {
+        /* Prevent export from being deleted */
+        blk_exp_ref(&exp->common);
+    }
+}
+
+static void fuse_dec_in_flight(FuseExport *exp)
+{
+    if (qatomic_fetch_dec(&exp->in_flight) == 1) {
+        /* Wake AIO_WAIT_WHILE() */
+        aio_wait_kick();
+
+        /* Now the export can be deleted */
+        blk_exp_unref(&exp->common);
+    }
+}
+
 static void fuse_attach_handlers(FuseExport *exp)
 {
     aio_set_fd_handler(exp->common.ctx,
@@ -297,9 +316,7 @@ static void read_from_fuse_export(void *opaque)
     FuseExport *exp = opaque;
     int ret;
 
-    blk_exp_ref(&exp->common);
-
-    qatomic_inc(&exp->in_flight);
+    fuse_inc_in_flight(exp);
 
     do {
         ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
@@ -317,11 +334,7 @@ static void read_from_fuse_export(void *opaque)
     fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
-    if (qatomic_fetch_dec(&exp->in_flight) == 1) {
-        aio_wait_kick(); /* wake AIO_WAIT_WHILE() */
-    }
-
-    blk_exp_unref(&exp->common);
+    fuse_dec_in_flight(exp);
 }
 
 static void fuse_export_shutdown(BlockExport *blk_exp)
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/15] fuse: Add halted flag
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (8 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 09/15] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 15:15   ` Stefan Hajnoczi
  2025-03-25 16:06 ` [PATCH 11/15] fuse: Manually process requests (without libfuse) Hanna Czenczek
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

This is a flag that we will want when processing FUSE requests
ourselves: When the kernel sends us e.g. a truncated request (i.e. we
receive less data than the request's indicated length), we cannot rely
on subsequent data to be valid.  Then, we are going to set this flag,
halting all FUSE request processing.

We plan to only use this flag in cases that would effectively be kernel
bugs.

(Right now, the flag is unused because libfuse still does our request
processing.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index e50dd91d3e..3dd50badb3 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -53,6 +53,13 @@ typedef struct FuseExport {
     unsigned int in_flight; /* atomic */
     bool mounted, fd_handler_set_up;
 
+    /*
+     * Set when there was an unrecoverable error and no requests should be read
+     * from the device anymore (basically only in case of something we would
+     * consider a kernel bug)
+     */
+    bool halted;
+
     char *mountpoint;
     bool writable;
     bool growable;
@@ -69,6 +76,7 @@ static const struct fuse_lowlevel_ops fuse_ops;
 
 static void fuse_export_shutdown(BlockExport *exp);
 static void fuse_export_delete(BlockExport *exp);
+static void fuse_export_halt(FuseExport *exp) G_GNUC_UNUSED;
 
 static void init_exports_table(void);
 
@@ -99,6 +107,10 @@ static void fuse_dec_in_flight(FuseExport *exp)
 
 static void fuse_attach_handlers(FuseExport *exp)
 {
+    if (exp->halted) {
+        return;
+    }
+
     aio_set_fd_handler(exp->common.ctx,
                        fuse_session_fd(exp->fuse_session),
                        read_from_fuse_export, NULL, NULL, NULL, exp);
@@ -316,6 +328,10 @@ static void read_from_fuse_export(void *opaque)
     FuseExport *exp = opaque;
     int ret;
 
+    if (unlikely(exp->halted)) {
+        return;
+    }
+
     fuse_inc_in_flight(exp);
 
     do {
@@ -374,6 +390,20 @@ static void fuse_export_delete(BlockExport *blk_exp)
     g_free(exp->mountpoint);
 }
 
+/**
+ * Halt the export: Detach FD handlers, and set exp->halted to true, preventing
+ * fuse_attach_handlers() from re-attaching them, therefore stopping all further
+ * request processing.
+ *
+ * Call this function when an unrecoverable error happens that makes processing
+ * all future requests unreliable.
+ */
+static void fuse_export_halt(FuseExport *exp)
+{
+    exp->halted = true;
+    fuse_detach_handlers(exp);
+}
+
 /**
  * Check whether @path points to a regular file.  If not, put an
  * appropriate message into *errp.
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 11/15] fuse: Manually process requests (without libfuse)
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (9 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 10/15] fuse: Add halted flag Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 15:35   ` Stefan Hajnoczi
  2025-04-01 14:35   ` Eric Blake
  2025-03-25 16:06 ` [PATCH 12/15] fuse: Reduce max read size Hanna Czenczek
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

Manually read requests from the /dev/fuse FD and process them, without
using libfuse.  This allows us to safely add parallel request processing
in coroutines later, without having to worry about libfuse internals.
(Technically, we already have exactly that problem with
read_from_fuse_export()/read_from_fuse_fd() nesting.)

We will continue to use libfuse for mounting the filesystem; fusermount3
is a effectively a helper program of libfuse, so it should know best how
to interact with it.  (Doing it manually without libfuse, while doable,
is a bit of a pain, and it is not clear to me how stable the "protocol"
actually is.)

Take this opportunity of quite a major rewrite to update the Copyright
line with corrected information that has surfaced in the meantime.

Here are some benchmarks from before this patch (4k, iodepth=16, libaio;
except 'sync', which are iodepth=1 and pvsync2):

file:
  read:
    seq aio:   78.6k ±1.3k IOPS
    rand aio:  39.3k ±2.9k
    seq sync:  32.5k ±0.7k
    rand sync:  9.9k ±0.1k
  write:
    seq aio:   61.9k ±0.5k
    rand aio:  61.2k ±0.6k
    seq sync:  27.9k ±0.2k
    rand sync: 27.6k ±0.4k
null:
  read:
    seq aio:   214.0k ±5.9k
    rand aio:  212.7k ±4.5k
    seq sync:   90.3k ±6.5k
    rand sync:  89.7k ±5.1k
  write:
    seq aio:   203.9k ±1.5k
    rand aio:  201.4k ±3.6k
    seq sync:   86.1k ±6.2k
    rand sync:  84.9k ±5.3k

And with this patch applied:

file:
  read:
    seq aio:   76.6k ±1.8k (- 3 %)
    rand aio:  26.7k ±0.4k (-32 %)
    seq sync:  47.7k ±1.2k (+47 %)
    rand sync: 10.1k ±0.2k (+ 2 %)
  write:
    seq aio:   58.1k ±0.5k (- 6 %)
    rand aio:  58.1k ±0.5k (- 5 %)
    seq sync:  36.3k ±0.3k (+30 %)
    rand sync: 36.1k ±0.4k (+31 %)
null:
  read:
    seq aio:   268.4k ±3.4k (+25 %)
    rand aio:  265.3k ±2.1k (+25 %)
    seq sync:  134.3k ±2.7k (+49 %)
    rand sync: 132.4k ±1.4k (+48 %)
  write:
    seq aio:   275.3k ±1.7k (+35 %)
    rand aio:  272.3k ±1.9k (+35 %)
    seq sync:  130.7k ±1.6k (+52 %)
    rand sync: 127.4k ±2.4k (+50 %)

So clearly the AIO file results are actually not good, and random reads
are indeed quite terrible.  On the other hand, we can see from the sync
and null results that request handling should in theory be quicker.  How
does this fit together?

I believe the bad AIO results are an artifact of the accidental parallel
request processing we have due to nested polling: Depending on how the
actual request processing is structured and how long request processing
takes, more or less requests will be submitted in parallel.  So because
of the restructuring, I think this patch accidentally changes how many
requests end up being submitted in parallel, which decreases
performance.

(I have seen something like this before: In RSD, without having
implemented a polling mode, the debug build tended to have better
performance than the more optimized release build, because the debug
build, taking longer to submit requests, ended up processing more
requests in parallel.)

In any case, once we use coroutines throughout the code, performance
will improve again across the board.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 793 +++++++++++++++++++++++++++++++-------------
 1 file changed, 567 insertions(+), 226 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 3dd50badb3..407b101018 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -1,7 +1,7 @@
 /*
  * Present a block device as a raw image through FUSE
  *
- * Copyright (c) 2020 Max Reitz <mreitz@redhat.com>
+ * Copyright (c) 2020, 2025 Hanna Czenczek <hreitz@redhat.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -27,12 +27,15 @@
 #include "block/qapi.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block.h"
+#include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
 
 #include <fuse.h>
 #include <fuse_lowlevel.h>
 
+#include "standard-headers/linux/fuse.h"
+
 #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
 #endif
@@ -42,17 +45,27 @@
 #endif
 
 /* Prevent overly long bounce buffer allocations */
-#define FUSE_MAX_BOUNCE_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
-
+#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+/* Small enough to fit in the request buffer */
+#define FUSE_MAX_WRITE_BYTES (4 * 1024)
 
 typedef struct FuseExport {
     BlockExport common;
 
     struct fuse_session *fuse_session;
-    struct fuse_buf fuse_buf;
     unsigned int in_flight; /* atomic */
     bool mounted, fd_handler_set_up;
 
+    /*
+     * The request buffer must be able to hold a full write, and/or at least
+     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes
+     */
+    char request_buf[MAX_CONST(
+        sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
+             FUSE_MAX_WRITE_BYTES,
+        FUSE_MIN_READ_BUFFER
+    )];
+
     /*
      * Set when there was an unrecoverable error and no requests should be read
      * from the device anymore (basically only in case of something we would
@@ -60,6 +73,8 @@ typedef struct FuseExport {
      */
     bool halted;
 
+    int fuse_fd;
+
     char *mountpoint;
     bool writable;
     bool growable;
@@ -72,19 +87,20 @@ typedef struct FuseExport {
 } FuseExport;
 
 static GHashTable *exports;
-static const struct fuse_lowlevel_ops fuse_ops;
 
 static void fuse_export_shutdown(BlockExport *exp);
 static void fuse_export_delete(BlockExport *exp);
-static void fuse_export_halt(FuseExport *exp) G_GNUC_UNUSED;
+static void fuse_export_halt(FuseExport *exp);
 
 static void init_exports_table(void);
 
 static int mount_fuse_export(FuseExport *exp, Error **errp);
-static void read_from_fuse_export(void *opaque);
 
 static bool is_regular_file(const char *path, Error **errp);
 
+static bool poll_fuse_fd(void *opaque);
+static void read_fuse_fd(void *opaque);
+static void fuse_process_request(FuseExport *exp);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -105,22 +121,27 @@ static void fuse_dec_in_flight(FuseExport *exp)
     }
 }
 
+/**
+ * Attach FUSE FD read and poll handlers.
+ */
 static void fuse_attach_handlers(FuseExport *exp)
 {
     if (exp->halted) {
         return;
     }
 
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
+                       read_fuse_fd, NULL, poll_fuse_fd,
+                       read_fuse_fd, exp);
     exp->fd_handler_set_up = true;
 }
 
+/**
+ * Detach FUSE FD read and poll handlers.
+ */
 static void fuse_detach_handlers(FuseExport *exp)
 {
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
+    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
                        NULL, NULL, NULL, NULL, NULL);
     exp->fd_handler_set_up = false;
 }
@@ -247,6 +268,14 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
+    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
+    ret = fcntl(exp->fuse_fd, F_SETFL, O_NONBLOCK);
+    if (ret < 0) {
+        ret = -errno;
+        error_setg_errno(errp, errno, "Failed to make FUSE FD non-blocking");
+        goto fail;
+    }
+
     fuse_attach_handlers(exp);
     return 0;
 
@@ -292,7 +321,7 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     mount_opts = g_strdup_printf("%s,nosuid,nodev,noatime,max_read=%zu,"
                                  "default_permissions%s",
                                  exp->writable ? "rw" : "ro",
-                                 FUSE_MAX_BOUNCE_BYTES,
+                                 FUSE_MAX_READ_BYTES,
                                  exp->allow_other ? ",allow_other" : "");
 
     fuse_argv[0] = ""; /* Dummy program name */
@@ -301,8 +330,8 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     fuse_argv[3] = NULL;
     fuse_args = (struct fuse_args)FUSE_ARGS_INIT(3, (char **)fuse_argv);
 
-    exp->fuse_session = fuse_session_new(&fuse_args, &fuse_ops,
-                                         sizeof(fuse_ops), exp);
+    /* We just create the session for mounting/unmounting, no need to set ops */
+    exp->fuse_session = fuse_session_new(&fuse_args, NULL, 0, NULL);
     g_free(mount_opts);
     if (!exp->fuse_session) {
         error_setg(errp, "Failed to set up FUSE session");
@@ -320,55 +349,94 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
 }
 
 /**
- * Callback to be invoked when the FUSE session FD can be read from.
- * (This is basically the FUSE event loop.)
+ * Try to read a single request from the FUSE FD.
+ * If a request is available, process it, and return true.
+ * Otherwise, return false.
  */
-static void read_from_fuse_export(void *opaque)
+static bool read_from_fuse_fd(void *opaque)
 {
     FuseExport *exp = opaque;
-    int ret;
+    int fuse_fd = exp->fuse_fd;
+    ssize_t ret;
+    const struct fuse_in_header *in_hdr;
+
+    fuse_inc_in_flight(exp);
 
     if (unlikely(exp->halted)) {
-        return;
+        goto no_request;
     }
 
-    fuse_inc_in_flight(exp);
+    ret = RETRY_ON_EINTR(read(fuse_fd, exp->request_buf,
+                              sizeof(exp->request_buf)));
+    if (ret < 0 && errno == EAGAIN) {
+        /* No request available */
+        goto no_request;
+    } else if (unlikely(ret < 0)) {
+        error_report("Failed to read from FUSE device: %s", strerror(-ret));
+        goto no_request;
+    }
 
-    do {
-        ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
-    } while (ret == -EINTR);
-    if (ret < 0) {
-        goto out;
+    if (unlikely(ret < sizeof(*in_hdr))) {
+        error_report("Incomplete read from FUSE device, expected at least %zu "
+                     "bytes, read %zi bytes; cannot trust subsequent "
+                     "requests, halting the export",
+                     sizeof(*in_hdr), ret);
+        fuse_export_halt(exp);
+        goto no_request;
     }
 
-    /*
-     * Note that polling in any request-processing function can lead to a nested
-     * read_from_fuse_export() call, which will overwrite the contents of
-     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
-     * content is copied before potentially polling.
-     */
-    fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
+    in_hdr = (const struct fuse_in_header *)exp->request_buf;
+    if (unlikely(ret != in_hdr->len)) {
+        error_report("Number of bytes read from FUSE device does not match "
+                     "request size, expected %" PRIu32 " bytes, read %zi "
+                     "bytes; cannot trust subsequent requests, halting the "
+                     "export",
+                     in_hdr->len, ret);
+        fuse_export_halt(exp);
+        goto no_request;
+    }
+
+    fuse_process_request(exp);
+    fuse_dec_in_flight(exp);
+    return true;
 
-out:
+no_request:
     fuse_dec_in_flight(exp);
+    return false;
+}
+
+/**
+ * Check the FUSE FD for whether it is readable or not.  Because we cannot
+ * reasonably do this without reading a request at the same time, also read and
+ * process that request if any.
+ * (To be used as a poll handler for the FUSE FD.)
+ */
+static bool poll_fuse_fd(void *opaque)
+{
+    return read_from_fuse_fd(opaque);
+}
+
+/**
+ * Read a request from the FUSE FD.
+ * (To be used as a handler for when the FUSE FD becomes readable.)
+ */
+static void read_fuse_fd(void *opaque)
+{
+    read_from_fuse_fd(opaque);
 }
 
 static void fuse_export_shutdown(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
-    if (exp->fuse_session) {
-        fuse_session_exit(exp->fuse_session);
-
-        if (exp->fd_handler_set_up) {
-            fuse_detach_handlers(exp);
-        }
+    if (exp->fd_handler_set_up) {
+        fuse_detach_handlers(exp);
     }
 
     if (exp->mountpoint) {
         /*
-         * Safe to drop now, because we will not handle any requests
-         * for this export anymore anyway.
+         * Safe to drop now, because we will not handle any requests for this
+         * export anymore anyway (at least not from the main thread).
          */
         g_hash_table_remove(exports, exp->mountpoint);
     }
@@ -386,7 +454,6 @@ static void fuse_export_delete(BlockExport *blk_exp)
         fuse_session_destroy(exp->fuse_session);
     }
 
-    free(exp->fuse_buf.mem);
     g_free(exp->mountpoint);
 }
 
@@ -428,46 +495,57 @@ static bool is_regular_file(const char *path, Error **errp)
 }
 
 /**
- * A chance to set change some parameters supplied to FUSE_INIT.
+ * Process FUSE INIT.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_init(void *userdata, struct fuse_conn_info *conn)
+static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
+                         uint32_t max_readahead, uint32_t flags)
 {
-    /*
-     * MIN_NON_ZERO() would not be wrong here, but what we set here
-     * must equal what has been passed to fuse_session_new().
-     * Therefore, as long as max_read must be passed as a mount option
-     * (which libfuse claims will be changed at some point), we have
-     * to set max_read to a fixed value here.
-     */
-    conn->max_read = FUSE_MAX_BOUNCE_BYTES;
+    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
 
-    conn->max_write = MIN_NON_ZERO(BDRV_REQUEST_MAX_BYTES, conn->max_write);
-}
+    *out = (struct fuse_init_out) {
+        .major = FUSE_KERNEL_VERSION,
+        .minor = FUSE_KERNEL_MINOR_VERSION,
+        .max_readahead = max_readahead,
+        .max_write = FUSE_MAX_WRITE_BYTES,
+        .flags = flags & supported_flags,
+        .flags2 = 0,
 
-/**
- * Let clients look up files.  Always return ENOENT because we only
- * care about the mountpoint itself.
- */
-static void fuse_lookup(fuse_req_t req, fuse_ino_t parent, const char *name)
-{
-    fuse_reply_err(req, ENOENT);
+        /* libfuse maximum: 2^16 - 1 */
+        .max_background = UINT16_MAX,
+
+        /* libfuse default: max_background * 3 / 4 */
+        .congestion_threshold = (int)UINT16_MAX * 3 / 4,
+
+        /* libfuse default: 1 */
+        .time_gran = 1,
+
+        /*
+         * probably unneeded without FUSE_MAX_PAGES, but this would be the
+         * libfuse default
+         */
+        .max_pages = DIV_ROUND_UP(FUSE_MAX_WRITE_BYTES,
+                                  qemu_real_host_page_size()),
+
+        /* Only needed for mappings (i.e. DAX) */
+        .map_alignment = 0,
+    };
+
+    return sizeof(*out);
 }
 
 /**
  * Let clients get file attributes (i.e., stat() the file).
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
-                         struct fuse_file_info *fi)
+static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
 {
-    struct stat statbuf;
     int64_t length, allocated_blocks;
     time_t now = time(NULL);
-    FuseExport *exp = fuse_req_userdata(req);
 
     length = blk_getlength(exp->common.blk);
     if (length < 0) {
-        fuse_reply_err(req, -length);
-        return;
+        return length;
     }
 
     allocated_blocks = bdrv_get_allocated_file_size(blk_bs(exp->common.blk));
@@ -477,21 +555,24 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
         allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512);
     }
 
-    statbuf = (struct stat) {
-        .st_ino     = 1,
-        .st_mode    = exp->st_mode,
-        .st_nlink   = 1,
-        .st_uid     = exp->st_uid,
-        .st_gid     = exp->st_gid,
-        .st_size    = length,
-        .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
-        .st_blocks  = allocated_blocks,
-        .st_atime   = now,
-        .st_mtime   = now,
-        .st_ctime   = now,
+    *out = (struct fuse_attr_out) {
+        .attr_valid = 1,
+        .attr = {
+            .ino        = 1,
+            .mode       = exp->st_mode,
+            .nlink      = 1,
+            .uid        = exp->st_uid,
+            .gid        = exp->st_gid,
+            .size       = length,
+            .blksize    = blk_bs(exp->common.blk)->bl.request_alignment,
+            .blocks     = allocated_blocks,
+            .atime      = now,
+            .mtime      = now,
+            .ctime      = now,
+        },
     };
 
-    fuse_reply_attr(req, &statbuf, 1.);
+    return sizeof(*out);
 }
 
 static int fuse_do_truncate(const FuseExport *exp, int64_t size,
@@ -544,160 +625,149 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
  * permit access: Read-only exports cannot be given +w, and exports
  * without allow_other cannot be given a different UID or GID, and
  * they cannot be given non-owner access.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
-                         int to_set, struct fuse_file_info *fi)
+static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
+                            uint32_t to_set, uint64_t size, uint32_t mode,
+                            uint32_t uid, uint32_t gid)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     int supported_attrs;
     int ret;
 
-    supported_attrs = FUSE_SET_ATTR_SIZE | FUSE_SET_ATTR_MODE;
+    /* SIZE and MODE are actually supported, the others can be safely ignored */
+    supported_attrs = FATTR_SIZE | FATTR_MODE |
+        FATTR_FH | FATTR_LOCKOWNER | FATTR_KILL_SUIDGID;
     if (exp->allow_other) {
-        supported_attrs |= FUSE_SET_ATTR_UID | FUSE_SET_ATTR_GID;
+        supported_attrs |= FATTR_UID | FATTR_GID;
     }
 
     if (to_set & ~supported_attrs) {
-        fuse_reply_err(req, ENOTSUP);
-        return;
+        return -ENOTSUP;
     }
 
     /* Do some argument checks first before committing to anything */
-    if (to_set & FUSE_SET_ATTR_MODE) {
+    if (to_set & FATTR_MODE) {
         /*
          * Without allow_other, non-owners can never access the export, so do
          * not allow setting permissions for them
          */
-        if (!exp->allow_other &&
-            (statbuf->st_mode & (S_IRWXG | S_IRWXO)) != 0)
-        {
-            fuse_reply_err(req, EPERM);
-            return;
+        if (!exp->allow_other && (mode & (S_IRWXG | S_IRWXO)) != 0) {
+            return -EPERM;
         }
 
         /* +w for read-only exports makes no sense, disallow it */
-        if (!exp->writable &&
-            (statbuf->st_mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0)
-        {
-            fuse_reply_err(req, EROFS);
-            return;
+        if (!exp->writable && (mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0) {
+            return -EROFS;
         }
     }
 
-    if (to_set & FUSE_SET_ATTR_SIZE) {
+    if (to_set & FATTR_SIZE) {
         if (!exp->writable) {
-            fuse_reply_err(req, EACCES);
-            return;
+            return -EACCES;
         }
 
-        ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
+        ret = fuse_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
         if (ret < 0) {
-            fuse_reply_err(req, -ret);
-            return;
+            return ret;
         }
     }
 
-    if (to_set & FUSE_SET_ATTR_MODE) {
+    if (to_set & FATTR_MODE) {
         /* Ignore FUSE-supplied file type, only change the mode */
-        exp->st_mode = (statbuf->st_mode & 07777) | S_IFREG;
+        exp->st_mode = (mode & 07777) | S_IFREG;
     }
 
-    if (to_set & FUSE_SET_ATTR_UID) {
-        exp->st_uid = statbuf->st_uid;
+    if (to_set & FATTR_UID) {
+        exp->st_uid = uid;
     }
 
-    if (to_set & FUSE_SET_ATTR_GID) {
-        exp->st_gid = statbuf->st_gid;
+    if (to_set & FATTR_GID) {
+        exp->st_gid = gid;
     }
 
-    fuse_getattr(req, inode, fi);
+    return fuse_getattr(exp, out);
 }
 
 /**
- * Let clients open a file (i.e., the exported image).
+ * Open an inode.  We only have a single inode in our exported filesystem, so we
+ * just acknowledge the request.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_open(fuse_req_t req, fuse_ino_t inode,
-                      struct fuse_file_info *fi)
+static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
 {
-    fi->direct_io = true;
-    fi->parallel_direct_writes = true;
-    fuse_reply_open(req, fi);
+    *out = (struct fuse_open_out) {
+        .open_flags = FOPEN_DIRECT_IO | FOPEN_PARALLEL_DIRECT_WRITES,
+    };
+    return sizeof(*out);
 }
 
 /**
- * Handle client reads from the exported image.
+ * Handle client reads from the exported image.  Allocates *bufptr and reads
+ * data from the block device into that buffer.
+ * Returns the buffer (read) size on success, and -errno on error.
  */
-static void fuse_read(fuse_req_t req, fuse_ino_t inode,
-                      size_t size, off_t offset, struct fuse_file_info *fi)
+static ssize_t fuse_read(FuseExport *exp, void **bufptr,
+                         uint64_t offset, uint32_t size)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-    int64_t length;
+    int64_t blk_len;
     void *buf;
     int ret;
 
     /* Limited by max_read, should not happen */
-    if (size > FUSE_MAX_BOUNCE_BYTES) {
-        fuse_reply_err(req, EINVAL);
-        return;
+    if (size > FUSE_MAX_READ_BYTES) {
+        return -EINVAL;
     }
 
     /**
      * Clients will expect short reads at EOF, so we have to limit
      * offset+size to the image length.
      */
-    length = blk_getlength(exp->common.blk);
-    if (length < 0) {
-        fuse_reply_err(req, -length);
-        return;
+    blk_len = blk_getlength(exp->common.blk);
+    if (blk_len < 0) {
+        return blk_len;
     }
 
-    if (offset + size > length) {
-        size = length - offset;
+    if (offset + size > blk_len) {
+        size = blk_len - offset;
     }
 
     buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
     if (!buf) {
-        fuse_reply_err(req, ENOMEM);
-        return;
+        return -ENOMEM;
     }
 
     ret = blk_pread(exp->common.blk, offset, size, buf, 0);
-    if (ret >= 0) {
-        fuse_reply_buf(req, buf, size);
-    } else {
-        fuse_reply_err(req, -ret);
+    if (ret < 0) {
+        qemu_vfree(buf);
+        return ret;
     }
 
-    qemu_vfree(buf);
+    *bufptr = buf;
+    return size;
 }
 
 /**
- * Handle client writes to the exported image.
+ * Handle client writes to the exported image.  @buf has the data to be written
+ * and will be copied to a bounce buffer before polling for the first time.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
-                       size_t size, off_t offset, struct fuse_file_info *fi)
+static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
+                          uint64_t offset, uint32_t size, const void *buf)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     void *copied;
-    int64_t length;
+    int64_t blk_len;
     int ret;
 
     /* Limited by max_write, should not happen */
     if (size > BDRV_REQUEST_MAX_BYTES) {
-        fuse_reply_err(req, EINVAL);
-        return;
+        return -EINVAL;
     }
 
     if (!exp->writable) {
-        fuse_reply_err(req, EACCES);
-        return;
+        return -EACCES;
     }
 
-    /*
-     * Heed the note on read_from_fuse_export(): If we poll (which any blk_*()
-     * I/O function may do), read_from_fuse_export() may be nested, overwriting
-     * the request buffer content.  Therefore, we must copy it here.
-     */
+    /* Must copy to bounce buffer before polling (to allow nesting) */
     copied = blk_blockalign(exp->common.blk, size);
     memcpy(copied, buf, size);
 
@@ -705,55 +775,57 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
      */
-    length = blk_getlength(exp->common.blk);
-    if (length < 0) {
-        fuse_reply_err(req, -length);
-        goto free_buffer;
+    blk_len = blk_getlength(exp->common.blk);
+    if (blk_len < 0) {
+        ret = blk_len;
+        goto fail_free_buffer;
     }
 
-    if (offset + size > length) {
+    if (offset + size > blk_len) {
         if (exp->growable) {
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                goto free_buffer;
+                goto fail_free_buffer;
             }
         } else {
-            size = length - offset;
+            size = blk_len - offset;
         }
     }
 
     ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
-    if (ret >= 0) {
-        fuse_reply_write(req, size);
-    } else {
-        fuse_reply_err(req, -ret);
+    if (ret < 0) {
+        goto fail_free_buffer;
     }
 
-free_buffer:
     qemu_vfree(copied);
+
+    *out = (struct fuse_write_out) {
+        .size = size,
+    };
+    return sizeof(*out);
+
+fail_free_buffer:
+    qemu_vfree(copied);
+    return ret;
 }
 
 /**
  * Let clients perform various fallocate() operations.
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
-                           off_t offset, off_t length,
-                           struct fuse_file_info *fi)
+static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
+                              uint32_t mode)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     int64_t blk_len;
     int ret;
 
     if (!exp->writable) {
-        fuse_reply_err(req, EACCES);
-        return;
+        return -EACCES;
     }
 
     blk_len = blk_getlength(exp->common.blk);
     if (blk_len < 0) {
-        fuse_reply_err(req, -blk_len);
-        return;
+        return blk_len;
     }
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -765,16 +837,14 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
     if (!mode) {
         /* We can only fallocate at the EOF with a truncate */
         if (offset < blk_len) {
-            fuse_reply_err(req, EOPNOTSUPP);
-            return;
+            return -EOPNOTSUPP;
         }
 
         if (offset > blk_len) {
             /* No preallocation needed here */
             ret = fuse_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                return;
+                return ret;
             }
         }
 
@@ -784,8 +854,7 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
     else if (mode & FALLOC_FL_PUNCH_HOLE) {
         if (!(mode & FALLOC_FL_KEEP_SIZE)) {
-            fuse_reply_err(req, EINVAL);
-            return;
+            return -EINVAL;
         }
 
         do {
@@ -813,8 +882,7 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
             ret = fuse_do_truncate(exp, offset + length, false,
                                    PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                return;
+                return ret;
             }
         }
 
@@ -832,44 +900,38 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
         ret = -EOPNOTSUPP;
     }
 
-    fuse_reply_err(req, ret < 0 ? -ret : 0);
+    return ret < 0 ? ret : 0;
 }
 
 /**
  * Let clients fsync the exported image.
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_fsync(fuse_req_t req, fuse_ino_t inode, int datasync,
-                       struct fuse_file_info *fi)
+static ssize_t fuse_fsync(FuseExport *exp)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-    int ret;
-
-    ret = blk_flush(exp->common.blk);
-    fuse_reply_err(req, ret < 0 ? -ret : 0);
+    return blk_flush(exp->common.blk);
 }
 
 /**
  * Called before an FD to the exported image is closed.  (libfuse
  * notes this to be a way to return last-minute errors.)
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_flush(fuse_req_t req, fuse_ino_t inode,
-                        struct fuse_file_info *fi)
+static ssize_t fuse_flush(FuseExport *exp)
 {
-    fuse_fsync(req, inode, 1, fi);
+    return blk_flush(exp->common.blk);
 }
 
 #ifdef CONFIG_FUSE_LSEEK
 /**
  * Let clients inquire allocation status.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
-                       int whence, struct fuse_file_info *fi)
+static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
+                          uint64_t offset, uint32_t whence)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-
     if (whence != SEEK_HOLE && whence != SEEK_DATA) {
-        fuse_reply_err(req, EINVAL);
-        return;
+        return -EINVAL;
     }
 
     while (true) {
@@ -879,8 +941,7 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
         ret = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
                                       offset, INT64_MAX, &pnum, NULL, NULL);
         if (ret < 0) {
-            fuse_reply_err(req, -ret);
-            return;
+            return ret;
         }
 
         if (!pnum && (ret & BDRV_BLOCK_EOF)) {
@@ -897,34 +958,38 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
 
             blk_len = blk_getlength(exp->common.blk);
             if (blk_len < 0) {
-                fuse_reply_err(req, -blk_len);
-                return;
+                return blk_len;
             }
 
             if (offset > blk_len || whence == SEEK_DATA) {
-                fuse_reply_err(req, ENXIO);
-            } else {
-                fuse_reply_lseek(req, offset);
+                return -ENXIO;
             }
-            return;
+
+            *out = (struct fuse_lseek_out) {
+                .offset = offset,
+            };
+            return sizeof(*out);
         }
 
         if (ret & BDRV_BLOCK_DATA) {
             if (whence == SEEK_DATA) {
-                fuse_reply_lseek(req, offset);
-                return;
+                *out = (struct fuse_lseek_out) {
+                    .offset = offset,
+                };
+                return sizeof(*out);
             }
         } else {
             if (whence == SEEK_HOLE) {
-                fuse_reply_lseek(req, offset);
-                return;
+                *out = (struct fuse_lseek_out) {
+                    .offset = offset,
+                };
+                return sizeof(*out);
             }
         }
 
         /* Safety check against infinite loops */
         if (!pnum) {
-            fuse_reply_err(req, ENXIO);
-            return;
+            return -ENXIO;
         }
 
         offset += pnum;
@@ -932,21 +997,297 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
 }
 #endif
 
-static const struct fuse_lowlevel_ops fuse_ops = {
-    .init       = fuse_init,
-    .lookup     = fuse_lookup,
-    .getattr    = fuse_getattr,
-    .setattr    = fuse_setattr,
-    .open       = fuse_open,
-    .read       = fuse_read,
-    .write      = fuse_write,
-    .fallocate  = fuse_fallocate,
-    .flush      = fuse_flush,
-    .fsync      = fuse_fsync,
+/**
+ * Write a FUSE response to the given @fd, using a single buffer consecutively
+ * containing both the response header and data: Initialize *out_hdr, and write
+ * it plus @response_data_length consecutive bytes to @fd.
+ *
+ * @fd: FUSE file descriptor
+ * @req_id: Corresponding request ID
+ * @out_hdr: Pointer to buffer that will hold the output header, and
+ *           additionally already contains @response_data_length data bytes
+ *           starting at *out_hdr + 1.
+ * @err: Error code (-errno, or 0 in case of success)
+ * @response_data_length: Length of data to return (following *out_hdr)
+ */
+static int fuse_write_response(int fd, uint32_t req_id,
+                               struct fuse_out_header *out_hdr, int err,
+                               size_t response_data_length)
+{
+    void *write_ptr = out_hdr;
+    size_t to_write = sizeof(*out_hdr) + response_data_length;
+    ssize_t ret;
+
+    *out_hdr = (struct fuse_out_header) {
+        .len = to_write,
+        .error = err,
+        .unique = req_id,
+    };
+
+    while (true) {
+        ret = RETRY_ON_EINTR(write(fd, write_ptr, to_write));
+        if (ret < 0) {
+            ret = -errno;
+            error_report("Failed to write to FUSE device: %s", strerror(-ret));
+            return ret;
+        } else {
+            to_write -= ret;
+            if (to_write > 0) {
+                write_ptr += ret;
+            } else {
+                return 0; /* success */
+            }
+        }
+    }
+}
+
+/**
+ * Write a FUSE response to the given @fd, using separate buffers for the
+ * response header and data: Initialize *out_hdr, and write it plus the data in
+ * *buf to @fd.
+ *
+ * In contrast to fuse_write_response(), this function cannot return errors, and
+ * will always return success (error code 0).
+ *
+ * @fd: FUSE file descriptor
+ * @req_id: Corresponding request ID
+ * @out_hdr: Pointer to buffer that will hold the output header
+ * @buf: Pointer to response data
+ * @buflen: Length of response data
+ */
+static int fuse_write_buf_response(int fd, uint32_t req_id,
+                                   struct fuse_out_header *out_hdr,
+                                   const void *buf, size_t buflen)
+{
+    struct iovec iov[2] = {
+        { out_hdr, sizeof(*out_hdr) },
+        { (void *)buf, buflen },
+    };
+    struct iovec *iovp = iov;
+    unsigned iov_count = ARRAY_SIZE(iov);
+    size_t to_write = sizeof(*out_hdr) + buflen;
+    ssize_t ret;
+
+    *out_hdr = (struct fuse_out_header) {
+        .len = to_write,
+        .unique = req_id,
+    };
+
+    while (true) {
+        ret = RETRY_ON_EINTR(writev(fd, iovp, iov_count));
+        if (ret < 0) {
+            ret = -errno;
+            error_report("Failed to write to FUSE device: %s", strerror(-ret));
+            return ret;
+        } else {
+            to_write -= ret;
+            if (to_write > 0) {
+                iov_discard_front(&iovp, &iov_count, ret);
+            } else {
+                return 0; /* success */
+            }
+        }
+    }
+}
+
+/*
+ * For use in fuse_process_request():
+ * Returns a pointer to the parameter object for the given operation (inside of
+ * exp->request_buf, which is assumed to hold a fuse_in_header first).
+ * Verifies that the object is complete (exp->request_buf is large enough to
+ * hold it in one piece, and the request length includes the whole object).
+ *
+ * Note that exp->request_buf may be overwritten after polling, so the returned
+ * pointer must not be used across a function that may poll!
+ */
+#define FUSE_IN_OP_STRUCT(op_name, export) \
+    ({ \
+        const struct fuse_in_header *__in_hdr = \
+            (const struct fuse_in_header *)(export)->request_buf; \
+        const struct fuse_##op_name##_in *__in = \
+            (const struct fuse_##op_name##_in *)(__in_hdr + 1); \
+        const size_t __param_len = sizeof(*__in_hdr) + sizeof(*__in); \
+        uint32_t __req_len; \
+        \
+        QEMU_BUILD_BUG_ON(sizeof((export)->request_buf) < __param_len); \
+        \
+        __req_len = __in_hdr->len; \
+        if (__req_len < __param_len) { \
+            warn_report("FUSE request truncated (%" PRIu32 " < %zu)", \
+                        __req_len, __param_len); \
+            ret = -EINVAL; \
+            break; \
+        } \
+        __in; \
+    })
+
+/*
+ * For use in fuse_process_request():
+ * Returns a pointer to the return object for the given operation (inside of
+ * out_buf, which is assumed to hold a fuse_out_header first).
+ * Verifies that out_buf is large enough to hold the whole object.
+ *
+ * (out_buf should be a char[] array.)
+ */
+#define FUSE_OUT_OP_STRUCT(op_name, out_buf) \
+    ({ \
+        struct fuse_out_header *__out_hdr = \
+            (struct fuse_out_header *)(out_buf); \
+        struct fuse_##op_name##_out *__out = \
+            (struct fuse_##op_name##_out *)(__out_hdr + 1); \
+        \
+        QEMU_BUILD_BUG_ON(sizeof(*__out_hdr) + sizeof(*__out) > \
+                          sizeof(out_buf)); \
+        \
+        __out; \
+    })
+
+/**
+ * Process a FUSE request, incl. writing the response.
+ *
+ * Note that polling in any request-processing function can lead to a nested
+ * read_from_fuse_fd() call, which will overwrite the contents of
+ * exp->request_buf.  Anything that takes a buffer needs to take care that the
+ * content is copied before potentially polling.
+ */
+static void fuse_process_request(FuseExport *exp)
+{
+    uint32_t opcode;
+    uint64_t req_id;
+    /*
+     * Return buffer.  Must be large enough to hold all return headers, but does
+     * not include space for data returned by read requests.
+     * (FUSE_IN_OP_STRUCT() verifies at compile time that out_buf is indeed
+     * large enough.)
+     */
+    char out_buf[sizeof(struct fuse_out_header) +
+                 MAX_CONST(sizeof(struct fuse_init_out),
+                 MAX_CONST(sizeof(struct fuse_open_out),
+                 MAX_CONST(sizeof(struct fuse_attr_out),
+                 MAX_CONST(sizeof(struct fuse_write_out),
+                           sizeof(struct fuse_lseek_out)))))];
+    struct fuse_out_header *out_hdr = (struct fuse_out_header *)out_buf;
+    /* For read requests: Data to be returned */
+    void *out_data_buffer = NULL;
+    ssize_t ret;
+
+    /* Limit scope to ensure pointer is no longer used after polling */
+    {
+        const struct fuse_in_header *in_hdr =
+            (const struct fuse_in_header *)exp->request_buf;
+
+        opcode = in_hdr->opcode;
+        req_id = in_hdr->unique;
+    }
+
+    switch (opcode) {
+    case FUSE_INIT: {
+        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, exp);
+        ret = fuse_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
+                        in->max_readahead, in->flags);
+        break;
+    }
+
+    case FUSE_OPEN:
+        ret = fuse_open(exp, FUSE_OUT_OP_STRUCT(open, out_buf));
+        break;
+
+    case FUSE_RELEASE:
+        ret = 0;
+        break;
+
+    case FUSE_LOOKUP:
+        ret = -ENOENT; /* There is no node but the root node */
+        break;
+
+    case FUSE_GETATTR:
+        ret = fuse_getattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf));
+        break;
+
+    case FUSE_SETATTR: {
+        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, exp);
+        ret = fuse_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
+                           in->valid, in->size, in->mode, in->uid, in->gid);
+        break;
+    }
+
+    case FUSE_READ: {
+        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, exp);
+        ret = fuse_read(exp, &out_data_buffer, in->offset, in->size);
+        break;
+    }
+
+    case FUSE_WRITE: {
+        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, exp);
+        uint32_t req_len;
+
+        req_len = ((const struct fuse_in_header *)exp->request_buf)->len;
+        if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
+                               in->size)) {
+            warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
+                        req_len - sizeof(struct fuse_in_header) - sizeof(*in),
+                        in->size);
+            ret = -EINVAL;
+            break;
+        }
+
+        /*
+         * poll_fuse_fd() has checked that in_hdr->len matches the number of
+         * bytes read, which cannot exceed the max_write value we set
+         * (FUSE_MAX_WRITE_BYTES).  So we know that FUSE_MAX_WRITE_BYTES >=
+         * in_hdr->len >= in->size + X, so this assertion must hold.
+         */
+        assert(in->size <= FUSE_MAX_WRITE_BYTES);
+
+        /*
+         * Passing a pointer to `in` (i.e. the request buffer) is fine because
+         * fuse_write() takes care to copy its contents before potentially
+         * polling.
+         */
+        ret = fuse_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
+                         in->offset, in->size, in + 1);
+        break;
+    }
+
+    case FUSE_FALLOCATE: {
+        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, exp);
+        ret = fuse_fallocate(exp, in->offset, in->length, in->mode);
+        break;
+    }
+
+    case FUSE_FSYNC:
+        ret = fuse_fsync(exp);
+        break;
+
+    case FUSE_FLUSH:
+        ret = fuse_flush(exp);
+        break;
+
 #ifdef CONFIG_FUSE_LSEEK
-    .lseek      = fuse_lseek,
+    case FUSE_LSEEK: {
+        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, exp);
+        ret = fuse_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
+                         in->offset, in->whence);
+        break;
+    }
 #endif
-};
+
+    default:
+        ret = -ENOSYS;
+    }
+
+    /* Ignore errors from fuse_write*(), nothing we can do anyway */
+    if (out_data_buffer) {
+        assert(ret >= 0);
+        fuse_write_buf_response(exp->fuse_fd, req_id, out_hdr,
+                                out_data_buffer, ret);
+        qemu_vfree(out_data_buffer);
+    } else {
+        fuse_write_response(exp->fuse_fd, req_id, out_hdr,
+                            ret < 0 ? ret : 0,
+                            ret < 0 ? 0 : ret);
+    }
+}
 
 const BlockExportDriver blk_exp_fuse = {
     .type               = BLOCK_EXPORT_TYPE_FUSE,
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 12/15] fuse: Reduce max read size
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (10 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 11/15] fuse: Manually process requests (without libfuse) Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 15:35   ` Stefan Hajnoczi
  2025-03-25 16:06 ` [PATCH 13/15] fuse: Process requests in coroutines Hanna Czenczek
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

We are going to introduce parallel processing via coroutines, a maximum
read size of 64 MB may be problematic, allowing users of the export to
force us to allocate quite large amounts of memory with just a few
requests.

At least tone it down to 1 MB, which is still probably far more than
enough.  (Larger requests are split automatically by the FUSE kernel
driver anyway.)

(Yes, we inadvertently already had parallel request processing due to
nested polling before.  Better to fix this late than never.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 407b101018..1b399eeab7 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -45,7 +45,7 @@
 #endif
 
 /* Prevent overly long bounce buffer allocations */
-#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
 /* Small enough to fit in the request buffer */
 #define FUSE_MAX_WRITE_BYTES (4 * 1024)
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 13/15] fuse: Process requests in coroutines
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (11 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 12/15] fuse: Reduce max read size Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 15:38   ` Stefan Hajnoczi
  2025-03-25 16:06 ` [PATCH 14/15] fuse: Implement multi-threading Hanna Czenczek
  2025-03-25 16:06 ` [PATCH 15/15] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
  14 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

Make fuse_process_request() a coroutine_fn (fuse_co_process_request())
and have read_from_fuse_fd() launch it inside of a newly created
coroutine instead of running it synchronously.  This way, we can process
requests in parallel.

These are the benchmark results, compared to (a) the original results
with libfuse, and (b) the results after switching away from libfuse
(i.e. before this patch):

file:                  (vs. libfuse / vs. no libfuse)
  read:
    seq aio:   120.6k ±1.1k (+ 53 % / + 58 %)
    rand aio:  113.3k ±5.9k (+188 % / +325 %)
    seq sync:   52.4k ±0.4k (+ 61 % / + 10 %)
    rand sync:  10.4k ±0.4k (+  6 % / +  3 %)
  write:
    seq aio:    79.8k ±0.8k (+ 29 % / + 37 %)
    rand aio:   79.0k ±0.6k (+ 29 % / + 36 %)
    seq sync:   41.5k ±0.3k (+ 49 % / + 15 %)
    rand sync:  41.4k ±0.2k (+ 50 % / + 15 %)
null:
  read:
    seq aio:   266.1k ±1.5k (+ 24 % / -  1 %)
    rand aio:  264.1k ±2.5k (+ 24 % / ±  0 %)
    seq sync:  135.6k ±3.2k (+ 50 % / +  1 %)
    rand sync: 134.7k ±3.0k (+ 50 % / +  2 %)
  write:
    seq aio:   281.0k ±1.8k (+ 38 % / +  2 %)
    rand aio:  288.1k ±6.1k (+ 43 % / +  6 %)
    seq sync:  142.2k ±3.1k (+ 65 % / +  9 %)
    rand sync: 141.1k ±2.9k (+ 66 % / + 11 %)

So for non-AIO cases (and the null driver, which does not yield), there
is little change; but for file AIO, results greatly improve, resolving
the performance issue we saw before (when switching away from libfuse).

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 209 ++++++++++++++++++++++++++------------------
 1 file changed, 126 insertions(+), 83 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 1b399eeab7..345e833171 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -27,6 +27,7 @@
 #include "block/qapi.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block.h"
+#include "qemu/coroutine.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
@@ -86,6 +87,12 @@ typedef struct FuseExport {
     gid_t st_gid;
 } FuseExport;
 
+/* Parameters to the request processing coroutine */
+typedef struct FuseRequestCoParam {
+    FuseExport *exp;
+    int got_request;
+} FuseRequestCoParam;
+
 static GHashTable *exports;
 
 static void fuse_export_shutdown(BlockExport *exp);
@@ -100,7 +107,7 @@ static bool is_regular_file(const char *path, Error **errp);
 
 static bool poll_fuse_fd(void *opaque);
 static void read_fuse_fd(void *opaque);
-static void fuse_process_request(FuseExport *exp);
+static void coroutine_fn fuse_co_process_request(FuseExport *exp);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -350,12 +357,20 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
 
 /**
  * Try to read a single request from the FUSE FD.
- * If a request is available, process it, and return true.
- * Otherwise, return false.
+ * Takes a FuseRequestCoParam object pointer in `opaque`.
+ *
+ * If a request is available, process it, and set FuseRequestCoParam.got_request
+ * to 1.  Otherwise, set it to 0.
+ * (Not using a boolean allows callers to initialize it e.g. with -EINPROGRESS.)
+ *
+ * The FuseRequestCoParam object is only accessed until yielding for the first
+ * time, i.e. may be dropped by the caller after running the coroutine until it
+ * yields.
  */
-static bool read_from_fuse_fd(void *opaque)
+static void coroutine_fn co_read_from_fuse_fd(void *opaque)
 {
-    FuseExport *exp = opaque;
+    FuseRequestCoParam *co_param = opaque;
+    FuseExport *exp = co_param->exp;
     int fuse_fd = exp->fuse_fd;
     ssize_t ret;
     const struct fuse_in_header *in_hdr;
@@ -396,13 +411,15 @@ static bool read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    fuse_process_request(exp);
+    /* Must set this before yielding */
+    co_param->got_request = 1;
+    fuse_co_process_request(exp);
     fuse_dec_in_flight(exp);
-    return true;
+    return;
 
 no_request:
+    co_param->got_request = 0;
     fuse_dec_in_flight(exp);
-    return false;
 }
 
 /**
@@ -413,7 +430,17 @@ no_request:
  */
 static bool poll_fuse_fd(void *opaque)
 {
-    return read_from_fuse_fd(opaque);
+    Coroutine *co;
+    FuseRequestCoParam co_param = {
+        .exp = opaque,
+        .got_request = -EINPROGRESS,
+    };
+
+    co = qemu_coroutine_create(co_read_from_fuse_fd, &co_param);
+    qemu_coroutine_enter(co);
+    assert(co_param.got_request != -EINPROGRESS);
+
+    return co_param.got_request;
 }
 
 /**
@@ -422,7 +449,15 @@ static bool poll_fuse_fd(void *opaque)
  */
 static void read_fuse_fd(void *opaque)
 {
-    read_from_fuse_fd(opaque);
+    Coroutine *co;
+    FuseRequestCoParam co_param = {
+        .exp = opaque,
+        .got_request = -EINPROGRESS,
+    };
+
+    co = qemu_coroutine_create(co_read_from_fuse_fd, &co_param);
+    qemu_coroutine_enter(co);
+    assert(co_param.got_request != -EINPROGRESS);
 }
 
 static void fuse_export_shutdown(BlockExport *blk_exp)
@@ -498,8 +533,9 @@ static bool is_regular_file(const char *path, Error **errp)
  * Process FUSE INIT.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
-                         uint32_t max_readahead, uint32_t flags)
+static ssize_t coroutine_fn
+fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
+             uint32_t max_readahead, uint32_t flags)
 {
     const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
 
@@ -538,17 +574,18 @@ static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
  * Let clients get file attributes (i.e., stat() the file).
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
+static ssize_t coroutine_fn
+fuse_co_getattr(FuseExport *exp, struct fuse_attr_out *out)
 {
     int64_t length, allocated_blocks;
     time_t now = time(NULL);
 
-    length = blk_getlength(exp->common.blk);
+    length = blk_co_getlength(exp->common.blk);
     if (length < 0) {
         return length;
     }
 
-    allocated_blocks = bdrv_get_allocated_file_size(blk_bs(exp->common.blk));
+    allocated_blocks = bdrv_co_get_allocated_file_size(blk_bs(exp->common.blk));
     if (allocated_blocks <= 0) {
         allocated_blocks = DIV_ROUND_UP(length, 512);
     } else {
@@ -575,8 +612,9 @@ static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
     return sizeof(*out);
 }
 
-static int fuse_do_truncate(const FuseExport *exp, int64_t size,
-                            bool req_zero_write, PreallocMode prealloc)
+static int coroutine_fn
+fuse_co_do_truncate(const FuseExport *exp, int64_t size, bool req_zero_write,
+                    PreallocMode prealloc)
 {
     uint64_t blk_perm, blk_shared_perm;
     BdrvRequestFlags truncate_flags = 0;
@@ -605,8 +643,8 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
         }
     }
 
-    ret = blk_truncate(exp->common.blk, size, true, prealloc,
-                       truncate_flags, NULL);
+    ret = blk_co_truncate(exp->common.blk, size, true, prealloc,
+                          truncate_flags, NULL);
 
     if (add_resize_perm) {
         /* Must succeed, because we are only giving up the RESIZE permission */
@@ -627,9 +665,9 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
  * they cannot be given non-owner access.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
-                            uint32_t to_set, uint64_t size, uint32_t mode,
-                            uint32_t uid, uint32_t gid)
+static ssize_t coroutine_fn
+fuse_co_setattr(FuseExport *exp, struct fuse_attr_out *out, uint32_t to_set,
+                uint64_t size, uint32_t mode, uint32_t uid, uint32_t gid)
 {
     int supported_attrs;
     int ret;
@@ -666,7 +704,7 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
             return -EACCES;
         }
 
-        ret = fuse_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
+        ret = fuse_co_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
         if (ret < 0) {
             return ret;
         }
@@ -685,7 +723,7 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
         exp->st_gid = gid;
     }
 
-    return fuse_getattr(exp, out);
+    return fuse_co_getattr(exp, out);
 }
 
 /**
@@ -693,7 +731,8 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
  * just acknowledge the request.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
+static ssize_t coroutine_fn
+fuse_co_open(FuseExport *exp, struct fuse_open_out *out)
 {
     *out = (struct fuse_open_out) {
         .open_flags = FOPEN_DIRECT_IO | FOPEN_PARALLEL_DIRECT_WRITES,
@@ -706,8 +745,8 @@ static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
  * data from the block device into that buffer.
  * Returns the buffer (read) size on success, and -errno on error.
  */
-static ssize_t fuse_read(FuseExport *exp, void **bufptr,
-                         uint64_t offset, uint32_t size)
+static ssize_t coroutine_fn
+fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
 {
     int64_t blk_len;
     void *buf;
@@ -722,7 +761,7 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
      * Clients will expect short reads at EOF, so we have to limit
      * offset+size to the image length.
      */
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         return blk_len;
     }
@@ -736,7 +775,7 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
         return -ENOMEM;
     }
 
-    ret = blk_pread(exp->common.blk, offset, size, buf, 0);
+    ret = blk_co_pread(exp->common.blk, offset, size, buf, 0);
     if (ret < 0) {
         qemu_vfree(buf);
         return ret;
@@ -748,11 +787,12 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
 
 /**
  * Handle client writes to the exported image.  @buf has the data to be written
- * and will be copied to a bounce buffer before polling for the first time.
+ * and will be copied to a bounce buffer before yielding for the first time.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
-                          uint64_t offset, uint32_t size, const void *buf)
+static ssize_t coroutine_fn
+fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
+              uint64_t offset, uint32_t size, const void *buf)
 {
     void *copied;
     int64_t blk_len;
@@ -767,7 +807,7 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
         return -EACCES;
     }
 
-    /* Must copy to bounce buffer before polling (to allow nesting) */
+    /* Must copy to bounce buffer before potentially yielding */
     copied = blk_blockalign(exp->common.blk, size);
     memcpy(copied, buf, size);
 
@@ -775,7 +815,7 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
      */
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         ret = blk_len;
         goto fail_free_buffer;
@@ -783,7 +823,8 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
 
     if (offset + size > blk_len) {
         if (exp->growable) {
-            ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset + size, true,
+                                      PREALLOC_MODE_OFF);
             if (ret < 0) {
                 goto fail_free_buffer;
             }
@@ -792,7 +833,7 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
         }
     }
 
-    ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
+    ret = blk_co_pwrite(exp->common.blk, offset, size, copied, 0);
     if (ret < 0) {
         goto fail_free_buffer;
     }
@@ -813,8 +854,9 @@ fail_free_buffer:
  * Let clients perform various fallocate() operations.
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
-                              uint32_t mode)
+static ssize_t coroutine_fn
+fuse_co_fallocate(FuseExport *exp,
+                  uint64_t offset, uint64_t length, uint32_t mode)
 {
     int64_t blk_len;
     int ret;
@@ -823,7 +865,7 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         return -EACCES;
     }
 
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         return blk_len;
     }
@@ -842,14 +884,14 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
 
         if (offset > blk_len) {
             /* No preallocation needed here */
-            ret = fuse_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
                 return ret;
             }
         }
 
-        ret = fuse_do_truncate(exp, offset + length, true,
-                               PREALLOC_MODE_FALLOC);
+        ret = fuse_co_do_truncate(exp, offset + length, true,
+                                  PREALLOC_MODE_FALLOC);
     }
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
     else if (mode & FALLOC_FL_PUNCH_HOLE) {
@@ -860,8 +902,9 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         do {
             int size = MIN(length, BDRV_REQUEST_MAX_BYTES);
 
-            ret = blk_pwrite_zeroes(exp->common.blk, offset, size,
-                                    BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK);
+            ret = blk_co_pwrite_zeroes(exp->common.blk, offset, size,
+                                       BDRV_REQ_MAY_UNMAP |
+                                       BDRV_REQ_NO_FALLBACK);
             if (ret == -ENOTSUP) {
                 /*
                  * fallocate() specifies to return EOPNOTSUPP for unsupported
@@ -879,8 +922,8 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
     else if (mode & FALLOC_FL_ZERO_RANGE) {
         if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + length > blk_len) {
             /* No need for zeroes, we are going to write them ourselves */
-            ret = fuse_do_truncate(exp, offset + length, false,
-                                   PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset + length, false,
+                                      PREALLOC_MODE_OFF);
             if (ret < 0) {
                 return ret;
             }
@@ -889,8 +932,8 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         do {
             int size = MIN(length, BDRV_REQUEST_MAX_BYTES);
 
-            ret = blk_pwrite_zeroes(exp->common.blk,
-                                    offset, size, 0);
+            ret = blk_co_pwrite_zeroes(exp->common.blk,
+                                       offset, size, 0);
             offset += size;
             length -= size;
         } while (ret == 0 && length > 0);
@@ -907,9 +950,9 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
  * Let clients fsync the exported image.
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_fsync(FuseExport *exp)
+static ssize_t coroutine_fn fuse_co_fsync(FuseExport *exp)
 {
-    return blk_flush(exp->common.blk);
+    return blk_co_flush(exp->common.blk);
 }
 
 /**
@@ -917,9 +960,9 @@ static ssize_t fuse_fsync(FuseExport *exp)
  * notes this to be a way to return last-minute errors.)
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_flush(FuseExport *exp)
+static ssize_t coroutine_fn fuse_co_flush(FuseExport *exp)
 {
-    return blk_flush(exp->common.blk);
+    return blk_co_flush(exp->common.blk);
 }
 
 #ifdef CONFIG_FUSE_LSEEK
@@ -927,8 +970,9 @@ static ssize_t fuse_flush(FuseExport *exp)
  * Let clients inquire allocation status.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
-                          uint64_t offset, uint32_t whence)
+static ssize_t coroutine_fn
+fuse_co_lseek(FuseExport *exp, struct fuse_lseek_out *out,
+              uint64_t offset, uint32_t whence)
 {
     if (whence != SEEK_HOLE && whence != SEEK_DATA) {
         return -EINVAL;
@@ -938,8 +982,8 @@ static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
         int64_t pnum;
         int ret;
 
-        ret = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
-                                      offset, INT64_MAX, &pnum, NULL, NULL);
+        ret = bdrv_co_block_status_above(blk_bs(exp->common.blk), NULL,
+                                         offset, INT64_MAX, &pnum, NULL, NULL);
         if (ret < 0) {
             return ret;
         }
@@ -956,7 +1000,7 @@ static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
              * and @blk_len (the client-visible EOF).
              */
 
-            blk_len = blk_getlength(exp->common.blk);
+            blk_len = blk_co_getlength(exp->common.blk);
             if (blk_len < 0) {
                 return blk_len;
             }
@@ -1091,14 +1135,14 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
 }
 
 /*
- * For use in fuse_process_request():
+ * For use in fuse_co_process_request():
  * Returns a pointer to the parameter object for the given operation (inside of
  * exp->request_buf, which is assumed to hold a fuse_in_header first).
  * Verifies that the object is complete (exp->request_buf is large enough to
  * hold it in one piece, and the request length includes the whole object).
  *
- * Note that exp->request_buf may be overwritten after polling, so the returned
- * pointer must not be used across a function that may poll!
+ * Note that exp->request_buf may be overwritten after yielding, so the returned
+ * pointer must not be used across a function that may yield!
  */
 #define FUSE_IN_OP_STRUCT(op_name, export) \
     ({ \
@@ -1122,7 +1166,7 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
     })
 
 /*
- * For use in fuse_process_request():
+ * For use in fuse_co_process_request():
  * Returns a pointer to the return object for the given operation (inside of
  * out_buf, which is assumed to hold a fuse_out_header first).
  * Verifies that out_buf is large enough to hold the whole object.
@@ -1145,12 +1189,11 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
 /**
  * Process a FUSE request, incl. writing the response.
  *
- * Note that polling in any request-processing function can lead to a nested
- * read_from_fuse_fd() call, which will overwrite the contents of
- * exp->request_buf.  Anything that takes a buffer needs to take care that the
- * content is copied before potentially polling.
+ * Note that yielding in any request-processing function can overwrite the
+ * contents of exp->request_buf.  Anything that takes a buffer needs to take
+ * care that the content is copied before yielding.
  */
-static void fuse_process_request(FuseExport *exp)
+static void coroutine_fn fuse_co_process_request(FuseExport *exp)
 {
     uint32_t opcode;
     uint64_t req_id;
@@ -1171,7 +1214,7 @@ static void fuse_process_request(FuseExport *exp)
     void *out_data_buffer = NULL;
     ssize_t ret;
 
-    /* Limit scope to ensure pointer is no longer used after polling */
+    /* Limit scope to ensure pointer is no longer used after yielding */
     {
         const struct fuse_in_header *in_hdr =
             (const struct fuse_in_header *)exp->request_buf;
@@ -1183,13 +1226,13 @@ static void fuse_process_request(FuseExport *exp)
     switch (opcode) {
     case FUSE_INIT: {
         const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, exp);
-        ret = fuse_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
-                        in->max_readahead, in->flags);
+        ret = fuse_co_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
+                           in->max_readahead, in->flags);
         break;
     }
 
     case FUSE_OPEN:
-        ret = fuse_open(exp, FUSE_OUT_OP_STRUCT(open, out_buf));
+        ret = fuse_co_open(exp, FUSE_OUT_OP_STRUCT(open, out_buf));
         break;
 
     case FUSE_RELEASE:
@@ -1201,19 +1244,19 @@ static void fuse_process_request(FuseExport *exp)
         break;
 
     case FUSE_GETATTR:
-        ret = fuse_getattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf));
+        ret = fuse_co_getattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf));
         break;
 
     case FUSE_SETATTR: {
         const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, exp);
-        ret = fuse_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
-                           in->valid, in->size, in->mode, in->uid, in->gid);
+        ret = fuse_co_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
+                              in->valid, in->size, in->mode, in->uid, in->gid);
         break;
     }
 
     case FUSE_READ: {
         const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, exp);
-        ret = fuse_read(exp, &out_data_buffer, in->offset, in->size);
+        ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size);
         break;
     }
 
@@ -1241,33 +1284,33 @@ static void fuse_process_request(FuseExport *exp)
 
         /*
          * Passing a pointer to `in` (i.e. the request buffer) is fine because
-         * fuse_write() takes care to copy its contents before potentially
-         * polling.
+         * fuse_co_write() takes care to copy its contents before potentially
+         * yielding.
          */
-        ret = fuse_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
-                         in->offset, in->size, in + 1);
+        ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
+                            in->offset, in->size, in + 1);
         break;
     }
 
     case FUSE_FALLOCATE: {
         const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, exp);
-        ret = fuse_fallocate(exp, in->offset, in->length, in->mode);
+        ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode);
         break;
     }
 
     case FUSE_FSYNC:
-        ret = fuse_fsync(exp);
+        ret = fuse_co_fsync(exp);
         break;
 
     case FUSE_FLUSH:
-        ret = fuse_flush(exp);
+        ret = fuse_co_flush(exp);
         break;
 
 #ifdef CONFIG_FUSE_LSEEK
     case FUSE_LSEEK: {
         const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, exp);
-        ret = fuse_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
-                         in->offset, in->whence);
+        ret = fuse_co_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
+                            in->offset, in->whence);
         break;
     }
 #endif
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 14/15] fuse: Implement multi-threading
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (12 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 13/15] fuse: Process requests in coroutines Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-26  5:38   ` Markus Armbruster
                     ` (2 more replies)
  2025-03-25 16:06 ` [PATCH 15/15] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
  14 siblings, 3 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
(via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).

We can use this to implement multi-threading.

Note that the interface presented here differs from the multi-queue
interface of virtio-blk: The latter maps virtqueues to iothreads, which
allows processing multiple virtqueues in a single iothread.  The
equivalent (processing multiple FDs in a single iothread) would not make
sense for FUSE because those FDs are used in a round-robin fashion by
the FUSE kernel driver.  Putting two of them into a single iothread will
just create a bottleneck.

Therefore, all we need is an array of iothreads, and we will create one
"queue" (FD) per thread.

These are the benchmark results when using four threads (compared to a
single thread); note that fio still only uses a single job, but
performance can still be improved because of said round-robin usage for
the queues.  (Not in the sync case, though, in which case I guess it
just adds overhead.)

file:
  read:
    seq aio:   264.8k ±0.8k (+120 %)
    rand aio:  143.8k ±0.4k (+ 27 %)
    seq sync:   49.9k ±0.5k (-  5 %)
    rand sync:  10.3k ±0.1k (-  1 %)
  write:
    seq aio:   226.6k ±2.1k (+184 %)
    rand aio:  225.9k ±1.8k (+186 %)
    seq sync:   36.9k ±0.6k (- 11 %)
    rand sync:  36.9k ±0.2k (- 11 %)
null:
  read:
    seq aio:   315.2k ±11.0k (+18 %)
    rand aio:  300.5k ±10.8k (+14 %)
    seq sync:  114.2k ± 3.6k (-16 %)
    rand sync: 112.5k ± 2.8k (-16 %)
  write:
    seq aio:   222.6k ±6.8k (-21 %)
    rand aio:  220.5k ±6.8k (-23 %)
    seq sync:  117.2k ±3.7k (-18 %)
    rand sync: 116.3k ±4.4k (-18 %)

(I don't know what's going on in the null-write AIO case, sorry.)

Here's results for numjobs=4:

"Before", i.e. without multithreading in QSD/FUSE (results compared to
numjobs=1):

file:
  read:
    seq aio:   104.7k ± 0.4k (- 13 %)
    rand aio:  111.5k ± 0.4k (-  2 %)
    seq sync:   71.0k ±13.8k (+ 36 %)
    rand sync:  41.4k ± 0.1k (+297 %)
  write:
    seq aio:    79.4k ±0.1k (-  1 %)
    rand aio:   78.6k ±0.1k (±  0 %)
    seq sync:   83.3k ±0.1k (+101 %)
    rand sync:  82.0k ±0.2k (+ 98 %)
null:
  read:
    seq aio:   260.5k ±1.5k (-  2 %)
    rand aio:  260.1k ±1.4k (-  2 %)
    seq sync:  291.8k ±1.3k (+115 %)
    rand sync: 280.1k ±1.7k (+115 %)
  write:
    seq aio:   280.1k ±1.7k (±  0 %)
    rand aio:  279.5k ±1.4k (-  3 %)
    seq sync:  306.7k ±2.2k (+116 %)
    rand sync: 305.9k ±1.8k (+117 %)

(As probably expected, little difference in the AIO case, but great
improvements in the sync case because it kind of gives it an artificial
iodepth of 4.)

"After", i.e. with four threads in QSD/FUSE (now results compared to the
above):

file:
  read:
    seq aio:   193.3k ± 1.8k (+ 85 %)
    rand aio:  329.3k ± 0.3k (+195 %)
    seq sync:   66.2k ±13.0k (-  7 %)
    rand sync:  40.1k ± 0.0k (-  3 %)
  write:
    seq aio:   219.7k ±0.8k (+177 %)
    rand aio:  217.2k ±1.5k (+176 %)
    seq sync:   92.5k ±0.2k (+ 11 %)
    rand sync:  91.9k ±0.2k (+ 12 %)
null:
  read:
    seq aio:   706.7k ±2.1k (+171 %)
    rand aio:  714.7k ±3.2k (+175 %)
    seq sync:  431.7k ±3.0k (+ 48 %)
    rand sync: 435.4k ±2.8k (+ 50 %)
  write:
    seq aio:   746.9k ±2.8k (+167 %)
    rand aio:  749.0k ±4.9k (+168 %)
    seq sync:  420.7k ±3.1k (+ 37 %)
    rand sync: 419.1k ±2.5k (+ 37 %)

So this helps mainly for the AIO cases, but also in the null sync cases,
because null is always CPU-bound, so more threads help.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 qapi/block-export.json |   8 +-
 block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
 2 files changed, 179 insertions(+), 43 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index c783e01a53..0bdd5992eb 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -179,12 +179,18 @@
 #     mount the export with allow_other, and if that fails, try again
 #     without.  (since 6.1; default: auto)
 #
+# @iothreads: Enables multi-threading: Handle requests in each of the
+#     given iothreads (instead of the block device's iothread, or the
+#     export's "main" iothread).  For this, the FUSE FD is duplicated so
+#     there is one FD per iothread.  (since 10.1)
+#
 # Since: 6.0
 ##
 { 'struct': 'BlockExportOptionsFuse',
   'data': { 'mountpoint': 'str',
             '*growable': 'bool',
-            '*allow-other': 'FuseExportAllowOther' },
+            '*allow-other': 'FuseExportAllowOther',
+            '*iothreads': ['str'] },
   'if': 'CONFIG_FUSE' }
 
 ##
diff --git a/block/export/fuse.c b/block/export/fuse.c
index 345e833171..0edd994392 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -31,11 +31,14 @@
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
+#include "system/block-backend.h"
+#include "system/iothread.h"
 
 #include <fuse.h>
 #include <fuse_lowlevel.h>
 
 #include "standard-headers/linux/fuse.h"
+#include <sys/ioctl.h>
 
 #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
@@ -50,12 +53,17 @@
 /* Small enough to fit in the request buffer */
 #define FUSE_MAX_WRITE_BYTES (4 * 1024)
 
-typedef struct FuseExport {
-    BlockExport common;
+typedef struct FuseExport FuseExport;
 
-    struct fuse_session *fuse_session;
-    unsigned int in_flight; /* atomic */
-    bool mounted, fd_handler_set_up;
+/*
+ * One FUSE "queue", representing one FUSE FD from which requests are fetched
+ * and processed.  Each queue is tied to an AioContext.
+ */
+typedef struct FuseQueue {
+    FuseExport *exp;
+
+    AioContext *ctx;
+    int fuse_fd;
 
     /*
      * The request buffer must be able to hold a full write, and/or at least
@@ -66,6 +74,14 @@ typedef struct FuseExport {
              FUSE_MAX_WRITE_BYTES,
         FUSE_MIN_READ_BUFFER
     )];
+} FuseQueue;
+
+struct FuseExport {
+    BlockExport common;
+
+    struct fuse_session *fuse_session;
+    unsigned int in_flight; /* atomic */
+    bool mounted, fd_handler_set_up;
 
     /*
      * Set when there was an unrecoverable error and no requests should be read
@@ -74,7 +90,15 @@ typedef struct FuseExport {
      */
     bool halted;
 
-    int fuse_fd;
+    int num_queues;
+    FuseQueue *queues;
+    /*
+     * True if this export should follow the generic export's AioContext.
+     * Will be false if the queues' AioContexts have been explicitly set by the
+     * user, i.e. are expected to stay in those contexts.
+     * (I.e. is always false if there is more than one queue.)
+     */
+    bool follow_aio_context;
 
     char *mountpoint;
     bool writable;
@@ -85,11 +109,11 @@ typedef struct FuseExport {
     mode_t st_mode;
     uid_t st_uid;
     gid_t st_gid;
-} FuseExport;
+};
 
 /* Parameters to the request processing coroutine */
 typedef struct FuseRequestCoParam {
-    FuseExport *exp;
+    FuseQueue *q;
     int got_request;
 } FuseRequestCoParam;
 
@@ -102,12 +126,13 @@ static void fuse_export_halt(FuseExport *exp);
 static void init_exports_table(void);
 
 static int mount_fuse_export(FuseExport *exp, Error **errp);
+static int clone_fuse_fd(int fd, Error **errp);
 
 static bool is_regular_file(const char *path, Error **errp);
 
 static bool poll_fuse_fd(void *opaque);
 static void read_fuse_fd(void *opaque);
-static void coroutine_fn fuse_co_process_request(FuseExport *exp);
+static void coroutine_fn fuse_co_process_request(FuseQueue *q);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -137,9 +162,11 @@ static void fuse_attach_handlers(FuseExport *exp)
         return;
     }
 
-    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
-                       read_fuse_fd, NULL, poll_fuse_fd,
-                       read_fuse_fd, exp);
+    for (int i = 0; i < exp->num_queues; i++) {
+        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
+                           read_fuse_fd, NULL, poll_fuse_fd,
+                           read_fuse_fd, &exp->queues[i]);
+    }
     exp->fd_handler_set_up = true;
 }
 
@@ -148,8 +175,10 @@ static void fuse_attach_handlers(FuseExport *exp)
  */
 static void fuse_detach_handlers(FuseExport *exp)
 {
-    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
-                       NULL, NULL, NULL, NULL, NULL);
+    for (int i = 0; i < exp->num_queues; i++) {
+        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
+                           NULL, NULL, NULL, NULL, NULL);
+    }
     exp->fd_handler_set_up = false;
 }
 
@@ -164,6 +193,11 @@ static void fuse_export_drained_end(void *opaque)
 
     /* Refresh AioContext in case it changed */
     exp->common.ctx = blk_get_aio_context(exp->common.blk);
+    if (exp->follow_aio_context) {
+        assert(exp->num_queues == 1);
+        exp->queues[0].ctx = exp->common.ctx;
+    }
+
     fuse_attach_handlers(exp);
 }
 
@@ -187,10 +221,52 @@ static int fuse_export_create(BlockExport *blk_exp,
     ERRP_GUARD(); /* ensure clean-up even with error_fatal */
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
     BlockExportOptionsFuse *args = &blk_exp_args->u.fuse;
+    FuseQueue *q;
     int ret;
 
     assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
 
+    if (args->iothreads) {
+        strList *e;
+
+        exp->follow_aio_context = false;
+        exp->num_queues = 0;
+        for (e = args->iothreads; e; e = e->next) {
+            exp->num_queues++;
+        }
+        if (exp->num_queues < 1) {
+            error_setg(errp, "Need at least one queue");
+            ret = -EINVAL;
+            goto fail;
+        }
+        exp->queues = g_new0(FuseQueue, exp->num_queues);
+        q = exp->queues;
+        for (e = args->iothreads; e; e = e->next) {
+            IOThread *iothread = iothread_by_id(e->value);
+
+            if (!iothread) {
+                error_setg(errp, "IOThread \"%s\" does not exist", e->value);
+                ret = -EINVAL;
+                goto fail;
+            }
+
+            *(q++) = (FuseQueue) {
+                .exp = exp,
+                .ctx = iothread_get_aio_context(iothread),
+                .fuse_fd = -1,
+            };
+        }
+    } else {
+        exp->follow_aio_context = true;
+        exp->num_queues = 1;
+        exp->queues = g_new(FuseQueue, exp->num_queues);
+        exp->queues[0] = (FuseQueue) {
+            .exp = exp,
+            .ctx = exp->common.ctx,
+            .fuse_fd = -1,
+        };
+    }
+
     /* For growable and writable exports, take the RESIZE permission */
     if (args->growable || blk_exp_args->writable) {
         uint64_t blk_perm, blk_shared_perm;
@@ -275,14 +351,24 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
-    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
-    ret = fcntl(exp->fuse_fd, F_SETFL, O_NONBLOCK);
+    assert(exp->num_queues >= 1);
+    exp->queues[0].fuse_fd = fuse_session_fd(exp->fuse_session);
+    ret = fcntl(exp->queues[0].fuse_fd, F_SETFL, O_NONBLOCK);
     if (ret < 0) {
         ret = -errno;
         error_setg_errno(errp, errno, "Failed to make FUSE FD non-blocking");
         goto fail;
     }
 
+    for (int i = 1; i < exp->num_queues; i++) {
+        int fd = clone_fuse_fd(exp->queues[0].fuse_fd, errp);
+        if (fd < 0) {
+            ret = fd;
+            goto fail;
+        }
+        exp->queues[i].fuse_fd = fd;
+    }
+
     fuse_attach_handlers(exp);
     return 0;
 
@@ -355,6 +441,39 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     return 0;
 }
 
+/**
+ * Clone the given /dev/fuse file descriptor, yielding a second FD from which
+ * requests can be pulled for the associated filesystem.  Returns an FD on
+ * success, and -errno on error.
+ */
+static int clone_fuse_fd(int fd, Error **errp)
+{
+    uint32_t src_fd = fd;
+    int new_fd;
+    int ret;
+
+    /*
+     * The name "/dev/fuse" is fixed, see libfuse's lib/fuse_loop_mt.c
+     * (fuse_clone_chan()).
+     */
+    new_fd = open("/dev/fuse", O_RDWR | O_CLOEXEC | O_NONBLOCK);
+    if (new_fd < 0) {
+        ret = -errno;
+        error_setg_errno(errp, errno, "Failed to open /dev/fuse");
+        return ret;
+    }
+
+    ret = ioctl(new_fd, FUSE_DEV_IOC_CLONE, &src_fd);
+    if (ret < 0) {
+        ret = -errno;
+        error_setg_errno(errp, errno, "Failed to clone FUSE FD");
+        close(new_fd);
+        return ret;
+    }
+
+    return new_fd;
+}
+
 /**
  * Try to read a single request from the FUSE FD.
  * Takes a FuseRequestCoParam object pointer in `opaque`.
@@ -370,8 +489,9 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
 static void coroutine_fn co_read_from_fuse_fd(void *opaque)
 {
     FuseRequestCoParam *co_param = opaque;
-    FuseExport *exp = co_param->exp;
-    int fuse_fd = exp->fuse_fd;
+    FuseQueue *q = co_param->q;
+    int fuse_fd = q->fuse_fd;
+    FuseExport *exp = q->exp;
     ssize_t ret;
     const struct fuse_in_header *in_hdr;
 
@@ -381,8 +501,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    ret = RETRY_ON_EINTR(read(fuse_fd, exp->request_buf,
-                              sizeof(exp->request_buf)));
+    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
     if (ret < 0 && errno == EAGAIN) {
         /* No request available */
         goto no_request;
@@ -400,7 +519,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    in_hdr = (const struct fuse_in_header *)exp->request_buf;
+    in_hdr = (const struct fuse_in_header *)q->request_buf;
     if (unlikely(ret != in_hdr->len)) {
         error_report("Number of bytes read from FUSE device does not match "
                      "request size, expected %" PRIu32 " bytes, read %zi "
@@ -413,7 +532,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
 
     /* Must set this before yielding */
     co_param->got_request = 1;
-    fuse_co_process_request(exp);
+    fuse_co_process_request(q);
     fuse_dec_in_flight(exp);
     return;
 
@@ -432,7 +551,7 @@ static bool poll_fuse_fd(void *opaque)
 {
     Coroutine *co;
     FuseRequestCoParam co_param = {
-        .exp = opaque,
+        .q = opaque,
         .got_request = -EINPROGRESS,
     };
 
@@ -451,7 +570,7 @@ static void read_fuse_fd(void *opaque)
 {
     Coroutine *co;
     FuseRequestCoParam co_param = {
-        .exp = opaque,
+        .q = opaque,
         .got_request = -EINPROGRESS,
     };
 
@@ -481,6 +600,16 @@ static void fuse_export_delete(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
+    for (int i = 0; i < exp->num_queues; i++) {
+        FuseQueue *q = &exp->queues[i];
+
+        /* Queue 0's FD belongs to the FUSE session */
+        if (i > 0 && q->fuse_fd >= 0) {
+            close(q->fuse_fd);
+        }
+    }
+    g_free(exp->queues);
+
     if (exp->fuse_session) {
         if (exp->mounted) {
             fuse_session_unmount(exp->fuse_session);
@@ -1137,23 +1266,23 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
 /*
  * For use in fuse_co_process_request():
  * Returns a pointer to the parameter object for the given operation (inside of
- * exp->request_buf, which is assumed to hold a fuse_in_header first).
- * Verifies that the object is complete (exp->request_buf is large enough to
+ * q->request_buf, which is assumed to hold a fuse_in_header first).
+ * Verifies that the object is complete (q->request_buf is large enough to
  * hold it in one piece, and the request length includes the whole object).
  *
- * Note that exp->request_buf may be overwritten after yielding, so the returned
+ * Note that q->request_buf may be overwritten after yielding, so the returned
  * pointer must not be used across a function that may yield!
  */
-#define FUSE_IN_OP_STRUCT(op_name, export) \
+#define FUSE_IN_OP_STRUCT(op_name, queue) \
     ({ \
         const struct fuse_in_header *__in_hdr = \
-            (const struct fuse_in_header *)(export)->request_buf; \
+            (const struct fuse_in_header *)(q)->request_buf; \
         const struct fuse_##op_name##_in *__in = \
             (const struct fuse_##op_name##_in *)(__in_hdr + 1); \
         const size_t __param_len = sizeof(*__in_hdr) + sizeof(*__in); \
         uint32_t __req_len; \
         \
-        QEMU_BUILD_BUG_ON(sizeof((export)->request_buf) < __param_len); \
+        QEMU_BUILD_BUG_ON(sizeof((q)->request_buf) < __param_len); \
         \
         __req_len = __in_hdr->len; \
         if (__req_len < __param_len) { \
@@ -1190,11 +1319,12 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
  * Process a FUSE request, incl. writing the response.
  *
  * Note that yielding in any request-processing function can overwrite the
- * contents of exp->request_buf.  Anything that takes a buffer needs to take
+ * contents of q->request_buf.  Anything that takes a buffer needs to take
  * care that the content is copied before yielding.
  */
-static void coroutine_fn fuse_co_process_request(FuseExport *exp)
+static void coroutine_fn fuse_co_process_request(FuseQueue *q)
 {
+    FuseExport *exp = q->exp;
     uint32_t opcode;
     uint64_t req_id;
     /*
@@ -1217,7 +1347,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
     /* Limit scope to ensure pointer is no longer used after yielding */
     {
         const struct fuse_in_header *in_hdr =
-            (const struct fuse_in_header *)exp->request_buf;
+            (const struct fuse_in_header *)q->request_buf;
 
         opcode = in_hdr->opcode;
         req_id = in_hdr->unique;
@@ -1225,7 +1355,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
 
     switch (opcode) {
     case FUSE_INIT: {
-        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, exp);
+        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, q);
         ret = fuse_co_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
                            in->max_readahead, in->flags);
         break;
@@ -1248,23 +1378,23 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
         break;
 
     case FUSE_SETATTR: {
-        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, exp);
+        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, q);
         ret = fuse_co_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
                               in->valid, in->size, in->mode, in->uid, in->gid);
         break;
     }
 
     case FUSE_READ: {
-        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, exp);
+        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, q);
         ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size);
         break;
     }
 
     case FUSE_WRITE: {
-        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, exp);
+        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, q);
         uint32_t req_len;
 
-        req_len = ((const struct fuse_in_header *)exp->request_buf)->len;
+        req_len = ((const struct fuse_in_header *)q->request_buf)->len;
         if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
                                in->size)) {
             warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
@@ -1293,7 +1423,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
     }
 
     case FUSE_FALLOCATE: {
-        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, exp);
+        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, q);
         ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode);
         break;
     }
@@ -1308,7 +1438,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
 
 #ifdef CONFIG_FUSE_LSEEK
     case FUSE_LSEEK: {
-        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, exp);
+        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, q);
         ret = fuse_co_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
                             in->offset, in->whence);
         break;
@@ -1322,11 +1452,11 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
     /* Ignore errors from fuse_write*(), nothing we can do anyway */
     if (out_data_buffer) {
         assert(ret >= 0);
-        fuse_write_buf_response(exp->fuse_fd, req_id, out_hdr,
+        fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
                                 out_data_buffer, ret);
         qemu_vfree(out_data_buffer);
     } else {
-        fuse_write_response(exp->fuse_fd, req_id, out_hdr,
+        fuse_write_response(q->fuse_fd, req_id, out_hdr,
                             ret < 0 ? ret : 0,
                             ret < 0 ? 0 : ret);
     }
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 15/15] fuse: Increase MAX_WRITE_SIZE with a second buffer
  2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (13 preceding siblings ...)
  2025-03-25 16:06 ` [PATCH 14/15] fuse: Implement multi-threading Hanna Czenczek
@ 2025-03-25 16:06 ` Hanna Czenczek
  2025-03-27 15:59   ` Stefan Hajnoczi
  2025-04-01 20:24   ` Eric Blake
  14 siblings, 2 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-25 16:06 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf

We probably want to support larger write sizes than just 4k; 64k seems
nice.  However, we cannot read partial requests from the FUSE FD, we
always have to read requests in full; so our read buffer must be large
enough to accommodate potential 64k writes if we want to support that.

Always allocating FuseRequest objects with 64k buffers in them seems
wasteful, though.  But we can get around the issue by splitting the
buffer into two and using readv(): One part will hold all normal (up to
4k) write requests and all other requests, and a second part (the
"spill-over buffer") will be used only for larger write requests.  Each
FuseQueue has its own spill-over buffer, and only if we find it used
when reading a request will we move its ownership into the FuseRequest
object and allocate a new spill-over buffer for the queue.

This way, we get to support "large" write sizes without having to
allocate big buffers when they aren't used.

Also, this even reduces the size of the FuseRequest objects because the
read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
the requests we support are not quite so large (except for >4k writes),
so until now, we basically had to have useless padding in there.

With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
is easily met and we can decrease the size of the buffer portion that is
right inside of FuseRequest.

As for benchmarks, the benefit of this patch can be shown easily by
writing a 4G image (with qemu-img convert) to a FUSE export:
- Before this patch: Takes 25.6 s (14.4 s with -t none)
- After this patch: Takes 4.5 s (5.5 s with -t none)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 137 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 118 insertions(+), 19 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 0edd994392..a24c5538b3 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -50,8 +50,17 @@
 
 /* Prevent overly long bounce buffer allocations */
 #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
-/* Small enough to fit in the request buffer */
-#define FUSE_MAX_WRITE_BYTES (4 * 1024)
+/*
+ * FUSE_MAX_WRITE_BYTES determines the maximum number of bytes we support in a
+ * write request; FUSE_IN_PLACE_WRITE_BYTES and FUSE_SPILLOVER_BUF_SIZE
+ * determine the split between the size of the in-place buffer in FuseRequest
+ * and the spill-over buffer in FuseQueue.  See FuseQueue.spillover_buf for a
+ * detailed explanation.
+ */
+#define FUSE_IN_PLACE_WRITE_BYTES (4 * 1024)
+#define FUSE_MAX_WRITE_BYTES (64 * 1024)
+#define FUSE_SPILLOVER_BUF_SIZE \
+    (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
 
 typedef struct FuseExport FuseExport;
 
@@ -67,15 +76,49 @@ typedef struct FuseQueue {
 
     /*
      * The request buffer must be able to hold a full write, and/or at least
-     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes
+     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes.
+     * This however is just the first part of the buffer; every read is given
+     * a vector of this buffer (which should be enough for all normal requests,
+     * which we check via the static assertion in FUSE_IN_OP_STRUCT()) and the
+     * spill-over buffer below.
+     * Therefore, the size of this buffer plus FUSE_SPILLOVER_BUF_SIZE must be
+     * FUSE_MIN_READ_BUFFER or more (checked via static assertion below).
+     */
+    char request_buf[sizeof(struct fuse_in_header) +
+                     sizeof(struct fuse_write_in) +
+                     FUSE_IN_PLACE_WRITE_BYTES];
+
+    /*
+     * When retrieving a FUSE request, the destination buffer must always be
+     * sufficiently large for the whole request, i.e. with max_write=64k, we
+     * must provide a buffer that fits the WRITE header and 64 kB of space for
+     * data.
+     * We do want to support 64k write requests without requiring them to be
+     * split up, but at the same time, do not want to do such a large allocation
+     * for every single request.
+     * Therefore, the FuseRequest object provides an in-line buffer that is
+     * enough for write requests up to 4k (and all other requests), and for
+     * every request that is bigger, we provide a spill-over buffer here (for
+     * the remaining 64k - 4k = 60k).
+     * When poll_fuse_fd() reads a FUSE request, it passes these buffers as an
+     * I/O vector, and then checks the return value (number of bytes read) to
+     * find out whether the spill-over buffer was used.  If so, it will move the
+     * buffer to the request, and will allocate a new spill-over buffer for the
+     * next request.
+     *
+     * Free this buffer with qemu_vfree().
      */
-    char request_buf[MAX_CONST(
-        sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
-             FUSE_MAX_WRITE_BYTES,
-        FUSE_MIN_READ_BUFFER
-    )];
+    void *spillover_buf;
 } FuseQueue;
 
+/*
+ * Verify that FuseQueue.request_buf plus the spill-over buffer together
+ * are big enough to be accepted by the FUSE kernel driver.
+ */
+QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
+                  FUSE_SPILLOVER_BUF_SIZE <
+                  FUSE_MIN_READ_BUFFER);
+
 struct FuseExport {
     BlockExport common;
 
@@ -132,7 +175,8 @@ static bool is_regular_file(const char *path, Error **errp);
 
 static bool poll_fuse_fd(void *opaque);
 static void read_fuse_fd(void *opaque);
-static void coroutine_fn fuse_co_process_request(FuseQueue *q);
+static void coroutine_fn
+fuse_co_process_request(FuseQueue *q, void *spillover_buf);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -494,6 +538,8 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
     FuseExport *exp = q->exp;
     ssize_t ret;
     const struct fuse_in_header *in_hdr;
+    struct iovec iov[2];
+    void *spillover_buf = NULL;
 
     fuse_inc_in_flight(exp);
 
@@ -501,7 +547,20 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
+    /*
+     * If handling the last request consumed the spill-over buffer, allocate a
+     * new one.  Align it to the block device's alignment, which admittedly is
+     * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too.
+     */
+    if (unlikely(!q->spillover_buf)) {
+        q->spillover_buf = blk_blockalign(exp->common.blk,
+                                          FUSE_SPILLOVER_BUF_SIZE);
+    }
+    /* Construct the I/O vector to hold the FUSE request */
+    iov[0] = (struct iovec) { q->request_buf, sizeof(q->request_buf) };
+    iov[1] = (struct iovec) { q->spillover_buf, FUSE_SPILLOVER_BUF_SIZE };
+
+    ret = RETRY_ON_EINTR(readv(fuse_fd, iov, ARRAY_SIZE(iov)));
     if (ret < 0 && errno == EAGAIN) {
         /* No request available */
         goto no_request;
@@ -530,9 +589,15 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
+    if (unlikely(ret > sizeof(q->request_buf))) {
+        /* Spillover buffer used, take ownership */
+        spillover_buf = q->spillover_buf;
+        q->spillover_buf = NULL;
+    }
+
     /* Must set this before yielding */
     co_param->got_request = 1;
-    fuse_co_process_request(q);
+    fuse_co_process_request(q, spillover_buf);
     fuse_dec_in_flight(exp);
     return;
 
@@ -607,6 +672,9 @@ static void fuse_export_delete(BlockExport *blk_exp)
         if (i > 0 && q->fuse_fd >= 0) {
             close(q->fuse_fd);
         }
+        if (q->spillover_buf) {
+            qemu_vfree(q->spillover_buf);
+        }
     }
     g_free(exp->queues);
 
@@ -915,17 +983,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
 }
 
 /**
- * Handle client writes to the exported image.  @buf has the data to be written
- * and will be copied to a bounce buffer before yielding for the first time.
+ * Handle client writes to the exported image.  @in_place_buf has the first
+ * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf
+ * contains the rest (if any; NULL otherwise).
+ * Data in @in_place_buf is assumed to be overwritten after yielding, so will
+ * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
+ * assumed to be exclusively owned and will be used as-is.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
 static ssize_t coroutine_fn
 fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
-              uint64_t offset, uint32_t size, const void *buf)
+              uint64_t offset, uint32_t size,
+              const void *in_place_buf, const void *spillover_buf)
 {
+    size_t in_place_size;
     void *copied;
     int64_t blk_len;
     int ret;
+    struct iovec iov[2];
+    QEMUIOVector qiov;
 
     /* Limited by max_write, should not happen */
     if (size > BDRV_REQUEST_MAX_BYTES) {
@@ -937,8 +1013,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
     }
 
     /* Must copy to bounce buffer before potentially yielding */
-    copied = blk_blockalign(exp->common.blk, size);
-    memcpy(copied, buf, size);
+    in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
+    copied = blk_blockalign(exp->common.blk, in_place_size);
+    memcpy(copied, in_place_buf, in_place_size);
 
     /**
      * Clients will expect short writes at EOF, so we have to limit
@@ -962,7 +1039,21 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
         }
     }
 
-    ret = blk_co_pwrite(exp->common.blk, offset, size, copied, 0);
+    iov[0] = (struct iovec) {
+        .iov_base = copied,
+        .iov_len = in_place_size,
+    };
+    if (size > FUSE_IN_PLACE_WRITE_BYTES) {
+        assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
+        iov[1] = (struct iovec) {
+            .iov_base = (void *)spillover_buf,
+            .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
+        };
+        qemu_iovec_init_external(&qiov, iov, 2);
+    } else {
+        qemu_iovec_init_external(&qiov, iov, 1);
+    }
+    ret = blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0);
     if (ret < 0) {
         goto fail_free_buffer;
     }
@@ -1321,8 +1412,14 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
  * Note that yielding in any request-processing function can overwrite the
  * contents of q->request_buf.  Anything that takes a buffer needs to take
  * care that the content is copied before yielding.
+ *
+ * @spillover_buf can contain the tail of a write request too large to fit into
+ * q->request_buf.  This function takes ownership of it (i.e. will free it),
+ * which assumes that its contents will not be overwritten by concurrent
+ * requests (as opposed to q->request_buf).
  */
-static void coroutine_fn fuse_co_process_request(FuseQueue *q)
+static void coroutine_fn
+fuse_co_process_request(FuseQueue *q, void *spillover_buf)
 {
     FuseExport *exp = q->exp;
     uint32_t opcode;
@@ -1418,7 +1515,7 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
          * yielding.
          */
         ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
-                            in->offset, in->size, in + 1);
+                            in->offset, in->size, in + 1, spillover_buf);
         break;
     }
 
@@ -1460,6 +1557,8 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
                             ret < 0 ? ret : 0,
                             ret < 0 ? 0 : ret);
     }
+
+    qemu_vfree(spillover_buf);
 }
 
 const BlockExportDriver blk_exp_fuse = {
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-25 16:06 ` [PATCH 14/15] fuse: Implement multi-threading Hanna Czenczek
@ 2025-03-26  5:38   ` Markus Armbruster
  2025-03-26  9:55     ` Hanna Czenczek
  2025-03-27 15:55   ` Stefan Hajnoczi
  2025-04-01 14:58   ` Eric Blake
  2 siblings, 1 reply; 59+ messages in thread
From: Markus Armbruster @ 2025-03-26  5:38 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

Hanna Czenczek <hreitz@redhat.com> writes:

> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
>
> We can use this to implement multi-threading.
>
> Note that the interface presented here differs from the multi-queue
> interface of virtio-blk: The latter maps virtqueues to iothreads, which
> allows processing multiple virtqueues in a single iothread.  The
> equivalent (processing multiple FDs in a single iothread) would not make
> sense for FUSE because those FDs are used in a round-robin fashion by
> the FUSE kernel driver.  Putting two of them into a single iothread will
> just create a bottleneck.
>
> Therefore, all we need is an array of iothreads, and we will create one
> "queue" (FD) per thread.

[...]

> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  qapi/block-export.json |   8 +-
>  block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
>  2 files changed, 179 insertions(+), 43 deletions(-)
>
> diff --git a/qapi/block-export.json b/qapi/block-export.json
> index c783e01a53..0bdd5992eb 100644
> --- a/qapi/block-export.json
> +++ b/qapi/block-export.json
> @@ -179,12 +179,18 @@
>  #     mount the export with allow_other, and if that fails, try again
>  #     without.  (since 6.1; default: auto)
>  #
> +# @iothreads: Enables multi-threading: Handle requests in each of the
> +#     given iothreads (instead of the block device's iothread, or the
> +#     export's "main" iothread).

When does "the block device's iothread" apply, and when "the export's
main iothread"?  Is this something the QMP user needs to know?


> +#                                 For this, the FUSE FD is duplicated so
> +#     there is one FD per iothread.  (since 10.1)

Is the file descriptor duplication something the QMP user needs to know?

> +#
>  # Since: 6.0
>  ##
>  { 'struct': 'BlockExportOptionsFuse',
>    'data': { 'mountpoint': 'str',
>              '*growable': 'bool',
> -            '*allow-other': 'FuseExportAllowOther' },
> +            '*allow-other': 'FuseExportAllowOther',
> +            '*iothreads': ['str'] },
>    'if': 'CONFIG_FUSE' }
>  
>  ##

[...]



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal
  2025-03-25 16:06 ` [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
@ 2025-03-26  5:47   ` Markus Armbruster
  2025-03-26  9:49     ` Hanna Czenczek
  2025-03-27 14:51   ` Stefan Hajnoczi
  1 sibling, 1 reply; 59+ messages in thread
From: Markus Armbruster @ 2025-03-26  5:47 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

Hanna Czenczek <hreitz@redhat.com> writes:

> When exports are created on the command line (with the storage daemon),
> errp is going to point to error_fatal.  Without ERRP_GUARD, we would
> exit immediately when *errp is set, i.e. skip the clean-up code under
> the `fail` label.  Use ERRP_GUARD so we always run that code.
>
> As far as I know, this has no actual impact right now[1], but it is
> still better to make this right.
>
> [1] Not cleaning up the mount point is the only thing I can imagine
>     would be problematic, but that is the last thing we attempt, so if
>     it fails, it will clean itself up.

Hmm.

The pattern is "no cleanup with &error_fatal or &error_abort, but not
cleaning up then is harmless".  How many instances do we have?  My gut
feeling is in the hundreds.  Why is "fixing" just this one worth the
bother?

> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index a12f479492..7c035dd6ca 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -119,6 +119,7 @@ static int fuse_export_create(BlockExport *blk_exp,
>                                BlockExportOptions *blk_exp_args,
>                                Error **errp)
>  {
> +    ERRP_GUARD(); /* ensure clean-up even with error_fatal */
>      FuseExport *exp = container_of(blk_exp, FuseExport, common);
>      BlockExportOptionsFuse *args = &blk_exp_args->u.fuse;
>      int ret;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal
  2025-03-26  5:47   ` Markus Armbruster
@ 2025-03-26  9:49     ` Hanna Czenczek
  0 siblings, 0 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-26  9:49 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-block, qemu-devel, Kevin Wolf

On 26.03.25 06:47, Markus Armbruster wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
>> When exports are created on the command line (with the storage daemon),
>> errp is going to point to error_fatal.  Without ERRP_GUARD, we would
>> exit immediately when *errp is set, i.e. skip the clean-up code under
>> the `fail` label.  Use ERRP_GUARD so we always run that code.
>>
>> As far as I know, this has no actual impact right now[1], but it is
>> still better to make this right.
>>
>> [1] Not cleaning up the mount point is the only thing I can imagine
>>      would be problematic, but that is the last thing we attempt, so if
>>      it fails, it will clean itself up.
> Hmm.
>
> The pattern is "no cleanup with &error_fatal or &error_abort, but not
> cleaning up then is harmless".  How many instances do we have?  My gut
> feeling is in the hundreds.  Why is "fixing" just this one worth the
> bother?

Because:
1. This one is in FUSE code, which I’m reworking in this series.
2. I did encounter this issue while playing around with manual mounting 
last year.  I don’t think it has visible impact when mounting with 
libfuse, but why leave out a fix for something that can be triggered by 
making valid changes to the code?

Hanna

>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   block/export/fuse.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>> index a12f479492..7c035dd6ca 100644
>> --- a/block/export/fuse.c
>> +++ b/block/export/fuse.c
>> @@ -119,6 +119,7 @@ static int fuse_export_create(BlockExport *blk_exp,
>>                                 BlockExportOptions *blk_exp_args,
>>                                 Error **errp)
>>   {
>> +    ERRP_GUARD(); /* ensure clean-up even with error_fatal */
>>       FuseExport *exp = container_of(blk_exp, FuseExport, common);
>>       BlockExportOptionsFuse *args = &blk_exp_args->u.fuse;
>>       int ret;



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-26  5:38   ` Markus Armbruster
@ 2025-03-26  9:55     ` Hanna Czenczek
  2025-03-26 11:41       ` Markus Armbruster
  0 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-26  9:55 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-block, qemu-devel, Kevin Wolf

On 26.03.25 06:38, Markus Armbruster wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
>> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
>> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
>>
>> We can use this to implement multi-threading.
>>
>> Note that the interface presented here differs from the multi-queue
>> interface of virtio-blk: The latter maps virtqueues to iothreads, which
>> allows processing multiple virtqueues in a single iothread.  The
>> equivalent (processing multiple FDs in a single iothread) would not make
>> sense for FUSE because those FDs are used in a round-robin fashion by
>> the FUSE kernel driver.  Putting two of them into a single iothread will
>> just create a bottleneck.
>>
>> Therefore, all we need is an array of iothreads, and we will create one
>> "queue" (FD) per thread.
> [...]
>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   qapi/block-export.json |   8 +-
>>   block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
>>   2 files changed, 179 insertions(+), 43 deletions(-)
>>
>> diff --git a/qapi/block-export.json b/qapi/block-export.json
>> index c783e01a53..0bdd5992eb 100644
>> --- a/qapi/block-export.json
>> +++ b/qapi/block-export.json
>> @@ -179,12 +179,18 @@
>>   #     mount the export with allow_other, and if that fails, try again
>>   #     without.  (since 6.1; default: auto)
>>   #
>> +# @iothreads: Enables multi-threading: Handle requests in each of the
>> +#     given iothreads (instead of the block device's iothread, or the
>> +#     export's "main" iothread).
> When does "the block device's iothread" apply, and when "the export's
> main iothread"?

Depends on where you set the iothread option.

> Is this something the QMP user needs to know?

I think so, because e.g. if you set iothread on the device and the 
export, you’ll get a conflict.  But if you set it there and set this 
option, you won’t.  This option will just override the device/export option.

>
>
>> +#                                 For this, the FUSE FD is duplicated so
>> +#     there is one FD per iothread.  (since 10.1)
> Is the file descriptor duplication something the QMP user needs to know?

I found this technical detail interesting, i.e. how multiqueue is 
implemented for FUSE.  Compare virtio devices, for which we make it 
clear that virtqueues are mapped to I/O threads (not just in 
documentation, but actually in option naming).  Is it something they 
must not know?

Hanna

>
>> +#
>>   # Since: 6.0
>>   ##
>>   { 'struct': 'BlockExportOptionsFuse',
>>     'data': { 'mountpoint': 'str',
>>               '*growable': 'bool',
>> -            '*allow-other': 'FuseExportAllowOther' },
>> +            '*allow-other': 'FuseExportAllowOther',
>> +            '*iothreads': ['str'] },
>>     'if': 'CONFIG_FUSE' }
>>   
>>   ##
> [...]
>



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-26  9:55     ` Hanna Czenczek
@ 2025-03-26 11:41       ` Markus Armbruster
  2025-03-26 13:56         ` Hanna Czenczek
  0 siblings, 1 reply; 59+ messages in thread
From: Markus Armbruster @ 2025-03-26 11:41 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

Hanna Czenczek <hreitz@redhat.com> writes:

> On 26.03.25 06:38, Markus Armbruster wrote:
>> Hanna Czenczek <hreitz@redhat.com> writes:
>>
>>> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
>>> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
>>>
>>> We can use this to implement multi-threading.
>>>
>>> Note that the interface presented here differs from the multi-queue
>>> interface of virtio-blk: The latter maps virtqueues to iothreads, which
>>> allows processing multiple virtqueues in a single iothread.  The
>>> equivalent (processing multiple FDs in a single iothread) would not make
>>> sense for FUSE because those FDs are used in a round-robin fashion by
>>> the FUSE kernel driver.  Putting two of them into a single iothread will
>>> just create a bottleneck.
>>>
>>> Therefore, all we need is an array of iothreads, and we will create one
>>> "queue" (FD) per thread.
>>
>> [...]
>>
>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>> ---
>>>   qapi/block-export.json |   8 +-
>>>   block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
>>>   2 files changed, 179 insertions(+), 43 deletions(-)
>>>
>>> diff --git a/qapi/block-export.json b/qapi/block-export.json
>>> index c783e01a53..0bdd5992eb 100644
>>> --- a/qapi/block-export.json
>>> +++ b/qapi/block-export.json
>>> @@ -179,12 +179,18 @@
>>>   #     mount the export with allow_other, and if that fails, try again
>>>   #     without.  (since 6.1; default: auto)
>>>   #
>>> +# @iothreads: Enables multi-threading: Handle requests in each of the
>>> +#     given iothreads (instead of the block device's iothread, or the
>>> +#     export's "main" iothread).
>>
>> When does "the block device's iothread" apply, and when "the export's
>> main iothread"?
>
> Depends on where you set the iothread option.

Assuming QMP users need to know (see right below), can we trust they
understand which one applies when?  If not, can we provide clues?

>> Is this something the QMP user needs to know?
>
> I think so, because e.g. if you set iothread on the device and the export, you’ll get a conflict.  But if you set it there and set this option, you won’t.  This option will just override the device/export option.

Do we think the doc comment sufficient for QMP users to figure this out?
If not, can we provide clues?

In particular, do we think they can go from an export failure to the
setting @iothreads here?  Perhaps the error message will guide them.
What is the message?

>>> +#                                 For this, the FUSE FD is duplicated so
>>> +#     there is one FD per iothread.  (since 10.1)
>>
>> Is the file descriptor duplication something the QMP user needs to know?
>
> I found this technical detail interesting, i.e. how multiqueue is implemented for FUSE.  Compare virtio devices, for which we make it clear that virtqueues are mapped to I/O threads (not just in documentation, but actually in option naming).  Is it something they must not know?

Interesting to whom?

Users of QMP?  Then explaining it in the doc comment (and thus the QEMU
QMP Reference Manual) is proper.

Just developers?  Then the doc comment is the wrong spot.

The QEMU QMP Reference Manual is for users of QMP.  It's dense reading.
Information the users are not expected to need / understand makes that
worse.

> Hanna
>
>>
>>> +#
>>>   # Since: 6.0
>>>   ##
>>>   { 'struct': 'BlockExportOptionsFuse',
>>>     'data': { 'mountpoint': 'str',
>>>               '*growable': 'bool',
>>> -            '*allow-other': 'FuseExportAllowOther' },
>>> +            '*allow-other': 'FuseExportAllowOther',
>>> +            '*iothreads': ['str'] },
>>>     'if': 'CONFIG_FUSE' }
>>>     ##
>> [...]
>>



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-26 11:41       ` Markus Armbruster
@ 2025-03-26 13:56         ` Hanna Czenczek
  2025-03-27 12:18           ` Markus Armbruster via
  0 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-26 13:56 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-block, qemu-devel, Kevin Wolf

On 26.03.25 12:41, Markus Armbruster wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
>> On 26.03.25 06:38, Markus Armbruster wrote:
>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>
>>>> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
>>>> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
>>>>
>>>> We can use this to implement multi-threading.
>>>>
>>>> Note that the interface presented here differs from the multi-queue
>>>> interface of virtio-blk: The latter maps virtqueues to iothreads, which
>>>> allows processing multiple virtqueues in a single iothread.  The
>>>> equivalent (processing multiple FDs in a single iothread) would not make
>>>> sense for FUSE because those FDs are used in a round-robin fashion by
>>>> the FUSE kernel driver.  Putting two of them into a single iothread will
>>>> just create a bottleneck.
>>>>
>>>> Therefore, all we need is an array of iothreads, and we will create one
>>>> "queue" (FD) per thread.
>>> [...]
>>>
>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>> ---
>>>>    qapi/block-export.json |   8 +-
>>>>    block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
>>>>    2 files changed, 179 insertions(+), 43 deletions(-)
>>>>
>>>> diff --git a/qapi/block-export.json b/qapi/block-export.json
>>>> index c783e01a53..0bdd5992eb 100644
>>>> --- a/qapi/block-export.json
>>>> +++ b/qapi/block-export.json
>>>> @@ -179,12 +179,18 @@
>>>>    #     mount the export with allow_other, and if that fails, try again
>>>>    #     without.  (since 6.1; default: auto)
>>>>    #
>>>> +# @iothreads: Enables multi-threading: Handle requests in each of the
>>>> +#     given iothreads (instead of the block device's iothread, or the
>>>> +#     export's "main" iothread).
>>> When does "the block device's iothread" apply, and when "the export's
>>> main iothread"?
>> Depends on where you set the iothread option.
> Assuming QMP users need to know (see right below), can we trust they
> understand which one applies when?  If not, can we provide clues?

I don’t understand what exactly you mean, but which one applies when has 
nothing to do with this option, but with the @iothread (and 
@fixed-iothread) option(s) on BlockExportOptions, which do document this.

>
>>> Is this something the QMP user needs to know?
>> I think so, because e.g. if you set iothread on the device and the export, you’ll get a conflict.  But if you set it there and set this option, you won’t.  This option will just override the device/export option.
> Do we think the doc comment sufficient for QMP users to figure this out?

As for conflict, BlockExportOptions.iothread and 
BlockExportOptions.fixed-iothread do.

As for overriding, I do think so.  Do you not?  I’m always open to 
suggestions.

> If not, can we provide clues?
>
> In particular, do we think they can go from an export failure to the
> setting @iothreads here?  Perhaps the error message will guide them.
> What is the message?

I don’t understand what failure you mean.

>
>>>> +#                                 For this, the FUSE FD is duplicated so
>>>> +#     there is one FD per iothread.  (since 10.1)
>>> Is the file descriptor duplication something the QMP user needs to know?
>> I found this technical detail interesting, i.e. how multiqueue is implemented for FUSE.  Compare virtio devices, for which we make it clear that virtqueues are mapped to I/O threads (not just in documentation, but actually in option naming).  Is it something they must not know?
> Interesting to whom?
>
> Users of QMP?  Then explaining it in the doc comment (and thus the QEMU
> QMP Reference Manual) is proper.

Yes, QEMU users.  I find this information interesting to users because 
virtio explains how multiqueue works there (see IOThreadVirtQueueMapping 
in virtio.json), and this explains that for FUSE exports, there are no 
virt queues, but requests come from that FD, which explains implicitly 
why this doesn’t use the IOThreadVirtQueueMapping type.

In fact, if anything, I would even expand on the explanation to say that 
requests are generally distributed in a round-robin fashion across FUSE 
FDs regardless of where they originate from, contrasting with 
virtqueues, which are generally tied to vCPUs.

Hanna

>
> Just developers?  Then the doc comment is the wrong spot.
>
> The QEMU QMP Reference Manual is for users of QMP.  It's dense reading.
> Information the users are not expected to need / understand makes that
> worse.
>
>> Hanna
>>
>>>> +#
>>>>    # Since: 6.0
>>>>    ##
>>>>    { 'struct': 'BlockExportOptionsFuse',
>>>>      'data': { 'mountpoint': 'str',
>>>>                '*growable': 'bool',
>>>> -            '*allow-other': 'FuseExportAllowOther' },
>>>> +            '*allow-other': 'FuseExportAllowOther',
>>>> +            '*iothreads': ['str'] },
>>>>      'if': 'CONFIG_FUSE' }
>>>>      ##
>>> [...]
>>>



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-26 13:56         ` Hanna Czenczek
@ 2025-03-27 12:18           ` Markus Armbruster via
  2025-03-27 13:45             ` Hanna Czenczek
  0 siblings, 1 reply; 59+ messages in thread
From: Markus Armbruster via @ 2025-03-27 12:18 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

Hanna Czenczek <hreitz@redhat.com> writes:

> On 26.03.25 12:41, Markus Armbruster wrote:
>> Hanna Czenczek <hreitz@redhat.com> writes:
>>
>>> On 26.03.25 06:38, Markus Armbruster wrote:
>>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>>
>>>>> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
>>>>> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
>>>>>
>>>>> We can use this to implement multi-threading.
>>>>>
>>>>> Note that the interface presented here differs from the multi-queue
>>>>> interface of virtio-blk: The latter maps virtqueues to iothreads, which
>>>>> allows processing multiple virtqueues in a single iothread.  The
>>>>> equivalent (processing multiple FDs in a single iothread) would not make
>>>>> sense for FUSE because those FDs are used in a round-robin fashion by
>>>>> the FUSE kernel driver.  Putting two of them into a single iothread will
>>>>> just create a bottleneck.
>>>>>
>>>>> Therefore, all we need is an array of iothreads, and we will create one
>>>>> "queue" (FD) per thread.
>>>>
>>>> [...]
>>>>
>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>> ---
>>>>>  qapi/block-export.json |   8 +-
>>>>>  block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
>>>>>  2 files changed, 179 insertions(+), 43 deletions(-)
>>>>>
>>>>> diff --git a/qapi/block-export.json b/qapi/block-export.json
>>>>> index c783e01a53..0bdd5992eb 100644
>>>>> --- a/qapi/block-export.json
>>>>> +++ b/qapi/block-export.json
>>>>> @@ -179,12 +179,18 @@
>>>>>  #     mount the export with allow_other, and if that fails, try again
>>>>>  #     without.  (since 6.1; default: auto)
>>>>>  #
>>>>> +# @iothreads: Enables multi-threading: Handle requests in each of the
>>>>> +#     given iothreads (instead of the block device's iothread, or the
>>>>> +#     export's "main" iothread).
>>>>
>>>> When does "the block device's iothread" apply, and when "the export's
>>>> main iothread"?
>>>
>>> Depends on where you set the iothread option.
>>
>> Assuming QMP users need to know (see right below), can we trust they
>> understand which one applies when?  If not, can we provide clues?
>
> I don’t understand what exactly you mean, but which one applies when has nothing to do with this option, but with the @iothread (and @fixed-iothread) option(s) on BlockExportOptions, which do document this.

Can you point me to the spot?

>>>> Is this something the QMP user needs to know?
>>>
>>> I think so, because e.g. if you set iothread on the device and the export, you’ll get a conflict.  But if you set it there and set this option, you won’t.  This option will just override the device/export option.
>>
>> Do we think the doc comment sufficient for QMP users to figure this out?
>
> As for conflict, BlockExportOptions.iothread and BlockExportOptions.fixed-iothread do.
>
> As for overriding, I do think so.  Do you not?  I’m always open to suggestions.
>
>> If not, can we provide clues?
>>
>> In particular, do we think they can go from an export failure to the
>> setting @iothreads here?  Perhaps the error message will guide them.
>> What is the message?
>
> I don’t understand what failure you mean.

You wrote "you'll get a conflict".  I assume this manifests as failure
of a QMP command (let's ignore CLI to keep things simple here).

Do we think ordinary users running into that failure can figure out they
can avoid it by setting @iothreads?

What's that failure's error message?

>>>>> +#                                 For this, the FUSE FD is duplicated so
>>>>> +#     there is one FD per iothread.  (since 10.1)
>>>>
>>>> Is the file descriptor duplication something the QMP user needs to know?
>>>
>>> I found this technical detail interesting, i.e. how multiqueue is implemented for FUSE.  Compare virtio devices, for which we make it clear that virtqueues are mapped to I/O threads (not just in documentation, but actually in option naming).  Is it something they must not know?
>>
>> Interesting to whom?
>>
>> Users of QMP?  Then explaining it in the doc comment (and thus the QEMU
>> QMP Reference Manual) is proper.
>
> Yes, QEMU users.  I find this information interesting to users because virtio explains how multiqueue works there (see IOThreadVirtQueueMapping in virtio.json), and this explains that for FUSE exports, there are no virt queues, but requests come from that FD, which explains implicitly why this doesn’t use the IOThreadVirtQueueMapping type.
>
> In fact, if anything, I would even expand on the explanation to say that requests are generally distributed in a round-robin fashion across FUSE FDs regardless of where they originate from, contrasting with virtqueues, which are generally tied to vCPUs.

Up to you.  I lack context to judge.

Making yourself or your fellow experts understand how this works is not
the same as making users understand how to use it.  Making me understand
is not the same either, but it might be closer.

If this isn't useful to you, let me know, and I'll shut up :)

> Hanna
>
>>
>> Just developers?  Then the doc comment is the wrong spot.
>>
>> The QEMU QMP Reference Manual is for users of QMP.  It's dense reading.
>> Information the users are not expected to need / understand makes that
>> worse.
>>
>>> Hanna
>>>
>>>>> +#
>>>>>    # Since: 6.0
>>>>>    ##
>>>>>    { 'struct': 'BlockExportOptionsFuse',
>>>>>      'data': { 'mountpoint': 'str',
>>>>>                '*growable': 'bool',
>>>>> -            '*allow-other': 'FuseExportAllowOther' },
>>>>> +            '*allow-other': 'FuseExportAllowOther',
>>>>> +            '*iothreads': ['str'] },
>>>>>      'if': 'CONFIG_FUSE' }
>>>>>      ##
>>>> [...]
>>>>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-27 12:18           ` Markus Armbruster via
@ 2025-03-27 13:45             ` Hanna Czenczek
  2025-04-01 12:05               ` Kevin Wolf
  0 siblings, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-03-27 13:45 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-block, qemu-devel, Kevin Wolf

On 27.03.25 13:18, Markus Armbruster wrote:
> Hanna Czenczek <hreitz@redhat.com> writes:
>
>> On 26.03.25 12:41, Markus Armbruster wrote:
>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>
>>>> On 26.03.25 06:38, Markus Armbruster wrote:
>>>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>>>
>>>>>> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
>>>>>> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
>>>>>>
>>>>>> We can use this to implement multi-threading.
>>>>>>
>>>>>> Note that the interface presented here differs from the multi-queue
>>>>>> interface of virtio-blk: The latter maps virtqueues to iothreads, which
>>>>>> allows processing multiple virtqueues in a single iothread.  The
>>>>>> equivalent (processing multiple FDs in a single iothread) would not make
>>>>>> sense for FUSE because those FDs are used in a round-robin fashion by
>>>>>> the FUSE kernel driver.  Putting two of them into a single iothread will
>>>>>> just create a bottleneck.
>>>>>>
>>>>>> Therefore, all we need is an array of iothreads, and we will create one
>>>>>> "queue" (FD) per thread.
>>>>> [...]
>>>>>
>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>> ---
>>>>>>   qapi/block-export.json |   8 +-
>>>>>>   block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
>>>>>>   2 files changed, 179 insertions(+), 43 deletions(-)
>>>>>>
>>>>>> diff --git a/qapi/block-export.json b/qapi/block-export.json
>>>>>> index c783e01a53..0bdd5992eb 100644
>>>>>> --- a/qapi/block-export.json
>>>>>> +++ b/qapi/block-export.json
>>>>>> @@ -179,12 +179,18 @@
>>>>>>   #     mount the export with allow_other, and if that fails, try again
>>>>>>   #     without.  (since 6.1; default: auto)
>>>>>>   #
>>>>>> +# @iothreads: Enables multi-threading: Handle requests in each of the
>>>>>> +#     given iothreads (instead of the block device's iothread, or the
>>>>>> +#     export's "main" iothread).
>>>>> When does "the block device's iothread" apply, and when "the export's
>>>>> main iothread"?
>>>> Depends on where you set the iothread option.
>>> Assuming QMP users need to know (see right below), can we trust they
>>> understand which one applies when?  If not, can we provide clues?
>> I don’t understand what exactly you mean, but which one applies when has nothing to do with this option, but with the @iothread (and @fixed-iothread) option(s) on BlockExportOptions, which do document this.
> Can you point me to the spot?

Sure: 
https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#object-QMP-block-export.BlockExportOptions 
(iothread and fixed-iothread)

>
>>>>> Is this something the QMP user needs to know?
>>>> I think so, because e.g. if you set iothread on the device and the export, you’ll get a conflict.  But if you set it there and set this option, you won’t.  This option will just override the device/export option.
>>> Do we think the doc comment sufficient for QMP users to figure this out?
>> As for conflict, BlockExportOptions.iothread and BlockExportOptions.fixed-iothread do.
>>
>> As for overriding, I do think so.  Do you not?  I’m always open to suggestions.
>>
>>> If not, can we provide clues?
>>>
>>> In particular, do we think they can go from an export failure to the
>>> setting @iothreads here?  Perhaps the error message will guide them.
>>> What is the message?
>> I don’t understand what failure you mean.
> You wrote "you'll get a conflict".  I assume this manifests as failure
> of a QMP command (let's ignore CLI to keep things simple here).

If you set the @iothread option on both the (guest) device and the 
export (and also @fixed-iothread on the export), then you’ll get an 
error.  Nothing to do with this new @iothreads option here.

> Do we think ordinary users running into that failure can figure out they
> can avoid it by setting @iothreads?

It shouldn’t affect the failure.  Setting @iothread on both device and 
export (with @fixed-iothread) will always cause an error, and should.  
Setting this option is not supposed to “fix” that configuration error.

Theoretically, setting @iothreads here could make it so that 
BlockExportOptions.iothread (and/or fixed-iothread) is ignored, because 
that thread will no longer be used for export-issued I/O; but in 
practice, setting that option (BlockExportOptions.iothread) moves that 
export and the whole BDS tree behind it to that I/O thread, so if you 
haven’t specified an I/O thread on the guest device, the guest device 
will then use that thread.  So making @iothreads silently completely 
ignore BlockExportOptions.iothread may cause surprising behavior.

Maybe we could make setting @iothreads here and the generic 
BlockExportOptions.iothread at the same time an error.  That would save 
us the explanation here.

> What's that failure's error message?

$ echo '{"execute":"qmp_capabilities"}
{"execute":"block-export-add",
  "arguments":{"type":"fuse",
"id":"exp",
"node-name":"null",
"mountpoint":"/tmp/fuse-export",
"iothread":"iothr1",
               "fixed-iothread":true}}' |
build/qemu-system-x86_64 \
     -object iothread,id=iothr0 \
     -object iothread,id=iothr1 \
     -blockdev null-co,node-name=null \
     -device virtio-blk,drive=null,iothread=iothr0 \
     -qmp stdio

{"QMP": {"version": {"qemu": {"micro": 91, "minor": 2, "major": 9}, 
"package": "v10.0.0-rc1"}, "capabilities": ["oob"]}}
{"return": {}}
{"error": {"class": "GenericError", "desc": "Cannot change iothread of 
active block backend"}}

>
>>>>>> +#                                 For this, the FUSE FD is duplicated so
>>>>>> +#     there is one FD per iothread.  (since 10.1)
>>>>> Is the file descriptor duplication something the QMP user needs to know?
>>>> I found this technical detail interesting, i.e. how multiqueue is implemented for FUSE.  Compare virtio devices, for which we make it clear that virtqueues are mapped to I/O threads (not just in documentation, but actually in option naming).  Is it something they must not know?
>>> Interesting to whom?
>>>
>>> Users of QMP?  Then explaining it in the doc comment (and thus the QEMU
>>> QMP Reference Manual) is proper.
>> Yes, QEMU users.  I find this information interesting to users because virtio explains how multiqueue works there (see IOThreadVirtQueueMapping in virtio.json), and this explains that for FUSE exports, there are no virt queues, but requests come from that FD, which explains implicitly why this doesn’t use the IOThreadVirtQueueMapping type.
>>
>> In fact, if anything, I would even expand on the explanation to say that requests are generally distributed in a round-robin fashion across FUSE FDs regardless of where they originate from, contrasting with virtqueues, which are generally tied to vCPUs.
> Up to you.  I lack context to judge.
>
> Making yourself or your fellow experts understand how this works is not
> the same as making users understand how to use it.  Making me understand
> is not the same either, but it might be closer.

This part of the documentation would concern itself less with “how to 
use it”, and more “when to use it”: This round-robin distribution of 
requests across FDs means that even if I/O is run in a single thread, 
using multiple threads for the export may improve performance (as shown 
in the commit message) – in contrast to virtqueue-based systems.  So I 
think that’s important information to users.

Hanna

>
> If this isn't useful to you, let me know, and I'll shut up :)
>
>> Hanna
>>
>>> Just developers?  Then the doc comment is the wrong spot.
>>>
>>> The QEMU QMP Reference Manual is for users of QMP.  It's dense reading.
>>> Information the users are not expected to need / understand makes that
>>> worse.
>>>
>>>> Hanna
>>>>
>>>>>> +#
>>>>>>     # Since: 6.0
>>>>>>     ##
>>>>>>     { 'struct': 'BlockExportOptionsFuse',
>>>>>>       'data': { 'mountpoint': 'str',
>>>>>>                 '*growable': 'bool',
>>>>>> -            '*allow-other': 'FuseExportAllowOther' },
>>>>>> +            '*allow-other': 'FuseExportAllowOther',
>>>>>> +            '*iothreads': ['str'] },
>>>>>>       'if': 'CONFIG_FUSE' }
>>>>>>       ##
>>>>> [...]
>>>>>



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/15] fuse: Copy write buffer content before polling
  2025-03-25 16:06 ` [PATCH 01/15] fuse: Copy write buffer content before polling Hanna Czenczek
@ 2025-03-27 14:47   ` Stefan Hajnoczi
  2025-04-04 11:17     ` Hanna Czenczek
  2025-04-01 13:44   ` Eric Blake
  1 sibling, 1 reply; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 14:47 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf, qemu-stable

[-- Attachment #1: Type: text/plain, Size: 5513 bytes --]

On Tue, Mar 25, 2025 at 05:06:35PM +0100, Hanna Czenczek wrote:
> Polling in I/O functions can lead to nested read_from_fuse_export()

"Polling" means several different things. "aio_poll()" or "nested event
loop" would be clearer.

> calls, overwriting the request buffer's content.  The only function
> affected by this is fuse_write(), which therefore must use a bounce
> buffer or corruption may occur.
> 
> Note that in addition we do not know whether libfuse-internal structures
> can cope with this nesting, and even if we did, we probably cannot rely
> on it in the future.  This is the main reason why we want to remove
> libfuse from the I/O path.
> 
> I do not have a good reproducer for this other than:
> 
> $ dd if=/dev/urandom of=image bs=1M count=4096
> $ dd if=/dev/zero of=copy bs=1M count=4096
> $ touch fuse-export
> $ qemu-storage-daemon \
>     --blockdev file,node-name=file,filename=copy \
>     --export \
>     fuse,id=exp,node-name=file,mountpoint=fuse-export,writable=true \
>     &
> 
> Other shell:
> $ qemu-img convert -p -n -f raw -O raw -t none image fuse-export
> $ killall -SIGINT qemu-storage-daemon
> $ qemu-img compare image copy
> Content mismatch at offset 0!
> 
> (The -t none in qemu-img convert is important.)
> 
> I tried reproducing this with throttle and small aio_write requests from
> another qemu-io instance, but for some reason all requests are perfectly
> serialized then.
> 
> I think in theory we should get parallel writes only if we set
> fi->parallel_direct_writes in fuse_open().  In fact, I can confirm that
> if we do that, that throttle-based reproducer works (i.e. does get
> parallel (nested) write requests).  I have no idea why we still get
> parallel requests with qemu-img convert anyway.
> 
> Also, a later patch in this series will set fi->parallel_direct_writes
> and note that it makes basically no difference when running fio on the
> current libfuse-based version of our code.  It does make a difference
> without libfuse.  So something quite fishy is going on.
> 
> I will try to investigate further what the root cause is, but I think
> for now let's assume that calling blk_pwrite() can invalidate the buffer
> contents through nested polling.
> 
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index 465cc9891d..a12f479492 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -301,6 +301,12 @@ static void read_from_fuse_export(void *opaque)
>          goto out;
>      }
>  
> +    /*
> +     * Note that polling in any request-processing function can lead to a nested
> +     * read_from_fuse_export() call, which will overwrite the contents of
> +     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
> +     * content is copied before potentially polling.
> +     */
>      fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);

It seems safer to allocate a fuse_buf per request instead copying the
data buffer only for write requests. Other request types might be
affected too (e.g. nested reads of different sizes).

I guess later on in this series a per-request fuse_buf will be
introduced anyway, so it doesn't matter what we do in this commit.

>  
>  out:
> @@ -624,6 +630,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>                         size_t size, off_t offset, struct fuse_file_info *fi)
>  {
>      FuseExport *exp = fuse_req_userdata(req);
> +    void *copied;
>      int64_t length;
>      int ret;
>  
> @@ -638,6 +645,14 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>          return;
>      }
>  
> +    /*
> +     * Heed the note on read_from_fuse_export(): If we poll (which any blk_*()
> +     * I/O function may do), read_from_fuse_export() may be nested, overwriting
> +     * the request buffer content.  Therefore, we must copy it here.
> +     */
> +    copied = blk_blockalign(exp->common.blk, size);
> +    memcpy(copied, buf, size);
> +
>      /**
>       * Clients will expect short writes at EOF, so we have to limit
>       * offset+size to the image length.
> @@ -645,7 +660,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>      length = blk_getlength(exp->common.blk);
>      if (length < 0) {
>          fuse_reply_err(req, -length);
> -        return;
> +        goto free_buffer;
>      }
>  
>      if (offset + size > length) {
> @@ -653,19 +668,22 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>              ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
>              if (ret < 0) {
>                  fuse_reply_err(req, -ret);
> -                return;
> +                goto free_buffer;
>              }
>          } else {
>              size = length - offset;
>          }
>      }
>  
> -    ret = blk_pwrite(exp->common.blk, offset, size, buf, 0);
> +    ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
>      if (ret >= 0) {
>          fuse_reply_write(req, size);
>      } else {
>          fuse_reply_err(req, -ret);
>      }
> +
> +free_buffer:
> +    qemu_vfree(copied);
>  }
>  
>  /**
> -- 
> 2.48.1
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal
  2025-03-25 16:06 ` [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
  2025-03-26  5:47   ` Markus Armbruster
@ 2025-03-27 14:51   ` Stefan Hajnoczi
  1 sibling, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 14:51 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 835 bytes --]

On Tue, Mar 25, 2025 at 05:06:42PM +0100, Hanna Czenczek wrote:
> When exports are created on the command line (with the storage daemon),
> errp is going to point to error_fatal.  Without ERRP_GUARD, we would
> exit immediately when *errp is set, i.e. skip the clean-up code under
> the `fail` label.  Use ERRP_GUARD so we always run that code.
> 
> As far as I know, this has no actual impact right now[1], but it is
> still better to make this right.
> 
> [1] Not cleaning up the mount point is the only thing I can imagine
>     would be problematic, but that is the last thing we attempt, so if
>     it fails, it will clean itself up.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 03/15] fuse: Remove superfluous empty line
  2025-03-25 16:06 ` [PATCH 03/15] fuse: Remove superfluous empty line Hanna Czenczek
@ 2025-03-27 14:53   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 14:53 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 236 bytes --]

On Tue, Mar 25, 2025 at 05:06:43PM +0100, Hanna Czenczek wrote:
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 1 -
>  1 file changed, 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 04/15] fuse: Explicitly set inode ID to 1
  2025-03-25 16:06 ` [PATCH 04/15] fuse: Explicitly set inode ID to 1 Hanna Czenczek
@ 2025-03-27 14:54   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 14:54 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 545 bytes --]

On Tue, Mar 25, 2025 at 05:06:44PM +0100, Hanna Czenczek wrote:
> Setting .st_ino to the FUSE inode ID is kind of arbitrary.  While in
> practice it is going to be fixed (to FUSE_ROOT_ID, which is 1) because
> we only have the root inode, that is not obvious in fuse_getattr().
> 
> Just explicitly set it to 1 (i.e. no functional change).
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/15] fuse: Change setup_... to mount_fuse_export()
  2025-03-25 16:06 ` [PATCH 05/15] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
@ 2025-03-27 14:55   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 14:55 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 614 bytes --]

On Tue, Mar 25, 2025 at 05:06:45PM +0100, Hanna Czenczek wrote:
> There is no clear separation between what should go into
> setup_fuse_export() and what should stay in fuse_export_create().
> 
> Make it clear that setup_fuse_export() is for mounting only.  Rename it,
> and move everything that has nothing to do with mounting up into
> fuse_export_create().
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 49 ++++++++++++++++++++-------------------------
>  1 file changed, 22 insertions(+), 27 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 06/15] fuse: Fix mount options
  2025-03-25 16:06 ` [PATCH 06/15] fuse: Fix mount options Hanna Czenczek
@ 2025-03-27 14:58   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 14:58 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 1521 bytes --]

On Tue, Mar 25, 2025 at 05:06:46PM +0100, Hanna Czenczek wrote:
> Since I actually took a look into how mounting with libfuse works[1], I
> now know that the FUSE mount options are not exactly standard mount
> system call options.  Specifically:
> - We should add "nosuid,nodev,noatime" because that is going to be
>   translated into the respective MS_ mount flags; and those flags make
>   sense for us.
> - We can set rw/ro to make the mount writable or not.  It makes sense to
>   set this flag to produce a better error message for read-only exports
>   (EROFS instead of EACCES).
>   This changes behavior as can be seen in iotest 308: It is no longer
>   possible to modify metadata of read-only exports.
> 
> In addition, in the comment, we can note that the FUSE mount() system
> call actually expects some more parameters that we can omit because
> fusermount3 (i.e. libfuse) will figure them out by itself:
> - fd: /dev/fuse fd
> - rootmode: Inode mode of the root node
> - user_id/group_id: Mounter's UID/GID
> 
> [1] It invokes fusermount3, an SUID libfuse helper program, which parses
>     and processes some mount options before actually invoking the
>     mount() system call.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c        | 14 +++++++++++---
>  tests/qemu-iotests/308     |  4 ++--
>  tests/qemu-iotests/308.out |  3 ++-
>  3 files changed, 15 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 07/15] fuse: Set direct_io and parallel_direct_writes
  2025-03-25 16:06 ` [PATCH 07/15] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
@ 2025-03-27 15:09   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 15:09 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]

On Tue, Mar 25, 2025 at 05:06:47PM +0100, Hanna Czenczek wrote:
> In fuse_open(), set these flags:
> - direct_io: We probably actually don't want to have the host page cache
>   be used for our exports.  QEMU block exports are supposed to represent
>   the image as-is (and thus potentially changing).
>   This causes a change in iotest 308's reference output.
> 
> - parallel_direct_writes: We can (now) cope with parallel writes, so we
>   should set this flag.  For some reason, it doesn't seem to make an
>   actual performance difference with libfuse, but it does make a
>   difference without it, so let's set it.
>   (See "fuse: Copy write buffer content before polling" for further
>   discussion.)
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c        | 2 ++
>  tests/qemu-iotests/308.out | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers()
  2025-03-25 16:06 ` [PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
@ 2025-03-27 15:12   ` Stefan Hajnoczi
  2025-04-01 13:55   ` Eric Blake
  1 sibling, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 15:12 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 393 bytes --]

On Tue, Mar 25, 2025 at 05:06:48PM +0100, Hanna Czenczek wrote:
> Pull setting up and tearing down the AIO context handlers into two
> dedicated functions.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 09/15] fuse: Introduce fuse_{inc,dec}_in_flight()
  2025-03-25 16:06 ` [PATCH 09/15] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
@ 2025-03-27 15:13   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 15:13 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 394 bytes --]

On Tue, Mar 25, 2025 at 05:06:49PM +0100, Hanna Czenczek wrote:
> This is how vduse-blk.c does it, and it does seem better to have
> dedicated functions for it.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 29 +++++++++++++++++++++--------
>  1 file changed, 21 insertions(+), 8 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/15] fuse: Add halted flag
  2025-03-25 16:06 ` [PATCH 10/15] fuse: Add halted flag Hanna Czenczek
@ 2025-03-27 15:15   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 15:15 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 787 bytes --]

On Tue, Mar 25, 2025 at 05:06:50PM +0100, Hanna Czenczek wrote:
> This is a flag that we will want when processing FUSE requests
> ourselves: When the kernel sends us e.g. a truncated request (i.e. we
> receive less data than the request's indicated length), we cannot rely
> on subsequent data to be valid.  Then, we are going to set this flag,
> halting all FUSE request processing.
> 
> We plan to only use this flag in cases that would effectively be kernel
> bugs.
> 
> (Right now, the flag is unused because libfuse still does our request
> processing.)
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 11/15] fuse: Manually process requests (without libfuse)
  2025-03-25 16:06 ` [PATCH 11/15] fuse: Manually process requests (without libfuse) Hanna Czenczek
@ 2025-03-27 15:35   ` Stefan Hajnoczi
  2025-04-04 12:36     ` Hanna Czenczek
  2025-04-01 14:35   ` Eric Blake
  1 sibling, 1 reply; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 15:35 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 46881 bytes --]

On Tue, Mar 25, 2025 at 05:06:51PM +0100, Hanna Czenczek wrote:
> Manually read requests from the /dev/fuse FD and process them, without
> using libfuse.  This allows us to safely add parallel request processing
> in coroutines later, without having to worry about libfuse internals.
> (Technically, we already have exactly that problem with
> read_from_fuse_export()/read_from_fuse_fd() nesting.)
> 
> We will continue to use libfuse for mounting the filesystem; fusermount3
> is a effectively a helper program of libfuse, so it should know best how
> to interact with it.  (Doing it manually without libfuse, while doable,
> is a bit of a pain, and it is not clear to me how stable the "protocol"
> actually is.)
> 
> Take this opportunity of quite a major rewrite to update the Copyright
> line with corrected information that has surfaced in the meantime.
> 
> Here are some benchmarks from before this patch (4k, iodepth=16, libaio;
> except 'sync', which are iodepth=1 and pvsync2):
> 
> file:
>   read:
>     seq aio:   78.6k ±1.3k IOPS
>     rand aio:  39.3k ±2.9k
>     seq sync:  32.5k ±0.7k
>     rand sync:  9.9k ±0.1k
>   write:
>     seq aio:   61.9k ±0.5k
>     rand aio:  61.2k ±0.6k
>     seq sync:  27.9k ±0.2k
>     rand sync: 27.6k ±0.4k
> null:
>   read:
>     seq aio:   214.0k ±5.9k
>     rand aio:  212.7k ±4.5k
>     seq sync:   90.3k ±6.5k
>     rand sync:  89.7k ±5.1k
>   write:
>     seq aio:   203.9k ±1.5k
>     rand aio:  201.4k ±3.6k
>     seq sync:   86.1k ±6.2k
>     rand sync:  84.9k ±5.3k
> 
> And with this patch applied:
> 
> file:
>   read:
>     seq aio:   76.6k ±1.8k (- 3 %)
>     rand aio:  26.7k ±0.4k (-32 %)
>     seq sync:  47.7k ±1.2k (+47 %)
>     rand sync: 10.1k ±0.2k (+ 2 %)
>   write:
>     seq aio:   58.1k ±0.5k (- 6 %)
>     rand aio:  58.1k ±0.5k (- 5 %)
>     seq sync:  36.3k ±0.3k (+30 %)
>     rand sync: 36.1k ±0.4k (+31 %)
> null:
>   read:
>     seq aio:   268.4k ±3.4k (+25 %)
>     rand aio:  265.3k ±2.1k (+25 %)
>     seq sync:  134.3k ±2.7k (+49 %)
>     rand sync: 132.4k ±1.4k (+48 %)
>   write:
>     seq aio:   275.3k ±1.7k (+35 %)
>     rand aio:  272.3k ±1.9k (+35 %)
>     seq sync:  130.7k ±1.6k (+52 %)
>     rand sync: 127.4k ±2.4k (+50 %)
> 
> So clearly the AIO file results are actually not good, and random reads
> are indeed quite terrible.  On the other hand, we can see from the sync
> and null results that request handling should in theory be quicker.  How
> does this fit together?
> 
> I believe the bad AIO results are an artifact of the accidental parallel
> request processing we have due to nested polling: Depending on how the
> actual request processing is structured and how long request processing
> takes, more or less requests will be submitted in parallel.  So because
> of the restructuring, I think this patch accidentally changes how many
> requests end up being submitted in parallel, which decreases
> performance.
> 
> (I have seen something like this before: In RSD, without having
> implemented a polling mode, the debug build tended to have better
> performance than the more optimized release build, because the debug
> build, taking longer to submit requests, ended up processing more
> requests in parallel.)
> 
> In any case, once we use coroutines throughout the code, performance
> will improve again across the board.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 793 +++++++++++++++++++++++++++++++-------------
>  1 file changed, 567 insertions(+), 226 deletions(-)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index 3dd50badb3..407b101018 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -1,7 +1,7 @@
>  /*
>   * Present a block device as a raw image through FUSE
>   *
> - * Copyright (c) 2020 Max Reitz <mreitz@redhat.com>
> + * Copyright (c) 2020, 2025 Hanna Czenczek <hreitz@redhat.com>
>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License as published by
> @@ -27,12 +27,15 @@
>  #include "block/qapi.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-commands-block.h"
> +#include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "system/block-backend.h"
>  
>  #include <fuse.h>
>  #include <fuse_lowlevel.h>
>  
> +#include "standard-headers/linux/fuse.h"
> +
>  #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
>  #include <linux/falloc.h>
>  #endif
> @@ -42,17 +45,27 @@
>  #endif
>  
>  /* Prevent overly long bounce buffer allocations */
> -#define FUSE_MAX_BOUNCE_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
> -
> +#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
> +/* Small enough to fit in the request buffer */
> +#define FUSE_MAX_WRITE_BYTES (4 * 1024)
>  
>  typedef struct FuseExport {
>      BlockExport common;
>  
>      struct fuse_session *fuse_session;
> -    struct fuse_buf fuse_buf;
>      unsigned int in_flight; /* atomic */
>      bool mounted, fd_handler_set_up;
>  
> +    /*
> +     * The request buffer must be able to hold a full write, and/or at least
> +     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes
> +     */
> +    char request_buf[MAX_CONST(
> +        sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
> +             FUSE_MAX_WRITE_BYTES,
> +        FUSE_MIN_READ_BUFFER
> +    )];
> +
>      /*
>       * Set when there was an unrecoverable error and no requests should be read
>       * from the device anymore (basically only in case of something we would
> @@ -60,6 +73,8 @@ typedef struct FuseExport {
>       */
>      bool halted;
>  
> +    int fuse_fd;
> +
>      char *mountpoint;
>      bool writable;
>      bool growable;
> @@ -72,19 +87,20 @@ typedef struct FuseExport {
>  } FuseExport;
>  
>  static GHashTable *exports;
> -static const struct fuse_lowlevel_ops fuse_ops;
>  
>  static void fuse_export_shutdown(BlockExport *exp);
>  static void fuse_export_delete(BlockExport *exp);
> -static void fuse_export_halt(FuseExport *exp) G_GNUC_UNUSED;
> +static void fuse_export_halt(FuseExport *exp);
>  
>  static void init_exports_table(void);
>  
>  static int mount_fuse_export(FuseExport *exp, Error **errp);
> -static void read_from_fuse_export(void *opaque);
>  
>  static bool is_regular_file(const char *path, Error **errp);
>  
> +static bool poll_fuse_fd(void *opaque);
> +static void read_fuse_fd(void *opaque);
> +static void fuse_process_request(FuseExport *exp);
>  
>  static void fuse_inc_in_flight(FuseExport *exp)
>  {
> @@ -105,22 +121,27 @@ static void fuse_dec_in_flight(FuseExport *exp)
>      }
>  }
>  
> +/**
> + * Attach FUSE FD read and poll handlers.
> + */
>  static void fuse_attach_handlers(FuseExport *exp)
>  {
>      if (exp->halted) {
>          return;
>      }
>  
> -    aio_set_fd_handler(exp->common.ctx,
> -                       fuse_session_fd(exp->fuse_session),
> -                       read_from_fuse_export, NULL, NULL, NULL, exp);
> +    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
> +                       read_fuse_fd, NULL, poll_fuse_fd,
> +                       read_fuse_fd, exp);
>      exp->fd_handler_set_up = true;
>  }
>  
> +/**
> + * Detach FUSE FD read and poll handlers.
> + */
>  static void fuse_detach_handlers(FuseExport *exp)
>  {
> -    aio_set_fd_handler(exp->common.ctx,
> -                       fuse_session_fd(exp->fuse_session),
> +    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
>                         NULL, NULL, NULL, NULL, NULL);
>      exp->fd_handler_set_up = false;
>  }
> @@ -247,6 +268,14 @@ static int fuse_export_create(BlockExport *blk_exp,
>  
>      g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
>  
> +    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
> +    ret = fcntl(exp->fuse_fd, F_SETFL, O_NONBLOCK);
> +    if (ret < 0) {
> +        ret = -errno;
> +        error_setg_errno(errp, errno, "Failed to make FUSE FD non-blocking");
> +        goto fail;
> +    }
> +
>      fuse_attach_handlers(exp);
>      return 0;
>  
> @@ -292,7 +321,7 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
>      mount_opts = g_strdup_printf("%s,nosuid,nodev,noatime,max_read=%zu,"
>                                   "default_permissions%s",
>                                   exp->writable ? "rw" : "ro",
> -                                 FUSE_MAX_BOUNCE_BYTES,
> +                                 FUSE_MAX_READ_BYTES,
>                                   exp->allow_other ? ",allow_other" : "");
>  
>      fuse_argv[0] = ""; /* Dummy program name */
> @@ -301,8 +330,8 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
>      fuse_argv[3] = NULL;
>      fuse_args = (struct fuse_args)FUSE_ARGS_INIT(3, (char **)fuse_argv);
>  
> -    exp->fuse_session = fuse_session_new(&fuse_args, &fuse_ops,
> -                                         sizeof(fuse_ops), exp);
> +    /* We just create the session for mounting/unmounting, no need to set ops */
> +    exp->fuse_session = fuse_session_new(&fuse_args, NULL, 0, NULL);
>      g_free(mount_opts);
>      if (!exp->fuse_session) {
>          error_setg(errp, "Failed to set up FUSE session");
> @@ -320,55 +349,94 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
>  }
>  
>  /**
> - * Callback to be invoked when the FUSE session FD can be read from.
> - * (This is basically the FUSE event loop.)
> + * Try to read a single request from the FUSE FD.
> + * If a request is available, process it, and return true.
> + * Otherwise, return false.
>   */
> -static void read_from_fuse_export(void *opaque)
> +static bool read_from_fuse_fd(void *opaque)
>  {
>      FuseExport *exp = opaque;
> -    int ret;
> +    int fuse_fd = exp->fuse_fd;
> +    ssize_t ret;
> +    const struct fuse_in_header *in_hdr;
> +
> +    fuse_inc_in_flight(exp);
>  
>      if (unlikely(exp->halted)) {
> -        return;
> +        goto no_request;
>      }
>  
> -    fuse_inc_in_flight(exp);
> +    ret = RETRY_ON_EINTR(read(fuse_fd, exp->request_buf,
> +                              sizeof(exp->request_buf)));
> +    if (ret < 0 && errno == EAGAIN) {
> +        /* No request available */
> +        goto no_request;
> +    } else if (unlikely(ret < 0)) {
> +        error_report("Failed to read from FUSE device: %s", strerror(-ret));
> +        goto no_request;
> +    }
>  
> -    do {
> -        ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
> -    } while (ret == -EINTR);
> -    if (ret < 0) {
> -        goto out;
> +    if (unlikely(ret < sizeof(*in_hdr))) {
> +        error_report("Incomplete read from FUSE device, expected at least %zu "
> +                     "bytes, read %zi bytes; cannot trust subsequent "
> +                     "requests, halting the export",
> +                     sizeof(*in_hdr), ret);
> +        fuse_export_halt(exp);
> +        goto no_request;
>      }
>  
> -    /*
> -     * Note that polling in any request-processing function can lead to a nested
> -     * read_from_fuse_export() call, which will overwrite the contents of
> -     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
> -     * content is copied before potentially polling.
> -     */
> -    fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
> +    in_hdr = (const struct fuse_in_header *)exp->request_buf;
> +    if (unlikely(ret != in_hdr->len)) {
> +        error_report("Number of bytes read from FUSE device does not match "
> +                     "request size, expected %" PRIu32 " bytes, read %zi "
> +                     "bytes; cannot trust subsequent requests, halting the "
> +                     "export",
> +                     in_hdr->len, ret);
> +        fuse_export_halt(exp);
> +        goto no_request;
> +    }
> +
> +    fuse_process_request(exp);
> +    fuse_dec_in_flight(exp);
> +    return true;
>  
> -out:
> +no_request:
>      fuse_dec_in_flight(exp);
> +    return false;
> +}
> +
> +/**
> + * Check the FUSE FD for whether it is readable or not.  Because we cannot
> + * reasonably do this without reading a request at the same time, also read and
> + * process that request if any.
> + * (To be used as a poll handler for the FUSE FD.)
> + */
> +static bool poll_fuse_fd(void *opaque)
> +{
> +    return read_from_fuse_fd(opaque);
> +}

The other io_poll() callbacks in QEMU peek at memory whereas this one
invokes the read(2) syscall. Two reasons why this is a problem:
1. Syscall latency is too high. Other fd handlers will be delayed by
   microseconds.
2. This doesn't scale. If every component in QEMU does this then the
   event loop degrades to O(n) of non-blocking read(2) syscalls where n
   is the number of fds.

Also, handling the request inside the io_poll() callback skews
AioContext's time accounting because time spent handling the request
will be accounted as "polling time". The adaptive polling calculation
will think it polled for longer than it did.

If there is no way to peek at memory, please don't implement the
io_poll() callback.

> +
> +/**
> + * Read a request from the FUSE FD.
> + * (To be used as a handler for when the FUSE FD becomes readable.)
> + */
> +static void read_fuse_fd(void *opaque)
> +{
> +    read_from_fuse_fd(opaque);
>  }
>  
>  static void fuse_export_shutdown(BlockExport *blk_exp)
>  {
>      FuseExport *exp = container_of(blk_exp, FuseExport, common);
>  
> -    if (exp->fuse_session) {
> -        fuse_session_exit(exp->fuse_session);
> -
> -        if (exp->fd_handler_set_up) {
> -            fuse_detach_handlers(exp);
> -        }
> +    if (exp->fd_handler_set_up) {
> +        fuse_detach_handlers(exp);
>      }
>  
>      if (exp->mountpoint) {
>          /*
> -         * Safe to drop now, because we will not handle any requests
> -         * for this export anymore anyway.
> +         * Safe to drop now, because we will not handle any requests for this
> +         * export anymore anyway (at least not from the main thread).
>           */
>          g_hash_table_remove(exports, exp->mountpoint);
>      }
> @@ -386,7 +454,6 @@ static void fuse_export_delete(BlockExport *blk_exp)
>          fuse_session_destroy(exp->fuse_session);
>      }
>  
> -    free(exp->fuse_buf.mem);
>      g_free(exp->mountpoint);
>  }
>  
> @@ -428,46 +495,57 @@ static bool is_regular_file(const char *path, Error **errp)
>  }
>  
>  /**
> - * A chance to set change some parameters supplied to FUSE_INIT.
> + * Process FUSE INIT.
> + * Return the number of bytes written to *out on success, and -errno on error.
>   */
> -static void fuse_init(void *userdata, struct fuse_conn_info *conn)
> +static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
> +                         uint32_t max_readahead, uint32_t flags)
>  {
> -    /*
> -     * MIN_NON_ZERO() would not be wrong here, but what we set here
> -     * must equal what has been passed to fuse_session_new().
> -     * Therefore, as long as max_read must be passed as a mount option
> -     * (which libfuse claims will be changed at some point), we have
> -     * to set max_read to a fixed value here.
> -     */
> -    conn->max_read = FUSE_MAX_BOUNCE_BYTES;
> +    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
>  
> -    conn->max_write = MIN_NON_ZERO(BDRV_REQUEST_MAX_BYTES, conn->max_write);
> -}
> +    *out = (struct fuse_init_out) {
> +        .major = FUSE_KERNEL_VERSION,
> +        .minor = FUSE_KERNEL_MINOR_VERSION,
> +        .max_readahead = max_readahead,
> +        .max_write = FUSE_MAX_WRITE_BYTES,
> +        .flags = flags & supported_flags,
> +        .flags2 = 0,
>  
> -/**
> - * Let clients look up files.  Always return ENOENT because we only
> - * care about the mountpoint itself.
> - */
> -static void fuse_lookup(fuse_req_t req, fuse_ino_t parent, const char *name)
> -{
> -    fuse_reply_err(req, ENOENT);
> +        /* libfuse maximum: 2^16 - 1 */
> +        .max_background = UINT16_MAX,
> +
> +        /* libfuse default: max_background * 3 / 4 */
> +        .congestion_threshold = (int)UINT16_MAX * 3 / 4,
> +
> +        /* libfuse default: 1 */
> +        .time_gran = 1,
> +
> +        /*
> +         * probably unneeded without FUSE_MAX_PAGES, but this would be the
> +         * libfuse default
> +         */
> +        .max_pages = DIV_ROUND_UP(FUSE_MAX_WRITE_BYTES,
> +                                  qemu_real_host_page_size()),
> +
> +        /* Only needed for mappings (i.e. DAX) */
> +        .map_alignment = 0,
> +    };
> +
> +    return sizeof(*out);
>  }
>  
>  /**
>   * Let clients get file attributes (i.e., stat() the file).
> + * Return the number of bytes written to *out on success, and -errno on error.
>   */
> -static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
> -                         struct fuse_file_info *fi)
> +static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
>  {
> -    struct stat statbuf;
>      int64_t length, allocated_blocks;
>      time_t now = time(NULL);
> -    FuseExport *exp = fuse_req_userdata(req);
>  
>      length = blk_getlength(exp->common.blk);
>      if (length < 0) {
> -        fuse_reply_err(req, -length);
> -        return;
> +        return length;
>      }
>  
>      allocated_blocks = bdrv_get_allocated_file_size(blk_bs(exp->common.blk));
> @@ -477,21 +555,24 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
>          allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512);
>      }
>  
> -    statbuf = (struct stat) {
> -        .st_ino     = 1,
> -        .st_mode    = exp->st_mode,
> -        .st_nlink   = 1,
> -        .st_uid     = exp->st_uid,
> -        .st_gid     = exp->st_gid,
> -        .st_size    = length,
> -        .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
> -        .st_blocks  = allocated_blocks,
> -        .st_atime   = now,
> -        .st_mtime   = now,
> -        .st_ctime   = now,
> +    *out = (struct fuse_attr_out) {
> +        .attr_valid = 1,
> +        .attr = {
> +            .ino        = 1,
> +            .mode       = exp->st_mode,
> +            .nlink      = 1,
> +            .uid        = exp->st_uid,
> +            .gid        = exp->st_gid,
> +            .size       = length,
> +            .blksize    = blk_bs(exp->common.blk)->bl.request_alignment,
> +            .blocks     = allocated_blocks,
> +            .atime      = now,
> +            .mtime      = now,
> +            .ctime      = now,
> +        },
>      };
>  
> -    fuse_reply_attr(req, &statbuf, 1.);
> +    return sizeof(*out);
>  }
>  
>  static int fuse_do_truncate(const FuseExport *exp, int64_t size,
> @@ -544,160 +625,149 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
>   * permit access: Read-only exports cannot be given +w, and exports
>   * without allow_other cannot be given a different UID or GID, and
>   * they cannot be given non-owner access.
> + * Return the number of bytes written to *out on success, and -errno on error.
>   */
> -static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
> -                         int to_set, struct fuse_file_info *fi)
> +static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
> +                            uint32_t to_set, uint64_t size, uint32_t mode,
> +                            uint32_t uid, uint32_t gid)
>  {
> -    FuseExport *exp = fuse_req_userdata(req);
>      int supported_attrs;
>      int ret;
>  
> -    supported_attrs = FUSE_SET_ATTR_SIZE | FUSE_SET_ATTR_MODE;
> +    /* SIZE and MODE are actually supported, the others can be safely ignored */
> +    supported_attrs = FATTR_SIZE | FATTR_MODE |
> +        FATTR_FH | FATTR_LOCKOWNER | FATTR_KILL_SUIDGID;
>      if (exp->allow_other) {
> -        supported_attrs |= FUSE_SET_ATTR_UID | FUSE_SET_ATTR_GID;
> +        supported_attrs |= FATTR_UID | FATTR_GID;
>      }
>  
>      if (to_set & ~supported_attrs) {
> -        fuse_reply_err(req, ENOTSUP);
> -        return;
> +        return -ENOTSUP;
>      }
>  
>      /* Do some argument checks first before committing to anything */
> -    if (to_set & FUSE_SET_ATTR_MODE) {
> +    if (to_set & FATTR_MODE) {
>          /*
>           * Without allow_other, non-owners can never access the export, so do
>           * not allow setting permissions for them
>           */
> -        if (!exp->allow_other &&
> -            (statbuf->st_mode & (S_IRWXG | S_IRWXO)) != 0)
> -        {
> -            fuse_reply_err(req, EPERM);
> -            return;
> +        if (!exp->allow_other && (mode & (S_IRWXG | S_IRWXO)) != 0) {
> +            return -EPERM;
>          }
>  
>          /* +w for read-only exports makes no sense, disallow it */
> -        if (!exp->writable &&
> -            (statbuf->st_mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0)
> -        {
> -            fuse_reply_err(req, EROFS);
> -            return;
> +        if (!exp->writable && (mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0) {
> +            return -EROFS;
>          }
>      }
>  
> -    if (to_set & FUSE_SET_ATTR_SIZE) {
> +    if (to_set & FATTR_SIZE) {
>          if (!exp->writable) {
> -            fuse_reply_err(req, EACCES);
> -            return;
> +            return -EACCES;
>          }
>  
> -        ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
> +        ret = fuse_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
>          if (ret < 0) {
> -            fuse_reply_err(req, -ret);
> -            return;
> +            return ret;
>          }
>      }
>  
> -    if (to_set & FUSE_SET_ATTR_MODE) {
> +    if (to_set & FATTR_MODE) {
>          /* Ignore FUSE-supplied file type, only change the mode */
> -        exp->st_mode = (statbuf->st_mode & 07777) | S_IFREG;
> +        exp->st_mode = (mode & 07777) | S_IFREG;
>      }
>  
> -    if (to_set & FUSE_SET_ATTR_UID) {
> -        exp->st_uid = statbuf->st_uid;
> +    if (to_set & FATTR_UID) {
> +        exp->st_uid = uid;
>      }
>  
> -    if (to_set & FUSE_SET_ATTR_GID) {
> -        exp->st_gid = statbuf->st_gid;
> +    if (to_set & FATTR_GID) {
> +        exp->st_gid = gid;
>      }
>  
> -    fuse_getattr(req, inode, fi);
> +    return fuse_getattr(exp, out);
>  }
>  
>  /**
> - * Let clients open a file (i.e., the exported image).
> + * Open an inode.  We only have a single inode in our exported filesystem, so we
> + * just acknowledge the request.
> + * Return the number of bytes written to *out on success, and -errno on error.
>   */
> -static void fuse_open(fuse_req_t req, fuse_ino_t inode,
> -                      struct fuse_file_info *fi)
> +static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
>  {
> -    fi->direct_io = true;
> -    fi->parallel_direct_writes = true;
> -    fuse_reply_open(req, fi);
> +    *out = (struct fuse_open_out) {
> +        .open_flags = FOPEN_DIRECT_IO | FOPEN_PARALLEL_DIRECT_WRITES,
> +    };
> +    return sizeof(*out);
>  }
>  
>  /**
> - * Handle client reads from the exported image.
> + * Handle client reads from the exported image.  Allocates *bufptr and reads
> + * data from the block device into that buffer.
> + * Returns the buffer (read) size on success, and -errno on error.
>   */
> -static void fuse_read(fuse_req_t req, fuse_ino_t inode,
> -                      size_t size, off_t offset, struct fuse_file_info *fi)
> +static ssize_t fuse_read(FuseExport *exp, void **bufptr,
> +                         uint64_t offset, uint32_t size)
>  {
> -    FuseExport *exp = fuse_req_userdata(req);
> -    int64_t length;
> +    int64_t blk_len;
>      void *buf;
>      int ret;
>  
>      /* Limited by max_read, should not happen */
> -    if (size > FUSE_MAX_BOUNCE_BYTES) {
> -        fuse_reply_err(req, EINVAL);
> -        return;
> +    if (size > FUSE_MAX_READ_BYTES) {
> +        return -EINVAL;
>      }
>  
>      /**
>       * Clients will expect short reads at EOF, so we have to limit
>       * offset+size to the image length.
>       */
> -    length = blk_getlength(exp->common.blk);
> -    if (length < 0) {
> -        fuse_reply_err(req, -length);
> -        return;
> +    blk_len = blk_getlength(exp->common.blk);
> +    if (blk_len < 0) {
> +        return blk_len;
>      }
>  
> -    if (offset + size > length) {
> -        size = length - offset;
> +    if (offset + size > blk_len) {
> +        size = blk_len - offset;
>      }
>  
>      buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
>      if (!buf) {
> -        fuse_reply_err(req, ENOMEM);
> -        return;
> +        return -ENOMEM;
>      }
>  
>      ret = blk_pread(exp->common.blk, offset, size, buf, 0);
> -    if (ret >= 0) {
> -        fuse_reply_buf(req, buf, size);
> -    } else {
> -        fuse_reply_err(req, -ret);
> +    if (ret < 0) {
> +        qemu_vfree(buf);
> +        return ret;
>      }
>  
> -    qemu_vfree(buf);
> +    *bufptr = buf;
> +    return size;
>  }
>  
>  /**
> - * Handle client writes to the exported image.
> + * Handle client writes to the exported image.  @buf has the data to be written
> + * and will be copied to a bounce buffer before polling for the first time.
> + * Return the number of bytes written to *out on success, and -errno on error.
>   */
> -static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
> -                       size_t size, off_t offset, struct fuse_file_info *fi)
> +static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
> +                          uint64_t offset, uint32_t size, const void *buf)
>  {
> -    FuseExport *exp = fuse_req_userdata(req);
>      void *copied;
> -    int64_t length;
> +    int64_t blk_len;
>      int ret;
>  
>      /* Limited by max_write, should not happen */
>      if (size > BDRV_REQUEST_MAX_BYTES) {
> -        fuse_reply_err(req, EINVAL);
> -        return;
> +        return -EINVAL;
>      }
>  
>      if (!exp->writable) {
> -        fuse_reply_err(req, EACCES);
> -        return;
> +        return -EACCES;
>      }
>  
> -    /*
> -     * Heed the note on read_from_fuse_export(): If we poll (which any blk_*()
> -     * I/O function may do), read_from_fuse_export() may be nested, overwriting
> -     * the request buffer content.  Therefore, we must copy it here.
> -     */
> +    /* Must copy to bounce buffer before polling (to allow nesting) */
>      copied = blk_blockalign(exp->common.blk, size);
>      memcpy(copied, buf, size);
>  
> @@ -705,55 +775,57 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>       * Clients will expect short writes at EOF, so we have to limit
>       * offset+size to the image length.
>       */
> -    length = blk_getlength(exp->common.blk);
> -    if (length < 0) {
> -        fuse_reply_err(req, -length);
> -        goto free_buffer;
> +    blk_len = blk_getlength(exp->common.blk);
> +    if (blk_len < 0) {
> +        ret = blk_len;
> +        goto fail_free_buffer;
>      }
>  
> -    if (offset + size > length) {
> +    if (offset + size > blk_len) {
>          if (exp->growable) {
>              ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
>              if (ret < 0) {
> -                fuse_reply_err(req, -ret);
> -                goto free_buffer;
> +                goto fail_free_buffer;
>              }
>          } else {
> -            size = length - offset;
> +            size = blk_len - offset;
>          }
>      }
>  
>      ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
> -    if (ret >= 0) {
> -        fuse_reply_write(req, size);
> -    } else {
> -        fuse_reply_err(req, -ret);
> +    if (ret < 0) {
> +        goto fail_free_buffer;
>      }
>  
> -free_buffer:
>      qemu_vfree(copied);
> +
> +    *out = (struct fuse_write_out) {
> +        .size = size,
> +    };
> +    return sizeof(*out);
> +
> +fail_free_buffer:
> +    qemu_vfree(copied);
> +    return ret;
>  }
>  
>  /**
>   * Let clients perform various fallocate() operations.
> + * Return 0 on success (no 'out' object), and -errno on error.
>   */
> -static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
> -                           off_t offset, off_t length,
> -                           struct fuse_file_info *fi)
> +static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
> +                              uint32_t mode)
>  {
> -    FuseExport *exp = fuse_req_userdata(req);
>      int64_t blk_len;
>      int ret;
>  
>      if (!exp->writable) {
> -        fuse_reply_err(req, EACCES);
> -        return;
> +        return -EACCES;
>      }
>  
>      blk_len = blk_getlength(exp->common.blk);
>      if (blk_len < 0) {
> -        fuse_reply_err(req, -blk_len);
> -        return;
> +        return blk_len;
>      }
>  
>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> @@ -765,16 +837,14 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
>      if (!mode) {
>          /* We can only fallocate at the EOF with a truncate */
>          if (offset < blk_len) {
> -            fuse_reply_err(req, EOPNOTSUPP);
> -            return;
> +            return -EOPNOTSUPP;
>          }
>  
>          if (offset > blk_len) {
>              /* No preallocation needed here */
>              ret = fuse_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
>              if (ret < 0) {
> -                fuse_reply_err(req, -ret);
> -                return;
> +                return ret;
>              }
>          }
>  
> @@ -784,8 +854,7 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>      else if (mode & FALLOC_FL_PUNCH_HOLE) {
>          if (!(mode & FALLOC_FL_KEEP_SIZE)) {
> -            fuse_reply_err(req, EINVAL);
> -            return;
> +            return -EINVAL;
>          }
>  
>          do {
> @@ -813,8 +882,7 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
>              ret = fuse_do_truncate(exp, offset + length, false,
>                                     PREALLOC_MODE_OFF);
>              if (ret < 0) {
> -                fuse_reply_err(req, -ret);
> -                return;
> +                return ret;
>              }
>          }
>  
> @@ -832,44 +900,38 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
>          ret = -EOPNOTSUPP;
>      }
>  
> -    fuse_reply_err(req, ret < 0 ? -ret : 0);
> +    return ret < 0 ? ret : 0;
>  }
>  
>  /**
>   * Let clients fsync the exported image.
> + * Return 0 on success (no 'out' object), and -errno on error.
>   */
> -static void fuse_fsync(fuse_req_t req, fuse_ino_t inode, int datasync,
> -                       struct fuse_file_info *fi)
> +static ssize_t fuse_fsync(FuseExport *exp)
>  {
> -    FuseExport *exp = fuse_req_userdata(req);
> -    int ret;
> -
> -    ret = blk_flush(exp->common.blk);
> -    fuse_reply_err(req, ret < 0 ? -ret : 0);
> +    return blk_flush(exp->common.blk);
>  }
>  
>  /**
>   * Called before an FD to the exported image is closed.  (libfuse
>   * notes this to be a way to return last-minute errors.)
> + * Return 0 on success (no 'out' object), and -errno on error.
>   */
> -static void fuse_flush(fuse_req_t req, fuse_ino_t inode,
> -                        struct fuse_file_info *fi)
> +static ssize_t fuse_flush(FuseExport *exp)
>  {
> -    fuse_fsync(req, inode, 1, fi);
> +    return blk_flush(exp->common.blk);
>  }
>  
>  #ifdef CONFIG_FUSE_LSEEK
>  /**
>   * Let clients inquire allocation status.
> + * Return the number of bytes written to *out on success, and -errno on error.
>   */
> -static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
> -                       int whence, struct fuse_file_info *fi)
> +static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
> +                          uint64_t offset, uint32_t whence)
>  {
> -    FuseExport *exp = fuse_req_userdata(req);
> -
>      if (whence != SEEK_HOLE && whence != SEEK_DATA) {
> -        fuse_reply_err(req, EINVAL);
> -        return;
> +        return -EINVAL;
>      }
>  
>      while (true) {
> @@ -879,8 +941,7 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
>          ret = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
>                                        offset, INT64_MAX, &pnum, NULL, NULL);
>          if (ret < 0) {
> -            fuse_reply_err(req, -ret);
> -            return;
> +            return ret;
>          }
>  
>          if (!pnum && (ret & BDRV_BLOCK_EOF)) {
> @@ -897,34 +958,38 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
>  
>              blk_len = blk_getlength(exp->common.blk);
>              if (blk_len < 0) {
> -                fuse_reply_err(req, -blk_len);
> -                return;
> +                return blk_len;
>              }
>  
>              if (offset > blk_len || whence == SEEK_DATA) {
> -                fuse_reply_err(req, ENXIO);
> -            } else {
> -                fuse_reply_lseek(req, offset);
> +                return -ENXIO;
>              }
> -            return;
> +
> +            *out = (struct fuse_lseek_out) {
> +                .offset = offset,
> +            };
> +            return sizeof(*out);
>          }
>  
>          if (ret & BDRV_BLOCK_DATA) {
>              if (whence == SEEK_DATA) {
> -                fuse_reply_lseek(req, offset);
> -                return;
> +                *out = (struct fuse_lseek_out) {
> +                    .offset = offset,
> +                };
> +                return sizeof(*out);
>              }
>          } else {
>              if (whence == SEEK_HOLE) {
> -                fuse_reply_lseek(req, offset);
> -                return;
> +                *out = (struct fuse_lseek_out) {
> +                    .offset = offset,
> +                };
> +                return sizeof(*out);
>              }
>          }
>  
>          /* Safety check against infinite loops */
>          if (!pnum) {
> -            fuse_reply_err(req, ENXIO);
> -            return;
> +            return -ENXIO;
>          }
>  
>          offset += pnum;
> @@ -932,21 +997,297 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
>  }
>  #endif
>  
> -static const struct fuse_lowlevel_ops fuse_ops = {
> -    .init       = fuse_init,
> -    .lookup     = fuse_lookup,
> -    .getattr    = fuse_getattr,
> -    .setattr    = fuse_setattr,
> -    .open       = fuse_open,
> -    .read       = fuse_read,
> -    .write      = fuse_write,
> -    .fallocate  = fuse_fallocate,
> -    .flush      = fuse_flush,
> -    .fsync      = fuse_fsync,
> +/**
> + * Write a FUSE response to the given @fd, using a single buffer consecutively
> + * containing both the response header and data: Initialize *out_hdr, and write
> + * it plus @response_data_length consecutive bytes to @fd.
> + *
> + * @fd: FUSE file descriptor
> + * @req_id: Corresponding request ID
> + * @out_hdr: Pointer to buffer that will hold the output header, and
> + *           additionally already contains @response_data_length data bytes
> + *           starting at *out_hdr + 1.
> + * @err: Error code (-errno, or 0 in case of success)
> + * @response_data_length: Length of data to return (following *out_hdr)
> + */
> +static int fuse_write_response(int fd, uint32_t req_id,
> +                               struct fuse_out_header *out_hdr, int err,
> +                               size_t response_data_length)
> +{
> +    void *write_ptr = out_hdr;
> +    size_t to_write = sizeof(*out_hdr) + response_data_length;
> +    ssize_t ret;
> +
> +    *out_hdr = (struct fuse_out_header) {
> +        .len = to_write,
> +        .error = err,
> +        .unique = req_id,
> +    };
> +
> +    while (true) {
> +        ret = RETRY_ON_EINTR(write(fd, write_ptr, to_write));
> +        if (ret < 0) {
> +            ret = -errno;
> +            error_report("Failed to write to FUSE device: %s", strerror(-ret));
> +            return ret;
> +        } else {
> +            to_write -= ret;
> +            if (to_write > 0) {
> +                write_ptr += ret;
> +            } else {
> +                return 0; /* success */
> +            }
> +        }
> +    }
> +}
> +
> +/**
> + * Write a FUSE response to the given @fd, using separate buffers for the
> + * response header and data: Initialize *out_hdr, and write it plus the data in
> + * *buf to @fd.
> + *
> + * In contrast to fuse_write_response(), this function cannot return errors, and
> + * will always return success (error code 0).
> + *
> + * @fd: FUSE file descriptor
> + * @req_id: Corresponding request ID
> + * @out_hdr: Pointer to buffer that will hold the output header
> + * @buf: Pointer to response data
> + * @buflen: Length of response data
> + */
> +static int fuse_write_buf_response(int fd, uint32_t req_id,
> +                                   struct fuse_out_header *out_hdr,
> +                                   const void *buf, size_t buflen)
> +{
> +    struct iovec iov[2] = {
> +        { out_hdr, sizeof(*out_hdr) },
> +        { (void *)buf, buflen },
> +    };
> +    struct iovec *iovp = iov;
> +    unsigned iov_count = ARRAY_SIZE(iov);
> +    size_t to_write = sizeof(*out_hdr) + buflen;
> +    ssize_t ret;
> +
> +    *out_hdr = (struct fuse_out_header) {
> +        .len = to_write,
> +        .unique = req_id,
> +    };
> +
> +    while (true) {
> +        ret = RETRY_ON_EINTR(writev(fd, iovp, iov_count));
> +        if (ret < 0) {
> +            ret = -errno;
> +            error_report("Failed to write to FUSE device: %s", strerror(-ret));
> +            return ret;
> +        } else {
> +            to_write -= ret;
> +            if (to_write > 0) {
> +                iov_discard_front(&iovp, &iov_count, ret);
> +            } else {
> +                return 0; /* success */
> +            }
> +        }
> +    }
> +}
> +
> +/*
> + * For use in fuse_process_request():
> + * Returns a pointer to the parameter object for the given operation (inside of
> + * exp->request_buf, which is assumed to hold a fuse_in_header first).
> + * Verifies that the object is complete (exp->request_buf is large enough to
> + * hold it in one piece, and the request length includes the whole object).
> + *
> + * Note that exp->request_buf may be overwritten after polling, so the returned
> + * pointer must not be used across a function that may poll!
> + */
> +#define FUSE_IN_OP_STRUCT(op_name, export) \
> +    ({ \
> +        const struct fuse_in_header *__in_hdr = \
> +            (const struct fuse_in_header *)(export)->request_buf; \
> +        const struct fuse_##op_name##_in *__in = \
> +            (const struct fuse_##op_name##_in *)(__in_hdr + 1); \
> +        const size_t __param_len = sizeof(*__in_hdr) + sizeof(*__in); \
> +        uint32_t __req_len; \
> +        \
> +        QEMU_BUILD_BUG_ON(sizeof((export)->request_buf) < __param_len); \
> +        \
> +        __req_len = __in_hdr->len; \
> +        if (__req_len < __param_len) { \
> +            warn_report("FUSE request truncated (%" PRIu32 " < %zu)", \
> +                        __req_len, __param_len); \
> +            ret = -EINVAL; \
> +            break; \
> +        } \
> +        __in; \
> +    })
> +
> +/*
> + * For use in fuse_process_request():
> + * Returns a pointer to the return object for the given operation (inside of
> + * out_buf, which is assumed to hold a fuse_out_header first).
> + * Verifies that out_buf is large enough to hold the whole object.
> + *
> + * (out_buf should be a char[] array.)
> + */
> +#define FUSE_OUT_OP_STRUCT(op_name, out_buf) \
> +    ({ \
> +        struct fuse_out_header *__out_hdr = \
> +            (struct fuse_out_header *)(out_buf); \
> +        struct fuse_##op_name##_out *__out = \
> +            (struct fuse_##op_name##_out *)(__out_hdr + 1); \
> +        \
> +        QEMU_BUILD_BUG_ON(sizeof(*__out_hdr) + sizeof(*__out) > \
> +                          sizeof(out_buf)); \
> +        \
> +        __out; \
> +    })
> +
> +/**
> + * Process a FUSE request, incl. writing the response.
> + *
> + * Note that polling in any request-processing function can lead to a nested
> + * read_from_fuse_fd() call, which will overwrite the contents of
> + * exp->request_buf.  Anything that takes a buffer needs to take care that the
> + * content is copied before potentially polling.
> + */
> +static void fuse_process_request(FuseExport *exp)
> +{
> +    uint32_t opcode;
> +    uint64_t req_id;
> +    /*
> +     * Return buffer.  Must be large enough to hold all return headers, but does
> +     * not include space for data returned by read requests.
> +     * (FUSE_IN_OP_STRUCT() verifies at compile time that out_buf is indeed
> +     * large enough.)
> +     */
> +    char out_buf[sizeof(struct fuse_out_header) +
> +                 MAX_CONST(sizeof(struct fuse_init_out),
> +                 MAX_CONST(sizeof(struct fuse_open_out),
> +                 MAX_CONST(sizeof(struct fuse_attr_out),
> +                 MAX_CONST(sizeof(struct fuse_write_out),
> +                           sizeof(struct fuse_lseek_out)))))];
> +    struct fuse_out_header *out_hdr = (struct fuse_out_header *)out_buf;
> +    /* For read requests: Data to be returned */
> +    void *out_data_buffer = NULL;
> +    ssize_t ret;
> +
> +    /* Limit scope to ensure pointer is no longer used after polling */
> +    {
> +        const struct fuse_in_header *in_hdr =
> +            (const struct fuse_in_header *)exp->request_buf;
> +
> +        opcode = in_hdr->opcode;
> +        req_id = in_hdr->unique;
> +    }
> +
> +    switch (opcode) {
> +    case FUSE_INIT: {
> +        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, exp);
> +        ret = fuse_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
> +                        in->max_readahead, in->flags);
> +        break;
> +    }
> +
> +    case FUSE_OPEN:
> +        ret = fuse_open(exp, FUSE_OUT_OP_STRUCT(open, out_buf));
> +        break;
> +
> +    case FUSE_RELEASE:
> +        ret = 0;
> +        break;
> +
> +    case FUSE_LOOKUP:
> +        ret = -ENOENT; /* There is no node but the root node */
> +        break;
> +
> +    case FUSE_GETATTR:
> +        ret = fuse_getattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf));
> +        break;
> +
> +    case FUSE_SETATTR: {
> +        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, exp);
> +        ret = fuse_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
> +                           in->valid, in->size, in->mode, in->uid, in->gid);
> +        break;
> +    }
> +
> +    case FUSE_READ: {
> +        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, exp);
> +        ret = fuse_read(exp, &out_data_buffer, in->offset, in->size);
> +        break;
> +    }
> +
> +    case FUSE_WRITE: {
> +        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, exp);
> +        uint32_t req_len;
> +
> +        req_len = ((const struct fuse_in_header *)exp->request_buf)->len;
> +        if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
> +                               in->size)) {
> +            warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
> +                        req_len - sizeof(struct fuse_in_header) - sizeof(*in),
> +                        in->size);
> +            ret = -EINVAL;
> +            break;
> +        }
> +
> +        /*
> +         * poll_fuse_fd() has checked that in_hdr->len matches the number of
> +         * bytes read, which cannot exceed the max_write value we set
> +         * (FUSE_MAX_WRITE_BYTES).  So we know that FUSE_MAX_WRITE_BYTES >=
> +         * in_hdr->len >= in->size + X, so this assertion must hold.
> +         */
> +        assert(in->size <= FUSE_MAX_WRITE_BYTES);
> +
> +        /*
> +         * Passing a pointer to `in` (i.e. the request buffer) is fine because
> +         * fuse_write() takes care to copy its contents before potentially
> +         * polling.
> +         */
> +        ret = fuse_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
> +                         in->offset, in->size, in + 1);
> +        break;
> +    }
> +
> +    case FUSE_FALLOCATE: {
> +        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, exp);
> +        ret = fuse_fallocate(exp, in->offset, in->length, in->mode);
> +        break;
> +    }
> +
> +    case FUSE_FSYNC:
> +        ret = fuse_fsync(exp);
> +        break;
> +
> +    case FUSE_FLUSH:
> +        ret = fuse_flush(exp);
> +        break;
> +
>  #ifdef CONFIG_FUSE_LSEEK
> -    .lseek      = fuse_lseek,
> +    case FUSE_LSEEK: {
> +        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, exp);
> +        ret = fuse_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
> +                         in->offset, in->whence);
> +        break;
> +    }
>  #endif
> -};
> +
> +    default:
> +        ret = -ENOSYS;
> +    }
> +
> +    /* Ignore errors from fuse_write*(), nothing we can do anyway */
> +    if (out_data_buffer) {
> +        assert(ret >= 0);
> +        fuse_write_buf_response(exp->fuse_fd, req_id, out_hdr,
> +                                out_data_buffer, ret);
> +        qemu_vfree(out_data_buffer);
> +    } else {
> +        fuse_write_response(exp->fuse_fd, req_id, out_hdr,
> +                            ret < 0 ? ret : 0,
> +                            ret < 0 ? 0 : ret);
> +    }
> +}
>  
>  const BlockExportDriver blk_exp_fuse = {
>      .type               = BLOCK_EXPORT_TYPE_FUSE,
> -- 
> 2.48.1
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 12/15] fuse: Reduce max read size
  2025-03-25 16:06 ` [PATCH 12/15] fuse: Reduce max read size Hanna Czenczek
@ 2025-03-27 15:35   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 15:35 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 804 bytes --]

On Tue, Mar 25, 2025 at 05:06:52PM +0100, Hanna Czenczek wrote:
> We are going to introduce parallel processing via coroutines, a maximum
> read size of 64 MB may be problematic, allowing users of the export to
> force us to allocate quite large amounts of memory with just a few
> requests.
> 
> At least tone it down to 1 MB, which is still probably far more than
> enough.  (Larger requests are split automatically by the FUSE kernel
> driver anyway.)
> 
> (Yes, we inadvertently already had parallel request processing due to
> nested polling before.  Better to fix this late than never.)
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 13/15] fuse: Process requests in coroutines
  2025-03-25 16:06 ` [PATCH 13/15] fuse: Process requests in coroutines Hanna Czenczek
@ 2025-03-27 15:38   ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 15:38 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 1861 bytes --]

On Tue, Mar 25, 2025 at 05:06:53PM +0100, Hanna Czenczek wrote:
> Make fuse_process_request() a coroutine_fn (fuse_co_process_request())
> and have read_from_fuse_fd() launch it inside of a newly created
> coroutine instead of running it synchronously.  This way, we can process
> requests in parallel.
> 
> These are the benchmark results, compared to (a) the original results
> with libfuse, and (b) the results after switching away from libfuse
> (i.e. before this patch):
> 
> file:                  (vs. libfuse / vs. no libfuse)
>   read:
>     seq aio:   120.6k ±1.1k (+ 53 % / + 58 %)
>     rand aio:  113.3k ±5.9k (+188 % / +325 %)
>     seq sync:   52.4k ±0.4k (+ 61 % / + 10 %)
>     rand sync:  10.4k ±0.4k (+  6 % / +  3 %)
>   write:
>     seq aio:    79.8k ±0.8k (+ 29 % / + 37 %)
>     rand aio:   79.0k ±0.6k (+ 29 % / + 36 %)
>     seq sync:   41.5k ±0.3k (+ 49 % / + 15 %)
>     rand sync:  41.4k ±0.2k (+ 50 % / + 15 %)
> null:
>   read:
>     seq aio:   266.1k ±1.5k (+ 24 % / -  1 %)
>     rand aio:  264.1k ±2.5k (+ 24 % / ±  0 %)
>     seq sync:  135.6k ±3.2k (+ 50 % / +  1 %)
>     rand sync: 134.7k ±3.0k (+ 50 % / +  2 %)
>   write:
>     seq aio:   281.0k ±1.8k (+ 38 % / +  2 %)
>     rand aio:  288.1k ±6.1k (+ 43 % / +  6 %)
>     seq sync:  142.2k ±3.1k (+ 65 % / +  9 %)
>     rand sync: 141.1k ±2.9k (+ 66 % / + 11 %)
> 
> So for non-AIO cases (and the null driver, which does not yield), there
> is little change; but for file AIO, results greatly improve, resolving
> the performance issue we saw before (when switching away from libfuse).
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 209 ++++++++++++++++++++++++++------------------
>  1 file changed, 126 insertions(+), 83 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-25 16:06 ` [PATCH 14/15] fuse: Implement multi-threading Hanna Czenczek
  2025-03-26  5:38   ` Markus Armbruster
@ 2025-03-27 15:55   ` Stefan Hajnoczi
  2025-04-01 20:36     ` Eric Blake
  2025-04-04 12:49     ` Hanna Czenczek
  2025-04-01 14:58   ` Eric Blake
  2 siblings, 2 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 15:55 UTC (permalink / raw)
  To: Hanna Czenczek, Eric Blake; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 23616 bytes --]

On Tue, Mar 25, 2025 at 05:06:54PM +0100, Hanna Czenczek wrote:
> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
> 
> We can use this to implement multi-threading.
> 
> Note that the interface presented here differs from the multi-queue
> interface of virtio-blk: The latter maps virtqueues to iothreads, which
> allows processing multiple virtqueues in a single iothread.  The
> equivalent (processing multiple FDs in a single iothread) would not make
> sense for FUSE because those FDs are used in a round-robin fashion by
> the FUSE kernel driver.  Putting two of them into a single iothread will
> just create a bottleneck.

This text might be outdated. virtio-blk's new iothread-vq-mapping
parameter provides the "array of iothreads" mentioned below and a way to
assign virtqueues to those IOThreads.

> 
> Therefore, all we need is an array of iothreads, and we will create one
> "queue" (FD) per thread.
> 
> These are the benchmark results when using four threads (compared to a
> single thread); note that fio still only uses a single job, but
> performance can still be improved because of said round-robin usage for
> the queues.  (Not in the sync case, though, in which case I guess it
> just adds overhead.)

Interesting. FUSE-over-io_uring seems to be different from
FUSE_DEV_IOC_CLONE here. It doesn't do round-robin. It uses CPU affinity
instead, handing requests to the io_uring context associated with the
current CPU when possible.

> 
> file:
>   read:
>     seq aio:   264.8k ±0.8k (+120 %)
>     rand aio:  143.8k ±0.4k (+ 27 %)
>     seq sync:   49.9k ±0.5k (-  5 %)
>     rand sync:  10.3k ±0.1k (-  1 %)
>   write:
>     seq aio:   226.6k ±2.1k (+184 %)
>     rand aio:  225.9k ±1.8k (+186 %)
>     seq sync:   36.9k ±0.6k (- 11 %)
>     rand sync:  36.9k ±0.2k (- 11 %)
> null:
>   read:
>     seq aio:   315.2k ±11.0k (+18 %)
>     rand aio:  300.5k ±10.8k (+14 %)
>     seq sync:  114.2k ± 3.6k (-16 %)
>     rand sync: 112.5k ± 2.8k (-16 %)
>   write:
>     seq aio:   222.6k ±6.8k (-21 %)
>     rand aio:  220.5k ±6.8k (-23 %)
>     seq sync:  117.2k ±3.7k (-18 %)
>     rand sync: 116.3k ±4.4k (-18 %)
> 
> (I don't know what's going on in the null-write AIO case, sorry.)
> 
> Here's results for numjobs=4:
> 
> "Before", i.e. without multithreading in QSD/FUSE (results compared to
> numjobs=1):
> 
> file:
>   read:
>     seq aio:   104.7k ± 0.4k (- 13 %)
>     rand aio:  111.5k ± 0.4k (-  2 %)
>     seq sync:   71.0k ±13.8k (+ 36 %)
>     rand sync:  41.4k ± 0.1k (+297 %)
>   write:
>     seq aio:    79.4k ±0.1k (-  1 %)
>     rand aio:   78.6k ±0.1k (±  0 %)
>     seq sync:   83.3k ±0.1k (+101 %)
>     rand sync:  82.0k ±0.2k (+ 98 %)
> null:
>   read:
>     seq aio:   260.5k ±1.5k (-  2 %)
>     rand aio:  260.1k ±1.4k (-  2 %)
>     seq sync:  291.8k ±1.3k (+115 %)
>     rand sync: 280.1k ±1.7k (+115 %)
>   write:
>     seq aio:   280.1k ±1.7k (±  0 %)
>     rand aio:  279.5k ±1.4k (-  3 %)
>     seq sync:  306.7k ±2.2k (+116 %)
>     rand sync: 305.9k ±1.8k (+117 %)
> 
> (As probably expected, little difference in the AIO case, but great
> improvements in the sync case because it kind of gives it an artificial
> iodepth of 4.)
> 
> "After", i.e. with four threads in QSD/FUSE (now results compared to the
> above):
> 
> file:
>   read:
>     seq aio:   193.3k ± 1.8k (+ 85 %)
>     rand aio:  329.3k ± 0.3k (+195 %)
>     seq sync:   66.2k ±13.0k (-  7 %)
>     rand sync:  40.1k ± 0.0k (-  3 %)
>   write:
>     seq aio:   219.7k ±0.8k (+177 %)
>     rand aio:  217.2k ±1.5k (+176 %)
>     seq sync:   92.5k ±0.2k (+ 11 %)
>     rand sync:  91.9k ±0.2k (+ 12 %)
> null:
>   read:
>     seq aio:   706.7k ±2.1k (+171 %)
>     rand aio:  714.7k ±3.2k (+175 %)
>     seq sync:  431.7k ±3.0k (+ 48 %)
>     rand sync: 435.4k ±2.8k (+ 50 %)
>   write:
>     seq aio:   746.9k ±2.8k (+167 %)
>     rand aio:  749.0k ±4.9k (+168 %)
>     seq sync:  420.7k ±3.1k (+ 37 %)
>     rand sync: 419.1k ±2.5k (+ 37 %)
> 
> So this helps mainly for the AIO cases, but also in the null sync cases,
> because null is always CPU-bound, so more threads help.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  qapi/block-export.json |   8 +-
>  block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
>  2 files changed, 179 insertions(+), 43 deletions(-)
> 
> diff --git a/qapi/block-export.json b/qapi/block-export.json
> index c783e01a53..0bdd5992eb 100644
> --- a/qapi/block-export.json
> +++ b/qapi/block-export.json
> @@ -179,12 +179,18 @@
>  #     mount the export with allow_other, and if that fails, try again
>  #     without.  (since 6.1; default: auto)
>  #
> +# @iothreads: Enables multi-threading: Handle requests in each of the
> +#     given iothreads (instead of the block device's iothread, or the
> +#     export's "main" iothread).  For this, the FUSE FD is duplicated so
> +#     there is one FD per iothread.  (since 10.1)

This option isn't FUSE-specific but FUSE is the first export type to
support it. Please add it to BlockExportOptions instead and refuse
export creation when the export type only supports 1 IOThread.

Eric: Are you interested in implementing support for multiple IOThreads
in the NBD export? I remember some time ago we talked about NBD
multi-conn support, although maybe that was for the client rather than
the server.

> +#
>  # Since: 6.0
>  ##
>  { 'struct': 'BlockExportOptionsFuse',
>    'data': { 'mountpoint': 'str',
>              '*growable': 'bool',
> -            '*allow-other': 'FuseExportAllowOther' },
> +            '*allow-other': 'FuseExportAllowOther',
> +            '*iothreads': ['str'] },
>    'if': 'CONFIG_FUSE' }
>  
>  ##
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index 345e833171..0edd994392 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -31,11 +31,14 @@
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "system/block-backend.h"
> +#include "system/block-backend.h"
> +#include "system/iothread.h"
>  
>  #include <fuse.h>
>  #include <fuse_lowlevel.h>
>  
>  #include "standard-headers/linux/fuse.h"
> +#include <sys/ioctl.h>
>  
>  #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
>  #include <linux/falloc.h>
> @@ -50,12 +53,17 @@
>  /* Small enough to fit in the request buffer */
>  #define FUSE_MAX_WRITE_BYTES (4 * 1024)
>  
> -typedef struct FuseExport {
> -    BlockExport common;
> +typedef struct FuseExport FuseExport;
>  
> -    struct fuse_session *fuse_session;
> -    unsigned int in_flight; /* atomic */
> -    bool mounted, fd_handler_set_up;
> +/*
> + * One FUSE "queue", representing one FUSE FD from which requests are fetched
> + * and processed.  Each queue is tied to an AioContext.
> + */
> +typedef struct FuseQueue {
> +    FuseExport *exp;
> +
> +    AioContext *ctx;
> +    int fuse_fd;
>  
>      /*
>       * The request buffer must be able to hold a full write, and/or at least
> @@ -66,6 +74,14 @@ typedef struct FuseExport {
>               FUSE_MAX_WRITE_BYTES,
>          FUSE_MIN_READ_BUFFER
>      )];
> +} FuseQueue;
> +
> +struct FuseExport {
> +    BlockExport common;
> +
> +    struct fuse_session *fuse_session;
> +    unsigned int in_flight; /* atomic */
> +    bool mounted, fd_handler_set_up;
>  
>      /*
>       * Set when there was an unrecoverable error and no requests should be read
> @@ -74,7 +90,15 @@ typedef struct FuseExport {
>       */
>      bool halted;
>  
> -    int fuse_fd;
> +    int num_queues;
> +    FuseQueue *queues;
> +    /*
> +     * True if this export should follow the generic export's AioContext.
> +     * Will be false if the queues' AioContexts have been explicitly set by the
> +     * user, i.e. are expected to stay in those contexts.
> +     * (I.e. is always false if there is more than one queue.)
> +     */
> +    bool follow_aio_context;
>  
>      char *mountpoint;
>      bool writable;
> @@ -85,11 +109,11 @@ typedef struct FuseExport {
>      mode_t st_mode;
>      uid_t st_uid;
>      gid_t st_gid;
> -} FuseExport;
> +};
>  
>  /* Parameters to the request processing coroutine */
>  typedef struct FuseRequestCoParam {
> -    FuseExport *exp;
> +    FuseQueue *q;
>      int got_request;
>  } FuseRequestCoParam;
>  
> @@ -102,12 +126,13 @@ static void fuse_export_halt(FuseExport *exp);
>  static void init_exports_table(void);
>  
>  static int mount_fuse_export(FuseExport *exp, Error **errp);
> +static int clone_fuse_fd(int fd, Error **errp);
>  
>  static bool is_regular_file(const char *path, Error **errp);
>  
>  static bool poll_fuse_fd(void *opaque);
>  static void read_fuse_fd(void *opaque);
> -static void coroutine_fn fuse_co_process_request(FuseExport *exp);
> +static void coroutine_fn fuse_co_process_request(FuseQueue *q);
>  
>  static void fuse_inc_in_flight(FuseExport *exp)
>  {
> @@ -137,9 +162,11 @@ static void fuse_attach_handlers(FuseExport *exp)
>          return;
>      }
>  
> -    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
> -                       read_fuse_fd, NULL, poll_fuse_fd,
> -                       read_fuse_fd, exp);
> +    for (int i = 0; i < exp->num_queues; i++) {
> +        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
> +                           read_fuse_fd, NULL, poll_fuse_fd,
> +                           read_fuse_fd, &exp->queues[i]);
> +    }
>      exp->fd_handler_set_up = true;
>  }
>  
> @@ -148,8 +175,10 @@ static void fuse_attach_handlers(FuseExport *exp)
>   */
>  static void fuse_detach_handlers(FuseExport *exp)
>  {
> -    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
> -                       NULL, NULL, NULL, NULL, NULL);
> +    for (int i = 0; i < exp->num_queues; i++) {
> +        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
> +                           NULL, NULL, NULL, NULL, NULL);
> +    }
>      exp->fd_handler_set_up = false;
>  }
>  
> @@ -164,6 +193,11 @@ static void fuse_export_drained_end(void *opaque)
>  
>      /* Refresh AioContext in case it changed */
>      exp->common.ctx = blk_get_aio_context(exp->common.blk);
> +    if (exp->follow_aio_context) {
> +        assert(exp->num_queues == 1);
> +        exp->queues[0].ctx = exp->common.ctx;
> +    }
> +
>      fuse_attach_handlers(exp);
>  }
>  
> @@ -187,10 +221,52 @@ static int fuse_export_create(BlockExport *blk_exp,
>      ERRP_GUARD(); /* ensure clean-up even with error_fatal */
>      FuseExport *exp = container_of(blk_exp, FuseExport, common);
>      BlockExportOptionsFuse *args = &blk_exp_args->u.fuse;
> +    FuseQueue *q;
>      int ret;
>  
>      assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
>  
> +    if (args->iothreads) {
> +        strList *e;
> +
> +        exp->follow_aio_context = false;
> +        exp->num_queues = 0;
> +        for (e = args->iothreads; e; e = e->next) {
> +            exp->num_queues++;
> +        }
> +        if (exp->num_queues < 1) {
> +            error_setg(errp, "Need at least one queue");
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +        exp->queues = g_new0(FuseQueue, exp->num_queues);
> +        q = exp->queues;
> +        for (e = args->iothreads; e; e = e->next) {
> +            IOThread *iothread = iothread_by_id(e->value);
> +
> +            if (!iothread) {
> +                error_setg(errp, "IOThread \"%s\" does not exist", e->value);
> +                ret = -EINVAL;
> +                goto fail;
> +            }
> +
> +            *(q++) = (FuseQueue) {
> +                .exp = exp,
> +                .ctx = iothread_get_aio_context(iothread),
> +                .fuse_fd = -1,
> +            };
> +        }
> +    } else {
> +        exp->follow_aio_context = true;
> +        exp->num_queues = 1;
> +        exp->queues = g_new(FuseQueue, exp->num_queues);
> +        exp->queues[0] = (FuseQueue) {
> +            .exp = exp,
> +            .ctx = exp->common.ctx,
> +            .fuse_fd = -1,
> +        };
> +    }
> +
>      /* For growable and writable exports, take the RESIZE permission */
>      if (args->growable || blk_exp_args->writable) {
>          uint64_t blk_perm, blk_shared_perm;
> @@ -275,14 +351,24 @@ static int fuse_export_create(BlockExport *blk_exp,
>  
>      g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
>  
> -    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
> -    ret = fcntl(exp->fuse_fd, F_SETFL, O_NONBLOCK);
> +    assert(exp->num_queues >= 1);
> +    exp->queues[0].fuse_fd = fuse_session_fd(exp->fuse_session);
> +    ret = fcntl(exp->queues[0].fuse_fd, F_SETFL, O_NONBLOCK);
>      if (ret < 0) {
>          ret = -errno;
>          error_setg_errno(errp, errno, "Failed to make FUSE FD non-blocking");
>          goto fail;
>      }
>  
> +    for (int i = 1; i < exp->num_queues; i++) {
> +        int fd = clone_fuse_fd(exp->queues[0].fuse_fd, errp);
> +        if (fd < 0) {
> +            ret = fd;
> +            goto fail;
> +        }
> +        exp->queues[i].fuse_fd = fd;
> +    }
> +
>      fuse_attach_handlers(exp);
>      return 0;
>  
> @@ -355,6 +441,39 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
>      return 0;
>  }
>  
> +/**
> + * Clone the given /dev/fuse file descriptor, yielding a second FD from which
> + * requests can be pulled for the associated filesystem.  Returns an FD on
> + * success, and -errno on error.
> + */
> +static int clone_fuse_fd(int fd, Error **errp)
> +{
> +    uint32_t src_fd = fd;
> +    int new_fd;
> +    int ret;
> +
> +    /*
> +     * The name "/dev/fuse" is fixed, see libfuse's lib/fuse_loop_mt.c
> +     * (fuse_clone_chan()).
> +     */
> +    new_fd = open("/dev/fuse", O_RDWR | O_CLOEXEC | O_NONBLOCK);
> +    if (new_fd < 0) {
> +        ret = -errno;
> +        error_setg_errno(errp, errno, "Failed to open /dev/fuse");
> +        return ret;
> +    }
> +
> +    ret = ioctl(new_fd, FUSE_DEV_IOC_CLONE, &src_fd);
> +    if (ret < 0) {
> +        ret = -errno;
> +        error_setg_errno(errp, errno, "Failed to clone FUSE FD");
> +        close(new_fd);
> +        return ret;
> +    }
> +
> +    return new_fd;
> +}
> +
>  /**
>   * Try to read a single request from the FUSE FD.
>   * Takes a FuseRequestCoParam object pointer in `opaque`.
> @@ -370,8 +489,9 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
>  static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>  {
>      FuseRequestCoParam *co_param = opaque;
> -    FuseExport *exp = co_param->exp;
> -    int fuse_fd = exp->fuse_fd;
> +    FuseQueue *q = co_param->q;
> +    int fuse_fd = q->fuse_fd;
> +    FuseExport *exp = q->exp;
>      ssize_t ret;
>      const struct fuse_in_header *in_hdr;
>  
> @@ -381,8 +501,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>          goto no_request;
>      }
>  
> -    ret = RETRY_ON_EINTR(read(fuse_fd, exp->request_buf,
> -                              sizeof(exp->request_buf)));
> +    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
>      if (ret < 0 && errno == EAGAIN) {
>          /* No request available */
>          goto no_request;
> @@ -400,7 +519,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>          goto no_request;
>      }
>  
> -    in_hdr = (const struct fuse_in_header *)exp->request_buf;
> +    in_hdr = (const struct fuse_in_header *)q->request_buf;
>      if (unlikely(ret != in_hdr->len)) {
>          error_report("Number of bytes read from FUSE device does not match "
>                       "request size, expected %" PRIu32 " bytes, read %zi "
> @@ -413,7 +532,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>  
>      /* Must set this before yielding */
>      co_param->got_request = 1;
> -    fuse_co_process_request(exp);
> +    fuse_co_process_request(q);
>      fuse_dec_in_flight(exp);
>      return;
>  
> @@ -432,7 +551,7 @@ static bool poll_fuse_fd(void *opaque)
>  {
>      Coroutine *co;
>      FuseRequestCoParam co_param = {
> -        .exp = opaque,
> +        .q = opaque,
>          .got_request = -EINPROGRESS,
>      };
>  
> @@ -451,7 +570,7 @@ static void read_fuse_fd(void *opaque)
>  {
>      Coroutine *co;
>      FuseRequestCoParam co_param = {
> -        .exp = opaque,
> +        .q = opaque,
>          .got_request = -EINPROGRESS,
>      };
>  
> @@ -481,6 +600,16 @@ static void fuse_export_delete(BlockExport *blk_exp)
>  {
>      FuseExport *exp = container_of(blk_exp, FuseExport, common);
>  
> +    for (int i = 0; i < exp->num_queues; i++) {
> +        FuseQueue *q = &exp->queues[i];
> +
> +        /* Queue 0's FD belongs to the FUSE session */
> +        if (i > 0 && q->fuse_fd >= 0) {
> +            close(q->fuse_fd);
> +        }
> +    }
> +    g_free(exp->queues);
> +
>      if (exp->fuse_session) {
>          if (exp->mounted) {
>              fuse_session_unmount(exp->fuse_session);
> @@ -1137,23 +1266,23 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
>  /*
>   * For use in fuse_co_process_request():
>   * Returns a pointer to the parameter object for the given operation (inside of
> - * exp->request_buf, which is assumed to hold a fuse_in_header first).
> - * Verifies that the object is complete (exp->request_buf is large enough to
> + * q->request_buf, which is assumed to hold a fuse_in_header first).
> + * Verifies that the object is complete (q->request_buf is large enough to
>   * hold it in one piece, and the request length includes the whole object).
>   *
> - * Note that exp->request_buf may be overwritten after yielding, so the returned
> + * Note that q->request_buf may be overwritten after yielding, so the returned
>   * pointer must not be used across a function that may yield!
>   */
> -#define FUSE_IN_OP_STRUCT(op_name, export) \
> +#define FUSE_IN_OP_STRUCT(op_name, queue) \
>      ({ \
>          const struct fuse_in_header *__in_hdr = \
> -            (const struct fuse_in_header *)(export)->request_buf; \
> +            (const struct fuse_in_header *)(q)->request_buf; \
>          const struct fuse_##op_name##_in *__in = \
>              (const struct fuse_##op_name##_in *)(__in_hdr + 1); \
>          const size_t __param_len = sizeof(*__in_hdr) + sizeof(*__in); \
>          uint32_t __req_len; \
>          \
> -        QEMU_BUILD_BUG_ON(sizeof((export)->request_buf) < __param_len); \
> +        QEMU_BUILD_BUG_ON(sizeof((q)->request_buf) < __param_len); \
>          \
>          __req_len = __in_hdr->len; \
>          if (__req_len < __param_len) { \
> @@ -1190,11 +1319,12 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
>   * Process a FUSE request, incl. writing the response.
>   *
>   * Note that yielding in any request-processing function can overwrite the
> - * contents of exp->request_buf.  Anything that takes a buffer needs to take
> + * contents of q->request_buf.  Anything that takes a buffer needs to take
>   * care that the content is copied before yielding.
>   */
> -static void coroutine_fn fuse_co_process_request(FuseExport *exp)
> +static void coroutine_fn fuse_co_process_request(FuseQueue *q)
>  {
> +    FuseExport *exp = q->exp;
>      uint32_t opcode;
>      uint64_t req_id;
>      /*
> @@ -1217,7 +1347,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>      /* Limit scope to ensure pointer is no longer used after yielding */
>      {
>          const struct fuse_in_header *in_hdr =
> -            (const struct fuse_in_header *)exp->request_buf;
> +            (const struct fuse_in_header *)q->request_buf;
>  
>          opcode = in_hdr->opcode;
>          req_id = in_hdr->unique;
> @@ -1225,7 +1355,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>  
>      switch (opcode) {
>      case FUSE_INIT: {
> -        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, exp);
> +        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, q);
>          ret = fuse_co_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
>                             in->max_readahead, in->flags);
>          break;
> @@ -1248,23 +1378,23 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>          break;
>  
>      case FUSE_SETATTR: {
> -        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, exp);
> +        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, q);
>          ret = fuse_co_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
>                                in->valid, in->size, in->mode, in->uid, in->gid);
>          break;
>      }
>  
>      case FUSE_READ: {
> -        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, exp);
> +        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, q);
>          ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size);
>          break;
>      }
>  
>      case FUSE_WRITE: {
> -        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, exp);
> +        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, q);
>          uint32_t req_len;
>  
> -        req_len = ((const struct fuse_in_header *)exp->request_buf)->len;
> +        req_len = ((const struct fuse_in_header *)q->request_buf)->len;
>          if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
>                                 in->size)) {
>              warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
> @@ -1293,7 +1423,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>      }
>  
>      case FUSE_FALLOCATE: {
> -        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, exp);
> +        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, q);
>          ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode);
>          break;
>      }
> @@ -1308,7 +1438,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>  
>  #ifdef CONFIG_FUSE_LSEEK
>      case FUSE_LSEEK: {
> -        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, exp);
> +        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, q);
>          ret = fuse_co_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
>                              in->offset, in->whence);
>          break;
> @@ -1322,11 +1452,11 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>      /* Ignore errors from fuse_write*(), nothing we can do anyway */
>      if (out_data_buffer) {
>          assert(ret >= 0);
> -        fuse_write_buf_response(exp->fuse_fd, req_id, out_hdr,
> +        fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
>                                  out_data_buffer, ret);
>          qemu_vfree(out_data_buffer);
>      } else {
> -        fuse_write_response(exp->fuse_fd, req_id, out_hdr,
> +        fuse_write_response(q->fuse_fd, req_id, out_hdr,
>                              ret < 0 ? ret : 0,
>                              ret < 0 ? 0 : ret);
>      }
> -- 
> 2.48.1
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 15/15] fuse: Increase MAX_WRITE_SIZE with a second buffer
  2025-03-25 16:06 ` [PATCH 15/15] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
@ 2025-03-27 15:59   ` Stefan Hajnoczi
  2025-04-01 20:24   ` Eric Blake
  1 sibling, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-03-27 15:59 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 2038 bytes --]

On Tue, Mar 25, 2025 at 05:06:55PM +0100, Hanna Czenczek wrote:
> We probably want to support larger write sizes than just 4k; 64k seems
> nice.  However, we cannot read partial requests from the FUSE FD, we
> always have to read requests in full; so our read buffer must be large
> enough to accommodate potential 64k writes if we want to support that.
> 
> Always allocating FuseRequest objects with 64k buffers in them seems
> wasteful, though.  But we can get around the issue by splitting the
> buffer into two and using readv(): One part will hold all normal (up to
> 4k) write requests and all other requests, and a second part (the
> "spill-over buffer") will be used only for larger write requests.  Each
> FuseQueue has its own spill-over buffer, and only if we find it used
> when reading a request will we move its ownership into the FuseRequest
> object and allocate a new spill-over buffer for the queue.
> 
> This way, we get to support "large" write sizes without having to
> allocate big buffers when they aren't used.
> 
> Also, this even reduces the size of the FuseRequest objects because the
> read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
> the requests we support are not quite so large (except for >4k writes),
> so until now, we basically had to have useless padding in there.
> 
> With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
> is easily met and we can decrease the size of the buffer portion that is
> right inside of FuseRequest.
> 
> As for benchmarks, the benefit of this patch can be shown easily by
> writing a 4G image (with qemu-img convert) to a FUSE export:
> - Before this patch: Takes 25.6 s (14.4 s with -t none)
> - After this patch: Takes 4.5 s (5.5 s with -t none)
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 137 ++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 118 insertions(+), 19 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-27 13:45             ` Hanna Czenczek
@ 2025-04-01 12:05               ` Kevin Wolf
  2025-04-01 20:31                 ` Eric Blake
  2025-04-04 12:45                 ` Hanna Czenczek
  0 siblings, 2 replies; 59+ messages in thread
From: Kevin Wolf @ 2025-04-01 12:05 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: Markus Armbruster, qemu-block, qemu-devel

Am 27.03.2025 um 14:45 hat Hanna Czenczek geschrieben:
> On 27.03.25 13:18, Markus Armbruster wrote:
> > Hanna Czenczek <hreitz@redhat.com> writes:
> > 
> > > On 26.03.25 12:41, Markus Armbruster wrote:
> > > > Hanna Czenczek <hreitz@redhat.com> writes:
> > > > 
> > > > > On 26.03.25 06:38, Markus Armbruster wrote:
> > > > > > Hanna Czenczek <hreitz@redhat.com> writes:
> > > > > > 
> > > > > > > FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
> > > > > > > (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
> > > > > > > 
> > > > > > > We can use this to implement multi-threading.
> > > > > > > 
> > > > > > > Note that the interface presented here differs from the multi-queue
> > > > > > > interface of virtio-blk: The latter maps virtqueues to iothreads, which
> > > > > > > allows processing multiple virtqueues in a single iothread.  The
> > > > > > > equivalent (processing multiple FDs in a single iothread) would not make
> > > > > > > sense for FUSE because those FDs are used in a round-robin fashion by
> > > > > > > the FUSE kernel driver.  Putting two of them into a single iothread will
> > > > > > > just create a bottleneck.
> > > > > > > 
> > > > > > > Therefore, all we need is an array of iothreads, and we will create one
> > > > > > > "queue" (FD) per thread.
> > > > > > [...]
> > > > > > 
> > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > ---
> > > > > > >   qapi/block-export.json |   8 +-
> > > > > > >   block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
> > > > > > >   2 files changed, 179 insertions(+), 43 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/qapi/block-export.json b/qapi/block-export.json
> > > > > > > index c783e01a53..0bdd5992eb 100644
> > > > > > > --- a/qapi/block-export.json
> > > > > > > +++ b/qapi/block-export.json
> > > > > > > @@ -179,12 +179,18 @@
> > > > > > >   #     mount the export with allow_other, and if that fails, try again
> > > > > > >   #     without.  (since 6.1; default: auto)
> > > > > > >   #
> > > > > > > +# @iothreads: Enables multi-threading: Handle requests in each of the
> > > > > > > +#     given iothreads (instead of the block device's iothread, or the
> > > > > > > +#     export's "main" iothread).
> > > > > > When does "the block device's iothread" apply, and when "the export's
> > > > > > main iothread"?
> > > > > Depends on where you set the iothread option.
> > > > Assuming QMP users need to know (see right below), can we trust they
> > > > understand which one applies when?  If not, can we provide clues?
> > > I don’t understand what exactly you mean, but which one applies when has nothing to do with this option, but with the @iothread (and @fixed-iothread) option(s) on BlockExportOptions, which do document this.
> > Can you point me to the spot?
> 
> Sure: https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#object-QMP-block-export.BlockExportOptions
> (iothread and fixed-iothread)
> 
> > 
> > > > > > Is this something the QMP user needs to know?
> > > > > I think so, because e.g. if you set iothread on the device and the export, you’ll get a conflict.  But if you set it there and set this option, you won’t.  This option will just override the device/export option.
> > > > Do we think the doc comment sufficient for QMP users to figure this out?
> > > As for conflict, BlockExportOptions.iothread and BlockExportOptions.fixed-iothread do.
> > > 
> > > As for overriding, I do think so.  Do you not?  I’m always open to suggestions.
> > > 
> > > > If not, can we provide clues?
> > > > 
> > > > In particular, do we think they can go from an export failure to the
> > > > setting @iothreads here?  Perhaps the error message will guide them.
> > > > What is the message?
> > > I don’t understand what failure you mean.
> > You wrote "you'll get a conflict".  I assume this manifests as failure
> > of a QMP command (let's ignore CLI to keep things simple here).
> 
> If you set the @iothread option on both the (guest) device and the export
> (and also @fixed-iothread on the export), then you’ll get an error.  Nothing
> to do with this new @iothreads option here.
> 
> > Do we think ordinary users running into that failure can figure out they
> > can avoid it by setting @iothreads?
> 
> It shouldn’t affect the failure.  Setting @iothread on both device and
> export (with @fixed-iothread) will always cause an error, and should. 
> Setting this option is not supposed to “fix” that configuration error.
> 
> Theoretically, setting @iothreads here could make it so that
> BlockExportOptions.iothread (and/or fixed-iothread) is ignored, because that
> thread will no longer be used for export-issued I/O; but in practice,
> setting that option (BlockExportOptions.iothread) moves that export and the
> whole BDS tree behind it to that I/O thread, so if you haven’t specified an
> I/O thread on the guest device, the guest device will then use that thread. 
> So making @iothreads silently completely ignore BlockExportOptions.iothread
> may cause surprising behavior.
> 
> Maybe we could make setting @iothreads here and the generic
> BlockExportOptions.iothread at the same time an error.  That would save us
> the explanation here.

This raises the question if the better interface wouldn't be to change
the BlockExportOptions.iothread from 'str' to an alternate between 'str'
and ['str'], allowing the user to specify multiple iothreads in the
already existing related option without creating conflicting options.

In the long run, I would be surprised if FUSE remained the only export
supporting multiple iothreads. At least the virtio based ones
(vhost-user-blk and VDUSE) even have precedence in the virtio-blk device
itself, so while I don't know how much interest there is in actually
implementing it, in theory we know it makes sense.

Kevin



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/15] fuse: Copy write buffer content before polling
  2025-03-25 16:06 ` [PATCH 01/15] fuse: Copy write buffer content before polling Hanna Czenczek
  2025-03-27 14:47   ` Stefan Hajnoczi
@ 2025-04-01 13:44   ` Eric Blake
  2025-04-04 11:18     ` Hanna Czenczek
  1 sibling, 1 reply; 59+ messages in thread
From: Eric Blake @ 2025-04-01 13:44 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf, qemu-stable

On Tue, Mar 25, 2025 at 05:06:35PM +0100, Hanna Czenczek wrote:
> Polling in I/O functions can lead to nested read_from_fuse_export()
> calls, overwriting the request buffer's content.  The only function
> affected by this is fuse_write(), which therefore must use a bounce
> buffer or corruption may occur.
> 
> Note that in addition we do not know whether libfuse-internal structures
> can cope with this nesting, and even if we did, we probably cannot rely
> on it in the future.  This is the main reason why we want to remove
> libfuse from the I/O path.
> 

> @@ -624,6 +630,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>                         size_t size, off_t offset, struct fuse_file_info *fi)
>  {
>      FuseExport *exp = fuse_req_userdata(req);
> +    void *copied;

Do we have a good way to annotate variables that require qemu_vfree()
if non-NULL for automatic cleanup?  If so, should this be annotated
and set to NULL here,...

>      int64_t length;
>      int ret;
>  
> @@ -638,6 +645,14 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>          return;
>      }
>  
> +    /*
> +     * Heed the note on read_from_fuse_export(): If we poll (which any blk_*()
> +     * I/O function may do), read_from_fuse_export() may be nested, overwriting
> +     * the request buffer content.  Therefore, we must copy it here.
> +     */
> +    copied = blk_blockalign(exp->common.blk, size);
> +    memcpy(copied, buf, size);
> +
>      /**
>       * Clients will expect short writes at EOF, so we have to limit
>       * offset+size to the image length.
> @@ -645,7 +660,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>      length = blk_getlength(exp->common.blk);
>      if (length < 0) {
>          fuse_reply_err(req, -length);
> -        return;
> +        goto free_buffer;

...so that this and similar hunks are not needed?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers()
  2025-03-25 16:06 ` [PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
  2025-03-27 15:12   ` Stefan Hajnoczi
@ 2025-04-01 13:55   ` Eric Blake
  2025-04-04 11:24     ` Hanna Czenczek
  1 sibling, 1 reply; 59+ messages in thread
From: Eric Blake @ 2025-04-01 13:55 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

On Tue, Mar 25, 2025 at 05:06:48PM +0100, Hanna Czenczek wrote:
> Pull setting up and tearing down the AIO context handlers into two
> dedicated functions.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index 2df6297d61..bd98809d71 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -78,27 +78,34 @@ static void read_from_fuse_export(void *opaque);
>  static bool is_regular_file(const char *path, Error **errp);
>  
>  
> -static void fuse_export_drained_begin(void *opaque)
> +static void fuse_attach_handlers(FuseExport *exp)
>  {
> -    FuseExport *exp = opaque;
> +    aio_set_fd_handler(exp->common.ctx,
> +                       fuse_session_fd(exp->fuse_session),
> +                       read_from_fuse_export, NULL, NULL, NULL, exp);
> +    exp->fd_handler_set_up = true;

I found this name mildly confusing (does "set_up=true" mean that I
still need to set up, or that I am already set up); would it be better
as s/fd_handler_set_up/fd_handler_armed/g ?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 11/15] fuse: Manually process requests (without libfuse)
  2025-03-25 16:06 ` [PATCH 11/15] fuse: Manually process requests (without libfuse) Hanna Czenczek
  2025-03-27 15:35   ` Stefan Hajnoczi
@ 2025-04-01 14:35   ` Eric Blake
  2025-04-04 11:30     ` Hanna Czenczek
  2025-04-04 11:42     ` Hanna Czenczek
  1 sibling, 2 replies; 59+ messages in thread
From: Eric Blake @ 2025-04-01 14:35 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

On Tue, Mar 25, 2025 at 05:06:51PM +0100, Hanna Czenczek wrote:
> Manually read requests from the /dev/fuse FD and process them, without
> using libfuse.  This allows us to safely add parallel request processing
> in coroutines later, without having to worry about libfuse internals.
> (Technically, we already have exactly that problem with
> read_from_fuse_export()/read_from_fuse_fd() nesting.)
> 
> We will continue to use libfuse for mounting the filesystem; fusermount3
> is a effectively a helper program of libfuse, so it should know best how
> to interact with it.  (Doing it manually without libfuse, while doable,
> is a bit of a pain, and it is not clear to me how stable the "protocol"
> actually is.)
> 

> @@ -247,6 +268,14 @@ static int fuse_export_create(BlockExport *blk_exp,
>  
>      g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
>  
> +    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
> +    ret = fcntl(exp->fuse_fd, F_SETFL, O_NONBLOCK);

fctnl(F_SETFL) should be used in a read-modify-write pattern with
F_GETFL (otherwise, you are nuking any other flags that might have
been important).

See also block/file-posix.c:fcntl_setfl.  Maybe we should hoist that
into a common helper in util/osdep.c?

>  /**
> - * Handle client reads from the exported image.
> + * Handle client reads from the exported image.  Allocates *bufptr and reads
> + * data from the block device into that buffer.

Worth calling out tht *bufptr must be freed with qemu_vfree...

> + * Returns the buffer (read) size on success, and -errno on error.
>   */
> -static void fuse_read(fuse_req_t req, fuse_ino_t inode,
> -                      size_t size, off_t offset, struct fuse_file_info *fi)
> +static ssize_t fuse_read(FuseExport *exp, void **bufptr,
> +                         uint64_t offset, uint32_t size)
...
>  {
>      buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
>      if (!buf) {
> -        fuse_reply_err(req, ENOMEM);
> -        return;
> +        return -ENOMEM;
>      }
>  
>      ret = blk_pread(exp->common.blk, offset, size, buf, 0);
> -    if (ret >= 0) {
> -        fuse_reply_buf(req, buf, size);
> -    } else {
> -        fuse_reply_err(req, -ret);
> +    if (ret < 0) {
> +        qemu_vfree(buf);
> +        return ret;

...since internal cleanup recognizes that plain free() is wrong?

>  #ifdef CONFIG_FUSE_LSEEK
>  /**
>   * Let clients inquire allocation status.
> + * Return the number of bytes written to *out on success, and -errno on error.
>   */
> -static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
> -                       int whence, struct fuse_file_info *fi)
> +static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
> +                          uint64_t offset, uint32_t whence)
>  {
> -    FuseExport *exp = fuse_req_userdata(req);
> -
>      if (whence != SEEK_HOLE && whence != SEEK_DATA) {
> -        fuse_reply_err(req, EINVAL);
> -        return;
> +        return -EINVAL;

Unrelated to this patch, but any reason why we only SEEK_HOLE/DATA
(and not, say, SEEK_SET)?  Is it because we aren't really maintaining
the notion of a current offset?  I guess that works as long as the
caller is always using pread/pwrite (never bare read/write where
depending on our internal offset would matter).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-25 16:06 ` [PATCH 14/15] fuse: Implement multi-threading Hanna Czenczek
  2025-03-26  5:38   ` Markus Armbruster
  2025-03-27 15:55   ` Stefan Hajnoczi
@ 2025-04-01 14:58   ` Eric Blake
  2 siblings, 0 replies; 59+ messages in thread
From: Eric Blake @ 2025-04-01 14:58 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

On Tue, Mar 25, 2025 at 05:06:54PM +0100, Hanna Czenczek wrote:
> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
> 
> We can use this to implement multi-threading.
> 
> Note that the interface presented here differs from the multi-queue
> interface of virtio-blk: The latter maps virtqueues to iothreads, which
> allows processing multiple virtqueues in a single iothread.  The
> equivalent (processing multiple FDs in a single iothread) would not make
> sense for FUSE because those FDs are used in a round-robin fashion by
> the FUSE kernel driver.  Putting two of them into a single iothread will
> just create a bottleneck.
> 
> Therefore, all we need is an array of iothreads, and we will create one
> "queue" (FD) per thread.
> 

> @@ -275,14 +351,24 @@ static int fuse_export_create(BlockExport *blk_exp,
>  
>      g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
>  
> -    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
> -    ret = fcntl(exp->fuse_fd, F_SETFL, O_NONBLOCK);
> +    assert(exp->num_queues >= 1);
> +    exp->queues[0].fuse_fd = fuse_session_fd(exp->fuse_session);
> +    ret = fcntl(exp->queues[0].fuse_fd, F_SETFL, O_NONBLOCK);

As mentioned before, F_SETFL should be set by read-modify-write.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 15/15] fuse: Increase MAX_WRITE_SIZE with a second buffer
  2025-03-25 16:06 ` [PATCH 15/15] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
  2025-03-27 15:59   ` Stefan Hajnoczi
@ 2025-04-01 20:24   ` Eric Blake
  2025-04-04 13:04     ` Hanna Czenczek
  1 sibling, 1 reply; 59+ messages in thread
From: Eric Blake @ 2025-04-01 20:24 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Kevin Wolf

On Tue, Mar 25, 2025 at 05:06:55PM +0100, Hanna Czenczek wrote:
> We probably want to support larger write sizes than just 4k; 64k seems
> nice.  However, we cannot read partial requests from the FUSE FD, we
> always have to read requests in full; so our read buffer must be large
> enough to accommodate potential 64k writes if we want to support that.
> 
> Always allocating FuseRequest objects with 64k buffers in them seems
> wasteful, though.  But we can get around the issue by splitting the
> buffer into two and using readv(): One part will hold all normal (up to
> 4k) write requests and all other requests, and a second part (the
> "spill-over buffer") will be used only for larger write requests.  Each
> FuseQueue has its own spill-over buffer, and only if we find it used
> when reading a request will we move its ownership into the FuseRequest
> object and allocate a new spill-over buffer for the queue.
> 
> This way, we get to support "large" write sizes without having to
> allocate big buffers when they aren't used.
> 
> Also, this even reduces the size of the FuseRequest objects because the
> read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
> the requests we support are not quite so large (except for >4k writes),
> so until now, we basically had to have useless padding in there.
> 
> With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
> is easily met and we can decrease the size of the buffer portion that is
> right inside of FuseRequest.
> 
> As for benchmarks, the benefit of this patch can be shown easily by
> writing a 4G image (with qemu-img convert) to a FUSE export:
> - Before this patch: Takes 25.6 s (14.4 s with -t none)
> - After this patch: Takes 4.5 s (5.5 s with -t none)

Hmm - 64k is still small for some tasks; with qcow2, you can have
clusters up to 2M, and writing a cluster at a time seems to be a
reasonable desire.  Or in other storage areas, nbdcopy (from libnbd)
defaults to 256k and allows up to 32M, or we can even look at lvm that
defaults to extent size of 4M (although with lvm thin volumes,
behavior is more like qcow2 with subclusters in that you don't have to
allocate the entire extent at once).

Maybe it makes sense to have this as a hierarchichal spillover (first
level spills over to 64k, second level spills over to 2M or whatever
ACTUAL maximum you are willing to support)?

Or you can feel free to ignore my comments on this patch - allowing
64k instead of 4k is ALREADY a nice change.

> @@ -501,7 +547,20 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>          goto no_request;
>      }
>  
> -    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
> +    /*
> +     * If handling the last request consumed the spill-over buffer, allocate a
> +     * new one.  Align it to the block device's alignment, which admittedly is
> +     * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too.

What are typical block device minimum alignments?  If 4k is typical,
then it should be easy to make FUSE_IN_PLACE_WRITE_BYTES be 4k.  If
64k is typical, we're already doomed at being aligned.  Is being
unaligned going to cause the block layer to add bounce buffers on us,
if the block layer has a larger alignment than what we use here?

> @@ -915,17 +983,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
>  }
>  
>  /**
> - * Handle client writes to the exported image.  @buf has the data to be written
> - * and will be copied to a bounce buffer before yielding for the first time.
> + * Handle client writes to the exported image.  @in_place_buf has the first
> + * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf
> + * contains the rest (if any; NULL otherwise).
> + * Data in @in_place_buf is assumed to be overwritten after yielding, so will
> + * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
> + * assumed to be exclusively owned and will be used as-is.

Makes sense, although it leads to some interesting bookkeeping.  (I'm
wondering if nbd would benefit from this split-level buffering, since
it supports reads and writes up to 32M)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-04-01 12:05               ` Kevin Wolf
@ 2025-04-01 20:31                 ` Eric Blake
  2025-04-04 12:45                 ` Hanna Czenczek
  1 sibling, 0 replies; 59+ messages in thread
From: Eric Blake @ 2025-04-01 20:31 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Hanna Czenczek, Markus Armbruster, qemu-block, qemu-devel

On Tue, Apr 01, 2025 at 02:05:40PM +0200, Kevin Wolf wrote:
> > Maybe we could make setting @iothreads here and the generic
> > BlockExportOptions.iothread at the same time an error.  That would save us
> > the explanation here.
> 
> This raises the question if the better interface wouldn't be to change
> the BlockExportOptions.iothread from 'str' to an alternate between 'str'
> and ['str'], allowing the user to specify multiple iothreads in the
> already existing related option without creating conflicting options.
> 
> In the long run, I would be surprised if FUSE remained the only export
> supporting multiple iothreads. At least the virtio based ones
> (vhost-user-blk and VDUSE) even have precedence in the virtio-blk device
> itself, so while I don't know how much interest there is in actually
> implementing it, in theory we know it makes sense.

And I really want my work on NBD Multi-conn support to merge well with
multiple iothreads.  That is yet another case where even if the I/O is
single-threaded, having parallel connections to the NBD server via
round-robin of requests can improve throughput.  But if you can afford
to assign four iothreads to manage four TCP connections to a single
NBD server, you'll get even better response.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-27 15:55   ` Stefan Hajnoczi
@ 2025-04-01 20:36     ` Eric Blake
  2025-04-02 13:20       ` Stefan Hajnoczi
  2025-04-04 12:49     ` Hanna Czenczek
  1 sibling, 1 reply; 59+ messages in thread
From: Eric Blake @ 2025-04-01 20:36 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Hanna Czenczek, qemu-block, qemu-devel, Kevin Wolf

On Thu, Mar 27, 2025 at 11:55:57AM -0400, Stefan Hajnoczi wrote:
> On Tue, Mar 25, 2025 at 05:06:54PM +0100, Hanna Czenczek wrote:
> > FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
> > (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
> > 
> > We can use this to implement multi-threading.
> > 
> > Note that the interface presented here differs from the multi-queue
> > interface of virtio-blk: The latter maps virtqueues to iothreads, which
> > allows processing multiple virtqueues in a single iothread.  The
> > equivalent (processing multiple FDs in a single iothread) would not make
> > sense for FUSE because those FDs are used in a round-robin fashion by
> > the FUSE kernel driver.  Putting two of them into a single iothread will
> > just create a bottleneck.
> 
> This text might be outdated. virtio-blk's new iothread-vq-mapping
> parameter provides the "array of iothreads" mentioned below and a way to
> assign virtqueues to those IOThreads.
> 

> > +++ b/qapi/block-export.json
> > @@ -179,12 +179,18 @@
> >  #     mount the export with allow_other, and if that fails, try again
> >  #     without.  (since 6.1; default: auto)
> >  #
> > +# @iothreads: Enables multi-threading: Handle requests in each of the
> > +#     given iothreads (instead of the block device's iothread, or the
> > +#     export's "main" iothread).  For this, the FUSE FD is duplicated so
> > +#     there is one FD per iothread.  (since 10.1)
> 
> This option isn't FUSE-specific but FUSE is the first export type to
> support it. Please add it to BlockExportOptions instead and refuse
> export creation when the export type only supports 1 IOThread.
> 
> Eric: Are you interested in implementing support for multiple IOThreads
> in the NBD export? I remember some time ago we talked about NBD
> multi-conn support, although maybe that was for the client rather than
> the server.

The NBD server already supports clients that make requests through
multiple TCP sockets, but right now, that server is not taking
advantage of iothreads to spread the TCP load.

And yes, I am in the middle of working on adding client NBD multi-conn
support (reviving Rich Jones' preliminary patches on what it would
take to have qemu open parallel TCP sockets to a supporting NBD
server), which also will use a round-robin approach (but here, the
round-robin is something we would code up in qemu, rather than the
behavior handed to us by the FUSE kernel layer).  Pinning specific
iothreads to a specific TCP socket may or may not make sense, but I
definitely want to have support for handing a pool of iothreads to an
NBD client that will be using multi-conn.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-04-01 20:36     ` Eric Blake
@ 2025-04-02 13:20       ` Stefan Hajnoczi
  2025-04-03 17:59         ` Eric Blake
  0 siblings, 1 reply; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-04-02 13:20 UTC (permalink / raw)
  To: Eric Blake; +Cc: Hanna Czenczek, qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 3136 bytes --]

On Tue, Apr 01, 2025 at 03:36:40PM -0500, Eric Blake wrote:
> On Thu, Mar 27, 2025 at 11:55:57AM -0400, Stefan Hajnoczi wrote:
> > On Tue, Mar 25, 2025 at 05:06:54PM +0100, Hanna Czenczek wrote:
> > > FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
> > > (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
> > > 
> > > We can use this to implement multi-threading.
> > > 
> > > Note that the interface presented here differs from the multi-queue
> > > interface of virtio-blk: The latter maps virtqueues to iothreads, which
> > > allows processing multiple virtqueues in a single iothread.  The
> > > equivalent (processing multiple FDs in a single iothread) would not make
> > > sense for FUSE because those FDs are used in a round-robin fashion by
> > > the FUSE kernel driver.  Putting two of them into a single iothread will
> > > just create a bottleneck.
> > 
> > This text might be outdated. virtio-blk's new iothread-vq-mapping
> > parameter provides the "array of iothreads" mentioned below and a way to
> > assign virtqueues to those IOThreads.
> > 
> 
> > > +++ b/qapi/block-export.json
> > > @@ -179,12 +179,18 @@
> > >  #     mount the export with allow_other, and if that fails, try again
> > >  #     without.  (since 6.1; default: auto)
> > >  #
> > > +# @iothreads: Enables multi-threading: Handle requests in each of the
> > > +#     given iothreads (instead of the block device's iothread, or the
> > > +#     export's "main" iothread).  For this, the FUSE FD is duplicated so
> > > +#     there is one FD per iothread.  (since 10.1)
> > 
> > This option isn't FUSE-specific but FUSE is the first export type to
> > support it. Please add it to BlockExportOptions instead and refuse
> > export creation when the export type only supports 1 IOThread.
> > 
> > Eric: Are you interested in implementing support for multiple IOThreads
> > in the NBD export? I remember some time ago we talked about NBD
> > multi-conn support, although maybe that was for the client rather than
> > the server.
> 
> The NBD server already supports clients that make requests through
> multiple TCP sockets, but right now, that server is not taking
> advantage of iothreads to spread the TCP load.
> 
> And yes, I am in the middle of working on adding client NBD multi-conn
> support (reviving Rich Jones' preliminary patches on what it would
> take to have qemu open parallel TCP sockets to a supporting NBD
> server), which also will use a round-robin approach (but here, the
> round-robin is something we would code up in qemu, rather than the
> behavior handed to us by the FUSE kernel layer).  Pinning specific
> iothreads to a specific TCP socket may or may not make sense, but I
> definitely want to have support for handing a pool of iothreads to an
> NBD client that will be using multi-conn.

Here I'm asking Hanna to make the "iothreads" export parameter generic
for all export types (not just for FUSE). Do you think that the NBD
export will be able to use the generic parameter or would you prefer an
NBD-specific export parameter?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-04-02 13:20       ` Stefan Hajnoczi
@ 2025-04-03 17:59         ` Eric Blake
  0 siblings, 0 replies; 59+ messages in thread
From: Eric Blake @ 2025-04-03 17:59 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Hanna Czenczek, qemu-block, qemu-devel, Kevin Wolf

On Wed, Apr 02, 2025 at 09:20:33AM -0400, Stefan Hajnoczi wrote:
> > > Eric: Are you interested in implementing support for multiple IOThreads
> > > in the NBD export? I remember some time ago we talked about NBD
> > > multi-conn support, although maybe that was for the client rather than
> > > the server.
> > 
> > The NBD server already supports clients that make requests through
> > multiple TCP sockets, but right now, that server is not taking
> > advantage of iothreads to spread the TCP load.
> > 
> > And yes, I am in the middle of working on adding client NBD multi-conn
> > support (reviving Rich Jones' preliminary patches on what it would
> > take to have qemu open parallel TCP sockets to a supporting NBD
> > server), which also will use a round-robin approach (but here, the
> > round-robin is something we would code up in qemu, rather than the
> > behavior handed to us by the FUSE kernel layer).  Pinning specific
> > iothreads to a specific TCP socket may or may not make sense, but I
> > definitely want to have support for handing a pool of iothreads to an
> > NBD client that will be using multi-conn.
> 
> Here I'm asking Hanna to make the "iothreads" export parameter generic
> for all export types (not just for FUSE). Do you think that the NBD
> export will be able to use the generic parameter or would you prefer an
> NBD-specific export parameter?

I would prefer to use a generic parameter. NBD will already need a
specific parameter on whether to attempt to use multiple TCP sockets
if the server advertises multi-conn.  But how to map those TCP sockets
to iothreads feels like a better fit for a generic iothreads; and
perhaps the NBD parameter can also be made smart enough to auto-set
the number of TCP sockets based on the number of available iothreads
if there is no explicit override.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/15] fuse: Copy write buffer content before polling
  2025-03-27 14:47   ` Stefan Hajnoczi
@ 2025-04-04 11:17     ` Hanna Czenczek
  0 siblings, 0 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-04-04 11:17 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, Kevin Wolf, qemu-stable

On 27.03.25 15:47, Stefan Hajnoczi wrote:
> On Tue, Mar 25, 2025 at 05:06:35PM +0100, Hanna Czenczek wrote:
>> Polling in I/O functions can lead to nested read_from_fuse_export()
> "Polling" means several different things. "aio_poll()" or "nested event
> loop" would be clearer.

Sure!

>> calls, overwriting the request buffer's content.  The only function
>> affected by this is fuse_write(), which therefore must use a bounce
>> buffer or corruption may occur.
>>
>> Note that in addition we do not know whether libfuse-internal structures
>> can cope with this nesting, and even if we did, we probably cannot rely
>> on it in the future.  This is the main reason why we want to remove
>> libfuse from the I/O path.
>>
>> I do not have a good reproducer for this other than:
>>
>> $ dd if=/dev/urandom of=image bs=1M count=4096
>> $ dd if=/dev/zero of=copy bs=1M count=4096
>> $ touch fuse-export
>> $ qemu-storage-daemon \
>>      --blockdev file,node-name=file,filename=copy \
>>      --export \
>>      fuse,id=exp,node-name=file,mountpoint=fuse-export,writable=true \
>>      &
>>
>> Other shell:
>> $ qemu-img convert -p -n -f raw -O raw -t none image fuse-export
>> $ killall -SIGINT qemu-storage-daemon
>> $ qemu-img compare image copy
>> Content mismatch at offset 0!
>>
>> (The -t none in qemu-img convert is important.)
>>
>> I tried reproducing this with throttle and small aio_write requests from
>> another qemu-io instance, but for some reason all requests are perfectly
>> serialized then.
>>
>> I think in theory we should get parallel writes only if we set
>> fi->parallel_direct_writes in fuse_open().  In fact, I can confirm that
>> if we do that, that throttle-based reproducer works (i.e. does get
>> parallel (nested) write requests).  I have no idea why we still get
>> parallel requests with qemu-img convert anyway.
>>
>> Also, a later patch in this series will set fi->parallel_direct_writes
>> and note that it makes basically no difference when running fio on the
>> current libfuse-based version of our code.  It does make a difference
>> without libfuse.  So something quite fishy is going on.
>>
>> I will try to investigate further what the root cause is, but I think
>> for now let's assume that calling blk_pwrite() can invalidate the buffer
>> contents through nested polling.
>>
>> Cc: qemu-stable@nongnu.org
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   block/export/fuse.c | 24 +++++++++++++++++++++---
>>   1 file changed, 21 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>> index 465cc9891d..a12f479492 100644
>> --- a/block/export/fuse.c
>> +++ b/block/export/fuse.c
>> @@ -301,6 +301,12 @@ static void read_from_fuse_export(void *opaque)
>>           goto out;
>>       }
>>   
>> +    /*
>> +     * Note that polling in any request-processing function can lead to a nested
>> +     * read_from_fuse_export() call, which will overwrite the contents of
>> +     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
>> +     * content is copied before potentially polling.
>> +     */
>>       fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
> It seems safer to allocate a fuse_buf per request instead copying the
> data buffer only for write requests. Other request types might be
> affected too (e.g. nested reads of different sizes).

I don’t think anything else is affected, but I absolutely agree that it 
would be more obviously safe to have a dedicated buffer for each request.

However, I decided to do it this way so I wouldn’t negatively affect 
performance before converting to coroutines – I felt it would be a bit 
unfair.  But if you think that’s alright, I’m happy to use a dedicated 
buffer per request!

>
> I guess later on in this series a per-request fuse_buf will be
> introduced anyway, so it doesn't matter what we do in this commit.

Kind of; this one is CC-ed to qemu-stable (the rest is not), so it does 
matter for stable releases.

Hanna

>
>>   
>>   out:
>> @@ -624,6 +630,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>>                          size_t size, off_t offset, struct fuse_file_info *fi)
>>   {
>>       FuseExport *exp = fuse_req_userdata(req);
>> +    void *copied;
>>       int64_t length;
>>       int ret;
>>   
>> @@ -638,6 +645,14 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>>           return;
>>       }
>>   
>> +    /*
>> +     * Heed the note on read_from_fuse_export(): If we poll (which any blk_*()
>> +     * I/O function may do), read_from_fuse_export() may be nested, overwriting
>> +     * the request buffer content.  Therefore, we must copy it here.
>> +     */
>> +    copied = blk_blockalign(exp->common.blk, size);
>> +    memcpy(copied, buf, size);
>> +
>>       /**
>>        * Clients will expect short writes at EOF, so we have to limit
>>        * offset+size to the image length.
>> @@ -645,7 +660,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>>       length = blk_getlength(exp->common.blk);
>>       if (length < 0) {
>>           fuse_reply_err(req, -length);
>> -        return;
>> +        goto free_buffer;
>>       }
>>   
>>       if (offset + size > length) {
>> @@ -653,19 +668,22 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>>               ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
>>               if (ret < 0) {
>>                   fuse_reply_err(req, -ret);
>> -                return;
>> +                goto free_buffer;
>>               }
>>           } else {
>>               size = length - offset;
>>           }
>>       }
>>   
>> -    ret = blk_pwrite(exp->common.blk, offset, size, buf, 0);
>> +    ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
>>       if (ret >= 0) {
>>           fuse_reply_write(req, size);
>>       } else {
>>           fuse_reply_err(req, -ret);
>>       }
>> +
>> +free_buffer:
>> +    qemu_vfree(copied);
>>   }
>>   
>>   /**
>> -- 
>> 2.48.1
>>
>>



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/15] fuse: Copy write buffer content before polling
  2025-04-01 13:44   ` Eric Blake
@ 2025-04-04 11:18     ` Hanna Czenczek
  0 siblings, 0 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-04-04 11:18 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-block, qemu-devel, Kevin Wolf, qemu-stable

On 01.04.25 15:44, Eric Blake wrote:
> On Tue, Mar 25, 2025 at 05:06:35PM +0100, Hanna Czenczek wrote:
>> Polling in I/O functions can lead to nested read_from_fuse_export()
>> calls, overwriting the request buffer's content.  The only function
>> affected by this is fuse_write(), which therefore must use a bounce
>> buffer or corruption may occur.
>>
>> Note that in addition we do not know whether libfuse-internal structures
>> can cope with this nesting, and even if we did, we probably cannot rely
>> on it in the future.  This is the main reason why we want to remove
>> libfuse from the I/O path.
>>
>> @@ -624,6 +630,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>>                          size_t size, off_t offset, struct fuse_file_info *fi)
>>   {
>>       FuseExport *exp = fuse_req_userdata(req);
>> +    void *copied;
> Do we have a good way to annotate variables that require qemu_vfree()
> if non-NULL for automatic cleanup?

That would be news to me, but if so, I’ll be happy to use it.  (The 
problem is distinguishing between what needs qemu_vfree(), and what 
needs g_free(), I suppose.)

Hanna

> If so, should this be annotated
> and set to NULL here,...
>
>>       int64_t length;
>>       int ret;
>>   
>> @@ -638,6 +645,14 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>>           return;
>>       }
>>   
>> +    /*
>> +     * Heed the note on read_from_fuse_export(): If we poll (which any blk_*()
>> +     * I/O function may do), read_from_fuse_export() may be nested, overwriting
>> +     * the request buffer content.  Therefore, we must copy it here.
>> +     */
>> +    copied = blk_blockalign(exp->common.blk, size);
>> +    memcpy(copied, buf, size);
>> +
>>       /**
>>        * Clients will expect short writes at EOF, so we have to limit
>>        * offset+size to the image length.
>> @@ -645,7 +660,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>>       length = blk_getlength(exp->common.blk);
>>       if (length < 0) {
>>           fuse_reply_err(req, -length);
>> -        return;
>> +        goto free_buffer;
> ...so that this and similar hunks are not needed?
>



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers()
  2025-04-01 13:55   ` Eric Blake
@ 2025-04-04 11:24     ` Hanna Czenczek
  0 siblings, 0 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-04-04 11:24 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-block, qemu-devel, Kevin Wolf

On 01.04.25 15:55, Eric Blake wrote:
> On Tue, Mar 25, 2025 at 05:06:48PM +0100, Hanna Czenczek wrote:
>> Pull setting up and tearing down the AIO context handlers into two
>> dedicated functions.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   block/export/fuse.c | 32 ++++++++++++++++----------------
>>   1 file changed, 16 insertions(+), 16 deletions(-)
>>
>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>> index 2df6297d61..bd98809d71 100644
>> --- a/block/export/fuse.c
>> +++ b/block/export/fuse.c
>> @@ -78,27 +78,34 @@ static void read_from_fuse_export(void *opaque);
>>   static bool is_regular_file(const char *path, Error **errp);
>>   
>>   
>> -static void fuse_export_drained_begin(void *opaque)
>> +static void fuse_attach_handlers(FuseExport *exp)
>>   {
>> -    FuseExport *exp = opaque;
>> +    aio_set_fd_handler(exp->common.ctx,
>> +                       fuse_session_fd(exp->fuse_session),
>> +                       read_from_fuse_export, NULL, NULL, NULL, exp);
>> +    exp->fd_handler_set_up = true;
> I found this name mildly confusing (does "set_up=true" mean that I
> still need to set up, or that I am already set up);

Not my fault that English has irregular verbs.  FWIW, if I meant the 
former, I’d probably call it “set_up_fd_handler” instead.

> would it be better
> as s/fd_handler_set_up/fd_handler_armed/g ?

I prefer the less militaristic “installed”, but sure.

Hanna



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 11/15] fuse: Manually process requests (without libfuse)
  2025-04-01 14:35   ` Eric Blake
@ 2025-04-04 11:30     ` Hanna Czenczek
  2025-04-04 11:42     ` Hanna Czenczek
  1 sibling, 0 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-04-04 11:30 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-block, qemu-devel, Kevin Wolf

On 01.04.25 16:35, Eric Blake wrote:
> On Tue, Mar 25, 2025 at 05:06:51PM +0100, Hanna Czenczek wrote:
>> Manually read requests from the /dev/fuse FD and process them, without
>> using libfuse.  This allows us to safely add parallel request processing
>> in coroutines later, without having to worry about libfuse internals.
>> (Technically, we already have exactly that problem with
>> read_from_fuse_export()/read_from_fuse_fd() nesting.)
>>
>> We will continue to use libfuse for mounting the filesystem; fusermount3
>> is a effectively a helper program of libfuse, so it should know best how
>> to interact with it.  (Doing it manually without libfuse, while doable,
>> is a bit of a pain, and it is not clear to me how stable the "protocol"
>> actually is.)
>>
>> @@ -247,6 +268,14 @@ static int fuse_export_create(BlockExport *blk_exp,
>>   
>>       g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
>>   
>> +    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
>> +    ret = fcntl(exp->fuse_fd, F_SETFL, O_NONBLOCK);
> fctnl(F_SETFL) should be used in a read-modify-write pattern with
> F_GETFL (otherwise, you are nuking any other flags that might have
> been important).
>
> See also block/file-posix.c:fcntl_setfl.  Maybe we should hoist that
> into a common helper in util/osdep.c?
>
>>   /**
>> - * Handle client reads from the exported image.
>> + * Handle client reads from the exported image.  Allocates *bufptr and reads
>> + * data from the block device into that buffer.
> Worth calling out tht *bufptr must be freed with qemu_vfree...
>
>> + * Returns the buffer (read) size on success, and -errno on error.
>>    */
>> -static void fuse_read(fuse_req_t req, fuse_ino_t inode,
>> -                      size_t size, off_t offset, struct fuse_file_info *fi)
>> +static ssize_t fuse_read(FuseExport *exp, void **bufptr,
>> +                         uint64_t offset, uint32_t size)
> ...
>>   {
>>       buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
>>       if (!buf) {
>> -        fuse_reply_err(req, ENOMEM);
>> -        return;
>> +        return -ENOMEM;
>>       }
>>   
>>       ret = blk_pread(exp->common.blk, offset, size, buf, 0);
>> -    if (ret >= 0) {
>> -        fuse_reply_buf(req, buf, size);
>> -    } else {
>> -        fuse_reply_err(req, -ret);
>> +    if (ret < 0) {
>> +        qemu_vfree(buf);
>> +        return ret;
> ...since internal cleanup recognizes that plain free() is wrong?
>
>>   #ifdef CONFIG_FUSE_LSEEK
>>   /**
>>    * Let clients inquire allocation status.
>> + * Return the number of bytes written to *out on success, and -errno on error.
>>    */
>> -static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
>> -                       int whence, struct fuse_file_info *fi)
>> +static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
>> +                          uint64_t offset, uint32_t whence)
>>   {
>> -    FuseExport *exp = fuse_req_userdata(req);
>> -
>>       if (whence != SEEK_HOLE && whence != SEEK_DATA) {
>> -        fuse_reply_err(req, EINVAL);
>> -        return;
>> +        return -EINVAL;
> Unrelated to this patch, but any reason why we only SEEK_HOLE/DATA
> (and not, say, SEEK_SET)?  Is it because we aren't really maintaining
> the notion of a current offset?  I guess that works as long as the
> caller is always using pread/pwrite (never bare read/write where
> depending on our internal offset would matter).

Because FUSE doesn’t send SEEK_SET; FDs‘ in-file offsets are maintained 
by the kernel.

Hanna



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 11/15] fuse: Manually process requests (without libfuse)
  2025-04-01 14:35   ` Eric Blake
  2025-04-04 11:30     ` Hanna Czenczek
@ 2025-04-04 11:42     ` Hanna Czenczek
  1 sibling, 0 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-04-04 11:42 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-block, qemu-devel, Kevin Wolf

Sorry, replied too early. :)

On 01.04.25 16:35, Eric Blake wrote:
> On Tue, Mar 25, 2025 at 05:06:51PM +0100, Hanna Czenczek wrote:
>> Manually read requests from the /dev/fuse FD and process them, without
>> using libfuse.  This allows us to safely add parallel request processing
>> in coroutines later, without having to worry about libfuse internals.
>> (Technically, we already have exactly that problem with
>> read_from_fuse_export()/read_from_fuse_fd() nesting.)
>>
>> We will continue to use libfuse for mounting the filesystem; fusermount3
>> is a effectively a helper program of libfuse, so it should know best how
>> to interact with it.  (Doing it manually without libfuse, while doable,
>> is a bit of a pain, and it is not clear to me how stable the "protocol"
>> actually is.)
>>
>> @@ -247,6 +268,14 @@ static int fuse_export_create(BlockExport *blk_exp,
>>   
>>       g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
>>   
>> +    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
>> +    ret = fcntl(exp->fuse_fd, F_SETFL, O_NONBLOCK);
> fctnl(F_SETFL) should be used in a read-modify-write pattern with
> F_GETFL (otherwise, you are nuking any other flags that might have
> been important).
>
> See also block/file-posix.c:fcntl_setfl.  Maybe we should hoist that
> into a common helper in util/osdep.c?

Sounds good.

>>   /**
>> - * Handle client reads from the exported image.
>> + * Handle client reads from the exported image.  Allocates *bufptr and reads
>> + * data from the block device into that buffer.
> Worth calling out tht *bufptr must be freed with qemu_vfree...

Yep, I’ll add it.

Hanna



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 11/15] fuse: Manually process requests (without libfuse)
  2025-03-27 15:35   ` Stefan Hajnoczi
@ 2025-04-04 12:36     ` Hanna Czenczek
  0 siblings, 0 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-04-04 12:36 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-block, qemu-devel, Kevin Wolf

On 27.03.25 16:35, Stefan Hajnoczi wrote:
> On Tue, Mar 25, 2025 at 05:06:51PM +0100, Hanna Czenczek wrote:
>> Manually read requests from the /dev/fuse FD and process them, without
>> using libfuse.  This allows us to safely add parallel request processing
>> in coroutines later, without having to worry about libfuse internals.
>> (Technically, we already have exactly that problem with
>> read_from_fuse_export()/read_from_fuse_fd() nesting.)
>>
>> We will continue to use libfuse for mounting the filesystem; fusermount3
>> is a effectively a helper program of libfuse, so it should know best how
>> to interact with it.  (Doing it manually without libfuse, while doable,
>> is a bit of a pain, and it is not clear to me how stable the "protocol"
>> actually is.)
>>
>> Take this opportunity of quite a major rewrite to update the Copyright
>> line with corrected information that has surfaced in the meantime.
>>
>> Here are some benchmarks from before this patch (4k, iodepth=16, libaio;
>> except 'sync', which are iodepth=1 and pvsync2):
>>
>> file:
>>    read:
>>      seq aio:   78.6k ±1.3k IOPS
>>      rand aio:  39.3k ±2.9k
>>      seq sync:  32.5k ±0.7k
>>      rand sync:  9.9k ±0.1k
>>    write:
>>      seq aio:   61.9k ±0.5k
>>      rand aio:  61.2k ±0.6k
>>      seq sync:  27.9k ±0.2k
>>      rand sync: 27.6k ±0.4k
>> null:
>>    read:
>>      seq aio:   214.0k ±5.9k
>>      rand aio:  212.7k ±4.5k
>>      seq sync:   90.3k ±6.5k
>>      rand sync:  89.7k ±5.1k
>>    write:
>>      seq aio:   203.9k ±1.5k
>>      rand aio:  201.4k ±3.6k
>>      seq sync:   86.1k ±6.2k
>>      rand sync:  84.9k ±5.3k
>>
>> And with this patch applied:
>>
>> file:
>>    read:
>>      seq aio:   76.6k ±1.8k (- 3 %)
>>      rand aio:  26.7k ±0.4k (-32 %)
>>      seq sync:  47.7k ±1.2k (+47 %)
>>      rand sync: 10.1k ±0.2k (+ 2 %)
>>    write:
>>      seq aio:   58.1k ±0.5k (- 6 %)
>>      rand aio:  58.1k ±0.5k (- 5 %)
>>      seq sync:  36.3k ±0.3k (+30 %)
>>      rand sync: 36.1k ±0.4k (+31 %)
>> null:
>>    read:
>>      seq aio:   268.4k ±3.4k (+25 %)
>>      rand aio:  265.3k ±2.1k (+25 %)
>>      seq sync:  134.3k ±2.7k (+49 %)
>>      rand sync: 132.4k ±1.4k (+48 %)
>>    write:
>>      seq aio:   275.3k ±1.7k (+35 %)
>>      rand aio:  272.3k ±1.9k (+35 %)
>>      seq sync:  130.7k ±1.6k (+52 %)
>>      rand sync: 127.4k ±2.4k (+50 %)
>>
>> So clearly the AIO file results are actually not good, and random reads
>> are indeed quite terrible.  On the other hand, we can see from the sync
>> and null results that request handling should in theory be quicker.  How
>> does this fit together?
>>
>> I believe the bad AIO results are an artifact of the accidental parallel
>> request processing we have due to nested polling: Depending on how the
>> actual request processing is structured and how long request processing
>> takes, more or less requests will be submitted in parallel.  So because
>> of the restructuring, I think this patch accidentally changes how many
>> requests end up being submitted in parallel, which decreases
>> performance.
>>
>> (I have seen something like this before: In RSD, without having
>> implemented a polling mode, the debug build tended to have better
>> performance than the more optimized release build, because the debug
>> build, taking longer to submit requests, ended up processing more
>> requests in parallel.)
>>
>> In any case, once we use coroutines throughout the code, performance
>> will improve again across the board.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   block/export/fuse.c | 793 +++++++++++++++++++++++++++++++-------------
>>   1 file changed, 567 insertions(+), 226 deletions(-)
>>
>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>> index 3dd50badb3..407b101018 100644
>> --- a/block/export/fuse.c
>> +++ b/block/export/fuse.c

[...]

>> +/**
>> + * Check the FUSE FD for whether it is readable or not.  Because we cannot
>> + * reasonably do this without reading a request at the same time, also read and
>> + * process that request if any.
>> + * (To be used as a poll handler for the FUSE FD.)
>> + */
>> +static bool poll_fuse_fd(void *opaque)
>> +{
>> +    return read_from_fuse_fd(opaque);
>> +}
> The other io_poll() callbacks in QEMU peek at memory whereas this one
> invokes the read(2) syscall. Two reasons why this is a problem:
> 1. Syscall latency is too high. Other fd handlers will be delayed by
>     microseconds.
> 2. This doesn't scale. If every component in QEMU does this then the
>     event loop degrades to O(n) of non-blocking read(2) syscalls where n
>     is the number of fds.
>
> Also, handling the request inside the io_poll() callback skews
> AioContext's time accounting because time spent handling the request
> will be accounted as "polling time". The adaptive polling calculation
> will think it polled for longer than it did.
>
> If there is no way to peek at memory, please don't implement the
> io_poll() callback.

Got it, thanks!

Hanna



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-04-01 12:05               ` Kevin Wolf
  2025-04-01 20:31                 ` Eric Blake
@ 2025-04-04 12:45                 ` Hanna Czenczek
  1 sibling, 0 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-04-04 12:45 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Markus Armbruster, qemu-block, qemu-devel

On 01.04.25 14:05, Kevin Wolf wrote:
> Am 27.03.2025 um 14:45 hat Hanna Czenczek geschrieben:
>> On 27.03.25 13:18, Markus Armbruster wrote:
>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>
>>>> On 26.03.25 12:41, Markus Armbruster wrote:
>>>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>>>
>>>>>> On 26.03.25 06:38, Markus Armbruster wrote:
>>>>>>> Hanna Czenczek <hreitz@redhat.com> writes:
>>>>>>>
>>>>>>>> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
>>>>>>>> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
>>>>>>>>
>>>>>>>> We can use this to implement multi-threading.
>>>>>>>>
>>>>>>>> Note that the interface presented here differs from the multi-queue
>>>>>>>> interface of virtio-blk: The latter maps virtqueues to iothreads, which
>>>>>>>> allows processing multiple virtqueues in a single iothread.  The
>>>>>>>> equivalent (processing multiple FDs in a single iothread) would not make
>>>>>>>> sense for FUSE because those FDs are used in a round-robin fashion by
>>>>>>>> the FUSE kernel driver.  Putting two of them into a single iothread will
>>>>>>>> just create a bottleneck.
>>>>>>>>
>>>>>>>> Therefore, all we need is an array of iothreads, and we will create one
>>>>>>>> "queue" (FD) per thread.
>>>>>>> [...]
>>>>>>>
>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>> ---
>>>>>>>>    qapi/block-export.json |   8 +-
>>>>>>>>    block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
>>>>>>>>    2 files changed, 179 insertions(+), 43 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/qapi/block-export.json b/qapi/block-export.json
>>>>>>>> index c783e01a53..0bdd5992eb 100644
>>>>>>>> --- a/qapi/block-export.json
>>>>>>>> +++ b/qapi/block-export.json
>>>>>>>> @@ -179,12 +179,18 @@
>>>>>>>>    #     mount the export with allow_other, and if that fails, try again
>>>>>>>>    #     without.  (since 6.1; default: auto)
>>>>>>>>    #
>>>>>>>> +# @iothreads: Enables multi-threading: Handle requests in each of the
>>>>>>>> +#     given iothreads (instead of the block device's iothread, or the
>>>>>>>> +#     export's "main" iothread).
>>>>>>> When does "the block device's iothread" apply, and when "the export's
>>>>>>> main iothread"?
>>>>>> Depends on where you set the iothread option.
>>>>> Assuming QMP users need to know (see right below), can we trust they
>>>>> understand which one applies when?  If not, can we provide clues?
>>>> I don’t understand what exactly you mean, but which one applies when has nothing to do with this option, but with the @iothread (and @fixed-iothread) option(s) on BlockExportOptions, which do document this.
>>> Can you point me to the spot?
>> Sure: https://www.qemu.org/docs/master/interop/qemu-qmp-ref.html#object-QMP-block-export.BlockExportOptions
>> (iothread and fixed-iothread)
>>
>>>>>>> Is this something the QMP user needs to know?
>>>>>> I think so, because e.g. if you set iothread on the device and the export, you’ll get a conflict.  But if you set it there and set this option, you won’t.  This option will just override the device/export option.
>>>>> Do we think the doc comment sufficient for QMP users to figure this out?
>>>> As for conflict, BlockExportOptions.iothread and BlockExportOptions.fixed-iothread do.
>>>>
>>>> As for overriding, I do think so.  Do you not?  I’m always open to suggestions.
>>>>
>>>>> If not, can we provide clues?
>>>>>
>>>>> In particular, do we think they can go from an export failure to the
>>>>> setting @iothreads here?  Perhaps the error message will guide them.
>>>>> What is the message?
>>>> I don’t understand what failure you mean.
>>> You wrote "you'll get a conflict".  I assume this manifests as failure
>>> of a QMP command (let's ignore CLI to keep things simple here).
>> If you set the @iothread option on both the (guest) device and the export
>> (and also @fixed-iothread on the export), then you’ll get an error.  Nothing
>> to do with this new @iothreads option here.
>>
>>> Do we think ordinary users running into that failure can figure out they
>>> can avoid it by setting @iothreads?
>> It shouldn’t affect the failure.  Setting @iothread on both device and
>> export (with @fixed-iothread) will always cause an error, and should.
>> Setting this option is not supposed to “fix” that configuration error.
>>
>> Theoretically, setting @iothreads here could make it so that
>> BlockExportOptions.iothread (and/or fixed-iothread) is ignored, because that
>> thread will no longer be used for export-issued I/O; but in practice,
>> setting that option (BlockExportOptions.iothread) moves that export and the
>> whole BDS tree behind it to that I/O thread, so if you haven’t specified an
>> I/O thread on the guest device, the guest device will then use that thread.
>> So making @iothreads silently completely ignore BlockExportOptions.iothread
>> may cause surprising behavior.
>>
>> Maybe we could make setting @iothreads here and the generic
>> BlockExportOptions.iothread at the same time an error.  That would save us
>> the explanation here.
> This raises the question if the better interface wouldn't be to change
> the BlockExportOptions.iothread from 'str' to an alternate between 'str'
> and ['str'], allowing the user to specify multiple iothreads in the
> already existing related option without creating conflicting options.

Sounds good.

> In the long run, I would be surprised if FUSE remained the only export
> supporting multiple iothreads. At least the virtio based ones
> (vhost-user-blk and VDUSE) even have precedence in the virtio-blk device
> itself, so while I don't know how much interest there is in actually
> implementing it, in theory we know it makes sense.

For the virtio-based ones, I don’t know whether the interface should 
allow to map virtqueues to threads, though (as virtio-blk allows now).  
It doesn’t make sense for FUSE because of the round-robin nature, but 
for other exports, I don’t know.

But I’m happy to not worry about that for now. :)

Hanna



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-03-27 15:55   ` Stefan Hajnoczi
  2025-04-01 20:36     ` Eric Blake
@ 2025-04-04 12:49     ` Hanna Czenczek
  2025-04-07 14:02       ` Stefan Hajnoczi
  1 sibling, 1 reply; 59+ messages in thread
From: Hanna Czenczek @ 2025-04-04 12:49 UTC (permalink / raw)
  To: Stefan Hajnoczi, Eric Blake; +Cc: qemu-block, qemu-devel, Kevin Wolf

On 27.03.25 16:55, Stefan Hajnoczi wrote:
> On Tue, Mar 25, 2025 at 05:06:54PM +0100, Hanna Czenczek wrote:
>> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
>> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
>>
>> We can use this to implement multi-threading.
>>
>> Note that the interface presented here differs from the multi-queue
>> interface of virtio-blk: The latter maps virtqueues to iothreads, which
>> allows processing multiple virtqueues in a single iothread.  The
>> equivalent (processing multiple FDs in a single iothread) would not make
>> sense for FUSE because those FDs are used in a round-robin fashion by
>> the FUSE kernel driver.  Putting two of them into a single iothread will
>> just create a bottleneck.
> This text might be outdated. virtio-blk's new iothread-vq-mapping
> parameter provides the "array of iothreads" mentioned below and a way to
> assign virtqueues to those IOThreads.

Ah, yes.  The difference is still that with FUSE, there is no such 
assignment, because it wouldn’t make sense.  But I can change s/maps 
virtqueues/allows mapping virtqueues/, and s/differs from/is only a 
subset of/, if that’s alright.

>> Therefore, all we need is an array of iothreads, and we will create one
>> "queue" (FD) per thread.
>>
>> These are the benchmark results when using four threads (compared to a
>> single thread); note that fio still only uses a single job, but
>> performance can still be improved because of said round-robin usage for
>> the queues.  (Not in the sync case, though, in which case I guess it
>> just adds overhead.)
> Interesting. FUSE-over-io_uring seems to be different from
> FUSE_DEV_IOC_CLONE here. It doesn't do round-robin. It uses CPU affinity
> instead, handing requests to the io_uring context associated with the
> current CPU when possible.

Do you think that should have implications for the QAPI interface?

[...]

>>   qapi/block-export.json |   8 +-
>>   block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
>>   2 files changed, 179 insertions(+), 43 deletions(-)
>>
>> diff --git a/qapi/block-export.json b/qapi/block-export.json
>> index c783e01a53..0bdd5992eb 100644
>> --- a/qapi/block-export.json
>> +++ b/qapi/block-export.json
>> @@ -179,12 +179,18 @@
>>   #     mount the export with allow_other, and if that fails, try again
>>   #     without.  (since 6.1; default: auto)
>>   #
>> +# @iothreads: Enables multi-threading: Handle requests in each of the
>> +#     given iothreads (instead of the block device's iothread, or the
>> +#     export's "main" iothread).  For this, the FUSE FD is duplicated so
>> +#     there is one FD per iothread.  (since 10.1)
> This option isn't FUSE-specific but FUSE is the first export type to
> support it. Please add it to BlockExportOptions instead and refuse
> export creation when the export type only supports 1 IOThread.

Makes sense.  I’ll try to go with what Kevin suggested, i.e. have 
@iothread be an alternate type.

Hanna

>
> Eric: Are you interested in implementing support for multiple IOThreads
> in the NBD export? I remember some time ago we talked about NBD
> multi-conn support, although maybe that was for the client rather than
> the server.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 15/15] fuse: Increase MAX_WRITE_SIZE with a second buffer
  2025-04-01 20:24   ` Eric Blake
@ 2025-04-04 13:04     ` Hanna Czenczek
  0 siblings, 0 replies; 59+ messages in thread
From: Hanna Czenczek @ 2025-04-04 13:04 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-block, qemu-devel, Kevin Wolf

On 01.04.25 22:24, Eric Blake wrote:
> On Tue, Mar 25, 2025 at 05:06:55PM +0100, Hanna Czenczek wrote:
>> We probably want to support larger write sizes than just 4k; 64k seems
>> nice.  However, we cannot read partial requests from the FUSE FD, we
>> always have to read requests in full; so our read buffer must be large
>> enough to accommodate potential 64k writes if we want to support that.
>>
>> Always allocating FuseRequest objects with 64k buffers in them seems
>> wasteful, though.  But we can get around the issue by splitting the
>> buffer into two and using readv(): One part will hold all normal (up to
>> 4k) write requests and all other requests, and a second part (the
>> "spill-over buffer") will be used only for larger write requests.  Each
>> FuseQueue has its own spill-over buffer, and only if we find it used
>> when reading a request will we move its ownership into the FuseRequest
>> object and allocate a new spill-over buffer for the queue.
>>
>> This way, we get to support "large" write sizes without having to
>> allocate big buffers when they aren't used.
>>
>> Also, this even reduces the size of the FuseRequest objects because the
>> read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
>> the requests we support are not quite so large (except for >4k writes),
>> so until now, we basically had to have useless padding in there.
>>
>> With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
>> is easily met and we can decrease the size of the buffer portion that is
>> right inside of FuseRequest.
>>
>> As for benchmarks, the benefit of this patch can be shown easily by
>> writing a 4G image (with qemu-img convert) to a FUSE export:
>> - Before this patch: Takes 25.6 s (14.4 s with -t none)
>> - After this patch: Takes 4.5 s (5.5 s with -t none)
> Hmm - 64k is still small for some tasks; with qcow2, you can have
> clusters up to 2M, and writing a cluster at a time seems to be a
> reasonable desire.

What do you mean specifically?  Do you think it’s reasonable from a 
performance point of view, or for potential atomicity, or...?

> Or in other storage areas, nbdcopy (from libnbd)
> defaults to 256k and allows up to 32M, or we can even look at lvm that
> defaults to extent size of 4M (although with lvm thin volumes,
> behavior is more like qcow2 with subclusters in that you don't have to
> allocate the entire extent at once).

The problem I see is that other back-ends can probably read the request 
header, figure out how much memory they need, and allocate the buffer.  
That doesn’t work with FUSE, you got to read the whole request at once.

> Maybe it makes sense to have this as a hierarchichal spillover (first
> level spills over to 64k, second level spills over to 2M or whatever
> ACTUAL maximum you are willing to support)?

Sounds quite complicated, so the question is what we’d want this for, 
whether it’d be worth it.

I also wonder whether rate-limiting should come into play at some 
point.  FUSE doesn’t require network, so a user could submit the same 2M 
buffer many times, and thus, without itself having to allocate a lot of 
memory, cause big memory allocations in qemu(-storage-daemon)...

>
> Or you can feel free to ignore my comments on this patch - allowing
> 64k instead of 4k is ALREADY a nice change.
>
>> @@ -501,7 +547,20 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>>           goto no_request;
>>       }
>>   
>> -    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
>> +    /*
>> +     * If handling the last request consumed the spill-over buffer, allocate a
>> +     * new one.  Align it to the block device's alignment, which admittedly is
>> +     * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too.
> What are typical block device minimum alignments?  If 4k is typical,
> then it should be easy to make FUSE_IN_PLACE_WRITE_BYTES be 4k.

Yes, but no.  We’d have to allocate request_buf such that for WRITE 
requests, the data in it would be aligned, i.e. after the header. We 
cannot split it (one part for the FUSE headers, one part for the data), 
because that might break up non-WRITE requests, which would make them 
hard to handle.

Locally, I have a version of this series that does introduce a function 
to allow allocating buffers such that an offset within them is aligned; 
but in the end I didn’t see much of a performance difference, so I 
decided to send this simpler version instead.

Hanna

> If
> 64k is typical, we're already doomed at being aligned.  Is being
> unaligned going to cause the block layer to add bounce buffers on us,
> if the block layer has a larger alignment than what we use here?
>
>> @@ -915,17 +983,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
>>   }
>>   
>>   /**
>> - * Handle client writes to the exported image.  @buf has the data to be written
>> - * and will be copied to a bounce buffer before yielding for the first time.
>> + * Handle client writes to the exported image.  @in_place_buf has the first
>> + * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf
>> + * contains the rest (if any; NULL otherwise).
>> + * Data in @in_place_buf is assumed to be overwritten after yielding, so will
>> + * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
>> + * assumed to be exclusively owned and will be used as-is.
> Makes sense, although it leads to some interesting bookkeeping.  (I'm
> wondering if nbd would benefit from this split-level buffering, since
> it supports reads and writes up to 32M)
>



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 14/15] fuse: Implement multi-threading
  2025-04-04 12:49     ` Hanna Czenczek
@ 2025-04-07 14:02       ` Stefan Hajnoczi
  0 siblings, 0 replies; 59+ messages in thread
From: Stefan Hajnoczi @ 2025-04-07 14:02 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: Eric Blake, qemu-block, qemu-devel, Kevin Wolf

[-- Attachment #1: Type: text/plain, Size: 3749 bytes --]

On Fri, Apr 04, 2025 at 02:49:08PM +0200, Hanna Czenczek wrote:
> On 27.03.25 16:55, Stefan Hajnoczi wrote:
> > On Tue, Mar 25, 2025 at 05:06:54PM +0100, Hanna Czenczek wrote:
> > > FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
> > > (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
> > > 
> > > We can use this to implement multi-threading.
> > > 
> > > Note that the interface presented here differs from the multi-queue
> > > interface of virtio-blk: The latter maps virtqueues to iothreads, which
> > > allows processing multiple virtqueues in a single iothread.  The
> > > equivalent (processing multiple FDs in a single iothread) would not make
> > > sense for FUSE because those FDs are used in a round-robin fashion by
> > > the FUSE kernel driver.  Putting two of them into a single iothread will
> > > just create a bottleneck.
> > This text might be outdated. virtio-blk's new iothread-vq-mapping
> > parameter provides the "array of iothreads" mentioned below and a way to
> > assign virtqueues to those IOThreads.
> 
> Ah, yes.  The difference is still that with FUSE, there is no such
> assignment, because it wouldn’t make sense.  But I can change s/maps
> virtqueues/allows mapping virtqueues/, and s/differs from/is only a subset
> of/, if that’s alright.

Sure, thanks!

> > > Therefore, all we need is an array of iothreads, and we will create one
> > > "queue" (FD) per thread.
> > > 
> > > These are the benchmark results when using four threads (compared to a
> > > single thread); note that fio still only uses a single job, but
> > > performance can still be improved because of said round-robin usage for
> > > the queues.  (Not in the sync case, though, in which case I guess it
> > > just adds overhead.)
> > Interesting. FUSE-over-io_uring seems to be different from
> > FUSE_DEV_IOC_CLONE here. It doesn't do round-robin. It uses CPU affinity
> > instead, handing requests to the io_uring context associated with the
> > current CPU when possible.
> 
> Do you think that should have implications for the QAPI interface?

It would be helpful to document the behavior so users know when
round-robin or CPU affinity are used, but the parameter itself would be
unchanged: an array of IOThreads.

> 
> [...]
> 
> > >   qapi/block-export.json |   8 +-
> > >   block/export/fuse.c    | 214 +++++++++++++++++++++++++++++++++--------
> > >   2 files changed, 179 insertions(+), 43 deletions(-)
> > > 
> > > diff --git a/qapi/block-export.json b/qapi/block-export.json
> > > index c783e01a53..0bdd5992eb 100644
> > > --- a/qapi/block-export.json
> > > +++ b/qapi/block-export.json
> > > @@ -179,12 +179,18 @@
> > >   #     mount the export with allow_other, and if that fails, try again
> > >   #     without.  (since 6.1; default: auto)
> > >   #
> > > +# @iothreads: Enables multi-threading: Handle requests in each of the
> > > +#     given iothreads (instead of the block device's iothread, or the
> > > +#     export's "main" iothread).  For this, the FUSE FD is duplicated so
> > > +#     there is one FD per iothread.  (since 10.1)
> > This option isn't FUSE-specific but FUSE is the first export type to
> > support it. Please add it to BlockExportOptions instead and refuse
> > export creation when the export type only supports 1 IOThread.
> 
> Makes sense.  I’ll try to go with what Kevin suggested, i.e. have @iothread
> be an alternate type.
> 
> Hanna
> 
> > 
> > Eric: Are you interested in implementing support for multiple IOThreads
> > in the NBD export? I remember some time ago we talked about NBD
> > multi-conn support, although maybe that was for the client rather than
> > the server.
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2025-04-07 14:04 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-25 16:05 [PATCH 00/15] export/fuse: Use coroutines and multi-threading Hanna Czenczek
2025-03-25 16:06 ` [PATCH 01/15] fuse: Copy write buffer content before polling Hanna Czenczek
2025-03-27 14:47   ` Stefan Hajnoczi
2025-04-04 11:17     ` Hanna Czenczek
2025-04-01 13:44   ` Eric Blake
2025-04-04 11:18     ` Hanna Czenczek
2025-03-25 16:06 ` [PATCH 02/15] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
2025-03-26  5:47   ` Markus Armbruster
2025-03-26  9:49     ` Hanna Czenczek
2025-03-27 14:51   ` Stefan Hajnoczi
2025-03-25 16:06 ` [PATCH 03/15] fuse: Remove superfluous empty line Hanna Czenczek
2025-03-27 14:53   ` Stefan Hajnoczi
2025-03-25 16:06 ` [PATCH 04/15] fuse: Explicitly set inode ID to 1 Hanna Czenczek
2025-03-27 14:54   ` Stefan Hajnoczi
2025-03-25 16:06 ` [PATCH 05/15] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
2025-03-27 14:55   ` Stefan Hajnoczi
2025-03-25 16:06 ` [PATCH 06/15] fuse: Fix mount options Hanna Czenczek
2025-03-27 14:58   ` Stefan Hajnoczi
2025-03-25 16:06 ` [PATCH 07/15] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
2025-03-27 15:09   ` Stefan Hajnoczi
2025-03-25 16:06 ` [PATCH 08/15] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
2025-03-27 15:12   ` Stefan Hajnoczi
2025-04-01 13:55   ` Eric Blake
2025-04-04 11:24     ` Hanna Czenczek
2025-03-25 16:06 ` [PATCH 09/15] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
2025-03-27 15:13   ` Stefan Hajnoczi
2025-03-25 16:06 ` [PATCH 10/15] fuse: Add halted flag Hanna Czenczek
2025-03-27 15:15   ` Stefan Hajnoczi
2025-03-25 16:06 ` [PATCH 11/15] fuse: Manually process requests (without libfuse) Hanna Czenczek
2025-03-27 15:35   ` Stefan Hajnoczi
2025-04-04 12:36     ` Hanna Czenczek
2025-04-01 14:35   ` Eric Blake
2025-04-04 11:30     ` Hanna Czenczek
2025-04-04 11:42     ` Hanna Czenczek
2025-03-25 16:06 ` [PATCH 12/15] fuse: Reduce max read size Hanna Czenczek
2025-03-27 15:35   ` Stefan Hajnoczi
2025-03-25 16:06 ` [PATCH 13/15] fuse: Process requests in coroutines Hanna Czenczek
2025-03-27 15:38   ` Stefan Hajnoczi
2025-03-25 16:06 ` [PATCH 14/15] fuse: Implement multi-threading Hanna Czenczek
2025-03-26  5:38   ` Markus Armbruster
2025-03-26  9:55     ` Hanna Czenczek
2025-03-26 11:41       ` Markus Armbruster
2025-03-26 13:56         ` Hanna Czenczek
2025-03-27 12:18           ` Markus Armbruster via
2025-03-27 13:45             ` Hanna Czenczek
2025-04-01 12:05               ` Kevin Wolf
2025-04-01 20:31                 ` Eric Blake
2025-04-04 12:45                 ` Hanna Czenczek
2025-03-27 15:55   ` Stefan Hajnoczi
2025-04-01 20:36     ` Eric Blake
2025-04-02 13:20       ` Stefan Hajnoczi
2025-04-03 17:59         ` Eric Blake
2025-04-04 12:49     ` Hanna Czenczek
2025-04-07 14:02       ` Stefan Hajnoczi
2025-04-01 14:58   ` Eric Blake
2025-03-25 16:06 ` [PATCH 15/15] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
2025-03-27 15:59   ` Stefan Hajnoczi
2025-04-01 20:24   ` Eric Blake
2025-04-04 13:04     ` Hanna Czenczek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).