[PATCH v2 00/21] export/fuse: Use coroutines and multi-threading

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading
@ 2025-06-04 13:27 Hanna Czenczek
  2025-06-04 13:27 ` [PATCH v2 01/21] fuse: Copy write buffer content before polling Hanna Czenczek
                   ` (21 more replies)
  0 siblings, 22 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:27 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Hi,

This series:
- Fixes some bugs/minor inconveniences,
- Removes libfuse from the request processing path,
- Make the FUSE export use coroutines for request handling,
- Introduces multi-threading into the FUSE export.

More detail on the v1 cover letter:

https://lists.nongnu.org/archive/html/qemu-block/2025-03/msg00359.html

Changes from v1:
- Patch 1: Clarified “polling” to be `aio_poll()`
- Patch 11 (new): Pulled out from patch 13 (prev. 11)
- Patch 12 (new): Suggested by Eric
- Patch 13 (prev. 11):
  - Drop false polling handlers
  - Use qemu_fcntl_addfl() instead of fcntl(F_SETFL) to keep
    pre-existing FD flags
  - Add a note that the buffer returned by read needs to be freed via
    qemu_vfree()
  - Pulled out a variable rename into the new patch 11
- Patch 15 (prev. 13):
  - Simplified the co_read_from_fuse_fd() interface thanks to no longer
    needing to support poll handlers
  - Increment in-flight counter before entering the coroutine to make it
    more obvious how tihs ensures that the export pointer remains valid
    throughout
- Patch 16 (new): Add a common multi-threading interface for exports
  instead of a specific one just for FUSE
- Patch 17 (new): Test cases for this new interface
- Patch 18 (prev. 14):
  - Drop the contrasting with virtio-blk from the commit message;
    explaining the interface is no longer necessary now that it’s
    introduced separately in patch 16.
  - Generally, the interface definition is removed in favor of the new
    one in patch 16.
  - Some rebase conflicts (due to other changes earlier in this series).
- Patch 19 (new): Stefan suggested adding an explicit note for users on
  how multi-threading behaves with FUSE, not least because in the future
  this behavior may depend on the specific implementation features
  chosen (io-uring or not).  Because the actual multi-thread interface
  is now on the common export options, it is no longer obvious where to
  put this implementation note; I decided to put it into the general
  description of the FUSE export options, inside of this dedicated
  patch.
- Patch 20 (new): Simple sanity test for FUSE multi-threading (just test
  that e.g. nothing crashes when running qemu-img bench on top)
- Patch 21 (prev. 15): Rebase conflict due to the changes in patch 15;
  kept Stefan’s R-b anyway

Review notes/suggestions I deliberately did not follow in v2:
- Stefan suggested to make patch 1 simpler and more robust by allocating
  a new buffer for each request.  This is indeed a simple change (for
  patch 1) that I wouldn’t mind, and that I also started to implement.
  However, eventually I decided against it:
  The problem doesn’t disappear with the rest of the series, it
  basically stays the exact same; though instead of an implicit
  aio_poll() leading to nested polling, it turns into an implicit
  coroutine yield doing pretty much the same.
  For performance, it is better not to allocate a new buffer for each
  request; we only really need a bounce buffer for writes, as there is
  no other case where we’d continue to read the request buffer after
  yielding.  Therefore, the final state I would like to have after this
  series is to use a common request buffer for all requests on a single
  queue, only using a bounce buffer for writes.
  With that, I think it’s better to implement exactly that right from
  the start, rather than introducing a new intermediate state.

git-backport-diff from v1:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/21:[0012] [FC] 'fuse: Copy write buffer content before polling'
002/21:[----] [--] 'fuse: Ensure init clean-up even with error_fatal'
003/21:[----] [--] 'fuse: Remove superfluous empty line'
004/21:[----] [--] 'fuse: Explicitly set inode ID to 1'
005/21:[----] [--] 'fuse: Change setup_... to mount_fuse_export()'
006/21:[----] [--] 'fuse: Fix mount options'
007/21:[----] [--] 'fuse: Set direct_io and parallel_direct_writes'
008/21:[----] [--] 'fuse: Introduce fuse_{at,de}tach_handlers()'
009/21:[----] [--] 'fuse: Introduce fuse_{inc,dec}_in_flight()'
010/21:[----] [--] 'fuse: Add halted flag'
011/21:[down] 'fuse: Rename length to blk_len in fuse_write()'
012/21:[down] 'block: Move qemu_fcntl_addfl() into osdep.c'
013/21:[0077] [FC] 'fuse: Manually process requests (without libfuse)'
014/21:[----] [--] 'fuse: Reduce max read size'
015/21:[0061] [FC] 'fuse: Process requests in coroutines'
016/21:[down] 'block/export: Add multi-threading interface'
017/21:[down] 'iotests/307: Test multi-thread export interface'
018/21:[0077] [FC] 'fuse: Implement multi-threading'
019/21:[down] 'qapi/block-export: Document FUSE's multi-threading'
020/21:[down] 'iotests/308: Add multi-threading sanity test'
021/21:[0002] [FC] 'fuse: Increase MAX_WRITE_SIZE with a second buffer'

Hanna Czenczek (21):
  fuse: Copy write buffer content before polling
  fuse: Ensure init clean-up even with error_fatal
  fuse: Remove superfluous empty line
  fuse: Explicitly set inode ID to 1
  fuse: Change setup_... to mount_fuse_export()
  fuse: Fix mount options
  fuse: Set direct_io and parallel_direct_writes
  fuse: Introduce fuse_{at,de}tach_handlers()
  fuse: Introduce fuse_{inc,dec}_in_flight()
  fuse: Add halted flag
  fuse: Rename length to blk_len in fuse_write()
  block: Move qemu_fcntl_addfl() into osdep.c
  fuse: Manually process requests (without libfuse)
  fuse: Reduce max read size
  fuse: Process requests in coroutines
  block/export: Add multi-threading interface
  iotests/307: Test multi-thread export interface
  fuse: Implement multi-threading
  qapi/block-export: Document FUSE's multi-threading
  iotests/308: Add multi-threading sanity test
  fuse: Increase MAX_WRITE_SIZE with a second buffer

 qapi/block-export.json               |   39 +-
 include/block/export.h               |   12 +-
 include/qemu/osdep.h                 |    1 +
 block/export/export.c                |   48 +-
 block/export/fuse.c                  | 1181 ++++++++++++++++++++------
 block/export/vduse-blk.c             |    7 +
 block/export/vhost-user-blk-server.c |    8 +
 block/file-posix.c                   |   17 +-
 nbd/server.c                         |    6 +
 util/osdep.c                         |   18 +
 tests/qemu-iotests/307               |   47 +
 tests/qemu-iotests/307.out           |   18 +
 tests/qemu-iotests/308               |   55 +-
 tests/qemu-iotests/308.out           |   61 +-
 14 files changed, 1213 insertions(+), 305 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 01/21] fuse: Copy write buffer content before polling
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
@ 2025-06-04 13:27 ` Hanna Czenczek
  2025-06-09 14:45   ` Stefan Hajnoczi
  2025-06-04 13:27 ` [PATCH v2 02/21] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:27 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

aio_poll() in I/O functions can lead to nested read_from_fuse_export()
calls, overwriting the request buffer's content.  The only function
affected by this is fuse_write(), which therefore must use a bounce
buffer or corruption may occur.

Note that in addition we do not know whether libfuse-internal structures
can cope with this nesting, and even if we did, we probably cannot rely
on it in the future.  This is the main reason why we want to remove
libfuse from the I/O path.

I do not have a good reproducer for this other than:

$ dd if=/dev/urandom of=image bs=1M count=4096
$ dd if=/dev/zero of=copy bs=1M count=4096
$ touch fuse-export
$ qemu-storage-daemon \
    --blockdev file,node-name=file,filename=copy \
    --export \
    fuse,id=exp,node-name=file,mountpoint=fuse-export,writable=true \
    &

Other shell:
$ qemu-img convert -p -n -f raw -O raw -t none image fuse-export
$ killall -SIGINT qemu-storage-daemon
$ qemu-img compare image copy
Content mismatch at offset 0!

(The -t none in qemu-img convert is important.)

I tried reproducing this with throttle and small aio_write requests from
another qemu-io instance, but for some reason all requests are perfectly
serialized then.

I think in theory we should get parallel writes only if we set
fi->parallel_direct_writes in fuse_open().  In fact, I can confirm that
if we do that, that throttle-based reproducer works (i.e. does get
parallel (nested) write requests).  I have no idea why we still get
parallel requests with qemu-img convert anyway.

Also, a later patch in this series will set fi->parallel_direct_writes
and note that it makes basically no difference when running fio on the
current libfuse-based version of our code.  It does make a difference
without libfuse.  So something quite fishy is going on.

I will try to investigate further what the root cause is, but I think
for now let's assume that calling blk_pwrite() can invalidate the buffer
contents through nested polling.

Cc: qemu-stable@nongnu.org
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 465cc9891d..b967e88d2b 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -301,6 +301,12 @@ static void read_from_fuse_export(void *opaque)
         goto out;
     }
 
+    /*
+     * Note that aio_poll() in any request-processing function can lead to a
+     * nested read_from_fuse_export() call, which will overwrite the contents of
+     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
+     * content is copied before potentially polling via aio_poll().
+     */
     fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
@@ -624,6 +630,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
                        size_t size, off_t offset, struct fuse_file_info *fi)
 {
     FuseExport *exp = fuse_req_userdata(req);
+    void *copied;
     int64_t length;
     int ret;
 
@@ -638,6 +645,14 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
         return;
     }
 
+    /*
+     * Heed the note on read_from_fuse_export(): If we call aio_poll() (which
+     * any blk_*() I/O function may do), read_from_fuse_export() may be nested,
+     * overwriting the request buffer content.  Therefore, we must copy it here.
+     */
+    copied = blk_blockalign(exp->common.blk, size);
+    memcpy(copied, buf, size);
+
     /**
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
@@ -645,7 +660,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
     length = blk_getlength(exp->common.blk);
     if (length < 0) {
         fuse_reply_err(req, -length);
-        return;
+        goto free_buffer;
     }
 
     if (offset + size > length) {
@@ -653,19 +668,22 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
                 fuse_reply_err(req, -ret);
-                return;
+                goto free_buffer;
             }
         } else {
             size = length - offset;
         }
     }
 
-    ret = blk_pwrite(exp->common.blk, offset, size, buf, 0);
+    ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
     if (ret >= 0) {
         fuse_reply_write(req, size);
     } else {
         fuse_reply_err(req, -ret);
     }
+
+free_buffer:
+    qemu_vfree(copied);
 }
 
 /**
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 01/21] fuse: Copy write buffer content before polling
  2025-06-04 13:27 ` [PATCH v2 01/21] fuse: Copy write buffer content before polling Hanna Czenczek
@ 2025-06-09 14:45   ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2025-06-09 14:45 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Kevin Wolf, Markus Armbruster, Brian Song

[-- Attachment #1: Type: text/plain, Size: 2427 bytes --]

On Wed, Jun 04, 2025 at 03:27:53PM +0200, Hanna Czenczek wrote:
> aio_poll() in I/O functions can lead to nested read_from_fuse_export()
> calls, overwriting the request buffer's content.  The only function
> affected by this is fuse_write(), which therefore must use a bounce
> buffer or corruption may occur.
> 
> Note that in addition we do not know whether libfuse-internal structures
> can cope with this nesting, and even if we did, we probably cannot rely
> on it in the future.  This is the main reason why we want to remove
> libfuse from the I/O path.
> 
> I do not have a good reproducer for this other than:
> 
> $ dd if=/dev/urandom of=image bs=1M count=4096
> $ dd if=/dev/zero of=copy bs=1M count=4096
> $ touch fuse-export
> $ qemu-storage-daemon \
>     --blockdev file,node-name=file,filename=copy \
>     --export \
>     fuse,id=exp,node-name=file,mountpoint=fuse-export,writable=true \
>     &
> 
> Other shell:
> $ qemu-img convert -p -n -f raw -O raw -t none image fuse-export
> $ killall -SIGINT qemu-storage-daemon
> $ qemu-img compare image copy
> Content mismatch at offset 0!
> 
> (The -t none in qemu-img convert is important.)
> 
> I tried reproducing this with throttle and small aio_write requests from
> another qemu-io instance, but for some reason all requests are perfectly
> serialized then.
> 
> I think in theory we should get parallel writes only if we set
> fi->parallel_direct_writes in fuse_open().  In fact, I can confirm that
> if we do that, that throttle-based reproducer works (i.e. does get
> parallel (nested) write requests).  I have no idea why we still get
> parallel requests with qemu-img convert anyway.
> 
> Also, a later patch in this series will set fi->parallel_direct_writes
> and note that it makes basically no difference when running fio on the
> current libfuse-based version of our code.  It does make a difference
> without libfuse.  So something quite fishy is going on.
> 
> I will try to investigate further what the root cause is, but I think
> for now let's assume that calling blk_pwrite() can invalidate the buffer
> contents through nested polling.
> 
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 02/21] fuse: Ensure init clean-up even with error_fatal
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
  2025-06-04 13:27 ` [PATCH v2 01/21] fuse: Copy write buffer content before polling Hanna Czenczek
@ 2025-06-04 13:27 ` Hanna Czenczek
  2025-06-04 13:27 ` [PATCH v2 03/21] fuse: Remove superfluous empty line Hanna Czenczek
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:27 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

When exports are created on the command line (with the storage daemon),
errp is going to point to error_fatal.  Without ERRP_GUARD, we would
exit immediately when *errp is set, i.e. skip the clean-up code under
the `fail` label.  Use ERRP_GUARD so we always run that code.

As far as I know, this has no actual impact right now[1], but it is
still better to make this right.

[1] Not cleaning up the mount point is the only thing I can imagine
    would be problematic, but that is the last thing we attempt, so if
    it fails, it will clean itself up.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index b967e88d2b..b224ce591d 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -119,6 +119,7 @@ static int fuse_export_create(BlockExport *blk_exp,
                               BlockExportOptions *blk_exp_args,
                               Error **errp)
 {
+    ERRP_GUARD(); /* ensure clean-up even with error_fatal */
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
     BlockExportOptionsFuse *args = &blk_exp_args->u.fuse;
     int ret;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 03/21] fuse: Remove superfluous empty line
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
  2025-06-04 13:27 ` [PATCH v2 01/21] fuse: Copy write buffer content before polling Hanna Czenczek
  2025-06-04 13:27 ` [PATCH v2 02/21] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
@ 2025-06-04 13:27 ` Hanna Czenczek
  2025-06-04 13:27 ` [PATCH v2 04/21] fuse: Explicitly set inode ID to 1 Hanna Czenczek
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:27 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index b224ce591d..a93316e1f4 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -464,7 +464,6 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
     }
 
     if (add_resize_perm) {
-
         if (!qemu_in_main_thread()) {
             /* Changing permissions like below only works in the main thread */
             return -EPERM;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 04/21] fuse: Explicitly set inode ID to 1
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (2 preceding siblings ...)
  2025-06-04 13:27 ` [PATCH v2 03/21] fuse: Remove superfluous empty line Hanna Czenczek
@ 2025-06-04 13:27 ` Hanna Czenczek
  2025-06-04 13:27 ` [PATCH v2 05/21] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:27 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Setting .st_ino to the FUSE inode ID is kind of arbitrary.  While in
practice it is going to be fixed (to FUSE_ROOT_ID, which is 1) because
we only have the root inode, that is not obvious in fuse_getattr().

Just explicitly set it to 1 (i.e. no functional change).

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index a93316e1f4..60d68d8fdd 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -432,7 +432,7 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
     }
 
     statbuf = (struct stat) {
-        .st_ino     = inode,
+        .st_ino     = 1,
         .st_mode    = exp->st_mode,
         .st_nlink   = 1,
         .st_uid     = exp->st_uid,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 05/21] fuse: Change setup_... to mount_fuse_export()
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (3 preceding siblings ...)
  2025-06-04 13:27 ` [PATCH v2 04/21] fuse: Explicitly set inode ID to 1 Hanna Czenczek
@ 2025-06-04 13:27 ` Hanna Czenczek
  2025-06-04 13:27 ` [PATCH v2 06/21] fuse: Fix mount options Hanna Czenczek
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:27 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

There is no clear separation between what should go into
setup_fuse_export() and what should stay in fuse_export_create().

Make it clear that setup_fuse_export() is for mounting only.  Rename it,
and move everything that has nothing to do with mounting up into
fuse_export_create().

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 49 ++++++++++++++++++++-------------------------
 1 file changed, 22 insertions(+), 27 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 60d68d8fdd..01a5716bdd 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -72,8 +72,7 @@ static void fuse_export_delete(BlockExport *exp);
 
 static void init_exports_table(void);
 
-static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
-                             bool allow_other, Error **errp);
+static int mount_fuse_export(FuseExport *exp, Error **errp);
 static void read_from_fuse_export(void *opaque);
 
 static bool is_regular_file(const char *path, Error **errp);
@@ -193,23 +192,32 @@ static int fuse_export_create(BlockExport *blk_exp,
     exp->st_gid = getgid();
 
     if (args->allow_other == FUSE_EXPORT_ALLOW_OTHER_AUTO) {
-        /* Ignore errors on our first attempt */
-        ret = setup_fuse_export(exp, args->mountpoint, true, NULL);
-        exp->allow_other = ret == 0;
+        /* Try allow_other == true first, ignore errors */
+        exp->allow_other = true;
+        ret = mount_fuse_export(exp, NULL);
         if (ret < 0) {
-            ret = setup_fuse_export(exp, args->mountpoint, false, errp);
+            exp->allow_other = false;
+            ret = mount_fuse_export(exp, errp);
         }
     } else {
         exp->allow_other = args->allow_other == FUSE_EXPORT_ALLOW_OTHER_ON;
-        ret = setup_fuse_export(exp, args->mountpoint, exp->allow_other, errp);
+        ret = mount_fuse_export(exp, errp);
     }
     if (ret < 0) {
         goto fail;
     }
 
+    g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
+
+    aio_set_fd_handler(exp->common.ctx,
+                       fuse_session_fd(exp->fuse_session),
+                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    exp->fd_handler_set_up = true;
+
     return 0;
 
 fail:
+    fuse_export_shutdown(blk_exp);
     fuse_export_delete(blk_exp);
     return ret;
 }
@@ -227,10 +235,10 @@ static void init_exports_table(void)
 }
 
 /**
- * Create exp->fuse_session and mount it.
+ * Create exp->fuse_session and mount it.  Expects exp->mountpoint,
+ * exp->writable, and exp->allow_other to be set as intended for the mount.
  */
-static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
-                             bool allow_other, Error **errp)
+static int mount_fuse_export(FuseExport *exp, Error **errp)
 {
     const char *fuse_argv[4];
     char *mount_opts;
@@ -243,7 +251,7 @@ static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
      */
     mount_opts = g_strdup_printf("max_read=%zu,default_permissions%s",
                                  FUSE_MAX_BOUNCE_BYTES,
-                                 allow_other ? ",allow_other" : "");
+                                 exp->allow_other ? ",allow_other" : "");
 
     fuse_argv[0] = ""; /* Dummy program name */
     fuse_argv[1] = "-o";
@@ -256,30 +264,17 @@ static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
     g_free(mount_opts);
     if (!exp->fuse_session) {
         error_setg(errp, "Failed to set up FUSE session");
-        ret = -EIO;
-        goto fail;
+        return -EIO;
     }
 
-    ret = fuse_session_mount(exp->fuse_session, mountpoint);
+    ret = fuse_session_mount(exp->fuse_session, exp->mountpoint);
     if (ret < 0) {
         error_setg(errp, "Failed to mount FUSE session to export");
-        ret = -EIO;
-        goto fail;
+        return -EIO;
     }
     exp->mounted = true;
 
-    g_hash_table_insert(exports, g_strdup(mountpoint), NULL);
-
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
-
     return 0;
-
-fail:
-    fuse_export_shutdown(&exp->common);
-    return ret;
 }
 
 /**
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 06/21] fuse: Fix mount options
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (4 preceding siblings ...)
  2025-06-04 13:27 ` [PATCH v2 05/21] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
@ 2025-06-04 13:27 ` Hanna Czenczek
  2025-06-04 13:27 ` [PATCH v2 07/21] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:27 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Since I actually took a look into how mounting with libfuse works[1], I
now know that the FUSE mount options are not exactly standard mount
system call options.  Specifically:
- We should add "nosuid,nodev,noatime" because that is going to be
  translated into the respective MS_ mount flags; and those flags make
  sense for us.
- We can set rw/ro to make the mount writable or not.  It makes sense to
  set this flag to produce a better error message for read-only exports
  (EROFS instead of EACCES).
  This changes behavior as can be seen in iotest 308: It is no longer
  possible to modify metadata of read-only exports.

In addition, in the comment, we can note that the FUSE mount() system
call actually expects some more parameters that we can omit because
fusermount3 (i.e. libfuse) will figure them out by itself:
- fd: /dev/fuse fd
- rootmode: Inode mode of the root node
- user_id/group_id: Mounter's UID/GID

[1] It invokes fusermount3, an SUID libfuse helper program, which parses
    and processes some mount options before actually invoking the
    mount() system call.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c        | 14 +++++++++++---
 tests/qemu-iotests/308     |  4 ++--
 tests/qemu-iotests/308.out |  3 ++-
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 01a5716bdd..9d110ce949 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -246,10 +246,18 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     int ret;
 
     /*
-     * max_read needs to match what fuse_init() sets.
-     * max_write need not be supplied.
+     * Note that these mount options differ from what we would pass to a direct
+     * mount() call:
+     * - nosuid, nodev, and noatime are not understood by the kernel; libfuse
+     *   uses those options to construct the mount flags (MS_*)
+     * - The FUSE kernel driver requires additional options (fd, rootmode,
+     *   user_id, group_id); these will be set by libfuse.
+     * Note that max_read is set here, while max_write is set via the FUSE INIT
+     * operation.
      */
-    mount_opts = g_strdup_printf("max_read=%zu,default_permissions%s",
+    mount_opts = g_strdup_printf("%s,nosuid,nodev,noatime,max_read=%zu,"
+                                 "default_permissions%s",
+                                 exp->writable ? "rw" : "ro",
                                  FUSE_MAX_BOUNCE_BYTES,
                                  exp->allow_other ? ",allow_other" : "");
 
diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index 6eced3aefb..033d5cbe22 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -178,7 +178,7 @@ stat -c 'Permissions pre-chmod: %a' "$EXT_MP"
 chmod u+w "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
 stat -c 'Permissions post-+w: %a' "$EXT_MP"
 
-# But that we can set, say, +x (if we are so inclined)
+# Same for other flags, like, say +x
 chmod u+x "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
 stat -c 'Permissions post-+x: %a' "$EXT_MP"
 
@@ -236,7 +236,7 @@ output=$($QEMU_IO -f raw -c 'write -P 42 1M 64k' "$TEST_IMG" 2>&1 \
 
 # Expected reference output: Opening the file fails because it has no
 # write permission
-reference="Could not open 'TEST_DIR/t.IMGFMT': Permission denied"
+reference="Could not open 'TEST_DIR/t.IMGFMT': Read-only file system"
 
 if echo "$output" | grep -q "$reference"; then
     echo "Writing to read-only export failed: OK"
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index e5e233691d..aa96faab6d 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -53,7 +53,8 @@ Images are identical.
 Permissions pre-chmod: 400
 chmod: changing permissions of 'TEST_DIR/t.IMGFMT.fuse': Read-only file system
 Permissions post-+w: 400
-Permissions post-+x: 500
+chmod: changing permissions of 'TEST_DIR/t.IMGFMT.fuse': Read-only file system
+Permissions post-+x: 400
 
 === Mount over existing file ===
 {'execute': 'block-export-add',
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 07/21] fuse: Set direct_io and parallel_direct_writes
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (5 preceding siblings ...)
  2025-06-04 13:27 ` [PATCH v2 06/21] fuse: Fix mount options Hanna Czenczek
@ 2025-06-04 13:27 ` Hanna Czenczek
  2025-06-04 13:28 ` [PATCH v2 08/21] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:27 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

In fuse_open(), set these flags:
- direct_io: We probably actually don't want to have the host page cache
  be used for our exports.  QEMU block exports are supposed to represent
  the image as-is (and thus potentially changing).
  This causes a change in iotest 308's reference output.

- parallel_direct_writes: We can (now) cope with parallel writes, so we
  should set this flag.  For some reason, it doesn't seem to make an
  actual performance difference with libfuse, but it does make a
  difference without it, so let's set it.
  (See "fuse: Copy write buffer content before polling" for further
  discussion.)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c        | 2 ++
 tests/qemu-iotests/308.out | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 9d110ce949..e1134a27d6 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -576,6 +576,8 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
 static void fuse_open(fuse_req_t req, fuse_ino_t inode,
                       struct fuse_file_info *fi)
 {
+    fi->direct_io = true;
+    fi->parallel_direct_writes = true;
     fuse_reply_open(req, fi);
 }
 
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index aa96faab6d..2d7a38d63d 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -131,7 +131,7 @@ wrote 65536/65536 bytes at offset 1048576
 
 --- Try growing non-growable export ---
 (OK: Lengths of export and original are the same)
-dd: error writing 'TEST_DIR/t.IMGFMT.fuse': Input/output error
+dd: error writing 'TEST_DIR/t.IMGFMT.fuse': No space left on device
 1+0 records in
 0+0 records out
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 08/21] fuse: Introduce fuse_{at,de}tach_handlers()
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (6 preceding siblings ...)
  2025-06-04 13:27 ` [PATCH v2 07/21] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-04 13:28 ` [PATCH v2 09/21] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Pull setting up and tearing down the AIO context handlers into two
dedicated functions.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index e1134a27d6..15ec7a5c05 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -78,27 +78,34 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
-static void fuse_export_drained_begin(void *opaque)
+static void fuse_attach_handlers(FuseExport *exp)
 {
-    FuseExport *exp = opaque;
+    aio_set_fd_handler(exp->common.ctx,
+                       fuse_session_fd(exp->fuse_session),
+                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    exp->fd_handler_set_up = true;
+}
 
+static void fuse_detach_handlers(FuseExport *exp)
+{
     aio_set_fd_handler(exp->common.ctx,
                        fuse_session_fd(exp->fuse_session),
                        NULL, NULL, NULL, NULL, NULL);
     exp->fd_handler_set_up = false;
 }
 
+static void fuse_export_drained_begin(void *opaque)
+{
+    fuse_detach_handlers(opaque);
+}
+
 static void fuse_export_drained_end(void *opaque)
 {
     FuseExport *exp = opaque;
 
     /* Refresh AioContext in case it changed */
     exp->common.ctx = blk_get_aio_context(exp->common.blk);
-
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
+    fuse_attach_handlers(exp);
 }
 
 static bool fuse_export_drained_poll(void *opaque)
@@ -209,11 +216,7 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
-
+    fuse_attach_handlers(exp);
     return 0;
 
 fail:
@@ -329,10 +332,7 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
         fuse_session_exit(exp->fuse_session);
 
         if (exp->fd_handler_set_up) {
-            aio_set_fd_handler(exp->common.ctx,
-                               fuse_session_fd(exp->fuse_session),
-                               NULL, NULL, NULL, NULL, NULL);
-            exp->fd_handler_set_up = false;
+            fuse_detach_handlers(exp);
         }
     }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 09/21] fuse: Introduce fuse_{inc,dec}_in_flight()
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (7 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 08/21] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-04 13:28 ` [PATCH v2 10/21] fuse: Add halted flag Hanna Czenczek
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

This is how vduse-blk.c does it, and it does seem better to have
dedicated functions for it.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 15ec7a5c05..bcbeaf92f4 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -78,6 +78,25 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
+static void fuse_inc_in_flight(FuseExport *exp)
+{
+    if (qatomic_fetch_inc(&exp->in_flight) == 0) {
+        /* Prevent export from being deleted */
+        blk_exp_ref(&exp->common);
+    }
+}
+
+static void fuse_dec_in_flight(FuseExport *exp)
+{
+    if (qatomic_fetch_dec(&exp->in_flight) == 1) {
+        /* Wake AIO_WAIT_WHILE() */
+        aio_wait_kick();
+
+        /* Now the export can be deleted */
+        blk_exp_unref(&exp->common);
+    }
+}
+
 static void fuse_attach_handlers(FuseExport *exp)
 {
     aio_set_fd_handler(exp->common.ctx,
@@ -297,9 +316,7 @@ static void read_from_fuse_export(void *opaque)
     FuseExport *exp = opaque;
     int ret;
 
-    blk_exp_ref(&exp->common);
-
-    qatomic_inc(&exp->in_flight);
+    fuse_inc_in_flight(exp);
 
     do {
         ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
@@ -317,11 +334,7 @@ static void read_from_fuse_export(void *opaque)
     fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
-    if (qatomic_fetch_dec(&exp->in_flight) == 1) {
-        aio_wait_kick(); /* wake AIO_WAIT_WHILE() */
-    }
-
-    blk_exp_unref(&exp->common);
+    fuse_dec_in_flight(exp);
 }
 
 static void fuse_export_shutdown(BlockExport *blk_exp)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 10/21] fuse: Add halted flag
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (8 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 09/21] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-04 13:28 ` [PATCH v2 11/21] fuse: Rename length to blk_len in fuse_write() Hanna Czenczek
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

This is a flag that we will want when processing FUSE requests
ourselves: When the kernel sends us e.g. a truncated request (i.e. we
receive less data than the request's indicated length), we cannot rely
on subsequent data to be valid.  Then, we are going to set this flag,
halting all FUSE request processing.

We plan to only use this flag in cases that would effectively be kernel
bugs.

(Right now, the flag is unused because libfuse still does our request
processing.)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index bcbeaf92f4..044fbbf1fe 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -53,6 +53,13 @@ typedef struct FuseExport {
     unsigned int in_flight; /* atomic */
     bool mounted, fd_handler_set_up;
 
+    /*
+     * Set when there was an unrecoverable error and no requests should be read
+     * from the device anymore (basically only in case of something we would
+     * consider a kernel bug)
+     */
+    bool halted;
+
     char *mountpoint;
     bool writable;
     bool growable;
@@ -69,6 +76,7 @@ static const struct fuse_lowlevel_ops fuse_ops;
 
 static void fuse_export_shutdown(BlockExport *exp);
 static void fuse_export_delete(BlockExport *exp);
+static void fuse_export_halt(FuseExport *exp) G_GNUC_UNUSED;
 
 static void init_exports_table(void);
 
@@ -99,6 +107,10 @@ static void fuse_dec_in_flight(FuseExport *exp)
 
 static void fuse_attach_handlers(FuseExport *exp)
 {
+    if (exp->halted) {
+        return;
+    }
+
     aio_set_fd_handler(exp->common.ctx,
                        fuse_session_fd(exp->fuse_session),
                        read_from_fuse_export, NULL, NULL, NULL, exp);
@@ -316,6 +328,10 @@ static void read_from_fuse_export(void *opaque)
     FuseExport *exp = opaque;
     int ret;
 
+    if (unlikely(exp->halted)) {
+        return;
+    }
+
     fuse_inc_in_flight(exp);
 
     do {
@@ -374,6 +390,20 @@ static void fuse_export_delete(BlockExport *blk_exp)
     g_free(exp->mountpoint);
 }
 
+/**
+ * Halt the export: Detach FD handlers, and set exp->halted to true, preventing
+ * fuse_attach_handlers() from re-attaching them, therefore stopping all further
+ * request processing.
+ *
+ * Call this function when an unrecoverable error happens that makes processing
+ * all future requests unreliable.
+ */
+static void fuse_export_halt(FuseExport *exp)
+{
+    exp->halted = true;
+    fuse_detach_handlers(exp);
+}
+
 /**
  * Check whether @path points to a regular file.  If not, put an
  * appropriate message into *errp.
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 11/21] fuse: Rename length to blk_len in fuse_write()
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (9 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 10/21] fuse: Add halted flag Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-09 14:48   ` Stefan Hajnoczi
  2025-06-04 13:28 ` [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c Hanna Czenczek
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

The term "length" is ambiguous, use "blk_len" instead to be clear.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 044fbbf1fe..fd7887889c 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -679,7 +679,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
 {
     FuseExport *exp = fuse_req_userdata(req);
     void *copied;
-    int64_t length;
+    int64_t blk_len;
     int ret;
 
     /* Limited by max_write, should not happen */
@@ -705,13 +705,13 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
      */
-    length = blk_getlength(exp->common.blk);
-    if (length < 0) {
-        fuse_reply_err(req, -length);
+    blk_len = blk_getlength(exp->common.blk);
+    if (blk_len < 0) {
+        fuse_reply_err(req, -blk_len);
         goto free_buffer;
     }
 
-    if (offset + size > length) {
+    if (offset + size > blk_len) {
         if (exp->growable) {
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
@@ -719,7 +719,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
                 goto free_buffer;
             }
         } else {
-            size = length - offset;
+            size = blk_len - offset;
         }
     }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 11/21] fuse: Rename length to blk_len in fuse_write()
  2025-06-04 13:28 ` [PATCH v2 11/21] fuse: Rename length to blk_len in fuse_write() Hanna Czenczek
@ 2025-06-09 14:48   ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2025-06-09 14:48 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Kevin Wolf, Markus Armbruster, Brian Song

[-- Attachment #1: Type: text/plain, Size: 347 bytes --]

On Wed, Jun 04, 2025 at 03:28:03PM +0200, Hanna Czenczek wrote:
> The term "length" is ambiguous, use "blk_len" instead to be clear.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (10 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 11/21] fuse: Rename length to blk_len in fuse_write() Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-04 15:18   ` Eric Blake
  2025-06-09 15:03   ` Stefan Hajnoczi
  2025-06-04 13:28 ` [PATCH v2 13/21] fuse: Manually process requests (without libfuse) Hanna Czenczek
                   ` (9 subsequent siblings)
  21 siblings, 2 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Move file-posix's helper to add a flag (or a set of flags) to an FD's
existing set of flags into osdep.c for other places to use.

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/qemu/osdep.h |  1 +
 block/file-posix.c   | 17 +----------------
 util/osdep.c         | 18 ++++++++++++++++++
 3 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 96fe51bc39..49b729edc1 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -774,6 +774,7 @@ static inline void qemu_reset_optind(void)
 }
 
 int qemu_fdatasync(int fd);
+int qemu_fcntl_addfl(int fd, int flag);
 
 /**
  * qemu_close_all_open_fd:
diff --git a/block/file-posix.c b/block/file-posix.c
index 9b5f08ccb2..045e94d54d 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1047,21 +1047,6 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
     return ret;
 }
 
-/* Sets a specific flag */
-static int fcntl_setfl(int fd, int flag)
-{
-    int flags;
-
-    flags = fcntl(fd, F_GETFL);
-    if (flags == -1) {
-        return -errno;
-    }
-    if (fcntl(fd, F_SETFL, flags | flag) == -1) {
-        return -errno;
-    }
-    return 0;
-}
-
 static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
                                  int *open_flags, uint64_t perm, Error **errp)
 {
@@ -1100,7 +1085,7 @@ static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
         /* dup the original fd */
         fd = qemu_dup(s->fd);
         if (fd >= 0) {
-            ret = fcntl_setfl(fd, *open_flags);
+            ret = qemu_fcntl_addfl(fd, *open_flags);
             if (ret) {
                 qemu_close(fd);
                 fd = -1;
diff --git a/util/osdep.c b/util/osdep.c
index 770369831b..ce5c6a7f59 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -613,3 +613,21 @@ int qemu_fdatasync(int fd)
     return fsync(fd);
 #endif
 }
+
+/**
+ * Set the given flag(s) (fcntl GETFL/SETFL) on the given FD, while retaining
+ * other flags.
+ */
+int qemu_fcntl_addfl(int fd, int flag)
+{
+    int flags;
+
+    flags = fcntl(fd, F_GETFL);
+    if (flags == -1) {
+        return -errno;
+    }
+    if (fcntl(fd, F_SETFL, flags | flag) == -1) {
+        return -errno;
+    }
+    return 0;
+}
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c
  2025-06-04 13:28 ` [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c Hanna Czenczek
@ 2025-06-04 15:18   ` Eric Blake
  2025-06-09 15:03   ` Stefan Hajnoczi
  1 sibling, 0 replies; 40+ messages in thread
From: Eric Blake @ 2025-06-04 15:18 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

On Wed, Jun 04, 2025 at 03:28:04PM +0200, Hanna Czenczek wrote:
> Move file-posix's helper to add a flag (or a set of flags) to an FD's
> existing set of flags into osdep.c for other places to use.
> 
> Suggested-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/qemu/osdep.h |  1 +
>  block/file-posix.c   | 17 +----------------
>  util/osdep.c         | 18 ++++++++++++++++++
>  3 files changed, 20 insertions(+), 16 deletions(-)
>

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c
  2025-06-04 13:28 ` [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c Hanna Czenczek
  2025-06-04 15:18   ` Eric Blake
@ 2025-06-09 15:03   ` Stefan Hajnoczi
  2025-07-01  7:24     ` Hanna Czenczek
  1 sibling, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2025-06-09 15:03 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Kevin Wolf, Markus Armbruster, Brian Song

[-- Attachment #1: Type: text/plain, Size: 4887 bytes --]

On Wed, Jun 04, 2025 at 03:28:04PM +0200, Hanna Czenczek wrote:
> Move file-posix's helper to add a flag (or a set of flags) to an FD's
> existing set of flags into osdep.c for other places to use.
> 
> Suggested-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/qemu/osdep.h |  1 +
>  block/file-posix.c   | 17 +----------------
>  util/osdep.c         | 18 ++++++++++++++++++
>  3 files changed, 20 insertions(+), 16 deletions(-)

I was curious if putting POSIX fcntl(2) in osdep.c would work on
Windows. It does not:

x86_64-w64-mingw32-gcc -m64 -Ilibqemuutil.a.p -I. -I.. -Iqapi -Itrace -Iui -Iui/shader -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/glib-2.0 -I/usr/x86_64-w64-mingw32/sys-root/mingw/lib/glib-2.0/include -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -fstack-protector-strong -Wempty-body -Wendif-labels -Wexpansion-to-defined -Wformat-security -Wformat-y2k -Wignored-qualifiers -Wimplicit-fallthrough=2 -Winit-self -Wmissing-format-attribute -Wmissing-prototypes -Wnested-externs -Wold-style-declaration -Wold-style-definition -Wredundant-decls -Wshadow=local -Wstrict-prototypes -Wtype-limits -Wundef -Wvla -Wwrite-strings -Wno-missing-include-dirs -Wno-psabi -Wno-shift-negative-value -iquote . -iquote /home/stefanha/qemu -iquote /home/stefanha/qemu/include -iquote /home/stefanha/qemu/host/include/x86_64 -iquote /home/stefanha/qemu/host/include/generic -iquote /home/stefanha/qemu/tcg/i386 -mms-bitfields -mms-bitfields -mcx16 -msse2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -fno-pie -no-pie -ftrivial-auto-var-init=zero -fzero-call-used-regs=used-gpr -mms-bitfields -mms-bitfields -pthread -mms-bitfields -MD -MQ libqemuutil.a.p/util_osdep.c.obj -MF libqemuutil.a.p/util_osdep.c.obj.d -o libqemuutil.a.p/util_osdep.c.obj -c ../util/osdep.c
../util/osdep.c: In function 'qemu_fcntl_addfl':
../util/osdep.c:625:13: error: implicit declaration of function 'fcntl' [-Wimplicit-function-declaration]
  625 |     flags = fcntl(fd, F_GETFL);
      |             ^~~~~
../util/osdep.c:625:13: error: nested extern declaration of 'fcntl' [-Werror=nested-externs]
../util/osdep.c:625:23: error: 'F_GETFL' undeclared (first use in this function)
  625 |     flags = fcntl(fd, F_GETFL);
      |                       ^~~~~~~
../util/osdep.c:625:23: note: each undeclared identifier is reported only once for each function it appears in
../util/osdep.c:629:19: error: 'F_SETFL' undeclared (first use in this function)
  629 |     if (fcntl(fd, F_SETFL, flags | flag) == -1) {
      |                   ^~~~~~~
cc1: all warnings being treated as errors

> 
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 96fe51bc39..49b729edc1 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -774,6 +774,7 @@ static inline void qemu_reset_optind(void)
>  }
>  
>  int qemu_fdatasync(int fd);
> +int qemu_fcntl_addfl(int fd, int flag);
>  
>  /**
>   * qemu_close_all_open_fd:
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 9b5f08ccb2..045e94d54d 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1047,21 +1047,6 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
>      return ret;
>  }
>  
> -/* Sets a specific flag */
> -static int fcntl_setfl(int fd, int flag)
> -{
> -    int flags;
> -
> -    flags = fcntl(fd, F_GETFL);
> -    if (flags == -1) {
> -        return -errno;
> -    }
> -    if (fcntl(fd, F_SETFL, flags | flag) == -1) {
> -        return -errno;
> -    }
> -    return 0;
> -}
> -
>  static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
>                                   int *open_flags, uint64_t perm, Error **errp)
>  {
> @@ -1100,7 +1085,7 @@ static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
>          /* dup the original fd */
>          fd = qemu_dup(s->fd);
>          if (fd >= 0) {
> -            ret = fcntl_setfl(fd, *open_flags);
> +            ret = qemu_fcntl_addfl(fd, *open_flags);
>              if (ret) {
>                  qemu_close(fd);
>                  fd = -1;
> diff --git a/util/osdep.c b/util/osdep.c
> index 770369831b..ce5c6a7f59 100644
> --- a/util/osdep.c
> +++ b/util/osdep.c
> @@ -613,3 +613,21 @@ int qemu_fdatasync(int fd)
>      return fsync(fd);
>  #endif
>  }
> +
> +/**
> + * Set the given flag(s) (fcntl GETFL/SETFL) on the given FD, while retaining
> + * other flags.
> + */
> +int qemu_fcntl_addfl(int fd, int flag)
> +{
> +    int flags;
> +
> +    flags = fcntl(fd, F_GETFL);
> +    if (flags == -1) {
> +        return -errno;
> +    }
> +    if (fcntl(fd, F_SETFL, flags | flag) == -1) {
> +        return -errno;
> +    }
> +    return 0;
> +}
> -- 
> 2.49.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c
  2025-06-09 15:03   ` Stefan Hajnoczi
@ 2025-07-01  7:24     ` Hanna Czenczek
  0 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-07-01  7:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-block, qemu-devel, Kevin Wolf, Markus Armbruster, Brian Song

On 09.06.25 17:03, Stefan Hajnoczi wrote:
> On Wed, Jun 04, 2025 at 03:28:04PM +0200, Hanna Czenczek wrote:
>> Move file-posix's helper to add a flag (or a set of flags) to an FD's
>> existing set of flags into osdep.c for other places to use.
>>
>> Suggested-by: Eric Blake <eblake@redhat.com>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/qemu/osdep.h |  1 +
>>   block/file-posix.c   | 17 +----------------
>>   util/osdep.c         | 18 ++++++++++++++++++
>>   3 files changed, 20 insertions(+), 16 deletions(-)
> I was curious if putting POSIX fcntl(2) in osdep.c would work on
> Windows. It does not:
>
> x86_64-w64-mingw32-gcc -m64 -Ilibqemuutil.a.p -I. -I.. -Iqapi -Itrace -Iui -Iui/shader -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/glib-2.0 -I/usr/x86_64-w64-mingw32/sys-root/mingw/lib/glib-2.0/include -fdiagnostics-color=auto -Wall -Winvalid-pch -Werror -std=gnu11 -O2 -g -fstack-protector-strong -Wempty-body -Wendif-labels -Wexpansion-to-defined -Wformat-security -Wformat-y2k -Wignored-qualifiers -Wimplicit-fallthrough=2 -Winit-self -Wmissing-format-attribute -Wmissing-prototypes -Wnested-externs -Wold-style-declaration -Wold-style-definition -Wredundant-decls -Wshadow=local -Wstrict-prototypes -Wtype-limits -Wundef -Wvla -Wwrite-strings -Wno-missing-include-dirs -Wno-psabi -Wno-shift-negative-value -iquote . -iquote /home/stefanha/qemu -iquote /home/stefanha/qemu/include -iquote /home/stefanha/qemu/host/include/x86_64 -iquote /home/stefanha/qemu/host/include/generic -iquote /home/stefanha/qemu/tcg/i386 -mms-bitfields -mms-bitfields -mcx16 -msse2 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -fno-strict-aliasing -fno-common -fwrapv -fno-pie -no-pie -ftrivial-auto-var-init=zero -fzero-call-used-regs=used-gpr -mms-bitfields -mms-bitfields -pthread -mms-bitfields -MD -MQ libqemuutil.a.p/util_osdep.c.obj -MF libqemuutil.a.p/util_osdep.c.obj.d -o libqemuutil.a.p/util_osdep.c.obj -c ../util/osdep.c
> ../util/osdep.c: In function 'qemu_fcntl_addfl':
> ../util/osdep.c:625:13: error: implicit declaration of function 'fcntl' [-Wimplicit-function-declaration]
>    625 |     flags = fcntl(fd, F_GETFL);
>        |             ^~~~~
> ../util/osdep.c:625:13: error: nested extern declaration of 'fcntl' [-Werror=nested-externs]
> ../util/osdep.c:625:23: error: 'F_GETFL' undeclared (first use in this function)
>    625 |     flags = fcntl(fd, F_GETFL);
>        |                       ^~~~~~~
> ../util/osdep.c:625:23: note: each undeclared identifier is reported only once for each function it appears in
> ../util/osdep.c:629:19: error: 'F_SETFL' undeclared (first use in this function)
>    629 |     if (fcntl(fd, F_SETFL, flags | flag) == -1) {
>        |                   ^~~~~~~
> cc1: all warnings being treated as errors

Ah, thanks!

I’ll move it up into the #ifndef _WIN32 block around qemu_dup_flags() 
and friends.

Hanna

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 13/21] fuse: Manually process requests (without libfuse)
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (11 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-09 16:54   ` Stefan Hajnoczi
  2025-06-04 13:28 ` [PATCH v2 14/21] fuse: Reduce max read size Hanna Czenczek
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Manually read requests from the /dev/fuse FD and process them, without
using libfuse.  This allows us to safely add parallel request processing
in coroutines later, without having to worry about libfuse internals.
(Technically, we already have exactly that problem with
read_from_fuse_export()/read_from_fuse_fd() nesting.)

We will continue to use libfuse for mounting the filesystem; fusermount3
is a effectively a helper program of libfuse, so it should know best how
to interact with it.  (Doing it manually without libfuse, while doable,
is a bit of a pain, and it is not clear to me how stable the "protocol"
actually is.)

Take this opportunity of quite a major rewrite to update the Copyright
line with corrected information that has surfaced in the meantime.

Here are some benchmarks from before this patch (4k, iodepth=16, libaio;
except 'sync', which are iodepth=1 and pvsync2):

file:
  read:
    seq aio:   78.6k ±1.3k IOPS
    rand aio:  39.3k ±2.9k
    seq sync:  32.5k ±0.7k
    rand sync:  9.9k ±0.1k
  write:
    seq aio:   61.9k ±0.5k
    rand aio:  61.2k ±0.6k
    seq sync:  27.9k ±0.2k
    rand sync: 27.6k ±0.4k
null:
  read:
    seq aio:   214.0k ±5.9k
    rand aio:  212.7k ±4.5k
    seq sync:   90.3k ±6.5k
    rand sync:  89.7k ±5.1k
  write:
    seq aio:   203.9k ±1.5k
    rand aio:  201.4k ±3.6k
    seq sync:   86.1k ±6.2k
    rand sync:  84.9k ±5.3k

And with this patch applied:

file:
  read:
    seq aio:   76.6k ±1.8k (- 3 %)
    rand aio:  26.7k ±0.4k (-32 %)
    seq sync:  47.7k ±1.2k (+47 %)
    rand sync: 10.1k ±0.2k (+ 2 %)
  write:
    seq aio:   58.1k ±0.5k (- 6 %)
    rand aio:  58.1k ±0.5k (- 5 %)
    seq sync:  36.3k ±0.3k (+30 %)
    rand sync: 36.1k ±0.4k (+31 %)
null:
  read:
    seq aio:   268.4k ±3.4k (+25 %)
    rand aio:  265.3k ±2.1k (+25 %)
    seq sync:  134.3k ±2.7k (+49 %)
    rand sync: 132.4k ±1.4k (+48 %)
  write:
    seq aio:   275.3k ±1.7k (+35 %)
    rand aio:  272.3k ±1.9k (+35 %)
    seq sync:  130.7k ±1.6k (+52 %)
    rand sync: 127.4k ±2.4k (+50 %)

So clearly the AIO file results are actually not good, and random reads
are indeed quite terrible.  On the other hand, we can see from the sync
and null results that request handling should in theory be quicker.  How
does this fit together?

I believe the bad AIO results are an artifact of the accidental parallel
request processing we have due to nested polling: Depending on how the
actual request processing is structured and how long request processing
takes, more or less requests will be submitted in parallel.  So because
of the restructuring, I think this patch accidentally changes how many
requests end up being submitted in parallel, which decreases
performance.

(I have seen something like this before: In RSD, without having
implemented a polling mode, the debug build tended to have better
performance than the more optimized release build, because the debug
build, taking longer to submit requests, ended up processing more
requests in parallel.)

In any case, once we use coroutines throughout the code, performance
will improve again across the board.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 754 +++++++++++++++++++++++++++++++-------------
 1 file changed, 535 insertions(+), 219 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index fd7887889c..926f97a885 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -1,7 +1,7 @@
 /*
  * Present a block device as a raw image through FUSE
  *
- * Copyright (c) 2020 Max Reitz <mreitz@redhat.com>
+ * Copyright (c) 2020, 2025 Hanna Czenczek <hreitz@redhat.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -27,12 +27,15 @@
 #include "block/qapi.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block.h"
+#include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
 
 #include <fuse.h>
 #include <fuse_lowlevel.h>
 
+#include "standard-headers/linux/fuse.h"
+
 #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
 #endif
@@ -42,17 +45,27 @@
 #endif
 
 /* Prevent overly long bounce buffer allocations */
-#define FUSE_MAX_BOUNCE_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
-
+#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+/* Small enough to fit in the request buffer */
+#define FUSE_MAX_WRITE_BYTES (4 * 1024)
 
 typedef struct FuseExport {
     BlockExport common;
 
     struct fuse_session *fuse_session;
-    struct fuse_buf fuse_buf;
     unsigned int in_flight; /* atomic */
     bool mounted, fd_handler_set_up;
 
+    /*
+     * The request buffer must be able to hold a full write, and/or at least
+     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes
+     */
+    char request_buf[MAX_CONST(
+        sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
+             FUSE_MAX_WRITE_BYTES,
+        FUSE_MIN_READ_BUFFER
+    )];
+
     /*
      * Set when there was an unrecoverable error and no requests should be read
      * from the device anymore (basically only in case of something we would
@@ -60,6 +73,8 @@ typedef struct FuseExport {
      */
     bool halted;
 
+    int fuse_fd;
+
     char *mountpoint;
     bool writable;
     bool growable;
@@ -72,19 +87,19 @@ typedef struct FuseExport {
 } FuseExport;
 
 static GHashTable *exports;
-static const struct fuse_lowlevel_ops fuse_ops;
 
 static void fuse_export_shutdown(BlockExport *exp);
 static void fuse_export_delete(BlockExport *exp);
-static void fuse_export_halt(FuseExport *exp) G_GNUC_UNUSED;
+static void fuse_export_halt(FuseExport *exp);
 
 static void init_exports_table(void);
 
 static int mount_fuse_export(FuseExport *exp, Error **errp);
-static void read_from_fuse_export(void *opaque);
 
 static bool is_regular_file(const char *path, Error **errp);
 
+static void read_from_fuse_fd(void *opaque);
+static void fuse_process_request(FuseExport *exp);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -105,22 +120,26 @@ static void fuse_dec_in_flight(FuseExport *exp)
     }
 }
 
+/**
+ * Attach FUSE FD read handler.
+ */
 static void fuse_attach_handlers(FuseExport *exp)
 {
     if (exp->halted) {
         return;
     }
 
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
+                       read_from_fuse_fd, NULL, NULL, NULL, exp);
     exp->fd_handler_set_up = true;
 }
 
+/**
+ * Detach FUSE FD read handler.
+ */
 static void fuse_detach_handlers(FuseExport *exp)
 {
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
+    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
                        NULL, NULL, NULL, NULL, NULL);
     exp->fd_handler_set_up = false;
 }
@@ -247,6 +266,13 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
+    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
+    ret = qemu_fcntl_addfl(exp->fuse_fd, O_NONBLOCK);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "Failed to make FUSE FD non-blocking");
+        goto fail;
+    }
+
     fuse_attach_handlers(exp);
     return 0;
 
@@ -292,7 +318,7 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     mount_opts = g_strdup_printf("%s,nosuid,nodev,noatime,max_read=%zu,"
                                  "default_permissions%s",
                                  exp->writable ? "rw" : "ro",
-                                 FUSE_MAX_BOUNCE_BYTES,
+                                 FUSE_MAX_READ_BYTES,
                                  exp->allow_other ? ",allow_other" : "");
 
     fuse_argv[0] = ""; /* Dummy program name */
@@ -301,8 +327,8 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     fuse_argv[3] = NULL;
     fuse_args = (struct fuse_args)FUSE_ARGS_INIT(3, (char **)fuse_argv);
 
-    exp->fuse_session = fuse_session_new(&fuse_args, &fuse_ops,
-                                         sizeof(fuse_ops), exp);
+    /* We just create the session for mounting/unmounting, no need to set ops */
+    exp->fuse_session = fuse_session_new(&fuse_args, NULL, 0, NULL);
     g_free(mount_opts);
     if (!exp->fuse_session) {
         error_setg(errp, "Failed to set up FUSE session");
@@ -320,36 +346,54 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
 }
 
 /**
- * Callback to be invoked when the FUSE session FD can be read from.
- * (This is basically the FUSE event loop.)
+ * Try to read and process a single request from the FUSE FD.
  */
-static void read_from_fuse_export(void *opaque)
+static void read_from_fuse_fd(void *opaque)
 {
     FuseExport *exp = opaque;
-    int ret;
+    int fuse_fd = exp->fuse_fd;
+    ssize_t ret;
+    const struct fuse_in_header *in_hdr;
+
+    fuse_inc_in_flight(exp);
 
     if (unlikely(exp->halted)) {
-        return;
+        goto no_request;
     }
 
-    fuse_inc_in_flight(exp);
+    ret = RETRY_ON_EINTR(read(fuse_fd, exp->request_buf,
+                              sizeof(exp->request_buf)));
+    if (ret < 0 && errno == EAGAIN) {
+        /* No request available */
+        goto no_request;
+    } else if (unlikely(ret < 0)) {
+        error_report("Failed to read from FUSE device: %s", strerror(-ret));
+        goto no_request;
+    }
 
-    do {
-        ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
-    } while (ret == -EINTR);
-    if (ret < 0) {
-        goto out;
+    if (unlikely(ret < sizeof(*in_hdr))) {
+        error_report("Incomplete read from FUSE device, expected at least %zu "
+                     "bytes, read %zi bytes; cannot trust subsequent "
+                     "requests, halting the export",
+                     sizeof(*in_hdr), ret);
+        fuse_export_halt(exp);
+        goto no_request;
     }
 
-    /*
-     * Note that aio_poll() in any request-processing function can lead to a
-     * nested read_from_fuse_export() call, which will overwrite the contents of
-     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
-     * content is copied before potentially polling via aio_poll().
-     */
-    fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
+    in_hdr = (const struct fuse_in_header *)exp->request_buf;
+    if (unlikely(ret != in_hdr->len)) {
+        error_report("Number of bytes read from FUSE device does not match "
+                     "request size, expected %" PRIu32 " bytes, read %zi "
+                     "bytes; cannot trust subsequent requests, halting the "
+                     "export",
+                     in_hdr->len, ret);
+        fuse_export_halt(exp);
+        goto no_request;
+    }
+
+    fuse_process_request(exp);
 
-out:
+no_request:
     fuse_dec_in_flight(exp);
 }
 
@@ -357,18 +401,14 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
-    if (exp->fuse_session) {
-        fuse_session_exit(exp->fuse_session);
-
-        if (exp->fd_handler_set_up) {
-            fuse_detach_handlers(exp);
-        }
+    if (exp->fd_handler_set_up) {
+        fuse_detach_handlers(exp);
     }
 
     if (exp->mountpoint) {
         /*
-         * Safe to drop now, because we will not handle any requests
-         * for this export anymore anyway.
+         * Safe to drop now, because we will not handle any requests for this
+         * export anymore anyway (at least not from the main thread).
          */
         g_hash_table_remove(exports, exp->mountpoint);
     }
@@ -386,7 +426,6 @@ static void fuse_export_delete(BlockExport *blk_exp)
         fuse_session_destroy(exp->fuse_session);
     }
 
-    free(exp->fuse_buf.mem);
     g_free(exp->mountpoint);
 }
 
@@ -428,46 +467,57 @@ static bool is_regular_file(const char *path, Error **errp)
 }
 
 /**
- * A chance to set change some parameters supplied to FUSE_INIT.
+ * Process FUSE INIT.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_init(void *userdata, struct fuse_conn_info *conn)
+static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
+                         uint32_t max_readahead, uint32_t flags)
 {
-    /*
-     * MIN_NON_ZERO() would not be wrong here, but what we set here
-     * must equal what has been passed to fuse_session_new().
-     * Therefore, as long as max_read must be passed as a mount option
-     * (which libfuse claims will be changed at some point), we have
-     * to set max_read to a fixed value here.
-     */
-    conn->max_read = FUSE_MAX_BOUNCE_BYTES;
+    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
 
-    conn->max_write = MIN_NON_ZERO(BDRV_REQUEST_MAX_BYTES, conn->max_write);
-}
+    *out = (struct fuse_init_out) {
+        .major = FUSE_KERNEL_VERSION,
+        .minor = FUSE_KERNEL_MINOR_VERSION,
+        .max_readahead = max_readahead,
+        .max_write = FUSE_MAX_WRITE_BYTES,
+        .flags = flags & supported_flags,
+        .flags2 = 0,
 
-/**
- * Let clients look up files.  Always return ENOENT because we only
- * care about the mountpoint itself.
- */
-static void fuse_lookup(fuse_req_t req, fuse_ino_t parent, const char *name)
-{
-    fuse_reply_err(req, ENOENT);
+        /* libfuse maximum: 2^16 - 1 */
+        .max_background = UINT16_MAX,
+
+        /* libfuse default: max_background * 3 / 4 */
+        .congestion_threshold = (int)UINT16_MAX * 3 / 4,
+
+        /* libfuse default: 1 */
+        .time_gran = 1,
+
+        /*
+         * probably unneeded without FUSE_MAX_PAGES, but this would be the
+         * libfuse default
+         */
+        .max_pages = DIV_ROUND_UP(FUSE_MAX_WRITE_BYTES,
+                                  qemu_real_host_page_size()),
+
+        /* Only needed for mappings (i.e. DAX) */
+        .map_alignment = 0,
+    };
+
+    return sizeof(*out);
 }
 
 /**
  * Let clients get file attributes (i.e., stat() the file).
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
-                         struct fuse_file_info *fi)
+static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
 {
-    struct stat statbuf;
     int64_t length, allocated_blocks;
     time_t now = time(NULL);
-    FuseExport *exp = fuse_req_userdata(req);
 
     length = blk_getlength(exp->common.blk);
     if (length < 0) {
-        fuse_reply_err(req, -length);
-        return;
+        return length;
     }
 
     allocated_blocks = bdrv_get_allocated_file_size(blk_bs(exp->common.blk));
@@ -477,21 +527,24 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
         allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512);
     }
 
-    statbuf = (struct stat) {
-        .st_ino     = 1,
-        .st_mode    = exp->st_mode,
-        .st_nlink   = 1,
-        .st_uid     = exp->st_uid,
-        .st_gid     = exp->st_gid,
-        .st_size    = length,
-        .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
-        .st_blocks  = allocated_blocks,
-        .st_atime   = now,
-        .st_mtime   = now,
-        .st_ctime   = now,
+    *out = (struct fuse_attr_out) {
+        .attr_valid = 1,
+        .attr = {
+            .ino        = 1,
+            .mode       = exp->st_mode,
+            .nlink      = 1,
+            .uid        = exp->st_uid,
+            .gid        = exp->st_gid,
+            .size       = length,
+            .blksize    = blk_bs(exp->common.blk)->bl.request_alignment,
+            .blocks     = allocated_blocks,
+            .atime      = now,
+            .mtime      = now,
+            .ctime      = now,
+        },
     };
 
-    fuse_reply_attr(req, &statbuf, 1.);
+    return sizeof(*out);
 }
 
 static int fuse_do_truncate(const FuseExport *exp, int64_t size,
@@ -544,159 +597,151 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
  * permit access: Read-only exports cannot be given +w, and exports
  * without allow_other cannot be given a different UID or GID, and
  * they cannot be given non-owner access.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
-                         int to_set, struct fuse_file_info *fi)
+static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
+                            uint32_t to_set, uint64_t size, uint32_t mode,
+                            uint32_t uid, uint32_t gid)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     int supported_attrs;
     int ret;
 
-    supported_attrs = FUSE_SET_ATTR_SIZE | FUSE_SET_ATTR_MODE;
+    /* SIZE and MODE are actually supported, the others can be safely ignored */
+    supported_attrs = FATTR_SIZE | FATTR_MODE |
+        FATTR_FH | FATTR_LOCKOWNER | FATTR_KILL_SUIDGID;
     if (exp->allow_other) {
-        supported_attrs |= FUSE_SET_ATTR_UID | FUSE_SET_ATTR_GID;
+        supported_attrs |= FATTR_UID | FATTR_GID;
     }
 
     if (to_set & ~supported_attrs) {
-        fuse_reply_err(req, ENOTSUP);
-        return;
+        return -ENOTSUP;
     }
 
     /* Do some argument checks first before committing to anything */
-    if (to_set & FUSE_SET_ATTR_MODE) {
+    if (to_set & FATTR_MODE) {
         /*
          * Without allow_other, non-owners can never access the export, so do
          * not allow setting permissions for them
          */
-        if (!exp->allow_other &&
-            (statbuf->st_mode & (S_IRWXG | S_IRWXO)) != 0)
-        {
-            fuse_reply_err(req, EPERM);
-            return;
+        if (!exp->allow_other && (mode & (S_IRWXG | S_IRWXO)) != 0) {
+            return -EPERM;
         }
 
         /* +w for read-only exports makes no sense, disallow it */
-        if (!exp->writable &&
-            (statbuf->st_mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0)
-        {
-            fuse_reply_err(req, EROFS);
-            return;
+        if (!exp->writable && (mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0) {
+            return -EROFS;
         }
     }
 
-    if (to_set & FUSE_SET_ATTR_SIZE) {
+    if (to_set & FATTR_SIZE) {
         if (!exp->writable) {
-            fuse_reply_err(req, EACCES);
-            return;
+            return -EACCES;
         }
 
-        ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
+        ret = fuse_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
         if (ret < 0) {
-            fuse_reply_err(req, -ret);
-            return;
+            return ret;
         }
     }
 
-    if (to_set & FUSE_SET_ATTR_MODE) {
+    if (to_set & FATTR_MODE) {
         /* Ignore FUSE-supplied file type, only change the mode */
-        exp->st_mode = (statbuf->st_mode & 07777) | S_IFREG;
+        exp->st_mode = (mode & 07777) | S_IFREG;
     }
 
-    if (to_set & FUSE_SET_ATTR_UID) {
-        exp->st_uid = statbuf->st_uid;
+    if (to_set & FATTR_UID) {
+        exp->st_uid = uid;
     }
 
-    if (to_set & FUSE_SET_ATTR_GID) {
-        exp->st_gid = statbuf->st_gid;
+    if (to_set & FATTR_GID) {
+        exp->st_gid = gid;
     }
 
-    fuse_getattr(req, inode, fi);
+    return fuse_getattr(exp, out);
 }
 
 /**
- * Let clients open a file (i.e., the exported image).
+ * Open an inode.  We only have a single inode in our exported filesystem, so we
+ * just acknowledge the request.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_open(fuse_req_t req, fuse_ino_t inode,
-                      struct fuse_file_info *fi)
+static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
 {
-    fi->direct_io = true;
-    fi->parallel_direct_writes = true;
-    fuse_reply_open(req, fi);
+    *out = (struct fuse_open_out) {
+        .open_flags = FOPEN_DIRECT_IO | FOPEN_PARALLEL_DIRECT_WRITES,
+    };
+    return sizeof(*out);
 }
 
 /**
- * Handle client reads from the exported image.
+ * Handle client reads from the exported image.  Allocates *bufptr and reads
+ * data from the block device into that buffer.
+ * Returns the buffer (read) size on success, and -errno on error.
+ * After use, *bufptr must be freed via qemu_vfree().
  */
-static void fuse_read(fuse_req_t req, fuse_ino_t inode,
-                      size_t size, off_t offset, struct fuse_file_info *fi)
+static ssize_t fuse_read(FuseExport *exp, void **bufptr,
+                         uint64_t offset, uint32_t size)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-    int64_t length;
+    int64_t blk_len;
     void *buf;
     int ret;
 
     /* Limited by max_read, should not happen */
-    if (size > FUSE_MAX_BOUNCE_BYTES) {
-        fuse_reply_err(req, EINVAL);
-        return;
+    if (size > FUSE_MAX_READ_BYTES) {
+        return -EINVAL;
     }
 
     /**
      * Clients will expect short reads at EOF, so we have to limit
      * offset+size to the image length.
      */
-    length = blk_getlength(exp->common.blk);
-    if (length < 0) {
-        fuse_reply_err(req, -length);
-        return;
+    blk_len = blk_getlength(exp->common.blk);
+    if (blk_len < 0) {
+        return blk_len;
     }
 
-    if (offset + size > length) {
-        size = length - offset;
+    if (offset + size > blk_len) {
+        size = blk_len - offset;
     }
 
     buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
     if (!buf) {
-        fuse_reply_err(req, ENOMEM);
-        return;
+        return -ENOMEM;
     }
 
     ret = blk_pread(exp->common.blk, offset, size, buf, 0);
-    if (ret >= 0) {
-        fuse_reply_buf(req, buf, size);
-    } else {
-        fuse_reply_err(req, -ret);
+    if (ret < 0) {
+        qemu_vfree(buf);
+        return ret;
     }
 
-    qemu_vfree(buf);
+    *bufptr = buf;
+    return size;
 }
 
 /**
- * Handle client writes to the exported image.
+ * Handle client writes to the exported image.  @buf has the data to be written
+ * and will be copied to a bounce buffer before polling for the first time.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
-                       size_t size, off_t offset, struct fuse_file_info *fi)
+static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
+                          uint64_t offset, uint32_t size, const void *buf)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     void *copied;
     int64_t blk_len;
     int ret;
 
     /* Limited by max_write, should not happen */
     if (size > BDRV_REQUEST_MAX_BYTES) {
-        fuse_reply_err(req, EINVAL);
-        return;
+        return -EINVAL;
     }
 
     if (!exp->writable) {
-        fuse_reply_err(req, EACCES);
-        return;
+        return -EACCES;
     }
 
     /*
-     * Heed the note on read_from_fuse_export(): If we call aio_poll() (which
-     * any blk_*() I/O function may do), read_from_fuse_export() may be nested,
-     * overwriting the request buffer content.  Therefore, we must copy it here.
+     * Must copy to bounce buffer before calling aio_poll() (to allow nesting)
      */
     copied = blk_blockalign(exp->common.blk, size);
     memcpy(copied, buf, size);
@@ -707,16 +752,15 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
      */
     blk_len = blk_getlength(exp->common.blk);
     if (blk_len < 0) {
-        fuse_reply_err(req, -blk_len);
-        goto free_buffer;
+        ret = blk_len;
+        goto fail_free_buffer;
     }
 
     if (offset + size > blk_len) {
         if (exp->growable) {
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                goto free_buffer;
+                goto fail_free_buffer;
             }
         } else {
             size = blk_len - offset;
@@ -724,36 +768,39 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
     }
 
     ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
-    if (ret >= 0) {
-        fuse_reply_write(req, size);
-    } else {
-        fuse_reply_err(req, -ret);
+    if (ret < 0) {
+        goto fail_free_buffer;
     }
 
-free_buffer:
     qemu_vfree(copied);
+
+    *out = (struct fuse_write_out) {
+        .size = size,
+    };
+    return sizeof(*out);
+
+fail_free_buffer:
+    qemu_vfree(copied);
+    return ret;
 }
 
 /**
  * Let clients perform various fallocate() operations.
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
-                           off_t offset, off_t length,
-                           struct fuse_file_info *fi)
+static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
+                              uint32_t mode)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     int64_t blk_len;
     int ret;
 
     if (!exp->writable) {
-        fuse_reply_err(req, EACCES);
-        return;
+        return -EACCES;
     }
 
     blk_len = blk_getlength(exp->common.blk);
     if (blk_len < 0) {
-        fuse_reply_err(req, -blk_len);
-        return;
+        return blk_len;
     }
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -765,16 +812,14 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
     if (!mode) {
         /* We can only fallocate at the EOF with a truncate */
         if (offset < blk_len) {
-            fuse_reply_err(req, EOPNOTSUPP);
-            return;
+            return -EOPNOTSUPP;
         }
 
         if (offset > blk_len) {
             /* No preallocation needed here */
             ret = fuse_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                return;
+                return ret;
             }
         }
 
@@ -784,8 +829,7 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
     else if (mode & FALLOC_FL_PUNCH_HOLE) {
         if (!(mode & FALLOC_FL_KEEP_SIZE)) {
-            fuse_reply_err(req, EINVAL);
-            return;
+            return -EINVAL;
         }
 
         do {
@@ -813,8 +857,7 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
             ret = fuse_do_truncate(exp, offset + length, false,
                                    PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                return;
+                return ret;
             }
         }
 
@@ -832,44 +875,38 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
         ret = -EOPNOTSUPP;
     }
 
-    fuse_reply_err(req, ret < 0 ? -ret : 0);
+    return ret < 0 ? ret : 0;
 }
 
 /**
  * Let clients fsync the exported image.
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_fsync(fuse_req_t req, fuse_ino_t inode, int datasync,
-                       struct fuse_file_info *fi)
+static ssize_t fuse_fsync(FuseExport *exp)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-    int ret;
-
-    ret = blk_flush(exp->common.blk);
-    fuse_reply_err(req, ret < 0 ? -ret : 0);
+    return blk_flush(exp->common.blk);
 }
 
 /**
  * Called before an FD to the exported image is closed.  (libfuse
  * notes this to be a way to return last-minute errors.)
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_flush(fuse_req_t req, fuse_ino_t inode,
-                        struct fuse_file_info *fi)
+static ssize_t fuse_flush(FuseExport *exp)
 {
-    fuse_fsync(req, inode, 1, fi);
+    return blk_flush(exp->common.blk);
 }
 
 #ifdef CONFIG_FUSE_LSEEK
 /**
  * Let clients inquire allocation status.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
-                       int whence, struct fuse_file_info *fi)
+static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
+                          uint64_t offset, uint32_t whence)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-
     if (whence != SEEK_HOLE && whence != SEEK_DATA) {
-        fuse_reply_err(req, EINVAL);
-        return;
+        return -EINVAL;
     }
 
     while (true) {
@@ -879,8 +916,7 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
         ret = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
                                       offset, INT64_MAX, &pnum, NULL, NULL);
         if (ret < 0) {
-            fuse_reply_err(req, -ret);
-            return;
+            return ret;
         }
 
         if (!pnum && (ret & BDRV_BLOCK_EOF)) {
@@ -897,34 +933,38 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
 
             blk_len = blk_getlength(exp->common.blk);
             if (blk_len < 0) {
-                fuse_reply_err(req, -blk_len);
-                return;
+                return blk_len;
             }
 
             if (offset > blk_len || whence == SEEK_DATA) {
-                fuse_reply_err(req, ENXIO);
-            } else {
-                fuse_reply_lseek(req, offset);
+                return -ENXIO;
             }
-            return;
+
+            *out = (struct fuse_lseek_out) {
+                .offset = offset,
+            };
+            return sizeof(*out);
         }
 
         if (ret & BDRV_BLOCK_DATA) {
             if (whence == SEEK_DATA) {
-                fuse_reply_lseek(req, offset);
-                return;
+                *out = (struct fuse_lseek_out) {
+                    .offset = offset,
+                };
+                return sizeof(*out);
             }
         } else {
             if (whence == SEEK_HOLE) {
-                fuse_reply_lseek(req, offset);
-                return;
+                *out = (struct fuse_lseek_out) {
+                    .offset = offset,
+                };
+                return sizeof(*out);
             }
         }
 
         /* Safety check against infinite loops */
         if (!pnum) {
-            fuse_reply_err(req, ENXIO);
-            return;
+            return -ENXIO;
         }
 
         offset += pnum;
@@ -932,21 +972,297 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
 }
 #endif
 
-static const struct fuse_lowlevel_ops fuse_ops = {
-    .init       = fuse_init,
-    .lookup     = fuse_lookup,
-    .getattr    = fuse_getattr,
-    .setattr    = fuse_setattr,
-    .open       = fuse_open,
-    .read       = fuse_read,
-    .write      = fuse_write,
-    .fallocate  = fuse_fallocate,
-    .flush      = fuse_flush,
-    .fsync      = fuse_fsync,
+/**
+ * Write a FUSE response to the given @fd, using a single buffer consecutively
+ * containing both the response header and data: Initialize *out_hdr, and write
+ * it plus @response_data_length consecutive bytes to @fd.
+ *
+ * @fd: FUSE file descriptor
+ * @req_id: Corresponding request ID
+ * @out_hdr: Pointer to buffer that will hold the output header, and
+ *           additionally already contains @response_data_length data bytes
+ *           starting at *out_hdr + 1.
+ * @err: Error code (-errno, or 0 in case of success)
+ * @response_data_length: Length of data to return (following *out_hdr)
+ */
+static int fuse_write_response(int fd, uint32_t req_id,
+                               struct fuse_out_header *out_hdr, int err,
+                               size_t response_data_length)
+{
+    void *write_ptr = out_hdr;
+    size_t to_write = sizeof(*out_hdr) + response_data_length;
+    ssize_t ret;
+
+    *out_hdr = (struct fuse_out_header) {
+        .len = to_write,
+        .error = err,
+        .unique = req_id,
+    };
+
+    while (true) {
+        ret = RETRY_ON_EINTR(write(fd, write_ptr, to_write));
+        if (ret < 0) {
+            ret = -errno;
+            error_report("Failed to write to FUSE device: %s", strerror(-ret));
+            return ret;
+        } else {
+            to_write -= ret;
+            if (to_write > 0) {
+                write_ptr += ret;
+            } else {
+                return 0; /* success */
+            }
+        }
+    }
+}
+
+/**
+ * Write a FUSE response to the given @fd, using separate buffers for the
+ * response header and data: Initialize *out_hdr, and write it plus the data in
+ * *buf to @fd.
+ *
+ * In contrast to fuse_write_response(), this function cannot return errors, and
+ * will always return success (error code 0).
+ *
+ * @fd: FUSE file descriptor
+ * @req_id: Corresponding request ID
+ * @out_hdr: Pointer to buffer that will hold the output header
+ * @buf: Pointer to response data
+ * @buflen: Length of response data
+ */
+static int fuse_write_buf_response(int fd, uint32_t req_id,
+                                   struct fuse_out_header *out_hdr,
+                                   const void *buf, size_t buflen)
+{
+    struct iovec iov[2] = {
+        { out_hdr, sizeof(*out_hdr) },
+        { (void *)buf, buflen },
+    };
+    struct iovec *iovp = iov;
+    unsigned iov_count = ARRAY_SIZE(iov);
+    size_t to_write = sizeof(*out_hdr) + buflen;
+    ssize_t ret;
+
+    *out_hdr = (struct fuse_out_header) {
+        .len = to_write,
+        .unique = req_id,
+    };
+
+    while (true) {
+        ret = RETRY_ON_EINTR(writev(fd, iovp, iov_count));
+        if (ret < 0) {
+            ret = -errno;
+            error_report("Failed to write to FUSE device: %s", strerror(-ret));
+            return ret;
+        } else {
+            to_write -= ret;
+            if (to_write > 0) {
+                iov_discard_front(&iovp, &iov_count, ret);
+            } else {
+                return 0; /* success */
+            }
+        }
+    }
+}
+
+/*
+ * For use in fuse_process_request():
+ * Returns a pointer to the parameter object for the given operation (inside of
+ * exp->request_buf, which is assumed to hold a fuse_in_header first).
+ * Verifies that the object is complete (exp->request_buf is large enough to
+ * hold it in one piece, and the request length includes the whole object).
+ *
+ * Note that exp->request_buf may be overwritten after polling, so the returned
+ * pointer must not be used across a function that may poll!
+ */
+#define FUSE_IN_OP_STRUCT(op_name, export) \
+    ({ \
+        const struct fuse_in_header *__in_hdr = \
+            (const struct fuse_in_header *)(export)->request_buf; \
+        const struct fuse_##op_name##_in *__in = \
+            (const struct fuse_##op_name##_in *)(__in_hdr + 1); \
+        const size_t __param_len = sizeof(*__in_hdr) + sizeof(*__in); \
+        uint32_t __req_len; \
+        \
+        QEMU_BUILD_BUG_ON(sizeof((export)->request_buf) < __param_len); \
+        \
+        __req_len = __in_hdr->len; \
+        if (__req_len < __param_len) { \
+            warn_report("FUSE request truncated (%" PRIu32 " < %zu)", \
+                        __req_len, __param_len); \
+            ret = -EINVAL; \
+            break; \
+        } \
+        __in; \
+    })
+
+/*
+ * For use in fuse_process_request():
+ * Returns a pointer to the return object for the given operation (inside of
+ * out_buf, which is assumed to hold a fuse_out_header first).
+ * Verifies that out_buf is large enough to hold the whole object.
+ *
+ * (out_buf should be a char[] array.)
+ */
+#define FUSE_OUT_OP_STRUCT(op_name, out_buf) \
+    ({ \
+        struct fuse_out_header *__out_hdr = \
+            (struct fuse_out_header *)(out_buf); \
+        struct fuse_##op_name##_out *__out = \
+            (struct fuse_##op_name##_out *)(__out_hdr + 1); \
+        \
+        QEMU_BUILD_BUG_ON(sizeof(*__out_hdr) + sizeof(*__out) > \
+                          sizeof(out_buf)); \
+        \
+        __out; \
+    })
+
+/**
+ * Process a FUSE request, incl. writing the response.
+ *
+ * Note that polling in any request-processing function can lead to a nested
+ * read_from_fuse_fd() call, which will overwrite the contents of
+ * exp->request_buf.  Anything that takes a buffer needs to take care that the
+ * content is copied before potentially polling.
+ */
+static void fuse_process_request(FuseExport *exp)
+{
+    uint32_t opcode;
+    uint64_t req_id;
+    /*
+     * Return buffer.  Must be large enough to hold all return headers, but does
+     * not include space for data returned by read requests.
+     * (FUSE_IN_OP_STRUCT() verifies at compile time that out_buf is indeed
+     * large enough.)
+     */
+    char out_buf[sizeof(struct fuse_out_header) +
+                 MAX_CONST(sizeof(struct fuse_init_out),
+                 MAX_CONST(sizeof(struct fuse_open_out),
+                 MAX_CONST(sizeof(struct fuse_attr_out),
+                 MAX_CONST(sizeof(struct fuse_write_out),
+                           sizeof(struct fuse_lseek_out)))))];
+    struct fuse_out_header *out_hdr = (struct fuse_out_header *)out_buf;
+    /* For read requests: Data to be returned */
+    void *out_data_buffer = NULL;
+    ssize_t ret;
+
+    /* Limit scope to ensure pointer is no longer used after polling */
+    {
+        const struct fuse_in_header *in_hdr =
+            (const struct fuse_in_header *)exp->request_buf;
+
+        opcode = in_hdr->opcode;
+        req_id = in_hdr->unique;
+    }
+
+    switch (opcode) {
+    case FUSE_INIT: {
+        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, exp);
+        ret = fuse_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
+                        in->max_readahead, in->flags);
+        break;
+    }
+
+    case FUSE_OPEN:
+        ret = fuse_open(exp, FUSE_OUT_OP_STRUCT(open, out_buf));
+        break;
+
+    case FUSE_RELEASE:
+        ret = 0;
+        break;
+
+    case FUSE_LOOKUP:
+        ret = -ENOENT; /* There is no node but the root node */
+        break;
+
+    case FUSE_GETATTR:
+        ret = fuse_getattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf));
+        break;
+
+    case FUSE_SETATTR: {
+        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, exp);
+        ret = fuse_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
+                           in->valid, in->size, in->mode, in->uid, in->gid);
+        break;
+    }
+
+    case FUSE_READ: {
+        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, exp);
+        ret = fuse_read(exp, &out_data_buffer, in->offset, in->size);
+        break;
+    }
+
+    case FUSE_WRITE: {
+        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, exp);
+        uint32_t req_len;
+
+        req_len = ((const struct fuse_in_header *)exp->request_buf)->len;
+        if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
+                               in->size)) {
+            warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
+                        req_len - sizeof(struct fuse_in_header) - sizeof(*in),
+                        in->size);
+            ret = -EINVAL;
+            break;
+        }
+
+        /*
+         * poll_fuse_fd() has checked that in_hdr->len matches the number of
+         * bytes read, which cannot exceed the max_write value we set
+         * (FUSE_MAX_WRITE_BYTES).  So we know that FUSE_MAX_WRITE_BYTES >=
+         * in_hdr->len >= in->size + X, so this assertion must hold.
+         */
+        assert(in->size <= FUSE_MAX_WRITE_BYTES);
+
+        /*
+         * Passing a pointer to `in` (i.e. the request buffer) is fine because
+         * fuse_write() takes care to copy its contents before potentially
+         * polling.
+         */
+        ret = fuse_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
+                         in->offset, in->size, in + 1);
+        break;
+    }
+
+    case FUSE_FALLOCATE: {
+        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, exp);
+        ret = fuse_fallocate(exp, in->offset, in->length, in->mode);
+        break;
+    }
+
+    case FUSE_FSYNC:
+        ret = fuse_fsync(exp);
+        break;
+
+    case FUSE_FLUSH:
+        ret = fuse_flush(exp);
+        break;
+
 #ifdef CONFIG_FUSE_LSEEK
-    .lseek      = fuse_lseek,
+    case FUSE_LSEEK: {
+        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, exp);
+        ret = fuse_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
+                         in->offset, in->whence);
+        break;
+    }
 #endif
-};
+
+    default:
+        ret = -ENOSYS;
+    }
+
+    /* Ignore errors from fuse_write*(), nothing we can do anyway */
+    if (out_data_buffer) {
+        assert(ret >= 0);
+        fuse_write_buf_response(exp->fuse_fd, req_id, out_hdr,
+                                out_data_buffer, ret);
+        qemu_vfree(out_data_buffer);
+    } else {
+        fuse_write_response(exp->fuse_fd, req_id, out_hdr,
+                            ret < 0 ? ret : 0,
+                            ret < 0 ? 0 : ret);
+    }
+}
 
 const BlockExportDriver blk_exp_fuse = {
     .type               = BLOCK_EXPORT_TYPE_FUSE,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 13/21] fuse: Manually process requests (without libfuse)
  2025-06-04 13:28 ` [PATCH v2 13/21] fuse: Manually process requests (without libfuse) Hanna Czenczek
@ 2025-06-09 16:54   ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2025-06-09 16:54 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Kevin Wolf, Markus Armbruster, Brian Song

[-- Attachment #1: Type: text/plain, Size: 3675 bytes --]

On Wed, Jun 04, 2025 at 03:28:05PM +0200, Hanna Czenczek wrote:
> Manually read requests from the /dev/fuse FD and process them, without
> using libfuse.  This allows us to safely add parallel request processing
> in coroutines later, without having to worry about libfuse internals.
> (Technically, we already have exactly that problem with
> read_from_fuse_export()/read_from_fuse_fd() nesting.)
> 
> We will continue to use libfuse for mounting the filesystem; fusermount3
> is a effectively a helper program of libfuse, so it should know best how
> to interact with it.  (Doing it manually without libfuse, while doable,
> is a bit of a pain, and it is not clear to me how stable the "protocol"
> actually is.)
> 
> Take this opportunity of quite a major rewrite to update the Copyright
> line with corrected information that has surfaced in the meantime.
> 
> Here are some benchmarks from before this patch (4k, iodepth=16, libaio;
> except 'sync', which are iodepth=1 and pvsync2):
> 
> file:
>   read:
>     seq aio:   78.6k ±1.3k IOPS
>     rand aio:  39.3k ±2.9k
>     seq sync:  32.5k ±0.7k
>     rand sync:  9.9k ±0.1k
>   write:
>     seq aio:   61.9k ±0.5k
>     rand aio:  61.2k ±0.6k
>     seq sync:  27.9k ±0.2k
>     rand sync: 27.6k ±0.4k
> null:
>   read:
>     seq aio:   214.0k ±5.9k
>     rand aio:  212.7k ±4.5k
>     seq sync:   90.3k ±6.5k
>     rand sync:  89.7k ±5.1k
>   write:
>     seq aio:   203.9k ±1.5k
>     rand aio:  201.4k ±3.6k
>     seq sync:   86.1k ±6.2k
>     rand sync:  84.9k ±5.3k
> 
> And with this patch applied:
> 
> file:
>   read:
>     seq aio:   76.6k ±1.8k (- 3 %)
>     rand aio:  26.7k ±0.4k (-32 %)
>     seq sync:  47.7k ±1.2k (+47 %)
>     rand sync: 10.1k ±0.2k (+ 2 %)
>   write:
>     seq aio:   58.1k ±0.5k (- 6 %)
>     rand aio:  58.1k ±0.5k (- 5 %)
>     seq sync:  36.3k ±0.3k (+30 %)
>     rand sync: 36.1k ±0.4k (+31 %)
> null:
>   read:
>     seq aio:   268.4k ±3.4k (+25 %)
>     rand aio:  265.3k ±2.1k (+25 %)
>     seq sync:  134.3k ±2.7k (+49 %)
>     rand sync: 132.4k ±1.4k (+48 %)
>   write:
>     seq aio:   275.3k ±1.7k (+35 %)
>     rand aio:  272.3k ±1.9k (+35 %)
>     seq sync:  130.7k ±1.6k (+52 %)
>     rand sync: 127.4k ±2.4k (+50 %)
> 
> So clearly the AIO file results are actually not good, and random reads
> are indeed quite terrible.  On the other hand, we can see from the sync
> and null results that request handling should in theory be quicker.  How
> does this fit together?
> 
> I believe the bad AIO results are an artifact of the accidental parallel
> request processing we have due to nested polling: Depending on how the
> actual request processing is structured and how long request processing
> takes, more or less requests will be submitted in parallel.  So because
> of the restructuring, I think this patch accidentally changes how many
> requests end up being submitted in parallel, which decreases
> performance.
> 
> (I have seen something like this before: In RSD, without having
> implemented a polling mode, the debug build tended to have better
> performance than the more optimized release build, because the debug
> build, taking longer to submit requests, ended up processing more
> requests in parallel.)
> 
> In any case, once we use coroutines throughout the code, performance
> will improve again across the board.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 754 +++++++++++++++++++++++++++++++-------------
>  1 file changed, 535 insertions(+), 219 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 14/21] fuse: Reduce max read size
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (12 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 13/21] fuse: Manually process requests (without libfuse) Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-04 13:28 ` [PATCH v2 15/21] fuse: Process requests in coroutines Hanna Czenczek
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

We are going to introduce parallel processing via coroutines, a maximum
read size of 64 MB may be problematic, allowing users of the export to
force us to allocate quite large amounts of memory with just a few
requests.

At least tone it down to 1 MB, which is still probably far more than
enough.  (Larger requests are split automatically by the FUSE kernel
driver anyway.)

(Yes, we inadvertently already had parallel request processing due to
nested polling before.  Better to fix this late than never.)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 926f97a885..ec3a307229 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -45,7 +45,7 @@
 #endif
 
 /* Prevent overly long bounce buffer allocations */
-#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
 /* Small enough to fit in the request buffer */
 #define FUSE_MAX_WRITE_BYTES (4 * 1024)
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 15/21] fuse: Process requests in coroutines
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (13 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 14/21] fuse: Reduce max read size Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-05  8:12   ` Hanna Czenczek
  2025-06-09 16:57   ` Stefan Hajnoczi
  2025-06-04 13:28 ` [PATCH v2 16/21] block/export: Add multi-threading interface Hanna Czenczek
                   ` (6 subsequent siblings)
  21 siblings, 2 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Make fuse_process_request() a coroutine_fn (fuse_co_process_request())
and have read_from_fuse_fd() launch it inside of a newly created
coroutine instead of running it synchronously.  This way, we can process
requests in parallel.

These are the benchmark results, compared to (a) the original results
with libfuse, and (b) the results after switching away from libfuse
(i.e. before this patch):

file:                  (vs. libfuse / vs. no libfuse)
  read:
    seq aio:   120.6k ±1.1k (+ 53 % / + 58 %)
    rand aio:  113.3k ±5.9k (+188 % / +325 %)
    seq sync:   52.4k ±0.4k (+ 61 % / + 10 %)
    rand sync:  10.4k ±0.4k (+  6 % / +  3 %)
  write:
    seq aio:    79.8k ±0.8k (+ 29 % / + 37 %)
    rand aio:   79.0k ±0.6k (+ 29 % / + 36 %)
    seq sync:   41.5k ±0.3k (+ 49 % / + 15 %)
    rand sync:  41.4k ±0.2k (+ 50 % / + 15 %)
null:
  read:
    seq aio:   266.1k ±1.5k (+ 24 % / -  1 %)
    rand aio:  264.1k ±2.5k (+ 24 % / ±  0 %)
    seq sync:  135.6k ±3.2k (+ 50 % / +  1 %)
    rand sync: 134.7k ±3.0k (+ 50 % / +  2 %)
  write:
    seq aio:   281.0k ±1.8k (+ 38 % / +  2 %)
    rand aio:  288.1k ±6.1k (+ 43 % / +  6 %)
    seq sync:  142.2k ±3.1k (+ 65 % / +  9 %)
    rand sync: 141.1k ±2.9k (+ 66 % / + 11 %)

So for non-AIO cases (and the null driver, which does not yield), there
is little change; but for file AIO, results greatly improve, resolving
the performance issue we saw before (when switching away from libfuse).

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 194 ++++++++++++++++++++++++++------------------
 1 file changed, 113 insertions(+), 81 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index ec3a307229..75d80da616 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -27,6 +27,7 @@
 #include "block/qapi.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block.h"
+#include "qemu/coroutine.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
@@ -86,6 +87,12 @@ typedef struct FuseExport {
     gid_t st_gid;
 } FuseExport;
 
+/* Parameters to the request processing coroutine */
+typedef struct FuseRequestCoParam {
+    FuseExport *exp;
+    int got_request;
+} FuseRequestCoParam;
+
 static GHashTable *exports;
 
 static void fuse_export_shutdown(BlockExport *exp);
@@ -99,7 +106,7 @@ static int mount_fuse_export(FuseExport *exp, Error **errp);
 static bool is_regular_file(const char *path, Error **errp);
 
 static void read_from_fuse_fd(void *opaque);
-static void fuse_process_request(FuseExport *exp);
+static void coroutine_fn fuse_co_process_request(FuseExport *exp);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -346,17 +353,20 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
 }
 
 /**
- * Try to read and process a single request from the FUSE FD.
+ * Try to read a single request from the FUSE FD.
+ * Takes a FuseExport pointer in `opaque`.
+ *
+ * Assumes the export's in-flight counter has already been incremented.
+ *
+ * If a request is available, process it.
  */
-static void read_from_fuse_fd(void *opaque)
+static void coroutine_fn co_read_from_fuse_fd(void *opaque)
 {
     FuseExport *exp = opaque;
     int fuse_fd = exp->fuse_fd;
     ssize_t ret;
     const struct fuse_in_header *in_hdr;
 
-    fuse_inc_in_flight(exp);
-
     if (unlikely(exp->halted)) {
         goto no_request;
     }
@@ -391,12 +401,28 @@ static void read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    fuse_process_request(exp);
+    fuse_co_process_request(exp);
 
 no_request:
     fuse_dec_in_flight(exp);
 }
 
+/**
+ * Try to read and process a single request from the FUSE FD.
+ * (To be used as a handler for when the FUSE FD becomes readable.)
+ * Takes a FuseExport pointer in `opaque`.
+ */
+static void read_from_fuse_fd(void *opaque)
+{
+    FuseExport *exp = opaque;
+    Coroutine *co;
+
+    co = qemu_coroutine_create(co_read_from_fuse_fd, exp);
+    /* Decremented by co_read_from_fuse_fd() */
+    fuse_inc_in_flight(exp);
+    qemu_coroutine_enter(co);
+}
+
 static void fuse_export_shutdown(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
@@ -470,8 +496,9 @@ static bool is_regular_file(const char *path, Error **errp)
  * Process FUSE INIT.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
-                         uint32_t max_readahead, uint32_t flags)
+static ssize_t coroutine_fn
+fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
+             uint32_t max_readahead, uint32_t flags)
 {
     const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
 
@@ -510,17 +537,18 @@ static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
  * Let clients get file attributes (i.e., stat() the file).
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
+static ssize_t coroutine_fn
+fuse_co_getattr(FuseExport *exp, struct fuse_attr_out *out)
 {
     int64_t length, allocated_blocks;
     time_t now = time(NULL);
 
-    length = blk_getlength(exp->common.blk);
+    length = blk_co_getlength(exp->common.blk);
     if (length < 0) {
         return length;
     }
 
-    allocated_blocks = bdrv_get_allocated_file_size(blk_bs(exp->common.blk));
+    allocated_blocks = bdrv_co_get_allocated_file_size(blk_bs(exp->common.blk));
     if (allocated_blocks <= 0) {
         allocated_blocks = DIV_ROUND_UP(length, 512);
     } else {
@@ -547,8 +575,9 @@ static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
     return sizeof(*out);
 }
 
-static int fuse_do_truncate(const FuseExport *exp, int64_t size,
-                            bool req_zero_write, PreallocMode prealloc)
+static int coroutine_fn
+fuse_co_do_truncate(const FuseExport *exp, int64_t size, bool req_zero_write,
+                    PreallocMode prealloc)
 {
     uint64_t blk_perm, blk_shared_perm;
     BdrvRequestFlags truncate_flags = 0;
@@ -577,8 +606,8 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
         }
     }
 
-    ret = blk_truncate(exp->common.blk, size, true, prealloc,
-                       truncate_flags, NULL);
+    ret = blk_co_truncate(exp->common.blk, size, true, prealloc,
+                          truncate_flags, NULL);
 
     if (add_resize_perm) {
         /* Must succeed, because we are only giving up the RESIZE permission */
@@ -599,9 +628,9 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
  * they cannot be given non-owner access.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
-                            uint32_t to_set, uint64_t size, uint32_t mode,
-                            uint32_t uid, uint32_t gid)
+static ssize_t coroutine_fn
+fuse_co_setattr(FuseExport *exp, struct fuse_attr_out *out, uint32_t to_set,
+                uint64_t size, uint32_t mode, uint32_t uid, uint32_t gid)
 {
     int supported_attrs;
     int ret;
@@ -638,7 +667,7 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
             return -EACCES;
         }
 
-        ret = fuse_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
+        ret = fuse_co_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
         if (ret < 0) {
             return ret;
         }
@@ -657,7 +686,7 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
         exp->st_gid = gid;
     }
 
-    return fuse_getattr(exp, out);
+    return fuse_co_getattr(exp, out);
 }
 
 /**
@@ -665,7 +694,8 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
  * just acknowledge the request.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
+static ssize_t coroutine_fn
+fuse_co_open(FuseExport *exp, struct fuse_open_out *out)
 {
     *out = (struct fuse_open_out) {
         .open_flags = FOPEN_DIRECT_IO | FOPEN_PARALLEL_DIRECT_WRITES,
@@ -679,8 +709,8 @@ static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
  * Returns the buffer (read) size on success, and -errno on error.
  * After use, *bufptr must be freed via qemu_vfree().
  */
-static ssize_t fuse_read(FuseExport *exp, void **bufptr,
-                         uint64_t offset, uint32_t size)
+static ssize_t coroutine_fn
+fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
 {
     int64_t blk_len;
     void *buf;
@@ -695,7 +725,7 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
      * Clients will expect short reads at EOF, so we have to limit
      * offset+size to the image length.
      */
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         return blk_len;
     }
@@ -709,7 +739,7 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
         return -ENOMEM;
     }
 
-    ret = blk_pread(exp->common.blk, offset, size, buf, 0);
+    ret = blk_co_pread(exp->common.blk, offset, size, buf, 0);
     if (ret < 0) {
         qemu_vfree(buf);
         return ret;
@@ -721,11 +751,12 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
 
 /**
  * Handle client writes to the exported image.  @buf has the data to be written
- * and will be copied to a bounce buffer before polling for the first time.
+ * and will be copied to a bounce buffer before yielding for the first time.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
-                          uint64_t offset, uint32_t size, const void *buf)
+static ssize_t coroutine_fn
+fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
+              uint64_t offset, uint32_t size, const void *buf)
 {
     void *copied;
     int64_t blk_len;
@@ -740,9 +771,7 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
         return -EACCES;
     }
 
-    /*
-     * Must copy to bounce buffer before calling aio_poll() (to allow nesting)
-     */
+    /* Must copy to bounce buffer before potentially yielding */
     copied = blk_blockalign(exp->common.blk, size);
     memcpy(copied, buf, size);
 
@@ -750,7 +779,7 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
      */
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         ret = blk_len;
         goto fail_free_buffer;
@@ -758,7 +787,8 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
 
     if (offset + size > blk_len) {
         if (exp->growable) {
-            ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset + size, true,
+                                      PREALLOC_MODE_OFF);
             if (ret < 0) {
                 goto fail_free_buffer;
             }
@@ -767,7 +797,7 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
         }
     }
 
-    ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
+    ret = blk_co_pwrite(exp->common.blk, offset, size, copied, 0);
     if (ret < 0) {
         goto fail_free_buffer;
     }
@@ -788,8 +818,9 @@ fail_free_buffer:
  * Let clients perform various fallocate() operations.
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
-                              uint32_t mode)
+static ssize_t coroutine_fn
+fuse_co_fallocate(FuseExport *exp,
+                  uint64_t offset, uint64_t length, uint32_t mode)
 {
     int64_t blk_len;
     int ret;
@@ -798,7 +829,7 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         return -EACCES;
     }
 
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         return blk_len;
     }
@@ -817,14 +848,14 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
 
         if (offset > blk_len) {
             /* No preallocation needed here */
-            ret = fuse_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
                 return ret;
             }
         }
 
-        ret = fuse_do_truncate(exp, offset + length, true,
-                               PREALLOC_MODE_FALLOC);
+        ret = fuse_co_do_truncate(exp, offset + length, true,
+                                  PREALLOC_MODE_FALLOC);
     }
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
     else if (mode & FALLOC_FL_PUNCH_HOLE) {
@@ -835,8 +866,9 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         do {
             int size = MIN(length, BDRV_REQUEST_MAX_BYTES);
 
-            ret = blk_pwrite_zeroes(exp->common.blk, offset, size,
-                                    BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK);
+            ret = blk_co_pwrite_zeroes(exp->common.blk, offset, size,
+                                       BDRV_REQ_MAY_UNMAP |
+                                       BDRV_REQ_NO_FALLBACK);
             if (ret == -ENOTSUP) {
                 /*
                  * fallocate() specifies to return EOPNOTSUPP for unsupported
@@ -854,8 +886,8 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
     else if (mode & FALLOC_FL_ZERO_RANGE) {
         if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + length > blk_len) {
             /* No need for zeroes, we are going to write them ourselves */
-            ret = fuse_do_truncate(exp, offset + length, false,
-                                   PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset + length, false,
+                                      PREALLOC_MODE_OFF);
             if (ret < 0) {
                 return ret;
             }
@@ -864,8 +896,8 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         do {
             int size = MIN(length, BDRV_REQUEST_MAX_BYTES);
 
-            ret = blk_pwrite_zeroes(exp->common.blk,
-                                    offset, size, 0);
+            ret = blk_co_pwrite_zeroes(exp->common.blk,
+                                       offset, size, 0);
             offset += size;
             length -= size;
         } while (ret == 0 && length > 0);
@@ -882,9 +914,9 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
  * Let clients fsync the exported image.
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_fsync(FuseExport *exp)
+static ssize_t coroutine_fn fuse_co_fsync(FuseExport *exp)
 {
-    return blk_flush(exp->common.blk);
+    return blk_co_flush(exp->common.blk);
 }
 
 /**
@@ -892,9 +924,9 @@ static ssize_t fuse_fsync(FuseExport *exp)
  * notes this to be a way to return last-minute errors.)
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_flush(FuseExport *exp)
+static ssize_t coroutine_fn fuse_co_flush(FuseExport *exp)
 {
-    return blk_flush(exp->common.blk);
+    return blk_co_flush(exp->common.blk);
 }
 
 #ifdef CONFIG_FUSE_LSEEK
@@ -902,8 +934,9 @@ static ssize_t fuse_flush(FuseExport *exp)
  * Let clients inquire allocation status.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
-                          uint64_t offset, uint32_t whence)
+static ssize_t coroutine_fn
+fuse_co_lseek(FuseExport *exp, struct fuse_lseek_out *out,
+              uint64_t offset, uint32_t whence)
 {
     if (whence != SEEK_HOLE && whence != SEEK_DATA) {
         return -EINVAL;
@@ -913,8 +946,8 @@ static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
         int64_t pnum;
         int ret;
 
-        ret = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
-                                      offset, INT64_MAX, &pnum, NULL, NULL);
+        ret = bdrv_co_block_status_above(blk_bs(exp->common.blk), NULL,
+                                         offset, INT64_MAX, &pnum, NULL, NULL);
         if (ret < 0) {
             return ret;
         }
@@ -931,7 +964,7 @@ static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
              * and @blk_len (the client-visible EOF).
              */
 
-            blk_len = blk_getlength(exp->common.blk);
+            blk_len = blk_co_getlength(exp->common.blk);
             if (blk_len < 0) {
                 return blk_len;
             }
@@ -1066,14 +1099,14 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
 }
 
 /*
- * For use in fuse_process_request():
+ * For use in fuse_co_process_request():
  * Returns a pointer to the parameter object for the given operation (inside of
  * exp->request_buf, which is assumed to hold a fuse_in_header first).
  * Verifies that the object is complete (exp->request_buf is large enough to
  * hold it in one piece, and the request length includes the whole object).
  *
- * Note that exp->request_buf may be overwritten after polling, so the returned
- * pointer must not be used across a function that may poll!
+ * Note that exp->request_buf may be overwritten after yielding, so the returned
+ * pointer must not be used across a function that may yield!
  */
 #define FUSE_IN_OP_STRUCT(op_name, export) \
     ({ \
@@ -1097,7 +1130,7 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
     })
 
 /*
- * For use in fuse_process_request():
+ * For use in fuse_co_process_request():
  * Returns a pointer to the return object for the given operation (inside of
  * out_buf, which is assumed to hold a fuse_out_header first).
  * Verifies that out_buf is large enough to hold the whole object.
@@ -1120,12 +1153,11 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
 /**
  * Process a FUSE request, incl. writing the response.
  *
- * Note that polling in any request-processing function can lead to a nested
- * read_from_fuse_fd() call, which will overwrite the contents of
- * exp->request_buf.  Anything that takes a buffer needs to take care that the
- * content is copied before potentially polling.
+ * Note that yielding in any request-processing function can overwrite the
+ * contents of exp->request_buf.  Anything that takes a buffer needs to take
+ * care that the content is copied before yielding.
  */
-static void fuse_process_request(FuseExport *exp)
+static void coroutine_fn fuse_co_process_request(FuseExport *exp)
 {
     uint32_t opcode;
     uint64_t req_id;
@@ -1146,7 +1178,7 @@ static void fuse_process_request(FuseExport *exp)
     void *out_data_buffer = NULL;
     ssize_t ret;
 
-    /* Limit scope to ensure pointer is no longer used after polling */
+    /* Limit scope to ensure pointer is no longer used after yielding */
     {
         const struct fuse_in_header *in_hdr =
             (const struct fuse_in_header *)exp->request_buf;
@@ -1158,13 +1190,13 @@ static void fuse_process_request(FuseExport *exp)
     switch (opcode) {
     case FUSE_INIT: {
         const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, exp);
-        ret = fuse_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
-                        in->max_readahead, in->flags);
+        ret = fuse_co_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
+                           in->max_readahead, in->flags);
         break;
     }
 
     case FUSE_OPEN:
-        ret = fuse_open(exp, FUSE_OUT_OP_STRUCT(open, out_buf));
+        ret = fuse_co_open(exp, FUSE_OUT_OP_STRUCT(open, out_buf));
         break;
 
     case FUSE_RELEASE:
@@ -1176,19 +1208,19 @@ static void fuse_process_request(FuseExport *exp)
         break;
 
     case FUSE_GETATTR:
-        ret = fuse_getattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf));
+        ret = fuse_co_getattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf));
         break;
 
     case FUSE_SETATTR: {
         const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, exp);
-        ret = fuse_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
-                           in->valid, in->size, in->mode, in->uid, in->gid);
+        ret = fuse_co_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
+                              in->valid, in->size, in->mode, in->uid, in->gid);
         break;
     }
 
     case FUSE_READ: {
         const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, exp);
-        ret = fuse_read(exp, &out_data_buffer, in->offset, in->size);
+        ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size);
         break;
     }
 
@@ -1216,33 +1248,33 @@ static void fuse_process_request(FuseExport *exp)
 
         /*
          * Passing a pointer to `in` (i.e. the request buffer) is fine because
-         * fuse_write() takes care to copy its contents before potentially
-         * polling.
+         * fuse_co_write() takes care to copy its contents before potentially
+         * yielding.
          */
-        ret = fuse_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
-                         in->offset, in->size, in + 1);
+        ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
+                            in->offset, in->size, in + 1);
         break;
     }
 
     case FUSE_FALLOCATE: {
         const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, exp);
-        ret = fuse_fallocate(exp, in->offset, in->length, in->mode);
+        ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode);
         break;
     }
 
     case FUSE_FSYNC:
-        ret = fuse_fsync(exp);
+        ret = fuse_co_fsync(exp);
         break;
 
     case FUSE_FLUSH:
-        ret = fuse_flush(exp);
+        ret = fuse_co_flush(exp);
         break;
 
 #ifdef CONFIG_FUSE_LSEEK
     case FUSE_LSEEK: {
         const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, exp);
-        ret = fuse_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
-                         in->offset, in->whence);
+        ret = fuse_co_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
+                            in->offset, in->whence);
         break;
     }
 #endif
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 15/21] fuse: Process requests in coroutines
  2025-06-04 13:28 ` [PATCH v2 15/21] fuse: Process requests in coroutines Hanna Czenczek
@ 2025-06-05  8:12   ` Hanna Czenczek
  2025-06-09 16:57   ` Stefan Hajnoczi
  1 sibling, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-05  8:12 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Stefan Hajnoczi, Kevin Wolf, Markus Armbruster,
	Brian Song

On 04.06.25 15:28, Hanna Czenczek wrote:
> Make fuse_process_request() a coroutine_fn (fuse_co_process_request())
> and have read_from_fuse_fd() launch it inside of a newly created
> coroutine instead of running it synchronously.  This way, we can process
> requests in parallel.
>
> These are the benchmark results, compared to (a) the original results
> with libfuse, and (b) the results after switching away from libfuse
> (i.e. before this patch):
>
> file:                  (vs. libfuse / vs. no libfuse)
>    read:
>      seq aio:   120.6k ±1.1k (+ 53 % / + 58 %)
>      rand aio:  113.3k ±5.9k (+188 % / +325 %)
>      seq sync:   52.4k ±0.4k (+ 61 % / + 10 %)
>      rand sync:  10.4k ±0.4k (+  6 % / +  3 %)
>    write:
>      seq aio:    79.8k ±0.8k (+ 29 % / + 37 %)
>      rand aio:   79.0k ±0.6k (+ 29 % / + 36 %)
>      seq sync:   41.5k ±0.3k (+ 49 % / + 15 %)
>      rand sync:  41.4k ±0.2k (+ 50 % / + 15 %)
> null:
>    read:
>      seq aio:   266.1k ±1.5k (+ 24 % / -  1 %)
>      rand aio:  264.1k ±2.5k (+ 24 % / ±  0 %)
>      seq sync:  135.6k ±3.2k (+ 50 % / +  1 %)
>      rand sync: 134.7k ±3.0k (+ 50 % / +  2 %)
>    write:
>      seq aio:   281.0k ±1.8k (+ 38 % / +  2 %)
>      rand aio:  288.1k ±6.1k (+ 43 % / +  6 %)
>      seq sync:  142.2k ±3.1k (+ 65 % / +  9 %)
>      rand sync: 141.1k ±2.9k (+ 66 % / + 11 %)
>
> So for non-AIO cases (and the null driver, which does not yield), there
> is little change; but for file AIO, results greatly improve, resolving
> the performance issue we saw before (when switching away from libfuse).
>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Sorry, I should have dropped the R-b :/

There are non-trivial changes to this patch from v1.

Hanna

> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>   block/export/fuse.c | 194 ++++++++++++++++++++++++++------------------
>   1 file changed, 113 insertions(+), 81 deletions(-)



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 15/21] fuse: Process requests in coroutines
  2025-06-04 13:28 ` [PATCH v2 15/21] fuse: Process requests in coroutines Hanna Czenczek
  2025-06-05  8:12   ` Hanna Czenczek
@ 2025-06-09 16:57   ` Stefan Hajnoczi
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2025-06-09 16:57 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Kevin Wolf, Markus Armbruster, Brian Song

[-- Attachment #1: Type: text/plain, Size: 1915 bytes --]

On Wed, Jun 04, 2025 at 03:28:07PM +0200, Hanna Czenczek wrote:
> Make fuse_process_request() a coroutine_fn (fuse_co_process_request())
> and have read_from_fuse_fd() launch it inside of a newly created
> coroutine instead of running it synchronously.  This way, we can process
> requests in parallel.
> 
> These are the benchmark results, compared to (a) the original results
> with libfuse, and (b) the results after switching away from libfuse
> (i.e. before this patch):
> 
> file:                  (vs. libfuse / vs. no libfuse)
>   read:
>     seq aio:   120.6k ±1.1k (+ 53 % / + 58 %)
>     rand aio:  113.3k ±5.9k (+188 % / +325 %)
>     seq sync:   52.4k ±0.4k (+ 61 % / + 10 %)
>     rand sync:  10.4k ±0.4k (+  6 % / +  3 %)
>   write:
>     seq aio:    79.8k ±0.8k (+ 29 % / + 37 %)
>     rand aio:   79.0k ±0.6k (+ 29 % / + 36 %)
>     seq sync:   41.5k ±0.3k (+ 49 % / + 15 %)
>     rand sync:  41.4k ±0.2k (+ 50 % / + 15 %)
> null:
>   read:
>     seq aio:   266.1k ±1.5k (+ 24 % / -  1 %)
>     rand aio:  264.1k ±2.5k (+ 24 % / ±  0 %)
>     seq sync:  135.6k ±3.2k (+ 50 % / +  1 %)
>     rand sync: 134.7k ±3.0k (+ 50 % / +  2 %)
>   write:
>     seq aio:   281.0k ±1.8k (+ 38 % / +  2 %)
>     rand aio:  288.1k ±6.1k (+ 43 % / +  6 %)
>     seq sync:  142.2k ±3.1k (+ 65 % / +  9 %)
>     rand sync: 141.1k ±2.9k (+ 66 % / + 11 %)
> 
> So for non-AIO cases (and the null driver, which does not yield), there
> is little change; but for file AIO, results greatly improve, resolving
> the performance issue we saw before (when switching away from libfuse).
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 194 ++++++++++++++++++++++++++------------------
>  1 file changed, 113 insertions(+), 81 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 16/21] block/export: Add multi-threading interface
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (14 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 15/21] fuse: Process requests in coroutines Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-04 13:58   ` Markus Armbruster
  2025-06-09 17:00   ` Stefan Hajnoczi
  2025-06-04 13:28 ` [PATCH v2 17/21] iotests/307: Test multi-thread export interface Hanna Czenczek
                   ` (5 subsequent siblings)
  21 siblings, 2 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Make BlockExportType.iothread an alternate between a single-thread
variant 'str' and a multi-threading variant '[str]'.

In contrast to the single-thread setting, the multi-threading setting
will not change the BDS's context (and so is incompatible with the
fixed-iothread setting), but instead just pass a list to the export
driver, with which it can do whatever it wants.

Currently no export driver supports multi-threading, so they all return
an error when receiving such a list.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 qapi/block-export.json               | 34 +++++++++++++++++---
 include/block/export.h               | 12 +++++--
 block/export/export.c                | 48 +++++++++++++++++++++++++---
 block/export/fuse.c                  |  7 ++++
 block/export/vduse-blk.c             |  7 ++++
 block/export/vhost-user-blk-server.c |  8 +++++
 nbd/server.c                         |  6 ++++
 7 files changed, 111 insertions(+), 11 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index c783e01a53..3ebad4ecef 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -362,14 +362,16 @@
 #     to the export before completion is signalled.  (since: 5.2;
 #     default: false)
 #
-# @iothread: The name of the iothread object where the export will
-#     run.  The default is to use the thread currently associated with
-#     the block node.  (since: 5.2)
+# @iothread: The name(s) of one or more iothread object(s) where the
+#     export will run.  The default is to use the thread currently
+#     associated with the block node.  (since: 5.2; multi-threading
+#     since 10.1)
 #
 # @fixed-iothread: True prevents the block node from being moved to
 #     another thread while the export is active.  If true and
 #     @iothread is given, export creation fails if the block node
-#     cannot be moved to the iothread.  The default is false.
+#     cannot be moved to the iothread.  Must not be true when giving
+#     multiple iothreads for @iothread.  The default is false.
 #     (since: 5.2)
 #
 # @allow-inactive: If true, the export allows the exported node to be inactive.
@@ -385,7 +387,7 @@
   'base': { 'type': 'BlockExportType',
             'id': 'str',
             '*fixed-iothread': 'bool',
-            '*iothread': 'str',
+            '*iothread': 'BlockExportIothreads',
             'node-name': 'str',
             '*writable': 'bool',
             '*writethrough': 'bool',
@@ -401,6 +403,28 @@
                      'if': 'CONFIG_VDUSE_BLK_EXPORT' }
    } }
 
+##
+# @BlockExportIothreads:
+#
+# Specify a single or multiple I/O threads in which to run a block export's I/O.
+#
+# @single: Run the export's I/O in the given single I/O thread.
+#
+# @multi: Use multi-threading across the given set of I/O threads, which must
+#     must not be empty.  Note: Passing a single I/O thread via this variant is
+#     still treated as multi-threading, which is different from using the
+#     @single variant.  In particular, even if there only is a single I/O thread
+#     in the set, export types that do not support multi-threading will
+#     generally reject this variant, and BlockExportOptions.fixed-iothread is
+#     always incompatible with it.
+#
+# Since: 10.1
+##
+{ 'alternate': 'BlockExportIothreads',
+  'data': {
+      'single': 'str',
+      'multi': ['str'] } }
+
 ##
 # @block-export-add:
 #
diff --git a/include/block/export.h b/include/block/export.h
index 4bd9531d4d..ca45da928c 100644
--- a/include/block/export.h
+++ b/include/block/export.h
@@ -32,8 +32,16 @@ typedef struct BlockExportDriver {
     /* True if the export type supports running on an inactive node */
     bool supports_inactive;
 
-    /* Creates and starts a new block export */
-    int (*create)(BlockExport *, BlockExportOptions *, Error **);
+    /*
+     * Creates and starts a new block export.
+     *
+     * If the user passed a set of I/O threads for multi-threading, @multithread
+     * is a list of the @multithread_count corresponding contexts (freed by the
+     * caller).  Note that @exp->ctx has no relation to that list.
+     */
+    int (*create)(BlockExport *exp, BlockExportOptions *opts,
+                  AioContext *const *multithread, size_t multithread_count,
+                  Error **errp);
 
     /*
      * Frees a removed block export. This function is only called after all
diff --git a/block/export/export.c b/block/export/export.c
index f3bbf11070..b733f269f3 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -76,16 +76,26 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
 {
     bool fixed_iothread = export->has_fixed_iothread && export->fixed_iothread;
     bool allow_inactive = export->has_allow_inactive && export->allow_inactive;
+    bool multithread = export->iothread &&
+        export->iothread->type == QTYPE_QLIST;
     const BlockExportDriver *drv;
     BlockExport *exp = NULL;
     BlockDriverState *bs;
     BlockBackend *blk = NULL;
     AioContext *ctx;
+    AioContext **multithread_ctxs = NULL;
+    size_t multithread_count = 0;
     uint64_t perm;
     int ret;
 
     GLOBAL_STATE_CODE();
 
+    if (fixed_iothread && multithread) {
+        error_setg(errp,
+                   "Cannot use fixed-iothread for a multi-threaded export");
+        return NULL;
+    }
+
     if (!id_wellformed(export->id)) {
         error_setg(errp, "Invalid block export id");
         return NULL;
@@ -116,14 +126,16 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
 
     ctx = bdrv_get_aio_context(bs);
 
-    if (export->iothread) {
+    /* Move the BDS to the target I/O thread, if it is a single one */
+    if (export->iothread && !multithread) {
+        const char *iothread_id = export->iothread->u.single;
         IOThread *iothread;
         AioContext *new_ctx;
         Error **set_context_errp;
 
-        iothread = iothread_by_id(export->iothread);
+        iothread = iothread_by_id(iothread_id);
         if (!iothread) {
-            error_setg(errp, "iothread \"%s\" not found", export->iothread);
+            error_setg(errp, "iothread \"%s\" not found", iothread_id);
             goto fail;
         }
 
@@ -137,6 +149,32 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
         } else if (fixed_iothread) {
             goto fail;
         }
+    } else if (multithread) {
+        strList *iothread_list = export->iothread->u.multi;
+        size_t i;
+
+        multithread_count = 0;
+        for (strList *e = iothread_list; e; e = e->next) {
+            multithread_count++;
+        }
+
+        if (multithread_count == 0) {
+            error_setg(errp, "The set of I/O threads must not be empty");
+            return NULL;
+        }
+
+        multithread_ctxs = g_new(AioContext *, multithread_count);
+        i = 0;
+        for (strList *e = iothread_list; e; e = e->next) {
+            IOThread *iothread = iothread_by_id(e->value);
+
+            if (!iothread) {
+                error_setg(errp, "iothread \"%s\" not found", e->value);
+                goto fail;
+            }
+            multithread_ctxs[i++] = iothread_get_aio_context(iothread);
+        }
+        assert(i == multithread_count);
     }
 
     bdrv_graph_rdlock_main_loop();
@@ -195,7 +233,7 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
         .blk        = blk,
     };
 
-    ret = drv->create(exp, export, errp);
+    ret = drv->create(exp, export, multithread_ctxs, multithread_count, errp);
     if (ret < 0) {
         goto fail;
     }
@@ -203,6 +241,7 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
     assert(exp->blk != NULL);
 
     QLIST_INSERT_HEAD(&block_exports, exp, next);
+    g_free(multithread_ctxs);
     return exp;
 
 fail:
@@ -214,6 +253,7 @@ fail:
         g_free(exp->id);
         g_free(exp);
     }
+    g_free(multithread_ctxs);
     return NULL;
 }
 
diff --git a/block/export/fuse.c b/block/export/fuse.c
index 75d80da616..44f0b796b3 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -180,6 +180,8 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
 
 static int fuse_export_create(BlockExport *blk_exp,
                               BlockExportOptions *blk_exp_args,
+                              AioContext *const *multithread,
+                              size_t mt_count,
                               Error **errp)
 {
     ERRP_GUARD(); /* ensure clean-up even with error_fatal */
@@ -189,6 +191,11 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
 
+    if (multithread) {
+        error_setg(errp, "FUSE export does not support multi-threading");
+        return -EINVAL;
+    }
+
     /* For growable and writable exports, take the RESIZE permission */
     if (args->growable || blk_exp_args->writable) {
         uint64_t blk_perm, blk_shared_perm;
diff --git a/block/export/vduse-blk.c b/block/export/vduse-blk.c
index bd852e538d..bf70c98dd6 100644
--- a/block/export/vduse-blk.c
+++ b/block/export/vduse-blk.c
@@ -266,6 +266,7 @@ static const BlockDevOps vduse_block_ops = {
 };
 
 static int vduse_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
+                                AioContext *const *multithread, size_t mt_count,
                                 Error **errp)
 {
     VduseBlkExport *vblk_exp = container_of(exp, VduseBlkExport, export);
@@ -301,6 +302,12 @@ static int vduse_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
             return -EINVAL;
         }
     }
+
+    if (multithread) {
+        error_setg(errp, "vduse-blk export does not support multi-threading");
+        return -EINVAL;
+    }
+
     vblk_exp->num_queues = num_queues;
     vblk_exp->handler.blk = exp->blk;
     vblk_exp->handler.serial = g_strdup(vblk_opts->serial ?: "");
diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index d9d2014d9b..481d4b7441 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -315,6 +315,7 @@ static const BlockDevOps vu_blk_dev_ops = {
 };
 
 static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
+                             AioContext *const *multithread, size_t mt_count,
                              Error **errp)
 {
     VuBlkExport *vexp = container_of(exp, VuBlkExport, export);
@@ -340,6 +341,13 @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
         error_setg(errp, "num-queues must be greater than 0");
         return -EINVAL;
     }
+
+    if (multithread) {
+        error_setg(errp,
+                   "vhost-user-blk export does not support multi-threading");
+        return -EINVAL;
+    }
+
     vexp->handler.blk = exp->blk;
     vexp->handler.serial = g_strdup("vhost_user_blk");
     vexp->handler.logical_block_size = logical_block_size;
diff --git a/nbd/server.c b/nbd/server.c
index d242be9811..a1736a5a24 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1793,6 +1793,7 @@ static const BlockDevOps nbd_block_ops = {
 };
 
 static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions *exp_args,
+                             AioContext *const *multithread, size_t mt_count,
                              Error **errp)
 {
     NBDExport *exp = container_of(blk_exp, NBDExport, common);
@@ -1829,6 +1830,11 @@ static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions *exp_args,
         return -EEXIST;
     }
 
+    if (multithread) {
+        error_setg(errp, "NBD export does not support multi-threading");
+        return -EINVAL;
+    }
+
     size = blk_getlength(blk);
     if (size < 0) {
         error_setg_errno(errp, -size,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 16/21] block/export: Add multi-threading interface
  2025-06-04 13:28 ` [PATCH v2 16/21] block/export: Add multi-threading interface Hanna Czenczek
@ 2025-06-04 13:58   ` Markus Armbruster
  2025-06-09 17:00   ` Stefan Hajnoczi
  1 sibling, 0 replies; 40+ messages in thread
From: Markus Armbruster @ 2025-06-04 13:58 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Kevin Wolf, Brian Song

Hanna Czenczek <hreitz@redhat.com> writes:

> Make BlockExportType.iothread an alternate between a single-thread
> variant 'str' and a multi-threading variant '[str]'.
>
> In contrast to the single-thread setting, the multi-threading setting
> will not change the BDS's context (and so is incompatible with the
> fixed-iothread setting), but instead just pass a list to the export
> driver, with which it can do whatever it wants.
>
> Currently no export driver supports multi-threading, so they all return
> an error when receiving such a list.
>
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>

Acked-by: Markus Armbruster <armbru@redhat.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 16/21] block/export: Add multi-threading interface
  2025-06-04 13:28 ` [PATCH v2 16/21] block/export: Add multi-threading interface Hanna Czenczek
  2025-06-04 13:58   ` Markus Armbruster
@ 2025-06-09 17:00   ` Stefan Hajnoczi
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2025-06-09 17:00 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Kevin Wolf, Markus Armbruster, Brian Song

[-- Attachment #1: Type: text/plain, Size: 1201 bytes --]

On Wed, Jun 04, 2025 at 03:28:08PM +0200, Hanna Czenczek wrote:
> Make BlockExportType.iothread an alternate between a single-thread
> variant 'str' and a multi-threading variant '[str]'.
> 
> In contrast to the single-thread setting, the multi-threading setting
> will not change the BDS's context (and so is incompatible with the
> fixed-iothread setting), but instead just pass a list to the export
> driver, with which it can do whatever it wants.
> 
> Currently no export driver supports multi-threading, so they all return
> an error when receiving such a list.
> 
> Suggested-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  qapi/block-export.json               | 34 +++++++++++++++++---
>  include/block/export.h               | 12 +++++--
>  block/export/export.c                | 48 +++++++++++++++++++++++++---
>  block/export/fuse.c                  |  7 ++++
>  block/export/vduse-blk.c             |  7 ++++
>  block/export/vhost-user-blk-server.c |  8 +++++
>  nbd/server.c                         |  6 ++++
>  7 files changed, 111 insertions(+), 11 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 17/21] iotests/307: Test multi-thread export interface
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (15 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 16/21] block/export: Add multi-threading interface Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-04 13:28 ` [PATCH v2 18/21] fuse: Implement multi-threading Hanna Czenczek
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Test the QAPI interface for multi-threaded exports.  None of our exports
currently support multi-threading, so it's always an error in the end,
but we can still test the specific errors.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 tests/qemu-iotests/307     | 47 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/307.out | 18 +++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/tests/qemu-iotests/307 b/tests/qemu-iotests/307
index b429b5aa50..f6ee3ebec0 100755
--- a/tests/qemu-iotests/307
+++ b/tests/qemu-iotests/307
@@ -142,5 +142,52 @@ with iotests.FilePath('image') as img, \
     vm.qmp_log('query-block-exports')
     iotests.qemu_nbd_list_log('-k', socket)
 
+    iotests.log('\n=== Using multi-thread with NBD ===')
+
+    # Actual multi-threading; (currently) not supported by NBD
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0', 'iothread1'])
+
+    # Should be treated the same way as actual multi-threading, even if there's
+    # only a single thread
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0'])
+
+    iotests.log('\n=== Empty thread list')
+
+    # Simply not allowed
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=[])
+
+    iotests.log('\n=== Non-existent thread name in list')
+
+    # Expect an error, even if NBD does not support multi-threading, because the
+    # list is parsed before being passed to NBD
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0', 'nothread', 'iothread1'])
+
+    iotests.log('\n=== Multi-thread with fixed-iothread')
+
+    # With multi-threading, there is no single context to give the BDS, so it is
+    # just left where it is.  fixed-iothread does not make sense then.
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0', 'iothread1'],
+               fixed_iothread=True)
+
     iotests.log('\n=== Shut down QEMU ===')
     vm.shutdown()
diff --git a/tests/qemu-iotests/307.out b/tests/qemu-iotests/307.out
index f645f3315f..a9b37d3ac1 100644
--- a/tests/qemu-iotests/307.out
+++ b/tests/qemu-iotests/307.out
@@ -134,4 +134,22 @@ read failed: Input/output error
 exports available: 0
 
 
+=== Using multi-thread with NBD ===
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": ["iothread0", "iothread1"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "NBD export does not support multi-threading"}}
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": ["iothread0"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "NBD export does not support multi-threading"}}
+
+=== Empty thread list
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": [], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "The set of I/O threads must not be empty"}}
+
+=== Non-existent thread name in list
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": ["iothread0", "nothread", "iothread1"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "iothread \"nothread\" not found"}}
+
+=== Multi-thread with fixed-iothread
+{"execute": "block-export-add", "arguments": {"fixed-iothread": true, "id": "export0", "iothread": ["iothread0", "iothread1"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "Cannot use fixed-iothread for a multi-threaded export"}}
+
 === Shut down QEMU ===
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 18/21] fuse: Implement multi-threading
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (16 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 17/21] iotests/307: Test multi-thread export interface Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-09 18:10   ` Stefan Hajnoczi
  2025-06-27  1:08   ` Brian
  2025-06-04 13:28 ` [PATCH v2 19/21] qapi/block-export: Document FUSE's multi-threading Hanna Czenczek
                   ` (3 subsequent siblings)
  21 siblings, 2 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
(via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).

We can use this to implement multi-threading.

For configuration, we don't need any more information beyond the simple
array provided by the core block export interface: The FUSE kernel
driver feeds these FDs in a round-robin fashion, so all of them are
equivalent and we want to have exactly one per thread.

These are the benchmark results when using four threads (compared to a
single thread); note that fio still only uses a single job, but
performance can still be improved because of said round-robin usage for
the queues.  (Not in the sync case, though, in which case I guess it
just adds overhead.)

file:
  read:
    seq aio:   264.8k ±0.8k (+120 %)
    rand aio:  143.8k ±0.4k (+ 27 %)
    seq sync:   49.9k ±0.5k (-  5 %)
    rand sync:  10.3k ±0.1k (-  1 %)
  write:
    seq aio:   226.6k ±2.1k (+184 %)
    rand aio:  225.9k ±1.8k (+186 %)
    seq sync:   36.9k ±0.6k (- 11 %)
    rand sync:  36.9k ±0.2k (- 11 %)
null:
  read:
    seq aio:   315.2k ±11.0k (+18 %)
    rand aio:  300.5k ±10.8k (+14 %)
    seq sync:  114.2k ± 3.6k (-16 %)
    rand sync: 112.5k ± 2.8k (-16 %)
  write:
    seq aio:   222.6k ±6.8k (-21 %)
    rand aio:  220.5k ±6.8k (-23 %)
    seq sync:  117.2k ±3.7k (-18 %)
    rand sync: 116.3k ±4.4k (-18 %)

(I don't know what's going on in the null-write AIO case, sorry.)

Here's results for numjobs=4:

"Before", i.e. without multithreading in QSD/FUSE (results compared to
numjobs=1):

file:
  read:
    seq aio:   104.7k ± 0.4k (- 13 %)
    rand aio:  111.5k ± 0.4k (-  2 %)
    seq sync:   71.0k ±13.8k (+ 36 %)
    rand sync:  41.4k ± 0.1k (+297 %)
  write:
    seq aio:    79.4k ±0.1k (-  1 %)
    rand aio:   78.6k ±0.1k (±  0 %)
    seq sync:   83.3k ±0.1k (+101 %)
    rand sync:  82.0k ±0.2k (+ 98 %)
null:
  read:
    seq aio:   260.5k ±1.5k (-  2 %)
    rand aio:  260.1k ±1.4k (-  2 %)
    seq sync:  291.8k ±1.3k (+115 %)
    rand sync: 280.1k ±1.7k (+115 %)
  write:
    seq aio:   280.1k ±1.7k (±  0 %)
    rand aio:  279.5k ±1.4k (-  3 %)
    seq sync:  306.7k ±2.2k (+116 %)
    rand sync: 305.9k ±1.8k (+117 %)

(As probably expected, little difference in the AIO case, but great
improvements in the sync case because it kind of gives it an artificial
iodepth of 4.)

"After", i.e. with four threads in QSD/FUSE (now results compared to the
above):

file:
  read:
    seq aio:   193.3k ± 1.8k (+ 85 %)
    rand aio:  329.3k ± 0.3k (+195 %)
    seq sync:   66.2k ±13.0k (-  7 %)
    rand sync:  40.1k ± 0.0k (-  3 %)
  write:
    seq aio:   219.7k ±0.8k (+177 %)
    rand aio:  217.2k ±1.5k (+176 %)
    seq sync:   92.5k ±0.2k (+ 11 %)
    rand sync:  91.9k ±0.2k (+ 12 %)
null:
  read:
    seq aio:   706.7k ±2.1k (+171 %)
    rand aio:  714.7k ±3.2k (+175 %)
    seq sync:  431.7k ±3.0k (+ 48 %)
    rand sync: 435.4k ±2.8k (+ 50 %)
  write:
    seq aio:   746.9k ±2.8k (+167 %)
    rand aio:  749.0k ±4.9k (+168 %)
    seq sync:  420.7k ±3.1k (+ 37 %)
    rand sync: 419.1k ±2.5k (+ 37 %)

So this helps mainly for the AIO cases, but also in the null sync cases,
because null is always CPU-bound, so more threads help.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 205 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 159 insertions(+), 46 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 44f0b796b3..cdec31f2a8 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -31,11 +31,14 @@
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
+#include "system/block-backend.h"
+#include "system/iothread.h"
 
 #include <fuse.h>
 #include <fuse_lowlevel.h>
 
 #include "standard-headers/linux/fuse.h"
+#include <sys/ioctl.h>
 
 #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
@@ -50,12 +53,17 @@
 /* Small enough to fit in the request buffer */
 #define FUSE_MAX_WRITE_BYTES (4 * 1024)
 
-typedef struct FuseExport {
-    BlockExport common;
+typedef struct FuseExport FuseExport;
 
-    struct fuse_session *fuse_session;
-    unsigned int in_flight; /* atomic */
-    bool mounted, fd_handler_set_up;
+/*
+ * One FUSE "queue", representing one FUSE FD from which requests are fetched
+ * and processed.  Each queue is tied to an AioContext.
+ */
+typedef struct FuseQueue {
+    FuseExport *exp;
+
+    AioContext *ctx;
+    int fuse_fd;
 
     /*
      * The request buffer must be able to hold a full write, and/or at least
@@ -66,6 +74,14 @@ typedef struct FuseExport {
              FUSE_MAX_WRITE_BYTES,
         FUSE_MIN_READ_BUFFER
     )];
+} FuseQueue;
+
+struct FuseExport {
+    BlockExport common;
+
+    struct fuse_session *fuse_session;
+    unsigned int in_flight; /* atomic */
+    bool mounted, fd_handler_set_up;
 
     /*
      * Set when there was an unrecoverable error and no requests should be read
@@ -74,7 +90,15 @@ typedef struct FuseExport {
      */
     bool halted;
 
-    int fuse_fd;
+    int num_queues;
+    FuseQueue *queues;
+    /*
+     * True if this export should follow the generic export's AioContext.
+     * Will be false if the queues' AioContexts have been explicitly set by the
+     * user, i.e. are expected to stay in those contexts.
+     * (I.e. is always false if there is more than one queue.)
+     */
+    bool follow_aio_context;
 
     char *mountpoint;
     bool writable;
@@ -85,11 +109,11 @@ typedef struct FuseExport {
     mode_t st_mode;
     uid_t st_uid;
     gid_t st_gid;
-} FuseExport;
+};
 
 /* Parameters to the request processing coroutine */
 typedef struct FuseRequestCoParam {
-    FuseExport *exp;
+    FuseQueue *q;
     int got_request;
 } FuseRequestCoParam;
 
@@ -102,11 +126,12 @@ static void fuse_export_halt(FuseExport *exp);
 static void init_exports_table(void);
 
 static int mount_fuse_export(FuseExport *exp, Error **errp);
+static int clone_fuse_fd(int fd, Error **errp);
 
 static bool is_regular_file(const char *path, Error **errp);
 
 static void read_from_fuse_fd(void *opaque);
-static void coroutine_fn fuse_co_process_request(FuseExport *exp);
+static void coroutine_fn fuse_co_process_request(FuseQueue *q);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -136,8 +161,11 @@ static void fuse_attach_handlers(FuseExport *exp)
         return;
     }
 
-    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
-                       read_from_fuse_fd, NULL, NULL, NULL, exp);
+    for (int i = 0; i < exp->num_queues; i++) {
+        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
+                           read_from_fuse_fd, NULL, NULL, NULL,
+                           &exp->queues[i]);
+    }
     exp->fd_handler_set_up = true;
 }
 
@@ -146,8 +174,10 @@ static void fuse_attach_handlers(FuseExport *exp)
  */
 static void fuse_detach_handlers(FuseExport *exp)
 {
-    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
-                       NULL, NULL, NULL, NULL, NULL);
+    for (int i = 0; i < exp->num_queues; i++) {
+        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
+                           NULL, NULL, NULL, NULL, NULL);
+    }
     exp->fd_handler_set_up = false;
 }
 
@@ -162,6 +192,11 @@ static void fuse_export_drained_end(void *opaque)
 
     /* Refresh AioContext in case it changed */
     exp->common.ctx = blk_get_aio_context(exp->common.blk);
+    if (exp->follow_aio_context) {
+        assert(exp->num_queues == 1);
+        exp->queues[0].ctx = exp->common.ctx;
+    }
+
     fuse_attach_handlers(exp);
 }
 
@@ -192,8 +227,32 @@ static int fuse_export_create(BlockExport *blk_exp,
     assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
 
     if (multithread) {
-        error_setg(errp, "FUSE export does not support multi-threading");
-        return -EINVAL;
+        /* Guaranteed by common export code */
+        assert(mt_count >= 1);
+
+        exp->follow_aio_context = false;
+        exp->num_queues = mt_count;
+        exp->queues = g_new(FuseQueue, mt_count);
+
+        for (size_t i = 0; i < mt_count; i++) {
+            exp->queues[i] = (FuseQueue) {
+                .exp = exp,
+                .ctx = multithread[i],
+                .fuse_fd = -1,
+            };
+        }
+    } else {
+        /* Guaranteed by common export code */
+        assert(mt_count == 0);
+
+        exp->follow_aio_context = true;
+        exp->num_queues = 1;
+        exp->queues = g_new(FuseQueue, 1);
+        exp->queues[0] = (FuseQueue) {
+            .exp = exp,
+            .ctx = exp->common.ctx,
+            .fuse_fd = -1,
+        };
     }
 
     /* For growable and writable exports, take the RESIZE permission */
@@ -280,13 +339,23 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
-    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
-    ret = qemu_fcntl_addfl(exp->fuse_fd, O_NONBLOCK);
+    assert(exp->num_queues >= 1);
+    exp->queues[0].fuse_fd = fuse_session_fd(exp->fuse_session);
+    ret = qemu_fcntl_addfl(exp->queues[0].fuse_fd, O_NONBLOCK);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "Failed to make FUSE FD non-blocking");
         goto fail;
     }
 
+    for (int i = 1; i < exp->num_queues; i++) {
+        int fd = clone_fuse_fd(exp->queues[0].fuse_fd, errp);
+        if (fd < 0) {
+            ret = fd;
+            goto fail;
+        }
+        exp->queues[i].fuse_fd = fd;
+    }
+
     fuse_attach_handlers(exp);
     return 0;
 
@@ -359,9 +428,42 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     return 0;
 }
 
+/**
+ * Clone the given /dev/fuse file descriptor, yielding a second FD from which
+ * requests can be pulled for the associated filesystem.  Returns an FD on
+ * success, and -errno on error.
+ */
+static int clone_fuse_fd(int fd, Error **errp)
+{
+    uint32_t src_fd = fd;
+    int new_fd;
+    int ret;
+
+    /*
+     * The name "/dev/fuse" is fixed, see libfuse's lib/fuse_loop_mt.c
+     * (fuse_clone_chan()).
+     */
+    new_fd = open("/dev/fuse", O_RDWR | O_CLOEXEC | O_NONBLOCK);
+    if (new_fd < 0) {
+        ret = -errno;
+        error_setg_errno(errp, errno, "Failed to open /dev/fuse");
+        return ret;
+    }
+
+    ret = ioctl(new_fd, FUSE_DEV_IOC_CLONE, &src_fd);
+    if (ret < 0) {
+        ret = -errno;
+        error_setg_errno(errp, errno, "Failed to clone FUSE FD");
+        close(new_fd);
+        return ret;
+    }
+
+    return new_fd;
+}
+
 /**
  * Try to read a single request from the FUSE FD.
- * Takes a FuseExport pointer in `opaque`.
+ * Takes a FuseQueue pointer in `opaque`.
  *
  * Assumes the export's in-flight counter has already been incremented.
  *
@@ -369,8 +471,9 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
  */
 static void coroutine_fn co_read_from_fuse_fd(void *opaque)
 {
-    FuseExport *exp = opaque;
-    int fuse_fd = exp->fuse_fd;
+    FuseQueue *q = opaque;
+    int fuse_fd = q->fuse_fd;
+    FuseExport *exp = q->exp;
     ssize_t ret;
     const struct fuse_in_header *in_hdr;
 
@@ -378,8 +481,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    ret = RETRY_ON_EINTR(read(fuse_fd, exp->request_buf,
-                              sizeof(exp->request_buf)));
+    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
     if (ret < 0 && errno == EAGAIN) {
         /* No request available */
         goto no_request;
@@ -397,7 +499,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    in_hdr = (const struct fuse_in_header *)exp->request_buf;
+    in_hdr = (const struct fuse_in_header *)q->request_buf;
     if (unlikely(ret != in_hdr->len)) {
         error_report("Number of bytes read from FUSE device does not match "
                      "request size, expected %" PRIu32 " bytes, read %zi "
@@ -408,7 +510,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    fuse_co_process_request(exp);
+    fuse_co_process_request(q);
 
 no_request:
     fuse_dec_in_flight(exp);
@@ -417,16 +519,16 @@ no_request:
 /**
  * Try to read and process a single request from the FUSE FD.
  * (To be used as a handler for when the FUSE FD becomes readable.)
- * Takes a FuseExport pointer in `opaque`.
+ * Takes a FuseQueue pointer in `opaque`.
  */
 static void read_from_fuse_fd(void *opaque)
 {
-    FuseExport *exp = opaque;
+    FuseQueue *q = opaque;
     Coroutine *co;
 
-    co = qemu_coroutine_create(co_read_from_fuse_fd, exp);
+    co = qemu_coroutine_create(co_read_from_fuse_fd, q);
     /* Decremented by co_read_from_fuse_fd() */
-    fuse_inc_in_flight(exp);
+    fuse_inc_in_flight(q->exp);
     qemu_coroutine_enter(co);
 }
 
@@ -451,6 +553,16 @@ static void fuse_export_delete(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
+    for (int i = 0; i < exp->num_queues; i++) {
+        FuseQueue *q = &exp->queues[i];
+
+        /* Queue 0's FD belongs to the FUSE session */
+        if (i > 0 && q->fuse_fd >= 0) {
+            close(q->fuse_fd);
+        }
+    }
+    g_free(exp->queues);
+
     if (exp->fuse_session) {
         if (exp->mounted) {
             fuse_session_unmount(exp->fuse_session);
@@ -1108,23 +1220,23 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
 /*
  * For use in fuse_co_process_request():
  * Returns a pointer to the parameter object for the given operation (inside of
- * exp->request_buf, which is assumed to hold a fuse_in_header first).
- * Verifies that the object is complete (exp->request_buf is large enough to
+ * q->request_buf, which is assumed to hold a fuse_in_header first).
+ * Verifies that the object is complete (q->request_buf is large enough to
  * hold it in one piece, and the request length includes the whole object).
  *
- * Note that exp->request_buf may be overwritten after yielding, so the returned
+ * Note that q->request_buf may be overwritten after yielding, so the returned
  * pointer must not be used across a function that may yield!
  */
-#define FUSE_IN_OP_STRUCT(op_name, export) \
+#define FUSE_IN_OP_STRUCT(op_name, queue) \
     ({ \
         const struct fuse_in_header *__in_hdr = \
-            (const struct fuse_in_header *)(export)->request_buf; \
+            (const struct fuse_in_header *)(q)->request_buf; \
         const struct fuse_##op_name##_in *__in = \
             (const struct fuse_##op_name##_in *)(__in_hdr + 1); \
         const size_t __param_len = sizeof(*__in_hdr) + sizeof(*__in); \
         uint32_t __req_len; \
         \
-        QEMU_BUILD_BUG_ON(sizeof((export)->request_buf) < __param_len); \
+        QEMU_BUILD_BUG_ON(sizeof((q)->request_buf) < __param_len); \
         \
         __req_len = __in_hdr->len; \
         if (__req_len < __param_len) { \
@@ -1161,11 +1273,12 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
  * Process a FUSE request, incl. writing the response.
  *
  * Note that yielding in any request-processing function can overwrite the
- * contents of exp->request_buf.  Anything that takes a buffer needs to take
+ * contents of q->request_buf.  Anything that takes a buffer needs to take
  * care that the content is copied before yielding.
  */
-static void coroutine_fn fuse_co_process_request(FuseExport *exp)
+static void coroutine_fn fuse_co_process_request(FuseQueue *q)
 {
+    FuseExport *exp = q->exp;
     uint32_t opcode;
     uint64_t req_id;
     /*
@@ -1188,7 +1301,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
     /* Limit scope to ensure pointer is no longer used after yielding */
     {
         const struct fuse_in_header *in_hdr =
-            (const struct fuse_in_header *)exp->request_buf;
+            (const struct fuse_in_header *)q->request_buf;
 
         opcode = in_hdr->opcode;
         req_id = in_hdr->unique;
@@ -1196,7 +1309,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
 
     switch (opcode) {
     case FUSE_INIT: {
-        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, exp);
+        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, q);
         ret = fuse_co_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
                            in->max_readahead, in->flags);
         break;
@@ -1219,23 +1332,23 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
         break;
 
     case FUSE_SETATTR: {
-        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, exp);
+        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, q);
         ret = fuse_co_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
                               in->valid, in->size, in->mode, in->uid, in->gid);
         break;
     }
 
     case FUSE_READ: {
-        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, exp);
+        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, q);
         ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size);
         break;
     }
 
     case FUSE_WRITE: {
-        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, exp);
+        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, q);
         uint32_t req_len;
 
-        req_len = ((const struct fuse_in_header *)exp->request_buf)->len;
+        req_len = ((const struct fuse_in_header *)q->request_buf)->len;
         if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
                                in->size)) {
             warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
@@ -1264,7 +1377,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
     }
 
     case FUSE_FALLOCATE: {
-        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, exp);
+        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, q);
         ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode);
         break;
     }
@@ -1279,7 +1392,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
 
 #ifdef CONFIG_FUSE_LSEEK
     case FUSE_LSEEK: {
-        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, exp);
+        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, q);
         ret = fuse_co_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
                             in->offset, in->whence);
         break;
@@ -1293,11 +1406,11 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
     /* Ignore errors from fuse_write*(), nothing we can do anyway */
     if (out_data_buffer) {
         assert(ret >= 0);
-        fuse_write_buf_response(exp->fuse_fd, req_id, out_hdr,
+        fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
                                 out_data_buffer, ret);
         qemu_vfree(out_data_buffer);
     } else {
-        fuse_write_response(exp->fuse_fd, req_id, out_hdr,
+        fuse_write_response(q->fuse_fd, req_id, out_hdr,
                             ret < 0 ? ret : 0,
                             ret < 0 ? 0 : ret);
     }
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 18/21] fuse: Implement multi-threading
  2025-06-04 13:28 ` [PATCH v2 18/21] fuse: Implement multi-threading Hanna Czenczek
@ 2025-06-09 18:10   ` Stefan Hajnoczi
  2025-06-27  1:08   ` Brian
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2025-06-09 18:10 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Kevin Wolf, Markus Armbruster, Brian Song

[-- Attachment #1: Type: text/plain, Size: 3820 bytes --]

On Wed, Jun 04, 2025 at 03:28:10PM +0200, Hanna Czenczek wrote:
> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
> 
> We can use this to implement multi-threading.
> 
> For configuration, we don't need any more information beyond the simple
> array provided by the core block export interface: The FUSE kernel
> driver feeds these FDs in a round-robin fashion, so all of them are
> equivalent and we want to have exactly one per thread.
> 
> These are the benchmark results when using four threads (compared to a
> single thread); note that fio still only uses a single job, but
> performance can still be improved because of said round-robin usage for
> the queues.  (Not in the sync case, though, in which case I guess it
> just adds overhead.)
> 
> file:
>   read:
>     seq aio:   264.8k ±0.8k (+120 %)
>     rand aio:  143.8k ±0.4k (+ 27 %)
>     seq sync:   49.9k ±0.5k (-  5 %)
>     rand sync:  10.3k ±0.1k (-  1 %)
>   write:
>     seq aio:   226.6k ±2.1k (+184 %)
>     rand aio:  225.9k ±1.8k (+186 %)
>     seq sync:   36.9k ±0.6k (- 11 %)
>     rand sync:  36.9k ±0.2k (- 11 %)
> null:
>   read:
>     seq aio:   315.2k ±11.0k (+18 %)
>     rand aio:  300.5k ±10.8k (+14 %)
>     seq sync:  114.2k ± 3.6k (-16 %)
>     rand sync: 112.5k ± 2.8k (-16 %)
>   write:
>     seq aio:   222.6k ±6.8k (-21 %)
>     rand aio:  220.5k ±6.8k (-23 %)
>     seq sync:  117.2k ±3.7k (-18 %)
>     rand sync: 116.3k ±4.4k (-18 %)
> 
> (I don't know what's going on in the null-write AIO case, sorry.)
> 
> Here's results for numjobs=4:
> 
> "Before", i.e. without multithreading in QSD/FUSE (results compared to
> numjobs=1):
> 
> file:
>   read:
>     seq aio:   104.7k ± 0.4k (- 13 %)
>     rand aio:  111.5k ± 0.4k (-  2 %)
>     seq sync:   71.0k ±13.8k (+ 36 %)
>     rand sync:  41.4k ± 0.1k (+297 %)
>   write:
>     seq aio:    79.4k ±0.1k (-  1 %)
>     rand aio:   78.6k ±0.1k (±  0 %)
>     seq sync:   83.3k ±0.1k (+101 %)
>     rand sync:  82.0k ±0.2k (+ 98 %)
> null:
>   read:
>     seq aio:   260.5k ±1.5k (-  2 %)
>     rand aio:  260.1k ±1.4k (-  2 %)
>     seq sync:  291.8k ±1.3k (+115 %)
>     rand sync: 280.1k ±1.7k (+115 %)
>   write:
>     seq aio:   280.1k ±1.7k (±  0 %)
>     rand aio:  279.5k ±1.4k (-  3 %)
>     seq sync:  306.7k ±2.2k (+116 %)
>     rand sync: 305.9k ±1.8k (+117 %)
> 
> (As probably expected, little difference in the AIO case, but great
> improvements in the sync case because it kind of gives it an artificial
> iodepth of 4.)
> 
> "After", i.e. with four threads in QSD/FUSE (now results compared to the
> above):
> 
> file:
>   read:
>     seq aio:   193.3k ± 1.8k (+ 85 %)
>     rand aio:  329.3k ± 0.3k (+195 %)
>     seq sync:   66.2k ±13.0k (-  7 %)
>     rand sync:  40.1k ± 0.0k (-  3 %)
>   write:
>     seq aio:   219.7k ±0.8k (+177 %)
>     rand aio:  217.2k ±1.5k (+176 %)
>     seq sync:   92.5k ±0.2k (+ 11 %)
>     rand sync:  91.9k ±0.2k (+ 12 %)
> null:
>   read:
>     seq aio:   706.7k ±2.1k (+171 %)
>     rand aio:  714.7k ±3.2k (+175 %)
>     seq sync:  431.7k ±3.0k (+ 48 %)
>     rand sync: 435.4k ±2.8k (+ 50 %)
>   write:
>     seq aio:   746.9k ±2.8k (+167 %)
>     rand aio:  749.0k ±4.9k (+168 %)
>     seq sync:  420.7k ±3.1k (+ 37 %)
>     rand sync: 419.1k ±2.5k (+ 37 %)
> 
> So this helps mainly for the AIO cases, but also in the null sync cases,
> because null is always CPU-bound, so more threads help.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 205 ++++++++++++++++++++++++++++++++++----------
>  1 file changed, 159 insertions(+), 46 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 18/21] fuse: Implement multi-threading
  2025-06-04 13:28 ` [PATCH v2 18/21] fuse: Implement multi-threading Hanna Czenczek
  2025-06-09 18:10   ` Stefan Hajnoczi
@ 2025-06-27  1:08   ` Brian
  2025-07-01  7:31     ` Hanna Czenczek
  1 sibling, 1 reply; 40+ messages in thread
From: Brian @ 2025-06-27  1:08 UTC (permalink / raw)
  To: Hanna Czenczek, qemu-block
  Cc: qemu-devel, Stefan Hajnoczi, Kevin Wolf, Markus Armbruster



On 6/4/25 9:28 AM, Hanna Czenczek wrote:
> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
> 
> We can use this to implement multi-threading.
> 
> For configuration, we don't need any more information beyond the simple
> array provided by the core block export interface: The FUSE kernel
> driver feeds these FDs in a round-robin fashion, so all of them are
> equivalent and we want to have exactly one per thread.
> 
> These are the benchmark results when using four threads (compared to a
> single thread); note that fio still only uses a single job, but
> performance can still be improved because of said round-robin usage for
> the queues.  (Not in the sync case, though, in which case I guess it
> just adds overhead.)
> 
> file:
>    read:
>      seq aio:   264.8k ±0.8k (+120 %)
>      rand aio:  143.8k ±0.4k (+ 27 %)
>      seq sync:   49.9k ±0.5k (-  5 %)
>      rand sync:  10.3k ±0.1k (-  1 %)
>    write:
>      seq aio:   226.6k ±2.1k (+184 %)
>      rand aio:  225.9k ±1.8k (+186 %)
>      seq sync:   36.9k ±0.6k (- 11 %)
>      rand sync:  36.9k ±0.2k (- 11 %)
> null:
>    read:
>      seq aio:   315.2k ±11.0k (+18 %)
>      rand aio:  300.5k ±10.8k (+14 %)
>      seq sync:  114.2k ± 3.6k (-16 %)
>      rand sync: 112.5k ± 2.8k (-16 %)
>    write:
>      seq aio:   222.6k ±6.8k (-21 %)
>      rand aio:  220.5k ±6.8k (-23 %)
>      seq sync:  117.2k ±3.7k (-18 %)
>      rand sync: 116.3k ±4.4k (-18 %)
> 
> (I don't know what's going on in the null-write AIO case, sorry.)
> 
> Here's results for numjobs=4:
> 
> "Before", i.e. without multithreading in QSD/FUSE (results compared to
> numjobs=1):
> 
> file:
>    read:
>      seq aio:   104.7k ± 0.4k (- 13 %)
>      rand aio:  111.5k ± 0.4k (-  2 %)
>      seq sync:   71.0k ±13.8k (+ 36 %)
>      rand sync:  41.4k ± 0.1k (+297 %)
>    write:
>      seq aio:    79.4k ±0.1k (-  1 %)
>      rand aio:   78.6k ±0.1k (±  0 %)
>      seq sync:   83.3k ±0.1k (+101 %)
>      rand sync:  82.0k ±0.2k (+ 98 %)
> null:
>    read:
>      seq aio:   260.5k ±1.5k (-  2 %)
>      rand aio:  260.1k ±1.4k (-  2 %)
>      seq sync:  291.8k ±1.3k (+115 %)
>      rand sync: 280.1k ±1.7k (+115 %)
>    write:
>      seq aio:   280.1k ±1.7k (±  0 %)
>      rand aio:  279.5k ±1.4k (-  3 %)
>      seq sync:  306.7k ±2.2k (+116 %)
>      rand sync: 305.9k ±1.8k (+117 %)
> 
> (As probably expected, little difference in the AIO case, but great
> improvements in the sync case because it kind of gives it an artificial
> iodepth of 4.)
> 
> "After", i.e. with four threads in QSD/FUSE (now results compared to the
> above):
> 
> file:
>    read:
>      seq aio:   193.3k ± 1.8k (+ 85 %)
>      rand aio:  329.3k ± 0.3k (+195 %)
>      seq sync:   66.2k ±13.0k (-  7 %)
>      rand sync:  40.1k ± 0.0k (-  3 %)
>    write:
>      seq aio:   219.7k ±0.8k (+177 %)
>      rand aio:  217.2k ±1.5k (+176 %)
>      seq sync:   92.5k ±0.2k (+ 11 %)
>      rand sync:  91.9k ±0.2k (+ 12 %)
> null:
>    read:
>      seq aio:   706.7k ±2.1k (+171 %)
>      rand aio:  714.7k ±3.2k (+175 %)
>      seq sync:  431.7k ±3.0k (+ 48 %)
>      rand sync: 435.4k ±2.8k (+ 50 %)
>    write:
>      seq aio:   746.9k ±2.8k (+167 %)
>      rand aio:  749.0k ±4.9k (+168 %)
>      seq sync:  420.7k ±3.1k (+ 37 %)
>      rand sync: 419.1k ±2.5k (+ 37 %)
> 
> So this helps mainly for the AIO cases, but also in the null sync cases,
> because null is always CPU-bound, so more threads help.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>   block/export/fuse.c | 205 ++++++++++++++++++++++++++++++++++----------
>   1 file changed, 159 insertions(+), 46 deletions(-)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index 44f0b796b3..cdec31f2a8 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -31,11 +31,14 @@
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
>   #include "system/block-backend.h"
> +#include "system/block-backend.h"
> +#include "system/iothread.h"
>   
>   #include <fuse.h>
>   #include <fuse_lowlevel.h>
>   
>   #include "standard-headers/linux/fuse.h"
> +#include <sys/ioctl.h>
>   
>   #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
>   #include <linux/falloc.h>
> @@ -50,12 +53,17 @@
>   /* Small enough to fit in the request buffer */
>   #define FUSE_MAX_WRITE_BYTES (4 * 1024)
>   
> -typedef struct FuseExport {
> -    BlockExport common;
> +typedef struct FuseExport FuseExport;
>   
> -    struct fuse_session *fuse_session;
> -    unsigned int in_flight; /* atomic */
> -    bool mounted, fd_handler_set_up;
> +/*
> + * One FUSE "queue", representing one FUSE FD from which requests are fetched
> + * and processed.  Each queue is tied to an AioContext.
> + */
> +typedef struct FuseQueue {
> +    FuseExport *exp;
> +
> +    AioContext *ctx;
> +    int fuse_fd;
>   
>       /*
>        * The request buffer must be able to hold a full write, and/or at least
> @@ -66,6 +74,14 @@ typedef struct FuseExport {
>                FUSE_MAX_WRITE_BYTES,
>           FUSE_MIN_READ_BUFFER
>       )];
> +} FuseQueue;
> +
> +struct FuseExport {
> +    BlockExport common;
> +
> +    struct fuse_session *fuse_session;
> +    unsigned int in_flight; /* atomic */
> +    bool mounted, fd_handler_set_up;
>   
>       /*
>        * Set when there was an unrecoverable error and no requests should be read
> @@ -74,7 +90,15 @@ typedef struct FuseExport {
>        */
>       bool halted;
>   
> -    int fuse_fd;
> +    int num_queues;
> +    FuseQueue *queues;
> +    /*
> +     * True if this export should follow the generic export's AioContext.
> +     * Will be false if the queues' AioContexts have been explicitly set by the
> +     * user, i.e. are expected to stay in those contexts.
> +     * (I.e. is always false if there is more than one queue.)
> +     */
> +    bool follow_aio_context;
>   
>       char *mountpoint;
>       bool writable;
> @@ -85,11 +109,11 @@ typedef struct FuseExport {
>       mode_t st_mode;
>       uid_t st_uid;
>       gid_t st_gid;
> -} FuseExport;
> +};
>   
>   /* Parameters to the request processing coroutine */
>   typedef struct FuseRequestCoParam {
> -    FuseExport *exp;
> +    FuseQueue *q;
>       int got_request;
>   } FuseRequestCoParam;
>   
> @@ -102,11 +126,12 @@ static void fuse_export_halt(FuseExport *exp);
>   static void init_exports_table(void);
>   
>   static int mount_fuse_export(FuseExport *exp, Error **errp);
> +static int clone_fuse_fd(int fd, Error **errp);
>   
>   static bool is_regular_file(const char *path, Error **errp);
>   
>   static void read_from_fuse_fd(void *opaque);
> -static void coroutine_fn fuse_co_process_request(FuseExport *exp);
> +static void coroutine_fn fuse_co_process_request(FuseQueue *q);
>   
>   static void fuse_inc_in_flight(FuseExport *exp)
>   {
> @@ -136,8 +161,11 @@ static void fuse_attach_handlers(FuseExport *exp)
>           return;
>       }
>   
> -    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
> -                       read_from_fuse_fd, NULL, NULL, NULL, exp);
> +    for (int i = 0; i < exp->num_queues; i++) {
> +        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
> +                           read_from_fuse_fd, NULL, NULL, NULL,
> +                           &exp->queues[i]);
> +    }
>       exp->fd_handler_set_up = true;
>   }
>   
> @@ -146,8 +174,10 @@ static void fuse_attach_handlers(FuseExport *exp)
>    */
>   static void fuse_detach_handlers(FuseExport *exp)
>   {
> -    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
> -                       NULL, NULL, NULL, NULL, NULL);
> +    for (int i = 0; i < exp->num_queues; i++) {
> +        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
> +                           NULL, NULL, NULL, NULL, NULL);
> +    }
>       exp->fd_handler_set_up = false;
>   }
>   
> @@ -162,6 +192,11 @@ static void fuse_export_drained_end(void *opaque)
>   
>       /* Refresh AioContext in case it changed */
>       exp->common.ctx = blk_get_aio_context(exp->common.blk);
> +    if (exp->follow_aio_context) {
> +        assert(exp->num_queues == 1);
> +        exp->queues[0].ctx = exp->common.ctx;
> +    }
> +
>       fuse_attach_handlers(exp);
>   }
>   
> @@ -192,8 +227,32 @@ static int fuse_export_create(BlockExport *blk_exp,
>       assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
>   
>       if (multithread) {
> -        error_setg(errp, "FUSE export does not support multi-threading");
> -        return -EINVAL;
> +        /* Guaranteed by common export code */
> +        assert(mt_count >= 1);
> +
> +        exp->follow_aio_context = false;
> +        exp->num_queues = mt_count;
> +        exp->queues = g_new(FuseQueue, mt_count);
> +
> +        for (size_t i = 0; i < mt_count; i++) {
> +            exp->queues[i] = (FuseQueue) {
> +                .exp = exp,
> +                .ctx = multithread[i],
> +                .fuse_fd = -1,
> +            };
> +        }
> +    } else {
> +        /* Guaranteed by common export code */
> +        assert(mt_count == 0);
> +
> +        exp->follow_aio_context = true;
> +        exp->num_queues = 1;
> +        exp->queues = g_new(FuseQueue, 1);
> +        exp->queues[0] = (FuseQueue) {
> +            .exp = exp,
> +            .ctx = exp->common.ctx,
> +            .fuse_fd = -1,
> +        };
>       }
>   
>       /* For growable and writable exports, take the RESIZE permission */
> @@ -280,13 +339,23 @@ static int fuse_export_create(BlockExport *blk_exp,
>   
>       g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
>   
> -    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
> -    ret = qemu_fcntl_addfl(exp->fuse_fd, O_NONBLOCK);
> +    assert(exp->num_queues >= 1);
> +    exp->queues[0].fuse_fd = fuse_session_fd(exp->fuse_session);
> +    ret = qemu_fcntl_addfl(exp->queues[0].fuse_fd, O_NONBLOCK);
>       if (ret < 0) {
>           error_setg_errno(errp, -ret, "Failed to make FUSE FD non-blocking");
>           goto fail;
>       }
>   
> +    for (int i = 1; i < exp->num_queues; i++) {
> +        int fd = clone_fuse_fd(exp->queues[0].fuse_fd, errp);
> +        if (fd < 0) {
> +            ret = fd;
> +            goto fail;
> +        }
> +        exp->queues[i].fuse_fd = fd;
> +    }
> +
>       fuse_attach_handlers(exp);
>       return 0;
>   
> @@ -359,9 +428,42 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
>       return 0;
>   }
>   
> +/**
> + * Clone the given /dev/fuse file descriptor, yielding a second FD from which
> + * requests can be pulled for the associated filesystem.  Returns an FD on
> + * success, and -errno on error.
> + */
> +static int clone_fuse_fd(int fd, Error **errp)
> +{
> +    uint32_t src_fd = fd;
> +    int new_fd;
> +    int ret;
> +
> +    /*
> +     * The name "/dev/fuse" is fixed, see libfuse's lib/fuse_loop_mt.c
> +     * (fuse_clone_chan()).
> +     */
> +    new_fd = open("/dev/fuse", O_RDWR | O_CLOEXEC | O_NONBLOCK);
> +    if (new_fd < 0) {
> +        ret = -errno;
> +        error_setg_errno(errp, errno, "Failed to open /dev/fuse");
> +        return ret;
> +    }
> +
> +    ret = ioctl(new_fd, FUSE_DEV_IOC_CLONE, &src_fd);
> +    if (ret < 0) {
> +        ret = -errno;
> +        error_setg_errno(errp, errno, "Failed to clone FUSE FD");
> +        close(new_fd);
> +        return ret;
> +    }
> +
> +    return new_fd;
> +}
> +
>   /**
>    * Try to read a single request from the FUSE FD.
> - * Takes a FuseExport pointer in `opaque`.
> + * Takes a FuseQueue pointer in `opaque`.
>    *
>    * Assumes the export's in-flight counter has already been incremented.
>    *
> @@ -369,8 +471,9 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
>    */
>   static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>   {
> -    FuseExport *exp = opaque;
> -    int fuse_fd = exp->fuse_fd;
> +    FuseQueue *q = opaque;
> +    int fuse_fd = q->fuse_fd;
> +    FuseExport *exp = q->exp;
>       ssize_t ret;
>       const struct fuse_in_header *in_hdr;
>   
> @@ -378,8 +481,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>           goto no_request;
>       }
>   
> -    ret = RETRY_ON_EINTR(read(fuse_fd, exp->request_buf,
> -                              sizeof(exp->request_buf)));
> +    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
>       if (ret < 0 && errno == EAGAIN) {
>           /* No request available */
>           goto no_request;
> @@ -397,7 +499,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>           goto no_request;
>       }
>   
> -    in_hdr = (const struct fuse_in_header *)exp->request_buf;
> +    in_hdr = (const struct fuse_in_header *)q->request_buf;
>       if (unlikely(ret != in_hdr->len)) {
>           error_report("Number of bytes read from FUSE device does not match "
>                        "request size, expected %" PRIu32 " bytes, read %zi "
> @@ -408,7 +510,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>           goto no_request;
>       }
>   
> -    fuse_co_process_request(exp);
> +    fuse_co_process_request(q);
>   
>   no_request:
>       fuse_dec_in_flight(exp);
> @@ -417,16 +519,16 @@ no_request:
>   /**
>    * Try to read and process a single request from the FUSE FD.
>    * (To be used as a handler for when the FUSE FD becomes readable.)
> - * Takes a FuseExport pointer in `opaque`.
> + * Takes a FuseQueue pointer in `opaque`.
>    */
>   static void read_from_fuse_fd(void *opaque)
>   {
> -    FuseExport *exp = opaque;
> +    FuseQueue *q = opaque;
>       Coroutine *co;
>   
> -    co = qemu_coroutine_create(co_read_from_fuse_fd, exp);
> +    co = qemu_coroutine_create(co_read_from_fuse_fd, q);
>       /* Decremented by co_read_from_fuse_fd() */
> -    fuse_inc_in_flight(exp);
> +    fuse_inc_in_flight(q->exp);
>       qemu_coroutine_enter(co);
>   }
>   
> @@ -451,6 +553,16 @@ static void fuse_export_delete(BlockExport *blk_exp)
>   {
>       FuseExport *exp = container_of(blk_exp, FuseExport, common);
>   
> +    for (int i = 0; i < exp->num_queues; i++) {
> +        FuseQueue *q = &exp->queues[i];
> +
> +        /* Queue 0's FD belongs to the FUSE session */
> +        if (i > 0 && q->fuse_fd >= 0) {
> +            close(q->fuse_fd);
> +        }
> +    }
> +    g_free(exp->queues);
> +
>       if (exp->fuse_session) {
>           if (exp->mounted) {
>               fuse_session_unmount(exp->fuse_session);
> @@ -1108,23 +1220,23 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
>   /*
>    * For use in fuse_co_process_request():
>    * Returns a pointer to the parameter object for the given operation (inside of
> - * exp->request_buf, which is assumed to hold a fuse_in_header first).
> - * Verifies that the object is complete (exp->request_buf is large enough to
> + * q->request_buf, which is assumed to hold a fuse_in_header first).
> + * Verifies that the object is complete (q->request_buf is large enough to
>    * hold it in one piece, and the request length includes the whole object).
>    *
> - * Note that exp->request_buf may be overwritten after yielding, so the returned
> + * Note that q->request_buf may be overwritten after yielding, so the returned
>    * pointer must not be used across a function that may yield!
>    */
> -#define FUSE_IN_OP_STRUCT(op_name, export) \
> +#define FUSE_IN_OP_STRUCT(op_name, queue) \

Should `q` actually be `queue` (i.e. the second parameter)?

The original code works because the second parameter passed to this 
macro is called `q` in the caller function.

>       ({ \
>           const struct fuse_in_header *__in_hdr = \
> -            (const struct fuse_in_header *)(export)->request_buf; \
> +            (const struct fuse_in_header *)(q)->request_buf; \

The same issue here.

>           const struct fuse_##op_name##_in *__in = \
>               (const struct fuse_##op_name##_in *)(__in_hdr + 1); \
>           const size_t __param_len = sizeof(*__in_hdr) + sizeof(*__in); \
>           uint32_t __req_len; \
>           \
> -        QEMU_BUILD_BUG_ON(sizeof((export)->request_buf) < __param_len); \
> +        QEMU_BUILD_BUG_ON(sizeof((q)->request_buf) < __param_len); \
>           \
>           __req_len = __in_hdr->len; \
>           if (__req_len < __param_len) { \
> @@ -1161,11 +1273,12 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
>    * Process a FUSE request, incl. writing the response.
>    *
>    * Note that yielding in any request-processing function can overwrite the
> - * contents of exp->request_buf.  Anything that takes a buffer needs to take
> + * contents of q->request_buf.  Anything that takes a buffer needs to take
>    * care that the content is copied before yielding.
>    */
> -static void coroutine_fn fuse_co_process_request(FuseExport *exp)
> +static void coroutine_fn fuse_co_process_request(FuseQueue *q)
>   {
> +    FuseExport *exp = q->exp;
>       uint32_t opcode;
>       uint64_t req_id;
>       /*
> @@ -1188,7 +1301,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>       /* Limit scope to ensure pointer is no longer used after yielding */
>       {
>           const struct fuse_in_header *in_hdr =
> -            (const struct fuse_in_header *)exp->request_buf;
> +            (const struct fuse_in_header *)q->request_buf;
>   
>           opcode = in_hdr->opcode;
>           req_id = in_hdr->unique;
> @@ -1196,7 +1309,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>   
>       switch (opcode) {
>       case FUSE_INIT: {
> -        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, exp);
> +        const struct fuse_init_in *in = FUSE_IN_OP_STRUCT(init, q);
>           ret = fuse_co_init(exp, FUSE_OUT_OP_STRUCT(init, out_buf),
>                              in->max_readahead, in->flags);
>           break;
> @@ -1219,23 +1332,23 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>           break;
>   
>       case FUSE_SETATTR: {
> -        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, exp);
> +        const struct fuse_setattr_in *in = FUSE_IN_OP_STRUCT(setattr, q);
>           ret = fuse_co_setattr(exp, FUSE_OUT_OP_STRUCT(attr, out_buf),
>                                 in->valid, in->size, in->mode, in->uid, in->gid);
>           break;
>       }
>   
>       case FUSE_READ: {
> -        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, exp);
> +        const struct fuse_read_in *in = FUSE_IN_OP_STRUCT(read, q);
>           ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size);
>           break;
>       }
>   
>       case FUSE_WRITE: {
> -        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, exp);
> +        const struct fuse_write_in *in = FUSE_IN_OP_STRUCT(write, q);
>           uint32_t req_len;
>   
> -        req_len = ((const struct fuse_in_header *)exp->request_buf)->len;
> +        req_len = ((const struct fuse_in_header *)q->request_buf)->len;
>           if (unlikely(req_len < sizeof(struct fuse_in_header) + sizeof(*in) +
>                                  in->size)) {
>               warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
> @@ -1264,7 +1377,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>       }
>   
>       case FUSE_FALLOCATE: {
> -        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, exp);
> +        const struct fuse_fallocate_in *in = FUSE_IN_OP_STRUCT(fallocate, q);
>           ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode);
>           break;
>       }
> @@ -1279,7 +1392,7 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>   
>   #ifdef CONFIG_FUSE_LSEEK
>       case FUSE_LSEEK: {
> -        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, exp);
> +        const struct fuse_lseek_in *in = FUSE_IN_OP_STRUCT(lseek, q);
>           ret = fuse_co_lseek(exp, FUSE_OUT_OP_STRUCT(lseek, out_buf),
>                               in->offset, in->whence);
>           break;
> @@ -1293,11 +1406,11 @@ static void coroutine_fn fuse_co_process_request(FuseExport *exp)
>       /* Ignore errors from fuse_write*(), nothing we can do anyway */
>       if (out_data_buffer) {
>           assert(ret >= 0);
> -        fuse_write_buf_response(exp->fuse_fd, req_id, out_hdr,
> +        fuse_write_buf_response(q->fuse_fd, req_id, out_hdr,
>                                   out_data_buffer, ret);
>           qemu_vfree(out_data_buffer);
>       } else {
> -        fuse_write_response(exp->fuse_fd, req_id, out_hdr,
> +        fuse_write_response(q->fuse_fd, req_id, out_hdr,
>                               ret < 0 ? ret : 0,
>                               ret < 0 ? 0 : ret);
>       }



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 18/21] fuse: Implement multi-threading
  2025-06-27  1:08   ` Brian
@ 2025-07-01  7:31     ` Hanna Czenczek
  0 siblings, 0 replies; 40+ messages in thread
From: Hanna Czenczek @ 2025-07-01  7:31 UTC (permalink / raw)
  To: Brian, qemu-block
  Cc: qemu-devel, Stefan Hajnoczi, Kevin Wolf, Markus Armbruster

On 27.06.25 03:08, Brian wrote:
>
>
> On 6/4/25 9:28 AM, Hanna Czenczek wrote:
>> FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
>> (via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).
>>
>> We can use this to implement multi-threading.
>>
>> For configuration, we don't need any more information beyond the simple
>> array provided by the core block export interface: The FUSE kernel
>> driver feeds these FDs in a round-robin fashion, so all of them are
>> equivalent and we want to have exactly one per thread.
>>
>> These are the benchmark results when using four threads (compared to a
>> single thread); note that fio still only uses a single job, but
>> performance can still be improved because of said round-robin usage for
>> the queues.  (Not in the sync case, though, in which case I guess it
>> just adds overhead.)
>>
>> file:
>>    read:
>>      seq aio:   264.8k ±0.8k (+120 %)
>>      rand aio:  143.8k ±0.4k (+ 27 %)
>>      seq sync:   49.9k ±0.5k (-  5 %)
>>      rand sync:  10.3k ±0.1k (-  1 %)
>>    write:
>>      seq aio:   226.6k ±2.1k (+184 %)
>>      rand aio:  225.9k ±1.8k (+186 %)
>>      seq sync:   36.9k ±0.6k (- 11 %)
>>      rand sync:  36.9k ±0.2k (- 11 %)
>> null:
>>    read:
>>      seq aio:   315.2k ±11.0k (+18 %)
>>      rand aio:  300.5k ±10.8k (+14 %)
>>      seq sync:  114.2k ± 3.6k (-16 %)
>>      rand sync: 112.5k ± 2.8k (-16 %)
>>    write:
>>      seq aio:   222.6k ±6.8k (-21 %)
>>      rand aio:  220.5k ±6.8k (-23 %)
>>      seq sync:  117.2k ±3.7k (-18 %)
>>      rand sync: 116.3k ±4.4k (-18 %)
>>
>> (I don't know what's going on in the null-write AIO case, sorry.)
>>
>> Here's results for numjobs=4:
>>
>> "Before", i.e. without multithreading in QSD/FUSE (results compared to
>> numjobs=1):
>>
>> file:
>>    read:
>>      seq aio:   104.7k ± 0.4k (- 13 %)
>>      rand aio:  111.5k ± 0.4k (-  2 %)
>>      seq sync:   71.0k ±13.8k (+ 36 %)
>>      rand sync:  41.4k ± 0.1k (+297 %)
>>    write:
>>      seq aio:    79.4k ±0.1k (-  1 %)
>>      rand aio:   78.6k ±0.1k (±  0 %)
>>      seq sync:   83.3k ±0.1k (+101 %)
>>      rand sync:  82.0k ±0.2k (+ 98 %)
>> null:
>>    read:
>>      seq aio:   260.5k ±1.5k (-  2 %)
>>      rand aio:  260.1k ±1.4k (-  2 %)
>>      seq sync:  291.8k ±1.3k (+115 %)
>>      rand sync: 280.1k ±1.7k (+115 %)
>>    write:
>>      seq aio:   280.1k ±1.7k (±  0 %)
>>      rand aio:  279.5k ±1.4k (-  3 %)
>>      seq sync:  306.7k ±2.2k (+116 %)
>>      rand sync: 305.9k ±1.8k (+117 %)
>>
>> (As probably expected, little difference in the AIO case, but great
>> improvements in the sync case because it kind of gives it an artificial
>> iodepth of 4.)
>>
>> "After", i.e. with four threads in QSD/FUSE (now results compared to the
>> above):
>>
>> file:
>>    read:
>>      seq aio:   193.3k ± 1.8k (+ 85 %)
>>      rand aio:  329.3k ± 0.3k (+195 %)
>>      seq sync:   66.2k ±13.0k (-  7 %)
>>      rand sync:  40.1k ± 0.0k (-  3 %)
>>    write:
>>      seq aio:   219.7k ±0.8k (+177 %)
>>      rand aio:  217.2k ±1.5k (+176 %)
>>      seq sync:   92.5k ±0.2k (+ 11 %)
>>      rand sync:  91.9k ±0.2k (+ 12 %)
>> null:
>>    read:
>>      seq aio:   706.7k ±2.1k (+171 %)
>>      rand aio:  714.7k ±3.2k (+175 %)
>>      seq sync:  431.7k ±3.0k (+ 48 %)
>>      rand sync: 435.4k ±2.8k (+ 50 %)
>>    write:
>>      seq aio:   746.9k ±2.8k (+167 %)
>>      rand aio:  749.0k ±4.9k (+168 %)
>>      seq sync:  420.7k ±3.1k (+ 37 %)
>>      rand sync: 419.1k ±2.5k (+ 37 %)
>>
>> So this helps mainly for the AIO cases, but also in the null sync cases,
>> because null is always CPU-bound, so more threads help.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   block/export/fuse.c | 205 ++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 159 insertions(+), 46 deletions(-)
>>
>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>> index 44f0b796b3..cdec31f2a8 100644
>> --- a/block/export/fuse.c
>> +++ b/block/export/fuse.c
>> @@ -31,11 +31,14 @@
>>   #include "qemu/error-report.h"
>>   #include "qemu/main-loop.h"
>>   #include "system/block-backend.h"
>> +#include "system/block-backend.h"
>> +#include "system/iothread.h"
>>     #include <fuse.h>
>>   #include <fuse_lowlevel.h>
>>     #include "standard-headers/linux/fuse.h"
>> +#include <sys/ioctl.h>
>>     #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
>>   #include <linux/falloc.h>
>> @@ -50,12 +53,17 @@
>>   /* Small enough to fit in the request buffer */
>>   #define FUSE_MAX_WRITE_BYTES (4 * 1024)
>>   -typedef struct FuseExport {
>> -    BlockExport common;
>> +typedef struct FuseExport FuseExport;
>>   -    struct fuse_session *fuse_session;
>> -    unsigned int in_flight; /* atomic */
>> -    bool mounted, fd_handler_set_up;
>> +/*
>> + * One FUSE "queue", representing one FUSE FD from which requests 
>> are fetched
>> + * and processed.  Each queue is tied to an AioContext.
>> + */
>> +typedef struct FuseQueue {
>> +    FuseExport *exp;
>> +
>> +    AioContext *ctx;
>> +    int fuse_fd;
>>         /*
>>        * The request buffer must be able to hold a full write, and/or 
>> at least
>> @@ -66,6 +74,14 @@ typedef struct FuseExport {
>>                FUSE_MAX_WRITE_BYTES,
>>           FUSE_MIN_READ_BUFFER
>>       )];
>> +} FuseQueue;
>> +
>> +struct FuseExport {
>> +    BlockExport common;
>> +
>> +    struct fuse_session *fuse_session;
>> +    unsigned int in_flight; /* atomic */
>> +    bool mounted, fd_handler_set_up;
>>         /*
>>        * Set when there was an unrecoverable error and no requests 
>> should be read
>> @@ -74,7 +90,15 @@ typedef struct FuseExport {
>>        */
>>       bool halted;
>>   -    int fuse_fd;
>> +    int num_queues;
>> +    FuseQueue *queues;
>> +    /*
>> +     * True if this export should follow the generic export's 
>> AioContext.
>> +     * Will be false if the queues' AioContexts have been explicitly 
>> set by the
>> +     * user, i.e. are expected to stay in those contexts.
>> +     * (I.e. is always false if there is more than one queue.)
>> +     */
>> +    bool follow_aio_context;
>>         char *mountpoint;
>>       bool writable;
>> @@ -85,11 +109,11 @@ typedef struct FuseExport {
>>       mode_t st_mode;
>>       uid_t st_uid;
>>       gid_t st_gid;
>> -} FuseExport;
>> +};
>>     /* Parameters to the request processing coroutine */
>>   typedef struct FuseRequestCoParam {
>> -    FuseExport *exp;
>> +    FuseQueue *q;
>>       int got_request;
>>   } FuseRequestCoParam;
>>   @@ -102,11 +126,12 @@ static void fuse_export_halt(FuseExport *exp);
>>   static void init_exports_table(void);
>>     static int mount_fuse_export(FuseExport *exp, Error **errp);
>> +static int clone_fuse_fd(int fd, Error **errp);
>>     static bool is_regular_file(const char *path, Error **errp);
>>     static void read_from_fuse_fd(void *opaque);
>> -static void coroutine_fn fuse_co_process_request(FuseExport *exp);
>> +static void coroutine_fn fuse_co_process_request(FuseQueue *q);
>>     static void fuse_inc_in_flight(FuseExport *exp)
>>   {
>> @@ -136,8 +161,11 @@ static void fuse_attach_handlers(FuseExport *exp)
>>           return;
>>       }
>>   -    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
>> -                       read_from_fuse_fd, NULL, NULL, NULL, exp);
>> +    for (int i = 0; i < exp->num_queues; i++) {
>> +        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
>> +                           read_from_fuse_fd, NULL, NULL, NULL,
>> +                           &exp->queues[i]);
>> +    }
>>       exp->fd_handler_set_up = true;
>>   }
>>   @@ -146,8 +174,10 @@ static void fuse_attach_handlers(FuseExport *exp)
>>    */
>>   static void fuse_detach_handlers(FuseExport *exp)
>>   {
>> -    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
>> -                       NULL, NULL, NULL, NULL, NULL);
>> +    for (int i = 0; i < exp->num_queues; i++) {
>> +        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
>> +                           NULL, NULL, NULL, NULL, NULL);
>> +    }
>>       exp->fd_handler_set_up = false;
>>   }
>>   @@ -162,6 +192,11 @@ static void fuse_export_drained_end(void *opaque)
>>         /* Refresh AioContext in case it changed */
>>       exp->common.ctx = blk_get_aio_context(exp->common.blk);
>> +    if (exp->follow_aio_context) {
>> +        assert(exp->num_queues == 1);
>> +        exp->queues[0].ctx = exp->common.ctx;
>> +    }
>> +
>>       fuse_attach_handlers(exp);
>>   }
>>   @@ -192,8 +227,32 @@ static int fuse_export_create(BlockExport 
>> *blk_exp,
>>       assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
>>         if (multithread) {
>> -        error_setg(errp, "FUSE export does not support 
>> multi-threading");
>> -        return -EINVAL;
>> +        /* Guaranteed by common export code */
>> +        assert(mt_count >= 1);
>> +
>> +        exp->follow_aio_context = false;
>> +        exp->num_queues = mt_count;
>> +        exp->queues = g_new(FuseQueue, mt_count);
>> +
>> +        for (size_t i = 0; i < mt_count; i++) {
>> +            exp->queues[i] = (FuseQueue) {
>> +                .exp = exp,
>> +                .ctx = multithread[i],
>> +                .fuse_fd = -1,
>> +            };
>> +        }
>> +    } else {
>> +        /* Guaranteed by common export code */
>> +        assert(mt_count == 0);
>> +
>> +        exp->follow_aio_context = true;
>> +        exp->num_queues = 1;
>> +        exp->queues = g_new(FuseQueue, 1);
>> +        exp->queues[0] = (FuseQueue) {
>> +            .exp = exp,
>> +            .ctx = exp->common.ctx,
>> +            .fuse_fd = -1,
>> +        };
>>       }
>>         /* For growable and writable exports, take the RESIZE 
>> permission */
>> @@ -280,13 +339,23 @@ static int fuse_export_create(BlockExport 
>> *blk_exp,
>>         g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
>>   -    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
>> -    ret = qemu_fcntl_addfl(exp->fuse_fd, O_NONBLOCK);
>> +    assert(exp->num_queues >= 1);
>> +    exp->queues[0].fuse_fd = fuse_session_fd(exp->fuse_session);
>> +    ret = qemu_fcntl_addfl(exp->queues[0].fuse_fd, O_NONBLOCK);
>>       if (ret < 0) {
>>           error_setg_errno(errp, -ret, "Failed to make FUSE FD 
>> non-blocking");
>>           goto fail;
>>       }
>>   +    for (int i = 1; i < exp->num_queues; i++) {
>> +        int fd = clone_fuse_fd(exp->queues[0].fuse_fd, errp);
>> +        if (fd < 0) {
>> +            ret = fd;
>> +            goto fail;
>> +        }
>> +        exp->queues[i].fuse_fd = fd;
>> +    }
>> +
>>       fuse_attach_handlers(exp);
>>       return 0;
>>   @@ -359,9 +428,42 @@ static int mount_fuse_export(FuseExport *exp, 
>> Error **errp)
>>       return 0;
>>   }
>>   +/**
>> + * Clone the given /dev/fuse file descriptor, yielding a second FD 
>> from which
>> + * requests can be pulled for the associated filesystem. Returns an 
>> FD on
>> + * success, and -errno on error.
>> + */
>> +static int clone_fuse_fd(int fd, Error **errp)
>> +{
>> +    uint32_t src_fd = fd;
>> +    int new_fd;
>> +    int ret;
>> +
>> +    /*
>> +     * The name "/dev/fuse" is fixed, see libfuse's lib/fuse_loop_mt.c
>> +     * (fuse_clone_chan()).
>> +     */
>> +    new_fd = open("/dev/fuse", O_RDWR | O_CLOEXEC | O_NONBLOCK);
>> +    if (new_fd < 0) {
>> +        ret = -errno;
>> +        error_setg_errno(errp, errno, "Failed to open /dev/fuse");
>> +        return ret;
>> +    }
>> +
>> +    ret = ioctl(new_fd, FUSE_DEV_IOC_CLONE, &src_fd);
>> +    if (ret < 0) {
>> +        ret = -errno;
>> +        error_setg_errno(errp, errno, "Failed to clone FUSE FD");
>> +        close(new_fd);
>> +        return ret;
>> +    }
>> +
>> +    return new_fd;
>> +}
>> +
>>   /**
>>    * Try to read a single request from the FUSE FD.
>> - * Takes a FuseExport pointer in `opaque`.
>> + * Takes a FuseQueue pointer in `opaque`.
>>    *
>>    * Assumes the export's in-flight counter has already been 
>> incremented.
>>    *
>> @@ -369,8 +471,9 @@ static int mount_fuse_export(FuseExport *exp, 
>> Error **errp)
>>    */
>>   static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>>   {
>> -    FuseExport *exp = opaque;
>> -    int fuse_fd = exp->fuse_fd;
>> +    FuseQueue *q = opaque;
>> +    int fuse_fd = q->fuse_fd;
>> +    FuseExport *exp = q->exp;
>>       ssize_t ret;
>>       const struct fuse_in_header *in_hdr;
>>   @@ -378,8 +481,7 @@ static void coroutine_fn 
>> co_read_from_fuse_fd(void *opaque)
>>           goto no_request;
>>       }
>>   -    ret = RETRY_ON_EINTR(read(fuse_fd, exp->request_buf,
>> -                              sizeof(exp->request_buf)));
>> +    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, 
>> sizeof(q->request_buf)));
>>       if (ret < 0 && errno == EAGAIN) {
>>           /* No request available */
>>           goto no_request;
>> @@ -397,7 +499,7 @@ static void coroutine_fn 
>> co_read_from_fuse_fd(void *opaque)
>>           goto no_request;
>>       }
>>   -    in_hdr = (const struct fuse_in_header *)exp->request_buf;
>> +    in_hdr = (const struct fuse_in_header *)q->request_buf;
>>       if (unlikely(ret != in_hdr->len)) {
>>           error_report("Number of bytes read from FUSE device does 
>> not match "
>>                        "request size, expected %" PRIu32 " bytes, 
>> read %zi "
>> @@ -408,7 +510,7 @@ static void coroutine_fn 
>> co_read_from_fuse_fd(void *opaque)
>>           goto no_request;
>>       }
>>   -    fuse_co_process_request(exp);
>> +    fuse_co_process_request(q);
>>     no_request:
>>       fuse_dec_in_flight(exp);
>> @@ -417,16 +519,16 @@ no_request:
>>   /**
>>    * Try to read and process a single request from the FUSE FD.
>>    * (To be used as a handler for when the FUSE FD becomes readable.)
>> - * Takes a FuseExport pointer in `opaque`.
>> + * Takes a FuseQueue pointer in `opaque`.
>>    */
>>   static void read_from_fuse_fd(void *opaque)
>>   {
>> -    FuseExport *exp = opaque;
>> +    FuseQueue *q = opaque;
>>       Coroutine *co;
>>   -    co = qemu_coroutine_create(co_read_from_fuse_fd, exp);
>> +    co = qemu_coroutine_create(co_read_from_fuse_fd, q);
>>       /* Decremented by co_read_from_fuse_fd() */
>> -    fuse_inc_in_flight(exp);
>> +    fuse_inc_in_flight(q->exp);
>>       qemu_coroutine_enter(co);
>>   }
>>   @@ -451,6 +553,16 @@ static void fuse_export_delete(BlockExport 
>> *blk_exp)
>>   {
>>       FuseExport *exp = container_of(blk_exp, FuseExport, common);
>>   +    for (int i = 0; i < exp->num_queues; i++) {
>> +        FuseQueue *q = &exp->queues[i];
>> +
>> +        /* Queue 0's FD belongs to the FUSE session */
>> +        if (i > 0 && q->fuse_fd >= 0) {
>> +            close(q->fuse_fd);
>> +        }
>> +    }
>> +    g_free(exp->queues);
>> +
>>       if (exp->fuse_session) {
>>           if (exp->mounted) {
>>               fuse_session_unmount(exp->fuse_session);
>> @@ -1108,23 +1220,23 @@ static int fuse_write_buf_response(int fd, 
>> uint32_t req_id,
>>   /*
>>    * For use in fuse_co_process_request():
>>    * Returns a pointer to the parameter object for the given 
>> operation (inside of
>> - * exp->request_buf, which is assumed to hold a fuse_in_header first).
>> - * Verifies that the object is complete (exp->request_buf is large 
>> enough to
>> + * q->request_buf, which is assumed to hold a fuse_in_header first).
>> + * Verifies that the object is complete (q->request_buf is large 
>> enough to
>>    * hold it in one piece, and the request length includes the whole 
>> object).
>>    *
>> - * Note that exp->request_buf may be overwritten after yielding, so 
>> the returned
>> + * Note that q->request_buf may be overwritten after yielding, so 
>> the returned
>>    * pointer must not be used across a function that may yield!
>>    */
>> -#define FUSE_IN_OP_STRUCT(op_name, export) \
>> +#define FUSE_IN_OP_STRUCT(op_name, queue) \
>
> Should `q` actually be `queue` (i.e. the second parameter)?

Right!  (And before, the comment should call it `export`, not `exp`)

Thanks, I’ll fix it.

Hanna



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 19/21] qapi/block-export: Document FUSE's multi-threading
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (17 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 18/21] fuse: Implement multi-threading Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-04 13:58   ` Markus Armbruster
  2025-06-04 13:28 ` [PATCH v2 20/21] iotests/308: Add multi-threading sanity test Hanna Czenczek
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Document for users that FUSE's multi-threading implementation
distributes requests in a round-robin manner, regardless of where they
originate from.

As noted by Stefan, this will probably change with a FUSE-over-io_uring
implementation (which is supposed to have CPU affinity), but documenting
that is left for once that is done.

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 qapi/block-export.json | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index 3ebad4ecef..f30690f54c 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -163,6 +163,11 @@
 # Options for exporting a block graph node on some (file) mountpoint
 # as a raw image.
 #
+# Multi-threading note: The FUSE export supports multi-threading.
+# Currently, requests are distributed across these threads in a
+# round-robin fashion, i.e. independently of the CPU core from which a
+# request originates.
+#
 # @mountpoint: Path on which to export the block device via FUSE.
 #     This must point to an existing regular file.
 #
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 19/21] qapi/block-export: Document FUSE's multi-threading
  2025-06-04 13:28 ` [PATCH v2 19/21] qapi/block-export: Document FUSE's multi-threading Hanna Czenczek
@ 2025-06-04 13:58   ` Markus Armbruster
  0 siblings, 0 replies; 40+ messages in thread
From: Markus Armbruster @ 2025-06-04 13:58 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Stefan Hajnoczi, Kevin Wolf, Brian Song

Hanna Czenczek <hreitz@redhat.com> writes:

> Document for users that FUSE's multi-threading implementation
> distributes requests in a round-robin manner, regardless of where they
> originate from.
>
> As noted by Stefan, this will probably change with a FUSE-over-io_uring
> implementation (which is supposed to have CPU affinity), but documenting
> that is left for once that is done.
>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>

Acked-by: Markus Armbruster <armbru@redhat.com>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 20/21] iotests/308: Add multi-threading sanity test
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (18 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 19/21] qapi/block-export: Document FUSE's multi-threading Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-09 18:12   ` Stefan Hajnoczi
  2025-06-04 13:28 ` [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
  2025-06-09 18:14 ` [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Stefan Hajnoczi
  21 siblings, 1 reply; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

Run qemu-img bench on a simple multi-threaded FUSE export to test that
it works.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 tests/qemu-iotests/308     | 51 ++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/308.out | 56 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 107 insertions(+)

diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index 033d5cbe22..2960412285 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -415,6 +415,57 @@ $QEMU_IO -c 'read -P 0 0 64M' "$TEST_IMG" | _filter_qemu_io
 
 _cleanup_test_img
 
+echo
+echo '=== Multi-threading ==='
+
+# Just set up a null block device, export it (with multi-threading), and run
+# qemu-img bench on it (to get parallel requests)
+
+_launch_qemu
+_send_qemu_cmd $QEMU_HANDLE \
+    "{'execute': 'qmp_capabilities'}" \
+    'return'
+
+_send_qemu_cmd $QEMU_HANDLE \
+    "{'execute': 'blockdev-add',
+      'arguments': {
+          'driver': 'null-co',
+          'node-name': 'null'
+      } }" \
+    'return'
+
+for id in iothread{0,1,2,3}; do
+    _send_qemu_cmd $QEMU_HANDLE \
+        "{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': '$id'
+          } }" \
+        'return'
+done
+
+echo
+
+iothreads="['iothread0', 'iothread1', 'iothread2', 'iothread3']"
+fuse_export_add \
+    'export' \
+    "'mountpoint': '$EXT_MP', 'iothread': $iothreads" \
+    'return' \
+    'null'
+
+echo
+$QEMU_IMG bench -f raw "$EXT_MP" |
+    sed -e 's/[0-9.]\+ seconds/X.XXX seconds/'
+echo
+
+fuse_export_del 'export'
+
+_send_qemu_cmd $QEMU_HANDLE \
+    "{'execute': 'quit'}" \
+    'return'
+
+wait=yes _cleanup_qemu
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index 2d7a38d63d..04e6913c5c 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -207,4 +207,60 @@ read 67108864/67108864 bytes at offset 0
 {"return": {}}
 read 67108864/67108864 bytes at offset 0
 64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Multi-threading ===
+{'execute': 'qmp_capabilities'}
+{"return": {}}
+{'execute': 'blockdev-add',
+      'arguments': {
+          'driver': 'null-co',
+          'node-name': 'null'
+      } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread0'
+          } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread1'
+          } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread2'
+          } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread3'
+          } }
+{"return": {}}
+
+{'execute': 'block-export-add',
+          'arguments': {
+              'type': 'fuse',
+              'id': 'export',
+              'node-name': 'null',
+              'mountpoint': 'TEST_DIR/t.IMGFMT.fuse', 'iothread': ['iothread0', 'iothread1', 'iothread2', 'iothread3']
+          } }
+{"return": {}}
+
+Sending 75000 read requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 4096)
+Run completed in X.XXX seconds.
+
+{'execute': 'block-export-del',
+          'arguments': {
+              'id': 'export'
+          } }
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_EXPORT_DELETED", "data": {"id": "export"}}
+{'execute': 'quit'}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
+{"return": {}}
 *** done
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 20/21] iotests/308: Add multi-threading sanity test
  2025-06-04 13:28 ` [PATCH v2 20/21] iotests/308: Add multi-threading sanity test Hanna Czenczek
@ 2025-06-09 18:12   ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2025-06-09 18:12 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Kevin Wolf, Markus Armbruster, Brian Song

[-- Attachment #1: Type: text/plain, Size: 455 bytes --]

On Wed, Jun 04, 2025 at 03:28:12PM +0200, Hanna Czenczek wrote:
> Run qemu-img bench on a simple multi-threaded FUSE export to test that
> it works.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  tests/qemu-iotests/308     | 51 ++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/308.out | 56 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 107 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (19 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 20/21] iotests/308: Add multi-threading sanity test Hanna Czenczek
@ 2025-06-04 13:28 ` Hanna Czenczek
  2025-06-10 23:37   ` Brian
  2025-06-09 18:14 ` [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Stefan Hajnoczi
  21 siblings, 1 reply; 40+ messages in thread
From: Hanna Czenczek @ 2025-06-04 13:28 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Stefan Hajnoczi, Kevin Wolf,
	Markus Armbruster, Brian Song

We probably want to support larger write sizes than just 4k; 64k seems
nice.  However, we cannot read partial requests from the FUSE FD, we
always have to read requests in full; so our read buffer must be large
enough to accommodate potential 64k writes if we want to support that.

Always allocating FuseRequest objects with 64k buffers in them seems
wasteful, though.  But we can get around the issue by splitting the
buffer into two and using readv(): One part will hold all normal (up to
4k) write requests and all other requests, and a second part (the
"spill-over buffer") will be used only for larger write requests.  Each
FuseQueue has its own spill-over buffer, and only if we find it used
when reading a request will we move its ownership into the FuseRequest
object and allocate a new spill-over buffer for the queue.

This way, we get to support "large" write sizes without having to
allocate big buffers when they aren't used.

Also, this even reduces the size of the FuseRequest objects because the
read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
the requests we support are not quite so large (except for >4k writes),
so until now, we basically had to have useless padding in there.

With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
is easily met and we can decrease the size of the buffer portion that is
right inside of FuseRequest.

As for benchmarks, the benefit of this patch can be shown easily by
writing a 4G image (with qemu-img convert) to a FUSE export:
- Before this patch: Takes 25.6 s (14.4 s with -t none)
- After this patch: Takes 4.5 s (5.5 s with -t none)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 137 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 118 insertions(+), 19 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index cdec31f2a8..908266d101 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -50,8 +50,17 @@
 
 /* Prevent overly long bounce buffer allocations */
 #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
-/* Small enough to fit in the request buffer */
-#define FUSE_MAX_WRITE_BYTES (4 * 1024)
+/*
+ * FUSE_MAX_WRITE_BYTES determines the maximum number of bytes we support in a
+ * write request; FUSE_IN_PLACE_WRITE_BYTES and FUSE_SPILLOVER_BUF_SIZE
+ * determine the split between the size of the in-place buffer in FuseRequest
+ * and the spill-over buffer in FuseQueue.  See FuseQueue.spillover_buf for a
+ * detailed explanation.
+ */
+#define FUSE_IN_PLACE_WRITE_BYTES (4 * 1024)
+#define FUSE_MAX_WRITE_BYTES (64 * 1024)
+#define FUSE_SPILLOVER_BUF_SIZE \
+    (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
 
 typedef struct FuseExport FuseExport;
 
@@ -67,15 +76,49 @@ typedef struct FuseQueue {
 
     /*
      * The request buffer must be able to hold a full write, and/or at least
-     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes
+     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes.
+     * This however is just the first part of the buffer; every read is given
+     * a vector of this buffer (which should be enough for all normal requests,
+     * which we check via the static assertion in FUSE_IN_OP_STRUCT()) and the
+     * spill-over buffer below.
+     * Therefore, the size of this buffer plus FUSE_SPILLOVER_BUF_SIZE must be
+     * FUSE_MIN_READ_BUFFER or more (checked via static assertion below).
+     */
+    char request_buf[sizeof(struct fuse_in_header) +
+                     sizeof(struct fuse_write_in) +
+                     FUSE_IN_PLACE_WRITE_BYTES];
+
+    /*
+     * When retrieving a FUSE request, the destination buffer must always be
+     * sufficiently large for the whole request, i.e. with max_write=64k, we
+     * must provide a buffer that fits the WRITE header and 64 kB of space for
+     * data.
+     * We do want to support 64k write requests without requiring them to be
+     * split up, but at the same time, do not want to do such a large allocation
+     * for every single request.
+     * Therefore, the FuseRequest object provides an in-line buffer that is
+     * enough for write requests up to 4k (and all other requests), and for
+     * every request that is bigger, we provide a spill-over buffer here (for
+     * the remaining 64k - 4k = 60k).
+     * When poll_fuse_fd() reads a FUSE request, it passes these buffers as an
+     * I/O vector, and then checks the return value (number of bytes read) to
+     * find out whether the spill-over buffer was used.  If so, it will move the
+     * buffer to the request, and will allocate a new spill-over buffer for the
+     * next request.
+     *
+     * Free this buffer with qemu_vfree().
      */
-    char request_buf[MAX_CONST(
-        sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
-             FUSE_MAX_WRITE_BYTES,
-        FUSE_MIN_READ_BUFFER
-    )];
+    void *spillover_buf;
 } FuseQueue;
 
+/*
+ * Verify that FuseQueue.request_buf plus the spill-over buffer together
+ * are big enough to be accepted by the FUSE kernel driver.
+ */
+QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
+                  FUSE_SPILLOVER_BUF_SIZE <
+                  FUSE_MIN_READ_BUFFER);
+
 struct FuseExport {
     BlockExport common;
 
@@ -131,7 +174,8 @@ static int clone_fuse_fd(int fd, Error **errp);
 static bool is_regular_file(const char *path, Error **errp);
 
 static void read_from_fuse_fd(void *opaque);
-static void coroutine_fn fuse_co_process_request(FuseQueue *q);
+static void coroutine_fn
+fuse_co_process_request(FuseQueue *q, void *spillover_buf);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -476,12 +520,27 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
     FuseExport *exp = q->exp;
     ssize_t ret;
     const struct fuse_in_header *in_hdr;
+    struct iovec iov[2];
+    void *spillover_buf = NULL;
 
     if (unlikely(exp->halted)) {
         goto no_request;
     }
 
-    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
+    /*
+     * If handling the last request consumed the spill-over buffer, allocate a
+     * new one.  Align it to the block device's alignment, which admittedly is
+     * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too.
+     */
+    if (unlikely(!q->spillover_buf)) {
+        q->spillover_buf = blk_blockalign(exp->common.blk,
+                                          FUSE_SPILLOVER_BUF_SIZE);
+    }
+    /* Construct the I/O vector to hold the FUSE request */
+    iov[0] = (struct iovec) { q->request_buf, sizeof(q->request_buf) };
+    iov[1] = (struct iovec) { q->spillover_buf, FUSE_SPILLOVER_BUF_SIZE };
+
+    ret = RETRY_ON_EINTR(readv(fuse_fd, iov, ARRAY_SIZE(iov)));
     if (ret < 0 && errno == EAGAIN) {
         /* No request available */
         goto no_request;
@@ -510,7 +569,13 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    fuse_co_process_request(q);
+    if (unlikely(ret > sizeof(q->request_buf))) {
+        /* Spillover buffer used, take ownership */
+        spillover_buf = q->spillover_buf;
+        q->spillover_buf = NULL;
+    }
+
+    fuse_co_process_request(q, spillover_buf);
 
 no_request:
     fuse_dec_in_flight(exp);
@@ -560,6 +625,9 @@ static void fuse_export_delete(BlockExport *blk_exp)
         if (i > 0 && q->fuse_fd >= 0) {
             close(q->fuse_fd);
         }
+        if (q->spillover_buf) {
+            qemu_vfree(q->spillover_buf);
+        }
     }
     g_free(exp->queues);
 
@@ -869,17 +937,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
 }
 
 /**
- * Handle client writes to the exported image.  @buf has the data to be written
- * and will be copied to a bounce buffer before yielding for the first time.
+ * Handle client writes to the exported image.  @in_place_buf has the first
+ * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf
+ * contains the rest (if any; NULL otherwise).
+ * Data in @in_place_buf is assumed to be overwritten after yielding, so will
+ * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
+ * assumed to be exclusively owned and will be used as-is.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
 static ssize_t coroutine_fn
 fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
-              uint64_t offset, uint32_t size, const void *buf)
+              uint64_t offset, uint32_t size,
+              const void *in_place_buf, const void *spillover_buf)
 {
+    size_t in_place_size;
     void *copied;
     int64_t blk_len;
     int ret;
+    struct iovec iov[2];
+    QEMUIOVector qiov;
 
     /* Limited by max_write, should not happen */
     if (size > BDRV_REQUEST_MAX_BYTES) {
@@ -891,8 +967,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
     }
 
     /* Must copy to bounce buffer before potentially yielding */
-    copied = blk_blockalign(exp->common.blk, size);
-    memcpy(copied, buf, size);
+    in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
+    copied = blk_blockalign(exp->common.blk, in_place_size);
+    memcpy(copied, in_place_buf, in_place_size);
 
     /**
      * Clients will expect short writes at EOF, so we have to limit
@@ -916,7 +993,21 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
         }
     }
 
-    ret = blk_co_pwrite(exp->common.blk, offset, size, copied, 0);
+    iov[0] = (struct iovec) {
+        .iov_base = copied,
+        .iov_len = in_place_size,
+    };
+    if (size > FUSE_IN_PLACE_WRITE_BYTES) {
+        assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
+        iov[1] = (struct iovec) {
+            .iov_base = (void *)spillover_buf,
+            .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
+        };
+        qemu_iovec_init_external(&qiov, iov, 2);
+    } else {
+        qemu_iovec_init_external(&qiov, iov, 1);
+    }
+    ret = blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0);
     if (ret < 0) {
         goto fail_free_buffer;
     }
@@ -1275,8 +1366,14 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
  * Note that yielding in any request-processing function can overwrite the
  * contents of q->request_buf.  Anything that takes a buffer needs to take
  * care that the content is copied before yielding.
+ *
+ * @spillover_buf can contain the tail of a write request too large to fit into
+ * q->request_buf.  This function takes ownership of it (i.e. will free it),
+ * which assumes that its contents will not be overwritten by concurrent
+ * requests (as opposed to q->request_buf).
  */
-static void coroutine_fn fuse_co_process_request(FuseQueue *q)
+static void coroutine_fn
+fuse_co_process_request(FuseQueue *q, void *spillover_buf)
 {
     FuseExport *exp = q->exp;
     uint32_t opcode;
@@ -1372,7 +1469,7 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
          * yielding.
          */
         ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
-                            in->offset, in->size, in + 1);
+                            in->offset, in->size, in + 1, spillover_buf);
         break;
     }
 
@@ -1414,6 +1511,8 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
                             ret < 0 ? ret : 0,
                             ret < 0 ? 0 : ret);
     }
+
+    qemu_vfree(spillover_buf);
 }
 
 const BlockExportDriver blk_exp_fuse = {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer
  2025-06-04 13:28 ` [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
@ 2025-06-10 23:37   ` Brian
  2025-06-11 13:46     ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Brian @ 2025-06-10 23:37 UTC (permalink / raw)
  To: Hanna Czenczek, qemu-block
  Cc: qemu-devel, Stefan Hajnoczi, Kevin Wolf, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 12799 bytes --]

Hi all,

I'm currently working on the fuse over io_uring feature for QEMU
FUSE export. When I submit SQEs to the /dev/fuse file descriptor
during the register phase, the kernel returns an error (in the
fuse_uring_create_ring_ent()). It seems this is because the payload
size (i.e. spillover_buf size, which is FUSE_MAX_WRITE_BYTES (64 *
4k) minus FUSE_IN_PLACE_WRITE_BYTES (4 * 4k)) is smaller than
ring->max_payload_sz (which is 32 pages * 4k).

Do we need to increase the spillover_buf size, or is there any
other workaround?


Brian

On 6/4/25 9:28 AM, Hanna Czenczek wrote:
> We probably want to support larger write sizes than just 4k; 64k seems
> nice.  However, we cannot read partial requests from the FUSE FD, we
> always have to read requests in full; so our read buffer must be large
> enough to accommodate potential 64k writes if we want to support that.
>
> Always allocating FuseRequest objects with 64k buffers in them seems
> wasteful, though.  But we can get around the issue by splitting the
> buffer into two and using readv(): One part will hold all normal (up to
> 4k) write requests and all other requests, and a second part (the
> "spill-over buffer") will be used only for larger write requests.  Each
> FuseQueue has its own spill-over buffer, and only if we find it used
> when reading a request will we move its ownership into the FuseRequest
> object and allocate a new spill-over buffer for the queue.
>
> This way, we get to support "large" write sizes without having to
> allocate big buffers when they aren't used.
>
> Also, this even reduces the size of the FuseRequest objects because the
> read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
> the requests we support are not quite so large (except for >4k writes),
> so until now, we basically had to have useless padding in there.
>
> With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
> is easily met and we can decrease the size of the buffer portion that is
> right inside of FuseRequest.
>
> As for benchmarks, the benefit of this patch can be shown easily by
> writing a 4G image (with qemu-img convert) to a FUSE export:
> - Before this patch: Takes 25.6 s (14.4 s with -t none)
> - After this patch: Takes 4.5 s (5.5 s with -t none)
>
> Reviewed-by: Stefan Hajnoczi<stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek<hreitz@redhat.com>
> ---
>   block/export/fuse.c | 137 ++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 118 insertions(+), 19 deletions(-)
>
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index cdec31f2a8..908266d101 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -50,8 +50,17 @@
>   
>   /* Prevent overly long bounce buffer allocations */
>   #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
> -/* Small enough to fit in the request buffer */
> -#define FUSE_MAX_WRITE_BYTES (4 * 1024)
> +/*
> + * FUSE_MAX_WRITE_BYTES determines the maximum number of bytes we support in a
> + * write request; FUSE_IN_PLACE_WRITE_BYTES and FUSE_SPILLOVER_BUF_SIZE
> + * determine the split between the size of the in-place buffer in FuseRequest
> + * and the spill-over buffer in FuseQueue.  See FuseQueue.spillover_buf for a
> + * detailed explanation.
> + */
> +#define FUSE_IN_PLACE_WRITE_BYTES (4 * 1024)
> +#define FUSE_MAX_WRITE_BYTES (64 * 1024)
> +#define FUSE_SPILLOVER_BUF_SIZE \
> +    (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
>   
>   typedef struct FuseExport FuseExport;
>   
> @@ -67,15 +76,49 @@ typedef struct FuseQueue {
>   
>       /*
>        * The request buffer must be able to hold a full write, and/or at least
> -     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes
> +     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes.
> +     * This however is just the first part of the buffer; every read is given
> +     * a vector of this buffer (which should be enough for all normal requests,
> +     * which we check via the static assertion in FUSE_IN_OP_STRUCT()) and the
> +     * spill-over buffer below.
> +     * Therefore, the size of this buffer plus FUSE_SPILLOVER_BUF_SIZE must be
> +     * FUSE_MIN_READ_BUFFER or more (checked via static assertion below).
> +     */
> +    char request_buf[sizeof(struct fuse_in_header) +
> +                     sizeof(struct fuse_write_in) +
> +                     FUSE_IN_PLACE_WRITE_BYTES];
> +
> +    /*
> +     * When retrieving a FUSE request, the destination buffer must always be
> +     * sufficiently large for the whole request, i.e. with max_write=64k, we
> +     * must provide a buffer that fits the WRITE header and 64 kB of space for
> +     * data.
> +     * We do want to support 64k write requests without requiring them to be
> +     * split up, but at the same time, do not want to do such a large allocation
> +     * for every single request.
> +     * Therefore, the FuseRequest object provides an in-line buffer that is
> +     * enough for write requests up to 4k (and all other requests), and for
> +     * every request that is bigger, we provide a spill-over buffer here (for
> +     * the remaining 64k - 4k = 60k).
> +     * When poll_fuse_fd() reads a FUSE request, it passes these buffers as an
> +     * I/O vector, and then checks the return value (number of bytes read) to
> +     * find out whether the spill-over buffer was used.  If so, it will move the
> +     * buffer to the request, and will allocate a new spill-over buffer for the
> +     * next request.
> +     *
> +     * Free this buffer with qemu_vfree().
>        */
> -    char request_buf[MAX_CONST(
> -        sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
> -             FUSE_MAX_WRITE_BYTES,
> -        FUSE_MIN_READ_BUFFER
> -    )];
> +    void *spillover_buf;
>   } FuseQueue;
>   
> +/*
> + * Verify that FuseQueue.request_buf plus the spill-over buffer together
> + * are big enough to be accepted by the FUSE kernel driver.
> + */
> +QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
> +                  FUSE_SPILLOVER_BUF_SIZE <
> +                  FUSE_MIN_READ_BUFFER);
> +
>   struct FuseExport {
>       BlockExport common;
>   
> @@ -131,7 +174,8 @@ static int clone_fuse_fd(int fd, Error **errp);
>   static bool is_regular_file(const char *path, Error **errp);
>   
>   static void read_from_fuse_fd(void *opaque);
> -static void coroutine_fn fuse_co_process_request(FuseQueue *q);
> +static void coroutine_fn
> +fuse_co_process_request(FuseQueue *q, void *spillover_buf);
>   
>   static void fuse_inc_in_flight(FuseExport *exp)
>   {
> @@ -476,12 +520,27 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>       FuseExport *exp = q->exp;
>       ssize_t ret;
>       const struct fuse_in_header *in_hdr;
> +    struct iovec iov[2];
> +    void *spillover_buf = NULL;
>   
>       if (unlikely(exp->halted)) {
>           goto no_request;
>       }
>   
> -    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
> +    /*
> +     * If handling the last request consumed the spill-over buffer, allocate a
> +     * new one.  Align it to the block device's alignment, which admittedly is
> +     * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too.
> +     */
> +    if (unlikely(!q->spillover_buf)) {
> +        q->spillover_buf = blk_blockalign(exp->common.blk,
> +                                          FUSE_SPILLOVER_BUF_SIZE);
> +    }
> +    /* Construct the I/O vector to hold the FUSE request */
> +    iov[0] = (struct iovec) { q->request_buf, sizeof(q->request_buf) };
> +    iov[1] = (struct iovec) { q->spillover_buf, FUSE_SPILLOVER_BUF_SIZE };
> +
> +    ret = RETRY_ON_EINTR(readv(fuse_fd, iov, ARRAY_SIZE(iov)));
>       if (ret < 0 && errno == EAGAIN) {
>           /* No request available */
>           goto no_request;
> @@ -510,7 +569,13 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
>           goto no_request;
>       }
>   
> -    fuse_co_process_request(q);
> +    if (unlikely(ret > sizeof(q->request_buf))) {
> +        /* Spillover buffer used, take ownership */
> +        spillover_buf = q->spillover_buf;
> +        q->spillover_buf = NULL;
> +    }
> +
> +    fuse_co_process_request(q, spillover_buf);
>   
>   no_request:
>       fuse_dec_in_flight(exp);
> @@ -560,6 +625,9 @@ static void fuse_export_delete(BlockExport *blk_exp)
>           if (i > 0 && q->fuse_fd >= 0) {
>               close(q->fuse_fd);
>           }
> +        if (q->spillover_buf) {
> +            qemu_vfree(q->spillover_buf);
> +        }
>       }
>       g_free(exp->queues);
>   
> @@ -869,17 +937,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
>   }
>   
>   /**
> - * Handle client writes to the exported image.  @buf has the data to be written
> - * and will be copied to a bounce buffer before yielding for the first time.
> + * Handle client writes to the exported image.  @in_place_buf has the first
> + * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf
> + * contains the rest (if any; NULL otherwise).
> + * Data in @in_place_buf is assumed to be overwritten after yielding, so will
> + * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
> + * assumed to be exclusively owned and will be used as-is.
>    * Return the number of bytes written to *out on success, and -errno on error.
>    */
>   static ssize_t coroutine_fn
>   fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> -              uint64_t offset, uint32_t size, const void *buf)
> +              uint64_t offset, uint32_t size,
> +              const void *in_place_buf, const void *spillover_buf)
>   {
> +    size_t in_place_size;
>       void *copied;
>       int64_t blk_len;
>       int ret;
> +    struct iovec iov[2];
> +    QEMUIOVector qiov;
>   
>       /* Limited by max_write, should not happen */
>       if (size > BDRV_REQUEST_MAX_BYTES) {
> @@ -891,8 +967,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
>       }
>   
>       /* Must copy to bounce buffer before potentially yielding */
> -    copied = blk_blockalign(exp->common.blk, size);
> -    memcpy(copied, buf, size);
> +    in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
> +    copied = blk_blockalign(exp->common.blk, in_place_size);
> +    memcpy(copied, in_place_buf, in_place_size);
>   
>       /**
>        * Clients will expect short writes at EOF, so we have to limit
> @@ -916,7 +993,21 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
>           }
>       }
>   
> -    ret = blk_co_pwrite(exp->common.blk, offset, size, copied, 0);
> +    iov[0] = (struct iovec) {
> +        .iov_base = copied,
> +        .iov_len = in_place_size,
> +    };
> +    if (size > FUSE_IN_PLACE_WRITE_BYTES) {
> +        assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
> +        iov[1] = (struct iovec) {
> +            .iov_base = (void *)spillover_buf,
> +            .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
> +        };
> +        qemu_iovec_init_external(&qiov, iov, 2);
> +    } else {
> +        qemu_iovec_init_external(&qiov, iov, 1);
> +    }
> +    ret = blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0);
>       if (ret < 0) {
>           goto fail_free_buffer;
>       }
> @@ -1275,8 +1366,14 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
>    * Note that yielding in any request-processing function can overwrite the
>    * contents of q->request_buf.  Anything that takes a buffer needs to take
>    * care that the content is copied before yielding.
> + *
> + * @spillover_buf can contain the tail of a write request too large to fit into
> + * q->request_buf.  This function takes ownership of it (i.e. will free it),
> + * which assumes that its contents will not be overwritten by concurrent
> + * requests (as opposed to q->request_buf).
>    */
> -static void coroutine_fn fuse_co_process_request(FuseQueue *q)
> +static void coroutine_fn
> +fuse_co_process_request(FuseQueue *q, void *spillover_buf)
>   {
>       FuseExport *exp = q->exp;
>       uint32_t opcode;
> @@ -1372,7 +1469,7 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
>            * yielding.
>            */
>           ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
> -                            in->offset, in->size, in + 1);
> +                            in->offset, in->size, in + 1, spillover_buf);
>           break;
>       }
>   
> @@ -1414,6 +1511,8 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
>                               ret < 0 ? ret : 0,
>                               ret < 0 ? 0 : ret);
>       }
> +
> +    qemu_vfree(spillover_buf);
>   }
>   
>   const BlockExportDriver blk_exp_fuse = {

[-- Attachment #2: Type: text/html, Size: 13084 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer
  2025-06-10 23:37   ` Brian
@ 2025-06-11 13:46     ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2025-06-11 13:46 UTC (permalink / raw)
  To: Brian; +Cc: Hanna Czenczek, qemu-block, qemu-devel, Kevin Wolf,
	Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 14178 bytes --]

On Tue, Jun 10, 2025 at 07:37:34PM -0400, Brian wrote:
> Hi all,
> 
> I'm currently working on the fuse over io_uring feature for QEMU
> FUSE export. When I submit SQEs to the /dev/fuse file descriptor
> during the register phase, the kernel returns an error (in the
> fuse_uring_create_ring_ent()). It seems this is because the payload
> size (i.e. spillover_buf size, which is FUSE_MAX_WRITE_BYTES (64 *
> 4k) minus FUSE_IN_PLACE_WRITE_BYTES (4 * 4k)) is smaller than
> ring->max_payload_sz (which is 32 pages * 4k).
> 
> Do we need to increase the spillover_buf size, or is there any
> other workaround?

When you implement support for concurrent FUSE-over-io_uring requests
you'll need to pre-allocate a buffer for each request. That
pre-allocation code will be separate from the request buffer that this
patch series defines. So I think this issue is specific to
FUSE-over-io_uring and something that can be done in your upcoming
patches rather than something that affects this patch series.

How about going ahead and writing the FUSE-over-io_uring-specific code
for pre-allocating request buffers right away instead of using
FuseQueue->request_buf[] in your code?

Stefan

> 
> 
> Brian
> 
> On 6/4/25 9:28 AM, Hanna Czenczek wrote:
> > We probably want to support larger write sizes than just 4k; 64k seems
> > nice.  However, we cannot read partial requests from the FUSE FD, we
> > always have to read requests in full; so our read buffer must be large
> > enough to accommodate potential 64k writes if we want to support that.
> > 
> > Always allocating FuseRequest objects with 64k buffers in them seems
> > wasteful, though.  But we can get around the issue by splitting the
> > buffer into two and using readv(): One part will hold all normal (up to
> > 4k) write requests and all other requests, and a second part (the
> > "spill-over buffer") will be used only for larger write requests.  Each
> > FuseQueue has its own spill-over buffer, and only if we find it used
> > when reading a request will we move its ownership into the FuseRequest
> > object and allocate a new spill-over buffer for the queue.
> > 
> > This way, we get to support "large" write sizes without having to
> > allocate big buffers when they aren't used.
> > 
> > Also, this even reduces the size of the FuseRequest objects because the
> > read buffer has to have at least FUSE_MIN_READ_BUFFER (8192) bytes; but
> > the requests we support are not quite so large (except for >4k writes),
> > so until now, we basically had to have useless padding in there.
> > 
> > With the spill-over buffer added, the FUSE_MIN_READ_BUFFER requirement
> > is easily met and we can decrease the size of the buffer portion that is
> > right inside of FuseRequest.
> > 
> > As for benchmarks, the benefit of this patch can be shown easily by
> > writing a 4G image (with qemu-img convert) to a FUSE export:
> > - Before this patch: Takes 25.6 s (14.4 s with -t none)
> > - After this patch: Takes 4.5 s (5.5 s with -t none)
> > 
> > Reviewed-by: Stefan Hajnoczi<stefanha@redhat.com>
> > Signed-off-by: Hanna Czenczek<hreitz@redhat.com>
> > ---
> >   block/export/fuse.c | 137 ++++++++++++++++++++++++++++++++++++++------
> >   1 file changed, 118 insertions(+), 19 deletions(-)
> > 
> > diff --git a/block/export/fuse.c b/block/export/fuse.c
> > index cdec31f2a8..908266d101 100644
> > --- a/block/export/fuse.c
> > +++ b/block/export/fuse.c
> > @@ -50,8 +50,17 @@
> >   /* Prevent overly long bounce buffer allocations */
> >   #define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
> > -/* Small enough to fit in the request buffer */
> > -#define FUSE_MAX_WRITE_BYTES (4 * 1024)
> > +/*
> > + * FUSE_MAX_WRITE_BYTES determines the maximum number of bytes we support in a
> > + * write request; FUSE_IN_PLACE_WRITE_BYTES and FUSE_SPILLOVER_BUF_SIZE
> > + * determine the split between the size of the in-place buffer in FuseRequest
> > + * and the spill-over buffer in FuseQueue.  See FuseQueue.spillover_buf for a
> > + * detailed explanation.
> > + */
> > +#define FUSE_IN_PLACE_WRITE_BYTES (4 * 1024)
> > +#define FUSE_MAX_WRITE_BYTES (64 * 1024)
> > +#define FUSE_SPILLOVER_BUF_SIZE \
> > +    (FUSE_MAX_WRITE_BYTES - FUSE_IN_PLACE_WRITE_BYTES)
> >   typedef struct FuseExport FuseExport;
> > @@ -67,15 +76,49 @@ typedef struct FuseQueue {
> >       /*
> >        * The request buffer must be able to hold a full write, and/or at least
> > -     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes
> > +     * FUSE_MIN_READ_BUFFER (from linux/fuse.h) bytes.
> > +     * This however is just the first part of the buffer; every read is given
> > +     * a vector of this buffer (which should be enough for all normal requests,
> > +     * which we check via the static assertion in FUSE_IN_OP_STRUCT()) and the
> > +     * spill-over buffer below.
> > +     * Therefore, the size of this buffer plus FUSE_SPILLOVER_BUF_SIZE must be
> > +     * FUSE_MIN_READ_BUFFER or more (checked via static assertion below).
> > +     */
> > +    char request_buf[sizeof(struct fuse_in_header) +
> > +                     sizeof(struct fuse_write_in) +
> > +                     FUSE_IN_PLACE_WRITE_BYTES];
> > +
> > +    /*
> > +     * When retrieving a FUSE request, the destination buffer must always be
> > +     * sufficiently large for the whole request, i.e. with max_write=64k, we
> > +     * must provide a buffer that fits the WRITE header and 64 kB of space for
> > +     * data.
> > +     * We do want to support 64k write requests without requiring them to be
> > +     * split up, but at the same time, do not want to do such a large allocation
> > +     * for every single request.
> > +     * Therefore, the FuseRequest object provides an in-line buffer that is
> > +     * enough for write requests up to 4k (and all other requests), and for
> > +     * every request that is bigger, we provide a spill-over buffer here (for
> > +     * the remaining 64k - 4k = 60k).
> > +     * When poll_fuse_fd() reads a FUSE request, it passes these buffers as an
> > +     * I/O vector, and then checks the return value (number of bytes read) to
> > +     * find out whether the spill-over buffer was used.  If so, it will move the
> > +     * buffer to the request, and will allocate a new spill-over buffer for the
> > +     * next request.
> > +     *
> > +     * Free this buffer with qemu_vfree().
> >        */
> > -    char request_buf[MAX_CONST(
> > -        sizeof(struct fuse_in_header) + sizeof(struct fuse_write_in) +
> > -             FUSE_MAX_WRITE_BYTES,
> > -        FUSE_MIN_READ_BUFFER
> > -    )];
> > +    void *spillover_buf;
> >   } FuseQueue;
> > +/*
> > + * Verify that FuseQueue.request_buf plus the spill-over buffer together
> > + * are big enough to be accepted by the FUSE kernel driver.
> > + */
> > +QEMU_BUILD_BUG_ON(sizeof(((FuseQueue *)0)->request_buf) +
> > +                  FUSE_SPILLOVER_BUF_SIZE <
> > +                  FUSE_MIN_READ_BUFFER);
> > +
> >   struct FuseExport {
> >       BlockExport common;
> > @@ -131,7 +174,8 @@ static int clone_fuse_fd(int fd, Error **errp);
> >   static bool is_regular_file(const char *path, Error **errp);
> >   static void read_from_fuse_fd(void *opaque);
> > -static void coroutine_fn fuse_co_process_request(FuseQueue *q);
> > +static void coroutine_fn
> > +fuse_co_process_request(FuseQueue *q, void *spillover_buf);
> >   static void fuse_inc_in_flight(FuseExport *exp)
> >   {
> > @@ -476,12 +520,27 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
> >       FuseExport *exp = q->exp;
> >       ssize_t ret;
> >       const struct fuse_in_header *in_hdr;
> > +    struct iovec iov[2];
> > +    void *spillover_buf = NULL;
> >       if (unlikely(exp->halted)) {
> >           goto no_request;
> >       }
> > -    ret = RETRY_ON_EINTR(read(fuse_fd, q->request_buf, sizeof(q->request_buf)));
> > +    /*
> > +     * If handling the last request consumed the spill-over buffer, allocate a
> > +     * new one.  Align it to the block device's alignment, which admittedly is
> > +     * only useful if FUSE_IN_PLACE_WRITE_BYTES is aligned, too.
> > +     */
> > +    if (unlikely(!q->spillover_buf)) {
> > +        q->spillover_buf = blk_blockalign(exp->common.blk,
> > +                                          FUSE_SPILLOVER_BUF_SIZE);
> > +    }
> > +    /* Construct the I/O vector to hold the FUSE request */
> > +    iov[0] = (struct iovec) { q->request_buf, sizeof(q->request_buf) };
> > +    iov[1] = (struct iovec) { q->spillover_buf, FUSE_SPILLOVER_BUF_SIZE };
> > +
> > +    ret = RETRY_ON_EINTR(readv(fuse_fd, iov, ARRAY_SIZE(iov)));
> >       if (ret < 0 && errno == EAGAIN) {
> >           /* No request available */
> >           goto no_request;
> > @@ -510,7 +569,13 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
> >           goto no_request;
> >       }
> > -    fuse_co_process_request(q);
> > +    if (unlikely(ret > sizeof(q->request_buf))) {
> > +        /* Spillover buffer used, take ownership */
> > +        spillover_buf = q->spillover_buf;
> > +        q->spillover_buf = NULL;
> > +    }
> > +
> > +    fuse_co_process_request(q, spillover_buf);
> >   no_request:
> >       fuse_dec_in_flight(exp);
> > @@ -560,6 +625,9 @@ static void fuse_export_delete(BlockExport *blk_exp)
> >           if (i > 0 && q->fuse_fd >= 0) {
> >               close(q->fuse_fd);
> >           }
> > +        if (q->spillover_buf) {
> > +            qemu_vfree(q->spillover_buf);
> > +        }
> >       }
> >       g_free(exp->queues);
> > @@ -869,17 +937,25 @@ fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
> >   }
> >   /**
> > - * Handle client writes to the exported image.  @buf has the data to be written
> > - * and will be copied to a bounce buffer before yielding for the first time.
> > + * Handle client writes to the exported image.  @in_place_buf has the first
> > + * FUSE_IN_PLACE_WRITE_BYTES bytes of the data to be written, @spillover_buf
> > + * contains the rest (if any; NULL otherwise).
> > + * Data in @in_place_buf is assumed to be overwritten after yielding, so will
> > + * be copied to a bounce buffer beforehand.  @spillover_buf in contrast is
> > + * assumed to be exclusively owned and will be used as-is.
> >    * Return the number of bytes written to *out on success, and -errno on error.
> >    */
> >   static ssize_t coroutine_fn
> >   fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> > -              uint64_t offset, uint32_t size, const void *buf)
> > +              uint64_t offset, uint32_t size,
> > +              const void *in_place_buf, const void *spillover_buf)
> >   {
> > +    size_t in_place_size;
> >       void *copied;
> >       int64_t blk_len;
> >       int ret;
> > +    struct iovec iov[2];
> > +    QEMUIOVector qiov;
> >       /* Limited by max_write, should not happen */
> >       if (size > BDRV_REQUEST_MAX_BYTES) {
> > @@ -891,8 +967,9 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> >       }
> >       /* Must copy to bounce buffer before potentially yielding */
> > -    copied = blk_blockalign(exp->common.blk, size);
> > -    memcpy(copied, buf, size);
> > +    in_place_size = MIN(size, FUSE_IN_PLACE_WRITE_BYTES);
> > +    copied = blk_blockalign(exp->common.blk, in_place_size);
> > +    memcpy(copied, in_place_buf, in_place_size);
> >       /**
> >        * Clients will expect short writes at EOF, so we have to limit
> > @@ -916,7 +993,21 @@ fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
> >           }
> >       }
> > -    ret = blk_co_pwrite(exp->common.blk, offset, size, copied, 0);
> > +    iov[0] = (struct iovec) {
> > +        .iov_base = copied,
> > +        .iov_len = in_place_size,
> > +    };
> > +    if (size > FUSE_IN_PLACE_WRITE_BYTES) {
> > +        assert(size - FUSE_IN_PLACE_WRITE_BYTES <= FUSE_SPILLOVER_BUF_SIZE);
> > +        iov[1] = (struct iovec) {
> > +            .iov_base = (void *)spillover_buf,
> > +            .iov_len = size - FUSE_IN_PLACE_WRITE_BYTES,
> > +        };
> > +        qemu_iovec_init_external(&qiov, iov, 2);
> > +    } else {
> > +        qemu_iovec_init_external(&qiov, iov, 1);
> > +    }
> > +    ret = blk_co_pwritev(exp->common.blk, offset, size, &qiov, 0);
> >       if (ret < 0) {
> >           goto fail_free_buffer;
> >       }
> > @@ -1275,8 +1366,14 @@ static int fuse_write_buf_response(int fd, uint32_t req_id,
> >    * Note that yielding in any request-processing function can overwrite the
> >    * contents of q->request_buf.  Anything that takes a buffer needs to take
> >    * care that the content is copied before yielding.
> > + *
> > + * @spillover_buf can contain the tail of a write request too large to fit into
> > + * q->request_buf.  This function takes ownership of it (i.e. will free it),
> > + * which assumes that its contents will not be overwritten by concurrent
> > + * requests (as opposed to q->request_buf).
> >    */
> > -static void coroutine_fn fuse_co_process_request(FuseQueue *q)
> > +static void coroutine_fn
> > +fuse_co_process_request(FuseQueue *q, void *spillover_buf)
> >   {
> >       FuseExport *exp = q->exp;
> >       uint32_t opcode;
> > @@ -1372,7 +1469,7 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
> >            * yielding.
> >            */
> >           ret = fuse_co_write(exp, FUSE_OUT_OP_STRUCT(write, out_buf),
> > -                            in->offset, in->size, in + 1);
> > +                            in->offset, in->size, in + 1, spillover_buf);
> >           break;
> >       }
> > @@ -1414,6 +1511,8 @@ static void coroutine_fn fuse_co_process_request(FuseQueue *q)
> >                               ret < 0 ? ret : 0,
> >                               ret < 0 ? 0 : ret);
> >       }
> > +
> > +    qemu_vfree(spillover_buf);
> >   }
> >   const BlockExportDriver blk_exp_fuse = {

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading
  2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (20 preceding siblings ...)
  2025-06-04 13:28 ` [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
@ 2025-06-09 18:14 ` Stefan Hajnoczi
  21 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2025-06-09 18:14 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Kevin Wolf, Markus Armbruster, Brian Song

[-- Attachment #1: Type: text/plain, Size: 7425 bytes --]

On Wed, Jun 04, 2025 at 03:27:52PM +0200, Hanna Czenczek wrote:
> Hi,
> 
> This series:
> - Fixes some bugs/minor inconveniences,
> - Removes libfuse from the request processing path,
> - Make the FUSE export use coroutines for request handling,
> - Introduces multi-threading into the FUSE export.
> 
> More detail on the v1 cover letter:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2025-03/msg00359.html
> 
> 
> Changes from v1:
> - Patch 1: Clarified “polling” to be `aio_poll()`
> - Patch 11 (new): Pulled out from patch 13 (prev. 11)
> - Patch 12 (new): Suggested by Eric
> - Patch 13 (prev. 11):
>   - Drop false polling handlers
>   - Use qemu_fcntl_addfl() instead of fcntl(F_SETFL) to keep
>     pre-existing FD flags
>   - Add a note that the buffer returned by read needs to be freed via
>     qemu_vfree()
>   - Pulled out a variable rename into the new patch 11
> - Patch 15 (prev. 13):
>   - Simplified the co_read_from_fuse_fd() interface thanks to no longer
>     needing to support poll handlers
>   - Increment in-flight counter before entering the coroutine to make it
>     more obvious how tihs ensures that the export pointer remains valid
>     throughout
> - Patch 16 (new): Add a common multi-threading interface for exports
>   instead of a specific one just for FUSE
> - Patch 17 (new): Test cases for this new interface
> - Patch 18 (prev. 14):
>   - Drop the contrasting with virtio-blk from the commit message;
>     explaining the interface is no longer necessary now that it’s
>     introduced separately in patch 16.
>   - Generally, the interface definition is removed in favor of the new
>     one in patch 16.
>   - Some rebase conflicts (due to other changes earlier in this series).
> - Patch 19 (new): Stefan suggested adding an explicit note for users on
>   how multi-threading behaves with FUSE, not least because in the future
>   this behavior may depend on the specific implementation features
>   chosen (io-uring or not).  Because the actual multi-thread interface
>   is now on the common export options, it is no longer obvious where to
>   put this implementation note; I decided to put it into the general
>   description of the FUSE export options, inside of this dedicated
>   patch.
> - Patch 20 (new): Simple sanity test for FUSE multi-threading (just test
>   that e.g. nothing crashes when running qemu-img bench on top)
> - Patch 21 (prev. 15): Rebase conflict due to the changes in patch 15;
>   kept Stefan’s R-b anyway
> 
> 
> Review notes/suggestions I deliberately did not follow in v2:
> - Stefan suggested to make patch 1 simpler and more robust by allocating
>   a new buffer for each request.  This is indeed a simple change (for
>   patch 1) that I wouldn’t mind, and that I also started to implement.
>   However, eventually I decided against it:
>   The problem doesn’t disappear with the rest of the series, it
>   basically stays the exact same; though instead of an implicit
>   aio_poll() leading to nested polling, it turns into an implicit
>   coroutine yield doing pretty much the same.
>   For performance, it is better not to allocate a new buffer for each
>   request; we only really need a bounce buffer for writes, as there is
>   no other case where we’d continue to read the request buffer after
>   yielding.  Therefore, the final state I would like to have after this
>   series is to use a common request buffer for all requests on a single
>   queue, only using a bounce buffer for writes.
>   With that, I think it’s better to implement exactly that right from
>   the start, rather than introducing a new intermediate state.
> 
> 
> git-backport-diff from v1:
> 
> Key:
> [----] : patches are identical
> [####] : number of functional differences between upstream/downstream patch
> [down] : patch is downstream-only
> The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively
> 
> 001/21:[0012] [FC] 'fuse: Copy write buffer content before polling'
> 002/21:[----] [--] 'fuse: Ensure init clean-up even with error_fatal'
> 003/21:[----] [--] 'fuse: Remove superfluous empty line'
> 004/21:[----] [--] 'fuse: Explicitly set inode ID to 1'
> 005/21:[----] [--] 'fuse: Change setup_... to mount_fuse_export()'
> 006/21:[----] [--] 'fuse: Fix mount options'
> 007/21:[----] [--] 'fuse: Set direct_io and parallel_direct_writes'
> 008/21:[----] [--] 'fuse: Introduce fuse_{at,de}tach_handlers()'
> 009/21:[----] [--] 'fuse: Introduce fuse_{inc,dec}_in_flight()'
> 010/21:[----] [--] 'fuse: Add halted flag'
> 011/21:[down] 'fuse: Rename length to blk_len in fuse_write()'
> 012/21:[down] 'block: Move qemu_fcntl_addfl() into osdep.c'
> 013/21:[0077] [FC] 'fuse: Manually process requests (without libfuse)'
> 014/21:[----] [--] 'fuse: Reduce max read size'
> 015/21:[0061] [FC] 'fuse: Process requests in coroutines'
> 016/21:[down] 'block/export: Add multi-threading interface'
> 017/21:[down] 'iotests/307: Test multi-thread export interface'
> 018/21:[0077] [FC] 'fuse: Implement multi-threading'
> 019/21:[down] 'qapi/block-export: Document FUSE's multi-threading'
> 020/21:[down] 'iotests/308: Add multi-threading sanity test'
> 021/21:[0002] [FC] 'fuse: Increase MAX_WRITE_SIZE with a second buffer'
> 
> 
> Hanna Czenczek (21):
>   fuse: Copy write buffer content before polling
>   fuse: Ensure init clean-up even with error_fatal
>   fuse: Remove superfluous empty line
>   fuse: Explicitly set inode ID to 1
>   fuse: Change setup_... to mount_fuse_export()
>   fuse: Fix mount options
>   fuse: Set direct_io and parallel_direct_writes
>   fuse: Introduce fuse_{at,de}tach_handlers()
>   fuse: Introduce fuse_{inc,dec}_in_flight()
>   fuse: Add halted flag
>   fuse: Rename length to blk_len in fuse_write()
>   block: Move qemu_fcntl_addfl() into osdep.c
>   fuse: Manually process requests (without libfuse)
>   fuse: Reduce max read size
>   fuse: Process requests in coroutines
>   block/export: Add multi-threading interface
>   iotests/307: Test multi-thread export interface
>   fuse: Implement multi-threading
>   qapi/block-export: Document FUSE's multi-threading
>   iotests/308: Add multi-threading sanity test
>   fuse: Increase MAX_WRITE_SIZE with a second buffer
> 
>  qapi/block-export.json               |   39 +-
>  include/block/export.h               |   12 +-
>  include/qemu/osdep.h                 |    1 +
>  block/export/export.c                |   48 +-
>  block/export/fuse.c                  | 1181 ++++++++++++++++++++------
>  block/export/vduse-blk.c             |    7 +
>  block/export/vhost-user-blk-server.c |    8 +
>  block/file-posix.c                   |   17 +-
>  nbd/server.c                         |    6 +
>  util/osdep.c                         |   18 +
>  tests/qemu-iotests/307               |   47 +
>  tests/qemu-iotests/307.out           |   18 +
>  tests/qemu-iotests/308               |   55 +-
>  tests/qemu-iotests/308.out           |   61 +-
>  14 files changed, 1213 insertions(+), 305 deletions(-)
> 
> -- 
> 2.49.0
> 

Nice, looks like there are some goodies in here that Brian can use in
his FUSE-over-io_uring project, like the new iothread=['iothread0',
'iothread1', ...] syntax for specifying multiple IOThreads.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-07-01  7:32 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04 13:27 [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 01/21] fuse: Copy write buffer content before polling Hanna Czenczek
2025-06-09 14:45   ` Stefan Hajnoczi
2025-06-04 13:27 ` [PATCH v2 02/21] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 03/21] fuse: Remove superfluous empty line Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 04/21] fuse: Explicitly set inode ID to 1 Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 05/21] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 06/21] fuse: Fix mount options Hanna Czenczek
2025-06-04 13:27 ` [PATCH v2 07/21] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 08/21] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 09/21] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 10/21] fuse: Add halted flag Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 11/21] fuse: Rename length to blk_len in fuse_write() Hanna Czenczek
2025-06-09 14:48   ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 12/21] block: Move qemu_fcntl_addfl() into osdep.c Hanna Czenczek
2025-06-04 15:18   ` Eric Blake
2025-06-09 15:03   ` Stefan Hajnoczi
2025-07-01  7:24     ` Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 13/21] fuse: Manually process requests (without libfuse) Hanna Czenczek
2025-06-09 16:54   ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 14/21] fuse: Reduce max read size Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 15/21] fuse: Process requests in coroutines Hanna Czenczek
2025-06-05  8:12   ` Hanna Czenczek
2025-06-09 16:57   ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 16/21] block/export: Add multi-threading interface Hanna Czenczek
2025-06-04 13:58   ` Markus Armbruster
2025-06-09 17:00   ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 17/21] iotests/307: Test multi-thread export interface Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 18/21] fuse: Implement multi-threading Hanna Czenczek
2025-06-09 18:10   ` Stefan Hajnoczi
2025-06-27  1:08   ` Brian
2025-07-01  7:31     ` Hanna Czenczek
2025-06-04 13:28 ` [PATCH v2 19/21] qapi/block-export: Document FUSE's multi-threading Hanna Czenczek
2025-06-04 13:58   ` Markus Armbruster
2025-06-04 13:28 ` [PATCH v2 20/21] iotests/308: Add multi-threading sanity test Hanna Czenczek
2025-06-09 18:12   ` Stefan Hajnoczi
2025-06-04 13:28 ` [PATCH v2 21/21] fuse: Increase MAX_WRITE_SIZE with a second buffer Hanna Czenczek
2025-06-10 23:37   ` Brian
2025-06-11 13:46     ` Stefan Hajnoczi
2025-06-09 18:14 ` [PATCH v2 00/21] export/fuse: Use coroutines and multi-threading Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).