All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading
@ 2026-02-18 13:26 Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 01/24] fuse: Copy write buffer content before polling Hanna Czenczek
                   ` (24 more replies)
  0 siblings, 25 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Hi,

This series:
- Fixes some bugs/minor inconveniences,
- Removes libfuse from the request processing path,
- Make the FUSE export use coroutines for request handling,

More detail on the v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2025-03/msg00359.html

v2 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2025-06/msg00040.html

v3 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2025-07/msg00005.html


I noticed some performance differences vs. my previous benchmarks;
notably, performance didn’t improve with the introduction of coroutines
much (except for random read performance).  However, when I run the same
benchmarks on the old branch again, I see no performance improvement
either.  Something about my host system must have changed.


Changes from v3:
- Patch 1: Use QEMU_AUTO_VFREE
- Patch 6: Added (fix for pre-existing bug)
- Patch 11: Make flag atomic, better this way for multithreading later
- Patch 12: Rename not only in fuse_write(), but fuse_read() as well
- Patch 13: Added (don’t truncate if we don’t want to test that)
- Patch 14: Added (fix for pre-existing bugs)
- Patch 16: Rewrote core parts:
  - Remove the macros to handle different FUSE requests in the same
    buffer, instead use unions
  - Restructure how we read data from FUSE:
    No longer one request_buf per queue, but instead a non-WRITE buffer
    on the stack, and a WRITE data buffer on the heap. We pass this via
    readv(), but it means that for non-WRITE requests, data that we want
    on the stack may spill into the heap buffer, so we need to copy it
    back. We cache the data buffer between non-WRITE requests so we at
    least don’t have to reallocate it all the time.
  - Actually take care of older FUSE versions down to 7.9 (2007)
  - Handle some more requests that should be handled: STATFS, DESTROY,
    FORGET, BATCH_FORGET
  - Don’t handle short writes on /dev/fuse
  - Move initializing fuse_out_header into the callers of
    fuse_write_response() and fuse_write_buf_response()
- Patch 18: Rebase conflicts because of patch 16, and removed the unused
  FuseRequestCoParam
- Patch 19: Adhere to QAPI max line length
- Patch 21: Added, not sure if absolutely necessary, but won’t hurt
  either
- Patch 22: Rebase conflicts because of patch 16


git-backport-diff from v3:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/24:[0009] [FC] 'fuse: Copy write buffer content before polling'
002/24:[----] [--] 'fuse: Ensure init clean-up even with error_fatal'
003/24:[----] [--] 'fuse: Remove superfluous empty line'
004/24:[----] [--] 'fuse: Explicitly set inode ID to 1'
005/24:[----] [--] 'fuse: Change setup_... to mount_fuse_export()'
006/24:[down] 'fuse: Destroy session on mount_fuse_export() fail'
007/24:[----] [--] 'fuse: Fix mount options'
008/24:[----] [--] 'fuse: Set direct_io and parallel_direct_writes'
009/24:[----] [--] 'fuse: Introduce fuse_{at,de}tach_handlers()'
010/24:[----] [--] 'fuse: Introduce fuse_{inc,dec}_in_flight()'
011/24:[0008] [FC] 'fuse: Add halted flag'
012/24:[0012] [FC] 'fuse: fuse_{read,write}: Rename length to blk_len'
013/24:[down] 'iotests/308: Use conv=notrunc to test growability'
014/24:[down] 'fuse: Explicitly handle non-grow post-EOF accesses'
015/24:[----] [--] 'block: Move qemu_fcntl_addfl() into osdep.c'
016/24:[0718] [FC] 'fuse: Manually process requests (without libfuse)'
017/24:[----] [-C] 'fuse: Reduce max read size'
018/24:[0102] [FC] 'fuse: Process requests in coroutines'
019/24:[0018] [FC] 'block/export: Add multi-threading interface'
020/24:[----] [--] 'iotests/307: Test multi-thread export interface'
021/24:[down] 'fuse: Make shared export state atomic'
022/24:[0084] [FC] 'fuse: Implement multi-threading'
023/24:[----] [--] 'qapi/block-export: Document FUSE's multi-threading'
024/24:[----] [--] 'iotests/308: Add multi-threading sanity test'


Hanna Czenczek (24):
  fuse: Copy write buffer content before polling
  fuse: Ensure init clean-up even with error_fatal
  fuse: Remove superfluous empty line
  fuse: Explicitly set inode ID to 1
  fuse: Change setup_... to mount_fuse_export()
  fuse: Destroy session on mount_fuse_export() fail
  fuse: Fix mount options
  fuse: Set direct_io and parallel_direct_writes
  fuse: Introduce fuse_{at,de}tach_handlers()
  fuse: Introduce fuse_{inc,dec}_in_flight()
  fuse: Add halted flag
  fuse: fuse_{read,write}: Rename length to blk_len
  iotests/308: Use conv=notrunc to test growability
  fuse: Explicitly handle non-grow post-EOF accesses
  block: Move qemu_fcntl_addfl() into osdep.c
  fuse: Manually process requests (without libfuse)
  fuse: Reduce max read size
  fuse: Process requests in coroutines
  block/export: Add multi-threading interface
  iotests/307: Test multi-thread export interface
  fuse: Make shared export state atomic
  fuse: Implement multi-threading
  qapi/block-export: Document FUSE's multi-threading
  iotests/308: Add multi-threading sanity test

 qapi/block-export.json               |   41 +-
 include/block/export.h               |   12 +-
 include/qemu/osdep.h                 |    1 +
 block/export/export.c                |   48 +-
 block/export/fuse.c                  | 1278 ++++++++++++++++++++------
 block/export/vduse-blk.c             |    7 +
 block/export/vhost-user-blk-server.c |    8 +
 block/file-posix.c                   |   17 +-
 nbd/server.c                         |    6 +
 util/osdep.c                         |   18 +
 tests/qemu-iotests/307               |   47 +
 tests/qemu-iotests/307.out           |   18 +
 tests/qemu-iotests/308               |   95 +-
 tests/qemu-iotests/308.out           |   71 +-
 14 files changed, 1358 insertions(+), 309 deletions(-)

-- 
2.53.0



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 01/24] fuse: Copy write buffer content before polling
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 02/24] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

aio_poll() in I/O functions can lead to nested read_from_fuse_export()
calls, overwriting the request buffer's content.  The only function
affected by this is fuse_write(), which therefore must use a bounce
buffer or corruption may occur.

Note that in addition we do not know whether libfuse-internal structures
can cope with this nesting, and even if we did, we probably cannot rely
on it in the future.  This is the main reason why we want to remove
libfuse from the I/O path.

I do not have a good reproducer for this other than:

$ dd if=/dev/urandom of=image bs=1M count=4096
$ dd if=/dev/zero of=copy bs=1M count=4096
$ touch fuse-export
$ qemu-storage-daemon \
    --blockdev file,node-name=file,filename=copy \
    --export \
    fuse,id=exp,node-name=file,mountpoint=fuse-export,writable=true \
    &

Other shell:
$ qemu-img convert -p -n -f raw -O raw -t none image fuse-export
$ killall -SIGINT qemu-storage-daemon
$ qemu-img compare image copy
Content mismatch at offset 0!

(The -t none in qemu-img convert is important.)

I tried reproducing this with throttle and small aio_write requests from
another qemu-io instance, but for some reason all requests are perfectly
serialized then.

I think in theory we should get parallel writes only if we set
fi->parallel_direct_writes in fuse_open().  In fact, I can confirm that
if we do that, that throttle-based reproducer works (i.e. does get
parallel (nested) write requests).  I have no idea why we still get
parallel requests with qemu-img convert anyway.

Also, a later patch in this series will set fi->parallel_direct_writes
and note that it makes basically no difference when running fio on the
current libfuse-based version of our code.  It does make a difference
without libfuse.  So something quite fishy is going on.

I will try to investigate further what the root cause is, but I think
for now let's assume that calling blk_pwrite() can invalidate the buffer
contents through nested polling.

Cc: qemu-stable@nongnu.org
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 8cf4572f78..cea9de61f1 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -301,6 +301,12 @@ static void read_from_fuse_export(void *opaque)
         goto out;
     }
 
+    /*
+     * Note that aio_poll() in any request-processing function can lead to a
+     * nested read_from_fuse_export() call, which will overwrite the contents of
+     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
+     * content is copied before potentially polling via aio_poll().
+     */
     fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
@@ -624,6 +630,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
                        size_t size, off_t offset, struct fuse_file_info *fi)
 {
     FuseExport *exp = fuse_req_userdata(req);
+    QEMU_AUTO_VFREE void *copied = NULL;
     int64_t length;
     int ret;
 
@@ -638,6 +645,14 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
         return;
     }
 
+    /*
+     * Heed the note on read_from_fuse_export(): If we call aio_poll() (which
+     * any blk_*() I/O function may do), read_from_fuse_export() may be nested,
+     * overwriting the request buffer content.  Therefore, we must copy it here.
+     */
+    copied = blk_blockalign(exp->common.blk, size);
+    memcpy(copied, buf, size);
+
     /**
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
@@ -660,7 +675,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
         }
     }
 
-    ret = blk_pwrite(exp->common.blk, offset, size, buf, 0);
+    ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
     if (ret >= 0) {
         fuse_reply_write(req, size);
     } else {
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 02/24] fuse: Ensure init clean-up even with error_fatal
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 01/24] fuse: Copy write buffer content before polling Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 03/24] fuse: Remove superfluous empty line Hanna Czenczek
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

When exports are created on the command line (with the storage daemon),
errp is going to point to error_fatal.  Without ERRP_GUARD, we would
exit immediately when *errp is set, i.e. skip the clean-up code under
the `fail` label.  Use ERRP_GUARD so we always run that code.

As far as I know, this has no actual impact right now[1], but it is
still better to make this right.

[1] Not cleaning up the mount point is the only thing I can imagine
    would be problematic, but that is the last thing we attempt, so if
    it fails, it will clean itself up.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index cea9de61f1..2ed22c6b2f 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -119,6 +119,7 @@ static int fuse_export_create(BlockExport *blk_exp,
                               BlockExportOptions *blk_exp_args,
                               Error **errp)
 {
+    ERRP_GUARD(); /* ensure clean-up even with error_fatal */
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
     BlockExportOptionsFuse *args = &blk_exp_args->u.fuse;
     int ret;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 03/24] fuse: Remove superfluous empty line
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 01/24] fuse: Copy write buffer content before polling Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 02/24] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 04/24] fuse: Explicitly set inode ID to 1 Hanna Czenczek
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 2ed22c6b2f..4cdf527d69 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -464,7 +464,6 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
     }
 
     if (add_resize_perm) {
-
         if (!qemu_in_main_thread()) {
             /* Changing permissions like below only works in the main thread */
             return -EPERM;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 04/24] fuse: Explicitly set inode ID to 1
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (2 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 03/24] fuse: Remove superfluous empty line Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 05/24] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Setting .st_ino to the FUSE inode ID is kind of arbitrary.  While in
practice it is going to be fixed (to FUSE_ROOT_ID, which is 1) because
we only have the root inode, that is not obvious in fuse_getattr().

Just explicitly set it to 1 (i.e. no functional change).

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 4cdf527d69..a56f645c05 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -432,7 +432,7 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
     }
 
     statbuf = (struct stat) {
-        .st_ino     = inode,
+        .st_ino     = 1,
         .st_mode    = exp->st_mode,
         .st_nlink   = 1,
         .st_uid     = exp->st_uid,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 05/24] fuse: Change setup_... to mount_fuse_export()
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (3 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 04/24] fuse: Explicitly set inode ID to 1 Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 06/24] fuse: Destroy session on mount_fuse_export() fail Hanna Czenczek
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

There is no clear separation between what should go into
setup_fuse_export() and what should stay in fuse_export_create().

Make it clear that setup_fuse_export() is for mounting only.  Rename it,
and move everything that has nothing to do with mounting up into
fuse_export_create().

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 49 ++++++++++++++++++++-------------------------
 1 file changed, 22 insertions(+), 27 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index a56f645c05..00bb2ffee4 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -72,8 +72,7 @@ static void fuse_export_delete(BlockExport *exp);
 
 static void init_exports_table(void);
 
-static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
-                             bool allow_other, Error **errp);
+static int mount_fuse_export(FuseExport *exp, Error **errp);
 static void read_from_fuse_export(void *opaque);
 
 static bool is_regular_file(const char *path, Error **errp);
@@ -193,23 +192,32 @@ static int fuse_export_create(BlockExport *blk_exp,
     exp->st_gid = getgid();
 
     if (args->allow_other == FUSE_EXPORT_ALLOW_OTHER_AUTO) {
-        /* Ignore errors on our first attempt */
-        ret = setup_fuse_export(exp, args->mountpoint, true, NULL);
-        exp->allow_other = ret == 0;
+        /* Try allow_other == true first, ignore errors */
+        exp->allow_other = true;
+        ret = mount_fuse_export(exp, NULL);
         if (ret < 0) {
-            ret = setup_fuse_export(exp, args->mountpoint, false, errp);
+            exp->allow_other = false;
+            ret = mount_fuse_export(exp, errp);
         }
     } else {
         exp->allow_other = args->allow_other == FUSE_EXPORT_ALLOW_OTHER_ON;
-        ret = setup_fuse_export(exp, args->mountpoint, exp->allow_other, errp);
+        ret = mount_fuse_export(exp, errp);
     }
     if (ret < 0) {
         goto fail;
     }
 
+    g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
+
+    aio_set_fd_handler(exp->common.ctx,
+                       fuse_session_fd(exp->fuse_session),
+                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    exp->fd_handler_set_up = true;
+
     return 0;
 
 fail:
+    fuse_export_shutdown(blk_exp);
     fuse_export_delete(blk_exp);
     return ret;
 }
@@ -227,10 +235,10 @@ static void init_exports_table(void)
 }
 
 /**
- * Create exp->fuse_session and mount it.
+ * Create exp->fuse_session and mount it.  Expects exp->mountpoint,
+ * exp->writable, and exp->allow_other to be set as intended for the mount.
  */
-static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
-                             bool allow_other, Error **errp)
+static int mount_fuse_export(FuseExport *exp, Error **errp)
 {
     const char *fuse_argv[4];
     char *mount_opts;
@@ -243,7 +251,7 @@ static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
      */
     mount_opts = g_strdup_printf("max_read=%zu,default_permissions%s",
                                  FUSE_MAX_BOUNCE_BYTES,
-                                 allow_other ? ",allow_other" : "");
+                                 exp->allow_other ? ",allow_other" : "");
 
     fuse_argv[0] = ""; /* Dummy program name */
     fuse_argv[1] = "-o";
@@ -256,30 +264,17 @@ static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
     g_free(mount_opts);
     if (!exp->fuse_session) {
         error_setg(errp, "Failed to set up FUSE session");
-        ret = -EIO;
-        goto fail;
+        return -EIO;
     }
 
-    ret = fuse_session_mount(exp->fuse_session, mountpoint);
+    ret = fuse_session_mount(exp->fuse_session, exp->mountpoint);
     if (ret < 0) {
         error_setg(errp, "Failed to mount FUSE session to export");
-        ret = -EIO;
-        goto fail;
+        return -EIO;
     }
     exp->mounted = true;
 
-    g_hash_table_insert(exports, g_strdup(mountpoint), NULL);
-
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
-
     return 0;
-
-fail:
-    fuse_export_shutdown(&exp->common);
-    return ret;
 }
 
 /**
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 06/24] fuse: Destroy session on mount_fuse_export() fail
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (4 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 05/24] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 07/24] fuse: Fix mount options Hanna Czenczek
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

If mount_fuse_export() fails to mount the session, destroy it.
Depending on the allow_other configuration, fuse_export_create() may
retry this function on error, which may leak one session instance
otherwise.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 00bb2ffee4..82560ca071 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -270,11 +270,17 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     ret = fuse_session_mount(exp->fuse_session, exp->mountpoint);
     if (ret < 0) {
         error_setg(errp, "Failed to mount FUSE session to export");
-        return -EIO;
+        ret = -EIO;
+        goto fail;
     }
     exp->mounted = true;
 
     return 0;
+
+fail:
+    fuse_session_destroy(exp->fuse_session);
+    exp->fuse_session = NULL;
+    return ret;
 }
 
 /**
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 07/24] fuse: Fix mount options
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (5 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 06/24] fuse: Destroy session on mount_fuse_export() fail Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-03-05 17:56   ` Kevin Wolf
  2026-02-18 13:26 ` [PATCH v4 08/24] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Since I actually took a look into how mounting with libfuse works[1], I
now know that the FUSE mount options are not exactly standard mount
system call options.  Specifically:
- We should add "nosuid,nodev,noatime" because that is going to be
  translated into the respective MS_ mount flags; and those flags make
  sense for us.
- We can set rw/ro to make the mount writable or not.  It makes sense to
  set this flag to produce a better error message for read-only exports
  (EROFS instead of EACCES).
  This changes behavior as can be seen in iotest 308: It is no longer
  possible to modify metadata of read-only exports.

In addition, in the comment, we can note that the FUSE mount() system
call actually expects some more parameters that we can omit because
fusermount3 (i.e. libfuse) will figure them out by itself:
- fd: /dev/fuse fd
- rootmode: Inode mode of the root node
- user_id/group_id: Mounter's UID/GID

[1] It invokes fusermount3, an SUID libfuse helper program, which parses
    and processes some mount options before actually invoking the
    mount() system call.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c        | 14 +++++++++++---
 tests/qemu-iotests/308     |  4 ++--
 tests/qemu-iotests/308.out |  3 ++-
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 82560ca071..0422cf4b8a 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -246,10 +246,18 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     int ret;
 
     /*
-     * max_read needs to match what fuse_init() sets.
-     * max_write need not be supplied.
+     * Note that these mount options differ from what we would pass to a direct
+     * mount() call:
+     * - nosuid, nodev, and noatime are not understood by the kernel; libfuse
+     *   uses those options to construct the mount flags (MS_*)
+     * - The FUSE kernel driver requires additional options (fd, rootmode,
+     *   user_id, group_id); these will be set by libfuse.
+     * Note that max_read is set here, while max_write is set via the FUSE INIT
+     * operation.
      */
-    mount_opts = g_strdup_printf("max_read=%zu,default_permissions%s",
+    mount_opts = g_strdup_printf("%s,nosuid,nodev,noatime,max_read=%zu,"
+                                 "default_permissions%s",
+                                 exp->writable ? "rw" : "ro",
                                  FUSE_MAX_BOUNCE_BYTES,
                                  exp->allow_other ? ",allow_other" : "");
 
diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index 6eced3aefb..033d5cbe22 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -178,7 +178,7 @@ stat -c 'Permissions pre-chmod: %a' "$EXT_MP"
 chmod u+w "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
 stat -c 'Permissions post-+w: %a' "$EXT_MP"
 
-# But that we can set, say, +x (if we are so inclined)
+# Same for other flags, like, say +x
 chmod u+x "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
 stat -c 'Permissions post-+x: %a' "$EXT_MP"
 
@@ -236,7 +236,7 @@ output=$($QEMU_IO -f raw -c 'write -P 42 1M 64k' "$TEST_IMG" 2>&1 \
 
 # Expected reference output: Opening the file fails because it has no
 # write permission
-reference="Could not open 'TEST_DIR/t.IMGFMT': Permission denied"
+reference="Could not open 'TEST_DIR/t.IMGFMT': Read-only file system"
 
 if echo "$output" | grep -q "$reference"; then
     echo "Writing to read-only export failed: OK"
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index e5e233691d..aa96faab6d 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -53,7 +53,8 @@ Images are identical.
 Permissions pre-chmod: 400
 chmod: changing permissions of 'TEST_DIR/t.IMGFMT.fuse': Read-only file system
 Permissions post-+w: 400
-Permissions post-+x: 500
+chmod: changing permissions of 'TEST_DIR/t.IMGFMT.fuse': Read-only file system
+Permissions post-+x: 400
 
 === Mount over existing file ===
 {'execute': 'block-export-add',
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 08/24] fuse: Set direct_io and parallel_direct_writes
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (6 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 07/24] fuse: Fix mount options Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 09/24] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

In fuse_open(), set these flags:
- direct_io: We probably actually don't want to have the host page cache
  be used for our exports.  QEMU block exports are supposed to represent
  the image as-is (and thus potentially changing).
  This causes a change in iotest 308's reference output.

- parallel_direct_writes: We can (now) cope with parallel writes, so we
  should set this flag.  For some reason, it doesn't seem to make an
  actual performance difference with libfuse, but it does make a
  difference without it, so let's set it.
  (See "fuse: Copy write buffer content before polling" for further
  discussion.)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c        | 2 ++
 tests/qemu-iotests/308.out | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 0422cf4b8a..d0e3c6bf61 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -582,6 +582,8 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
 static void fuse_open(fuse_req_t req, fuse_ino_t inode,
                       struct fuse_file_info *fi)
 {
+    fi->direct_io = true;
+    fi->parallel_direct_writes = true;
     fuse_reply_open(req, fi);
 }
 
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index aa96faab6d..2d7a38d63d 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -131,7 +131,7 @@ wrote 65536/65536 bytes at offset 1048576
 
 --- Try growing non-growable export ---
 (OK: Lengths of export and original are the same)
-dd: error writing 'TEST_DIR/t.IMGFMT.fuse': Input/output error
+dd: error writing 'TEST_DIR/t.IMGFMT.fuse': No space left on device
 1+0 records in
 0+0 records out
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 09/24] fuse: Introduce fuse_{at,de}tach_handlers()
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (7 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 08/24] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 10/24] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Pull setting up and tearing down the AIO context handlers into two
dedicated functions.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index d0e3c6bf61..5953407f20 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -78,27 +78,34 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
-static void fuse_export_drained_begin(void *opaque)
+static void fuse_attach_handlers(FuseExport *exp)
 {
-    FuseExport *exp = opaque;
+    aio_set_fd_handler(exp->common.ctx,
+                       fuse_session_fd(exp->fuse_session),
+                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    exp->fd_handler_set_up = true;
+}
 
+static void fuse_detach_handlers(FuseExport *exp)
+{
     aio_set_fd_handler(exp->common.ctx,
                        fuse_session_fd(exp->fuse_session),
                        NULL, NULL, NULL, NULL, NULL);
     exp->fd_handler_set_up = false;
 }
 
+static void fuse_export_drained_begin(void *opaque)
+{
+    fuse_detach_handlers(opaque);
+}
+
 static void fuse_export_drained_end(void *opaque)
 {
     FuseExport *exp = opaque;
 
     /* Refresh AioContext in case it changed */
     exp->common.ctx = blk_get_aio_context(exp->common.blk);
-
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
+    fuse_attach_handlers(exp);
 }
 
 static bool fuse_export_drained_poll(void *opaque)
@@ -209,11 +216,7 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
-
+    fuse_attach_handlers(exp);
     return 0;
 
 fail:
@@ -335,10 +338,7 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
         fuse_session_exit(exp->fuse_session);
 
         if (exp->fd_handler_set_up) {
-            aio_set_fd_handler(exp->common.ctx,
-                               fuse_session_fd(exp->fuse_session),
-                               NULL, NULL, NULL, NULL, NULL);
-            exp->fd_handler_set_up = false;
+            fuse_detach_handlers(exp);
         }
     }
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 10/24] fuse: Introduce fuse_{inc,dec}_in_flight()
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (8 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 09/24] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 11/24] fuse: Add halted flag Hanna Czenczek
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

This is how vduse-blk.c does it, and it does seem better to have
dedicated functions for it.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 5953407f20..fc75a5e74d 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -78,6 +78,25 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
+static void fuse_inc_in_flight(FuseExport *exp)
+{
+    if (qatomic_fetch_inc(&exp->in_flight) == 0) {
+        /* Prevent export from being deleted */
+        blk_exp_ref(&exp->common);
+    }
+}
+
+static void fuse_dec_in_flight(FuseExport *exp)
+{
+    if (qatomic_fetch_dec(&exp->in_flight) == 1) {
+        /* Wake AIO_WAIT_WHILE() */
+        aio_wait_kick();
+
+        /* Now the export can be deleted */
+        blk_exp_unref(&exp->common);
+    }
+}
+
 static void fuse_attach_handlers(FuseExport *exp)
 {
     aio_set_fd_handler(exp->common.ctx,
@@ -303,9 +322,7 @@ static void read_from_fuse_export(void *opaque)
     FuseExport *exp = opaque;
     int ret;
 
-    blk_exp_ref(&exp->common);
-
-    qatomic_inc(&exp->in_flight);
+    fuse_inc_in_flight(exp);
 
     do {
         ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
@@ -323,11 +340,7 @@ static void read_from_fuse_export(void *opaque)
     fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
-    if (qatomic_fetch_dec(&exp->in_flight) == 1) {
-        aio_wait_kick(); /* wake AIO_WAIT_WHILE() */
-    }
-
-    blk_exp_unref(&exp->common);
+    fuse_dec_in_flight(exp);
 }
 
 static void fuse_export_shutdown(BlockExport *blk_exp)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 11/24] fuse: Add halted flag
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (9 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 10/24] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 12/24] fuse: fuse_{read,write}: Rename length to blk_len Hanna Czenczek
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

This is a flag that we will want when processing FUSE requests
ourselves: When the kernel sends us e.g. a truncated request (i.e. we
receive less data than the request's indicated length), we cannot rely
on subsequent data to be valid.  Then, we are going to set this flag,
halting all FUSE request processing.

We plan to only use this flag in cases that would effectively be kernel
bugs.

While not necessary yet, access the flag atomically so that it will be
safe to use once we introduce multi-threading.

(Right now, the flag is unused because libfuse still does our request
processing.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index fc75a5e74d..f6a5f4fa0a 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -53,6 +53,13 @@ typedef struct FuseExport {
     unsigned int in_flight; /* atomic */
     bool mounted, fd_handler_set_up;
 
+    /*
+     * Set when there was an unrecoverable error and no requests should be read
+     * from the device anymore (basically only in case of something we would
+     * consider a kernel bug).  Access atomically.
+     */
+    bool halted;
+
     char *mountpoint;
     bool writable;
     bool growable;
@@ -69,6 +76,7 @@ static const struct fuse_lowlevel_ops fuse_ops;
 
 static void fuse_export_shutdown(BlockExport *exp);
 static void fuse_export_delete(BlockExport *exp);
+static void fuse_export_halt(FuseExport *exp) G_GNUC_UNUSED;
 
 static void init_exports_table(void);
 
@@ -99,6 +107,10 @@ static void fuse_dec_in_flight(FuseExport *exp)
 
 static void fuse_attach_handlers(FuseExport *exp)
 {
+    if (qatomic_read(&exp->halted)) {
+        return;
+    }
+
     aio_set_fd_handler(exp->common.ctx,
                        fuse_session_fd(exp->fuse_session),
                        read_from_fuse_export, NULL, NULL, NULL, exp);
@@ -322,6 +334,10 @@ static void read_from_fuse_export(void *opaque)
     FuseExport *exp = opaque;
     int ret;
 
+    if (unlikely(qatomic_read(&exp->halted))) {
+        return;
+    }
+
     fuse_inc_in_flight(exp);
 
     do {
@@ -380,6 +396,20 @@ static void fuse_export_delete(BlockExport *blk_exp)
     g_free(exp->mountpoint);
 }
 
+/**
+ * Halt the export: Detach FD handlers, and set exp->halted to true, preventing
+ * fuse_attach_handlers() from re-attaching them, therefore stopping all further
+ * request processing.
+ *
+ * Call this function when an unrecoverable error happens that makes processing
+ * all future requests unreliable.
+ */
+static void fuse_export_halt(FuseExport *exp)
+{
+    qatomic_set(&exp->halted, true);
+    fuse_detach_handlers(exp);
+}
+
 /**
  * Check whether @path points to a regular file.  If not, put an
  * appropriate message into *errp.
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 12/24] fuse: fuse_{read,write}: Rename length to blk_len
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (10 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 11/24] fuse: Add halted flag Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 13/24] iotests/308: Use conv=notrunc to test growability Hanna Czenczek
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

The term "length" is ambiguous, use "blk_len" instead to be clear.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index f6a5f4fa0a..d45c6b814f 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -637,7 +637,7 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
                       size_t size, off_t offset, struct fuse_file_info *fi)
 {
     FuseExport *exp = fuse_req_userdata(req);
-    int64_t length;
+    int64_t blk_len;
     void *buf;
     int ret;
 
@@ -651,14 +651,14 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
      * Clients will expect short reads at EOF, so we have to limit
      * offset+size to the image length.
      */
-    length = blk_getlength(exp->common.blk);
-    if (length < 0) {
-        fuse_reply_err(req, -length);
+    blk_len = blk_getlength(exp->common.blk);
+    if (blk_len < 0) {
+        fuse_reply_err(req, -blk_len);
         return;
     }
 
-    if (offset + size > length) {
-        size = length - offset;
+    if (offset + size > blk_len) {
+        size = blk_len - offset;
     }
 
     buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
@@ -685,7 +685,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
 {
     FuseExport *exp = fuse_req_userdata(req);
     QEMU_AUTO_VFREE void *copied = NULL;
-    int64_t length;
+    int64_t blk_len;
     int ret;
 
     /* Limited by max_write, should not happen */
@@ -711,13 +711,13 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
      */
-    length = blk_getlength(exp->common.blk);
-    if (length < 0) {
-        fuse_reply_err(req, -length);
+    blk_len = blk_getlength(exp->common.blk);
+    if (blk_len < 0) {
+        fuse_reply_err(req, -blk_len);
         return;
     }
 
-    if (offset + size > length) {
+    if (offset + size > blk_len) {
         if (exp->growable) {
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
@@ -725,7 +725,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
                 return;
             }
         } else {
-            size = length - offset;
+            size = blk_len - offset;
         }
     }
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 13/24] iotests/308: Use conv=notrunc to test growability
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (11 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 12/24] fuse: fuse_{read,write}: Rename length to blk_len Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 14/24] fuse: Explicitly handle non-grow post-EOF accesses Hanna Czenczek
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Without conv=notrunc, dd will automatically truncate the output file to
the @seek value at least.  We want to test post-EOF I/O, not truncate,
so pass conv=notrunc.

(It does not make a difference in practice because we only seek to the
EOF, so the truncate effectively does nothing, but this is still
cleaner.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 tests/qemu-iotests/308 | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index 033d5cbe22..6ecb275555 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -296,7 +296,8 @@ orig_disk_usage=$(disk_usage "$TEST_IMG")
 # Should fail (exports are non-growable by default)
 # (Note that qemu-io can never write beyond the EOF, so we have to use
 # dd here)
-dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$orig_len 2>&1 \
+dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$orig_len \
+    conv=notrunc 2>&1 \
     | _filter_testdir | _filter_imgfmt
 
 echo
@@ -333,7 +334,7 @@ fuse_export_add \
     'node-protocol'
 
 # Now we should be able to write beyond the EOF
-dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$new_len 2>&1 \
+dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$new_len conv=notrunc 2>&1 \
     | _filter_testdir | _filter_imgfmt
 
 new_len=$(get_proto_len "$EXT_MP" "$TEST_IMG")
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 14/24] fuse: Explicitly handle non-grow post-EOF accesses
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (12 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 13/24] iotests/308: Use conv=notrunc to test growability Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 15/24] block: Move qemu_fcntl_addfl() into osdep.c Hanna Czenczek
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

When reading to / writing from non-growable exports, we cap the I/O size
by `offset - blk_len`.  This will underflow for accesses that are
completely past the disk end.

Check and handle that case explicitly.

This is also enough to ensure that `offset + size` will not overflow;
blk_len is int64_t, offset is uint32_t, `offset < blk_len`, so from
`INT64_MAX + UINT32_MAX < UINT64_MAX` it follows that `offset + size`
cannot overflow.

Just one catch: We have to allow write accesses to growable exports past
the EOF, so then we cannot rely on `offset < blk_len`, but have to
verify explicitly that `offset + size` does not overflow.

The negative consequences of not having this commit are luckily limited
because blk_pread() and blk_pwrite() will reject post-EOF requests
anyway, so a `size` underflow post-EOF will just result in an I/O error.
So:
- Post-EOF reads will incorrectly result in I/O errors instead of just
  0-length reads.  We will also attempt to allocate a very large buffer,
  which is wrong and not good, but not terrible.
- Post-EOF writes on non-growable exports will result in I/O errors
  instead of 0-length writes (which generally indicate ENOSPC).
- Post-EOF writes on growable exports can theoretically overflow on EOF
  and truncate the export down to a much too small size, but in
  practice, FUSE will never send an offset greater than signed INT_MAX,
  preventing a uint64_t overflow.  (fuse_write_args_fill() in the kernel
  uses loff_t for the offset, which is signed.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c        | 20 +++++++++++++++++++-
 tests/qemu-iotests/308     | 35 ++++++++++++++++++++++++++++++-----
 tests/qemu-iotests/308.out | 10 ++++++++++
 3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index d45c6b814f..af0a8de17b 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -657,6 +657,16 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
         return;
     }
 
+    if (offset >= blk_len) {
+        /*
+         * Technically libfuse does not allow returning a zero error code for
+         * read requests, but in practice this is a 0-length read (and a future
+         * commit will change this code anyway)
+         */
+        fuse_reply_err(req, 0);
+        return;
+    }
+
     if (offset + size > blk_len) {
         size = blk_len - offset;
     }
@@ -717,7 +727,15 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
         return;
     }
 
-    if (offset + size > blk_len) {
+    if (offset >= blk_len && !exp->growable) {
+        fuse_reply_write(req, 0);
+        return;
+    }
+
+    if (offset + size < offset) {
+        fuse_reply_err(req, EINVAL);
+        return;
+    } else if (offset + size > blk_len) {
         if (exp->growable) {
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index 6ecb275555..a83c6fc01f 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -300,16 +300,34 @@ dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$orig_len \
     conv=notrunc 2>&1 \
     | _filter_testdir | _filter_imgfmt
 
+# And one really squarely post-EOF write
+dd if=/dev/zero of="$EXT_MP" bs=1 count=1 seek=$((orig_len + 32 * 1024)) \
+    conv=notrunc 2>&1 \
+    | _filter_testdir | _filter_imgfmt
+
+# Half-post-EOF reads
+dd if="$EXT_MP" of=/dev/null bs=1 count=64k skip=$((orig_len - 32 * 1024)) \
+    2>&1 | _filter_testdir | _filter_imgfmt
+
+# And one really squarely post-EOF read
+dd if="$EXT_MP" of=/dev/null bs=1 count=1 skip=$((orig_len + 32 * 1024)) \
+    2>&1 | _filter_testdir | _filter_imgfmt
+
 echo
 echo '--- Resize export ---'
 
 # But we can truncate it explicitly; even with fallocate
-fallocate -o "$orig_len" -l 64k "$EXT_MP"
+# (Make sure we extend it to a length not divisible by 128k, we need that below)
+bs=$((128 * 1024))
+extend_to=$(((orig_len + bs - 1) / bs * bs + bs / 2))
+extend_by=$((extend_to - orig_len))
+
+fallocate -o "$orig_len" -l $extend_by "$EXT_MP"
 
 new_len=$(get_proto_len "$EXT_MP" "$TEST_IMG")
-if [ "$new_len" != "$((orig_len + 65536))" ]; then
+if [ "$new_len" != "$extend_to" ]; then
     echo 'ERROR: Unexpected post-truncate image size:'
-    echo "$new_len != $((orig_len + 65536))"
+    echo "$new_len != $extend_to"
 else
     echo 'OK: Post-truncate image size is as expected'
 fi
@@ -322,6 +340,13 @@ else
     echo "$orig_disk_usage => $new_disk_usage"
 fi
 
+# Use this opportunity to test a read access across the (now no longer so much
+# aligned) EOF.  dd can only do requests with a length of its block size, and
+# all of its seek/skip values are in bs units, so it is hard to do a request
+# across the EOF if the EOF is at a power of two (64M).
+dd if="$EXT_MP" of=/dev/null bs=$bs count=2 skip=$((extend_to / bs)) \
+    2>&1 | _filter_testdir | _filter_imgfmt
+
 echo
 echo '--- Try growing growable export ---'
 
@@ -338,9 +363,9 @@ dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$new_len conv=notrunc 2>&1 \
     | _filter_testdir | _filter_imgfmt
 
 new_len=$(get_proto_len "$EXT_MP" "$TEST_IMG")
-if [ "$new_len" != "$((orig_len + 131072))" ]; then
+if [ "$new_len" != "$((extend_to + 65536))" ]; then
     echo 'ERROR: Unexpected post-grow image size:'
-    echo "$new_len != $((orig_len + 131072))"
+    echo "$new_len != $((extend_to + 65536))"
 else
     echo 'OK: Post-grow image size is as expected'
 fi
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index 2d7a38d63d..ebeaf64b48 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -134,11 +134,21 @@ wrote 65536/65536 bytes at offset 1048576
 dd: error writing 'TEST_DIR/t.IMGFMT.fuse': No space left on device
 1+0 records in
 0+0 records out
+dd: error writing 'TEST_DIR/t.IMGFMT.fuse': No space left on device
+1+0 records in
+0+0 records out
+32768+0 records in
+32768+0 records out
+dd: TEST_DIR/t.IMGFMT.fuse: cannot skip to specified offset
+0+0 records in
+0+0 records out
 
 --- Resize export ---
 (OK: Lengths of export and original are the same)
 OK: Post-truncate image size is as expected
 OK: Disk usage grew with fallocate
+0+1 records in
+0+1 records out
 
 --- Try growing growable export ---
 {'execute': 'block-export-del',
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 15/24] block: Move qemu_fcntl_addfl() into osdep.c
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (13 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 14/24] fuse: Explicitly handle non-grow post-EOF accesses Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 16/24] fuse: Manually process requests (without libfuse) Hanna Czenczek
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Move file-posix's helper to add a flag (or a set of flags) to an FD's
existing set of flags into osdep.c for other places to use.

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/qemu/osdep.h |  1 +
 block/file-posix.c   | 17 +----------------
 util/osdep.c         | 18 ++++++++++++++++++
 3 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index b384b5b506..f151578b5c 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -633,6 +633,7 @@ int qemu_lock_fd(int fd, int64_t start, int64_t len, bool exclusive);
 int qemu_unlock_fd(int fd, int64_t start, int64_t len);
 int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive);
 bool qemu_has_ofd_lock(void);
+int qemu_fcntl_addfl(int fd, int flag);
 #endif
 
 bool qemu_has_direct_io(void);
diff --git a/block/file-posix.c b/block/file-posix.c
index 6265d2e248..e49b13d6ab 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1056,21 +1056,6 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
     return ret;
 }
 
-/* Sets a specific flag */
-static int fcntl_setfl(int fd, int flag)
-{
-    int flags;
-
-    flags = fcntl(fd, F_GETFL);
-    if (flags == -1) {
-        return -errno;
-    }
-    if (fcntl(fd, F_SETFL, flags | flag) == -1) {
-        return -errno;
-    }
-    return 0;
-}
-
 static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
                                  int *open_flags, uint64_t perm, Error **errp)
 {
@@ -1109,7 +1094,7 @@ static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
         /* dup the original fd */
         fd = qemu_dup(s->fd);
         if (fd >= 0) {
-            ret = fcntl_setfl(fd, *open_flags);
+            ret = qemu_fcntl_addfl(fd, *open_flags);
             if (ret) {
                 qemu_close(fd);
                 fd = -1;
diff --git a/util/osdep.c b/util/osdep.c
index 770369831b..000e7daac8 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -280,6 +280,24 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive)
         return fl.l_type == F_UNLCK ? 0 : -EAGAIN;
     }
 }
+
+/**
+ * Set the given flag(s) (fcntl GETFL/SETFL) on the given FD, while retaining
+ * other flags.
+ */
+int qemu_fcntl_addfl(int fd, int flag)
+{
+    int flags;
+
+    flags = fcntl(fd, F_GETFL);
+    if (flags == -1) {
+        return -errno;
+    }
+    if (fcntl(fd, F_SETFL, flags | flag) == -1) {
+        return -errno;
+    }
+    return 0;
+}
 #endif
 
 bool qemu_has_direct_io(void)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 16/24] fuse: Manually process requests (without libfuse)
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (14 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 15/24] block: Move qemu_fcntl_addfl() into osdep.c Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-26 19:26   ` Kevin Wolf
  2026-02-26 19:29   ` Kevin Wolf
  2026-02-18 13:26 ` [PATCH v4 17/24] fuse: Reduce max read size Hanna Czenczek
                   ` (8 subsequent siblings)
  24 siblings, 2 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Manually read requests from the /dev/fuse FD and process them, without
using libfuse.  This allows us to safely add parallel request processing
in coroutines later, without having to worry about libfuse internals.
(Technically, we already have exactly that problem with
read_from_fuse_export()/read_from_fuse_fd() nesting.)

We will continue to use libfuse for mounting the filesystem; fusermount3
is a effectively a helper program of libfuse, so it should know best how
to interact with it.  (Doing it manually without libfuse, while doable,
is a bit of a pain, and it is not clear to me how stable the "protocol"
actually is.)

Take this opportunity of quite a major rewrite to update the Copyright
line with corrected information that has surfaced in the meantime.

Here are some benchmarks from before this patch (4k, iodepth=16, libaio;
except 'sync', which are iodepth=1 and pvsync2):

file:
  read:
    seq aio:    99.8k ±1.5k IOPS
    rand aio:   50.5k ±1.0k
    seq sync:   36.1k ±1.1k
    rand sync:  10.0k ±0.1k
  write:
    seq aio:    72.0k ±9.3k
    rand aio:   70.6k ±2.5k
    seq sync:   30.6k ±0.8k
    rand sync:  30.1k ±1.0k
null:
  read:
    seq aio:   157.9k ±4.7k
    rand aio:  158.7k ±4.8k
    seq sync:   80.2k ±2.8k
    rand sync:  77.5k ±3.8k
  write:
    seq aio:   154.3k ±3.6k
    rand aio:  154.3k ±4.2k
    seq sync:   76.1k ±5.2k
    rand sync:  72.9k ±4.0k

And with this patch applied:

file:
  read:
    seq aio:   106.8k ±1.9k (+7%)
    rand aio:   48.3k ±8.8k (-4%)
    seq sync:   35.5k ±1.4k (-2%)
    rand sync:  10.0k ±0.2k (±0%)
  write:
    seq aio:    76.3k ±6.6k (+6%)
    rand aio:   76.4k ±1.5k (+8%)
    seq sync:   31.6k ±0.6k (+3%)
    rand sync:  30.9k ±0.8k (+3%)
null:
  read:
    seq aio:   161.7k ±6.0k (+2%)
    rand aio:  165.6k ±7.1k (+4%)
    seq sync:   80.5k ±3.0k (±0%)
    rand sync:  78.5k ±3.1k (+1%)
  write:
    seq aio:   185.1k ±3.3k (+20%)
    rand aio:  186.7k ±4.8k (+21%)
    seq sync:   82.5k ±4.2k (+8%)
    rand sync:  78.7k ±3.2k (+8%)

So not much difference, aside from write AIO to a null-co export getting
a bit better.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 944 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 720 insertions(+), 224 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index af0a8de17b..c481fb72a2 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -1,7 +1,7 @@
 /*
  * Present a block device as a raw image through FUSE
  *
- * Copyright (c) 2020 Max Reitz <mreitz@redhat.com>
+ * Copyright (c) 2020, 2025 Hanna Czenczek <hreitz@redhat.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -27,12 +27,15 @@
 #include "block/qapi.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block.h"
+#include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
 
 #include <fuse.h>
 #include <fuse_lowlevel.h>
 
+#include "standard-headers/linux/fuse.h"
+
 #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
 #endif
@@ -42,17 +45,102 @@
 #endif
 
 /* Prevent overly long bounce buffer allocations */
-#define FUSE_MAX_BOUNCE_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+/* Small enough to fit in the request buffer */
+#define FUSE_MAX_WRITE_BYTES (64 * 1024)
 
+/*
+ * fuse_init_in structure before 7.36.  We don't need the flags2 field added
+ * there, so we can work with the smaller older structure to stay compatible
+ * with older kernels.
+ */
+struct fuse_init_in_compat {
+    uint32_t major;
+    uint32_t minor;
+    uint32_t max_readahead;
+    uint32_t flags;
+};
+
+typedef struct FuseRequestInHeader {
+    struct fuse_in_header common;
+    /* All supported requests */
+    union {
+        struct fuse_init_in_compat init;
+        struct fuse_open_in open;
+        struct fuse_setattr_in setattr;
+        struct fuse_read_in read;
+        struct fuse_write_in write;
+        struct fuse_fallocate_in fallocate;
+#ifdef CONFIG_FUSE_LSEEK
+        struct fuse_lseek_in lseek;
+#endif
+    };
+} FuseRequestInHeader;
+
+typedef struct FuseRequestOutHeader {
+    struct fuse_out_header common;
+    /* All supported requests */
+    union {
+        struct fuse_init_out init;
+        struct fuse_statfs_out statfs;
+        struct fuse_open_out open;
+        struct fuse_attr_out attr;
+        struct fuse_write_out write;
+#ifdef CONFIG_FUSE_LSEEK
+        struct fuse_lseek_out lseek;
+#endif
+    };
+} FuseRequestOutHeader;
+
+typedef union FuseRequestInHeaderBuf {
+    struct FuseRequestInHeader structured;
+    struct {
+        /*
+         * Part of the request header that is filled for write requests
+         * (Needed because we want the data to go into a different buffer, to
+         * avoid having to use a bounce buffer)
+         */
+        char head[sizeof(struct fuse_in_header) +
+                  sizeof(struct fuse_write_in)];
+        /*
+         * Rest of the request header for requests that have a longer header
+         * than write requests
+         */
+        char tail[sizeof(FuseRequestInHeader) -
+                  (sizeof(struct fuse_in_header) +
+                   sizeof(struct fuse_write_in))];
+    };
+} FuseRequestInHeaderBuf;
+
+QEMU_BUILD_BUG_ON(sizeof(FuseRequestInHeaderBuf) !=
+                  sizeof(FuseRequestInHeader));
+QEMU_BUILD_BUG_ON(sizeof(((FuseRequestInHeaderBuf *)0)->head) +
+                  sizeof(((FuseRequestInHeaderBuf *)0)->tail) !=
+                  sizeof(FuseRequestInHeader));
 
 typedef struct FuseExport {
     BlockExport common;
 
     struct fuse_session *fuse_session;
-    struct fuse_buf fuse_buf;
     unsigned int in_flight; /* atomic */
     bool mounted, fd_handler_set_up;
 
+    /*
+     * Cached buffer to receive the data of WRITE requests.  Cached because:
+     * To read requests, we put a FuseRequestInHeaderBuf (FRIHB) object on the
+     * stack, and a (WRITE data) buffer on the heap.  We pass FRIHB.head and the
+     * data buffer to readv().  This way, for WRITE requests, we get exactly
+     * their data in the data buffer and can avoid bounce buffering.
+     * However, for non-WRITE requests, some of the header may end up in the
+     * data buffer, so we will need to copy that back into the FRIHB object, and
+     * then we don't need the heap buffer anymore.  That is why we cache it, so
+     * we can trivially reuse it between non-WRITE requests.
+     *
+     * Note that these data buffers and thus req_write_data_cached are allocated
+     * via blk_blockalign() and thus need to be freed via qemu_vfree().
+     */
+    void *req_write_data_cached;
+
     /*
      * Set when there was an unrecoverable error and no requests should be read
      * from the device anymore (basically only in case of something we would
@@ -60,6 +148,8 @@ typedef struct FuseExport {
      */
     bool halted;
 
+    int fuse_fd;
+
     char *mountpoint;
     bool writable;
     bool growable;
@@ -71,20 +161,31 @@ typedef struct FuseExport {
     gid_t st_gid;
 } FuseExport;
 
+/*
+ * Verify that the size of FuseRequestInHeaderBuf.head plus the data
+ * buffer are big enough to be accepted by the FUSE kernel driver.
+ */
+QEMU_BUILD_BUG_ON(sizeof(((FuseRequestInHeaderBuf *)0)->head) +
+                  FUSE_MAX_WRITE_BYTES <
+                  FUSE_MIN_READ_BUFFER);
+
 static GHashTable *exports;
-static const struct fuse_lowlevel_ops fuse_ops;
 
 static void fuse_export_shutdown(BlockExport *exp);
 static void fuse_export_delete(BlockExport *exp);
-static void fuse_export_halt(FuseExport *exp) G_GNUC_UNUSED;
+static void fuse_export_halt(FuseExport *exp);
 
 static void init_exports_table(void);
 
 static int mount_fuse_export(FuseExport *exp, Error **errp);
-static void read_from_fuse_export(void *opaque);
 
 static bool is_regular_file(const char *path, Error **errp);
 
+static void read_from_fuse_fd(void *opaque);
+static void fuse_process_request(FuseExport *exp,
+                                 const FuseRequestInHeader *in_hdr,
+                                 const void *data_buffer);
+static int fuse_write_err(int fd, const struct fuse_in_header *in_hdr, int err);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -105,22 +206,26 @@ static void fuse_dec_in_flight(FuseExport *exp)
     }
 }
 
+/**
+ * Attach FUSE FD read handler.
+ */
 static void fuse_attach_handlers(FuseExport *exp)
 {
     if (qatomic_read(&exp->halted)) {
         return;
     }
 
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
+                       read_from_fuse_fd, NULL, NULL, NULL, exp);
     exp->fd_handler_set_up = true;
 }
 
+/**
+ * Detach FUSE FD read handler.
+ */
 static void fuse_detach_handlers(FuseExport *exp)
 {
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
+    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
                        NULL, NULL, NULL, NULL, NULL);
     exp->fd_handler_set_up = false;
 }
@@ -247,6 +352,13 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
+    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
+    ret = qemu_fcntl_addfl(exp->fuse_fd, O_NONBLOCK);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "Failed to make FUSE FD non-blocking");
+        goto fail;
+    }
+
     fuse_attach_handlers(exp);
     return 0;
 
@@ -278,6 +390,17 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     char *mount_opts;
     struct fuse_args fuse_args;
     int ret;
+    /*
+     * We just create the session for mounting/unmounting, no need to provide
+     * any operations.  However, since libfuse commit 52a633a5d, we have to
+     * provide some op struct and cannot just pass NULL (even though the commit
+     * message ("allow passing ops as NULL") seems to imply the exact opposite,
+     * as does the comment added to fuse_session_new_fn() ("To create a no-op
+     * session just for mounting pass op as NULL.").
+     * This is how said libfuse commit implements a no-op session internally, so
+     * do it the same way.
+     */
+    static const struct fuse_lowlevel_ops null_ops = { 0 };
 
     /*
      * Note that these mount options differ from what we would pass to a direct
@@ -292,7 +415,7 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     mount_opts = g_strdup_printf("%s,nosuid,nodev,noatime,max_read=%zu,"
                                  "default_permissions%s",
                                  exp->writable ? "rw" : "ro",
-                                 FUSE_MAX_BOUNCE_BYTES,
+                                 FUSE_MAX_READ_BYTES,
                                  exp->allow_other ? ",allow_other" : "");
 
     fuse_argv[0] = ""; /* Dummy program name */
@@ -301,8 +424,8 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     fuse_argv[3] = NULL;
     fuse_args = (struct fuse_args)FUSE_ARGS_INIT(3, (char **)fuse_argv);
 
-    exp->fuse_session = fuse_session_new(&fuse_args, &fuse_ops,
-                                         sizeof(fuse_ops), exp);
+    exp->fuse_session = fuse_session_new(&fuse_args, &null_ops,
+                                         sizeof(null_ops), NULL);
     g_free(mount_opts);
     if (!exp->fuse_session) {
         error_setg(errp, "Failed to set up FUSE session");
@@ -326,36 +449,163 @@ fail:
 }
 
 /**
- * Callback to be invoked when the FUSE session FD can be read from.
- * (This is basically the FUSE event loop.)
+ * Allocate a buffer to receive WRITE data, or take the cached one.
  */
-static void read_from_fuse_export(void *opaque)
+static void *get_write_data_buffer(FuseExport *exp)
 {
-    FuseExport *exp = opaque;
-    int ret;
+    if (exp->req_write_data_cached) {
+        void *cached = exp->req_write_data_cached;
+        exp->req_write_data_cached = NULL;
+        return cached;
+    } else {
+        return blk_blockalign(exp->common.blk, FUSE_MAX_WRITE_BYTES);
+    }
+}
 
-    if (unlikely(qatomic_read(&exp->halted))) {
+/**
+ * Release a WRITE data buffer, possibly reusing it for a subsequent request.
+ */
+static void release_write_data_buffer(FuseExport *exp, void **buffer)
+{
+    if (!*buffer) {
         return;
     }
 
+    if (!exp->req_write_data_cached) {
+        exp->req_write_data_cached = *buffer;
+    } else {
+        qemu_vfree(*buffer);
+    }
+    *buffer = NULL;
+}
+
+/**
+ * Return the length of the specific operation's own in_header.
+ * Return -ENOSYS if the operation is not supported.
+ */
+static ssize_t req_op_hdr_len(const FuseRequestInHeader *in_hdr)
+{
+    switch (in_hdr->common.opcode) {
+    case FUSE_INIT:
+        return sizeof(in_hdr->init);
+    case FUSE_OPEN:
+        return sizeof(in_hdr->open);
+    case FUSE_SETATTR:
+        return sizeof(in_hdr->setattr);
+    case FUSE_READ:
+        return sizeof(in_hdr->read);
+    case FUSE_WRITE:
+        return sizeof(in_hdr->write);
+    case FUSE_FALLOCATE:
+        return sizeof(in_hdr->fallocate);
+#ifdef CONFIG_FUSE_LSEEK
+    case FUSE_LSEEK:
+        return sizeof(in_hdr->lseek);
+#endif
+    case FUSE_DESTROY:
+    case FUSE_STATFS:
+    case FUSE_RELEASE:
+    case FUSE_LOOKUP:
+    case FUSE_FORGET:
+    case FUSE_BATCH_FORGET:
+    case FUSE_GETATTR:
+    case FUSE_FSYNC:
+    case FUSE_FLUSH:
+        /* These requests don't have their own header or we don't care */
+        return 0;
+    default:
+        return -ENOSYS;
+    }
+}
+
+/**
+ * Try to read and process a single request from the FUSE FD.
+ */
+static void read_from_fuse_fd(void *opaque)
+{
+    FuseExport *exp = opaque;
+    int fuse_fd = exp->fuse_fd;
+    ssize_t ret;
+    FuseRequestInHeaderBuf in_hdr_buf;
+    const FuseRequestInHeader *in_hdr;
+    void *data_buffer = NULL;
+    struct iovec iov[2];
+    ssize_t op_hdr_len;
+
     fuse_inc_in_flight(exp);
 
-    do {
-        ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
-    } while (ret == -EINTR);
-    if (ret < 0) {
-        goto out;
+    if (unlikely(qatomic_read(&exp->halted))) {
+        goto no_request;
+    }
+
+    data_buffer = get_write_data_buffer(exp);
+
+    /* Construct the I/O vector to hold the FUSE request */
+    iov[0] = (struct iovec) { &in_hdr_buf.head, sizeof(in_hdr_buf.head) };
+    iov[1] = (struct iovec) { data_buffer, FUSE_MAX_WRITE_BYTES };
+    ret = RETRY_ON_EINTR(readv(fuse_fd, iov, ARRAY_SIZE(iov)));
+    if (ret < 0 && errno == EAGAIN) {
+        /* No request available */
+        goto no_request;
+    } else if (unlikely(ret < 0)) {
+        error_report("Failed to read from FUSE device: %s", strerror(errno));
+        goto no_request;
+    }
+
+    if (unlikely(ret < sizeof(in_hdr->common))) {
+        error_report("Incomplete read from FUSE device, expected at least %zu "
+                     "bytes, read %zi bytes; cannot trust subsequent "
+                     "requests, halting the export",
+                     sizeof(in_hdr->common), ret);
+        fuse_export_halt(exp);
+        goto no_request;
+    }
+    in_hdr = &in_hdr_buf.structured;
+
+    if (unlikely(ret != in_hdr->common.len)) {
+        error_report("Number of bytes read from FUSE device does not match "
+                     "request size, expected %" PRIu32 " bytes, read %zi "
+                     "bytes; cannot trust subsequent requests, halting the "
+                     "export",
+                     in_hdr->common.len, ret);
+        fuse_export_halt(exp);
+        goto no_request;
+    }
+
+    op_hdr_len = req_op_hdr_len(in_hdr);
+    if (op_hdr_len < 0) {
+        fuse_write_err(fuse_fd, &in_hdr->common, op_hdr_len);
+        goto no_request;
+    }
+
+    if (unlikely(ret < sizeof(in_hdr->common) + op_hdr_len)) {
+        error_report("FUSE request truncated, expected %zu bytes, read %zi "
+                     "bytes",
+                     sizeof(in_hdr->common) + op_hdr_len, ret);
+        fuse_write_err(fuse_fd, &in_hdr->common, -EINVAL);
+        goto no_request;
     }
 
     /*
-     * Note that aio_poll() in any request-processing function can lead to a
-     * nested read_from_fuse_export() call, which will overwrite the contents of
-     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
-     * content is copied before potentially polling via aio_poll().
+     * Only WRITE uses the write data buffer, so for non-WRITE requests longer
+     * than .head, we need to copy any data that spilled into data_buffer into
+     * .tail.  Then we can release the write data buffer.
      */
-    fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
+    if (in_hdr->common.opcode != FUSE_WRITE) {
+        if (ret > sizeof(in_hdr_buf.head)) {
+            size_t len;
+            /* Limit size to prevent overflow */
+            len = MIN(ret - sizeof(in_hdr_buf.head), sizeof(in_hdr_buf.tail));
+            memcpy(in_hdr_buf.tail, data_buffer, len);
+        }
 
-out:
+        release_write_data_buffer(exp, &data_buffer);
+    }
+
+    fuse_process_request(exp, in_hdr, data_buffer);
+
+no_request:
+    release_write_data_buffer(exp, &data_buffer);
     fuse_dec_in_flight(exp);
 }
 
@@ -363,18 +613,14 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
-    if (exp->fuse_session) {
-        fuse_session_exit(exp->fuse_session);
-
-        if (exp->fd_handler_set_up) {
-            fuse_detach_handlers(exp);
-        }
+    if (exp->fd_handler_set_up) {
+        fuse_detach_handlers(exp);
     }
 
     if (exp->mountpoint) {
         /*
-         * Safe to drop now, because we will not handle any requests
-         * for this export anymore anyway.
+         * Safe to drop now, because we will not handle any requests for this
+         * export anymore anyway (at least not from the main thread).
          */
         g_hash_table_remove(exports, exp->mountpoint);
     }
@@ -392,7 +638,7 @@ static void fuse_export_delete(BlockExport *blk_exp)
         fuse_session_destroy(exp->fuse_session);
     }
 
-    free(exp->fuse_buf.mem);
+    qemu_vfree(exp->req_write_data_cached);
     g_free(exp->mountpoint);
 }
 
@@ -434,46 +680,101 @@ static bool is_regular_file(const char *path, Error **errp)
 }
 
 /**
- * A chance to set change some parameters supplied to FUSE_INIT.
+ * Process FUSE INIT.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_init(void *userdata, struct fuse_conn_info *conn)
+static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
+                         const struct fuse_init_in_compat *in)
 {
+    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
+
+    if (in->major != 7) {
+        error_report("FUSE major version mismatch: We have 7, but kernel has %"
+                     PRIu32, in->major);
+        return -EINVAL;
+    }
+
+    /* 2007's 7.9 added fuse_attr.blksize; working around that would be hard */
+    if (in->minor < 9) {
+        error_report("FUSE minor version too old: 9 required, but kernel has %"
+                     PRIu32, in->minor);
+        return -EINVAL;
+    }
+
+    *out = (struct fuse_init_out) {
+        .major = 7,
+        .minor = MIN(FUSE_KERNEL_MINOR_VERSION, in->minor),
+        .max_readahead = in->max_readahead,
+        .max_write = FUSE_MAX_WRITE_BYTES,
+        .flags = in->flags & supported_flags,
+        .flags2 = 0,
+
+        /* libfuse maximum: 2^16 - 1 */
+        .max_background = UINT16_MAX,
+
+        /* libfuse default: max_background * 3 / 4 */
+        .congestion_threshold = (int)UINT16_MAX * 3 / 4,
+
+        /* libfuse default: 1 */
+        .time_gran = 1,
+
+        /*
+         * probably unneeded without FUSE_MAX_PAGES, but this would be the
+         * libfuse default
+         */
+        .max_pages = DIV_ROUND_UP(FUSE_MAX_WRITE_BYTES,
+                                  qemu_real_host_page_size()),
+
+        /* Only needed for mappings (i.e. DAX) */
+        .map_alignment = 0,
+    };
+
     /*
-     * MIN_NON_ZERO() would not be wrong here, but what we set here
-     * must equal what has been passed to fuse_session_new().
-     * Therefore, as long as max_read must be passed as a mount option
-     * (which libfuse claims will be changed at some point), we have
-     * to set max_read to a fixed value here.
+     * Before 7.23, fuse_init_out is shorter.
+     * Drop the tail (time_gran, max_pages, map_alignment).
      */
-    conn->max_read = FUSE_MAX_BOUNCE_BYTES;
-
-    conn->max_write = MIN_NON_ZERO(BDRV_REQUEST_MAX_BYTES, conn->max_write);
+    return out->minor >= 23 ? sizeof(*out) : FUSE_COMPAT_22_INIT_OUT_SIZE;
 }
 
 /**
- * Let clients look up files.  Always return ENOENT because we only
- * care about the mountpoint itself.
+ * Return some filesystem information, just to not break e.g. `df`.
  */
-static void fuse_lookup(fuse_req_t req, fuse_ino_t parent, const char *name)
+static ssize_t fuse_statfs(FuseExport *exp, struct fuse_statfs_out *out)
 {
-    fuse_reply_err(req, ENOENT);
+    BlockDriverState *root_bs;
+    uint32_t opt_transfer = 512;
+
+    root_bs = blk_bs(exp->common.blk);
+    if (root_bs) {
+        opt_transfer = root_bs->bl.opt_transfer;
+        if (!opt_transfer) {
+            opt_transfer = root_bs->bl.request_alignment;
+        }
+        opt_transfer = MAX(opt_transfer, 512);
+    }
+
+    *out = (struct fuse_statfs_out) {
+        /* These are the fields libfuse sets by default */
+        .st = {
+            .namelen = 255,
+            .bsize = opt_transfer,
+        },
+    };
+    return sizeof(*out);
 }
 
 /**
  * Let clients get file attributes (i.e., stat() the file).
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
-                         struct fuse_file_info *fi)
+static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
 {
-    struct stat statbuf;
     int64_t length, allocated_blocks;
     time_t now = time(NULL);
-    FuseExport *exp = fuse_req_userdata(req);
 
     length = blk_getlength(exp->common.blk);
     if (length < 0) {
-        fuse_reply_err(req, -length);
-        return;
+        return length;
     }
 
     allocated_blocks = bdrv_get_allocated_file_size(blk_bs(exp->common.blk));
@@ -483,21 +784,24 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
         allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512);
     }
 
-    statbuf = (struct stat) {
-        .st_ino     = 1,
-        .st_mode    = exp->st_mode,
-        .st_nlink   = 1,
-        .st_uid     = exp->st_uid,
-        .st_gid     = exp->st_gid,
-        .st_size    = length,
-        .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
-        .st_blocks  = allocated_blocks,
-        .st_atime   = now,
-        .st_mtime   = now,
-        .st_ctime   = now,
+    *out = (struct fuse_attr_out) {
+        .attr_valid = 1,
+        .attr = {
+            .ino        = 1,
+            .mode       = exp->st_mode,
+            .nlink      = 1,
+            .uid        = exp->st_uid,
+            .gid        = exp->st_gid,
+            .size       = length,
+            .blksize    = blk_bs(exp->common.blk)->bl.request_alignment,
+            .blocks     = allocated_blocks,
+            .atime      = now,
+            .mtime      = now,
+            .ctime      = now,
+        },
     };
 
-    fuse_reply_attr(req, &statbuf, 1.);
+    return sizeof(*out);
 }
 
 static int fuse_do_truncate(const FuseExport *exp, int64_t size,
@@ -550,101 +854,98 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
  * permit access: Read-only exports cannot be given +w, and exports
  * without allow_other cannot be given a different UID or GID, and
  * they cannot be given non-owner access.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
-                         int to_set, struct fuse_file_info *fi)
+static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
+                            uint32_t to_set, uint64_t size, uint32_t mode,
+                            uint32_t uid, uint32_t gid)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     int supported_attrs;
     int ret;
 
-    supported_attrs = FUSE_SET_ATTR_SIZE | FUSE_SET_ATTR_MODE;
+    /* SIZE and MODE are actually supported, the others can be safely ignored */
+    supported_attrs = FATTR_SIZE | FATTR_MODE |
+        FATTR_FH | FATTR_LOCKOWNER | FATTR_KILL_SUIDGID;
     if (exp->allow_other) {
-        supported_attrs |= FUSE_SET_ATTR_UID | FUSE_SET_ATTR_GID;
+        supported_attrs |= FATTR_UID | FATTR_GID;
     }
 
     if (to_set & ~supported_attrs) {
-        fuse_reply_err(req, ENOTSUP);
-        return;
+        return -ENOTSUP;
     }
 
     /* Do some argument checks first before committing to anything */
-    if (to_set & FUSE_SET_ATTR_MODE) {
+    if (to_set & FATTR_MODE) {
         /*
          * Without allow_other, non-owners can never access the export, so do
          * not allow setting permissions for them
          */
-        if (!exp->allow_other &&
-            (statbuf->st_mode & (S_IRWXG | S_IRWXO)) != 0)
-        {
-            fuse_reply_err(req, EPERM);
-            return;
+        if (!exp->allow_other && (mode & (S_IRWXG | S_IRWXO)) != 0) {
+            return -EPERM;
         }
 
         /* +w for read-only exports makes no sense, disallow it */
-        if (!exp->writable &&
-            (statbuf->st_mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0)
-        {
-            fuse_reply_err(req, EROFS);
-            return;
+        if (!exp->writable && (mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0) {
+            return -EROFS;
         }
     }
 
-    if (to_set & FUSE_SET_ATTR_SIZE) {
+    if (to_set & FATTR_SIZE) {
         if (!exp->writable) {
-            fuse_reply_err(req, EACCES);
-            return;
+            return -EACCES;
         }
 
-        ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
+        ret = fuse_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
         if (ret < 0) {
-            fuse_reply_err(req, -ret);
-            return;
+            return ret;
         }
     }
 
-    if (to_set & FUSE_SET_ATTR_MODE) {
+    if (to_set & FATTR_MODE) {
         /* Ignore FUSE-supplied file type, only change the mode */
-        exp->st_mode = (statbuf->st_mode & 07777) | S_IFREG;
+        exp->st_mode = (mode & 07777) | S_IFREG;
     }
 
-    if (to_set & FUSE_SET_ATTR_UID) {
-        exp->st_uid = statbuf->st_uid;
+    if (to_set & FATTR_UID) {
+        exp->st_uid = uid;
     }
 
-    if (to_set & FUSE_SET_ATTR_GID) {
-        exp->st_gid = statbuf->st_gid;
+    if (to_set & FATTR_GID) {
+        exp->st_gid = gid;
     }
 
-    fuse_getattr(req, inode, fi);
+    return fuse_getattr(exp, out);
 }
 
 /**
- * Let clients open a file (i.e., the exported image).
+ * Open an inode.  We only have a single inode in our exported filesystem, so we
+ * just acknowledge the request.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_open(fuse_req_t req, fuse_ino_t inode,
-                      struct fuse_file_info *fi)
+static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
 {
-    fi->direct_io = true;
-    fi->parallel_direct_writes = true;
-    fuse_reply_open(req, fi);
+    *out = (struct fuse_open_out) {
+        .open_flags = FOPEN_DIRECT_IO | FOPEN_PARALLEL_DIRECT_WRITES,
+    };
+    return sizeof(*out);
 }
 
 /**
- * Handle client reads from the exported image.
+ * Handle client reads from the exported image.  Allocates *bufptr and reads
+ * data from the block device into that buffer.
+ * Returns the buffer (read) size on success, and -errno on error.
+ * After use, *bufptr must be freed via qemu_vfree().
  */
-static void fuse_read(fuse_req_t req, fuse_ino_t inode,
-                      size_t size, off_t offset, struct fuse_file_info *fi)
+static ssize_t fuse_read(FuseExport *exp, void **bufptr,
+                         uint64_t offset, uint32_t size)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     int64_t blk_len;
     void *buf;
     int ret;
 
     /* Limited by max_read, should not happen */
-    if (size > FUSE_MAX_BOUNCE_BYTES) {
-        fuse_reply_err(req, EINVAL);
-        return;
+    if (size > FUSE_MAX_READ_BYTES) {
+        return -EINVAL;
     }
 
     /**
@@ -653,18 +954,12 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
      */
     blk_len = blk_getlength(exp->common.blk);
     if (blk_len < 0) {
-        fuse_reply_err(req, -blk_len);
-        return;
+        return blk_len;
     }
 
     if (offset >= blk_len) {
-        /*
-         * Technically libfuse does not allow returning a zero error code for
-         * read requests, but in practice this is a 0-length read (and a future
-         * commit will change this code anyway)
-         */
-        fuse_reply_err(req, 0);
-        return;
+        *bufptr = NULL;
+        return 0;
     }
 
     if (offset + size > blk_len) {
@@ -673,108 +968,96 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
 
     buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
     if (!buf) {
-        fuse_reply_err(req, ENOMEM);
-        return;
+        return -ENOMEM;
     }
 
     ret = blk_pread(exp->common.blk, offset, size, buf, 0);
-    if (ret >= 0) {
-        fuse_reply_buf(req, buf, size);
-    } else {
-        fuse_reply_err(req, -ret);
+    if (ret < 0) {
+        qemu_vfree(buf);
+        return ret;
     }
 
-    qemu_vfree(buf);
+    *bufptr = buf;
+    return size;
 }
 
 /**
- * Handle client writes to the exported image.
+ * Handle client writes to the exported image.  @buf has the data to be written.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
-                       size_t size, off_t offset, struct fuse_file_info *fi)
+static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
+                          uint64_t offset, uint32_t size, const void *buf)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-    QEMU_AUTO_VFREE void *copied = NULL;
     int64_t blk_len;
     int ret;
 
+    QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
     /* Limited by max_write, should not happen */
-    if (size > BDRV_REQUEST_MAX_BYTES) {
-        fuse_reply_err(req, EINVAL);
-        return;
+    if (size > FUSE_MAX_WRITE_BYTES) {
+        return -EINVAL;
     }
 
     if (!exp->writable) {
-        fuse_reply_err(req, EACCES);
-        return;
+        return -EACCES;
     }
 
-    /*
-     * Heed the note on read_from_fuse_export(): If we call aio_poll() (which
-     * any blk_*() I/O function may do), read_from_fuse_export() may be nested,
-     * overwriting the request buffer content.  Therefore, we must copy it here.
-     */
-    copied = blk_blockalign(exp->common.blk, size);
-    memcpy(copied, buf, size);
-
     /**
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
      */
     blk_len = blk_getlength(exp->common.blk);
     if (blk_len < 0) {
-        fuse_reply_err(req, -blk_len);
-        return;
+        return blk_len;
     }
 
     if (offset >= blk_len && !exp->growable) {
-        fuse_reply_write(req, 0);
-        return;
+        *out = (struct fuse_write_out) {
+            .size = 0,
+        };
+        return sizeof(*out);
     }
 
     if (offset + size < offset) {
-        fuse_reply_err(req, EINVAL);
-        return;
+        return -EINVAL;
     } else if (offset + size > blk_len) {
         if (exp->growable) {
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                return;
+                return ret;
             }
         } else {
             size = blk_len - offset;
         }
     }
 
-    ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
-    if (ret >= 0) {
-        fuse_reply_write(req, size);
-    } else {
-        fuse_reply_err(req, -ret);
+    ret = blk_pwrite(exp->common.blk, offset, size, buf, 0);
+    if (ret < 0) {
+        return ret;
     }
+
+    *out = (struct fuse_write_out) {
+        .size = size,
+    };
+    return sizeof(*out);
 }
 
 /**
  * Let clients perform various fallocate() operations.
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
-                           off_t offset, off_t length,
-                           struct fuse_file_info *fi)
+static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
+                              uint32_t mode)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     int64_t blk_len;
     int ret;
 
     if (!exp->writable) {
-        fuse_reply_err(req, EACCES);
-        return;
+        return -EACCES;
     }
 
     blk_len = blk_getlength(exp->common.blk);
     if (blk_len < 0) {
-        fuse_reply_err(req, -blk_len);
-        return;
+        return blk_len;
     }
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -786,16 +1069,14 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
     if (!mode) {
         /* We can only fallocate at the EOF with a truncate */
         if (offset < blk_len) {
-            fuse_reply_err(req, EOPNOTSUPP);
-            return;
+            return -EOPNOTSUPP;
         }
 
         if (offset > blk_len) {
             /* No preallocation needed here */
             ret = fuse_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                return;
+                return ret;
             }
         }
 
@@ -805,8 +1086,7 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
     else if (mode & FALLOC_FL_PUNCH_HOLE) {
         if (!(mode & FALLOC_FL_KEEP_SIZE)) {
-            fuse_reply_err(req, EINVAL);
-            return;
+            return -EINVAL;
         }
 
         do {
@@ -834,8 +1114,7 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
             ret = fuse_do_truncate(exp, offset + length, false,
                                    PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                return;
+                return ret;
             }
         }
 
@@ -853,44 +1132,38 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
         ret = -EOPNOTSUPP;
     }
 
-    fuse_reply_err(req, ret < 0 ? -ret : 0);
+    return ret < 0 ? ret : 0;
 }
 
 /**
  * Let clients fsync the exported image.
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_fsync(fuse_req_t req, fuse_ino_t inode, int datasync,
-                       struct fuse_file_info *fi)
+static ssize_t fuse_fsync(FuseExport *exp)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-    int ret;
-
-    ret = blk_flush(exp->common.blk);
-    fuse_reply_err(req, ret < 0 ? -ret : 0);
+    return blk_flush(exp->common.blk);
 }
 
 /**
  * Called before an FD to the exported image is closed.  (libfuse
  * notes this to be a way to return last-minute errors.)
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_flush(fuse_req_t req, fuse_ino_t inode,
-                        struct fuse_file_info *fi)
+static ssize_t fuse_flush(FuseExport *exp)
 {
-    fuse_fsync(req, inode, 1, fi);
+    return blk_flush(exp->common.blk);
 }
 
 #ifdef CONFIG_FUSE_LSEEK
 /**
  * Let clients inquire allocation status.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
-                       int whence, struct fuse_file_info *fi)
+static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
+                          uint64_t offset, uint32_t whence)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-
     if (whence != SEEK_HOLE && whence != SEEK_DATA) {
-        fuse_reply_err(req, EINVAL);
-        return;
+        return -EINVAL;
     }
 
     while (true) {
@@ -900,8 +1173,7 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
         ret = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
                                       offset, INT64_MAX, &pnum, NULL, NULL);
         if (ret < 0) {
-            fuse_reply_err(req, -ret);
-            return;
+            return ret;
         }
 
         if (!pnum && (ret & BDRV_BLOCK_EOF)) {
@@ -918,34 +1190,38 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
 
             blk_len = blk_getlength(exp->common.blk);
             if (blk_len < 0) {
-                fuse_reply_err(req, -blk_len);
-                return;
+                return blk_len;
             }
 
             if (offset > blk_len || whence == SEEK_DATA) {
-                fuse_reply_err(req, ENXIO);
-            } else {
-                fuse_reply_lseek(req, offset);
+                return -ENXIO;
             }
-            return;
+
+            *out = (struct fuse_lseek_out) {
+                .offset = offset,
+            };
+            return sizeof(*out);
         }
 
         if (ret & BDRV_BLOCK_DATA) {
             if (whence == SEEK_DATA) {
-                fuse_reply_lseek(req, offset);
-                return;
+                *out = (struct fuse_lseek_out) {
+                    .offset = offset,
+                };
+                return sizeof(*out);
             }
         } else {
             if (whence == SEEK_HOLE) {
-                fuse_reply_lseek(req, offset);
-                return;
+                *out = (struct fuse_lseek_out) {
+                    .offset = offset,
+                };
+                return sizeof(*out);
             }
         }
 
         /* Safety check against infinite loops */
         if (!pnum) {
-            fuse_reply_err(req, ENXIO);
-            return;
+            return -ENXIO;
         }
 
         offset += pnum;
@@ -953,21 +1229,241 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
 }
 #endif
 
-static const struct fuse_lowlevel_ops fuse_ops = {
-    .init       = fuse_init,
-    .lookup     = fuse_lookup,
-    .getattr    = fuse_getattr,
-    .setattr    = fuse_setattr,
-    .open       = fuse_open,
-    .read       = fuse_read,
-    .write      = fuse_write,
-    .fallocate  = fuse_fallocate,
-    .flush      = fuse_flush,
-    .fsync      = fuse_fsync,
+/**
+ * Write a FUSE response to the given @fd.
+ *
+ * Effectively, writes out_hdr->common.len bytes of the buffer that is *out_hdr.
+ *
+ * @fd: FUSE file descriptor
+ * @out_hdr: Request response header and request-specific response data
+ */
+static int fuse_write_response(int fd, FuseRequestOutHeader *out_hdr)
+{
+    size_t to_write = out_hdr->common.len;
+    ssize_t ret;
+
+    /* Must at least write fuse_out_header */
+    assert(to_write >= sizeof(out_hdr->common));
+
+    ret = RETRY_ON_EINTR(write(fd, out_hdr, to_write));
+    if (ret < 0) {
+        ret = -errno;
+        error_report("Failed to write to FUSE device: %s", strerror(-ret));
+        return ret;
+    }
+
+    /* Short writes are unexpected, treat them as errors */
+    if (ret != to_write) {
+        error_report("Short write to FUSE device, wrote %zi of %zu bytes",
+                     ret, to_write);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+/**
+ * Write a FUSE error response to @fd.
+ *
+ * @fd: FUSE file descriptor
+ * @in_hdr: Incoming request header to which to respond
+ * @err: Error code (-errno, must be negative!)
+ */
+static int fuse_write_err(int fd, const struct fuse_in_header *in_hdr, int err)
+{
+    FuseRequestOutHeader out_hdr = {
+        .common = {
+            .len = sizeof(out_hdr.common),
+            /* FUSE expects negative error values */
+            .error = err,
+            .unique = in_hdr->unique,
+        },
+    };
+
+    return fuse_write_response(fd, &out_hdr);
+}
+
+/**
+ * Write a FUSE response to the given @fd, using separate buffers for the
+ * response header and data.
+ *
+ * In contrast to fuse_write_response(), this function cannot return a full
+ * FuseRequestOutHeader (i.e. including request-specific response structs),
+ * but only FuseRequestOutHeader.common.  The remaining data must be in
+ * *buf.
+ *
+ * (Total length must be set in out_hdr->len.)
+ *
+ * @fd: FUSE file descriptor
+ * @out_hdr: Request response header
+ * @buf: Pointer to response data
+ */
+static int fuse_write_buf_response(int fd,
+                                   const struct fuse_out_header *out_hdr,
+                                   const void *buf)
+{
+    size_t to_write = out_hdr->len;
+    struct iovec iov[2] = {
+        { (void *)out_hdr, sizeof(*out_hdr) },
+        { (void *)buf, to_write - sizeof(*out_hdr) },
+    };
+    ssize_t ret;
+
+    /* *buf length must not be negative */
+    assert(to_write >= sizeof(*out_hdr));
+
+    ret = RETRY_ON_EINTR(writev(fd, iov, ARRAY_SIZE(iov)));
+    if (ret < 0) {
+        ret = -errno;
+        error_report("Failed to write to FUSE device: %s", strerror(-ret));
+        return ret;
+    }
+
+    /* Short writes are unexpected, treat them as errors */
+    if (ret != to_write) {
+        error_report("Short write to FUSE device, wrote %zi of %zu bytes",
+                     ret, to_write);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+/**
+ * Process a FUSE request, incl. writing the response.
+ */
+static void fuse_process_request(FuseExport *exp,
+                                 const FuseRequestInHeader *in_hdr,
+                                 const void *data_buffer)
+{
+    FuseRequestOutHeader out_hdr;
+    /* For read requests: Data to be returned */
+    void *out_data_buffer = NULL;
+    ssize_t ret;
+
+    switch (in_hdr->common.opcode) {
+    case FUSE_INIT:
+        ret = fuse_init(exp, &out_hdr.init, &in_hdr->init);
+        break;
+
+    case FUSE_DESTROY:
+        ret = 0;
+        break;
+
+    case FUSE_STATFS:
+        ret = fuse_statfs(exp, &out_hdr.statfs);
+        break;
+
+    case FUSE_OPEN:
+        ret = fuse_open(exp, &out_hdr.open);
+        break;
+
+    case FUSE_RELEASE:
+        ret = 0;
+        break;
+
+    case FUSE_LOOKUP:
+        ret = -ENOENT; /* There is no node but the root node */
+        break;
+
+    case FUSE_FORGET:
+    case FUSE_BATCH_FORGET:
+        /* These have no response, and there is nothing we need to do */
+        return;
+
+    case FUSE_GETATTR:
+        ret = fuse_getattr(exp, &out_hdr.attr);
+        break;
+
+    case FUSE_SETATTR: {
+        const struct fuse_setattr_in *in = &in_hdr->setattr;
+        ret = fuse_setattr(exp, &out_hdr.attr,
+                           in->valid, in->size, in->mode, in->uid, in->gid);
+        break;
+    }
+
+    case FUSE_READ: {
+        const struct fuse_read_in *in = &in_hdr->read;
+        ret = fuse_read(exp, &out_data_buffer, in->offset, in->size);
+        break;
+    }
+
+    case FUSE_WRITE: {
+        const struct fuse_write_in *in = &in_hdr->write;
+        uint32_t req_len = in_hdr->common.len;
+
+        if (unlikely(req_len < sizeof(in_hdr->common) + sizeof(*in) +
+                               in->size)) {
+            warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
+                        req_len - sizeof(in_hdr->common) - sizeof(*in),
+                        in->size);
+            ret = -EINVAL;
+            break;
+        }
+
+        /*
+         * read_from_fuse_fd() has checked that in_hdr->len matches the number
+         * of bytes read, which cannot exceed the max_write value we set
+         * (FUSE_MAX_WRITE_BYTES).  So we know that FUSE_MAX_WRITE_BYTES >=
+         * in_hdr->len >= in->size + X, so this assertion must hold.
+         */
+        assert(in->size <= FUSE_MAX_WRITE_BYTES);
+
+        ret = fuse_write(exp, &out_hdr.write,
+                         in->offset, in->size, data_buffer);
+        break;
+    }
+
+    case FUSE_FALLOCATE: {
+        const struct fuse_fallocate_in *in = &in_hdr->fallocate;
+        ret = fuse_fallocate(exp, in->offset, in->length, in->mode);
+        break;
+    }
+
+    case FUSE_FSYNC:
+        ret = fuse_fsync(exp);
+        break;
+
+    case FUSE_FLUSH:
+        ret = fuse_flush(exp);
+        break;
+
 #ifdef CONFIG_FUSE_LSEEK
-    .lseek      = fuse_lseek,
+    case FUSE_LSEEK: {
+        const struct fuse_lseek_in *in = &in_hdr->lseek;
+        ret = fuse_lseek(exp, &out_hdr.lseek, in->offset, in->whence);
+        break;
+    }
 #endif
-};
+
+    default:
+        ret = -ENOSYS;
+    }
+
+    if (ret >= 0) {
+        out_hdr.common = (struct fuse_out_header) {
+            .len = sizeof(out_hdr.common) + ret,
+            .unique = in_hdr->common.unique,
+        };
+    } else {
+        /* fuse_read() must not return a buffer in case of error */
+        assert(out_data_buffer == NULL);
+
+        out_hdr.common = (struct fuse_out_header) {
+            .len = sizeof(out_hdr.common),
+            /* FUSE expects negative errno values */
+            .error = ret,
+            .unique = in_hdr->common.unique,
+        };
+    }
+
+    if (out_data_buffer) {
+        fuse_write_buf_response(exp->fuse_fd, &out_hdr.common, out_data_buffer);
+        qemu_vfree(out_data_buffer);
+    } else {
+        fuse_write_response(exp->fuse_fd, &out_hdr);
+    }
+}
 
 const BlockExportDriver blk_exp_fuse = {
     .type               = BLOCK_EXPORT_TYPE_FUSE,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 17/24] fuse: Reduce max read size
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (15 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 16/24] fuse: Manually process requests (without libfuse) Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 18/24] fuse: Process requests in coroutines Hanna Czenczek
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

We are going to introduce parallel processing via coroutines, a maximum
read size of 64 MB may be problematic, allowing users of the export to
force us to allocate quite large amounts of memory with just a few
requests.

At least tone it down to 1 MB, which is still probably far more than
enough.  (Larger requests are split automatically by the FUSE kernel
driver anyway.)

(Yes, we inadvertently already had parallel request processing due to
nested polling before.  Better to fix this late than never.)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index c481fb72a2..8c5a1d397d 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -45,7 +45,7 @@
 #endif
 
 /* Prevent overly long bounce buffer allocations */
-#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
 /* Small enough to fit in the request buffer */
 #define FUSE_MAX_WRITE_BYTES (64 * 1024)
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 18/24] fuse: Process requests in coroutines
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (16 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 17/24] fuse: Reduce max read size Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 19/24] block/export: Add multi-threading interface Hanna Czenczek
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Make fuse_process_request() a coroutine_fn (fuse_co_process_request())
and have read_from_fuse_fd() launch it inside of a newly created
coroutine instead of running it synchronously.  This way, we can process
requests in parallel.

These are the benchmark results, compared to (a) the original results
with libfuse, and (b) the results after switching away from libfuse
(i.e. before this patch):

file:                (vs. libfuse / vs. no libfuse)
  read:
    seq aio:    97.8k ±1.5k (-2%  / -8%)
    rand aio:   95.8k ±3.4k (+90% / +98%)
    seq sync:   34.5k ±1.0k (-4%  / -3%)
    rand sync:   9.9k ±0.1k (-1%  / -1%)
  write:
    seq aio:    68.7k ±1.3k (-5%  / -10%)
    rand aio:   68.9k ±1.1k (-2%  / -10%)
    seq sync:   30.6k ±0.9k (±0%  / -3%)
    rand sync:  30.6k ±0.6k (+1%  / -1%)
null:
  read:
    seq aio:   174.5k ±6.8k (+11% / +8%)
    rand aio:  170.9k ±5.7k (+8%  / +3%)
    seq sync:   82.0k ±3.3k (+2%  / +2%)
    rand sync:  78.0k ±4.0k (+1%  / -1%)
  write:
    seq aio:   196.0k ±2.8k (+27% / +6%)
    rand aio:  191.2k ±7.9k (+24% / +2%)
    seq sync:   83.3k ±4.4k (+9%  / +1%)
    rand sync:  79.5k ±4.4k (+9%  / +1%)

So there is not much difference, especially when compared to how it was
with libfuse, except for the randread AIO case with an actual file.
That improves greatly.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 174 ++++++++++++++++++++++++++------------------
 1 file changed, 102 insertions(+), 72 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 8c5a1d397d..b60affe86b 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -27,6 +27,7 @@
 #include "block/qapi.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block.h"
+#include "qemu/coroutine.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
@@ -182,9 +183,9 @@ static int mount_fuse_export(FuseExport *exp, Error **errp);
 static bool is_regular_file(const char *path, Error **errp);
 
 static void read_from_fuse_fd(void *opaque);
-static void fuse_process_request(FuseExport *exp,
-                                 const FuseRequestInHeader *in_hdr,
-                                 const void *data_buffer);
+static void coroutine_fn
+fuse_co_process_request(FuseExport *exp, const FuseRequestInHeader *in_hdr,
+                        const void *data_buffer);
 static int fuse_write_err(int fd, const struct fuse_in_header *in_hdr, int err);
 
 static void fuse_inc_in_flight(FuseExport *exp)
@@ -519,9 +520,14 @@ static ssize_t req_op_hdr_len(const FuseRequestInHeader *in_hdr)
 }
 
 /**
- * Try to read and process a single request from the FUSE FD.
+ * Try to read a single request from the FUSE FD.
+ * Takes a FuseExport pointer in `opaque`.
+ *
+ * Assumes the export's in-flight counter has already been incremented.
+ *
+ * If a request is available, process it.
  */
-static void read_from_fuse_fd(void *opaque)
+static void coroutine_fn co_read_from_fuse_fd(void *opaque)
 {
     FuseExport *exp = opaque;
     int fuse_fd = exp->fuse_fd;
@@ -532,8 +538,6 @@ static void read_from_fuse_fd(void *opaque)
     struct iovec iov[2];
     ssize_t op_hdr_len;
 
-    fuse_inc_in_flight(exp);
-
     if (unlikely(qatomic_read(&exp->halted))) {
         goto no_request;
     }
@@ -602,13 +606,29 @@ static void read_from_fuse_fd(void *opaque)
         release_write_data_buffer(exp, &data_buffer);
     }
 
-    fuse_process_request(exp, in_hdr, data_buffer);
+    fuse_co_process_request(exp, in_hdr, data_buffer);
 
 no_request:
     release_write_data_buffer(exp, &data_buffer);
     fuse_dec_in_flight(exp);
 }
 
+/**
+ * Try to read and process a single request from the FUSE FD.
+ * (To be used as a handler for when the FUSE FD becomes readable.)
+ * Takes a FuseExport pointer in `opaque`.
+ */
+static void read_from_fuse_fd(void *opaque)
+{
+    FuseExport *exp = opaque;
+    Coroutine *co;
+
+    co = qemu_coroutine_create(co_read_from_fuse_fd, exp);
+    /* Decremented by co_read_from_fuse_fd() */
+    fuse_inc_in_flight(exp);
+    qemu_coroutine_enter(co);
+}
+
 static void fuse_export_shutdown(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
@@ -683,8 +703,9 @@ static bool is_regular_file(const char *path, Error **errp)
  * Process FUSE INIT.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
-                         const struct fuse_init_in_compat *in)
+static ssize_t coroutine_fn
+fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
+             const struct fuse_init_in_compat *in)
 {
     const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
 
@@ -739,7 +760,8 @@ static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
 /**
  * Return some filesystem information, just to not break e.g. `df`.
  */
-static ssize_t fuse_statfs(FuseExport *exp, struct fuse_statfs_out *out)
+static ssize_t coroutine_fn
+fuse_co_statfs(FuseExport *exp, struct fuse_statfs_out *out)
 {
     BlockDriverState *root_bs;
     uint32_t opt_transfer = 512;
@@ -767,17 +789,18 @@ static ssize_t fuse_statfs(FuseExport *exp, struct fuse_statfs_out *out)
  * Let clients get file attributes (i.e., stat() the file).
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
+static ssize_t coroutine_fn
+fuse_co_getattr(FuseExport *exp, struct fuse_attr_out *out)
 {
     int64_t length, allocated_blocks;
     time_t now = time(NULL);
 
-    length = blk_getlength(exp->common.blk);
+    length = blk_co_getlength(exp->common.blk);
     if (length < 0) {
         return length;
     }
 
-    allocated_blocks = bdrv_get_allocated_file_size(blk_bs(exp->common.blk));
+    allocated_blocks = bdrv_co_get_allocated_file_size(blk_bs(exp->common.blk));
     if (allocated_blocks <= 0) {
         allocated_blocks = DIV_ROUND_UP(length, 512);
     } else {
@@ -804,8 +827,9 @@ static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
     return sizeof(*out);
 }
 
-static int fuse_do_truncate(const FuseExport *exp, int64_t size,
-                            bool req_zero_write, PreallocMode prealloc)
+static int coroutine_fn
+fuse_co_do_truncate(const FuseExport *exp, int64_t size, bool req_zero_write,
+                    PreallocMode prealloc)
 {
     uint64_t blk_perm, blk_shared_perm;
     BdrvRequestFlags truncate_flags = 0;
@@ -834,8 +858,8 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
         }
     }
 
-    ret = blk_truncate(exp->common.blk, size, true, prealloc,
-                       truncate_flags, NULL);
+    ret = blk_co_truncate(exp->common.blk, size, true, prealloc,
+                          truncate_flags, NULL);
 
     if (add_resize_perm) {
         /* Must succeed, because we are only giving up the RESIZE permission */
@@ -856,9 +880,9 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
  * they cannot be given non-owner access.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
-                            uint32_t to_set, uint64_t size, uint32_t mode,
-                            uint32_t uid, uint32_t gid)
+static ssize_t coroutine_fn
+fuse_co_setattr(FuseExport *exp, struct fuse_attr_out *out, uint32_t to_set,
+                uint64_t size, uint32_t mode, uint32_t uid, uint32_t gid)
 {
     int supported_attrs;
     int ret;
@@ -895,7 +919,7 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
             return -EACCES;
         }
 
-        ret = fuse_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
+        ret = fuse_co_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
         if (ret < 0) {
             return ret;
         }
@@ -914,7 +938,7 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
         exp->st_gid = gid;
     }
 
-    return fuse_getattr(exp, out);
+    return fuse_co_getattr(exp, out);
 }
 
 /**
@@ -922,7 +946,8 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
  * just acknowledge the request.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
+static ssize_t coroutine_fn
+fuse_co_open(FuseExport *exp, struct fuse_open_out *out)
 {
     *out = (struct fuse_open_out) {
         .open_flags = FOPEN_DIRECT_IO | FOPEN_PARALLEL_DIRECT_WRITES,
@@ -936,8 +961,8 @@ static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
  * Returns the buffer (read) size on success, and -errno on error.
  * After use, *bufptr must be freed via qemu_vfree().
  */
-static ssize_t fuse_read(FuseExport *exp, void **bufptr,
-                         uint64_t offset, uint32_t size)
+static ssize_t coroutine_fn
+fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
 {
     int64_t blk_len;
     void *buf;
@@ -952,7 +977,7 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
      * Clients will expect short reads at EOF, so we have to limit
      * offset+size to the image length.
      */
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         return blk_len;
     }
@@ -971,7 +996,7 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
         return -ENOMEM;
     }
 
-    ret = blk_pread(exp->common.blk, offset, size, buf, 0);
+    ret = blk_co_pread(exp->common.blk, offset, size, buf, 0);
     if (ret < 0) {
         qemu_vfree(buf);
         return ret;
@@ -985,8 +1010,9 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
  * Handle client writes to the exported image.  @buf has the data to be written.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
-                          uint64_t offset, uint32_t size, const void *buf)
+static ssize_t coroutine_fn
+fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
+              uint64_t offset, uint32_t size, const void *buf)
 {
     int64_t blk_len;
     int ret;
@@ -1005,7 +1031,7 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
      */
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         return blk_len;
     }
@@ -1021,7 +1047,8 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
         return -EINVAL;
     } else if (offset + size > blk_len) {
         if (exp->growable) {
-            ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset + size, true,
+                                      PREALLOC_MODE_OFF);
             if (ret < 0) {
                 return ret;
             }
@@ -1030,7 +1057,7 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
         }
     }
 
-    ret = blk_pwrite(exp->common.blk, offset, size, buf, 0);
+    ret = blk_co_pwrite(exp->common.blk, offset, size, buf, 0);
     if (ret < 0) {
         return ret;
     }
@@ -1045,8 +1072,9 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
  * Let clients perform various fallocate() operations.
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
-                              uint32_t mode)
+static ssize_t coroutine_fn
+fuse_co_fallocate(FuseExport *exp,
+                  uint64_t offset, uint64_t length, uint32_t mode)
 {
     int64_t blk_len;
     int ret;
@@ -1055,7 +1083,7 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         return -EACCES;
     }
 
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         return blk_len;
     }
@@ -1074,14 +1102,14 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
 
         if (offset > blk_len) {
             /* No preallocation needed here */
-            ret = fuse_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
                 return ret;
             }
         }
 
-        ret = fuse_do_truncate(exp, offset + length, true,
-                               PREALLOC_MODE_FALLOC);
+        ret = fuse_co_do_truncate(exp, offset + length, true,
+                                  PREALLOC_MODE_FALLOC);
     }
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
     else if (mode & FALLOC_FL_PUNCH_HOLE) {
@@ -1092,8 +1120,9 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         do {
             int size = MIN(length, BDRV_REQUEST_MAX_BYTES);
 
-            ret = blk_pwrite_zeroes(exp->common.blk, offset, size,
-                                    BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK);
+            ret = blk_co_pwrite_zeroes(exp->common.blk, offset, size,
+                                       BDRV_REQ_MAY_UNMAP |
+                                       BDRV_REQ_NO_FALLBACK);
             if (ret == -ENOTSUP) {
                 /*
                  * fallocate() specifies to return EOPNOTSUPP for unsupported
@@ -1111,8 +1140,8 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
     else if (mode & FALLOC_FL_ZERO_RANGE) {
         if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + length > blk_len) {
             /* No need for zeroes, we are going to write them ourselves */
-            ret = fuse_do_truncate(exp, offset + length, false,
-                                   PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset + length, false,
+                                      PREALLOC_MODE_OFF);
             if (ret < 0) {
                 return ret;
             }
@@ -1121,8 +1150,8 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         do {
             int size = MIN(length, BDRV_REQUEST_MAX_BYTES);
 
-            ret = blk_pwrite_zeroes(exp->common.blk,
-                                    offset, size, 0);
+            ret = blk_co_pwrite_zeroes(exp->common.blk,
+                                       offset, size, 0);
             offset += size;
             length -= size;
         } while (ret == 0 && length > 0);
@@ -1139,9 +1168,9 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
  * Let clients fsync the exported image.
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_fsync(FuseExport *exp)
+static ssize_t coroutine_fn fuse_co_fsync(FuseExport *exp)
 {
-    return blk_flush(exp->common.blk);
+    return blk_co_flush(exp->common.blk);
 }
 
 /**
@@ -1149,9 +1178,9 @@ static ssize_t fuse_fsync(FuseExport *exp)
  * notes this to be a way to return last-minute errors.)
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_flush(FuseExport *exp)
+static ssize_t coroutine_fn fuse_co_flush(FuseExport *exp)
 {
-    return blk_flush(exp->common.blk);
+    return blk_co_flush(exp->common.blk);
 }
 
 #ifdef CONFIG_FUSE_LSEEK
@@ -1159,8 +1188,9 @@ static ssize_t fuse_flush(FuseExport *exp)
  * Let clients inquire allocation status.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
-                          uint64_t offset, uint32_t whence)
+static ssize_t coroutine_fn
+fuse_co_lseek(FuseExport *exp, struct fuse_lseek_out *out,
+              uint64_t offset, uint32_t whence)
 {
     if (whence != SEEK_HOLE && whence != SEEK_DATA) {
         return -EINVAL;
@@ -1170,8 +1200,8 @@ static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
         int64_t pnum;
         int ret;
 
-        ret = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
-                                      offset, INT64_MAX, &pnum, NULL, NULL);
+        ret = bdrv_co_block_status_above(blk_bs(exp->common.blk), NULL,
+                                         offset, INT64_MAX, &pnum, NULL, NULL);
         if (ret < 0) {
             return ret;
         }
@@ -1188,7 +1218,7 @@ static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
              * and @blk_len (the client-visible EOF).
              */
 
-            blk_len = blk_getlength(exp->common.blk);
+            blk_len = blk_co_getlength(exp->common.blk);
             if (blk_len < 0) {
                 return blk_len;
             }
@@ -1332,9 +1362,9 @@ static int fuse_write_buf_response(int fd,
 /**
  * Process a FUSE request, incl. writing the response.
  */
-static void fuse_process_request(FuseExport *exp,
-                                 const FuseRequestInHeader *in_hdr,
-                                 const void *data_buffer)
+static void coroutine_fn
+fuse_co_process_request(FuseExport *exp, const FuseRequestInHeader *in_hdr,
+                        const void *data_buffer)
 {
     FuseRequestOutHeader out_hdr;
     /* For read requests: Data to be returned */
@@ -1343,7 +1373,7 @@ static void fuse_process_request(FuseExport *exp,
 
     switch (in_hdr->common.opcode) {
     case FUSE_INIT:
-        ret = fuse_init(exp, &out_hdr.init, &in_hdr->init);
+        ret = fuse_co_init(exp, &out_hdr.init, &in_hdr->init);
         break;
 
     case FUSE_DESTROY:
@@ -1351,11 +1381,11 @@ static void fuse_process_request(FuseExport *exp,
         break;
 
     case FUSE_STATFS:
-        ret = fuse_statfs(exp, &out_hdr.statfs);
+        ret = fuse_co_statfs(exp, &out_hdr.statfs);
         break;
 
     case FUSE_OPEN:
-        ret = fuse_open(exp, &out_hdr.open);
+        ret = fuse_co_open(exp, &out_hdr.open);
         break;
 
     case FUSE_RELEASE:
@@ -1372,19 +1402,19 @@ static void fuse_process_request(FuseExport *exp,
         return;
 
     case FUSE_GETATTR:
-        ret = fuse_getattr(exp, &out_hdr.attr);
+        ret = fuse_co_getattr(exp, &out_hdr.attr);
         break;
 
     case FUSE_SETATTR: {
         const struct fuse_setattr_in *in = &in_hdr->setattr;
-        ret = fuse_setattr(exp, &out_hdr.attr,
-                           in->valid, in->size, in->mode, in->uid, in->gid);
+        ret = fuse_co_setattr(exp, &out_hdr.attr,
+                              in->valid, in->size, in->mode, in->uid, in->gid);
         break;
     }
 
     case FUSE_READ: {
         const struct fuse_read_in *in = &in_hdr->read;
-        ret = fuse_read(exp, &out_data_buffer, in->offset, in->size);
+        ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size);
         break;
     }
 
@@ -1402,36 +1432,36 @@ static void fuse_process_request(FuseExport *exp,
         }
 
         /*
-         * read_from_fuse_fd() has checked that in_hdr->len matches the number
-         * of bytes read, which cannot exceed the max_write value we set
+         * co_read_from_fuse_fd() has checked that in_hdr->len matches the
+         * number of bytes read, which cannot exceed the max_write value we set
          * (FUSE_MAX_WRITE_BYTES).  So we know that FUSE_MAX_WRITE_BYTES >=
          * in_hdr->len >= in->size + X, so this assertion must hold.
          */
         assert(in->size <= FUSE_MAX_WRITE_BYTES);
 
-        ret = fuse_write(exp, &out_hdr.write,
-                         in->offset, in->size, data_buffer);
+        ret = fuse_co_write(exp, &out_hdr.write,
+                            in->offset, in->size, data_buffer);
         break;
     }
 
     case FUSE_FALLOCATE: {
         const struct fuse_fallocate_in *in = &in_hdr->fallocate;
-        ret = fuse_fallocate(exp, in->offset, in->length, in->mode);
+        ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode);
         break;
     }
 
     case FUSE_FSYNC:
-        ret = fuse_fsync(exp);
+        ret = fuse_co_fsync(exp);
         break;
 
     case FUSE_FLUSH:
-        ret = fuse_flush(exp);
+        ret = fuse_co_flush(exp);
         break;
 
 #ifdef CONFIG_FUSE_LSEEK
     case FUSE_LSEEK: {
         const struct fuse_lseek_in *in = &in_hdr->lseek;
-        ret = fuse_lseek(exp, &out_hdr.lseek, in->offset, in->whence);
+        ret = fuse_co_lseek(exp, &out_hdr.lseek, in->offset, in->whence);
         break;
     }
 #endif
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 19/24] block/export: Add multi-threading interface
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (17 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 18/24] fuse: Process requests in coroutines Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 20/24] iotests/307: Test multi-thread export interface Hanna Czenczek
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Make BlockExportType.iothread an alternate between a single-thread
variant 'str' and a multi-threading variant '[str]'.

In contrast to the single-thread setting, the multi-threading setting
will not change the BDS's context (and so is incompatible with the
fixed-iothread setting), but instead just pass a list to the export
driver, with which it can do whatever it wants.

Currently no export driver supports multi-threading, so they all return
an error when receiving such a list.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 qapi/block-export.json               | 36 ++++++++++++++++++---
 include/block/export.h               | 12 +++++--
 block/export/export.c                | 48 +++++++++++++++++++++++++---
 block/export/fuse.c                  |  7 ++++
 block/export/vduse-blk.c             |  7 ++++
 block/export/vhost-user-blk-server.c |  8 +++++
 nbd/server.c                         |  6 ++++
 7 files changed, 113 insertions(+), 11 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index 076954ef1a..160cd2e3ca 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -363,14 +363,16 @@
 #     to the export before completion is signalled.  (since: 5.2;
 #     default: false)
 #
-# @iothread: The name of the iothread object where the export will
-#     run.  The default is to use the thread currently associated with
-#     the block node.  (since: 5.2)
+# @iothread: The name(s) of one or more iothread object(s) where the
+#     export will run.  The default is to use the thread currently
+#     associated with the block node.  (since: 5.2; multi-threading
+#     since 10.1)
 #
 # @fixed-iothread: True prevents the block node from being moved to
 #     another thread while the export is active.  If true and
 #     @iothread is given, export creation fails if the block node
-#     cannot be moved to the iothread.  The default is false.
+#     cannot be moved to the iothread.  Must not be true when giving
+#     multiple iothreads for @iothread.  The default is false.
 #     (since: 5.2)
 #
 # @allow-inactive: If true, the export allows the exported node to be
@@ -387,7 +389,7 @@
   'base': { 'type': 'BlockExportType',
             'id': 'str',
             '*fixed-iothread': 'bool',
-            '*iothread': 'str',
+            '*iothread': 'BlockExportIothreads',
             'node-name': 'str',
             '*writable': 'bool',
             '*writethrough': 'bool',
@@ -403,6 +405,30 @@
                      'if': 'CONFIG_VDUSE_BLK_EXPORT' }
    } }
 
+##
+# @BlockExportIothreads:
+#
+# Specify a single or multiple I/O threads in which to run a block
+# export's I/O.
+#
+# @single: Run the export's I/O in the given single I/O thread.
+#
+# @multi: Use multi-threading across the given set of I/O threads,
+#     which must not be empty.  Note: Passing a single I/O thread via
+#     this variant is still treated as multi-threading, which is
+#     different from using the @single variant.  In particular, even
+#     if there only is a single I/O thread in the set, export types
+#     that do not support multi-threading will generally reject this
+#     variant, and BlockExportOptions.fixed-iothread is always
+#     incompatible with it.
+#
+# Since: 10.1
+##
+{ 'alternate': 'BlockExportIothreads',
+  'data': {
+      'single': 'str',
+      'multi': ['str'] } }
+
 ##
 # @block-export-add:
 #
diff --git a/include/block/export.h b/include/block/export.h
index 4bd9531d4d..ca45da928c 100644
--- a/include/block/export.h
+++ b/include/block/export.h
@@ -32,8 +32,16 @@ typedef struct BlockExportDriver {
     /* True if the export type supports running on an inactive node */
     bool supports_inactive;
 
-    /* Creates and starts a new block export */
-    int (*create)(BlockExport *, BlockExportOptions *, Error **);
+    /*
+     * Creates and starts a new block export.
+     *
+     * If the user passed a set of I/O threads for multi-threading, @multithread
+     * is a list of the @multithread_count corresponding contexts (freed by the
+     * caller).  Note that @exp->ctx has no relation to that list.
+     */
+    int (*create)(BlockExport *exp, BlockExportOptions *opts,
+                  AioContext *const *multithread, size_t multithread_count,
+                  Error **errp);
 
     /*
      * Frees a removed block export. This function is only called after all
diff --git a/block/export/export.c b/block/export/export.c
index f3bbf11070..b733f269f3 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -76,16 +76,26 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
 {
     bool fixed_iothread = export->has_fixed_iothread && export->fixed_iothread;
     bool allow_inactive = export->has_allow_inactive && export->allow_inactive;
+    bool multithread = export->iothread &&
+        export->iothread->type == QTYPE_QLIST;
     const BlockExportDriver *drv;
     BlockExport *exp = NULL;
     BlockDriverState *bs;
     BlockBackend *blk = NULL;
     AioContext *ctx;
+    AioContext **multithread_ctxs = NULL;
+    size_t multithread_count = 0;
     uint64_t perm;
     int ret;
 
     GLOBAL_STATE_CODE();
 
+    if (fixed_iothread && multithread) {
+        error_setg(errp,
+                   "Cannot use fixed-iothread for a multi-threaded export");
+        return NULL;
+    }
+
     if (!id_wellformed(export->id)) {
         error_setg(errp, "Invalid block export id");
         return NULL;
@@ -116,14 +126,16 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
 
     ctx = bdrv_get_aio_context(bs);
 
-    if (export->iothread) {
+    /* Move the BDS to the target I/O thread, if it is a single one */
+    if (export->iothread && !multithread) {
+        const char *iothread_id = export->iothread->u.single;
         IOThread *iothread;
         AioContext *new_ctx;
         Error **set_context_errp;
 
-        iothread = iothread_by_id(export->iothread);
+        iothread = iothread_by_id(iothread_id);
         if (!iothread) {
-            error_setg(errp, "iothread \"%s\" not found", export->iothread);
+            error_setg(errp, "iothread \"%s\" not found", iothread_id);
             goto fail;
         }
 
@@ -137,6 +149,32 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
         } else if (fixed_iothread) {
             goto fail;
         }
+    } else if (multithread) {
+        strList *iothread_list = export->iothread->u.multi;
+        size_t i;
+
+        multithread_count = 0;
+        for (strList *e = iothread_list; e; e = e->next) {
+            multithread_count++;
+        }
+
+        if (multithread_count == 0) {
+            error_setg(errp, "The set of I/O threads must not be empty");
+            return NULL;
+        }
+
+        multithread_ctxs = g_new(AioContext *, multithread_count);
+        i = 0;
+        for (strList *e = iothread_list; e; e = e->next) {
+            IOThread *iothread = iothread_by_id(e->value);
+
+            if (!iothread) {
+                error_setg(errp, "iothread \"%s\" not found", e->value);
+                goto fail;
+            }
+            multithread_ctxs[i++] = iothread_get_aio_context(iothread);
+        }
+        assert(i == multithread_count);
     }
 
     bdrv_graph_rdlock_main_loop();
@@ -195,7 +233,7 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
         .blk        = blk,
     };
 
-    ret = drv->create(exp, export, errp);
+    ret = drv->create(exp, export, multithread_ctxs, multithread_count, errp);
     if (ret < 0) {
         goto fail;
     }
@@ -203,6 +241,7 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
     assert(exp->blk != NULL);
 
     QLIST_INSERT_HEAD(&block_exports, exp, next);
+    g_free(multithread_ctxs);
     return exp;
 
 fail:
@@ -214,6 +253,7 @@ fail:
         g_free(exp->id);
         g_free(exp);
     }
+    g_free(multithread_ctxs);
     return NULL;
 }
 
diff --git a/block/export/fuse.c b/block/export/fuse.c
index b60affe86b..aa4f1c0307 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -260,6 +260,8 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
 
 static int fuse_export_create(BlockExport *blk_exp,
                               BlockExportOptions *blk_exp_args,
+                              AioContext *const *multithread,
+                              size_t mt_count,
                               Error **errp)
 {
     ERRP_GUARD(); /* ensure clean-up even with error_fatal */
@@ -269,6 +271,11 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
 
+    if (multithread) {
+        error_setg(errp, "FUSE export does not support multi-threading");
+        return -EINVAL;
+    }
+
     /* For growable and writable exports, take the RESIZE permission */
     if (args->growable || blk_exp_args->writable) {
         uint64_t blk_perm, blk_shared_perm;
diff --git a/block/export/vduse-blk.c b/block/export/vduse-blk.c
index 8af13b7f0b..10dc673c56 100644
--- a/block/export/vduse-blk.c
+++ b/block/export/vduse-blk.c
@@ -267,6 +267,7 @@ static const BlockDevOps vduse_block_ops = {
 };
 
 static int vduse_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
+                                AioContext *const *multithread, size_t mt_count,
                                 Error **errp)
 {
     VduseBlkExport *vblk_exp = container_of(exp, VduseBlkExport, export);
@@ -302,6 +303,12 @@ static int vduse_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
             return -EINVAL;
         }
     }
+
+    if (multithread) {
+        error_setg(errp, "vduse-blk export does not support multi-threading");
+        return -EINVAL;
+    }
+
     vblk_exp->num_queues = num_queues;
     vblk_exp->handler.blk = exp->blk;
     vblk_exp->handler.serial = g_strdup(vblk_opts->serial ?: "");
diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index a4d54e824f..e89422bb85 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -316,6 +316,7 @@ static const BlockDevOps vu_blk_dev_ops = {
 };
 
 static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
+                             AioContext *const *multithread, size_t mt_count,
                              Error **errp)
 {
     VuBlkExport *vexp = container_of(exp, VuBlkExport, export);
@@ -341,6 +342,13 @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
         error_setg(errp, "num-queues must be greater than 0");
         return -EINVAL;
     }
+
+    if (multithread) {
+        error_setg(errp,
+                   "vhost-user-blk export does not support multi-threading");
+        return -EINVAL;
+    }
+
     vexp->handler.blk = exp->blk;
     vexp->handler.serial = g_strdup("vhost_user_blk");
     vexp->handler.logical_block_size = logical_block_size;
diff --git a/nbd/server.c b/nbd/server.c
index acec0487a8..620097c58c 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1795,6 +1795,7 @@ static const BlockDevOps nbd_block_ops = {
 };
 
 static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions *exp_args,
+                             AioContext *const *multithread, size_t mt_count,
                              Error **errp)
 {
     NBDExport *exp = container_of(blk_exp, NBDExport, common);
@@ -1831,6 +1832,11 @@ static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions *exp_args,
         return -EEXIST;
     }
 
+    if (multithread) {
+        error_setg(errp, "NBD export does not support multi-threading");
+        return -EINVAL;
+    }
+
     size = blk_getlength(blk);
     if (size < 0) {
         error_setg_errno(errp, -size,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 20/24] iotests/307: Test multi-thread export interface
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (18 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 19/24] block/export: Add multi-threading interface Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 21/24] fuse: Make shared export state atomic Hanna Czenczek
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Test the QAPI interface for multi-threaded exports.  None of our exports
currently support multi-threading, so it's always an error in the end,
but we can still test the specific errors.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 tests/qemu-iotests/307     | 47 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/307.out | 18 +++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/tests/qemu-iotests/307 b/tests/qemu-iotests/307
index b429b5aa50..f6ee3ebec0 100755
--- a/tests/qemu-iotests/307
+++ b/tests/qemu-iotests/307
@@ -142,5 +142,52 @@ with iotests.FilePath('image') as img, \
     vm.qmp_log('query-block-exports')
     iotests.qemu_nbd_list_log('-k', socket)
 
+    iotests.log('\n=== Using multi-thread with NBD ===')
+
+    # Actual multi-threading; (currently) not supported by NBD
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0', 'iothread1'])
+
+    # Should be treated the same way as actual multi-threading, even if there's
+    # only a single thread
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0'])
+
+    iotests.log('\n=== Empty thread list')
+
+    # Simply not allowed
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=[])
+
+    iotests.log('\n=== Non-existent thread name in list')
+
+    # Expect an error, even if NBD does not support multi-threading, because the
+    # list is parsed before being passed to NBD
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0', 'nothread', 'iothread1'])
+
+    iotests.log('\n=== Multi-thread with fixed-iothread')
+
+    # With multi-threading, there is no single context to give the BDS, so it is
+    # just left where it is.  fixed-iothread does not make sense then.
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0', 'iothread1'],
+               fixed_iothread=True)
+
     iotests.log('\n=== Shut down QEMU ===')
     vm.shutdown()
diff --git a/tests/qemu-iotests/307.out b/tests/qemu-iotests/307.out
index f645f3315f..a9b37d3ac1 100644
--- a/tests/qemu-iotests/307.out
+++ b/tests/qemu-iotests/307.out
@@ -134,4 +134,22 @@ read failed: Input/output error
 exports available: 0
 
 
+=== Using multi-thread with NBD ===
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": ["iothread0", "iothread1"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "NBD export does not support multi-threading"}}
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": ["iothread0"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "NBD export does not support multi-threading"}}
+
+=== Empty thread list
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": [], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "The set of I/O threads must not be empty"}}
+
+=== Non-existent thread name in list
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": ["iothread0", "nothread", "iothread1"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "iothread \"nothread\" not found"}}
+
+=== Multi-thread with fixed-iothread
+{"execute": "block-export-add", "arguments": {"fixed-iothread": true, "id": "export0", "iothread": ["iothread0", "iothread1"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "Cannot use fixed-iothread for a multi-threaded export"}}
+
 === Shut down QEMU ===
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 21/24] fuse: Make shared export state atomic
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (19 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 20/24] iotests/307: Test multi-thread export interface Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 22/24] fuse: Implement multi-threading Hanna Czenczek
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

The next commit is going to allow multi-threaded access to a FUSE
export.  In order to allow safe concurrent SETATTR operations that
can modify the shared st_mode, st_uid, and st_gid, make any access to
those fields atomic operations.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index aa4f1c0307..162cbdacfc 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -157,6 +157,7 @@ typedef struct FuseExport {
     /* Whether allow_other was used as a mount option or not */
     bool allow_other;
 
+    /* All atomic */
     mode_t st_mode;
     uid_t st_uid;
     gid_t st_gid;
@@ -267,6 +268,7 @@ static int fuse_export_create(BlockExport *blk_exp,
     ERRP_GUARD(); /* ensure clean-up even with error_fatal */
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
     BlockExportOptionsFuse *args = &blk_exp_args->u.fuse;
+    uint32_t st_mode;
     int ret;
 
     assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
@@ -335,12 +337,13 @@ static int fuse_export_create(BlockExport *blk_exp,
         args->allow_other = FUSE_EXPORT_ALLOW_OTHER_AUTO;
     }
 
-    exp->st_mode = S_IFREG | S_IRUSR;
+    st_mode = S_IFREG | S_IRUSR;
     if (exp->writable) {
-        exp->st_mode |= S_IWUSR;
+        st_mode |= S_IWUSR;
     }
-    exp->st_uid = getuid();
-    exp->st_gid = getgid();
+    qatomic_set(&exp->st_mode, st_mode);
+    qatomic_set(&exp->st_uid, getuid());
+    qatomic_set(&exp->st_gid, getgid());
 
     if (args->allow_other == FUSE_EXPORT_ALLOW_OTHER_AUTO) {
         /* Try allow_other == true first, ignore errors */
@@ -818,10 +821,10 @@ fuse_co_getattr(FuseExport *exp, struct fuse_attr_out *out)
         .attr_valid = 1,
         .attr = {
             .ino        = 1,
-            .mode       = exp->st_mode,
+            .mode       = qatomic_read(&exp->st_mode),
             .nlink      = 1,
-            .uid        = exp->st_uid,
-            .gid        = exp->st_gid,
+            .uid        = qatomic_read(&exp->st_uid),
+            .gid        = qatomic_read(&exp->st_gid),
             .size       = length,
             .blksize    = blk_bs(exp->common.blk)->bl.request_alignment,
             .blocks     = allocated_blocks,
@@ -934,15 +937,15 @@ fuse_co_setattr(FuseExport *exp, struct fuse_attr_out *out, uint32_t to_set,
 
     if (to_set & FATTR_MODE) {
         /* Ignore FUSE-supplied file type, only change the mode */
-        exp->st_mode = (mode & 07777) | S_IFREG;
+        qatomic_set(&exp->st_mode, (mode & 07777) | S_IFREG);
     }
 
     if (to_set & FATTR_UID) {
-        exp->st_uid = uid;
+        qatomic_set(&exp->st_uid, uid);
     }
 
     if (to_set & FATTR_GID) {
-        exp->st_gid = gid;
+        qatomic_set(&exp->st_gid, gid);
     }
 
     return fuse_co_getattr(exp, out);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 22/24] fuse: Implement multi-threading
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (20 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 21/24] fuse: Make shared export state atomic Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 23/24] qapi/block-export: Document FUSE's multi-threading Hanna Czenczek
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
(via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).

We can use this to implement multi-threading.

For configuration, we don't need any more information beyond the simple
array provided by the core block export interface: The FUSE kernel
driver feeds these FDs in a round-robin fashion, so all of them are
equivalent and we want to have exactly one per thread.

These are the benchmark results when using four threads (compared to a
single thread); note that fio still only uses a single job, but
performance can still be improved because of said round-robin usage for
the queues.  (Not in the sync case, though, in which case I guess it
just adds overhead.)

file:
  read:
    seq aio:   261.7k ±1.7k  (+168%)
    rand aio:  129.2k ±14.3k (+35%)
    seq sync:   36.6k ±0.6k  (+6%)
    rand sync:  10.1k ±0.1k  (+2%)
  write:
    seq aio:   235.7k ±2.8k  (+243%)
    rand aio:  232.0k ±6.7k  (+237%)
    seq sync:   31.7k ±0.6k  (+4%)
    rand sync:  31.8k ±0.5k  (+4%)
null:
  read:
    seq aio:   253.8k ±12.3k (+45%)
    rand aio:  248.2k ±12.0k (+45%)
    seq sync:   91.6k ±2.4k  (+12%)
    rand sync:  91.3k ±2.1k  (+17%)
  write:
    seq aio:   208.2k ±9.8k  (+6%)
    rand aio:  207.0k ±7.4k  (+8%)
    seq sync:   91.2k ±1.9k  (+9%)
    rand sync:  90.4k ±2.5k  (+14%)

So moderate improvements in most cases, but quite improved AIO
performance with an actual underlying file.

Here's results for numjobs=4:

"Before", i.e. without multithreading in QSD/FUSE (results compared to
numjobs=1):

file:
  read:
    seq aio:    85.5k ±0.4k (-13%)
    rand aio:   92.5k ±0.5k (-3%)
    seq sync:   54.5k ±9.1k (+58%)
    rand sync:  38.0k ±0.2k (+283%)
  write:
    seq aio:    67.3k ±0.3k (-2%)
    rand aio:   67.6k ±0.3k (-2%)
    seq sync:   69.3k ±0.5k (+126%)
    rand sync:  69.3k ±0.3k (+126%)
null:
  read:
    seq aio:   170.6k ±0.8k (-2%)
    rand aio:  170.9k ±0.9k (±0%)
    seq sync:  187.6k ±1.3k (+129%)
    rand sync: 188.9k ±0.9k (+142%)
  write:
    seq aio:   191.5k ±1.2k (-2%)
    rand aio:  193.8k ±1.4k (-1%)
    seq sync:  206.1k ±1.3k (+147%)
    rand sync: 206.1k ±1.2k (+159%)

As probably expected, little difference in the AIO case, but great
improvements in the sync cases because it kind of gives it an artificial
iodepth of 4.

"After", i.e. with four threads in QSD/FUSE (now results compared to the
above):

file:
  read:
    seq aio:   198.7k ±2.7k (+132%)
    rand aio:  317.3k ±0.6k (+243%)
    seq sync:   55.9k ±8.9k (+3%)
    rand sync:  39.1k ±0.0k (+3%)
  write:
    seq aio:   229.0k ±0.8k (+240%)
    rand aio:  227.0k ±1.3k (+235%)
    seq sync:  102.5k ±0.2k (+48%)
    rand sync: 101.7k ±0.2k (+47%)
null:
  read:
    seq aio:   584.0k ±1.5k (+242%)
    rand aio:  581.9k ±1.9k (+240%)
    seq sync:  270.6k ±0.9k (+44%)
    rand sync: 270.4k ±0.7k (+43%)
  write:
    seq aio:   598.4k ±2.0k (+212%)
    rand aio:  605.2k ±2.0k (+212%)
    seq sync:  274.0k ±0.8k (+33%)
    rand sync: 275.0k ±0.7k (+33%)

So this helps mainly for the AIO cases, but also in the null sync cases,
because null is always CPU-bound, so more threads help.

One unsolved mystery: When using a multithreaded export, running fio
with 1 job (benchmark at the top of this commit) yields better seqread
performance than doing so with 4 jobs.  Actually, with 4 jobs, it's
significantly than randread, which is quite strange.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/export/fuse.c | 193 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 153 insertions(+), 40 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 162cbdacfc..6777a7651b 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -31,11 +31,13 @@
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
+#include "system/iothread.h"
 
 #include <fuse.h>
 #include <fuse_lowlevel.h>
 
 #include "standard-headers/linux/fuse.h"
+#include <sys/ioctl.h>
 
 #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
@@ -119,12 +121,17 @@ QEMU_BUILD_BUG_ON(sizeof(((FuseRequestInHeaderBuf *)0)->head) +
                   sizeof(((FuseRequestInHeaderBuf *)0)->tail) !=
                   sizeof(FuseRequestInHeader));
 
-typedef struct FuseExport {
-    BlockExport common;
+typedef struct FuseExport FuseExport;
 
-    struct fuse_session *fuse_session;
-    unsigned int in_flight; /* atomic */
-    bool mounted, fd_handler_set_up;
+/*
+ * One FUSE "queue", representing one FUSE FD from which requests are fetched
+ * and processed.  Each queue is tied to an AioContext.
+ */
+typedef struct FuseQueue {
+    FuseExport *exp;
+
+    AioContext *ctx;
+    int fuse_fd;
 
     /*
      * Cached buffer to receive the data of WRITE requests.  Cached because:
@@ -141,6 +148,14 @@ typedef struct FuseExport {
      * via blk_blockalign() and thus need to be freed via qemu_vfree().
      */
     void *req_write_data_cached;
+} FuseQueue;
+
+struct FuseExport {
+    BlockExport common;
+
+    struct fuse_session *fuse_session;
+    unsigned int in_flight; /* atomic */
+    bool mounted, fd_handler_set_up;
 
     /*
      * Set when there was an unrecoverable error and no requests should be read
@@ -149,7 +164,15 @@ typedef struct FuseExport {
      */
     bool halted;
 
-    int fuse_fd;
+    int num_queues;
+    FuseQueue *queues;
+    /*
+     * True if this export should follow the generic export's AioContext.
+     * Will be false if the queues' AioContexts have been explicitly set by the
+     * user, i.e. are expected to stay in those contexts.
+     * (I.e. is always false if there is more than one queue.)
+     */
+    bool follow_aio_context;
 
     char *mountpoint;
     bool writable;
@@ -161,7 +184,7 @@ typedef struct FuseExport {
     mode_t st_mode;
     uid_t st_uid;
     gid_t st_gid;
-} FuseExport;
+};
 
 /*
  * Verify that the size of FuseRequestInHeaderBuf.head plus the data
@@ -180,12 +203,13 @@ static void fuse_export_halt(FuseExport *exp);
 static void init_exports_table(void);
 
 static int mount_fuse_export(FuseExport *exp, Error **errp);
+static int clone_fuse_fd(int fd, Error **errp);
 
 static bool is_regular_file(const char *path, Error **errp);
 
 static void read_from_fuse_fd(void *opaque);
 static void coroutine_fn
-fuse_co_process_request(FuseExport *exp, const FuseRequestInHeader *in_hdr,
+fuse_co_process_request(FuseQueue *q, const FuseRequestInHeader *in_hdr,
                         const void *data_buffer);
 static int fuse_write_err(int fd, const struct fuse_in_header *in_hdr, int err);
 
@@ -217,8 +241,11 @@ static void fuse_attach_handlers(FuseExport *exp)
         return;
     }
 
-    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
-                       read_from_fuse_fd, NULL, NULL, NULL, exp);
+    for (int i = 0; i < exp->num_queues; i++) {
+        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
+                           read_from_fuse_fd, NULL, NULL, NULL,
+                           &exp->queues[i]);
+    }
     exp->fd_handler_set_up = true;
 }
 
@@ -227,8 +254,10 @@ static void fuse_attach_handlers(FuseExport *exp)
  */
 static void fuse_detach_handlers(FuseExport *exp)
 {
-    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
-                       NULL, NULL, NULL, NULL, NULL);
+    for (int i = 0; i < exp->num_queues; i++) {
+        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
+                           NULL, NULL, NULL, NULL, NULL);
+    }
     exp->fd_handler_set_up = false;
 }
 
@@ -243,6 +272,11 @@ static void fuse_export_drained_end(void *opaque)
 
     /* Refresh AioContext in case it changed */
     exp->common.ctx = blk_get_aio_context(exp->common.blk);
+    if (exp->follow_aio_context) {
+        assert(exp->num_queues == 1);
+        exp->queues[0].ctx = exp->common.ctx;
+    }
+
     fuse_attach_handlers(exp);
 }
 
@@ -274,8 +308,32 @@ static int fuse_export_create(BlockExport *blk_exp,
     assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
 
     if (multithread) {
-        error_setg(errp, "FUSE export does not support multi-threading");
-        return -EINVAL;
+        /* Guaranteed by common export code */
+        assert(mt_count >= 1);
+
+        exp->follow_aio_context = false;
+        exp->num_queues = mt_count;
+        exp->queues = g_new(FuseQueue, mt_count);
+
+        for (size_t i = 0; i < mt_count; i++) {
+            exp->queues[i] = (FuseQueue) {
+                .exp = exp,
+                .ctx = multithread[i],
+                .fuse_fd = -1,
+            };
+        }
+    } else {
+        /* Guaranteed by common export code */
+        assert(mt_count == 0);
+
+        exp->follow_aio_context = true;
+        exp->num_queues = 1;
+        exp->queues = g_new(FuseQueue, 1);
+        exp->queues[0] = (FuseQueue) {
+            .exp = exp,
+            .ctx = exp->common.ctx,
+            .fuse_fd = -1,
+        };
     }
 
     /* For growable and writable exports, take the RESIZE permission */
@@ -287,7 +345,7 @@ static int fuse_export_create(BlockExport *blk_exp,
         ret = blk_set_perm(exp->common.blk, blk_perm | BLK_PERM_RESIZE,
                            blk_shared_perm, errp);
         if (ret < 0) {
-            return ret;
+            goto fail;
         }
     }
 
@@ -363,13 +421,23 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
-    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
-    ret = qemu_fcntl_addfl(exp->fuse_fd, O_NONBLOCK);
+    assert(exp->num_queues >= 1);
+    exp->queues[0].fuse_fd = fuse_session_fd(exp->fuse_session);
+    ret = qemu_fcntl_addfl(exp->queues[0].fuse_fd, O_NONBLOCK);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "Failed to make FUSE FD non-blocking");
         goto fail;
     }
 
+    for (int i = 1; i < exp->num_queues; i++) {
+        int fd = clone_fuse_fd(exp->queues[0].fuse_fd, errp);
+        if (fd < 0) {
+            ret = fd;
+            goto fail;
+        }
+        exp->queues[i].fuse_fd = fd;
+    }
+
     fuse_attach_handlers(exp);
     return 0;
 
@@ -462,28 +530,28 @@ fail:
 /**
  * Allocate a buffer to receive WRITE data, or take the cached one.
  */
-static void *get_write_data_buffer(FuseExport *exp)
+static void *get_write_data_buffer(FuseQueue *q)
 {
-    if (exp->req_write_data_cached) {
-        void *cached = exp->req_write_data_cached;
-        exp->req_write_data_cached = NULL;
+    if (q->req_write_data_cached) {
+        void *cached = q->req_write_data_cached;
+        q->req_write_data_cached = NULL;
         return cached;
     } else {
-        return blk_blockalign(exp->common.blk, FUSE_MAX_WRITE_BYTES);
+        return blk_blockalign(q->exp->common.blk, FUSE_MAX_WRITE_BYTES);
     }
 }
 
 /**
  * Release a WRITE data buffer, possibly reusing it for a subsequent request.
  */
-static void release_write_data_buffer(FuseExport *exp, void **buffer)
+static void release_write_data_buffer(FuseQueue *q, void **buffer)
 {
     if (!*buffer) {
         return;
     }
 
-    if (!exp->req_write_data_cached) {
-        exp->req_write_data_cached = *buffer;
+    if (!q->req_write_data_cached) {
+        q->req_write_data_cached = *buffer;
     } else {
         qemu_vfree(*buffer);
     }
@@ -529,9 +597,42 @@ static ssize_t req_op_hdr_len(const FuseRequestInHeader *in_hdr)
     }
 }
 
+/**
+ * Clone the given /dev/fuse file descriptor, yielding a second FD from which
+ * requests can be pulled for the associated filesystem.  Returns an FD on
+ * success, and -errno on error.
+ */
+static int clone_fuse_fd(int fd, Error **errp)
+{
+    uint32_t src_fd = fd;
+    int new_fd;
+    int ret;
+
+    /*
+     * The name "/dev/fuse" is fixed, see libfuse's lib/fuse_loop_mt.c
+     * (fuse_clone_chan()).
+     */
+    new_fd = open("/dev/fuse", O_RDWR | O_CLOEXEC | O_NONBLOCK);
+    if (new_fd < 0) {
+        ret = -errno;
+        error_setg_errno(errp, errno, "Failed to open /dev/fuse");
+        return ret;
+    }
+
+    ret = ioctl(new_fd, FUSE_DEV_IOC_CLONE, &src_fd);
+    if (ret < 0) {
+        ret = -errno;
+        error_setg_errno(errp, errno, "Failed to clone FUSE FD");
+        close(new_fd);
+        return ret;
+    }
+
+    return new_fd;
+}
+
 /**
  * Try to read a single request from the FUSE FD.
- * Takes a FuseExport pointer in `opaque`.
+ * Takes a FuseQueue pointer in `opaque`.
  *
  * Assumes the export's in-flight counter has already been incremented.
  *
@@ -539,8 +640,9 @@ static ssize_t req_op_hdr_len(const FuseRequestInHeader *in_hdr)
  */
 static void coroutine_fn co_read_from_fuse_fd(void *opaque)
 {
-    FuseExport *exp = opaque;
-    int fuse_fd = exp->fuse_fd;
+    FuseQueue *q = opaque;
+    int fuse_fd = q->fuse_fd;
+    FuseExport *exp = q->exp;
     ssize_t ret;
     FuseRequestInHeaderBuf in_hdr_buf;
     const FuseRequestInHeader *in_hdr;
@@ -552,7 +654,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    data_buffer = get_write_data_buffer(exp);
+    data_buffer = get_write_data_buffer(q);
 
     /* Construct the I/O vector to hold the FUSE request */
     iov[0] = (struct iovec) { &in_hdr_buf.head, sizeof(in_hdr_buf.head) };
@@ -613,29 +715,29 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
             memcpy(in_hdr_buf.tail, data_buffer, len);
         }
 
-        release_write_data_buffer(exp, &data_buffer);
+        release_write_data_buffer(q, &data_buffer);
     }
 
-    fuse_co_process_request(exp, in_hdr, data_buffer);
+    fuse_co_process_request(q, in_hdr, data_buffer);
 
 no_request:
-    release_write_data_buffer(exp, &data_buffer);
+    release_write_data_buffer(q, &data_buffer);
     fuse_dec_in_flight(exp);
 }
 
 /**
  * Try to read and process a single request from the FUSE FD.
  * (To be used as a handler for when the FUSE FD becomes readable.)
- * Takes a FuseExport pointer in `opaque`.
+ * Takes a FuseQueue pointer in `opaque`.
  */
 static void read_from_fuse_fd(void *opaque)
 {
-    FuseExport *exp = opaque;
+    FuseQueue *q = opaque;
     Coroutine *co;
 
-    co = qemu_coroutine_create(co_read_from_fuse_fd, exp);
+    co = qemu_coroutine_create(co_read_from_fuse_fd, q);
     /* Decremented by co_read_from_fuse_fd() */
-    fuse_inc_in_flight(exp);
+    fuse_inc_in_flight(q->exp);
     qemu_coroutine_enter(co);
 }
 
@@ -660,6 +762,17 @@ static void fuse_export_delete(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
+    for (int i = 0; i < exp->num_queues; i++) {
+        FuseQueue *q = &exp->queues[i];
+
+        /* Queue 0's FD belongs to the FUSE session */
+        if (i > 0 && q->fuse_fd >= 0) {
+            close(q->fuse_fd);
+        }
+        qemu_vfree(q->req_write_data_cached);
+    }
+    g_free(exp->queues);
+
     if (exp->fuse_session) {
         if (exp->mounted) {
             fuse_session_unmount(exp->fuse_session);
@@ -668,7 +781,6 @@ static void fuse_export_delete(BlockExport *blk_exp)
         fuse_session_destroy(exp->fuse_session);
     }
 
-    qemu_vfree(exp->req_write_data_cached);
     g_free(exp->mountpoint);
 }
 
@@ -1373,10 +1485,11 @@ static int fuse_write_buf_response(int fd,
  * Process a FUSE request, incl. writing the response.
  */
 static void coroutine_fn
-fuse_co_process_request(FuseExport *exp, const FuseRequestInHeader *in_hdr,
+fuse_co_process_request(FuseQueue *q, const FuseRequestInHeader *in_hdr,
                         const void *data_buffer)
 {
     FuseRequestOutHeader out_hdr;
+    FuseExport *exp = q->exp;
     /* For read requests: Data to be returned */
     void *out_data_buffer = NULL;
     ssize_t ret;
@@ -1498,10 +1611,10 @@ fuse_co_process_request(FuseExport *exp, const FuseRequestInHeader *in_hdr,
     }
 
     if (out_data_buffer) {
-        fuse_write_buf_response(exp->fuse_fd, &out_hdr.common, out_data_buffer);
+        fuse_write_buf_response(q->fuse_fd, &out_hdr.common, out_data_buffer);
         qemu_vfree(out_data_buffer);
     } else {
-        fuse_write_response(exp->fuse_fd, &out_hdr);
+        fuse_write_response(q->fuse_fd, &out_hdr);
     }
 }
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 23/24] qapi/block-export: Document FUSE's multi-threading
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (21 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 22/24] fuse: Implement multi-threading Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-18 13:26 ` [PATCH v4 24/24] iotests/308: Add multi-threading sanity test Hanna Czenczek
  2026-02-26 19:54 ` [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Kevin Wolf
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Document for users that FUSE's multi-threading implementation
distributes requests in a round-robin manner, regardless of where they
originate from.

As noted by Stefan, this will probably change with a FUSE-over-io_uring
implementation (which is supposed to have CPU affinity), but documenting
that is left for once that is done.

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 qapi/block-export.json | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index 160cd2e3ca..dd724acf1c 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -164,6 +164,11 @@
 # Options for exporting a block graph node on some (file) mountpoint
 # as a raw image.
 #
+# Multi-threading note: The FUSE export supports multi-threading.
+# Currently, requests are distributed across these threads in a
+# round-robin fashion, i.e. independently of the CPU core from which a
+# request originates.
+#
 # @mountpoint: Path on which to export the block device via FUSE.
 #     This must point to an existing regular file.
 #
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 24/24] iotests/308: Add multi-threading sanity test
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (22 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 23/24] qapi/block-export: Document FUSE's multi-threading Hanna Czenczek
@ 2026-02-18 13:26 ` Hanna Czenczek
  2026-02-26 19:54 ` [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Kevin Wolf
  24 siblings, 0 replies; 29+ messages in thread
From: Hanna Czenczek @ 2026-02-18 13:26 UTC (permalink / raw)
  To: qemu-block; +Cc: qemu-devel, Hanna Czenczek, Kevin Wolf, Brian Song

Run qemu-img bench on a simple multi-threaded FUSE export to test that
it works.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 tests/qemu-iotests/308     | 51 ++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/308.out | 56 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 107 insertions(+)

diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index a83c6fc01f..f4a06a522e 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -441,6 +441,57 @@ $QEMU_IO -c 'read -P 0 0 64M' "$TEST_IMG" | _filter_qemu_io
 
 _cleanup_test_img
 
+echo
+echo '=== Multi-threading ==='
+
+# Just set up a null block device, export it (with multi-threading), and run
+# qemu-img bench on it (to get parallel requests)
+
+_launch_qemu
+_send_qemu_cmd $QEMU_HANDLE \
+    "{'execute': 'qmp_capabilities'}" \
+    'return'
+
+_send_qemu_cmd $QEMU_HANDLE \
+    "{'execute': 'blockdev-add',
+      'arguments': {
+          'driver': 'null-co',
+          'node-name': 'null'
+      } }" \
+    'return'
+
+for id in iothread{0,1,2,3}; do
+    _send_qemu_cmd $QEMU_HANDLE \
+        "{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': '$id'
+          } }" \
+        'return'
+done
+
+echo
+
+iothreads="['iothread0', 'iothread1', 'iothread2', 'iothread3']"
+fuse_export_add \
+    'export' \
+    "'mountpoint': '$EXT_MP', 'iothread': $iothreads" \
+    'return' \
+    'null'
+
+echo
+$QEMU_IMG bench -f raw "$EXT_MP" |
+    sed -e 's/[0-9.]\+ seconds/X.XXX seconds/'
+echo
+
+fuse_export_del 'export'
+
+_send_qemu_cmd $QEMU_HANDLE \
+    "{'execute': 'quit'}" \
+    'return'
+
+wait=yes _cleanup_qemu
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index ebeaf64b48..580cc94e92 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -217,4 +217,60 @@ read 67108864/67108864 bytes at offset 0
 {"return": {}}
 read 67108864/67108864 bytes at offset 0
 64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Multi-threading ===
+{'execute': 'qmp_capabilities'}
+{"return": {}}
+{'execute': 'blockdev-add',
+      'arguments': {
+          'driver': 'null-co',
+          'node-name': 'null'
+      } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread0'
+          } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread1'
+          } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread2'
+          } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread3'
+          } }
+{"return": {}}
+
+{'execute': 'block-export-add',
+          'arguments': {
+              'type': 'fuse',
+              'id': 'export',
+              'node-name': 'null',
+              'mountpoint': 'TEST_DIR/t.IMGFMT.fuse', 'iothread': ['iothread0', 'iothread1', 'iothread2', 'iothread3']
+          } }
+{"return": {}}
+
+Sending 75000 read requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 4096)
+Run completed in X.XXX seconds.
+
+{'execute': 'block-export-del',
+          'arguments': {
+              'id': 'export'
+          } }
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_EXPORT_DELETED", "data": {"id": "export"}}
+{'execute': 'quit'}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
+{"return": {}}
 *** done
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 16/24] fuse: Manually process requests (without libfuse)
  2026-02-18 13:26 ` [PATCH v4 16/24] fuse: Manually process requests (without libfuse) Hanna Czenczek
@ 2026-02-26 19:26   ` Kevin Wolf
  2026-02-26 19:29   ` Kevin Wolf
  1 sibling, 0 replies; 29+ messages in thread
From: Kevin Wolf @ 2026-02-26 19:26 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Brian Song

Am 18.02.2026 um 14:26 hat Hanna Czenczek geschrieben:
> Manually read requests from the /dev/fuse FD and process them, without
> using libfuse.  This allows us to safely add parallel request processing
> in coroutines later, without having to worry about libfuse internals.
> (Technically, we already have exactly that problem with
> read_from_fuse_export()/read_from_fuse_fd() nesting.)
> 
> We will continue to use libfuse for mounting the filesystem; fusermount3
> is a effectively a helper program of libfuse, so it should know best how
> to interact with it.  (Doing it manually without libfuse, while doable,
> is a bit of a pain, and it is not clear to me how stable the "protocol"
> actually is.)
> 
> Take this opportunity of quite a major rewrite to update the Copyright
> line with corrected information that has surfaced in the meantime.
> 
> Here are some benchmarks from before this patch (4k, iodepth=16, libaio;
> except 'sync', which are iodepth=1 and pvsync2):
> 
> file:
>   read:
>     seq aio:    99.8k ±1.5k IOPS
>     rand aio:   50.5k ±1.0k
>     seq sync:   36.1k ±1.1k
>     rand sync:  10.0k ±0.1k
>   write:
>     seq aio:    72.0k ±9.3k
>     rand aio:   70.6k ±2.5k
>     seq sync:   30.6k ±0.8k
>     rand sync:  30.1k ±1.0k
> null:
>   read:
>     seq aio:   157.9k ±4.7k
>     rand aio:  158.7k ±4.8k
>     seq sync:   80.2k ±2.8k
>     rand sync:  77.5k ±3.8k
>   write:
>     seq aio:   154.3k ±3.6k
>     rand aio:  154.3k ±4.2k
>     seq sync:   76.1k ±5.2k
>     rand sync:  72.9k ±4.0k
> 
> And with this patch applied:
> 
> file:
>   read:
>     seq aio:   106.8k ±1.9k (+7%)
>     rand aio:   48.3k ±8.8k (-4%)
>     seq sync:   35.5k ±1.4k (-2%)
>     rand sync:  10.0k ±0.2k (±0%)
>   write:
>     seq aio:    76.3k ±6.6k (+6%)
>     rand aio:   76.4k ±1.5k (+8%)
>     seq sync:   31.6k ±0.6k (+3%)
>     rand sync:  30.9k ±0.8k (+3%)
> null:
>   read:
>     seq aio:   161.7k ±6.0k (+2%)
>     rand aio:  165.6k ±7.1k (+4%)
>     seq sync:   80.5k ±3.0k (±0%)
>     rand sync:  78.5k ±3.1k (+1%)
>   write:
>     seq aio:   185.1k ±3.3k (+20%)
>     rand aio:  186.7k ±4.8k (+21%)
>     seq sync:   82.5k ±4.2k (+8%)
>     rand sync:  78.7k ±3.2k (+8%)
> 
> So not much difference, aside from write AIO to a null-co export getting
> a bit better.
> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/export/fuse.c | 944 +++++++++++++++++++++++++++++++++-----------
>  1 file changed, 720 insertions(+), 224 deletions(-)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index af0a8de17b..c481fb72a2 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -1,7 +1,7 @@
>  /*
>   * Present a block device as a raw image through FUSE
>   *
> - * Copyright (c) 2020 Max Reitz <mreitz@redhat.com>
> + * Copyright (c) 2020, 2025 Hanna Czenczek <hreitz@redhat.com>
>   *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License as published by
> @@ -27,12 +27,15 @@
>  #include "block/qapi.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-commands-block.h"
> +#include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "system/block-backend.h"
>  
>  #include <fuse.h>
>  #include <fuse_lowlevel.h>
>  
> +#include "standard-headers/linux/fuse.h"
> +
>  #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
>  #include <linux/falloc.h>
>  #endif
> @@ -42,17 +45,102 @@
>  #endif
>  
>  /* Prevent overly long bounce buffer allocations */
> -#define FUSE_MAX_BOUNCE_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
> +#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
> +/* Small enough to fit in the request buffer */
> +#define FUSE_MAX_WRITE_BYTES (64 * 1024)

Is the comment stale now that you moved to two separate buffers?

>  /**
> - * Handle client reads from the exported image.
> + * Handle client reads from the exported image.  Allocates *bufptr and reads
> + * data from the block device into that buffer.
> + * Returns the buffer (read) size on success, and -errno on error.
> + * After use, *bufptr must be freed via qemu_vfree().
>   */
> -static void fuse_read(fuse_req_t req, fuse_ino_t inode,
> -                      size_t size, off_t offset, struct fuse_file_info *fi)
> +static ssize_t fuse_read(FuseExport *exp, void **bufptr,
> +                         uint64_t offset, uint32_t size)
>  {
> -    FuseExport *exp = fuse_req_userdata(req);
>      int64_t blk_len;
>      void *buf;
>      int ret;
>  
>      /* Limited by max_read, should not happen */
> -    if (size > FUSE_MAX_BOUNCE_BYTES) {
> -        fuse_reply_err(req, EINVAL);
> -        return;
> +    if (size > FUSE_MAX_READ_BYTES) {
> +        return -EINVAL;
>      }
>  
>      /**
> @@ -653,18 +954,12 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
>       */
>      blk_len = blk_getlength(exp->common.blk);
>      if (blk_len < 0) {
> -        fuse_reply_err(req, -blk_len);
> -        return;
> +        return blk_len;
>      }
>  
>      if (offset >= blk_len) {
> -        /*
> -         * Technically libfuse does not allow returning a zero error code for
> -         * read requests, but in practice this is a 0-length read (and a future
> -         * commit will change this code anyway)
> -         */
> -        fuse_reply_err(req, 0);
> -        return;
> +        *bufptr = NULL;
> +        return 0;

It feels a bit inconsistent to set *bufptr = NULL here, but not in the
error paths. Both cases depend on it being NULL afterwards, but the
caller already makes sure that it is NULL when it calls fuse_read().

>      }
>  
>      if (offset + size > blk_len) {

Overall, this feels much nicer than v3!

Kevin



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 16/24] fuse: Manually process requests (without libfuse)
  2026-02-18 13:26 ` [PATCH v4 16/24] fuse: Manually process requests (without libfuse) Hanna Czenczek
  2026-02-26 19:26   ` Kevin Wolf
@ 2026-02-26 19:29   ` Kevin Wolf
  1 sibling, 0 replies; 29+ messages in thread
From: Kevin Wolf @ 2026-02-26 19:29 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Brian Song, bernd

Am 18.02.2026 um 14:26 hat Hanna Czenczek geschrieben:
> @@ -278,6 +390,17 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
>      char *mount_opts;
>      struct fuse_args fuse_args;
>      int ret;
> +    /*
> +     * We just create the session for mounting/unmounting, no need to provide
> +     * any operations.  However, since libfuse commit 52a633a5d, we have to
> +     * provide some op struct and cannot just pass NULL (even though the commit
> +     * message ("allow passing ops as NULL") seems to imply the exact opposite,
> +     * as does the comment added to fuse_session_new_fn() ("To create a no-op
> +     * session just for mounting pass op as NULL.").
> +     * This is how said libfuse commit implements a no-op session internally, so
> +     * do it the same way.
> +     */
> +    static const struct fuse_lowlevel_ops null_ops = { 0 };
>  
>      /*
>       * Note that these mount options differ from what we would pass to a direct
> @@ -301,8 +424,8 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
>      fuse_argv[3] = NULL;
>      fuse_args = (struct fuse_args)FUSE_ARGS_INIT(3, (char **)fuse_argv);
>  
> -    exp->fuse_session = fuse_session_new(&fuse_args, &fuse_ops,
> -                                         sizeof(fuse_ops), exp);
> +    exp->fuse_session = fuse_session_new(&fuse_args, &null_ops,
> +                                         sizeof(null_ops), NULL);

Bernd, is it intentional that the external interface changed in the way
the comment explains or is this accidental breakage?

Kevin



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading
  2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
                   ` (23 preceding siblings ...)
  2026-02-18 13:26 ` [PATCH v4 24/24] iotests/308: Add multi-threading sanity test Hanna Czenczek
@ 2026-02-26 19:54 ` Kevin Wolf
  24 siblings, 0 replies; 29+ messages in thread
From: Kevin Wolf @ 2026-02-26 19:54 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Brian Song

Am 18.02.2026 um 14:26 hat Hanna Czenczek geschrieben:
> Hi,
> 
> This series:
> - Fixes some bugs/minor inconveniences,
> - Removes libfuse from the request processing path,
> - Make the FUSE export use coroutines for request handling,
> 
> More detail on the v1 cover letter:
> https://lists.nongnu.org/archive/html/qemu-block/2025-03/msg00359.html
> 
> v2 cover letter:
> https://lists.nongnu.org/archive/html/qemu-block/2025-06/msg00040.html
> 
> v3 cover letter:
> https://lists.nongnu.org/archive/html/qemu-block/2025-07/msg00005.html
> 
> 
> I noticed some performance differences vs. my previous benchmarks;
> notably, performance didn’t improve with the introduction of coroutines
> much (except for random read performance).  However, when I run the same
> benchmarks on the old branch again, I see no performance improvement
> either.  Something about my host system must have changed.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>

I had very few minor comments. Let me know if you'd like to have these
points changed while I apply the series.

Kevin



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 07/24] fuse: Fix mount options
  2026-02-18 13:26 ` [PATCH v4 07/24] fuse: Fix mount options Hanna Czenczek
@ 2026-03-05 17:56   ` Kevin Wolf
  0 siblings, 0 replies; 29+ messages in thread
From: Kevin Wolf @ 2026-03-05 17:56 UTC (permalink / raw)
  To: Hanna Czenczek; +Cc: qemu-block, qemu-devel, Brian Song

Am 18.02.2026 um 14:26 hat Hanna Czenczek geschrieben:
> Since I actually took a look into how mounting with libfuse works[1], I
> now know that the FUSE mount options are not exactly standard mount
> system call options.  Specifically:
> - We should add "nosuid,nodev,noatime" because that is going to be
>   translated into the respective MS_ mount flags; and those flags make
>   sense for us.
> - We can set rw/ro to make the mount writable or not.  It makes sense to
>   set this flag to produce a better error message for read-only exports
>   (EROFS instead of EACCES).
>   This changes behavior as can be seen in iotest 308: It is no longer
>   possible to modify metadata of read-only exports.
> 
> In addition, in the comment, we can note that the FUSE mount() system
> call actually expects some more parameters that we can omit because
> fusermount3 (i.e. libfuse) will figure them out by itself:
> - fd: /dev/fuse fd
> - rootmode: Inode mode of the root node
> - user_id/group_id: Mounter's UID/GID
> 
> [1] It invokes fusermount3, an SUID libfuse helper program, which parses
>     and processes some mount options before actually invoking the
>     mount() system call.
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>

This breaks the fuse-allow-other iotest for me.

It doesn't look to me like the test actually requires a read-only
export, so I'd suggest to just make it writable like below. I'll apply
this now, if there are any objections let me know so that we can change
it again before I send a pull request.

Kevin

diff --git a/tests/qemu-iotests/tests/fuse-allow-other b/tests/qemu-iotests/tests/fuse-allow-other
index 19f494aefb1..eaa39f8f236 100755
--- a/tests/qemu-iotests/tests/fuse-allow-other
+++ b/tests/qemu-iotests/tests/fuse-allow-other
@@ -101,7 +101,8 @@ run_permission_test()

     fuse_export_add 'export' \
         "'mountpoint': '$EXT_MP',
-         'allow-other': '$1'"
+         'allow-other': '$1',
+         'writable': true"

     # Should always work
     echo '(Removing all permissions)'
diff --git a/tests/qemu-iotests/tests/fuse-allow-other.out b/tests/qemu-iotests/tests/fuse-allow-other.out
index 3219fc35e05..62660b40bfc 100644
--- a/tests/qemu-iotests/tests/fuse-allow-other.out
+++ b/tests/qemu-iotests/tests/fuse-allow-other.out
@@ -12,7 +12,8 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65536
                   'id': 'export',
                   'node-name': 'node-format',
                   'mountpoint': 'TEST_DIR/fuse-export',
-         'allow-other': 'off'
+         'allow-other': 'off',
+         'writable': true
               } }
 {"return": {}}
 (Removing all permissions)
@@ -41,7 +42,8 @@ stat: cannot statx 'fuse-export': Permission denied
                   'id': 'export',
                   'node-name': 'node-format',
                   'mountpoint': 'TEST_DIR/fuse-export',
-         'allow-other': 'on'
+         'allow-other': 'on',
+         'writable': true
               } }
 {"return": {}}
 (Removing all permissions)
@@ -68,7 +70,8 @@ Permissions seen by nobody: 440
                   'id': 'export',
                   'node-name': 'node-format',
                   'mountpoint': 'TEST_DIR/fuse-export',
-         'allow-other': 'auto'
+         'allow-other': 'auto',
+         'writable': true
               } }
 {"return": {}}
 (Removing all permissions)



^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-03-05 17:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-18 13:26 [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 01/24] fuse: Copy write buffer content before polling Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 02/24] fuse: Ensure init clean-up even with error_fatal Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 03/24] fuse: Remove superfluous empty line Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 04/24] fuse: Explicitly set inode ID to 1 Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 05/24] fuse: Change setup_... to mount_fuse_export() Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 06/24] fuse: Destroy session on mount_fuse_export() fail Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 07/24] fuse: Fix mount options Hanna Czenczek
2026-03-05 17:56   ` Kevin Wolf
2026-02-18 13:26 ` [PATCH v4 08/24] fuse: Set direct_io and parallel_direct_writes Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 09/24] fuse: Introduce fuse_{at,de}tach_handlers() Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 10/24] fuse: Introduce fuse_{inc,dec}_in_flight() Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 11/24] fuse: Add halted flag Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 12/24] fuse: fuse_{read,write}: Rename length to blk_len Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 13/24] iotests/308: Use conv=notrunc to test growability Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 14/24] fuse: Explicitly handle non-grow post-EOF accesses Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 15/24] block: Move qemu_fcntl_addfl() into osdep.c Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 16/24] fuse: Manually process requests (without libfuse) Hanna Czenczek
2026-02-26 19:26   ` Kevin Wolf
2026-02-26 19:29   ` Kevin Wolf
2026-02-18 13:26 ` [PATCH v4 17/24] fuse: Reduce max read size Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 18/24] fuse: Process requests in coroutines Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 19/24] block/export: Add multi-threading interface Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 20/24] iotests/307: Test multi-thread export interface Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 21/24] fuse: Make shared export state atomic Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 22/24] fuse: Implement multi-threading Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 23/24] qapi/block-export: Document FUSE's multi-threading Hanna Czenczek
2026-02-18 13:26 ` [PATCH v4 24/24] iotests/308: Add multi-threading sanity test Hanna Czenczek
2026-02-26 19:54 ` [PATCH v4 00/24] export/fuse: Use coroutines and multi-threading Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.