[PULL 00/28] Block layer patches

public inbox for qemu-devel@nongnu.org
 help / color / mirror / Atom feed

* [PULL 00/28] Block layer patches
@ 2026-03-10 16:25 Kevin Wolf
  2026-03-10 16:25 ` [PULL 01/28] fuse: Copy write buffer content before polling Kevin Wolf
                   ` (28 more replies)
  0 siblings, 29 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:25 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

The following changes since commit 31ee190665dd50054c39cef5ad740680aabda382:

  Merge tag 'hw-misc-20260309' of https://github.com/philmd/qemu into staging (2026-03-09 17:19:26 +0000)

are available in the Git repository at:

  https://gitlab.com/kmwolf/qemu.git tags/for-upstream

for you to fetch changes up to 7b13fc97d7235006d2ccc7a132ecb70802ba258f:

  block/curl: add support for S3 presigned URLs (2026-03-10 15:48:48 +0100)

----------------------------------------------------------------
Block layer patches

- export/fuse: Use coroutines and multi-threading
- curl: Add force-range option
- nfs: add support for libnfs v6

----------------------------------------------------------------
Antoine Damhet (2):
      qapi: block: Refactor HTTP(s) common arguments
      block/curl: add support for S3 presigned URLs

Hanna Czenczek (25):
      fuse: Copy write buffer content before polling
      fuse: Ensure init clean-up even with error_fatal
      fuse: Remove superfluous empty line
      fuse: Explicitly set inode ID to 1
      fuse: Change setup_... to mount_fuse_export()
      fuse: Destroy session on mount_fuse_export() fail
      fuse: Fix mount options
      fuse: Set direct_io and parallel_direct_writes
      fuse: Introduce fuse_{at,de}tach_handlers()
      fuse: Introduce fuse_{inc,dec}_in_flight()
      fuse: Add halted flag
      fuse: fuse_{read,write}: Rename length to blk_len
      iotests/308: Use conv=notrunc to test growability
      fuse: Explicitly handle non-grow post-EOF accesses
      block: Move qemu_fcntl_addfl() into osdep.c
      fuse: Drop permission changes in fuse_do_truncate
      fuse: Manually process requests (without libfuse)
      fuse: Reduce max read size
      fuse: Process requests in coroutines
      block/export: Add multi-threading interface
      iotests/307: Test multi-thread export interface
      fuse: Make shared export state atomic
      fuse: Implement multi-threading
      qapi/block-export: Document FUSE's multi-threading
      iotests/308: Add multi-threading sanity test

Peter Lieven (1):
      block/nfs: add support for libnfs v6

 qapi/block-core.json                          |   21 +-
 qapi/block-export.json                        |   41 +-
 docs/system/device-url-syntax.rst.inc         |    6 +
 include/block/export.h                        |   12 +-
 include/qemu/osdep.h                          |    1 +
 block/curl.c                                  |  104 +-
 block/export/export.c                         |   48 +-
 block/export/fuse.c                           | 1311 +++++++++++++++++++------
 block/export/vduse-blk.c                      |    7 +
 block/export/vhost-user-blk-server.c          |    8 +
 block/file-posix.c                            |   17 +-
 block/nfs.c                                   |   50 +-
 nbd/server.c                                  |    6 +
 util/osdep.c                                  |   18 +
 block/trace-events                            |    1 +
 meson.build                                   |    2 +-
 tests/qemu-iotests/307                        |   47 +
 tests/qemu-iotests/307.out                    |   18 +
 tests/qemu-iotests/308                        |   95 +-
 tests/qemu-iotests/308.out                    |   71 +-
 tests/qemu-iotests/tests/fuse-allow-other     |    3 +-
 tests/qemu-iotests/tests/fuse-allow-other.out |    9 +-
 22 files changed, 1506 insertions(+), 390 deletions(-)



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PULL 01/28] fuse: Copy write buffer content before polling
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
@ 2026-03-10 16:25 ` Kevin Wolf
  2026-03-10 16:25 ` [PULL 02/28] fuse: Ensure init clean-up even with error_fatal Kevin Wolf
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:25 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

aio_poll() in I/O functions can lead to nested read_from_fuse_export()
calls, overwriting the request buffer's content.  The only function
affected by this is fuse_write(), which therefore must use a bounce
buffer or corruption may occur.

Note that in addition we do not know whether libfuse-internal structures
can cope with this nesting, and even if we did, we probably cannot rely
on it in the future.  This is the main reason why we want to remove
libfuse from the I/O path.

I do not have a good reproducer for this other than:

$ dd if=/dev/urandom of=image bs=1M count=4096
$ dd if=/dev/zero of=copy bs=1M count=4096
$ touch fuse-export
$ qemu-storage-daemon \
    --blockdev file,node-name=file,filename=copy \
    --export \
    fuse,id=exp,node-name=file,mountpoint=fuse-export,writable=true \
    &

Other shell:
$ qemu-img convert -p -n -f raw -O raw -t none image fuse-export
$ killall -SIGINT qemu-storage-daemon
$ qemu-img compare image copy
Content mismatch at offset 0!

(The -t none in qemu-img convert is important.)

I tried reproducing this with throttle and small aio_write requests from
another qemu-io instance, but for some reason all requests are perfectly
serialized then.

I think in theory we should get parallel writes only if we set
fi->parallel_direct_writes in fuse_open().  In fact, I can confirm that
if we do that, that throttle-based reproducer works (i.e. does get
parallel (nested) write requests).  I have no idea why we still get
parallel requests with qemu-img convert anyway.

Also, a later patch in this series will set fi->parallel_direct_writes
and note that it makes basically no difference when running fio on the
current libfuse-based version of our code.  It does make a difference
without libfuse.  So something quite fishy is going on.

I will try to investigate further what the root cause is, but I think
for now let's assume that calling blk_pwrite() can invalidate the buffer
contents through nested polling.

Cc: qemu-stable@nongnu.org
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-2-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 8cf4572f78d..cea9de61f1d 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -301,6 +301,12 @@ static void read_from_fuse_export(void *opaque)
         goto out;
     }

+    /*
+     * Note that aio_poll() in any request-processing function can lead to a
+     * nested read_from_fuse_export() call, which will overwrite the contents of
+     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
+     * content is copied before potentially polling via aio_poll().
+     */
     fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);

 out:
@@ -624,6 +630,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
                        size_t size, off_t offset, struct fuse_file_info *fi)
 {
     FuseExport *exp = fuse_req_userdata(req);
+    QEMU_AUTO_VFREE void *copied = NULL;
     int64_t length;
     int ret;

@@ -638,6 +645,14 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
         return;
     }

+    /*
+     * Heed the note on read_from_fuse_export(): If we call aio_poll() (which
+     * any blk_*() I/O function may do), read_from_fuse_export() may be nested,
+     * overwriting the request buffer content.  Therefore, we must copy it here.
+     */
+    copied = blk_blockalign(exp->common.blk, size);
+    memcpy(copied, buf, size);
+
     /**
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
@@ -660,7 +675,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
         }
     }

-    ret = blk_pwrite(exp->common.blk, offset, size, buf, 0);
+    ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
     if (ret >= 0) {
         fuse_reply_write(req, size);
     } else {
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 02/28] fuse: Ensure init clean-up even with error_fatal
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
  2026-03-10 16:25 ` [PULL 01/28] fuse: Copy write buffer content before polling Kevin Wolf
@ 2026-03-10 16:25 ` Kevin Wolf
  2026-03-10 16:25 ` [PULL 03/28] fuse: Remove superfluous empty line Kevin Wolf
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:25 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

When exports are created on the command line (with the storage daemon),
errp is going to point to error_fatal.  Without ERRP_GUARD, we would
exit immediately when *errp is set, i.e. skip the clean-up code under
the `fail` label.  Use ERRP_GUARD so we always run that code.

As far as I know, this has no actual impact right now[1], but it is
still better to make this right.

[1] Not cleaning up the mount point is the only thing I can imagine
    would be problematic, but that is the last thing we attempt, so if
    it fails, it will clean itself up.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-3-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index cea9de61f1d..2ed22c6b2f5 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -119,6 +119,7 @@ static int fuse_export_create(BlockExport *blk_exp,
                               BlockExportOptions *blk_exp_args,
                               Error **errp)
 {
+    ERRP_GUARD(); /* ensure clean-up even with error_fatal */
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
     BlockExportOptionsFuse *args = &blk_exp_args->u.fuse;
     int ret;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 03/28] fuse: Remove superfluous empty line
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
  2026-03-10 16:25 ` [PULL 01/28] fuse: Copy write buffer content before polling Kevin Wolf
  2026-03-10 16:25 ` [PULL 02/28] fuse: Ensure init clean-up even with error_fatal Kevin Wolf
@ 2026-03-10 16:25 ` Kevin Wolf
  2026-03-10 16:25 ` [PULL 04/28] fuse: Explicitly set inode ID to 1 Kevin Wolf
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:25 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-4-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 2ed22c6b2f5..4cdf527d691 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -464,7 +464,6 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
     }
 
     if (add_resize_perm) {
-
         if (!qemu_in_main_thread()) {
             /* Changing permissions like below only works in the main thread */
             return -EPERM;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 04/28] fuse: Explicitly set inode ID to 1
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (2 preceding siblings ...)
  2026-03-10 16:25 ` [PULL 03/28] fuse: Remove superfluous empty line Kevin Wolf
@ 2026-03-10 16:25 ` Kevin Wolf
  2026-03-10 16:25 ` [PULL 05/28] fuse: Change setup_... to mount_fuse_export() Kevin Wolf
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:25 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Setting .st_ino to the FUSE inode ID is kind of arbitrary.  While in
practice it is going to be fixed (to FUSE_ROOT_ID, which is 1) because
we only have the root inode, that is not obvious in fuse_getattr().

Just explicitly set it to 1 (i.e. no functional change).

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-5-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 4cdf527d691..a56f645c05d 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -432,7 +432,7 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
     }
 
     statbuf = (struct stat) {
-        .st_ino     = inode,
+        .st_ino     = 1,
         .st_mode    = exp->st_mode,
         .st_nlink   = 1,
         .st_uid     = exp->st_uid,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 05/28] fuse: Change setup_... to mount_fuse_export()
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (3 preceding siblings ...)
  2026-03-10 16:25 ` [PULL 04/28] fuse: Explicitly set inode ID to 1 Kevin Wolf
@ 2026-03-10 16:25 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 06/28] fuse: Destroy session on mount_fuse_export() fail Kevin Wolf
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:25 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

There is no clear separation between what should go into
setup_fuse_export() and what should stay in fuse_export_create().

Make it clear that setup_fuse_export() is for mounting only.  Rename it,
and move everything that has nothing to do with mounting up into
fuse_export_create().

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-6-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 49 ++++++++++++++++++++-------------------------
 1 file changed, 22 insertions(+), 27 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index a56f645c05d..00bb2ffee40 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -72,8 +72,7 @@ static void fuse_export_delete(BlockExport *exp);
 
 static void init_exports_table(void);
 
-static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
-                             bool allow_other, Error **errp);
+static int mount_fuse_export(FuseExport *exp, Error **errp);
 static void read_from_fuse_export(void *opaque);
 
 static bool is_regular_file(const char *path, Error **errp);
@@ -193,23 +192,32 @@ static int fuse_export_create(BlockExport *blk_exp,
     exp->st_gid = getgid();
 
     if (args->allow_other == FUSE_EXPORT_ALLOW_OTHER_AUTO) {
-        /* Ignore errors on our first attempt */
-        ret = setup_fuse_export(exp, args->mountpoint, true, NULL);
-        exp->allow_other = ret == 0;
+        /* Try allow_other == true first, ignore errors */
+        exp->allow_other = true;
+        ret = mount_fuse_export(exp, NULL);
         if (ret < 0) {
-            ret = setup_fuse_export(exp, args->mountpoint, false, errp);
+            exp->allow_other = false;
+            ret = mount_fuse_export(exp, errp);
         }
     } else {
         exp->allow_other = args->allow_other == FUSE_EXPORT_ALLOW_OTHER_ON;
-        ret = setup_fuse_export(exp, args->mountpoint, exp->allow_other, errp);
+        ret = mount_fuse_export(exp, errp);
     }
     if (ret < 0) {
         goto fail;
     }
 
+    g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
+
+    aio_set_fd_handler(exp->common.ctx,
+                       fuse_session_fd(exp->fuse_session),
+                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    exp->fd_handler_set_up = true;
+
     return 0;
 
 fail:
+    fuse_export_shutdown(blk_exp);
     fuse_export_delete(blk_exp);
     return ret;
 }
@@ -227,10 +235,10 @@ static void init_exports_table(void)
 }
 
 /**
- * Create exp->fuse_session and mount it.
+ * Create exp->fuse_session and mount it.  Expects exp->mountpoint,
+ * exp->writable, and exp->allow_other to be set as intended for the mount.
  */
-static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
-                             bool allow_other, Error **errp)
+static int mount_fuse_export(FuseExport *exp, Error **errp)
 {
     const char *fuse_argv[4];
     char *mount_opts;
@@ -243,7 +251,7 @@ static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
      */
     mount_opts = g_strdup_printf("max_read=%zu,default_permissions%s",
                                  FUSE_MAX_BOUNCE_BYTES,
-                                 allow_other ? ",allow_other" : "");
+                                 exp->allow_other ? ",allow_other" : "");
 
     fuse_argv[0] = ""; /* Dummy program name */
     fuse_argv[1] = "-o";
@@ -256,30 +264,17 @@ static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
     g_free(mount_opts);
     if (!exp->fuse_session) {
         error_setg(errp, "Failed to set up FUSE session");
-        ret = -EIO;
-        goto fail;
+        return -EIO;
     }
 
-    ret = fuse_session_mount(exp->fuse_session, mountpoint);
+    ret = fuse_session_mount(exp->fuse_session, exp->mountpoint);
     if (ret < 0) {
         error_setg(errp, "Failed to mount FUSE session to export");
-        ret = -EIO;
-        goto fail;
+        return -EIO;
     }
     exp->mounted = true;
 
-    g_hash_table_insert(exports, g_strdup(mountpoint), NULL);
-
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
-
     return 0;
-
-fail:
-    fuse_export_shutdown(&exp->common);
-    return ret;
 }
 
 /**
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 06/28] fuse: Destroy session on mount_fuse_export() fail
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (4 preceding siblings ...)
  2026-03-10 16:25 ` [PULL 05/28] fuse: Change setup_... to mount_fuse_export() Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 07/28] fuse: Fix mount options Kevin Wolf
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

If mount_fuse_export() fails to mount the session, destroy it.
Depending on the allow_other configuration, fuse_export_create() may
retry this function on error, which may leak one session instance
otherwise.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-7-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 00bb2ffee40..82560ca071f 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -270,11 +270,17 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     ret = fuse_session_mount(exp->fuse_session, exp->mountpoint);
     if (ret < 0) {
         error_setg(errp, "Failed to mount FUSE session to export");
-        return -EIO;
+        ret = -EIO;
+        goto fail;
     }
     exp->mounted = true;
 
     return 0;
+
+fail:
+    fuse_session_destroy(exp->fuse_session);
+    exp->fuse_session = NULL;
+    return ret;
 }
 
 /**
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 07/28] fuse: Fix mount options
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (5 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 06/28] fuse: Destroy session on mount_fuse_export() fail Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 08/28] fuse: Set direct_io and parallel_direct_writes Kevin Wolf
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Since I actually took a look into how mounting with libfuse works[1], I
now know that the FUSE mount options are not exactly standard mount
system call options.  Specifically:
- We should add "nosuid,nodev,noatime" because that is going to be
  translated into the respective MS_ mount flags; and those flags make
  sense for us.
- We can set rw/ro to make the mount writable or not.  It makes sense to
  set this flag to produce a better error message for read-only exports
  (EROFS instead of EACCES).
  This changes behavior as can be seen in iotest 308: It is no longer
  possible to modify metadata of read-only exports.
  Similarly, in fuse-allow-other, we must now make the export writable
  to use SETATTR.

In addition, in the comment, we can note that the FUSE mount() system
call actually expects some more parameters that we can omit because
fusermount3 (i.e. libfuse) will figure them out by itself:
- fd: /dev/fuse fd
- rootmode: Inode mode of the root node
- user_id/group_id: Mounter's UID/GID

[1] It invokes fusermount3, an SUID libfuse helper program, which parses
    and processes some mount options before actually invoking the
    mount() system call.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-8-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c                           | 14 +++++++++++---
 tests/qemu-iotests/308                        |  4 ++--
 tests/qemu-iotests/308.out                    |  3 ++-
 tests/qemu-iotests/tests/fuse-allow-other     |  3 ++-
 tests/qemu-iotests/tests/fuse-allow-other.out |  9 ++++++---
 5 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 82560ca071f..0422cf4b8af 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -246,10 +246,18 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     int ret;
 
     /*
-     * max_read needs to match what fuse_init() sets.
-     * max_write need not be supplied.
+     * Note that these mount options differ from what we would pass to a direct
+     * mount() call:
+     * - nosuid, nodev, and noatime are not understood by the kernel; libfuse
+     *   uses those options to construct the mount flags (MS_*)
+     * - The FUSE kernel driver requires additional options (fd, rootmode,
+     *   user_id, group_id); these will be set by libfuse.
+     * Note that max_read is set here, while max_write is set via the FUSE INIT
+     * operation.
      */
-    mount_opts = g_strdup_printf("max_read=%zu,default_permissions%s",
+    mount_opts = g_strdup_printf("%s,nosuid,nodev,noatime,max_read=%zu,"
+                                 "default_permissions%s",
+                                 exp->writable ? "rw" : "ro",
                                  FUSE_MAX_BOUNCE_BYTES,
                                  exp->allow_other ? ",allow_other" : "");
 
diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index 6eced3aefb9..033d5cbe222 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -178,7 +178,7 @@ stat -c 'Permissions pre-chmod: %a' "$EXT_MP"
 chmod u+w "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
 stat -c 'Permissions post-+w: %a' "$EXT_MP"
 
-# But that we can set, say, +x (if we are so inclined)
+# Same for other flags, like, say +x
 chmod u+x "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
 stat -c 'Permissions post-+x: %a' "$EXT_MP"
 
@@ -236,7 +236,7 @@ output=$($QEMU_IO -f raw -c 'write -P 42 1M 64k' "$TEST_IMG" 2>&1 \
 
 # Expected reference output: Opening the file fails because it has no
 # write permission
-reference="Could not open 'TEST_DIR/t.IMGFMT': Permission denied"
+reference="Could not open 'TEST_DIR/t.IMGFMT': Read-only file system"
 
 if echo "$output" | grep -q "$reference"; then
     echo "Writing to read-only export failed: OK"
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index e5e233691d6..aa96faab6d0 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -53,7 +53,8 @@ Images are identical.
 Permissions pre-chmod: 400
 chmod: changing permissions of 'TEST_DIR/t.IMGFMT.fuse': Read-only file system
 Permissions post-+w: 400
-Permissions post-+x: 500
+chmod: changing permissions of 'TEST_DIR/t.IMGFMT.fuse': Read-only file system
+Permissions post-+x: 400
 
 === Mount over existing file ===
 {'execute': 'block-export-add',
diff --git a/tests/qemu-iotests/tests/fuse-allow-other b/tests/qemu-iotests/tests/fuse-allow-other
index 19f494aefb1..eaa39f8f236 100755
--- a/tests/qemu-iotests/tests/fuse-allow-other
+++ b/tests/qemu-iotests/tests/fuse-allow-other
@@ -101,7 +101,8 @@ run_permission_test()
 
     fuse_export_add 'export' \
         "'mountpoint': '$EXT_MP',
-         'allow-other': '$1'"
+         'allow-other': '$1',
+         'writable': true"
 
     # Should always work
     echo '(Removing all permissions)'
diff --git a/tests/qemu-iotests/tests/fuse-allow-other.out b/tests/qemu-iotests/tests/fuse-allow-other.out
index 3219fc35e05..62660b40bfc 100644
--- a/tests/qemu-iotests/tests/fuse-allow-other.out
+++ b/tests/qemu-iotests/tests/fuse-allow-other.out
@@ -12,7 +12,8 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65536
                   'id': 'export',
                   'node-name': 'node-format',
                   'mountpoint': 'TEST_DIR/fuse-export',
-         'allow-other': 'off'
+         'allow-other': 'off',
+         'writable': true
               } }
 {"return": {}}
 (Removing all permissions)
@@ -41,7 +42,8 @@ stat: cannot statx 'fuse-export': Permission denied
                   'id': 'export',
                   'node-name': 'node-format',
                   'mountpoint': 'TEST_DIR/fuse-export',
-         'allow-other': 'on'
+         'allow-other': 'on',
+         'writable': true
               } }
 {"return": {}}
 (Removing all permissions)
@@ -68,7 +70,8 @@ Permissions seen by nobody: 440
                   'id': 'export',
                   'node-name': 'node-format',
                   'mountpoint': 'TEST_DIR/fuse-export',
-         'allow-other': 'auto'
+         'allow-other': 'auto',
+         'writable': true
               } }
 {"return": {}}
 (Removing all permissions)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 08/28] fuse: Set direct_io and parallel_direct_writes
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (6 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 07/28] fuse: Fix mount options Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 09/28] fuse: Introduce fuse_{at,de}tach_handlers() Kevin Wolf
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

In fuse_open(), set these flags:
- direct_io: We probably actually don't want to have the host page cache
  be used for our exports.  QEMU block exports are supposed to represent
  the image as-is (and thus potentially changing).
  This causes a change in iotest 308's reference output.

- parallel_direct_writes: We can (now) cope with parallel writes, so we
  should set this flag.  For some reason, it doesn't seem to make an
  actual performance difference with libfuse, but it does make a
  difference without it, so let's set it.
  (See "fuse: Copy write buffer content before polling" for further
  discussion.)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-9-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c        | 2 ++
 tests/qemu-iotests/308.out | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 0422cf4b8af..d0e3c6bf61f 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -582,6 +582,8 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
 static void fuse_open(fuse_req_t req, fuse_ino_t inode,
                       struct fuse_file_info *fi)
 {
+    fi->direct_io = true;
+    fi->parallel_direct_writes = true;
     fuse_reply_open(req, fi);
 }
 
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index aa96faab6d0..2d7a38d63d2 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -131,7 +131,7 @@ wrote 65536/65536 bytes at offset 1048576
 
 --- Try growing non-growable export ---
 (OK: Lengths of export and original are the same)
-dd: error writing 'TEST_DIR/t.IMGFMT.fuse': Input/output error
+dd: error writing 'TEST_DIR/t.IMGFMT.fuse': No space left on device
 1+0 records in
 0+0 records out
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 09/28] fuse: Introduce fuse_{at,de}tach_handlers()
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (7 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 08/28] fuse: Set direct_io and parallel_direct_writes Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 10/28] fuse: Introduce fuse_{inc,dec}_in_flight() Kevin Wolf
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Pull setting up and tearing down the AIO context handlers into two
dedicated functions.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-10-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index d0e3c6bf61f..5953407f20f 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -78,27 +78,34 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
-static void fuse_export_drained_begin(void *opaque)
+static void fuse_attach_handlers(FuseExport *exp)
 {
-    FuseExport *exp = opaque;
+    aio_set_fd_handler(exp->common.ctx,
+                       fuse_session_fd(exp->fuse_session),
+                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    exp->fd_handler_set_up = true;
+}
 
+static void fuse_detach_handlers(FuseExport *exp)
+{
     aio_set_fd_handler(exp->common.ctx,
                        fuse_session_fd(exp->fuse_session),
                        NULL, NULL, NULL, NULL, NULL);
     exp->fd_handler_set_up = false;
 }
 
+static void fuse_export_drained_begin(void *opaque)
+{
+    fuse_detach_handlers(opaque);
+}
+
 static void fuse_export_drained_end(void *opaque)
 {
     FuseExport *exp = opaque;
 
     /* Refresh AioContext in case it changed */
     exp->common.ctx = blk_get_aio_context(exp->common.blk);
-
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
+    fuse_attach_handlers(exp);
 }
 
 static bool fuse_export_drained_poll(void *opaque)
@@ -209,11 +216,7 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
-    exp->fd_handler_set_up = true;
-
+    fuse_attach_handlers(exp);
     return 0;
 
 fail:
@@ -335,10 +338,7 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
         fuse_session_exit(exp->fuse_session);
 
         if (exp->fd_handler_set_up) {
-            aio_set_fd_handler(exp->common.ctx,
-                               fuse_session_fd(exp->fuse_session),
-                               NULL, NULL, NULL, NULL, NULL);
-            exp->fd_handler_set_up = false;
+            fuse_detach_handlers(exp);
         }
     }
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 10/28] fuse: Introduce fuse_{inc,dec}_in_flight()
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (8 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 09/28] fuse: Introduce fuse_{at,de}tach_handlers() Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 11/28] fuse: Add halted flag Kevin Wolf
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

This is how vduse-blk.c does it, and it does seem better to have
dedicated functions for it.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-11-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 5953407f20f..fc75a5e74d9 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -78,6 +78,25 @@ static void read_from_fuse_export(void *opaque);
 static bool is_regular_file(const char *path, Error **errp);
 
 
+static void fuse_inc_in_flight(FuseExport *exp)
+{
+    if (qatomic_fetch_inc(&exp->in_flight) == 0) {
+        /* Prevent export from being deleted */
+        blk_exp_ref(&exp->common);
+    }
+}
+
+static void fuse_dec_in_flight(FuseExport *exp)
+{
+    if (qatomic_fetch_dec(&exp->in_flight) == 1) {
+        /* Wake AIO_WAIT_WHILE() */
+        aio_wait_kick();
+
+        /* Now the export can be deleted */
+        blk_exp_unref(&exp->common);
+    }
+}
+
 static void fuse_attach_handlers(FuseExport *exp)
 {
     aio_set_fd_handler(exp->common.ctx,
@@ -303,9 +322,7 @@ static void read_from_fuse_export(void *opaque)
     FuseExport *exp = opaque;
     int ret;
 
-    blk_exp_ref(&exp->common);
-
-    qatomic_inc(&exp->in_flight);
+    fuse_inc_in_flight(exp);
 
     do {
         ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
@@ -323,11 +340,7 @@ static void read_from_fuse_export(void *opaque)
     fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
 
 out:
-    if (qatomic_fetch_dec(&exp->in_flight) == 1) {
-        aio_wait_kick(); /* wake AIO_WAIT_WHILE() */
-    }
-
-    blk_exp_unref(&exp->common);
+    fuse_dec_in_flight(exp);
 }
 
 static void fuse_export_shutdown(BlockExport *blk_exp)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 11/28] fuse: Add halted flag
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (9 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 10/28] fuse: Introduce fuse_{inc,dec}_in_flight() Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 12/28] fuse: fuse_{read,write}: Rename length to blk_len Kevin Wolf
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

This is a flag that we will want when processing FUSE requests
ourselves: When the kernel sends us e.g. a truncated request (i.e. we
receive less data than the request's indicated length), we cannot rely
on subsequent data to be valid.  Then, we are going to set this flag,
halting all FUSE request processing.

We plan to only use this flag in cases that would effectively be kernel
bugs.

While not necessary yet, access the flag atomically so that it will be
safe to use once we introduce multi-threading.

(Right now, the flag is unused because libfuse still does our request
processing.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-12-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index fc75a5e74d9..f6a5f4fa0a0 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -53,6 +53,13 @@ typedef struct FuseExport {
     unsigned int in_flight; /* atomic */
     bool mounted, fd_handler_set_up;
 
+    /*
+     * Set when there was an unrecoverable error and no requests should be read
+     * from the device anymore (basically only in case of something we would
+     * consider a kernel bug).  Access atomically.
+     */
+    bool halted;
+
     char *mountpoint;
     bool writable;
     bool growable;
@@ -69,6 +76,7 @@ static const struct fuse_lowlevel_ops fuse_ops;
 
 static void fuse_export_shutdown(BlockExport *exp);
 static void fuse_export_delete(BlockExport *exp);
+static void fuse_export_halt(FuseExport *exp) G_GNUC_UNUSED;
 
 static void init_exports_table(void);
 
@@ -99,6 +107,10 @@ static void fuse_dec_in_flight(FuseExport *exp)
 
 static void fuse_attach_handlers(FuseExport *exp)
 {
+    if (qatomic_read(&exp->halted)) {
+        return;
+    }
+
     aio_set_fd_handler(exp->common.ctx,
                        fuse_session_fd(exp->fuse_session),
                        read_from_fuse_export, NULL, NULL, NULL, exp);
@@ -322,6 +334,10 @@ static void read_from_fuse_export(void *opaque)
     FuseExport *exp = opaque;
     int ret;
 
+    if (unlikely(qatomic_read(&exp->halted))) {
+        return;
+    }
+
     fuse_inc_in_flight(exp);
 
     do {
@@ -380,6 +396,20 @@ static void fuse_export_delete(BlockExport *blk_exp)
     g_free(exp->mountpoint);
 }
 
+/**
+ * Halt the export: Detach FD handlers, and set exp->halted to true, preventing
+ * fuse_attach_handlers() from re-attaching them, therefore stopping all further
+ * request processing.
+ *
+ * Call this function when an unrecoverable error happens that makes processing
+ * all future requests unreliable.
+ */
+static void fuse_export_halt(FuseExport *exp)
+{
+    qatomic_set(&exp->halted, true);
+    fuse_detach_handlers(exp);
+}
+
 /**
  * Check whether @path points to a regular file.  If not, put an
  * appropriate message into *errp.
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 12/28] fuse: fuse_{read,write}: Rename length to blk_len
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (10 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 11/28] fuse: Add halted flag Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 13/28] iotests/308: Use conv=notrunc to test growability Kevin Wolf
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

The term "length" is ambiguous, use "blk_len" instead to be clear.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-13-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index f6a5f4fa0a0..d45c6b814fe 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -637,7 +637,7 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
                       size_t size, off_t offset, struct fuse_file_info *fi)
 {
     FuseExport *exp = fuse_req_userdata(req);
-    int64_t length;
+    int64_t blk_len;
     void *buf;
     int ret;
 
@@ -651,14 +651,14 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
      * Clients will expect short reads at EOF, so we have to limit
      * offset+size to the image length.
      */
-    length = blk_getlength(exp->common.blk);
-    if (length < 0) {
-        fuse_reply_err(req, -length);
+    blk_len = blk_getlength(exp->common.blk);
+    if (blk_len < 0) {
+        fuse_reply_err(req, -blk_len);
         return;
     }
 
-    if (offset + size > length) {
-        size = length - offset;
+    if (offset + size > blk_len) {
+        size = blk_len - offset;
     }
 
     buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
@@ -685,7 +685,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
 {
     FuseExport *exp = fuse_req_userdata(req);
     QEMU_AUTO_VFREE void *copied = NULL;
-    int64_t length;
+    int64_t blk_len;
     int ret;
 
     /* Limited by max_write, should not happen */
@@ -711,13 +711,13 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
      */
-    length = blk_getlength(exp->common.blk);
-    if (length < 0) {
-        fuse_reply_err(req, -length);
+    blk_len = blk_getlength(exp->common.blk);
+    if (blk_len < 0) {
+        fuse_reply_err(req, -blk_len);
         return;
     }
 
-    if (offset + size > length) {
+    if (offset + size > blk_len) {
         if (exp->growable) {
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
@@ -725,7 +725,7 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
                 return;
             }
         } else {
-            size = length - offset;
+            size = blk_len - offset;
         }
     }
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 13/28] iotests/308: Use conv=notrunc to test growability
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (11 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 12/28] fuse: fuse_{read,write}: Rename length to blk_len Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 14/28] fuse: Explicitly handle non-grow post-EOF accesses Kevin Wolf
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Without conv=notrunc, dd will automatically truncate the output file to
the @seek value at least.  We want to test post-EOF I/O, not truncate,
so pass conv=notrunc.

(It does not make a difference in practice because we only seek to the
EOF, so the truncate effectively does nothing, but this is still
cleaner.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-14-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/308 | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index 033d5cbe222..6ecb275555a 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -296,7 +296,8 @@ orig_disk_usage=$(disk_usage "$TEST_IMG")
 # Should fail (exports are non-growable by default)
 # (Note that qemu-io can never write beyond the EOF, so we have to use
 # dd here)
-dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$orig_len 2>&1 \
+dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$orig_len \
+    conv=notrunc 2>&1 \
     | _filter_testdir | _filter_imgfmt
 
 echo
@@ -333,7 +334,7 @@ fuse_export_add \
     'node-protocol'
 
 # Now we should be able to write beyond the EOF
-dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$new_len 2>&1 \
+dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$new_len conv=notrunc 2>&1 \
     | _filter_testdir | _filter_imgfmt
 
 new_len=$(get_proto_len "$EXT_MP" "$TEST_IMG")
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 14/28] fuse: Explicitly handle non-grow post-EOF accesses
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (12 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 13/28] iotests/308: Use conv=notrunc to test growability Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 15/28] block: Move qemu_fcntl_addfl() into osdep.c Kevin Wolf
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

When reading to / writing from non-growable exports, we cap the I/O size
by `offset - blk_len`.  This will underflow for accesses that are
completely past the disk end.

Check and handle that case explicitly.

This is also enough to ensure that `offset + size` will not overflow;
blk_len is int64_t, offset is uint32_t, `offset < blk_len`, so from
`INT64_MAX + UINT32_MAX < UINT64_MAX` it follows that `offset + size`
cannot overflow.

Just one catch: We have to allow write accesses to growable exports past
the EOF, so then we cannot rely on `offset < blk_len`, but have to
verify explicitly that `offset + size` does not overflow.

The negative consequences of not having this commit are luckily limited
because blk_pread() and blk_pwrite() will reject post-EOF requests
anyway, so a `size` underflow post-EOF will just result in an I/O error.
So:
- Post-EOF reads will incorrectly result in I/O errors instead of just
  0-length reads.  We will also attempt to allocate a very large buffer,
  which is wrong and not good, but not terrible.
- Post-EOF writes on non-growable exports will result in I/O errors
  instead of 0-length writes (which generally indicate ENOSPC).
- Post-EOF writes on growable exports can theoretically overflow on EOF
  and truncate the export down to a much too small size, but in
  practice, FUSE will never send an offset greater than signed INT_MAX,
  preventing a uint64_t overflow.  (fuse_write_args_fill() in the kernel
  uses loff_t for the offset, which is signed.)

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-15-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c        | 20 +++++++++++++++++++-
 tests/qemu-iotests/308     | 35 ++++++++++++++++++++++++++++++-----
 tests/qemu-iotests/308.out | 10 ++++++++++
 3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index d45c6b814fe..af0a8de17b1 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -657,6 +657,16 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
         return;
     }
 
+    if (offset >= blk_len) {
+        /*
+         * Technically libfuse does not allow returning a zero error code for
+         * read requests, but in practice this is a 0-length read (and a future
+         * commit will change this code anyway)
+         */
+        fuse_reply_err(req, 0);
+        return;
+    }
+
     if (offset + size > blk_len) {
         size = blk_len - offset;
     }
@@ -717,7 +727,15 @@ static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
         return;
     }
 
-    if (offset + size > blk_len) {
+    if (offset >= blk_len && !exp->growable) {
+        fuse_reply_write(req, 0);
+        return;
+    }
+
+    if (offset + size < offset) {
+        fuse_reply_err(req, EINVAL);
+        return;
+    } else if (offset + size > blk_len) {
         if (exp->growable) {
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index 6ecb275555a..a83c6fc01fb 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -300,16 +300,34 @@ dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$orig_len \
     conv=notrunc 2>&1 \
     | _filter_testdir | _filter_imgfmt
 
+# And one really squarely post-EOF write
+dd if=/dev/zero of="$EXT_MP" bs=1 count=1 seek=$((orig_len + 32 * 1024)) \
+    conv=notrunc 2>&1 \
+    | _filter_testdir | _filter_imgfmt
+
+# Half-post-EOF reads
+dd if="$EXT_MP" of=/dev/null bs=1 count=64k skip=$((orig_len - 32 * 1024)) \
+    2>&1 | _filter_testdir | _filter_imgfmt
+
+# And one really squarely post-EOF read
+dd if="$EXT_MP" of=/dev/null bs=1 count=1 skip=$((orig_len + 32 * 1024)) \
+    2>&1 | _filter_testdir | _filter_imgfmt
+
 echo
 echo '--- Resize export ---'
 
 # But we can truncate it explicitly; even with fallocate
-fallocate -o "$orig_len" -l 64k "$EXT_MP"
+# (Make sure we extend it to a length not divisible by 128k, we need that below)
+bs=$((128 * 1024))
+extend_to=$(((orig_len + bs - 1) / bs * bs + bs / 2))
+extend_by=$((extend_to - orig_len))
+
+fallocate -o "$orig_len" -l $extend_by "$EXT_MP"
 
 new_len=$(get_proto_len "$EXT_MP" "$TEST_IMG")
-if [ "$new_len" != "$((orig_len + 65536))" ]; then
+if [ "$new_len" != "$extend_to" ]; then
     echo 'ERROR: Unexpected post-truncate image size:'
-    echo "$new_len != $((orig_len + 65536))"
+    echo "$new_len != $extend_to"
 else
     echo 'OK: Post-truncate image size is as expected'
 fi
@@ -322,6 +340,13 @@ else
     echo "$orig_disk_usage => $new_disk_usage"
 fi
 
+# Use this opportunity to test a read access across the (now no longer so much
+# aligned) EOF.  dd can only do requests with a length of its block size, and
+# all of its seek/skip values are in bs units, so it is hard to do a request
+# across the EOF if the EOF is at a power of two (64M).
+dd if="$EXT_MP" of=/dev/null bs=$bs count=2 skip=$((extend_to / bs)) \
+    2>&1 | _filter_testdir | _filter_imgfmt
+
 echo
 echo '--- Try growing growable export ---'
 
@@ -338,9 +363,9 @@ dd if=/dev/zero of="$EXT_MP" bs=1 count=64k seek=$new_len conv=notrunc 2>&1 \
     | _filter_testdir | _filter_imgfmt
 
 new_len=$(get_proto_len "$EXT_MP" "$TEST_IMG")
-if [ "$new_len" != "$((orig_len + 131072))" ]; then
+if [ "$new_len" != "$((extend_to + 65536))" ]; then
     echo 'ERROR: Unexpected post-grow image size:'
-    echo "$new_len != $((orig_len + 131072))"
+    echo "$new_len != $((extend_to + 65536))"
 else
     echo 'OK: Post-grow image size is as expected'
 fi
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index 2d7a38d63d2..ebeaf64b486 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -134,11 +134,21 @@ wrote 65536/65536 bytes at offset 1048576
 dd: error writing 'TEST_DIR/t.IMGFMT.fuse': No space left on device
 1+0 records in
 0+0 records out
+dd: error writing 'TEST_DIR/t.IMGFMT.fuse': No space left on device
+1+0 records in
+0+0 records out
+32768+0 records in
+32768+0 records out
+dd: TEST_DIR/t.IMGFMT.fuse: cannot skip to specified offset
+0+0 records in
+0+0 records out
 
 --- Resize export ---
 (OK: Lengths of export and original are the same)
 OK: Post-truncate image size is as expected
 OK: Disk usage grew with fallocate
+0+1 records in
+0+1 records out
 
 --- Try growing growable export ---
 {'execute': 'block-export-del',
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 15/28] block: Move qemu_fcntl_addfl() into osdep.c
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (13 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 14/28] fuse: Explicitly handle non-grow post-EOF accesses Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 16/28] fuse: Drop permission changes in fuse_do_truncate Kevin Wolf
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Move file-posix's helper to add a flag (or a set of flags) to an FD's
existing set of flags into osdep.c for other places to use.

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-16-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/qemu/osdep.h |  1 +
 block/file-posix.c   | 17 +----------------
 util/osdep.c         | 18 ++++++++++++++++++
 3 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index b384b5b506b..f151578b5ce 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -633,6 +633,7 @@ int qemu_lock_fd(int fd, int64_t start, int64_t len, bool exclusive);
 int qemu_unlock_fd(int fd, int64_t start, int64_t len);
 int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive);
 bool qemu_has_ofd_lock(void);
+int qemu_fcntl_addfl(int fd, int flag);
 #endif
 
 bool qemu_has_direct_io(void);
diff --git a/block/file-posix.c b/block/file-posix.c
index 6265d2e248b..e49b13d6abb 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1056,21 +1056,6 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
     return ret;
 }
 
-/* Sets a specific flag */
-static int fcntl_setfl(int fd, int flag)
-{
-    int flags;
-
-    flags = fcntl(fd, F_GETFL);
-    if (flags == -1) {
-        return -errno;
-    }
-    if (fcntl(fd, F_SETFL, flags | flag) == -1) {
-        return -errno;
-    }
-    return 0;
-}
-
 static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
                                  int *open_flags, uint64_t perm, Error **errp)
 {
@@ -1109,7 +1094,7 @@ static int raw_reconfigure_getfd(BlockDriverState *bs, int flags,
         /* dup the original fd */
         fd = qemu_dup(s->fd);
         if (fd >= 0) {
-            ret = fcntl_setfl(fd, *open_flags);
+            ret = qemu_fcntl_addfl(fd, *open_flags);
             if (ret) {
                 qemu_close(fd);
                 fd = -1;
diff --git a/util/osdep.c b/util/osdep.c
index 770369831bc..000e7daac8b 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -280,6 +280,24 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive)
         return fl.l_type == F_UNLCK ? 0 : -EAGAIN;
     }
 }
+
+/**
+ * Set the given flag(s) (fcntl GETFL/SETFL) on the given FD, while retaining
+ * other flags.
+ */
+int qemu_fcntl_addfl(int fd, int flag)
+{
+    int flags;
+
+    flags = fcntl(fd, F_GETFL);
+    if (flags == -1) {
+        return -errno;
+    }
+    if (fcntl(fd, F_SETFL, flags | flag) == -1) {
+        return -errno;
+    }
+    return 0;
+}
 #endif
 
 bool qemu_has_direct_io(void)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 16/28] fuse: Drop permission changes in fuse_do_truncate
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (14 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 15/28] block: Move qemu_fcntl_addfl() into osdep.c Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 17/28] fuse: Manually process requests (without libfuse) Kevin Wolf
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

This function is always called with writable == true.  This makes
add_resize_perm always false, and thus we can drop the quite ugly
permission-changing code.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-17-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 34 ++--------------------------------
 1 file changed, 2 insertions(+), 32 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index af0a8de17b1..b7a710c29f8 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -503,44 +503,14 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
 static int fuse_do_truncate(const FuseExport *exp, int64_t size,
                             bool req_zero_write, PreallocMode prealloc)
 {
-    uint64_t blk_perm, blk_shared_perm;
     BdrvRequestFlags truncate_flags = 0;
-    bool add_resize_perm;
-    int ret, ret_check;
-
-    /* Growable and writable exports have a permanent RESIZE permission */
-    add_resize_perm = !exp->growable && !exp->writable;
 
     if (req_zero_write) {
         truncate_flags |= BDRV_REQ_ZERO_WRITE;
     }
 
-    if (add_resize_perm) {
-        if (!qemu_in_main_thread()) {
-            /* Changing permissions like below only works in the main thread */
-            return -EPERM;
-        }
-
-        blk_get_perm(exp->common.blk, &blk_perm, &blk_shared_perm);
-
-        ret = blk_set_perm(exp->common.blk, blk_perm | BLK_PERM_RESIZE,
-                           blk_shared_perm, NULL);
-        if (ret < 0) {
-            return ret;
-        }
-    }
-
-    ret = blk_truncate(exp->common.blk, size, true, prealloc,
-                       truncate_flags, NULL);
-
-    if (add_resize_perm) {
-        /* Must succeed, because we are only giving up the RESIZE permission */
-        ret_check = blk_set_perm(exp->common.blk, blk_perm,
-                                 blk_shared_perm, &error_abort);
-        assert(ret_check == 0);
-    }
-
-    return ret;
+    return blk_truncate(exp->common.blk, size, true, prealloc,
+                        truncate_flags, NULL);
 }
 
 /**
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 17/28] fuse: Manually process requests (without libfuse)
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (15 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 16/28] fuse: Drop permission changes in fuse_do_truncate Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 18/28] fuse: Reduce max read size Kevin Wolf
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Manually read requests from the /dev/fuse FD and process them, without
using libfuse.  This allows us to safely add parallel request processing
in coroutines later, without having to worry about libfuse internals.
(Technically, we already have exactly that problem with
read_from_fuse_export()/read_from_fuse_fd() nesting.)

We will continue to use libfuse for mounting the filesystem; fusermount3
is a effectively a helper program of libfuse, so it should know best how
to interact with it.  (Doing it manually without libfuse, while doable,
is a bit of a pain, and it is not clear to me how stable the "protocol"
actually is.)

Take this opportunity of quite a major rewrite to update the Copyright
line with corrected information that has surfaced in the meantime.

Here are some benchmarks from before this patch (4k, iodepth=16, libaio;
except 'sync', which are iodepth=1 and pvsync2):

file:
  read:
    seq aio:    99.8k ±1.5k IOPS
    rand aio:   50.5k ±1.0k
    seq sync:   36.1k ±1.1k
    rand sync:  10.0k ±0.1k
  write:
    seq aio:    72.0k ±9.3k
    rand aio:   70.6k ±2.5k
    seq sync:   30.6k ±0.8k
    rand sync:  30.1k ±1.0k
null:
  read:
    seq aio:   157.9k ±4.7k
    rand aio:  158.7k ±4.8k
    seq sync:   80.2k ±2.8k
    rand sync:  77.5k ±3.8k
  write:
    seq aio:   154.3k ±3.6k
    rand aio:  154.3k ±4.2k
    seq sync:   76.1k ±5.2k
    rand sync:  72.9k ±4.0k

And with this patch applied:

file:
  read:
    seq aio:   106.8k ±1.9k (+7%)
    rand aio:   48.3k ±8.8k (-4%)
    seq sync:   35.5k ±1.4k (-2%)
    rand sync:  10.0k ±0.2k (±0%)
  write:
    seq aio:    76.3k ±6.6k (+6%)
    rand aio:   76.4k ±1.5k (+8%)
    seq sync:   31.6k ±0.6k (+3%)
    rand sync:  30.9k ±0.8k (+3%)
null:
  read:
    seq aio:   161.7k ±6.0k (+2%)
    rand aio:  165.6k ±7.1k (+4%)
    seq sync:   80.5k ±3.0k (±0%)
    rand sync:  78.5k ±3.1k (+1%)
  write:
    seq aio:   185.1k ±3.3k (+20%)
    rand aio:  186.7k ±4.8k (+21%)
    seq sync:   82.5k ±4.2k (+8%)
    rand sync:  78.7k ±3.2k (+8%)

So not much difference, aside from write AIO to a null-co export getting
a bit better.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-18-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 945 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 721 insertions(+), 224 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index b7a710c29f8..bd099d12911 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -1,7 +1,7 @@
 /*
  * Present a block device as a raw image through FUSE
  *
- * Copyright (c) 2020 Max Reitz <mreitz@redhat.com>
+ * Copyright (c) 2020, 2025 Hanna Czenczek <hreitz@redhat.com>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -27,12 +27,15 @@
 #include "block/qapi.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block.h"
+#include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
 
 #include <fuse.h>
 #include <fuse_lowlevel.h>
 
+#include "standard-headers/linux/fuse.h"
+
 #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
 #endif
@@ -42,17 +45,101 @@
 #endif
 
 /* Prevent overly long bounce buffer allocations */
-#define FUSE_MAX_BOUNCE_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+#define FUSE_MAX_WRITE_BYTES (64 * 1024)
+
+/*
+ * fuse_init_in structure before 7.36.  We don't need the flags2 field added
+ * there, so we can work with the smaller older structure to stay compatible
+ * with older kernels.
+ */
+struct fuse_init_in_compat {
+    uint32_t major;
+    uint32_t minor;
+    uint32_t max_readahead;
+    uint32_t flags;
+};
+
+typedef struct FuseRequestInHeader {
+    struct fuse_in_header common;
+    /* All supported requests */
+    union {
+        struct fuse_init_in_compat init;
+        struct fuse_open_in open;
+        struct fuse_setattr_in setattr;
+        struct fuse_read_in read;
+        struct fuse_write_in write;
+        struct fuse_fallocate_in fallocate;
+#ifdef CONFIG_FUSE_LSEEK
+        struct fuse_lseek_in lseek;
+#endif
+    };
+} FuseRequestInHeader;
+
+typedef struct FuseRequestOutHeader {
+    struct fuse_out_header common;
+    /* All supported requests */
+    union {
+        struct fuse_init_out init;
+        struct fuse_statfs_out statfs;
+        struct fuse_open_out open;
+        struct fuse_attr_out attr;
+        struct fuse_write_out write;
+#ifdef CONFIG_FUSE_LSEEK
+        struct fuse_lseek_out lseek;
+#endif
+    };
+} FuseRequestOutHeader;
+
+typedef union FuseRequestInHeaderBuf {
+    struct FuseRequestInHeader structured;
+    struct {
+        /*
+         * Part of the request header that is filled for write requests
+         * (Needed because we want the data to go into a different buffer, to
+         * avoid having to use a bounce buffer)
+         */
+        char head[sizeof(struct fuse_in_header) +
+                  sizeof(struct fuse_write_in)];
+        /*
+         * Rest of the request header for requests that have a longer header
+         * than write requests
+         */
+        char tail[sizeof(FuseRequestInHeader) -
+                  (sizeof(struct fuse_in_header) +
+                   sizeof(struct fuse_write_in))];
+    };
+} FuseRequestInHeaderBuf;
 
+QEMU_BUILD_BUG_ON(sizeof(FuseRequestInHeaderBuf) !=
+                  sizeof(FuseRequestInHeader));
+QEMU_BUILD_BUG_ON(sizeof(((FuseRequestInHeaderBuf *)0)->head) +
+                  sizeof(((FuseRequestInHeaderBuf *)0)->tail) !=
+                  sizeof(FuseRequestInHeader));
 
 typedef struct FuseExport {
     BlockExport common;
 
     struct fuse_session *fuse_session;
-    struct fuse_buf fuse_buf;
     unsigned int in_flight; /* atomic */
     bool mounted, fd_handler_set_up;
 
+    /*
+     * Cached buffer to receive the data of WRITE requests.  Cached because:
+     * To read requests, we put a FuseRequestInHeaderBuf (FRIHB) object on the
+     * stack, and a (WRITE data) buffer on the heap.  We pass FRIHB.head and the
+     * data buffer to readv().  This way, for WRITE requests, we get exactly
+     * their data in the data buffer and can avoid bounce buffering.
+     * However, for non-WRITE requests, some of the header may end up in the
+     * data buffer, so we will need to copy that back into the FRIHB object, and
+     * then we don't need the heap buffer anymore.  That is why we cache it, so
+     * we can trivially reuse it between non-WRITE requests.
+     *
+     * Note that these data buffers and thus req_write_data_cached are allocated
+     * via blk_blockalign() and thus need to be freed via qemu_vfree().
+     */
+    void *req_write_data_cached;
+
     /*
      * Set when there was an unrecoverable error and no requests should be read
      * from the device anymore (basically only in case of something we would
@@ -60,6 +147,8 @@ typedef struct FuseExport {
      */
     bool halted;
 
+    int fuse_fd;
+
     char *mountpoint;
     bool writable;
     bool growable;
@@ -71,20 +160,31 @@ typedef struct FuseExport {
     gid_t st_gid;
 } FuseExport;
 
+/*
+ * Verify that the size of FuseRequestInHeaderBuf.head plus the data
+ * buffer are big enough to be accepted by the FUSE kernel driver.
+ */
+QEMU_BUILD_BUG_ON(sizeof(((FuseRequestInHeaderBuf *)0)->head) +
+                  FUSE_MAX_WRITE_BYTES <
+                  FUSE_MIN_READ_BUFFER);
+
 static GHashTable *exports;
-static const struct fuse_lowlevel_ops fuse_ops;
 
 static void fuse_export_shutdown(BlockExport *exp);
 static void fuse_export_delete(BlockExport *exp);
-static void fuse_export_halt(FuseExport *exp) G_GNUC_UNUSED;
+static void fuse_export_halt(FuseExport *exp);
 
 static void init_exports_table(void);
 
 static int mount_fuse_export(FuseExport *exp, Error **errp);
-static void read_from_fuse_export(void *opaque);
 
 static bool is_regular_file(const char *path, Error **errp);
 
+static void read_from_fuse_fd(void *opaque);
+static void fuse_process_request(FuseExport *exp,
+                                 const FuseRequestInHeader *in_hdr,
+                                 const void *data_buffer);
+static int fuse_write_err(int fd, const struct fuse_in_header *in_hdr, int err);
 
 static void fuse_inc_in_flight(FuseExport *exp)
 {
@@ -105,22 +205,26 @@ static void fuse_dec_in_flight(FuseExport *exp)
     }
 }
 
+/**
+ * Attach FUSE FD read handler.
+ */
 static void fuse_attach_handlers(FuseExport *exp)
 {
     if (qatomic_read(&exp->halted)) {
         return;
     }
 
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
-                       read_from_fuse_export, NULL, NULL, NULL, exp);
+    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
+                       read_from_fuse_fd, NULL, NULL, NULL, exp);
     exp->fd_handler_set_up = true;
 }
 
+/**
+ * Detach FUSE FD read handler.
+ */
 static void fuse_detach_handlers(FuseExport *exp)
 {
-    aio_set_fd_handler(exp->common.ctx,
-                       fuse_session_fd(exp->fuse_session),
+    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
                        NULL, NULL, NULL, NULL, NULL);
     exp->fd_handler_set_up = false;
 }
@@ -247,6 +351,13 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
+    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
+    ret = qemu_fcntl_addfl(exp->fuse_fd, O_NONBLOCK);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "Failed to make FUSE FD non-blocking");
+        goto fail;
+    }
+
     fuse_attach_handlers(exp);
     return 0;
 
@@ -278,6 +389,17 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     char *mount_opts;
     struct fuse_args fuse_args;
     int ret;
+    /*
+     * We just create the session for mounting/unmounting, no need to provide
+     * any operations.  However, since libfuse commit 52a633a5d, we have to
+     * provide some op struct and cannot just pass NULL (even though the commit
+     * message ("allow passing ops as NULL") seems to imply the exact opposite,
+     * as does the comment added to fuse_session_new_fn() ("To create a no-op
+     * session just for mounting pass op as NULL.").
+     * This is how said libfuse commit implements a no-op session internally, so
+     * do it the same way.
+     */
+    static const struct fuse_lowlevel_ops null_ops = { 0 };
 
     /*
      * Note that these mount options differ from what we would pass to a direct
@@ -292,7 +414,7 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     mount_opts = g_strdup_printf("%s,nosuid,nodev,noatime,max_read=%zu,"
                                  "default_permissions%s",
                                  exp->writable ? "rw" : "ro",
-                                 FUSE_MAX_BOUNCE_BYTES,
+                                 FUSE_MAX_READ_BYTES,
                                  exp->allow_other ? ",allow_other" : "");
 
     fuse_argv[0] = ""; /* Dummy program name */
@@ -301,8 +423,8 @@ static int mount_fuse_export(FuseExport *exp, Error **errp)
     fuse_argv[3] = NULL;
     fuse_args = (struct fuse_args)FUSE_ARGS_INIT(3, (char **)fuse_argv);
 
-    exp->fuse_session = fuse_session_new(&fuse_args, &fuse_ops,
-                                         sizeof(fuse_ops), exp);
+    exp->fuse_session = fuse_session_new(&fuse_args, &null_ops,
+                                         sizeof(null_ops), NULL);
     g_free(mount_opts);
     if (!exp->fuse_session) {
         error_setg(errp, "Failed to set up FUSE session");
@@ -326,36 +448,163 @@ fail:
 }
 
 /**
- * Callback to be invoked when the FUSE session FD can be read from.
- * (This is basically the FUSE event loop.)
+ * Allocate a buffer to receive WRITE data, or take the cached one.
  */
-static void read_from_fuse_export(void *opaque)
+static void *get_write_data_buffer(FuseExport *exp)
 {
-    FuseExport *exp = opaque;
-    int ret;
+    if (exp->req_write_data_cached) {
+        void *cached = exp->req_write_data_cached;
+        exp->req_write_data_cached = NULL;
+        return cached;
+    } else {
+        return blk_blockalign(exp->common.blk, FUSE_MAX_WRITE_BYTES);
+    }
+}
 
-    if (unlikely(qatomic_read(&exp->halted))) {
+/**
+ * Release a WRITE data buffer, possibly reusing it for a subsequent request.
+ */
+static void release_write_data_buffer(FuseExport *exp, void **buffer)
+{
+    if (!*buffer) {
         return;
     }
 
+    if (!exp->req_write_data_cached) {
+        exp->req_write_data_cached = *buffer;
+    } else {
+        qemu_vfree(*buffer);
+    }
+    *buffer = NULL;
+}
+
+/**
+ * Return the length of the specific operation's own in_header.
+ * Return -ENOSYS if the operation is not supported.
+ */
+static ssize_t req_op_hdr_len(const FuseRequestInHeader *in_hdr)
+{
+    switch (in_hdr->common.opcode) {
+    case FUSE_INIT:
+        return sizeof(in_hdr->init);
+    case FUSE_OPEN:
+        return sizeof(in_hdr->open);
+    case FUSE_SETATTR:
+        return sizeof(in_hdr->setattr);
+    case FUSE_READ:
+        return sizeof(in_hdr->read);
+    case FUSE_WRITE:
+        return sizeof(in_hdr->write);
+    case FUSE_FALLOCATE:
+        return sizeof(in_hdr->fallocate);
+#ifdef CONFIG_FUSE_LSEEK
+    case FUSE_LSEEK:
+        return sizeof(in_hdr->lseek);
+#endif
+    case FUSE_DESTROY:
+    case FUSE_STATFS:
+    case FUSE_RELEASE:
+    case FUSE_LOOKUP:
+    case FUSE_FORGET:
+    case FUSE_BATCH_FORGET:
+    case FUSE_GETATTR:
+    case FUSE_FSYNC:
+    case FUSE_FLUSH:
+        /* These requests don't have their own header or we don't care */
+        return 0;
+    default:
+        return -ENOSYS;
+    }
+}
+
+/**
+ * Try to read and process a single request from the FUSE FD.
+ */
+static void read_from_fuse_fd(void *opaque)
+{
+    FuseExport *exp = opaque;
+    int fuse_fd = exp->fuse_fd;
+    ssize_t ret;
+    FuseRequestInHeaderBuf in_hdr_buf;
+    const FuseRequestInHeader *in_hdr;
+    void *data_buffer = NULL;
+    struct iovec iov[2];
+    ssize_t op_hdr_len;
+
     fuse_inc_in_flight(exp);
 
-    do {
-        ret = fuse_session_receive_buf(exp->fuse_session, &exp->fuse_buf);
-    } while (ret == -EINTR);
-    if (ret < 0) {
-        goto out;
+    if (unlikely(qatomic_read(&exp->halted))) {
+        goto no_request;
+    }
+
+    data_buffer = get_write_data_buffer(exp);
+
+    /* Construct the I/O vector to hold the FUSE request */
+    iov[0] = (struct iovec) { &in_hdr_buf.head, sizeof(in_hdr_buf.head) };
+    iov[1] = (struct iovec) { data_buffer, FUSE_MAX_WRITE_BYTES };
+    ret = RETRY_ON_EINTR(readv(fuse_fd, iov, ARRAY_SIZE(iov)));
+    if (ret < 0 && errno == EAGAIN) {
+        /* No request available */
+        goto no_request;
+    } else if (unlikely(ret < 0)) {
+        error_report("Failed to read from FUSE device: %s", strerror(errno));
+        goto no_request;
+    }
+
+    if (unlikely(ret < sizeof(in_hdr->common))) {
+        error_report("Incomplete read from FUSE device, expected at least %zu "
+                     "bytes, read %zi bytes; cannot trust subsequent "
+                     "requests, halting the export",
+                     sizeof(in_hdr->common), ret);
+        fuse_export_halt(exp);
+        goto no_request;
+    }
+    in_hdr = &in_hdr_buf.structured;
+
+    if (unlikely(ret != in_hdr->common.len)) {
+        error_report("Number of bytes read from FUSE device does not match "
+                     "request size, expected %" PRIu32 " bytes, read %zi "
+                     "bytes; cannot trust subsequent requests, halting the "
+                     "export",
+                     in_hdr->common.len, ret);
+        fuse_export_halt(exp);
+        goto no_request;
+    }
+
+    op_hdr_len = req_op_hdr_len(in_hdr);
+    if (op_hdr_len < 0) {
+        fuse_write_err(fuse_fd, &in_hdr->common, op_hdr_len);
+        goto no_request;
+    }
+
+    if (unlikely(ret < sizeof(in_hdr->common) + op_hdr_len)) {
+        error_report("FUSE request truncated, expected %zu bytes, read %zi "
+                     "bytes",
+                     sizeof(in_hdr->common) + op_hdr_len, ret);
+        fuse_write_err(fuse_fd, &in_hdr->common, -EINVAL);
+        goto no_request;
     }
 
     /*
-     * Note that aio_poll() in any request-processing function can lead to a
-     * nested read_from_fuse_export() call, which will overwrite the contents of
-     * exp->fuse_buf.  Anything that takes a buffer needs to take care that the
-     * content is copied before potentially polling via aio_poll().
+     * Only WRITE uses the write data buffer, so for non-WRITE requests longer
+     * than .head, we need to copy any data that spilled into data_buffer into
+     * .tail.  Then we can release the write data buffer.
      */
-    fuse_session_process_buf(exp->fuse_session, &exp->fuse_buf);
+    if (in_hdr->common.opcode != FUSE_WRITE) {
+        if (ret > sizeof(in_hdr_buf.head)) {
+            size_t len;
+            /* Limit size to prevent overflow */
+            len = MIN(ret - sizeof(in_hdr_buf.head), sizeof(in_hdr_buf.tail));
+            memcpy(in_hdr_buf.tail, data_buffer, len);
+        }
+
+        release_write_data_buffer(exp, &data_buffer);
+    }
 
-out:
+    fuse_process_request(exp, in_hdr, data_buffer);
+
+no_request:
+    release_write_data_buffer(exp, &data_buffer);
     fuse_dec_in_flight(exp);
 }
 
@@ -363,18 +612,14 @@ static void fuse_export_shutdown(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
-    if (exp->fuse_session) {
-        fuse_session_exit(exp->fuse_session);
-
-        if (exp->fd_handler_set_up) {
-            fuse_detach_handlers(exp);
-        }
+    if (exp->fd_handler_set_up) {
+        fuse_detach_handlers(exp);
     }
 
     if (exp->mountpoint) {
         /*
-         * Safe to drop now, because we will not handle any requests
-         * for this export anymore anyway.
+         * Safe to drop now, because we will not handle any requests for this
+         * export anymore anyway (at least not from the main thread).
          */
         g_hash_table_remove(exports, exp->mountpoint);
     }
@@ -392,7 +637,7 @@ static void fuse_export_delete(BlockExport *blk_exp)
         fuse_session_destroy(exp->fuse_session);
     }
 
-    free(exp->fuse_buf.mem);
+    qemu_vfree(exp->req_write_data_cached);
     g_free(exp->mountpoint);
 }
 
@@ -434,46 +679,101 @@ static bool is_regular_file(const char *path, Error **errp)
 }
 
 /**
- * A chance to set change some parameters supplied to FUSE_INIT.
+ * Process FUSE INIT.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_init(void *userdata, struct fuse_conn_info *conn)
+static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
+                         const struct fuse_init_in_compat *in)
 {
+    const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
+
+    if (in->major != 7) {
+        error_report("FUSE major version mismatch: We have 7, but kernel has %"
+                     PRIu32, in->major);
+        return -EINVAL;
+    }
+
+    /* 2007's 7.9 added fuse_attr.blksize; working around that would be hard */
+    if (in->minor < 9) {
+        error_report("FUSE minor version too old: 9 required, but kernel has %"
+                     PRIu32, in->minor);
+        return -EINVAL;
+    }
+
+    *out = (struct fuse_init_out) {
+        .major = 7,
+        .minor = MIN(FUSE_KERNEL_MINOR_VERSION, in->minor),
+        .max_readahead = in->max_readahead,
+        .max_write = FUSE_MAX_WRITE_BYTES,
+        .flags = in->flags & supported_flags,
+        .flags2 = 0,
+
+        /* libfuse maximum: 2^16 - 1 */
+        .max_background = UINT16_MAX,
+
+        /* libfuse default: max_background * 3 / 4 */
+        .congestion_threshold = (int)UINT16_MAX * 3 / 4,
+
+        /* libfuse default: 1 */
+        .time_gran = 1,
+
+        /*
+         * probably unneeded without FUSE_MAX_PAGES, but this would be the
+         * libfuse default
+         */
+        .max_pages = DIV_ROUND_UP(FUSE_MAX_WRITE_BYTES,
+                                  qemu_real_host_page_size()),
+
+        /* Only needed for mappings (i.e. DAX) */
+        .map_alignment = 0,
+    };
+
     /*
-     * MIN_NON_ZERO() would not be wrong here, but what we set here
-     * must equal what has been passed to fuse_session_new().
-     * Therefore, as long as max_read must be passed as a mount option
-     * (which libfuse claims will be changed at some point), we have
-     * to set max_read to a fixed value here.
+     * Before 7.23, fuse_init_out is shorter.
+     * Drop the tail (time_gran, max_pages, map_alignment).
      */
-    conn->max_read = FUSE_MAX_BOUNCE_BYTES;
-
-    conn->max_write = MIN_NON_ZERO(BDRV_REQUEST_MAX_BYTES, conn->max_write);
+    return out->minor >= 23 ? sizeof(*out) : FUSE_COMPAT_22_INIT_OUT_SIZE;
 }
 
 /**
- * Let clients look up files.  Always return ENOENT because we only
- * care about the mountpoint itself.
+ * Return some filesystem information, just to not break e.g. `df`.
  */
-static void fuse_lookup(fuse_req_t req, fuse_ino_t parent, const char *name)
+static ssize_t fuse_statfs(FuseExport *exp, struct fuse_statfs_out *out)
 {
-    fuse_reply_err(req, ENOENT);
+    BlockDriverState *root_bs;
+    uint32_t opt_transfer = 512;
+
+    root_bs = blk_bs(exp->common.blk);
+    if (root_bs) {
+        opt_transfer = root_bs->bl.opt_transfer;
+        if (!opt_transfer) {
+            opt_transfer = root_bs->bl.request_alignment;
+        }
+        opt_transfer = MAX(opt_transfer, 512);
+    }
+
+    *out = (struct fuse_statfs_out) {
+        /* These are the fields libfuse sets by default */
+        .st = {
+            .namelen = 255,
+            .bsize = opt_transfer,
+        },
+    };
+    return sizeof(*out);
 }
 
 /**
  * Let clients get file attributes (i.e., stat() the file).
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
-                         struct fuse_file_info *fi)
+static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
 {
-    struct stat statbuf;
     int64_t length, allocated_blocks;
     time_t now = time(NULL);
-    FuseExport *exp = fuse_req_userdata(req);
 
     length = blk_getlength(exp->common.blk);
     if (length < 0) {
-        fuse_reply_err(req, -length);
-        return;
+        return length;
     }
 
     allocated_blocks = bdrv_get_allocated_file_size(blk_bs(exp->common.blk));
@@ -483,21 +783,24 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
         allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512);
     }
 
-    statbuf = (struct stat) {
-        .st_ino     = 1,
-        .st_mode    = exp->st_mode,
-        .st_nlink   = 1,
-        .st_uid     = exp->st_uid,
-        .st_gid     = exp->st_gid,
-        .st_size    = length,
-        .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
-        .st_blocks  = allocated_blocks,
-        .st_atime   = now,
-        .st_mtime   = now,
-        .st_ctime   = now,
+    *out = (struct fuse_attr_out) {
+        .attr_valid = 1,
+        .attr = {
+            .ino        = 1,
+            .mode       = exp->st_mode,
+            .nlink      = 1,
+            .uid        = exp->st_uid,
+            .gid        = exp->st_gid,
+            .size       = length,
+            .blksize    = blk_bs(exp->common.blk)->bl.request_alignment,
+            .blocks     = allocated_blocks,
+            .atime      = now,
+            .mtime      = now,
+            .ctime      = now,
+        },
     };
 
-    fuse_reply_attr(req, &statbuf, 1.);
+    return sizeof(*out);
 }
 
 static int fuse_do_truncate(const FuseExport *exp, int64_t size,
@@ -520,101 +823,99 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
  * permit access: Read-only exports cannot be given +w, and exports
  * without allow_other cannot be given a different UID or GID, and
  * they cannot be given non-owner access.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
-                         int to_set, struct fuse_file_info *fi)
+static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
+                            uint32_t to_set, uint64_t size, uint32_t mode,
+                            uint32_t uid, uint32_t gid)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     int supported_attrs;
     int ret;
 
-    supported_attrs = FUSE_SET_ATTR_SIZE | FUSE_SET_ATTR_MODE;
+    /* SIZE and MODE are actually supported, the others can be safely ignored */
+    supported_attrs = FATTR_SIZE | FATTR_MODE |
+        FATTR_FH | FATTR_LOCKOWNER | FATTR_KILL_SUIDGID;
     if (exp->allow_other) {
-        supported_attrs |= FUSE_SET_ATTR_UID | FUSE_SET_ATTR_GID;
+        supported_attrs |= FATTR_UID | FATTR_GID;
     }
 
     if (to_set & ~supported_attrs) {
-        fuse_reply_err(req, ENOTSUP);
-        return;
+        return -ENOTSUP;
     }
 
     /* Do some argument checks first before committing to anything */
-    if (to_set & FUSE_SET_ATTR_MODE) {
+    if (to_set & FATTR_MODE) {
         /*
          * Without allow_other, non-owners can never access the export, so do
          * not allow setting permissions for them
          */
-        if (!exp->allow_other &&
-            (statbuf->st_mode & (S_IRWXG | S_IRWXO)) != 0)
-        {
-            fuse_reply_err(req, EPERM);
-            return;
+        if (!exp->allow_other && (mode & (S_IRWXG | S_IRWXO)) != 0) {
+            return -EPERM;
         }
 
         /* +w for read-only exports makes no sense, disallow it */
-        if (!exp->writable &&
-            (statbuf->st_mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0)
-        {
-            fuse_reply_err(req, EROFS);
-            return;
+        if (!exp->writable && (mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0) {
+            return -EROFS;
         }
     }
 
-    if (to_set & FUSE_SET_ATTR_SIZE) {
+    if (to_set & FATTR_SIZE) {
         if (!exp->writable) {
-            fuse_reply_err(req, EACCES);
-            return;
+            return -EACCES;
         }
 
-        ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
+        ret = fuse_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
         if (ret < 0) {
-            fuse_reply_err(req, -ret);
-            return;
+            return ret;
         }
     }
 
-    if (to_set & FUSE_SET_ATTR_MODE) {
+    if (to_set & FATTR_MODE) {
         /* Ignore FUSE-supplied file type, only change the mode */
-        exp->st_mode = (statbuf->st_mode & 07777) | S_IFREG;
+        exp->st_mode = (mode & 07777) | S_IFREG;
     }
 
-    if (to_set & FUSE_SET_ATTR_UID) {
-        exp->st_uid = statbuf->st_uid;
+    if (to_set & FATTR_UID) {
+        exp->st_uid = uid;
     }
 
-    if (to_set & FUSE_SET_ATTR_GID) {
-        exp->st_gid = statbuf->st_gid;
+    if (to_set & FATTR_GID) {
+        exp->st_gid = gid;
     }
 
-    fuse_getattr(req, inode, fi);
+    return fuse_getattr(exp, out);
 }
 
 /**
- * Let clients open a file (i.e., the exported image).
+ * Open an inode.  We only have a single inode in our exported filesystem, so we
+ * just acknowledge the request.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_open(fuse_req_t req, fuse_ino_t inode,
-                      struct fuse_file_info *fi)
+static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
 {
-    fi->direct_io = true;
-    fi->parallel_direct_writes = true;
-    fuse_reply_open(req, fi);
+    *out = (struct fuse_open_out) {
+        .open_flags = FOPEN_DIRECT_IO | FOPEN_PARALLEL_DIRECT_WRITES,
+    };
+    return sizeof(*out);
 }
 
 /**
- * Handle client reads from the exported image.
+ * Handle client reads from the exported image.  Allocates *bufptr and reads
+ * data from the block device into that buffer.
+ * Returns the buffer (read) size on success, and -errno on error.
+ * Note: If the returned size is 0, *bufptr will be set to NULL.
+ * After use, *bufptr must be freed via qemu_vfree().
  */
-static void fuse_read(fuse_req_t req, fuse_ino_t inode,
-                      size_t size, off_t offset, struct fuse_file_info *fi)
+static ssize_t fuse_read(FuseExport *exp, void **bufptr,
+                         uint64_t offset, uint32_t size)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     int64_t blk_len;
     void *buf;
     int ret;
 
     /* Limited by max_read, should not happen */
-    if (size > FUSE_MAX_BOUNCE_BYTES) {
-        fuse_reply_err(req, EINVAL);
-        return;
+    if (size > FUSE_MAX_READ_BYTES) {
+        return -EINVAL;
     }
 
     /**
@@ -623,18 +924,13 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
      */
     blk_len = blk_getlength(exp->common.blk);
     if (blk_len < 0) {
-        fuse_reply_err(req, -blk_len);
-        return;
+        return blk_len;
     }
 
     if (offset >= blk_len) {
-        /*
-         * Technically libfuse does not allow returning a zero error code for
-         * read requests, but in practice this is a 0-length read (and a future
-         * commit will change this code anyway)
-         */
-        fuse_reply_err(req, 0);
-        return;
+        /* Explicitly set to NULL because we return success here */
+        *bufptr = NULL;
+        return 0;
     }
 
     if (offset + size > blk_len) {
@@ -643,108 +939,96 @@ static void fuse_read(fuse_req_t req, fuse_ino_t inode,
 
     buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
     if (!buf) {
-        fuse_reply_err(req, ENOMEM);
-        return;
+        return -ENOMEM;
     }
 
     ret = blk_pread(exp->common.blk, offset, size, buf, 0);
-    if (ret >= 0) {
-        fuse_reply_buf(req, buf, size);
-    } else {
-        fuse_reply_err(req, -ret);
+    if (ret < 0) {
+        qemu_vfree(buf);
+        return ret;
     }
 
-    qemu_vfree(buf);
+    *bufptr = buf;
+    return size;
 }
 
 /**
- * Handle client writes to the exported image.
+ * Handle client writes to the exported image.  @buf has the data to be written.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
-                       size_t size, off_t offset, struct fuse_file_info *fi)
+static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
+                          uint64_t offset, uint32_t size, const void *buf)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-    QEMU_AUTO_VFREE void *copied = NULL;
     int64_t blk_len;
     int ret;
 
+    QEMU_BUILD_BUG_ON(FUSE_MAX_WRITE_BYTES > BDRV_REQUEST_MAX_BYTES);
     /* Limited by max_write, should not happen */
-    if (size > BDRV_REQUEST_MAX_BYTES) {
-        fuse_reply_err(req, EINVAL);
-        return;
+    if (size > FUSE_MAX_WRITE_BYTES) {
+        return -EINVAL;
     }
 
     if (!exp->writable) {
-        fuse_reply_err(req, EACCES);
-        return;
+        return -EACCES;
     }
 
-    /*
-     * Heed the note on read_from_fuse_export(): If we call aio_poll() (which
-     * any blk_*() I/O function may do), read_from_fuse_export() may be nested,
-     * overwriting the request buffer content.  Therefore, we must copy it here.
-     */
-    copied = blk_blockalign(exp->common.blk, size);
-    memcpy(copied, buf, size);
-
     /**
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
      */
     blk_len = blk_getlength(exp->common.blk);
     if (blk_len < 0) {
-        fuse_reply_err(req, -blk_len);
-        return;
+        return blk_len;
     }
 
     if (offset >= blk_len && !exp->growable) {
-        fuse_reply_write(req, 0);
-        return;
+        *out = (struct fuse_write_out) {
+            .size = 0,
+        };
+        return sizeof(*out);
     }
 
     if (offset + size < offset) {
-        fuse_reply_err(req, EINVAL);
-        return;
+        return -EINVAL;
     } else if (offset + size > blk_len) {
         if (exp->growable) {
             ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                return;
+                return ret;
             }
         } else {
             size = blk_len - offset;
         }
     }
 
-    ret = blk_pwrite(exp->common.blk, offset, size, copied, 0);
-    if (ret >= 0) {
-        fuse_reply_write(req, size);
-    } else {
-        fuse_reply_err(req, -ret);
+    ret = blk_pwrite(exp->common.blk, offset, size, buf, 0);
+    if (ret < 0) {
+        return ret;
     }
+
+    *out = (struct fuse_write_out) {
+        .size = size,
+    };
+    return sizeof(*out);
 }
 
 /**
  * Let clients perform various fallocate() operations.
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
-                           off_t offset, off_t length,
-                           struct fuse_file_info *fi)
+static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
+                              uint32_t mode)
 {
-    FuseExport *exp = fuse_req_userdata(req);
     int64_t blk_len;
     int ret;
 
     if (!exp->writable) {
-        fuse_reply_err(req, EACCES);
-        return;
+        return -EACCES;
     }
 
     blk_len = blk_getlength(exp->common.blk);
     if (blk_len < 0) {
-        fuse_reply_err(req, -blk_len);
-        return;
+        return blk_len;
     }
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -756,16 +1040,14 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
     if (!mode) {
         /* We can only fallocate at the EOF with a truncate */
         if (offset < blk_len) {
-            fuse_reply_err(req, EOPNOTSUPP);
-            return;
+            return -EOPNOTSUPP;
         }
 
         if (offset > blk_len) {
             /* No preallocation needed here */
             ret = fuse_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                return;
+                return ret;
             }
         }
 
@@ -775,8 +1057,7 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
     else if (mode & FALLOC_FL_PUNCH_HOLE) {
         if (!(mode & FALLOC_FL_KEEP_SIZE)) {
-            fuse_reply_err(req, EINVAL);
-            return;
+            return -EINVAL;
         }
 
         do {
@@ -804,8 +1085,7 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
             ret = fuse_do_truncate(exp, offset + length, false,
                                    PREALLOC_MODE_OFF);
             if (ret < 0) {
-                fuse_reply_err(req, -ret);
-                return;
+                return ret;
             }
         }
 
@@ -823,44 +1103,38 @@ static void fuse_fallocate(fuse_req_t req, fuse_ino_t inode, int mode,
         ret = -EOPNOTSUPP;
     }
 
-    fuse_reply_err(req, ret < 0 ? -ret : 0);
+    return ret < 0 ? ret : 0;
 }
 
 /**
  * Let clients fsync the exported image.
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_fsync(fuse_req_t req, fuse_ino_t inode, int datasync,
-                       struct fuse_file_info *fi)
+static ssize_t fuse_fsync(FuseExport *exp)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-    int ret;
-
-    ret = blk_flush(exp->common.blk);
-    fuse_reply_err(req, ret < 0 ? -ret : 0);
+    return blk_flush(exp->common.blk);
 }
 
 /**
  * Called before an FD to the exported image is closed.  (libfuse
  * notes this to be a way to return last-minute errors.)
+ * Return 0 on success (no 'out' object), and -errno on error.
  */
-static void fuse_flush(fuse_req_t req, fuse_ino_t inode,
-                        struct fuse_file_info *fi)
+static ssize_t fuse_flush(FuseExport *exp)
 {
-    fuse_fsync(req, inode, 1, fi);
+    return blk_flush(exp->common.blk);
 }
 
 #ifdef CONFIG_FUSE_LSEEK
 /**
  * Let clients inquire allocation status.
+ * Return the number of bytes written to *out on success, and -errno on error.
  */
-static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
-                       int whence, struct fuse_file_info *fi)
+static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
+                          uint64_t offset, uint32_t whence)
 {
-    FuseExport *exp = fuse_req_userdata(req);
-
     if (whence != SEEK_HOLE && whence != SEEK_DATA) {
-        fuse_reply_err(req, EINVAL);
-        return;
+        return -EINVAL;
     }
 
     while (true) {
@@ -870,8 +1144,7 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
         ret = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
                                       offset, INT64_MAX, &pnum, NULL, NULL);
         if (ret < 0) {
-            fuse_reply_err(req, -ret);
-            return;
+            return ret;
         }
 
         if (!pnum && (ret & BDRV_BLOCK_EOF)) {
@@ -888,34 +1161,38 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
 
             blk_len = blk_getlength(exp->common.blk);
             if (blk_len < 0) {
-                fuse_reply_err(req, -blk_len);
-                return;
+                return blk_len;
             }
 
             if (offset > blk_len || whence == SEEK_DATA) {
-                fuse_reply_err(req, ENXIO);
-            } else {
-                fuse_reply_lseek(req, offset);
+                return -ENXIO;
             }
-            return;
+
+            *out = (struct fuse_lseek_out) {
+                .offset = offset,
+            };
+            return sizeof(*out);
         }
 
         if (ret & BDRV_BLOCK_DATA) {
             if (whence == SEEK_DATA) {
-                fuse_reply_lseek(req, offset);
-                return;
+                *out = (struct fuse_lseek_out) {
+                    .offset = offset,
+                };
+                return sizeof(*out);
             }
         } else {
             if (whence == SEEK_HOLE) {
-                fuse_reply_lseek(req, offset);
-                return;
+                *out = (struct fuse_lseek_out) {
+                    .offset = offset,
+                };
+                return sizeof(*out);
             }
         }
 
         /* Safety check against infinite loops */
         if (!pnum) {
-            fuse_reply_err(req, ENXIO);
-            return;
+            return -ENXIO;
         }
 
         offset += pnum;
@@ -923,21 +1200,241 @@ static void fuse_lseek(fuse_req_t req, fuse_ino_t inode, off_t offset,
 }
 #endif
 
-static const struct fuse_lowlevel_ops fuse_ops = {
-    .init       = fuse_init,
-    .lookup     = fuse_lookup,
-    .getattr    = fuse_getattr,
-    .setattr    = fuse_setattr,
-    .open       = fuse_open,
-    .read       = fuse_read,
-    .write      = fuse_write,
-    .fallocate  = fuse_fallocate,
-    .flush      = fuse_flush,
-    .fsync      = fuse_fsync,
+/**
+ * Write a FUSE response to the given @fd.
+ *
+ * Effectively, writes out_hdr->common.len bytes of the buffer that is *out_hdr.
+ *
+ * @fd: FUSE file descriptor
+ * @out_hdr: Request response header and request-specific response data
+ */
+static int fuse_write_response(int fd, FuseRequestOutHeader *out_hdr)
+{
+    size_t to_write = out_hdr->common.len;
+    ssize_t ret;
+
+    /* Must at least write fuse_out_header */
+    assert(to_write >= sizeof(out_hdr->common));
+
+    ret = RETRY_ON_EINTR(write(fd, out_hdr, to_write));
+    if (ret < 0) {
+        ret = -errno;
+        error_report("Failed to write to FUSE device: %s", strerror(-ret));
+        return ret;
+    }
+
+    /* Short writes are unexpected, treat them as errors */
+    if (ret != to_write) {
+        error_report("Short write to FUSE device, wrote %zi of %zu bytes",
+                     ret, to_write);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+/**
+ * Write a FUSE error response to @fd.
+ *
+ * @fd: FUSE file descriptor
+ * @in_hdr: Incoming request header to which to respond
+ * @err: Error code (-errno, must be negative!)
+ */
+static int fuse_write_err(int fd, const struct fuse_in_header *in_hdr, int err)
+{
+    FuseRequestOutHeader out_hdr = {
+        .common = {
+            .len = sizeof(out_hdr.common),
+            /* FUSE expects negative error values */
+            .error = err,
+            .unique = in_hdr->unique,
+        },
+    };
+
+    return fuse_write_response(fd, &out_hdr);
+}
+
+/**
+ * Write a FUSE response to the given @fd, using separate buffers for the
+ * response header and data.
+ *
+ * In contrast to fuse_write_response(), this function cannot return a full
+ * FuseRequestOutHeader (i.e. including request-specific response structs),
+ * but only FuseRequestOutHeader.common.  The remaining data must be in
+ * *buf.
+ *
+ * (Total length must be set in out_hdr->len.)
+ *
+ * @fd: FUSE file descriptor
+ * @out_hdr: Request response header
+ * @buf: Pointer to response data
+ */
+static int fuse_write_buf_response(int fd,
+                                   const struct fuse_out_header *out_hdr,
+                                   const void *buf)
+{
+    size_t to_write = out_hdr->len;
+    struct iovec iov[2] = {
+        { (void *)out_hdr, sizeof(*out_hdr) },
+        { (void *)buf, to_write - sizeof(*out_hdr) },
+    };
+    ssize_t ret;
+
+    /* *buf length must not be negative */
+    assert(to_write >= sizeof(*out_hdr));
+
+    ret = RETRY_ON_EINTR(writev(fd, iov, ARRAY_SIZE(iov)));
+    if (ret < 0) {
+        ret = -errno;
+        error_report("Failed to write to FUSE device: %s", strerror(-ret));
+        return ret;
+    }
+
+    /* Short writes are unexpected, treat them as errors */
+    if (ret != to_write) {
+        error_report("Short write to FUSE device, wrote %zi of %zu bytes",
+                     ret, to_write);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+/**
+ * Process a FUSE request, incl. writing the response.
+ */
+static void fuse_process_request(FuseExport *exp,
+                                 const FuseRequestInHeader *in_hdr,
+                                 const void *data_buffer)
+{
+    FuseRequestOutHeader out_hdr;
+    /* For read requests: Data to be returned */
+    void *out_data_buffer = NULL;
+    ssize_t ret;
+
+    switch (in_hdr->common.opcode) {
+    case FUSE_INIT:
+        ret = fuse_init(exp, &out_hdr.init, &in_hdr->init);
+        break;
+
+    case FUSE_DESTROY:
+        ret = 0;
+        break;
+
+    case FUSE_STATFS:
+        ret = fuse_statfs(exp, &out_hdr.statfs);
+        break;
+
+    case FUSE_OPEN:
+        ret = fuse_open(exp, &out_hdr.open);
+        break;
+
+    case FUSE_RELEASE:
+        ret = 0;
+        break;
+
+    case FUSE_LOOKUP:
+        ret = -ENOENT; /* There is no node but the root node */
+        break;
+
+    case FUSE_FORGET:
+    case FUSE_BATCH_FORGET:
+        /* These have no response, and there is nothing we need to do */
+        return;
+
+    case FUSE_GETATTR:
+        ret = fuse_getattr(exp, &out_hdr.attr);
+        break;
+
+    case FUSE_SETATTR: {
+        const struct fuse_setattr_in *in = &in_hdr->setattr;
+        ret = fuse_setattr(exp, &out_hdr.attr,
+                           in->valid, in->size, in->mode, in->uid, in->gid);
+        break;
+    }
+
+    case FUSE_READ: {
+        const struct fuse_read_in *in = &in_hdr->read;
+        ret = fuse_read(exp, &out_data_buffer, in->offset, in->size);
+        break;
+    }
+
+    case FUSE_WRITE: {
+        const struct fuse_write_in *in = &in_hdr->write;
+        uint32_t req_len = in_hdr->common.len;
+
+        if (unlikely(req_len < sizeof(in_hdr->common) + sizeof(*in) +
+                               in->size)) {
+            warn_report("FUSE WRITE truncated; received %zu bytes of %" PRIu32,
+                        req_len - sizeof(in_hdr->common) - sizeof(*in),
+                        in->size);
+            ret = -EINVAL;
+            break;
+        }
+
+        /*
+         * read_from_fuse_fd() has checked that in_hdr->len matches the number
+         * of bytes read, which cannot exceed the max_write value we set
+         * (FUSE_MAX_WRITE_BYTES).  So we know that FUSE_MAX_WRITE_BYTES >=
+         * in_hdr->len >= in->size + X, so this assertion must hold.
+         */
+        assert(in->size <= FUSE_MAX_WRITE_BYTES);
+
+        ret = fuse_write(exp, &out_hdr.write,
+                         in->offset, in->size, data_buffer);
+        break;
+    }
+
+    case FUSE_FALLOCATE: {
+        const struct fuse_fallocate_in *in = &in_hdr->fallocate;
+        ret = fuse_fallocate(exp, in->offset, in->length, in->mode);
+        break;
+    }
+
+    case FUSE_FSYNC:
+        ret = fuse_fsync(exp);
+        break;
+
+    case FUSE_FLUSH:
+        ret = fuse_flush(exp);
+        break;
+
 #ifdef CONFIG_FUSE_LSEEK
-    .lseek      = fuse_lseek,
+    case FUSE_LSEEK: {
+        const struct fuse_lseek_in *in = &in_hdr->lseek;
+        ret = fuse_lseek(exp, &out_hdr.lseek, in->offset, in->whence);
+        break;
+    }
 #endif
-};
+
+    default:
+        ret = -ENOSYS;
+    }
+
+    if (ret >= 0) {
+        out_hdr.common = (struct fuse_out_header) {
+            .len = sizeof(out_hdr.common) + ret,
+            .unique = in_hdr->common.unique,
+        };
+    } else {
+        /* fuse_read() must not return a buffer in case of error */
+        assert(out_data_buffer == NULL);
+
+        out_hdr.common = (struct fuse_out_header) {
+            .len = sizeof(out_hdr.common),
+            /* FUSE expects negative errno values */
+            .error = ret,
+            .unique = in_hdr->common.unique,
+        };
+    }
+
+    if (out_data_buffer) {
+        fuse_write_buf_response(exp->fuse_fd, &out_hdr.common, out_data_buffer);
+        qemu_vfree(out_data_buffer);
+    } else {
+        fuse_write_response(exp->fuse_fd, &out_hdr);
+    }
+}
 
 const BlockExportDriver blk_exp_fuse = {
     .type               = BLOCK_EXPORT_TYPE_FUSE,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 18/28] fuse: Reduce max read size
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (16 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 17/28] fuse: Manually process requests (without libfuse) Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 19/28] fuse: Process requests in coroutines Kevin Wolf
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

We are going to introduce parallel processing via coroutines, a maximum
read size of 64 MB may be problematic, allowing users of the export to
force us to allocate quite large amounts of memory with just a few
requests.

At least tone it down to 1 MB, which is still probably far more than
enough.  (Larger requests are split automatically by the FUSE kernel
driver anyway.)

(Yes, we inadvertently already had parallel request processing due to
nested polling before.  Better to fix this late than never.)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-19-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index bd099d12911..f32e74f39dd 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -45,7 +45,7 @@
 #endif
 
 /* Prevent overly long bounce buffer allocations */
-#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+#define FUSE_MAX_READ_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 1 * 1024 * 1024))
 #define FUSE_MAX_WRITE_BYTES (64 * 1024)
 
 /*
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 19/28] fuse: Process requests in coroutines
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (17 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 18/28] fuse: Reduce max read size Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 20/28] block/export: Add multi-threading interface Kevin Wolf
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Make fuse_process_request() a coroutine_fn (fuse_co_process_request())
and have read_from_fuse_fd() launch it inside of a newly created
coroutine instead of running it synchronously.  This way, we can process
requests in parallel.

These are the benchmark results, compared to (a) the original results
with libfuse, and (b) the results after switching away from libfuse
(i.e. before this patch):

file:                (vs. libfuse / vs. no libfuse)
  read:
    seq aio:    97.8k ±1.5k (-2%  / -8%)
    rand aio:   95.8k ±3.4k (+90% / +98%)
    seq sync:   34.5k ±1.0k (-4%  / -3%)
    rand sync:   9.9k ±0.1k (-1%  / -1%)
  write:
    seq aio:    68.7k ±1.3k (-5%  / -10%)
    rand aio:   68.9k ±1.1k (-2%  / -10%)
    seq sync:   30.6k ±0.9k (±0%  / -3%)
    rand sync:  30.6k ±0.6k (+1%  / -1%)
null:
  read:
    seq aio:   174.5k ±6.8k (+11% / +8%)
    rand aio:  170.9k ±5.7k (+8%  / +3%)
    seq sync:   82.0k ±3.3k (+2%  / +2%)
    rand sync:  78.0k ±4.0k (+1%  / -1%)
  write:
    seq aio:   196.0k ±2.8k (+27% / +6%)
    rand aio:  191.2k ±7.9k (+24% / +2%)
    seq sync:   83.3k ±4.4k (+9%  / +1%)
    rand sync:  79.5k ±4.4k (+9%  / +1%)

So there is not much difference, especially when compared to how it was
with libfuse, except for the randread AIO case with an actual file.
That improves greatly.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-20-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 176 ++++++++++++++++++++++++++------------------
 1 file changed, 104 insertions(+), 72 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index f32e74f39dd..7c072409d83 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -27,6 +27,7 @@
 #include "block/qapi.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block.h"
+#include "qemu/coroutine.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
@@ -181,9 +182,9 @@ static int mount_fuse_export(FuseExport *exp, Error **errp);
 static bool is_regular_file(const char *path, Error **errp);
 
 static void read_from_fuse_fd(void *opaque);
-static void fuse_process_request(FuseExport *exp,
-                                 const FuseRequestInHeader *in_hdr,
-                                 const void *data_buffer);
+static void coroutine_fn
+fuse_co_process_request(FuseExport *exp, const FuseRequestInHeader *in_hdr,
+                        const void *data_buffer);
 static int fuse_write_err(int fd, const struct fuse_in_header *in_hdr, int err);
 
 static void fuse_inc_in_flight(FuseExport *exp)
@@ -518,9 +519,14 @@ static ssize_t req_op_hdr_len(const FuseRequestInHeader *in_hdr)
 }
 
 /**
- * Try to read and process a single request from the FUSE FD.
+ * Try to read a single request from the FUSE FD.
+ * Takes a FuseExport pointer in `opaque`.
+ *
+ * Assumes the export's in-flight counter has already been incremented.
+ *
+ * If a request is available, process it.
  */
-static void read_from_fuse_fd(void *opaque)
+static void coroutine_fn co_read_from_fuse_fd(void *opaque)
 {
     FuseExport *exp = opaque;
     int fuse_fd = exp->fuse_fd;
@@ -531,8 +537,6 @@ static void read_from_fuse_fd(void *opaque)
     struct iovec iov[2];
     ssize_t op_hdr_len;
 
-    fuse_inc_in_flight(exp);
-
     if (unlikely(qatomic_read(&exp->halted))) {
         goto no_request;
     }
@@ -601,13 +605,29 @@ static void read_from_fuse_fd(void *opaque)
         release_write_data_buffer(exp, &data_buffer);
     }
 
-    fuse_process_request(exp, in_hdr, data_buffer);
+    fuse_co_process_request(exp, in_hdr, data_buffer);
 
 no_request:
     release_write_data_buffer(exp, &data_buffer);
     fuse_dec_in_flight(exp);
 }
 
+/**
+ * Try to read and process a single request from the FUSE FD.
+ * (To be used as a handler for when the FUSE FD becomes readable.)
+ * Takes a FuseExport pointer in `opaque`.
+ */
+static void read_from_fuse_fd(void *opaque)
+{
+    FuseExport *exp = opaque;
+    Coroutine *co;
+
+    co = qemu_coroutine_create(co_read_from_fuse_fd, exp);
+    /* Decremented by co_read_from_fuse_fd() */
+    fuse_inc_in_flight(exp);
+    qemu_coroutine_enter(co);
+}
+
 static void fuse_export_shutdown(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
@@ -682,8 +702,9 @@ static bool is_regular_file(const char *path, Error **errp)
  * Process FUSE INIT.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
-                         const struct fuse_init_in_compat *in)
+static ssize_t coroutine_fn GRAPH_RDLOCK
+fuse_co_init(FuseExport *exp, struct fuse_init_out *out,
+             const struct fuse_init_in_compat *in)
 {
     const uint32_t supported_flags = FUSE_ASYNC_READ | FUSE_ASYNC_DIO;
 
@@ -738,7 +759,8 @@ static ssize_t fuse_init(FuseExport *exp, struct fuse_init_out *out,
 /**
  * Return some filesystem information, just to not break e.g. `df`.
  */
-static ssize_t fuse_statfs(FuseExport *exp, struct fuse_statfs_out *out)
+static ssize_t coroutine_fn GRAPH_RDLOCK
+fuse_co_statfs(FuseExport *exp, struct fuse_statfs_out *out)
 {
     BlockDriverState *root_bs;
     uint32_t opt_transfer = 512;
@@ -766,17 +788,18 @@ static ssize_t fuse_statfs(FuseExport *exp, struct fuse_statfs_out *out)
  * Let clients get file attributes (i.e., stat() the file).
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
+static ssize_t coroutine_fn GRAPH_RDLOCK
+fuse_co_getattr(FuseExport *exp, struct fuse_attr_out *out)
 {
     int64_t length, allocated_blocks;
     time_t now = time(NULL);
 
-    length = blk_getlength(exp->common.blk);
+    length = blk_co_getlength(exp->common.blk);
     if (length < 0) {
         return length;
     }
 
-    allocated_blocks = bdrv_get_allocated_file_size(blk_bs(exp->common.blk));
+    allocated_blocks = bdrv_co_get_allocated_file_size(blk_bs(exp->common.blk));
     if (allocated_blocks <= 0) {
         allocated_blocks = DIV_ROUND_UP(length, 512);
     } else {
@@ -803,8 +826,9 @@ static ssize_t fuse_getattr(FuseExport *exp, struct fuse_attr_out *out)
     return sizeof(*out);
 }
 
-static int fuse_do_truncate(const FuseExport *exp, int64_t size,
-                            bool req_zero_write, PreallocMode prealloc)
+static int coroutine_fn GRAPH_RDLOCK
+fuse_co_do_truncate(const FuseExport *exp, int64_t size, bool req_zero_write,
+                    PreallocMode prealloc)
 {
     BdrvRequestFlags truncate_flags = 0;
 
@@ -812,8 +836,8 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
         truncate_flags |= BDRV_REQ_ZERO_WRITE;
     }
 
-    return blk_truncate(exp->common.blk, size, true, prealloc,
-                        truncate_flags, NULL);
+    return blk_co_truncate(exp->common.blk, size, true, prealloc,
+                           truncate_flags, NULL);
 }
 
 /**
@@ -825,9 +849,9 @@ static int fuse_do_truncate(const FuseExport *exp, int64_t size,
  * they cannot be given non-owner access.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
-                            uint32_t to_set, uint64_t size, uint32_t mode,
-                            uint32_t uid, uint32_t gid)
+static ssize_t coroutine_fn GRAPH_RDLOCK
+fuse_co_setattr(FuseExport *exp, struct fuse_attr_out *out, uint32_t to_set,
+                uint64_t size, uint32_t mode, uint32_t uid, uint32_t gid)
 {
     int supported_attrs;
     int ret;
@@ -864,7 +888,7 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
             return -EACCES;
         }
 
-        ret = fuse_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
+        ret = fuse_co_do_truncate(exp, size, true, PREALLOC_MODE_OFF);
         if (ret < 0) {
             return ret;
         }
@@ -883,7 +907,7 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
         exp->st_gid = gid;
     }
 
-    return fuse_getattr(exp, out);
+    return fuse_co_getattr(exp, out);
 }
 
 /**
@@ -891,7 +915,8 @@ static ssize_t fuse_setattr(FuseExport *exp, struct fuse_attr_out *out,
  * just acknowledge the request.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
+static ssize_t coroutine_fn GRAPH_RDLOCK
+fuse_co_open(FuseExport *exp, struct fuse_open_out *out)
 {
     *out = (struct fuse_open_out) {
         .open_flags = FOPEN_DIRECT_IO | FOPEN_PARALLEL_DIRECT_WRITES,
@@ -906,8 +931,8 @@ static ssize_t fuse_open(FuseExport *exp, struct fuse_open_out *out)
  * Note: If the returned size is 0, *bufptr will be set to NULL.
  * After use, *bufptr must be freed via qemu_vfree().
  */
-static ssize_t fuse_read(FuseExport *exp, void **bufptr,
-                         uint64_t offset, uint32_t size)
+static ssize_t coroutine_fn GRAPH_RDLOCK
+fuse_co_read(FuseExport *exp, void **bufptr, uint64_t offset, uint32_t size)
 {
     int64_t blk_len;
     void *buf;
@@ -922,7 +947,7 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
      * Clients will expect short reads at EOF, so we have to limit
      * offset+size to the image length.
      */
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         return blk_len;
     }
@@ -942,7 +967,7 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
         return -ENOMEM;
     }
 
-    ret = blk_pread(exp->common.blk, offset, size, buf, 0);
+    ret = blk_co_pread(exp->common.blk, offset, size, buf, 0);
     if (ret < 0) {
         qemu_vfree(buf);
         return ret;
@@ -956,8 +981,9 @@ static ssize_t fuse_read(FuseExport *exp, void **bufptr,
  * Handle client writes to the exported image.  @buf has the data to be written.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
-                          uint64_t offset, uint32_t size, const void *buf)
+static ssize_t coroutine_fn GRAPH_RDLOCK
+fuse_co_write(FuseExport *exp, struct fuse_write_out *out,
+              uint64_t offset, uint32_t size, const void *buf)
 {
     int64_t blk_len;
     int ret;
@@ -976,7 +1002,7 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
      * Clients will expect short writes at EOF, so we have to limit
      * offset+size to the image length.
      */
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         return blk_len;
     }
@@ -992,7 +1018,8 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
         return -EINVAL;
     } else if (offset + size > blk_len) {
         if (exp->growable) {
-            ret = fuse_do_truncate(exp, offset + size, true, PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset + size, true,
+                                      PREALLOC_MODE_OFF);
             if (ret < 0) {
                 return ret;
             }
@@ -1001,7 +1028,7 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
         }
     }
 
-    ret = blk_pwrite(exp->common.blk, offset, size, buf, 0);
+    ret = blk_co_pwrite(exp->common.blk, offset, size, buf, 0);
     if (ret < 0) {
         return ret;
     }
@@ -1016,8 +1043,9 @@ static ssize_t fuse_write(FuseExport *exp, struct fuse_write_out *out,
  * Let clients perform various fallocate() operations.
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
-                              uint32_t mode)
+static ssize_t coroutine_fn GRAPH_RDLOCK
+fuse_co_fallocate(FuseExport *exp,
+                  uint64_t offset, uint64_t length, uint32_t mode)
 {
     int64_t blk_len;
     int ret;
@@ -1026,7 +1054,7 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         return -EACCES;
     }
 
-    blk_len = blk_getlength(exp->common.blk);
+    blk_len = blk_co_getlength(exp->common.blk);
     if (blk_len < 0) {
         return blk_len;
     }
@@ -1045,14 +1073,14 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
 
         if (offset > blk_len) {
             /* No preallocation needed here */
-            ret = fuse_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset, true, PREALLOC_MODE_OFF);
             if (ret < 0) {
                 return ret;
             }
         }
 
-        ret = fuse_do_truncate(exp, offset + length, true,
-                               PREALLOC_MODE_FALLOC);
+        ret = fuse_co_do_truncate(exp, offset + length, true,
+                                  PREALLOC_MODE_FALLOC);
     }
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
     else if (mode & FALLOC_FL_PUNCH_HOLE) {
@@ -1063,8 +1091,9 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         do {
             int size = MIN(length, BDRV_REQUEST_MAX_BYTES);
 
-            ret = blk_pwrite_zeroes(exp->common.blk, offset, size,
-                                    BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK);
+            ret = blk_co_pwrite_zeroes(exp->common.blk, offset, size,
+                                       BDRV_REQ_MAY_UNMAP |
+                                       BDRV_REQ_NO_FALLBACK);
             if (ret == -ENOTSUP) {
                 /*
                  * fallocate() specifies to return EOPNOTSUPP for unsupported
@@ -1082,8 +1111,8 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
     else if (mode & FALLOC_FL_ZERO_RANGE) {
         if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + length > blk_len) {
             /* No need for zeroes, we are going to write them ourselves */
-            ret = fuse_do_truncate(exp, offset + length, false,
-                                   PREALLOC_MODE_OFF);
+            ret = fuse_co_do_truncate(exp, offset + length, false,
+                                      PREALLOC_MODE_OFF);
             if (ret < 0) {
                 return ret;
             }
@@ -1092,8 +1121,8 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
         do {
             int size = MIN(length, BDRV_REQUEST_MAX_BYTES);
 
-            ret = blk_pwrite_zeroes(exp->common.blk,
-                                    offset, size, 0);
+            ret = blk_co_pwrite_zeroes(exp->common.blk,
+                                       offset, size, 0);
             offset += size;
             length -= size;
         } while (ret == 0 && length > 0);
@@ -1110,9 +1139,9 @@ static ssize_t fuse_fallocate(FuseExport *exp, uint64_t offset, uint64_t length,
  * Let clients fsync the exported image.
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_fsync(FuseExport *exp)
+static ssize_t coroutine_fn GRAPH_RDLOCK fuse_co_fsync(FuseExport *exp)
 {
-    return blk_flush(exp->common.blk);
+    return blk_co_flush(exp->common.blk);
 }
 
 /**
@@ -1120,9 +1149,9 @@ static ssize_t fuse_fsync(FuseExport *exp)
  * notes this to be a way to return last-minute errors.)
  * Return 0 on success (no 'out' object), and -errno on error.
  */
-static ssize_t fuse_flush(FuseExport *exp)
+static ssize_t coroutine_fn GRAPH_RDLOCK fuse_co_flush(FuseExport *exp)
 {
-    return blk_flush(exp->common.blk);
+    return blk_co_flush(exp->common.blk);
 }
 
 #ifdef CONFIG_FUSE_LSEEK
@@ -1130,8 +1159,9 @@ static ssize_t fuse_flush(FuseExport *exp)
  * Let clients inquire allocation status.
  * Return the number of bytes written to *out on success, and -errno on error.
  */
-static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
-                          uint64_t offset, uint32_t whence)
+static ssize_t coroutine_fn GRAPH_RDLOCK
+fuse_co_lseek(FuseExport *exp, struct fuse_lseek_out *out,
+              uint64_t offset, uint32_t whence)
 {
     if (whence != SEEK_HOLE && whence != SEEK_DATA) {
         return -EINVAL;
@@ -1141,8 +1171,8 @@ static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
         int64_t pnum;
         int ret;
 
-        ret = bdrv_block_status_above(blk_bs(exp->common.blk), NULL,
-                                      offset, INT64_MAX, &pnum, NULL, NULL);
+        ret = bdrv_co_block_status_above(blk_bs(exp->common.blk), NULL,
+                                         offset, INT64_MAX, &pnum, NULL, NULL);
         if (ret < 0) {
             return ret;
         }
@@ -1159,7 +1189,7 @@ static ssize_t fuse_lseek(FuseExport *exp, struct fuse_lseek_out *out,
              * and @blk_len (the client-visible EOF).
              */
 
-            blk_len = blk_getlength(exp->common.blk);
+            blk_len = blk_co_getlength(exp->common.blk);
             if (blk_len < 0) {
                 return blk_len;
             }
@@ -1303,18 +1333,20 @@ static int fuse_write_buf_response(int fd,
 /**
  * Process a FUSE request, incl. writing the response.
  */
-static void fuse_process_request(FuseExport *exp,
-                                 const FuseRequestInHeader *in_hdr,
-                                 const void *data_buffer)
+static void coroutine_fn
+fuse_co_process_request(FuseExport *exp, const FuseRequestInHeader *in_hdr,
+                        const void *data_buffer)
 {
     FuseRequestOutHeader out_hdr;
     /* For read requests: Data to be returned */
     void *out_data_buffer = NULL;
     ssize_t ret;
 
+    GRAPH_RDLOCK_GUARD();
+
     switch (in_hdr->common.opcode) {
     case FUSE_INIT:
-        ret = fuse_init(exp, &out_hdr.init, &in_hdr->init);
+        ret = fuse_co_init(exp, &out_hdr.init, &in_hdr->init);
         break;
 
     case FUSE_DESTROY:
@@ -1322,11 +1354,11 @@ static void fuse_process_request(FuseExport *exp,
         break;
 
     case FUSE_STATFS:
-        ret = fuse_statfs(exp, &out_hdr.statfs);
+        ret = fuse_co_statfs(exp, &out_hdr.statfs);
         break;
 
     case FUSE_OPEN:
-        ret = fuse_open(exp, &out_hdr.open);
+        ret = fuse_co_open(exp, &out_hdr.open);
         break;
 
     case FUSE_RELEASE:
@@ -1343,19 +1375,19 @@ static void fuse_process_request(FuseExport *exp,
         return;
 
     case FUSE_GETATTR:
-        ret = fuse_getattr(exp, &out_hdr.attr);
+        ret = fuse_co_getattr(exp, &out_hdr.attr);
         break;
 
     case FUSE_SETATTR: {
         const struct fuse_setattr_in *in = &in_hdr->setattr;
-        ret = fuse_setattr(exp, &out_hdr.attr,
-                           in->valid, in->size, in->mode, in->uid, in->gid);
+        ret = fuse_co_setattr(exp, &out_hdr.attr,
+                              in->valid, in->size, in->mode, in->uid, in->gid);
         break;
     }
 
     case FUSE_READ: {
         const struct fuse_read_in *in = &in_hdr->read;
-        ret = fuse_read(exp, &out_data_buffer, in->offset, in->size);
+        ret = fuse_co_read(exp, &out_data_buffer, in->offset, in->size);
         break;
     }
 
@@ -1373,36 +1405,36 @@ static void fuse_process_request(FuseExport *exp,
         }
 
         /*
-         * read_from_fuse_fd() has checked that in_hdr->len matches the number
-         * of bytes read, which cannot exceed the max_write value we set
+         * co_read_from_fuse_fd() has checked that in_hdr->len matches the
+         * number of bytes read, which cannot exceed the max_write value we set
          * (FUSE_MAX_WRITE_BYTES).  So we know that FUSE_MAX_WRITE_BYTES >=
          * in_hdr->len >= in->size + X, so this assertion must hold.
          */
         assert(in->size <= FUSE_MAX_WRITE_BYTES);
 
-        ret = fuse_write(exp, &out_hdr.write,
-                         in->offset, in->size, data_buffer);
+        ret = fuse_co_write(exp, &out_hdr.write,
+                            in->offset, in->size, data_buffer);
         break;
     }
 
     case FUSE_FALLOCATE: {
         const struct fuse_fallocate_in *in = &in_hdr->fallocate;
-        ret = fuse_fallocate(exp, in->offset, in->length, in->mode);
+        ret = fuse_co_fallocate(exp, in->offset, in->length, in->mode);
         break;
     }
 
     case FUSE_FSYNC:
-        ret = fuse_fsync(exp);
+        ret = fuse_co_fsync(exp);
         break;
 
     case FUSE_FLUSH:
-        ret = fuse_flush(exp);
+        ret = fuse_co_flush(exp);
         break;
 
 #ifdef CONFIG_FUSE_LSEEK
     case FUSE_LSEEK: {
         const struct fuse_lseek_in *in = &in_hdr->lseek;
-        ret = fuse_lseek(exp, &out_hdr.lseek, in->offset, in->whence);
+        ret = fuse_co_lseek(exp, &out_hdr.lseek, in->offset, in->whence);
         break;
     }
 #endif
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 20/28] block/export: Add multi-threading interface
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (18 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 19/28] fuse: Process requests in coroutines Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 21/28] iotests/307: Test multi-thread export interface Kevin Wolf
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Make BlockExportType.iothread an alternate between a single-thread
variant 'str' and a multi-threading variant '[str]'.

In contrast to the single-thread setting, the multi-threading setting
will not change the BDS's context (and so is incompatible with the
fixed-iothread setting), but instead just pass a list to the export
driver, with which it can do whatever it wants.

Currently no export driver supports multi-threading, so they all return
an error when receiving such a list.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-21-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qapi/block-export.json               | 36 ++++++++++++++++++---
 include/block/export.h               | 12 +++++--
 block/export/export.c                | 48 +++++++++++++++++++++++++---
 block/export/fuse.c                  |  7 ++++
 block/export/vduse-blk.c             |  7 ++++
 block/export/vhost-user-blk-server.c |  8 +++++
 nbd/server.c                         |  6 ++++
 7 files changed, 113 insertions(+), 11 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index 076954ef1ac..160cd2e3ca0 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -363,14 +363,16 @@
 #     to the export before completion is signalled.  (since: 5.2;
 #     default: false)
 #
-# @iothread: The name of the iothread object where the export will
-#     run.  The default is to use the thread currently associated with
-#     the block node.  (since: 5.2)
+# @iothread: The name(s) of one or more iothread object(s) where the
+#     export will run.  The default is to use the thread currently
+#     associated with the block node.  (since: 5.2; multi-threading
+#     since 10.1)
 #
 # @fixed-iothread: True prevents the block node from being moved to
 #     another thread while the export is active.  If true and
 #     @iothread is given, export creation fails if the block node
-#     cannot be moved to the iothread.  The default is false.
+#     cannot be moved to the iothread.  Must not be true when giving
+#     multiple iothreads for @iothread.  The default is false.
 #     (since: 5.2)
 #
 # @allow-inactive: If true, the export allows the exported node to be
@@ -387,7 +389,7 @@
   'base': { 'type': 'BlockExportType',
             'id': 'str',
             '*fixed-iothread': 'bool',
-            '*iothread': 'str',
+            '*iothread': 'BlockExportIothreads',
             'node-name': 'str',
             '*writable': 'bool',
             '*writethrough': 'bool',
@@ -403,6 +405,30 @@
                      'if': 'CONFIG_VDUSE_BLK_EXPORT' }
    } }
 
+##
+# @BlockExportIothreads:
+#
+# Specify a single or multiple I/O threads in which to run a block
+# export's I/O.
+#
+# @single: Run the export's I/O in the given single I/O thread.
+#
+# @multi: Use multi-threading across the given set of I/O threads,
+#     which must not be empty.  Note: Passing a single I/O thread via
+#     this variant is still treated as multi-threading, which is
+#     different from using the @single variant.  In particular, even
+#     if there only is a single I/O thread in the set, export types
+#     that do not support multi-threading will generally reject this
+#     variant, and BlockExportOptions.fixed-iothread is always
+#     incompatible with it.
+#
+# Since: 10.1
+##
+{ 'alternate': 'BlockExportIothreads',
+  'data': {
+      'single': 'str',
+      'multi': ['str'] } }
+
 ##
 # @block-export-add:
 #
diff --git a/include/block/export.h b/include/block/export.h
index 4bd9531d4d9..ca45da928c7 100644
--- a/include/block/export.h
+++ b/include/block/export.h
@@ -32,8 +32,16 @@ typedef struct BlockExportDriver {
     /* True if the export type supports running on an inactive node */
     bool supports_inactive;
 
-    /* Creates and starts a new block export */
-    int (*create)(BlockExport *, BlockExportOptions *, Error **);
+    /*
+     * Creates and starts a new block export.
+     *
+     * If the user passed a set of I/O threads for multi-threading, @multithread
+     * is a list of the @multithread_count corresponding contexts (freed by the
+     * caller).  Note that @exp->ctx has no relation to that list.
+     */
+    int (*create)(BlockExport *exp, BlockExportOptions *opts,
+                  AioContext *const *multithread, size_t multithread_count,
+                  Error **errp);
 
     /*
      * Frees a removed block export. This function is only called after all
diff --git a/block/export/export.c b/block/export/export.c
index f3bbf11070d..b733f269f3e 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -76,16 +76,26 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
 {
     bool fixed_iothread = export->has_fixed_iothread && export->fixed_iothread;
     bool allow_inactive = export->has_allow_inactive && export->allow_inactive;
+    bool multithread = export->iothread &&
+        export->iothread->type == QTYPE_QLIST;
     const BlockExportDriver *drv;
     BlockExport *exp = NULL;
     BlockDriverState *bs;
     BlockBackend *blk = NULL;
     AioContext *ctx;
+    AioContext **multithread_ctxs = NULL;
+    size_t multithread_count = 0;
     uint64_t perm;
     int ret;
 
     GLOBAL_STATE_CODE();
 
+    if (fixed_iothread && multithread) {
+        error_setg(errp,
+                   "Cannot use fixed-iothread for a multi-threaded export");
+        return NULL;
+    }
+
     if (!id_wellformed(export->id)) {
         error_setg(errp, "Invalid block export id");
         return NULL;
@@ -116,14 +126,16 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
 
     ctx = bdrv_get_aio_context(bs);
 
-    if (export->iothread) {
+    /* Move the BDS to the target I/O thread, if it is a single one */
+    if (export->iothread && !multithread) {
+        const char *iothread_id = export->iothread->u.single;
         IOThread *iothread;
         AioContext *new_ctx;
         Error **set_context_errp;
 
-        iothread = iothread_by_id(export->iothread);
+        iothread = iothread_by_id(iothread_id);
         if (!iothread) {
-            error_setg(errp, "iothread \"%s\" not found", export->iothread);
+            error_setg(errp, "iothread \"%s\" not found", iothread_id);
             goto fail;
         }
 
@@ -137,6 +149,32 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
         } else if (fixed_iothread) {
             goto fail;
         }
+    } else if (multithread) {
+        strList *iothread_list = export->iothread->u.multi;
+        size_t i;
+
+        multithread_count = 0;
+        for (strList *e = iothread_list; e; e = e->next) {
+            multithread_count++;
+        }
+
+        if (multithread_count == 0) {
+            error_setg(errp, "The set of I/O threads must not be empty");
+            return NULL;
+        }
+
+        multithread_ctxs = g_new(AioContext *, multithread_count);
+        i = 0;
+        for (strList *e = iothread_list; e; e = e->next) {
+            IOThread *iothread = iothread_by_id(e->value);
+
+            if (!iothread) {
+                error_setg(errp, "iothread \"%s\" not found", e->value);
+                goto fail;
+            }
+            multithread_ctxs[i++] = iothread_get_aio_context(iothread);
+        }
+        assert(i == multithread_count);
     }
 
     bdrv_graph_rdlock_main_loop();
@@ -195,7 +233,7 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
         .blk        = blk,
     };
 
-    ret = drv->create(exp, export, errp);
+    ret = drv->create(exp, export, multithread_ctxs, multithread_count, errp);
     if (ret < 0) {
         goto fail;
     }
@@ -203,6 +241,7 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error **errp)
     assert(exp->blk != NULL);
 
     QLIST_INSERT_HEAD(&block_exports, exp, next);
+    g_free(multithread_ctxs);
     return exp;
 
 fail:
@@ -214,6 +253,7 @@ fail:
         g_free(exp->id);
         g_free(exp);
     }
+    g_free(multithread_ctxs);
     return NULL;
 }
 
diff --git a/block/export/fuse.c b/block/export/fuse.c
index 7c072409d83..1c794118393 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -259,6 +259,8 @@ static const BlockDevOps fuse_export_blk_dev_ops = {
 
 static int fuse_export_create(BlockExport *blk_exp,
                               BlockExportOptions *blk_exp_args,
+                              AioContext *const *multithread,
+                              size_t mt_count,
                               Error **errp)
 {
     ERRP_GUARD(); /* ensure clean-up even with error_fatal */
@@ -268,6 +270,11 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
 
+    if (multithread) {
+        error_setg(errp, "FUSE export does not support multi-threading");
+        return -EINVAL;
+    }
+
     /* For growable and writable exports, take the RESIZE permission */
     if (args->growable || blk_exp_args->writable) {
         uint64_t blk_perm, blk_shared_perm;
diff --git a/block/export/vduse-blk.c b/block/export/vduse-blk.c
index 8af13b7f0bf..10dc673c566 100644
--- a/block/export/vduse-blk.c
+++ b/block/export/vduse-blk.c
@@ -267,6 +267,7 @@ static const BlockDevOps vduse_block_ops = {
 };
 
 static int vduse_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
+                                AioContext *const *multithread, size_t mt_count,
                                 Error **errp)
 {
     VduseBlkExport *vblk_exp = container_of(exp, VduseBlkExport, export);
@@ -302,6 +303,12 @@ static int vduse_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
             return -EINVAL;
         }
     }
+
+    if (multithread) {
+        error_setg(errp, "vduse-blk export does not support multi-threading");
+        return -EINVAL;
+    }
+
     vblk_exp->num_queues = num_queues;
     vblk_exp->handler.blk = exp->blk;
     vblk_exp->handler.serial = g_strdup(vblk_opts->serial ?: "");
diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index a4d54e824f2..e89422bb85a 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -316,6 +316,7 @@ static const BlockDevOps vu_blk_dev_ops = {
 };
 
 static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
+                             AioContext *const *multithread, size_t mt_count,
                              Error **errp)
 {
     VuBlkExport *vexp = container_of(exp, VuBlkExport, export);
@@ -341,6 +342,13 @@ static int vu_blk_exp_create(BlockExport *exp, BlockExportOptions *opts,
         error_setg(errp, "num-queues must be greater than 0");
         return -EINVAL;
     }
+
+    if (multithread) {
+        error_setg(errp,
+                   "vhost-user-blk export does not support multi-threading");
+        return -EINVAL;
+    }
+
     vexp->handler.blk = exp->blk;
     vexp->handler.serial = g_strdup("vhost_user_blk");
     vexp->handler.logical_block_size = logical_block_size;
diff --git a/nbd/server.c b/nbd/server.c
index acec0487a8b..620097c58ca 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1795,6 +1795,7 @@ static const BlockDevOps nbd_block_ops = {
 };
 
 static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions *exp_args,
+                             AioContext *const *multithread, size_t mt_count,
                              Error **errp)
 {
     NBDExport *exp = container_of(blk_exp, NBDExport, common);
@@ -1831,6 +1832,11 @@ static int nbd_export_create(BlockExport *blk_exp, BlockExportOptions *exp_args,
         return -EEXIST;
     }
 
+    if (multithread) {
+        error_setg(errp, "NBD export does not support multi-threading");
+        return -EINVAL;
+    }
+
     size = blk_getlength(blk);
     if (size < 0) {
         error_setg_errno(errp, -size,
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 21/28] iotests/307: Test multi-thread export interface
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (19 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 20/28] block/export: Add multi-threading interface Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 22/28] fuse: Make shared export state atomic Kevin Wolf
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Test the QAPI interface for multi-threaded exports.  None of our exports
currently support multi-threading, so it's always an error in the end,
but we can still test the specific errors.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-22-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/307     | 47 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/307.out | 18 +++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/tests/qemu-iotests/307 b/tests/qemu-iotests/307
index b429b5aa50a..f6ee3ebec08 100755
--- a/tests/qemu-iotests/307
+++ b/tests/qemu-iotests/307
@@ -142,5 +142,52 @@ with iotests.FilePath('image') as img, \
     vm.qmp_log('query-block-exports')
     iotests.qemu_nbd_list_log('-k', socket)
 
+    iotests.log('\n=== Using multi-thread with NBD ===')
+
+    # Actual multi-threading; (currently) not supported by NBD
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0', 'iothread1'])
+
+    # Should be treated the same way as actual multi-threading, even if there's
+    # only a single thread
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0'])
+
+    iotests.log('\n=== Empty thread list')
+
+    # Simply not allowed
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=[])
+
+    iotests.log('\n=== Non-existent thread name in list')
+
+    # Expect an error, even if NBD does not support multi-threading, because the
+    # list is parsed before being passed to NBD
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0', 'nothread', 'iothread1'])
+
+    iotests.log('\n=== Multi-thread with fixed-iothread')
+
+    # With multi-threading, there is no single context to give the BDS, so it is
+    # just left where it is.  fixed-iothread does not make sense then.
+    vm.qmp_log('block-export-add',
+               id='export0',
+               type='nbd',
+               node_name='fmt',
+               iothread=['iothread0', 'iothread1'],
+               fixed_iothread=True)
+
     iotests.log('\n=== Shut down QEMU ===')
     vm.shutdown()
diff --git a/tests/qemu-iotests/307.out b/tests/qemu-iotests/307.out
index f645f3315f8..a9b37d3ac11 100644
--- a/tests/qemu-iotests/307.out
+++ b/tests/qemu-iotests/307.out
@@ -134,4 +134,22 @@ read failed: Input/output error
 exports available: 0
 
 
+=== Using multi-thread with NBD ===
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": ["iothread0", "iothread1"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "NBD export does not support multi-threading"}}
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": ["iothread0"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "NBD export does not support multi-threading"}}
+
+=== Empty thread list
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": [], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "The set of I/O threads must not be empty"}}
+
+=== Non-existent thread name in list
+{"execute": "block-export-add", "arguments": {"id": "export0", "iothread": ["iothread0", "nothread", "iothread1"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "iothread \"nothread\" not found"}}
+
+=== Multi-thread with fixed-iothread
+{"execute": "block-export-add", "arguments": {"fixed-iothread": true, "id": "export0", "iothread": ["iothread0", "iothread1"], "node-name": "fmt", "type": "nbd"}}
+{"error": {"class": "GenericError", "desc": "Cannot use fixed-iothread for a multi-threaded export"}}
+
 === Shut down QEMU ===
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 22/28] fuse: Make shared export state atomic
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (20 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 21/28] iotests/307: Test multi-thread export interface Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 23/28] fuse: Implement multi-threading Kevin Wolf
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

The next commit is going to allow multi-threaded access to a FUSE
export.  In order to allow safe concurrent SETATTR operations that
can modify the shared st_mode, st_uid, and st_gid, make any access to
those fields atomic operations.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-23-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 1c794118393..fe1b6ad5ffc 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -156,6 +156,7 @@ typedef struct FuseExport {
     /* Whether allow_other was used as a mount option or not */
     bool allow_other;
 
+    /* All atomic */
     mode_t st_mode;
     uid_t st_uid;
     gid_t st_gid;
@@ -266,6 +267,7 @@ static int fuse_export_create(BlockExport *blk_exp,
     ERRP_GUARD(); /* ensure clean-up even with error_fatal */
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
     BlockExportOptionsFuse *args = &blk_exp_args->u.fuse;
+    uint32_t st_mode;
     int ret;
 
     assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
@@ -334,12 +336,13 @@ static int fuse_export_create(BlockExport *blk_exp,
         args->allow_other = FUSE_EXPORT_ALLOW_OTHER_AUTO;
     }
 
-    exp->st_mode = S_IFREG | S_IRUSR;
+    st_mode = S_IFREG | S_IRUSR;
     if (exp->writable) {
-        exp->st_mode |= S_IWUSR;
+        st_mode |= S_IWUSR;
     }
-    exp->st_uid = getuid();
-    exp->st_gid = getgid();
+    qatomic_set(&exp->st_mode, st_mode);
+    qatomic_set(&exp->st_uid, getuid());
+    qatomic_set(&exp->st_gid, getgid());
 
     if (args->allow_other == FUSE_EXPORT_ALLOW_OTHER_AUTO) {
         /* Try allow_other == true first, ignore errors */
@@ -817,10 +820,10 @@ fuse_co_getattr(FuseExport *exp, struct fuse_attr_out *out)
         .attr_valid = 1,
         .attr = {
             .ino        = 1,
-            .mode       = exp->st_mode,
+            .mode       = qatomic_read(&exp->st_mode),
             .nlink      = 1,
-            .uid        = exp->st_uid,
-            .gid        = exp->st_gid,
+            .uid        = qatomic_read(&exp->st_uid),
+            .gid        = qatomic_read(&exp->st_gid),
             .size       = length,
             .blksize    = blk_bs(exp->common.blk)->bl.request_alignment,
             .blocks     = allocated_blocks,
@@ -903,15 +906,15 @@ fuse_co_setattr(FuseExport *exp, struct fuse_attr_out *out, uint32_t to_set,
 
     if (to_set & FATTR_MODE) {
         /* Ignore FUSE-supplied file type, only change the mode */
-        exp->st_mode = (mode & 07777) | S_IFREG;
+        qatomic_set(&exp->st_mode, (mode & 07777) | S_IFREG);
     }
 
     if (to_set & FATTR_UID) {
-        exp->st_uid = uid;
+        qatomic_set(&exp->st_uid, uid);
     }
 
     if (to_set & FATTR_GID) {
-        exp->st_gid = gid;
+        qatomic_set(&exp->st_gid, gid);
     }
 
     return fuse_co_getattr(exp, out);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 23/28] fuse: Implement multi-threading
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (21 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 22/28] fuse: Make shared export state atomic Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 24/28] qapi/block-export: Document FUSE's multi-threading Kevin Wolf
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

FUSE allows creating multiple request queues by "cloning" /dev/fuse FDs
(via open("/dev/fuse") + ioctl(FUSE_DEV_IOC_CLONE)).

We can use this to implement multi-threading.

For configuration, we don't need any more information beyond the simple
array provided by the core block export interface: The FUSE kernel
driver feeds these FDs in a round-robin fashion, so all of them are
equivalent and we want to have exactly one per thread.

These are the benchmark results when using four threads (compared to a
single thread); note that fio still only uses a single job, but
performance can still be improved because of said round-robin usage for
the queues.  (Not in the sync case, though, in which case I guess it
just adds overhead.)

file:
  read:
    seq aio:   261.7k ±1.7k  (+168%)
    rand aio:  129.2k ±14.3k (+35%)
    seq sync:   36.6k ±0.6k  (+6%)
    rand sync:  10.1k ±0.1k  (+2%)
  write:
    seq aio:   235.7k ±2.8k  (+243%)
    rand aio:  232.0k ±6.7k  (+237%)
    seq sync:   31.7k ±0.6k  (+4%)
    rand sync:  31.8k ±0.5k  (+4%)
null:
  read:
    seq aio:   253.8k ±12.3k (+45%)
    rand aio:  248.2k ±12.0k (+45%)
    seq sync:   91.6k ±2.4k  (+12%)
    rand sync:  91.3k ±2.1k  (+17%)
  write:
    seq aio:   208.2k ±9.8k  (+6%)
    rand aio:  207.0k ±7.4k  (+8%)
    seq sync:   91.2k ±1.9k  (+9%)
    rand sync:  90.4k ±2.5k  (+14%)

So moderate improvements in most cases, but quite improved AIO
performance with an actual underlying file.

Here's results for numjobs=4:

"Before", i.e. without multithreading in QSD/FUSE (results compared to
numjobs=1):

file:
  read:
    seq aio:    85.5k ±0.4k (-13%)
    rand aio:   92.5k ±0.5k (-3%)
    seq sync:   54.5k ±9.1k (+58%)
    rand sync:  38.0k ±0.2k (+283%)
  write:
    seq aio:    67.3k ±0.3k (-2%)
    rand aio:   67.6k ±0.3k (-2%)
    seq sync:   69.3k ±0.5k (+126%)
    rand sync:  69.3k ±0.3k (+126%)
null:
  read:
    seq aio:   170.6k ±0.8k (-2%)
    rand aio:  170.9k ±0.9k (±0%)
    seq sync:  187.6k ±1.3k (+129%)
    rand sync: 188.9k ±0.9k (+142%)
  write:
    seq aio:   191.5k ±1.2k (-2%)
    rand aio:  193.8k ±1.4k (-1%)
    seq sync:  206.1k ±1.3k (+147%)
    rand sync: 206.1k ±1.2k (+159%)

As probably expected, little difference in the AIO case, but great
improvements in the sync cases because it kind of gives it an artificial
iodepth of 4.

"After", i.e. with four threads in QSD/FUSE (now results compared to the
above):

file:
  read:
    seq aio:   198.7k ±2.7k (+132%)
    rand aio:  317.3k ±0.6k (+243%)
    seq sync:   55.9k ±8.9k (+3%)
    rand sync:  39.1k ±0.0k (+3%)
  write:
    seq aio:   229.0k ±0.8k (+240%)
    rand aio:  227.0k ±1.3k (+235%)
    seq sync:  102.5k ±0.2k (+48%)
    rand sync: 101.7k ±0.2k (+47%)
null:
  read:
    seq aio:   584.0k ±1.5k (+242%)
    rand aio:  581.9k ±1.9k (+240%)
    seq sync:  270.6k ±0.9k (+44%)
    rand sync: 270.4k ±0.7k (+43%)
  write:
    seq aio:   598.4k ±2.0k (+212%)
    rand aio:  605.2k ±2.0k (+212%)
    seq sync:  274.0k ±0.8k (+33%)
    rand sync: 275.0k ±0.7k (+33%)

So this helps mainly for the AIO cases, but also in the null sync cases,
because null is always CPU-bound, so more threads help.

One unsolved mystery: When using a multithreaded export, running fio
with 1 job (benchmark at the top of this commit) yields better seqread
performance than doing so with 4 jobs.  Actually, with 4 jobs, it's
significantly than randread, which is quite strange.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-24-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/export/fuse.c | 193 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 153 insertions(+), 40 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index fe1b6ad5ffc..a2a478d2934 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -31,11 +31,13 @@
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "system/block-backend.h"
+#include "system/iothread.h"
 
 #include <fuse.h>
 #include <fuse_lowlevel.h>
 
 #include "standard-headers/linux/fuse.h"
+#include <sys/ioctl.h>
 
 #if defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
@@ -118,12 +120,17 @@ QEMU_BUILD_BUG_ON(sizeof(((FuseRequestInHeaderBuf *)0)->head) +
                   sizeof(((FuseRequestInHeaderBuf *)0)->tail) !=
                   sizeof(FuseRequestInHeader));
 
-typedef struct FuseExport {
-    BlockExport common;
+typedef struct FuseExport FuseExport;
 
-    struct fuse_session *fuse_session;
-    unsigned int in_flight; /* atomic */
-    bool mounted, fd_handler_set_up;
+/*
+ * One FUSE "queue", representing one FUSE FD from which requests are fetched
+ * and processed.  Each queue is tied to an AioContext.
+ */
+typedef struct FuseQueue {
+    FuseExport *exp;
+
+    AioContext *ctx;
+    int fuse_fd;
 
     /*
      * Cached buffer to receive the data of WRITE requests.  Cached because:
@@ -140,6 +147,14 @@ typedef struct FuseExport {
      * via blk_blockalign() and thus need to be freed via qemu_vfree().
      */
     void *req_write_data_cached;
+} FuseQueue;
+
+struct FuseExport {
+    BlockExport common;
+
+    struct fuse_session *fuse_session;
+    unsigned int in_flight; /* atomic */
+    bool mounted, fd_handler_set_up;
 
     /*
      * Set when there was an unrecoverable error and no requests should be read
@@ -148,7 +163,15 @@ typedef struct FuseExport {
      */
     bool halted;
 
-    int fuse_fd;
+    int num_queues;
+    FuseQueue *queues;
+    /*
+     * True if this export should follow the generic export's AioContext.
+     * Will be false if the queues' AioContexts have been explicitly set by the
+     * user, i.e. are expected to stay in those contexts.
+     * (I.e. is always false if there is more than one queue.)
+     */
+    bool follow_aio_context;
 
     char *mountpoint;
     bool writable;
@@ -160,7 +183,7 @@ typedef struct FuseExport {
     mode_t st_mode;
     uid_t st_uid;
     gid_t st_gid;
-} FuseExport;
+};
 
 /*
  * Verify that the size of FuseRequestInHeaderBuf.head plus the data
@@ -179,12 +202,13 @@ static void fuse_export_halt(FuseExport *exp);
 static void init_exports_table(void);
 
 static int mount_fuse_export(FuseExport *exp, Error **errp);
+static int clone_fuse_fd(int fd, Error **errp);
 
 static bool is_regular_file(const char *path, Error **errp);
 
 static void read_from_fuse_fd(void *opaque);
 static void coroutine_fn
-fuse_co_process_request(FuseExport *exp, const FuseRequestInHeader *in_hdr,
+fuse_co_process_request(FuseQueue *q, const FuseRequestInHeader *in_hdr,
                         const void *data_buffer);
 static int fuse_write_err(int fd, const struct fuse_in_header *in_hdr, int err);
 
@@ -216,8 +240,11 @@ static void fuse_attach_handlers(FuseExport *exp)
         return;
     }
 
-    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
-                       read_from_fuse_fd, NULL, NULL, NULL, exp);
+    for (int i = 0; i < exp->num_queues; i++) {
+        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
+                           read_from_fuse_fd, NULL, NULL, NULL,
+                           &exp->queues[i]);
+    }
     exp->fd_handler_set_up = true;
 }
 
@@ -226,8 +253,10 @@ static void fuse_attach_handlers(FuseExport *exp)
  */
 static void fuse_detach_handlers(FuseExport *exp)
 {
-    aio_set_fd_handler(exp->common.ctx, exp->fuse_fd,
-                       NULL, NULL, NULL, NULL, NULL);
+    for (int i = 0; i < exp->num_queues; i++) {
+        aio_set_fd_handler(exp->queues[i].ctx, exp->queues[i].fuse_fd,
+                           NULL, NULL, NULL, NULL, NULL);
+    }
     exp->fd_handler_set_up = false;
 }
 
@@ -242,6 +271,11 @@ static void fuse_export_drained_end(void *opaque)
 
     /* Refresh AioContext in case it changed */
     exp->common.ctx = blk_get_aio_context(exp->common.blk);
+    if (exp->follow_aio_context) {
+        assert(exp->num_queues == 1);
+        exp->queues[0].ctx = exp->common.ctx;
+    }
+
     fuse_attach_handlers(exp);
 }
 
@@ -273,8 +307,32 @@ static int fuse_export_create(BlockExport *blk_exp,
     assert(blk_exp_args->type == BLOCK_EXPORT_TYPE_FUSE);
 
     if (multithread) {
-        error_setg(errp, "FUSE export does not support multi-threading");
-        return -EINVAL;
+        /* Guaranteed by common export code */
+        assert(mt_count >= 1);
+
+        exp->follow_aio_context = false;
+        exp->num_queues = mt_count;
+        exp->queues = g_new(FuseQueue, mt_count);
+
+        for (size_t i = 0; i < mt_count; i++) {
+            exp->queues[i] = (FuseQueue) {
+                .exp = exp,
+                .ctx = multithread[i],
+                .fuse_fd = -1,
+            };
+        }
+    } else {
+        /* Guaranteed by common export code */
+        assert(mt_count == 0);
+
+        exp->follow_aio_context = true;
+        exp->num_queues = 1;
+        exp->queues = g_new(FuseQueue, 1);
+        exp->queues[0] = (FuseQueue) {
+            .exp = exp,
+            .ctx = exp->common.ctx,
+            .fuse_fd = -1,
+        };
     }
 
     /* For growable and writable exports, take the RESIZE permission */
@@ -286,7 +344,7 @@ static int fuse_export_create(BlockExport *blk_exp,
         ret = blk_set_perm(exp->common.blk, blk_perm | BLK_PERM_RESIZE,
                            blk_shared_perm, errp);
         if (ret < 0) {
-            return ret;
+            goto fail;
         }
     }
 
@@ -362,13 +420,23 @@ static int fuse_export_create(BlockExport *blk_exp,
 
     g_hash_table_insert(exports, g_strdup(exp->mountpoint), NULL);
 
-    exp->fuse_fd = fuse_session_fd(exp->fuse_session);
-    ret = qemu_fcntl_addfl(exp->fuse_fd, O_NONBLOCK);
+    assert(exp->num_queues >= 1);
+    exp->queues[0].fuse_fd = fuse_session_fd(exp->fuse_session);
+    ret = qemu_fcntl_addfl(exp->queues[0].fuse_fd, O_NONBLOCK);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "Failed to make FUSE FD non-blocking");
         goto fail;
     }
 
+    for (int i = 1; i < exp->num_queues; i++) {
+        int fd = clone_fuse_fd(exp->queues[0].fuse_fd, errp);
+        if (fd < 0) {
+            ret = fd;
+            goto fail;
+        }
+        exp->queues[i].fuse_fd = fd;
+    }
+
     fuse_attach_handlers(exp);
     return 0;
 
@@ -461,28 +529,28 @@ fail:
 /**
  * Allocate a buffer to receive WRITE data, or take the cached one.
  */
-static void *get_write_data_buffer(FuseExport *exp)
+static void *get_write_data_buffer(FuseQueue *q)
 {
-    if (exp->req_write_data_cached) {
-        void *cached = exp->req_write_data_cached;
-        exp->req_write_data_cached = NULL;
+    if (q->req_write_data_cached) {
+        void *cached = q->req_write_data_cached;
+        q->req_write_data_cached = NULL;
         return cached;
     } else {
-        return blk_blockalign(exp->common.blk, FUSE_MAX_WRITE_BYTES);
+        return blk_blockalign(q->exp->common.blk, FUSE_MAX_WRITE_BYTES);
     }
 }
 
 /**
  * Release a WRITE data buffer, possibly reusing it for a subsequent request.
  */
-static void release_write_data_buffer(FuseExport *exp, void **buffer)
+static void release_write_data_buffer(FuseQueue *q, void **buffer)
 {
     if (!*buffer) {
         return;
     }
 
-    if (!exp->req_write_data_cached) {
-        exp->req_write_data_cached = *buffer;
+    if (!q->req_write_data_cached) {
+        q->req_write_data_cached = *buffer;
     } else {
         qemu_vfree(*buffer);
     }
@@ -528,9 +596,42 @@ static ssize_t req_op_hdr_len(const FuseRequestInHeader *in_hdr)
     }
 }
 
+/**
+ * Clone the given /dev/fuse file descriptor, yielding a second FD from which
+ * requests can be pulled for the associated filesystem.  Returns an FD on
+ * success, and -errno on error.
+ */
+static int clone_fuse_fd(int fd, Error **errp)
+{
+    uint32_t src_fd = fd;
+    int new_fd;
+    int ret;
+
+    /*
+     * The name "/dev/fuse" is fixed, see libfuse's lib/fuse_loop_mt.c
+     * (fuse_clone_chan()).
+     */
+    new_fd = open("/dev/fuse", O_RDWR | O_CLOEXEC | O_NONBLOCK);
+    if (new_fd < 0) {
+        ret = -errno;
+        error_setg_errno(errp, errno, "Failed to open /dev/fuse");
+        return ret;
+    }
+
+    ret = ioctl(new_fd, FUSE_DEV_IOC_CLONE, &src_fd);
+    if (ret < 0) {
+        ret = -errno;
+        error_setg_errno(errp, errno, "Failed to clone FUSE FD");
+        close(new_fd);
+        return ret;
+    }
+
+    return new_fd;
+}
+
 /**
  * Try to read a single request from the FUSE FD.
- * Takes a FuseExport pointer in `opaque`.
+ * Takes a FuseQueue pointer in `opaque`.
  *
  * Assumes the export's in-flight counter has already been incremented.
  *
@@ -538,8 +639,9 @@ static ssize_t req_op_hdr_len(const FuseRequestInHeader *in_hdr)
  */
 static void coroutine_fn co_read_from_fuse_fd(void *opaque)
 {
-    FuseExport *exp = opaque;
-    int fuse_fd = exp->fuse_fd;
+    FuseQueue *q = opaque;
+    int fuse_fd = q->fuse_fd;
+    FuseExport *exp = q->exp;
     ssize_t ret;
     FuseRequestInHeaderBuf in_hdr_buf;
     const FuseRequestInHeader *in_hdr;
@@ -551,7 +653,7 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
         goto no_request;
     }
 
-    data_buffer = get_write_data_buffer(exp);
+    data_buffer = get_write_data_buffer(q);
 
     /* Construct the I/O vector to hold the FUSE request */
     iov[0] = (struct iovec) { &in_hdr_buf.head, sizeof(in_hdr_buf.head) };
@@ -612,29 +714,29 @@ static void coroutine_fn co_read_from_fuse_fd(void *opaque)
             memcpy(in_hdr_buf.tail, data_buffer, len);
         }
 
-        release_write_data_buffer(exp, &data_buffer);
+        release_write_data_buffer(q, &data_buffer);
     }
 
-    fuse_co_process_request(exp, in_hdr, data_buffer);
+    fuse_co_process_request(q, in_hdr, data_buffer);
 
 no_request:
-    release_write_data_buffer(exp, &data_buffer);
+    release_write_data_buffer(q, &data_buffer);
     fuse_dec_in_flight(exp);
 }
 
 /**
  * Try to read and process a single request from the FUSE FD.
  * (To be used as a handler for when the FUSE FD becomes readable.)
- * Takes a FuseExport pointer in `opaque`.
+ * Takes a FuseQueue pointer in `opaque`.
  */
 static void read_from_fuse_fd(void *opaque)
 {
-    FuseExport *exp = opaque;
+    FuseQueue *q = opaque;
     Coroutine *co;
 
-    co = qemu_coroutine_create(co_read_from_fuse_fd, exp);
+    co = qemu_coroutine_create(co_read_from_fuse_fd, q);
     /* Decremented by co_read_from_fuse_fd() */
-    fuse_inc_in_flight(exp);
+    fuse_inc_in_flight(q->exp);
     qemu_coroutine_enter(co);
 }
 
@@ -659,6 +761,17 @@ static void fuse_export_delete(BlockExport *blk_exp)
 {
     FuseExport *exp = container_of(blk_exp, FuseExport, common);
 
+    for (int i = 0; i < exp->num_queues; i++) {
+        FuseQueue *q = &exp->queues[i];
+
+        /* Queue 0's FD belongs to the FUSE session */
+        if (i > 0 && q->fuse_fd >= 0) {
+            close(q->fuse_fd);
+        }
+        qemu_vfree(q->req_write_data_cached);
+    }
+    g_free(exp->queues);
+
     if (exp->fuse_session) {
         if (exp->mounted) {
             fuse_session_unmount(exp->fuse_session);
@@ -667,7 +780,6 @@ static void fuse_export_delete(BlockExport *blk_exp)
         fuse_session_destroy(exp->fuse_session);
     }
 
-    qemu_vfree(exp->req_write_data_cached);
     g_free(exp->mountpoint);
 }
 
@@ -1344,10 +1456,11 @@ static int fuse_write_buf_response(int fd,
  * Process a FUSE request, incl. writing the response.
  */
 static void coroutine_fn
-fuse_co_process_request(FuseExport *exp, const FuseRequestInHeader *in_hdr,
+fuse_co_process_request(FuseQueue *q, const FuseRequestInHeader *in_hdr,
                         const void *data_buffer)
 {
     FuseRequestOutHeader out_hdr;
+    FuseExport *exp = q->exp;
     /* For read requests: Data to be returned */
     void *out_data_buffer = NULL;
     ssize_t ret;
@@ -1471,10 +1584,10 @@ fuse_co_process_request(FuseExport *exp, const FuseRequestInHeader *in_hdr,
     }
 
     if (out_data_buffer) {
-        fuse_write_buf_response(exp->fuse_fd, &out_hdr.common, out_data_buffer);
+        fuse_write_buf_response(q->fuse_fd, &out_hdr.common, out_data_buffer);
         qemu_vfree(out_data_buffer);
     } else {
-        fuse_write_response(exp->fuse_fd, &out_hdr);
+        fuse_write_response(q->fuse_fd, &out_hdr);
     }
 }
 
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 24/28] qapi/block-export: Document FUSE's multi-threading
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (22 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 23/28] fuse: Implement multi-threading Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 25/28] iotests/308: Add multi-threading sanity test Kevin Wolf
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Document for users that FUSE's multi-threading implementation
distributes requests in a round-robin manner, regardless of where they
originate from.

As noted by Stefan, this will probably change with a FUSE-over-io_uring
implementation (which is supposed to have CPU affinity), but documenting
that is left for once that is done.

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-25-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qapi/block-export.json | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index 160cd2e3ca0..dd724acf1cb 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -164,6 +164,11 @@
 # Options for exporting a block graph node on some (file) mountpoint
 # as a raw image.
 #
+# Multi-threading note: The FUSE export supports multi-threading.
+# Currently, requests are distributed across these threads in a
+# round-robin fashion, i.e. independently of the CPU core from which a
+# request originates.
+#
 # @mountpoint: Path on which to export the block device via FUSE.
 #     This must point to an existing regular file.
 #
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 25/28] iotests/308: Add multi-threading sanity test
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (23 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 24/28] qapi/block-export: Document FUSE's multi-threading Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 26/28] block/nfs: add support for libnfs v6 Kevin Wolf
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Hanna Czenczek <hreitz@redhat.com>

Run qemu-img bench on a simple multi-threaded FUSE export to test that
it works.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20260309150856.26800-26-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/308     | 51 ++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/308.out | 56 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 107 insertions(+)

diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index a83c6fc01fb..f4a06a522ef 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -441,6 +441,57 @@ $QEMU_IO -c 'read -P 0 0 64M' "$TEST_IMG" | _filter_qemu_io
 
 _cleanup_test_img
 
+echo
+echo '=== Multi-threading ==='
+
+# Just set up a null block device, export it (with multi-threading), and run
+# qemu-img bench on it (to get parallel requests)
+
+_launch_qemu
+_send_qemu_cmd $QEMU_HANDLE \
+    "{'execute': 'qmp_capabilities'}" \
+    'return'
+
+_send_qemu_cmd $QEMU_HANDLE \
+    "{'execute': 'blockdev-add',
+      'arguments': {
+          'driver': 'null-co',
+          'node-name': 'null'
+      } }" \
+    'return'
+
+for id in iothread{0,1,2,3}; do
+    _send_qemu_cmd $QEMU_HANDLE \
+        "{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': '$id'
+          } }" \
+        'return'
+done
+
+echo
+
+iothreads="['iothread0', 'iothread1', 'iothread2', 'iothread3']"
+fuse_export_add \
+    'export' \
+    "'mountpoint': '$EXT_MP', 'iothread': $iothreads" \
+    'return' \
+    'null'
+
+echo
+$QEMU_IMG bench -f raw "$EXT_MP" |
+    sed -e 's/[0-9.]\+ seconds/X.XXX seconds/'
+echo
+
+fuse_export_del 'export'
+
+_send_qemu_cmd $QEMU_HANDLE \
+    "{'execute': 'quit'}" \
+    'return'
+
+wait=yes _cleanup_qemu
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index ebeaf64b486..580cc94e92f 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -217,4 +217,60 @@ read 67108864/67108864 bytes at offset 0
 {"return": {}}
 read 67108864/67108864 bytes at offset 0
 64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+=== Multi-threading ===
+{'execute': 'qmp_capabilities'}
+{"return": {}}
+{'execute': 'blockdev-add',
+      'arguments': {
+          'driver': 'null-co',
+          'node-name': 'null'
+      } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread0'
+          } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread1'
+          } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread2'
+          } }
+{"return": {}}
+{'execute': 'object-add',
+          'arguments': {
+              'qom-type': 'iothread',
+              'id': 'iothread3'
+          } }
+{"return": {}}
+
+{'execute': 'block-export-add',
+          'arguments': {
+              'type': 'fuse',
+              'id': 'export',
+              'node-name': 'null',
+              'mountpoint': 'TEST_DIR/t.IMGFMT.fuse', 'iothread': ['iothread0', 'iothread1', 'iothread2', 'iothread3']
+          } }
+{"return": {}}
+
+Sending 75000 read requests, 4096 bytes each, 64 in parallel (starting at offset 0, step size 4096)
+Run completed in X.XXX seconds.
+
+{'execute': 'block-export-del',
+          'arguments': {
+              'id': 'export'
+          } }
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_EXPORT_DELETED", "data": {"id": "export"}}
+{'execute': 'quit'}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
+{"return": {}}
 *** done
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 26/28] block/nfs: add support for libnfs v6
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (24 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 25/28] iotests/308: Add multi-threading sanity test Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-12  9:41   ` Peter Maydell
  2026-03-10 16:26 ` [PULL 27/28] qapi: block: Refactor HTTP(s) common arguments Kevin Wolf
                   ` (2 subsequent siblings)
  28 siblings, 1 reply; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Peter Lieven <pl@dlhnet.de>

libnfs v6 added a new api structure for read and write requests.

This effectively also adds zero copy read support for cases where
the qiov coming from the block layer has only one vector.

The .brdv_refresh_limits implementation is needed because libnfs v6
silently dropped support for splitting large read/write request into
chunks.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Peter Lieven <pl@dlhnet.de>
Message-ID: <20260306142840.72923-1-pl@dlhnet.de>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/nfs.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 meson.build |  2 +-
 2 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/block/nfs.c b/block/nfs.c
index b78f4f86e85..53e267fa755 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -69,7 +69,9 @@ typedef struct NFSClient {
 typedef struct NFSRPC {
     BlockDriverState *bs;
     int ret;
+#ifndef LIBNFS_API_V2
     QEMUIOVector *iov;
+#endif
     struct stat *st;
     Coroutine *co;
     NFSClient *client;
@@ -237,6 +239,7 @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void *data,
     NFSRPC *task = private_data;
     task->ret = ret;
     assert(!task->st);
+#ifndef LIBNFS_API_V2
     if (task->ret > 0 && task->iov) {
         if (task->ret <= task->iov->size) {
             qemu_iovec_from_buf(task->iov, 0, data, task->ret);
@@ -244,6 +247,7 @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void *data,
             task->ret = -EIO;
         }
     }
+#endif
     if (task->ret < 0) {
         error_report("NFS Error: %s", nfs_get_error(nfs));
     }
@@ -266,13 +270,36 @@ static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
 {
     NFSClient *client = bs->opaque;
     NFSRPC task;
+    char *buf = NULL;
+    bool my_buffer = false;
 
     nfs_co_init_task(bs, &task);
-    task.iov = iov;
+
+#ifdef LIBNFS_API_V2
+    if (iov->niov != 1) {
+        buf = g_try_malloc(bytes);
+        if (bytes && buf == NULL) {
+            return -ENOMEM;
+        }
+        my_buffer = true;
+    } else {
+        buf = iov->iov[0].iov_base;
+    }
+#endif
 
     WITH_QEMU_LOCK_GUARD(&client->mutex) {
+#ifdef LIBNFS_API_V2
+        if (nfs_pread_async(client->context, client->fh,
+                            buf, bytes, offset,
+                            nfs_co_generic_cb, &task) != 0) {
+#else
+        task.iov = iov;
         if (nfs_pread_async(client->context, client->fh,
                             offset, bytes, nfs_co_generic_cb, &task) != 0) {
+#endif
+            if (my_buffer) {
+                g_free(buf);
+            }
             return -ENOMEM;
         }
 
@@ -280,6 +307,13 @@ static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
     }
     qemu_coroutine_yield();
 
+    if (my_buffer) {
+        if (task.ret > 0) {
+            qemu_iovec_from_buf(iov, 0, buf, task.ret);
+        }
+        g_free(buf);
+    }
+
     if (task.ret < 0) {
         return task.ret;
     }
@@ -315,9 +349,15 @@ static int coroutine_fn nfs_co_pwritev(BlockDriverState *bs, int64_t offset,
     }
 
     WITH_QEMU_LOCK_GUARD(&client->mutex) {
+#ifdef LIBNFS_API_V2
+        if (nfs_pwrite_async(client->context, client->fh,
+                             buf, bytes, offset,
+                             nfs_co_generic_cb, &task) != 0) {
+#else
         if (nfs_pwrite_async(client->context, client->fh,
                              offset, bytes, buf,
                              nfs_co_generic_cb, &task) != 0) {
+#endif
             if (my_buffer) {
                 g_free(buf);
             }
@@ -856,6 +896,13 @@ static void coroutine_fn nfs_co_invalidate_cache(BlockDriverState *bs,
 }
 #endif
 
+static void nfs_refresh_limits(BlockDriverState *bs, Error **errp)
+{
+    NFSClient *client = bs->opaque;
+    bs->bl.max_transfer = MIN((uint32_t)nfs_get_readmax(client->context),
+                              (uint32_t)nfs_get_writemax(client->context));
+}
+
 static const char *nfs_strong_runtime_opts[] = {
     "path",
     "user",
@@ -893,6 +940,7 @@ static BlockDriver bdrv_nfs = {
     .bdrv_detach_aio_context        = nfs_detach_aio_context,
     .bdrv_attach_aio_context        = nfs_attach_aio_context,
     .bdrv_refresh_filename          = nfs_refresh_filename,
+    .bdrv_refresh_limits            = nfs_refresh_limits,
     .bdrv_dirname                   = nfs_dirname,
 
     .strong_runtime_opts            = nfs_strong_runtime_opts,
diff --git a/meson.build b/meson.build
index f45885f05a1..bb0d1d993a5 100644
--- a/meson.build
+++ b/meson.build
@@ -1157,7 +1157,7 @@ endif
 
 libnfs = not_found
 if not get_option('libnfs').auto() or have_block
-  libnfs = dependency('libnfs', version: ['>=1.9.3', '<6.0.0'],
+  libnfs = dependency('libnfs', version: '>=1.9.3',
                       required: get_option('libnfs'),
                       method: 'pkg-config')
 endif
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 27/28] qapi: block: Refactor HTTP(s) common arguments
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (25 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 26/28] block/nfs: add support for libnfs v6 Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-10 16:26 ` [PULL 28/28] block/curl: add support for S3 presigned URLs Kevin Wolf
  2026-03-11 10:43 ` [PULL 00/28] Block layer patches Peter Maydell
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Antoine Damhet <adamhet@scaleway.com>

The HTTPs curl block driver is a superset of the HTTP driver, reflect
that in the QAPI.

Suggested-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Antoine Damhet <adamhet@scaleway.com>
Message-ID: <20260227-fix-curl-v3-v3-2-eb8a4d88feef@scaleway.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qapi/block-core.json | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index da0b36a3751..8ba1fdc49d6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4600,23 +4600,14 @@
 # Driver specific block device options for HTTPS connections over the
 # curl backend.  URLs must start with "https://".
 #
-# @cookie: List of cookies to set; format is "name1=content1;
-#     name2=content2;" as explained by CURLOPT_COOKIE(3).  Defaults to
-#     no cookies.
-#
 # @sslverify: Whether to verify the SSL certificate's validity
 #     (defaults to true)
 #
-# @cookie-secret: ID of a QCryptoSecret object providing the cookie
-#     data in a secure way.  See @cookie for the format.  (since 2.10)
-#
 # Since: 2.9
 ##
 { 'struct': 'BlockdevOptionsCurlHttps',
-  'base': 'BlockdevOptionsCurlBase',
-  'data': { '*cookie': 'str',
-            '*sslverify': 'bool',
-            '*cookie-secret': 'str'} }
+  'base': 'BlockdevOptionsCurlHttp',
+  'data': { '*sslverify': 'bool'} }
 
 ##
 # @BlockdevOptionsCurlFtp:
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PULL 28/28] block/curl: add support for S3 presigned URLs
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (26 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 27/28] qapi: block: Refactor HTTP(s) common arguments Kevin Wolf
@ 2026-03-10 16:26 ` Kevin Wolf
  2026-03-11 10:43 ` [PULL 00/28] Block layer patches Peter Maydell
  28 siblings, 0 replies; 35+ messages in thread
From: Kevin Wolf @ 2026-03-10 16:26 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel

From: Antoine Damhet <adamhet@scaleway.com>

S3 presigned URLs are signed for a specific HTTP method (typically GET
for our use cases). The curl block driver currently issues a HEAD
request to discover the web server features and the file size, which
fails with 'HTTP 403' (forbidden).

Add a 'force-range' option that skips the HEAD request and instead
issues a minimal GET request (querying 1 byte from the server) to
extract the file size from the 'Content-Range' response header. To
achieve this the 'curl_header_cb' is redesigned to generically parse
HTTP headers.

$ $QEMU -drive driver=https,\
             'url=https://s3.example.com/some.img?X-Amz-Security-Token=XXX',
             force-range=true

Enabling the 'force-range' option without the web server specified with
@url supporting it might cause the server to respond successfully with
'HTTP 200' and attempt to send the whole file body. With the
'CURLOPT_NOBODY' option set the libcurl will skip reading after the
headers and close the connection. QEMU still gracefully detects the
missing feature. This might waste a small number of TCP packets but is
otherwise transparent to the user.

Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Antoine Damhet <adamhet@scaleway.com>
Message-ID: <20260227-fix-curl-v3-v3-3-eb8a4d88feef@scaleway.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qapi/block-core.json                  |   8 +-
 docs/system/device-url-syntax.rst.inc |   6 ++
 block/curl.c                          | 104 ++++++++++++++++++--------
 block/trace-events                    |   1 +
 4 files changed, 85 insertions(+), 34 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 8ba1fdc49d6..f8d446b3d6e 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4587,12 +4587,18 @@
 # @cookie-secret: ID of a QCryptoSecret object providing the cookie
 #     data in a secure way.  See @cookie for the format.  (since 2.10)
 #
+# @force-range: Don't issue a HEAD HTTP request to discover if the
+#     http server supports range requests and rely only on GET
+#     requests.  This is especially useful for S3 presigned URLs where
+#     HEAD requests are unauthorized.  (default: false; since 11.0)
+#
 # Since: 2.9
 ##
 { 'struct': 'BlockdevOptionsCurlHttp',
   'base': 'BlockdevOptionsCurlBase',
   'data': { '*cookie': 'str',
-            '*cookie-secret': 'str'} }
+            '*cookie-secret': 'str',
+            '*force-range': 'bool'} }
 
 ##
 # @BlockdevOptionsCurlHttps:
diff --git a/docs/system/device-url-syntax.rst.inc b/docs/system/device-url-syntax.rst.inc
index aae65d138c0..996ce5418ff 100644
--- a/docs/system/device-url-syntax.rst.inc
+++ b/docs/system/device-url-syntax.rst.inc
@@ -179,6 +179,12 @@ These are specified using a special URL syntax.
       get the size of the image to be downloaded. If not set, the
       default timeout of 5 seconds is used.
 
+   ``force-range``
+      Don't issue a HEAD HTTP request to discover if the http server
+      server supports range requests and rely only on GET requests. This
+      is especially useful for S3 presigned URLs where HEAD requests
+      are unauthorized. It defaults to 'false'.
+
    Note that when passing options to qemu explicitly, ``driver`` is the
    value of <protocol>.
 
diff --git a/block/curl.c b/block/curl.c
index 6dccf002564..66aecfb20ec 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -62,10 +62,12 @@
 #define CURL_BLOCK_OPT_PASSWORD_SECRET "password-secret"
 #define CURL_BLOCK_OPT_PROXY_USERNAME "proxy-username"
 #define CURL_BLOCK_OPT_PROXY_PASSWORD_SECRET "proxy-password-secret"
+#define CURL_BLOCK_OPT_FORCE_RANGE "force-range"
 
 #define CURL_BLOCK_OPT_READAHEAD_DEFAULT (256 * 1024)
 #define CURL_BLOCK_OPT_SSLVERIFY_DEFAULT true
 #define CURL_BLOCK_OPT_TIMEOUT_DEFAULT 5
+#define CURL_BLOCK_OPT_FORCE_RANGE_DEFAULT false
 
 struct BDRVCURLState;
 struct CURLState;
@@ -206,27 +208,33 @@ static size_t curl_header_cb(void *ptr, size_t size, size_t nmemb, void *opaque)
 {
     BDRVCURLState *s = opaque;
     size_t realsize = size * nmemb;
-    const char *p = ptr;
-    const char *end = p + realsize;
-    const char *t = "accept-ranges : bytes "; /* A lowercase template */
+    g_autofree char *header = g_strstrip(g_strndup(ptr, realsize));
+    char *val = strchr(header, ':');
 
-    /* check if header matches the "t" template */
-    for (;;) {
-        if (*t == ' ') { /* space in t matches any amount of isspace in p */
-            if (p < end && g_ascii_isspace(*p)) {
-                ++p;
-            } else {
-                ++t;
-            }
-        } else if (*t && p < end && *t == g_ascii_tolower(*p)) {
-            ++p, ++t;
-        } else {
-            break;
-        }
+    if (!val) {
+        return realsize;
     }
 
-    if (!*t && p == end) { /* if we managed to reach ends of both strings */
-        s->accept_range = true;
+    *val++ = '\0';
+    g_strchomp(header);
+    while (g_ascii_isspace(*val)) {
+        ++val;
+    }
+
+    trace_curl_header_cb(header, val);
+
+    if (!g_ascii_strcasecmp(header, "accept-ranges")) {
+        if (!g_ascii_strcasecmp(val, "bytes")) {
+            s->accept_range = true;
+        }
+    } else if (!g_ascii_strcasecmp(header, "Content-Range")) {
+        /* Content-Range fmt is `bytes begin-end/full_size` */
+        val = strchr(val, '/');
+        if (val) {
+            if (qemu_strtou64(val + 1, NULL, 10, &s->len) < 0) {
+                s->len = UINT64_MAX;
+            }
+        }
     }
 
     return realsize;
@@ -668,6 +676,11 @@ static QemuOptsList runtime_opts = {
             .type = QEMU_OPT_STRING,
             .help = "ID of secret used as password for HTTP proxy auth",
         },
+        {
+            .name = CURL_BLOCK_OPT_FORCE_RANGE,
+            .type = QEMU_OPT_BOOL,
+            .help = "Assume HTTP range requests are supported",
+        },
         { /* end of list */ }
     },
 };
@@ -690,6 +703,7 @@ static int curl_open(BlockDriverState *bs, QDict *options, int flags,
 #endif
     const char *secretid;
     const char *protocol_delimiter;
+    bool force_range;
     int ret;
 
     bdrv_graph_rdlock_main_loop();
@@ -807,35 +821,56 @@ static int curl_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     s->accept_range = false;
+    s->len = UINT64_MAX;
+    force_range = qemu_opt_get_bool(opts, CURL_BLOCK_OPT_FORCE_RANGE,
+                                    CURL_BLOCK_OPT_FORCE_RANGE_DEFAULT);
+    /*
+     * When minimal CURL will be bumped to `7.83`, the header callback + manual
+     * parsing can be replaced by `curl_easy_header` calls
+     */
     if (curl_easy_setopt(state->curl, CURLOPT_NOBODY, 1L) ||
         curl_easy_setopt(state->curl, CURLOPT_HEADERFUNCTION, curl_header_cb) ||
         curl_easy_setopt(state->curl, CURLOPT_HEADERDATA, s)) {
-        pstrcpy(state->errmsg, CURL_ERROR_SIZE,
-                "curl library initialization failed.");
-        goto out;
+        goto out_init;
+    }
+    if (force_range) {
+        if (curl_easy_setopt(state->curl, CURLOPT_CUSTOMREQUEST, "GET") ||
+            curl_easy_setopt(state->curl, CURLOPT_RANGE, "0-0")) {
+            goto out_init;
+        }
     }
+
     if (curl_easy_perform(state->curl))
         goto out;
-    /* CURL 7.55.0 deprecates CURLINFO_CONTENT_LENGTH_DOWNLOAD in favour of
-     * the *_T version which returns a more sensible type for content length.
-     */
+
+    if (!force_range) {
+        /*
+         * CURL 7.55.0 deprecates CURLINFO_CONTENT_LENGTH_DOWNLOAD in favour of
+         * the *_T version which returns a more sensible type for content
+         * length.
+         */
 #if LIBCURL_VERSION_NUM >= 0x073700
-    if (curl_easy_getinfo(state->curl, CURLINFO_CONTENT_LENGTH_DOWNLOAD_T, &cl)) {
-        goto out;
-    }
+        if (curl_easy_getinfo(state->curl, CURLINFO_CONTENT_LENGTH_DOWNLOAD_T,
+                              &cl)) {
+            goto out;
+        }
 #else
-    if (curl_easy_getinfo(state->curl, CURLINFO_CONTENT_LENGTH_DOWNLOAD, &cl)) {
-        goto out;
-    }
+        if (curl_easy_getinfo(state->curl, CURLINFO_CONTENT_LENGTH_DOWNLOAD,
+                              &cl)) {
+            goto out;
+        }
 #endif
-    if (cl < 0) {
+        if (cl >= 0) {
+            s->len = cl;
+        }
+    }
+
+    if (s->len == UINT64_MAX) {
         pstrcpy(state->errmsg, CURL_ERROR_SIZE,
                 "Server didn't report file size.");
         goto out;
     }
 
-    s->len = cl;
-
     if ((!strncasecmp(s->url, "http://", strlen("http://"))
         || !strncasecmp(s->url, "https://", strlen("https://")))
         && !s->accept_range) {
@@ -856,6 +891,9 @@ static int curl_open(BlockDriverState *bs, QDict *options, int flags,
     qemu_opts_del(opts);
     return 0;
 
+out_init:
+    pstrcpy(state->errmsg, CURL_ERROR_SIZE,
+            "curl library initialization failed.");
 out:
     error_setg(errp, "CURL: Error opening file: %s", state->errmsg);
     curl_easy_cleanup(state->curl);
diff --git a/block/trace-events b/block/trace-events
index c9b4736ff88..d170fc96f15 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -191,6 +191,7 @@ ssh_server_status(int status) "server status=%d"
 curl_timer_cb(long timeout_ms) "timer callback timeout_ms %ld"
 curl_sock_cb(int action, int fd) "sock action %d on fd %d"
 curl_read_cb(size_t realsize) "just reading %zu bytes"
+curl_header_cb(const char *key, const char *val) "looking at %s: %s"
 curl_open(const char *file) "opening %s"
 curl_open_size(uint64_t size) "size = %" PRIu64
 curl_setup_preadv(uint64_t bytes, uint64_t start, const char *range) "reading %" PRIu64 " at %" PRIu64 " (%s)"
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PULL 00/28] Block layer patches
  2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
                   ` (27 preceding siblings ...)
  2026-03-10 16:26 ` [PULL 28/28] block/curl: add support for S3 presigned URLs Kevin Wolf
@ 2026-03-11 10:43 ` Peter Maydell
  28 siblings, 0 replies; 35+ messages in thread
From: Peter Maydell @ 2026-03-11 10:43 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel

On Tue, 10 Mar 2026 at 16:28, Kevin Wolf <kwolf@redhat.com> wrote:
>
> The following changes since commit 31ee190665dd50054c39cef5ad740680aabda382:
>
>   Merge tag 'hw-misc-20260309' of https://github.com/philmd/qemu into staging (2026-03-09 17:19:26 +0000)
>
> are available in the Git repository at:
>
>   https://gitlab.com/kmwolf/qemu.git tags/for-upstream
>
> for you to fetch changes up to 7b13fc97d7235006d2ccc7a132ecb70802ba258f:
>
>   block/curl: add support for S3 presigned URLs (2026-03-10 15:48:48 +0100)
>
> ----------------------------------------------------------------
> Block layer patches
>
> - export/fuse: Use coroutines and multi-threading
> - curl: Add force-range option
> - nfs: add support for libnfs v6
>



Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/11.0
for any user-visible changes.

-- PMM


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PULL 26/28] block/nfs: add support for libnfs v6
  2026-03-10 16:26 ` [PULL 26/28] block/nfs: add support for libnfs v6 Kevin Wolf
@ 2026-03-12  9:41   ` Peter Maydell
  2026-03-12 16:12     ` Kevin Wolf
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Maydell @ 2026-03-12  9:41 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel

On Tue, 10 Mar 2026 at 16:30, Kevin Wolf <kwolf@redhat.com> wrote:
>
> From: Peter Lieven <pl@dlhnet.de>
>
> libnfs v6 added a new api structure for read and write requests.
>
> This effectively also adds zero copy read support for cases where
> the qiov coming from the block layer has only one vector.
>
> The .brdv_refresh_limits implementation is needed because libnfs v6
> silently dropped support for splitting large read/write request into
> chunks.
>
> Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
> Signed-off-by: Peter Lieven <pl@dlhnet.de>
> Message-ID: <20260306142840.72923-1-pl@dlhnet.de>
> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>


Hi; Coverity reports a potential issue with this code
(CID 1645631):

> @@ -266,13 +270,36 @@ static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
>  {
>      NFSClient *client = bs->opaque;
>      NFSRPC task;
> +    char *buf = NULL;
> +    bool my_buffer = false;
>
>      nfs_co_init_task(bs, &task);
> -    task.iov = iov;
> +
> +#ifdef LIBNFS_API_V2
> +    if (iov->niov != 1) {
> +        buf = g_try_malloc(bytes);
> +        if (bytes && buf == NULL) {
> +            return -ENOMEM;
> +        }
> +        my_buffer = true;

Here we have code that takes the "read zero bytes" case, and
still tries to malloc a 0-length buffer (which is of dubious
portability). Then it will continue...

> +    } else {
> +        buf = iov->iov[0].iov_base;
> +    }
> +#endif
>
>      WITH_QEMU_LOCK_GUARD(&client->mutex) {
> +#ifdef LIBNFS_API_V2
> +        if (nfs_pread_async(client->context, client->fh,
> +                            buf, bytes, offset,
> +                            nfs_co_generic_cb, &task) != 0) {
> +#else
> +        task.iov = iov;
>          if (nfs_pread_async(client->context, client->fh,
>                              offset, bytes, nfs_co_generic_cb, &task) != 0) {
> +#endif
> +            if (my_buffer) {
> +                g_free(buf);
> +            }
>              return -ENOMEM;
>          }
>
> @@ -280,6 +307,13 @@ static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
>      }
>      qemu_coroutine_yield();
>
> +    if (my_buffer) {
> +        if (task.ret > 0) {
> +            qemu_iovec_from_buf(iov, 0, buf, task.ret);

...and down here we use 'buf', but Coverity thinks it might be NULL
because we only returned -ENOMEM above for a NULL buffer if bytes == 0.
So we might be here with bytes == 0 and buf == NULL.

Maybe we can't get here, but maybe it would be simpler to handle
the "asked to read 0 bytes" case directly without calling into the
nfs library or allocating a 0 byte buffer?

> +        }
> +        g_free(buf);
> +    }
> +
>      if (task.ret < 0) {
>          return task.ret;
>      }

thanks
-- PMM


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PULL 26/28] block/nfs: add support for libnfs v6
  2026-03-12  9:41   ` Peter Maydell
@ 2026-03-12 16:12     ` Kevin Wolf
  2026-03-12 16:19       ` Peter Maydell
  0 siblings, 1 reply; 35+ messages in thread
From: Kevin Wolf @ 2026-03-12 16:12 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-block, qemu-devel

Am 12.03.2026 um 10:41 hat Peter Maydell geschrieben:
> On Tue, 10 Mar 2026 at 16:30, Kevin Wolf <kwolf@redhat.com> wrote:
> >
> > From: Peter Lieven <pl@dlhnet.de>
> >
> > libnfs v6 added a new api structure for read and write requests.
> >
> > This effectively also adds zero copy read support for cases where
> > the qiov coming from the block layer has only one vector.
> >
> > The .brdv_refresh_limits implementation is needed because libnfs v6
> > silently dropped support for splitting large read/write request into
> > chunks.
> >
> > Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
> > Signed-off-by: Peter Lieven <pl@dlhnet.de>
> > Message-ID: <20260306142840.72923-1-pl@dlhnet.de>
> > Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> 
> 
> Hi; Coverity reports a potential issue with this code
> (CID 1645631):
> 
> > @@ -266,13 +270,36 @@ static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
> >  {
> >      NFSClient *client = bs->opaque;
> >      NFSRPC task;
> > +    char *buf = NULL;
> > +    bool my_buffer = false;
> >
> >      nfs_co_init_task(bs, &task);
> > -    task.iov = iov;
> > +
> > +#ifdef LIBNFS_API_V2
> > +    if (iov->niov != 1) {
> > +        buf = g_try_malloc(bytes);
> > +        if (bytes && buf == NULL) {
> > +            return -ENOMEM;
> > +        }
> > +        my_buffer = true;
> 
> Here we have code that takes the "read zero bytes" case, and
> still tries to malloc a 0-length buffer (which is of dubious
> portability). Then it will continue...

g_try_malloc() always returns NULL for allocating 0 bytes, so I don't
think portability is a problem here.

> > +    } else {
> > +        buf = iov->iov[0].iov_base;
> > +    }
> > +#endif
> >
> >      WITH_QEMU_LOCK_GUARD(&client->mutex) {
> > +#ifdef LIBNFS_API_V2
> > +        if (nfs_pread_async(client->context, client->fh,
> > +                            buf, bytes, offset,
> > +                            nfs_co_generic_cb, &task) != 0) {
> > +#else
> > +        task.iov = iov;
> >          if (nfs_pread_async(client->context, client->fh,
> >                              offset, bytes, nfs_co_generic_cb, &task) != 0) {
> > +#endif
> > +            if (my_buffer) {
> > +                g_free(buf);
> > +            }
> >              return -ENOMEM;
> >          }
> >
> > @@ -280,6 +307,13 @@ static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
> >      }
> >      qemu_coroutine_yield();
> >
> > +    if (my_buffer) {
> > +        if (task.ret > 0) {
> > +            qemu_iovec_from_buf(iov, 0, buf, task.ret);
> 
> ...and down here we use 'buf', but Coverity thinks it might be NULL
> because we only returned -ENOMEM above for a NULL buffer if bytes == 0.
> So we might be here with bytes == 0 and buf == NULL.

Yes, buf might be NULL, but copying 0 bytes from NULL isn't a problem
because you don't actually dereference the pointer then.

I think the part that Coverity doesn't understand is probably that
task.ret is limited to bytes (i.e. 0 in this case).

> Maybe we can't get here, but maybe it would be simpler to handle
> the "asked to read 0 bytes" case directly without calling into the
> nfs library or allocating a 0 byte buffer?

We could have an 'if (!bytes) { return 0; }' at the start of the
function if we want to.

Kevin



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PULL 26/28] block/nfs: add support for libnfs v6
  2026-03-12 16:12     ` Kevin Wolf
@ 2026-03-12 16:19       ` Peter Maydell
  2026-03-12 16:47         ` Kevin Wolf
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Maydell @ 2026-03-12 16:19 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel

On Thu, 12 Mar 2026 at 16:13, Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 12.03.2026 um 10:41 hat Peter Maydell geschrieben:
> > On Tue, 10 Mar 2026 at 16:30, Kevin Wolf <kwolf@redhat.com> wrote:
> > >
> > > From: Peter Lieven <pl@dlhnet.de>
> > >
> > > libnfs v6 added a new api structure for read and write requests.
> > >
> > > This effectively also adds zero copy read support for cases where
> > > the qiov coming from the block layer has only one vector.
> > >
> > > The .brdv_refresh_limits implementation is needed because libnfs v6
> > > silently dropped support for splitting large read/write request into
> > > chunks.
> > >
> > > Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
> > > Signed-off-by: Peter Lieven <pl@dlhnet.de>
> > > Message-ID: <20260306142840.72923-1-pl@dlhnet.de>
> > > Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> > > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >
> >
> > Hi; Coverity reports a potential issue with this code
> > (CID 1645631):
> >
> > > @@ -266,13 +270,36 @@ static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
> > >  {
> > >      NFSClient *client = bs->opaque;
> > >      NFSRPC task;
> > > +    char *buf = NULL;
> > > +    bool my_buffer = false;
> > >
> > >      nfs_co_init_task(bs, &task);
> > > -    task.iov = iov;
> > > +
> > > +#ifdef LIBNFS_API_V2
> > > +    if (iov->niov != 1) {
> > > +        buf = g_try_malloc(bytes);
> > > +        if (bytes && buf == NULL) {
> > > +            return -ENOMEM;
> > > +        }
> > > +        my_buffer = true;
> >
> > Here we have code that takes the "read zero bytes" case, and
> > still tries to malloc a 0-length buffer (which is of dubious
> > portability). Then it will continue...
>
> g_try_malloc() always returns NULL for allocating 0 bytes, so I don't
> think portability is a problem here.

Ah, so it does. The glib docs document this for g_malloc() but
don't mention it for g_try_malloc(), so I missed it.

> > > +    if (my_buffer) {
> > > +        if (task.ret > 0) {
> > > +            qemu_iovec_from_buf(iov, 0, buf, task.ret);
> >
> > ...and down here we use 'buf', but Coverity thinks it might be NULL
> > because we only returned -ENOMEM above for a NULL buffer if bytes == 0.
> > So we might be here with bytes == 0 and buf == NULL.
>
> Yes, buf might be NULL, but copying 0 bytes from NULL isn't a problem
> because you don't actually dereference the pointer then.
>
> I think the part that Coverity doesn't understand is probably that
> task.ret is limited to bytes (i.e. 0 in this case).

Mmm. Let me know if you'd prefer me to mark this issue in
Coverity as a false positive rather than changing the code.

-- PMM


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PULL 26/28] block/nfs: add support for libnfs v6
  2026-03-12 16:19       ` Peter Maydell
@ 2026-03-12 16:47         ` Kevin Wolf
  2026-03-20  9:50           ` Peter Maydell
  0 siblings, 1 reply; 35+ messages in thread
From: Kevin Wolf @ 2026-03-12 16:47 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-block, qemu-devel

Am 12.03.2026 um 17:19 hat Peter Maydell geschrieben:
> On Thu, 12 Mar 2026 at 16:13, Kevin Wolf <kwolf@redhat.com> wrote:
> >
> > Am 12.03.2026 um 10:41 hat Peter Maydell geschrieben:
> > > On Tue, 10 Mar 2026 at 16:30, Kevin Wolf <kwolf@redhat.com> wrote:
> > > >
> > > > From: Peter Lieven <pl@dlhnet.de>
> > > >
> > > > libnfs v6 added a new api structure for read and write requests.
> > > >
> > > > This effectively also adds zero copy read support for cases where
> > > > the qiov coming from the block layer has only one vector.
> > > >
> > > > The .brdv_refresh_limits implementation is needed because libnfs v6
> > > > silently dropped support for splitting large read/write request into
> > > > chunks.
> > > >
> > > > Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
> > > > Signed-off-by: Peter Lieven <pl@dlhnet.de>
> > > > Message-ID: <20260306142840.72923-1-pl@dlhnet.de>
> > > > Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> > > > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > >
> > >
> > > Hi; Coverity reports a potential issue with this code
> > > (CID 1645631):
> > >
> > > > @@ -266,13 +270,36 @@ static int coroutine_fn nfs_co_preadv(BlockDriverState *bs, int64_t offset,
> > > >  {
> > > >      NFSClient *client = bs->opaque;
> > > >      NFSRPC task;
> > > > +    char *buf = NULL;
> > > > +    bool my_buffer = false;
> > > >
> > > >      nfs_co_init_task(bs, &task);
> > > > -    task.iov = iov;
> > > > +
> > > > +#ifdef LIBNFS_API_V2
> > > > +    if (iov->niov != 1) {
> > > > +        buf = g_try_malloc(bytes);
> > > > +        if (bytes && buf == NULL) {
> > > > +            return -ENOMEM;
> > > > +        }
> > > > +        my_buffer = true;
> > >
> > > Here we have code that takes the "read zero bytes" case, and
> > > still tries to malloc a 0-length buffer (which is of dubious
> > > portability). Then it will continue...
> >
> > g_try_malloc() always returns NULL for allocating 0 bytes, so I don't
> > think portability is a problem here.
> 
> Ah, so it does. The glib docs document this for g_malloc() but
> don't mention it for g_try_malloc(), so I missed it.
> 
> > > > +    if (my_buffer) {
> > > > +        if (task.ret > 0) {
> > > > +            qemu_iovec_from_buf(iov, 0, buf, task.ret);
> > >
> > > ...and down here we use 'buf', but Coverity thinks it might be NULL
> > > because we only returned -ENOMEM above for a NULL buffer if bytes == 0.
> > > So we might be here with bytes == 0 and buf == NULL.
> >
> > Yes, buf might be NULL, but copying 0 bytes from NULL isn't a problem
> > because you don't actually dereference the pointer then.
> >
> > I think the part that Coverity doesn't understand is probably that
> > task.ret is limited to bytes (i.e. 0 in this case).
> 
> Mmm. Let me know if you'd prefer me to mark this issue in
> Coverity as a false positive rather than changing the code.

I don't really mind either way. If someone posts a patch, I'll apply it
(though not sure if that would be only for 11.1 at this point).

Kevin



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PULL 26/28] block/nfs: add support for libnfs v6
  2026-03-12 16:47         ` Kevin Wolf
@ 2026-03-20  9:50           ` Peter Maydell
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Maydell @ 2026-03-20  9:50 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, qemu-devel, Peter Lieven

I just noticed I forgot to cc Peter Lieven on this email -- sorry
about that.

On Thu, 12 Mar 2026 at 16:47, Kevin Wolf <kwolf@redhat.com> wrote:
>
> Am 12.03.2026 um 17:19 hat Peter Maydell geschrieben:
> > On Thu, 12 Mar 2026 at 16:13, Kevin Wolf <kwolf@redhat.com> wrote:
> > >
> > > Am 12.03.2026 um 10:41 hat Peter Maydell geschrieben:
> > > > On Tue, 10 Mar 2026 at 16:30, Kevin Wolf <kwolf@redhat.com> wrote:
> > > > >
> > > > > From: Peter Lieven <pl@dlhnet.de>
> > > > >
> > > > > libnfs v6 added a new api structure for read and write requests.
> > > > >
> > > > > This effectively also adds zero copy read support for cases where
> > > > > the qiov coming from the block layer has only one vector.
> > > > >
> > > > > The .brdv_refresh_limits implementation is needed because libnfs v6
> > > > > silently dropped support for splitting large read/write request into
> > > > > chunks.
> > > > >
> > > > > Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
> > > > > Signed-off-by: Peter Lieven <pl@dlhnet.de>
> > > > > Message-ID: <20260306142840.72923-1-pl@dlhnet.de>
> > > > > Reviewed-by: Kevin Wolf <kwolf@redhat.com>
> > > > > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > > >
> > > >
> > > > Hi; Coverity reports a potential issue with this code
> > > > (CID 1645631):

> > > > > +    if (my_buffer) {
> > > > > +        if (task.ret > 0) {
> > > > > +            qemu_iovec_from_buf(iov, 0, buf, task.ret);
> > > >
> > > > ...and down here we use 'buf', but Coverity thinks it might be NULL
> > > > because we only returned -ENOMEM above for a NULL buffer if bytes == 0.
> > > > So we might be here with bytes == 0 and buf == NULL.
> > >
> > > Yes, buf might be NULL, but copying 0 bytes from NULL isn't a problem
> > > because you don't actually dereference the pointer then.
> > >
> > > I think the part that Coverity doesn't understand is probably that
> > > task.ret is limited to bytes (i.e. 0 in this case).

> > > > Maybe we can't get here, but maybe it would be simpler to handle
> > > > the "asked to read 0 bytes" case directly without calling into the
> > > > nfs library or allocating a 0 byte buffer?

> > > We could have an 'if (!bytes) { return 0; }' at the start of the
> > > function if we want to.

> > Mmm. Let me know if you'd prefer me to mark this issue in
> > Coverity as a false positive rather than changing the code.
>
> I don't really mind either way. If someone posts a patch, I'll apply it
> (though not sure if that would be only for 11.1 at this point).

So what is the conclusion here -- do we want to change the code,
or are we happy with it as it is and want to tell Coverity this
is a false positive?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2026-03-20  9:51 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-10 16:25 [PULL 00/28] Block layer patches Kevin Wolf
2026-03-10 16:25 ` [PULL 01/28] fuse: Copy write buffer content before polling Kevin Wolf
2026-03-10 16:25 ` [PULL 02/28] fuse: Ensure init clean-up even with error_fatal Kevin Wolf
2026-03-10 16:25 ` [PULL 03/28] fuse: Remove superfluous empty line Kevin Wolf
2026-03-10 16:25 ` [PULL 04/28] fuse: Explicitly set inode ID to 1 Kevin Wolf
2026-03-10 16:25 ` [PULL 05/28] fuse: Change setup_... to mount_fuse_export() Kevin Wolf
2026-03-10 16:26 ` [PULL 06/28] fuse: Destroy session on mount_fuse_export() fail Kevin Wolf
2026-03-10 16:26 ` [PULL 07/28] fuse: Fix mount options Kevin Wolf
2026-03-10 16:26 ` [PULL 08/28] fuse: Set direct_io and parallel_direct_writes Kevin Wolf
2026-03-10 16:26 ` [PULL 09/28] fuse: Introduce fuse_{at,de}tach_handlers() Kevin Wolf
2026-03-10 16:26 ` [PULL 10/28] fuse: Introduce fuse_{inc,dec}_in_flight() Kevin Wolf
2026-03-10 16:26 ` [PULL 11/28] fuse: Add halted flag Kevin Wolf
2026-03-10 16:26 ` [PULL 12/28] fuse: fuse_{read,write}: Rename length to blk_len Kevin Wolf
2026-03-10 16:26 ` [PULL 13/28] iotests/308: Use conv=notrunc to test growability Kevin Wolf
2026-03-10 16:26 ` [PULL 14/28] fuse: Explicitly handle non-grow post-EOF accesses Kevin Wolf
2026-03-10 16:26 ` [PULL 15/28] block: Move qemu_fcntl_addfl() into osdep.c Kevin Wolf
2026-03-10 16:26 ` [PULL 16/28] fuse: Drop permission changes in fuse_do_truncate Kevin Wolf
2026-03-10 16:26 ` [PULL 17/28] fuse: Manually process requests (without libfuse) Kevin Wolf
2026-03-10 16:26 ` [PULL 18/28] fuse: Reduce max read size Kevin Wolf
2026-03-10 16:26 ` [PULL 19/28] fuse: Process requests in coroutines Kevin Wolf
2026-03-10 16:26 ` [PULL 20/28] block/export: Add multi-threading interface Kevin Wolf
2026-03-10 16:26 ` [PULL 21/28] iotests/307: Test multi-thread export interface Kevin Wolf
2026-03-10 16:26 ` [PULL 22/28] fuse: Make shared export state atomic Kevin Wolf
2026-03-10 16:26 ` [PULL 23/28] fuse: Implement multi-threading Kevin Wolf
2026-03-10 16:26 ` [PULL 24/28] qapi/block-export: Document FUSE's multi-threading Kevin Wolf
2026-03-10 16:26 ` [PULL 25/28] iotests/308: Add multi-threading sanity test Kevin Wolf
2026-03-10 16:26 ` [PULL 26/28] block/nfs: add support for libnfs v6 Kevin Wolf
2026-03-12  9:41   ` Peter Maydell
2026-03-12 16:12     ` Kevin Wolf
2026-03-12 16:19       ` Peter Maydell
2026-03-12 16:47         ` Kevin Wolf
2026-03-20  9:50           ` Peter Maydell
2026-03-10 16:26 ` [PULL 27/28] qapi: block: Refactor HTTP(s) common arguments Kevin Wolf
2026-03-10 16:26 ` [PULL 28/28] block/curl: add support for S3 presigned URLs Kevin Wolf
2026-03-11 10:43 ` [PULL 00/28] Block layer patches Peter Maydell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox