All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/1] block/linux-aio: fix reproducible SIGSEGV from unbounded ioq_submit() recursion
@ 2026-05-20 14:25 Denis V. Lunev via qemu development
  2026-05-20 14:25 ` [PATCH v3] block/linux-aio: bound ioq_submit() recursion depth Denis V. Lunev via qemu development
  2026-05-21 15:55 ` [PATCH v3 0/1] block/linux-aio: fix reproducible SIGSEGV from unbounded ioq_submit() recursion Stefan Hajnoczi
  0 siblings, 2 replies; 3+ messages in thread
From: Denis V. Lunev via qemu development @ 2026-05-20 14:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, qemu-stable, kwolf, hreitz, stefanha, pbonzini,
	Denis V. Lunev

Observed in production where a cached-I/O backup path was driven
through aio=native, making io_submit(2) complete synchronously and
closing the recursion cycle.  On the supported aio=native + cache=none
+ qcow2 configuration the cycle stays bounded by accident rather than
by construction; this patch bounds it explicitly.

Bisect:

  v8.1.0 (forward edge only)      no crash / 20
  84d61e5f36^                     no crash / 20
  84d61e5f36 (backward edge in)   crash at attempt 17
  v8.2.0                          crash at attempt  4
  master + this patch             no crash / 80

The closing commit is 84d61e5f36 ("virtio: use defer_call() in
virtio_irqfd_notify()").

No iotest: crash rate is 6..17 per 20 on unpatched master; a formal
test would be flaky.  The vmdk + aio=native + cache=none shape is
not otherwise exercised by the suite.

--- gen-workload.py -----------------------------------------------
#!/usr/bin/env python3
import random, sys
REGION  = 32 * 1024 * 1024
CLUSTER = 64 * 1024
SEED    = 0xC0FFEE
def main(out):
    r = random.Random(SEED); ops = []
    for _ in range(10000):
        off = r.randrange(0, REGION - 4096) & ~4095
        ops.append("aio_write -q %d 4k" % off)
    for i in range(10000):
        size, n = ("64k", 65536) if i < 5000 else ("128k", 131072)
        off = r.randrange(0, REGION - n) & ~(CLUSTER - 1)
        ops.append("aio_write -q -z -u %d %s" % (off, size))
    r.shuffle(ops); ops.append("aio_flush")
    open(out, "w").write("\n".join(ops) + "\n")
if __name__ == "__main__":
    main(sys.argv[1] if len(sys.argv) > 1 else "t.cmds")
-------------------------------------------------------------------

--- repro.sh ------------------------------------------------------
#!/bin/bash
set -u
qimg=$1; qio=$2; label=$3; attempts=${4:-20}
cmds=${5:-$(dirname "$0")/t.cmds}
vmdk=/tmp/t.$label.vmdk; log=/tmp/repro_$label.log
: > "$log"
for i in $(seq 1 "$attempts"); do
    rm -f "$vmdk"
    "$qimg" create -f vmdk "$vmdk" 256M >/dev/null 2>&1
    "$qio" -f vmdk -n --cache=none --aio=native "$vmdk" < "$cmds" \
        >>"$log" 2>&1
    rc=$?
    [ $rc -ge 128 ] && { echo "CRASH attempt $i rc=$rc" >>"$log"; break; }
done
echo "DONE $label rc=$rc attempt=$i" >> "$log"
-------------------------------------------------------------------

  python3 gen-workload.py t.cmds
  ./repro.sh /path/to/qemu-img /path/to/qemu-io test 20

Notes:

 * IOQ_SUBMIT_MAX_DEPTH = 8.  Round headroom over the bounded depth
   of the supported async-completion path.
 * Per-thread __thread counter, matching util/defer-call.c's storage.
   A per-LinuxAioState field would let multiple devices on one
   thread recurse independently.

Changes from v2:
* moved depth guard to struct qemu_laiocb (suggestion from Stefan)

Changes from v1:
* removed all downstream marks

Denis V. Lunev (1):
  block/linux-aio: bound ioq_submit() recursion depth

 block/linux-aio.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Hanna Reitz <hreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
--
2.51.0



^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v3] block/linux-aio: bound ioq_submit() recursion depth
  2026-05-20 14:25 [PATCH v3 0/1] block/linux-aio: fix reproducible SIGSEGV from unbounded ioq_submit() recursion Denis V. Lunev via qemu development
@ 2026-05-20 14:25 ` Denis V. Lunev via qemu development
  2026-05-21 15:55 ` [PATCH v3 0/1] block/linux-aio: fix reproducible SIGSEGV from unbounded ioq_submit() recursion Stefan Hajnoczi
  1 sibling, 0 replies; 3+ messages in thread
From: Denis V. Lunev via qemu development @ 2026-05-20 14:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, qemu-stable, kwolf, hreitz, stefanha, pbonzini,
	Denis V. Lunev

qemu_laio_process_completions() wraps its body in defer_call_begin /
defer_call_end. Inside the section, completion callbacks wake coroutines
that queue new aiocbs; laio_do_submit() defers laio_deferred_fn. At the
bottom of qemu_laio_process_completions() the defer_call_end() fires
laio_deferred_fn, which calls ioq_submit(), closing the cycle:

  ioq_submit
    -> io_submit(2)                           // some sync completions
    -> qemu_laio_process_completions          // defer_call_begin
         -> aio_co_wake                       // resumes coroutine
              -> laio_do_submit
                   -> defer_call(laio_deferred_fn, s)   // enqueued
         -> defer_call_end                    // nesting drops to 0
              -> laio_deferred_fn
                   -> ioq_submit              // +1 stack frame, loop

When io_submit(2) returns asynchronously (O_DIRECT) the cycle
terminates in one extra frame: the fresh aiocb is still in flight, no
completion is drained, no coroutine wakes, no new submission queues.
When submissions complete synchronously (non-O_DIRECT, or per-descriptor
drivers such as vmdk) each level enqueues more work for the next
defer_call_end() to drain, so recursion grows without bound and QEMU
crashes with SIGSEGV on the thread guard page.

The cycle was closed by two performance commits, each correct in
isolation:

  076682885d ("block/linux-aio: convert to blk_io_plug_call() API")
    -- introduced laio_deferred_fn and wired
       laio_do_submit -> defer_call(laio_deferred_fn, s).

  84d61e5f36 ("virtio: use defer_call() in virtio_irqfd_notify()")
    -- added defer_call_begin/end around qemu_laio_process_completions
       so virtio-irqfd notifications batch across a completion pass.

The supported aio=native + cache=none pairing keeps submissions
asynchronous, so the cycle stays bounded; nothing in the code enforces
that contract. Observed in production as a SIGSEGV during a backup job
configured with --cached + aio=native; reproducible on upstream with
qemu-io against vmdk.

Cap ioq_submit() recursion with a counter on LaioQueue, which is only
accessed from the AioContext home thread. On overflow, return without
submitting. The pending work is drained by s->completion_bh, which
qemu_laio_process_completions() has already scheduled on entry -- no
work is lost; one event-loop round-trip of latency is paid only when
the bound is hit, which cannot happen on a supported configuration.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Hanna Reitz <hreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
---
 block/linux-aio.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 0a7424fbb3..5aaf2e8514 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -36,6 +36,19 @@
 /* Maximum number of requests in a batch. (default value) */
 #define DEFAULT_MAX_BATCH 32
 
+/*
+ * Bound on how deep ioq_submit() may recurse on a single LaioQueue via the
+ * ioq_submit -> qemu_laio_process_completions -> defer_call_end ->
+ * laio_deferred_fn -> ioq_submit cycle. The cycle terminates naturally
+ * when io_submit(2) returns asynchronously (O_DIRECT), but can grow
+ * without bound when submissions complete synchronously. On overflow
+ * the caller returns without submitting; the outermost
+ * qemu_laio_process_completions() has already scheduled s->completion_bh
+ * (via qemu_bh_schedule() at the top of that function), which resumes
+ * submission from the next event-loop dispatch.
+ */
+#define IOQ_SUBMIT_MAX_DEPTH 8
+
 struct qemu_laiocb {
     Coroutine *co;
     LinuxAioState *ctx;
@@ -61,6 +74,7 @@ typedef struct {
     unsigned int in_queue;
     unsigned int in_flight;
     bool blocked;
+    unsigned int submit_depth;
     QSIMPLEQ_HEAD(, qemu_laiocb) pending;
 } LaioQueue;
 
@@ -331,6 +345,7 @@ static void ioq_init(LaioQueue *io_q)
     io_q->in_queue = 0;
     io_q->in_flight = 0;
     io_q->blocked = false;
+    io_q->submit_depth = 0;
 }
 
 static void ioq_submit(LinuxAioState *s)
@@ -340,6 +355,11 @@ static void ioq_submit(LinuxAioState *s)
     QEMU_UNINITIALIZED struct iocb *iocbs[MAX_EVENTS];
     QSIMPLEQ_HEAD(, qemu_laiocb) completed;
 
+    if (s->io_q.submit_depth >= IOQ_SUBMIT_MAX_DEPTH) {
+        return;
+    }
+    s->io_q.submit_depth++;
+
     do {
         if (s->io_q.in_flight >= MAX_EVENTS) {
             break;
@@ -385,6 +405,8 @@ static void ioq_submit(LinuxAioState *s)
          * pended requests will be submitted from there.
          */
     }
+
+    s->io_q.submit_depth--;
 }
 
 static uint64_t laio_max_batch(LinuxAioState *s, uint64_t dev_max_batch)

base-commit: ac0cc20ad2fe0b8df2e5d9458e90a095ac711ab1
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v3 0/1] block/linux-aio: fix reproducible SIGSEGV from unbounded ioq_submit() recursion
  2026-05-20 14:25 [PATCH v3 0/1] block/linux-aio: fix reproducible SIGSEGV from unbounded ioq_submit() recursion Denis V. Lunev via qemu development
  2026-05-20 14:25 ` [PATCH v3] block/linux-aio: bound ioq_submit() recursion depth Denis V. Lunev via qemu development
@ 2026-05-21 15:55 ` Stefan Hajnoczi
  1 sibling, 0 replies; 3+ messages in thread
From: Stefan Hajnoczi @ 2026-05-21 15:55 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: qemu-devel, qemu-block, qemu-stable, kwolf, hreitz, pbonzini

[-- Attachment #1: Type: text/plain, Size: 3629 bytes --]

On Wed, May 20, 2026 at 04:25:02PM +0200, Denis V. Lunev via qemu development wrote:
> Observed in production where a cached-I/O backup path was driven
> through aio=native, making io_submit(2) complete synchronously and
> closing the recursion cycle.  On the supported aio=native + cache=none
> + qcow2 configuration the cycle stays bounded by accident rather than
> by construction; this patch bounds it explicitly.
> 
> Bisect:
> 
>   v8.1.0 (forward edge only)      no crash / 20
>   84d61e5f36^                     no crash / 20
>   84d61e5f36 (backward edge in)   crash at attempt 17
>   v8.2.0                          crash at attempt  4
>   master + this patch             no crash / 80
> 
> The closing commit is 84d61e5f36 ("virtio: use defer_call() in
> virtio_irqfd_notify()").
> 
> No iotest: crash rate is 6..17 per 20 on unpatched master; a formal
> test would be flaky.  The vmdk + aio=native + cache=none shape is
> not otherwise exercised by the suite.
> 
> --- gen-workload.py -----------------------------------------------
> #!/usr/bin/env python3
> import random, sys
> REGION  = 32 * 1024 * 1024
> CLUSTER = 64 * 1024
> SEED    = 0xC0FFEE
> def main(out):
>     r = random.Random(SEED); ops = []
>     for _ in range(10000):
>         off = r.randrange(0, REGION - 4096) & ~4095
>         ops.append("aio_write -q %d 4k" % off)
>     for i in range(10000):
>         size, n = ("64k", 65536) if i < 5000 else ("128k", 131072)
>         off = r.randrange(0, REGION - n) & ~(CLUSTER - 1)
>         ops.append("aio_write -q -z -u %d %s" % (off, size))
>     r.shuffle(ops); ops.append("aio_flush")
>     open(out, "w").write("\n".join(ops) + "\n")
> if __name__ == "__main__":
>     main(sys.argv[1] if len(sys.argv) > 1 else "t.cmds")
> -------------------------------------------------------------------
> 
> --- repro.sh ------------------------------------------------------
> #!/bin/bash
> set -u
> qimg=$1; qio=$2; label=$3; attempts=${4:-20}
> cmds=${5:-$(dirname "$0")/t.cmds}
> vmdk=/tmp/t.$label.vmdk; log=/tmp/repro_$label.log
> : > "$log"
> for i in $(seq 1 "$attempts"); do
>     rm -f "$vmdk"
>     "$qimg" create -f vmdk "$vmdk" 256M >/dev/null 2>&1
>     "$qio" -f vmdk -n --cache=none --aio=native "$vmdk" < "$cmds" \
>         >>"$log" 2>&1
>     rc=$?
>     [ $rc -ge 128 ] && { echo "CRASH attempt $i rc=$rc" >>"$log"; break; }
> done
> echo "DONE $label rc=$rc attempt=$i" >> "$log"
> -------------------------------------------------------------------
> 
>   python3 gen-workload.py t.cmds
>   ./repro.sh /path/to/qemu-img /path/to/qemu-io test 20
> 
> Notes:
> 
>  * IOQ_SUBMIT_MAX_DEPTH = 8.  Round headroom over the bounded depth
>    of the supported async-completion path.
>  * Per-thread __thread counter, matching util/defer-call.c's storage.
>    A per-LinuxAioState field would let multiple devices on one
>    thread recurse independently.
> 
> Changes from v2:
> * moved depth guard to struct qemu_laiocb (suggestion from Stefan)
> 
> Changes from v1:
> * removed all downstream marks
> 
> Denis V. Lunev (1):
>   block/linux-aio: bound ioq_submit() recursion depth
> 
>  block/linux-aio.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Hanna Reitz <hreitz@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> --
> 2.51.0
> 
> 

Thanks, applied to my block tree:
https://gitlab.com/stefanha/qemu/commits/block

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-21 15:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20 14:25 [PATCH v3 0/1] block/linux-aio: fix reproducible SIGSEGV from unbounded ioq_submit() recursion Denis V. Lunev via qemu development
2026-05-20 14:25 ` [PATCH v3] block/linux-aio: bound ioq_submit() recursion depth Denis V. Lunev via qemu development
2026-05-21 15:55 ` [PATCH v3 0/1] block/linux-aio: fix reproducible SIGSEGV from unbounded ioq_submit() recursion Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.