[PATCH RFC v2 0/3] improve aio-polling efficiency

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC v2 0/3] improve aio-polling efficiency
@ 2026-03-23 13:54 Jaehoon Kim
  2026-03-23 13:54 ` [PATCH RFC v2 1/3] aio-poll: avoid unnecessary polling time computation Jaehoon Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jaehoon Kim @ 2026-03-23 13:54 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: mjrosato, farman, pbonzini, stefanha, fam, armbru, eblake,
	berrange, eduardo, dave, sw, Jaehoon Kim

Dear all,

This is v2 of the RFC patch series to refine aio_poll adaptive
polling logic for better CPU efficiency.

v1: 
https://lore.kernel.org/qemu-devel/20260113174824.464720-1-jhkim@linux.ibm.com/

Changes in v2:
- Patch 2/3: Changed default POLL_WEIGHT_SHIFT from 2 to 3
  based on extensive testing. Updated commit message with
  detailed performance comparison showing weight=3 provides
  better balance between throughput and CPU savings

- Patch 3/3: Added proper initialization of poll-weight,
  poll-grow, and poll-shrink defaults in iothread.c

This patch series refines the aio_poll adaptive polling logic to
reduce unnecessary busy-waiting and improve CPU efficiency.

The first patch prevents redundant polling time calculation when
polling is disabled. The second patch enhances the adaptive polling
mechanism by dynamically adjusting the iothread's polling duration
based on event intervals measured by individual AioHandlers. The
third patch introduces a new 'poll-weight' parameter for runtime
control over how much the current interval influences the next
polling duration.

We evaluated the patches on s390x hosts with different configurations:

Initial testing (Fedora 42):
Using a single guest with 16 virtio block devices backed by FCP
multipath devices, I/O scheduler set to 'none'. Across four FIO
workload patterns (sequential R/W, random R/W), averaged over
numjobs 1, 4, 8, and 16:

 - Throughput: -3% to -8% (one iothread), -2% to -5% (two iothreads)
 - CPU usage: -10% to -25% (one iothread), -7% to -12% (two iothreads)

Additional validation (RHEL 10.1 GA + QEMU 10.0.0):
Comparing baseline vs poll-weight=2/3 with FCP and FICON storage,
using 1 and 8 iothreads, averaged over numjobs 1, 4, and 8:

Summary of results (% change vs baseline):

 - Throughput avg: -2.2% (weight=3), -2.4% (weight=2)
 - CPU consumption avg: -9.4% (weight=3), -10.9% (weight=2)

Weight=3 was selected as default for providing slightly better
throughput (-2.2% vs -2.4%) while maintaining substantial CPU
savings (-9.4%).

Best regards,
Jaehoon Kim

Jaehoon Kim (3):
  aio-poll: avoid unnecessary polling time computation
  aio-poll: refine iothread polling using weighted handler intervals
  qapi/iothread: introduce poll-weight parameter for aio-poll

 include/qemu/aio.h                |   8 +-
 include/system/iothread.h         |   1 +
 iothread.c                        |  34 ++++++-
 monitor/hmp-cmds.c                |   1 +
 qapi/misc.json                    |   7 ++
 qapi/qom.json                     |   8 +-
 qemu-options.hx                   |   7 +-
 tests/unit/test-nested-aio-poll.c |   2 +-
 util/aio-posix.c                  | 141 ++++++++++++++++++++----------
 util/aio-win32.c                  |   3 +-
 util/async.c                      |   2 +
 11 files changed, 161 insertions(+), 53 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH RFC v2 1/3] aio-poll: avoid unnecessary polling time computation
  2026-03-23 13:54 [PATCH RFC v2 0/3] improve aio-polling efficiency Jaehoon Kim
@ 2026-03-23 13:54 ` Jaehoon Kim
  2026-03-25 17:22   ` Stefan Hajnoczi
  2026-03-23 13:54 ` [PATCH RFC v2 2/3] aio-poll: refine iothread polling using weighted handler intervals Jaehoon Kim
  2026-03-23 13:54 ` [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll Jaehoon Kim
  2 siblings, 1 reply; 18+ messages in thread
From: Jaehoon Kim @ 2026-03-23 13:54 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: mjrosato, farman, pbonzini, stefanha, fam, armbru, eblake,
	berrange, eduardo, dave, sw, Jaehoon Kim

Nodes are no longer added to poll_aio_handlers when adaptive polling is
disabled, preventing unnecessary try_poll_mode() calls. Additionally,
aio_poll() skips try_poll_mode() when timeout is 0.

This avoids iterating over all nodes to compute max_ns unnecessarily
when polling is disabled or timeout is 0.

Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>
---
 util/aio-posix.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 488d964611..b02beb0505 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -307,9 +307,8 @@ static bool aio_dispatch_handler(AioContext *ctx, AioHandler *node)
      * fdmon_supports_polling(), but only until the fd fires for the first
      * time.
      */
-    if (!QLIST_IS_INSERTED(node, node_deleted) &&
-        !QLIST_IS_INSERTED(node, node_poll) &&
-        node->io_poll) {
+    if (ctx->poll_max_ns && !QLIST_IS_INSERTED(node, node_deleted) &&
+        !QLIST_IS_INSERTED(node, node_poll) && node->io_poll) {
         trace_poll_add(ctx, node, node->pfd.fd, revents);
         if (ctx->poll_started && node->io_poll_begin) {
             node->io_poll_begin(node->opaque);
@@ -631,7 +630,7 @@ static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
 bool aio_poll(AioContext *ctx, bool blocking)
 {
     AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
-    bool progress;
+    bool progress = false;
     bool use_notify_me;
     int64_t timeout;
     int64_t start = 0;
@@ -656,7 +655,9 @@ bool aio_poll(AioContext *ctx, bool blocking)
     }
 
     timeout = blocking ? aio_compute_timeout(ctx) : 0;
-    progress = try_poll_mode(ctx, &ready_list, &timeout);
+    if ((ctx->poll_max_ns != 0) && (timeout != 0)) {
+        progress = try_poll_mode(ctx, &ready_list, &timeout);
+    }
     assert(!(timeout && progress));
 
     /*
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC v2 2/3] aio-poll: refine iothread polling using weighted handler intervals
  2026-03-23 13:54 [PATCH RFC v2 0/3] improve aio-polling efficiency Jaehoon Kim
  2026-03-23 13:54 ` [PATCH RFC v2 1/3] aio-poll: avoid unnecessary polling time computation Jaehoon Kim
@ 2026-03-23 13:54 ` Jaehoon Kim
  2026-03-25 20:37   ` Stefan Hajnoczi
  2026-03-23 13:54 ` [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll Jaehoon Kim
  2 siblings, 1 reply; 18+ messages in thread
From: Jaehoon Kim @ 2026-03-23 13:54 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: mjrosato, farman, pbonzini, stefanha, fam, armbru, eblake,
	berrange, eduardo, dave, sw, Jaehoon Kim

Refine adaptive polling in aio_poll by updating iothread polling
duration based on weighted AioHandler event intervals.

Each AioHandler's poll.ns is updated using a weighted factor when an
event occurs. Idle handlers accumulate block_ns until poll_max_ns and
then reset to 0, preventing sporadically active handlers from
unnecessarily prolonging iothread polling.

The iothread polling duration is set based on the largest poll.ns among
active handlers. The shrink divider defaults to 2, matching the grow
rate, to reduce frequent poll_ns resets for slow devices.

The default weight factor (POLL_WEIGHT_SHIFT=3, meaning the current
interval contributes 12.5% to the weighted average) was selected based
on extensive testing comparing QEMU 10.0.0 baseline vs poll-weight=2
and poll-weight=3 across various workloads.

The table below shows a comparison between:
-Host: RHEL 10.1 GA + qemu-10.0.0-14.el10_1, Guest: RHEL 9.6GA vs.
-Host: RHEL 10.1 GA + qemu-10.0.0-14.el10_1 (w=2/w=3), Guest: RHEL 9.6GA
for FIO FCP and FICON with 1 iothread and 8 iothreads.
The values shown are the averages for numjobs 1, 4, and 8.

Summary of results (% change vs baseline):

                    | poll-weight=2      | poll-weight=3
--------------------|--------------------|-----------------
Throughput avg      | -2.4% (all tests)  | -2.2% (all tests)
CPU consumption avg | -10.9% (all tests) | -9.4% (all tests)

Both weight=2 and weight=3 show significant CPU consumption reduction
(~10%) compared to baseline, which addresses the CPU utilization
regression observed in QEMU 10.0.0. The throughput impact is minimal
for both (~2%).

Weight=3 is selected as the default because it provides slightly better
throughput (-2.2% vs -2.4%) while still achieving substantial CPU
savings (-9.4%). The difference between weight=2 and weight=3 is small,
but weight=3 offers a better balance for general-purpose workloads.

Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>
---
 include/qemu/aio.h |   4 +-
 util/aio-posix.c   | 135 +++++++++++++++++++++++++++++++--------------
 util/async.c       |   1 +
 3 files changed, 99 insertions(+), 41 deletions(-)

diff --git a/include/qemu/aio.h b/include/qemu/aio.h
index 8cca2360d1..6c77a190e9 100644
--- a/include/qemu/aio.h
+++ b/include/qemu/aio.h
@@ -195,7 +195,8 @@ struct BHListSlice {
 typedef QSLIST_HEAD(, AioHandler) AioHandlerSList;
 
 typedef struct AioPolledEvent {
-    int64_t ns;        /* current polling time in nanoseconds */
+    bool has_event; /* Flag to indicate if an event has occurred */
+    int64_t ns;     /* estimated block time in nanoseconds */
 } AioPolledEvent;
 
 struct AioContext {
@@ -306,6 +307,7 @@ struct AioContext {
     int poll_disable_cnt;
 
     /* Polling mode parameters */
+    int64_t poll_ns;        /* current polling time in nanoseconds */
     int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
     int64_t poll_grow;      /* polling time growth factor */
     int64_t poll_shrink;    /* polling time shrink factor */
diff --git a/util/aio-posix.c b/util/aio-posix.c
index b02beb0505..2b3522f2f9 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -29,9 +29,11 @@
 
 /* Stop userspace polling on a handler if it isn't active for some time */
 #define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND)
+#define POLL_WEIGHT_SHIFT   (3)
 
-static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
-                                int64_t block_ns);
+static void adjust_block_ns(AioContext *ctx, int64_t block_ns);
+static void grow_polling_time(AioContext *ctx, int64_t block_ns);
+static void shrink_polling_time(AioContext *ctx, int64_t block_ns);
 
 bool aio_poll_disabled(AioContext *ctx)
 {
@@ -373,7 +375,7 @@ static bool aio_dispatch_ready_handlers(AioContext *ctx,
          * add the handler to ctx->poll_aio_handlers.
          */
         if (ctx->poll_max_ns && QLIST_IS_INSERTED(node, node_poll)) {
-            adjust_polling_time(ctx, &node->poll, block_ns);
+            node->poll.has_event = true;
         }
     }
 
@@ -560,18 +562,13 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
 static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
                           int64_t *timeout)
 {
-    AioHandler *node;
     int64_t max_ns;
 
     if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
         return false;
     }
 
-    max_ns = 0;
-    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
-        max_ns = MAX(max_ns, node->poll.ns);
-    }
-    max_ns = qemu_soonest_timeout(*timeout, max_ns);
+    max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
 
     if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
         /*
@@ -587,46 +584,98 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
     return false;
 }
 
-static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
-                                int64_t block_ns)
+static void shrink_polling_time(AioContext *ctx, int64_t block_ns)
 {
-    if (block_ns <= poll->ns) {
-        /* This is the sweet spot, no adjustment needed */
-    } else if (block_ns > ctx->poll_max_ns) {
-        /* We'd have to poll for too long, poll less */
-        int64_t old = poll->ns;
-
-        if (ctx->poll_shrink) {
-            poll->ns /= ctx->poll_shrink;
-        } else {
-            poll->ns = 0;
-        }
+    /*
+     * Reduce polling time if the block_ns is zero or
+     * less than the current poll_ns.
+     */
+    int64_t old = ctx->poll_ns;
+    int64_t shrink = ctx->poll_shrink;
 
-        trace_poll_shrink(ctx, old, poll->ns);
-    } else if (poll->ns < ctx->poll_max_ns &&
-               block_ns < ctx->poll_max_ns) {
-        /* There is room to grow, poll longer */
-        int64_t old = poll->ns;
-        int64_t grow = ctx->poll_grow;
+    if (shrink == 0) {
+        shrink = 2;
+    }
 
-        if (grow == 0) {
-            grow = 2;
-        }
+    if (block_ns < (ctx->poll_ns / shrink)) {
+        ctx->poll_ns /= shrink;
+    }
 
-        if (poll->ns) {
-            poll->ns *= grow;
-        } else {
-            poll->ns = 4000; /* start polling at 4 microseconds */
-        }
+    trace_poll_shrink(ctx, old, ctx->poll_ns);
+}
 
-        if (poll->ns > ctx->poll_max_ns) {
-            poll->ns = ctx->poll_max_ns;
-        }
+static void grow_polling_time(AioContext *ctx, int64_t block_ns)
+{
+    /* There is room to grow, poll longer */
+    int64_t old = ctx->poll_ns;
+    int64_t grow = ctx->poll_grow;
 
-        trace_poll_grow(ctx, old, poll->ns);
+    if (grow == 0) {
+        grow = 2;
     }
+
+    if (block_ns > ctx->poll_ns * grow) {
+        ctx->poll_ns = block_ns;
+    } else {
+        ctx->poll_ns *= grow;
+    }
+
+    if (ctx->poll_ns > ctx->poll_max_ns) {
+        ctx->poll_ns = ctx->poll_max_ns;
+    }
+
+    trace_poll_grow(ctx, old, ctx->poll_ns);
 }
 
+static void adjust_block_ns(AioContext *ctx, int64_t block_ns)
+{
+    AioHandler *node;
+    int64_t adj_block_ns = -1;
+
+    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
+        if (node->poll.has_event) {
+            /*
+             * Update poll.ns for the node with an event.
+             * Uses a weighted average of the current block_ns and the previous
+             * poll.ns to smooth out polling time adjustments.
+             */
+            node->poll.ns = node->poll.ns
+                ? (node->poll.ns - (node->poll.ns >> POLL_WEIGHT_SHIFT))
+                + (block_ns >> POLL_WEIGHT_SHIFT) : block_ns;
+
+            if (node->poll.ns > ctx->poll_max_ns) {
+                node->poll.ns = 0;
+            }
+            /*
+             * To avoid excessive polling time increase, update adj_block_ns
+             * for nodes with the event flag set to true
+             */
+            adj_block_ns = MAX(adj_block_ns, node->poll.ns);
+            node->poll.has_event = false;
+         } else {
+            /*
+             * No event now, but was active before.
+             * If it waits longer than poll_max_ns, poll.ns will stay 0
+             * until the next event arrives.
+             */
+            if (node->poll.ns != 0) {
+                node->poll.ns += block_ns;
+                if (node->poll.ns > ctx->poll_max_ns) {
+                    node->poll.ns = 0;
+                }
+            }
+        }
+    }
+
+    if (adj_block_ns >= 0) {
+        if (adj_block_ns > ctx->poll_ns) {
+            grow_polling_time(ctx, adj_block_ns);
+        } else {
+            shrink_polling_time(ctx, adj_block_ns);
+         }
+     }
+ }
+
 bool aio_poll(AioContext *ctx, bool blocking)
 {
     AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
@@ -723,6 +772,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
 
     aio_free_deleted_handlers(ctx);
 
+    if (ctx->poll_max_ns) {
+        adjust_block_ns(ctx, block_ns);
+    }
+
     qemu_lockcnt_dec(&ctx->list_lock);
 
     progress |= timerlistgroup_run_timers(&ctx->tlg);
@@ -784,6 +837,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
 
     qemu_lockcnt_inc(&ctx->list_lock);
     QLIST_FOREACH(node, &ctx->aio_handlers, node) {
+        node->poll.has_event = false;
         node->poll.ns = 0;
     }
     qemu_lockcnt_dec(&ctx->list_lock);
@@ -794,6 +848,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
     ctx->poll_max_ns = max_ns;
     ctx->poll_grow = grow;
     ctx->poll_shrink = shrink;
+    ctx->poll_ns = 0;
 
     aio_notify(ctx);
 }
diff --git a/util/async.c b/util/async.c
index 80d6b01a8a..9d3627566f 100644
--- a/util/async.c
+++ b/util/async.c
@@ -606,6 +606,7 @@ AioContext *aio_context_new(Error **errp)
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
 
     ctx->poll_max_ns = 0;
+    ctx->poll_ns = 0;
     ctx->poll_grow = 0;
     ctx->poll_shrink = 0;
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll
  2026-03-23 13:54 [PATCH RFC v2 0/3] improve aio-polling efficiency Jaehoon Kim
  2026-03-23 13:54 ` [PATCH RFC v2 1/3] aio-poll: avoid unnecessary polling time computation Jaehoon Kim
  2026-03-23 13:54 ` [PATCH RFC v2 2/3] aio-poll: refine iothread polling using weighted handler intervals Jaehoon Kim
@ 2026-03-23 13:54 ` Jaehoon Kim
  2026-03-25 14:04   ` Markus Armbruster
                     ` (2 more replies)
  2 siblings, 3 replies; 18+ messages in thread
From: Jaehoon Kim @ 2026-03-23 13:54 UTC (permalink / raw)
  To: qemu-devel, qemu-block
  Cc: mjrosato, farman, pbonzini, stefanha, fam, armbru, eblake,
	berrange, eduardo, dave, sw, Jaehoon Kim

Introduce a configurable poll-weight parameter for adaptive polling
in IOThread. This parameter replaces the hardcoded POLL_WEIGHT_SHIFT
constant, allowing runtime control over how much the most recent
event interval affects the next polling duration calculation.

The poll-weight parameter uses a shift value where larger values
decrease the weight of the current interval, enabling more gradual
adjustments. When set to 0, a default value of 3 is used (meaning
the current interval contributes approximately 1/8 to the weighted
average).

This patch also removes the hardcoded default values for poll-grow
and poll-shrink parameters from the grow_polling_time() and
shrink_polling_time() functions, as these defaults are now properly
initialized in iothread.c during IOThread creation.

Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>
---
 include/qemu/aio.h                |  4 +++-
 include/system/iothread.h         |  1 +
 iothread.c                        | 34 ++++++++++++++++++++++++++++++-
 monitor/hmp-cmds.c                |  1 +
 qapi/misc.json                    |  7 +++++++
 qapi/qom.json                     |  8 +++++++-
 qemu-options.hx                   |  7 ++++++-
 tests/unit/test-nested-aio-poll.c |  2 +-
 util/aio-posix.c                  | 17 +++++-----------
 util/aio-win32.c                  |  3 ++-
 util/async.c                      |  1 +
 11 files changed, 67 insertions(+), 18 deletions(-)

diff --git a/include/qemu/aio.h b/include/qemu/aio.h
index 6c77a190e9..50b8db2712 100644
--- a/include/qemu/aio.h
+++ b/include/qemu/aio.h
@@ -311,6 +311,7 @@ struct AioContext {
     int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
     int64_t poll_grow;      /* polling time growth factor */
     int64_t poll_shrink;    /* polling time shrink factor */
+    int64_t poll_weight;    /* weight of current interval in calculation */
 
     /* AIO engine parameters */
     int64_t aio_max_batch;  /* maximum number of requests in a batch */
@@ -792,12 +793,13 @@ void aio_context_destroy(AioContext *ctx);
  * @max_ns: how long to busy poll for, in nanoseconds
  * @grow: polling time growth factor
  * @shrink: polling time shrink factor
+ * @weight: weight factor applied to the current polling interval
  *
  * Poll mode can be disabled by setting poll_max_ns to 0.
  */
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
                                  int64_t grow, int64_t shrink,
-                                 Error **errp);
+                                 int64_t weight, Error **errp);
 
 /**
  * aio_context_set_aio_params:
diff --git a/include/system/iothread.h b/include/system/iothread.h
index e26d13c6c7..6ea57ed126 100644
--- a/include/system/iothread.h
+++ b/include/system/iothread.h
@@ -38,6 +38,7 @@ struct IOThread {
     int64_t poll_max_ns;
     int64_t poll_grow;
     int64_t poll_shrink;
+    int64_t poll_weight;
 };
 typedef struct IOThread IOThread;
 
diff --git a/iothread.c b/iothread.c
index caf68e0764..0389b8f7a8 100644
--- a/iothread.c
+++ b/iothread.c
@@ -32,8 +32,14 @@
  * workloads.
  */
 #define IOTHREAD_POLL_MAX_NS_DEFAULT 32768ULL
+#define IOTHREAD_POLL_GROW_DEFAULT 2ULL
+#define IOTHREAD_POLL_SHRINK_DEFAULT 2ULL
+#define IOTHREAD_POLL_WEIGHT_DEFAULT 3ULL
 #else
 #define IOTHREAD_POLL_MAX_NS_DEFAULT 0ULL
+#define IOTHREAD_POLL_GROW_DEFAULT 0ULL
+#define IOTHREAD_POLL_SHRINK_DEFAULT 0ULL
+#define IOTHREAD_POLL_WEIGHT_DEFAULT 0ULL
 #endif
 
 static void *iothread_run(void *opaque)
@@ -103,6 +109,10 @@ static void iothread_instance_init(Object *obj)
     IOThread *iothread = IOTHREAD(obj);
 
     iothread->poll_max_ns = IOTHREAD_POLL_MAX_NS_DEFAULT;
+    iothread->poll_grow = IOTHREAD_POLL_GROW_DEFAULT;
+    iothread->poll_shrink = IOTHREAD_POLL_SHRINK_DEFAULT;
+    iothread->poll_weight = IOTHREAD_POLL_WEIGHT_DEFAULT;
+
     iothread->thread_id = -1;
     qemu_sem_init(&iothread->init_done_sem, 0);
     /* By default, we don't run gcontext */
@@ -164,6 +174,7 @@ static void iothread_set_aio_context_params(EventLoopBase *base, Error **errp)
                                 iothread->poll_max_ns,
                                 iothread->poll_grow,
                                 iothread->poll_shrink,
+                                iothread->poll_weight,
                                 errp);
     if (*errp) {
         return;
@@ -233,6 +244,9 @@ static IOThreadParamInfo poll_grow_info = {
 static IOThreadParamInfo poll_shrink_info = {
     "poll-shrink", offsetof(IOThread, poll_shrink),
 };
+static IOThreadParamInfo poll_weight_info = {
+    "poll-weight", offsetof(IOThread, poll_weight),
+};
 
 static void iothread_get_param(Object *obj, Visitor *v,
         const char *name, IOThreadParamInfo *info, Error **errp)
@@ -260,7 +274,19 @@ static bool iothread_set_param(Object *obj, Visitor *v,
         return false;
     }
 
-    *field = value;
+    if (value == 0) {
+        if (info->offset == offsetof(IOThread, poll_grow)) {
+            *field = IOTHREAD_POLL_GROW_DEFAULT;
+        } else if (info->offset == offsetof(IOThread, poll_shrink)) {
+            *field = IOTHREAD_POLL_SHRINK_DEFAULT;
+        } else if (info->offset == offsetof(IOThread, poll_weight)) {
+            *field = IOTHREAD_POLL_WEIGHT_DEFAULT;
+        } else {
+            *field = value;
+        }
+    } else {
+        *field = value;
+    }
 
     return true;
 }
@@ -288,6 +314,7 @@ static void iothread_set_poll_param(Object *obj, Visitor *v,
                                     iothread->poll_max_ns,
                                     iothread->poll_grow,
                                     iothread->poll_shrink,
+                                    iothread->poll_weight,
                                     errp);
     }
 }
@@ -311,6 +338,10 @@ static void iothread_class_init(ObjectClass *klass, const void *class_data)
                               iothread_get_poll_param,
                               iothread_set_poll_param,
                               NULL, &poll_shrink_info);
+    object_class_property_add(klass, "poll-weight", "int",
+                              iothread_get_poll_param,
+                              iothread_set_poll_param,
+                              NULL, &poll_weight_info);
 }
 
 static const TypeInfo iothread_info = {
@@ -356,6 +387,7 @@ static int query_one_iothread(Object *object, void *opaque)
     info->poll_max_ns = iothread->poll_max_ns;
     info->poll_grow = iothread->poll_grow;
     info->poll_shrink = iothread->poll_shrink;
+    info->poll_weight = iothread->poll_weight;
     info->aio_max_batch = iothread->parent_obj.aio_max_batch;
 
     QAPI_LIST_APPEND(*tail, info);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index bad034937a..75b6e7fa65 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -206,6 +206,7 @@ void hmp_info_iothreads(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "  poll-max-ns=%" PRId64 "\n", value->poll_max_ns);
         monitor_printf(mon, "  poll-grow=%" PRId64 "\n", value->poll_grow);
         monitor_printf(mon, "  poll-shrink=%" PRId64 "\n", value->poll_shrink);
+        monitor_printf(mon, "  poll-weight=%" PRId64 "\n", value->poll_weight);
         monitor_printf(mon, "  aio-max-batch=%" PRId64 "\n",
                        value->aio_max_batch);
     }
diff --git a/qapi/misc.json b/qapi/misc.json
index 28c641fe2f..39d17010bc 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -85,6 +85,12 @@
 # @poll-shrink: how many ns will be removed from polling time, 0 means
 #     that it's not configured (since 2.9)
 #
+# @poll-weight: the weight factor for adaptive polling.
+#     Determines how much the current event interval contributes to
+#     the next polling time calculation.  Valid values are 1 or
+#     greater.  0 selects the system default value which is current 3
+#     (since 10.2)
+#
 # @aio-max-batch: maximum number of requests in a batch for the AIO
 #     engine, 0 means that the engine will use its default (since 6.1)
 #
@@ -96,6 +102,7 @@
            'poll-max-ns': 'int',
            'poll-grow': 'int',
            'poll-shrink': 'int',
+           'poll-weight': 'int',
            'aio-max-batch': 'int' } }
 
 ##
diff --git a/qapi/qom.json b/qapi/qom.json
index c653248f85..feb80b6cfe 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -606,6 +606,11 @@
 #     algorithm detects it is spending too long polling without
 #     encountering events.  0 selects a default behaviour (default: 0)
 #
+# @poll-weight: the weight factor for adaptive polling.
+#     Determines how much the current event interval contributes to
+#     the next polling time calculation.  Valid values are 1 or
+#     greater.  If set to 0, the default value of 3 is used.
+#
 # The @aio-max-batch option is available since 6.1.
 #
 # Since: 2.0
@@ -614,7 +619,8 @@
   'base': 'EventLoopBaseProperties',
   'data': { '*poll-max-ns': 'int',
             '*poll-grow': 'int',
-            '*poll-shrink': 'int' } }
+            '*poll-shrink': 'int',
+            '*poll-weight': 'int' } }
 
 ##
 # @MainLoopProperties:
diff --git a/qemu-options.hx b/qemu-options.hx
index 69e5a874c1..8ddf6c8d36 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -6413,7 +6413,7 @@ SRST
 
             CN=laptop.example.com,O=Example Home,L=London,ST=London,C=GB
 
-    ``-object iothread,id=id,poll-max-ns=poll-max-ns,poll-grow=poll-grow,poll-shrink=poll-shrink,aio-max-batch=aio-max-batch``
+    ``-object iothread,id=id,poll-max-ns=poll-max-ns,poll-grow=poll-grow,poll-shrink=poll-shrink,poll-weight=poll-weight,aio-max-batch=aio-max-batch``
         Creates a dedicated event loop thread that devices can be
         assigned to. This is known as an IOThread. By default device
         emulation happens in vCPU threads or the main event loop thread.
@@ -6449,6 +6449,11 @@ SRST
         the polling time when the algorithm detects it is spending too
         long polling without encountering events.
 
+        The ``poll-weight`` parameter is the weight factor used in the
+        adaptive polling algorithm. It determines how much the most
+        recent event interval affects the calculation of the next
+        polling duration.
+
         The ``aio-max-batch`` parameter is the maximum number of requests
         in a batch for the AIO engine, 0 means that the engine will use
         its default.
diff --git a/tests/unit/test-nested-aio-poll.c b/tests/unit/test-nested-aio-poll.c
index 9ab1ad08a7..4c38f36fd4 100644
--- a/tests/unit/test-nested-aio-poll.c
+++ b/tests/unit/test-nested-aio-poll.c
@@ -81,7 +81,7 @@ static void test(void)
     qemu_set_current_aio_context(td.ctx);
 
     /* Enable polling */
-    aio_context_set_poll_params(td.ctx, 1000000, 2, 2, &error_abort);
+    aio_context_set_poll_params(td.ctx, 1000000, 2, 2, 3, &error_abort);
 
     /* Make the event notifier active (set) right away */
     event_notifier_init(&td.poll_notifier, 1);
diff --git a/util/aio-posix.c b/util/aio-posix.c
index 2b3522f2f9..13b7f94911 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -29,7 +29,6 @@
 
 /* Stop userspace polling on a handler if it isn't active for some time */
 #define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND)
-#define POLL_WEIGHT_SHIFT   (3)
 
 static void adjust_block_ns(AioContext *ctx, int64_t block_ns);
 static void grow_polling_time(AioContext *ctx, int64_t block_ns);
@@ -593,10 +592,6 @@ static void shrink_polling_time(AioContext *ctx, int64_t block_ns)
     int64_t old = ctx->poll_ns;
     int64_t shrink = ctx->poll_shrink;
 
-    if (shrink == 0) {
-        shrink = 2;
-    }
-
     if (block_ns < (ctx->poll_ns / shrink)) {
         ctx->poll_ns /= shrink;
     }
@@ -610,10 +605,6 @@ static void grow_polling_time(AioContext *ctx, int64_t block_ns)
     int64_t old = ctx->poll_ns;
     int64_t grow = ctx->poll_grow;
 
-    if (grow == 0) {
-        grow = 2;
-    }
-
     if (block_ns > ctx->poll_ns * grow) {
         ctx->poll_ns = block_ns;
     } else {
@@ -640,8 +631,8 @@ static void adjust_block_ns(AioContext *ctx, int64_t block_ns)
              * poll.ns to smooth out polling time adjustments.
              */
             node->poll.ns = node->poll.ns
-                ? (node->poll.ns - (node->poll.ns >> POLL_WEIGHT_SHIFT))
-                + (block_ns >> POLL_WEIGHT_SHIFT) : block_ns;
+                ? (node->poll.ns - (node->poll.ns >> ctx->poll_weight))
+                + (block_ns >> ctx->poll_weight) : block_ns;
 
             if (node->poll.ns > ctx->poll_max_ns) {
                 node->poll.ns = 0;
@@ -831,7 +822,8 @@ void aio_context_destroy(AioContext *ctx)
 }
 
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
-                                 int64_t grow, int64_t shrink, Error **errp)
+                                 int64_t grow, int64_t shrink,
+                                 int64_t weight, Error **errp)
 {
     AioHandler *node;
 
@@ -848,6 +840,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
     ctx->poll_max_ns = max_ns;
     ctx->poll_grow = grow;
     ctx->poll_shrink = shrink;
+    ctx->poll_weight = weight;
     ctx->poll_ns = 0;
 
     aio_notify(ctx);
diff --git a/util/aio-win32.c b/util/aio-win32.c
index 6e6f699e4b..1985843233 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -429,7 +429,8 @@ void aio_context_destroy(AioContext *ctx)
 }
 
 void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
-                                 int64_t grow, int64_t shrink, Error **errp)
+                                 int64_t grow, int64_t shrink,
+                                 int64_t weight, Error **errp)
 {
     if (max_ns) {
         error_setg(errp, "AioContext polling is not implemented on Windows");
diff --git a/util/async.c b/util/async.c
index 9d3627566f..741fcfd6a7 100644
--- a/util/async.c
+++ b/util/async.c
@@ -609,6 +609,7 @@ AioContext *aio_context_new(Error **errp)
     ctx->poll_ns = 0;
     ctx->poll_grow = 0;
     ctx->poll_shrink = 0;
+    ctx->poll_weight = 0;
 
     ctx->aio_max_batch = 0;
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll
  2026-03-23 13:54 ` [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll Jaehoon Kim
@ 2026-03-25 14:04   ` Markus Armbruster
  2026-03-26 15:55     ` JAEHOON KIM
  2026-03-25 16:52   ` Stefan Hajnoczi
  2026-03-25 16:56   ` Stefan Hajnoczi
  2 siblings, 1 reply; 18+ messages in thread
From: Markus Armbruster @ 2026-03-25 14:04 UTC (permalink / raw)
  To: Jaehoon Kim
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, stefanha, fam,
	armbru, eblake, berrange, eduardo, dave, sw

Jaehoon Kim <jhkim@linux.ibm.com> writes:

> Introduce a configurable poll-weight parameter for adaptive polling
> in IOThread. This parameter replaces the hardcoded POLL_WEIGHT_SHIFT
> constant, allowing runtime control over how much the most recent
> event interval affects the next polling duration calculation.
>
> The poll-weight parameter uses a shift value where larger values
> decrease the weight of the current interval, enabling more gradual
> adjustments. When set to 0, a default value of 3 is used (meaning
> the current interval contributes approximately 1/8 to the weighted
> average).
>
> This patch also removes the hardcoded default values for poll-grow
> and poll-shrink parameters from the grow_polling_time() and
> shrink_polling_time() functions, as these defaults are now properly
> initialized in iothread.c during IOThread creation.
>
> Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>

[...]

> diff --git a/qapi/misc.json b/qapi/misc.json
> index 28c641fe2f..39d17010bc 100644
> --- a/qapi/misc.json
> +++ b/qapi/misc.json
> @@ -85,6 +85,12 @@

Note: this is IOThreadInfo, used only as return value of
query-iothreads.

>  # @poll-shrink: how many ns will be removed from polling time, 0 means
>  #     that it's not configured (since 2.9)
>  #
> +# @poll-weight: the weight factor for adaptive polling.
> +#     Determines how much the current event interval contributes to
> +#     the next polling time calculation.  Valid values are 1 or
> +#     greater.  0 selects the system default value which is current 3

Does query-iothreads actually return 0?  I'd expect it to return the
value that is actually used.

> +#     (since 10.2)

11.1 most likely.

> +#
>  # @aio-max-batch: maximum number of requests in a batch for the AIO
>  #     engine, 0 means that the engine will use its default (since 6.1)
>  #
> @@ -96,6 +102,7 @@
>             'poll-max-ns': 'int',
>             'poll-grow': 'int',
>             'poll-shrink': 'int',
> +           'poll-weight': 'int',
>             'aio-max-batch': 'int' } }
>  
>  ##
> diff --git a/qapi/qom.json b/qapi/qom.json
> index c653248f85..feb80b6cfe 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -606,6 +606,11 @@
>  #     algorithm detects it is spending too long polling without
>  #     encountering events.  0 selects a default behaviour (default: 0)
>  #
> +# @poll-weight: the weight factor for adaptive polling.
> +#     Determines how much the current event interval contributes to
> +#     the next polling time calculation.  Valid values are 1 or
> +#     greater.  If set to 0, the default value of 3 is used.

The commit message hints what the valid values mean, the doc comment
doesn't even that.  Do users need to know?

Code [*] below uses it like time >> poll_weight, where @time is int64_t.
poll_weight > 63 is undefined behavior, which is a no-no.  Please reject
such values.  poll_weight == 64 results in zero.  Is that useful?

Missing: (default: 0) (since 11.1)

> +#
>  # The @aio-max-batch option is available since 6.1.
>  #
>  # Since: 2.0
> @@ -614,7 +619,8 @@
>    'base': 'EventLoopBaseProperties',
>    'data': { '*poll-max-ns': 'int',
>              '*poll-grow': 'int',
> -            '*poll-shrink': 'int' } }
> +            '*poll-shrink': 'int',
> +            '*poll-weight': 'int' } }
>  
>  ##
>  # @MainLoopProperties:
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 69e5a874c1..8ddf6c8d36 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -6413,7 +6413,7 @@ SRST
>  
>              CN=laptop.example.com,O=Example Home,L=London,ST=London,C=GB
>  
> -    ``-object iothread,id=id,poll-max-ns=poll-max-ns,poll-grow=poll-grow,poll-shrink=poll-shrink,aio-max-batch=aio-max-batch``
> +    ``-object iothread,id=id,poll-max-ns=poll-max-ns,poll-grow=poll-grow,poll-shrink=poll-shrink,poll-weight=poll-weight,aio-max-batch=aio-max-batch``
>          Creates a dedicated event loop thread that devices can be
>          assigned to. This is known as an IOThread. By default device
>          emulation happens in vCPU threads or the main event loop thread.
> @@ -6449,6 +6449,11 @@ SRST
>          the polling time when the algorithm detects it is spending too
>          long polling without encountering events.
>  
> +        The ``poll-weight`` parameter is the weight factor used in the
> +        adaptive polling algorithm. It determines how much the most
> +        recent event interval affects the calculation of the next
> +        polling duration.
> +
>          The ``aio-max-batch`` parameter is the maximum number of requests
>          in a batch for the AIO engine, 0 means that the engine will use
>          its default.
> diff --git a/tests/unit/test-nested-aio-poll.c b/tests/unit/test-nested-aio-poll.c
> index 9ab1ad08a7..4c38f36fd4 100644
> --- a/tests/unit/test-nested-aio-poll.c
> +++ b/tests/unit/test-nested-aio-poll.c
> @@ -81,7 +81,7 @@ static void test(void)
>      qemu_set_current_aio_context(td.ctx);
>  
>      /* Enable polling */
> -    aio_context_set_poll_params(td.ctx, 1000000, 2, 2, &error_abort);
> +    aio_context_set_poll_params(td.ctx, 1000000, 2, 2, 3, &error_abort);
>  
>      /* Make the event notifier active (set) right away */
>      event_notifier_init(&td.poll_notifier, 1);
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 2b3522f2f9..13b7f94911 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -29,7 +29,6 @@
>  
>  /* Stop userspace polling on a handler if it isn't active for some time */
>  #define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND)
> -#define POLL_WEIGHT_SHIFT   (3)
>  
>  static void adjust_block_ns(AioContext *ctx, int64_t block_ns);
>  static void grow_polling_time(AioContext *ctx, int64_t block_ns);
> @@ -593,10 +592,6 @@ static void shrink_polling_time(AioContext *ctx, int64_t block_ns)
>      int64_t old = ctx->poll_ns;
>      int64_t shrink = ctx->poll_shrink;
>  
> -    if (shrink == 0) {
> -        shrink = 2;
> -    }
> -
>      if (block_ns < (ctx->poll_ns / shrink)) {
>          ctx->poll_ns /= shrink;
>      }
> @@ -610,10 +605,6 @@ static void grow_polling_time(AioContext *ctx, int64_t block_ns)
>      int64_t old = ctx->poll_ns;
>      int64_t grow = ctx->poll_grow;
>  
> -    if (grow == 0) {
> -        grow = 2;
> -    }
> -
>      if (block_ns > ctx->poll_ns * grow) {
>          ctx->poll_ns = block_ns;
>      } else {
> @@ -640,8 +631,8 @@ static void adjust_block_ns(AioContext *ctx, int64_t block_ns)
>               * poll.ns to smooth out polling time adjustments.
>               */
>              node->poll.ns = node->poll.ns
> -                ? (node->poll.ns - (node->poll.ns >> POLL_WEIGHT_SHIFT))
> -                + (block_ns >> POLL_WEIGHT_SHIFT) : block_ns;
> +                ? (node->poll.ns - (node->poll.ns >> ctx->poll_weight))
> +                + (block_ns >> ctx->poll_weight) : block_ns;

[*] This is the use of @poll-weight referred to above.

>  
>              if (node->poll.ns > ctx->poll_max_ns) {
>                  node->poll.ns = 0;
> @@ -831,7 +822,8 @@ void aio_context_destroy(AioContext *ctx)
>  }
>  
>  void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
> -                                 int64_t grow, int64_t shrink, Error **errp)
> +                                 int64_t grow, int64_t shrink,
> +                                 int64_t weight, Error **errp)
>  {
>      AioHandler *node;
>  
> @@ -848,6 +840,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>      ctx->poll_max_ns = max_ns;
>      ctx->poll_grow = grow;
>      ctx->poll_shrink = shrink;
> +    ctx->poll_weight = weight;
>      ctx->poll_ns = 0;
>  
>      aio_notify(ctx);

[...]



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll
  2026-03-23 13:54 ` [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll Jaehoon Kim
  2026-03-25 14:04   ` Markus Armbruster
@ 2026-03-25 16:52   ` Stefan Hajnoczi
  2026-03-25 16:56   ` Stefan Hajnoczi
  2 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2026-03-25 16:52 UTC (permalink / raw)
  To: Jaehoon Kim
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, fam, armbru,
	eblake, berrange, eduardo, dave, sw

[-- Attachment #1: Type: text/plain, Size: 1777 bytes --]

On Mon, Mar 23, 2026 at 08:54:51AM -0500, Jaehoon Kim wrote:
> diff --git a/qapi/misc.json b/qapi/misc.json
> index 28c641fe2f..39d17010bc 100644
> --- a/qapi/misc.json
> +++ b/qapi/misc.json
> @@ -85,6 +85,12 @@
>  # @poll-shrink: how many ns will be removed from polling time, 0 means
>  #     that it's not configured (since 2.9)
>  #
> +# @poll-weight: the weight factor for adaptive polling.
> +#     Determines how much the current event interval contributes to
> +#     the next polling time calculation.  Valid values are 1 or
> +#     greater.  0 selects the system default value which is current 3
> +#     (since 10.2)

QEMU 11.0 is already in hard freeze, so this patch will go into 11.1:

  (since 11.1)

> +#
>  # @aio-max-batch: maximum number of requests in a batch for the AIO
>  #     engine, 0 means that the engine will use its default (since 6.1)
>  #
> @@ -96,6 +102,7 @@
>             'poll-max-ns': 'int',
>             'poll-grow': 'int',
>             'poll-shrink': 'int',
> +           'poll-weight': 'int',
>             'aio-max-batch': 'int' } }
>  
>  ##
> diff --git a/qapi/qom.json b/qapi/qom.json
> index c653248f85..feb80b6cfe 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -606,6 +606,11 @@
>  #     algorithm detects it is spending too long polling without
>  #     encountering events.  0 selects a default behaviour (default: 0)
>  #
> +# @poll-weight: the weight factor for adaptive polling.
> +#     Determines how much the current event interval contributes to
> +#     the next polling time calculation.  Valid values are 1 or
> +#     greater.  If set to 0, the default value of 3 is used.

(since 11.1)

Other than that:

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll
  2026-03-23 13:54 ` [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll Jaehoon Kim
  2026-03-25 14:04   ` Markus Armbruster
  2026-03-25 16:52   ` Stefan Hajnoczi
@ 2026-03-25 16:56   ` Stefan Hajnoczi
  2026-03-26 16:13     ` JAEHOON KIM
  2 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2026-03-25 16:56 UTC (permalink / raw)
  To: Jaehoon Kim
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, fam, armbru,
	eblake, berrange, eduardo, dave, sw

[-- Attachment #1: Type: text/plain, Size: 975 bytes --]

On Mon, Mar 23, 2026 at 08:54:51AM -0500, Jaehoon Kim wrote:
> @@ -831,7 +822,8 @@ void aio_context_destroy(AioContext *ctx)
>  }
>  
>  void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
> -                                 int64_t grow, int64_t shrink, Error **errp)
> +                                 int64_t grow, int64_t shrink,
> +                                 int64_t weight, Error **errp)
>  {
>      AioHandler *node;
>  
> @@ -848,6 +840,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>      ctx->poll_max_ns = max_ns;
>      ctx->poll_grow = grow;
>      ctx->poll_shrink = shrink;
> +    ctx->poll_weight = weight;
>      ctx->poll_ns = 0;
>  
>      aio_notify(ctx);

On second thought, now that the divide-by-0 protection has been removed
and these fields are assumed to hold a valid value when poll_max_ns > 0,
aio_context_set_poll_params() needs the same 0-protection as
iothread_set_param().

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 1/3] aio-poll: avoid unnecessary polling time computation
  2026-03-23 13:54 ` [PATCH RFC v2 1/3] aio-poll: avoid unnecessary polling time computation Jaehoon Kim
@ 2026-03-25 17:22   ` Stefan Hajnoczi
  2026-03-26 18:17     ` JAEHOON KIM
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2026-03-25 17:22 UTC (permalink / raw)
  To: Jaehoon Kim
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, fam, armbru,
	eblake, berrange, eduardo, dave, sw

[-- Attachment #1: Type: text/plain, Size: 752 bytes --]

On Mon, Mar 23, 2026 at 08:54:49AM -0500, Jaehoon Kim wrote:
> Nodes are no longer added to poll_aio_handlers when adaptive polling is
> disabled, preventing unnecessary try_poll_mode() calls. Additionally,
> aio_poll() skips try_poll_mode() when timeout is 0.

Skipping when timeout is 0 seems risky to me. VIRTIO devices disable
guest kicks when polling mode is started. When aio_poll(ctx,
blocking=false) is called, we will skip polling and
ctx->fdmon_ops->need_wait(ctx) won't detect an event either. aio_poll()
will return without noticing that the VIRTIO device's AioHandler is
ready.

Is skipping when timeout 0 necessary for performance or can it be
dropped from the patch?

Aside from this:

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 2/3] aio-poll: refine iothread polling using weighted handler intervals
  2026-03-23 13:54 ` [PATCH RFC v2 2/3] aio-poll: refine iothread polling using weighted handler intervals Jaehoon Kim
@ 2026-03-25 20:37   ` Stefan Hajnoczi
  2026-03-27  5:02     ` JAEHOON KIM
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2026-03-25 20:37 UTC (permalink / raw)
  To: Jaehoon Kim
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, fam, armbru,
	eblake, berrange, eduardo, dave, sw

[-- Attachment #1: Type: text/plain, Size: 12750 bytes --]

On Mon, Mar 23, 2026 at 08:54:50AM -0500, Jaehoon Kim wrote:
> Refine adaptive polling in aio_poll by updating iothread polling
> duration based on weighted AioHandler event intervals.
> 
> Each AioHandler's poll.ns is updated using a weighted factor when an
> event occurs. Idle handlers accumulate block_ns until poll_max_ns and
> then reset to 0, preventing sporadically active handlers from
> unnecessarily prolonging iothread polling.
> 
> The iothread polling duration is set based on the largest poll.ns among
> active handlers. The shrink divider defaults to 2, matching the grow
> rate, to reduce frequent poll_ns resets for slow devices.
> 
> The default weight factor (POLL_WEIGHT_SHIFT=3, meaning the current
> interval contributes 12.5% to the weighted average) was selected based
> on extensive testing comparing QEMU 10.0.0 baseline vs poll-weight=2
> and poll-weight=3 across various workloads.
> 
> The table below shows a comparison between:
> -Host: RHEL 10.1 GA + qemu-10.0.0-14.el10_1, Guest: RHEL 9.6GA vs.
> -Host: RHEL 10.1 GA + qemu-10.0.0-14.el10_1 (w=2/w=3), Guest: RHEL 9.6GA
> for FIO FCP and FICON with 1 iothread and 8 iothreads.
> The values shown are the averages for numjobs 1, 4, and 8.
> 
> Summary of results (% change vs baseline):
> 
>                     | poll-weight=2      | poll-weight=3
> --------------------|--------------------|-----------------
> Throughput avg      | -2.4% (all tests)  | -2.2% (all tests)
> CPU consumption avg | -10.9% (all tests) | -9.4% (all tests)
> 
> Both weight=2 and weight=3 show significant CPU consumption reduction
> (~10%) compared to baseline, which addresses the CPU utilization
> regression observed in QEMU 10.0.0. The throughput impact is minimal
> for both (~2%).
> 
> Weight=3 is selected as the default because it provides slightly better
> throughput (-2.2% vs -2.4%) while still achieving substantial CPU
> savings (-9.4%). The difference between weight=2 and weight=3 is small,
> but weight=3 offers a better balance for general-purpose workloads.
> 
> Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>
> ---
>  include/qemu/aio.h |   4 +-
>  util/aio-posix.c   | 135 +++++++++++++++++++++++++++++++--------------
>  util/async.c       |   1 +
>  3 files changed, 99 insertions(+), 41 deletions(-)
> 
> diff --git a/include/qemu/aio.h b/include/qemu/aio.h
> index 8cca2360d1..6c77a190e9 100644
> --- a/include/qemu/aio.h
> +++ b/include/qemu/aio.h
> @@ -195,7 +195,8 @@ struct BHListSlice {
>  typedef QSLIST_HEAD(, AioHandler) AioHandlerSList;
>  
>  typedef struct AioPolledEvent {
> -    int64_t ns;        /* current polling time in nanoseconds */
> +    bool has_event; /* Flag to indicate if an event has occurred */
> +    int64_t ns;     /* estimated block time in nanoseconds */
>  } AioPolledEvent;
>  
>  struct AioContext {
> @@ -306,6 +307,7 @@ struct AioContext {
>      int poll_disable_cnt;
>  
>      /* Polling mode parameters */
> +    int64_t poll_ns;        /* current polling time in nanoseconds */
>      int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
>      int64_t poll_grow;      /* polling time growth factor */
>      int64_t poll_shrink;    /* polling time shrink factor */
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index b02beb0505..2b3522f2f9 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -29,9 +29,11 @@
>  
>  /* Stop userspace polling on a handler if it isn't active for some time */
>  #define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND)
> +#define POLL_WEIGHT_SHIFT   (3)
>  
> -static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
> -                                int64_t block_ns);
> +static void adjust_block_ns(AioContext *ctx, int64_t block_ns);
> +static void grow_polling_time(AioContext *ctx, int64_t block_ns);
> +static void shrink_polling_time(AioContext *ctx, int64_t block_ns);
>  
>  bool aio_poll_disabled(AioContext *ctx)
>  {
> @@ -373,7 +375,7 @@ static bool aio_dispatch_ready_handlers(AioContext *ctx,
>           * add the handler to ctx->poll_aio_handlers.

This comment refers to adjusting the polling time. The code no longer
does this and the comment should be updated.

>           */
>          if (ctx->poll_max_ns && QLIST_IS_INSERTED(node, node_poll)) {
> -            adjust_polling_time(ctx, &node->poll, block_ns);

aio_dispatch_ready_handlers() no longer uses the block_ns argument. It
can be removed.

> +            node->poll.has_event = true;
>          }
>      }
>  
> @@ -560,18 +562,13 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
>  static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
>                            int64_t *timeout)
>  {
> -    AioHandler *node;
>      int64_t max_ns;
>  
>      if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
>          return false;
>      }
>  
> -    max_ns = 0;
> -    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
> -        max_ns = MAX(max_ns, node->poll.ns);
> -    }
> -    max_ns = qemu_soonest_timeout(*timeout, max_ns);
> +    max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
>  
>      if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
>          /*
> @@ -587,46 +584,98 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
>      return false;
>  }
>  
> -static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
> -                                int64_t block_ns)
> +static void shrink_polling_time(AioContext *ctx, int64_t block_ns)
>  {
> -    if (block_ns <= poll->ns) {
> -        /* This is the sweet spot, no adjustment needed */
> -    } else if (block_ns > ctx->poll_max_ns) {
> -        /* We'd have to poll for too long, poll less */
> -        int64_t old = poll->ns;
> -
> -        if (ctx->poll_shrink) {
> -            poll->ns /= ctx->poll_shrink;
> -        } else {
> -            poll->ns = 0;
> -        }
> +    /*
> +     * Reduce polling time if the block_ns is zero or
> +     * less than the current poll_ns.
> +     */
> +    int64_t old = ctx->poll_ns;
> +    int64_t shrink = ctx->poll_shrink;
>  
> -        trace_poll_shrink(ctx, old, poll->ns);
> -    } else if (poll->ns < ctx->poll_max_ns &&
> -               block_ns < ctx->poll_max_ns) {
> -        /* There is room to grow, poll longer */
> -        int64_t old = poll->ns;
> -        int64_t grow = ctx->poll_grow;
> +    if (shrink == 0) {
> +        shrink = 2;
> +    }
>  
> -        if (grow == 0) {
> -            grow = 2;
> -        }
> +    if (block_ns < (ctx->poll_ns / shrink)) {
> +        ctx->poll_ns /= shrink;
> +    }
>  
> -        if (poll->ns) {
> -            poll->ns *= grow;
> -        } else {
> -            poll->ns = 4000; /* start polling at 4 microseconds */
> -        }
> +    trace_poll_shrink(ctx, old, ctx->poll_ns);

This trace event should be inside if (block_ns < (ctx->poll_ns /
shrink)) like it was before this patch.

> +}
>  
> -        if (poll->ns > ctx->poll_max_ns) {
> -            poll->ns = ctx->poll_max_ns;
> -        }
> +static void grow_polling_time(AioContext *ctx, int64_t block_ns)
> +{
> +    /* There is room to grow, poll longer */
> +    int64_t old = ctx->poll_ns;
> +    int64_t grow = ctx->poll_grow;
>  
> -        trace_poll_grow(ctx, old, poll->ns);
> +    if (grow == 0) {
> +        grow = 2;
>      }
> +
> +    if (block_ns > ctx->poll_ns * grow) {
> +        ctx->poll_ns = block_ns;
> +    } else {
> +        ctx->poll_ns *= grow;
> +    }
> +
> +    if (ctx->poll_ns > ctx->poll_max_ns) {
> +        ctx->poll_ns = ctx->poll_max_ns;
> +    }
> +
> +    trace_poll_grow(ctx, old, ctx->poll_ns);

Same here.

>  }
>  
> +static void adjust_block_ns(AioContext *ctx, int64_t block_ns)
> +{
> +    AioHandler *node;
> +    int64_t adj_block_ns = -1;
> +
> +    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
> +        if (node->poll.has_event) {

Did you consider unifying node->poll.has_event with
node->poll_idle_timeout, which is assigned now + POLL_IDLE_INTERVAL_NS
every time ->io_poll() detects an event?

For instance, rename node->poll_idle_timeout to
node->last_event_timestamp and assign now without adding
POLL_IDLE_INTERVAL_NS. Then use the field for both idle node removal and
adjust_block_ns() (pass in now).

> +            /*
> +             * Update poll.ns for the node with an event.
> +             * Uses a weighted average of the current block_ns and the previous
> +             * poll.ns to smooth out polling time adjustments.
> +             */
> +            node->poll.ns = node->poll.ns
> +                ? (node->poll.ns - (node->poll.ns >> POLL_WEIGHT_SHIFT))
> +                + (block_ns >> POLL_WEIGHT_SHIFT) : block_ns;
> +
> +            if (node->poll.ns > ctx->poll_max_ns) {
> +                node->poll.ns = 0;
> +            }

Previously:
-        if (poll->ns > ctx->poll_max_ns) {
-            poll->ns = ctx->poll_max_ns;
-        }

Was this causing excessive CPU consumption in your benchmarks?

Can you explain the rationale for zeroing the poll time? Aside from
reducing CPU consumption, it also reduces the chance that polling will
succeed and could therefore impact performance.

I'm asking about this because this patch makes several changes at once
and I'm not sure how the CPU usage and performance changes are
attributed to these multiple changes. I want to make sure the changes
merged are minimal and the best set - sometimes when multiple things are
changes at the same time, not all of them are beneficial.

> +            /*
> +             * To avoid excessive polling time increase, update adj_block_ns
> +             * for nodes with the event flag set to true
> +             */
> +            adj_block_ns = MAX(adj_block_ns, node->poll.ns);

adj_block_ns is not the blocking time, it's the maximum current poll
time across all nodes. It would be clearer to change the variable name.

> +            node->poll.has_event = false;
> +         } else {

4-space indentation should be used.

> +            /*
> +             * No event now, but was active before.
> +             * If it waits longer than poll_max_ns, poll.ns will stay 0
> +             * until the next event arrives.
> +             */
> +            if (node->poll.ns != 0) {
> +                node->poll.ns += block_ns;

Why is block_ns being added to an recently inactive node's polling time?
Here node->poll.ns no longer measures the weighted time until the
handler had an event.

If the goal is to get rid of inactive nodes, then maybe the idle handler
removal mechanism should be made more aggresive instead?

> +                if (node->poll.ns > ctx->poll_max_ns) {
> +                    node->poll.ns = 0;
> +                }
> +            }
> +        }
> +    }
> +
> +    if (adj_block_ns >= 0) {
> +        if (adj_block_ns > ctx->poll_ns) {
> +            grow_polling_time(ctx, adj_block_ns);
> +        } else {
> +            shrink_polling_time(ctx, adj_block_ns);
> +         }
> +     }
> + }
> +
>  bool aio_poll(AioContext *ctx, bool blocking)
>  {
>      AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
> @@ -723,6 +772,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
>  
>      aio_free_deleted_handlers(ctx);
>  
> +    if (ctx->poll_max_ns) {
> +        adjust_block_ns(ctx, block_ns);
> +    }
> +
>      qemu_lockcnt_dec(&ctx->list_lock);
>  
>      progress |= timerlistgroup_run_timers(&ctx->tlg);
> @@ -784,6 +837,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>  
>      qemu_lockcnt_inc(&ctx->list_lock);
>      QLIST_FOREACH(node, &ctx->aio_handlers, node) {
> +        node->poll.has_event = false;
>          node->poll.ns = 0;
>      }
>      qemu_lockcnt_dec(&ctx->list_lock);
> @@ -794,6 +848,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>      ctx->poll_max_ns = max_ns;
>      ctx->poll_grow = grow;
>      ctx->poll_shrink = shrink;
> +    ctx->poll_ns = 0;
>  
>      aio_notify(ctx);
>  }
> diff --git a/util/async.c b/util/async.c
> index 80d6b01a8a..9d3627566f 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -606,6 +606,7 @@ AioContext *aio_context_new(Error **errp)
>      timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
>  
>      ctx->poll_max_ns = 0;
> +    ctx->poll_ns = 0;
>      ctx->poll_grow = 0;
>      ctx->poll_shrink = 0;
>  
> -- 
> 2.50.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll
  2026-03-25 14:04   ` Markus Armbruster
@ 2026-03-26 15:55     ` JAEHOON KIM
  2026-03-27  5:49       ` Markus Armbruster
  0 siblings, 1 reply; 18+ messages in thread
From: JAEHOON KIM @ 2026-03-26 15:55 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, stefanha, fam,
	eblake, berrange, eduardo, dave, sw

On 3/25/2026 9:04 AM, Markus Armbruster wrote:
> Jaehoon Kim <jhkim@linux.ibm.com> writes:
>
>> Introduce a configurable poll-weight parameter for adaptive polling
>> in IOThread. This parameter replaces the hardcoded POLL_WEIGHT_SHIFT
>> constant, allowing runtime control over how much the most recent
>> event interval affects the next polling duration calculation.
>>
>> The poll-weight parameter uses a shift value where larger values
>> decrease the weight of the current interval, enabling more gradual
>> adjustments. When set to 0, a default value of 3 is used (meaning
>> the current interval contributes approximately 1/8 to the weighted
>> average).
>>
>> This patch also removes the hardcoded default values for poll-grow
>> and poll-shrink parameters from the grow_polling_time() and
>> shrink_polling_time() functions, as these defaults are now properly
>> initialized in iothread.c during IOThread creation.
>>
>> Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>
> [...]
>
>> diff --git a/qapi/misc.json b/qapi/misc.json
>> index 28c641fe2f..39d17010bc 100644
>> --- a/qapi/misc.json
>> +++ b/qapi/misc.json
>> @@ -85,6 +85,12 @@
> Note: this is IOThreadInfo, used only as return value of
> query-iothreads.
>
>>   # @poll-shrink: how many ns will be removed from polling time, 0 means
>>   #     that it's not configured (since 2.9)
>>   #
>> +# @poll-weight: the weight factor for adaptive polling.
>> +#     Determines how much the current event interval contributes to
>> +#     the next polling time calculation.  Valid values are 1 or
>> +#     greater.  0 selects the system default value which is current 3
> Does query-iothreads actually return 0?  I'd expect it to return the
> value that is actually used.
Thanks for your feedback.
Expose IOThread default poll parameters in iothread_instance_init() so
query shows system-initialized values for grow, shrink, and weight.
Also, I modified iothread_set_param() so that when the user input 0,
it reverts to the system default values.

>> +#     (since 10.2)
> 11.1 most likely.
Thanks for noticing. I'll update the version to 11.1 in the next patch.
>> +#
>>   # @aio-max-batch: maximum number of requests in a batch for the AIO
>>   #     engine, 0 means that the engine will use its default (since 6.1)
>>   #
>> @@ -96,6 +102,7 @@
>>              'poll-max-ns': 'int',
>>              'poll-grow': 'int',
>>              'poll-shrink': 'int',
>> +           'poll-weight': 'int',
>>              'aio-max-batch': 'int' } }
>>   
>>   ##
>> diff --git a/qapi/qom.json b/qapi/qom.json
>> index c653248f85..feb80b6cfe 100644
>> --- a/qapi/qom.json
>> +++ b/qapi/qom.json
>> @@ -606,6 +606,11 @@
>>   #     algorithm detects it is spending too long polling without
>>   #     encountering events.  0 selects a default behaviour (default: 0)
>>   #
>> +# @poll-weight: the weight factor for adaptive polling.
>> +#     Determines how much the current event interval contributes to
>> +#     the next polling time calculation.  Valid values are 1 or
>> +#     greater.  If set to 0, the default value of 3 is used.
> The commit message hints what the valid values mean, the doc comment
> doesn't even that.  Do users need to know?
>
> Code [*] below uses it like time >> poll_weight, where @time is int64_t.
> poll_weight > 63 is undefined behavior, which is a no-no.  Please reject
> such values.  poll_weight == 64 results in zero.  Is that useful?
>
> Missing: (default: 0) (since 11.1)
I agree. I will update the doc comment to give users a practical hint,
for example like this:

# @poll-weight: the weight factor for adaptive polling.
#     Determines how much the most recent event interval affects
#     the next polling duration calculation.
#     If set to 0, the system default value of 3 is used.
#     Typical values: 1 (high weight on recent interval),
#     2-4 (moderate weight on recent interval).
#    (default: 0) (since 11.1)

I will also a check in the code so that values exceeding the maximum
allowed will revert to the system default.
>> +#
>>   # The @aio-max-batch option is available since 6.1.
>>   #
>>   # Since: 2.0
>> @@ -614,7 +619,8 @@
>>     'base': 'EventLoopBaseProperties',
>>     'data': { '*poll-max-ns': 'int',
>>               '*poll-grow': 'int',
>> -            '*poll-shrink': 'int' } }
>> +            '*poll-shrink': 'int',
>> +            '*poll-weight': 'int' } }
>>   
>>   ##
>>   # @MainLoopProperties:
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 69e5a874c1..8ddf6c8d36 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -6413,7 +6413,7 @@ SRST
>>   
>>               CN=laptop.example.com,O=Example Home,L=London,ST=London,C=GB
>>   
>> -    ``-object iothread,id=id,poll-max-ns=poll-max-ns,poll-grow=poll-grow,poll-shrink=poll-shrink,aio-max-batch=aio-max-batch``
>> +    ``-object iothread,id=id,poll-max-ns=poll-max-ns,poll-grow=poll-grow,poll-shrink=poll-shrink,poll-weight=poll-weight,aio-max-batch=aio-max-batch``
>>           Creates a dedicated event loop thread that devices can be
>>           assigned to. This is known as an IOThread. By default device
>>           emulation happens in vCPU threads or the main event loop thread.
>> @@ -6449,6 +6449,11 @@ SRST
>>           the polling time when the algorithm detects it is spending too
>>           long polling without encountering events.
>>   
>> +        The ``poll-weight`` parameter is the weight factor used in the
>> +        adaptive polling algorithm. It determines how much the most
>> +        recent event interval affects the calculation of the next
>> +        polling duration.
>> +
>>           The ``aio-max-batch`` parameter is the maximum number of requests
>>           in a batch for the AIO engine, 0 means that the engine will use
>>           its default.
>> diff --git a/tests/unit/test-nested-aio-poll.c b/tests/unit/test-nested-aio-poll.c
>> index 9ab1ad08a7..4c38f36fd4 100644
>> --- a/tests/unit/test-nested-aio-poll.c
>> +++ b/tests/unit/test-nested-aio-poll.c
>> @@ -81,7 +81,7 @@ static void test(void)
>>       qemu_set_current_aio_context(td.ctx);
>>   
>>       /* Enable polling */
>> -    aio_context_set_poll_params(td.ctx, 1000000, 2, 2, &error_abort);
>> +    aio_context_set_poll_params(td.ctx, 1000000, 2, 2, 3, &error_abort);
>>   
>>       /* Make the event notifier active (set) right away */
>>       event_notifier_init(&td.poll_notifier, 1);
>> diff --git a/util/aio-posix.c b/util/aio-posix.c
>> index 2b3522f2f9..13b7f94911 100644
>> --- a/util/aio-posix.c
>> +++ b/util/aio-posix.c
>> @@ -29,7 +29,6 @@
>>   
>>   /* Stop userspace polling on a handler if it isn't active for some time */
>>   #define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND)
>> -#define POLL_WEIGHT_SHIFT   (3)
>>   
>>   static void adjust_block_ns(AioContext *ctx, int64_t block_ns);
>>   static void grow_polling_time(AioContext *ctx, int64_t block_ns);
>> @@ -593,10 +592,6 @@ static void shrink_polling_time(AioContext *ctx, int64_t block_ns)
>>       int64_t old = ctx->poll_ns;
>>       int64_t shrink = ctx->poll_shrink;
>>   
>> -    if (shrink == 0) {
>> -        shrink = 2;
>> -    }
>> -
>>       if (block_ns < (ctx->poll_ns / shrink)) {
>>           ctx->poll_ns /= shrink;
>>       }
>> @@ -610,10 +605,6 @@ static void grow_polling_time(AioContext *ctx, int64_t block_ns)
>>       int64_t old = ctx->poll_ns;
>>       int64_t grow = ctx->poll_grow;
>>   
>> -    if (grow == 0) {
>> -        grow = 2;
>> -    }
>> -
>>       if (block_ns > ctx->poll_ns * grow) {
>>           ctx->poll_ns = block_ns;
>>       } else {
>> @@ -640,8 +631,8 @@ static void adjust_block_ns(AioContext *ctx, int64_t block_ns)
>>                * poll.ns to smooth out polling time adjustments.
>>                */
>>               node->poll.ns = node->poll.ns
>> -                ? (node->poll.ns - (node->poll.ns >> POLL_WEIGHT_SHIFT))
>> -                + (block_ns >> POLL_WEIGHT_SHIFT) : block_ns;
>> +                ? (node->poll.ns - (node->poll.ns >> ctx->poll_weight))
>> +                + (block_ns >> ctx->poll_weight) : block_ns;
> [*] This is the use of @poll-weight referred to above.
>
>>   
>>               if (node->poll.ns > ctx->poll_max_ns) {
>>                   node->poll.ns = 0;
>> @@ -831,7 +822,8 @@ void aio_context_destroy(AioContext *ctx)
>>   }
>>   
>>   void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>> -                                 int64_t grow, int64_t shrink, Error **errp)
>> +                                 int64_t grow, int64_t shrink,
>> +                                 int64_t weight, Error **errp)
>>   {
>>       AioHandler *node;
>>   
>> @@ -848,6 +840,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>>       ctx->poll_max_ns = max_ns;
>>       ctx->poll_grow = grow;
>>       ctx->poll_shrink = shrink;
>> +    ctx->poll_weight = weight;
>>       ctx->poll_ns = 0;
>>   
>>       aio_notify(ctx);
> [...]
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll
  2026-03-25 16:56   ` Stefan Hajnoczi
@ 2026-03-26 16:13     ` JAEHOON KIM
  0 siblings, 0 replies; 18+ messages in thread
From: JAEHOON KIM @ 2026-03-26 16:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, fam, armbru,
	eblake, berrange, eduardo, dave, sw

On 3/25/2026 11:56 AM, Stefan Hajnoczi wrote:
> On Mon, Mar 23, 2026 at 08:54:51AM -0500, Jaehoon Kim wrote:
>> @@ -831,7 +822,8 @@ void aio_context_destroy(AioContext *ctx)
>>   }
>>   
>>   void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>> -                                 int64_t grow, int64_t shrink, Error **errp)
>> +                                 int64_t grow, int64_t shrink,
>> +                                 int64_t weight, Error **errp)
>>   {
>>       AioHandler *node;
>>   
>> @@ -848,6 +840,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>>       ctx->poll_max_ns = max_ns;
>>       ctx->poll_grow = grow;
>>       ctx->poll_shrink = shrink;
>> +    ctx->poll_weight = weight;
>>       ctx->poll_ns = 0;
>>   
>>       aio_notify(ctx);
> On second thought, now that the divide-by-0 protection has been removed
> and these fields are assumed to hold a valid value when poll_max_ns > 0,
> aio_context_set_poll_params() needs the same 0-protection as
> iothread_set_param().

Thanks for pointing that out. I only considered the user settings and
overlooked the possibility of this being configured in test code or
other areas.

I will add divide-by-zero protection to aio_context_set_poll_params.
and update the versioning to 11.1 in the next version.

Regards,
Jaehoon



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 1/3] aio-poll: avoid unnecessary polling time computation
  2026-03-25 17:22   ` Stefan Hajnoczi
@ 2026-03-26 18:17     ` JAEHOON KIM
  2026-03-26 18:34       ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: JAEHOON KIM @ 2026-03-26 18:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, fam, armbru,
	eblake, berrange, eduardo, dave, sw

On 3/25/2026 12:22 PM, Stefan Hajnoczi wrote:
> On Mon, Mar 23, 2026 at 08:54:49AM -0500, Jaehoon Kim wrote:
>> Nodes are no longer added to poll_aio_handlers when adaptive polling is
>> disabled, preventing unnecessary try_poll_mode() calls. Additionally,
>> aio_poll() skips try_poll_mode() when timeout is 0.
> Skipping when timeout is 0 seems risky to me. VIRTIO devices disable
> guest kicks when polling mode is started. When aio_poll(ctx,
> blocking=false) is called, we will skip polling and
> ctx->fdmon_ops->need_wait(ctx) won't detect an event either. aio_poll()
> will return without noticing that the VIRTIO device's AioHandler is
> ready.
>
> Is skipping when timeout 0 necessary for performance or can it be
> dropped from the patch?
>
> Aside from this:
>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Thank you for the review!
Removing the (timeout != 0) check does not cause any issue on my side,
so I plan to remove it in the next version.

However, I'd like to clarify: when timeout is 0, is there any concern
in the try_poll_mode function below that I might be missing?
I just want to make sure I understand correctly.

static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
                           int64_t *timeout)
{
     AioHandler *node;
     int64_t max_ns;

     if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
         return false;
     }

     max_ns = 0;
     QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
         max_ns = MAX(max_ns, node->poll.ns);
     }
     max_ns = qemu_soonest_timeout(*timeout, max_ns);

     if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
         /*
          * Enable poll mode. It pairs with the poll_set_started() in
          * aio_poll() which disables poll mode.
          */
         poll_set_started(ctx, ready_list, true);


     .....


Regards,
Jaehoon Kim.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 1/3] aio-poll: avoid unnecessary polling time computation
  2026-03-26 18:17     ` JAEHOON KIM
@ 2026-03-26 18:34       ` Stefan Hajnoczi
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan Hajnoczi @ 2026-03-26 18:34 UTC (permalink / raw)
  To: JAEHOON KIM
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, fam, armbru,
	eblake, berrange, eduardo, dave, sw

[-- Attachment #1: Type: text/plain, Size: 2314 bytes --]

On Thu, Mar 26, 2026 at 01:17:44PM -0500, JAEHOON KIM wrote:
> On 3/25/2026 12:22 PM, Stefan Hajnoczi wrote:
> > On Mon, Mar 23, 2026 at 08:54:49AM -0500, Jaehoon Kim wrote:
> > > Nodes are no longer added to poll_aio_handlers when adaptive polling is
> > > disabled, preventing unnecessary try_poll_mode() calls. Additionally,
> > > aio_poll() skips try_poll_mode() when timeout is 0.
> > Skipping when timeout is 0 seems risky to me. VIRTIO devices disable
> > guest kicks when polling mode is started. When aio_poll(ctx,
> > blocking=false) is called, we will skip polling and
> > ctx->fdmon_ops->need_wait(ctx) won't detect an event either. aio_poll()
> > will return without noticing that the VIRTIO device's AioHandler is
> > ready.
> > 
> > Is skipping when timeout 0 necessary for performance or can it be
> > dropped from the patch?
> > 
> > Aside from this:
> > 
> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Thank you for the review!
> Removing the (timeout != 0) check does not cause any issue on my side,
> so I plan to remove it in the next version.
> 
> However, I'd like to clarify: when timeout is 0, is there any concern
> in the try_poll_mode function below that I might be missing?
> I just want to make sure I understand correctly.
> 
> static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
>                           int64_t *timeout)
> {
>     AioHandler *node;
>     int64_t max_ns;
> 
>     if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
>         return false;
>     }
> 
>     max_ns = 0;
>     QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
>         max_ns = MAX(max_ns, node->poll.ns);
>     }
>     max_ns = qemu_soonest_timeout(*timeout, max_ns);
> 
>     if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {

I think you are pointing out that max_ns will be 0 and therefore polling
will be skipped? At least this is what came to mind when I read this
code again.

That is a problem because the exact scenario I described in my reply to
you can already happen in the existing code before your patch :(.

Avoiding the timeout != 0 check in your patch would help contain the bug
in try_poll_mode() rather than extending it to aio_poll() as well.

Thanks for pointing this out!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 2/3] aio-poll: refine iothread polling using weighted handler intervals
  2026-03-25 20:37   ` Stefan Hajnoczi
@ 2026-03-27  5:02     ` JAEHOON KIM
  2026-03-30 19:17       ` Stefan Hajnoczi
  0 siblings, 1 reply; 18+ messages in thread
From: JAEHOON KIM @ 2026-03-27  5:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, fam, armbru,
	eblake, berrange, eduardo, dave, sw

On 3/25/2026 3:37 PM, Stefan Hajnoczi wrote:
> On Mon, Mar 23, 2026 at 08:54:50AM -0500, Jaehoon Kim wrote:
>> Refine adaptive polling in aio_poll by updating iothread polling
>> duration based on weighted AioHandler event intervals.
>>
>> Each AioHandler's poll.ns is updated using a weighted factor when an
>> event occurs. Idle handlers accumulate block_ns until poll_max_ns and
>> then reset to 0, preventing sporadically active handlers from
>> unnecessarily prolonging iothread polling.
>>
>> The iothread polling duration is set based on the largest poll.ns among
>> active handlers. The shrink divider defaults to 2, matching the grow
>> rate, to reduce frequent poll_ns resets for slow devices.
>>
>> The default weight factor (POLL_WEIGHT_SHIFT=3, meaning the current
>> interval contributes 12.5% to the weighted average) was selected based
>> on extensive testing comparing QEMU 10.0.0 baseline vs poll-weight=2
>> and poll-weight=3 across various workloads.
>>
>> The table below shows a comparison between:
>> -Host: RHEL 10.1 GA + qemu-10.0.0-14.el10_1, Guest: RHEL 9.6GA vs.
>> -Host: RHEL 10.1 GA + qemu-10.0.0-14.el10_1 (w=2/w=3), Guest: RHEL 9.6GA
>> for FIO FCP and FICON with 1 iothread and 8 iothreads.
>> The values shown are the averages for numjobs 1, 4, and 8.
>>
>> Summary of results (% change vs baseline):
>>
>>                      | poll-weight=2      | poll-weight=3
>> --------------------|--------------------|-----------------
>> Throughput avg      | -2.4% (all tests)  | -2.2% (all tests)
>> CPU consumption avg | -10.9% (all tests) | -9.4% (all tests)
>>
>> Both weight=2 and weight=3 show significant CPU consumption reduction
>> (~10%) compared to baseline, which addresses the CPU utilization
>> regression observed in QEMU 10.0.0. The throughput impact is minimal
>> for both (~2%).
>>
>> Weight=3 is selected as the default because it provides slightly better
>> throughput (-2.2% vs -2.4%) while still achieving substantial CPU
>> savings (-9.4%). The difference between weight=2 and weight=3 is small,
>> but weight=3 offers a better balance for general-purpose workloads.
>>
>> Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>
>> ---
>>   include/qemu/aio.h |   4 +-
>>   util/aio-posix.c   | 135 +++++++++++++++++++++++++++++++--------------
>>   util/async.c       |   1 +
>>   3 files changed, 99 insertions(+), 41 deletions(-)
>>
>> diff --git a/include/qemu/aio.h b/include/qemu/aio.h
>> index 8cca2360d1..6c77a190e9 100644
>> --- a/include/qemu/aio.h
>> +++ b/include/qemu/aio.h
>> @@ -195,7 +195,8 @@ struct BHListSlice {
>>   typedef QSLIST_HEAD(, AioHandler) AioHandlerSList;
>>   
>>   typedef struct AioPolledEvent {
>> -    int64_t ns;        /* current polling time in nanoseconds */
>> +    bool has_event; /* Flag to indicate if an event has occurred */
>> +    int64_t ns;     /* estimated block time in nanoseconds */
>>   } AioPolledEvent;
>>   
>>   struct AioContext {
>> @@ -306,6 +307,7 @@ struct AioContext {
>>       int poll_disable_cnt;
>>   
>>       /* Polling mode parameters */
>> +    int64_t poll_ns;        /* current polling time in nanoseconds */
>>       int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
>>       int64_t poll_grow;      /* polling time growth factor */
>>       int64_t poll_shrink;    /* polling time shrink factor */
>> diff --git a/util/aio-posix.c b/util/aio-posix.c
>> index b02beb0505..2b3522f2f9 100644
>> --- a/util/aio-posix.c
>> +++ b/util/aio-posix.c
>> @@ -29,9 +29,11 @@
>>   
>>   /* Stop userspace polling on a handler if it isn't active for some time */
>>   #define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND)
>> +#define POLL_WEIGHT_SHIFT   (3)
>>   
>> -static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
>> -                                int64_t block_ns);
>> +static void adjust_block_ns(AioContext *ctx, int64_t block_ns);
>> +static void grow_polling_time(AioContext *ctx, int64_t block_ns);
>> +static void shrink_polling_time(AioContext *ctx, int64_t block_ns);
>>   
>>   bool aio_poll_disabled(AioContext *ctx)
>>   {
>> @@ -373,7 +375,7 @@ static bool aio_dispatch_ready_handlers(AioContext *ctx,
>>            * add the handler to ctx->poll_aio_handlers.
> This comment refers to adjusting the polling time. The code no longer
> does this and the comment should be updated.
The comment about adjusting polling time is no longer accurate.
I will update it in the next version.
>>            */
>>           if (ctx->poll_max_ns && QLIST_IS_INSERTED(node, node_poll)) {
>> -            adjust_polling_time(ctx, &node->poll, block_ns);
> aio_dispatch_ready_handlers() no longer uses the block_ns argument. It
> can be removed.
I will remove the block_ns argument in the next version.
>> +            node->poll.has_event = true;
>>           }
>>       }
>>   
>> @@ -560,18 +562,13 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
>>   static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
>>                             int64_t *timeout)
>>   {
>> -    AioHandler *node;
>>       int64_t max_ns;
>>   
>>       if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
>>           return false;
>>       }
>>   
>> -    max_ns = 0;
>> -    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
>> -        max_ns = MAX(max_ns, node->poll.ns);
>> -    }
>> -    max_ns = qemu_soonest_timeout(*timeout, max_ns);
>> +    max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
>>   
>>       if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
>>           /*
>> @@ -587,46 +584,98 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
>>       return false;
>>   }
>>   
>> -static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
>> -                                int64_t block_ns)
>> +static void shrink_polling_time(AioContext *ctx, int64_t block_ns)
>>   {
>> -    if (block_ns <= poll->ns) {
>> -        /* This is the sweet spot, no adjustment needed */
>> -    } else if (block_ns > ctx->poll_max_ns) {
>> -        /* We'd have to poll for too long, poll less */
>> -        int64_t old = poll->ns;
>> -
>> -        if (ctx->poll_shrink) {
>> -            poll->ns /= ctx->poll_shrink;
>> -        } else {
>> -            poll->ns = 0;
>> -        }
>> +    /*
>> +     * Reduce polling time if the block_ns is zero or
>> +     * less than the current poll_ns.
>> +     */
>> +    int64_t old = ctx->poll_ns;
>> +    int64_t shrink = ctx->poll_shrink;
>>   
>> -        trace_poll_shrink(ctx, old, poll->ns);
>> -    } else if (poll->ns < ctx->poll_max_ns &&
>> -               block_ns < ctx->poll_max_ns) {
>> -        /* There is room to grow, poll longer */
>> -        int64_t old = poll->ns;
>> -        int64_t grow = ctx->poll_grow;
>> +    if (shrink == 0) {
>> +        shrink = 2;
>> +    }
>>   
>> -        if (grow == 0) {
>> -            grow = 2;
>> -        }
>> +    if (block_ns < (ctx->poll_ns / shrink)) {
>> +        ctx->poll_ns /= shrink;
>> +    }
>>   
>> -        if (poll->ns) {
>> -            poll->ns *= grow;
>> -        } else {
>> -            poll->ns = 4000; /* start polling at 4 microseconds */
>> -        }
>> +    trace_poll_shrink(ctx, old, ctx->poll_ns);
> This trace event should be inside if (block_ns < (ctx->poll_ns /
> shrink)) like it was before this patch.
>
>> +}
>>   
>> -        if (poll->ns > ctx->poll_max_ns) {
>> -            poll->ns = ctx->poll_max_ns;
>> -        }
>> +static void grow_polling_time(AioContext *ctx, int64_t block_ns)
>> +{
>> +    /* There is room to grow, poll longer */
>> +    int64_t old = ctx->poll_ns;
>> +    int64_t grow = ctx->poll_grow;
>>   
>> -        trace_poll_grow(ctx, old, poll->ns);
>> +    if (grow == 0) {
>> +        grow = 2;
>>       }
>> +
>> +    if (block_ns > ctx->poll_ns * grow) {
>> +        ctx->poll_ns = block_ns;
>> +    } else {
>> +        ctx->poll_ns *= grow;
>> +    }
>> +
>> +    if (ctx->poll_ns > ctx->poll_max_ns) {
>> +        ctx->poll_ns = ctx->poll_max_ns;
>> +    }
>> +
>> +    trace_poll_grow(ctx, old, ctx->poll_ns);
> Same here.
I will move the trace_poll_xxx functions inside if condition in the
next version.
>
>>   }
>>   
>> +static void adjust_block_ns(AioContext *ctx, int64_t block_ns)
>> +{
>> +    AioHandler *node;
>> +    int64_t adj_block_ns = -1;
>> +
>> +    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
>> +        if (node->poll.has_event) {
> Did you consider unifying node->poll.has_event with
> node->poll_idle_timeout, which is assigned now + POLL_IDLE_INTERVAL_NS
> every time ->io_poll() detects an event?
>
> For instance, rename node->poll_idle_timeout to
> node->last_event_timestamp and assign now without adding
> POLL_IDLE_INTERVAL_NS. Then use the field for both idle node removal and
> adjust_block_ns() (pass in now).
Thank you for the suggestion, I think this is a good idea.
After testing, it seems that node->poll_idle_timeout can be reused as 
you suggested,
although a few adjustment are needed.

currently, an event is detected and the AioHandler is added to the 
ready_list in
three cases: run_poll_handler_once(), ctx->fdmon_ops-wait(), and 
poll_set_started().

To accurately track the last event timestamp of an AioHandler, it seems 
necessary to
update the timestamp in the following two functions:

@@ -45,6 +45,7 @@ void aio_add_ready_handler(AioHandlerList *ready_list,
  {
      QLIST_SAFE_REMOVE(node, node_ready); /* remove from nested 
parent's list */
      node->pfd.revents = revents;
+    node->poll_idle_timeout = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      QLIST_INSERT_HEAD(ready_list, node, node_ready);
  }

@@ -53,6 +54,7 @@ static void aio_add_poll_ready_handler(AioHandlerList 
*ready_list,
  {
      QLIST_SAFE_REMOVE(node, node_ready); /* remove from nested 
parent's list */
      node->poll_ready = true;
+    node->poll_idle_timeout = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
      QLIST_INSERT_HEAD(ready_list, node, node_ready);
  }

In addition, remove_idle_poll_handler() would need some adjustments as well.
If this approach aligns with you had in mind, I believe it can be 
incorporated in
the next version without any issues.

>> +            /*
>> +             * Update poll.ns for the node with an event.
>> +             * Uses a weighted average of the current block_ns and the previous
>> +             * poll.ns to smooth out polling time adjustments.
>> +             */
>> +            node->poll.ns = node->poll.ns
>> +                ? (node->poll.ns - (node->poll.ns >> POLL_WEIGHT_SHIFT))
>> +                + (block_ns >> POLL_WEIGHT_SHIFT) : block_ns;
>> +
>> +            if (node->poll.ns > ctx->poll_max_ns) {
>> +                node->poll.ns = 0;
>> +            }
> Previously:
> -        if (poll->ns > ctx->poll_max_ns) {
> -            poll->ns = ctx->poll_max_ns;
> -        }
>
> Was this causing excessive CPU consumption in your benchmarks?
>
> Can you explain the rationale for zeroing the poll time? Aside from
> reducing CPU consumption, it also reduces the chance that polling will
> succeed and could therefore impact performance.
>
> I'm asking about this because this patch makes several changes at once
> and I'm not sure how the CPU usage and performance changes are
> attributed to these multiple changes. I want to make sure the changes
> merged are minimal and the best set - sometimes when multiple things are
> changes at the same time, not all of them are beneficial.

This snippet you quoted under "Previously" is also reflected in the logic
within grow_polling_time() where a same approach is used.

The difference is that previously. all AioHandler in the reay_list would
update their own poll.ns using block_ns. As in the snippet below,
if block_ns exceeded poll_max_ns, it would effectively be reset to 0 anyway.

-    } else if (block_ns > ctx->poll_max_ns) {
-        /* We'd have to poll for too long, poll less */
-        int64_t old = poll->ns;
-
-        if (ctx->poll_shrink) {
-            poll->ns /= ctx->poll_shrink;
-        } else {
-            poll->ns = 0;
-        }

I think I did not explain this part clearly enough in the commit message.
Here’s a more detailed explanation of the current polling logic problem 
and approach:

Problem:
Starting from QEMU 10.0, poll.ns was introduced per event handler to 
mitigate excessive
fluctuations in IOThread polling times observed in earlier versions 
(QEMU 9.x).

However, in the current design, poll.ns is updated only when an event 
occurs, making it
difficult to treat block_ns as a reliable event interval. Also, The 
IOThread’s next
polling time is determined by the maximum poll.ns among all AioHandlers, 
which means
idle AioHandlers with high poll.ns can have an outsized impact on the 
polling duration.

For io_uring, idle AioHandlers are cleared after POLL_IDLE_INTERVAL_NS 
(7s), but
ppoll/epoll, there is no such mechanism, so CPU consumption due to idle 
nodes can
increase even more.

Approach:
To address this, we treat block_ns as an event interval and update each 
AioHandler’s
poll.ns using a weighted factor. This smooths out polling time 
adjustments, preventing
excessive fluctuations and ensuring that recent event intervals are 
properly reflected,
which helps maintain performance while lowering CPU utilization.

To use block_ns as an event interval, we update polling times for both event
and non-event AioHandlers in each loop iteration. Non-event AioHandler 
do not require
a weighted factor; this allows for rapid isolation of idle nodes, while 
ensuring that
poll.ns can increase more responsively when an event occurs within a few 
subsequent loops.

>> +            /*
>> +             * To avoid excessive polling time increase, update adj_block_ns
>> +             * for nodes with the event flag set to true
>> +             */
>> +            adj_block_ns = MAX(adj_block_ns, node->poll.ns);
> adj_block_ns is not the blocking time, it's the maximum current poll
> time across all nodes. It would be clearer to change the variable name.
You're right. I will rename it to max_poll_ns to better reflect its 
purpose.
>> +            node->poll.has_event = false;
>> +         } else {
> 4-space indentation should be used.
I will also fix the indentation.
>
>> +            /*
>> +             * No event now, but was active before.
>> +             * If it waits longer than poll_max_ns, poll.ns will stay 0
>> +             * until the next event arrives.
>> +             */
>> +            if (node->poll.ns != 0) {
>> +                node->poll.ns += block_ns;
> Why is block_ns being added to an recently inactive node's polling time?
> Here node->poll.ns no longer measures the weighted time until the
> handler had an event.
>
> If the goal is to get rid of inactive nodes, then maybe the idle handler
> removal mechanism should be made more aggresive instead?
>
>> +                if (node->poll.ns > ctx->poll_max_ns) {
>> +                    node->poll.ns = 0;
>> +                }
>> +            }
>> +        }
>> +    }
>> +
>> +    if (adj_block_ns >= 0) {
>> +        if (adj_block_ns > ctx->poll_ns) {
>> +            grow_polling_time(ctx, adj_block_ns);
>> +        } else {
>> +            shrink_polling_time(ctx, adj_block_ns);
>> +         }
>> +     }
>> + }
>> +
>>   bool aio_poll(AioContext *ctx, bool blocking)
>>   {
>>       AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
>> @@ -723,6 +772,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
>>   
>>       aio_free_deleted_handlers(ctx);
>>   
>> +    if (ctx->poll_max_ns) {
>> +        adjust_block_ns(ctx, block_ns);
>> +    }
>> +
>>       qemu_lockcnt_dec(&ctx->list_lock);
>>   
>>       progress |= timerlistgroup_run_timers(&ctx->tlg);
>> @@ -784,6 +837,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>>   
>>       qemu_lockcnt_inc(&ctx->list_lock);
>>       QLIST_FOREACH(node, &ctx->aio_handlers, node) {
>> +        node->poll.has_event = false;
>>           node->poll.ns = 0;
>>       }
>>       qemu_lockcnt_dec(&ctx->list_lock);
>> @@ -794,6 +848,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>>       ctx->poll_max_ns = max_ns;
>>       ctx->poll_grow = grow;
>>       ctx->poll_shrink = shrink;
>> +    ctx->poll_ns = 0;
>>   
>>       aio_notify(ctx);
>>   }
>> diff --git a/util/async.c b/util/async.c
>> index 80d6b01a8a..9d3627566f 100644
>> --- a/util/async.c
>> +++ b/util/async.c
>> @@ -606,6 +606,7 @@ AioContext *aio_context_new(Error **errp)
>>       timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
>>   
>>       ctx->poll_max_ns = 0;
>> +    ctx->poll_ns = 0;
>>       ctx->poll_grow = 0;
>>       ctx->poll_shrink = 0;
>>   
>> -- 
>> 2.50.1
>>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll
  2026-03-26 15:55     ` JAEHOON KIM
@ 2026-03-27  5:49       ` Markus Armbruster
  2026-03-27 14:23         ` JAEHOON KIM
  0 siblings, 1 reply; 18+ messages in thread
From: Markus Armbruster @ 2026-03-27  5:49 UTC (permalink / raw)
  To: JAEHOON KIM
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, stefanha, fam,
	eblake, berrange, eduardo, dave, sw

JAEHOON KIM <jhkim@linux.ibm.com> writes:

> On 3/25/2026 9:04 AM, Markus Armbruster wrote:
>> Jaehoon Kim <jhkim@linux.ibm.com> writes:
>>
>>> Introduce a configurable poll-weight parameter for adaptive polling
>>> in IOThread. This parameter replaces the hardcoded POLL_WEIGHT_SHIFT
>>> constant, allowing runtime control over how much the most recent
>>> event interval affects the next polling duration calculation.
>>>
>>> The poll-weight parameter uses a shift value where larger values
>>> decrease the weight of the current interval, enabling more gradual
>>> adjustments. When set to 0, a default value of 3 is used (meaning
>>> the current interval contributes approximately 1/8 to the weighted
>>> average).
>>>
>>> This patch also removes the hardcoded default values for poll-grow
>>> and poll-shrink parameters from the grow_polling_time() and
>>> shrink_polling_time() functions, as these defaults are now properly
>>> initialized in iothread.c during IOThread creation.
>>>
>>> Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>

[...]

>>> diff --git a/qapi/qom.json b/qapi/qom.json
>>> index c653248f85..feb80b6cfe 100644
>>> --- a/qapi/qom.json
>>> +++ b/qapi/qom.json
>>> @@ -606,6 +606,11 @@
>>>  #     algorithm detects it is spending too long polling without
>>>  #     encountering events.  0 selects a default behaviour (default: 0)
>>>  #
>>> +# @poll-weight: the weight factor for adaptive polling.
>>> +#     Determines how much the current event interval contributes to
>>> +#     the next polling time calculation.  Valid values are 1 or
>>> +#     greater.  If set to 0, the default value of 3 is used.
>>
>> The commit message hints what the valid values mean, the doc comment
>> doesn't even that.  Do users need to know?
>>
>> Code [*] below uses it like time >> poll_weight, where @time is int64_t.
>> poll_weight > 63 is undefined behavior, which is a no-no.  Please reject
>> such values.  poll_weight == 64 results in zero.  Is that useful?
>>
>> Missing: (default: 0) (since 11.1)
>
> I agree. I will update the doc comment to give users a practical hint,
> for example like this:
>
> # @poll-weight: the weight factor for adaptive polling.
> #     Determines how much the most recent event interval affects
> #     the next polling duration calculation.
> #     If set to 0, the system default value of 3 is used.
> #     Typical values: 1 (high weight on recent interval),
> #     2-4 (moderate weight on recent interval).
> #    (default: 0) (since 11.1)

Better, thanks!

> I will also a check in the code so that values exceeding the maximum
> allowed will revert to the system default.

Don't silently "correct" invalid input, reject it!

[...]



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll
  2026-03-27  5:49       ` Markus Armbruster
@ 2026-03-27 14:23         ` JAEHOON KIM
  0 siblings, 0 replies; 18+ messages in thread
From: JAEHOON KIM @ 2026-03-27 14:23 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, stefanha, fam,
	eblake, berrange, eduardo, dave, sw

On 3/27/2026 12:49 AM, Markus Armbruster wrote:
> JAEHOON KIM <jhkim@linux.ibm.com> writes:
>
>> On 3/25/2026 9:04 AM, Markus Armbruster wrote:
>>> Jaehoon Kim <jhkim@linux.ibm.com> writes:
>>>
>>>> Introduce a configurable poll-weight parameter for adaptive polling
>>>> in IOThread. This parameter replaces the hardcoded POLL_WEIGHT_SHIFT
>>>> constant, allowing runtime control over how much the most recent
>>>> event interval affects the next polling duration calculation.
>>>>
>>>> The poll-weight parameter uses a shift value where larger values
>>>> decrease the weight of the current interval, enabling more gradual
>>>> adjustments. When set to 0, a default value of 3 is used (meaning
>>>> the current interval contributes approximately 1/8 to the weighted
>>>> average).
>>>>
>>>> This patch also removes the hardcoded default values for poll-grow
>>>> and poll-shrink parameters from the grow_polling_time() and
>>>> shrink_polling_time() functions, as these defaults are now properly
>>>> initialized in iothread.c during IOThread creation.
>>>>
>>>> Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>
> [...]
>
>>>> diff --git a/qapi/qom.json b/qapi/qom.json
>>>> index c653248f85..feb80b6cfe 100644
>>>> --- a/qapi/qom.json
>>>> +++ b/qapi/qom.json
>>>> @@ -606,6 +606,11 @@
>>>>   #     algorithm detects it is spending too long polling without
>>>>   #     encountering events.  0 selects a default behaviour (default: 0)
>>>>   #
>>>> +# @poll-weight: the weight factor for adaptive polling.
>>>> +#     Determines how much the current event interval contributes to
>>>> +#     the next polling time calculation.  Valid values are 1 or
>>>> +#     greater.  If set to 0, the default value of 3 is used.
>>> The commit message hints what the valid values mean, the doc comment
>>> doesn't even that.  Do users need to know?
>>>
>>> Code [*] below uses it like time >> poll_weight, where @time is int64_t.
>>> poll_weight > 63 is undefined behavior, which is a no-no.  Please reject
>>> such values.  poll_weight == 64 results in zero.  Is that useful?
>>>
>>> Missing: (default: 0) (since 11.1)
>> I agree. I will update the doc comment to give users a practical hint,
>> for example like this:
>>
>> # @poll-weight: the weight factor for adaptive polling.
>> #     Determines how much the most recent event interval affects
>> #     the next polling duration calculation.
>> #     If set to 0, the system default value of 3 is used.
>> #     Typical values: 1 (high weight on recent interval),
>> #     2-4 (moderate weight on recent interval).
>> #    (default: 0) (since 11.1)
> Better, thanks!
>
>> I will also a check in the code so that values exceeding the maximum
>> allowed will revert to the system default.
> Don't silently "correct" invalid input, reject it!
>
> [...]
Hi Markus,

Thank you for the review.
I will change the code to reject invalid values instead of silently correct
them in the next version.

Regards,
Jaehoon.

>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 2/3] aio-poll: refine iothread polling using weighted handler intervals
  2026-03-27  5:02     ` JAEHOON KIM
@ 2026-03-30 19:17       ` Stefan Hajnoczi
  2026-03-31 20:42         ` JAEHOON KIM
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2026-03-30 19:17 UTC (permalink / raw)
  To: JAEHOON KIM
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, fam, armbru,
	eblake, berrange, eduardo, dave, sw

[-- Attachment #1: Type: text/plain, Size: 18630 bytes --]

On Fri, Mar 27, 2026 at 12:02:21AM -0500, JAEHOON KIM wrote:
> On 3/25/2026 3:37 PM, Stefan Hajnoczi wrote:
> > On Mon, Mar 23, 2026 at 08:54:50AM -0500, Jaehoon Kim wrote:
> > > Refine adaptive polling in aio_poll by updating iothread polling
> > > duration based on weighted AioHandler event intervals.
> > > 
> > > Each AioHandler's poll.ns is updated using a weighted factor when an
> > > event occurs. Idle handlers accumulate block_ns until poll_max_ns and
> > > then reset to 0, preventing sporadically active handlers from
> > > unnecessarily prolonging iothread polling.
> > > 
> > > The iothread polling duration is set based on the largest poll.ns among
> > > active handlers. The shrink divider defaults to 2, matching the grow
> > > rate, to reduce frequent poll_ns resets for slow devices.
> > > 
> > > The default weight factor (POLL_WEIGHT_SHIFT=3, meaning the current
> > > interval contributes 12.5% to the weighted average) was selected based
> > > on extensive testing comparing QEMU 10.0.0 baseline vs poll-weight=2
> > > and poll-weight=3 across various workloads.
> > > 
> > > The table below shows a comparison between:
> > > -Host: RHEL 10.1 GA + qemu-10.0.0-14.el10_1, Guest: RHEL 9.6GA vs.
> > > -Host: RHEL 10.1 GA + qemu-10.0.0-14.el10_1 (w=2/w=3), Guest: RHEL 9.6GA
> > > for FIO FCP and FICON with 1 iothread and 8 iothreads.
> > > The values shown are the averages for numjobs 1, 4, and 8.
> > > 
> > > Summary of results (% change vs baseline):
> > > 
> > >                      | poll-weight=2      | poll-weight=3
> > > --------------------|--------------------|-----------------
> > > Throughput avg      | -2.4% (all tests)  | -2.2% (all tests)
> > > CPU consumption avg | -10.9% (all tests) | -9.4% (all tests)
> > > 
> > > Both weight=2 and weight=3 show significant CPU consumption reduction
> > > (~10%) compared to baseline, which addresses the CPU utilization
> > > regression observed in QEMU 10.0.0. The throughput impact is minimal
> > > for both (~2%).
> > > 
> > > Weight=3 is selected as the default because it provides slightly better
> > > throughput (-2.2% vs -2.4%) while still achieving substantial CPU
> > > savings (-9.4%). The difference between weight=2 and weight=3 is small,
> > > but weight=3 offers a better balance for general-purpose workloads.
> > > 
> > > Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>
> > > ---
> > >   include/qemu/aio.h |   4 +-
> > >   util/aio-posix.c   | 135 +++++++++++++++++++++++++++++++--------------
> > >   util/async.c       |   1 +
> > >   3 files changed, 99 insertions(+), 41 deletions(-)
> > > 
> > > diff --git a/include/qemu/aio.h b/include/qemu/aio.h
> > > index 8cca2360d1..6c77a190e9 100644
> > > --- a/include/qemu/aio.h
> > > +++ b/include/qemu/aio.h
> > > @@ -195,7 +195,8 @@ struct BHListSlice {
> > >   typedef QSLIST_HEAD(, AioHandler) AioHandlerSList;
> > >   typedef struct AioPolledEvent {
> > > -    int64_t ns;        /* current polling time in nanoseconds */
> > > +    bool has_event; /* Flag to indicate if an event has occurred */
> > > +    int64_t ns;     /* estimated block time in nanoseconds */
> > >   } AioPolledEvent;
> > >   struct AioContext {
> > > @@ -306,6 +307,7 @@ struct AioContext {
> > >       int poll_disable_cnt;
> > >       /* Polling mode parameters */
> > > +    int64_t poll_ns;        /* current polling time in nanoseconds */
> > >       int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
> > >       int64_t poll_grow;      /* polling time growth factor */
> > >       int64_t poll_shrink;    /* polling time shrink factor */
> > > diff --git a/util/aio-posix.c b/util/aio-posix.c
> > > index b02beb0505..2b3522f2f9 100644
> > > --- a/util/aio-posix.c
> > > +++ b/util/aio-posix.c
> > > @@ -29,9 +29,11 @@
> > >   /* Stop userspace polling on a handler if it isn't active for some time */
> > >   #define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND)
> > > +#define POLL_WEIGHT_SHIFT   (3)
> > > -static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
> > > -                                int64_t block_ns);
> > > +static void adjust_block_ns(AioContext *ctx, int64_t block_ns);
> > > +static void grow_polling_time(AioContext *ctx, int64_t block_ns);
> > > +static void shrink_polling_time(AioContext *ctx, int64_t block_ns);
> > >   bool aio_poll_disabled(AioContext *ctx)
> > >   {
> > > @@ -373,7 +375,7 @@ static bool aio_dispatch_ready_handlers(AioContext *ctx,
> > >            * add the handler to ctx->poll_aio_handlers.
> > This comment refers to adjusting the polling time. The code no longer
> > does this and the comment should be updated.
> The comment about adjusting polling time is no longer accurate.
> I will update it in the next version.
> > >            */
> > >           if (ctx->poll_max_ns && QLIST_IS_INSERTED(node, node_poll)) {
> > > -            adjust_polling_time(ctx, &node->poll, block_ns);
> > aio_dispatch_ready_handlers() no longer uses the block_ns argument. It
> > can be removed.
> I will remove the block_ns argument in the next version.
> > > +            node->poll.has_event = true;
> > >           }
> > >       }
> > > @@ -560,18 +562,13 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
> > >   static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
> > >                             int64_t *timeout)
> > >   {
> > > -    AioHandler *node;
> > >       int64_t max_ns;
> > >       if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
> > >           return false;
> > >       }
> > > -    max_ns = 0;
> > > -    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
> > > -        max_ns = MAX(max_ns, node->poll.ns);
> > > -    }
> > > -    max_ns = qemu_soonest_timeout(*timeout, max_ns);
> > > +    max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
> > >       if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
> > >           /*
> > > @@ -587,46 +584,98 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
> > >       return false;
> > >   }
> > > -static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
> > > -                                int64_t block_ns)
> > > +static void shrink_polling_time(AioContext *ctx, int64_t block_ns)
> > >   {
> > > -    if (block_ns <= poll->ns) {
> > > -        /* This is the sweet spot, no adjustment needed */
> > > -    } else if (block_ns > ctx->poll_max_ns) {
> > > -        /* We'd have to poll for too long, poll less */
> > > -        int64_t old = poll->ns;
> > > -
> > > -        if (ctx->poll_shrink) {
> > > -            poll->ns /= ctx->poll_shrink;
> > > -        } else {
> > > -            poll->ns = 0;
> > > -        }
> > > +    /*
> > > +     * Reduce polling time if the block_ns is zero or
> > > +     * less than the current poll_ns.
> > > +     */
> > > +    int64_t old = ctx->poll_ns;
> > > +    int64_t shrink = ctx->poll_shrink;
> > > -        trace_poll_shrink(ctx, old, poll->ns);
> > > -    } else if (poll->ns < ctx->poll_max_ns &&
> > > -               block_ns < ctx->poll_max_ns) {
> > > -        /* There is room to grow, poll longer */
> > > -        int64_t old = poll->ns;
> > > -        int64_t grow = ctx->poll_grow;
> > > +    if (shrink == 0) {
> > > +        shrink = 2;
> > > +    }
> > > -        if (grow == 0) {
> > > -            grow = 2;
> > > -        }
> > > +    if (block_ns < (ctx->poll_ns / shrink)) {
> > > +        ctx->poll_ns /= shrink;
> > > +    }
> > > -        if (poll->ns) {
> > > -            poll->ns *= grow;
> > > -        } else {
> > > -            poll->ns = 4000; /* start polling at 4 microseconds */
> > > -        }
> > > +    trace_poll_shrink(ctx, old, ctx->poll_ns);
> > This trace event should be inside if (block_ns < (ctx->poll_ns /
> > shrink)) like it was before this patch.
> > 
> > > +}
> > > -        if (poll->ns > ctx->poll_max_ns) {
> > > -            poll->ns = ctx->poll_max_ns;
> > > -        }
> > > +static void grow_polling_time(AioContext *ctx, int64_t block_ns)
> > > +{
> > > +    /* There is room to grow, poll longer */
> > > +    int64_t old = ctx->poll_ns;
> > > +    int64_t grow = ctx->poll_grow;
> > > -        trace_poll_grow(ctx, old, poll->ns);
> > > +    if (grow == 0) {
> > > +        grow = 2;
> > >       }
> > > +
> > > +    if (block_ns > ctx->poll_ns * grow) {
> > > +        ctx->poll_ns = block_ns;
> > > +    } else {
> > > +        ctx->poll_ns *= grow;
> > > +    }
> > > +
> > > +    if (ctx->poll_ns > ctx->poll_max_ns) {
> > > +        ctx->poll_ns = ctx->poll_max_ns;
> > > +    }
> > > +
> > > +    trace_poll_grow(ctx, old, ctx->poll_ns);
> > Same here.
> I will move the trace_poll_xxx functions inside if condition in the
> next version.
> > 
> > >   }
> > > +static void adjust_block_ns(AioContext *ctx, int64_t block_ns)
> > > +{
> > > +    AioHandler *node;
> > > +    int64_t adj_block_ns = -1;
> > > +
> > > +    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
> > > +        if (node->poll.has_event) {
> > Did you consider unifying node->poll.has_event with
> > node->poll_idle_timeout, which is assigned now + POLL_IDLE_INTERVAL_NS
> > every time ->io_poll() detects an event?
> > 
> > For instance, rename node->poll_idle_timeout to
> > node->last_event_timestamp and assign now without adding
> > POLL_IDLE_INTERVAL_NS. Then use the field for both idle node removal and
> > adjust_block_ns() (pass in now).
> Thank you for the suggestion, I think this is a good idea.
> After testing, it seems that node->poll_idle_timeout can be reused as you
> suggested,
> although a few adjustment are needed.
> 
> currently, an event is detected and the AioHandler is added to the
> ready_list in
> three cases: run_poll_handler_once(), ctx->fdmon_ops-wait(), and
> poll_set_started().
> 
> To accurately track the last event timestamp of an AioHandler, it seems
> necessary to
> update the timestamp in the following two functions:
> 
> @@ -45,6 +45,7 @@ void aio_add_ready_handler(AioHandlerList *ready_list,
>  {
>      QLIST_SAFE_REMOVE(node, node_ready); /* remove from nested parent's
> list */
>      node->pfd.revents = revents;
> +    node->poll_idle_timeout = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>      QLIST_INSERT_HEAD(ready_list, node, node_ready);
>  }
> 
> @@ -53,6 +54,7 @@ static void aio_add_poll_ready_handler(AioHandlerList
> *ready_list,
>  {
>      QLIST_SAFE_REMOVE(node, node_ready); /* remove from nested parent's
> list */
>      node->poll_ready = true;
> +    node->poll_idle_timeout = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>      QLIST_INSERT_HEAD(ready_list, node, node_ready);
>  }
> 
> In addition, remove_idle_poll_handler() would need some adjustments as well.
> If this approach aligns with you had in mind, I believe it can be
> incorporated in
> the next version without any issues.

That works but I worry a bit about calling qemu_clock_get_ns() for every
event. The timestamp does not need to be precise down to the nano-,
micro-, or event milli-second. I think you could instead pass 'now' into
aio_dispatch_ready_handlers() and assign the new
node->last_dispatch_timestamp field there.

> > > +            /*
> > > +             * Update poll.ns for the node with an event.
> > > +             * Uses a weighted average of the current block_ns and the previous
> > > +             * poll.ns to smooth out polling time adjustments.
> > > +             */
> > > +            node->poll.ns = node->poll.ns
> > > +                ? (node->poll.ns - (node->poll.ns >> POLL_WEIGHT_SHIFT))
> > > +                + (block_ns >> POLL_WEIGHT_SHIFT) : block_ns;
> > > +
> > > +            if (node->poll.ns > ctx->poll_max_ns) {
> > > +                node->poll.ns = 0;
> > > +            }
> > Previously:
> > -        if (poll->ns > ctx->poll_max_ns) {
> > -            poll->ns = ctx->poll_max_ns;
> > -        }
> > 
> > Was this causing excessive CPU consumption in your benchmarks?
> > 
> > Can you explain the rationale for zeroing the poll time? Aside from
> > reducing CPU consumption, it also reduces the chance that polling will
> > succeed and could therefore impact performance.
> > 
> > I'm asking about this because this patch makes several changes at once
> > and I'm not sure how the CPU usage and performance changes are
> > attributed to these multiple changes. I want to make sure the changes
> > merged are minimal and the best set - sometimes when multiple things are
> > changes at the same time, not all of them are beneficial.
> 
> This snippet you quoted under "Previously" is also reflected in the logic
> within grow_polling_time() where a same approach is used.
> 
> The difference is that previously. all AioHandler in the reay_list would
> update their own poll.ns using block_ns. As in the snippet below,
> if block_ns exceeded poll_max_ns, it would effectively be reset to 0 anyway.
> 
> -    } else if (block_ns > ctx->poll_max_ns) {
> -        /* We'd have to poll for too long, poll less */
> -        int64_t old = poll->ns;
> -
> -        if (ctx->poll_shrink) {
> -            poll->ns /= ctx->poll_shrink;
> -        } else {
> -            poll->ns = 0;
> -        }
> 
> I think I did not explain this part clearly enough in the commit message.
> Here’s a more detailed explanation of the current polling logic problem and
> approach:
> 
> Problem:
> Starting from QEMU 10.0, poll.ns was introduced per event handler to
> mitigate excessive
> fluctuations in IOThread polling times observed in earlier versions (QEMU
> 9.x).
> 
> However, in the current design, poll.ns is updated only when an event
> occurs, making it
> difficult to treat block_ns as a reliable event interval. Also, The
> IOThread’s next
> polling time is determined by the maximum poll.ns among all AioHandlers,
> which means
> idle AioHandlers with high poll.ns can have an outsized impact on the
> polling duration.
> 
> For io_uring, idle AioHandlers are cleared after POLL_IDLE_INTERVAL_NS (7s),
> but
> ppoll/epoll, there is no such mechanism, so CPU consumption due to idle
> nodes can
> increase even more.
> 
> Approach:
> To address this, we treat block_ns as an event interval and update each
> AioHandler’s
> poll.ns using a weighted factor. This smooths out polling time adjustments,
> preventing
> excessive fluctuations and ensuring that recent event intervals are properly
> reflected,
> which helps maintain performance while lowering CPU utilization.
> 
> To use block_ns as an event interval, we update polling times for both event
> and non-event AioHandlers in each loop iteration. Non-event AioHandler do
> not require
> a weighted factor; this allows for rapid isolation of idle nodes, while
> ensuring that
> poll.ns can increase more responsively when an event occurs within a few
> subsequent loops.

Thanks for this information. Please include it in the commit description.

> > > +            /*
> > > +             * To avoid excessive polling time increase, update adj_block_ns
> > > +             * for nodes with the event flag set to true
> > > +             */
> > > +            adj_block_ns = MAX(adj_block_ns, node->poll.ns);
> > adj_block_ns is not the blocking time, it's the maximum current poll
> > time across all nodes. It would be clearer to change the variable name.
> You're right. I will rename it to max_poll_ns to better reflect its purpose.
> > > +            node->poll.has_event = false;
> > > +         } else {
> > 4-space indentation should be used.
> I will also fix the indentation.
> > 
> > > +            /*
> > > +             * No event now, but was active before.
> > > +             * If it waits longer than poll_max_ns, poll.ns will stay 0
> > > +             * until the next event arrives.
> > > +             */
> > > +            if (node->poll.ns != 0) {
> > > +                node->poll.ns += block_ns;
> > Why is block_ns being added to an recently inactive node's polling time?
> > Here node->poll.ns no longer measures the weighted time until the
> > handler had an event.
> > 
> > If the goal is to get rid of inactive nodes, then maybe the idle handler
> > removal mechanism should be made more aggresive instead?
> > 
> > > +                if (node->poll.ns > ctx->poll_max_ns) {
> > > +                    node->poll.ns = 0;
> > > +                }
> > > +            }
> > > +        }
> > > +    }
> > > +
> > > +    if (adj_block_ns >= 0) {
> > > +        if (adj_block_ns > ctx->poll_ns) {
> > > +            grow_polling_time(ctx, adj_block_ns);
> > > +        } else {
> > > +            shrink_polling_time(ctx, adj_block_ns);
> > > +         }
> > > +     }
> > > + }
> > > +
> > >   bool aio_poll(AioContext *ctx, bool blocking)
> > >   {
> > >       AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
> > > @@ -723,6 +772,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
> > >       aio_free_deleted_handlers(ctx);
> > > +    if (ctx->poll_max_ns) {
> > > +        adjust_block_ns(ctx, block_ns);
> > > +    }
> > > +
> > >       qemu_lockcnt_dec(&ctx->list_lock);
> > >       progress |= timerlistgroup_run_timers(&ctx->tlg);
> > > @@ -784,6 +837,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
> > >       qemu_lockcnt_inc(&ctx->list_lock);
> > >       QLIST_FOREACH(node, &ctx->aio_handlers, node) {
> > > +        node->poll.has_event = false;
> > >           node->poll.ns = 0;
> > >       }
> > >       qemu_lockcnt_dec(&ctx->list_lock);
> > > @@ -794,6 +848,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
> > >       ctx->poll_max_ns = max_ns;
> > >       ctx->poll_grow = grow;
> > >       ctx->poll_shrink = shrink;
> > > +    ctx->poll_ns = 0;
> > >       aio_notify(ctx);
> > >   }
> > > diff --git a/util/async.c b/util/async.c
> > > index 80d6b01a8a..9d3627566f 100644
> > > --- a/util/async.c
> > > +++ b/util/async.c
> > > @@ -606,6 +606,7 @@ AioContext *aio_context_new(Error **errp)
> > >       timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
> > >       ctx->poll_max_ns = 0;
> > > +    ctx->poll_ns = 0;
> > >       ctx->poll_grow = 0;
> > >       ctx->poll_shrink = 0;
> > > -- 
> > > 2.50.1
> > > 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH RFC v2 2/3] aio-poll: refine iothread polling using weighted handler intervals
  2026-03-30 19:17       ` Stefan Hajnoczi
@ 2026-03-31 20:42         ` JAEHOON KIM
  0 siblings, 0 replies; 18+ messages in thread
From: JAEHOON KIM @ 2026-03-31 20:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, mjrosato, farman, pbonzini, fam, armbru,
	eblake, berrange, eduardo, dave, sw

On 3/30/2026 2:17 PM, Stefan Hajnoczi wrote:
> On Fri, Mar 27, 2026 at 12:02:21AM -0500, JAEHOON KIM wrote:
>> On 3/25/2026 3:37 PM, Stefan Hajnoczi wrote:
>>> On Mon, Mar 23, 2026 at 08:54:50AM -0500, Jaehoon Kim wrote:
>>>> Refine adaptive polling in aio_poll by updating iothread polling
>>>> duration based on weighted AioHandler event intervals.
>>>>
>>>> Each AioHandler's poll.ns is updated using a weighted factor when an
>>>> event occurs. Idle handlers accumulate block_ns until poll_max_ns and
>>>> then reset to 0, preventing sporadically active handlers from
>>>> unnecessarily prolonging iothread polling.
>>>>
>>>> The iothread polling duration is set based on the largest poll.ns among
>>>> active handlers. The shrink divider defaults to 2, matching the grow
>>>> rate, to reduce frequent poll_ns resets for slow devices.
>>>>
>>>> The default weight factor (POLL_WEIGHT_SHIFT=3, meaning the current
>>>> interval contributes 12.5% to the weighted average) was selected based
>>>> on extensive testing comparing QEMU 10.0.0 baseline vs poll-weight=2
>>>> and poll-weight=3 across various workloads.
>>>>
>>>> The table below shows a comparison between:
>>>> -Host: RHEL 10.1 GA + qemu-10.0.0-14.el10_1, Guest: RHEL 9.6GA vs.
>>>> -Host: RHEL 10.1 GA + qemu-10.0.0-14.el10_1 (w=2/w=3), Guest: RHEL 9.6GA
>>>> for FIO FCP and FICON with 1 iothread and 8 iothreads.
>>>> The values shown are the averages for numjobs 1, 4, and 8.
>>>>
>>>> Summary of results (% change vs baseline):
>>>>
>>>>                       | poll-weight=2      | poll-weight=3
>>>> --------------------|--------------------|-----------------
>>>> Throughput avg      | -2.4% (all tests)  | -2.2% (all tests)
>>>> CPU consumption avg | -10.9% (all tests) | -9.4% (all tests)
>>>>
>>>> Both weight=2 and weight=3 show significant CPU consumption reduction
>>>> (~10%) compared to baseline, which addresses the CPU utilization
>>>> regression observed in QEMU 10.0.0. The throughput impact is minimal
>>>> for both (~2%).
>>>>
>>>> Weight=3 is selected as the default because it provides slightly better
>>>> throughput (-2.2% vs -2.4%) while still achieving substantial CPU
>>>> savings (-9.4%). The difference between weight=2 and weight=3 is small,
>>>> but weight=3 offers a better balance for general-purpose workloads.
>>>>
>>>> Signed-off-by: Jaehoon Kim <jhkim@linux.ibm.com>
>>>> ---
>>>>    include/qemu/aio.h |   4 +-
>>>>    util/aio-posix.c   | 135 +++++++++++++++++++++++++++++++--------------
>>>>    util/async.c       |   1 +
>>>>    3 files changed, 99 insertions(+), 41 deletions(-)
>>>>
>>>> diff --git a/include/qemu/aio.h b/include/qemu/aio.h
>>>> index 8cca2360d1..6c77a190e9 100644
>>>> --- a/include/qemu/aio.h
>>>> +++ b/include/qemu/aio.h
>>>> @@ -195,7 +195,8 @@ struct BHListSlice {
>>>>    typedef QSLIST_HEAD(, AioHandler) AioHandlerSList;
>>>>    typedef struct AioPolledEvent {
>>>> -    int64_t ns;        /* current polling time in nanoseconds */
>>>> +    bool has_event; /* Flag to indicate if an event has occurred */
>>>> +    int64_t ns;     /* estimated block time in nanoseconds */
>>>>    } AioPolledEvent;
>>>>    struct AioContext {
>>>> @@ -306,6 +307,7 @@ struct AioContext {
>>>>        int poll_disable_cnt;
>>>>        /* Polling mode parameters */
>>>> +    int64_t poll_ns;        /* current polling time in nanoseconds */
>>>>        int64_t poll_max_ns;    /* maximum polling time in nanoseconds */
>>>>        int64_t poll_grow;      /* polling time growth factor */
>>>>        int64_t poll_shrink;    /* polling time shrink factor */
>>>> diff --git a/util/aio-posix.c b/util/aio-posix.c
>>>> index b02beb0505..2b3522f2f9 100644
>>>> --- a/util/aio-posix.c
>>>> +++ b/util/aio-posix.c
>>>> @@ -29,9 +29,11 @@
>>>>    /* Stop userspace polling on a handler if it isn't active for some time */
>>>>    #define POLL_IDLE_INTERVAL_NS (7 * NANOSECONDS_PER_SECOND)
>>>> +#define POLL_WEIGHT_SHIFT   (3)
>>>> -static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
>>>> -                                int64_t block_ns);
>>>> +static void adjust_block_ns(AioContext *ctx, int64_t block_ns);
>>>> +static void grow_polling_time(AioContext *ctx, int64_t block_ns);
>>>> +static void shrink_polling_time(AioContext *ctx, int64_t block_ns);
>>>>    bool aio_poll_disabled(AioContext *ctx)
>>>>    {
>>>> @@ -373,7 +375,7 @@ static bool aio_dispatch_ready_handlers(AioContext *ctx,
>>>>             * add the handler to ctx->poll_aio_handlers.
>>> This comment refers to adjusting the polling time. The code no longer
>>> does this and the comment should be updated.
>> The comment about adjusting polling time is no longer accurate.
>> I will update it in the next version.
>>>>             */
>>>>            if (ctx->poll_max_ns && QLIST_IS_INSERTED(node, node_poll)) {
>>>> -            adjust_polling_time(ctx, &node->poll, block_ns);
>>> aio_dispatch_ready_handlers() no longer uses the block_ns argument. It
>>> can be removed.
>> I will remove the block_ns argument in the next version.
>>>> +            node->poll.has_event = true;
>>>>            }
>>>>        }
>>>> @@ -560,18 +562,13 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
>>>>    static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
>>>>                              int64_t *timeout)
>>>>    {
>>>> -    AioHandler *node;
>>>>        int64_t max_ns;
>>>>        if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
>>>>            return false;
>>>>        }
>>>> -    max_ns = 0;
>>>> -    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
>>>> -        max_ns = MAX(max_ns, node->poll.ns);
>>>> -    }
>>>> -    max_ns = qemu_soonest_timeout(*timeout, max_ns);
>>>> +    max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
>>>>        if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
>>>>            /*
>>>> @@ -587,46 +584,98 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
>>>>        return false;
>>>>    }
>>>> -static void adjust_polling_time(AioContext *ctx, AioPolledEvent *poll,
>>>> -                                int64_t block_ns)
>>>> +static void shrink_polling_time(AioContext *ctx, int64_t block_ns)
>>>>    {
>>>> -    if (block_ns <= poll->ns) {
>>>> -        /* This is the sweet spot, no adjustment needed */
>>>> -    } else if (block_ns > ctx->poll_max_ns) {
>>>> -        /* We'd have to poll for too long, poll less */
>>>> -        int64_t old = poll->ns;
>>>> -
>>>> -        if (ctx->poll_shrink) {
>>>> -            poll->ns /= ctx->poll_shrink;
>>>> -        } else {
>>>> -            poll->ns = 0;
>>>> -        }
>>>> +    /*
>>>> +     * Reduce polling time if the block_ns is zero or
>>>> +     * less than the current poll_ns.
>>>> +     */
>>>> +    int64_t old = ctx->poll_ns;
>>>> +    int64_t shrink = ctx->poll_shrink;
>>>> -        trace_poll_shrink(ctx, old, poll->ns);
>>>> -    } else if (poll->ns < ctx->poll_max_ns &&
>>>> -               block_ns < ctx->poll_max_ns) {
>>>> -        /* There is room to grow, poll longer */
>>>> -        int64_t old = poll->ns;
>>>> -        int64_t grow = ctx->poll_grow;
>>>> +    if (shrink == 0) {
>>>> +        shrink = 2;
>>>> +    }
>>>> -        if (grow == 0) {
>>>> -            grow = 2;
>>>> -        }
>>>> +    if (block_ns < (ctx->poll_ns / shrink)) {
>>>> +        ctx->poll_ns /= shrink;
>>>> +    }
>>>> -        if (poll->ns) {
>>>> -            poll->ns *= grow;
>>>> -        } else {
>>>> -            poll->ns = 4000; /* start polling at 4 microseconds */
>>>> -        }
>>>> +    trace_poll_shrink(ctx, old, ctx->poll_ns);
>>> This trace event should be inside if (block_ns < (ctx->poll_ns /
>>> shrink)) like it was before this patch.
>>>
>>>> +}
>>>> -        if (poll->ns > ctx->poll_max_ns) {
>>>> -            poll->ns = ctx->poll_max_ns;
>>>> -        }
>>>> +static void grow_polling_time(AioContext *ctx, int64_t block_ns)
>>>> +{
>>>> +    /* There is room to grow, poll longer */
>>>> +    int64_t old = ctx->poll_ns;
>>>> +    int64_t grow = ctx->poll_grow;
>>>> -        trace_poll_grow(ctx, old, poll->ns);
>>>> +    if (grow == 0) {
>>>> +        grow = 2;
>>>>        }
>>>> +
>>>> +    if (block_ns > ctx->poll_ns * grow) {
>>>> +        ctx->poll_ns = block_ns;
>>>> +    } else {
>>>> +        ctx->poll_ns *= grow;
>>>> +    }
>>>> +
>>>> +    if (ctx->poll_ns > ctx->poll_max_ns) {
>>>> +        ctx->poll_ns = ctx->poll_max_ns;
>>>> +    }
>>>> +
>>>> +    trace_poll_grow(ctx, old, ctx->poll_ns);
>>> Same here.
>> I will move the trace_poll_xxx functions inside if condition in the
>> next version.
>>>>    }
>>>> +static void adjust_block_ns(AioContext *ctx, int64_t block_ns)
>>>> +{
>>>> +    AioHandler *node;
>>>> +    int64_t adj_block_ns = -1;
>>>> +
>>>> +    QLIST_FOREACH(node, &ctx->poll_aio_handlers, node_poll) {
>>>> +        if (node->poll.has_event) {
>>> Did you consider unifying node->poll.has_event with
>>> node->poll_idle_timeout, which is assigned now + POLL_IDLE_INTERVAL_NS
>>> every time ->io_poll() detects an event?
>>>
>>> For instance, rename node->poll_idle_timeout to
>>> node->last_event_timestamp and assign now without adding
>>> POLL_IDLE_INTERVAL_NS. Then use the field for both idle node removal and
>>> adjust_block_ns() (pass in now).
>> Thank you for the suggestion, I think this is a good idea.
>> After testing, it seems that node->poll_idle_timeout can be reused as you
>> suggested,
>> although a few adjustment are needed.
>>
>> currently, an event is detected and the AioHandler is added to the
>> ready_list in
>> three cases: run_poll_handler_once(), ctx->fdmon_ops-wait(), and
>> poll_set_started().
>>
>> To accurately track the last event timestamp of an AioHandler, it seems
>> necessary to
>> update the timestamp in the following two functions:
>>
>> @@ -45,6 +45,7 @@ void aio_add_ready_handler(AioHandlerList *ready_list,
>>   {
>>       QLIST_SAFE_REMOVE(node, node_ready); /* remove from nested parent's
>> list */
>>       node->pfd.revents = revents;
>> +    node->poll_idle_timeout = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>       QLIST_INSERT_HEAD(ready_list, node, node_ready);
>>   }
>>
>> @@ -53,6 +54,7 @@ static void aio_add_poll_ready_handler(AioHandlerList
>> *ready_list,
>>   {
>>       QLIST_SAFE_REMOVE(node, node_ready); /* remove from nested parent's
>> list */
>>       node->poll_ready = true;
>> +    node->poll_idle_timeout = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>       QLIST_INSERT_HEAD(ready_list, node, node_ready);
>>   }
>>
>> In addition, remove_idle_poll_handler() would need some adjustments as well.
>> If this approach aligns with you had in mind, I believe it can be
>> incorporated in
>> the next version without any issues.
> That works but I worry a bit about calling qemu_clock_get_ns() for every
> event. The timestamp does not need to be precise down to the nano-,
> micro-, or event milli-second. I think you could instead pass 'now' into
> aio_dispatch_ready_handlers() and assign the new
> node->last_dispatch_timestamp field there.
Thanks for the suggestion, that makes sense.
You're right that calling qemu_clock_get_ns() on every event could introduce
unnecessary overhead, especially since we don't require high precision here.
I'll update the code accordingly.
>
>>>> +            /*
>>>> +             * Update poll.ns for the node with an event.
>>>> +             * Uses a weighted average of the current block_ns and the previous
>>>> +             * poll.ns to smooth out polling time adjustments.
>>>> +             */
>>>> +            node->poll.ns = node->poll.ns
>>>> +                ? (node->poll.ns - (node->poll.ns >> POLL_WEIGHT_SHIFT))
>>>> +                + (block_ns >> POLL_WEIGHT_SHIFT) : block_ns;
>>>> +
>>>> +            if (node->poll.ns > ctx->poll_max_ns) {
>>>> +                node->poll.ns = 0;
>>>> +            }
>>> Previously:
>>> -        if (poll->ns > ctx->poll_max_ns) {
>>> -            poll->ns = ctx->poll_max_ns;
>>> -        }
>>>
>>> Was this causing excessive CPU consumption in your benchmarks?
>>>
>>> Can you explain the rationale for zeroing the poll time? Aside from
>>> reducing CPU consumption, it also reduces the chance that polling will
>>> succeed and could therefore impact performance.
>>>
>>> I'm asking about this because this patch makes several changes at once
>>> and I'm not sure how the CPU usage and performance changes are
>>> attributed to these multiple changes. I want to make sure the changes
>>> merged are minimal and the best set - sometimes when multiple things are
>>> changes at the same time, not all of them are beneficial.
>> This snippet you quoted under "Previously" is also reflected in the logic
>> within grow_polling_time() where a same approach is used.
>>
>> The difference is that previously. all AioHandler in the reay_list would
>> update their own poll.ns using block_ns. As in the snippet below,
>> if block_ns exceeded poll_max_ns, it would effectively be reset to 0 anyway.
>>
>> -    } else if (block_ns > ctx->poll_max_ns) {
>> -        /* We'd have to poll for too long, poll less */
>> -        int64_t old = poll->ns;
>> -
>> -        if (ctx->poll_shrink) {
>> -            poll->ns /= ctx->poll_shrink;
>> -        } else {
>> -            poll->ns = 0;
>> -        }
>>
>> I think I did not explain this part clearly enough in the commit message.
>> Here’s a more detailed explanation of the current polling logic problem and
>> approach:
>>
>> Problem:
>> Starting from QEMU 10.0, poll.ns was introduced per event handler to
>> mitigate excessive
>> fluctuations in IOThread polling times observed in earlier versions (QEMU
>> 9.x).
>>
>> However, in the current design, poll.ns is updated only when an event
>> occurs, making it
>> difficult to treat block_ns as a reliable event interval. Also, The
>> IOThread’s next
>> polling time is determined by the maximum poll.ns among all AioHandlers,
>> which means
>> idle AioHandlers with high poll.ns can have an outsized impact on the
>> polling duration.
>>
>> For io_uring, idle AioHandlers are cleared after POLL_IDLE_INTERVAL_NS (7s),
>> but
>> ppoll/epoll, there is no such mechanism, so CPU consumption due to idle
>> nodes can
>> increase even more.
>>
>> Approach:
>> To address this, we treat block_ns as an event interval and update each
>> AioHandler’s
>> poll.ns using a weighted factor. This smooths out polling time adjustments,
>> preventing
>> excessive fluctuations and ensuring that recent event intervals are properly
>> reflected,
>> which helps maintain performance while lowering CPU utilization.
>>
>> To use block_ns as an event interval, we update polling times for both event
>> and non-event AioHandlers in each loop iteration. Non-event AioHandler do
>> not require
>> a weighted factor; this allows for rapid isolation of idle nodes, while
>> ensuring that
>> poll.ns can increase more responsively when an event occurs within a few
>> subsequent loops.
> Thanks for this information. Please include it in the commit description.
Thanks, I'll include this in the commit description in the next version.
>
>>>> +            /*
>>>> +             * To avoid excessive polling time increase, update adj_block_ns
>>>> +             * for nodes with the event flag set to true
>>>> +             */
>>>> +            adj_block_ns = MAX(adj_block_ns, node->poll.ns);
>>> adj_block_ns is not the blocking time, it's the maximum current poll
>>> time across all nodes. It would be clearer to change the variable name.
>> You're right. I will rename it to max_poll_ns to better reflect its purpose.
>>>> +            node->poll.has_event = false;
>>>> +         } else {
>>> 4-space indentation should be used.
>> I will also fix the indentation.
>>>> +            /*
>>>> +             * No event now, but was active before.
>>>> +             * If it waits longer than poll_max_ns, poll.ns will stay 0
>>>> +             * until the next event arrives.
>>>> +             */
>>>> +            if (node->poll.ns != 0) {
>>>> +                node->poll.ns += block_ns;
>>> Why is block_ns being added to an recently inactive node's polling time?
>>> Here node->poll.ns no longer measures the weighted time until the
>>> handler had an event.
>>>
>>> If the goal is to get rid of inactive nodes, then maybe the idle handler
>>> removal mechanism should be made more aggresive instead?
>>>
>>>> +                if (node->poll.ns > ctx->poll_max_ns) {
>>>> +                    node->poll.ns = 0;
>>>> +                }
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (adj_block_ns >= 0) {
>>>> +        if (adj_block_ns > ctx->poll_ns) {
>>>> +            grow_polling_time(ctx, adj_block_ns);
>>>> +        } else {
>>>> +            shrink_polling_time(ctx, adj_block_ns);
>>>> +         }
>>>> +     }
>>>> + }
>>>> +
>>>>    bool aio_poll(AioContext *ctx, bool blocking)
>>>>    {
>>>>        AioHandlerList ready_list = QLIST_HEAD_INITIALIZER(ready_list);
>>>> @@ -723,6 +772,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
>>>>        aio_free_deleted_handlers(ctx);
>>>> +    if (ctx->poll_max_ns) {
>>>> +        adjust_block_ns(ctx, block_ns);
>>>> +    }
>>>> +
>>>>        qemu_lockcnt_dec(&ctx->list_lock);
>>>>        progress |= timerlistgroup_run_timers(&ctx->tlg);
>>>> @@ -784,6 +837,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>>>>        qemu_lockcnt_inc(&ctx->list_lock);
>>>>        QLIST_FOREACH(node, &ctx->aio_handlers, node) {
>>>> +        node->poll.has_event = false;
>>>>            node->poll.ns = 0;
>>>>        }
>>>>        qemu_lockcnt_dec(&ctx->list_lock);
>>>> @@ -794,6 +848,7 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
>>>>        ctx->poll_max_ns = max_ns;
>>>>        ctx->poll_grow = grow;
>>>>        ctx->poll_shrink = shrink;
>>>> +    ctx->poll_ns = 0;
>>>>        aio_notify(ctx);
>>>>    }
>>>> diff --git a/util/async.c b/util/async.c
>>>> index 80d6b01a8a..9d3627566f 100644
>>>> --- a/util/async.c
>>>> +++ b/util/async.c
>>>> @@ -606,6 +606,7 @@ AioContext *aio_context_new(Error **errp)
>>>>        timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
>>>>        ctx->poll_max_ns = 0;
>>>> +    ctx->poll_ns = 0;
>>>>        ctx->poll_grow = 0;
>>>>        ctx->poll_shrink = 0;
>>>> -- 
>>>> 2.50.1
>>>>



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-03-31 20:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-23 13:54 [PATCH RFC v2 0/3] improve aio-polling efficiency Jaehoon Kim
2026-03-23 13:54 ` [PATCH RFC v2 1/3] aio-poll: avoid unnecessary polling time computation Jaehoon Kim
2026-03-25 17:22   ` Stefan Hajnoczi
2026-03-26 18:17     ` JAEHOON KIM
2026-03-26 18:34       ` Stefan Hajnoczi
2026-03-23 13:54 ` [PATCH RFC v2 2/3] aio-poll: refine iothread polling using weighted handler intervals Jaehoon Kim
2026-03-25 20:37   ` Stefan Hajnoczi
2026-03-27  5:02     ` JAEHOON KIM
2026-03-30 19:17       ` Stefan Hajnoczi
2026-03-31 20:42         ` JAEHOON KIM
2026-03-23 13:54 ` [PATCH RFC v2 3/3] qapi/iothread: introduce poll-weight parameter for aio-poll Jaehoon Kim
2026-03-25 14:04   ` Markus Armbruster
2026-03-26 15:55     ` JAEHOON KIM
2026-03-27  5:49       ` Markus Armbruster
2026-03-27 14:23         ` JAEHOON KIM
2026-03-25 16:52   ` Stefan Hajnoczi
2026-03-25 16:56   ` Stefan Hajnoczi
2026-03-26 16:13     ` JAEHOON KIM

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.