[PATCH] rcu: Unify force quiescent state

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] rcu: Unify force quiescent state
@ 2025-10-16  6:34 Akihiko Odaki
  2025-10-16 12:59 ` Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Akihiko Odaki @ 2025-10-16  6:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Dmitry Osipenko, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin,
	Akihiko Odaki

Borrow the concept of force quiescent state from Linux to ensure readers
remain fast during normal operation and to avoid stalls.

Background
==========

The previous implementation had four steps to begin reclamation.

1. call_rcu_thread() would wait for the first callback.

2. call_rcu_thread() would periodically poll until a decent number of
   callbacks piled up or it timed out.

3. synchronize_rcu() would statr a grace period (GP).

4. wait_for_readers() would wait for the GP to end. It would also
   trigger the force_rcu notifier to break busy loops in a read-side
   critical section if drain_call_rcu() had been called.

Problem
=======

The separation of waiting logic across these steps led to suboptimal
behavior:

The GP was delayed until call_rcu_thread() stops polling.

force_rcu was not consistently triggered when call_rcu_thread() detected
a high number of pending callbacks or a timeout. This inconsistency
sometimes led to stalls, as reported in a virtio-gpu issue where memory
unmapping was blocked[1].

wait_for_readers() imposed unnecessary overhead in non-urgent cases by
unconditionally executing qatomic_set(&index->waiting, true) and
qemu_event_reset(&rcu_gp_event), which are necessary only for expedited
synchronization.

Solution
========

Move the polling in call_rcu_thread() to wait_for_readers() to prevent
the delay of the GP. Additionally, reorganize wait_for_readers() to
distinguish between two states:

Normal State: it relies exclusively on periodic polling to detect
the end of the GP and maintains the read-side fast path.

Force Quiescent State: Whenever expediting synchronization, it always
triggers force_rcu and executes both qatomic_set(&index->waiting, true)
and qemu_event_reset(&rcu_gp_event). This avoids stalls while confining
the read-side overhead to this state.

This unified approach, inspired by the Linux RCU, ensures consistent and
efficient RCU grace period handling and confirms resolution of the
virtio-gpu issue.

[1] https://lore.kernel.org/qemu-devel/20251014111234.3190346-9-alex.bennee@linaro.org/

Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
---
 util/rcu.c | 79 ++++++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 51 insertions(+), 28 deletions(-)

diff --git a/util/rcu.c b/util/rcu.c
index b703c86f15a3..acac9446ea98 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -43,10 +43,14 @@
 #define RCU_GP_LOCKED           (1UL << 0)
 #define RCU_GP_CTR              (1UL << 1)
 
+
+#define RCU_CALL_MIN_SIZE        30
+
 unsigned long rcu_gp_ctr = RCU_GP_LOCKED;
 
 QemuEvent rcu_gp_event;
 static int in_drain_call_rcu;
+static int rcu_call_count;
 static QemuMutex rcu_registry_lock;
 static QemuMutex rcu_sync_lock;
 
@@ -76,15 +80,29 @@ static void wait_for_readers(void)
 {
     ThreadList qsreaders = QLIST_HEAD_INITIALIZER(qsreaders);
     struct rcu_reader_data *index, *tmp;
+    int sleeps = 0;
+    bool forced = false;
 
     for (;;) {
-        /* We want to be notified of changes made to rcu_gp_ongoing
-         * while we walk the list.
+        /*
+         * Force the grace period to end and wait for it if any of the
+         * following heuristical conditions are satisfied:
+         * - A decent number of callbacks piled up.
+         * - It timed out.
+         * - It is in a drain_call_rcu() call.
+         *
+         * Otherwise, periodically poll the grace period, hoping it ends
+         * promptly.
          */
-        qemu_event_reset(&rcu_gp_event);
+        if (!forced &&
+            (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
+             sleeps >= 5 || qatomic_read(&in_drain_call_rcu))) {
+            forced = true;
 
-        QLIST_FOREACH(index, &registry, node) {
-            qatomic_set(&index->waiting, true);
+            QLIST_FOREACH(index, &registry, node) {
+                notifier_list_notify(&index->force_rcu, NULL);
+                qatomic_set(&index->waiting, true);
+            }
         }
 
         /* Here, order the stores to index->waiting before the loads of
@@ -106,8 +124,6 @@ static void wait_for_readers(void)
                  * get some extra futex wakeups.
                  */
                 qatomic_set(&index->waiting, false);
-            } else if (qatomic_read(&in_drain_call_rcu)) {
-                notifier_list_notify(&index->force_rcu, NULL);
             }
         }
 
@@ -115,7 +131,8 @@ static void wait_for_readers(void)
             break;
         }
 
-        /* Wait for one thread to report a quiescent state and try again.
+        /*
+         * Sleep for a while and try again.
          * Release rcu_registry_lock, so rcu_(un)register_thread() doesn't
          * wait too much time.
          *
@@ -133,7 +150,20 @@ static void wait_for_readers(void)
          * rcu_registry_lock is released.
          */
         qemu_mutex_unlock(&rcu_registry_lock);
-        qemu_event_wait(&rcu_gp_event);
+
+        if (forced) {
+            qemu_event_wait(&rcu_gp_event);
+
+            /*
+             * We want to be notified of changes made to rcu_gp_ongoing
+             * while we walk the list.
+             */
+            qemu_event_reset(&rcu_gp_event);
+        } else {
+            g_usleep(10000);
+            sleeps++;
+        }
+
         qemu_mutex_lock(&rcu_registry_lock);
     }
 
@@ -173,15 +203,11 @@ void synchronize_rcu(void)
     }
 }
 
-
-#define RCU_CALL_MIN_SIZE        30
-
 /* Multi-producer, single-consumer queue based on urcu/static/wfqueue.h
  * from liburcu.  Note that head is only used by the consumer.
  */
 static struct rcu_head dummy;
 static struct rcu_head *head = &dummy, **tail = &dummy.next;
-static int rcu_call_count;
 static QemuEvent rcu_call_ready_event;
 
 static void enqueue(struct rcu_head *node)
@@ -259,30 +285,27 @@ static void *call_rcu_thread(void *opaque)
     rcu_register_thread();
 
     for (;;) {
-        int tries = 0;
-        int n = qatomic_read(&rcu_call_count);
+        int n;
 
-        /* Heuristically wait for a decent number of callbacks to pile up.
+        /*
          * Fetch rcu_call_count now, we only must process elements that were
          * added before synchronize_rcu() starts.
          */
-        while (n == 0 || (n < RCU_CALL_MIN_SIZE && ++tries <= 5)) {
-            g_usleep(10000);
-            if (n == 0) {
-                qemu_event_reset(&rcu_call_ready_event);
-                n = qatomic_read(&rcu_call_count);
-                if (n == 0) {
+        for (;;) {
+            qemu_event_reset(&rcu_call_ready_event);
+            n = qatomic_read(&rcu_call_count);
+            if (n) {
+                break;
+            }
+
 #if defined(CONFIG_MALLOC_TRIM)
-                    malloc_trim(4 * 1024 * 1024);
+            malloc_trim(4 * 1024 * 1024);
 #endif
-                    qemu_event_wait(&rcu_call_ready_event);
-                }
-            }
-            n = qatomic_read(&rcu_call_count);
+            qemu_event_wait(&rcu_call_ready_event);
         }
 
-        qatomic_sub(&rcu_call_count, n);
         synchronize_rcu();
+        qatomic_sub(&rcu_call_count, n);
         bql_lock();
         while (n > 0) {
             node = try_dequeue();

---
base-commit: 0dc905ac306c68649e05cdaf8434123c8f917b41
change-id: 20251015-force-c4e03a9ba719

Best regards,
--  
Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-16  6:34 [PATCH] rcu: Unify force quiescent state Akihiko Odaki
@ 2025-10-16 12:59 ` Paolo Bonzini
  2025-10-16 19:33 ` Dmitry Osipenko
  2025-10-19 20:25 ` Dmitry Osipenko
  2 siblings, 0 replies; 14+ messages in thread
From: Paolo Bonzini @ 2025-10-16 12:59 UTC (permalink / raw)
  To: Akihiko Odaki, qemu-devel
  Cc: David Hildenbrand, Eric Blake, Philippe Mathieu-Daudé,
	Dmitry Osipenko, Markus Armbruster, Marc-André Lureau,
	Peter Xu, Michael S. Tsirkin

On 10/16/25 08:34, Akihiko Odaki wrote:
> Borrow the concept of force quiescent state from Linux to ensure readers
> remain fast during normal operation and to avoid stalls.

Very nice solution!  The code of the call RCU thread is simpler.  I will 
follow up with extracting

             node = try_dequeue();
             while (!node) {
                 bql_unlock();
                 qemu_event_reset(&rcu_call_ready_event);
                 node = try_dequeue();
                 if (!node) {
                     qemu_event_wait(&rcu_call_ready_event);
                     node = try_dequeue();
                 }
                 bql_lock();
             }


to its own function dequeue().

Paolo

> Background
> ==========
> 
> The previous implementation had four steps to begin reclamation.
> 
> 1. call_rcu_thread() would wait for the first callback.
> 
> 2. call_rcu_thread() would periodically poll until a decent number of
>     callbacks piled up or it timed out.
> 
> 3. synchronize_rcu() would statr a grace period (GP).
> 
> 4. wait_for_readers() would wait for the GP to end. It would also
>     trigger the force_rcu notifier to break busy loops in a read-side
>     critical section if drain_call_rcu() had been called.
> 
> Problem
> =======
> 
> The separation of waiting logic across these steps led to suboptimal
> behavior:
> 
> The GP was delayed until call_rcu_thread() stops polling.
> 
> force_rcu was not consistently triggered when call_rcu_thread() detected
> a high number of pending callbacks or a timeout. This inconsistency
> sometimes led to stalls, as reported in a virtio-gpu issue where memory
> unmapping was blocked[1].
> 
> wait_for_readers() imposed unnecessary overhead in non-urgent cases by
> unconditionally executing qatomic_set(&index->waiting, true) and
> qemu_event_reset(&rcu_gp_event), which are necessary only for expedited
> synchronization.
> 
> Solution
> ========
> 
> Move the polling in call_rcu_thread() to wait_for_readers() to prevent
> the delay of the GP. Additionally, reorganize wait_for_readers() to
> distinguish between two states:
> 
> Normal State: it relies exclusively on periodic polling to detect
> the end of the GP and maintains the read-side fast path.
> 
> Force Quiescent State: Whenever expediting synchronization, it always
> triggers force_rcu and executes both qatomic_set(&index->waiting, true)
> and qemu_event_reset(&rcu_gp_event). This avoids stalls while confining
> the read-side overhead to this state.
> 
> This unified approach, inspired by the Linux RCU, ensures consistent and
> efficient RCU grace period handling and confirms resolution of the
> virtio-gpu issue.
> 
> [1] https://lore.kernel.org/qemu-devel/20251014111234.3190346-9-alex.bennee@linaro.org/
> 
> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> ---
>   util/rcu.c | 79 ++++++++++++++++++++++++++++++++++++++++----------------------
>   1 file changed, 51 insertions(+), 28 deletions(-)
> 
> diff --git a/util/rcu.c b/util/rcu.c
> index b703c86f15a3..acac9446ea98 100644
> --- a/util/rcu.c
> +++ b/util/rcu.c
> @@ -43,10 +43,14 @@
>   #define RCU_GP_LOCKED           (1UL << 0)
>   #define RCU_GP_CTR              (1UL << 1)
>   
> +
> +#define RCU_CALL_MIN_SIZE        30
> +
>   unsigned long rcu_gp_ctr = RCU_GP_LOCKED;
>   
>   QemuEvent rcu_gp_event;
>   static int in_drain_call_rcu;
> +static int rcu_call_count;
>   static QemuMutex rcu_registry_lock;
>   static QemuMutex rcu_sync_lock;
>   
> @@ -76,15 +80,29 @@ static void wait_for_readers(void)
>   {
>       ThreadList qsreaders = QLIST_HEAD_INITIALIZER(qsreaders);
>       struct rcu_reader_data *index, *tmp;
> +    int sleeps = 0;
> +    bool forced = false;
>   
>       for (;;) {
> -        /* We want to be notified of changes made to rcu_gp_ongoing
> -         * while we walk the list.
> +        /*
> +         * Force the grace period to end and wait for it if any of the
> +         * following heuristical conditions are satisfied:
> +         * - A decent number of callbacks piled up.
> +         * - It timed out.
> +         * - It is in a drain_call_rcu() call.
> +         *
> +         * Otherwise, periodically poll the grace period, hoping it ends
> +         * promptly.
>            */
> -        qemu_event_reset(&rcu_gp_event);
> +        if (!forced &&
> +            (qatomic_read(&rcu_call_count) >= RCU_CALL_MIN_SIZE ||
> +             sleeps >= 5 || qatomic_read(&in_drain_call_rcu))) {
> +            forced = true;
>   
> -        QLIST_FOREACH(index, &registry, node) {
> -            qatomic_set(&index->waiting, true);
> +            QLIST_FOREACH(index, &registry, node) {
> +                notifier_list_notify(&index->force_rcu, NULL);
> +                qatomic_set(&index->waiting, true);
> +            }
>           }
>   
>           /* Here, order the stores to index->waiting before the loads of
> @@ -106,8 +124,6 @@ static void wait_for_readers(void)
>                    * get some extra futex wakeups.
>                    */
>                   qatomic_set(&index->waiting, false);
> -            } else if (qatomic_read(&in_drain_call_rcu)) {
> -                notifier_list_notify(&index->force_rcu, NULL);
>               }
>           }
>   
> @@ -115,7 +131,8 @@ static void wait_for_readers(void)
>               break;
>           }
>   
> -        /* Wait for one thread to report a quiescent state and try again.
> +        /*
> +         * Sleep for a while and try again.
>            * Release rcu_registry_lock, so rcu_(un)register_thread() doesn't
>            * wait too much time.
>            *
> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>            * rcu_registry_lock is released.
>            */
>           qemu_mutex_unlock(&rcu_registry_lock);
> -        qemu_event_wait(&rcu_gp_event);
> +
> +        if (forced) {
> +            qemu_event_wait(&rcu_gp_event);
> +
> +            /*
> +             * We want to be notified of changes made to rcu_gp_ongoing
> +             * while we walk the list.
> +             */
> +            qemu_event_reset(&rcu_gp_event);
> +        } else {
> +            g_usleep(10000);
> +            sleeps++;
> +        }
> +
>           qemu_mutex_lock(&rcu_registry_lock);
>       }
>   
> @@ -173,15 +203,11 @@ void synchronize_rcu(void)
>       }
>   }
>   
> -
> -#define RCU_CALL_MIN_SIZE        30
> -
>   /* Multi-producer, single-consumer queue based on urcu/static/wfqueue.h
>    * from liburcu.  Note that head is only used by the consumer.
>    */
>   static struct rcu_head dummy;
>   static struct rcu_head *head = &dummy, **tail = &dummy.next;
> -static int rcu_call_count;
>   static QemuEvent rcu_call_ready_event;
>   
>   static void enqueue(struct rcu_head *node)
> @@ -259,30 +285,27 @@ static void *call_rcu_thread(void *opaque)
>       rcu_register_thread();
>   
>       for (;;) {
> -        int tries = 0;
> -        int n = qatomic_read(&rcu_call_count);
> +        int n;
>   
> -        /* Heuristically wait for a decent number of callbacks to pile up.
> +        /*
>            * Fetch rcu_call_count now, we only must process elements that were
>            * added before synchronize_rcu() starts.
>            */
> -        while (n == 0 || (n < RCU_CALL_MIN_SIZE && ++tries <= 5)) {
> -            g_usleep(10000);
> -            if (n == 0) {
> -                qemu_event_reset(&rcu_call_ready_event);
> -                n = qatomic_read(&rcu_call_count);
> -                if (n == 0) {
> +        for (;;) {
> +            qemu_event_reset(&rcu_call_ready_event);
> +            n = qatomic_read(&rcu_call_count);
> +            if (n) {
> +                break;
> +            }
> +
>   #if defined(CONFIG_MALLOC_TRIM)
> -                    malloc_trim(4 * 1024 * 1024);
> +            malloc_trim(4 * 1024 * 1024);
>   #endif
> -                    qemu_event_wait(&rcu_call_ready_event);
> -                }
> -            }
> -            n = qatomic_read(&rcu_call_count);
> +            qemu_event_wait(&rcu_call_ready_event);
>           }
>   
> -        qatomic_sub(&rcu_call_count, n);
>           synchronize_rcu();
> +        qatomic_sub(&rcu_call_count, n);
>           bql_lock();
>           while (n > 0) {
>               node = try_dequeue();
> 
> ---
> base-commit: 0dc905ac306c68649e05cdaf8434123c8f917b41
> change-id: 20251015-force-c4e03a9ba719
> 
> Best regards,
> --
> Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> 
> 
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-16  6:34 [PATCH] rcu: Unify force quiescent state Akihiko Odaki
  2025-10-16 12:59 ` Paolo Bonzini
@ 2025-10-16 19:33 ` Dmitry Osipenko
  2025-10-16 23:43   ` Akihiko Odaki
  2025-10-20  6:38   ` Paolo Bonzini
  2025-10-19 20:25 ` Dmitry Osipenko
  2 siblings, 2 replies; 14+ messages in thread
From: Dmitry Osipenko @ 2025-10-16 19:33 UTC (permalink / raw)
  To: Akihiko Odaki, qemu-devel, Alex Bennée
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin

On 10/16/25 09:34, Akihiko Odaki wrote:
> -        /* Wait for one thread to report a quiescent state and try again.
> +        /*
> +         * Sleep for a while and try again.
>           * Release rcu_registry_lock, so rcu_(un)register_thread() doesn't
>           * wait too much time.
>           *
> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>           * rcu_registry_lock is released.
>           */
>          qemu_mutex_unlock(&rcu_registry_lock);
> -        qemu_event_wait(&rcu_gp_event);
> +
> +        if (forced) {
> +            qemu_event_wait(&rcu_gp_event);
> +
> +            /*
> +             * We want to be notified of changes made to rcu_gp_ongoing
> +             * while we walk the list.
> +             */
> +            qemu_event_reset(&rcu_gp_event);
> +        } else {
> +            g_usleep(10000);
> +            sleeps++;

Thanks a lot for this RCU improvement. It indeed removes the hard stalls
with unmapping of virtio-gpu blobs.

Am I understanding correctly that potentially we will be hitting this
g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
MemoryRegion patches from Alex [1] are still needed to avoid stalls
entirely.

[1]
https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-alex.bennee@linaro.org/

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-16 19:33 ` Dmitry Osipenko
@ 2025-10-16 23:43   ` Akihiko Odaki
  2025-10-17  0:40     ` Akihiko Odaki
  2025-10-20  6:38   ` Paolo Bonzini
  1 sibling, 1 reply; 14+ messages in thread
From: Akihiko Odaki @ 2025-10-16 23:43 UTC (permalink / raw)
  To: Dmitry Osipenko, qemu-devel, Alex Bennée
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin

On 2025/10/17 4:33, Dmitry Osipenko wrote:
> On 10/16/25 09:34, Akihiko Odaki wrote:
>> -        /* Wait for one thread to report a quiescent state and try again.
>> +        /*
>> +         * Sleep for a while and try again.
>>            * Release rcu_registry_lock, so rcu_(un)register_thread() doesn't
>>            * wait too much time.
>>            *
>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>            * rcu_registry_lock is released.
>>            */
>>           qemu_mutex_unlock(&rcu_registry_lock);
>> -        qemu_event_wait(&rcu_gp_event);
>> +
>> +        if (forced) {
>> +            qemu_event_wait(&rcu_gp_event);
>> +
>> +            /*
>> +             * We want to be notified of changes made to rcu_gp_ongoing
>> +             * while we walk the list.
>> +             */
>> +            qemu_event_reset(&rcu_gp_event);
>> +        } else {
>> +            g_usleep(10000);
>> +            sleeps++;
> 
> Thanks a lot for this RCU improvement. It indeed removes the hard stalls
> with unmapping of virtio-gpu blobs.
> 
> Am I understanding correctly that potentially we will be hitting this
> g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
> MemoryRegion patches from Alex [1] are still needed to avoid stalls
> entirely.
> 
> [1]
> https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-alex.bennee@linaro.org/

That is right, but "avoiding stalls entirely" also causes use-after-free.

The problem with virtio-gpu on TCG is that TCG keeps using the old 
memory map until force_rcu is triggered. So, without force_rcu, the 
following pseudo-code on a guest will result in use-after-free:

address = blob_map(resource_id);
blob_unmap(resource_id);

for (i = 0; i < some_big_number; i++)
   *(uint8_t *)address = 0;

*(uint8_t *)address will dereference the blob until force_rcu is 
triggered, so finalizing MemoryRegion before force_rcu results in 
use-after-free.

The best option to eliminate the delay entirely I have in mind is to 
call drain_call_rcu(), but I'm not for such a change (for now). 
drain_call_rcu() eliminates the delay if the FlatView protected by RCU 
is the only referrer of the MemoryRegion, but that is not guaranteed.

Performance should not be a concern anyway in this situation. The guest 
should not waste CPU time by polling in the first place if you really 
care performance; since it's a para-virtualized device and not a real 
hardware, CPU time may be shared between the guest and the device, and 
thus polling on the guest has an inherent risk of slowing down the 
device. For performance-sensitive workloads, the guest should:

- avoid polling and
- accumulate commands instead of waiting for each

The delay will be less problematic if the guest does so, and I think at 
least Linux does avoid polling.

That said, stalling the guest forever in this situation is "wrong" (!= 
"bad performance"). I wrote this patch to guarantee forward progress, 
which is mandatory for semantic correctness.

Perhaps drain_call_rcu() may make sense also in other, 
performance-sensitive scenarios, but it should be added after benchmark 
or we will have a immature optimization.

Regards,
Akihiko Odaki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-16 23:43   ` Akihiko Odaki
@ 2025-10-17  0:40     ` Akihiko Odaki
  2025-10-19 20:23       ` Dmitry Osipenko
  2025-10-22  3:30       ` Dmitry Osipenko
  0 siblings, 2 replies; 14+ messages in thread
From: Akihiko Odaki @ 2025-10-17  0:40 UTC (permalink / raw)
  To: Dmitry Osipenko, qemu-devel, Alex Bennée
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin

On 2025/10/17 8:43, Akihiko Odaki wrote:
> On 2025/10/17 4:33, Dmitry Osipenko wrote:
>> On 10/16/25 09:34, Akihiko Odaki wrote:
>>> -        /* Wait for one thread to report a quiescent state and try 
>>> again.
>>> +        /*
>>> +         * Sleep for a while and try again.
>>>            * Release rcu_registry_lock, so rcu_(un)register_thread() 
>>> doesn't
>>>            * wait too much time.
>>>            *
>>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>>            * rcu_registry_lock is released.
>>>            */
>>>           qemu_mutex_unlock(&rcu_registry_lock);
>>> -        qemu_event_wait(&rcu_gp_event);
>>> +
>>> +        if (forced) {
>>> +            qemu_event_wait(&rcu_gp_event);
>>> +
>>> +            /*
>>> +             * We want to be notified of changes made to rcu_gp_ongoing
>>> +             * while we walk the list.
>>> +             */
>>> +            qemu_event_reset(&rcu_gp_event);
>>> +        } else {
>>> +            g_usleep(10000);
>>> +            sleeps++;
>>
>> Thanks a lot for this RCU improvement. It indeed removes the hard stalls
>> with unmapping of virtio-gpu blobs.
>>
>> Am I understanding correctly that potentially we will be hitting this
>> g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
>> MemoryRegion patches from Alex [1] are still needed to avoid stalls
>> entirely.
>>
>> [1]
>> https://lore.kernel.org/qemu-devel/20251014111234.3190346-6- 
>> alex.bennee@linaro.org/
> 
> That is right, but "avoiding stalls entirely" also causes use-after-free.
> 
> The problem with virtio-gpu on TCG is that TCG keeps using the old 
> memory map until force_rcu is triggered. So, without force_rcu, the 
> following pseudo-code on a guest will result in use-after-free:
> 
> address = blob_map(resource_id);
> blob_unmap(resource_id);
> 
> for (i = 0; i < some_big_number; i++)
>    *(uint8_t *)address = 0;
> 
> *(uint8_t *)address will dereference the blob until force_rcu is 
> triggered, so finalizing MemoryRegion before force_rcu results in use- 
> after-free.
> 
> The best option to eliminate the delay entirely I have in mind is to 
> call drain_call_rcu(), but I'm not for such a change (for now). 
> drain_call_rcu() eliminates the delay if the FlatView protected by RCU 
> is the only referrer of the MemoryRegion, but that is not guaranteed.
> 
> Performance should not be a concern anyway in this situation. The guest 
> should not waste CPU time by polling in the first place if you really 
> care performance; since it's a para-virtualized device and not a real 
> hardware, CPU time may be shared between the guest and the device, and 
> thus polling on the guest has an inherent risk of slowing down the 
> device. For performance-sensitive workloads, the guest should:
> 
> - avoid polling and
> - accumulate commands instead of waiting for each
> 
> The delay will be less problematic if the guest does so, and I think at 
> least Linux does avoid polling.
> 
> That said, stalling the guest forever in this situation is "wrong" (!= 
> "bad performance"). I wrote this patch to guarantee forward progress, 
> which is mandatory for semantic correctness.
> 
> Perhaps drain_call_rcu() may make sense also in other, performance- 
> sensitive scenarios, but it should be added after benchmark or we will 
> have a immature optimization.

I first thought just adding drain_call_rcu() would work but apparently 
it is not that simple. Adding drain_call_rcu() has a few problems:

- It drops the BQL, which should be avoided. Problems caused by 
run_on_cpu(), which drops the BQL, was discussed on the list for a few 
times and drain_call_rcu() may also suffer from them.

- It is less effective if the RCU thread enters g_usleep() before 
drain_call_rcu() is called.

- It slows down readers due to the nature of drain_call_rcu().

So, if you know some workload that may suffer from the delay, it may be 
a good idea to try them with the patches from Alex first, and then think 
of a clean solution if it improves performance.

Regards,
Akihiko Odaki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-17  0:40     ` Akihiko Odaki
@ 2025-10-19 20:23       ` Dmitry Osipenko
  2025-10-20  1:14         ` Akihiko Odaki
  2025-10-22  3:30       ` Dmitry Osipenko
  1 sibling, 1 reply; 14+ messages in thread
From: Dmitry Osipenko @ 2025-10-19 20:23 UTC (permalink / raw)
  To: Akihiko Odaki, qemu-devel, Alex Bennée
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin

On 10/17/25 03:40, Akihiko Odaki wrote:
> On 2025/10/17 8:43, Akihiko Odaki wrote:
>> On 2025/10/17 4:33, Dmitry Osipenko wrote:
>>> On 10/16/25 09:34, Akihiko Odaki wrote:
>>>> -        /* Wait for one thread to report a quiescent state and try
>>>> again.
>>>> +        /*
>>>> +         * Sleep for a while and try again.
>>>>            * Release rcu_registry_lock, so rcu_(un)register_thread()
>>>> doesn't
>>>>            * wait too much time.
>>>>            *
>>>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>>>            * rcu_registry_lock is released.
>>>>            */
>>>>           qemu_mutex_unlock(&rcu_registry_lock);
>>>> -        qemu_event_wait(&rcu_gp_event);
>>>> +
>>>> +        if (forced) {
>>>> +            qemu_event_wait(&rcu_gp_event);
>>>> +
>>>> +            /*
>>>> +             * We want to be notified of changes made to
>>>> rcu_gp_ongoing
>>>> +             * while we walk the list.
>>>> +             */
>>>> +            qemu_event_reset(&rcu_gp_event);
>>>> +        } else {
>>>> +            g_usleep(10000);
>>>> +            sleeps++;
>>>
>>> Thanks a lot for this RCU improvement. It indeed removes the hard stalls
>>> with unmapping of virtio-gpu blobs.
>>>
>>> Am I understanding correctly that potentially we will be hitting this
>>> g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
>>> MemoryRegion patches from Alex [1] are still needed to avoid stalls
>>> entirely.
>>>
>>> [1]
>>> https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-
>>> alex.bennee@linaro.org/
>>
>> That is right, but "avoiding stalls entirely" also causes use-after-free.
>>
>> The problem with virtio-gpu on TCG is that TCG keeps using the old
>> memory map until force_rcu is triggered. So, without force_rcu, the
>> following pseudo-code on a guest will result in use-after-free:
>>
>> address = blob_map(resource_id);
>> blob_unmap(resource_id);
>>
>> for (i = 0; i < some_big_number; i++)
>>    *(uint8_t *)address = 0;
>>
>> *(uint8_t *)address will dereference the blob until force_rcu is
>> triggered, so finalizing MemoryRegion before force_rcu results in use-
>> after-free.
>>
>> The best option to eliminate the delay entirely I have in mind is to
>> call drain_call_rcu(), but I'm not for such a change (for now).
>> drain_call_rcu() eliminates the delay if the FlatView protected by RCU
>> is the only referrer of the MemoryRegion, but that is not guaranteed.
>>
>> Performance should not be a concern anyway in this situation. The
>> guest should not waste CPU time by polling in the first place if you
>> really care performance; since it's a para-virtualized device and not
>> a real hardware, CPU time may be shared between the guest and the
>> device, and thus polling on the guest has an inherent risk of slowing
>> down the device. For performance-sensitive workloads, the guest should:
>>
>> - avoid polling and
>> - accumulate commands instead of waiting for each
>>
>> The delay will be less problematic if the guest does so, and I think
>> at least Linux does avoid polling.
>>
>> That said, stalling the guest forever in this situation is "wrong" (!=
>> "bad performance"). I wrote this patch to guarantee forward progress,
>> which is mandatory for semantic correctness.
>>
>> Perhaps drain_call_rcu() may make sense also in other, performance-
>> sensitive scenarios, but it should be added after benchmark or we will
>> have a immature optimization.
> 
> I first thought just adding drain_call_rcu() would work but apparently
> it is not that simple. Adding drain_call_rcu() has a few problems:
> 
> - It drops the BQL, which should be avoided. Problems caused by
> run_on_cpu(), which drops the BQL, was discussed on the list for a few
> times and drain_call_rcu() may also suffer from them.
> 
> - It is less effective if the RCU thread enters g_usleep() before
> drain_call_rcu() is called.
> 
> - It slows down readers due to the nature of drain_call_rcu().
> 
> So, if you know some workload that may suffer from the delay, it may be
> a good idea to try them with the patches from Alex first, and then think
> of a clean solution if it improves performance.

Thanks a lot for the clarification. I'm seeing occasional 10ms stalls
with this patch applied, still it's a huge improvement. Looking forward
to v2.

In addition to a guest waiting for the virgl commands completion, QEMU
display updates on host are also blocked while unmapping cmd is
suspended. This is a noticeable problem for interactive GFX applications
running on guest.

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-19 20:23       ` Dmitry Osipenko
@ 2025-10-20  1:14         ` Akihiko Odaki
  2025-10-20 20:41           ` Dmitry Osipenko
  0 siblings, 1 reply; 14+ messages in thread
From: Akihiko Odaki @ 2025-10-20  1:14 UTC (permalink / raw)
  To: Dmitry Osipenko, qemu-devel, Alex Bennée
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin

On 2025/10/20 5:23, Dmitry Osipenko wrote:
> On 10/17/25 03:40, Akihiko Odaki wrote:
>> On 2025/10/17 8:43, Akihiko Odaki wrote:
>>> On 2025/10/17 4:33, Dmitry Osipenko wrote:
>>>> On 10/16/25 09:34, Akihiko Odaki wrote:
>>>>> -        /* Wait for one thread to report a quiescent state and try
>>>>> again.
>>>>> +        /*
>>>>> +         * Sleep for a while and try again.
>>>>>             * Release rcu_registry_lock, so rcu_(un)register_thread()
>>>>> doesn't
>>>>>             * wait too much time.
>>>>>             *
>>>>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>>>>             * rcu_registry_lock is released.
>>>>>             */
>>>>>            qemu_mutex_unlock(&rcu_registry_lock);
>>>>> -        qemu_event_wait(&rcu_gp_event);
>>>>> +
>>>>> +        if (forced) {
>>>>> +            qemu_event_wait(&rcu_gp_event);
>>>>> +
>>>>> +            /*
>>>>> +             * We want to be notified of changes made to
>>>>> rcu_gp_ongoing
>>>>> +             * while we walk the list.
>>>>> +             */
>>>>> +            qemu_event_reset(&rcu_gp_event);
>>>>> +        } else {
>>>>> +            g_usleep(10000);
>>>>> +            sleeps++;
>>>>
>>>> Thanks a lot for this RCU improvement. It indeed removes the hard stalls
>>>> with unmapping of virtio-gpu blobs.
>>>>
>>>> Am I understanding correctly that potentially we will be hitting this
>>>> g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
>>>> MemoryRegion patches from Alex [1] are still needed to avoid stalls
>>>> entirely.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-
>>>> alex.bennee@linaro.org/
>>>
>>> That is right, but "avoiding stalls entirely" also causes use-after-free.
>>>
>>> The problem with virtio-gpu on TCG is that TCG keeps using the old
>>> memory map until force_rcu is triggered. So, without force_rcu, the
>>> following pseudo-code on a guest will result in use-after-free:
>>>
>>> address = blob_map(resource_id);
>>> blob_unmap(resource_id);
>>>
>>> for (i = 0; i < some_big_number; i++)
>>>     *(uint8_t *)address = 0;
>>>
>>> *(uint8_t *)address will dereference the blob until force_rcu is
>>> triggered, so finalizing MemoryRegion before force_rcu results in use-
>>> after-free.
>>>
>>> The best option to eliminate the delay entirely I have in mind is to
>>> call drain_call_rcu(), but I'm not for such a change (for now).
>>> drain_call_rcu() eliminates the delay if the FlatView protected by RCU
>>> is the only referrer of the MemoryRegion, but that is not guaranteed.
>>>
>>> Performance should not be a concern anyway in this situation. The
>>> guest should not waste CPU time by polling in the first place if you
>>> really care performance; since it's a para-virtualized device and not
>>> a real hardware, CPU time may be shared between the guest and the
>>> device, and thus polling on the guest has an inherent risk of slowing
>>> down the device. For performance-sensitive workloads, the guest should:
>>>
>>> - avoid polling and
>>> - accumulate commands instead of waiting for each
>>>
>>> The delay will be less problematic if the guest does so, and I think
>>> at least Linux does avoid polling.
>>>
>>> That said, stalling the guest forever in this situation is "wrong" (!=
>>> "bad performance"). I wrote this patch to guarantee forward progress,
>>> which is mandatory for semantic correctness.
>>>
>>> Perhaps drain_call_rcu() may make sense also in other, performance-
>>> sensitive scenarios, but it should be added after benchmark or we will
>>> have a immature optimization.
>>
>> I first thought just adding drain_call_rcu() would work but apparently
>> it is not that simple. Adding drain_call_rcu() has a few problems:
>>
>> - It drops the BQL, which should be avoided. Problems caused by
>> run_on_cpu(), which drops the BQL, was discussed on the list for a few
>> times and drain_call_rcu() may also suffer from them.
>>
>> - It is less effective if the RCU thread enters g_usleep() before
>> drain_call_rcu() is called.
>>
>> - It slows down readers due to the nature of drain_call_rcu().
>>
>> So, if you know some workload that may suffer from the delay, it may be
>> a good idea to try them with the patches from Alex first, and then think
>> of a clean solution if it improves performance.
> 
> Thanks a lot for the clarification. I'm seeing occasional 10ms stalls
> with this patch applied, still it's a huge improvement. Looking forward
> to v2.

Just for (further) clarification, but 10ms stalls are present even 
without this patch (correct me if I'm wrong). I think the stalls need to 
be resolved with another patch instead of having v2 of this unless it is 
a regression.

> 
> In addition to a guest waiting for the virgl commands completion, QEMU
> display updates on host are also blocked while unmapping cmd is
> suspended. This is a noticeable problem for interactive GFX applications
> running on guest.

I guess you meant that the scanout commands following unmapping commands 
are blocked. While I can imagine that can cause frames skipped and 
damage user experience, it is nice if you know reproduction cases or 
affected workloads and share them with me.

Regards,
Akihiko Odaki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-20  1:14         ` Akihiko Odaki
@ 2025-10-20 20:41           ` Dmitry Osipenko
  0 siblings, 0 replies; 14+ messages in thread
From: Dmitry Osipenko @ 2025-10-20 20:41 UTC (permalink / raw)
  To: Akihiko Odaki, qemu-devel, Alex Bennée
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin

On 10/20/25 04:14, Akihiko Odaki wrote:
...
>>> So, if you know some workload that may suffer from the delay, it may be
>>> a good idea to try them with the patches from Alex first, and then think
>>> of a clean solution if it improves performance.
>>
>> Thanks a lot for the clarification. I'm seeing occasional 10ms stalls
>> with this patch applied, still it's a huge improvement. Looking forward
>> to v2.
> 
> Just for (further) clarification, but 10ms stalls are present even
> without this patch (correct me if I'm wrong). I think the stalls need to
> be resolved with another patch instead of having v2 of this unless it is
> a regression.

Stalls present without this patch and they are much worse than with your
patch. Without your patch - unmaping stalls for 50-100ms, with your
patch - unmapping stalls for 2-20ms.

There are no stalls at all with patches from Alex, unmapping finishes
instantly and performance is ideal.

>> In addition to a guest waiting for the virgl commands completion, QEMU
>> display updates on host are also blocked while unmapping cmd is
>> suspended. This is a noticeable problem for interactive GFX applications
>> running on guest.
> 
> I guess you meant that the scanout commands following unmapping commands
> are blocked. While I can imagine that can cause frames skipped and
> damage user experience, it is nice if you know reproduction cases or
> affected workloads and share them with me.

Correct, scanout commands are blocked.

Running pretty much any VK application with venus reproduces the
problem. A simple reproduction case is to run vkmark with venus, it
would noticeably stall between switching benchmark modes and when app quits.

With native contexts the problem is much more visible. Running any
Desktop Environment (KDE Plasma in my case) on guest with amd/intel nctx
would be freezing badly by moving application window around desktop.

-- 
Best regards,
Dmitry

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-17  0:40     ` Akihiko Odaki
  2025-10-19 20:23       ` Dmitry Osipenko
@ 2025-10-22  3:30       ` Dmitry Osipenko
  2025-10-23  5:50         ` Akihiko Odaki
  1 sibling, 1 reply; 14+ messages in thread
From: Dmitry Osipenko @ 2025-10-22  3:30 UTC (permalink / raw)
  To: Akihiko Odaki, qemu-devel, Alex Bennée
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin

On 10/17/25 03:40, Akihiko Odaki wrote:
> On 2025/10/17 8:43, Akihiko Odaki wrote:
>> On 2025/10/17 4:33, Dmitry Osipenko wrote:
>>> On 10/16/25 09:34, Akihiko Odaki wrote:
>>>> -        /* Wait for one thread to report a quiescent state and try
>>>> again.
>>>> +        /*
>>>> +         * Sleep for a while and try again.
>>>>            * Release rcu_registry_lock, so rcu_(un)register_thread()
>>>> doesn't
>>>>            * wait too much time.
>>>>            *
>>>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>>>            * rcu_registry_lock is released.
>>>>            */
>>>>           qemu_mutex_unlock(&rcu_registry_lock);
>>>> -        qemu_event_wait(&rcu_gp_event);
>>>> +
>>>> +        if (forced) {
>>>> +            qemu_event_wait(&rcu_gp_event);
>>>> +
>>>> +            /*
>>>> +             * We want to be notified of changes made to
>>>> rcu_gp_ongoing
>>>> +             * while we walk the list.
>>>> +             */
>>>> +            qemu_event_reset(&rcu_gp_event);
>>>> +        } else {
>>>> +            g_usleep(10000);
>>>> +            sleeps++;
>>>
>>> Thanks a lot for this RCU improvement. It indeed removes the hard stalls
>>> with unmapping of virtio-gpu blobs.
>>>
>>> Am I understanding correctly that potentially we will be hitting this
>>> g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
>>> MemoryRegion patches from Alex [1] are still needed to avoid stalls
>>> entirely.
>>>
>>> [1]
>>> https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-
>>> alex.bennee@linaro.org/
>>
>> That is right, but "avoiding stalls entirely" also causes use-after-free.
>>
>> The problem with virtio-gpu on TCG is that TCG keeps using the old
>> memory map until force_rcu is triggered. So, without force_rcu, the
>> following pseudo-code on a guest will result in use-after-free:
>>
>> address = blob_map(resource_id);
>> blob_unmap(resource_id);
>>
>> for (i = 0; i < some_big_number; i++)
>>    *(uint8_t *)address = 0;
>>
>> *(uint8_t *)address will dereference the blob until force_rcu is
>> triggered, so finalizing MemoryRegion before force_rcu results in use-
>> after-free.
>>
>> The best option to eliminate the delay entirely I have in mind is to
>> call drain_call_rcu(), but I'm not for such a change (for now).
>> drain_call_rcu() eliminates the delay if the FlatView protected by RCU
>> is the only referrer of the MemoryRegion, but that is not guaranteed.
>>
>> Performance should not be a concern anyway in this situation. The
>> guest should not waste CPU time by polling in the first place if you
>> really care performance; since it's a para-virtualized device and not
>> a real hardware, CPU time may be shared between the guest and the
>> device, and thus polling on the guest has an inherent risk of slowing
>> down the device. For performance-sensitive workloads, the guest should:
>>
>> - avoid polling and
>> - accumulate commands instead of waiting for each
>>
>> The delay will be less problematic if the guest does so, and I think
>> at least Linux does avoid polling.
>>
>> That said, stalling the guest forever in this situation is "wrong" (!=
>> "bad performance"). I wrote this patch to guarantee forward progress,
>> which is mandatory for semantic correctness.
>>
>> Perhaps drain_call_rcu() may make sense also in other, performance-
>> sensitive scenarios, but it should be added after benchmark or we will
>> have a immature optimization.
> 
> I first thought just adding drain_call_rcu() would work but apparently
> it is not that simple. Adding drain_call_rcu() has a few problems:
> 
> - It drops the BQL, which should be avoided. Problems caused by
> run_on_cpu(), which drops the BQL, was discussed on the list for a few
> times and drain_call_rcu() may also suffer from them.
> 
> - It is less effective if the RCU thread enters g_usleep() before
> drain_call_rcu() is called.
> 
> - It slows down readers due to the nature of drain_call_rcu().
> 
> So, if you know some workload that may suffer from the delay, it may be
> a good idea to try them with the patches from Alex first, and then think
> of a clean solution if it improves performance.

What's not clear to me is whether this use-after-free problem affects
only TCG or KVM too.

Can we unmap blob immediately using Alex' method when using KVM? Would
be great if we could optimize unmapping for KVM. TCG performance is a
much lesser concern.

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-22  3:30       ` Dmitry Osipenko
@ 2025-10-23  5:50         ` Akihiko Odaki
  2025-10-23  9:50           ` Dmitry Osipenko
  0 siblings, 1 reply; 14+ messages in thread
From: Akihiko Odaki @ 2025-10-23  5:50 UTC (permalink / raw)
  To: Dmitry Osipenko, qemu-devel, Alex Bennée
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin

On 2025/10/22 12:30, Dmitry Osipenko wrote:
> On 10/17/25 03:40, Akihiko Odaki wrote:
>> On 2025/10/17 8:43, Akihiko Odaki wrote:
>>> On 2025/10/17 4:33, Dmitry Osipenko wrote:
>>>> On 10/16/25 09:34, Akihiko Odaki wrote:
>>>>> -        /* Wait for one thread to report a quiescent state and try
>>>>> again.
>>>>> +        /*
>>>>> +         * Sleep for a while and try again.
>>>>>             * Release rcu_registry_lock, so rcu_(un)register_thread()
>>>>> doesn't
>>>>>             * wait too much time.
>>>>>             *
>>>>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>>>>             * rcu_registry_lock is released.
>>>>>             */
>>>>>            qemu_mutex_unlock(&rcu_registry_lock);
>>>>> -        qemu_event_wait(&rcu_gp_event);
>>>>> +
>>>>> +        if (forced) {
>>>>> +            qemu_event_wait(&rcu_gp_event);
>>>>> +
>>>>> +            /*
>>>>> +             * We want to be notified of changes made to
>>>>> rcu_gp_ongoing
>>>>> +             * while we walk the list.
>>>>> +             */
>>>>> +            qemu_event_reset(&rcu_gp_event);
>>>>> +        } else {
>>>>> +            g_usleep(10000);
>>>>> +            sleeps++;
>>>>
>>>> Thanks a lot for this RCU improvement. It indeed removes the hard stalls
>>>> with unmapping of virtio-gpu blobs.
>>>>
>>>> Am I understanding correctly that potentially we will be hitting this
>>>> g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
>>>> MemoryRegion patches from Alex [1] are still needed to avoid stalls
>>>> entirely.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-
>>>> alex.bennee@linaro.org/
>>>
>>> That is right, but "avoiding stalls entirely" also causes use-after-free.
>>>
>>> The problem with virtio-gpu on TCG is that TCG keeps using the old
>>> memory map until force_rcu is triggered. So, without force_rcu, the
>>> following pseudo-code on a guest will result in use-after-free:
>>>
>>> address = blob_map(resource_id);
>>> blob_unmap(resource_id);
>>>
>>> for (i = 0; i < some_big_number; i++)
>>>     *(uint8_t *)address = 0;
>>>
>>> *(uint8_t *)address will dereference the blob until force_rcu is
>>> triggered, so finalizing MemoryRegion before force_rcu results in use-
>>> after-free.
>>>
>>> The best option to eliminate the delay entirely I have in mind is to
>>> call drain_call_rcu(), but I'm not for such a change (for now).
>>> drain_call_rcu() eliminates the delay if the FlatView protected by RCU
>>> is the only referrer of the MemoryRegion, but that is not guaranteed.
>>>
>>> Performance should not be a concern anyway in this situation. The
>>> guest should not waste CPU time by polling in the first place if you
>>> really care performance; since it's a para-virtualized device and not
>>> a real hardware, CPU time may be shared between the guest and the
>>> device, and thus polling on the guest has an inherent risk of slowing
>>> down the device. For performance-sensitive workloads, the guest should:
>>>
>>> - avoid polling and
>>> - accumulate commands instead of waiting for each
>>>
>>> The delay will be less problematic if the guest does so, and I think
>>> at least Linux does avoid polling.
>>>
>>> That said, stalling the guest forever in this situation is "wrong" (!=
>>> "bad performance"). I wrote this patch to guarantee forward progress,
>>> which is mandatory for semantic correctness.
>>>
>>> Perhaps drain_call_rcu() may make sense also in other, performance-
>>> sensitive scenarios, but it should be added after benchmark or we will
>>> have a immature optimization.
>>
>> I first thought just adding drain_call_rcu() would work but apparently
>> it is not that simple. Adding drain_call_rcu() has a few problems:
>>
>> - It drops the BQL, which should be avoided. Problems caused by
>> run_on_cpu(), which drops the BQL, was discussed on the list for a few
>> times and drain_call_rcu() may also suffer from them.
>>
>> - It is less effective if the RCU thread enters g_usleep() before
>> drain_call_rcu() is called.
>>
>> - It slows down readers due to the nature of drain_call_rcu().
>>
>> So, if you know some workload that may suffer from the delay, it may be
>> a good idea to try them with the patches from Alex first, and then think
>> of a clean solution if it improves performance.
> 
> What's not clear to me is whether this use-after-free problem affects
> only TCG or KVM too.
> 
> Can we unmap blob immediately using Alex' method when using KVM? Would
> be great if we could optimize unmapping for KVM. TCG performance is a
> much lesser concern.

Unfortunately no, the use-after-free can occur whatever acceleration is 
used. It was discussed thoroughly in the past. In short, the patch 
breaks reference counting (which also breaks RCU as it is indirectly 
tracked with reference counting) so use-after-free can be triggered with 
a DMA operation.

The best summary I can find is the following:
https://lore.kernel.org/qemu-devel/6c4b9d6a-d86e-44cf-be48-f919b81eac15@rsg.ci.i.u-tokyo.ac.jp/

The breakage of reference counting can be checked with the code in the 
following email:
https://lore.kernel.org/qemu-devel/e9285843-0487-4754-a348-34f1a852b24c@daynix.com/

I suggested adding an RCU API to request the force quiescent
state for virtio-gpu as a solution that works both on TCG and KVM in 
another thread, but I think it is also the easiest solution for KVM that 
avoids a regression. The thread can be found at:
https://lore.kernel.org/qemu-devel/f5d55625-10e0-496f-9b3e-2010fe0c741f@rsg.ci.i.u-tokyo.ac.jp/

I think it is straightforward enough, but I'm worried that it may not be 
implemented before the feature freeze. I am now just a hobbyist that has 
little time to spend for QEMU, and RCU and the memory region management 
are too important to make a change in haste.

Regards,
Akihiko Odaki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-23  5:50         ` Akihiko Odaki
@ 2025-10-23  9:50           ` Dmitry Osipenko
  0 siblings, 0 replies; 14+ messages in thread
From: Dmitry Osipenko @ 2025-10-23  9:50 UTC (permalink / raw)
  To: Akihiko Odaki, qemu-devel, Alex Bennée
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin

On 10/23/25 08:50, Akihiko Odaki wrote:
> On 2025/10/22 12:30, Dmitry Osipenko wrote:
>> On 10/17/25 03:40, Akihiko Odaki wrote:
>>> On 2025/10/17 8:43, Akihiko Odaki wrote:
>>>> On 2025/10/17 4:33, Dmitry Osipenko wrote:
>>>>> On 10/16/25 09:34, Akihiko Odaki wrote:
>>>>>> -        /* Wait for one thread to report a quiescent state and try
>>>>>> again.
>>>>>> +        /*
>>>>>> +         * Sleep for a while and try again.
>>>>>>             * Release rcu_registry_lock, so rcu_(un)register_thread()
>>>>>> doesn't
>>>>>>             * wait too much time.
>>>>>>             *
>>>>>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>>>>>             * rcu_registry_lock is released.
>>>>>>             */
>>>>>>            qemu_mutex_unlock(&rcu_registry_lock);
>>>>>> -        qemu_event_wait(&rcu_gp_event);
>>>>>> +
>>>>>> +        if (forced) {
>>>>>> +            qemu_event_wait(&rcu_gp_event);
>>>>>> +
>>>>>> +            /*
>>>>>> +             * We want to be notified of changes made to
>>>>>> rcu_gp_ongoing
>>>>>> +             * while we walk the list.
>>>>>> +             */
>>>>>> +            qemu_event_reset(&rcu_gp_event);
>>>>>> +        } else {
>>>>>> +            g_usleep(10000);
>>>>>> +            sleeps++;
>>>>>
>>>>> Thanks a lot for this RCU improvement. It indeed removes the hard
>>>>> stalls
>>>>> with unmapping of virtio-gpu blobs.
>>>>>
>>>>> Am I understanding correctly that potentially we will be hitting this
>>>>> g_usleep(10000) and stall virtio-gpu for the first ~10ms? I.e. the
>>>>> MemoryRegion patches from Alex [1] are still needed to avoid stalls
>>>>> entirely.
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-
>>>>> alex.bennee@linaro.org/
>>>>
>>>> That is right, but "avoiding stalls entirely" also causes use-after-
>>>> free.
>>>>
>>>> The problem with virtio-gpu on TCG is that TCG keeps using the old
>>>> memory map until force_rcu is triggered. So, without force_rcu, the
>>>> following pseudo-code on a guest will result in use-after-free:
>>>>
>>>> address = blob_map(resource_id);
>>>> blob_unmap(resource_id);
>>>>
>>>> for (i = 0; i < some_big_number; i++)
>>>>     *(uint8_t *)address = 0;
>>>>
>>>> *(uint8_t *)address will dereference the blob until force_rcu is
>>>> triggered, so finalizing MemoryRegion before force_rcu results in use-
>>>> after-free.
>>>>
>>>> The best option to eliminate the delay entirely I have in mind is to
>>>> call drain_call_rcu(), but I'm not for such a change (for now).
>>>> drain_call_rcu() eliminates the delay if the FlatView protected by RCU
>>>> is the only referrer of the MemoryRegion, but that is not guaranteed.
>>>>
>>>> Performance should not be a concern anyway in this situation. The
>>>> guest should not waste CPU time by polling in the first place if you
>>>> really care performance; since it's a para-virtualized device and not
>>>> a real hardware, CPU time may be shared between the guest and the
>>>> device, and thus polling on the guest has an inherent risk of slowing
>>>> down the device. For performance-sensitive workloads, the guest should:
>>>>
>>>> - avoid polling and
>>>> - accumulate commands instead of waiting for each
>>>>
>>>> The delay will be less problematic if the guest does so, and I think
>>>> at least Linux does avoid polling.
>>>>
>>>> That said, stalling the guest forever in this situation is "wrong" (!=
>>>> "bad performance"). I wrote this patch to guarantee forward progress,
>>>> which is mandatory for semantic correctness.
>>>>
>>>> Perhaps drain_call_rcu() may make sense also in other, performance-
>>>> sensitive scenarios, but it should be added after benchmark or we will
>>>> have a immature optimization.
>>>
>>> I first thought just adding drain_call_rcu() would work but apparently
>>> it is not that simple. Adding drain_call_rcu() has a few problems:
>>>
>>> - It drops the BQL, which should be avoided. Problems caused by
>>> run_on_cpu(), which drops the BQL, was discussed on the list for a few
>>> times and drain_call_rcu() may also suffer from them.
>>>
>>> - It is less effective if the RCU thread enters g_usleep() before
>>> drain_call_rcu() is called.
>>>
>>> - It slows down readers due to the nature of drain_call_rcu().
>>>
>>> So, if you know some workload that may suffer from the delay, it may be
>>> a good idea to try them with the patches from Alex first, and then think
>>> of a clean solution if it improves performance.
>>
>> What's not clear to me is whether this use-after-free problem affects
>> only TCG or KVM too.
>>
>> Can we unmap blob immediately using Alex' method when using KVM? Would
>> be great if we could optimize unmapping for KVM. TCG performance is a
>> much lesser concern.
> 
> Unfortunately no, the use-after-free can occur whatever acceleration is
> used. It was discussed thoroughly in the past. In short, the patch
> breaks reference counting (which also breaks RCU as it is indirectly
> tracked with reference counting) so use-after-free can be triggered with
> a DMA operation.
> 
> The best summary I can find is the following:
> https://lore.kernel.org/qemu-devel/6c4b9d6a-d86e-44cf-be48-
> f919b81eac15@rsg.ci.i.u-tokyo.ac.jp/
> 
> The breakage of reference counting can be checked with the code in the
> following email:
> https://lore.kernel.org/qemu-devel/e9285843-0487-4754-
> a348-34f1a852b24c@daynix.com/
> 
> I suggested adding an RCU API to request the force quiescent
> state for virtio-gpu as a solution that works both on TCG and KVM in
> another thread, but I think it is also the easiest solution for KVM that
> avoids a regression. The thread can be found at:
> https://lore.kernel.org/qemu-devel/
> f5d55625-10e0-496f-9b3e-2010fe0c741f@rsg.ci.i.u-tokyo.ac.jp/
> 
> I think it is straightforward enough, but I'm worried that it may not be
> implemented before the feature freeze. I am now just a hobbyist that has
> little time to spend for QEMU, and RCU and the memory region management
> are too important to make a change in haste.
Thanks. The RCU polling improvement already is a very good start and
makes native context usable with QEMU.

You're likely the best person to work on the RCU API addition. I may
look into it too.

-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-16 19:33 ` Dmitry Osipenko
  2025-10-16 23:43   ` Akihiko Odaki
@ 2025-10-20  6:38   ` Paolo Bonzini
  2025-10-21  4:14     ` Akihiko Odaki
  1 sibling, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2025-10-20  6:38 UTC (permalink / raw)
  To: Dmitry Osipenko, Akihiko Odaki, qemu-devel, Alex Bennée
  Cc: David Hildenbrand, Eric Blake, Philippe Mathieu-Daudé,
	Markus Armbruster, Marc-André Lureau, Peter Xu,
	Michael S. Tsirkin

On 10/16/25 21:33, Dmitry Osipenko wrote:
> On 10/16/25 09:34, Akihiko Odaki wrote:
>> -        /* Wait for one thread to report a quiescent state and try again.
>> +        /*
>> +         * Sleep for a while and try again.
>>            * Release rcu_registry_lock, so rcu_(un)register_thread() doesn't
>>            * wait too much time.
>>            *
>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>            * rcu_registry_lock is released.
>>            */
>>           qemu_mutex_unlock(&rcu_registry_lock);
>> -        qemu_event_wait(&rcu_gp_event);
>> +
>> +        if (forced) {
>> +            qemu_event_wait(&rcu_gp_event);
>> +
>> +            /*
>> +             * We want to be notified of changes made to rcu_gp_ongoing
>> +             * while we walk the list.
>> +             */
>> +            qemu_event_reset(&rcu_gp_event);
>> +        } else {
>> +            g_usleep(10000);
>> +            sleeps++;
> 
> Thanks a lot for this RCU improvement. It indeed removes the hard stalls
> with unmapping of virtio-gpu blobs.
> 
> Am I understanding correctly that potentially we will be hitting this
> g_usleep(10000) and stall virtio-gpu for the first ~10ms?

Would it help to have some kind of exponential backoff, starting with 
1-3 ms and increasing after the first wait?  Something like 
1.5/3/6/12/12/12 ms would have a similar effect but reduce the wait if 
the vCPU kick is fast enough.

Paolo

I.e. the
> MemoryRegion patches from Alex [1] are still needed to avoid stalls
> entirely.
> 
> [1]
> https://lore.kernel.org/qemu-devel/20251014111234.3190346-6-alex.bennee@linaro.org/
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-20  6:38   ` Paolo Bonzini
@ 2025-10-21  4:14     ` Akihiko Odaki
  0 siblings, 0 replies; 14+ messages in thread
From: Akihiko Odaki @ 2025-10-21  4:14 UTC (permalink / raw)
  To: Paolo Bonzini, Dmitry Osipenko, qemu-devel, Alex Bennée
  Cc: David Hildenbrand, Eric Blake, Philippe Mathieu-Daudé,
	Markus Armbruster, Marc-André Lureau, Peter Xu,
	Michael S. Tsirkin

On 2025/10/20 15:38, Paolo Bonzini wrote:
> On 10/16/25 21:33, Dmitry Osipenko wrote:
>> On 10/16/25 09:34, Akihiko Odaki wrote:
>>> -        /* Wait for one thread to report a quiescent state and try 
>>> again.
>>> +        /*
>>> +         * Sleep for a while and try again.
>>>            * Release rcu_registry_lock, so rcu_(un)register_thread() 
>>> doesn't
>>>            * wait too much time.
>>>            *
>>> @@ -133,7 +150,20 @@ static void wait_for_readers(void)
>>>            * rcu_registry_lock is released.
>>>            */
>>>           qemu_mutex_unlock(&rcu_registry_lock);
>>> -        qemu_event_wait(&rcu_gp_event);
>>> +
>>> +        if (forced) {
>>> +            qemu_event_wait(&rcu_gp_event);
>>> +
>>> +            /*
>>> +             * We want to be notified of changes made to rcu_gp_ongoing
>>> +             * while we walk the list.
>>> +             */
>>> +            qemu_event_reset(&rcu_gp_event);
>>> +        } else {
>>> +            g_usleep(10000);
>>> +            sleeps++;
>>
>> Thanks a lot for this RCU improvement. It indeed removes the hard stalls
>> with unmapping of virtio-gpu blobs.
>>
>> Am I understanding correctly that potentially we will be hitting this
>> g_usleep(10000) and stall virtio-gpu for the first ~10ms?
> 
> Would it help to have some kind of exponential backoff, starting with 
> 1-3 ms and increasing after the first wait?  Something like 
> 1.5/3/6/12/12/12 ms would have a similar effect but reduce the wait if 
> the vCPU kick is fast enough.

vCPU kick is needed only for TCG and triggered with force_rcu when 
entering the force quiescent state. The state is entered only after 
finishing all polling checks, so shortening only the early ones may not 
be effective for TCG.

It also increases the overhead for users other than virtio-gpu. I 
suppose the first 10 ms sleep is sufficient for the most case, which 
minimizes context switching due to sleep/wakeup. However, it is more 
likely that a few of the early sleeps and their wakeups will be 
triggered if they are shortened.

Now I'm thinking of adding an RCU API to request the force quiescent 
state for virtio-gpu. It will be able to utilize the force quiescent 
state (particularly for TCG) and is zero cost unless virtio-gpu is used.

Regards,
Akihiko Odaki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] rcu: Unify force quiescent state
  2025-10-16  6:34 [PATCH] rcu: Unify force quiescent state Akihiko Odaki
  2025-10-16 12:59 ` Paolo Bonzini
  2025-10-16 19:33 ` Dmitry Osipenko
@ 2025-10-19 20:25 ` Dmitry Osipenko
  2 siblings, 0 replies; 14+ messages in thread
From: Dmitry Osipenko @ 2025-10-19 20:25 UTC (permalink / raw)
  To: Akihiko Odaki, qemu-devel
  Cc: Paolo Bonzini, David Hildenbrand, Eric Blake,
	Philippe Mathieu-Daudé, Markus Armbruster,
	Marc-André Lureau, Peter Xu, Michael S. Tsirkin

On 10/16/25 09:34, Akihiko Odaki wrote:
> Borrow the concept of force quiescent state from Linux to ensure readers
> remain fast during normal operation and to avoid stalls.
> 
> Background
> ==========
> 
> The previous implementation had four steps to begin reclamation.
> 
> 1. call_rcu_thread() would wait for the first callback.
> 
> 2. call_rcu_thread() would periodically poll until a decent number of
>    callbacks piled up or it timed out.
> 
> 3. synchronize_rcu() would statr a grace period (GP).
> 
> 4. wait_for_readers() would wait for the GP to end. It would also
>    trigger the force_rcu notifier to break busy loops in a read-side
>    critical section if drain_call_rcu() had been called.
> 
> Problem
> =======
> 
> The separation of waiting logic across these steps led to suboptimal
> behavior:
> 
> The GP was delayed until call_rcu_thread() stops polling.
> 
> force_rcu was not consistently triggered when call_rcu_thread() detected
> a high number of pending callbacks or a timeout. This inconsistency
> sometimes led to stalls, as reported in a virtio-gpu issue where memory
> unmapping was blocked[1].
> 
> wait_for_readers() imposed unnecessary overhead in non-urgent cases by
> unconditionally executing qatomic_set(&index->waiting, true) and
> qemu_event_reset(&rcu_gp_event), which are necessary only for expedited
> synchronization.
> 
> Solution
> ========
> 
> Move the polling in call_rcu_thread() to wait_for_readers() to prevent
> the delay of the GP. Additionally, reorganize wait_for_readers() to
> distinguish between two states:
> 
> Normal State: it relies exclusively on periodic polling to detect
> the end of the GP and maintains the read-side fast path.
> 
> Force Quiescent State: Whenever expediting synchronization, it always
> triggers force_rcu and executes both qatomic_set(&index->waiting, true)
> and qemu_event_reset(&rcu_gp_event). This avoids stalls while confining
> the read-side overhead to this state.
> 
> This unified approach, inspired by the Linux RCU, ensures consistent and
> efficient RCU grace period handling and confirms resolution of the
> virtio-gpu issue.
> 
> [1] https://lore.kernel.org/qemu-devel/20251014111234.3190346-9-alex.bennee@linaro.org/
> 
> Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
> ---
>  util/rcu.c | 79 ++++++++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 51 insertions(+), 28 deletions(-)

Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> # drm native
contexts + venus


-- 
Best regards,
Dmitry


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-10-23  9:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-16  6:34 [PATCH] rcu: Unify force quiescent state Akihiko Odaki
2025-10-16 12:59 ` Paolo Bonzini
2025-10-16 19:33 ` Dmitry Osipenko
2025-10-16 23:43   ` Akihiko Odaki
2025-10-17  0:40     ` Akihiko Odaki
2025-10-19 20:23       ` Dmitry Osipenko
2025-10-20  1:14         ` Akihiko Odaki
2025-10-20 20:41           ` Dmitry Osipenko
2025-10-22  3:30       ` Dmitry Osipenko
2025-10-23  5:50         ` Akihiko Odaki
2025-10-23  9:50           ` Dmitry Osipenko
2025-10-20  6:38   ` Paolo Bonzini
2025-10-21  4:14     ` Akihiko Odaki
2025-10-19 20:25 ` Dmitry Osipenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).