[Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool
@ 2014-11-27 10:27 Peter Lieven
  2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" Peter Lieven
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Peter Lieven @ 2014-11-27 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz,
	stefanha, pbonzini

This series removes the need for locking the coroutine pool. During
collection of perf data it showed up that there is serious time spend
in locking the coroutine pool during heavy I/O from the vServer.

This also needs serious benchmarking. Using Kevins qemu-img bench there is a 3-4%
speedup. Inside vServers it seems to heavily dependend on the individual
CPU and SMP configuration.

I just wanted to post these patches here to discuss if this is an idea
that we should look futher into.

KNOWN ISSUES:
 - the I/O thread qobject is not destroyed on qemu termination. These
   seems to be the case for all qobjects?! Therefore the per thread coroutine
   pool is not cleaned up.
 - without ioeventfd there is effectively no longer a coroutine pool since
   the coroutines are created in a different thread then they are freed.

Peter Lieven (3):
  Revert "coroutine: make pool size dynamic"
  block/block-backend.c: remove coroutine pool reservation
  qemu-coroutine: use a ring per thread for the pool

 block/block-backend.c     |    6 ----
 include/block/coroutine.h |   13 +-------
 iothread.c                |    3 ++
 qemu-coroutine.c          |   78 +++++++++++++++------------------------------
 4 files changed, 30 insertions(+), 70 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic"
  2014-11-27 10:27 [Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven
@ 2014-11-27 10:27 ` Peter Lieven
  2014-11-28 12:42   ` Stefan Hajnoczi
  2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 2/3] block/block-backend.c: remove coroutine pool reservation Peter Lieven
  2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2014-11-27 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz,
	stefanha, pbonzini

This reverts commit ac2662a913ee5854b1269256adbdc14e57ba480a.
---
 include/block/coroutine.h |   11 -----------
 qemu-coroutine.c          |   26 +++-----------------------
 2 files changed, 3 insertions(+), 34 deletions(-)

diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 793df0e..1d9343d 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -215,15 +215,4 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
  * Note that this function clobbers the handlers for the file descriptor.
  */
 void coroutine_fn yield_until_fd_readable(int fd);
-
-/**
- * Add or subtract from the coroutine pool size
- *
- * The coroutine implementation keeps a pool of coroutines to be reused by
- * qemu_coroutine_create().  This makes coroutine creation cheap.  Heavy
- * coroutine users should call this to reserve pool space.  Call it again with
- * a negative number to release pool space.
- */
-void qemu_coroutine_adjust_pool_size(int n);
-
 #endif /* QEMU_COROUTINE_H */
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index bd574aa..4708521 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -19,14 +19,14 @@
 #include "block/coroutine_int.h"
 
 enum {
-    POOL_DEFAULT_SIZE = 64,
+    /* Maximum free pool size prevents holding too many freed coroutines */
+    POOL_MAX_SIZE = 64,
 };
 
 /** Free list to speed up creation */
 static QemuMutex pool_lock;
 static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
 static unsigned int pool_size;
-static unsigned int pool_max_size = POOL_DEFAULT_SIZE;
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 {
@@ -55,7 +55,7 @@ static void coroutine_delete(Coroutine *co)
 {
     if (CONFIG_COROUTINE_POOL) {
         qemu_mutex_lock(&pool_lock);
-        if (pool_size < pool_max_size) {
+        if (pool_size < POOL_MAX_SIZE) {
             QSLIST_INSERT_HEAD(&pool, co, pool_next);
             co->caller = NULL;
             pool_size++;
@@ -137,23 +137,3 @@ void coroutine_fn qemu_coroutine_yield(void)
     self->caller = NULL;
     coroutine_swap(self, to);
 }
-
-void qemu_coroutine_adjust_pool_size(int n)
-{
-    qemu_mutex_lock(&pool_lock);
-
-    pool_max_size += n;
-
-    /* Callers should never take away more than they added */
-    assert(pool_max_size >= POOL_DEFAULT_SIZE);
-
-    /* Trim oversized pool down to new max */
-    while (pool_size > pool_max_size) {
-        Coroutine *co = QSLIST_FIRST(&pool);
-        QSLIST_REMOVE_HEAD(&pool, pool_next);
-        pool_size--;
-        qemu_coroutine_delete(co);
-    }
-
-    qemu_mutex_unlock(&pool_lock);
-}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic"
  2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" Peter Lieven
@ 2014-11-28 12:42   ` Stefan Hajnoczi
  2014-11-28 12:45     ` Peter Lieven
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Hajnoczi @ 2014-11-28 12:42 UTC (permalink / raw)
  To: Peter Lieven
  Cc: kwolf, famz, benoit, ming.lei, armbru, qemu-devel, stefanha,
	pbonzini, mreitz

[-- Attachment #1: Type: text/plain, Size: 306 bytes --]

On Thu, Nov 27, 2014 at 11:27:04AM +0100, Peter Lieven wrote:
> This reverts commit ac2662a913ee5854b1269256adbdc14e57ba480a.

Justification?

Won't we hit the same problem as the global pool: either pool size is
too large or too small.  If you exceed the pool size performance drops
dramatically.

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic"
  2014-11-28 12:42   ` Stefan Hajnoczi
@ 2014-11-28 12:45     ` Peter Lieven
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 12:45 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, famz, benoit, ming.lei, armbru, qemu-devel, stefanha,
	pbonzini, mreitz

Am 28.11.2014 um 13:42 schrieb Stefan Hajnoczi:
> On Thu, Nov 27, 2014 at 11:27:04AM +0100, Peter Lieven wrote:
>> This reverts commit ac2662a913ee5854b1269256adbdc14e57ba480a.
> Justification?
>
> Won't we hit the same problem as the global pool: either pool size is
> too large or too small.  If you exceed the pool size performance drops
> dramatically.

Each thread has its own pool. But you can ignore this series I think.
The idea Paolo came up with seems to be better. With
my last improvement it seems to have equal performance without
the drawback of different threads creating and destroying the coroutines.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC PATCH 2/3] block/block-backend.c: remove coroutine pool reservation
  2014-11-27 10:27 [Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven
  2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" Peter Lieven
@ 2014-11-27 10:27 ` Peter Lieven
  2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven
  2 siblings, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2014-11-27 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz,
	stefanha, pbonzini

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 block/block-backend.c |    6 ------
 1 file changed, 6 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d0692b1..cf36ffb 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -15,9 +15,6 @@
 #include "sysemu/blockdev.h"
 #include "qapi-event.h"
 
-/* Number of coroutines to reserve per attached device model */
-#define COROUTINE_POOL_RESERVATION 64
-
 struct BlockBackend {
     char *name;
     int refcnt;
@@ -261,8 +258,6 @@ int blk_attach_dev(BlockBackend *blk, void *dev)
     blk->dev = dev;
     bdrv_iostatus_reset(blk->bs);
 
-    /* We're expecting I/O from the device so bump up coroutine pool size */
-    qemu_coroutine_adjust_pool_size(COROUTINE_POOL_RESERVATION);
     return 0;
 }
 
@@ -290,7 +285,6 @@ void blk_detach_dev(BlockBackend *blk, void *dev)
     blk->dev_ops = NULL;
     blk->dev_opaque = NULL;
     bdrv_set_guest_block_size(blk->bs, 512);
-    qemu_coroutine_adjust_pool_size(-COROUTINE_POOL_RESERVATION);
     blk_unref(blk);
 }
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-27 10:27 [Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven
  2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" Peter Lieven
  2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 2/3] block/block-backend.c: remove coroutine pool reservation Peter Lieven
@ 2014-11-27 10:27 ` Peter Lieven
  2014-11-27 16:40   ` Paolo Bonzini
  2014-11-28 12:40   ` Stefan Hajnoczi
  2 siblings, 2 replies; 27+ messages in thread
From: Peter Lieven @ 2014-11-27 10:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz,
	stefanha, pbonzini

This patch creates a ring structure for the coroutine pool instead
of a linked list. The implementation of the list has the issue
that it always throws aways the latest coroutines instead of the
oldest ones. This is a drawback since the latest used coroutines
are more likely cached than old ones.

Furthermore this patch creates a coroutine pool per thread
to remove the need for locking.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 include/block/coroutine.h |    2 +-
 iothread.c                |    3 +++
 qemu-coroutine.c          |   54 ++++++++++++++++++++-------------------------
 3 files changed, 28 insertions(+), 31 deletions(-)

diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 1d9343d..f0b3a2d 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -96,7 +96,7 @@ Coroutine *coroutine_fn qemu_coroutine_self(void);
  */
 bool qemu_in_coroutine(void);
 
-
+void coroutine_pool_cleanup(void);
 
 /**
  * CoQueues are a mechanism to queue coroutines in order to continue executing
diff --git a/iothread.c b/iothread.c
index 342a23f..b53529b 100644
--- a/iothread.c
+++ b/iothread.c
@@ -15,6 +15,7 @@
 #include "qom/object_interfaces.h"
 #include "qemu/module.h"
 #include "block/aio.h"
+#include "block/coroutine.h"
 #include "sysemu/iothread.h"
 #include "qmp-commands.h"
 #include "qemu/error-report.h"
@@ -47,6 +48,8 @@ static void *iothread_run(void *opaque)
         }
         aio_context_release(iothread->ctx);
     }
+
+    coroutine_pool_cleanup();
     return NULL;
 }
 
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index 4708521..1900155 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -24,22 +24,22 @@ enum {
 };
 
 /** Free list to speed up creation */
-static QemuMutex pool_lock;
-static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_size;
+static __thread struct CoRoutinePool {
+    Coroutine *ptrs[POOL_MAX_SIZE];
+    unsigned int size;
+    unsigned int nextfree;
+} CoPool;
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 {
     Coroutine *co = NULL;
-
     if (CONFIG_COROUTINE_POOL) {
-        qemu_mutex_lock(&pool_lock);
-        co = QSLIST_FIRST(&pool);
-        if (co) {
-            QSLIST_REMOVE_HEAD(&pool, pool_next);
-            pool_size--;
+        if (CoPool.size) {
+            co = CoPool.ptrs[CoPool.nextfree];
+            CoPool.size--;
+            CoPool.nextfree--;
+            CoPool.nextfree &= (POOL_MAX_SIZE - 1);
         }
-        qemu_mutex_unlock(&pool_lock);
     }
 
     if (!co) {
@@ -54,36 +54,30 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 static void coroutine_delete(Coroutine *co)
 {
     if (CONFIG_COROUTINE_POOL) {
-        qemu_mutex_lock(&pool_lock);
-        if (pool_size < POOL_MAX_SIZE) {
-            QSLIST_INSERT_HEAD(&pool, co, pool_next);
-            co->caller = NULL;
-            pool_size++;
-            qemu_mutex_unlock(&pool_lock);
-            return;
+        CoPool.nextfree++;
+        CoPool.nextfree &= (POOL_MAX_SIZE - 1);
+        if (CoPool.size == POOL_MAX_SIZE) {
+            qemu_coroutine_delete(CoPool.ptrs[CoPool.nextfree]);
+        } else {
+            CoPool.size++;
         }
-        qemu_mutex_unlock(&pool_lock);
+        co->caller = NULL;
+        CoPool.ptrs[CoPool.nextfree] = co;
+    } else {
+        qemu_coroutine_delete(co);
     }
-
-    qemu_coroutine_delete(co);
 }
 
 static void __attribute__((constructor)) coroutine_pool_init(void)
 {
-    qemu_mutex_init(&pool_lock);
 }
 
-static void __attribute__((destructor)) coroutine_pool_cleanup(void)
+void __attribute__((destructor)) coroutine_pool_cleanup(void)
 {
-    Coroutine *co;
-    Coroutine *tmp;
-
-    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
-        QSLIST_REMOVE_HEAD(&pool, pool_next);
-        qemu_coroutine_delete(co);
+    printf("coroutine_pool_cleanup %lx pool %p\n", pthread_self(), &CoPool);
+    while (CoPool.size) {
+        qemu_coroutine_delete(qemu_coroutine_create(NULL));
     }
-
-    qemu_mutex_destroy(&pool_lock);
 }
 
 static void coroutine_swap(Coroutine *from, Coroutine *to)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven
@ 2014-11-27 16:40   ` Paolo Bonzini
  2014-11-28  8:13     ` Peter Lieven
  2014-11-28 12:40   ` Stefan Hajnoczi
  1 sibling, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2014-11-27 16:40 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, armbru, mreitz, stefanha

On 27/11/2014 11:27, Peter Lieven wrote:
> +static __thread struct CoRoutinePool {
> +    Coroutine *ptrs[POOL_MAX_SIZE];
> +    unsigned int size;
> +    unsigned int nextfree;
> +} CoPool;
>  

The per-thread ring unfortunately didn't work well last time it was
tested.  Devices that do not use ioeventfd (not just the slow ones, even
decently performing ones like ahci, nvme or megasas) will create the
coroutine in the VCPU thread, and destroy it in the iothread.  The
result is that coroutines cannot be reused.

Can you check if this is still the case?

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-27 16:40   ` Paolo Bonzini
@ 2014-11-28  8:13     ` Peter Lieven
       [not found]       ` <54784E55.6060405@redhat.com>
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2014-11-28  8:13 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, armbru, mreitz, stefanha

Am 27.11.2014 um 17:40 schrieb Paolo Bonzini:
>
> On 27/11/2014 11:27, Peter Lieven wrote:
>> +static __thread struct CoRoutinePool {
>> +    Coroutine *ptrs[POOL_MAX_SIZE];
>> +    unsigned int size;
>> +    unsigned int nextfree;
>> +} CoPool;
>>  
> The per-thread ring unfortunately didn't work well last time it was
> tested.  Devices that do not use ioeventfd (not just the slow ones, even
> decently performing ones like ahci, nvme or megasas) will create the
> coroutine in the VCPU thread, and destroy it in the iothread.  The
> result is that coroutines cannot be reused.
>
> Can you check if this is still the case?

I already tested at least for IDE and for ioeventfd=off. The coroutine
is created in the vCPU thread and destroyed in the I/O thread.

I also havea more complicated version which sets per therad coroutine pool only
for dataplane. Avoiding the lock for dedicated iothreads.

For those who want to take a look:

https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <54784E55.6060405@redhat.com>]

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
       [not found]       ` <54784E55.6060405@redhat.com>
@ 2014-11-28 10:37         ` Peter Lieven
  2014-11-28 11:14           ` Paolo Bonzini
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 10:37 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 11:28 schrieb Paolo Bonzini:
>
> On 28/11/2014 09:13, Peter Lieven wrote:
>> Am 27.11.2014 um 17:40 schrieb Paolo Bonzini:
>>> On 27/11/2014 11:27, Peter Lieven wrote:
>>>> +static __thread struct CoRoutinePool {
>>>> +    Coroutine *ptrs[POOL_MAX_SIZE];
>>>> +    unsigned int size;
>>>> +    unsigned int nextfree;
>>>> +} CoPool;
>>>>  
>>> The per-thread ring unfortunately didn't work well last time it was
>>> tested.  Devices that do not use ioeventfd (not just the slow ones, even
>>> decently performing ones like ahci, nvme or megasas) will create the
>>> coroutine in the VCPU thread, and destroy it in the iothread.  The
>>> result is that coroutines cannot be reused.
>>>
>>> Can you check if this is still the case?
>> I already tested at least for IDE and for ioeventfd=off. The coroutine
>> is created in the vCPU thread and destroyed in the I/O thread.
>>
>> I also havea more complicated version which sets per therad coroutine pool only
>> for dataplane. Avoiding the lock for dedicated iothreads.
>>
>> For those who want to take a look:
>>
>> https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e
> Can you test it against the patch I just sent in Kevin's linux-aio
> coroutine thread?

Was already doing it ;-) At least with test-couroutine.c....

master:
Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine

paolo:
Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine

plieven/perf_master2:
Run operation 40000000 iterations 9.013785 s, 4437K operations/s, 225ns per coroutine

plieven/perf_master:
Run operation 40000000 iterations 11.072883 s, 3612K operations/s, 276ns per coroutine

However, perf_master and perf_master2 have a regerssion regarding nesting as it seems.
@Kevin: Could that be the reason why they performe bad in some szenarios?


Regarding the bypass that is discussed. If it is not just a benchmark thing but really necessary
for some peoples use cases why not add a new aio mode like "bypass" and use it only then.
If the performance is really needed the user he/she might trade it in for lost features like iothrottling, filters etc.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 10:37         ` Peter Lieven
@ 2014-11-28 11:14           ` Paolo Bonzini
  2014-11-28 11:21             ` Peter Lieven
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2014-11-28 11:14 UTC (permalink / raw)
  To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster


> master:
> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
> 
> paolo:
> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine

Nice. :)

Can you please try "coroutine: Use __thread … " together, too?  I still
see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
I apply it here (my times are 191/160/145).

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 11:14           ` Paolo Bonzini
@ 2014-11-28 11:21             ` Peter Lieven
  2014-11-28 11:23               ` Paolo Bonzini
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 11:21 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>> master:
>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>
>> paolo:
>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
> Nice. :)
>
> Can you please try "coroutine: Use __thread … " together, too?  I still
> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
> I apply it here (my times are 191/160/145).

indeed:

Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 11:21             ` Peter Lieven
@ 2014-11-28 11:23               ` Paolo Bonzini
  2014-11-28 11:27                 ` Peter Lieven
  2014-11-28 11:32                 ` Peter Lieven
  0 siblings, 2 replies; 27+ messages in thread
From: Paolo Bonzini @ 2014-11-28 11:23 UTC (permalink / raw)
  To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster



On 28/11/2014 12:21, Peter Lieven wrote:
> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>> master:
>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>
>>> paolo:
>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>> Nice. :)
>>
>> Can you please try "coroutine: Use __thread … " together, too?  I still
>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>> I apply it here (my times are 191/160/145).
> 
> indeed:
> 
> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine

Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
if we can use a similar algorithm but with arrays instead of lists...

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 11:23               ` Paolo Bonzini
@ 2014-11-28 11:27                 ` Peter Lieven
  2014-11-28 11:32                 ` Peter Lieven
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 11:27 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
>
> On 28/11/2014 12:21, Peter Lieven wrote:
>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>>> master:
>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>>
>>>> paolo:
>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>>> Nice. :)
>>>
>>> Can you please try "coroutine: Use __thread … " together, too?  I still
>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>>> I apply it here (my times are 191/160/145).
>> indeed:
>>
>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
> Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
> if we can use a similar algorithm but with arrays instead of lists...

You mean an algorithm similar to perf_master2 or to the current implementation?

The ring buffer seems to have a drawback when it comes to excessive coroutine nesting.
My idea was that you do not throw away hot coroutines when the pool is full. However,
i do not know if this is really a problem since the pool is only full when there is not much
I/O. Or is this assumption to easy?

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 11:23               ` Paolo Bonzini
  2014-11-28 11:27                 ` Peter Lieven
@ 2014-11-28 11:32                 ` Peter Lieven
  2014-11-28 11:46                   ` Peter Lieven
  2014-11-28 12:21                   ` Paolo Bonzini
  1 sibling, 2 replies; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 11:32 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
>
> On 28/11/2014 12:21, Peter Lieven wrote:
>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>>> master:
>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>>
>>>> paolo:
>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>>> Nice. :)
>>>
>>> Can you please try "coroutine: Use __thread … " together, too?  I still
>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>>> I apply it here (my times are 191/160/145).
>> indeed:
>>
>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
> Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
> if we can use a similar algorithm but with arrays instead of lists...

Why do you set pool_size = 0 in the create path?

When I do the following:
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index 6bee354..c79ee78 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
                  * and the actual size of alloc_pool.  But it is just a heuristic,
                  * it does not need to be perfect.
                  */
-                pool_size = 0;
+                atomic_dec(&pool_size);
                 QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
                 co = QSLIST_FIRST(&alloc_pool);


I get:
Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 11:32                 ` Peter Lieven
@ 2014-11-28 11:46                   ` Peter Lieven
  2014-11-28 12:26                     ` Paolo Bonzini
  2014-11-28 12:21                   ` Paolo Bonzini
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 11:46 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 12:32 schrieb Peter Lieven:
> Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
>> On 28/11/2014 12:21, Peter Lieven wrote:
>>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>>>> master:
>>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>>>
>>>>> paolo:
>>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>>>> Nice. :)
>>>>
>>>> Can you please try "coroutine: Use __thread … " together, too?  I still
>>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>>>> I apply it here (my times are 191/160/145).
>>> indeed:
>>>
>>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
>> Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
>> if we can use a similar algorithm but with arrays instead of lists...
> Why do you set pool_size = 0 in the create path?
>
> When I do the following:
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index 6bee354..c79ee78 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>                   * and the actual size of alloc_pool.  But it is just a heuristic,
>                   * it does not need to be perfect.
>                   */
> -                pool_size = 0;
> +                atomic_dec(&pool_size);
>                  QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
>                  co = QSLIST_FIRST(&alloc_pool);
>
>
> I get:
> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
>

Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more
than one thread in need of a lot of coroutines?

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 11:46                   ` Peter Lieven
@ 2014-11-28 12:26                     ` Paolo Bonzini
  2014-11-28 12:39                       ` Peter Lieven
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2014-11-28 12:26 UTC (permalink / raw)
  To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

On 28/11/2014 12:46, Peter Lieven wrote:
> > I get:
> > Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
>
> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more
> than one thread in need of a lot of coroutines?

Overall the algorithm is expected to adapt.  The N threads contribute to
the global release pool, so the pool will fill up N times faster than if
you had only one thread.  There can be some variance, which is why the
maximum size of the pool is twice the threshold (and probably could be
tuned better).

Benchmarks are needed on real I/O too, of course, especially with high
queue depth.

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 12:26                     ` Paolo Bonzini
@ 2014-11-28 12:39                       ` Peter Lieven
  2014-11-28 12:45                         ` Paolo Bonzini
  2014-11-28 13:13                         ` Peter Lieven
  0 siblings, 2 replies; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 12:39 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 13:26 schrieb Paolo Bonzini:
>
> On 28/11/2014 12:46, Peter Lieven wrote:
>>> I get:
>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more
>> than one thread in need of a lot of coroutines?
> Overall the algorithm is expected to adapt.  The N threads contribute to
> the global release pool, so the pool will fill up N times faster than if
> you had only one thread.  There can be some variance, which is why the
> maximum size of the pool is twice the threshold (and probably could be
> tuned better).
>
> Benchmarks are needed on real I/O too, of course, especially with high
> queue depth.

Yes, cool. The atomic operations are a bit tricky at the first glance ;-)

Question:
 Why is the pool_size increment atomic and the set to zero not?
 
Idea:
 If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-)

Run operation 40000000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index 6bee354..edea162 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -25,8 +25,9 @@ enum {
 
 /** Free list to speed up creation */
 static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_size;
+static unsigned int release_pool_size;
 static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
+static __thread unsigned int alloc_pool_size;
 
 /* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
 static void coroutine_pool_cleanup(void *value);
@@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
     if (CONFIG_COROUTINE_POOL) {
         co = QSLIST_FIRST(&alloc_pool);
         if (!co) {
-            if (pool_size > POOL_BATCH_SIZE) {
-                /* This is not exact; there could be a little skew between pool_size
+            if (release_pool_size > POOL_BATCH_SIZE) {
+                /* This is not exact; there could be a little skew between release_pool_size
                  * and the actual size of alloc_pool.  But it is just a heuristic,
                  * it does not need to be perfect.
                  */
-                pool_size = 0;
+                alloc_pool_size = atomic_fetch_and(&release_pool_size, 0);
                 QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
                 co = QSLIST_FIRST(&alloc_pool);
 
@@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
                  */
                 g_private_set(&dummy_key, &dummy_key);
             }
+        } else {
+            alloc_pool_size--;
         }
         if (co) {
             QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
@@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 static void coroutine_delete(Coroutine *co)
 {
     if (CONFIG_COROUTINE_POOL) {
-        if (pool_size < POOL_BATCH_SIZE * 2) {
+        if (release_pool_size < POOL_BATCH_SIZE * 2) {
             co->caller = NULL;
             QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
-            atomic_inc(&pool_size);
+            atomic_inc(&release_pool_size);
+            return;
+        } else if (alloc_pool_size < POOL_BATCH_SIZE) {
+            co->caller = NULL;
+            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
+            alloc_pool_size++;
             return;
         }
     }


Bug?:
 The release_pool is not cleanup up on termination I think.

Peter

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 12:39                       ` Peter Lieven
@ 2014-11-28 12:45                         ` Paolo Bonzini
  2014-11-28 12:49                           ` Peter Lieven
  2014-11-28 13:17                           ` Peter Lieven
  2014-11-28 13:13                         ` Peter Lieven
  1 sibling, 2 replies; 27+ messages in thread
From: Paolo Bonzini @ 2014-11-28 12:45 UTC (permalink / raw)
  To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster



On 28/11/2014 13:39, Peter Lieven wrote:
> Am 28.11.2014 um 13:26 schrieb Paolo Bonzini:
>>
>> On 28/11/2014 12:46, Peter Lieven wrote:
>>>> I get:
>>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
>>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more
>>> than one thread in need of a lot of coroutines?
>> Overall the algorithm is expected to adapt.  The N threads contribute to
>> the global release pool, so the pool will fill up N times faster than if
>> you had only one thread.  There can be some variance, which is why the
>> maximum size of the pool is twice the threshold (and probably could be
>> tuned better).
>>
>> Benchmarks are needed on real I/O too, of course, especially with high
>> queue depth.
> 
> Yes, cool. The atomic operations are a bit tricky at the first glance ;-)
> 
> Question:
>  Why is the pool_size increment atomic and the set to zero not?

Because the set to zero is not a read-modify-write operation, so it is
always atomic.  It's just not sequentially-consistent (see
docs/atomics.txt for some info on what that means).

> Idea:
>  If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-)

Because you can only waste 64 coroutines per thread.  But numbers cannot
be sneezed at, so it's worth doing it as a separate patch.

> Run operation 40000000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine
> 
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index 6bee354..edea162 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -25,8 +25,9 @@ enum {
>  
>  /** Free list to speed up creation */
>  static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
> -static unsigned int pool_size;
> +static unsigned int release_pool_size;
>  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
> +static __thread unsigned int alloc_pool_size;
>  
>  /* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
>  static void coroutine_pool_cleanup(void *value);
> @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>      if (CONFIG_COROUTINE_POOL) {
>          co = QSLIST_FIRST(&alloc_pool);
>          if (!co) {
> -            if (pool_size > POOL_BATCH_SIZE) {
> -                /* This is not exact; there could be a little skew between pool_size
> +            if (release_pool_size > POOL_BATCH_SIZE) {
> +                /* This is not exact; there could be a little skew between release_pool_size
>                   * and the actual size of alloc_pool.  But it is just a heuristic,
>                   * it does not need to be perfect.
>                   */
> -                pool_size = 0;
> +                alloc_pool_size = atomic_fetch_and(&release_pool_size, 0);
>                  QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
>                  co = QSLIST_FIRST(&alloc_pool);
>  
> @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>                   */
>                  g_private_set(&dummy_key, &dummy_key);
>              }
> +        } else {
> +            alloc_pool_size--;
>          }
>          if (co) {
>              QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
> @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>  static void coroutine_delete(Coroutine *co)
>  {
>      if (CONFIG_COROUTINE_POOL) {
> -        if (pool_size < POOL_BATCH_SIZE * 2) {
> +        if (release_pool_size < POOL_BATCH_SIZE * 2) {
>              co->caller = NULL;
>              QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
> -            atomic_inc(&pool_size);
> +            atomic_inc(&release_pool_size);
> +            return;
> +        } else if (alloc_pool_size < POOL_BATCH_SIZE) {
> +            co->caller = NULL;
> +            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
> +            alloc_pool_size++;
>              return;
>          }
>      }
> 
> 
> Bug?:
>  The release_pool is not cleanup up on termination I think.

That's not necessary, it is global.

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 12:45                         ` Paolo Bonzini
@ 2014-11-28 12:49                           ` Peter Lieven
  2014-11-28 12:56                             ` Paolo Bonzini
  2014-11-28 13:17                           ` Peter Lieven
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 12:49 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 13:45 schrieb Paolo Bonzini:
>
> On 28/11/2014 13:39, Peter Lieven wrote:
>> Am 28.11.2014 um 13:26 schrieb Paolo Bonzini:
>>> On 28/11/2014 12:46, Peter Lieven wrote:
>>>>> I get:
>>>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
>>>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more
>>>> than one thread in need of a lot of coroutines?
>>> Overall the algorithm is expected to adapt.  The N threads contribute to
>>> the global release pool, so the pool will fill up N times faster than if
>>> you had only one thread.  There can be some variance, which is why the
>>> maximum size of the pool is twice the threshold (and probably could be
>>> tuned better).
>>>
>>> Benchmarks are needed on real I/O too, of course, especially with high
>>> queue depth.
>> Yes, cool. The atomic operations are a bit tricky at the first glance ;-)
>>
>> Question:
>>  Why is the pool_size increment atomic and the set to zero not?
> Because the set to zero is not a read-modify-write operation, so it is
> always atomic.  It's just not sequentially-consistent (see
> docs/atomics.txt for some info on what that means).
>
>> Idea:
>>  If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-)
> Because you can only waste 64 coroutines per thread.  But numbers cannot
> be sneezed at, so it's worth doing it as a separate patch.

What do you mean by that? If I use dataplane I will fill the global pool and never use it okay, but
then I use thread local storage only. So I get the same numbers as in my thread local storage only version.

Maybe it is an idea to tweak the POOL_BATCH_SIZE * 2 according to what is really attached. If we
have only dataplane or ioeventfd it can be POOL_BATCH_SIZE * 0 and we even won't waste those
coroutines oxidating in the global pool.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 12:49                           ` Peter Lieven
@ 2014-11-28 12:56                             ` Paolo Bonzini
  0 siblings, 0 replies; 27+ messages in thread
From: Paolo Bonzini @ 2014-11-28 12:56 UTC (permalink / raw)
  To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster



On 28/11/2014 13:49, Peter Lieven wrote:
>>> Idea:
>>> >>  If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-)
>> > Because you can only waste 64 coroutines per thread.  But numbers cannot

s/only//

>> > be sneezed at, so it's worth doing it as a separate patch.
> What do you mean by that? If I use dataplane I will fill the global pool and never use it okay, but
> then I use thread local storage only. So I get the same numbers as in my thread local storage only version.

Right.  I didn't want to waste the coroutines.  But it's not 64
coroutines per VCPU thread, it's just 64 coroutines for the global
iothread because all the dataplane threads are guaranteed to use
ioeventfd.  Let's do it. :)

Can I add your Signed-off-by to the patch?

Paolo

> Maybe it is an idea to tweak the POOL_BATCH_SIZE * 2 according to what is really attached. If we
> have only dataplane or ioeventfd it can be POOL_BATCH_SIZE * 0 and we even won't waste those
> coroutines oxidating in the global pool.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 12:45                         ` Paolo Bonzini
  2014-11-28 12:49                           ` Peter Lieven
@ 2014-11-28 13:17                           ` Peter Lieven
  2014-11-28 14:17                             ` Paolo Bonzini
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 13:17 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 13:45 schrieb Paolo Bonzini:
>
> On 28/11/2014 13:39, Peter Lieven wrote:
>> Am 28.11.2014 um 13:26 schrieb Paolo Bonzini:
>>> On 28/11/2014 12:46, Peter Lieven wrote:
>>>>> I get:
>>>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
>>>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more
>>>> than one thread in need of a lot of coroutines?
>>> Overall the algorithm is expected to adapt.  The N threads contribute to
>>> the global release pool, so the pool will fill up N times faster than if
>>> you had only one thread.  There can be some variance, which is why the
>>> maximum size of the pool is twice the threshold (and probably could be
>>> tuned better).
>>>
>>> Benchmarks are needed on real I/O too, of course, especially with high
>>> queue depth.
>> Yes, cool. The atomic operations are a bit tricky at the first glance ;-)
>>
>> Question:
>>  Why is the pool_size increment atomic and the set to zero not?
> Because the set to zero is not a read-modify-write operation, so it is
> always atomic.  It's just not sequentially-consistent (see
> docs/atomics.txt for some info on what that means).
>
>> Idea:
>>  If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-)
> Because you can only waste 64 coroutines per thread.  But numbers cannot
> be sneezed at, so it's worth doing it as a separate patch.
>
>> Run operation 40000000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine
>>
>> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
>> index 6bee354..edea162 100644
>> --- a/qemu-coroutine.c
>> +++ b/qemu-coroutine.c
>> @@ -25,8 +25,9 @@ enum {
>>  
>>  /** Free list to speed up creation */
>>  static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
>> -static unsigned int pool_size;
>> +static unsigned int release_pool_size;
>>  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
>> +static __thread unsigned int alloc_pool_size;
>>  
>>  /* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
>>  static void coroutine_pool_cleanup(void *value);
>> @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>>      if (CONFIG_COROUTINE_POOL) {
>>          co = QSLIST_FIRST(&alloc_pool);
>>          if (!co) {
>> -            if (pool_size > POOL_BATCH_SIZE) {
>> -                /* This is not exact; there could be a little skew between pool_size
>> +            if (release_pool_size > POOL_BATCH_SIZE) {
>> +                /* This is not exact; there could be a little skew between release_pool_size
>>                   * and the actual size of alloc_pool.  But it is just a heuristic,
>>                   * it does not need to be perfect.
>>                   */
>> -                pool_size = 0;
>> +                alloc_pool_size = atomic_fetch_and(&release_pool_size, 0);
>>                  QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
>>                  co = QSLIST_FIRST(&alloc_pool);
>>  
>> @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>>                   */
>>                  g_private_set(&dummy_key, &dummy_key);
>>              }
>> +        } else {
>> +            alloc_pool_size--;
>>          }
>>          if (co) {
>>              QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
>> @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>>  static void coroutine_delete(Coroutine *co)
>>  {
>>      if (CONFIG_COROUTINE_POOL) {
>> -        if (pool_size < POOL_BATCH_SIZE * 2) {
>> +        if (release_pool_size < POOL_BATCH_SIZE * 2) {
>>              co->caller = NULL;
>>              QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
>> -            atomic_inc(&pool_size);
>> +            atomic_inc(&release_pool_size);
>> +            return;
>> +        } else if (alloc_pool_size < POOL_BATCH_SIZE) {
>> +            co->caller = NULL;
>> +            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
>> +            alloc_pool_size++;
>>              return;
>>          }
>>      }
>>
>>
>> Bug?:
>>  The release_pool is not cleanup up on termination I think.
> That's not necessary, it is global.

I don't see where you iterate over release_pool and destroy all coroutines?

Maybe just add back the old destructor with s/pool/release_pool/g

Peter

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 13:17                           ` Peter Lieven
@ 2014-11-28 14:17                             ` Paolo Bonzini
  2014-11-28 20:11                               ` Peter Lieven
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2014-11-28 14:17 UTC (permalink / raw)
  To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster



On 28/11/2014 14:17, Peter Lieven wrote:
>>> >>  The release_pool is not cleanup up on termination I think.
>> > That's not necessary, it is global.
> I don't see where you iterate over release_pool and destroy all coroutines?

The OS does that for us when we exit.

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 14:17                             ` Paolo Bonzini
@ 2014-11-28 20:11                               ` Peter Lieven
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 20:11 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 15:17 schrieb Paolo Bonzini:
>
> On 28/11/2014 14:17, Peter Lieven wrote:
>>>>>>  The release_pool is not cleanup up on termination I think.
>>>> That's not necessary, it is global.
>> I don't see where you iterate over release_pool and destroy all coroutines?
> The OS does that for us when we exit.

Sure, but isn't that considered bad practice?
Before this patch the destructured freed the coroutines.

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 12:39                       ` Peter Lieven
  2014-11-28 12:45                         ` Paolo Bonzini
@ 2014-11-28 13:13                         ` Peter Lieven
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 13:13 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 13:39 schrieb Peter Lieven:
> Am 28.11.2014 um 13:26 schrieb Paolo Bonzini:
>> On 28/11/2014 12:46, Peter Lieven wrote:
>>>> I get:
>>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
>>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more
>>> than one thread in need of a lot of coroutines?
>> Overall the algorithm is expected to adapt.  The N threads contribute to
>> the global release pool, so the pool will fill up N times faster than if
>> you had only one thread.  There can be some variance, which is why the
>> maximum size of the pool is twice the threshold (and probably could be
>> tuned better).
>>
>> Benchmarks are needed on real I/O too, of course, especially with high
>> queue depth.
> Yes, cool. The atomic operations are a bit tricky at the first glance ;-)
>
> Question:
>  Why is the pool_size increment atomic and the set to zero not?
>  
> Idea:
>  If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-)
>
> Run operation 40000000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine
>
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index 6bee354..edea162 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -25,8 +25,9 @@ enum {
>  
>  /** Free list to speed up creation */
>  static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
> -static unsigned int pool_size;
> +static unsigned int release_pool_size;
>  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
> +static __thread unsigned int alloc_pool_size;
>  
>  /* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
>  static void coroutine_pool_cleanup(void *value);
> @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>      if (CONFIG_COROUTINE_POOL) {
>          co = QSLIST_FIRST(&alloc_pool);
>          if (!co) {
> -            if (pool_size > POOL_BATCH_SIZE) {
> -                /* This is not exact; there could be a little skew between pool_size
> +            if (release_pool_size > POOL_BATCH_SIZE) {
> +                /* This is not exact; there could be a little skew between release_pool_size
>                   * and the actual size of alloc_pool.  But it is just a heuristic,
>                   * it does not need to be perfect.
>                   */
> -                pool_size = 0;
> +                alloc_pool_size = atomic_fetch_and(&release_pool_size, 0);
>                  QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
>                  co = QSLIST_FIRST(&alloc_pool);
>  
> @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>                   */
>                  g_private_set(&dummy_key, &dummy_key);
>              }
> +        } else {
> +            alloc_pool_size--;
>          }
>          if (co) {
>              QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
> @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>  static void coroutine_delete(Coroutine *co)
>  {
>      if (CONFIG_COROUTINE_POOL) {
> -        if (pool_size < POOL_BATCH_SIZE * 2) {
> +        if (release_pool_size < POOL_BATCH_SIZE * 2) {
>              co->caller = NULL;
>              QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
> -            atomic_inc(&pool_size);
> +            atomic_inc(&release_pool_size);
> +            return;
> +        } else if (alloc_pool_size < POOL_BATCH_SIZE) {
> +            co->caller = NULL;
> +            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
> +            alloc_pool_size++;
>              return;
>          }
>      }

Signed-off-by: Peter Lieven <pl@kamp.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 11:32                 ` Peter Lieven
  2014-11-28 11:46                   ` Peter Lieven
@ 2014-11-28 12:21                   ` Paolo Bonzini
  2014-11-28 12:26                     ` Peter Lieven
  1 sibling, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2014-11-28 12:21 UTC (permalink / raw)
  To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster



On 28/11/2014 12:32, Peter Lieven wrote:
> Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
>>
>> On 28/11/2014 12:21, Peter Lieven wrote:
>>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>>>> master:
>>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>>>
>>>>> paolo:
>>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>>>> Nice. :)
>>>>
>>>> Can you please try "coroutine: Use __thread … " together, too?  I still
>>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>>>> I apply it here (my times are 191/160/145).
>>> indeed:
>>>
>>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
>> Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
>> if we can use a similar algorithm but with arrays instead of lists...
> 
> Why do you set pool_size = 0 in the create path?
> 
> When I do the following:
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index 6bee354..c79ee78 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>                   * and the actual size of alloc_pool.  But it is just a heuristic,
>                   * it does not need to be perfect.
>                   */
> -                pool_size = 0;
> +                atomic_dec(&pool_size);
>                  QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
>                  co = QSLIST_FIRST(&alloc_pool);
> 
> 
> I get:
> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine

Because pool_size is the (approximate) number of coroutines in the pool.
 It is zero after QSLIST_MOVE_ATOMIC has NULL-ed out release_pool.slh_first.

Paolo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-28 12:21                   ` Paolo Bonzini
@ 2014-11-28 12:26                     ` Peter Lieven
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Lieven @ 2014-11-28 12:26 UTC (permalink / raw)
  To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi,
	qemu-devel@nongnu.org, Markus Armbruster

Am 28.11.2014 um 13:21 schrieb Paolo Bonzini:
>
> On 28/11/2014 12:32, Peter Lieven wrote:
>> Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
>>> On 28/11/2014 12:21, Peter Lieven wrote:
>>>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
>>>>>> master:
>>>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine
>>>>>>
>>>>>> paolo:
>>>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine
>>>>> Nice. :)
>>>>>
>>>>> Can you please try "coroutine: Use __thread … " together, too?  I still
>>>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
>>>>> I apply it here (my times are 191/160/145).
>>>> indeed:
>>>>
>>>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
>>> Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
>>> if we can use a similar algorithm but with arrays instead of lists...
>> Why do you set pool_size = 0 in the create path?
>>
>> When I do the following:
>> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
>> index 6bee354..c79ee78 100644
>> --- a/qemu-coroutine.c
>> +++ b/qemu-coroutine.c
>> @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>>                   * and the actual size of alloc_pool.  But it is just a heuristic,
>>                   * it does not need to be perfect.
>>                   */
>> -                pool_size = 0;
>> +                atomic_dec(&pool_size);
>>                  QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
>>                  co = QSLIST_FIRST(&alloc_pool);
>>
>>
>> I get:
>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
> Because pool_size is the (approximate) number of coroutines in the pool.
>  It is zero after QSLIST_MOVE_ATOMIC has NULL-ed out release_pool.slh_first.

got it meanwhile. and its not as bad as i thought since you only steal the release_pool if your
alloc_pool is empty. Right?

Peter

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
  2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven
  2014-11-27 16:40   ` Paolo Bonzini
@ 2014-11-28 12:40   ` Stefan Hajnoczi
  1 sibling, 0 replies; 27+ messages in thread
From: Stefan Hajnoczi @ 2014-11-28 12:40 UTC (permalink / raw)
  To: Peter Lieven
  Cc: kwolf, famz, benoit, ming.lei, armbru, qemu-devel, stefanha,
	pbonzini, mreitz

[-- Attachment #1: Type: text/plain, Size: 809 bytes --]

On Thu, Nov 27, 2014 at 11:27:06AM +0100, Peter Lieven wrote:
> diff --git a/iothread.c b/iothread.c
> index 342a23f..b53529b 100644
> --- a/iothread.c
> +++ b/iothread.c
> @@ -15,6 +15,7 @@
>  #include "qom/object_interfaces.h"
>  #include "qemu/module.h"
>  #include "block/aio.h"
> +#include "block/coroutine.h"
>  #include "sysemu/iothread.h"
>  #include "qmp-commands.h"
>  #include "qemu/error-report.h"
> @@ -47,6 +48,8 @@ static void *iothread_run(void *opaque)
>          }
>          aio_context_release(iothread->ctx);
>      }
> +
> +    coroutine_pool_cleanup();
>      return NULL;
>  }

The assumption here is that iothread_run() is the only thread function
that uses coroutines.

If another thread uses coroutines the pool will leak :(.  This is one of
the challenges of thread-local storage.

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-11-28 20:12 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-27 10:27 [Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven
2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" Peter Lieven
2014-11-28 12:42   ` Stefan Hajnoczi
2014-11-28 12:45     ` Peter Lieven
2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 2/3] block/block-backend.c: remove coroutine pool reservation Peter Lieven
2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven
2014-11-27 16:40   ` Paolo Bonzini
2014-11-28  8:13     ` Peter Lieven
     [not found]       ` <54784E55.6060405@redhat.com>
2014-11-28 10:37         ` Peter Lieven
2014-11-28 11:14           ` Paolo Bonzini
2014-11-28 11:21             ` Peter Lieven
2014-11-28 11:23               ` Paolo Bonzini
2014-11-28 11:27                 ` Peter Lieven
2014-11-28 11:32                 ` Peter Lieven
2014-11-28 11:46                   ` Peter Lieven
2014-11-28 12:26                     ` Paolo Bonzini
2014-11-28 12:39                       ` Peter Lieven
2014-11-28 12:45                         ` Paolo Bonzini
2014-11-28 12:49                           ` Peter Lieven
2014-11-28 12:56                             ` Paolo Bonzini
2014-11-28 13:17                           ` Peter Lieven
2014-11-28 14:17                             ` Paolo Bonzini
2014-11-28 20:11                               ` Peter Lieven
2014-11-28 13:13                         ` Peter Lieven
2014-11-28 12:21                   ` Paolo Bonzini
2014-11-28 12:26                     ` Peter Lieven
2014-11-28 12:40   ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).