* [Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool @ 2014-11-27 10:27 Peter Lieven 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" Peter Lieven ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Peter Lieven @ 2014-11-27 10:27 UTC (permalink / raw) To: qemu-devel Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz, stefanha, pbonzini This series removes the need for locking the coroutine pool. During collection of perf data it showed up that there is serious time spend in locking the coroutine pool during heavy I/O from the vServer. This also needs serious benchmarking. Using Kevins qemu-img bench there is a 3-4% speedup. Inside vServers it seems to heavily dependend on the individual CPU and SMP configuration. I just wanted to post these patches here to discuss if this is an idea that we should look futher into. KNOWN ISSUES: - the I/O thread qobject is not destroyed on qemu termination. These seems to be the case for all qobjects?! Therefore the per thread coroutine pool is not cleaned up. - without ioeventfd there is effectively no longer a coroutine pool since the coroutines are created in a different thread then they are freed. Peter Lieven (3): Revert "coroutine: make pool size dynamic" block/block-backend.c: remove coroutine pool reservation qemu-coroutine: use a ring per thread for the pool block/block-backend.c | 6 ---- include/block/coroutine.h | 13 +------- iothread.c | 3 ++ qemu-coroutine.c | 78 +++++++++++++++------------------------------ 4 files changed, 30 insertions(+), 70 deletions(-) -- 1.7.9.5 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" 2014-11-27 10:27 [Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven @ 2014-11-27 10:27 ` Peter Lieven 2014-11-28 12:42 ` Stefan Hajnoczi 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 2/3] block/block-backend.c: remove coroutine pool reservation Peter Lieven 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven 2 siblings, 1 reply; 27+ messages in thread From: Peter Lieven @ 2014-11-27 10:27 UTC (permalink / raw) To: qemu-devel Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz, stefanha, pbonzini This reverts commit ac2662a913ee5854b1269256adbdc14e57ba480a. --- include/block/coroutine.h | 11 ----------- qemu-coroutine.c | 26 +++----------------------- 2 files changed, 3 insertions(+), 34 deletions(-) diff --git a/include/block/coroutine.h b/include/block/coroutine.h index 793df0e..1d9343d 100644 --- a/include/block/coroutine.h +++ b/include/block/coroutine.h @@ -215,15 +215,4 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type, * Note that this function clobbers the handlers for the file descriptor. */ void coroutine_fn yield_until_fd_readable(int fd); - -/** - * Add or subtract from the coroutine pool size - * - * The coroutine implementation keeps a pool of coroutines to be reused by - * qemu_coroutine_create(). This makes coroutine creation cheap. Heavy - * coroutine users should call this to reserve pool space. Call it again with - * a negative number to release pool space. - */ -void qemu_coroutine_adjust_pool_size(int n); - #endif /* QEMU_COROUTINE_H */ diff --git a/qemu-coroutine.c b/qemu-coroutine.c index bd574aa..4708521 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -19,14 +19,14 @@ #include "block/coroutine_int.h" enum { - POOL_DEFAULT_SIZE = 64, + /* Maximum free pool size prevents holding too many freed coroutines */ + POOL_MAX_SIZE = 64, }; /** Free list to speed up creation */ static QemuMutex pool_lock; static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool); static unsigned int pool_size; -static unsigned int pool_max_size = POOL_DEFAULT_SIZE; Coroutine *qemu_coroutine_create(CoroutineEntry *entry) { @@ -55,7 +55,7 @@ static void coroutine_delete(Coroutine *co) { if (CONFIG_COROUTINE_POOL) { qemu_mutex_lock(&pool_lock); - if (pool_size < pool_max_size) { + if (pool_size < POOL_MAX_SIZE) { QSLIST_INSERT_HEAD(&pool, co, pool_next); co->caller = NULL; pool_size++; @@ -137,23 +137,3 @@ void coroutine_fn qemu_coroutine_yield(void) self->caller = NULL; coroutine_swap(self, to); } - -void qemu_coroutine_adjust_pool_size(int n) -{ - qemu_mutex_lock(&pool_lock); - - pool_max_size += n; - - /* Callers should never take away more than they added */ - assert(pool_max_size >= POOL_DEFAULT_SIZE); - - /* Trim oversized pool down to new max */ - while (pool_size > pool_max_size) { - Coroutine *co = QSLIST_FIRST(&pool); - QSLIST_REMOVE_HEAD(&pool, pool_next); - pool_size--; - qemu_coroutine_delete(co); - } - - qemu_mutex_unlock(&pool_lock); -} -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" Peter Lieven @ 2014-11-28 12:42 ` Stefan Hajnoczi 2014-11-28 12:45 ` Peter Lieven 0 siblings, 1 reply; 27+ messages in thread From: Stefan Hajnoczi @ 2014-11-28 12:42 UTC (permalink / raw) To: Peter Lieven Cc: kwolf, famz, benoit, ming.lei, armbru, qemu-devel, stefanha, pbonzini, mreitz [-- Attachment #1: Type: text/plain, Size: 306 bytes --] On Thu, Nov 27, 2014 at 11:27:04AM +0100, Peter Lieven wrote: > This reverts commit ac2662a913ee5854b1269256adbdc14e57ba480a. Justification? Won't we hit the same problem as the global pool: either pool size is too large or too small. If you exceed the pool size performance drops dramatically. Stefan [-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" 2014-11-28 12:42 ` Stefan Hajnoczi @ 2014-11-28 12:45 ` Peter Lieven 0 siblings, 0 replies; 27+ messages in thread From: Peter Lieven @ 2014-11-28 12:45 UTC (permalink / raw) To: Stefan Hajnoczi Cc: kwolf, famz, benoit, ming.lei, armbru, qemu-devel, stefanha, pbonzini, mreitz Am 28.11.2014 um 13:42 schrieb Stefan Hajnoczi: > On Thu, Nov 27, 2014 at 11:27:04AM +0100, Peter Lieven wrote: >> This reverts commit ac2662a913ee5854b1269256adbdc14e57ba480a. > Justification? > > Won't we hit the same problem as the global pool: either pool size is > too large or too small. If you exceed the pool size performance drops > dramatically. Each thread has its own pool. But you can ignore this series I think. The idea Paolo came up with seems to be better. With my last improvement it seems to have equal performance without the drawback of different threads creating and destroying the coroutines. Peter ^ permalink raw reply [flat|nested] 27+ messages in thread
* [Qemu-devel] [RFC PATCH 2/3] block/block-backend.c: remove coroutine pool reservation 2014-11-27 10:27 [Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" Peter Lieven @ 2014-11-27 10:27 ` Peter Lieven 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven 2 siblings, 0 replies; 27+ messages in thread From: Peter Lieven @ 2014-11-27 10:27 UTC (permalink / raw) To: qemu-devel Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz, stefanha, pbonzini Signed-off-by: Peter Lieven <pl@kamp.de> --- block/block-backend.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index d0692b1..cf36ffb 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -15,9 +15,6 @@ #include "sysemu/blockdev.h" #include "qapi-event.h" -/* Number of coroutines to reserve per attached device model */ -#define COROUTINE_POOL_RESERVATION 64 - struct BlockBackend { char *name; int refcnt; @@ -261,8 +258,6 @@ int blk_attach_dev(BlockBackend *blk, void *dev) blk->dev = dev; bdrv_iostatus_reset(blk->bs); - /* We're expecting I/O from the device so bump up coroutine pool size */ - qemu_coroutine_adjust_pool_size(COROUTINE_POOL_RESERVATION); return 0; } @@ -290,7 +285,6 @@ void blk_detach_dev(BlockBackend *blk, void *dev) blk->dev_ops = NULL; blk->dev_opaque = NULL; bdrv_set_guest_block_size(blk->bs, 512); - qemu_coroutine_adjust_pool_size(-COROUTINE_POOL_RESERVATION); blk_unref(blk); } -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-27 10:27 [Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" Peter Lieven 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 2/3] block/block-backend.c: remove coroutine pool reservation Peter Lieven @ 2014-11-27 10:27 ` Peter Lieven 2014-11-27 16:40 ` Paolo Bonzini 2014-11-28 12:40 ` Stefan Hajnoczi 2 siblings, 2 replies; 27+ messages in thread From: Peter Lieven @ 2014-11-27 10:27 UTC (permalink / raw) To: qemu-devel Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz, stefanha, pbonzini This patch creates a ring structure for the coroutine pool instead of a linked list. The implementation of the list has the issue that it always throws aways the latest coroutines instead of the oldest ones. This is a drawback since the latest used coroutines are more likely cached than old ones. Furthermore this patch creates a coroutine pool per thread to remove the need for locking. Signed-off-by: Peter Lieven <pl@kamp.de> --- include/block/coroutine.h | 2 +- iothread.c | 3 +++ qemu-coroutine.c | 54 ++++++++++++++++++++------------------------- 3 files changed, 28 insertions(+), 31 deletions(-) diff --git a/include/block/coroutine.h b/include/block/coroutine.h index 1d9343d..f0b3a2d 100644 --- a/include/block/coroutine.h +++ b/include/block/coroutine.h @@ -96,7 +96,7 @@ Coroutine *coroutine_fn qemu_coroutine_self(void); */ bool qemu_in_coroutine(void); - +void coroutine_pool_cleanup(void); /** * CoQueues are a mechanism to queue coroutines in order to continue executing diff --git a/iothread.c b/iothread.c index 342a23f..b53529b 100644 --- a/iothread.c +++ b/iothread.c @@ -15,6 +15,7 @@ #include "qom/object_interfaces.h" #include "qemu/module.h" #include "block/aio.h" +#include "block/coroutine.h" #include "sysemu/iothread.h" #include "qmp-commands.h" #include "qemu/error-report.h" @@ -47,6 +48,8 @@ static void *iothread_run(void *opaque) } aio_context_release(iothread->ctx); } + + coroutine_pool_cleanup(); return NULL; } diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 4708521..1900155 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -24,22 +24,22 @@ enum { }; /** Free list to speed up creation */ -static QemuMutex pool_lock; -static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int pool_size; +static __thread struct CoRoutinePool { + Coroutine *ptrs[POOL_MAX_SIZE]; + unsigned int size; + unsigned int nextfree; +} CoPool; Coroutine *qemu_coroutine_create(CoroutineEntry *entry) { Coroutine *co = NULL; - if (CONFIG_COROUTINE_POOL) { - qemu_mutex_lock(&pool_lock); - co = QSLIST_FIRST(&pool); - if (co) { - QSLIST_REMOVE_HEAD(&pool, pool_next); - pool_size--; + if (CoPool.size) { + co = CoPool.ptrs[CoPool.nextfree]; + CoPool.size--; + CoPool.nextfree--; + CoPool.nextfree &= (POOL_MAX_SIZE - 1); } - qemu_mutex_unlock(&pool_lock); } if (!co) { @@ -54,36 +54,30 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) static void coroutine_delete(Coroutine *co) { if (CONFIG_COROUTINE_POOL) { - qemu_mutex_lock(&pool_lock); - if (pool_size < POOL_MAX_SIZE) { - QSLIST_INSERT_HEAD(&pool, co, pool_next); - co->caller = NULL; - pool_size++; - qemu_mutex_unlock(&pool_lock); - return; + CoPool.nextfree++; + CoPool.nextfree &= (POOL_MAX_SIZE - 1); + if (CoPool.size == POOL_MAX_SIZE) { + qemu_coroutine_delete(CoPool.ptrs[CoPool.nextfree]); + } else { + CoPool.size++; } - qemu_mutex_unlock(&pool_lock); + co->caller = NULL; + CoPool.ptrs[CoPool.nextfree] = co; + } else { + qemu_coroutine_delete(co); } - - qemu_coroutine_delete(co); } static void __attribute__((constructor)) coroutine_pool_init(void) { - qemu_mutex_init(&pool_lock); } -static void __attribute__((destructor)) coroutine_pool_cleanup(void) +void __attribute__((destructor)) coroutine_pool_cleanup(void) { - Coroutine *co; - Coroutine *tmp; - - QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) { - QSLIST_REMOVE_HEAD(&pool, pool_next); - qemu_coroutine_delete(co); + printf("coroutine_pool_cleanup %lx pool %p\n", pthread_self(), &CoPool); + while (CoPool.size) { + qemu_coroutine_delete(qemu_coroutine_create(NULL)); } - - qemu_mutex_destroy(&pool_lock); } static void coroutine_swap(Coroutine *from, Coroutine *to) -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven @ 2014-11-27 16:40 ` Paolo Bonzini 2014-11-28 8:13 ` Peter Lieven 2014-11-28 12:40 ` Stefan Hajnoczi 1 sibling, 1 reply; 27+ messages in thread From: Paolo Bonzini @ 2014-11-27 16:40 UTC (permalink / raw) To: Peter Lieven, qemu-devel Cc: kwolf, famz, benoit, ming.lei, armbru, mreitz, stefanha On 27/11/2014 11:27, Peter Lieven wrote: > +static __thread struct CoRoutinePool { > + Coroutine *ptrs[POOL_MAX_SIZE]; > + unsigned int size; > + unsigned int nextfree; > +} CoPool; > The per-thread ring unfortunately didn't work well last time it was tested. Devices that do not use ioeventfd (not just the slow ones, even decently performing ones like ahci, nvme or megasas) will create the coroutine in the VCPU thread, and destroy it in the iothread. The result is that coroutines cannot be reused. Can you check if this is still the case? Paolo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-27 16:40 ` Paolo Bonzini @ 2014-11-28 8:13 ` Peter Lieven [not found] ` <54784E55.6060405@redhat.com> 0 siblings, 1 reply; 27+ messages in thread From: Peter Lieven @ 2014-11-28 8:13 UTC (permalink / raw) To: Paolo Bonzini, qemu-devel Cc: kwolf, famz, benoit, ming.lei, armbru, mreitz, stefanha Am 27.11.2014 um 17:40 schrieb Paolo Bonzini: > > On 27/11/2014 11:27, Peter Lieven wrote: >> +static __thread struct CoRoutinePool { >> + Coroutine *ptrs[POOL_MAX_SIZE]; >> + unsigned int size; >> + unsigned int nextfree; >> +} CoPool; >> > The per-thread ring unfortunately didn't work well last time it was > tested. Devices that do not use ioeventfd (not just the slow ones, even > decently performing ones like ahci, nvme or megasas) will create the > coroutine in the VCPU thread, and destroy it in the iothread. The > result is that coroutines cannot be reused. > > Can you check if this is still the case? I already tested at least for IDE and for ioeventfd=off. The coroutine is created in the vCPU thread and destroyed in the I/O thread. I also havea more complicated version which sets per therad coroutine pool only for dataplane. Avoiding the lock for dedicated iothreads. For those who want to take a look: https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e Peter ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <54784E55.6060405@redhat.com>]
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool [not found] ` <54784E55.6060405@redhat.com> @ 2014-11-28 10:37 ` Peter Lieven 2014-11-28 11:14 ` Paolo Bonzini 0 siblings, 1 reply; 27+ messages in thread From: Peter Lieven @ 2014-11-28 10:37 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 11:28 schrieb Paolo Bonzini: > > On 28/11/2014 09:13, Peter Lieven wrote: >> Am 27.11.2014 um 17:40 schrieb Paolo Bonzini: >>> On 27/11/2014 11:27, Peter Lieven wrote: >>>> +static __thread struct CoRoutinePool { >>>> + Coroutine *ptrs[POOL_MAX_SIZE]; >>>> + unsigned int size; >>>> + unsigned int nextfree; >>>> +} CoPool; >>>> >>> The per-thread ring unfortunately didn't work well last time it was >>> tested. Devices that do not use ioeventfd (not just the slow ones, even >>> decently performing ones like ahci, nvme or megasas) will create the >>> coroutine in the VCPU thread, and destroy it in the iothread. The >>> result is that coroutines cannot be reused. >>> >>> Can you check if this is still the case? >> I already tested at least for IDE and for ioeventfd=off. The coroutine >> is created in the vCPU thread and destroyed in the I/O thread. >> >> I also havea more complicated version which sets per therad coroutine pool only >> for dataplane. Avoiding the lock for dedicated iothreads. >> >> For those who want to take a look: >> >> https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e > Can you test it against the patch I just sent in Kevin's linux-aio > coroutine thread? Was already doing it ;-) At least with test-couroutine.c.... master: Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine plieven/perf_master2: Run operation 40000000 iterations 9.013785 s, 4437K operations/s, 225ns per coroutine plieven/perf_master: Run operation 40000000 iterations 11.072883 s, 3612K operations/s, 276ns per coroutine However, perf_master and perf_master2 have a regerssion regarding nesting as it seems. @Kevin: Could that be the reason why they performe bad in some szenarios? Regarding the bypass that is discussed. If it is not just a benchmark thing but really necessary for some peoples use cases why not add a new aio mode like "bypass" and use it only then. If the performance is really needed the user he/she might trade it in for lost features like iothrottling, filters etc. Peter ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 10:37 ` Peter Lieven @ 2014-11-28 11:14 ` Paolo Bonzini 2014-11-28 11:21 ` Peter Lieven 0 siblings, 1 reply; 27+ messages in thread From: Paolo Bonzini @ 2014-11-28 11:14 UTC (permalink / raw) To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster > master: > Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine > > paolo: > Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine Nice. :) Can you please try "coroutine: Use __thread … " together, too? I still see 11% time spent in pthread_getspecific, and I get ~10% more indeed if I apply it here (my times are 191/160/145). Paolo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 11:14 ` Paolo Bonzini @ 2014-11-28 11:21 ` Peter Lieven 2014-11-28 11:23 ` Paolo Bonzini 0 siblings, 1 reply; 27+ messages in thread From: Peter Lieven @ 2014-11-28 11:21 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: >> master: >> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine >> >> paolo: >> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine > Nice. :) > > Can you please try "coroutine: Use __thread … " together, too? I still > see 11% time spent in pthread_getspecific, and I get ~10% more indeed if > I apply it here (my times are 191/160/145). indeed: Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 11:21 ` Peter Lieven @ 2014-11-28 11:23 ` Paolo Bonzini 2014-11-28 11:27 ` Peter Lieven 2014-11-28 11:32 ` Peter Lieven 0 siblings, 2 replies; 27+ messages in thread From: Paolo Bonzini @ 2014-11-28 11:23 UTC (permalink / raw) To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster On 28/11/2014 12:21, Peter Lieven wrote: > Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: >>> master: >>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine >>> >>> paolo: >>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine >> Nice. :) >> >> Can you please try "coroutine: Use __thread … " together, too? I still >> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if >> I apply it here (my times are 191/160/145). > > indeed: > > Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine Your perf_master2 uses the ring buffer unconditionally, right? I wonder if we can use a similar algorithm but with arrays instead of lists... Paolo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 11:23 ` Paolo Bonzini @ 2014-11-28 11:27 ` Peter Lieven 2014-11-28 11:32 ` Peter Lieven 1 sibling, 0 replies; 27+ messages in thread From: Peter Lieven @ 2014-11-28 11:27 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: > > On 28/11/2014 12:21, Peter Lieven wrote: >> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: >>>> master: >>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine >>>> >>>> paolo: >>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine >>> Nice. :) >>> >>> Can you please try "coroutine: Use __thread … " together, too? I still >>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if >>> I apply it here (my times are 191/160/145). >> indeed: >> >> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine > Your perf_master2 uses the ring buffer unconditionally, right? I wonder > if we can use a similar algorithm but with arrays instead of lists... You mean an algorithm similar to perf_master2 or to the current implementation? The ring buffer seems to have a drawback when it comes to excessive coroutine nesting. My idea was that you do not throw away hot coroutines when the pool is full. However, i do not know if this is really a problem since the pool is only full when there is not much I/O. Or is this assumption to easy? Peter ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 11:23 ` Paolo Bonzini 2014-11-28 11:27 ` Peter Lieven @ 2014-11-28 11:32 ` Peter Lieven 2014-11-28 11:46 ` Peter Lieven 2014-11-28 12:21 ` Paolo Bonzini 1 sibling, 2 replies; 27+ messages in thread From: Peter Lieven @ 2014-11-28 11:32 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: > > On 28/11/2014 12:21, Peter Lieven wrote: >> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: >>>> master: >>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine >>>> >>>> paolo: >>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine >>> Nice. :) >>> >>> Can you please try "coroutine: Use __thread … " together, too? I still >>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if >>> I apply it here (my times are 191/160/145). >> indeed: >> >> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine > Your perf_master2 uses the ring buffer unconditionally, right? I wonder > if we can use a similar algorithm but with arrays instead of lists... Why do you set pool_size = 0 in the create path? When I do the following: diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 6bee354..c79ee78 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) * and the actual size of alloc_pool. But it is just a heuristic, * it does not need to be perfect. */ - pool_size = 0; + atomic_dec(&pool_size); QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); co = QSLIST_FIRST(&alloc_pool); I get: Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 11:32 ` Peter Lieven @ 2014-11-28 11:46 ` Peter Lieven 2014-11-28 12:26 ` Paolo Bonzini 2014-11-28 12:21 ` Paolo Bonzini 1 sibling, 1 reply; 27+ messages in thread From: Peter Lieven @ 2014-11-28 11:46 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 12:32 schrieb Peter Lieven: > Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: >> On 28/11/2014 12:21, Peter Lieven wrote: >>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: >>>>> master: >>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine >>>>> >>>>> paolo: >>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine >>>> Nice. :) >>>> >>>> Can you please try "coroutine: Use __thread … " together, too? I still >>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if >>>> I apply it here (my times are 191/160/145). >>> indeed: >>> >>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine >> Your perf_master2 uses the ring buffer unconditionally, right? I wonder >> if we can use a similar algorithm but with arrays instead of lists... > Why do you set pool_size = 0 in the create path? > > When I do the following: > diff --git a/qemu-coroutine.c b/qemu-coroutine.c > index 6bee354..c79ee78 100644 > --- a/qemu-coroutine.c > +++ b/qemu-coroutine.c > @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > * and the actual size of alloc_pool. But it is just a heuristic, > * it does not need to be perfect. > */ > - pool_size = 0; > + atomic_dec(&pool_size); > QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); > co = QSLIST_FIRST(&alloc_pool); > > > I get: > Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine > Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more than one thread in need of a lot of coroutines? Peter ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 11:46 ` Peter Lieven @ 2014-11-28 12:26 ` Paolo Bonzini 2014-11-28 12:39 ` Peter Lieven 0 siblings, 1 reply; 27+ messages in thread From: Paolo Bonzini @ 2014-11-28 12:26 UTC (permalink / raw) To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster On 28/11/2014 12:46, Peter Lieven wrote: > > I get: > > Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine > > Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more > than one thread in need of a lot of coroutines? Overall the algorithm is expected to adapt. The N threads contribute to the global release pool, so the pool will fill up N times faster than if you had only one thread. There can be some variance, which is why the maximum size of the pool is twice the threshold (and probably could be tuned better). Benchmarks are needed on real I/O too, of course, especially with high queue depth. Paolo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 12:26 ` Paolo Bonzini @ 2014-11-28 12:39 ` Peter Lieven 2014-11-28 12:45 ` Paolo Bonzini 2014-11-28 13:13 ` Peter Lieven 0 siblings, 2 replies; 27+ messages in thread From: Peter Lieven @ 2014-11-28 12:39 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: > > On 28/11/2014 12:46, Peter Lieven wrote: >>> I get: >>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine >> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more >> than one thread in need of a lot of coroutines? > Overall the algorithm is expected to adapt. The N threads contribute to > the global release pool, so the pool will fill up N times faster than if > you had only one thread. There can be some variance, which is why the > maximum size of the pool is twice the threshold (and probably could be > tuned better). > > Benchmarks are needed on real I/O too, of course, especially with high > queue depth. Yes, cool. The atomic operations are a bit tricky at the first glance ;-) Question: Why is the pool_size increment atomic and the set to zero not? Idea: If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) Run operation 40000000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 6bee354..edea162 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -25,8 +25,9 @@ enum { /** Free list to speed up creation */ static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int pool_size; +static unsigned int release_pool_size; static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); +static __thread unsigned int alloc_pool_size; /* The GPrivate is only used to invoke coroutine_pool_cleanup. */ static void coroutine_pool_cleanup(void *value); @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(&alloc_pool); if (!co) { - if (pool_size > POOL_BATCH_SIZE) { - /* This is not exact; there could be a little skew between pool_size + if (release_pool_size > POOL_BATCH_SIZE) { + /* This is not exact; there could be a little skew between release_pool_size * and the actual size of alloc_pool. But it is just a heuristic, * it does not need to be perfect. */ - pool_size = 0; + alloc_pool_size = atomic_fetch_and(&release_pool_size, 0); QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); co = QSLIST_FIRST(&alloc_pool); @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) */ g_private_set(&dummy_key, &dummy_key); } + } else { + alloc_pool_size--; } if (co) { QSLIST_REMOVE_HEAD(&alloc_pool, pool_next); @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) static void coroutine_delete(Coroutine *co) { if (CONFIG_COROUTINE_POOL) { - if (pool_size < POOL_BATCH_SIZE * 2) { + if (release_pool_size < POOL_BATCH_SIZE * 2) { co->caller = NULL; QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); - atomic_inc(&pool_size); + atomic_inc(&release_pool_size); + return; + } else if (alloc_pool_size < POOL_BATCH_SIZE) { + co->caller = NULL; + QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); + alloc_pool_size++; return; } } Bug?: The release_pool is not cleanup up on termination I think. Peter ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 12:39 ` Peter Lieven @ 2014-11-28 12:45 ` Paolo Bonzini 2014-11-28 12:49 ` Peter Lieven 2014-11-28 13:17 ` Peter Lieven 2014-11-28 13:13 ` Peter Lieven 1 sibling, 2 replies; 27+ messages in thread From: Paolo Bonzini @ 2014-11-28 12:45 UTC (permalink / raw) To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster On 28/11/2014 13:39, Peter Lieven wrote: > Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: >> >> On 28/11/2014 12:46, Peter Lieven wrote: >>>> I get: >>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine >>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more >>> than one thread in need of a lot of coroutines? >> Overall the algorithm is expected to adapt. The N threads contribute to >> the global release pool, so the pool will fill up N times faster than if >> you had only one thread. There can be some variance, which is why the >> maximum size of the pool is twice the threshold (and probably could be >> tuned better). >> >> Benchmarks are needed on real I/O too, of course, especially with high >> queue depth. > > Yes, cool. The atomic operations are a bit tricky at the first glance ;-) > > Question: > Why is the pool_size increment atomic and the set to zero not? Because the set to zero is not a read-modify-write operation, so it is always atomic. It's just not sequentially-consistent (see docs/atomics.txt for some info on what that means). > Idea: > If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) Because you can only waste 64 coroutines per thread. But numbers cannot be sneezed at, so it's worth doing it as a separate patch. > Run operation 40000000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine > > diff --git a/qemu-coroutine.c b/qemu-coroutine.c > index 6bee354..edea162 100644 > --- a/qemu-coroutine.c > +++ b/qemu-coroutine.c > @@ -25,8 +25,9 @@ enum { > > /** Free list to speed up creation */ > static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); > -static unsigned int pool_size; > +static unsigned int release_pool_size; > static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); > +static __thread unsigned int alloc_pool_size; > > /* The GPrivate is only used to invoke coroutine_pool_cleanup. */ > static void coroutine_pool_cleanup(void *value); > @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > if (CONFIG_COROUTINE_POOL) { > co = QSLIST_FIRST(&alloc_pool); > if (!co) { > - if (pool_size > POOL_BATCH_SIZE) { > - /* This is not exact; there could be a little skew between pool_size > + if (release_pool_size > POOL_BATCH_SIZE) { > + /* This is not exact; there could be a little skew between release_pool_size > * and the actual size of alloc_pool. But it is just a heuristic, > * it does not need to be perfect. > */ > - pool_size = 0; > + alloc_pool_size = atomic_fetch_and(&release_pool_size, 0); > QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); > co = QSLIST_FIRST(&alloc_pool); > > @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > */ > g_private_set(&dummy_key, &dummy_key); > } > + } else { > + alloc_pool_size--; > } > if (co) { > QSLIST_REMOVE_HEAD(&alloc_pool, pool_next); > @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > static void coroutine_delete(Coroutine *co) > { > if (CONFIG_COROUTINE_POOL) { > - if (pool_size < POOL_BATCH_SIZE * 2) { > + if (release_pool_size < POOL_BATCH_SIZE * 2) { > co->caller = NULL; > QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); > - atomic_inc(&pool_size); > + atomic_inc(&release_pool_size); > + return; > + } else if (alloc_pool_size < POOL_BATCH_SIZE) { > + co->caller = NULL; > + QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); > + alloc_pool_size++; > return; > } > } > > > Bug?: > The release_pool is not cleanup up on termination I think. That's not necessary, it is global. Paolo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 12:45 ` Paolo Bonzini @ 2014-11-28 12:49 ` Peter Lieven 2014-11-28 12:56 ` Paolo Bonzini 2014-11-28 13:17 ` Peter Lieven 1 sibling, 1 reply; 27+ messages in thread From: Peter Lieven @ 2014-11-28 12:49 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 13:45 schrieb Paolo Bonzini: > > On 28/11/2014 13:39, Peter Lieven wrote: >> Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: >>> On 28/11/2014 12:46, Peter Lieven wrote: >>>>> I get: >>>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine >>>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more >>>> than one thread in need of a lot of coroutines? >>> Overall the algorithm is expected to adapt. The N threads contribute to >>> the global release pool, so the pool will fill up N times faster than if >>> you had only one thread. There can be some variance, which is why the >>> maximum size of the pool is twice the threshold (and probably could be >>> tuned better). >>> >>> Benchmarks are needed on real I/O too, of course, especially with high >>> queue depth. >> Yes, cool. The atomic operations are a bit tricky at the first glance ;-) >> >> Question: >> Why is the pool_size increment atomic and the set to zero not? > Because the set to zero is not a read-modify-write operation, so it is > always atomic. It's just not sequentially-consistent (see > docs/atomics.txt for some info on what that means). > >> Idea: >> If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) > Because you can only waste 64 coroutines per thread. But numbers cannot > be sneezed at, so it's worth doing it as a separate patch. What do you mean by that? If I use dataplane I will fill the global pool and never use it okay, but then I use thread local storage only. So I get the same numbers as in my thread local storage only version. Maybe it is an idea to tweak the POOL_BATCH_SIZE * 2 according to what is really attached. If we have only dataplane or ioeventfd it can be POOL_BATCH_SIZE * 0 and we even won't waste those coroutines oxidating in the global pool. Peter ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 12:49 ` Peter Lieven @ 2014-11-28 12:56 ` Paolo Bonzini 0 siblings, 0 replies; 27+ messages in thread From: Paolo Bonzini @ 2014-11-28 12:56 UTC (permalink / raw) To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster On 28/11/2014 13:49, Peter Lieven wrote: >>> Idea: >>> >> If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) >> > Because you can only waste 64 coroutines per thread. But numbers cannot s/only// >> > be sneezed at, so it's worth doing it as a separate patch. > What do you mean by that? If I use dataplane I will fill the global pool and never use it okay, but > then I use thread local storage only. So I get the same numbers as in my thread local storage only version. Right. I didn't want to waste the coroutines. But it's not 64 coroutines per VCPU thread, it's just 64 coroutines for the global iothread because all the dataplane threads are guaranteed to use ioeventfd. Let's do it. :) Can I add your Signed-off-by to the patch? Paolo > Maybe it is an idea to tweak the POOL_BATCH_SIZE * 2 according to what is really attached. If we > have only dataplane or ioeventfd it can be POOL_BATCH_SIZE * 0 and we even won't waste those > coroutines oxidating in the global pool. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 12:45 ` Paolo Bonzini 2014-11-28 12:49 ` Peter Lieven @ 2014-11-28 13:17 ` Peter Lieven 2014-11-28 14:17 ` Paolo Bonzini 1 sibling, 1 reply; 27+ messages in thread From: Peter Lieven @ 2014-11-28 13:17 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 13:45 schrieb Paolo Bonzini: > > On 28/11/2014 13:39, Peter Lieven wrote: >> Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: >>> On 28/11/2014 12:46, Peter Lieven wrote: >>>>> I get: >>>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine >>>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more >>>> than one thread in need of a lot of coroutines? >>> Overall the algorithm is expected to adapt. The N threads contribute to >>> the global release pool, so the pool will fill up N times faster than if >>> you had only one thread. There can be some variance, which is why the >>> maximum size of the pool is twice the threshold (and probably could be >>> tuned better). >>> >>> Benchmarks are needed on real I/O too, of course, especially with high >>> queue depth. >> Yes, cool. The atomic operations are a bit tricky at the first glance ;-) >> >> Question: >> Why is the pool_size increment atomic and the set to zero not? > Because the set to zero is not a read-modify-write operation, so it is > always atomic. It's just not sequentially-consistent (see > docs/atomics.txt for some info on what that means). > >> Idea: >> If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) > Because you can only waste 64 coroutines per thread. But numbers cannot > be sneezed at, so it's worth doing it as a separate patch. > >> Run operation 40000000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine >> >> diff --git a/qemu-coroutine.c b/qemu-coroutine.c >> index 6bee354..edea162 100644 >> --- a/qemu-coroutine.c >> +++ b/qemu-coroutine.c >> @@ -25,8 +25,9 @@ enum { >> >> /** Free list to speed up creation */ >> static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); >> -static unsigned int pool_size; >> +static unsigned int release_pool_size; >> static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); >> +static __thread unsigned int alloc_pool_size; >> >> /* The GPrivate is only used to invoke coroutine_pool_cleanup. */ >> static void coroutine_pool_cleanup(void *value); >> @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) >> if (CONFIG_COROUTINE_POOL) { >> co = QSLIST_FIRST(&alloc_pool); >> if (!co) { >> - if (pool_size > POOL_BATCH_SIZE) { >> - /* This is not exact; there could be a little skew between pool_size >> + if (release_pool_size > POOL_BATCH_SIZE) { >> + /* This is not exact; there could be a little skew between release_pool_size >> * and the actual size of alloc_pool. But it is just a heuristic, >> * it does not need to be perfect. >> */ >> - pool_size = 0; >> + alloc_pool_size = atomic_fetch_and(&release_pool_size, 0); >> QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); >> co = QSLIST_FIRST(&alloc_pool); >> >> @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) >> */ >> g_private_set(&dummy_key, &dummy_key); >> } >> + } else { >> + alloc_pool_size--; >> } >> if (co) { >> QSLIST_REMOVE_HEAD(&alloc_pool, pool_next); >> @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) >> static void coroutine_delete(Coroutine *co) >> { >> if (CONFIG_COROUTINE_POOL) { >> - if (pool_size < POOL_BATCH_SIZE * 2) { >> + if (release_pool_size < POOL_BATCH_SIZE * 2) { >> co->caller = NULL; >> QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); >> - atomic_inc(&pool_size); >> + atomic_inc(&release_pool_size); >> + return; >> + } else if (alloc_pool_size < POOL_BATCH_SIZE) { >> + co->caller = NULL; >> + QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); >> + alloc_pool_size++; >> return; >> } >> } >> >> >> Bug?: >> The release_pool is not cleanup up on termination I think. > That's not necessary, it is global. I don't see where you iterate over release_pool and destroy all coroutines? Maybe just add back the old destructor with s/pool/release_pool/g Peter Peter ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 13:17 ` Peter Lieven @ 2014-11-28 14:17 ` Paolo Bonzini 2014-11-28 20:11 ` Peter Lieven 0 siblings, 1 reply; 27+ messages in thread From: Paolo Bonzini @ 2014-11-28 14:17 UTC (permalink / raw) To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster On 28/11/2014 14:17, Peter Lieven wrote: >>> >> The release_pool is not cleanup up on termination I think. >> > That's not necessary, it is global. > I don't see where you iterate over release_pool and destroy all coroutines? The OS does that for us when we exit. Paolo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 14:17 ` Paolo Bonzini @ 2014-11-28 20:11 ` Peter Lieven 0 siblings, 0 replies; 27+ messages in thread From: Peter Lieven @ 2014-11-28 20:11 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 15:17 schrieb Paolo Bonzini: > > On 28/11/2014 14:17, Peter Lieven wrote: >>>>>> The release_pool is not cleanup up on termination I think. >>>> That's not necessary, it is global. >> I don't see where you iterate over release_pool and destroy all coroutines? > The OS does that for us when we exit. Sure, but isn't that considered bad practice? Before this patch the destructured freed the coroutines. Peter ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 12:39 ` Peter Lieven 2014-11-28 12:45 ` Paolo Bonzini @ 2014-11-28 13:13 ` Peter Lieven 1 sibling, 0 replies; 27+ messages in thread From: Peter Lieven @ 2014-11-28 13:13 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 13:39 schrieb Peter Lieven: > Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: >> On 28/11/2014 12:46, Peter Lieven wrote: >>>> I get: >>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine >>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more >>> than one thread in need of a lot of coroutines? >> Overall the algorithm is expected to adapt. The N threads contribute to >> the global release pool, so the pool will fill up N times faster than if >> you had only one thread. There can be some variance, which is why the >> maximum size of the pool is twice the threshold (and probably could be >> tuned better). >> >> Benchmarks are needed on real I/O too, of course, especially with high >> queue depth. > Yes, cool. The atomic operations are a bit tricky at the first glance ;-) > > Question: > Why is the pool_size increment atomic and the set to zero not? > > Idea: > If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) > > Run operation 40000000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine > > diff --git a/qemu-coroutine.c b/qemu-coroutine.c > index 6bee354..edea162 100644 > --- a/qemu-coroutine.c > +++ b/qemu-coroutine.c > @@ -25,8 +25,9 @@ enum { > > /** Free list to speed up creation */ > static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); > -static unsigned int pool_size; > +static unsigned int release_pool_size; > static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); > +static __thread unsigned int alloc_pool_size; > > /* The GPrivate is only used to invoke coroutine_pool_cleanup. */ > static void coroutine_pool_cleanup(void *value); > @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > if (CONFIG_COROUTINE_POOL) { > co = QSLIST_FIRST(&alloc_pool); > if (!co) { > - if (pool_size > POOL_BATCH_SIZE) { > - /* This is not exact; there could be a little skew between pool_size > + if (release_pool_size > POOL_BATCH_SIZE) { > + /* This is not exact; there could be a little skew between release_pool_size > * and the actual size of alloc_pool. But it is just a heuristic, > * it does not need to be perfect. > */ > - pool_size = 0; > + alloc_pool_size = atomic_fetch_and(&release_pool_size, 0); > QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); > co = QSLIST_FIRST(&alloc_pool); > > @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > */ > g_private_set(&dummy_key, &dummy_key); > } > + } else { > + alloc_pool_size--; > } > if (co) { > QSLIST_REMOVE_HEAD(&alloc_pool, pool_next); > @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > static void coroutine_delete(Coroutine *co) > { > if (CONFIG_COROUTINE_POOL) { > - if (pool_size < POOL_BATCH_SIZE * 2) { > + if (release_pool_size < POOL_BATCH_SIZE * 2) { > co->caller = NULL; > QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); > - atomic_inc(&pool_size); > + atomic_inc(&release_pool_size); > + return; > + } else if (alloc_pool_size < POOL_BATCH_SIZE) { > + co->caller = NULL; > + QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); > + alloc_pool_size++; > return; > } > } Signed-off-by: Peter Lieven <pl@kamp.de> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 11:32 ` Peter Lieven 2014-11-28 11:46 ` Peter Lieven @ 2014-11-28 12:21 ` Paolo Bonzini 2014-11-28 12:26 ` Peter Lieven 1 sibling, 1 reply; 27+ messages in thread From: Paolo Bonzini @ 2014-11-28 12:21 UTC (permalink / raw) To: Peter Lieven, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster On 28/11/2014 12:32, Peter Lieven wrote: > Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: >> >> On 28/11/2014 12:21, Peter Lieven wrote: >>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: >>>>> master: >>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine >>>>> >>>>> paolo: >>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine >>>> Nice. :) >>>> >>>> Can you please try "coroutine: Use __thread … " together, too? I still >>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if >>>> I apply it here (my times are 191/160/145). >>> indeed: >>> >>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine >> Your perf_master2 uses the ring buffer unconditionally, right? I wonder >> if we can use a similar algorithm but with arrays instead of lists... > > Why do you set pool_size = 0 in the create path? > > When I do the following: > diff --git a/qemu-coroutine.c b/qemu-coroutine.c > index 6bee354..c79ee78 100644 > --- a/qemu-coroutine.c > +++ b/qemu-coroutine.c > @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > * and the actual size of alloc_pool. But it is just a heuristic, > * it does not need to be perfect. > */ > - pool_size = 0; > + atomic_dec(&pool_size); > QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); > co = QSLIST_FIRST(&alloc_pool); > > > I get: > Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine Because pool_size is the (approximate) number of coroutines in the pool. It is zero after QSLIST_MOVE_ATOMIC has NULL-ed out release_pool.slh_first. Paolo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-28 12:21 ` Paolo Bonzini @ 2014-11-28 12:26 ` Peter Lieven 0 siblings, 0 replies; 27+ messages in thread From: Peter Lieven @ 2014-11-28 12:26 UTC (permalink / raw) To: Paolo Bonzini, ming.lei, Kevin Wolf, Stefan Hajnoczi, qemu-devel@nongnu.org, Markus Armbruster Am 28.11.2014 um 13:21 schrieb Paolo Bonzini: > > On 28/11/2014 12:32, Peter Lieven wrote: >> Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: >>> On 28/11/2014 12:21, Peter Lieven wrote: >>>> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: >>>>>> master: >>>>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine >>>>>> >>>>>> paolo: >>>>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine >>>>> Nice. :) >>>>> >>>>> Can you please try "coroutine: Use __thread … " together, too? I still >>>>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if >>>>> I apply it here (my times are 191/160/145). >>>> indeed: >>>> >>>> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine >>> Your perf_master2 uses the ring buffer unconditionally, right? I wonder >>> if we can use a similar algorithm but with arrays instead of lists... >> Why do you set pool_size = 0 in the create path? >> >> When I do the following: >> diff --git a/qemu-coroutine.c b/qemu-coroutine.c >> index 6bee354..c79ee78 100644 >> --- a/qemu-coroutine.c >> +++ b/qemu-coroutine.c >> @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) >> * and the actual size of alloc_pool. But it is just a heuristic, >> * it does not need to be perfect. >> */ >> - pool_size = 0; >> + atomic_dec(&pool_size); >> QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); >> co = QSLIST_FIRST(&alloc_pool); >> >> >> I get: >> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine > Because pool_size is the (approximate) number of coroutines in the pool. > It is zero after QSLIST_MOVE_ATOMIC has NULL-ed out release_pool.slh_first. got it meanwhile. and its not as bad as i thought since you only steal the release_pool if your alloc_pool is empty. Right? Peter ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven 2014-11-27 16:40 ` Paolo Bonzini @ 2014-11-28 12:40 ` Stefan Hajnoczi 1 sibling, 0 replies; 27+ messages in thread From: Stefan Hajnoczi @ 2014-11-28 12:40 UTC (permalink / raw) To: Peter Lieven Cc: kwolf, famz, benoit, ming.lei, armbru, qemu-devel, stefanha, pbonzini, mreitz [-- Attachment #1: Type: text/plain, Size: 809 bytes --] On Thu, Nov 27, 2014 at 11:27:06AM +0100, Peter Lieven wrote: > diff --git a/iothread.c b/iothread.c > index 342a23f..b53529b 100644 > --- a/iothread.c > +++ b/iothread.c > @@ -15,6 +15,7 @@ > #include "qom/object_interfaces.h" > #include "qemu/module.h" > #include "block/aio.h" > +#include "block/coroutine.h" > #include "sysemu/iothread.h" > #include "qmp-commands.h" > #include "qemu/error-report.h" > @@ -47,6 +48,8 @@ static void *iothread_run(void *opaque) > } > aio_context_release(iothread->ctx); > } > + > + coroutine_pool_cleanup(); > return NULL; > } The assumption here is that iothread_run() is the only thread function that uses coroutines. If another thread uses coroutines the pool will leak :(. This is one of the challenges of thread-local storage. [-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2014-11-28 20:12 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-11-27 10:27 [Qemu-devel] [RFC PATCH 0/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 1/3] Revert "coroutine: make pool size dynamic" Peter Lieven 2014-11-28 12:42 ` Stefan Hajnoczi 2014-11-28 12:45 ` Peter Lieven 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 2/3] block/block-backend.c: remove coroutine pool reservation Peter Lieven 2014-11-27 10:27 ` [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool Peter Lieven 2014-11-27 16:40 ` Paolo Bonzini 2014-11-28 8:13 ` Peter Lieven [not found] ` <54784E55.6060405@redhat.com> 2014-11-28 10:37 ` Peter Lieven 2014-11-28 11:14 ` Paolo Bonzini 2014-11-28 11:21 ` Peter Lieven 2014-11-28 11:23 ` Paolo Bonzini 2014-11-28 11:27 ` Peter Lieven 2014-11-28 11:32 ` Peter Lieven 2014-11-28 11:46 ` Peter Lieven 2014-11-28 12:26 ` Paolo Bonzini 2014-11-28 12:39 ` Peter Lieven 2014-11-28 12:45 ` Paolo Bonzini 2014-11-28 12:49 ` Peter Lieven 2014-11-28 12:56 ` Paolo Bonzini 2014-11-28 13:17 ` Peter Lieven 2014-11-28 14:17 ` Paolo Bonzini 2014-11-28 20:11 ` Peter Lieven 2014-11-28 13:13 ` Peter Lieven 2014-11-28 12:21 ` Paolo Bonzini 2014-11-28 12:26 ` Peter Lieven 2014-11-28 12:40 ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).