From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40685) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XuKve-0004tt-Fn for qemu-devel@nongnu.org; Fri, 28 Nov 2014 07:45:48 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XuKvT-0007Cf-4v for qemu-devel@nongnu.org; Fri, 28 Nov 2014 07:45:42 -0500 Received: from mx1.redhat.com ([209.132.183.28]:55640) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XuKvH-00078N-S6 for qemu-devel@nongnu.org; Fri, 28 Nov 2014 07:45:30 -0500 Message-ID: <54786E52.6050209@redhat.com> Date: Fri, 28 Nov 2014 13:45:06 +0100 From: Paolo Bonzini MIME-Version: 1.0 References: <1417084026-12307-1-git-send-email-pl@kamp.de> <1417084026-12307-4-git-send-email-pl@kamp.de> <547753F7.2030709@redhat.com> <54782EC3.10005@kamp.de> <54784E55.6060405@redhat.com> <54785067.60905@kamp.de> <547858FF.5070602@redhat.com> <54785AA5.9070409@kamp.de> <54785B2E.9070203@redhat.com> <54785D60.1070306@kamp.de> <5478609B.8060503@kamp.de> <547869DE.3080907@redhat.com> <54786CF5.2060705@kamp.de> In-Reply-To: <54786CF5.2060705@kamp.de> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Lieven , ming.lei@canonical.com, Kevin Wolf , Stefan Hajnoczi , "qemu-devel@nongnu.org" , Markus Armbruster On 28/11/2014 13:39, Peter Lieven wrote: > Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: >> >> On 28/11/2014 12:46, Peter Lieven wrote: >>>> I get: >>>> Run operation 40000000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine >>> Ok, understood, it "steals" the whole pool, right? Isn't that bad if we have more >>> than one thread in need of a lot of coroutines? >> Overall the algorithm is expected to adapt. The N threads contribute to >> the global release pool, so the pool will fill up N times faster than if >> you had only one thread. There can be some variance, which is why the >> maximum size of the pool is twice the threshold (and probably could be >> tuned better). >> >> Benchmarks are needed on real I/O too, of course, especially with high >> queue depth. > > Yes, cool. The atomic operations are a bit tricky at the first glance ;-) > > Question: > Why is the pool_size increment atomic and the set to zero not? Because the set to zero is not a read-modify-write operation, so it is always atomic. It's just not sequentially-consistent (see docs/atomics.txt for some info on what that means). > Idea: > If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) Because you can only waste 64 coroutines per thread. But numbers cannot be sneezed at, so it's worth doing it as a separate patch. > Run operation 40000000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine > > diff --git a/qemu-coroutine.c b/qemu-coroutine.c > index 6bee354..edea162 100644 > --- a/qemu-coroutine.c > +++ b/qemu-coroutine.c > @@ -25,8 +25,9 @@ enum { > > /** Free list to speed up creation */ > static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); > -static unsigned int pool_size; > +static unsigned int release_pool_size; > static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); > +static __thread unsigned int alloc_pool_size; > > /* The GPrivate is only used to invoke coroutine_pool_cleanup. */ > static void coroutine_pool_cleanup(void *value); > @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > if (CONFIG_COROUTINE_POOL) { > co = QSLIST_FIRST(&alloc_pool); > if (!co) { > - if (pool_size > POOL_BATCH_SIZE) { > - /* This is not exact; there could be a little skew between pool_size > + if (release_pool_size > POOL_BATCH_SIZE) { > + /* This is not exact; there could be a little skew between release_pool_size > * and the actual size of alloc_pool. But it is just a heuristic, > * it does not need to be perfect. > */ > - pool_size = 0; > + alloc_pool_size = atomic_fetch_and(&release_pool_size, 0); > QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); > co = QSLIST_FIRST(&alloc_pool); > > @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > */ > g_private_set(&dummy_key, &dummy_key); > } > + } else { > + alloc_pool_size--; > } > if (co) { > QSLIST_REMOVE_HEAD(&alloc_pool, pool_next); > @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) > static void coroutine_delete(Coroutine *co) > { > if (CONFIG_COROUTINE_POOL) { > - if (pool_size < POOL_BATCH_SIZE * 2) { > + if (release_pool_size < POOL_BATCH_SIZE * 2) { > co->caller = NULL; > QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); > - atomic_inc(&pool_size); > + atomic_inc(&release_pool_size); > + return; > + } else if (alloc_pool_size < POOL_BATCH_SIZE) { > + co->caller = NULL; > + QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); > + alloc_pool_size++; > return; > } > } > > > Bug?: > The release_pool is not cleanup up on termination I think. That's not necessary, it is global. Paolo