From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50401)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <pl@kamp.de>)
	id 1XvmGc-0005R4-47
	for qemu-devel@nongnu.org; Tue, 02 Dec 2014 07:09:27 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pl@kamp.de>) id 1XvmGT-0000p8-2L
	for qemu-devel@nongnu.org; Tue, 02 Dec 2014 07:09:18 -0500
Received: from mx-v6.kamp.de ([2a02:248:0:51::16]:35578 helo=mx01.kamp.de)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <pl@kamp.de>)
	id 1XvmGS-0000ov-Ox
	for qemu-devel@nongnu.org; Tue, 02 Dec 2014 07:09:09 -0500
Message-ID: <547DABDE.2050009@kamp.de>
Date: Tue, 02 Dec 2014 13:09:02 +0100
From: Peter Lieven <pl@kamp.de>
MIME-Version: 1.0
References: <1417518350-6167-1-git-send-email-pbonzini@redhat.com>
	<1417518350-6167-6-git-send-email-pbonzini@redhat.com>
In-Reply-To: <1417518350-6167-6-git-send-email-pbonzini@redhat.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid
	mutex
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>, qemu-devel@nongnu.org
Cc: kwolf@redhat.com, ming.lei@canonical.com, stefanha@redhat.com

On 02.12.2014 12:05, Paolo Bonzini wrote:
> This patch removes the mutex by using fancy lock-free manipulation of
> the pool.  Lock-free stacks and queues are not hard, but they can suffer
> from the ABA problem so they are better avoided unless you have some
> deferred reclamation scheme like RCU.  Otherwise you have to stick
> with adding to a list, and emptying it completely.  This is what this
> patch does, by coupling a lock-free global list of available coroutines
> with per-CPU lists that are actually used on coroutine creation.
>
> Whenever the destruction pool is big enough, the next thread that runs
> out of coroutines will steal the whole destruction pool.  This is positive
> in two ways:
>
> 1) the allocation does not have to do any atomic operation in the fast
> path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
> allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
> loop, that hopefully doesn't cause any starvation, and an atomic_inc.
>
> A later patch will also remove atomic operations from the release path,
> and try to avoid the atomic_xchg altogether---succeeding in doing so if
> all devices either use ioeventfd or are not submitting requests actively.
>
> 2) in theory this should be completely adaptive.  The number of coroutines
> around should be a little more than POOL_BATCH_SIZE * number of allocating
> threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
> pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
> more generous.  But if you actually have many high-iodepth disks, it's better
> to put them in different iothreads, which will also use separate thread
> pools and aio=native file descriptors).
>
> This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
> No matter if we end with some kind of coroutine bypass scheme or not,
> it cannot hurt to optimize hot code.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> v1->v2: leave personal opinions out of commit messages :) [Kevin]
>
>   qemu-coroutine.c | 93 +++++++++++++++++++++++++-------------------------------
>   1 file changed, 42 insertions(+), 51 deletions(-)
>
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index bd574aa..aee1017 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -15,31 +15,57 @@
>   #include "trace.h"
>   #include "qemu-common.h"
>   #include "qemu/thread.h"
> +#include "qemu/atomic.h"
>   #include "block/coroutine.h"
>   #include "block/coroutine_int.h"
>   
>   enum {
> -    POOL_DEFAULT_SIZE = 64,
> +    POOL_BATCH_SIZE = 64,
>   };
>   
>   /** Free list to speed up creation */
> -static QemuMutex pool_lock;
> -static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
> -static unsigned int pool_size;
> -static unsigned int pool_max_size = POOL_DEFAULT_SIZE;
> +static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
> +static unsigned int release_pool_size;
> +static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
> +static __thread Notifier coroutine_pool_cleanup_notifier;
> +
> +static void coroutine_pool_cleanup(Notifier *n, void *value)
> +{
> +    Coroutine *co;
> +    Coroutine *tmp;
> +
> +    QSLIST_FOREACH_SAFE(co, &alloc_pool, pool_next, tmp) {
> +        QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
> +        qemu_coroutine_delete(co);
> +    }
> +}
>   
>   Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>   {
>       Coroutine *co = NULL;
>   
>       if (CONFIG_COROUTINE_POOL) {
> -        qemu_mutex_lock(&pool_lock);
> -        co = QSLIST_FIRST(&pool);
> +        co = QSLIST_FIRST(&alloc_pool);
> +        if (!co) {
> +            if (release_pool_size > POOL_BATCH_SIZE) {
> +                /* Slow path; a good place to register the destructor, too.  */
> +                if (!coroutine_pool_cleanup_notifier.notify) {
> +                    coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
> +                    qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
> +                }
> +
> +                /* This is not exact; there could be a little skew between
> +                 * release_pool_size and the actual size of release_pool.  But
> +                 * it is just a heuristic, it does not need to be perfect.
> +                 */
> +                release_pool_size = 0;
> +                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
> +                co = QSLIST_FIRST(&alloc_pool);
> +            }
> +        }
>           if (co) {
> -            QSLIST_REMOVE_HEAD(&pool, pool_next);
> -            pool_size--;
> +            QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
>           }
> -        qemu_mutex_unlock(&pool_lock);
>       }
>   
>       if (!co) {
> @@ -53,39 +80,19 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>   
>   static void coroutine_delete(Coroutine *co)
>   {
> +    co->caller = NULL;
> +
>       if (CONFIG_COROUTINE_POOL) {
> -        qemu_mutex_lock(&pool_lock);
> -        if (pool_size < pool_max_size) {
> -            QSLIST_INSERT_HEAD(&pool, co, pool_next);
> -            co->caller = NULL;
> -            pool_size++;
> -            qemu_mutex_unlock(&pool_lock);
> +        if (release_pool_size < POOL_BATCH_SIZE * 2) {
> +            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
> +            atomic_inc(&release_pool_size);
>               return;
>           }
> -        qemu_mutex_unlock(&pool_lock);
>       }
>   
>       qemu_coroutine_delete(co);
>   }
>   
> -static void __attribute__((constructor)) coroutine_pool_init(void)
> -{
> -    qemu_mutex_init(&pool_lock);
> -}
> -
> -static void __attribute__((destructor)) coroutine_pool_cleanup(void)
> -{
> -    Coroutine *co;
> -    Coroutine *tmp;
> -
> -    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
> -        QSLIST_REMOVE_HEAD(&pool, pool_next);
> -        qemu_coroutine_delete(co);
> -    }
> -
> -    qemu_mutex_destroy(&pool_lock);
> -}
> -

I still feel we should leave this destructor in to clean up the release_pool.

Peter