[Qemu-devel] [PATCH v2 0/7] coroutine: optimizations

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations
@ 2014-12-02 11:05 Paolo Bonzini
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 1/7] coroutine-ucontext: use __thread Paolo Bonzini
                   ` (9 more replies)
  0 siblings, 10 replies; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-02 11:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

As discussed in the other thread, this brings speedups from
dropping the coroutine mutex (which serializes multiple iothreads,
too) and using ELF thread-local storage.

The speedup in perf/cost is about 50% (190->125).  Windows port tested
with tests/test-coroutine.exe under Wine.

Paolo

v1->v2: include the noinline attribute [many...]
	do not mention SwitchToFiber [Kevin]
	rename run_main_iothread_exit -> run_main_thread_exit
	leave personal opinions out of commit messages :) [Kevin]
	mention gain from patch 7 [Peter]
	change "alloc_pool_size +=" to "alloc_pool_size =" [Peter]

Paolo Bonzini (7):
  coroutine-ucontext: use __thread
  qemu-thread: add per-thread atexit functions
  test-coroutine: avoid overflow on 32-bit systems
  QSLIST: add lock-free operations
  coroutine: rewrite pool to avoid mutex
  coroutine: drop qemu_coroutine_adjust_pool_size
  coroutine: try harder not to delete coroutines

 block/block-backend.c     |   4 --
 coroutine-ucontext.c      |  64 +++++++---------------------
 include/block/coroutine.h |  10 -----
 include/qemu/queue.h      |  15 ++++++-
 include/qemu/thread.h     |   4 ++
 qemu-coroutine.c          | 104 ++++++++++++++++++++++------------------------
 tests/test-coroutine.c    |   2 +-
 util/qemu-thread-posix.c  |  37 +++++++++++++++++
 util/qemu-thread-win32.c  |  48 ++++++++++++++++-----
 9 files changed, 157 insertions(+), 131 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v2 1/7] coroutine-ucontext: use __thread
  2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
@ 2014-12-02 11:05 ` Paolo Bonzini
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 2/7] qemu-thread: add per-thread atexit functions Paolo Bonzini
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-02 11:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

ELF thread local storage is about 10% faster on tests/test-coroutine's
perf/cost test.  The timing on my machine is 190ns per iteration with
pthread TLS, 170 with ELF TLS.

Based on a patch by Kevin Wolf and Peter Lieven, but redone to follow
the model of coroutine-win32.c (including the important "noinline"
attribute!).

Platforms without thread-local storage (OpenBSD probably?) will need
a new-enough GCC for this to compile, in order to use the same emutls
support that Windows already relies on.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
v1->v2: include the noinline attribute [many...]
	do not mention SwitchToFiber [Kevin]

 coroutine-ucontext.c | 64 +++++++++++++---------------------------------------
 1 file changed, 16 insertions(+), 48 deletions(-)

diff --git a/coroutine-ucontext.c b/coroutine-ucontext.c
index 4bf2cde..d86e3e1 100644
--- a/coroutine-ucontext.c
+++ b/coroutine-ucontext.c
@@ -25,7 +25,6 @@
 #include <stdlib.h>
 #include <setjmp.h>
 #include <stdint.h>
-#include <pthread.h>
 #include <ucontext.h>
 #include "qemu-common.h"
 #include "block/coroutine_int.h"
@@ -48,15 +47,8 @@ typedef struct {
 /**
  * Per-thread coroutine bookkeeping
  */
-typedef struct {
-    /** Currently executing coroutine */
-    Coroutine *current;
-
-    /** The default coroutine */
-    CoroutineUContext leader;
-} CoroutineThreadState;
-
-static pthread_key_t thread_state_key;
+static __thread CoroutineUContext leader;
+static __thread Coroutine *current;
 
 /*
  * va_args to makecontext() must be type 'int', so passing
@@ -68,36 +60,6 @@ union cc_arg {
     int i[2];
 };
 
-static CoroutineThreadState *coroutine_get_thread_state(void)
-{
-    CoroutineThreadState *s = pthread_getspecific(thread_state_key);
-
-    if (!s) {
-        s = g_malloc0(sizeof(*s));
-        s->current = &s->leader.base;
-        pthread_setspecific(thread_state_key, s);
-    }
-    return s;
-}
-
-static void qemu_coroutine_thread_cleanup(void *opaque)
-{
-    CoroutineThreadState *s = opaque;
-
-    g_free(s);
-}
-
-static void __attribute__((constructor)) coroutine_init(void)
-{
-    int ret;
-
-    ret = pthread_key_create(&thread_state_key, qemu_coroutine_thread_cleanup);
-    if (ret != 0) {
-        fprintf(stderr, "unable to create leader key: %s\n", strerror(errno));
-        abort();
-    }
-}
-
 static void coroutine_trampoline(int i0, int i1)
 {
     union cc_arg arg;
@@ -193,15 +155,23 @@ void qemu_coroutine_delete(Coroutine *co_)
     g_free(co);
 }
 
-CoroutineAction qemu_coroutine_switch(Coroutine *from_, Coroutine *to_,
-                                      CoroutineAction action)
+/* This function is marked noinline to prevent GCC from inlining it
+ * into coroutine_trampoline(). If we allow it to do that then it
+ * hoists the code to get the address of the TLS variable "current"
+ * out of the while() loop. This is an invalid transformation because
+ * the sigsetjmp() call may be called when running thread A but
+ * return in thread B, and so we might be in a different thread
+ * context each time round the loop.
+ */
+CoroutineAction __attribute__((noinline))
+qemu_coroutine_switch(Coroutine *from_, Coroutine *to_,
+                      CoroutineAction action)
 {
     CoroutineUContext *from = DO_UPCAST(CoroutineUContext, base, from_);
     CoroutineUContext *to = DO_UPCAST(CoroutineUContext, base, to_);
-    CoroutineThreadState *s = coroutine_get_thread_state();
     int ret;
 
-    s->current = to_;
+    current = to_;
 
     ret = sigsetjmp(from->env, 0);
     if (ret == 0) {
@@ -212,14 +181,13 @@ CoroutineAction qemu_coroutine_switch(Coroutine *from_, Coroutine *to_,
 
 Coroutine *qemu_coroutine_self(void)
 {
-    CoroutineThreadState *s = coroutine_get_thread_state();
-
-    return s->current;
+    if (!current) {
+        current = &leader.base;
+    }
+    return current;
 }
 
 bool qemu_in_coroutine(void)
 {
-    CoroutineThreadState *s = pthread_getspecific(thread_state_key);
-
-    return s && s->current->caller;
+    return current && current->caller;
 }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v2 2/7] qemu-thread: add per-thread atexit functions
  2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 1/7] coroutine-ucontext: use __thread Paolo Bonzini
@ 2014-12-02 11:05 ` Paolo Bonzini
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 3/7] test-coroutine: avoid overflow on 32-bit systems Paolo Bonzini
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-02 11:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

Destructors are the main additional feature of pthread TLS compared
to __thread.  If we were using C++ (hint, hint!) we could have used
thread-local objects with a destructor.  Since we are not, instead,
we add a simple Notifier-based API.

Note that the notifier must be per-thread as well.  We can add a
global list as well later, perhaps.

The Win32 implementation has some complications because a) detached
threads used not to have a QemuThreadData; b) the main thread does
not go through win32_start_routine, so we have to use atexit too.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
v1->v2: rename run_main_iothread_exit -> run_main_thread_exit

 include/qemu/thread.h    |  4 ++++
 util/qemu-thread-posix.c | 37 +++++++++++++++++++++++++++++++++++++
 util/qemu-thread-win32.c | 48 +++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 78 insertions(+), 11 deletions(-)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index f7e3b9b..e89fdc9 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -61,4 +61,8 @@ bool qemu_thread_is_self(QemuThread *thread);
 void qemu_thread_exit(void *retval);
 void qemu_thread_naming(bool enable);
 
+struct Notifier;
+void qemu_thread_atexit_add(struct Notifier *notifier);
+void qemu_thread_atexit_remove(struct Notifier *notifier);
+
 #endif
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index d05a649..41cb23d 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -26,6 +26,7 @@
 #endif
 #include "qemu/thread.h"
 #include "qemu/atomic.h"
+#include "qemu/notify.h"
 
 static bool name_threads;
 
@@ -401,6 +402,42 @@ void qemu_event_wait(QemuEvent *ev)
     }
 }
 
+static pthread_key_t exit_key;
+
+union NotifierThreadData {
+    void *ptr;
+    NotifierList list;
+};
+QEMU_BUILD_BUG_ON(sizeof(union NotifierThreadData) != sizeof(void *));
+
+void qemu_thread_atexit_add(Notifier *notifier)
+{
+    union NotifierThreadData ntd;
+    ntd.ptr = pthread_getspecific(exit_key);
+    notifier_list_add(&ntd.list, notifier);
+    pthread_setspecific(exit_key, ntd.ptr);
+}
+
+void qemu_thread_atexit_remove(Notifier *notifier)
+{
+    union NotifierThreadData ntd;
+    ntd.ptr = pthread_getspecific(exit_key);
+    notifier_remove(notifier);
+    pthread_setspecific(exit_key, ntd.ptr);
+}
+
+static void qemu_thread_atexit_run(void *arg)
+{
+    union NotifierThreadData ntd = { .ptr = arg };
+    notifier_list_notify(&ntd.list, NULL);
+}
+
+static void __attribute__((constructor)) qemu_thread_atexit_init(void)
+{
+    pthread_key_create(&exit_key, qemu_thread_atexit_run);
+}
+
+
 /* Attempt to set the threads name; note that this is for debug, so
  * we're not going to fail if we can't set it.
  */
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index c405c9b..7bda85b 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -12,6 +12,7 @@
  */
 #include "qemu-common.h"
 #include "qemu/thread.h"
+#include "qemu/notify.h"
 #include <process.h>
 #include <assert.h>
 #include <limits.h>
@@ -268,6 +269,7 @@ struct QemuThreadData {
     void             *(*start_routine)(void *);
     void             *arg;
     short             mode;
+    NotifierList      exit;
 
     /* Only used for joinable threads. */
     bool              exited;
@@ -275,18 +277,40 @@ struct QemuThreadData {
     CRITICAL_SECTION  cs;
 };
 
+static bool atexit_registered;
+static NotifierList main_thread_exit;
+
 static __thread QemuThreadData *qemu_thread_data;
 
+static void run_main_thread_exit(void)
+{
+    notifier_list_notify(&main_thread_exit, NULL);
+}
+
+void qemu_thread_atexit_add(Notifier *notifier)
+{
+    if (!qemu_thread_data) {
+        if (!atexit_registered) {
+            atexit_registered = true;
+            atexit(run_main_thread_exit);
+        }
+        notifier_list_add(&main_thread_exit, notifier);
+    } else {
+        notifier_list_add(&qemu_thread_data->exit, notifier);
+    }
+}
+
+void qemu_thread_atexit_remove(Notifier *notifier)
+{
+    notifier_remove(notifier);
+}
+
 static unsigned __stdcall win32_start_routine(void *arg)
 {
     QemuThreadData *data = (QemuThreadData *) arg;
     void *(*start_routine)(void *) = data->start_routine;
     void *thread_arg = data->arg;
 
-    if (data->mode == QEMU_THREAD_DETACHED) {
-        g_free(data);
-        data = NULL;
-    }
     qemu_thread_data = data;
     qemu_thread_exit(start_routine(thread_arg));
     abort();
@@ -296,12 +320,14 @@ void qemu_thread_exit(void *arg)
 {
     QemuThreadData *data = qemu_thread_data;
 
-    if (data) {
-        assert(data->mode != QEMU_THREAD_DETACHED);
+    notifier_list_notify(&data->exit, NULL);
+    if (data->mode == QEMU_THREAD_JOINABLE) {
         data->ret = arg;
         EnterCriticalSection(&data->cs);
         data->exited = true;
         LeaveCriticalSection(&data->cs);
+    } else {
+        g_free(data);
     }
     _endthreadex(0);
 }
@@ -313,9 +339,10 @@ void *qemu_thread_join(QemuThread *thread)
     HANDLE handle;
 
     data = thread->data;
-    if (!data) {
+    if (data->mode == QEMU_THREAD_DETACHED) {
         return NULL;
     }
+
     /*
      * Because multiple copies of the QemuThread can exist via
      * qemu_thread_get_self, we need to store a value that cannot
@@ -329,7 +356,6 @@ void *qemu_thread_join(QemuThread *thread)
         CloseHandle(handle);
     }
     ret = data->ret;
-    assert(data->mode != QEMU_THREAD_DETACHED);
     DeleteCriticalSection(&data->cs);
     g_free(data);
     return ret;
@@ -347,6 +373,7 @@ void qemu_thread_create(QemuThread *thread, const char *name,
     data->arg = arg;
     data->mode = mode;
     data->exited = false;
+    notifier_list_init(&data->exit);
 
     if (data->mode != QEMU_THREAD_DETACHED) {
         InitializeCriticalSection(&data->cs);
@@ -358,7 +385,7 @@ void qemu_thread_create(QemuThread *thread, const char *name,
         error_exit(GetLastError(), __func__);
     }
     CloseHandle(hThread);
-    thread->data = (mode == QEMU_THREAD_DETACHED) ? NULL : data;
+    thread->data = data;
 }
 
 void qemu_thread_get_self(QemuThread *thread)
@@ -373,11 +400,10 @@ HANDLE qemu_thread_get_handle(QemuThread *thread)
     HANDLE handle;
 
     data = thread->data;
-    if (!data) {
+    if (data->mode == QEMU_THREAD_DETACHED) {
         return NULL;
     }
 
-    assert(data->mode != QEMU_THREAD_DETACHED);
     EnterCriticalSection(&data->cs);
     if (!data->exited) {
         handle = OpenThread(SYNCHRONIZE | THREAD_SUSPEND_RESUME, FALSE,
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v2 3/7] test-coroutine: avoid overflow on 32-bit systems
  2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 1/7] coroutine-ucontext: use __thread Paolo Bonzini
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 2/7] qemu-thread: add per-thread atexit functions Paolo Bonzini
@ 2014-12-02 11:05 ` Paolo Bonzini
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 4/7] QSLIST: add lock-free operations Paolo Bonzini
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-02 11:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

unsigned long is not large enough to represent 1000000000 * duration there.
Just use floating point.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 tests/test-coroutine.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
index e22fae1..27d1b6f 100644
--- a/tests/test-coroutine.c
+++ b/tests/test-coroutine.c
@@ -337,7 +337,7 @@ static void perf_cost(void)
                    "%luns per coroutine",
                    maxcycles,
                    duration, ops,
-                   (unsigned long)(1000000000 * duration) / maxcycles);
+                   (unsigned long)(1000000000.0 * duration / maxcycles));
 }
 
 int main(int argc, char **argv)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v2 4/7] QSLIST: add lock-free operations
  2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
                   ` (2 preceding siblings ...)
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 3/7] test-coroutine: avoid overflow on 32-bit systems Paolo Bonzini
@ 2014-12-02 11:05 ` Paolo Bonzini
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid mutex Paolo Bonzini
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-02 11:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

These operations are trivial to implement and do not have ABA problems.
They are enough to implement simple multiple-producer, single consumer
lock-free lists or, as in the next patch, the multiple consumers can
steal a whole batch of elements and process them at their leisure.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/qemu/queue.h | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index d433b90..6a01e2f 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -191,8 +191,19 @@ struct {                                                                \
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_INSERT_HEAD(head, elm, field) do {                        \
-        (elm)->field.sle_next = (head)->slh_first;                      \
-        (head)->slh_first = (elm);                                      \
+        (elm)->field.sle_next = (head)->slh_first;                       \
+        (head)->slh_first = (elm);                                       \
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do {                   \
+        do {                                                               \
+            (elm)->field.sle_next = (head)->slh_first;                     \
+        } while (atomic_cmpxchg(&(head)->slh_first, (elm)->field.sle_next, \
+                               (elm)) != (elm)->field.sle_next);           \
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_MOVE_ATOMIC(dest, src) do {                               \
+        (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL);        \
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_REMOVE_HEAD(head, field) do {                             \
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid mutex
  2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
                   ` (3 preceding siblings ...)
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 4/7] QSLIST: add lock-free operations Paolo Bonzini
@ 2014-12-02 11:05 ` Paolo Bonzini
  2014-12-02 12:09   ` Peter Lieven
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 6/7] coroutine: drop qemu_coroutine_adjust_pool_size Paolo Bonzini
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-02 11:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.

Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:

1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.

A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.

2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).

This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
v1->v2: leave personal opinions out of commit messages :) [Kevin]

 qemu-coroutine.c | 93 +++++++++++++++++++++++++-------------------------------
 1 file changed, 42 insertions(+), 51 deletions(-)

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index bd574aa..aee1017 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -15,31 +15,57 @@
 #include "trace.h"
 #include "qemu-common.h"
 #include "qemu/thread.h"
+#include "qemu/atomic.h"
 #include "block/coroutine.h"
 #include "block/coroutine_int.h"
 
 enum {
-    POOL_DEFAULT_SIZE = 64,
+    POOL_BATCH_SIZE = 64,
 };
 
 /** Free list to speed up creation */
-static QemuMutex pool_lock;
-static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_size;
-static unsigned int pool_max_size = POOL_DEFAULT_SIZE;
+static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
+static unsigned int release_pool_size;
+static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
+static __thread Notifier coroutine_pool_cleanup_notifier;
+
+static void coroutine_pool_cleanup(Notifier *n, void *value)
+{
+    Coroutine *co;
+    Coroutine *tmp;
+
+    QSLIST_FOREACH_SAFE(co, &alloc_pool, pool_next, tmp) {
+        QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
+        qemu_coroutine_delete(co);
+    }
+}
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 {
     Coroutine *co = NULL;
 
     if (CONFIG_COROUTINE_POOL) {
-        qemu_mutex_lock(&pool_lock);
-        co = QSLIST_FIRST(&pool);
+        co = QSLIST_FIRST(&alloc_pool);
+        if (!co) {
+            if (release_pool_size > POOL_BATCH_SIZE) {
+                /* Slow path; a good place to register the destructor, too.  */
+                if (!coroutine_pool_cleanup_notifier.notify) {
+                    coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
+                    qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
+                }
+
+                /* This is not exact; there could be a little skew between
+                 * release_pool_size and the actual size of release_pool.  But
+                 * it is just a heuristic, it does not need to be perfect.
+                 */
+                release_pool_size = 0;
+                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
+                co = QSLIST_FIRST(&alloc_pool);
+            }
+        }
         if (co) {
-            QSLIST_REMOVE_HEAD(&pool, pool_next);
-            pool_size--;
+            QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
         }
-        qemu_mutex_unlock(&pool_lock);
     }
 
     if (!co) {
@@ -53,39 +80,19 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 
 static void coroutine_delete(Coroutine *co)
 {
+    co->caller = NULL;
+
     if (CONFIG_COROUTINE_POOL) {
-        qemu_mutex_lock(&pool_lock);
-        if (pool_size < pool_max_size) {
-            QSLIST_INSERT_HEAD(&pool, co, pool_next);
-            co->caller = NULL;
-            pool_size++;
-            qemu_mutex_unlock(&pool_lock);
+        if (release_pool_size < POOL_BATCH_SIZE * 2) {
+            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
+            atomic_inc(&release_pool_size);
             return;
         }
-        qemu_mutex_unlock(&pool_lock);
     }
 
     qemu_coroutine_delete(co);
 }
 
-static void __attribute__((constructor)) coroutine_pool_init(void)
-{
-    qemu_mutex_init(&pool_lock);
-}
-
-static void __attribute__((destructor)) coroutine_pool_cleanup(void)
-{
-    Coroutine *co;
-    Coroutine *tmp;
-
-    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
-        QSLIST_REMOVE_HEAD(&pool, pool_next);
-        qemu_coroutine_delete(co);
-    }
-
-    qemu_mutex_destroy(&pool_lock);
-}
-
 static void coroutine_swap(Coroutine *from, Coroutine *to)
 {
     CoroutineAction ret;
@@ -140,20 +147,4 @@ void coroutine_fn qemu_coroutine_yield(void)
 
 void qemu_coroutine_adjust_pool_size(int n)
 {
-    qemu_mutex_lock(&pool_lock);
-
-    pool_max_size += n;
-
-    /* Callers should never take away more than they added */
-    assert(pool_max_size >= POOL_DEFAULT_SIZE);
-
-    /* Trim oversized pool down to new max */
-    while (pool_size > pool_max_size) {
-        Coroutine *co = QSLIST_FIRST(&pool);
-        QSLIST_REMOVE_HEAD(&pool, pool_next);
-        pool_size--;
-        qemu_coroutine_delete(co);
-    }
-
-    qemu_mutex_unlock(&pool_lock);
 }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid mutex
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid mutex Paolo Bonzini
@ 2014-12-02 12:09   ` Peter Lieven
  2014-12-02 12:13     ` Paolo Bonzini
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Lieven @ 2014-12-02 12:09 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, ming.lei, stefanha

On 02.12.2014 12:05, Paolo Bonzini wrote:
> This patch removes the mutex by using fancy lock-free manipulation of
> the pool.  Lock-free stacks and queues are not hard, but they can suffer
> from the ABA problem so they are better avoided unless you have some
> deferred reclamation scheme like RCU.  Otherwise you have to stick
> with adding to a list, and emptying it completely.  This is what this
> patch does, by coupling a lock-free global list of available coroutines
> with per-CPU lists that are actually used on coroutine creation.
>
> Whenever the destruction pool is big enough, the next thread that runs
> out of coroutines will steal the whole destruction pool.  This is positive
> in two ways:
>
> 1) the allocation does not have to do any atomic operation in the fast
> path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
> allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
> loop, that hopefully doesn't cause any starvation, and an atomic_inc.
>
> A later patch will also remove atomic operations from the release path,
> and try to avoid the atomic_xchg altogether---succeeding in doing so if
> all devices either use ioeventfd or are not submitting requests actively.
>
> 2) in theory this should be completely adaptive.  The number of coroutines
> around should be a little more than POOL_BATCH_SIZE * number of allocating
> threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
> pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
> more generous.  But if you actually have many high-iodepth disks, it's better
> to put them in different iothreads, which will also use separate thread
> pools and aio=native file descriptors).
>
> This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
> No matter if we end with some kind of coroutine bypass scheme or not,
> it cannot hurt to optimize hot code.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> v1->v2: leave personal opinions out of commit messages :) [Kevin]
>
>   qemu-coroutine.c | 93 +++++++++++++++++++++++++-------------------------------
>   1 file changed, 42 insertions(+), 51 deletions(-)
>
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index bd574aa..aee1017 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -15,31 +15,57 @@
>   #include "trace.h"
>   #include "qemu-common.h"
>   #include "qemu/thread.h"
> +#include "qemu/atomic.h"
>   #include "block/coroutine.h"
>   #include "block/coroutine_int.h"
>   
>   enum {
> -    POOL_DEFAULT_SIZE = 64,
> +    POOL_BATCH_SIZE = 64,
>   };
>   
>   /** Free list to speed up creation */
> -static QemuMutex pool_lock;
> -static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
> -static unsigned int pool_size;
> -static unsigned int pool_max_size = POOL_DEFAULT_SIZE;
> +static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
> +static unsigned int release_pool_size;
> +static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
> +static __thread Notifier coroutine_pool_cleanup_notifier;
> +
> +static void coroutine_pool_cleanup(Notifier *n, void *value)
> +{
> +    Coroutine *co;
> +    Coroutine *tmp;
> +
> +    QSLIST_FOREACH_SAFE(co, &alloc_pool, pool_next, tmp) {
> +        QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
> +        qemu_coroutine_delete(co);
> +    }
> +}
>   
>   Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>   {
>       Coroutine *co = NULL;
>   
>       if (CONFIG_COROUTINE_POOL) {
> -        qemu_mutex_lock(&pool_lock);
> -        co = QSLIST_FIRST(&pool);
> +        co = QSLIST_FIRST(&alloc_pool);
> +        if (!co) {
> +            if (release_pool_size > POOL_BATCH_SIZE) {
> +                /* Slow path; a good place to register the destructor, too.  */
> +                if (!coroutine_pool_cleanup_notifier.notify) {
> +                    coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
> +                    qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
> +                }
> +
> +                /* This is not exact; there could be a little skew between
> +                 * release_pool_size and the actual size of release_pool.  But
> +                 * it is just a heuristic, it does not need to be perfect.
> +                 */
> +                release_pool_size = 0;
> +                QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
> +                co = QSLIST_FIRST(&alloc_pool);
> +            }
> +        }
>           if (co) {
> -            QSLIST_REMOVE_HEAD(&pool, pool_next);
> -            pool_size--;
> +            QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
>           }
> -        qemu_mutex_unlock(&pool_lock);
>       }
>   
>       if (!co) {
> @@ -53,39 +80,19 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
>   
>   static void coroutine_delete(Coroutine *co)
>   {
> +    co->caller = NULL;
> +
>       if (CONFIG_COROUTINE_POOL) {
> -        qemu_mutex_lock(&pool_lock);
> -        if (pool_size < pool_max_size) {
> -            QSLIST_INSERT_HEAD(&pool, co, pool_next);
> -            co->caller = NULL;
> -            pool_size++;
> -            qemu_mutex_unlock(&pool_lock);
> +        if (release_pool_size < POOL_BATCH_SIZE * 2) {
> +            QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
> +            atomic_inc(&release_pool_size);
>               return;
>           }
> -        qemu_mutex_unlock(&pool_lock);
>       }
>   
>       qemu_coroutine_delete(co);
>   }
>   
> -static void __attribute__((constructor)) coroutine_pool_init(void)
> -{
> -    qemu_mutex_init(&pool_lock);
> -}
> -
> -static void __attribute__((destructor)) coroutine_pool_cleanup(void)
> -{
> -    Coroutine *co;
> -    Coroutine *tmp;
> -
> -    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
> -        QSLIST_REMOVE_HEAD(&pool, pool_next);
> -        qemu_coroutine_delete(co);
> -    }
> -
> -    qemu_mutex_destroy(&pool_lock);
> -}
> -

I still feel we should leave this destructor in to clean up the release_pool.

Peter

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid mutex
  2014-12-02 12:09   ` Peter Lieven
@ 2014-12-02 12:13     ` Paolo Bonzini
  2014-12-02 12:18       ` Peter Lieven
  2014-12-02 13:04       ` Kevin Wolf
  0 siblings, 2 replies; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-02 12:13 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel; +Cc: kwolf, ming.lei, stefanha



On 02/12/2014 13:09, Peter Lieven wrote:
>>
>> -static void __attribute__((destructor)) coroutine_pool_cleanup(void)
>> -{
>> -    Coroutine *co;
>> -    Coroutine *tmp;
>> -
>> -    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
>> -        QSLIST_REMOVE_HEAD(&pool, pool_next);
>> -        qemu_coroutine_delete(co);
>> -    }
>> -
>> -    qemu_mutex_destroy(&pool_lock);
>> -}
>> -
> 
> I still feel we should leave this destructor in to clean up the
> release_pool.

Why?  If you run QEMU under valgrind, there are thousands of blocks that
we do not free.  Stefan/Kevin, what do you think?

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid mutex
  2014-12-02 12:13     ` Paolo Bonzini
@ 2014-12-02 12:18       ` Peter Lieven
  2014-12-02 12:32         ` Paolo Bonzini
  2014-12-02 13:04       ` Kevin Wolf
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Lieven @ 2014-12-02 12:18 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, ming.lei, stefanha

On 02.12.2014 13:13, Paolo Bonzini wrote:
>
> On 02/12/2014 13:09, Peter Lieven wrote:
>>> -static void __attribute__((destructor)) coroutine_pool_cleanup(void)
>>> -{
>>> -    Coroutine *co;
>>> -    Coroutine *tmp;
>>> -
>>> -    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
>>> -        QSLIST_REMOVE_HEAD(&pool, pool_next);
>>> -        qemu_coroutine_delete(co);
>>> -    }
>>> -
>>> -    qemu_mutex_destroy(&pool_lock);
>>> -}
>>> -
>> I still feel we should leave this destructor in to clean up the
>> release_pool.
> Why?  If you run QEMU under valgrind, there are thousands of blocks that
> we do not free.  Stefan/Kevin, what do you think?

Before this patch we cleaned up this part at least.
I have learned that it bad style not to clean up your resources.
Just because other code parts do not do it we should not introduce
new parts that don't it.

Peter

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid mutex
  2014-12-02 12:18       ` Peter Lieven
@ 2014-12-02 12:32         ` Paolo Bonzini
  0 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-02 12:32 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel; +Cc: kwolf, ming.lei, stefanha



On 02/12/2014 13:18, Peter Lieven wrote:
> On 02.12.2014 13:13, Paolo Bonzini wrote:
>>
>> On 02/12/2014 13:09, Peter Lieven wrote:
>>>> -static void __attribute__((destructor)) coroutine_pool_cleanup(void)
>>>> -{
>>>> -    Coroutine *co;
>>>> -    Coroutine *tmp;
>>>> -
>>>> -    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
>>>> -        QSLIST_REMOVE_HEAD(&pool, pool_next);
>>>> -        qemu_coroutine_delete(co);
>>>> -    }
>>>> -
>>>> -    qemu_mutex_destroy(&pool_lock);
>>>> -}
>>>> -
>>> I still feel we should leave this destructor in to clean up the
>>> release_pool.
>> Why?  If you run QEMU under valgrind, there are thousands of blocks that
>> we do not free.  Stefan/Kevin, what do you think?
> 
> Before this patch we cleaned up this part at least.
> I have learned that it bad style not to clean up your resources.
> Just because other code parts do not do it we should not introduce
> new parts that don't it.

Which other parts do we cleanup?  For example file descriptors are not
cleaned up, much less most memory; the kernel is there to do it for us.
 I think it's up to the maintainers to decide.

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid mutex
  2014-12-02 12:13     ` Paolo Bonzini
  2014-12-02 12:18       ` Peter Lieven
@ 2014-12-02 13:04       ` Kevin Wolf
  1 sibling, 0 replies; 17+ messages in thread
From: Kevin Wolf @ 2014-12-02 13:04 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: ming.lei, Peter Lieven, qemu-devel, stefanha

Am 02.12.2014 um 13:13 hat Paolo Bonzini geschrieben:
> 
> 
> On 02/12/2014 13:09, Peter Lieven wrote:
> >>
> >> -static void __attribute__((destructor)) coroutine_pool_cleanup(void)
> >> -{
> >> -    Coroutine *co;
> >> -    Coroutine *tmp;
> >> -
> >> -    QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) {
> >> -        QSLIST_REMOVE_HEAD(&pool, pool_next);
> >> -        qemu_coroutine_delete(co);
> >> -    }
> >> -
> >> -    qemu_mutex_destroy(&pool_lock);
> >> -}
> >> -
> > 
> > I still feel we should leave this destructor in to clean up the
> > release_pool.
> 
> Why?  If you run QEMU under valgrind, there are thousands of blocks that
> we do not free.  Stefan/Kevin, what do you think?

The destructor doesn't seem to be doing anything but freeing memory,
which the OS can indeed do for us. I don't mind either way.

Kevin

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v2 6/7] coroutine: drop qemu_coroutine_adjust_pool_size
  2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
                   ` (4 preceding siblings ...)
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid mutex Paolo Bonzini
@ 2014-12-02 11:05 ` Paolo Bonzini
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 7/7] coroutine: try harder not to delete coroutines Paolo Bonzini
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-02 11:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

This is not needed anymore.  The new TLS-based algorithm is adaptive.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/block-backend.c     |  4 ----
 include/block/coroutine.h | 10 ----------
 qemu-coroutine.c          |  4 ----
 3 files changed, 18 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d0692b1..abf0cd1 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -260,9 +260,6 @@ int blk_attach_dev(BlockBackend *blk, void *dev)
     blk_ref(blk);
     blk->dev = dev;
     bdrv_iostatus_reset(blk->bs);
-
-    /* We're expecting I/O from the device so bump up coroutine pool size */
-    qemu_coroutine_adjust_pool_size(COROUTINE_POOL_RESERVATION);
     return 0;
 }
 
@@ -290,7 +287,6 @@ void blk_detach_dev(BlockBackend *blk, void *dev)
     blk->dev_ops = NULL;
     blk->dev_opaque = NULL;
     bdrv_set_guest_block_size(blk->bs, 512);
-    qemu_coroutine_adjust_pool_size(-COROUTINE_POOL_RESERVATION);
     blk_unref(blk);
 }
 
diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 793df0e..20c027a 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -216,14 +216,4 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
  */
 void coroutine_fn yield_until_fd_readable(int fd);
 
-/**
- * Add or subtract from the coroutine pool size
- *
- * The coroutine implementation keeps a pool of coroutines to be reused by
- * qemu_coroutine_create().  This makes coroutine creation cheap.  Heavy
- * coroutine users should call this to reserve pool space.  Call it again with
- * a negative number to release pool space.
- */
-void qemu_coroutine_adjust_pool_size(int n);
-
 #endif /* QEMU_COROUTINE_H */
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index aee1017..ca40f4f 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -144,7 +144,3 @@ void coroutine_fn qemu_coroutine_yield(void)
     self->caller = NULL;
     coroutine_swap(self, to);
 }
-
-void qemu_coroutine_adjust_pool_size(int n)
-{
-}
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [Qemu-devel] [PATCH v2 7/7] coroutine: try harder not to delete coroutines
  2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
                   ` (5 preceding siblings ...)
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 6/7] coroutine: drop qemu_coroutine_adjust_pool_size Paolo Bonzini
@ 2014-12-02 11:05 ` Paolo Bonzini
  2014-12-11 13:55 ` [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Peter Lieven
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-02 11:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, ming.lei, pl, stefanha

From: Peter Lieven <pl@kamp.de>

Placing coroutines on the global pool should be preferrable, because it
can help all threads.  But if the global pool is full, we can still
try to save some allocations by stashing completed coroutines on the
local pool.  This is quite cheap too, because it does not require
atomic operations, and provides a gain of 15% in the best case.

Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
v1->v2: mention gain from this patch [Peter]
	change "alloc_pool_size +=" to "alloc_pool_size =" [Peter]

 qemu-coroutine.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index da1b961..977f114 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -27,6 +27,7 @@ enum {
 static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
 static unsigned int release_pool_size;
 static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
+static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
 
 static void coroutine_pool_cleanup(Notifier *n, void *value)
@@ -58,13 +59,14 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
                  * release_pool_size and the actual size of release_pool.  But
                  * it is just a heuristic, it does not need to be perfect.
                  */
-                release_pool_size = 0;
+                alloc_pool_size = atomic_xchg(&release_pool_size, 0);
                 QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
                 co = QSLIST_FIRST(&alloc_pool);
             }
         }
         if (co) {
             QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
+            alloc_pool_size--;
         }
     }
 
@@ -87,6 +89,11 @@ static void coroutine_delete(Coroutine *co)
             atomic_inc(&release_pool_size);
             return;
         }
+        if (alloc_pool_size < POOL_BATCH_SIZE) {
+            QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
+            alloc_pool_size++;
+            return;
+        }
     }
 
     qemu_coroutine_delete(co);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations
  2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
                   ` (6 preceding siblings ...)
  2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 7/7] coroutine: try harder not to delete coroutines Paolo Bonzini
@ 2014-12-11 13:55 ` Peter Lieven
  2014-12-15 21:35   ` Paolo Bonzini
  2014-12-18 10:06 ` Fam Zheng
  2015-01-06 15:39 ` Stefan Hajnoczi
  9 siblings, 1 reply; 17+ messages in thread
From: Peter Lieven @ 2014-12-11 13:55 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, ming.lei, stefanha

On 02.12.2014 12:05, Paolo Bonzini wrote:
> As discussed in the other thread, this brings speedups from
> dropping the coroutine mutex (which serializes multiple iothreads,
> too) and using ELF thread-local storage.
>
> The speedup in perf/cost is about 50% (190->125).  Windows port tested
> with tests/test-coroutine.exe under Wine.
>
> Paolo
>
> v1->v2: include the noinline attribute [many...]
> 	do not mention SwitchToFiber [Kevin]
> 	rename run_main_iothread_exit -> run_main_thread_exit
> 	leave personal opinions out of commit messages :) [Kevin]
> 	mention gain from patch 7 [Peter]
> 	change "alloc_pool_size +=" to "alloc_pool_size =" [Peter]
>
> Paolo Bonzini (7):
>    coroutine-ucontext: use __thread
>    qemu-thread: add per-thread atexit functions
>    test-coroutine: avoid overflow on 32-bit systems
>    QSLIST: add lock-free operations
>    coroutine: rewrite pool to avoid mutex
>    coroutine: drop qemu_coroutine_adjust_pool_size
>    coroutine: try harder not to delete coroutines
>
>   block/block-backend.c     |   4 --
>   coroutine-ucontext.c      |  64 +++++++---------------------
>   include/block/coroutine.h |  10 -----
>   include/qemu/queue.h      |  15 ++++++-
>   include/qemu/thread.h     |   4 ++
>   qemu-coroutine.c          | 104 ++++++++++++++++++++++------------------------
>   tests/test-coroutine.c    |   2 +-
>   util/qemu-thread-posix.c  |  37 +++++++++++++++++
>   util/qemu-thread-win32.c  |  48 ++++++++++++++++-----
>   9 files changed, 157 insertions(+), 131 deletions(-)
>

Whats the status of this series?

Peter

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations
  2014-12-11 13:55 ` [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Peter Lieven
@ 2014-12-15 21:35   ` Paolo Bonzini
  0 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2014-12-15 21:35 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel; +Cc: kwolf, ming.lei, stefanha



On 11/12/2014 14:55, Peter Lieven wrote:
> On 02.12.2014 12:05, Paolo Bonzini wrote:
>> As discussed in the other thread, this brings speedups from
>> dropping the coroutine mutex (which serializes multiple iothreads,
>> too) and using ELF thread-local storage.
>>
>> The speedup in perf/cost is about 50% (190->125).  Windows port tested
>> with tests/test-coroutine.exe under Wine.
>>
>> Paolo
>>
>> v1->v2: include the noinline attribute [many...]
>>     do not mention SwitchToFiber [Kevin]
>>     rename run_main_iothread_exit -> run_main_thread_exit
>>     leave personal opinions out of commit messages :) [Kevin]
>>     mention gain from patch 7 [Peter]
>>     change "alloc_pool_size +=" to "alloc_pool_size =" [Peter]
>>
>> Paolo Bonzini (7):
>>    coroutine-ucontext: use __thread
>>    qemu-thread: add per-thread atexit functions
>>    test-coroutine: avoid overflow on 32-bit systems
>>    QSLIST: add lock-free operations
>>    coroutine: rewrite pool to avoid mutex
>>    coroutine: drop qemu_coroutine_adjust_pool_size
>>    coroutine: try harder not to delete coroutines
>>
>>   block/block-backend.c     |   4 --
>>   coroutine-ucontext.c      |  64 +++++++---------------------
>>   include/block/coroutine.h |  10 -----
>>   include/qemu/queue.h      |  15 ++++++-
>>   include/qemu/thread.h     |   4 ++
>>   qemu-coroutine.c          | 104
>> ++++++++++++++++++++++------------------------
>>   tests/test-coroutine.c    |   2 +-
>>   util/qemu-thread-posix.c  |  37 +++++++++++++++++
>>   util/qemu-thread-win32.c  |  48 ++++++++++++++++-----
>>   9 files changed, 157 insertions(+), 131 deletions(-)
>>
> 
> Whats the status of this series?

Maintainers are probably just waiting for a complete Reviewed-by.  If
you can provide performance numbers, that would help too.

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations
  2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
                   ` (7 preceding siblings ...)
  2014-12-11 13:55 ` [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Peter Lieven
@ 2014-12-18 10:06 ` Fam Zheng
  2015-01-06 15:39 ` Stefan Hajnoczi
  9 siblings, 0 replies; 17+ messages in thread
From: Fam Zheng @ 2014-12-18 10:06 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, ming.lei, pl, qemu-devel, stefanha

On Tue, 12/02 12:05, Paolo Bonzini wrote:
> As discussed in the other thread, this brings speedups from
> dropping the coroutine mutex (which serializes multiple iothreads,
> too) and using ELF thread-local storage.
> 
> The speedup in perf/cost is about 50% (190->125).  Windows port tested
> with tests/test-coroutine.exe under Wine.

Looks quite nice!

Reviewed-by: Fam Zheng <famz@redhat.com>

> 
> Paolo
> 
> v1->v2: include the noinline attribute [many...]
> 	do not mention SwitchToFiber [Kevin]
> 	rename run_main_iothread_exit -> run_main_thread_exit
> 	leave personal opinions out of commit messages :) [Kevin]
> 	mention gain from patch 7 [Peter]
> 	change "alloc_pool_size +=" to "alloc_pool_size =" [Peter]
> 
> Paolo Bonzini (7):
>   coroutine-ucontext: use __thread
>   qemu-thread: add per-thread atexit functions
>   test-coroutine: avoid overflow on 32-bit systems
>   QSLIST: add lock-free operations
>   coroutine: rewrite pool to avoid mutex
>   coroutine: drop qemu_coroutine_adjust_pool_size
>   coroutine: try harder not to delete coroutines
> 
>  block/block-backend.c     |   4 --
>  coroutine-ucontext.c      |  64 +++++++---------------------
>  include/block/coroutine.h |  10 -----
>  include/qemu/queue.h      |  15 ++++++-
>  include/qemu/thread.h     |   4 ++
>  qemu-coroutine.c          | 104 ++++++++++++++++++++++------------------------
>  tests/test-coroutine.c    |   2 +-
>  util/qemu-thread-posix.c  |  37 +++++++++++++++++
>  util/qemu-thread-win32.c  |  48 ++++++++++++++++-----
>  9 files changed, 157 insertions(+), 131 deletions(-)
> 
> -- 
> 2.1.0
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations
  2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
                   ` (8 preceding siblings ...)
  2014-12-18 10:06 ` Fam Zheng
@ 2015-01-06 15:39 ` Stefan Hajnoczi
  9 siblings, 0 replies; 17+ messages in thread
From: Stefan Hajnoczi @ 2015-01-06 15:39 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, ming.lei, pl, qemu-devel, stefanha

[-- Attachment #1: Type: text/plain, Size: 1697 bytes --]

On Tue, Dec 02, 2014 at 12:05:43PM +0100, Paolo Bonzini wrote:
> As discussed in the other thread, this brings speedups from
> dropping the coroutine mutex (which serializes multiple iothreads,
> too) and using ELF thread-local storage.
> 
> The speedup in perf/cost is about 50% (190->125).  Windows port tested
> with tests/test-coroutine.exe under Wine.
> 
> Paolo
> 
> v1->v2: include the noinline attribute [many...]
> 	do not mention SwitchToFiber [Kevin]
> 	rename run_main_iothread_exit -> run_main_thread_exit
> 	leave personal opinions out of commit messages :) [Kevin]
> 	mention gain from patch 7 [Peter]
> 	change "alloc_pool_size +=" to "alloc_pool_size =" [Peter]
> 
> Paolo Bonzini (7):
>   coroutine-ucontext: use __thread
>   qemu-thread: add per-thread atexit functions
>   test-coroutine: avoid overflow on 32-bit systems
>   QSLIST: add lock-free operations
>   coroutine: rewrite pool to avoid mutex
>   coroutine: drop qemu_coroutine_adjust_pool_size
>   coroutine: try harder not to delete coroutines
> 
>  block/block-backend.c     |   4 --
>  coroutine-ucontext.c      |  64 +++++++---------------------
>  include/block/coroutine.h |  10 -----
>  include/qemu/queue.h      |  15 ++++++-
>  include/qemu/thread.h     |   4 ++
>  qemu-coroutine.c          | 104 ++++++++++++++++++++++------------------------
>  tests/test-coroutine.c    |   2 +-
>  util/qemu-thread-posix.c  |  37 +++++++++++++++++
>  util/qemu-thread-win32.c  |  48 ++++++++++++++++-----
>  9 files changed, 157 insertions(+), 131 deletions(-)
> 
> -- 
> 2.1.0
> 
> 

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-01-06 15:39 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-02 11:05 [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Paolo Bonzini
2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 1/7] coroutine-ucontext: use __thread Paolo Bonzini
2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 2/7] qemu-thread: add per-thread atexit functions Paolo Bonzini
2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 3/7] test-coroutine: avoid overflow on 32-bit systems Paolo Bonzini
2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 4/7] QSLIST: add lock-free operations Paolo Bonzini
2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 5/7] coroutine: rewrite pool to avoid mutex Paolo Bonzini
2014-12-02 12:09   ` Peter Lieven
2014-12-02 12:13     ` Paolo Bonzini
2014-12-02 12:18       ` Peter Lieven
2014-12-02 12:32         ` Paolo Bonzini
2014-12-02 13:04       ` Kevin Wolf
2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 6/7] coroutine: drop qemu_coroutine_adjust_pool_size Paolo Bonzini
2014-12-02 11:05 ` [Qemu-devel] [PATCH v2 7/7] coroutine: try harder not to delete coroutines Paolo Bonzini
2014-12-11 13:55 ` [Qemu-devel] [PATCH v2 0/7] coroutine: optimizations Peter Lieven
2014-12-15 21:35   ` Paolo Bonzini
2014-12-18 10:06 ` Fam Zheng
2015-01-06 15:39 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).