[Qemu-devel] [PATCH v2] thread-pool: fix deadlock when callbacks depends on each other

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2] thread-pool: fix deadlock when callbacks depends on each other
@ 2014-06-02  7:15 Marcin Gibuła
  2014-06-04 10:01 ` Stefan Hajnoczi
  0 siblings, 1 reply; 4+ messages in thread
From: Marcin Gibuła @ 2014-06-02  7:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Stefan Hajnoczi

When two coroutines submit I/O and first coroutine depends on second to 
complete (by calling bdrv_drain_all), deadlock may occur.

This is because both requests may have completed before thread pool 
notifier got called. Then, when notifier gets executed and first 
coroutine calls aio_pool() to make progress, it will hang forever, as 
notifier's descriptor has been already marked clear.

This patch fixes this, by deferring clearing notifier until no 
completions are pending.

Without this patch, I could reproduce this bug with snapshot-commit with 
about 1 per 10 tries. With this patch, I couldn't reproduce it any more.

Signed-off-by: Marcin Gibula <m.gibula@beyond.pl>
---

--- thread-pool.c	2014-04-17 15:44:45.000000000 +0200
+++ thread-pool.c	2014-06-02 09:10:25.442260590 +0200
@@ -76,6 +76,8 @@ struct ThreadPool {
      int new_threads;     /* backlog of threads we need to create */
      int pending_threads; /* threads created but not running yet */
      int pending_cancellations; /* whether we need a cond_broadcast */
+    int pending_completions; /* whether we need to rearm notifier when
+                                executing callback */
      bool stopping;
  };

@@ -110,6 +112,10 @@ static void *worker_thread(void *opaque)
          ret = req->func(req->arg);

          req->ret = ret;
+        if (req->common.cb) {
+            atomic_inc(&pool->pending_completions);
+        }
+
          /* Write ret before state.  */
          smp_wmb();
          req->state = THREAD_DONE;
@@ -173,7 +179,6 @@ static void event_notifier_ready(EventNo
      ThreadPool *pool = container_of(notifier, ThreadPool, notifier);
      ThreadPoolElement *elem, *next;

-    event_notifier_test_and_clear(notifier);
  restart:
      QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
          if (elem->state != THREAD_CANCELED && elem->state != 
THREAD_DONE) {
@@ -185,6 +190,8 @@ restart:
          }
          if (elem->state == THREAD_DONE && elem->common.cb) {
              QLIST_REMOVE(elem, all);
+            atomic_dec(&pool->pending_completions);
+
              /* Read state before ret.  */
              smp_rmb();
              elem->common.cb(elem->common.opaque, elem->ret);
@@ -196,6 +203,19 @@ restart:
              qemu_aio_release(elem);
          }
      }
+
+    /* Double test of pending_completions is necessary to
+     * ensure that there is no race between testing it and
+     * clearing notifier.
+     */
+    if (atomic_read(&pool->pending_completions)) {
+        goto restart;
+    }
+    event_notifier_test_and_clear(notifier);
+    if (atomic_read(&pool->pending_completions)) {
+        event_notifier_set(notifier);
+        goto restart;
+    }
  }

  static void thread_pool_cancel(BlockDriverAIOCB *acb)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [PATCH v2] thread-pool: fix deadlock when callbacks depends on each other
  2014-06-02  7:15 [Qemu-devel] [PATCH v2] thread-pool: fix deadlock when callbacks depends on each other Marcin Gibuła
@ 2014-06-04 10:01 ` Stefan Hajnoczi
  2014-06-04 10:18   ` Paolo Bonzini
  2014-06-04 10:31   ` Marcin Gibuła
  0 siblings, 2 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2014-06-04 10:01 UTC (permalink / raw)
  To: Marcin Gibuła; +Cc: Paolo Bonzini, qemu-devel

On Mon, Jun 02, 2014 at 09:15:27AM +0200, Marcin Gibuła wrote:
> When two coroutines submit I/O and first coroutine depends on second to
> complete (by calling bdrv_drain_all), deadlock may occur.

bdrv_drain_all() is a very heavy-weight operation.  Coroutines should
avoid it if possible.  Please post the file/line/function where this
call was made, perhaps there is a better way to wait for the other
coroutine.  This isn't a fix for this bug but it's a cleanup.

> This is because both requests may have completed before thread pool notifier
> got called. Then, when notifier gets executed and first coroutine calls
> aio_pool() to make progress, it will hang forever, as notifier's descriptor
> has been already marked clear.
> 
> This patch fixes this, by deferring clearing notifier until no completions
> are pending.
> 
> Without this patch, I could reproduce this bug with snapshot-commit with
> about 1 per 10 tries. With this patch, I couldn't reproduce it any more.
> 
> Signed-off-by: Marcin Gibula <m.gibula@beyond.pl>
> ---

This is an interesting bug that definitely needs a test case to prevent
regressions in the future.

Please take a look at tests/test-thread-pool.c and add a test to it.  It
can be reproduced deterministically - just call aio_poll() after the
dummy worker functions have both completed.  Then the next aio_poll()
call in the thread pool callback will suffer the problem you described.

Stefan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [PATCH v2] thread-pool: fix deadlock when callbacks depends on each other
  2014-06-04 10:01 ` Stefan Hajnoczi
@ 2014-06-04 10:18   ` Paolo Bonzini
  2014-06-04 10:31   ` Marcin Gibuła
  1 sibling, 0 replies; 4+ messages in thread
From: Paolo Bonzini @ 2014-06-04 10:18 UTC (permalink / raw)
  To: Stefan Hajnoczi, Marcin Gibuła; +Cc: qemu-devel

Il 04/06/2014 12:01, Stefan Hajnoczi ha scritto:
>> > Without this patch, I could reproduce this bug with snapshot-commit with
>> > about 1 per 10 tries. With this patch, I couldn't reproduce it any more.
>> >
>> > Signed-off-by: Marcin Gibula <m.gibula@beyond.pl>
>> > ---
> This is an interesting bug that definitely needs a test case to prevent
> regressions in the future.
>
> Please take a look at tests/test-thread-pool.c and add a test to it.  It
> can be reproduced deterministically - just call aio_poll() after the
> dummy worker functions have both completed.  Then the next aio_poll()
> call in the thread pool callback will suffer the problem you described.

The question if we want to consider this thread-pool.c behavior a real 
bug or just a misfeature (the real bug being elsewhere).

Even though this patch avoids the performance problems of v1, we would 
have to fix at least two other cases and it's not obvious (a) that those 
two are the only ones (b) tgat those two can be fixed without affecting 
performance.

If the bottom half code is immune from this event notifier problem, 
bdrv_drain/bdrv_drain_all calls in coroutine context can defer the 
actual draining to a bottom half and reenter the coroutine afterwards; 
we can then audit that all other calls should come from the main loop 
rather than aio_poll.

Paolo

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [PATCH v2] thread-pool: fix deadlock when callbacks depends on each other
  2014-06-04 10:01 ` Stefan Hajnoczi
  2014-06-04 10:18   ` Paolo Bonzini
@ 2014-06-04 10:31   ` Marcin Gibuła
  1 sibling, 0 replies; 4+ messages in thread
From: Marcin Gibuła @ 2014-06-04 10:31 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Paolo Bonzini, qemu-devel

On 04.06.2014 12:01, Stefan Hajnoczi wrote:
> On Mon, Jun 02, 2014 at 09:15:27AM +0200, Marcin Gibuła wrote:
>> When two coroutines submit I/O and first coroutine depends on second to
>> complete (by calling bdrv_drain_all), deadlock may occur.
>
> bdrv_drain_all() is a very heavy-weight operation.  Coroutines should
> avoid it if possible.  Please post the file/line/function where this
> call was made, perhaps there is a better way to wait for the other
> coroutine.  This isn't a fix for this bug but it's a cleanup.

As in original bug report:

#4  0x00007f699c095c0a in bdrv_drain_all () at 
/var/tmp/portage/app-emulation/qemu-2.0.0_rc2/work/qemu-2.0.0-rc2/block.c:1805
#5  0x00007f699c09c87e in bdrv_close (bs=bs@entry=0x7f699f0bc520) at 
/var/tmp/portage/app-emulation/qemu-2.0.0_rc2/work/qemu-2.0.0-rc2/block.c:1695
#6  0x00007f699c09c5fa in bdrv_delete (bs=0x7f699f0bc520) at 
/var/tmp/portage/app-emulation/qemu-2.0.0_rc2/work/qemu-2.0.0-rc2/block.c:1978
#7  bdrv_unref (bs=0x7f699f0bc520) at 
/var/tmp/portage/app-emulation/qemu-2.0.0_rc2/work/qemu-2.0.0-rc2/block.c:5198
#8  0x00007f699c09c812 in bdrv_drop_intermediate 
(active=active@entry=0x7f699ebfd330, top=top@entry=0x7f699f0bc520, 
base=base@entry=0x7f699eec43d0) at 
/var/tmp/portage/app-emulation/qemu-2.0.0_rc2/work/qemu-2.0.0-rc2/block.c:2567
#9  0x00007f699c0a1963 in commit_run (opaque=0x7f699f17dcc0) at 
/var/tmp/portage/app-emulation/qemu-2.0.0_rc2/work/qemu-2.0.0-rc2/block/commit.c:144
#10 0x00007f699c0e0dca in coroutine_trampoline (i0=<optimized out>, 
i1=<optimized out>) at 
/var/tmp/portage/app-emulation/qemu-2.0.0_rc2/work/qemu-2.0.0-rc2/coroutine-ucontext.c:118 



mirror_run probably has this as well. I didn't check others.

-- 
mg

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-06-04 10:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-02  7:15 [Qemu-devel] [PATCH v2] thread-pool: fix deadlock when callbacks depends on each other Marcin Gibuła
2014-06-04 10:01 ` Stefan Hajnoczi
2014-06-04 10:18   ` Paolo Bonzini
2014-06-04 10:31   ` Marcin Gibuła

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).