From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:39581) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ws81H-0001vD-Hv for qemu-devel@nongnu.org; Wed, 04 Jun 2014 06:02:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Ws81B-0008Po-LI for qemu-devel@nongnu.org; Wed, 04 Jun 2014 06:02:07 -0400 Received: from mail-wg0-x22c.google.com ([2a00:1450:400c:c00::22c]:60943) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ws81B-0008Pb-FG for qemu-devel@nongnu.org; Wed, 04 Jun 2014 06:02:01 -0400 Received: by mail-wg0-f44.google.com with SMTP id a1so8024065wgh.3 for ; Wed, 04 Jun 2014 03:02:00 -0700 (PDT) Date: Wed, 4 Jun 2014 12:01:56 +0200 From: Stefan Hajnoczi Message-ID: <20140604100156.GF26902@stefanha-thinkpad.redhat.com> References: <538C248F.2000801@beyond.pl> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <538C248F.2000801@beyond.pl> Subject: Re: [Qemu-devel] [PATCH v2] thread-pool: fix deadlock when callbacks depends on each other List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Marcin =?utf-8?Q?Gibu=C5=82a?= Cc: Paolo Bonzini , qemu-devel On Mon, Jun 02, 2014 at 09:15:27AM +0200, Marcin GibuĊ‚a wrote: > When two coroutines submit I/O and first coroutine depends on second to > complete (by calling bdrv_drain_all), deadlock may occur. bdrv_drain_all() is a very heavy-weight operation. Coroutines should avoid it if possible. Please post the file/line/function where this call was made, perhaps there is a better way to wait for the other coroutine. This isn't a fix for this bug but it's a cleanup. > This is because both requests may have completed before thread pool notifier > got called. Then, when notifier gets executed and first coroutine calls > aio_pool() to make progress, it will hang forever, as notifier's descriptor > has been already marked clear. > > This patch fixes this, by deferring clearing notifier until no completions > are pending. > > Without this patch, I could reproduce this bug with snapshot-commit with > about 1 per 10 tries. With this patch, I couldn't reproduce it any more. > > Signed-off-by: Marcin Gibula > --- This is an interesting bug that definitely needs a test case to prevent regressions in the future. Please take a look at tests/test-thread-pool.c and add a test to it. It can be reproduced deterministically - just call aio_poll() after the dummy worker functions have both completed. Then the next aio_poll() call in the thread pool callback will suffer the problem you described. Stefan