From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60088) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YdiTg-0000UB-JO for qemu-devel@nongnu.org; Thu, 02 Apr 2015 13:00:25 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YdiTb-0004L7-0u for qemu-devel@nongnu.org; Thu, 02 Apr 2015 13:00:24 -0400 Received: from mx1.redhat.com ([209.132.183.28]:49789) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YdiTa-0004L2-O4 for qemu-devel@nongnu.org; Thu, 02 Apr 2015 13:00:18 -0400 Message-ID: <551D759B.2090102@redhat.com> Date: Thu, 02 Apr 2015 19:00:11 +0200 From: Paolo Bonzini MIME-Version: 1.0 References: <551D71BF.6050601@redhat.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] thread-pool.c race condition? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Laszlo Ersek , John Snow , qemu-devel On 02/04/2015 18:47, Stefan Hajnoczi wrote: > My initial speculation was that the qemu_bh_schedule(): > > if (bh->scheduled) > return; > > Check is causing us to skip BH invocations. > > When I look at the code the lack of explicit barriers or atomic > operations for bh->scheduled itself is a little suspicious. You may have been onto something. If bh->scheduled is already 1, we do not execute a memory barrier to order "any writes needed by the callback [...] before the locations are read in the aio_bh_poll" (quoting from the comment). In particular, req->state might be see as THREAD_ACTIVE. This would explain the failure on aarch64, but not on x86_64. So, this is probably worth testing: diff --git a/async.c b/async.c index 2be88cc..c5d9939 100644 --- a/async.c +++ b/async.c @@ -122,19 +122,17 @@ void qemu_bh_schedule(QEMUBH *bh) { AioContext *ctx; - if (bh->scheduled) - return; ctx = bh->ctx; bh->idle = 0; - /* Make sure that: + /* The memory barrier implicit in atomic_xchg makes sure that: * 1. idle & any writes needed by the callback are done before the * locations are read in the aio_bh_poll. * 2. ctx is loaded before scheduled is set and the callback has a chance * to execute. */ - smp_mb(); - bh->scheduled = 1; - aio_notify(ctx); + if (atomic_xchg(&bh->scheduled, 1) == 0) { + aio_notify(ctx); + } } > But now I'm focussing more on thread-pool.c since that has its own > threading constraints. Making thread-pool.c less clever didn't make the bug go away for Laszlo. Paolo