From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:60088)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1YdiTg-0000UB-JO
	for qemu-devel@nongnu.org; Thu, 02 Apr 2015 13:00:25 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1YdiTb-0004L7-0u
	for qemu-devel@nongnu.org; Thu, 02 Apr 2015 13:00:24 -0400
Received: from mx1.redhat.com ([209.132.183.28]:49789)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1YdiTa-0004L2-O4
	for qemu-devel@nongnu.org; Thu, 02 Apr 2015 13:00:18 -0400
Message-ID: <551D759B.2090102@redhat.com>
Date: Thu, 02 Apr 2015 19:00:11 +0200
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <CAJSP0QU3XK82fpKjkO0T=6eM+_Ow0vAjX2nJ1hRF6ora4VzTpw@mail.gmail.com>	<551D71BF.6050601@redhat.com>
	<CAJSP0QX2LZvmNiPjDdoj62tgOyYkNt9OpV9MeoU64wwio1Q3ig@mail.gmail.com>
In-Reply-To: <CAJSP0QX2LZvmNiPjDdoj62tgOyYkNt9OpV9MeoU64wwio1Q3ig@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Subject: Re: [Qemu-devel] thread-pool.c race condition?
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Laszlo Ersek <lersek@redhat.com>, John Snow <jsnow@redhat.com>, qemu-devel <qemu-devel@nongnu.org>


On 02/04/2015 18:47, Stefan Hajnoczi wrote:
> My initial speculation was that the qemu_bh_schedule():
> 
> if (bh->scheduled)
>     return;
> 
> Check is causing us to skip BH invocations.
> 
> When I look at the code the lack of explicit barriers or atomic
> operations for bh->scheduled itself is a little suspicious.

You may have been onto something.  If bh->scheduled is already 1, we do not
execute a memory barrier to order "any writes needed by the callback
[...] before the locations are read in the aio_bh_poll" (quoting from
the comment).

In particular, req->state might be see as THREAD_ACTIVE.  This would
explain the failure on aarch64, but not on x86_64.

So, this is probably worth testing:

diff --git a/async.c b/async.c
index 2be88cc..c5d9939 100644
--- a/async.c
+++ b/async.c
@@ -122,19 +122,17 @@ void qemu_bh_schedule(QEMUBH *bh)
 {
     AioContext *ctx;
 
-    if (bh->scheduled)
-        return;
     ctx = bh->ctx;
     bh->idle = 0;
-    /* Make sure that:
+    /* The memory barrier implicit in atomic_xchg makes sure that:
      * 1. idle & any writes needed by the callback are done before the
      *    locations are read in the aio_bh_poll.
      * 2. ctx is loaded before scheduled is set and the callback has a chance
      *    to execute.
      */
-    smp_mb();
-    bh->scheduled = 1;
-    aio_notify(ctx);
+    if (atomic_xchg(&bh->scheduled, 1) == 0) {
+        aio_notify(ctx);
+    }
 }


> But now I'm focussing more on thread-pool.c since that has its own
> threading constraints.

Making thread-pool.c less clever didn't make the bug go away for Laszlo.

Paolo