From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:33091)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1Yg8eH-0005qK-UI
	for qemu-devel@nongnu.org; Thu, 09 Apr 2015 05:21:26 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1Yg8eD-0007bj-9y
	for qemu-devel@nongnu.org; Thu, 09 Apr 2015 05:21:21 -0400
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Thu,  9 Apr 2015 11:21:10 +0200
Message-Id: <1428571270-11723-1-git-send-email-pbonzini@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom
	half scheduling
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org
Cc: jsnow@redhat.com, lersek@redhat.com, stefanha@redhat.com, Paul.Leveille@stratus.com, qemu-stable@nongnu.org

There are two problems with memory barriers in async.c.  The fix is
to use atomic_xchg in order to achieve sequential consistency between
the scheduling of a bottom half and the corresponding execution.

First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does
not execute a memory barrier to order any writes needed by the callback
before the read of bh->scheduled.  If the other side sees req->state as
THREAD_ACTIVE, the callback is not invoked and you get deadlock.

Second, the memory barrier in aio_bh_poll is too weak.  Without this
patch, it is possible that bh->scheduled =3D 0 is not "published" until
after the callback has returned.  Another thread wants to schedule the
bottom half, but it sees bh->scheduled =3D 1 and does nothing.  This caus=
es
a lost wakeup.  The memory barrier should have been changed to smp_mb()
in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
2014-06-03) together with qemu_bh_schedule()'s.  Guess who reviewed
that patch?

Both of these involve a store and a load, so they are reproducible
on x86_64 as well.  Paul Leveille however reported how to trigger the
problem within 15 minutes on x86_64 as well.  His (untested) recipe,
reproduced here for reference, is the following:

   1) Qcow2 (or 3) is critical =E2=80=93 raw files alone seem to avoid th=
e problem.

   2) Use =E2=80=9Ccache=3Ddirectsync=E2=80=9D rather than the default of
   =E2=80=9Ccache=3Dnone=E2=80=9D to make it happen easier.

   3) Use a server with a write-back RAID controller to allow for rapid
   IO rates.

   4) Run a random-access load that (mostly) writes chunks to various
   files on the virtual block device.

      a. I use =E2=80=98diskload.exe c:25=E2=80=99, a Microsoft HCT load
         generator, on Windows VMs.

      b. Iometer can probably be configured to generate a similar load.

   5) Run multiple VMs in parallel, against the same storage device,
   to shake the failure out sooner.

   6) IvyBridge and Haswell processors for certain; not sure about others.

A similar patch survived over 12 hours of testing, where an unpatched
QEMU would fail within 15 minutes.

This bug is, most likely, also involved in the failures in the libguestfs
testsuite on AArch64 (reported by Laszlo Ersek and Richard Jones).  Howev=
er,
the patch is not enough to fix that.

Thanks to Stefan Hajnoczi for suggesting closer examination of
qemu_bh_schedule, and to Paul for providing test input and a prototype
patch.

Cc: qemu-stable@nongnu.org
Reported-by: Laszlo Ersek <lersek@redhat.com>
Reported-by: Paul Leveille <Paul.Leveille@stratus.com>
Reported-by: John Snow <jsnow@redhat.com>
Suggested-by: Paul Leveille <Paul.Leveille@stratus.com>
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Paul Leveille <Paul.Leveille@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 async.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/async.c b/async.c
index 2be88cc..2b51e87 100644
--- a/async.c
+++ b/async.c
@@ -72,12 +72,13 @@ int aio_bh_poll(AioContext *ctx)
         /* Make sure that fetching bh happens before accessing its membe=
rs */
         smp_read_barrier_depends();
         next =3D bh->next;
-        if (!bh->deleted && bh->scheduled) {
-            bh->scheduled =3D 0;
-            /* Paired with write barrier in bh schedule to ensure readin=
g for
-             * idle & callbacks coming after bh's scheduling.
-             */
-            smp_rmb();
+        /* The atomic_xchg is paired with the one in qemu_bh_schedule.  =
The
+         * implicit memory barrier ensures that the callback sees all wr=
ites
+         * done by the scheduling thread.  It also ensures that the sche=
duling
+         * thread sees the zero before bh->cb has run, and thus will cal=
l
+         * aio_notify again if necessary.
+         */
+        if (!bh->deleted && atomic_xchg(&bh->scheduled, 0)) {
             if (!bh->idle)
                 ret =3D 1;
             bh->idle =3D 0;
@@ -108,33 +109,28 @@ int aio_bh_poll(AioContext *ctx)
=20
 void qemu_bh_schedule_idle(QEMUBH *bh)
 {
-    if (bh->scheduled)
-        return;
     bh->idle =3D 1;
     /* Make sure that idle & any writes needed by the callback are done
      * before the locations are read in the aio_bh_poll.
      */
-    smp_wmb();
-    bh->scheduled =3D 1;
+    atomic_mb_set(&bh->scheduled, 1);
 }
=20
 void qemu_bh_schedule(QEMUBH *bh)
 {
     AioContext *ctx;
=20
-    if (bh->scheduled)
-        return;
     ctx =3D bh->ctx;
     bh->idle =3D 0;
-    /* Make sure that:
+    /* The memory barrier implicit in atomic_xchg makes sure that:
      * 1. idle & any writes needed by the callback are done before the
      *    locations are read in the aio_bh_poll.
      * 2. ctx is loaded before scheduled is set and the callback has a c=
hance
      *    to execute.
      */
-    smp_mb();
-    bh->scheduled =3D 1;
-    aio_notify(ctx);
+    if (atomic_xchg(&bh->scheduled, 1) =3D=3D 0) {
+        aio_notify(ctx);
+    }
 }
=20
=20
--=20
2.3.4