From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:36528)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lersek@redhat.com>) id 1Yfmfd-0007vm-Ua
	for qemu-devel@nongnu.org; Wed, 08 Apr 2015 05:53:23 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <lersek@redhat.com>) id 1Yfmfb-0007EE-NJ
	for qemu-devel@nongnu.org; Wed, 08 Apr 2015 05:53:17 -0400
Received: from mx1.redhat.com ([209.132.183.28]:48571)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lersek@redhat.com>) id 1Yfmfb-0007Dw-EN
	for qemu-devel@nongnu.org; Wed, 08 Apr 2015 05:53:15 -0400
Message-ID: <5524FA84.10007@redhat.com>
Date: Wed, 08 Apr 2015 11:53:08 +0200
From: Laszlo Ersek <lersek@redhat.com>
MIME-Version: 1.0
References: <1428419779-26062-1-git-send-email-pbonzini@redhat.com>
	<84E676B2D50CD441955FB9992BB861809325DD5F@EXHQ1.corp.stratus.com>
In-Reply-To: <84E676B2D50CD441955FB9992BB861809325DD5F@EXHQ1.corp.stratus.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom
	half scheduling
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Leveille, Paul" <Paul.Leveille@stratus.com>, 'Paolo Bonzini' <pbonzini@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Cc: "stefanha@redhat.com" <stefanha@redhat.com>

On 04/07/15 20:20, Leveille, Paul wrote:
> Paolo,
>=20
> I've applied your patch in place of my prototype patch and, as expected=
, it's working fine. Thanks!

I also tested the patch (on top of 5a24f20a72), and on aarch64 it
doesn't fix the hang.

If there's interest, I can write up the reproducer using public /
upstream components only. (It's a bit messy because you have to use
Gerd's AAVMF build, or build it yourself.) There are other reproducers
too, of course, but the one with AAVMF is very reliable, and it needs
very little custom stuff to set up.

Thanks
Laszlo

> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]=20
> Sent: Tuesday, April 07, 2015 11:16 AM
> To: qemu-devel@nongnu.org
> Cc: lersek@redhat.com; Leveille, Paul; stefanha@redhat.com
> Subject: [PATCH] aio: strengthen memory barriers for bottom half schedu=
ling
>=20
> There are two problems with memory barriers in async.c.  The fix is to =
use atomic_xchg in order to achieve sequential consistency between the sc=
heduling of a bottom half and the corresponding execution.
>=20
> First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does not=
 execute a memory barrier to order any writes needed by the callback befo=
re the read of bh->scheduled.  If the other side sees req->state as THREA=
D_ACTIVE, the callback is not invoked and you get deadlock.
>=20
> Second, the memory barrier in aio_bh_poll is too weak.  Without this pa=
tch, it is possible that bh->scheduled =3D 0 is not "published" until aft=
er the callback has returned.  Another thread wants to schedule the botto=
m half, but it sees bh->scheduled =3D 1 and does nothing.  This causes a =
lost wakeup.  The memory barrier should have been changed to smp_mb() in =
commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
> 2014-06-03) together with qemu_bh_schedule()'s.  Guess who reviewed tha=
t patch?
>=20
> Both of these involve a store and a load, so they are reproducible on
> x86_64 as well.  It is however much easier on aarch64, where the libgue=
stfs test suite triggers the bug fairly easily.  Even there the failure c=
an go away or appear depending on compiler optimization level, tracing op=
tions, or even kernel debugging options.
>=20
> Paul Leveille however reported how to trigger the problem within 15 min=
utes on x86_64 as well.  His (untested) recipe, reproduced here for refer=
ence, is the following:
>=20
>    1) Qcow2 (or 3) is critical =E2=80=93 raw files alone seem to avoid =
the problem.
>=20
>    2) Use =E2=80=9Ccache=3Ddirectsync=E2=80=9D rather than the default =
of
>    =E2=80=9Ccache=3Dnone=E2=80=9D to make it happen easier.
>=20
>    3) Use a server with a write-back RAID controller to allow for rapid
>    IO rates.
>=20
>    4) Run a random-access load that (mostly) writes chunks to various
>    files on the virtual block device.
>=20
>       a. I use =E2=80=98diskload.exe c:25=E2=80=99, a Microsoft HCT loa=
d
>          generator, on Windows VMs.
>=20
>       b. Iometer can probably be configured to generate a similar load.
>=20
>    5) Run multiple VMs in parallel, against the same storage device,
>    to shake the failure out sooner.
>=20
>    6) IvyBridge and Haswell processors for certain; not sure about othe=
rs.
>=20
> A similar patch survived over 12 hours of testing, where an unpatched Q=
EMU would fail within 15 minutes.
>=20
> This bug is, most likely, also the cause of failures in the libguestfs =
testsuite on AArch64.
>=20
> Thanks to Laszlo Ersek for initially reporting this bug, to Stefan Hajn=
oczi for suggesting closer examination of qemu_bh_schedule, and to Paul f=
or providing test input and a prototype patch.
>=20
> Reported-by: Laszlo Ersek <lersek@redhat.com>
> Reported-by: Paul Leveille <Paul.Leveille@stratus.com>
> Reported-by: John Snow <jsnow@redhat.com>
> Suggested-by: Paul Leveille <Paul.Leveille@stratus.com>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>         Not yet tested on AArch64, will do it tomorrow.  Paul, it would
>         be great if you could test this patch too!
>=20
>  async.c | 28 ++++++++++++----------------
>  1 file changed, 12 insertions(+), 16 deletions(-)
>=20
> diff --git a/async.c b/async.c
> index 2be88cc..2b51e87 100644
> --- a/async.c
> +++ b/async.c
> @@ -72,12 +72,13 @@ int aio_bh_poll(AioContext *ctx)
>          /* Make sure that fetching bh happens before accessing its mem=
bers */
>          smp_read_barrier_depends();
>          next =3D bh->next;
> -        if (!bh->deleted && bh->scheduled) {
> -            bh->scheduled =3D 0;
> -            /* Paired with write barrier in bh schedule to ensure read=
ing for
> -             * idle & callbacks coming after bh's scheduling.
> -             */
> -            smp_rmb();
> +        /* The atomic_xchg is paired with the one in qemu_bh_schedule.=
  The
> +         * implicit memory barrier ensures that the callback sees all =
writes
> +         * done by the scheduling thread.  It also ensures that the sc=
heduling
> +         * thread sees the zero before bh->cb has run, and thus will c=
all
> +         * aio_notify again if necessary.
> +         */
> +        if (!bh->deleted && atomic_xchg(&bh->scheduled, 0)) {
>              if (!bh->idle)
>                  ret =3D 1;
>              bh->idle =3D 0;
> @@ -108,33 +109,28 @@ int aio_bh_poll(AioContext *ctx)
> =20
>  void qemu_bh_schedule_idle(QEMUBH *bh)
>  {
> -    if (bh->scheduled)
> -        return;
>      bh->idle =3D 1;
>      /* Make sure that idle & any writes needed by the callback are don=
e
>       * before the locations are read in the aio_bh_poll.
>       */
> -    smp_wmb();
> -    bh->scheduled =3D 1;
> +    atomic_mb_set(&bh->scheduled, 1);
>  }
> =20
>  void qemu_bh_schedule(QEMUBH *bh)
>  {
>      AioContext *ctx;
> =20
> -    if (bh->scheduled)
> -        return;
>      ctx =3D bh->ctx;
>      bh->idle =3D 0;
> -    /* Make sure that:
> +    /* The memory barrier implicit in atomic_xchg makes sure that:
>       * 1. idle & any writes needed by the callback are done before the
>       *    locations are read in the aio_bh_poll.
>       * 2. ctx is loaded before scheduled is set and the callback has a=
 chance
>       *    to execute.
>       */
> -    smp_mb();
> -    bh->scheduled =3D 1;
> -    aio_notify(ctx);
> +    if (atomic_xchg(&bh->scheduled, 1) =3D=3D 0) {
> +        aio_notify(ctx);
> +    }
>  }
> =20
> =20
> --
> 2.3.4
>=20