[Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling
@ 2015-04-07 15:16 Paolo Bonzini
  2015-04-07 18:20 ` Leveille, Paul
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Paolo Bonzini @ 2015-04-07 15:16 UTC (permalink / raw)
  To: qemu-devel; +Cc: lersek, stefanha, Paul.Leveille

There are two problems with memory barriers in async.c.  The fix is
to use atomic_xchg in order to achieve sequential consistency between
the scheduling of a bottom half and the corresponding execution.

First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does
not execute a memory barrier to order any writes needed by the callback
before the read of bh->scheduled.  If the other side sees req->state as
THREAD_ACTIVE, the callback is not invoked and you get deadlock.

Second, the memory barrier in aio_bh_poll is too weak.  Without this
patch, it is possible that bh->scheduled = 0 is not "published" until
after the callback has returned.  Another thread wants to schedule the
bottom half, but it sees bh->scheduled = 1 and does nothing.  This causes
a lost wakeup.  The memory barrier should have been changed to smp_mb()
in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
2014-06-03) together with qemu_bh_schedule()'s.  Guess who reviewed
that patch?

Both of these involve a store and a load, so they are reproducible on
x86_64 as well.  It is however much easier on aarch64, where the
libguestfs test suite triggers the bug fairly easily.  Even there the
failure can go away or appear depending on compiler optimization level,
tracing options, or even kernel debugging options.

Paul Leveille however reported how to trigger the problem within 15
minutes on x86_64 as well.  His (untested) recipe, reproduced here
for reference, is the following:

   1) Qcow2 (or 3) is critical – raw files alone seem to avoid the problem.

   2) Use “cache=directsync” rather than the default of
   “cache=none” to make it happen easier.

   3) Use a server with a write-back RAID controller to allow for rapid
   IO rates.

   4) Run a random-access load that (mostly) writes chunks to various
   files on the virtual block device.

      a. I use ‘diskload.exe c:25’, a Microsoft HCT load
         generator, on Windows VMs.

      b. Iometer can probably be configured to generate a similar load.

   5) Run multiple VMs in parallel, against the same storage device,
   to shake the failure out sooner.

   6) IvyBridge and Haswell processors for certain; not sure about others.

A similar patch survived over 12 hours of testing, where an unpatched
QEMU would fail within 15 minutes.

This bug is, most likely, also the cause of failures in the libguestfs
testsuite on AArch64.

Thanks to Laszlo Ersek for initially reporting this bug, to Stefan
Hajnoczi for suggesting closer examination of qemu_bh_schedule, and to
Paul for providing test input and a prototype patch.

Reported-by: Laszlo Ersek <lersek@redhat.com>
Reported-by: Paul Leveille <Paul.Leveille@stratus.com>
Reported-by: John Snow <jsnow@redhat.com>
Suggested-by: Paul Leveille <Paul.Leveille@stratus.com>
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
        Not yet tested on AArch64, will do it tomorrow.  Paul, it would
        be great if you could test this patch too!

 async.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/async.c b/async.c
index 2be88cc..2b51e87 100644
--- a/async.c
+++ b/async.c
@@ -72,12 +72,13 @@ int aio_bh_poll(AioContext *ctx)
         /* Make sure that fetching bh happens before accessing its members */
         smp_read_barrier_depends();
         next = bh->next;
-        if (!bh->deleted && bh->scheduled) {
-            bh->scheduled = 0;
-            /* Paired with write barrier in bh schedule to ensure reading for
-             * idle & callbacks coming after bh's scheduling.
-             */
-            smp_rmb();
+        /* The atomic_xchg is paired with the one in qemu_bh_schedule.  The
+         * implicit memory barrier ensures that the callback sees all writes
+         * done by the scheduling thread.  It also ensures that the scheduling
+         * thread sees the zero before bh->cb has run, and thus will call
+         * aio_notify again if necessary.
+         */
+        if (!bh->deleted && atomic_xchg(&bh->scheduled, 0)) {
             if (!bh->idle)
                 ret = 1;
             bh->idle = 0;
@@ -108,33 +109,28 @@ int aio_bh_poll(AioContext *ctx)

 void qemu_bh_schedule_idle(QEMUBH *bh)
 {
-    if (bh->scheduled)
-        return;
     bh->idle = 1;
     /* Make sure that idle & any writes needed by the callback are done
      * before the locations are read in the aio_bh_poll.
      */
-    smp_wmb();
-    bh->scheduled = 1;
+    atomic_mb_set(&bh->scheduled, 1);
 }

 void qemu_bh_schedule(QEMUBH *bh)
 {
     AioContext *ctx;

-    if (bh->scheduled)
-        return;
     ctx = bh->ctx;
     bh->idle = 0;
-    /* Make sure that:
+    /* The memory barrier implicit in atomic_xchg makes sure that:
      * 1. idle & any writes needed by the callback are done before the
      *    locations are read in the aio_bh_poll.
      * 2. ctx is loaded before scheduled is set and the callback has a chance
      *    to execute.
      */
-    smp_mb();
-    bh->scheduled = 1;
-    aio_notify(ctx);
+    if (atomic_xchg(&bh->scheduled, 1) == 0) {
+        aio_notify(ctx);
+    }
 }

-- 
2.3.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling
  2015-04-07 15:16 [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling Paolo Bonzini
@ 2015-04-07 18:20 ` Leveille, Paul
  2015-04-08  9:53   ` Laszlo Ersek
  2015-04-08 10:15 ` Richard W.M. Jones
  2015-04-08 15:06 ` Stefan Hajnoczi
  2 siblings, 1 reply; 8+ messages in thread
From: Leveille, Paul @ 2015-04-07 18:20 UTC (permalink / raw)
  To: 'Paolo Bonzini', qemu-devel@nongnu.org
  Cc: lersek@redhat.com, stefanha@redhat.com

Paolo,

I've applied your patch in place of my prototype patch and, as expected, it's working fine. Thanks!

-----Original Message-----
From: Paolo Bonzini [mailto:pbonzini@redhat.com] 
Sent: Tuesday, April 07, 2015 11:16 AM
To: qemu-devel@nongnu.org
Cc: lersek@redhat.com; Leveille, Paul; stefanha@redhat.com
Subject: [PATCH] aio: strengthen memory barriers for bottom half scheduling

There are two problems with memory barriers in async.c.  The fix is to use atomic_xchg in order to achieve sequential consistency between the scheduling of a bottom half and the corresponding execution.

First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does not execute a memory barrier to order any writes needed by the callback before the read of bh->scheduled.  If the other side sees req->state as THREAD_ACTIVE, the callback is not invoked and you get deadlock.

Second, the memory barrier in aio_bh_poll is too weak.  Without this patch, it is possible that bh->scheduled = 0 is not "published" until after the callback has returned.  Another thread wants to schedule the bottom half, but it sees bh->scheduled = 1 and does nothing.  This causes a lost wakeup.  The memory barrier should have been changed to smp_mb() in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
2014-06-03) together with qemu_bh_schedule()'s.  Guess who reviewed that patch?

Both of these involve a store and a load, so they are reproducible on
x86_64 as well.  It is however much easier on aarch64, where the libguestfs test suite triggers the bug fairly easily.  Even there the failure can go away or appear depending on compiler optimization level, tracing options, or even kernel debugging options.

Paul Leveille however reported how to trigger the problem within 15 minutes on x86_64 as well.  His (untested) recipe, reproduced here for reference, is the following:

   1) Qcow2 (or 3) is critical – raw files alone seem to avoid the problem.

   2) Use “cache=directsync” rather than the default of
   “cache=none” to make it happen easier.

   3) Use a server with a write-back RAID controller to allow for rapid
   IO rates.

   4) Run a random-access load that (mostly) writes chunks to various
   files on the virtual block device.

      a. I use ‘diskload.exe c:25’, a Microsoft HCT load
         generator, on Windows VMs.

      b. Iometer can probably be configured to generate a similar load.

   5) Run multiple VMs in parallel, against the same storage device,
   to shake the failure out sooner.

   6) IvyBridge and Haswell processors for certain; not sure about others.

A similar patch survived over 12 hours of testing, where an unpatched QEMU would fail within 15 minutes.

This bug is, most likely, also the cause of failures in the libguestfs testsuite on AArch64.

Thanks to Laszlo Ersek for initially reporting this bug, to Stefan Hajnoczi for suggesting closer examination of qemu_bh_schedule, and to Paul for providing test input and a prototype patch.

Reported-by: Laszlo Ersek <lersek@redhat.com>
Reported-by: Paul Leveille <Paul.Leveille@stratus.com>
Reported-by: John Snow <jsnow@redhat.com>
Suggested-by: Paul Leveille <Paul.Leveille@stratus.com>
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
        Not yet tested on AArch64, will do it tomorrow.  Paul, it would
        be great if you could test this patch too!

 async.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/async.c b/async.c
index 2be88cc..2b51e87 100644
--- a/async.c
+++ b/async.c
@@ -72,12 +72,13 @@ int aio_bh_poll(AioContext *ctx)
         /* Make sure that fetching bh happens before accessing its members */
         smp_read_barrier_depends();
         next = bh->next;
-        if (!bh->deleted && bh->scheduled) {
-            bh->scheduled = 0;
-            /* Paired with write barrier in bh schedule to ensure reading for
-             * idle & callbacks coming after bh's scheduling.
-             */
-            smp_rmb();
+        /* The atomic_xchg is paired with the one in qemu_bh_schedule.  The
+         * implicit memory barrier ensures that the callback sees all writes
+         * done by the scheduling thread.  It also ensures that the scheduling
+         * thread sees the zero before bh->cb has run, and thus will call
+         * aio_notify again if necessary.
+         */
+        if (!bh->deleted && atomic_xchg(&bh->scheduled, 0)) {
             if (!bh->idle)
                 ret = 1;
             bh->idle = 0;
@@ -108,33 +109,28 @@ int aio_bh_poll(AioContext *ctx)

 void qemu_bh_schedule_idle(QEMUBH *bh)
 {
-    if (bh->scheduled)
-        return;
     bh->idle = 1;
     /* Make sure that idle & any writes needed by the callback are done
      * before the locations are read in the aio_bh_poll.
      */
-    smp_wmb();
-    bh->scheduled = 1;
+    atomic_mb_set(&bh->scheduled, 1);
 }

 void qemu_bh_schedule(QEMUBH *bh)
 {
     AioContext *ctx;

-    if (bh->scheduled)
-        return;
     ctx = bh->ctx;
     bh->idle = 0;
-    /* Make sure that:
+    /* The memory barrier implicit in atomic_xchg makes sure that:
      * 1. idle & any writes needed by the callback are done before the
      *    locations are read in the aio_bh_poll.
      * 2. ctx is loaded before scheduled is set and the callback has a chance
      *    to execute.
      */
-    smp_mb();
-    bh->scheduled = 1;
-    aio_notify(ctx);
+    if (atomic_xchg(&bh->scheduled, 1) == 0) {
+        aio_notify(ctx);
+    }
 }

--
2.3.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling
  2015-04-07 18:20 ` Leveille, Paul
@ 2015-04-08  9:53   ` Laszlo Ersek
  0 siblings, 0 replies; 8+ messages in thread
From: Laszlo Ersek @ 2015-04-08  9:53 UTC (permalink / raw)
  To: Leveille, Paul, 'Paolo Bonzini', qemu-devel@nongnu.org
  Cc: stefanha@redhat.com

On 04/07/15 20:20, Leveille, Paul wrote:
> Paolo,
> 
> I've applied your patch in place of my prototype patch and, as expected, it's working fine. Thanks!

I also tested the patch (on top of 5a24f20a72), and on aarch64 it
doesn't fix the hang.

If there's interest, I can write up the reproducer using public /
upstream components only. (It's a bit messy because you have to use
Gerd's AAVMF build, or build it yourself.) There are other reproducers
too, of course, but the one with AAVMF is very reliable, and it needs
very little custom stuff to set up.

Thanks
Laszlo

> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com] 
> Sent: Tuesday, April 07, 2015 11:16 AM
> To: qemu-devel@nongnu.org
> Cc: lersek@redhat.com; Leveille, Paul; stefanha@redhat.com
> Subject: [PATCH] aio: strengthen memory barriers for bottom half scheduling
> 
> There are two problems with memory barriers in async.c.  The fix is to use atomic_xchg in order to achieve sequential consistency between the scheduling of a bottom half and the corresponding execution.
> 
> First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does not execute a memory barrier to order any writes needed by the callback before the read of bh->scheduled.  If the other side sees req->state as THREAD_ACTIVE, the callback is not invoked and you get deadlock.
> 
> Second, the memory barrier in aio_bh_poll is too weak.  Without this patch, it is possible that bh->scheduled = 0 is not "published" until after the callback has returned.  Another thread wants to schedule the bottom half, but it sees bh->scheduled = 1 and does nothing.  This causes a lost wakeup.  The memory barrier should have been changed to smp_mb() in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
> 2014-06-03) together with qemu_bh_schedule()'s.  Guess who reviewed that patch?
> 
> Both of these involve a store and a load, so they are reproducible on
> x86_64 as well.  It is however much easier on aarch64, where the libguestfs test suite triggers the bug fairly easily.  Even there the failure can go away or appear depending on compiler optimization level, tracing options, or even kernel debugging options.
> 
> Paul Leveille however reported how to trigger the problem within 15 minutes on x86_64 as well.  His (untested) recipe, reproduced here for reference, is the following:
> 
>    1) Qcow2 (or 3) is critical – raw files alone seem to avoid the problem.
> 
>    2) Use “cache=directsync” rather than the default of
>    “cache=none” to make it happen easier.
> 
>    3) Use a server with a write-back RAID controller to allow for rapid
>    IO rates.
> 
>    4) Run a random-access load that (mostly) writes chunks to various
>    files on the virtual block device.
> 
>       a. I use ‘diskload.exe c:25’, a Microsoft HCT load
>          generator, on Windows VMs.
> 
>       b. Iometer can probably be configured to generate a similar load.
> 
>    5) Run multiple VMs in parallel, against the same storage device,
>    to shake the failure out sooner.
> 
>    6) IvyBridge and Haswell processors for certain; not sure about others.
> 
> A similar patch survived over 12 hours of testing, where an unpatched QEMU would fail within 15 minutes.
> 
> This bug is, most likely, also the cause of failures in the libguestfs testsuite on AArch64.
> 
> Thanks to Laszlo Ersek for initially reporting this bug, to Stefan Hajnoczi for suggesting closer examination of qemu_bh_schedule, and to Paul for providing test input and a prototype patch.
> 
> Reported-by: Laszlo Ersek <lersek@redhat.com>
> Reported-by: Paul Leveille <Paul.Leveille@stratus.com>
> Reported-by: John Snow <jsnow@redhat.com>
> Suggested-by: Paul Leveille <Paul.Leveille@stratus.com>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>         Not yet tested on AArch64, will do it tomorrow.  Paul, it would
>         be great if you could test this patch too!
> 
>  async.c | 28 ++++++++++++----------------
>  1 file changed, 12 insertions(+), 16 deletions(-)
> 
> diff --git a/async.c b/async.c
> index 2be88cc..2b51e87 100644
> --- a/async.c
> +++ b/async.c
> @@ -72,12 +72,13 @@ int aio_bh_poll(AioContext *ctx)
>          /* Make sure that fetching bh happens before accessing its members */
>          smp_read_barrier_depends();
>          next = bh->next;
> -        if (!bh->deleted && bh->scheduled) {
> -            bh->scheduled = 0;
> -            /* Paired with write barrier in bh schedule to ensure reading for
> -             * idle & callbacks coming after bh's scheduling.
> -             */
> -            smp_rmb();
> +        /* The atomic_xchg is paired with the one in qemu_bh_schedule.  The
> +         * implicit memory barrier ensures that the callback sees all writes
> +         * done by the scheduling thread.  It also ensures that the scheduling
> +         * thread sees the zero before bh->cb has run, and thus will call
> +         * aio_notify again if necessary.
> +         */
> +        if (!bh->deleted && atomic_xchg(&bh->scheduled, 0)) {
>              if (!bh->idle)
>                  ret = 1;
>              bh->idle = 0;
> @@ -108,33 +109,28 @@ int aio_bh_poll(AioContext *ctx)
>  
>  void qemu_bh_schedule_idle(QEMUBH *bh)
>  {
> -    if (bh->scheduled)
> -        return;
>      bh->idle = 1;
>      /* Make sure that idle & any writes needed by the callback are done
>       * before the locations are read in the aio_bh_poll.
>       */
> -    smp_wmb();
> -    bh->scheduled = 1;
> +    atomic_mb_set(&bh->scheduled, 1);
>  }
>  
>  void qemu_bh_schedule(QEMUBH *bh)
>  {
>      AioContext *ctx;
>  
> -    if (bh->scheduled)
> -        return;
>      ctx = bh->ctx;
>      bh->idle = 0;
> -    /* Make sure that:
> +    /* The memory barrier implicit in atomic_xchg makes sure that:
>       * 1. idle & any writes needed by the callback are done before the
>       *    locations are read in the aio_bh_poll.
>       * 2. ctx is loaded before scheduled is set and the callback has a chance
>       *    to execute.
>       */
> -    smp_mb();
> -    bh->scheduled = 1;
> -    aio_notify(ctx);
> +    if (atomic_xchg(&bh->scheduled, 1) == 0) {
> +        aio_notify(ctx);
> +    }
>  }
>  
>  
> --
> 2.3.4
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling
  2015-04-07 15:16 [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling Paolo Bonzini
  2015-04-07 18:20 ` Leveille, Paul
@ 2015-04-08 10:15 ` Richard W.M. Jones
  2015-04-08 10:34   ` Laszlo Ersek
  2015-04-08 15:06 ` Stefan Hajnoczi
  2 siblings, 1 reply; 8+ messages in thread
From: Richard W.M. Jones @ 2015-04-08 10:15 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Paul.Leveille, lersek, qemu-devel, stefanha

On Tue, Apr 07, 2015 at 05:16:19PM +0200, Paolo Bonzini wrote:
> This bug is, most likely, also the cause of failures in the libguestfs
> testsuite on AArch64.

I'm afraid I have to agree with Laszlo, that this patch unfortunately
does not cure the aarch64 race.

The test case I'm using is:
http://git.annexia.org/?p=rhbz1184405.git;a=tree

Note: This does not mean this patch is bad!  Just that it doesn't cure
the long-standing aarch64 race that we've been trying to fix forever,
which is what we were hoping.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling
  2015-04-08 10:15 ` Richard W.M. Jones
@ 2015-04-08 10:34   ` Laszlo Ersek
  0 siblings, 0 replies; 8+ messages in thread
From: Laszlo Ersek @ 2015-04-08 10:34 UTC (permalink / raw)
  To: Richard W.M. Jones, Paolo Bonzini; +Cc: Paul.Leveille, qemu-devel, stefanha

On 04/08/15 12:15, Richard W.M. Jones wrote:
> On Tue, Apr 07, 2015 at 05:16:19PM +0200, Paolo Bonzini wrote:
>> This bug is, most likely, also the cause of failures in the libguestfs
>> testsuite on AArch64.
> 
> I'm afraid I have to agree with Laszlo, that this patch unfortunately
> does not cure the aarch64 race.
> 
> The test case I'm using is:
> http://git.annexia.org/?p=rhbz1184405.git;a=tree
> 
> Note: This does not mean this patch is bad!  Just that it doesn't cure
> the long-standing aarch64 race that we've been trying to fix forever,
> which is what we were hoping.

Precisely. I meant the same; sorry if I wasn't clear about it.

Laszlo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling
  2015-04-07 15:16 [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling Paolo Bonzini
  2015-04-07 18:20 ` Leveille, Paul
  2015-04-08 10:15 ` Richard W.M. Jones
@ 2015-04-08 15:06 ` Stefan Hajnoczi
  2 siblings, 0 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2015-04-08 15:06 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Paul.Leveille, Laszlo Ersek, qemu-devel, Stefan Hajnoczi

On Tue, Apr 7, 2015 at 4:16 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> There are two problems with memory barriers in async.c.  The fix is
> to use atomic_xchg in order to achieve sequential consistency between
> the scheduling of a bottom half and the corresponding execution.
>
> First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does
> not execute a memory barrier to order any writes needed by the callback
> before the read of bh->scheduled.  If the other side sees req->state as
> THREAD_ACTIVE, the callback is not invoked and you get deadlock.
>
> Second, the memory barrier in aio_bh_poll is too weak.  Without this
> patch, it is possible that bh->scheduled = 0 is not "published" until
> after the callback has returned.  Another thread wants to schedule the
> bottom half, but it sees bh->scheduled = 1 and does nothing.  This causes
> a lost wakeup.  The memory barrier should have been changed to smp_mb()
> in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
> 2014-06-03) together with qemu_bh_schedule()'s.  Guess who reviewed
> that patch?
>
> Both of these involve a store and a load, so they are reproducible on
> x86_64 as well.  It is however much easier on aarch64, where the
> libguestfs test suite triggers the bug fairly easily.  Even there the
> failure can go away or appear depending on compiler optimization level,
> tracing options, or even kernel debugging options.
>
> Paul Leveille however reported how to trigger the problem within 15
> minutes on x86_64 as well.  His (untested) recipe, reproduced here
> for reference, is the following:
>
>    1) Qcow2 (or 3) is critical – raw files alone seem to avoid the problem.
>
>    2) Use “cache=directsync” rather than the default of
>    “cache=none” to make it happen easier.
>
>    3) Use a server with a write-back RAID controller to allow for rapid
>    IO rates.
>
>    4) Run a random-access load that (mostly) writes chunks to various
>    files on the virtual block device.
>
>       a. I use ‘diskload.exe c:25’, a Microsoft HCT load
>          generator, on Windows VMs.
>
>       b. Iometer can probably be configured to generate a similar load.
>
>    5) Run multiple VMs in parallel, against the same storage device,
>    to shake the failure out sooner.
>
>    6) IvyBridge and Haswell processors for certain; not sure about others.
>
> A similar patch survived over 12 hours of testing, where an unpatched
> QEMU would fail within 15 minutes.
>
> This bug is, most likely, also the cause of failures in the libguestfs
> testsuite on AArch64.
>
> Thanks to Laszlo Ersek for initially reporting this bug, to Stefan
> Hajnoczi for suggesting closer examination of qemu_bh_schedule, and to
> Paul for providing test input and a prototype patch.
>
> Reported-by: Laszlo Ersek <lersek@redhat.com>
> Reported-by: Paul Leveille <Paul.Leveille@stratus.com>
> Reported-by: John Snow <jsnow@redhat.com>
> Suggested-by: Paul Leveille <Paul.Leveille@stratus.com>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>         Not yet tested on AArch64, will do it tomorrow.  Paul, it would
>         be great if you could test this patch too!

Paolo, please update the commit description as you see fit (e.g. the
aarch64 bug turned out to be unrelated so it probably shouldn't be
mentioned).

If we run out of time for QEMU 2.3-rc3 I will edit the description
myself and merge the patch.

Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling
@ 2015-04-09  9:21 Paolo Bonzini
  2015-04-09  9:29 ` Stefan Hajnoczi
  0 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2015-04-09  9:21 UTC (permalink / raw)
  To: qemu-devel; +Cc: jsnow, lersek, stefanha, Paul.Leveille, qemu-stable

There are two problems with memory barriers in async.c.  The fix is
to use atomic_xchg in order to achieve sequential consistency between
the scheduling of a bottom half and the corresponding execution.

First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does
not execute a memory barrier to order any writes needed by the callback
before the read of bh->scheduled.  If the other side sees req->state as
THREAD_ACTIVE, the callback is not invoked and you get deadlock.

Second, the memory barrier in aio_bh_poll is too weak.  Without this
patch, it is possible that bh->scheduled = 0 is not "published" until
after the callback has returned.  Another thread wants to schedule the
bottom half, but it sees bh->scheduled = 1 and does nothing.  This causes
a lost wakeup.  The memory barrier should have been changed to smp_mb()
in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
2014-06-03) together with qemu_bh_schedule()'s.  Guess who reviewed
that patch?

Both of these involve a store and a load, so they are reproducible
on x86_64 as well.  Paul Leveille however reported how to trigger the
problem within 15 minutes on x86_64 as well.  His (untested) recipe,
reproduced here for reference, is the following:

   1) Qcow2 (or 3) is critical – raw files alone seem to avoid the problem.

   2) Use “cache=directsync” rather than the default of
   “cache=none” to make it happen easier.

   3) Use a server with a write-back RAID controller to allow for rapid
   IO rates.

   4) Run a random-access load that (mostly) writes chunks to various
   files on the virtual block device.

      a. I use ‘diskload.exe c:25’, a Microsoft HCT load
         generator, on Windows VMs.

      b. Iometer can probably be configured to generate a similar load.

   5) Run multiple VMs in parallel, against the same storage device,
   to shake the failure out sooner.

   6) IvyBridge and Haswell processors for certain; not sure about others.

A similar patch survived over 12 hours of testing, where an unpatched
QEMU would fail within 15 minutes.

This bug is, most likely, also involved in the failures in the libguestfs
testsuite on AArch64 (reported by Laszlo Ersek and Richard Jones).  However,
the patch is not enough to fix that.

Thanks to Stefan Hajnoczi for suggesting closer examination of
qemu_bh_schedule, and to Paul for providing test input and a prototype
patch.

Cc: qemu-stable@nongnu.org
Reported-by: Laszlo Ersek <lersek@redhat.com>
Reported-by: Paul Leveille <Paul.Leveille@stratus.com>
Reported-by: John Snow <jsnow@redhat.com>
Suggested-by: Paul Leveille <Paul.Leveille@stratus.com>
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Paul Leveille <Paul.Leveille@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 async.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/async.c b/async.c
index 2be88cc..2b51e87 100644
--- a/async.c
+++ b/async.c
@@ -72,12 +72,13 @@ int aio_bh_poll(AioContext *ctx)
         /* Make sure that fetching bh happens before accessing its members */
         smp_read_barrier_depends();
         next = bh->next;
-        if (!bh->deleted && bh->scheduled) {
-            bh->scheduled = 0;
-            /* Paired with write barrier in bh schedule to ensure reading for
-             * idle & callbacks coming after bh's scheduling.
-             */
-            smp_rmb();
+        /* The atomic_xchg is paired with the one in qemu_bh_schedule.  The
+         * implicit memory barrier ensures that the callback sees all writes
+         * done by the scheduling thread.  It also ensures that the scheduling
+         * thread sees the zero before bh->cb has run, and thus will call
+         * aio_notify again if necessary.
+         */
+        if (!bh->deleted && atomic_xchg(&bh->scheduled, 0)) {
             if (!bh->idle)
                 ret = 1;
             bh->idle = 0;
@@ -108,33 +109,28 @@ int aio_bh_poll(AioContext *ctx)

 void qemu_bh_schedule_idle(QEMUBH *bh)
 {
-    if (bh->scheduled)
-        return;
     bh->idle = 1;
     /* Make sure that idle & any writes needed by the callback are done
      * before the locations are read in the aio_bh_poll.
      */
-    smp_wmb();
-    bh->scheduled = 1;
+    atomic_mb_set(&bh->scheduled, 1);
 }

 void qemu_bh_schedule(QEMUBH *bh)
 {
     AioContext *ctx;

-    if (bh->scheduled)
-        return;
     ctx = bh->ctx;
     bh->idle = 0;
-    /* Make sure that:
+    /* The memory barrier implicit in atomic_xchg makes sure that:
      * 1. idle & any writes needed by the callback are done before the
      *    locations are read in the aio_bh_poll.
      * 2. ctx is loaded before scheduled is set and the callback has a chance
      *    to execute.
      */
-    smp_mb();
-    bh->scheduled = 1;
-    aio_notify(ctx);
+    if (atomic_xchg(&bh->scheduled, 1) == 0) {
+        aio_notify(ctx);
+    }
 }

-- 
2.3.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling
  2015-04-09  9:21 Paolo Bonzini
@ 2015-04-09  9:29 ` Stefan Hajnoczi
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2015-04-09  9:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: jsnow, qemu-devel, qemu-stable, stefanha, lersek, Paul.Leveille

[-- Attachment #1: Type: text/plain, Size: 3254 bytes --]

On Thu, Apr 09, 2015 at 11:21:10AM +0200, Paolo Bonzini wrote:
> There are two problems with memory barriers in async.c.  The fix is
> to use atomic_xchg in order to achieve sequential consistency between
> the scheduling of a bottom half and the corresponding execution.
> 
> First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does
> not execute a memory barrier to order any writes needed by the callback
> before the read of bh->scheduled.  If the other side sees req->state as
> THREAD_ACTIVE, the callback is not invoked and you get deadlock.
> 
> Second, the memory barrier in aio_bh_poll is too weak.  Without this
> patch, it is possible that bh->scheduled = 0 is not "published" until
> after the callback has returned.  Another thread wants to schedule the
> bottom half, but it sees bh->scheduled = 1 and does nothing.  This causes
> a lost wakeup.  The memory barrier should have been changed to smp_mb()
> in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
> 2014-06-03) together with qemu_bh_schedule()'s.  Guess who reviewed
> that patch?
> 
> Both of these involve a store and a load, so they are reproducible
> on x86_64 as well.  Paul Leveille however reported how to trigger the
> problem within 15 minutes on x86_64 as well.  His (untested) recipe,
> reproduced here for reference, is the following:
> 
>    1) Qcow2 (or 3) is critical – raw files alone seem to avoid the problem.
> 
>    2) Use “cache=directsync” rather than the default of
>    “cache=none” to make it happen easier.
> 
>    3) Use a server with a write-back RAID controller to allow for rapid
>    IO rates.
> 
>    4) Run a random-access load that (mostly) writes chunks to various
>    files on the virtual block device.
> 
>       a. I use ‘diskload.exe c:25’, a Microsoft HCT load
>          generator, on Windows VMs.
> 
>       b. Iometer can probably be configured to generate a similar load.
> 
>    5) Run multiple VMs in parallel, against the same storage device,
>    to shake the failure out sooner.
> 
>    6) IvyBridge and Haswell processors for certain; not sure about others.
> 
> A similar patch survived over 12 hours of testing, where an unpatched
> QEMU would fail within 15 minutes.
> 
> This bug is, most likely, also involved in the failures in the libguestfs
> testsuite on AArch64 (reported by Laszlo Ersek and Richard Jones).  However,
> the patch is not enough to fix that.
> 
> Thanks to Stefan Hajnoczi for suggesting closer examination of
> qemu_bh_schedule, and to Paul for providing test input and a prototype
> patch.
> 
> Cc: qemu-stable@nongnu.org
> Reported-by: Laszlo Ersek <lersek@redhat.com>
> Reported-by: Paul Leveille <Paul.Leveille@stratus.com>
> Reported-by: John Snow <jsnow@redhat.com>
> Suggested-by: Paul Leveille <Paul.Leveille@stratus.com>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Tested-by: Paul Leveille <Paul.Leveille@stratus.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  async.c | 28 ++++++++++++----------------
>  1 file changed, 12 insertions(+), 16 deletions(-)

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-04-09  9:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-07 15:16 [Qemu-devel] [PATCH] aio: strengthen memory barriers for bottom half scheduling Paolo Bonzini
2015-04-07 18:20 ` Leveille, Paul
2015-04-08  9:53   ` Laszlo Ersek
2015-04-08 10:15 ` Richard W.M. Jones
2015-04-08 10:34   ` Laszlo Ersek
2015-04-08 15:06 ` Stefan Hajnoczi
  -- strict thread matches above, loose matches on Subject: below --
2015-04-09  9:21 Paolo Bonzini
2015-04-09  9:29 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).