[PATCH 0/8] Fix missing memory barriers on ARM

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] Fix missing memory barriers on ARM
@ 2023-03-03 17:19 Paolo Bonzini
  2023-03-03 17:19 ` [PATCH 1/8] qatomic: add smp_mb__before/after_rmw() Paolo Bonzini
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-03 17:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: gshan, eesposit, david, stefanha, cohuck, eauger

This series fixes more instances of the problem fixed by commits
5710a3e09f9b ("async: use explicit memory barriers", 2020-04-09) and
7455ff1aa015 ("aio_wait_kick: add missing memory barrier", 2022-06-24).
This is an interesting case where ARM's memory ordering is somewhat 
stronger than what you would expect.

On ARM, seqcst loads and stores (which QEMU does not use) are compiled
respectively as LDAR and STLR instructions.  Even though STLR is also
used for store-release operations, STLR followed by LDAR provides
store-against-load ordering, which is stronger than a store-release.
Compare this to ARMv7, where store-release is DMB+STR and store-seqcst
is DMB+STR+DMB.

This means that on ARM a sequence of

  qatomic_store_release(&y, ...);         // STLR
  a = qatomic_load_acquire(&x);           // LDAR

provides stronger ordering at the processor level than the two MOV
instructions you'd get on x86.

Likewise, on ARM sequentially consistent read-modify-write operations only
need to use LDAXR and STLXR respectively for the load and the store, which
is weaker than the LOCK prefix used on x86.

In a strange twist of events, however, the _stronger_ semantics
of the ARM instructions can end up causing bugs on ARM, not on x86.
The problems occur when seqcst atomics are mixed with relaxed atomics.
Here is how the two are compiled on ARM:

                   load              store
   relaxed         LDR               STR
   seqcst          LDAR              STLR

QEMU's atomics try to bridge the Linux API (that most of the developers
are familiar with) and the C11 API, and the two have a substantial
difference:

- in Linux, strongly-ordered atomics such as atomic_add_return() affect
  the global ordering of _all_ memory operations, including for example
  READ_ONCE()/WRITE_ONCE()

- in C11, sequentially consistent atomics (except for seqcst fences)
  only affect the ordering of sequentially consistent operations.
  In particular, since relaxed loads are done with LDR on ARM, they are
  not ordered against seqcst stores (which are done with STLR).

QEMU implements high-level synchronization primitives with the idea that
the primitives contain the necessary memory barriers, and the callers can
use relaxed atomics (qatomic_read/qatomic_set) or even regular accesses.
This is very much incompatible with the C11 view that seqcst accesses
are only ordered against other seqcst accesses, and requires using
seqcst fences as in the following example:

   qatomic_set(&y, 1);            qatomic_set(&x, 1);
   smp_mb();                      smp_mb();
   ... qatomic_read(&x) ...       ... qatomic_read(&y) ...

Bugs ensue when a qatomic_*() read-modify write operation is used instead
of one or both stores, for example:

   qatomic_<rmw>(&y, ...);
   smp_mb();
   ... qatomic_read(&x) ...

Developers that are more familiar with the Linux API may be tempted
to omit the smp_mb() and that's exactly what yours truly did in
qemu_event_set() and qemu_event_reset().  After a27dd2de68f3 ("KVM:
keep track of running ioctls", 2023-01-11), this caused hangs due to
threads sleeping forever in qemu_event_wait().

This _could_ also have been the cause of occasional hangs of rcutorture,
though I have not observed them personally.

(As an aside, GCC's older __sync_* builtins *did* provide a full barrier
between the RMW operation and loads on the side of the operation.  The
difference between seqcst C11 atomics and __sync_* atomics is exactly
an extra memory barrier after the STLXR instruction).

In order to fix this, while avoiding worse performance from having two
back-to-back memory barriers on x86, patch 1 introduces optimized
memory barriers smp_mb__before_rmw() and smp_mb__after_rmw().  The usage
is similar to Linux's smp_mb__before/after_atomic(), but the name is
different because they affect _all_ RMW operations.  On Linux, instead,
they are not needed around those RMW operations that return the old value.

The remaining patches add them everywhere they are needed.  In the
case of QemuEvent (patches 2-3), I reviewed the algorithm thoroughly,
dropping barriers that were not necessary and killing optimizations that
I wasn't entirely sure about.  For the other cases, instead, the changes
are minimal.

Note: I have a follow-up set of patches that gets rid completely of
atomic_mb_read(); atomic_mb_set() instead can remain and mimic Linux's
smp_store_mb() operation.  A glimpse of these changes is already visible
in patches 7 and 8.

Thanks to Emanuele Esposito and Gavin Shan for help debugging and
testing the changes!

Paolo

Paolo Bonzini (8):
  qatomic: add smp_mb__before/after_rmw()
  qemu-thread-posix: cleanup, fix, document QemuEvent
  qemu-thread-win32: cleanup, fix, document QemuEvent
  edu: add smp_mb__after_rmw()
  util/async: add smp_mb__after_rmw() around BH enqueue/dequeue
  aio-wait: switch to smp_mb__after_rmw()
  qemu-coroutine-lock: add smp_mb__after_rmw()
  physmem: add missing memory barrier

 docs/devel/atomics.rst     | 26 ++++++++++---
 hw/misc/edu.c              |  5 +++
 include/block/aio-wait.h   |  2 +-
 include/qemu/atomic.h      | 17 ++++++++-
 softmmu/physmem.c          |  3 ++
 util/async.c               | 13 ++++---
 util/qemu-coroutine-lock.c |  9 ++++-
 util/qemu-thread-posix.c   | 64 +++++++++++++++++++++----------
 util/qemu-thread-win32.c   | 78 ++++++++++++++++++++++++++------------
 9 files changed, 160 insertions(+), 57 deletions(-)

-- 
2.39.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/8] qatomic: add smp_mb__before/after_rmw()
  2023-03-03 17:19 [PATCH 0/8] Fix missing memory barriers on ARM Paolo Bonzini
@ 2023-03-03 17:19 ` Paolo Bonzini
  2023-03-05 18:57   ` Richard Henderson
  2023-03-06 13:21   ` David Hildenbrand
  2023-03-03 17:19 ` [PATCH 2/8] qemu-thread-posix: cleanup, fix, document QemuEvent Paolo Bonzini
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-03 17:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: gshan, eesposit, david, stefanha, cohuck, eauger

On ARM, seqcst loads and stores (which QEMU does not use) are compiled
respectively as LDAR and STLR instructions.  Even though STLR is also
used for store-release operations, STLR followed by LDAR provides
store-against-load ordering, which is stronger than a store-release.
Compare this to ARMv7, where store-release is DMB+STR and store-seqcst
is DMB+STR+DMB.

This means that on ARM a sequence of

  qatomic_store_release(&y, ...);         // STLR
  a = qatomic_load_acquire(&x);           // LDAR

provides stronger ordering at the processor level than the two MOV
instructions you'd get on x86.

Likewise, on ARM sequentially consistent read-modify-write operations only
need to use LDAXR and STLXR respectively for the load and the store, which
is weaker than the LOCK prefix used on x86.

In a strange twist of events, however, the _stronger_ semantics
of the ARM instructions can end up causing bugs on ARM, not on x86.
The problems occur when seqcst atomics are mixed with relaxed atomics.

QEMU's atomics try to bridge the Linux API (that most of the developers
are familiar with) and the C11 API, and the two have a substantial
difference:

- in Linux, strongly-ordered atomics such as atomic_add_return() affect
  the global ordering of _all_ memory operations, including for example
  READ_ONCE()/WRITE_ONCE()

- in C11, sequentially consistent atomics (except for seq-cst fences)
  only affect the ordering of sequentially consistent operations.
  In particular, since relaxed loads are done with LDR on ARM, they are
  not ordered against seqcst stores (which are done with STLR).

QEMU implements high-level synchronization primitives with the idea that
the primitives contain the necessary memory barriers, and the callers can
use relaxed atomics (qatomic_read/qatomic_set) or even regular accesses.
This is very much incompatible with the C11 view that seqcst accesses
are only ordered against other seqcst accesses, and requires using seqcst
fences as in the following example:

   qatomic_set(&y, 1);            qatomic_set(&x, 1);
   smp_mb();                      smp_mb();
   ... qatomic_read(&x) ...       ... qatomic_read(&y) ...

When a qatomic_*() read-modify write operation is used instead of one
or both stores, developers that are more familiar with the Linux API may
be tempted to omit the smp_mb(), which will work on x86 but not on ARM.

This nasty difference between Linux and C11 read-modify-write operations
has already caused issues in util/async.c and more are being found.
Provide something similar to Linux smp_mb__before/after_atomic(); this
has the double function of documenting clearly why there is a memory
barrier, and avoiding a double barrier on x86 and s390x systems.

The new macro can already be put to use in qatomic_mb_set().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 docs/devel/atomics.rst | 26 +++++++++++++++++++++-----
 include/qemu/atomic.h  | 17 ++++++++++++++++-
 2 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
index 7957310071d9..898f5393c07a 100644
--- a/docs/devel/atomics.rst
+++ b/docs/devel/atomics.rst
@@ -27,7 +27,8 @@ provides macros that fall in three camps:
 
 - weak atomic access and manual memory barriers: ``qatomic_read()``,
   ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
-  ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``;
+  ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``,
+  ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``;
 
 - sequentially consistent atomic access: everything else.
 
@@ -472,7 +473,7 @@ and memory barriers, and the equivalents in QEMU:
   sequential consistency.
 
 - in QEMU, ``qatomic_read()`` and ``qatomic_set()`` do not participate in
-  the total ordering enforced by sequentially-consistent operations.
+  the ordering enforced by read-modify-write operations.
   This is because QEMU uses the C11 memory model.  The following example
   is correct in Linux but not in QEMU:
 
@@ -488,9 +489,24 @@ and memory barriers, and the equivalents in QEMU:
   because the read of ``y`` can be moved (by either the processor or the
   compiler) before the write of ``x``.
 
-  Fixing this requires an ``smp_mb()`` memory barrier between the write
-  of ``x`` and the read of ``y``.  In the common case where only one thread
-  writes ``x``, it is also possible to write it like this:
+  Fixing this requires a full memory barrier between the write of ``x`` and
+  the read of ``y``.  QEMU provides ``smp_mb__before_rmw()`` and
+  ``smp_mb__after_rmw()``; they act both as an optimization,
+  avoiding the memory barrier on processors where it is unnecessary,
+  and as a clarification of this corner case of the C11 memory model:
+
+      +--------------------------------+
+      | QEMU (incorrect)               |
+      +================================+
+      | ::                             |
+      |                                |
+      |   a = qatomic_fetch_add(&x, 2);|
+      |   smp_mb__after_rmw();         |
+      |   b = qatomic_read(&y);        |
+      +--------------------------------+
+
+  In the common case where only one thread writes ``x``, it is also possible
+  to write it like this:
 
       +--------------------------------+
       | QEMU (correct)                 |
diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 874134fd19ae..f85834ee8b20 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -245,6 +245,20 @@
 #define smp_wmb()   smp_mb_release()
 #define smp_rmb()   smp_mb_acquire()
 
+/*
+ * SEQ_CST is weaker than the older __sync_* builtins and Linux
+ * kernel read-modify-write atomics.  Provide a macro to obtain
+ * the same semantics.
+ */
+#if !defined(QEMU_SANITIZE_THREAD) && \
+    (defined(__i386__) || defined(__x86_64__) || defined(__s390x__))
+# define smp_mb__before_rmw() signal_barrier()
+# define smp_mb__after_rmw() signal_barrier()
+#else
+# define smp_mb__before_rmw() smp_mb()
+# define smp_mb__after_rmw() smp_mb()
+#endif
+
 /* qatomic_mb_read/set semantics map Java volatile variables. They are
  * less expensive on some platforms (notably POWER) than fully
  * sequentially consistent operations.
@@ -259,7 +273,8 @@
 #if !defined(QEMU_SANITIZE_THREAD) && \
     (defined(__i386__) || defined(__x86_64__) || defined(__s390x__))
 /* This is more efficient than a store plus a fence.  */
-# define qatomic_mb_set(ptr, i)  ((void)qatomic_xchg(ptr, i))
+# define qatomic_mb_set(ptr, i) \
+    ({ (void)qatomic_xchg(ptr, i); smp_mb__after_rmw(); })
 #else
 # define qatomic_mb_set(ptr, i) \
    ({ qatomic_store_release(ptr, i); smp_mb(); })
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/8] qatomic: add smp_mb__before/after_rmw()
  2023-03-03 17:19 ` [PATCH 1/8] qatomic: add smp_mb__before/after_rmw() Paolo Bonzini
@ 2023-03-05 18:57   ` Richard Henderson
  2023-03-05 21:00     ` Paolo Bonzini
  2023-03-06 13:21   ` David Hildenbrand
  1 sibling, 1 reply; 38+ messages in thread
From: Richard Henderson @ 2023-03-05 18:57 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: gshan, eesposit, david, stefanha, cohuck, eauger

On 3/3/23 09:19, Paolo Bonzini wrote:
> This nasty difference between Linux and C11 read-modify-write operations
> has already caused issues in util/async.c and more are being found.
> Provide something similar to Linux smp_mb__before/after_atomic(); this
> has the double function of documenting clearly why there is a memory
> barrier, and avoiding a double barrier on x86 and s390x systems.

It does make me question our choice to use neither the Linux nor the C11 model.

> +      +--------------------------------+
> +      | QEMU (incorrect)               |
> +      +================================+
> +      | ::                             |
> +      |                                |
> +      |   a = qatomic_fetch_add(&x, 2);|
> +      |   smp_mb__after_rmw();         |
> +      |   b = qatomic_read(&y);        |
> +      +--------------------------------+

Correct, surely.

> +/*
> + * SEQ_CST is weaker than the older __sync_* builtins and Linux
> + * kernel read-modify-write atomics.  Provide a macro to obtain
> + * the same semantics.
> + */
> +#if !defined(QEMU_SANITIZE_THREAD) && \
> +    (defined(__i386__) || defined(__x86_64__) || defined(__s390x__))
> +# define smp_mb__before_rmw() signal_barrier()
> +# define smp_mb__after_rmw() signal_barrier()
> +#else
> +# define smp_mb__before_rmw() smp_mb()
> +# define smp_mb__after_rmw() smp_mb()
> +#endif

I notice you never actually use smp_mb__before_rmw(), but I suppose since we're trying to 
mirror smp_mb__before(), we should keep the interface whole.

Other than the "incorrect" above,
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/8] qatomic: add smp_mb__before/after_rmw()
  2023-03-05 18:57   ` Richard Henderson
@ 2023-03-05 21:00     ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-05 21:00 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, Guowen Shan, Emanuele Giuseppe Esposito,
	David Hildenbrand, Hajnoczi, Stefan, Cornelia Huck, Eric Auger

[-- Attachment #1: Type: text/plain, Size: 2625 bytes --]

Il dom 5 mar 2023, 19:57 Richard Henderson <richard.henderson@linaro.org>
ha scritto:

> On 3/3/23 09:19, Paolo Bonzini wrote:
> > This nasty difference between Linux and C11 read-modify-write operations
> > has already caused issues in util/async.c and more are being found.
> > Provide something similar to Linux smp_mb__before/after_atomic(); this
> > has the double function of documenting clearly why there is a memory
> > barrier, and avoiding a double barrier on x86 and s390x systems.
>
> It does make me question our choice to use neither the Linux nor the C11
> model.
>

We do use the C11 model. However, C11's "just use seqcst everywhere" is not
suitable to define higher-level primitives—including those that are part of
the C11 or pthreads specifications.

For example, C11 implementations (such as glibc) have the same issue as
QEMU in their implementations of e.g. pthread_cond_wait or pthread_join.
The primitive itself could in principle be implemented using seqcst
atomics, but it has to assume that the caller uses not even relaxed
atomics, but just plain old non-atomic accesses.

If instead you mean using acqrel memory ordering instead of seqcst for the
RMW operations, honestly I think I understand acq and rel pretty well at
this point but I don't understand acqrel enough to be able to use it.

Paolo


> > +      +--------------------------------+
> > +      | QEMU (incorrect)               |
> > +      +================================+
> > +      | ::                             |
> > +      |                                |
> > +      |   a = qatomic_fetch_add(&x, 2);|
> > +      |   smp_mb__after_rmw();         |
> > +      |   b = qatomic_read(&y);        |
> > +      +--------------------------------+
>
> Correct, surely.
>
> > +/*
> > + * SEQ_CST is weaker than the older __sync_* builtins and Linux
> > + * kernel read-modify-write atomics.  Provide a macro to obtain
> > + * the same semantics.
> > + */
> > +#if !defined(QEMU_SANITIZE_THREAD) && \
> > +    (defined(__i386__) || defined(__x86_64__) || defined(__s390x__))
> > +# define smp_mb__before_rmw() signal_barrier()
> > +# define smp_mb__after_rmw() signal_barrier()
> > +#else
> > +# define smp_mb__before_rmw() smp_mb()
> > +# define smp_mb__after_rmw() smp_mb()
> > +#endif
>
> I notice you never actually use smp_mb__before_rmw(), but I suppose since
> we're trying to
> mirror smp_mb__before(), we should keep the interface whole.
>
> Other than the "incorrect" above,
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
>
>
> r~
>
>

[-- Attachment #2: Type: text/html, Size: 3658 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/8] qatomic: add smp_mb__before/after_rmw()
  2023-03-03 17:19 ` [PATCH 1/8] qatomic: add smp_mb__before/after_rmw() Paolo Bonzini
  2023-03-05 18:57   ` Richard Henderson
@ 2023-03-06 13:21   ` David Hildenbrand
  2023-03-06 13:22     ` David Hildenbrand
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2023-03-06 13:21 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 03.03.23 18:19, Paolo Bonzini wrote:
> On ARM, seqcst loads and stores (which QEMU does not use) are compiled
> respectively as LDAR and STLR instructions.  Even though STLR is also
> used for store-release operations, STLR followed by LDAR provides
> store-against-load ordering, which is stronger than a store-release.
> Compare this to ARMv7, where store-release is DMB+STR and store-seqcst
> is DMB+STR+DMB.
> 
> This means that on ARM a sequence of
> 
>    qatomic_store_release(&y, ...);         // STLR
>    a = qatomic_load_acquire(&x);           // LDAR
> 
> provides stronger ordering at the processor level than the two MOV
> instructions you'd get on x86.
> 
> Likewise, on ARM sequentially consistent read-modify-write operations only
> need to use LDAXR and STLXR respectively for the load and the store, which
> is weaker than the LOCK prefix used on x86.
> 
> In a strange twist of events, however, the _stronger_ semantics
> of the ARM instructions can end up causing bugs on ARM, not on x86.
> The problems occur when seqcst atomics are mixed with relaxed atomics.
> 
> QEMU's atomics try to bridge the Linux API (that most of the developers
> are familiar with) and the C11 API, and the two have a substantial
> difference:
> 
> - in Linux, strongly-ordered atomics such as atomic_add_return() affect
>    the global ordering of _all_ memory operations, including for example
>    READ_ONCE()/WRITE_ONCE()
> 
> - in C11, sequentially consistent atomics (except for seq-cst fences)
>    only affect the ordering of sequentially consistent operations.
>    In particular, since relaxed loads are done with LDR on ARM, they are
>    not ordered against seqcst stores (which are done with STLR).
> 
> QEMU implements high-level synchronization primitives with the idea that
> the primitives contain the necessary memory barriers, and the callers can
> use relaxed atomics (qatomic_read/qatomic_set) or even regular accesses.
> This is very much incompatible with the C11 view that seqcst accesses
> are only ordered against other seqcst accesses, and requires using seqcst
> fences as in the following example:
> 
>     qatomic_set(&y, 1);            qatomic_set(&x, 1);
>     smp_mb();                      smp_mb();
>     ... qatomic_read(&x) ...       ... qatomic_read(&y) ...
> 
> When a qatomic_*() read-modify write operation is used instead of one
> or both stores, developers that are more familiar with the Linux API may
> be tempted to omit the smp_mb(), which will work on x86 but not on ARM.
> 
> This nasty difference between Linux and C11 read-modify-write operations
> has already caused issues in util/async.c and more are being found.
> Provide something similar to Linux smp_mb__before/after_atomic(); this
> has the double function of documenting clearly why there is a memory
> barrier, and avoiding a double barrier on x86 and s390x systems.
> 

Right, just like smp_mb__before_atomic()/smp_mb__after_atomic().


> The new macro can already be put to use in qatomic_mb_set().
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   docs/devel/atomics.rst | 26 +++++++++++++++++++++-----
>   include/qemu/atomic.h  | 17 ++++++++++++++++-
>   2 files changed, 37 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
> index 7957310071d9..898f5393c07a 100644
> --- a/docs/devel/atomics.rst
> +++ b/docs/devel/atomics.rst
> @@ -27,7 +27,8 @@ provides macros that fall in three camps:
>   
>   - weak atomic access and manual memory barriers: ``qatomic_read()``,
>     ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
> -  ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``;
> +  ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``,
> +  ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``;
>   
>   - sequentially consistent atomic access: everything else.
>   
> @@ -472,7 +473,7 @@ and memory barriers, and the equivalents in QEMU:
>     sequential consistency.
>   
>   - in QEMU, ``qatomic_read()`` and ``qatomic_set()`` do not participate in
> -  the total ordering enforced by sequentially-consistent operations.
> +  the ordering enforced by read-modify-write operations.
>     This is because QEMU uses the C11 memory model.  The following example
>     is correct in Linux but not in QEMU:
>   
> @@ -488,9 +489,24 @@ and memory barriers, and the equivalents in QEMU:
>     because the read of ``y`` can be moved (by either the processor or the
>     compiler) before the write of ``x``.
>   
> -  Fixing this requires an ``smp_mb()`` memory barrier between the write
> -  of ``x`` and the read of ``y``.  In the common case where only one thread
> -  writes ``x``, it is also possible to write it like this:
> +  Fixing this requires a full memory barrier between the write of ``x`` and
> +  the read of ``y``.  QEMU provides ``smp_mb__before_rmw()`` and
> +  ``smp_mb__after_rmw()``; they act both as an optimization,
> +  avoiding the memory barrier on processors where it is unnecessary,
> +  and as a clarification of this corner case of the C11 memory model:
> +
> +      +--------------------------------+
> +      | QEMU (incorrect)               |

Just double-checking: shouldn't this be "QEMU (correct)" ?

Or am I confused? :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/8] qatomic: add smp_mb__before/after_rmw()
  2023-03-06 13:21   ` David Hildenbrand
@ 2023-03-06 13:22     ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-03-06 13:22 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 06.03.23 14:21, David Hildenbrand wrote:
> On 03.03.23 18:19, Paolo Bonzini wrote:
>> On ARM, seqcst loads and stores (which QEMU does not use) are compiled
>> respectively as LDAR and STLR instructions.  Even though STLR is also
>> used for store-release operations, STLR followed by LDAR provides
>> store-against-load ordering, which is stronger than a store-release.
>> Compare this to ARMv7, where store-release is DMB+STR and store-seqcst
>> is DMB+STR+DMB.
>>
>> This means that on ARM a sequence of
>>
>>     qatomic_store_release(&y, ...);         // STLR
>>     a = qatomic_load_acquire(&x);           // LDAR
>>
>> provides stronger ordering at the processor level than the two MOV
>> instructions you'd get on x86.
>>
>> Likewise, on ARM sequentially consistent read-modify-write operations only
>> need to use LDAXR and STLXR respectively for the load and the store, which
>> is weaker than the LOCK prefix used on x86.
>>
>> In a strange twist of events, however, the _stronger_ semantics
>> of the ARM instructions can end up causing bugs on ARM, not on x86.
>> The problems occur when seqcst atomics are mixed with relaxed atomics.
>>
>> QEMU's atomics try to bridge the Linux API (that most of the developers
>> are familiar with) and the C11 API, and the two have a substantial
>> difference:
>>
>> - in Linux, strongly-ordered atomics such as atomic_add_return() affect
>>     the global ordering of _all_ memory operations, including for example
>>     READ_ONCE()/WRITE_ONCE()
>>
>> - in C11, sequentially consistent atomics (except for seq-cst fences)
>>     only affect the ordering of sequentially consistent operations.
>>     In particular, since relaxed loads are done with LDR on ARM, they are
>>     not ordered against seqcst stores (which are done with STLR).
>>
>> QEMU implements high-level synchronization primitives with the idea that
>> the primitives contain the necessary memory barriers, and the callers can
>> use relaxed atomics (qatomic_read/qatomic_set) or even regular accesses.
>> This is very much incompatible with the C11 view that seqcst accesses
>> are only ordered against other seqcst accesses, and requires using seqcst
>> fences as in the following example:
>>
>>      qatomic_set(&y, 1);            qatomic_set(&x, 1);
>>      smp_mb();                      smp_mb();
>>      ... qatomic_read(&x) ...       ... qatomic_read(&y) ...
>>
>> When a qatomic_*() read-modify write operation is used instead of one
>> or both stores, developers that are more familiar with the Linux API may
>> be tempted to omit the smp_mb(), which will work on x86 but not on ARM.
>>
>> This nasty difference between Linux and C11 read-modify-write operations
>> has already caused issues in util/async.c and more are being found.
>> Provide something similar to Linux smp_mb__before/after_atomic(); this
>> has the double function of documenting clearly why there is a memory
>> barrier, and avoiding a double barrier on x86 and s390x systems.
>>
> 
> Right, just like smp_mb__before_atomic()/smp_mb__after_atomic().
> 
> 
>> The new macro can already be put to use in qatomic_mb_set().
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>    docs/devel/atomics.rst | 26 +++++++++++++++++++++-----
>>    include/qemu/atomic.h  | 17 ++++++++++++++++-
>>    2 files changed, 37 insertions(+), 6 deletions(-)
>>
>> diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
>> index 7957310071d9..898f5393c07a 100644
>> --- a/docs/devel/atomics.rst
>> +++ b/docs/devel/atomics.rst
>> @@ -27,7 +27,8 @@ provides macros that fall in three camps:
>>    
>>    - weak atomic access and manual memory barriers: ``qatomic_read()``,
>>      ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
>> -  ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``;
>> +  ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``,
>> +  ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``;
>>    
>>    - sequentially consistent atomic access: everything else.
>>    
>> @@ -472,7 +473,7 @@ and memory barriers, and the equivalents in QEMU:
>>      sequential consistency.
>>    
>>    - in QEMU, ``qatomic_read()`` and ``qatomic_set()`` do not participate in
>> -  the total ordering enforced by sequentially-consistent operations.
>> +  the ordering enforced by read-modify-write operations.
>>      This is because QEMU uses the C11 memory model.  The following example
>>      is correct in Linux but not in QEMU:
>>    
>> @@ -488,9 +489,24 @@ and memory barriers, and the equivalents in QEMU:
>>      because the read of ``y`` can be moved (by either the processor or the
>>      compiler) before the write of ``x``.
>>    
>> -  Fixing this requires an ``smp_mb()`` memory barrier between the write
>> -  of ``x`` and the read of ``y``.  In the common case where only one thread
>> -  writes ``x``, it is also possible to write it like this:
>> +  Fixing this requires a full memory barrier between the write of ``x`` and
>> +  the read of ``y``.  QEMU provides ``smp_mb__before_rmw()`` and
>> +  ``smp_mb__after_rmw()``; they act both as an optimization,
>> +  avoiding the memory barrier on processors where it is unnecessary,
>> +  and as a clarification of this corner case of the C11 memory model:
>> +
>> +      +--------------------------------+
>> +      | QEMU (incorrect)               |
> 
> Just double-checking: shouldn't this be "QEMU (correct)" ?
> 
> Or am I confused? :)
> 

Oh, noticed Richard also pointed that out. With that fixed LGTM

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 2/8] qemu-thread-posix: cleanup, fix, document QemuEvent
  2023-03-03 17:19 [PATCH 0/8] Fix missing memory barriers on ARM Paolo Bonzini
  2023-03-03 17:19 ` [PATCH 1/8] qatomic: add smp_mb__before/after_rmw() Paolo Bonzini
@ 2023-03-03 17:19 ` Paolo Bonzini
  2023-03-05 19:11   ` Richard Henderson
  2023-03-06 13:28   ` David Hildenbrand
  2023-03-03 17:19 ` [PATCH 3/8] qemu-thread-win32: " Paolo Bonzini
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-03 17:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: gshan, eesposit, david, stefanha, cohuck, eauger

QemuEvent is currently broken on ARM due to missing memory barriers
after qatomic_*().  Apart from adding the memory barrier, a closer look
reveals some unpaired memory barriers too.  Document more clearly what
is going on, and remove optimizations that I couldn't quite prove to
be correct.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 util/qemu-thread-posix.c | 64 ++++++++++++++++++++++++++++------------
 1 file changed, 45 insertions(+), 19 deletions(-)

diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 93d250579741..06d1bff63bb7 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -384,13 +384,21 @@ void qemu_event_destroy(QemuEvent *ev)
 
 void qemu_event_set(QemuEvent *ev)
 {
-    /* qemu_event_set has release semantics, but because it *loads*
+    assert(ev->initialized);
+
+    /*
+     * Pairs with memory barrier in qemu_event_reset.
+     *
+     * qemu_event_set has release semantics, but because it *loads*
      * ev->value we need a full memory barrier here.
      */
-    assert(ev->initialized);
     smp_mb();
     if (qatomic_read(&ev->value) != EV_SET) {
-        if (qatomic_xchg(&ev->value, EV_SET) == EV_BUSY) {
+        int old = qatomic_xchg(&ev->value, EV_SET);
+
+        /* Pairs with memory barrier in kernel futex_wait system call.  */
+        smp_mb__after_rmw();
+        if (old == EV_BUSY) {
             /* There were waiters, wake them up.  */
             qemu_futex_wake(ev, INT_MAX);
         }
@@ -399,18 +407,19 @@ void qemu_event_set(QemuEvent *ev)
 
 void qemu_event_reset(QemuEvent *ev)
 {
-    unsigned value;
-
     assert(ev->initialized);
-    value = qatomic_read(&ev->value);
-    smp_mb_acquire();
-    if (value == EV_SET) {
-        /*
-         * If there was a concurrent reset (or even reset+wait),
-         * do nothing.  Otherwise change EV_SET->EV_FREE.
-         */
-        qatomic_or(&ev->value, EV_FREE);
-    }
+
+    /*
+     * If there was a concurrent reset (or even reset+wait),
+     * do nothing.  Otherwise change EV_SET->EV_FREE.
+     */
+    qatomic_or(&ev->value, EV_FREE);
+
+    /*
+     * Order reset before checking the condition in the caller.
+     * Pairs with the first memory barrier in qemu_event_set().
+     */
+    smp_mb__after_rmw();
 }
 
 void qemu_event_wait(QemuEvent *ev)
@@ -418,20 +427,37 @@ void qemu_event_wait(QemuEvent *ev)
     unsigned value;
 
     assert(ev->initialized);
+
+    /*
+     * This read does not have any particular ordering requirements;
+     * if it moves earlier, we might miss qemu_event_set() and go down the
+     * slow path unnecessarily, but ultimately the memory barrier in
+     * qemu_futex_wait() will ensure the check is done correctly.
+     */
     value = qatomic_read(&ev->value);
-    smp_mb_acquire();
     if (value != EV_SET) {
         if (value == EV_FREE) {
             /*
-             * Leave the event reset and tell qemu_event_set that there
-             * are waiters.  No need to retry, because there cannot be
-             * a concurrent busy->free transition.  After the CAS, the
-             * event will be either set or busy.
+             * Leave the event reset and tell qemu_event_set that there are
+             * waiters.  No need to retry, because there cannot be a concurrent
+             * busy->free transition.  After the CAS, the event will be either
+             * set or busy.
+             *
+             * Neither the load nor the store of this cmpxchg have particular
+             * ordering requirements.  The reasoning for the load is the same
+             * as qatomic_read() above; while moving the store earlier can only
+             * cause qemu_event_set() to issue _more_ wakeups.
              */
             if (qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET) {
                 return;
             }
         }
+
+        /*
+         * This is the final check for a concurrent set, so it does need
+         * a smp_mb() pairing with the second barrier of qemu_event_set().
+         * The barrier is inside the FUTEX_WAIT system call.
+         */
         qemu_futex_wait(ev, EV_BUSY);
     }
 }
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/8] qemu-thread-posix: cleanup, fix, document QemuEvent
  2023-03-03 17:19 ` [PATCH 2/8] qemu-thread-posix: cleanup, fix, document QemuEvent Paolo Bonzini
@ 2023-03-05 19:11   ` Richard Henderson
  2023-03-06 13:28   ` David Hildenbrand
  1 sibling, 0 replies; 38+ messages in thread
From: Richard Henderson @ 2023-03-05 19:11 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: gshan, eesposit, david, stefanha, cohuck, eauger

On 3/3/23 09:19, Paolo Bonzini wrote:
> QemuEvent is currently broken on ARM due to missing memory barriers
> after qatomic_*().  Apart from adding the memory barrier, a closer look
> reveals some unpaired memory barriers too.  Document more clearly what
> is going on, and remove optimizations that I couldn't quite prove to
> be correct.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   util/qemu-thread-posix.c | 64 ++++++++++++++++++++++++++++------------
>   1 file changed, 45 insertions(+), 19 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/8] qemu-thread-posix: cleanup, fix, document QemuEvent
  2023-03-03 17:19 ` [PATCH 2/8] qemu-thread-posix: cleanup, fix, document QemuEvent Paolo Bonzini
  2023-03-05 19:11   ` Richard Henderson
@ 2023-03-06 13:28   ` David Hildenbrand
  1 sibling, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-03-06 13:28 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

> -             * Leave the event reset and tell qemu_event_set that there
> -             * are waiters.  No need to retry, because there cannot be
> -             * a concurrent busy->free transition.  After the CAS, the
> -             * event will be either set or busy.
> +             * Leave the event reset and tell qemu_event_set that there are
> +             * waiters.  No need to retry, because there cannot be a concurrent
> +             * busy->free transition.  After the CAS, the event will be either
> +             * set or busy.
> +             *
> +             * Neither the load nor the store of this cmpxchg have particular
> +             * ordering requirements.  The reasoning for the load is the same
> +             * as qatomic_read() above; while moving the store earlier can only
> +             * cause qemu_event_set() to issue _more_ wakeups.

IIUC, the qatomic_read(&ev->value) is mostly an optimization then, to 
not do an unconditional qatomic_cmpxchg(). That's why we don't care 
about the order in particular.

>                */
>               if (qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET) {
>                   return;
>               }
>           }
> +
> +        /*
> +         * This is the final check for a concurrent set, so it does need
> +         * a smp_mb() pairing with the second barrier of qemu_event_set().
> +         * The barrier is inside the FUTEX_WAIT system call.
> +         */
>           qemu_futex_wait(ev, EV_BUSY);
>       }
>   }


Skipping back and forth between the Linux and QEMU memory model is a pain :D

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3/8] qemu-thread-win32: cleanup, fix, document QemuEvent
  2023-03-03 17:19 [PATCH 0/8] Fix missing memory barriers on ARM Paolo Bonzini
  2023-03-03 17:19 ` [PATCH 1/8] qatomic: add smp_mb__before/after_rmw() Paolo Bonzini
  2023-03-03 17:19 ` [PATCH 2/8] qemu-thread-posix: cleanup, fix, document QemuEvent Paolo Bonzini
@ 2023-03-03 17:19 ` Paolo Bonzini
  2023-03-05 19:14   ` Richard Henderson
  2023-03-06 13:31   ` David Hildenbrand
  2023-03-03 17:19 ` [PATCH 4/8] edu: add smp_mb__after_rmw() Paolo Bonzini
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-03 17:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: gshan, eesposit, david, stefanha, cohuck, eauger

QemuEvent is currently broken on ARM due to missing memory barriers
after qatomic_*().  Apart from adding the memory barrier, a closer look
reveals some unpaired memory barriers that are not really needed and
complicated the functions unnecessarily, as well as some optimizations
that I couldn't quite prove to be correct.

Finally, the code is relying on a memory barrier in ResetEvent(); the
barrier _ought_ to be there but there is really no documentation about
it; it only affects the slow path, so make it explicit.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 util/qemu-thread-win32.c | 78 +++++++++++++++++++++++++++-------------
 1 file changed, 53 insertions(+), 25 deletions(-)

diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index 69db254ac7c1..eff664ae6b31 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -272,12 +272,20 @@ void qemu_event_destroy(QemuEvent *ev)
 void qemu_event_set(QemuEvent *ev)
 {
     assert(ev->initialized);
-    /* qemu_event_set has release semantics, but because it *loads*
+
+    /*
+     * Pairs with memory barrier in qemu_event_reset.
+     *
+     * qemu_event_set has release semantics, but because it *loads*
      * ev->value we need a full memory barrier here.
      */
     smp_mb();
     if (qatomic_read(&ev->value) != EV_SET) {
-        if (qatomic_xchg(&ev->value, EV_SET) == EV_BUSY) {
+        int old = qatomic_xchg(&ev->value, EV_SET);
+
+        /* Pairs with memory barrier after ResetEvent.  */
+        smp_mb__after_rmw();
+        if (old == EV_BUSY) {
             /* There were waiters, wake them up.  */
             SetEvent(ev->event);
         }
@@ -286,17 +294,19 @@ void qemu_event_set(QemuEvent *ev)
 
 void qemu_event_reset(QemuEvent *ev)
 {
-    unsigned value;
-
     assert(ev->initialized);
-    value = qatomic_read(&ev->value);
-    smp_mb_acquire();
-    if (value == EV_SET) {
-        /* If there was a concurrent reset (or even reset+wait),
-         * do nothing.  Otherwise change EV_SET->EV_FREE.
-         */
-        qatomic_or(&ev->value, EV_FREE);
-    }
+
+    /*
+     * If there was a concurrent reset (or even reset+wait),
+     * do nothing.  Otherwise change EV_SET->EV_FREE.
+     */
+    qatomic_or(&ev->value, EV_FREE);
+
+    /*
+     * Order reset before checking the condition in the caller.
+     * Pairs with the first memory barrier in qemu_event_set().
+     */
+    smp_mb__after_rmw();
 }
 
 void qemu_event_wait(QemuEvent *ev)
@@ -304,29 +314,47 @@ void qemu_event_wait(QemuEvent *ev)
     unsigned value;
 
     assert(ev->initialized);
+
+    /*
+     * This read does not have any particular ordering requirements;
+     * if it moves earlier, we might miss qemu_event_set() and go down the
+     * slow path unnecessarily, but ultimately the memory barrier below,
+     * plus the internal synchronization of the kernel event, will ensure
+     * the check is done correctly.
+     */
     value = qatomic_read(&ev->value);
-    smp_mb_acquire();
     if (value != EV_SET) {
         if (value == EV_FREE) {
-            /* qemu_event_set is not yet going to call SetEvent, but we are
-             * going to do another check for EV_SET below when setting EV_BUSY.
-             * At that point it is safe to call WaitForSingleObject.
+            /*
+             * Here the underlying kernel event is reset, but qemu_event_set is
+             * not yet going to call SetEvent.  However, there will be another
+             * check for EV_SET below when setting EV_BUSY.  At that point it
+             * is safe to call WaitForSingleObject.
              */
             ResetEvent(ev->event);
 
-            /* Tell qemu_event_set that there are waiters.  No need to retry
-             * because there cannot be a concurrent busy->free transition.
-             * After the CAS, the event will be either set or busy.
+            /*
+             * It is not clear whether ResetEvent provides this barrier; kernel
+             * APIs (KeResetEvent/KeClearEvent) do not.  Better safe than sorry!
+             */
+            smp_mb();
+
+            /*
+             * Leave the event reset and tell qemu_event_set that there are
+             * waiters.  No need to retry, because there cannot be a concurrent
+             * busy->free transition.  After the CAS, the event will be either
+             * set or busy.
              */
             if (qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET) {
-                value = EV_SET;
-            } else {
-                value = EV_BUSY;
+                return;
             }
         }
-        if (value == EV_BUSY) {
-            WaitForSingleObject(ev->event, INFINITE);
-        }
+
+        /*
+         * ev->value is now EV_BUSY.  Since we didn't observe EV_SET,
+         * qemu_event_set() must observe EV_BUSY and call SetEvent().
+         */
+        WaitForSingleObject(ev->event, INFINITE);
     }
 }
 
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/8] qemu-thread-win32: cleanup, fix, document QemuEvent
  2023-03-03 17:19 ` [PATCH 3/8] qemu-thread-win32: " Paolo Bonzini
@ 2023-03-05 19:14   ` Richard Henderson
  2023-03-06 13:31   ` David Hildenbrand
  1 sibling, 0 replies; 38+ messages in thread
From: Richard Henderson @ 2023-03-05 19:14 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: gshan, eesposit, david, stefanha, cohuck, eauger

On 3/3/23 09:19, Paolo Bonzini wrote:
> QemuEvent is currently broken on ARM due to missing memory barriers
> after qatomic_*().  Apart from adding the memory barrier, a closer look
> reveals some unpaired memory barriers that are not really needed and
> complicated the functions unnecessarily, as well as some optimizations
> that I couldn't quite prove to be correct.
> 
> Finally, the code is relying on a memory barrier in ResetEvent(); the
> barrier_ought_  to be there but there is really no documentation about
> it; it only affects the slow path, so make it explicit.
> 
> Signed-off-by: Paolo Bonzini<pbonzini@redhat.com>
> ---
>   util/qemu-thread-win32.c | 78 +++++++++++++++++++++++++++-------------
>   1 file changed, 53 insertions(+), 25 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/8] qemu-thread-win32: cleanup, fix, document QemuEvent
  2023-03-03 17:19 ` [PATCH 3/8] qemu-thread-win32: " Paolo Bonzini
  2023-03-05 19:14   ` Richard Henderson
@ 2023-03-06 13:31   ` David Hildenbrand
  2023-03-06 14:20     ` Paolo Bonzini
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2023-03-06 13:31 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 03.03.23 18:19, Paolo Bonzini wrote:
> QemuEvent is currently broken on ARM due to missing memory barriers
> after qatomic_*().  Apart from adding the memory barrier, a closer look
> reveals some unpaired memory barriers that are not really needed and
> complicated the functions unnecessarily, as well as some optimizations
> that I couldn't quite prove to be correct.
> 
> Finally, the code is relying on a memory barrier in ResetEvent(); the
> barrier _ought_ to be there but there is really no documentation about
> it; it only affects the slow path, so make it explicit.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

[...]

>   util/qemu-thread-win32.c | 78 +++++++++++++++++++++++++++-------------
>   1 file changed, 53 insertions(+), 25 deletions(-)
> 
> diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
> index 69db254ac7c1..eff664ae6b31 100644
> --- a/util/qemu-thread-win32.c
> +++ b/util/qemu-thread-win32.c
> @@ -272,12 +272,20 @@ void qemu_event_destroy(QemuEvent *ev)
>   void qemu_event_set(QemuEvent *ev)
>   {
>       assert(ev->initialized);
> -    /* qemu_event_set has release semantics, but because it *loads*
> +
> +    /*
> +     * Pairs with memory barrier in qemu_event_reset.
> +     *
> +     * qemu_event_set has release semantics, but because it *loads*
>        * ev->value we need a full memory barrier here.
>        */
>       smp_mb();
>       if (qatomic_read(&ev->value) != EV_SET) {
> -        if (qatomic_xchg(&ev->value, EV_SET) == EV_BUSY) {
> +        int old = qatomic_xchg(&ev->value, EV_SET);
> +
> +        /* Pairs with memory barrier after ResetEvent.  */
> +        smp_mb__after_rmw();
> +        if (old == EV_BUSY) {
>               /* There were waiters, wake them up.  */
>               SetEvent(ev->event);
>           }
> @@ -286,17 +294,19 @@ void qemu_event_set(QemuEvent *ev)
>   
>   void qemu_event_reset(QemuEvent *ev)
>   {
> -    unsigned value;
> -
>       assert(ev->initialized);
> -    value = qatomic_read(&ev->value);
> -    smp_mb_acquire();
> -    if (value == EV_SET) {
> -        /* If there was a concurrent reset (or even reset+wait),
> -         * do nothing.  Otherwise change EV_SET->EV_FREE.
> -         */
> -        qatomic_or(&ev->value, EV_FREE);
> -    }
> +

[had the same thought on patch #2]

IIUC, the "read first" is an optimization to not unconditionally dirty 
the cache-line. But I assume we don't particularly care about that 
optimization on the reset path.

> +    /*
> +     * If there was a concurrent reset (or even reset+wait),
> +     * do nothing.  Otherwise change EV_SET->EV_FREE.
> +     */
> +    qatomic_or(&ev->value, EV_FREE);
> +
> +    /*
> +     * Order reset before checking the condition in the caller.
> +     * Pairs with the first memory barrier in qemu_event_set().
> +     */
> +    smp_mb__after_rmw();
>   }
>   
>   void qemu_event_wait(QemuEvent *ev)
> @@ -304,29 +314,47 @@ void qemu_event_wait(QemuEvent *ev)
>       unsigned value;
>   
>       assert(ev->initialized);
> +
> +    /*
> +     * This read does not have any particular ordering requirements;
> +     * if it moves earlier, we might miss qemu_event_set() and go down the
> +     * slow path unnecessarily, but ultimately the memory barrier below,
> +     * plus the internal synchronization of the kernel event, will ensure
> +     * the check is done correctly.
> +     */
>       value = qatomic_read(&ev->value);
> -    smp_mb_acquire();
>       if (value != EV_SET) {
>           if (value == EV_FREE) {
> -            /* qemu_event_set is not yet going to call SetEvent, but we are
> -             * going to do another check for EV_SET below when setting EV_BUSY.
> -             * At that point it is safe to call WaitForSingleObject.
> +            /*
> +             * Here the underlying kernel event is reset, but qemu_event_set is
> +             * not yet going to call SetEvent.  However, there will be another
> +             * check for EV_SET below when setting EV_BUSY.  At that point it
> +             * is safe to call WaitForSingleObject.
>                */
>               ResetEvent(ev->event);
>   
> -            /* Tell qemu_event_set that there are waiters.  No need to retry
> -             * because there cannot be a concurrent busy->free transition.
> -             * After the CAS, the event will be either set or busy.
> +            /*
> +             * It is not clear whether ResetEvent provides this barrier; kernel
> +             * APIs (KeResetEvent/KeClearEvent) do not.  Better safe than sorry!
> +             */
> +            smp_mb();
> +
> +            /*
> +             * Leave the event reset and tell qemu_event_set that there are
> +             * waiters.  No need to retry, because there cannot be a concurrent
> +             * busy->free transition.  After the CAS, the event will be either
> +             * set or busy.
>                */
>               if (qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET) {
> -                value = EV_SET;
> -            } else {
> -                value = EV_BUSY;
> +                return;
>               }
>           }
> -        if (value == EV_BUSY) {
> -            WaitForSingleObject(ev->event, INFINITE);
> -        }
> +
> +        /*
> +         * ev->value is now EV_BUSY.  Since we didn't observe EV_SET,
> +         * qemu_event_set() must observe EV_BUSY and call SetEvent().
> +         */
> +        WaitForSingleObject(ev->event, INFINITE);
>       }
>   }
>   

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/8] qemu-thread-win32: cleanup, fix, document QemuEvent
  2023-03-06 13:31   ` David Hildenbrand
@ 2023-03-06 14:20     ` Paolo Bonzini
  2023-03-06 14:32       ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-06 14:20 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 3/6/23 14:31, David Hildenbrand wrote:
>>
>> -    smp_mb_acquire();
>> -    if (value == EV_SET) {
>> -        /* If there was a concurrent reset (or even reset+wait),
>> -         * do nothing.  Otherwise change EV_SET->EV_FREE.
>> -         */
>> -        qatomic_or(&ev->value, EV_FREE);
>> -    }
>> +
> 
> [had the same thought on patch #2]
> 
> IIUC, the "read first" is an optimization to not unconditionally dirty 
> the cache-line. But I assume we don't particularly care about that 
> optimization on the reset path.

Thinking more about it, the intended usage of QemuEvent is either

     qemu_event_reset();
     if (!check()) {
          qemu_event_wait());
     }

or

     if (!check()) {
         qemu_event_reset();
         if (!check()) {
              qemu_event_wait());
         }
     }

If we don't care on the reset path we care much less on the wait path. 
Should I remove it and go straight to the cmpxchg, just for peace of mind?

Paolo



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/8] qemu-thread-win32: cleanup, fix, document QemuEvent
  2023-03-06 14:20     ` Paolo Bonzini
@ 2023-03-06 14:32       ` David Hildenbrand
  2023-03-06 15:17         ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2023-03-06 14:32 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 06.03.23 15:20, Paolo Bonzini wrote:
> On 3/6/23 14:31, David Hildenbrand wrote:
>>>
>>> -    smp_mb_acquire();
>>> -    if (value == EV_SET) {
>>> -        /* If there was a concurrent reset (or even reset+wait),
>>> -         * do nothing.  Otherwise change EV_SET->EV_FREE.
>>> -         */
>>> -        qatomic_or(&ev->value, EV_FREE);
>>> -    }
>>> +
>>
>> [had the same thought on patch #2]
>>
>> IIUC, the "read first" is an optimization to not unconditionally dirty
>> the cache-line. But I assume we don't particularly care about that
>> optimization on the reset path.
> 
> Thinking more about it, the intended usage of QemuEvent is either
> 
>       qemu_event_reset();
>       if (!check()) {
>            qemu_event_wait());
>       }
> 
> or
> 
>       if (!check()) {
>           qemu_event_reset();
>           if (!check()) {
>                qemu_event_wait());
>           }
>       }
> 
> If we don't care on the reset path we care much less on the wait path.
> Should I remove it and go straight to the cmpxchg, just for peace of mind?

Sounds reasonable to me at could simplify qemu_event_wait a bit.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/8] qemu-thread-win32: cleanup, fix, document QemuEvent
  2023-03-06 14:32       ` David Hildenbrand
@ 2023-03-06 15:17         ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-06 15:17 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 3/6/23 15:32, David Hildenbrand wrote:
>>
>> Thinking more about it, the intended usage of QemuEvent is either
>>
>>       qemu_event_reset();
>>       if (!check()) {
>>            qemu_event_wait());
>>       }
>>
>> or
>>
>>       if (!check()) {
>>           qemu_event_reset();
>>           if (!check()) {
>>                qemu_event_wait());
>>           }
>>       }
>>
>> If we don't care on the reset path we care much less on the wait path.
>> Should I remove it and go straight to the cmpxchg, just for peace of 
>> mind?
> 
> Sounds reasonable to me at could simplify qemu_event_wait a bit.

Hmm, it does avoid a whole system call in the Windows case, so I prefer 
to keep it.  And I prefer to keep the load-acquire on the fast path too, 
I don't think it's needed in the actual uses of QemuEvent but it's safer 
in case it's used as "just a boolean".

Paolo



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/8] edu: add smp_mb__after_rmw()
  2023-03-03 17:19 [PATCH 0/8] Fix missing memory barriers on ARM Paolo Bonzini
                   ` (2 preceding siblings ...)
  2023-03-03 17:19 ` [PATCH 3/8] qemu-thread-win32: " Paolo Bonzini
@ 2023-03-03 17:19 ` Paolo Bonzini
  2023-03-05 19:14   ` Richard Henderson
                     ` (2 more replies)
  2023-03-03 17:19 ` [PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue Paolo Bonzini
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-03 17:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: gshan, eesposit, david, stefanha, cohuck, eauger

Ensure ordering between clearing the COMPUTING flag and checking
IRQFACT, and between setting the IRQFACT flag and checking
COMPUTING.  This ensures that no wakeups are lost.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/misc/edu.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/hw/misc/edu.c b/hw/misc/edu.c
index e935c418d400..a1f8bc77e770 100644
--- a/hw/misc/edu.c
+++ b/hw/misc/edu.c
@@ -267,6 +267,8 @@ static void edu_mmio_write(void *opaque, hwaddr addr, uint64_t val,
     case 0x20:
         if (val & EDU_STATUS_IRQFACT) {
             qatomic_or(&edu->status, EDU_STATUS_IRQFACT);
+            /* Order check of the COMPUTING flag after setting IRQFACT.  */
+            smp_mb__after_rmw();
         } else {
             qatomic_and(&edu->status, ~EDU_STATUS_IRQFACT);
         }
@@ -349,6 +351,9 @@ static void *edu_fact_thread(void *opaque)
         qemu_mutex_unlock(&edu->thr_mutex);
         qatomic_and(&edu->status, ~EDU_STATUS_COMPUTING);
 
+        /* Clear COMPUTING flag before checking IRQFACT.  */
+        smp_mb__after_rmw();
+
         if (qatomic_read(&edu->status) & EDU_STATUS_IRQFACT) {
             qemu_mutex_lock_iothread();
             edu_raise_irq(edu, FACT_IRQ);
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/8] edu: add smp_mb__after_rmw()
  2023-03-03 17:19 ` [PATCH 4/8] edu: add smp_mb__after_rmw() Paolo Bonzini
@ 2023-03-05 19:14   ` Richard Henderson
  2023-03-06 13:31   ` David Hildenbrand
  2023-03-06 13:38   ` Peter Maydell
  2 siblings, 0 replies; 38+ messages in thread
From: Richard Henderson @ 2023-03-05 19:14 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: gshan, eesposit, david, stefanha, cohuck, eauger

On 3/3/23 09:19, Paolo Bonzini wrote:
> Ensure ordering between clearing the COMPUTING flag and checking
> IRQFACT, and between setting the IRQFACT flag and checking
> COMPUTING.  This ensures that no wakeups are lost.
> 
> Signed-off-by: Paolo Bonzini<pbonzini@redhat.com>
> ---
>   hw/misc/edu.c | 5 +++++
>   1 file changed, 5 insertions(+)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/8] edu: add smp_mb__after_rmw()
  2023-03-03 17:19 ` [PATCH 4/8] edu: add smp_mb__after_rmw() Paolo Bonzini
  2023-03-05 19:14   ` Richard Henderson
@ 2023-03-06 13:31   ` David Hildenbrand
  2023-03-06 13:38   ` Peter Maydell
  2 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-03-06 13:31 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 03.03.23 18:19, Paolo Bonzini wrote:
> Ensure ordering between clearing the COMPUTING flag and checking
> IRQFACT, and between setting the IRQFACT flag and checking
> COMPUTING.  This ensures that no wakeups are lost.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/8] edu: add smp_mb__after_rmw()
  2023-03-03 17:19 ` [PATCH 4/8] edu: add smp_mb__after_rmw() Paolo Bonzini
  2023-03-05 19:14   ` Richard Henderson
  2023-03-06 13:31   ` David Hildenbrand
@ 2023-03-06 13:38   ` Peter Maydell
  2023-03-06 14:10     ` Paolo Bonzini
  2 siblings, 1 reply; 38+ messages in thread
From: Peter Maydell @ 2023-03-06 13:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, gshan, eesposit, david, stefanha, cohuck, eauger

On Fri, 3 Mar 2023 at 17:21, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Ensure ordering between clearing the COMPUTING flag and checking
> IRQFACT, and between setting the IRQFACT flag and checking
> COMPUTING.  This ensures that no wakeups are lost.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Why is this device even messing around with multiple
threads and atomics anyway ??

I'm quite tempted to suggest we should deprecate-and-drop
this; it's not real hardware, it doesn't do anything
useful, and it's not a good model to follow if you're
implementing some other device.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/8] edu: add smp_mb__after_rmw()
  2023-03-06 13:38   ` Peter Maydell
@ 2023-03-06 14:10     ` Paolo Bonzini
  2023-03-06 14:24       ` Peter Maydell
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-06 14:10 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, gshan, eesposit, david, stefanha, cohuck, eauger

On 3/6/23 14:38, Peter Maydell wrote:
> On Fri, 3 Mar 2023 at 17:21, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> Ensure ordering between clearing the COMPUTING flag and checking
>> IRQFACT, and between setting the IRQFACT flag and checking
>> COMPUTING.  This ensures that no wakeups are lost.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> Why is this device even messing around with multiple
> threads and atomics anyway ??

Because it is an example of deferring device work to another thread, 
just like on real hardware it may be deferred to an on-device 
microcontroller or CPU.

In particular, in this case I think it is useful to show a pitfall that 
is present in emulated hardware (where all loads and stores ultimately 
go through a store buffer and CPU cache) and not in real hardware (where 
I/O registers are always uncached).

> I'm quite tempted to suggest we should deprecate-and-drop
> this; it's not real hardware, it doesn't do anything
> useful, and it's not a good model to follow if you're
> implementing some other device.

I'm okay with deprecating it but I think it has educational value.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/8] edu: add smp_mb__after_rmw()
  2023-03-06 14:10     ` Paolo Bonzini
@ 2023-03-06 14:24       ` Peter Maydell
  2023-03-06 15:06         ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Maydell @ 2023-03-06 14:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, gshan, eesposit, david, stefanha, cohuck, eauger

On Mon, 6 Mar 2023 at 14:10, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 3/6/23 14:38, Peter Maydell wrote:
> > On Fri, 3 Mar 2023 at 17:21, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>
> >> Ensure ordering between clearing the COMPUTING flag and checking
> >> IRQFACT, and between setting the IRQFACT flag and checking
> >> COMPUTING.  This ensures that no wakeups are lost.
> >>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >
> > Why is this device even messing around with multiple
> > threads and atomics anyway ??
>
> Because it is an example of deferring device work to another thread,
> just like on real hardware it may be deferred to an on-device
> microcontroller or CPU.

If we want to be able to do that, we should probably have
infrastructure and higher-level primitives for it that
don't require device authors to be super-familiar with
QEMU's memory model and barriers... The fact there are only
half a dozen other uses of qemu_thread_create() under hw/
suggests that in practice we don't really need to do this
very often, though.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/8] edu: add smp_mb__after_rmw()
  2023-03-06 14:24       ` Peter Maydell
@ 2023-03-06 15:06         ` Paolo Bonzini
  2023-03-06 15:36           ` Peter Maydell
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-06 15:06 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, gshan, eesposit, david, stefanha, cohuck, eauger

On 3/6/23 15:24, Peter Maydell wrote:
>>> Why is this device even messing around with multiple
>>> threads and atomics anyway ??
>> Because it is an example of deferring device work to another thread,
>> just like on real hardware it may be deferred to an on-device
>> microcontroller or CPU.
> If we want to be able to do that, we should probably have
> infrastructure and higher-level primitives for it that
> don't require device authors to be super-familiar with
> QEMU's memory model and barriers... The fact there are only
> half a dozen other uses of qemu_thread_create() under hw/
> suggests that in practice we don't really need to do this
> very often, though.

Yes, you're totally right about that.  I have never needed this kind of 
higher-level primitive so I haven't thought much about what it would 
look like.

The usage of barriers isn't great, but all this patch does is correctness...

Paolo



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/8] edu: add smp_mb__after_rmw()
  2023-03-06 15:06         ` Paolo Bonzini
@ 2023-03-06 15:36           ` Peter Maydell
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Maydell @ 2023-03-06 15:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, gshan, eesposit, david, stefanha, cohuck, eauger

On Mon, 6 Mar 2023 at 15:06, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 3/6/23 15:24, Peter Maydell wrote:
> >>> Why is this device even messing around with multiple
> >>> threads and atomics anyway ??
> >> Because it is an example of deferring device work to another thread,
> >> just like on real hardware it may be deferred to an on-device
> >> microcontroller or CPU.
> > If we want to be able to do that, we should probably have
> > infrastructure and higher-level primitives for it that
> > don't require device authors to be super-familiar with
> > QEMU's memory model and barriers... The fact there are only
> > half a dozen other uses of qemu_thread_create() under hw/
> > suggests that in practice we don't really need to do this
> > very often, though.
>
> Yes, you're totally right about that.  I have never needed this kind of
> higher-level primitive so I haven't thought much about what it would
> look like.
>
> The usage of barriers isn't great, but all this patch does is correctness...

Yeah, I'm not objecting to this patch, to be clear. It's just
that it brought the edu device to my negative attention :-)

-- PMM


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue
  2023-03-03 17:19 [PATCH 0/8] Fix missing memory barriers on ARM Paolo Bonzini
                   ` (3 preceding siblings ...)
  2023-03-03 17:19 ` [PATCH 4/8] edu: add smp_mb__after_rmw() Paolo Bonzini
@ 2023-03-03 17:19 ` Paolo Bonzini
  2023-03-05 19:32   ` Richard Henderson
  2023-03-03 17:19 ` [PATCH 6/8] aio-wait: switch to smp_mb__after_rmw() Paolo Bonzini
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-03 17:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: gshan, eesposit, david, stefanha, cohuck, eauger

There is no implicit memory barrier in qatomic_fetch_or() and
atomic_fetch_and() on ARM systems.  Add an explicit
smp_mb__after_rmw() to match the intended semantics.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 util/async.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/util/async.c b/util/async.c
index 0657b7539777..6129f2c991cb 100644
--- a/util/async.c
+++ b/util/async.c
@@ -74,13 +74,15 @@ static void aio_bh_enqueue(QEMUBH *bh, unsigned new_flags)
     unsigned old_flags;
 
     /*
-     * The memory barrier implicit in qatomic_fetch_or makes sure that:
-     * 1. idle & any writes needed by the callback are done before the
-     *    locations are read in the aio_bh_poll.
+     * The memory barrier makes sure that:
+     * 1. any writes needed by the callback are visible from the callback
+     *    after aio_bh_dequeue() returns bh.
      * 2. ctx is loaded before the callback has a chance to execute and bh
      *    could be freed.
      */
     old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags);
+    smp_mb__after_rmw();
+
     if (!(old_flags & BH_PENDING)) {
         QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next);
     }
@@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags)
     QSLIST_REMOVE_HEAD(head, next);
 
     /*
-     * The qatomic_and is paired with aio_bh_enqueue().  The implicit memory
-     * barrier ensures that the callback sees all writes done by the scheduling
+     * The memory barrier is paired with aio_bh_enqueue() and it
+     * ensures that the callback sees all writes done by the scheduling
      * thread.  It also ensures that the scheduling thread sees the cleared
      * flag before bh->cb has run, and thus will call aio_notify again if
      * necessary.
      */
     *flags = qatomic_fetch_and(&bh->flags,
                               ~(BH_PENDING | BH_SCHEDULED | BH_IDLE));
+    smp_mb__after_rmw();
     return bh;
 }
 
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue
  2023-03-03 17:19 ` [PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue Paolo Bonzini
@ 2023-03-05 19:32   ` Richard Henderson
  2023-03-06  9:55     ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Richard Henderson @ 2023-03-05 19:32 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: gshan, eesposit, david, stefanha, cohuck, eauger

On 3/3/23 09:19, Paolo Bonzini wrote:
> There is no implicit memory barrier in qatomic_fetch_or() and
> atomic_fetch_and() on ARM systems.  Add an explicit
> smp_mb__after_rmw() to match the intended semantics.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   util/async.c | 13 ++++++++-----
>   1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/util/async.c b/util/async.c
> index 0657b7539777..6129f2c991cb 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -74,13 +74,15 @@ static void aio_bh_enqueue(QEMUBH *bh, unsigned new_flags)
>       unsigned old_flags;
>   
>       /*
> -     * The memory barrier implicit in qatomic_fetch_or makes sure that:
> -     * 1. idle & any writes needed by the callback are done before the
> -     *    locations are read in the aio_bh_poll.
> +     * The memory barrier makes sure that:
> +     * 1. any writes needed by the callback are visible from the callback
> +     *    after aio_bh_dequeue() returns bh.
>        * 2. ctx is loaded before the callback has a chance to execute and bh
>        *    could be freed.
>        */
>       old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags);
> +    smp_mb__after_rmw();
> +
>       if (!(old_flags & BH_PENDING)) {
>           QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next);
>       }
> @@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags)
>       QSLIST_REMOVE_HEAD(head, next);
>   
>       /*
> -     * The qatomic_and is paired with aio_bh_enqueue().  The implicit memory
> -     * barrier ensures that the callback sees all writes done by the scheduling
> +     * The memory barrier is paired with aio_bh_enqueue() and it
> +     * ensures that the callback sees all writes done by the scheduling
>        * thread.  It also ensures that the scheduling thread sees the cleared
>        * flag before bh->cb has run, and thus will call aio_notify again if
>        * necessary.
>        */

Is it really paired with aio_bh_enqueue?

Seems like the enqueue barrier really is for aio_bh_poll, and the dequeue barrier is for 
the callback.

While the comments seem off, the code seems correct.  If you agree:
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue
  2023-03-05 19:32   ` Richard Henderson
@ 2023-03-06  9:55     ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-06  9:55 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: gshan, eesposit, david, stefanha, cohuck, eauger

On 3/5/23 20:32, Richard Henderson wrote:
> On 3/3/23 09:19, Paolo Bonzini wrote:
>>       unsigned old_flags;
>>       /*
>>        * The memory barrier makes sure that:
>>        * 1. any writes needed by the callback are visible from the callback
>>        *    after aio_bh_dequeue() returns bh.
>>        * 2. ctx is loaded before the callback has a chance to execute and bh
>>        *    could be freed.
>>        */
>>       old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags);
>> +    smp_mb__after_rmw();
>> +
>>       if (!(old_flags & BH_PENDING)) {
>>           QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next);
>>       }
>> @@ -107,14 +109,15 @@ static QEMUBH *aio_bh_dequeue(BHList *head, 
>> unsigned *flags)
>>       QSLIST_REMOVE_HEAD(head, next);
>>       /*
>>        * The memory barrier is paired with aio_bh_enqueue() and it
>>        * ensures that the callback sees all writes done by the scheduling
>>        * thread.  It also ensures that the scheduling thread sees the cleared
>>        * flag before bh->cb has run, and thus will call aio_notify again if
>>        * necessary.
>>        */
> 
> Is it really paired with aio_bh_enqueue?
> 
> Seems like the enqueue barrier really is for aio_bh_poll, and the 
> dequeue barrier is for the callback.

The documentation has been quite obsolete since the introduction of
bh_list.  The logic is as follows:

aio_bh_enqueue()
   load bh->ctx
   set PENDING flag                                          // (0)
   if flag was not set
     // bh becomes visible to dequeuing thread here:
     insert into bh_list                                     // (1)

   aio_notify
     // Write bh->flags and bh_list before ctx->notified
     smp_wmb()                                               // (2)
     set notified to true
     // Write notified before reading notify_me
     smp_mb()                                                // (3)
     if notify_me then event_notifier_set()

and on the dequeue side it's tricky to see why all barriers are
needed; it boils down to the busy-wait polling done at the beginning
of aio_poll():

aio_poll()
   compute approximate timeout (*unordered* read of bh_list)

   enable notify_me
   // Write notify_me before reading notified
   smp_mb()                                // synchronizes with (3)
   if notified then timeout = 0
   ctx->fdmon_ops->wait(timeout)

   disable notify_me
   aio_notify_accept()
     set notified to false
     // Write ctx->notified before reading e.g. bh_list
     smp_mb()                              // synchronizes with (2)

   aio_bh_poll()
     QSLIST_MOVE_ATOMIC                    // synchronizes with (1)
     aio_bh_dequeue
       remove from head
       clear PENDING/SCHEDULED/IDLE        // synchronizes with (0)
     if flags were set
       aio_bh_call


That is:

for synchronization point (0)
- the main function here is to ensure that aio_bh_dequeue() removes
   from the list before the PENDING flag is cleared, otherwise enqueuers can
   clobber the list, and vice versa for the enqueuers.  For this, it is enough
   that qatomic_fetch_and() is "release" and qatomic_fetch_or() is "acquire"
   (and they are).

for synchronization point (1)
- the bottom half machinery between aio_bh_enqueue and aio_bh_poll only
   needs release-to-acquire synchronization, and that is provided by (1).
   This is also where ordering ensures that bh->ctx is loaded before the
   callback executes (which could free bh).

- between enqueuers, setting the PENDING flag only needs to be done before
   inserting into bh_list to avoid double insertion (and of course needs to
   be done atomically); for this purpose, QSLIST_INSERT_HEAD_ATOMIC needs to
   be "release" (which it is).

Also the call to aio_notify() in aio_bh_enqueue() is unconditional, so
aio_bh_dequeue() need not protect against missed calls to aio_notify().
aio_notify() only synchronizes with aio_poll() and aio_notify_accept().

for synchronization point (2)
- This cannot be just a smp_rmb() because the timeout that was computed
   earlier was not ordered against either notified or notify_me.

- This is a smp_mb() for generality; as far as bh_list is
   concerned it could be smp_mb__before_rmw().

Synchronization point (3) is pretty mundane in comparison.

So I'm dropping this patch; I have prepared a documentation patch instead.

Paolo



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 6/8] aio-wait: switch to smp_mb__after_rmw()
  2023-03-03 17:19 [PATCH 0/8] Fix missing memory barriers on ARM Paolo Bonzini
                   ` (4 preceding siblings ...)
  2023-03-03 17:19 ` [PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue Paolo Bonzini
@ 2023-03-03 17:19 ` Paolo Bonzini
  2023-03-05 19:32   ` Richard Henderson
                     ` (2 more replies)
  2023-03-03 17:19 ` [PATCH 7/8] qemu-coroutine-lock: add smp_mb__after_rmw() Paolo Bonzini
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-03 17:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: gshan, eesposit, david, stefanha, cohuck, eauger

The barrier comes after an atomic increment, so it is enough to use
smp_mb__after_rmw(); this avoids a double barrier on x86 systems.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/block/aio-wait.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
index dd9a7f6461ef..da13357bb8cf 100644
--- a/include/block/aio-wait.h
+++ b/include/block/aio-wait.h
@@ -85,7 +85,7 @@ extern AioWait global_aio_wait;
     /* Increment wait_->num_waiters before evaluating cond. */     \
     qatomic_inc(&wait_->num_waiters);                              \
     /* Paired with smp_mb in aio_wait_kick(). */                   \
-    smp_mb();                                                      \
+    smp_mb__after_rmw();                                           \
     if (ctx_ && in_aio_context_home_thread(ctx_)) {                \
         while ((cond)) {                                           \
             aio_poll(ctx_, true);                                  \
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/8] aio-wait: switch to smp_mb__after_rmw()
  2023-03-03 17:19 ` [PATCH 6/8] aio-wait: switch to smp_mb__after_rmw() Paolo Bonzini
@ 2023-03-05 19:32   ` Richard Henderson
  2023-03-06 13:32   ` David Hildenbrand
  2023-03-06 14:38   ` Stefan Hajnoczi
  2 siblings, 0 replies; 38+ messages in thread
From: Richard Henderson @ 2023-03-05 19:32 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: gshan, eesposit, david, stefanha, cohuck, eauger

On 3/3/23 09:19, Paolo Bonzini wrote:
> The barrier comes after an atomic increment, so it is enough to use
> smp_mb__after_rmw(); this avoids a double barrier on x86 systems.
> 
> Signed-off-by: Paolo Bonzini<pbonzini@redhat.com>
> ---
>   include/block/aio-wait.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/8] aio-wait: switch to smp_mb__after_rmw()
  2023-03-03 17:19 ` [PATCH 6/8] aio-wait: switch to smp_mb__after_rmw() Paolo Bonzini
  2023-03-05 19:32   ` Richard Henderson
@ 2023-03-06 13:32   ` David Hildenbrand
  2023-03-06 14:38   ` Stefan Hajnoczi
  2 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-03-06 13:32 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 03.03.23 18:19, Paolo Bonzini wrote:
> The barrier comes after an atomic increment, so it is enough to use
> smp_mb__after_rmw(); this avoids a double barrier on x86 systems.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   include/block/aio-wait.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
> index dd9a7f6461ef..da13357bb8cf 100644
> --- a/include/block/aio-wait.h
> +++ b/include/block/aio-wait.h
> @@ -85,7 +85,7 @@ extern AioWait global_aio_wait;
>       /* Increment wait_->num_waiters before evaluating cond. */     \
>       qatomic_inc(&wait_->num_waiters);                              \
>       /* Paired with smp_mb in aio_wait_kick(). */                   \
> -    smp_mb();                                                      \
> +    smp_mb__after_rmw();                                           \
>       if (ctx_ && in_aio_context_home_thread(ctx_)) {                \
>           while ((cond)) {                                           \
>               aio_poll(ctx_, true);                                  \

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 6/8] aio-wait: switch to smp_mb__after_rmw()
  2023-03-03 17:19 ` [PATCH 6/8] aio-wait: switch to smp_mb__after_rmw() Paolo Bonzini
  2023-03-05 19:32   ` Richard Henderson
  2023-03-06 13:32   ` David Hildenbrand
@ 2023-03-06 14:38   ` Stefan Hajnoczi
  2 siblings, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2023-03-06 14:38 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, gshan, eesposit, david, cohuck, eauger

[-- Attachment #1: Type: text/plain, Size: 409 bytes --]

On Fri, Mar 03, 2023 at 06:19:37PM +0100, Paolo Bonzini wrote:
> The barrier comes after an atomic increment, so it is enough to use
> smp_mb__after_rmw(); this avoids a double barrier on x86 systems.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  include/block/aio-wait.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 7/8] qemu-coroutine-lock: add smp_mb__after_rmw()
  2023-03-03 17:19 [PATCH 0/8] Fix missing memory barriers on ARM Paolo Bonzini
                   ` (5 preceding siblings ...)
  2023-03-03 17:19 ` [PATCH 6/8] aio-wait: switch to smp_mb__after_rmw() Paolo Bonzini
@ 2023-03-03 17:19 ` Paolo Bonzini
  2023-03-05 19:36   ` Richard Henderson
  2023-03-06 13:33   ` David Hildenbrand
  2023-03-03 17:19 ` [PATCH 8/8] physmem: add missing memory barrier Paolo Bonzini
  2023-03-06 13:35 ` [PATCH 0/8] Fix missing memory barriers on ARM David Hildenbrand
  8 siblings, 2 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-03 17:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: gshan, eesposit, david, stefanha, cohuck, eauger

mutex->from_push and mutex->handoff in qemu-coroutine-lock implement
the familiar pattern:

   write a                                  write b
   smp_mb()                                 smp_mb()
   read b                                   read a

The memory barrier is required by the C memory model even after a
SEQ_CST read-modify-write operation such as QSLIST_INSERT_HEAD_ATOMIC.
Add it and avoid the unclear qatomic_mb_read() operation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 util/qemu-coroutine-lock.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index 58f3f771817b..84a50a9e9117 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -201,10 +201,16 @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(AioContext *ctx,
     trace_qemu_co_mutex_lock_entry(mutex, self);
     push_waiter(mutex, &w);
 
+    /*
+     * Add waiter before reading mutex->handoff.  Pairs with qatomic_mb_set
+     * in qemu_co_mutex_unlock.
+     */
+    smp_mb__after_rmw();
+
     /* This is the "Responsibility Hand-Off" protocol; a lock() picks from
      * a concurrent unlock() the responsibility of waking somebody up.
      */
-    old_handoff = qatomic_mb_read(&mutex->handoff);
+    old_handoff = qatomic_read(&mutex->handoff);
     if (old_handoff &&
         has_waiters(mutex) &&
         qatomic_cmpxchg(&mutex->handoff, old_handoff, 0) == old_handoff) {
@@ -303,6 +309,7 @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
         }
 
         our_handoff = mutex->sequence;
+        /* Set handoff before checking for waiters.  */
         qatomic_mb_set(&mutex->handoff, our_handoff);
         if (!has_waiters(mutex)) {
             /* The concurrent lock has not added itself yet, so it
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 7/8] qemu-coroutine-lock: add smp_mb__after_rmw()
  2023-03-03 17:19 ` [PATCH 7/8] qemu-coroutine-lock: add smp_mb__after_rmw() Paolo Bonzini
@ 2023-03-05 19:36   ` Richard Henderson
  2023-03-06 13:33   ` David Hildenbrand
  1 sibling, 0 replies; 38+ messages in thread
From: Richard Henderson @ 2023-03-05 19:36 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: gshan, eesposit, david, stefanha, cohuck, eauger

On 3/3/23 09:19, Paolo Bonzini wrote:
> mutex->from_push and mutex->handoff in qemu-coroutine-lock implement
> the familiar pattern:
> 
>     write a                                  write b
>     smp_mb()                                 smp_mb()
>     read b                                   read a
> 
> The memory barrier is required by the C memory model even after a
> SEQ_CST read-modify-write operation such as QSLIST_INSERT_HEAD_ATOMIC.
> Add it and avoid the unclear qatomic_mb_read() operation.
> 
> Signed-off-by: Paolo Bonzini<pbonzini@redhat.com>
> ---
>   util/qemu-coroutine-lock.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 7/8] qemu-coroutine-lock: add smp_mb__after_rmw()
  2023-03-03 17:19 ` [PATCH 7/8] qemu-coroutine-lock: add smp_mb__after_rmw() Paolo Bonzini
  2023-03-05 19:36   ` Richard Henderson
@ 2023-03-06 13:33   ` David Hildenbrand
  1 sibling, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-03-06 13:33 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 03.03.23 18:19, Paolo Bonzini wrote:
> mutex->from_push and mutex->handoff in qemu-coroutine-lock implement
> the familiar pattern:
> 
>     write a                                  write b
>     smp_mb()                                 smp_mb()
>     read b                                   read a
> 
> The memory barrier is required by the C memory model even after a
> SEQ_CST read-modify-write operation such as QSLIST_INSERT_HEAD_ATOMIC.
> Add it and avoid the unclear qatomic_mb_read() operation.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   util/qemu-coroutine-lock.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
> index 58f3f771817b..84a50a9e9117 100644
> --- a/util/qemu-coroutine-lock.c
> +++ b/util/qemu-coroutine-lock.c
> @@ -201,10 +201,16 @@ static void coroutine_fn qemu_co_mutex_lock_slowpath(AioContext *ctx,
>       trace_qemu_co_mutex_lock_entry(mutex, self);
>       push_waiter(mutex, &w);
>   
> +    /*
> +     * Add waiter before reading mutex->handoff.  Pairs with qatomic_mb_set
> +     * in qemu_co_mutex_unlock.
> +     */
> +    smp_mb__after_rmw();
> +
>       /* This is the "Responsibility Hand-Off" protocol; a lock() picks from
>        * a concurrent unlock() the responsibility of waking somebody up.
>        */
> -    old_handoff = qatomic_mb_read(&mutex->handoff);
> +    old_handoff = qatomic_read(&mutex->handoff);
>       if (old_handoff &&
>           has_waiters(mutex) &&
>           qatomic_cmpxchg(&mutex->handoff, old_handoff, 0) == old_handoff) {
> @@ -303,6 +309,7 @@ void coroutine_fn qemu_co_mutex_unlock(CoMutex *mutex)
>           }
>   
>           our_handoff = mutex->sequence;
> +        /* Set handoff before checking for waiters.  */
>           qatomic_mb_set(&mutex->handoff, our_handoff);
>           if (!has_waiters(mutex)) {
>               /* The concurrent lock has not added itself yet, so it

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 8/8] physmem: add missing memory barrier
  2023-03-03 17:19 [PATCH 0/8] Fix missing memory barriers on ARM Paolo Bonzini
                   ` (6 preceding siblings ...)
  2023-03-03 17:19 ` [PATCH 7/8] qemu-coroutine-lock: add smp_mb__after_rmw() Paolo Bonzini
@ 2023-03-03 17:19 ` Paolo Bonzini
  2023-03-05 19:40   ` Richard Henderson
  2023-03-06 13:34   ` David Hildenbrand
  2023-03-06 13:35 ` [PATCH 0/8] Fix missing memory barriers on ARM David Hildenbrand
  8 siblings, 2 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-03 17:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: gshan, eesposit, david, stefanha, cohuck, eauger

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 softmmu/physmem.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 47143edb4f6c..a6efd8e8dd11 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2927,6 +2927,8 @@ void cpu_register_map_client(QEMUBH *bh)
     qemu_mutex_lock(&map_client_list_lock);
     client->bh = bh;
     QLIST_INSERT_HEAD(&map_client_list, client, link);
+    /* Write map_client_list before reading in_use.  */
+    smp_mb();
     if (!qatomic_read(&bounce.in_use)) {
         cpu_notify_map_clients_locked();
     }
@@ -3116,6 +3118,7 @@ void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
     qemu_vfree(bounce.buffer);
     bounce.buffer = NULL;
     memory_region_unref(bounce.mr);
+    /* Clear in_use before reading map_client_list.  */
     qatomic_mb_set(&bounce.in_use, false);
     cpu_notify_map_clients();
 }
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/8] physmem: add missing memory barrier
  2023-03-03 17:19 ` [PATCH 8/8] physmem: add missing memory barrier Paolo Bonzini
@ 2023-03-05 19:40   ` Richard Henderson
  2023-03-06 13:34   ` David Hildenbrand
  1 sibling, 0 replies; 38+ messages in thread
From: Richard Henderson @ 2023-03-05 19:40 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: gshan, eesposit, david, stefanha, cohuck, eauger

On 3/3/23 09:19, Paolo Bonzini wrote:
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   softmmu/physmem.c | 3 +++
>   1 file changed, 3 insertions(+)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~

> 
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 47143edb4f6c..a6efd8e8dd11 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -2927,6 +2927,8 @@ void cpu_register_map_client(QEMUBH *bh)
>       qemu_mutex_lock(&map_client_list_lock);
>       client->bh = bh;
>       QLIST_INSERT_HEAD(&map_client_list, client, link);
> +    /* Write map_client_list before reading in_use.  */
> +    smp_mb();
>       if (!qatomic_read(&bounce.in_use)) {
>           cpu_notify_map_clients_locked();
>       }
> @@ -3116,6 +3118,7 @@ void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
>       qemu_vfree(bounce.buffer);
>       bounce.buffer = NULL;
>       memory_region_unref(bounce.mr);
> +    /* Clear in_use before reading map_client_list.  */
>       qatomic_mb_set(&bounce.in_use, false);
>       cpu_notify_map_clients();
>   }



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 8/8] physmem: add missing memory barrier
  2023-03-03 17:19 ` [PATCH 8/8] physmem: add missing memory barrier Paolo Bonzini
  2023-03-05 19:40   ` Richard Henderson
@ 2023-03-06 13:34   ` David Hildenbrand
  1 sibling, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2023-03-06 13:34 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 03.03.23 18:19, Paolo Bonzini wrote:
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/8] Fix missing memory barriers on ARM
  2023-03-03 17:19 [PATCH 0/8] Fix missing memory barriers on ARM Paolo Bonzini
                   ` (7 preceding siblings ...)
  2023-03-03 17:19 ` [PATCH 8/8] physmem: add missing memory barrier Paolo Bonzini
@ 2023-03-06 13:35 ` David Hildenbrand
  2023-03-06 14:14   ` Paolo Bonzini
  8 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2023-03-06 13:35 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 03.03.23 18:19, Paolo Bonzini wrote:
> This series fixes more instances of the problem fixed by commits
> 5710a3e09f9b ("async: use explicit memory barriers", 2020-04-09) and
> 7455ff1aa015 ("aio_wait_kick: add missing memory barrier", 2022-06-24).
> This is an interesting case where ARM's memory ordering is somewhat
> stronger than what you would expect.
> 
> On ARM, seqcst loads and stores (which QEMU does not use) are compiled
> respectively as LDAR and STLR instructions.  Even though STLR is also
> used for store-release operations, STLR followed by LDAR provides
> store-against-load ordering, which is stronger than a store-release.
> Compare this to ARMv7, where store-release is DMB+STR and store-seqcst
> is DMB+STR+DMB.
> 
> This means that on ARM a sequence of
> 
>    qatomic_store_release(&y, ...);         // STLR
>    a = qatomic_load_acquire(&x);           // LDAR
> 
> provides stronger ordering at the processor level than the two MOV
> instructions you'd get on x86.
> 
> Likewise, on ARM sequentially consistent read-modify-write operations only
> need to use LDAXR and STLXR respectively for the load and the store, which
> is weaker than the LOCK prefix used on x86.
> 
> In a strange twist of events, however, the _stronger_ semantics
> of the ARM instructions can end up causing bugs on ARM, not on x86.
> The problems occur when seqcst atomics are mixed with relaxed atomics.
> Here is how the two are compiled on ARM:
> 
>                     load              store
>     relaxed         LDR               STR
>     seqcst          LDAR              STLR
> 
> QEMU's atomics try to bridge the Linux API (that most of the developers
> are familiar with) and the C11 API, and the two have a substantial
> difference:
> 
> - in Linux, strongly-ordered atomics such as atomic_add_return() affect
>    the global ordering of _all_ memory operations, including for example
>    READ_ONCE()/WRITE_ONCE()
> 
> - in C11, sequentially consistent atomics (except for seqcst fences)
>    only affect the ordering of sequentially consistent operations.
>    In particular, since relaxed loads are done with LDR on ARM, they are
>    not ordered against seqcst stores (which are done with STLR).
> 
> QEMU implements high-level synchronization primitives with the idea that
> the primitives contain the necessary memory barriers, and the callers can
> use relaxed atomics (qatomic_read/qatomic_set) or even regular accesses.
> This is very much incompatible with the C11 view that seqcst accesses
> are only ordered against other seqcst accesses, and requires using
> seqcst fences as in the following example:
> 
>     qatomic_set(&y, 1);            qatomic_set(&x, 1);
>     smp_mb();                      smp_mb();
>     ... qatomic_read(&x) ...       ... qatomic_read(&y) ...
> 
> Bugs ensue when a qatomic_*() read-modify write operation is used instead
> of one or both stores, for example:
> 
>     qatomic_<rmw>(&y, ...);
>     smp_mb();
>     ... qatomic_read(&x) ...
> 
> Developers that are more familiar with the Linux API may be tempted
> to omit the smp_mb() and that's exactly what yours truly did in
> qemu_event_set() and qemu_event_reset().  After a27dd2de68f3 ("KVM:
> keep track of running ioctls", 2023-01-11), this caused hangs due to
> threads sleeping forever in qemu_event_wait().
> 
> This _could_ also have been the cause of occasional hangs of rcutorture,
> though I have not observed them personally.
> 
> (As an aside, GCC's older __sync_* builtins *did* provide a full barrier
> between the RMW operation and loads on the side of the operation.  The
> difference between seqcst C11 atomics and __sync_* atomics is exactly
> an extra memory barrier after the STLXR instruction).
> 
> In order to fix this, while avoiding worse performance from having two
> back-to-back memory barriers on x86, patch 1 introduces optimized
> memory barriers smp_mb__before_rmw() and smp_mb__after_rmw().  The usage
> is similar to Linux's smp_mb__before/after_atomic(), but the name is
> different because they affect _all_ RMW operations.  On Linux, instead,
> they are not needed around those RMW operations that return the old value.
> 
> The remaining patches add them everywhere they are needed.  In the
> case of QemuEvent (patches 2-3), I reviewed the algorithm thoroughly,
> dropping barriers that were not necessary and killing optimizations that
> I wasn't entirely sure about.  For the other cases, instead, the changes
> are minimal.
> 
> Note: I have a follow-up set of patches that gets rid completely of
> atomic_mb_read(); atomic_mb_set() instead can remain and mimic Linux's
> smp_store_mb() operation.  A glimpse of these changes is already visible
> in patches 7 and 8.

That sounds interesting. I was briefly confused about atomic_mb_* ...


... do we want to add some Fixes: tags or is it too hard to come up with 
something reasonable?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/8] Fix missing memory barriers on ARM
  2023-03-06 13:35 ` [PATCH 0/8] Fix missing memory barriers on ARM David Hildenbrand
@ 2023-03-06 14:14   ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2023-03-06 14:14 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel; +Cc: gshan, eesposit, stefanha, cohuck, eauger

On 3/6/23 14:35, David Hildenbrand wrote:
>> Note: I have a follow-up set of patches that gets rid completely of
>> atomic_mb_read(); atomic_mb_set() instead can remain and mimic Linux's
>> smp_store_mb() operation.  A glimpse of these changes is already visible
>> in patches 7 and 8.
> 
> That sounds interesting. I was briefly confused about atomic_mb_* ...
> 
> ... do we want to add some Fixes: tags or is it too hard to come up with 
> something reasonable?

The Fixes tag would often point to

     commit a0aa44b488b3601415d55041e4619aef5f3a4ba8
     Author: Alex Bennée <alex.bennee@linaro.org>
     Date:   Thu Jan 28 10:15:17 2016 +0000

     include/qemu/atomic.h: default to __atomic functions
     
     The __atomic primitives have been available since GCC 4.7 and provide
     a richer interface for describing memory ordering requirements. As a
     bonus by using the primitives instead of hand-rolled functions we can
     use tools such as the ThreadSanitizer which need the use of well
     defined APIs for its analysis.
     
     If we have __ATOMIC defines we exclusively use the __atomic primitives
     for all our atomic access. Otherwise we fall back to the mixture of
     __sync and hand-rolled barrier cases.

(for which I would have sworn I was the sole perpetrator, not just the
committer).  But it pre-dates some of the code that is being fixed, so
I am not sure it makes sense to add it.

Paolo



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2023-03-06 16:02 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-03 17:19 [PATCH 0/8] Fix missing memory barriers on ARM Paolo Bonzini
2023-03-03 17:19 ` [PATCH 1/8] qatomic: add smp_mb__before/after_rmw() Paolo Bonzini
2023-03-05 18:57   ` Richard Henderson
2023-03-05 21:00     ` Paolo Bonzini
2023-03-06 13:21   ` David Hildenbrand
2023-03-06 13:22     ` David Hildenbrand
2023-03-03 17:19 ` [PATCH 2/8] qemu-thread-posix: cleanup, fix, document QemuEvent Paolo Bonzini
2023-03-05 19:11   ` Richard Henderson
2023-03-06 13:28   ` David Hildenbrand
2023-03-03 17:19 ` [PATCH 3/8] qemu-thread-win32: " Paolo Bonzini
2023-03-05 19:14   ` Richard Henderson
2023-03-06 13:31   ` David Hildenbrand
2023-03-06 14:20     ` Paolo Bonzini
2023-03-06 14:32       ` David Hildenbrand
2023-03-06 15:17         ` Paolo Bonzini
2023-03-03 17:19 ` [PATCH 4/8] edu: add smp_mb__after_rmw() Paolo Bonzini
2023-03-05 19:14   ` Richard Henderson
2023-03-06 13:31   ` David Hildenbrand
2023-03-06 13:38   ` Peter Maydell
2023-03-06 14:10     ` Paolo Bonzini
2023-03-06 14:24       ` Peter Maydell
2023-03-06 15:06         ` Paolo Bonzini
2023-03-06 15:36           ` Peter Maydell
2023-03-03 17:19 ` [PATCH 5/8] util/async: add smp_mb__after_rmw() around BH enqueue/dequeue Paolo Bonzini
2023-03-05 19:32   ` Richard Henderson
2023-03-06  9:55     ` Paolo Bonzini
2023-03-03 17:19 ` [PATCH 6/8] aio-wait: switch to smp_mb__after_rmw() Paolo Bonzini
2023-03-05 19:32   ` Richard Henderson
2023-03-06 13:32   ` David Hildenbrand
2023-03-06 14:38   ` Stefan Hajnoczi
2023-03-03 17:19 ` [PATCH 7/8] qemu-coroutine-lock: add smp_mb__after_rmw() Paolo Bonzini
2023-03-05 19:36   ` Richard Henderson
2023-03-06 13:33   ` David Hildenbrand
2023-03-03 17:19 ` [PATCH 8/8] physmem: add missing memory barrier Paolo Bonzini
2023-03-05 19:40   ` Richard Henderson
2023-03-06 13:34   ` David Hildenbrand
2023-03-06 13:35 ` [PATCH 0/8] Fix missing memory barriers on ARM David Hildenbrand
2023-03-06 14:14   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).