[Qemu-devel] [PATCH v2 0/2] make AioContext's bh re-entrant

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2 0/2] make AioContext's bh re-entrant
@ 2013-06-16 11:21 Liu Ping Fan
  2013-06-16 11:21 ` [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations Liu Ping Fan
                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Liu Ping Fan @ 2013-06-16 11:21 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Anthony Liguori

When trying out of QBL, we badly require more fine defined barrier and atomic ops, so
I repost Paolo's atomic patch which fetched github.com/bonzini/qemu.git rcu
(thanks Paolo, with your urcu patches, things go more easily)


v1->v2:
  more fine document
  introduce another wmb for qemu_bh_schedule and correspoing rmb
  

Liu Ping Fan (1):
  QEMUBH: make AioContext's bh re-entrant

Paolo Bonzini (1):
  add a header file for atomic operations

 async.c                  |  21 ++++
 docs/atomics.txt         | 322 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/display/qxl.c         |   3 +-
 hw/virtio/vhost.c        |   9 +-
 include/block/aio.h      |   3 +
 include/qemu/atomic.h    | 223 +++++++++++++++++++++++++++-----
 migration.c              |   3 +-
 tests/test-thread-pool.c |   8 +-
 8 files changed, 548 insertions(+), 44 deletions(-)
 create mode 100644 docs/atomics.txt

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations
  2013-06-16 11:21 [Qemu-devel] [PATCH v2 0/2] make AioContext's bh re-entrant Liu Ping Fan
@ 2013-06-16 11:21 ` Liu Ping Fan
  2013-06-17 18:57   ` Richard Henderson
  2013-06-16 11:21 ` [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant Liu Ping Fan
  2013-06-17  7:11 ` [Qemu-devel] [PATCH v2 0/2] " Paolo Bonzini
  2 siblings, 1 reply; 41+ messages in thread
From: Liu Ping Fan @ 2013-06-16 11:21 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Anthony Liguori

From: Paolo Bonzini <pbonzini@redhat.com>

We're already using them in several places, but __sync builtins are just
too ugly to type, and do not provide seqcst load/store operations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 docs/atomics.txt         | 322 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/display/qxl.c         |   3 +-
 hw/virtio/vhost.c        |   9 +-
 include/qemu/atomic.h    | 223 +++++++++++++++++++++++++++-----
 migration.c              |   3 +-
 tests/test-thread-pool.c |   8 +-
 6 files changed, 524 insertions(+), 44 deletions(-)
 create mode 100644 docs/atomics.txt

diff --git a/docs/atomics.txt b/docs/atomics.txt
new file mode 100644
index 0000000..aa2f378
--- /dev/null
+++ b/docs/atomics.txt
@@ -0,0 +1,322 @@
+CPUs perform independent memory operations effectively in random order.
+but this can be a problem for CPU-CPU interaction (including interactions
+between QEMU and the guest).  Multi-threaded programs use various tools
+to instruct the compiler and the CPU to restrict the order to something
+that is consistent with the expectations of the programmer.
+
+The most basic tool is locking.  Mutexes, condition variables and
+semaphores are used in QEMU, and should be the default approach to
+synchronization.  Anything else is considerably harder, but it's
+also justified more often than one would like.  The two tools that
+are provided by qemu/atomic.h are memory barriers and atomic operations.
+
+Macros defined by qemu/atomic.h fall in three camps:
+
+- compiler barriers: barrier();
+
+- weak atomic access and manual memory barriers: atomic_read(),
+  atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_read_barrier_depends();
+
+- sequentially consistent atomic access: everything else.
+
+
+COMPILER MEMORY BARRIER
+=======================
+
+barrier() prevents the compiler from moving the memory accesses either
+side of it to the other side.  The compiler barrier has no direct effect
+on the CPU, which may then reorder things however it wishes.
+
+barrier() is mostly used within qemu/atomic.h itself.  On some
+architectures, CPU guarantees are strong enough that blocking compiler
+optimizations already ensures the correct order of execution.  In this
+case, qemu/atomic.h will reduce stronger memory barriers to simple
+compiler barriers.
+
+Still, barrier() can be useful when writing code that can be interrupted
+by signal handlers.
+
+
+SEQUENTIALLY CONSISTENT ATOMIC ACCESS
+=====================================
+
+Most of the operations in the qemu/atomic.h header ensure *sequential
+consistency*, where "the result of any execution is the same as if the
+operations of all the processors were executed in some sequential order,
+and the operations of each individual processor appear in this sequence
+in the order specified by its program".
+
+qemu/atomic.h provides the following set of atomic read-modify-write
+operations:
+
+    typeof(*ptr) atomic_inc(ptr)
+    typeof(*ptr) atomic_dec(ptr)
+    typeof(*ptr) atomic_add(ptr, val)
+    typeof(*ptr) atomic_sub(ptr, val)
+    typeof(*ptr) atomic_and(ptr, val)
+    typeof(*ptr) atomic_or(ptr, val)
+    typeof(*ptr) atomic_xchg(ptr, val
+    typeof(*ptr) atomic_cmpxchg(ptr, old, new)
+
+all of which return the old value of *ptr.  These operations are
+polymorphic; they operate on any type that is as wide as an int.
+
+It also provides the following two operations for sequentially consistent
+loads and stores:
+
+    typeof(*ptr) atomic_mb_read(ptr)
+    void         atomic_mb_set(ptr, val)
+
+These operations operate on any type that is as wide as an int or smaller.
+
+
+WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS
+=============================================
+
+Compared to sequentially consistent atomic access, programming with
+weaker consistency models can be considerably more complicated.
+In general, if the algorithm you are writing includes both writes
+and reads on the same side, it is generally simpler to use sequentially
+consistent primitives.
+
+When using this model, variables are accessed with atomic_read() and
+atomic_set(), and restrictions to the ordering of accesses is enforced
+using the smp_rmb(), smp_wmb(), smp_mb() and smp_read_barrier_depends()
+memory barriers.
+
+atomic_read() and atomic_set() prevents the compiler from using
+optimizations that might otherwise optimize accesses out of existence
+on the one hand, or that might create unsolicited accesses on the other.
+In general this should not have any effect, because the same compiler
+barriers are already implied by memory barriers.  However, it is useful
+to do so, because it tells readers which variables are shared with
+other threads, and which are local to the current thread or protected
+by other, more mundane means.
+
+Memory barriers control the order of references to shared memory.
+They come in four kinds:
+
+- smp_rmb() guarantees that all the LOAD operations specified before
+  the barrier will appear to happen before all the LOAD operations
+  specified after the barrier with respect to the other components of
+  the system.
+
+  In other words, smp_rmb() puts a partial ordering on loads, but is not
+  required to have any effect on stores.
+
+- smp_wmb() guarantees that all the STORE operations specified before
+  the barrier will appear to happen before all the STORE operations
+  specified after the barrier with respect to the other components of
+  the system.
+
+  In other words, smp_wmb() puts a partial ordering on stores, but is not
+  required to have any effect on loads.
+
+- smp_mb() guarantees that all the LOAD and STORE operations specified
+  before the barrier will appear to happen before all the LOAD and
+  STORE operations specified after the barrier with respect to the other
+  components of the system.
+
+  smp_mb() puts a partial ordering on both loads and stores.  It is
+  stronger than both a read and a write memory barrier; it implies both
+  smp_rmb() and smp_wmb(), but it also prevents STOREs coming before the
+  barrier from overtaking LOADs coming after the barrier and vice versa.
+
+- smp_read_barrier_depends() is a weaker kind of read barrier.  On
+  most processors, whenever two loads are performed such that the
+  second depends on the result of the first (e.g., the first load
+  retrieves the address to which the second load will be directed),
+  the processor will guarantee that the first LOAD will appear to happen
+  before the second with respect to the other components of the system.
+  However, this is not always true---for example, it was not true on
+  Alpha processors.  Whenever this kind of access happens to shared
+  memory (that is not protected by a lock), a read barrier is needed,
+  and smp_read_barrier_depends() can be used instead of smp_rmb().
+
+  Note that the first load really has to have a _data_ dependency and not
+  a control dependency.  If the address for the second load is dependent
+  on the first load, but the dependency is through a conditional rather
+  than actually loading the address itself, then it's a _control_
+  dependency and a full read barrier or better is required.
+
+
+This is the set of barriers that is required *between* two atomic_read()
+and atomic_set() operations to achieve sequential consistency:
+
+                    |               2nd operation             |
+                    |-----------------------------------------|
+     1st operation  | (after last) | atomic_read | atomic_set |
+     ---------------+--------------+-------------+------------|
+     (before first) |              | none        | smp_wmb()  |
+     ---------------+--------------+-------------+------------|
+     atomic_read    | smp_rmb()    | smp_rmb()*  | **         |
+     ---------------+--------------+-------------+------------|
+     atomic_set     | none         | smp_mb()*** | smp_wmb()  |
+     ---------------+--------------+-------------+------------|
+
+       * Or smp_read_barrier_depends().
+
+      ** This requires a load-store barrier.  How to achieve this varies
+         depending on the machine, but in practice smp_rmb()+smp_wmb()
+         should have the desired effect.  For example, on PowerPC the
+         lwsync instruction is a combined load-load, load-store and
+         store-store barrier.
+
+     *** This requires a store-load barrier.  On most machines, the only
+         way to achieve this is a full barrier.
+
+
+You can see that the two possible definitions of atomic_mb_read()
+and atomic_mb_set() are the following:
+
+    1) atomic_mb_read(p)   = atomic_read(p); smp_rmb()
+       atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v); smp_mb()
+
+    2) atomic_mb_read(p)   = smp_mb() atomic_read(p); smp_rmb()
+       atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v);
+
+Usually the former is used, because smp_mb() is expensive and a program
+normally has more reads than writes.  Therefore it makes more sense to
+make atomic_mb_set() the more expensive operation.
+
+There are two common cases in which atomic_mb_read and atomic_mb_set
+generate too many memory barriers, and thus it can be useful to manually
+place barriers instead:
+
+- when a data structure has one thread that is always a writer
+  and one thread that is always a reader, manual placement of
+  memory barriers makes the write side faster.  Furthermore,
+  correctness is easy to check for in this case using the "pairing"
+  trick that is explained below:
+
+     thread 1                                thread 1
+     -------------------------               ------------------------
+     (other writes)
+                                             smp_wmb()
+     atomic_mb_set(&a, x)                    atomic_set(&a, x)
+                                             smp_wmb()
+     atomic_mb_set(&b, y)                    atomic_set(&b, y)
+
+                                       =>
+     thread 2                                thread 2
+     -------------------------               ------------------------
+     y = atomic_mb_read(&b)                  y = atomic_read(&b)
+                                             smp_rmb()
+     x = atomic_mb_read(&a)                  x = atomic_read(&a)
+                                             smp_rmb()
+
+- sometimes, a thread is accessing many variables that are otherwise
+  unrelated to each other (for example because, apart from the current
+  thread, exactly one other thread will read or write each of these
+  variables).  In this case, it is possible to "hoist" the implicit
+  barriers provided by atomic_mb_read() and atomic_mb_set() outside
+  a loop.  For example, the above definition atomic_mb_read() gives
+  the following transformation:
+
+     n = 0;                                  n = 0;
+     for (i = 0; i < 10; i++)          =>    for (i = 0; i < 10; i++)
+       n += atomic_mb_read(&a[i]);             n += atomic_read(&a[i]);
+                                             smp_rmb();
+
+  Similarly, atomic_mb_set() can be transformed as follows:
+  smp_mb():
+
+                                             smp_wmb();
+     for (i = 0; i < 10; i++)          =>    for (i = 0; i < 10; i++)
+       atomic_mb_set(&a[i], false);            atomic_set(&a[i], false);
+                                             smp_mb();
+
+
+The two tricks can be combined.  In this case, splitting a loop in
+two lets you hoist the barriers out of the loops _and_ eliminate the
+expensive smp_mb():
+
+                                             smp_wmb();
+     for (i = 0; i < 10; i++) {        =>    for (i = 0; i < 10; i++)
+       atomic_mb_set(&a[i], false);            atomic_set(&a[i], false);
+       atomic_mb_set(&b[i], false);          smb_wmb();
+     }                                       for (i = 0; i < 10; i++)
+                                               atomic_set(&a[i], false);
+                                             smp_mb();
+
+  The other thread can still use atomic_mb_read()/atomic_mb_set()
+
+
+Memory barrier pairing
+----------------------
+
+A useful rule of thumb is that memory barriers should always, or almost
+always, be paired with another barrier.  In the case of QEMU, however,
+note that the other barrier may actually be in a driver that runs in
+the guest!
+
+For the purposes of pairing, smp_read_barrier_depends() and smp_rmb()
+both count as read barriers.  A read barriers shall pair with a write
+barrier or a full barrier; a write barrier shall pair with a read
+barrier or a full barrier.  A full barrier can pair with anything.
+For example:
+
+        thread 1             thread 2
+        ===============      ===============
+        a = 1;
+        smp_wmb();
+        b = 2;               x = b;
+                             smp_rmb();
+                             y = a;
+
+Note that the "writing" thread are accessing the variables in the
+opposite order as the "reading" thread.  This is expected: stores
+before the write barrier will normally match the loads after the
+read barrier, and vice versa.  The same is true for more than 2
+access and for data dependency barriers:
+
+        thread 1             thread 2
+        ===============      ===============
+        b[2] = 1;
+        smp_wmb();
+        x->i = 2;
+        smp_wmb();
+        a = x;               x = a;
+                             smp_read_barrier_depends();
+                             y = x->i;
+                             smp_read_barrier_depends();
+                             z = b[y];
+
+smp_wmb() also pairs with atomic_mb_read(), and smp_rmb() also pairs
+with atomic_mb_set().
+
+
+COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
+============================================
+
+Here is a list of differences between Linux kernel atomic operations
+and memory barriers, and the equivalents in QEMU:
+
+- atomic operations in Linux are always on a 32-bit int type and
+  use a boxed atomic_t type; atomic operations in QEMU are polymorphic
+  and use normal C types.
+
+- atomic_read and atomic_set in Linux give no guarantee at all;
+  atomic_read and atomic_set in QEMU include a compiler barrier
+  (similar to the ACCESS_ONCE macro in Linux).
+
+- most atomic read-modify-write operations in Linux return void;
+  in QEMU, all of them return the old value of the variable.
+
+- different atomic read-modify-write operations in Linux imply
+  a different set of memory barriers; in QEMU, all of them enforce
+  sequential consistency, which means they imply full memory barriers
+  before and after the operation.
+
+- Linux does not have an equivalent of atomic_mb_read() and
+  atomic_mb_set().  In particular, note that set_mb() is a little
+  weaker than atomic_mb_set().
+
+
+SOURCES
+=======
+
+* Documentation/memory-barriers.txt from the Linux kernel
+
+* "The JSR-133 Cookbook for Compiler Writers", available at
+  http://g.oswego.edu/dl/jmm/cookbook.html
diff --git a/hw/display/qxl.c b/hw/display/qxl.c
index c475cb1..2b10d3d 100644
--- a/hw/display/qxl.c
+++ b/hw/display/qxl.c
@@ -23,6 +23,7 @@
 #include "qemu-common.h"
 #include "qemu/timer.h"
 #include "qemu/queue.h"
+#include "qemu/atomic.h"
 #include "monitor/monitor.h"
 #include "sysemu/sysemu.h"
 #include "trace.h"
@@ -1725,7 +1726,7 @@ static void qxl_send_events(PCIQXLDevice *d, uint32_t events)
         trace_qxl_send_events_vm_stopped(d->id, events);
         return;
     }
-    old_pending = __sync_fetch_and_or(&d->ram->int_pending, le_events);
+    old_pending = atomic_or(&d->ram->int_pending, le_events);
     if ((old_pending & le_events) == le_events) {
         return;
     }
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index fbabf99..28abe1e 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -16,6 +16,7 @@
 #include <sys/ioctl.h>
 #include "hw/virtio/vhost.h"
 #include "hw/hw.h"
+#include "qemu/atomic.h"
 #include "qemu/range.h"
 #include <linux/vhost.h>
 #include "exec/address-spaces.h"
@@ -47,11 +48,9 @@ static void vhost_dev_sync_region(struct vhost_dev *dev,
             addr += VHOST_LOG_CHUNK;
             continue;
         }
-        /* Data must be read atomically. We don't really
-         * need the barrier semantics of __sync
-         * builtins, but it's easier to use them than
-         * roll our own. */
-        log = __sync_fetch_and_and(from, 0);
+        /* Data must be read atomically. We don't really need barrier semantics
+         * but it's easier to use atomic_* than roll our own. */
+        log = atomic_xchg(from, 0);
         while ((bit = sizeof(log) > sizeof(int) ?
                 ffsll(log) : ffs(log))) {
             hwaddr page_addr;
diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 10becb6..18aecaf 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -1,68 +1,227 @@
-#ifndef __QEMU_BARRIER_H
-#define __QEMU_BARRIER_H 1
+/*
+ * Simple interface for atomic operations.
+ *
+ * Copyright (C) 2013 Red Hat, Inc.
+ *
+ * Author: Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
 
-/* Compiler barrier */
-#define barrier()   asm volatile("" ::: "memory")
+#ifndef __QEMU_ATOMIC_H
+#define __QEMU_ATOMIC_H 1
 
-#if defined(__i386__)
+#include "qemu/compiler.h"
 
-#include "qemu/compiler.h"        /* QEMU_GNUC_PREREQ */
+/* For C11 atomic ops */
 
-/*
- * Because of the strongly ordered x86 storage model, wmb() and rmb() are nops
- * on x86(well, a compiler barrier only).  Well, at least as long as
- * qemu doesn't do accesses to write-combining memory or non-temporal
- * load/stores from C code.
- */
-#define smp_wmb()   barrier()
-#define smp_rmb()   barrier()
+#if QEMU_GNUC_PREREQ(4, 8)
+#ifndef __ATOMIC_RELAXED
+#define __ATOMIC_RELAXED 0
+#endif
+#ifndef __ATOMIC_CONSUME
+#define __ATOMIC_CONSUME 1
+#endif
+#ifndef __ATOMIC_ACQUIRE
+#define __ATOMIC_ACQUIRE 2
+#endif
+#ifndef __ATOMIC_RELEASE
+#define __ATOMIC_RELEASE 3
+#endif
+#ifndef __ATOMIC_ACQ_REL
+#define __ATOMIC_ACQ_REL 4
+#endif
+#ifndef __ATOMIC_SEQ_CST
+#define __ATOMIC_SEQ_CST 5
+#endif
+#endif
+
+
+/* Compiler barrier */
+#define barrier()   ({ asm volatile("" ::: "memory"); (void)0; })
+
+#if !QEMU_GNUC_PREREQ(4, 8)
 
 /*
- * We use GCC builtin if it's available, as that can use
- * mfence on 32 bit as well, e.g. if built with -march=pentium-m.
- * However, on i386, there seem to be known bugs as recently as 4.3.
- * */
-#if QEMU_GNUC_PREREQ(4, 4)
-#define smp_mb() __sync_synchronize()
+ * We use GCC builtin if it's available, as that can use mfence on
+ * 32-bit as well, e.g. if built with -march=pentium-m. However, on
+ * i386 the spec is buggy, and the implementation followed it until
+ * 4.3 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793).
+ */
+#if defined(__i386__) || defined(__x86_64__)
+#if !QEMU_GNUC_PREREQ(4, 4)
+#if defined __x86_64__
+#define smp_mb()    ({ asm volatile("mfence" ::: "memory"); (void)0; })
 #else
-#define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
+#define smp_mb()    ({ asm volatile("lock; addl $0,0(%%esp) " ::: "memory"); (void)0; })
+#endif
+#endif
+#endif
+
+
+#ifdef __alpha__
+#define smp_read_barrier_depends()   asm volatile("mb":::"memory")
 #endif
 
-#elif defined(__x86_64__)
+#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
 
+/*
+ * Because of the strongly ordered storage model, wmb() and rmb() are nops
+ * here (a compiler barrier only).  QEMU doesn't do accesses to write-combining
+ * qemu memory or non-temporal load/stores from C code.
+ */
 #define smp_wmb()   barrier()
 #define smp_rmb()   barrier()
-#define smp_mb() asm volatile("mfence" ::: "memory")
+
+/*
+ * __sync_lock_test_and_set() is documented to be an acquire barrier only,
+ * but it is a full barrier at the hardware level.  Add a compiler barrier
+ * to make it a full barrier also at the compiler level.
+ */
+#define atomic_xchg(ptr, i)    (barrier(), __sync_lock_test_and_set(ptr, i))
+
+/*
+ * Load/store with Java volatile semantics.
+ */
+#define atomic_mb_set(ptr, i)  ((void)atomic_xchg(ptr, i))
 
 #elif defined(_ARCH_PPC)
 
 /*
  * We use an eieio() for wmb() on powerpc.  This assumes we don't
  * need to order cacheable and non-cacheable stores with respect to
- * each other
+ * each other.
+ *
+ * smp_mb has the same problem as on x86 for not-very-new GCC
+ * (http://patchwork.ozlabs.org/patch/126184/, Nov 2011).
  */
-#define smp_wmb()   asm volatile("eieio" ::: "memory")
-
+#define smp_wmb()   ({ asm volatile("eieio" ::: "memory"); (void)0; })
 #if defined(__powerpc64__)
-#define smp_rmb()   asm volatile("lwsync" ::: "memory")
+#define smp_rmb()   ({ asm volatile("lwsync" ::: "memory"); (void)0; })
 #else
-#define smp_rmb()   asm volatile("sync" ::: "memory")
+#define smp_rmb()   ({ asm volatile("sync" ::: "memory"); (void)0; })
 #endif
+#define smp_mb()    ({ asm volatile("sync" ::: "memory"); (void)0; })
 
-#define smp_mb()   asm volatile("sync" ::: "memory")
+#endif /* _ARCH_PPC */
 
-#else
+#endif /* GCC 4.8 */
 
 /*
  * For (host) platforms we don't have explicit barrier definitions
  * for, we use the gcc __sync_synchronize() primitive to generate a
  * full barrier.  This should be safe on all platforms, though it may
- * be overkill for wmb() and rmb().
+ * be overkill for smp_wmb() and smp_rmb().
  */
+#ifndef smp_mb
+#define smp_mb()    __sync_synchronize()
+#endif
+
+#ifndef smp_wmb
+#if QEMU_GNUC_PREREQ(4, 8)
+#define smp_wmb()   __atomic_thread_fence(__ATOMIC_RELEASE)
+#else
 #define smp_wmb()   __sync_synchronize()
-#define smp_mb()   __sync_synchronize()
+#endif
+#endif
+
+#ifndef smp_rmb
+#if QEMU_GNUC_PREREQ(4, 8)
+#define smp_rmb()   __atomic_thread_fence(__ATOMIC_ACQUIRE)
+#else
 #define smp_rmb()   __sync_synchronize()
+#endif
+#endif
+
+#ifndef smp_read_barrier_depends
+#if QEMU_GNUC_PREREQ(4, 8)
+#define smp_read_barrier_depends()   __atomic_thread_fence(__ATOMIC_CONSUME)
+#else
+#define smp_read_barrier_depends()   barrier()
+#endif
+#endif
 
+#ifndef atomic_read
+#define atomic_read(ptr)       (*(__typeof__(*ptr) *volatile) (ptr))
 #endif
 
+#ifndef atomic_set
+#define atomic_set(ptr, i)     ((*(__typeof__(*ptr) *volatile) (ptr)) = (i))
+#endif
+
+/* These have the same semantics as Java volatile variables.
+ * See http://gee.cs.oswego.edu/dl/jmm/cookbook.html:
+ * "1. Issue a StoreStore barrier (wmb) before each volatile store."
+ *  2. Issue a StoreLoad barrier after each volatile store.
+ *     Note that you could instead issue one before each volatile load, but
+ *     this would be slower for typical programs using volatiles in which
+ *     reads greatly outnumber writes. Alternatively, if available, you
+ *     can implement volatile store as an atomic instruction (for example
+ *     XCHG on x86) and omit the barrier. This may be more efficient if
+ *     atomic instructions are cheaper than StoreLoad barriers.
+ *  3. Issue LoadLoad and LoadStore barriers after each volatile load."
+ *
+ * If you prefer to think in terms of "pairing" of memory barriers,
+ * an atomic_mb_read pairs with an atomic_mb_set.
+ *
+ * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
+ * while an atomic_mb_set is a st.rel followed by a memory barrier.
+ */
+#ifndef atomic_mb_read
+#if QEMU_GNUC_PREREQ(4, 8)
+#define atomic_mb_read(ptr)       ({             \
+    typeof(*ptr) _val;                           \
+    __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
+    _val;                                        \
+})
+#else
+#define atomic_mb_read(ptr)    ({           \
+    typeof(*ptr) _val = atomic_read(ptr);   \
+    smp_rmb();                              \
+    _val;                                   \
+})
+#endif
+#endif
+
+#ifndef atomic_mb_set
+#if QEMU_GNUC_PREREQ(4, 8)
+#define atomic_mb_set(ptr, i)    do {             \
+    typeof(*ptr) _val = i;                        \
+    __atomic_store(ptr, &_val, __ATOMIC_SEQ_CST); \
+} while(0)
+#else
+#define atomic_mb_set(ptr, i)  do {         \
+    smp_wmb();                              \
+    atomic_set(ptr, i);                     \
+    smp_mb();                               \
+} while (0)
+#endif
+#endif
+
+#ifndef atomic_xchg
+#ifdef __clang__
+#define atomic_xchg(ptr, i)    __sync_exchange(ptr, i)
+#elif QEMU_GNUC_PREREQ(4, 8)
+#define atomic_xchg(ptr, i)    ({                           \
+    typeof(*ptr) _new = (i), _old;                          \
+    __atomic_exchange(ptr, &_new, &_old, __ATOMIC_SEQ_CST); \
+    _old;                                                   \
+})
+#else
+/* __sync_lock_test_and_set() is documented to be an acquire barrier only.  */
+#define atomic_xchg(ptr, i)    (smp_mb(), __sync_lock_test_and_set(ptr, i))
+#endif
+#endif
+
+/* Provide shorter names for GCC atomic builtins.  */
+#define atomic_inc(ptr)        __sync_fetch_and_add(ptr, 1)
+#define atomic_dec(ptr)        __sync_fetch_and_add(ptr, -1)
+#define atomic_add             __sync_fetch_and_add
+#define atomic_sub             __sync_fetch_and_sub
+#define atomic_and             __sync_fetch_and_and
+#define atomic_or              __sync_fetch_and_or
+#define atomic_cmpxchg         __sync_val_compare_and_swap
+
 #endif
diff --git a/migration.c b/migration.c
index bfbc345..a513970 100644
--- a/migration.c
+++ b/migration.c
@@ -290,8 +290,7 @@ static void migrate_fd_cleanup(void *opaque)
 
 static void migrate_finish_set_state(MigrationState *s, int new_state)
 {
-    if (__sync_val_compare_and_swap(&s->state, MIG_STATE_ACTIVE,
-                                    new_state) == new_state) {
+    if (atomic_cmpxchg(&s->state, MIG_STATE_ACTIVE, new_state) == new_state) {
         trace_migrate_set_state(new_state);
     }
 }
diff --git a/tests/test-thread-pool.c b/tests/test-thread-pool.c
index 22915aa..dbf2fc8 100644
--- a/tests/test-thread-pool.c
+++ b/tests/test-thread-pool.c
@@ -17,15 +17,15 @@ typedef struct {
 static int worker_cb(void *opaque)
 {
     WorkerTestData *data = opaque;
-    return __sync_fetch_and_add(&data->n, 1);
+    return atomic_inc(&data->n);
 }
 
 static int long_cb(void *opaque)
 {
     WorkerTestData *data = opaque;
-    __sync_fetch_and_add(&data->n, 1);
+    atomic_inc(&data->n);
     g_usleep(2000000);
-    __sync_fetch_and_add(&data->n, 1);
+    atomic_inc(&data->n);
     return 0;
 }
 
@@ -169,7 +169,7 @@ static void test_cancel(void)
     /* Cancel the jobs that haven't been started yet.  */
     num_canceled = 0;
     for (i = 0; i < 100; i++) {
-        if (__sync_val_compare_and_swap(&data[i].n, 0, 3) == 0) {
+        if (atomic_cmpxchg(&data[i].n, 0, 3) == 0) {
             data[i].ret = -ECANCELED;
             bdrv_aio_cancel(data[i].aiocb);
             active--;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-16 11:21 [Qemu-devel] [PATCH v2 0/2] make AioContext's bh re-entrant Liu Ping Fan
  2013-06-16 11:21 ` [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations Liu Ping Fan
@ 2013-06-16 11:21 ` Liu Ping Fan
  2013-06-17 15:28   ` Stefan Hajnoczi
  2013-06-18 15:14   ` mdroth
  2013-06-17  7:11 ` [Qemu-devel] [PATCH v2 0/2] " Paolo Bonzini
  2 siblings, 2 replies; 41+ messages in thread
From: Liu Ping Fan @ 2013-06-16 11:21 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Anthony Liguori

BH will be used outside big lock, so introduce lock to protect
between the writers, ie, bh's adders and deleter.
Note that the lock only affects the writers and bh's callback does
not take this extra lock.

Signed-off-by: Liu Ping Fan <pingfanl@linux.vnet.ibm.com>
---
 async.c             | 21 +++++++++++++++++++++
 include/block/aio.h |  3 +++
 2 files changed, 24 insertions(+)

diff --git a/async.c b/async.c
index 90fe906..6a3269f 100644
--- a/async.c
+++ b/async.c
@@ -47,8 +47,12 @@ QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
     bh->ctx = ctx;
     bh->cb = cb;
     bh->opaque = opaque;
+    qemu_mutex_lock(&ctx->bh_lock);
     bh->next = ctx->first_bh;
+    /* Make sure the memebers ready before putting bh into list */
+    smp_wmb();
     ctx->first_bh = bh;
+    qemu_mutex_unlock(&ctx->bh_lock);
     return bh;
 }
 
@@ -61,12 +65,18 @@ int aio_bh_poll(AioContext *ctx)
 
     ret = 0;
     for (bh = ctx->first_bh; bh; bh = next) {
+        /* Make sure fetching bh before accessing its members */
+        smp_read_barrier_depends();
         next = bh->next;
         if (!bh->deleted && bh->scheduled) {
             bh->scheduled = 0;
             if (!bh->idle)
                 ret = 1;
             bh->idle = 0;
+            /* Paired with write barrier in bh schedule to ensure reading for
+             *  callbacks coming after bh's scheduling.
+             */
+            smp_rmb();
             bh->cb(bh->opaque);
         }
     }
@@ -75,6 +85,7 @@ int aio_bh_poll(AioContext *ctx)
 
     /* remove deleted bhs */
     if (!ctx->walking_bh) {
+        qemu_mutex_lock(&ctx->bh_lock);
         bhp = &ctx->first_bh;
         while (*bhp) {
             bh = *bhp;
@@ -85,6 +96,7 @@ int aio_bh_poll(AioContext *ctx)
                 bhp = &bh->next;
             }
         }
+        qemu_mutex_unlock(&ctx->bh_lock);
     }
 
     return ret;
@@ -94,6 +106,10 @@ void qemu_bh_schedule_idle(QEMUBH *bh)
 {
     if (bh->scheduled)
         return;
+    /* Make sure any writes that are needed by the callback are done
+     * before the locations are read in the aio_bh_poll.
+     */
+    smp_wmb();
     bh->scheduled = 1;
     bh->idle = 1;
 }
@@ -102,6 +118,10 @@ void qemu_bh_schedule(QEMUBH *bh)
 {
     if (bh->scheduled)
         return;
+    /* Make sure any writes that are needed by the callback are done
+     * before the locations are read in the aio_bh_poll.
+     */
+    smp_wmb();
     bh->scheduled = 1;
     bh->idle = 0;
     aio_notify(bh->ctx);
@@ -211,6 +231,7 @@ AioContext *aio_context_new(void)
     ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
     ctx->pollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
     ctx->thread_pool = NULL;
+    qemu_mutex_init(&ctx->bh_lock);
     event_notifier_init(&ctx->notifier, false);
     aio_set_event_notifier(ctx, &ctx->notifier, 
                            (EventNotifierHandler *)
diff --git a/include/block/aio.h b/include/block/aio.h
index 1836793..971fbef 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -17,6 +17,7 @@
 #include "qemu-common.h"
 #include "qemu/queue.h"
 #include "qemu/event_notifier.h"
+#include "qemu/thread.h"
 
 typedef struct BlockDriverAIOCB BlockDriverAIOCB;
 typedef void BlockDriverCompletionFunc(void *opaque, int ret);
@@ -53,6 +54,8 @@ typedef struct AioContext {
      */
     int walking_handlers;
 
+    /* lock to protect between bh's adders and deleter */
+    QemuMutex bh_lock;
     /* Anchor of the list of Bottom Halves belonging to the context */
     struct QEMUBH *first_bh;
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/2] make AioContext's bh re-entrant
  2013-06-16 11:21 [Qemu-devel] [PATCH v2 0/2] make AioContext's bh re-entrant Liu Ping Fan
  2013-06-16 11:21 ` [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations Liu Ping Fan
  2013-06-16 11:21 ` [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant Liu Ping Fan
@ 2013-06-17  7:11 ` Paolo Bonzini
  2013-06-18  2:40   ` liu ping fan
  2 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-17  7:11 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Anthony Liguori

Il 16/06/2013 13:21, Liu Ping Fan ha scritto:
> When trying out of QBL, we badly require more fine defined barrier and atomic ops, so
> I repost Paolo's atomic patch which fetched github.com/bonzini/qemu.git rcu

CCing block maintainers.

Almost Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

You're still not documenting that multiple occurrences of aio_bh_poll
cannot be called concurrently.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-16 11:21 ` [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant Liu Ping Fan
@ 2013-06-17 15:28   ` Stefan Hajnoczi
  2013-06-17 16:41     ` Paolo Bonzini
  2013-06-18  2:19     ` liu ping fan
  2013-06-18 15:14   ` mdroth
  1 sibling, 2 replies; 41+ messages in thread
From: Stefan Hajnoczi @ 2013-06-17 15:28 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: Paolo Bonzini, qemu-devel, Anthony Liguori

On Sun, Jun 16, 2013 at 07:21:21PM +0800, Liu Ping Fan wrote:
> @@ -47,8 +47,12 @@ QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
>      bh->ctx = ctx;
>      bh->cb = cb;
>      bh->opaque = opaque;
> +    qemu_mutex_lock(&ctx->bh_lock);
>      bh->next = ctx->first_bh;
> +    /* Make sure the memebers ready before putting bh into list */

s/memebers/members/

> +    smp_wmb();

Why lock bh_lock before assigning bh->next?  Could you lock the mutex
here and then drop the smp_wmb() since the pthread function already does
that?

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-17 15:28   ` Stefan Hajnoczi
@ 2013-06-17 16:41     ` Paolo Bonzini
  2013-06-18  2:19     ` liu ping fan
  1 sibling, 0 replies; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-17 16:41 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Liu Ping Fan, Anthony Liguori, qemu-devel

Il 17/06/2013 17:28, Stefan Hajnoczi ha scritto:
>> > +    qemu_mutex_lock(&ctx->bh_lock);
>> >      bh->next = ctx->first_bh;
>> > +    /* Make sure the memebers ready before putting bh into list */
> s/memebers/members/
> 
>> > +    smp_wmb();
> Why lock bh_lock before assigning bh->next?  Could you lock the mutex
> here and then drop the smp_wmb() since the pthread function already does
> that?
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11

Not sure I understand, ctx->first_bh is read here and that's what the
lock protects.

    thread 1                           thread 2
    ------------------------------------------------------------------
    bh->next = ctx->first_bh;
                                       bh->next = ctx->first_bh;
                                       lock
                                       ctx->first_bh = bh;
                                       unlock
    lock
    ctx->first_bh = bh;
    unlock

and thread 2's bottom half is gone.  There is also a similar race that
leaves a dangling pointer if aio_bh_new races against aio_bh_poll.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations
  2013-06-16 11:21 ` [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations Liu Ping Fan
@ 2013-06-17 18:57   ` Richard Henderson
  2013-06-18  8:36     ` Paolo Bonzini
  2013-06-18 13:24     ` [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations) Paolo Bonzini
  0 siblings, 2 replies; 41+ messages in thread
From: Richard Henderson @ 2013-06-17 18:57 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: Paolo Bonzini, qemu-devel, Anthony Liguori

On 06/16/2013 04:21 AM, Liu Ping Fan wrote:
> +#if QEMU_GNUC_PREREQ(4, 8)
> +#ifndef __ATOMIC_RELAXED
> +#define __ATOMIC_RELAXED 0
> +#endif

Why all the ifdefs?  If __atomic support is present, then __ATOMIC defines will
exist.

> +/* Compiler barrier */
> +#define barrier()   ({ asm volatile("" ::: "memory"); (void)0; })
> +
> +#if !QEMU_GNUC_PREREQ(4, 8)

With C++11 atomic support we should define barrier as

  __atomic_signal_fence(__ATOMIC_ACQ_REL)

>  #if defined(__powerpc64__)
> -#define smp_rmb()   asm volatile("lwsync" ::: "memory")
> +#define smp_rmb()   ({ asm volatile("lwsync" ::: "memory"); (void)0; })
>  #else
> -#define smp_rmb()   asm volatile("sync" ::: "memory")
> +#define smp_rmb()   ({ asm volatile("sync" ::: "memory"); (void)0; })
>  #endif

Err... lwsync corresponds to ACQ_REL consistency, while sync corresponds to
SEQ_CST consistency.  The newly added document says you want SEQ_CST while the
old code implies ACQ_REL.

Which is true?

> +#ifndef smp_mb
> +#define smp_mb()    __sync_synchronize()
> +#endif

Use __atomic_thread_fence here for completeness?

> +#ifndef atomic_read
> +#define atomic_read(ptr)       (*(__typeof__(*ptr) *volatile) (ptr))
>  #endif
>  
> +#ifndef atomic_set
> +#define atomic_set(ptr, i)     ((*(__typeof__(*ptr) *volatile) (ptr)) = (i))
> +#endif

Use

__atomic_load(..., __ATOMIC_RELAXED)
__atomic_store(..., __ATOMIC_RELAXED)

?

> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
...
> + */
> +#ifndef atomic_mb_read
> +#if QEMU_GNUC_PREREQ(4, 8)
> +#define atomic_mb_read(ptr)       ({             \
> +    typeof(*ptr) _val;                           \
> +    __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
> +    _val;                                        \
> +})
> +#else
> +#define atomic_mb_read(ptr)    ({           \
> +    typeof(*ptr) _val = atomic_read(ptr);   \
> +    smp_rmb();                              \
> +    _val;                                   \

This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
load_acquire, one needs barriers before and after the atomic_read in order to
implement SEQ_CST.

So again I have to ask, what semantics are you actually looking for here?

> +#define atomic_mb_set(ptr, i)  do {         \
> +    smp_wmb();                              \
> +    atomic_set(ptr, i);                     \
> +    smp_mb();                               \
> +} while (0)

Really only a smp_wmb?  What about outstanding loads?


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-17 15:28   ` Stefan Hajnoczi
  2013-06-17 16:41     ` Paolo Bonzini
@ 2013-06-18  2:19     ` liu ping fan
  2013-06-18  9:31       ` Stefan Hajnoczi
  1 sibling, 1 reply; 41+ messages in thread
From: liu ping fan @ 2013-06-18  2:19 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Paolo Bonzini, qemu-devel, Anthony Liguori

On Mon, Jun 17, 2013 at 11:28 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Sun, Jun 16, 2013 at 07:21:21PM +0800, Liu Ping Fan wrote:
>> @@ -47,8 +47,12 @@ QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
>>      bh->ctx = ctx;
>>      bh->cb = cb;
>>      bh->opaque = opaque;
>> +    qemu_mutex_lock(&ctx->bh_lock);
>>      bh->next = ctx->first_bh;
>> +    /* Make sure the memebers ready before putting bh into list */
>
> s/memebers/members/
>
Will fix, thanks.
>> +    smp_wmb();
>
> Why lock bh_lock before assigning bh->next?  Could you lock the mutex
> here and then drop the smp_wmb() since the pthread function already does
> that?
>
As Paolo's explain, it will open race gap for writers.

Thanks and regards,
Pingfan
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/2] make AioContext's bh re-entrant
  2013-06-17  7:11 ` [Qemu-devel] [PATCH v2 0/2] " Paolo Bonzini
@ 2013-06-18  2:40   ` liu ping fan
  2013-06-18  8:36     ` Paolo Bonzini
  0 siblings, 1 reply; 41+ messages in thread
From: liu ping fan @ 2013-06-18  2:40 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Anthony Liguori

On Mon, Jun 17, 2013 at 3:11 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 16/06/2013 13:21, Liu Ping Fan ha scritto:
>> When trying out of QBL, we badly require more fine defined barrier and atomic ops, so
>> I repost Paolo's atomic patch which fetched github.com/bonzini/qemu.git rcu
>
> CCing block maintainers.
>
> Almost Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>
> You're still not documenting that multiple occurrences of aio_bh_poll
> cannot be called concurrently.
>
Will append this in commit log.
Do we have plans to have aio_bh_poll() in concurrently? And is it
because we have lots of AioContext, but want limited threads?  Or
other benefit?

Thanks and regards,
Pingfan
> Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations
  2013-06-17 18:57   ` Richard Henderson
@ 2013-06-18  8:36     ` Paolo Bonzini
  2013-06-18 11:03       ` Paolo Bonzini
  2013-06-18 14:38       ` Richard Henderson
  2013-06-18 13:24     ` [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations) Paolo Bonzini
  1 sibling, 2 replies; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-18  8:36 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Liu Ping Fan, Anthony Liguori, qemu-devel

Il 17/06/2013 20:57, Richard Henderson ha scritto:
> On 06/16/2013 04:21 AM, Liu Ping Fan wrote:
>> +#if QEMU_GNUC_PREREQ(4, 8)
>> +#ifndef __ATOMIC_RELAXED
>> +#define __ATOMIC_RELAXED 0
>> +#endif
> 
> Why all the ifdefs?  If __atomic support is present, then __ATOMIC defines will
> exist.

Then I can just use "#ifdef __ATOMIC_RELAXED" instead of
"#if QEMU_GNUC_PREREQ(4, 8)"?

>> +/* Compiler barrier */
>> +#define barrier()   ({ asm volatile("" ::: "memory"); (void)0; })
>> +
>> +#if !QEMU_GNUC_PREREQ(4, 8)
> 
> With C++11 atomic support we should define barrier as
> 
>   __atomic_signal_fence(__ATOMIC_ACQ_REL)

I don't really understand fences other than SEQ_CST, so I was hesitant to use
them.  The asms should work anyway.

>>  #if defined(__powerpc64__)
>> -#define smp_rmb()   asm volatile("lwsync" ::: "memory")
>> +#define smp_rmb()   ({ asm volatile("lwsync" ::: "memory"); (void)0; })
>>  #else
>> -#define smp_rmb()   asm volatile("sync" ::: "memory")
>> +#define smp_rmb()   ({ asm volatile("sync" ::: "memory"); (void)0; })
>>  #endif
> 
> Err... lwsync corresponds to ACQ_REL consistency, while sync corresponds to
> SEQ_CST consistency.  The newly added document says you want SEQ_CST while the
> old code implies ACQ_REL.
> 
> Which is true?

I have no idea, but I can say which semantics I want:

1) Linux kernel memory barrier semantics for smp_*mb (i.e. express the barriers
in terms of read/write/full, not in terms of acq/rel/seqcst);

2) Java volatile semantics for atomic_mb_*.

Basically, I cannot claim I understand this stuff 100%, but at least I could
use sources I trust to implement it.

My rationale was that both have been in use for a while and have proven to be
both efficient and useful.  They are more easily understood than the C11 memory
model, though not as complete.  I wanted something that people could figure out
relatively easily (tested on myself while working on the RCU stuff).

Most important, Java volatile can be described in terms of barriers as in the
JSR-133 cookbook, making it easy to convert one to the other.  This also reflected
in the names.

Finally, Java volatile semantics are reasonably fast for all architectures,
including ARM and PPC.

I think my understanding that Java volatile == SEQ_CST was wrong, though,
and I should rewrite the documentation.  More on this below.

>> +#ifndef smp_mb
>> +#define smp_mb()    __sync_synchronize()
>> +#endif
> 
> Use __atomic_thread_fence here for completeness?
> 
>> +#ifndef atomic_read
>> +#define atomic_read(ptr)       (*(__typeof__(*ptr) *volatile) (ptr))
>>  #endif
>>  
>> +#ifndef atomic_set
>> +#define atomic_set(ptr, i)     ((*(__typeof__(*ptr) *volatile) (ptr)) = (i))
>> +#endif
> 
> Use
> 
> __atomic_load(..., __ATOMIC_RELAXED)
> __atomic_store(..., __ATOMIC_RELAXED)
> 
> ?

Same here, I didn't want proliferation of #ifdefs beyond what is actually required.

>> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
>> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
> ...
>> + */
>> +#ifndef atomic_mb_read
>> +#if QEMU_GNUC_PREREQ(4, 8)
>> +#define atomic_mb_read(ptr)       ({             \
>> +    typeof(*ptr) _val;                           \
>> +    __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
>> +    _val;                                        \
>> +})
>> +#else
>> +#define atomic_mb_read(ptr)    ({           \
>> +    typeof(*ptr) _val = atomic_read(ptr);   \
>> +    smp_rmb();                              \
>> +    _val;                                   \
> 
> This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
> load_acquire, one needs barriers before and after the atomic_read in order to
> implement SEQ_CST.
> 
> So again I have to ask, what semantics are you actually looking for here?

As mentioned above, I wanted Java volatile semantics.  I misunderstood
that Java volatile == SEQ_CST.  The Java memory model FAQ says:
http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile

     Each read of a volatile will see the last write to that volatile
     by any thread; in effect, they are designated by the programmer as
     fields for which it is never acceptable to see a "stale" value as
     a result of caching or reordering. The compiler and runtime are
     prohibited from allocating them in registers. They must also ensure
     that after they are written, they are flushed out of the cache to
     main memory, so they can immediately become visible to other threads.
     Similarly, before a volatile field is read, the cache must be
     invalidated so that the value in main memory, not the local processor
     cache, is the one seen. There are also additional restrictions on
     reordering accesses to volatile variables. 

(Since almost all architectures are cache coherent, I take that "cache"
really means the store buffers.  The JSR-133 cookbook basically says the
same).  In addition, mentioned by the cookbook but not the FAQ, normal
stores cannot be moved after a volatile stores, and volatile loads
cannot be moved after a normal load.

The FAQ also has an "important note", however:

    Important Note: Note that it is important for both threads to access
    the same volatile variable in order to properly set up the happens-before
    relationship. It is not the case that everything visible to thread A
    when it writes volatile field f becomes visible to thread B after it
    reads volatile field g. The release and acquire have to "match" (i.e.,
    be performed on the same volatile field) to have the right semantics. 

Is this final "important note" the difference between ACQ_REL and SEQ_CST?

>> +#define atomic_mb_set(ptr, i)  do {         \
>> +    smp_wmb();                              \
>> +    atomic_set(ptr, i);                     \
>> +    smp_mb();                               \
>> +} while (0)
> 
> Really only a smp_wmb?  What about outstanding loads?

Probably the same as before.  Also, as mentioned in the documentation
smp_rmb() + smp_wmb() imply a load-store barrier on all machines.

Thanks for reviewing this, you're probably the only person in the community
that understands this stuff.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/2] make AioContext's bh re-entrant
  2013-06-18  2:40   ` liu ping fan
@ 2013-06-18  8:36     ` Paolo Bonzini
  0 siblings, 0 replies; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-18  8:36 UTC (permalink / raw)
  To: liu ping fan; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Anthony Liguori

Il 18/06/2013 04:40, liu ping fan ha scritto:
> On Mon, Jun 17, 2013 at 3:11 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 16/06/2013 13:21, Liu Ping Fan ha scritto:
>>> When trying out of QBL, we badly require more fine defined barrier and atomic ops, so
>>> I repost Paolo's atomic patch which fetched github.com/bonzini/qemu.git rcu
>>
>> CCing block maintainers.
>>
>> Almost Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>
>> You're still not documenting that multiple occurrences of aio_bh_poll
>> cannot be called concurrently.
>>
> Will append this in commit log.

No, in the code!

> Do we have plans to have aio_bh_poll() in concurrently?

I think it's too hard.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-18  2:19     ` liu ping fan
@ 2013-06-18  9:31       ` Stefan Hajnoczi
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Hajnoczi @ 2013-06-18  9:31 UTC (permalink / raw)
  To: liu ping fan; +Cc: Paolo Bonzini, qemu-devel, Anthony Liguori

On Tue, Jun 18, 2013 at 10:19:40AM +0800, liu ping fan wrote:
> On Mon, Jun 17, 2013 at 11:28 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Sun, Jun 16, 2013 at 07:21:21PM +0800, Liu Ping Fan wrote:
> > Why lock bh_lock before assigning bh->next?  Could you lock the mutex
> > here and then drop the smp_wmb() since the pthread function already does
> > that?
> >
> As Paolo's explain, it will open race gap for writers.

Right, thanks!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations
  2013-06-18  8:36     ` Paolo Bonzini
@ 2013-06-18 11:03       ` Paolo Bonzini
  2013-06-18 14:38       ` Richard Henderson
  1 sibling, 0 replies; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-18 11:03 UTC (permalink / raw)
  Cc: qemu-devel, Liu Ping Fan, Anthony Liguori, Richard Henderson

Il 18/06/2013 10:36, Paolo Bonzini ha scritto:
>     Important Note: Note that it is important for both threads to access
>     the same volatile variable in order to properly set up the happens-before
>     relationship. It is not the case that everything visible to thread A
>     when it writes volatile field f becomes visible to thread B after it
>     reads volatile field g. The release and acquire have to "match" (i.e.,
>     be performed on the same volatile field) to have the right semantics. 
> 
> Is this final "important note" the difference between ACQ_REL and SEQ_CST?

Based on what I read now, I think I want ACQ_REL for
atomic_mb_{read,set}.  One can still get SEQ_CST with atomic_xchg (for
sets) or atomic_add (adding 0, for reads).

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-17 18:57   ` Richard Henderson
  2013-06-18  8:36     ` Paolo Bonzini
@ 2013-06-18 13:24     ` Paolo Bonzini
  2013-06-18 14:50       ` Paul E. McKenney
                         ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-18 13:24 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Torvald Riegel, Andrew Haley, Liu Ping Fan, qemu-devel,
	Anthony Liguori, Paul E. McKenney

Il 17/06/2013 20:57, Richard Henderson ha scritto:
>> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
>> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
> ...
>> + */
>> +#ifndef atomic_mb_read
>> +#if QEMU_GNUC_PREREQ(4, 8)
>> +#define atomic_mb_read(ptr)       ({             \
>> +    typeof(*ptr) _val;                           \
>> +    __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
>> +    _val;                                        \
>> +})
>> +#else
>> +#define atomic_mb_read(ptr)    ({           \
>> +    typeof(*ptr) _val = atomic_read(ptr);   \
>> +    smp_rmb();                              \
>> +    _val;                                   \
> 
> This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
> load_acquire, one needs barriers before and after the atomic_read in order to
> implement SEQ_CST.

The store-load barrier between atomic_mb_set and atomic_mb_read are
provided by the atomic_mb_set.  The load-load barrier between two
atomic_mb_reads are provided by the first read.

> So again I have to ask, what semantics are you actually looking for here?

So, everything I found points to Java volatile being sequentially
consistent, though I'm still not sure why C11 suggests hwsync for load
seq-cst on POWER instead of lwsync.  Comparing the sequences that my
code (based on the JSR-133 cookbook) generate with the C11 suggestions
you get:

    Load seqcst               hwsync; ld; cmp; bc; isync
    Store seqcst              hwsync; st
    (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)

    Load Java volatile        ld; lwsync
    Store Java volatile       lwsync; st; hwsync
    (source: http://g.oswego.edu/dl/jmm/cookbook.html)

where the lwsync in loads acts as a load-load barrier, while the one in
stores acts as load-store + store-store barrier.

Is the cookbook known to be wrong?

Or is Java volatile somewhere between acq_rel and seq_cst, as the last
paragraph of
http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
seems to suggest?

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations
  2013-06-18  8:36     ` Paolo Bonzini
  2013-06-18 11:03       ` Paolo Bonzini
@ 2013-06-18 14:38       ` Richard Henderson
  2013-06-18 15:04         ` Paolo Bonzini
  1 sibling, 1 reply; 41+ messages in thread
From: Richard Henderson @ 2013-06-18 14:38 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Liu Ping Fan, Anthony Liguori, qemu-devel

On 06/18/2013 01:36 AM, Paolo Bonzini wrote:
>> Why all the ifdefs?  If __atomic support is present, then __ATOMIC defines will
>> exist.
> 
> Then I can just use "#ifdef __ATOMIC_RELAXED" instead of
> "#if QEMU_GNUC_PREREQ(4, 8)"?

I'd say so.

> I have no idea, but I can say which semantics I want:
> 
> 1) Linux kernel memory barrier semantics for smp_*mb (i.e. express the barriers
> in terms of read/write/full, not in terms of acq/rel/seqcst);
> 
> 2) Java volatile semantics for atomic_mb_*.
> 
> Basically, I cannot claim I understand this stuff 100%, but at least I could
> use sources I trust to implement it.

Fair enough.  Excellent pointers to have in the documentation, anyhow.

>>> +#ifndef atomic_read
>>> +#define atomic_read(ptr)       (*(__typeof__(*ptr) *volatile) (ptr))
>>>  #endif
>>>  
>>> +#ifndef atomic_set
>>> +#define atomic_set(ptr, i)     ((*(__typeof__(*ptr) *volatile) (ptr)) = (i))
>>> +#endif
>>
>> Use
>>
>> __atomic_load(..., __ATOMIC_RELAXED)
>> __atomic_store(..., __ATOMIC_RELAXED)
>>
>> ?
> 
> Same here, I didn't want proliferation of #ifdefs beyond what is actually required.

Not knowing exactly where these might be used within the code base, I'd be
worried about someone applying them to a uint64_t, somewhere a 32-bit host
might see it.  At which point the above is going to be silently wrong, loaded
with two 32-bit pieces.

Given that we're not requiring gcc 4.8, and cannot guarantee use of
__atomic_load, perhaps we ought to do something like

#define atomic_read(ptr) \
  ({ if (sizeof(*(ptr)) > sizeof(ptr)) invalid_atomic_read(); \
     *(__typeof__(*ptr) *volatile) (ptr)); })

which should generate a link error when reading a size we can't guarantee will
Just Work.


> The FAQ also has an "important note", however:
> 
>     Important Note: Note that it is important for both threads to access
>     the same volatile variable in order to properly set up the happens-before
>     relationship. It is not the case that everything visible to thread A
>     when it writes volatile field f becomes visible to thread B after it
>     reads volatile field g. The release and acquire have to "match" (i.e.,
>     be performed on the same volatile field) to have the right semantics. 
> 
> Is this final "important note" the difference between ACQ_REL and SEQ_CST?

Yes, exactly.


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 13:24     ` [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations) Paolo Bonzini
@ 2013-06-18 14:50       ` Paul E. McKenney
  2013-06-18 15:29         ` Peter Sewell
                           ` (2 more replies)
  2013-06-18 15:26       ` Torvald Riegel
  2013-06-18 17:38       ` Andrew Haley
  2 siblings, 3 replies; 41+ messages in thread
From: Paul E. McKenney @ 2013-06-18 14:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Susmit.Sarkar, Peter.Sewell, Andrew Haley, qemu-devel,
	Liu Ping Fan, Anthony Liguori, luc.maranget, Torvald Riegel,
	Richard Henderson

On Tue, Jun 18, 2013 at 03:24:24PM +0200, Paolo Bonzini wrote:
> Il 17/06/2013 20:57, Richard Henderson ha scritto:
> >> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
> >> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
> > ...
> >> + */
> >> +#ifndef atomic_mb_read
> >> +#if QEMU_GNUC_PREREQ(4, 8)
> >> +#define atomic_mb_read(ptr)       ({             \
> >> +    typeof(*ptr) _val;                           \
> >> +    __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
> >> +    _val;                                        \
> >> +})
> >> +#else
> >> +#define atomic_mb_read(ptr)    ({           \
> >> +    typeof(*ptr) _val = atomic_read(ptr);   \
> >> +    smp_rmb();                              \
> >> +    _val;                                   \
> > 
> > This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
> > load_acquire, one needs barriers before and after the atomic_read in order to
> > implement SEQ_CST.
> 
> The store-load barrier between atomic_mb_set and atomic_mb_read are
> provided by the atomic_mb_set.  The load-load barrier between two
> atomic_mb_reads are provided by the first read.
> 
> > So again I have to ask, what semantics are you actually looking for here?
> 
> So, everything I found points to Java volatile being sequentially
> consistent, though I'm still not sure why C11 suggests hwsync for load
> seq-cst on POWER instead of lwsync.  Comparing the sequences that my
> code (based on the JSR-133 cookbook) generate with the C11 suggestions
> you get:
> 
>     Load seqcst               hwsync; ld; cmp; bc; isync
>     Store seqcst              hwsync; st
>     (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
> 
>     Load Java volatile        ld; lwsync
>     Store Java volatile       lwsync; st; hwsync
>     (source: http://g.oswego.edu/dl/jmm/cookbook.html)
> 
> where the lwsync in loads acts as a load-load barrier, while the one in
> stores acts as load-store + store-store barrier.
> 
> Is the cookbook known to be wrong?
> 
> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
> paragraph of
> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
> seems to suggest?

First, I am not a fan of SC, mostly because there don't seem to be many
(any?) production-quality algorithms that need SC.  But if you really
want to take a parallel-programming trip back to the 1980s, let's go!  ;-)

The canonical litmus test for SC is independent reads of independent
writes, or IRIW for short.  It is as follows, with x and y both initially
zero:

	Thread 0	Thread 1	Thread 2	Thread 3
	x=1;		y=1;		r1=x;		r2=y;
					r3=y;		r4=x;

On a sequentially consistent system, r1==1&&r2==1&&r3==0&&&r4==0 is
a forbidden final outcome, where "final" means that we wait for all
four threads to terminate and for their effects on memory to propagate
fully through the system before checking the final values.

The first mapping has been demonstrated to be correct at the URL
you provide.  So let's apply the second mapping:

	Thread 0	Thread 1	Thread 2	Thread 3
	lwsync;		lwsync;		r1=x;		r2=y;
	x=1;		y=1;		lwsync;		lwsync;
	hwsync;		hwsync;		r3=y;		r4=x;
					lwsync;		lwsync;

Now barriers operate by forcing ordering between accesses on either
side of the barrier.  This means that barriers at the very start or
the very end of a given thread's execution have no effect and may
be ignored.  This results in the following:

	Thread 0	Thread 1	Thread 2	Thread 3
	x=1;		y=1;		r1=x;		r2=y;
					lwsync;		lwsync;
					r3=y;		r4=x;

This sequence does -not- result in SC execution, as documented in the
table at the beginning of Section 6.3 on page 23 of:

	http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

See the row with "IRIW+lwsync", which shows that this code sequence does
not preserve SC ordering, neither in theory nor on real hardware.  (And I
strongly recommend this paper, even before having read it thoroughly
myself, hence my taking the precaution of CCing the authors in case I
am misinterpreting something.)  If you want to prove it for yourself,
use the software tool called out at http://lwn.net/Articles/470681/.

So is this a real issue?  Show me a parallel algorithm that relies on
SC, demonstrate that it is the best algorithm for the problem that it
solves, and then demonstrate that your problem statement is the best
match for some important real-world problem.  Until then, no, this is
not a real issue.

							Thanx, Paul

PS:  Nevertheless, I personally prefer the C++ formulation, but that is
     only because I stand with one foot in theory and the other in
     practice.  If I were a pure practitioner, I would probably strongly
     prefer the Java formulation.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations
  2013-06-18 14:38       ` Richard Henderson
@ 2013-06-18 15:04         ` Paolo Bonzini
  0 siblings, 0 replies; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-18 15:04 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Liu Ping Fan, Anthony Liguori, qemu-devel

Il 18/06/2013 16:38, Richard Henderson ha scritto:
>>>> +#ifndef atomic_read
>>>> +#define atomic_read(ptr)       (*(__typeof__(*ptr) *volatile) (ptr))
>>>>  #endif
>>>>  
>>>> +#ifndef atomic_set
>>>> +#define atomic_set(ptr, i)     ((*(__typeof__(*ptr) *volatile) (ptr)) = (i))
>>>> +#endif
>>>
>>> Use
>>>
>>> __atomic_load(..., __ATOMIC_RELAXED)
>>> __atomic_store(..., __ATOMIC_RELAXED)
>>>
>>> ?
>>
>> Same here, I didn't want proliferation of #ifdefs beyond what is actually required.
> 
> Not knowing exactly where these might be used within the code base, I'd be
> worried about someone applying them to a uint64_t, somewhere a 32-bit host
> might see it.  At which point the above is going to be silently wrong, loaded
> with two 32-bit pieces.
> 
> Given that we're not requiring gcc 4.8, and cannot guarantee use of
> __atomic_load, perhaps we ought to do something like
> 
> #define atomic_read(ptr) \
>   ({ if (sizeof(*(ptr)) > sizeof(ptr)) invalid_atomic_read(); \
>      *(__typeof__(*ptr) *volatile) (ptr)); })
> 
> which should generate a link error when reading a size we can't guarantee will
> Just Work.

Good idea.

>> The FAQ also has an "important note", however:
>>
>>     Important Note: Note that it is important for both threads to access
>>     the same volatile variable in order to properly set up the happens-before
>>     relationship. It is not the case that everything visible to thread A
>>     when it writes volatile field f becomes visible to thread B after it
>>     reads volatile field g. The release and acquire have to "match" (i.e.,
>>     be performed on the same volatile field) to have the right semantics. 
>>
>> Is this final "important note" the difference between ACQ_REL and SEQ_CST?
> 
> Yes, exactly.

I still have some confusion, so I sent another follow-up with the exact
differences in the generated code.  But in any case, I will augment the
documentation with the text from the FAQ, seems safe.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-16 11:21 ` [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant Liu Ping Fan
  2013-06-17 15:28   ` Stefan Hajnoczi
@ 2013-06-18 15:14   ` mdroth
  2013-06-18 16:19     ` mdroth
  2013-06-18 19:20     ` Paolo Bonzini
  1 sibling, 2 replies; 41+ messages in thread
From: mdroth @ 2013-06-18 15:14 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: Paolo Bonzini, qemu-devel, Anthony Liguori

On Sun, Jun 16, 2013 at 07:21:21PM +0800, Liu Ping Fan wrote:
> BH will be used outside big lock, so introduce lock to protect
> between the writers, ie, bh's adders and deleter.
> Note that the lock only affects the writers and bh's callback does
> not take this extra lock.
> 
> Signed-off-by: Liu Ping Fan <pingfanl@linux.vnet.ibm.com>
> ---
>  async.c             | 21 +++++++++++++++++++++
>  include/block/aio.h |  3 +++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/async.c b/async.c
> index 90fe906..6a3269f 100644
> --- a/async.c
> +++ b/async.c
> @@ -47,8 +47,12 @@ QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
>      bh->ctx = ctx;
>      bh->cb = cb;
>      bh->opaque = opaque;
> +    qemu_mutex_lock(&ctx->bh_lock);
>      bh->next = ctx->first_bh;
> +    /* Make sure the memebers ready before putting bh into list */
> +    smp_wmb();
>      ctx->first_bh = bh;
> +    qemu_mutex_unlock(&ctx->bh_lock);
>      return bh;
>  }
> 
> @@ -61,12 +65,18 @@ int aio_bh_poll(AioContext *ctx)
> 
>      ret = 0;
>      for (bh = ctx->first_bh; bh; bh = next) {
> +        /* Make sure fetching bh before accessing its members */
> +        smp_read_barrier_depends();
>          next = bh->next;
>          if (!bh->deleted && bh->scheduled) {
>              bh->scheduled = 0;
>              if (!bh->idle)
>                  ret = 1;
>              bh->idle = 0;
> +            /* Paired with write barrier in bh schedule to ensure reading for
> +             *  callbacks coming after bh's scheduling.
> +             */
> +            smp_rmb();
>              bh->cb(bh->opaque);

Could we possibly simplify this by introducing a recursive mutex that we
could use to protect the whole list loop and hold even during the cb?

I assume we can't hold the lock during the cb currently since we might
try to reschedule, but if it's a recursive mutex would that simplify
things?

I've been doing something similar with IOHandlers for the QContext
stuff, and that's the approach I took. This patch introduces the
recursive mutex:

https://github.com/mdroth/qemu/commit/c7ee0844da62283c9466fcb10ddbfadd0b8bfc53


>          }
>      }
> @@ -75,6 +85,7 @@ int aio_bh_poll(AioContext *ctx)
> 
>      /* remove deleted bhs */
>      if (!ctx->walking_bh) {
> +        qemu_mutex_lock(&ctx->bh_lock);
>          bhp = &ctx->first_bh;
>          while (*bhp) {
>              bh = *bhp;
> @@ -85,6 +96,7 @@ int aio_bh_poll(AioContext *ctx)
>                  bhp = &bh->next;
>              }
>          }
> +        qemu_mutex_unlock(&ctx->bh_lock);
>      }
> 
>      return ret;
> @@ -94,6 +106,10 @@ void qemu_bh_schedule_idle(QEMUBH *bh)
>  {
>      if (bh->scheduled)
>          return;
> +    /* Make sure any writes that are needed by the callback are done
> +     * before the locations are read in the aio_bh_poll.
> +     */
> +    smp_wmb();
>      bh->scheduled = 1;
>      bh->idle = 1;
>  }
> @@ -102,6 +118,10 @@ void qemu_bh_schedule(QEMUBH *bh)
>  {
>      if (bh->scheduled)
>          return;
> +    /* Make sure any writes that are needed by the callback are done
> +     * before the locations are read in the aio_bh_poll.
> +     */
> +    smp_wmb();
>      bh->scheduled = 1;
>      bh->idle = 0;
>      aio_notify(bh->ctx);
> @@ -211,6 +231,7 @@ AioContext *aio_context_new(void)
>      ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
>      ctx->pollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
>      ctx->thread_pool = NULL;
> +    qemu_mutex_init(&ctx->bh_lock);
>      event_notifier_init(&ctx->notifier, false);
>      aio_set_event_notifier(ctx, &ctx->notifier, 
>                             (EventNotifierHandler *)
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 1836793..971fbef 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -17,6 +17,7 @@
>  #include "qemu-common.h"
>  #include "qemu/queue.h"
>  #include "qemu/event_notifier.h"
> +#include "qemu/thread.h"
> 
>  typedef struct BlockDriverAIOCB BlockDriverAIOCB;
>  typedef void BlockDriverCompletionFunc(void *opaque, int ret);
> @@ -53,6 +54,8 @@ typedef struct AioContext {
>       */
>      int walking_handlers;
> 
> +    /* lock to protect between bh's adders and deleter */
> +    QemuMutex bh_lock;
>      /* Anchor of the list of Bottom Halves belonging to the context */
>      struct QEMUBH *first_bh;
> 
> -- 
> 1.8.1.4
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 13:24     ` [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations) Paolo Bonzini
  2013-06-18 14:50       ` Paul E. McKenney
@ 2013-06-18 15:26       ` Torvald Riegel
  2013-06-18 17:38       ` Andrew Haley
  2 siblings, 0 replies; 41+ messages in thread
From: Torvald Riegel @ 2013-06-18 15:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrew Haley, qemu-devel, Liu Ping Fan, Anthony Liguori,
	Paul E. McKenney, Richard Henderson

On Tue, 2013-06-18 at 15:24 +0200, Paolo Bonzini wrote:
> Il 17/06/2013 20:57, Richard Henderson ha scritto:
> >> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
> >> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
> > ...
> >> + */
> >> +#ifndef atomic_mb_read
> >> +#if QEMU_GNUC_PREREQ(4, 8)
> >> +#define atomic_mb_read(ptr)       ({             \
> >> +    typeof(*ptr) _val;                           \
> >> +    __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
> >> +    _val;                                        \
> >> +})
> >> +#else
> >> +#define atomic_mb_read(ptr)    ({           \
> >> +    typeof(*ptr) _val = atomic_read(ptr);   \
> >> +    smp_rmb();                              \
> >> +    _val;                                   \
> > 
> > This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
> > load_acquire, one needs barriers before and after the atomic_read in order to
> > implement SEQ_CST.
> 
> The store-load barrier between atomic_mb_set and atomic_mb_read are
> provided by the atomic_mb_set.  The load-load barrier between two
> atomic_mb_reads are provided by the first read.
> 
> > So again I have to ask, what semantics are you actually looking for here?
> 
> So, everything I found points to Java volatile being sequentially
> consistent, though I'm still not sure why C11 suggests hwsync for load
> seq-cst on POWER instead of lwsync.  Comparing the sequences that my
> code (based on the JSR-133 cookbook) generate with the C11 suggestions
> you get:
> 
>     Load seqcst               hwsync; ld; cmp; bc; isync
>     Store seqcst              hwsync; st
>     (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
> 
>     Load Java volatile        ld; lwsync
>     Store Java volatile       lwsync; st; hwsync
>     (source: http://g.oswego.edu/dl/jmm/cookbook.html)
> 
> where the lwsync in loads acts as a load-load barrier, while the one in
> stores acts as load-store + store-store barrier.
> 
> Is the cookbook known to be wrong?
> 
> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
> paragraph of
> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
> seems to suggest?

That section seems to indeed state that Java volatile accesses have
acquire/release semantics (I don't know too much about the Java model
though).

That would also match the load-acquire code on the C++11 mappings page
(if using an acquire fence after the load).  For the stores, the
cookbook mapping then has an extra hwsync at the end, which would
translate to an extra seqcst C++11 fence.  (Again, I can't say why.)

Perhaps http://www.cl.cam.ac.uk/~pes20/cppppc could help explain parts
of it.


Torvald

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 14:50       ` Paul E. McKenney
@ 2013-06-18 15:29         ` Peter Sewell
  2013-06-18 15:37         ` Torvald Riegel
  2013-06-18 16:08         ` Paolo Bonzini
  2 siblings, 0 replies; 41+ messages in thread
From: Peter Sewell @ 2013-06-18 15:29 UTC (permalink / raw)
  To: paulmck
  Cc: Susmit.Sarkar, Andrew Haley, qemu-devel, Liu Ping Fan,
	Anthony Liguori, luc.maranget, Paolo Bonzini, Torvald Riegel,
	Richard Henderson

On 18 June 2013 15:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> On Tue, Jun 18, 2013 at 03:24:24PM +0200, Paolo Bonzini wrote:
>> Il 17/06/2013 20:57, Richard Henderson ha scritto:
>> >> + * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
>> >> + * while an atomic_mb_set is a st.rel followed by a memory barrier.
>> > ...
>> >> + */
>> >> +#ifndef atomic_mb_read
>> >> +#if QEMU_GNUC_PREREQ(4, 8)
>> >> +#define atomic_mb_read(ptr)       ({             \
>> >> +    typeof(*ptr) _val;                           \
>> >> +    __atomic_load(ptr, &_val, __ATOMIC_SEQ_CST); \
>> >> +    _val;                                        \
>> >> +})
>> >> +#else
>> >> +#define atomic_mb_read(ptr)    ({           \
>> >> +    typeof(*ptr) _val = atomic_read(ptr);   \
>> >> +    smp_rmb();                              \
>> >> +    _val;                                   \
>> >
>> > This latter definition is ACQUIRE not SEQ_CST (except for ia64).  Without
>> > load_acquire, one needs barriers before and after the atomic_read in order to
>> > implement SEQ_CST.
>>
>> The store-load barrier between atomic_mb_set and atomic_mb_read are
>> provided by the atomic_mb_set.  The load-load barrier between two
>> atomic_mb_reads are provided by the first read.
>>
>> > So again I have to ask, what semantics are you actually looking for here?
>>
>> So, everything I found points to Java volatile being sequentially
>> consistent, though I'm still not sure why C11 suggests hwsync for load
>> seq-cst on POWER instead of lwsync.  Comparing the sequences that my
>> code (based on the JSR-133 cookbook) generate with the C11 suggestions
>> you get:
>>
>>     Load seqcst               hwsync; ld; cmp; bc; isync
>>     Store seqcst              hwsync; st
>>     (source: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
>>
>>     Load Java volatile        ld; lwsync
>>     Store Java volatile       lwsync; st; hwsync
>>     (source: http://g.oswego.edu/dl/jmm/cookbook.html)
>>
>> where the lwsync in loads acts as a load-load barrier, while the one in
>> stores acts as load-store + store-store barrier.

(I don't know the context of these mails, but let me add a gloss to
correct a common misconception there:  it's not sufficient to think
about
these things solely in terms of thread-local reordering, as the
tutorial that Paul points to
below explains. One has to think about the lack of multi-copy
atomicity too, as seen in that IRIW example)

>> Is the cookbook known to be wrong?
>>
>> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
>> paragraph of
>> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
>> seems to suggest?
>
> First, I am not a fan of SC, mostly because there don't seem to be many
> (any?) production-quality algorithms that need SC.  But if you really
> want to take a parallel-programming trip back to the 1980s, let's go!  ;-)
>
> The canonical litmus test for SC is independent reads of independent
> writes, or IRIW for short.  It is as follows, with x and y both initially
> zero:
>
>         Thread 0        Thread 1        Thread 2        Thread 3
>         x=1;            y=1;            r1=x;           r2=y;
>                                         r3=y;           r4=x;
>
> On a sequentially consistent system, r1==1&&r2==1&&r3==0&&&r4==0 is
> a forbidden final outcome, where "final" means that we wait for all
> four threads to terminate and for their effects on memory to propagate
> fully through the system before checking the final values.
>
> The first mapping has been demonstrated to be correct at the URL
> you provide.  So let's apply the second mapping:
>
>         Thread 0        Thread 1        Thread 2        Thread 3
>         lwsync;         lwsync;         r1=x;           r2=y;
>         x=1;            y=1;            lwsync;         lwsync;
>         hwsync;         hwsync;         r3=y;           r4=x;
>                                         lwsync;         lwsync;
>
> Now barriers operate by forcing ordering between accesses on either
> side of the barrier.  This means that barriers at the very start or
> the very end of a given thread's execution have no effect and may
> be ignored.  This results in the following:
>
>         Thread 0        Thread 1        Thread 2        Thread 3
>         x=1;            y=1;            r1=x;           r2=y;
>                                         lwsync;         lwsync;
>                                         r3=y;           r4=x;
>
> This sequence does -not- result in SC execution, as documented in the
> table at the beginning of Section 6.3 on page 23 of:
>
>         http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>
> See the row with "IRIW+lwsync", which shows that this code sequence does
> not preserve SC ordering, neither in theory nor on real hardware.  (And I
> strongly recommend this paper, even before having read it thoroughly
> myself, hence my taking the precaution of CCing the authors in case I
> am misinterpreting something.)  If you want to prove it for yourself,
> use the software tool called out at http://lwn.net/Articles/470681/.
>
> So is this a real issue?  Show me a parallel algorithm that relies on
> SC, demonstrate that it is the best algorithm for the problem that it
> solves, and then demonstrate that your problem statement is the best
> match for some important real-world problem.  Until then, no, this is
> not a real issue.
>
>                                                         Thanx, Paul
>
> PS:  Nevertheless, I personally prefer the C++ formulation, but that is
>      only because I stand with one foot in theory and the other in
>      practice.  If I were a pure practitioner, I would probably strongly
>      prefer the Java formulation.
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 14:50       ` Paul E. McKenney
  2013-06-18 15:29         ` Peter Sewell
@ 2013-06-18 15:37         ` Torvald Riegel
  2013-06-19  1:53           ` Paul E. McKenney
  2013-06-18 16:08         ` Paolo Bonzini
  2 siblings, 1 reply; 41+ messages in thread
From: Torvald Riegel @ 2013-06-18 15:37 UTC (permalink / raw)
  To: paulmck
  Cc: Susmit.Sarkar, Andrew Haley, qemu-devel, Liu Ping Fan,
	Anthony Liguori, luc.maranget, Paolo Bonzini, Peter.Sewell,
	Richard Henderson

On Tue, 2013-06-18 at 07:50 -0700, Paul E. McKenney wrote:
> First, I am not a fan of SC, mostly because there don't seem to be many
> (any?) production-quality algorithms that need SC.  But if you really
> want to take a parallel-programming trip back to the 1980s, let's go!  ;-)

Dekker-style mutual exclusion is useful for things like read-mostly
multiple-reader single-writer locks, or similar "asymmetric" cases of
synchronization.  SC fences are needed for this.

> PS:  Nevertheless, I personally prefer the C++ formulation, but that is
>      only because I stand with one foot in theory and the other in
>      practice.  If I were a pure practitioner, I would probably strongly
>      prefer the Java formulation.

That's because you're a practitioner with experience :)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 14:50       ` Paul E. McKenney
  2013-06-18 15:29         ` Peter Sewell
  2013-06-18 15:37         ` Torvald Riegel
@ 2013-06-18 16:08         ` Paolo Bonzini
  2013-06-18 16:38           ` Torvald Riegel
  2 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-18 16:08 UTC (permalink / raw)
  To: paulmck
  Cc: Susmit.Sarkar, Peter.Sewell, Andrew Haley, qemu-devel,
	Liu Ping Fan, Anthony Liguori, luc.maranget, Torvald Riegel,
	Richard Henderson

Il 18/06/2013 16:50, Paul E. McKenney ha scritto:
> PS:  Nevertheless, I personally prefer the C++ formulation, but that is
>      only because I stand with one foot in theory and the other in
>      practice.  If I were a pure practitioner, I would probably strongly
>      prefer the Java formulation.

Awesome answer, and this last paragraph sums it up pretty well.

That was basically my understanding, too.  I still do not completely 
get the relationship between Java semantics and ACQ_REL, but I can 
sidestep the issue for adding portable atomics to QEMU.  QEMU 
developers and Linux developers have some overlap, and Java volatiles 
are simple to understand in terms of memory barriers (which Linux 
uses); hence, I'll treat ourselves as pure practitioners.

I will just not use __atomic_load/__atomic_store to implement the 
primitives, and always express them in terms of memory barriers.  Of 
course, the memory barriers themselves are defined using 
__atomic_thread_fence.

I attach v2 of the patch below; Ping Fan, feel free to pick it up.

Thanks again!

Paolo

----------- 8< -------------

>From 5c71a423156d1f0d61f17afa99b099bf139c64ae Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Mon, 13 May 2013 13:29:47 +0200
Subject: [PATCH] add a header file for atomic operations

We're already using them in several places, but __sync builtins are just
too ugly to type, and do not provide seqcst load/store operations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 docs/atomics.txt         | 345 +++++++++++++++++++++++++++++++++++++++++++++++
 hw/display/qxl.c         |   3 +-
 hw/virtio/vhost.c        |   9 +-
 include/qemu/atomic.h    | 190 +++++++++++++++++++++-----
 migration.c              |   3 +-
 tests/test-thread-pool.c |   8 +-
 6 files changed, 514 insertions(+), 44 deletions(-)
 create mode 100644 docs/atomics.txt

diff --git a/docs/atomics.txt b/docs/atomics.txt
new file mode 100644
index 0000000..e2ce04b
--- /dev/null
+++ b/docs/atomics.txt
@@ -0,0 +1,345 @@
+CPUs perform independent memory operations effectively in random order.
+but this can be a problem for CPU-CPU interaction (including interactions
+between QEMU and the guest).  Multi-threaded programs use various tools
+to instruct the compiler and the CPU to restrict the order to something
+that is consistent with the expectations of the programmer.
+
+The most basic tool is locking.  Mutexes, condition variables and
+semaphores are used in QEMU, and should be the default approach to
+synchronization.  Anything else is considerably harder, but it's
+also justified more often than one would like.  The two tools that
+are provided by qemu/atomic.h are memory barriers and atomic operations.
+
+Macros defined by qemu/atomic.h fall in three camps:
+
+- compiler barriers: barrier();
+
+- weak atomic access and manual memory barriers: atomic_read(),
+  atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_read_barrier_depends();
+
+- sequentially consistent atomic access: everything else.
+
+
+COMPILER MEMORY BARRIER
+=======================
+
+barrier() prevents the compiler from moving the memory accesses either
+side of it to the other side.  The compiler barrier has no direct effect
+on the CPU, which may then reorder things however it wishes.
+
+barrier() is mostly used within qemu/atomic.h itself.  On some
+architectures, CPU guarantees are strong enough that blocking compiler
+optimizations already ensures the correct order of execution.  In this
+case, qemu/atomic.h will reduce stronger memory barriers to simple
+compiler barriers.
+
+Still, barrier() can be useful when writing code that can be interrupted
+by signal handlers.
+
+
+SEQUENTIALLY CONSISTENT ATOMIC ACCESS
+=====================================
+
+Most of the operations in the qemu/atomic.h header ensure *sequential
+consistency*, where "the result of any execution is the same as if the
+operations of all the processors were executed in some sequential order,
+and the operations of each individual processor appear in this sequence
+in the order specified by its program".
+
+qemu/atomic.h provides the following set of atomic read-modify-write
+operations:
+
+    typeof(*ptr) atomic_inc(ptr)
+    typeof(*ptr) atomic_dec(ptr)
+    typeof(*ptr) atomic_add(ptr, val)
+    typeof(*ptr) atomic_sub(ptr, val)
+    typeof(*ptr) atomic_and(ptr, val)
+    typeof(*ptr) atomic_or(ptr, val)
+    typeof(*ptr) atomic_xchg(ptr, val
+    typeof(*ptr) atomic_cmpxchg(ptr, old, new)
+
+all of which return the old value of *ptr.  These operations are
+polymorphic; they operate on any type that is as wide as an int.
+
+Sequentially consistent loads and stores can be done using:
+
+    atomic_add(ptr, 0) for loads
+    atomic_xchg(ptr, val) for stores
+
+However, they are quite expensive on some platforms, notably POWER and
+ARM.  Therefore, qemu/atomic.h provides two primitives with slightly
+weaker constraints:
+
+    typeof(*ptr) atomic_mb_read(ptr)
+    void         atomic_mb_set(ptr, val)
+
+The semantics of these primitives map to Java volatile variables,
+and are strongly related to memory barriers as used in the Linux
+kernel (see below).
+
+As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
+be reordered with each other, and it is also not possible to reorder
+"normal" accesses around them.
+
+However, and this is the important difference between
+atomic_mb_read/atomic_mb_set and sequential consistency, it is important
+for both threads to access the same volatile variable.  It is not the
+case that everything visible to thread A when it writes volatile field f
+becomes visible to thread B after it reads volatile field g. The store
+and load have to "match" (i.e., be performed on the same volatile
+field) to achieve the right semantics.
+
+
+These operations operate on any type that is as wide as an int or smaller.
+
+
+WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS
+=============================================
+
+Compared to sequentially consistent atomic access, programming with
+weaker consistency models can be considerably more complicated.
+In general, if the algorithm you are writing includes both writes
+and reads on the same side, it is generally simpler to use sequentially
+consistent primitives.
+
+When using this model, variables are accessed with atomic_read() and
+atomic_set(), and restrictions to the ordering of accesses is enforced
+using the smp_rmb(), smp_wmb(), smp_mb() and smp_read_barrier_depends()
+memory barriers.
+
+atomic_read() and atomic_set() prevents the compiler from using
+optimizations that might otherwise optimize accesses out of existence
+on the one hand, or that might create unsolicited accesses on the other.
+In general this should not have any effect, because the same compiler
+barriers are already implied by memory barriers.  However, it is useful
+to do so, because it tells readers which variables are shared with
+other threads, and which are local to the current thread or protected
+by other, more mundane means.
+
+Memory barriers control the order of references to shared memory.
+They come in four kinds:
+
+- smp_rmb() guarantees that all the LOAD operations specified before
+  the barrier will appear to happen before all the LOAD operations
+  specified after the barrier with respect to the other components of
+  the system.
+
+  In other words, smp_rmb() puts a partial ordering on loads, but is not
+  required to have any effect on stores.
+
+- smp_wmb() guarantees that all the STORE operations specified before
+  the barrier will appear to happen before all the STORE operations
+  specified after the barrier with respect to the other components of
+  the system.
+
+  In other words, smp_wmb() puts a partial ordering on stores, but is not
+  required to have any effect on loads.
+
+- smp_mb() guarantees that all the LOAD and STORE operations specified
+  before the barrier will appear to happen before all the LOAD and
+  STORE operations specified after the barrier with respect to the other
+  components of the system.
+
+  smp_mb() puts a partial ordering on both loads and stores.  It is
+  stronger than both a read and a write memory barrier; it implies both
+  smp_rmb() and smp_wmb(), but it also prevents STOREs coming before the
+  barrier from overtaking LOADs coming after the barrier and vice versa.
+
+- smp_read_barrier_depends() is a weaker kind of read barrier.  On
+  most processors, whenever two loads are performed such that the
+  second depends on the result of the first (e.g., the first load
+  retrieves the address to which the second load will be directed),
+  the processor will guarantee that the first LOAD will appear to happen
+  before the second with respect to the other components of the system.
+  However, this is not always true---for example, it was not true on
+  Alpha processors.  Whenever this kind of access happens to shared
+  memory (that is not protected by a lock), a read barrier is needed,
+  and smp_read_barrier_depends() can be used instead of smp_rmb().
+
+  Note that the first load really has to have a _data_ dependency and not
+  a control dependency.  If the address for the second load is dependent
+  on the first load, but the dependency is through a conditional rather
+  than actually loading the address itself, then it's a _control_
+  dependency and a full read barrier or better is required.
+
+
+This is the set of barriers that is required *between* two atomic_read()
+and atomic_set() operations to achieve sequential consistency:
+
+                    |               2nd operation             |
+                    |-----------------------------------------|
+     1st operation  | (after last) | atomic_read | atomic_set |
+     ---------------+--------------+-------------+------------|
+     (before first) |              | none        | smp_wmb()  |
+     ---------------+--------------+-------------+------------|
+     atomic_read    | smp_rmb()    | smp_rmb()*  | **         |
+     ---------------+--------------+-------------+------------|
+     atomic_set     | none         | smp_mb()*** | smp_wmb()  |
+     ---------------+--------------+-------------+------------|
+
+       * Or smp_read_barrier_depends().
+
+      ** This requires a load-store barrier.  How to achieve this varies
+         depending on the machine, but in practice smp_rmb()+smp_wmb()
+         should have the desired effect.  For example, on PowerPC the
+         lwsync instruction is a combined load-load, load-store and
+         store-store barrier.
+
+     *** This requires a store-load barrier.  On most machines, the only
+         way to achieve this is a full barrier.
+
+
+You can see that the two possible definitions of atomic_mb_read()
+and atomic_mb_set() are the following:
+
+    1) atomic_mb_read(p)   = atomic_read(p); smp_rmb()
+       atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v); smp_mb()
+
+    2) atomic_mb_read(p)   = smp_mb() atomic_read(p); smp_rmb()
+       atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v);
+
+Usually the former is used, because smp_mb() is expensive and a program
+normally has more reads than writes.  Therefore it makes more sense to
+make atomic_mb_set() the more expensive operation.
+
+There are two common cases in which atomic_mb_read and atomic_mb_set
+generate too many memory barriers, and thus it can be useful to manually
+place barriers instead:
+
+- when a data structure has one thread that is always a writer
+  and one thread that is always a reader, manual placement of
+  memory barriers makes the write side faster.  Furthermore,
+  correctness is easy to check for in this case using the "pairing"
+  trick that is explained below:
+
+     thread 1                                thread 1
+     -------------------------               ------------------------
+     (other writes)
+                                             smp_wmb()
+     atomic_mb_set(&a, x)                    atomic_set(&a, x)
+                                             smp_wmb()
+     atomic_mb_set(&b, y)                    atomic_set(&b, y)
+
+                                       =>
+     thread 2                                thread 2
+     -------------------------               ------------------------
+     y = atomic_mb_read(&b)                  y = atomic_read(&b)
+                                             smp_rmb()
+     x = atomic_mb_read(&a)                  x = atomic_read(&a)
+                                             smp_rmb()
+
+- sometimes, a thread is accessing many variables that are otherwise
+  unrelated to each other (for example because, apart from the current
+  thread, exactly one other thread will read or write each of these
+  variables).  In this case, it is possible to "hoist" the implicit
+  barriers provided by atomic_mb_read() and atomic_mb_set() outside
+  a loop.  For example, the above definition atomic_mb_read() gives
+  the following transformation:
+
+     n = 0;                                  n = 0;
+     for (i = 0; i < 10; i++)          =>    for (i = 0; i < 10; i++)
+       n += atomic_mb_read(&a[i]);             n += atomic_read(&a[i]);
+                                             smp_rmb();
+
+  Similarly, atomic_mb_set() can be transformed as follows:
+  smp_mb():
+
+                                             smp_wmb();
+     for (i = 0; i < 10; i++)          =>    for (i = 0; i < 10; i++)
+       atomic_mb_set(&a[i], false);            atomic_set(&a[i], false);
+                                             smp_mb();
+
+
+The two tricks can be combined.  In this case, splitting a loop in
+two lets you hoist the barriers out of the loops _and_ eliminate the
+expensive smp_mb():
+
+                                             smp_wmb();
+     for (i = 0; i < 10; i++) {        =>    for (i = 0; i < 10; i++)
+       atomic_mb_set(&a[i], false);            atomic_set(&a[i], false);
+       atomic_mb_set(&b[i], false);          smb_wmb();
+     }                                       for (i = 0; i < 10; i++)
+                                               atomic_set(&a[i], false);
+                                             smp_mb();
+
+  The other thread can still use atomic_mb_read()/atomic_mb_set()
+
+
+Memory barrier pairing
+----------------------
+
+A useful rule of thumb is that memory barriers should always, or almost
+always, be paired with another barrier.  In the case of QEMU, however,
+note that the other barrier may actually be in a driver that runs in
+the guest!
+
+For the purposes of pairing, smp_read_barrier_depends() and smp_rmb()
+both count as read barriers.  A read barriers shall pair with a write
+barrier or a full barrier; a write barrier shall pair with a read
+barrier or a full barrier.  A full barrier can pair with anything.
+For example:
+
+        thread 1             thread 2
+        ===============      ===============
+        a = 1;
+        smp_wmb();
+        b = 2;               x = b;
+                             smp_rmb();
+                             y = a;
+
+Note that the "writing" thread are accessing the variables in the
+opposite order as the "reading" thread.  This is expected: stores
+before the write barrier will normally match the loads after the
+read barrier, and vice versa.  The same is true for more than 2
+access and for data dependency barriers:
+
+        thread 1             thread 2
+        ===============      ===============
+        b[2] = 1;
+        smp_wmb();
+        x->i = 2;
+        smp_wmb();
+        a = x;               x = a;
+                             smp_read_barrier_depends();
+                             y = x->i;
+                             smp_read_barrier_depends();
+                             z = b[y];
+
+smp_wmb() also pairs with atomic_mb_read(), and smp_rmb() also pairs
+with atomic_mb_set().
+
+
+COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
+============================================
+
+Here is a list of differences between Linux kernel atomic operations
+and memory barriers, and the equivalents in QEMU:
+
+- atomic operations in Linux are always on a 32-bit int type and
+  use a boxed atomic_t type; atomic operations in QEMU are polymorphic
+  and use normal C types.
+
+- atomic_read and atomic_set in Linux give no guarantee at all;
+  atomic_read and atomic_set in QEMU include a compiler barrier
+  (similar to the ACCESS_ONCE macro in Linux).
+
+- most atomic read-modify-write operations in Linux return void;
+  in QEMU, all of them return the old value of the variable.
+
+- different atomic read-modify-write operations in Linux imply
+  a different set of memory barriers; in QEMU, all of them enforce
+  sequential consistency, which means they imply full memory barriers
+  before and after the operation.
+
+- Linux does not have an equivalent of atomic_mb_read() and
+  atomic_mb_set().  In particular, note that set_mb() is a little
+  weaker than atomic_mb_set().
+
+
+SOURCES
+=======
+
+* Documentation/memory-barriers.txt from the Linux kernel
+
+* "The JSR-133 Cookbook for Compiler Writers", available at
+  http://g.oswego.edu/dl/jmm/cookbook.html
diff --git a/hw/display/qxl.c b/hw/display/qxl.c
index 0c852de..a85e178 100644
--- a/hw/display/qxl.c
+++ b/hw/display/qxl.c
@@ -23,6 +23,7 @@
 #include "qemu-common.h"
 #include "qemu/timer.h"
 #include "qemu/queue.h"
+#include "qemu/atomic.h"
 #include "monitor/monitor.h"
 #include "sysemu/sysemu.h"
 #include "trace.h"
@@ -1725,7 +1726,7 @@ static void qxl_send_events(PCIQXLDevice *d, uint32_t events)
         trace_qxl_send_events_vm_stopped(d->id, events);
         return;
     }
-    old_pending = __sync_fetch_and_or(&d->ram->int_pending, le_events);
+    old_pending = atomic_or(&d->ram->int_pending, le_events);
     if ((old_pending & le_events) == le_events) {
         return;
     }
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 96ab625..8f6ab13 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -16,6 +16,7 @@
 #include <sys/ioctl.h>
 #include "hw/virtio/vhost.h"
 #include "hw/hw.h"
+#include "qemu/atomic.h"
 #include "qemu/range.h"
 #include <linux/vhost.h>
 #include "exec/address-spaces.h"
@@ -47,11 +48,9 @@ static void vhost_dev_sync_region(struct vhost_dev *dev,
             addr += VHOST_LOG_CHUNK;
             continue;
         }
-        /* Data must be read atomically. We don't really
-         * need the barrier semantics of __sync
-         * builtins, but it's easier to use them than
-         * roll our own. */
-        log = __sync_fetch_and_and(from, 0);
+        /* Data must be read atomically. We don't really need barrier semantics
+         * but it's easier to use atomic_* than roll our own. */
+        log = atomic_xchg(from, 0);
         while ((bit = sizeof(log) > sizeof(int) ?
                 ffsll(log) : ffs(log))) {
             hwaddr page_addr;
diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 10becb6..04d64d0 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -1,68 +1,194 @@
-#ifndef __QEMU_BARRIER_H
-#define __QEMU_BARRIER_H 1
+/*
+ * Simple interface for atomic operations.
+ *
+ * Copyright (C) 2013 Red Hat, Inc.
+ *
+ * Author: Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
 
-/* Compiler barrier */
-#define barrier()   asm volatile("" ::: "memory")
+#ifndef __QEMU_ATOMIC_H
+#define __QEMU_ATOMIC_H 1
 
-#if defined(__i386__)
+#include "qemu/compiler.h"
 
-#include "qemu/compiler.h"        /* QEMU_GNUC_PREREQ */
+/* For C11 atomic ops */
 
-/*
- * Because of the strongly ordered x86 storage model, wmb() and rmb() are nops
- * on x86(well, a compiler barrier only).  Well, at least as long as
- * qemu doesn't do accesses to write-combining memory or non-temporal
- * load/stores from C code.
- */
-#define smp_wmb()   barrier()
-#define smp_rmb()   barrier()
+/* Compiler barrier */
+#define barrier()   ({ asm volatile("" ::: "memory"); (void)0; })
+
+#ifndef __ATOMIC_RELAXED
 
 /*
- * We use GCC builtin if it's available, as that can use
- * mfence on 32 bit as well, e.g. if built with -march=pentium-m.
- * However, on i386, there seem to be known bugs as recently as 4.3.
- * */
-#if QEMU_GNUC_PREREQ(4, 4)
-#define smp_mb() __sync_synchronize()
+ * We use GCC builtin if it's available, as that can use mfence on
+ * 32-bit as well, e.g. if built with -march=pentium-m. However, on
+ * i386 the spec is buggy, and the implementation followed it until
+ * 4.3 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36793).
+ */
+#if defined(__i386__) || defined(__x86_64__)
+#if !QEMU_GNUC_PREREQ(4, 4)
+#if defined __x86_64__
+#define smp_mb()    ({ asm volatile("mfence" ::: "memory"); (void)0; })
 #else
-#define smp_mb() asm volatile("lock; addl $0,0(%%esp) " ::: "memory")
+#define smp_mb()    ({ asm volatile("lock; addl $0,0(%%esp) " ::: "memory"); (void)0; })
+#endif
+#endif
 #endif
 
-#elif defined(__x86_64__)
 
+#ifdef __alpha__
+#define smp_read_barrier_depends()   asm volatile("mb":::"memory")
+#endif
+
+#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
+
+/*
+ * Because of the strongly ordered storage model, wmb() and rmb() are nops
+ * here (a compiler barrier only).  QEMU doesn't do accesses to write-combining
+ * qemu memory or non-temporal load/stores from C code.
+ */
 #define smp_wmb()   barrier()
 #define smp_rmb()   barrier()
-#define smp_mb() asm volatile("mfence" ::: "memory")
+
+/*
+ * __sync_lock_test_and_set() is documented to be an acquire barrier only,
+ * but it is a full barrier at the hardware level.  Add a compiler barrier
+ * to make it a full barrier also at the compiler level.
+ */
+#define atomic_xchg(ptr, i)    (barrier(), __sync_lock_test_and_set(ptr, i))
+
+/*
+ * Load/store with Java volatile semantics.
+ */
+#define atomic_mb_set(ptr, i)  ((void)atomic_xchg(ptr, i))
 
 #elif defined(_ARCH_PPC)
 
 /*
  * We use an eieio() for wmb() on powerpc.  This assumes we don't
  * need to order cacheable and non-cacheable stores with respect to
- * each other
+ * each other.
+ *
+ * smp_mb has the same problem as on x86 for not-very-new GCC
+ * (http://patchwork.ozlabs.org/patch/126184/, Nov 2011).
  */
-#define smp_wmb()   asm volatile("eieio" ::: "memory")
-
+#define smp_wmb()   ({ asm volatile("eieio" ::: "memory"); (void)0; })
 #if defined(__powerpc64__)
-#define smp_rmb()   asm volatile("lwsync" ::: "memory")
+#define smp_rmb()   ({ asm volatile("lwsync" ::: "memory"); (void)0; })
 #else
-#define smp_rmb()   asm volatile("sync" ::: "memory")
+#define smp_rmb()   ({ asm volatile("sync" ::: "memory"); (void)0; })
 #endif
+#define smp_mb()    ({ asm volatile("sync" ::: "memory"); (void)0; })
 
-#define smp_mb()   asm volatile("sync" ::: "memory")
+#endif /* _ARCH_PPC */
 
-#else
+#endif /* C11 atomics */
 
 /*
  * For (host) platforms we don't have explicit barrier definitions
  * for, we use the gcc __sync_synchronize() primitive to generate a
  * full barrier.  This should be safe on all platforms, though it may
- * be overkill for wmb() and rmb().
+ * be overkill for smp_wmb() and smp_rmb().
  */
+#ifndef smp_mb
+#define smp_mb()    __sync_synchronize()
+#endif
+
+#ifndef smp_wmb
+#ifdef __ATOMIC_RELEASE
+#define smp_wmb()   __atomic_thread_fence(__ATOMIC_RELEASE)
+#else
 #define smp_wmb()   __sync_synchronize()
-#define smp_mb()   __sync_synchronize()
+#endif
+#endif
+
+#ifndef smp_rmb
+#ifdef __ATOMIC_ACQUIRE
+#define smp_rmb()   __atomic_thread_fence(__ATOMIC_ACQUIRE)
+#else
 #define smp_rmb()   __sync_synchronize()
+#endif
+#endif
+
+#ifndef smp_read_barrier_depends
+#ifdef __ATOMIC_CONSUME
+#define smp_read_barrier_depends()   __atomic_thread_fence(__ATOMIC_CONSUME)
+#else
+#define smp_read_barrier_depends()   barrier()
+#endif
+#endif
 
+#ifndef atomic_read
+#define atomic_read(ptr)       (*(__typeof__(*ptr) *volatile) (ptr))
 #endif
 
+#ifndef atomic_set
+#define atomic_set(ptr, i)     ((*(__typeof__(*ptr) *volatile) (ptr)) = (i))
+#endif
+
+/* These have the same semantics as Java volatile variables.
+ * See http://gee.cs.oswego.edu/dl/jmm/cookbook.html:
+ * "1. Issue a StoreStore barrier (wmb) before each volatile store."
+ *  2. Issue a StoreLoad barrier after each volatile store.
+ *     Note that you could instead issue one before each volatile load, but
+ *     this would be slower for typical programs using volatiles in which
+ *     reads greatly outnumber writes. Alternatively, if available, you
+ *     can implement volatile store as an atomic instruction (for example
+ *     XCHG on x86) and omit the barrier. This may be more efficient if
+ *     atomic instructions are cheaper than StoreLoad barriers.
+ *  3. Issue LoadLoad and LoadStore barriers after each volatile load."
+ *
+ * If you prefer to think in terms of "pairing" of memory barriers,
+ * an atomic_mb_read pairs with an atomic_mb_set.
+ *
+ * And for the few ia64 lovers that exist, an atomic_mb_read is a ld.acq,
+ * while an atomic_mb_set is a st.rel followed by a memory barrier.
+ *
+ * These are a bit weaker than __atomic_load/store with __ATOMIC_SEQ_CST
+ * (see docs/atomics.txt), and I'm not sure that __ATOMIC_ACQ_REL is enough.
+ * Just always use the barriers manually by the rules above.
+ */
+#ifndef atomic_mb_read
+#define atomic_mb_read(ptr)    ({           \
+    typeof(*ptr) _val = atomic_read(ptr);   \
+    smp_rmb();                              \
+    _val;                                   \
+})
+#endif
+
+#ifndef atomic_mb_set
+#define atomic_mb_set(ptr, i)  do {         \
+    smp_wmb();                              \
+    atomic_set(ptr, i);                     \
+    smp_mb();                               \
+} while (0)
+#endif
+
+#ifndef atomic_xchg
+#ifdef __ATOMIC_SEQ_CST
+#define atomic_xchg(ptr, i)    ({                           \
+    typeof(*ptr) _new = (i), _old;                          \
+    __atomic_exchange(ptr, &_new, &_old, __ATOMIC_SEQ_CST); \
+    _old;                                                   \
+})
+#elif defined __clang__
+#define atomic_xchg(ptr, i)    __sync_exchange(ptr, i)
+#else
+/* __sync_lock_test_and_set() is documented to be an acquire barrier only.  */
+#define atomic_xchg(ptr, i)    (smp_mb(), __sync_lock_test_and_set(ptr, i))
+#endif
+#endif
+
+/* Provide shorter names for GCC atomic builtins.  */
+#define atomic_inc(ptr)        __sync_fetch_and_add(ptr, 1)
+#define atomic_dec(ptr)        __sync_fetch_and_add(ptr, -1)
+#define atomic_add             __sync_fetch_and_add
+#define atomic_sub             __sync_fetch_and_sub
+#define atomic_and             __sync_fetch_and_and
+#define atomic_or              __sync_fetch_and_or
+#define atomic_cmpxchg         __sync_val_compare_and_swap
+
 #endif
diff --git a/migration.c b/migration.c
index 058f9e6..83f5691 100644
--- a/migration.c
+++ b/migration.c
@@ -290,8 +290,7 @@ static void migrate_fd_cleanup(void *opaque)
 
 static void migrate_finish_set_state(MigrationState *s, int new_state)
 {
-    if (__sync_val_compare_and_swap(&s->state, MIG_STATE_ACTIVE,
-                                    new_state) == new_state) {
+    if (atomic_cmpxchg(&s->state, MIG_STATE_ACTIVE, new_state) == new_state) {
         trace_migrate_set_state(new_state);
     }
 }
diff --git a/tests/test-thread-pool.c b/tests/test-thread-pool.c
index 22915aa..dbf2fc8 100644
--- a/tests/test-thread-pool.c
+++ b/tests/test-thread-pool.c
@@ -17,15 +17,15 @@ typedef struct {
 static int worker_cb(void *opaque)
 {
     WorkerTestData *data = opaque;
-    return __sync_fetch_and_add(&data->n, 1);
+    return atomic_inc(&data->n);
 }
 
 static int long_cb(void *opaque)
 {
     WorkerTestData *data = opaque;
-    __sync_fetch_and_add(&data->n, 1);
+    atomic_inc(&data->n);
     g_usleep(2000000);
-    __sync_fetch_and_add(&data->n, 1);
+    atomic_inc(&data->n);
     return 0;
 }
 
@@ -169,7 +169,7 @@ static void test_cancel(void)
     /* Cancel the jobs that haven't been started yet.  */
     num_canceled = 0;
     for (i = 0; i < 100; i++) {
-        if (__sync_val_compare_and_swap(&data[i].n, 0, 3) == 0) {
+        if (atomic_cmpxchg(&data[i].n, 0, 3) == 0) {
             data[i].ret = -ECANCELED;
             bdrv_aio_cancel(data[i].aiocb);
             active--;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-18 15:14   ` mdroth
@ 2013-06-18 16:19     ` mdroth
  2013-06-18 19:20     ` Paolo Bonzini
  1 sibling, 0 replies; 41+ messages in thread
From: mdroth @ 2013-06-18 16:19 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: Paolo Bonzini, qemu-devel, Anthony Liguori

On Tue, Jun 18, 2013 at 10:14:38AM -0500, mdroth wrote:
> On Sun, Jun 16, 2013 at 07:21:21PM +0800, Liu Ping Fan wrote:
> > BH will be used outside big lock, so introduce lock to protect
> > between the writers, ie, bh's adders and deleter.
> > Note that the lock only affects the writers and bh's callback does
> > not take this extra lock.
> > 
> > Signed-off-by: Liu Ping Fan <pingfanl@linux.vnet.ibm.com>
> > ---
> >  async.c             | 21 +++++++++++++++++++++
> >  include/block/aio.h |  3 +++
> >  2 files changed, 24 insertions(+)
> > 
> > diff --git a/async.c b/async.c
> > index 90fe906..6a3269f 100644
> > --- a/async.c
> > +++ b/async.c
> > @@ -47,8 +47,12 @@ QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
> >      bh->ctx = ctx;
> >      bh->cb = cb;
> >      bh->opaque = opaque;
> > +    qemu_mutex_lock(&ctx->bh_lock);
> >      bh->next = ctx->first_bh;
> > +    /* Make sure the memebers ready before putting bh into list */
> > +    smp_wmb();
> >      ctx->first_bh = bh;
> > +    qemu_mutex_unlock(&ctx->bh_lock);
> >      return bh;
> >  }
> > 
> > @@ -61,12 +65,18 @@ int aio_bh_poll(AioContext *ctx)
> > 
> >      ret = 0;
> >      for (bh = ctx->first_bh; bh; bh = next) {
> > +        /* Make sure fetching bh before accessing its members */
> > +        smp_read_barrier_depends();
> >          next = bh->next;
> >          if (!bh->deleted && bh->scheduled) {
> >              bh->scheduled = 0;
> >              if (!bh->idle)
> >                  ret = 1;
> >              bh->idle = 0;
> > +            /* Paired with write barrier in bh schedule to ensure reading for
> > +             *  callbacks coming after bh's scheduling.
> > +             */
> > +            smp_rmb();
> >              bh->cb(bh->opaque);
> 
> Could we possibly simplify this by introducing a recursive mutex that we
> could use to protect the whole list loop and hold even during the cb?
> 
> I assume we can't hold the lock during the cb currently since we might
> try to reschedule, but if it's a recursive mutex would that simplify
> things?
> 
> I've been doing something similar with IOHandlers for the QContext
> stuff, and that's the approach I took. This patch introduces the
> recursive mutex:
> 
> https://github.com/mdroth/qemu/commit/c7ee0844da62283c9466fcb10ddbfadd0b8bfc53

Doh, missing this fix:

diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index 0623eca..a4d8170 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -81,7 +81,7 @@ int qemu_mutex_trylock(QemuMutex *mutex)
 void qemu_mutex_unlock(QemuMutex *mutex)
 {
     assert(mutex->owner == GetCurrentThreadId());
-    if (!mutex->recursive || mutex->recursion_depth == 0) {
+    if (!mutex->recursive || --mutex->recursion_depth == 0) {
         mutex->owner = 0;
     }
     LeaveCriticalSection(&mutex->lock);

> 
> 
> >          }
> >      }
> > @@ -75,6 +85,7 @@ int aio_bh_poll(AioContext *ctx)
> > 
> >      /* remove deleted bhs */
> >      if (!ctx->walking_bh) {
> > +        qemu_mutex_lock(&ctx->bh_lock);
> >          bhp = &ctx->first_bh;
> >          while (*bhp) {
> >              bh = *bhp;
> > @@ -85,6 +96,7 @@ int aio_bh_poll(AioContext *ctx)
> >                  bhp = &bh->next;
> >              }
> >          }
> > +        qemu_mutex_unlock(&ctx->bh_lock);
> >      }
> > 
> >      return ret;
> > @@ -94,6 +106,10 @@ void qemu_bh_schedule_idle(QEMUBH *bh)
> >  {
> >      if (bh->scheduled)
> >          return;
> > +    /* Make sure any writes that are needed by the callback are done
> > +     * before the locations are read in the aio_bh_poll.
> > +     */
> > +    smp_wmb();
> >      bh->scheduled = 1;
> >      bh->idle = 1;
> >  }
> > @@ -102,6 +118,10 @@ void qemu_bh_schedule(QEMUBH *bh)
> >  {
> >      if (bh->scheduled)
> >          return;
> > +    /* Make sure any writes that are needed by the callback are done
> > +     * before the locations are read in the aio_bh_poll.
> > +     */
> > +    smp_wmb();
> >      bh->scheduled = 1;
> >      bh->idle = 0;
> >      aio_notify(bh->ctx);
> > @@ -211,6 +231,7 @@ AioContext *aio_context_new(void)
> >      ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
> >      ctx->pollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
> >      ctx->thread_pool = NULL;
> > +    qemu_mutex_init(&ctx->bh_lock);
> >      event_notifier_init(&ctx->notifier, false);
> >      aio_set_event_notifier(ctx, &ctx->notifier, 
> >                             (EventNotifierHandler *)
> > diff --git a/include/block/aio.h b/include/block/aio.h
> > index 1836793..971fbef 100644
> > --- a/include/block/aio.h
> > +++ b/include/block/aio.h
> > @@ -17,6 +17,7 @@
> >  #include "qemu-common.h"
> >  #include "qemu/queue.h"
> >  #include "qemu/event_notifier.h"
> > +#include "qemu/thread.h"
> > 
> >  typedef struct BlockDriverAIOCB BlockDriverAIOCB;
> >  typedef void BlockDriverCompletionFunc(void *opaque, int ret);
> > @@ -53,6 +54,8 @@ typedef struct AioContext {
> >       */
> >      int walking_handlers;
> > 
> > +    /* lock to protect between bh's adders and deleter */
> > +    QemuMutex bh_lock;
> >      /* Anchor of the list of Bottom Halves belonging to the context */
> >      struct QEMUBH *first_bh;
> > 
> > -- 
> > 1.8.1.4
> > 
> > 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 16:08         ` Paolo Bonzini
@ 2013-06-18 16:38           ` Torvald Riegel
  2013-06-19  1:57             ` Paul E. McKenney
  2013-06-19  9:31             ` Paolo Bonzini
  0 siblings, 2 replies; 41+ messages in thread
From: Torvald Riegel @ 2013-06-18 16:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrew Haley, qemu-devel, Liu Ping Fan, Anthony Liguori, paulmck,
	Richard Henderson

On Tue, 2013-06-18 at 18:08 +0200, Paolo Bonzini wrote:
> Il 18/06/2013 16:50, Paul E. McKenney ha scritto:
> > PS:  Nevertheless, I personally prefer the C++ formulation, but that is
> >      only because I stand with one foot in theory and the other in
> >      practice.  If I were a pure practitioner, I would probably strongly
> >      prefer the Java formulation.
> 
> Awesome answer, and this last paragraph sums it up pretty well.

I disagree that for non-Java code the Java model should be better.  Both
C11 and C++11 use the same model, and I don't see a reason to not use it
if you're writing C/C++ code anyway.

The C++ model is definitely useful for practitioners; just because it
uses seq-cst memory order as safe default doesn't mean that programmers
that can deal with weaker ordering guarantees can't make use of those
weaker ones.
I thought Paul was referring to seq-cst as default; if that wasn't the
point he wanted to make, I actually don't understand his theory/practice
comparison (never mind that whenever you need to reason about concurrent
stuff, having a solid formal framework as the one by the Cambridge group
is definitely helpful).  Seq-cst and acq-rel are just different
guarantees -- this doesn't mean that one is better than the other; you
need to understand anyway what you're doing and which one you need.
Often, ensuring a synchronized-with edge by pairing release/acquire will
be sufficient, but that doesn't say anything about the Java vs. C/C++
model.

> That was basically my understanding, too.  I still do not completely 
> get the relationship between Java semantics and ACQ_REL, but I can 
> sidestep the issue for adding portable atomics to QEMU.  QEMU 
> developers and Linux developers have some overlap, and Java volatiles 
> are simple to understand in terms of memory barriers (which Linux 
> uses); hence, I'll treat ourselves as pure practitioners.

I don't think that this is the conclusion here.  I strongly suggest to
just go with the C11/C++11 model, instead of rolling your own or trying
to replicate the Java model.  That would also allow you to just point to
the C11 model and any information / tutorials about it instead of having
to document your own (see the patch), and you can make use of any
(future) tool support (e.g., race detectors).

> I will just not use __atomic_load/__atomic_store to implement the 
> primitives, and always express them in terms of memory barriers.

Why?  (If there's some QEMU-specific reason, just let me know; I know
little about QEMU..)
I would assume that using the __atomic* builtins is just fine if they're
available.

Torvald

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 13:24     ` [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations) Paolo Bonzini
  2013-06-18 14:50       ` Paul E. McKenney
  2013-06-18 15:26       ` Torvald Riegel
@ 2013-06-18 17:38       ` Andrew Haley
  2013-06-19  9:30         ` Paolo Bonzini
  2 siblings, 1 reply; 41+ messages in thread
From: Andrew Haley @ 2013-06-18 17:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Torvald Riegel, qemu-devel, Liu Ping Fan, Anthony Liguori,
	Paul E. McKenney, Richard Henderson

On 06/18/2013 02:24 PM, Paolo Bonzini wrote:
> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
> paragraph of
> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
> seems to suggest?

As far as I know, the Java semantics are acq/rel.  I can't see anything
there that suggests otherwise.  If we'd wanted to know for certain we
should have CC'd Doug lea.

Andrew.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-18 15:14   ` mdroth
  2013-06-18 16:19     ` mdroth
@ 2013-06-18 19:20     ` Paolo Bonzini
  2013-06-18 22:26       ` mdroth
  1 sibling, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-18 19:20 UTC (permalink / raw)
  To: mdroth; +Cc: Liu Ping Fan, Anthony Liguori, qemu-devel

Il 18/06/2013 17:14, mdroth ha scritto:
> Could we possibly simplify this by introducing a recursive mutex that we
> could use to protect the whole list loop and hold even during the cb?

If it is possible, we should avoid recursive locks.  It makes impossible
to establish a lock hierarchy.  For example:

> I assume we can't hold the lock during the cb currently since we might
> try to reschedule, but if it's a recursive mutex would that simplify
> things?

If you have two callbacks in two different AioContexts, both of which do
bdrv_drain_all(), you get an AB-BA deadlock

Also, the barriers for qemu_bh_schedule are needed anyway.  It's only
those that order bh->next vs. ctx->first_bh that would go away.

Paolo

> I've been doing something similar with IOHandlers for the QContext
> stuff, and that's the approach I took. This patch introduces the
> recursive mutex:
> 
> https://github.com/mdroth/qemu/commit/c7ee0844da62283c9466fcb10ddbfadd0b8bfc53

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-18 19:20     ` Paolo Bonzini
@ 2013-06-18 22:26       ` mdroth
  2013-06-19  9:27         ` Paolo Bonzini
  0 siblings, 1 reply; 41+ messages in thread
From: mdroth @ 2013-06-18 22:26 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Liu Ping Fan, Anthony Liguori, qemu-devel

On Tue, Jun 18, 2013 at 09:20:26PM +0200, Paolo Bonzini wrote:
> Il 18/06/2013 17:14, mdroth ha scritto:
> > Could we possibly simplify this by introducing a recursive mutex that we
> > could use to protect the whole list loop and hold even during the cb?
> 
> If it is possible, we should avoid recursive locks.  It makes impossible
> to establish a lock hierarchy.  For example:
> 
> > I assume we can't hold the lock during the cb currently since we might
> > try to reschedule, but if it's a recursive mutex would that simplify
> > things?
> 
> If you have two callbacks in two different AioContexts, both of which do
> bdrv_drain_all(), you get an AB-BA deadlock

I think I see what you mean. That problem exists regardless of whether we
introduce a recursive mutex though right? I guess the main issue is the
fact that we'd be encouraging sloppy locking practices without
addressing the root problem?

I'm just worried what other subtle problems pop up if we instead rely
heavily on memory barriers and inevitably forget one here or there, but
maybe that's just me not having a good understanding of when to use them.

But doesn't rcu provide higher-level interfaces for these kinds of things?
Is it possible to hide any of this behind our list interfaces?

> 
> Also, the barriers for qemu_bh_schedule are needed anyway.  It's only
> those that order bh->next vs. ctx->first_bh that would go away.

I see, I guess it's unavoidable for some cases.

> 
> Paolo
> 
> > I've been doing something similar with IOHandlers for the QContext
> > stuff, and that's the approach I took. This patch introduces the
> > recursive mutex:
> > 
> > https://github.com/mdroth/qemu/commit/c7ee0844da62283c9466fcb10ddbfadd0b8bfc53
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 15:37         ` Torvald Riegel
@ 2013-06-19  1:53           ` Paul E. McKenney
  2013-06-19  7:11             ` Torvald Riegel
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2013-06-19  1:53 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Susmit.Sarkar, Andrew Haley, qemu-devel, Liu Ping Fan,
	Anthony Liguori, luc.maranget, Paolo Bonzini, Peter.Sewell,
	Richard Henderson

On Tue, Jun 18, 2013 at 05:37:42PM +0200, Torvald Riegel wrote:
> On Tue, 2013-06-18 at 07:50 -0700, Paul E. McKenney wrote:
> > First, I am not a fan of SC, mostly because there don't seem to be many
> > (any?) production-quality algorithms that need SC.  But if you really
> > want to take a parallel-programming trip back to the 1980s, let's go!  ;-)
> 
> Dekker-style mutual exclusion is useful for things like read-mostly
> multiple-reader single-writer locks, or similar "asymmetric" cases of
> synchronization.  SC fences are needed for this.

They definitely need Power hwsync rather than lwsync, but they need
fewer fences than would be emitted by slavishly following either of the
SC recipes for Power.  (Another example needing store-to-load ordering
is hazard pointers.)

> > PS:  Nevertheless, I personally prefer the C++ formulation, but that is
> >      only because I stand with one foot in theory and the other in
> >      practice.  If I were a pure practitioner, I would probably strongly
> >      prefer the Java formulation.
> 
> That's because you're a practitioner with experience :)

I knew I had all this grey hair for some reason or another...  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 16:38           ` Torvald Riegel
@ 2013-06-19  1:57             ` Paul E. McKenney
  2013-06-19  9:31             ` Paolo Bonzini
  1 sibling, 0 replies; 41+ messages in thread
From: Paul E. McKenney @ 2013-06-19  1:57 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Andrew Haley, qemu-devel, Liu Ping Fan, Anthony Liguori,
	Paolo Bonzini, Richard Henderson

On Tue, Jun 18, 2013 at 06:38:38PM +0200, Torvald Riegel wrote:
> On Tue, 2013-06-18 at 18:08 +0200, Paolo Bonzini wrote:
> > Il 18/06/2013 16:50, Paul E. McKenney ha scritto:
> > > PS:  Nevertheless, I personally prefer the C++ formulation, but that is
> > >      only because I stand with one foot in theory and the other in
> > >      practice.  If I were a pure practitioner, I would probably strongly
> > >      prefer the Java formulation.
> > 
> > Awesome answer, and this last paragraph sums it up pretty well.
> 
> I disagree that for non-Java code the Java model should be better.  Both
> C11 and C++11 use the same model, and I don't see a reason to not use it
> if you're writing C/C++ code anyway.
> 
> The C++ model is definitely useful for practitioners; just because it
> uses seq-cst memory order as safe default doesn't mean that programmers
> that can deal with weaker ordering guarantees can't make use of those
> weaker ones.
> I thought Paul was referring to seq-cst as default; if that wasn't the
> point he wanted to make, I actually don't understand his theory/practice
> comparison (never mind that whenever you need to reason about concurrent
> stuff, having a solid formal framework as the one by the Cambridge group
> is definitely helpful).  Seq-cst and acq-rel are just different
> guarantees -- this doesn't mean that one is better than the other; you
> need to understand anyway what you're doing and which one you need.
> Often, ensuring a synchronized-with edge by pairing release/acquire will
> be sufficient, but that doesn't say anything about the Java vs. C/C++
> model.

Having the Cambridge group's model and tooling around has definitely
made my life much easier!

And yes, I do very much see a need for a wide range of ordering/overhead
tradeoffs, at least until such time as someone figures a way around
either the atomic nature of matter or the finite speed of light.  ;-)

							Thanx, Paul

> > That was basically my understanding, too.  I still do not completely 
> > get the relationship between Java semantics and ACQ_REL, but I can 
> > sidestep the issue for adding portable atomics to QEMU.  QEMU 
> > developers and Linux developers have some overlap, and Java volatiles 
> > are simple to understand in terms of memory barriers (which Linux 
> > uses); hence, I'll treat ourselves as pure practitioners.
> 
> I don't think that this is the conclusion here.  I strongly suggest to
> just go with the C11/C++11 model, instead of rolling your own or trying
> to replicate the Java model.  That would also allow you to just point to
> the C11 model and any information / tutorials about it instead of having
> to document your own (see the patch), and you can make use of any
> (future) tool support (e.g., race detectors).
> 
> > I will just not use __atomic_load/__atomic_store to implement the 
> > primitives, and always express them in terms of memory barriers.
> 
> Why?  (If there's some QEMU-specific reason, just let me know; I know
> little about QEMU..)
> I would assume that using the __atomic* builtins is just fine if they're
> available.
> 
> 
> Torvald
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-19  1:53           ` Paul E. McKenney
@ 2013-06-19  7:11             ` Torvald Riegel
  2013-06-20 15:18               ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Torvald Riegel @ 2013-06-19  7:11 UTC (permalink / raw)
  To: paulmck
  Cc: Susmit.Sarkar, Andrew Haley, qemu-devel, Liu Ping Fan,
	Anthony Liguori, luc.maranget, Paolo Bonzini, Peter.Sewell,
	Richard Henderson

On Tue, 2013-06-18 at 18:53 -0700, Paul E. McKenney wrote:
> On Tue, Jun 18, 2013 at 05:37:42PM +0200, Torvald Riegel wrote:
> > On Tue, 2013-06-18 at 07:50 -0700, Paul E. McKenney wrote:
> > > First, I am not a fan of SC, mostly because there don't seem to be many
> > > (any?) production-quality algorithms that need SC.  But if you really
> > > want to take a parallel-programming trip back to the 1980s, let's go!  ;-)
> > 
> > Dekker-style mutual exclusion is useful for things like read-mostly
> > multiple-reader single-writer locks, or similar "asymmetric" cases of
> > synchronization.  SC fences are needed for this.
> 
> They definitely need Power hwsync rather than lwsync, but they need
> fewer fences than would be emitted by slavishly following either of the
> SC recipes for Power.  (Another example needing store-to-load ordering
> is hazard pointers.)

The C++11 seq-cst fence expands to hwsync; combined with a relaxed
store / load, that should be minimal.  Or are you saying that on Power,
there is a weaker HW barrier available that still constrains store-load
reordering sufficiently?


Torvald

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-18 22:26       ` mdroth
@ 2013-06-19  9:27         ` Paolo Bonzini
  2013-06-20  9:11           ` Stefan Hajnoczi
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-19  9:27 UTC (permalink / raw)
  To: mdroth; +Cc: Liu Ping Fan, Anthony Liguori, qemu-devel

Il 19/06/2013 00:26, mdroth ha scritto:
> On Tue, Jun 18, 2013 at 09:20:26PM +0200, Paolo Bonzini wrote:
>> Il 18/06/2013 17:14, mdroth ha scritto:
>>> Could we possibly simplify this by introducing a recursive mutex that we
>>> could use to protect the whole list loop and hold even during the cb?
>>
>> If it is possible, we should avoid recursive locks.  It makes impossible
>> to establish a lock hierarchy.  For example:
>>
>>> I assume we can't hold the lock during the cb currently since we might
>>> try to reschedule, but if it's a recursive mutex would that simplify
>>> things?
>>
>> If you have two callbacks in two different AioContexts, both of which do
>> bdrv_drain_all(), you get an AB-BA deadlock
> 
> I think I see what you mean. That problem exists regardless of whether we
> introduce a recursive mutex though right?

Without a recursive mutex, you only hold one lock at a time in each thread.

> I guess the main issue is the
> fact that we'd be encouraging sloppy locking practices without
> addressing the root problem?

Yeah.  We're basically standing where the Linux kernel stood 10 years
ago (let's say 2.2 timeframe).  If Linux got this far without recursive
mutexes, we can at least try. :)

> I'm just worried what other subtle problems pop up if we instead rely
> heavily on memory barriers and inevitably forget one here or there, but
> maybe that's just me not having a good understanding of when to use them.

That's true.  I hope that the docs in patch 1 help, and (much) more
thorough docs are available in the Linux kernel.

> But doesn't rcu provide higher-level interfaces for these kinds of things?
> Is it possible to hide any of this behind our list interfaces?

My atomics header file provides higher-level interfaces, but in most
cases barriers are not that harder to use and the docs help converting
one style to the other.

So far I've not used RCU for lists, only for entire data structures.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 17:38       ` Andrew Haley
@ 2013-06-19  9:30         ` Paolo Bonzini
  2013-06-19 15:36           ` Andrew Haley
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-19  9:30 UTC (permalink / raw)
  To: Andrew Haley
  Cc: Torvald Riegel, qemu-devel, Liu Ping Fan, Anthony Liguori,
	Paul E. McKenney, Richard Henderson

Il 18/06/2013 19:38, Andrew Haley ha scritto:
>> > Or is Java volatile somewhere between acq_rel and seq_cst, as the last
>> > paragraph of
>> > http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
>> > seems to suggest?
> As far as I know, the Java semantics are acq/rel.  I can't see anything
> there that suggests otherwise.  If we'd wanted to know for certain we
> should have CC'd Doug lea.

acq/rel wouldn't have a full store-load barrier between a volatile store
and a volatile load.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-18 16:38           ` Torvald Riegel
  2013-06-19  1:57             ` Paul E. McKenney
@ 2013-06-19  9:31             ` Paolo Bonzini
  2013-06-19 13:15               ` Torvald Riegel
  1 sibling, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-19  9:31 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Andrew Haley, qemu-devel, Liu Ping Fan, Anthony Liguori, paulmck,
	Richard Henderson

Il 18/06/2013 18:38, Torvald Riegel ha scritto:
> I don't think that this is the conclusion here.  I strongly suggest to
> just go with the C11/C++11 model, instead of rolling your own or trying
> to replicate the Java model.  That would also allow you to just point to
> the C11 model and any information / tutorials about it instead of having
> to document your own (see the patch), and you can make use of any
> (future) tool support (e.g., race detectors).

I'm definitely not rolling my own, but I think there is some value in
using the Java model.  Warning: the explanation came out quite
verbose... tl;dr at the end.

One reason is that implementing SC for POWER is quite expensive, while
this is not the case for Java volatile (which I'm still not convinced is
acq-rel, because it also orders volatile stores and volatile loads).
People working on QEMU are often used to manually placed barriers on
Linux, and Linux barriers do not fully give you seq-cst semantics.  They
give you something much more similar to the Java model.

The Java model gives good performance and is easier to understand than
the non-seqcst modes of atomic builtins.  It is pretty much impossible
to understand the latter without a formal model; I see the importance of
a formal model, but at the same time it is hard not to appreciate the
detailed-but-practical style of the Linux documentation.

Second, the Java model has very good "practical" documentation from
sources I trust.  Note the part about trust: I found way too many Java
tutorials, newsgroup posts, and blogs that say Java is SC, when it is not.

Paul's Linux docs are a source I trust, and the JSR-133 FAQ/cookbook too
(especially now that Richard and Paul completed my understanding of
them).  There are substantially fewer practical documents for C11/C++11
that are similarly authoritative.  I obviously trust Cambridge for
C11/C++11, but their material is very concise or just refers to the
formal model.  The formal model is not what I want when my question is
simply "why is lwsync good for acquire and release, but not for
seqcst?", for example.  And the papers sometime refer to "private
communication" between the authors and other people, which can be
annoying.  Hans Boehm and Herb Sutter have good poster and slide
material, but they do not have the same level of completeness as Paul's
Linux documentation.  Paul _really_ has spoiled us "pure practitioners"...

Third, we must support old GCC (even as old as 4.2), so we need
hand-written assembly for atomics anyway.  This again goes back to
documentation and the JSR-133 cookbook.  It not only gives you
instructions on how to implement the model (which is also true for the
Cambridge web pages on C11/C++11), but is also a good base for writing
our own documentation.  It helped me understanding existing code using
barriers, optimizing it, and putting this knowledge in words.  I just
couldn't find anything as useful for C11/C++11.

In short, the C11/C++11 model is not what most developers are used to
here, hardware is not 100% mature for it (for example ARMv8 has seqcst
load/store; perhaps POWER will grow that in time), is harder to
optimize, and has (as of 2013) less "practical" documentation from
sources I trust.

Besides, since what I'm using is weaker than SC, there's always the
possibility of switching to SC in the future when enough of these issues
are solved.  In case you really need SC _now_, it is easy to do it using
fetch-and-add (for loads) or xchg (for stores).

>> I will just not use __atomic_load/__atomic_store to implement the 
>> primitives, and always express them in terms of memory barriers.
> 
> Why?  (If there's some QEMU-specific reason, just let me know; I know
> little about QEMU..)

I guess I mentioned the QEMU-specific reasons above.

> I would assume that using the __atomic* builtins is just fine if they're
> available.

It would implement slightly different semantics based on the compiler
version, so I think it's dangerous.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-19  9:31             ` Paolo Bonzini
@ 2013-06-19 13:15               ` Torvald Riegel
  2013-06-19 15:14                 ` Paolo Bonzini
  0 siblings, 1 reply; 41+ messages in thread
From: Torvald Riegel @ 2013-06-19 13:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrew Haley, qemu-devel, Liu Ping Fan, Anthony Liguori, paulmck,
	Richard Henderson

On Wed, 2013-06-19 at 11:31 +0200, Paolo Bonzini wrote:
> Il 18/06/2013 18:38, Torvald Riegel ha scritto:
> > I don't think that this is the conclusion here.  I strongly suggest to
> > just go with the C11/C++11 model, instead of rolling your own or trying
> > to replicate the Java model.  That would also allow you to just point to
> > the C11 model and any information / tutorials about it instead of having
> > to document your own (see the patch), and you can make use of any
> > (future) tool support (e.g., race detectors).
> 
> I'm definitely not rolling my own, but I think there is some value in
> using the Java model.  Warning: the explanation came out quite
> verbose... tl;dr at the end.

That's fine -- it is a nontrival topic after all.  My reply is equally
lengthy :)

> 
> 
> One reason is that implementing SC for POWER is quite expensive,

Sure, but you don't have to use SC fences or atomics if you don't want
them.  Note that C11/C++11 as well as the __atomic* builtins allow you
to specify a memory order.  It's perfectly fine to use acquire fences or
release fences.  There shouldn't be just one kind of barrier/fence.

> while
> this is not the case for Java volatile (which I'm still not convinced is
> acq-rel, because it also orders volatile stores and volatile loads).

But you have the same choice with C11/C++11.  You seemed to be fine with
using acquire/release in your code, so if that's what you want, just use
it :)

I vaguely remember that the Java model gives stronger guarantees than
C/C++ in the presence of data races; something about no values
out-of-thin-air.  Maybe that's where the stronger-than-acq/rel memory
orders for volatile stores come from.

> People working on QEMU are often used to manually placed barriers on
> Linux, and Linux barriers do not fully give you seq-cst semantics.  They
> give you something much more similar to the Java model.

Note that there is a reason why C11/C++11 don't just have barriers
combined with ordinary memory accesses: The compiler needs to be aware
which accesses are sequential code (so it can assume that they are
data-race-free) and which are potentially concurrent with other accesses
to the same data.  When you just use normal memory accesses with
barriers, you're relying on non-specified behavior in a compiler, and
you have no guarantee that the compiler reads a value just once, for
example; you can try to make this very likely be correct by careful
placement of asm compiler barriers, but this is likely to be more
difficult than just using atomics, which will do the right thing.

Linux folks seem to be doing fine in this area, but they also seem to
mostly use existing concurrency abstractions such as locks or RCU.
Thus, maybe it's not a too good indication of the ease-of-use of their
style of expressing low-level synchronization.

> The Java model gives good performance and is easier to understand than
> the non-seqcst modes of atomic builtins.  It is pretty much impossible
> to understand the latter without a formal model;

I'm certainly biased because I've been looking at this a lot.  But I
believe that there are common synchronization idioms which are
relatively straightforward to understand.  Acquire/release pairs should
be one such case.  Locks are essentially the same everywhere.

> I see the importance of
> a formal model, but at the same time it is hard not to appreciate the
> detailed-but-practical style of the Linux documentation.

I don't think the inherent complexity programmers *need* to deal with is
different because in the end, if you have two ways to express the same
ordering guarantees, you have to reason about the very same ordering
guarantees.  This is what makes up most of the complexity.  For example,
if you want to use acq/rel widely, you need to understand what this
means for your concurrent code; this won't change significantly
depending on how you express it.

I think the C++11 model is pretty compact, meaning no unnecessary fluff
(maybe one would need fewer memory orders, but those aren't too hard to
ignore).  I believe that if you'd want to specify the Linux model
including any implicit assumptions on the compilers, you'd end up with
the same level of detail in the model.

Maybe the issue that you see with C11/C++11 is that it offers more than
you actually need.  If you can summarize what kind of synchronization /
concurrent code you are primarily looking at, I can try to help outline
a subset of it (i.e., something like code style but just for
synchronization).

> Second, the Java model has very good "practical" documentation from
> sources I trust.  Note the part about trust: I found way too many Java
> tutorials, newsgroup posts, and blogs that say Java is SC, when it is not.
> 
> Paul's Linux docs are a source I trust, and the JSR-133 FAQ/cookbook too
> (especially now that Richard and Paul completed my understanding of
> them).  There are substantially fewer practical documents for C11/C++11
> that are similarly authoritative.

That's true.  But both models are younger in the sense of being
available to practitioners.  Boehm's libatomic-ops library was close,
but compiler support for them wasn't available until recently.  But
maybe it's useful to look 1 or 2 years ahead; I suspect that there will
be much more documentation in the near future.

> I obviously trust Cambridge for
> C11/C++11, but their material is very concise or just refers to the
> formal model.

Yes, their publications are really about the model.  It's not a
tutorial, but useful for reference.  BTW, have you read their C++ paper
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf
or the POPL paper?  The former has more detail (no page limit).

If you haven't yet, I suggest giving their cppmem tool a try too:
http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
It's a browser-based tool that shows you all the possible interleavings
of small pieces of synchronization code, together with a graph with the
relationships as in their model.  Very helpful to explore whether your
synchronization code does the right thing.  At least, I'm not aware of
anything like that for Java.

> The formal model is not what I want when my question is
> simply "why is lwsync good for acquire and release, but not for
> seqcst?", for example.

But that's a question that is not about the programming-language-level
memory model, but about POWER's memory model.  I suppose that's not
something you'd have to deal with frequently, or are you indeed
primarily interested in emulating a particular architecture's model?
The Cambridge group also has a formal model for POWER's memory model,
perhaps this can help answer this question.

> And the papers sometime refer to "private
> communication" between the authors and other people, which can be
> annoying.  Hans Boehm and Herb Sutter have good poster and slide
> material, but they do not have the same level of completeness as Paul's
> Linux documentation.  Paul _really_ has spoiled us "pure practitioners"...
> 
> 
> Third, we must support old GCC (even as old as 4.2), so we need
> hand-written assembly for atomics anyway.

Perhaps you could make use of GCC's libatomic, or at least take code
from it.  Boehm's libatomic-ops might be another choice.  But I agree
that you can't just switch to the newer atomics support; it will take
some more time until it being available is the norm.

> This again goes back to
> documentation and the JSR-133 cookbook.  It not only gives you
> instructions on how to implement the model (which is also true for the
> Cambridge web pages on C11/C++11), but is also a good base for writing
> our own documentation.  It helped me understanding existing code using
> barriers, optimizing it, and putting this knowledge in words.  I just
> couldn't find anything as useful for C11/C++11.
> 
> 
> In short, the C11/C++11 model is not what most developers are used to
> here

I guess so.  But you also have to consider the legacy that you create.
I do think the C11/C++11 model will used widely, and more and more
people will used to it.  Thus, it's the question of whether you want to
deviate from what will likely be the common approach in the not to far
away future.

> hardware is not 100% mature for it (for example ARMv8 has seqcst
> load/store; perhaps POWER will grow that in time),

I'm not quite sure what you mean, or at least how what you might mean is
related to the choice between the C/C++ and Java models.

> is harder to
> optimize, and has (as of 2013) less "practical" documentation from
> sources I trust.
> 
> Besides, since what I'm using is weaker than SC, there's always the
> possibility of switching to SC in the future when enough of these issues
> are solved.

C11/C++11 is not _equal to_ SC.  It includes SC (because it has to
anyway to enable certain concurrent algorithms), and C++ has SC as
default, but this doesn't mean that the C/C++ model equates to SC
semantics.

> In case you really need SC _now_, it is easy to do it using
> fetch-and-add (for loads) or xchg (for stores).

Not on POWER and other architectures with similarly weak memory
models :)  On x86 and such the atomic read-modify-write ops give you
strong ordering, but that's an artifact of those models.  You can get SC
on most of the archs without having to use atomic read-modify-write ops.

If the fetch-and-add abstractions in your model always give you SC, then
that's more an indication that they don't allow you the fine-grained
control that you want, especially on POWER :)

> 
> >> I will just not use __atomic_load/__atomic_store to implement the 
> >> primitives, and always express them in terms of memory barriers.
> > 
> > Why?  (If there's some QEMU-specific reason, just let me know; I know
> > little about QEMU..)
> 
> I guess I mentioned the QEMU-specific reasons above.
> 
> > I would assume that using the __atomic* builtins is just fine if they're
> > available.
> 
> It would implement slightly different semantics based on the compiler
> version, so I think it's dangerous.

If __atomic is available, the semantics don't change.  The __sync
builtins do provide a certain subset of the semantics you get with the
__atomic builtins, however.

Torvald

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-19 13:15               ` Torvald Riegel
@ 2013-06-19 15:14                 ` Paolo Bonzini
  2013-06-19 20:25                   ` Torvald Riegel
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-19 15:14 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Andrew Haley, qemu-devel, Liu Ping Fan, Anthony Liguori, paulmck,
	Richard Henderson

Il 19/06/2013 15:15, Torvald Riegel ha scritto:
>> One reason is that implementing SC for POWER is quite expensive,
> 
> Sure, but you don't have to use SC fences or atomics if you don't want
> them.  Note that C11/C++11 as well as the __atomic* builtins allow you
> to specify a memory order.  It's perfectly fine to use acquire fences or
> release fences.  There shouldn't be just one kind of barrier/fence.

Agreed.  For example Linux uses four: consume (read_barrier_depends),
acquire (rmb), release (wmb), SC (mb).  In addition in Linux loads and
stores are always relaxed, some RMW ops are SC but others are relaxed.

I want to do something similar in QEMU, with as few changes as possible.
 In the end I settled for the following:

(1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
but RMW shouldn't be that bad.  I don't care if reference counting is a
little slower than it could be, for example);

(2) I'd like to have some kind of non-reordering load/store too, either
SC (which I've improperly referred to as C11/C++11 in my previous email)
or Java volatile.

   [An aside: Java guarantees that volatile stores are not reordered
   with volatile loads.  This is not guaranteed by just using release
   stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].

As long as you only have a producer and a consumer, C11 is fine, because
all you need is load-acquire/store-release.  In fact, if it weren't for
the experience factor, C11 is easier than manually placing acquire and
release barriers.  But as soon as two or more threads are reading _and_
writing the shared memory, it gets complicated and I want to provide
something simple that people can use.  This is the reason for (2) above.

There will still be a few cases that need to be optimized, and here are
where the difficult requirements come:

(R1) the primitives *should* not be alien to people who know Linux.

(R2) those optimizations *must* be easy to do and review; at least as
easy as these things go.

The two are obviously related.  Ease of review is why it is important to
make things familiar to people who know Linux.

In C11, relaxing SC loads and stores is complicated, and more
specifically hard to explain!  I cannot do that myself, and much less
explain that to the community.  I cannot make them do that.
Unfortunately, relaxing SC loads and stores is important on POWER which
has efficient acq/rel but inefficient SC (hwsync in the loads).  So, C11
fails both requirements. :(

By contrast, Java volatile semantics are easily converted to a sequence
of relaxed loads, relaxed stores, and acq/rel/sc fences.  It's almost an
algorithm; I tried to do that myself and succeeded, I could document it
nicely.  Even better, there are authoritative sources that confirm my
writing and should be accessible to people who did synchronization
"stuff" in Linux (no formal models :)).  In this respect, Java satisfies
both requirements.

And the loss is limited, since things such as Dekker's algorithm are
rare in practice.  (In particular, RCU can be implemented just fine with
Java volatile semantics, but load-acquire/store-release is not enough).

[Nothing really important after this point, I think].

> Note that there is a reason why C11/C++11 don't just have barriers
> combined with ordinary memory accesses: The compiler needs to be aware
> which accesses are sequential code (so it can assume that they are
> data-race-free) and which are potentially concurrent with other accesses
> to the same data.  [...]
> you can try to make this very likely be correct by careful
> placement of asm compiler barriers, but this is likely to be more
> difficult than just using atomics, which will do the right thing.

Note that asm is just for older compilers (and even then I try to use
GCC intrinsics as much as possible).

On newer compilers I do use atomics (SC RMW ops, acq/rel/SC/consume
thread fences) to properly annotate references.  rth also suggested that
I use load/store(relaxed) instead of C volatile.

> Maybe the issue that you see with C11/C++11 is that it offers more than
> you actually need.  If you can summarize what kind of synchronization /
> concurrent code you are primarily looking at, I can try to help outline
> a subset of it (i.e., something like code style but just for
> synchronization).

Is the above explanation clearer?

>> I obviously trust Cambridge for
>> C11/C++11, but their material is very concise or just refers to the
>> formal model.
> 
> Yes, their publications are really about the model.  It's not a
> tutorial, but useful for reference.  BTW, have you read their C++ paper
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf
> or the POPL paper?  The former has more detail (no page limit).

I know it, but I cannot say I tried hard to understand it.

> If you haven't yet, I suggest giving their cppmem tool a try too:
> http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

I saw and tried the similar tool for POWER.  The problem with these
tools, is that they require you to abstract your program into an input
to the tool.  It works when _writing_ the code, but not when reviewing it.

>> The formal model is not what I want when my question is
>> simply "why is lwsync good for acquire and release, but not for
>> seqcst?", for example.
> 
> But that's a question that is not about the programming-language-level
> memory model, but about POWER's memory model.  I suppose that's not
> something you'd have to deal with frequently, or are you indeed
> primarily interested in emulating a particular architecture's model?
> The Cambridge group also has a formal model for POWER's memory model,
> perhaps this can help answer this question.

I'm more familiar with Cambridge's POWER work than the C++ work, and it
didn't. :)  I care about the POWER memory model because it's roughly the
only one where

    load(seqcst) => load(relaxed) + fence(acquire)
    store(seqcst) => fence(release) + store(relaxed) + fence(seqcst)

is not a valid (though perhaps suboptimal) mapping.  Note that the
primitives on the RHS are basically what they use in the Linux kernel.

>> In short, the C11/C++11 model is not what most developers are used to
>> here
> 
> I guess so.  But you also have to consider the legacy that you create.
> I do think the C11/C++11 model will used widely, and more and more
> people will used to it.

I don't think many people will learn how to use the various non-seqcst
modes...  At least so far I punted. :)

Many idioms will certainly be wrapped within the C++ standard library
(smart pointers and all that), but C starts at a disadvantage.  You have
a point, though.

>> In case you really need SC _now_, it is easy to do it using
>> fetch-and-add (for loads) or xchg (for stores).
> 
> Not on POWER and other architectures with similarly weak memory models :)  
> If the fetch-and-add abstractions in your model always give you SC, then
> that's more an indication that they don't allow you the fine-grained
> control that you want, especially on POWER :)

Indeed, they don't allow fine-grained control.  But I need to strike a
balance between control and ease of use, otherwise C11 would have been
an obvious choice (except for the detail of older compilers, but let's
ignore it).

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-19  9:30         ` Paolo Bonzini
@ 2013-06-19 15:36           ` Andrew Haley
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Haley @ 2013-06-19 15:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Torvald Riegel, qemu-devel, Liu Ping Fan, Anthony Liguori,
	Paul E. McKenney, Richard Henderson

On 06/19/2013 10:30 AM, Paolo Bonzini wrote:
> Il 18/06/2013 19:38, Andrew Haley ha scritto:
>>>> Or is Java volatile somewhere between acq_rel and seq_cst, as the last
>>>> paragraph of
>>>> http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
>>>> seems to suggest?
>> As far as I know, the Java semantics are acq/rel.  I can't see anything
>> there that suggests otherwise.  If we'd wanted to know for certain we
>> should have CC'd Doug lea.
> 
> acq/rel wouldn't have a full store-load barrier between a volatile store
> and a volatile load.

Ahhh, okay.  I had to check the C++11 spec to see the difference.  I'm
so deep into the Java world that I hadn't noticed that C++11 was any
different.

Andrew.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-19 15:14                 ` Paolo Bonzini
@ 2013-06-19 20:25                   ` Torvald Riegel
  2013-06-20  7:53                     ` Paolo Bonzini
  0 siblings, 1 reply; 41+ messages in thread
From: Torvald Riegel @ 2013-06-19 20:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrew Haley, qemu-devel, Liu Ping Fan, Anthony Liguori, paulmck,
	Richard Henderson

On Wed, 2013-06-19 at 17:14 +0200, Paolo Bonzini wrote:
> Il 19/06/2013 15:15, Torvald Riegel ha scritto:
> >> One reason is that implementing SC for POWER is quite expensive,
> > 
> > Sure, but you don't have to use SC fences or atomics if you don't want
> > them.  Note that C11/C++11 as well as the __atomic* builtins allow you
> > to specify a memory order.  It's perfectly fine to use acquire fences or
> > release fences.  There shouldn't be just one kind of barrier/fence.
> 
> Agreed.  For example Linux uses four: consume (read_barrier_depends),
> acquire (rmb), release (wmb), SC (mb).  In addition in Linux loads and
> stores are always relaxed, some RMW ops are SC but others are relaxed.
> 
> I want to do something similar in QEMU, with as few changes as possible.
>  In the end I settled for the following:
> 
> (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
> but RMW shouldn't be that bad.  I don't care if reference counting is a
> little slower than it could be, for example);

I doubt relaxed RMW ops are sufficient even for reference counting.
Typically, the reference counter is used conceptually similar to a lock,
so you need the acquire/release (modulo funny optimizations).  The only
use case that comes to my mind right now for relaxed RMW is really just
statistics counters or such, or cases where you can "re-use" another
barrier.

> (2) I'd like to have some kind of non-reordering load/store too, either
> SC (which I've improperly referred to as C11/C++11 in my previous email)
> or Java volatile.

Often you probably don't need more than acq/rel, as Paul pointed out.
SC becomes important once you do something like Dekker-style sync, so
cases where you sync via several separate variables to avoid the cache
misses in some common case.  Once you go through one variable in the
end, acq/rel should be fine.

>    [An aside: Java guarantees that volatile stores are not reordered
>    with volatile loads.  This is not guaranteed by just using release
>    stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].
   Or maybe Java volatile is acq for loads and seq_cst for stores...
> 
> As long as you only have a producer and a consumer, C11 is fine, because
> all you need is load-acquire/store-release.  In fact, if it weren't for
> the experience factor, C11 is easier than manually placing acquire and
> release barriers.  But as soon as two or more threads are reading _and_
> writing the shared memory, it gets complicated and I want to provide
> something simple that people can use.  This is the reason for (2) above.

I can't quite follow you here.  There is a total order for all
modifications to a single variable, and if you use acq/rel combined with
loads and stores on this variable, then you basically can make use of
the total order.  (All loads that read-from a certain store get a
synchronized-with (and thus happens-before edge) with the store, and the
stores are in a total order.)  This is independent of the number of
readers and writers.  The difference starts once you want to sync with
more than one variable, and need to establish an order between those
accesses.

> There will still be a few cases that need to be optimized, and here are
> where the difficult requirements come:
> 
> (R1) the primitives *should* not be alien to people who know Linux.
> 
> (R2) those optimizations *must* be easy to do and review; at least as
> easy as these things go.
> 
> The two are obviously related.  Ease of review is why it is important to
> make things familiar to people who know Linux.
> 
> In C11, relaxing SC loads and stores is complicated, and more
> specifically hard to explain!

I can't see why that would be harder than reasoning about equally weaker
Java semantics.  But you obviously know your community, and I don't :)

> I cannot do that myself, and much less
> explain that to the community.  I cannot make them do that.
> Unfortunately, relaxing SC loads and stores is important on POWER which
> has efficient acq/rel but inefficient SC (hwsync in the loads).  So, C11
> fails both requirements. :(
> 
> By contrast, Java volatile semantics are easily converted to a sequence
> of relaxed loads, relaxed stores, and acq/rel/sc fences.

The same holds for C11/C++11.  If you look at either the standard or the
Batty model, you'll see that for every pair like store(rel)--load(acq),
there is also store(rel)--fence(acq)+load(relaxed),
store(relaxed)+fence(rel)--fence(acq)+load(relaxed), etc. defined,
giving the same semantics.  Likewise for SC.

> It's almost an
> algorithm; I tried to do that myself and succeeded, I could document it
> nicely.  Even better, there are authoritative sources that confirm my
> writing and should be accessible to people who did synchronization
> "stuff" in Linux (no formal models :)).  In this respect, Java satisfies
> both requirements.
> 
> And the loss is limited, since things such as Dekker's algorithm are
> rare in practice.  (In particular, RCU can be implemented just fine with
> Java volatile semantics, but load-acquire/store-release is not enough).

You can also build Dekker with SC stores and acq loads, if I'm not
mistaken.  Typically one would probably use SC fences and relaxed
stores/loads.

> 
> [Nothing really important after this point, I think].
> 
> > Note that there is a reason why C11/C++11 don't just have barriers
> > combined with ordinary memory accesses: The compiler needs to be aware
> > which accesses are sequential code (so it can assume that they are
> > data-race-free) and which are potentially concurrent with other accesses
> > to the same data.  [...]
> > you can try to make this very likely be correct by careful
> > placement of asm compiler barriers, but this is likely to be more
> > difficult than just using atomics, which will do the right thing.
> 
> Note that asm is just for older compilers (and even then I try to use
> GCC intrinsics as much as possible).
> 
> On newer compilers I do use atomics (SC RMW ops, acq/rel/SC/consume
> thread fences) to properly annotate references.  rth also suggested that
> I use load/store(relaxed) instead of C volatile.

I agree with rth's suggestion.

> > Maybe the issue that you see with C11/C++11 is that it offers more than
> > you actually need.  If you can summarize what kind of synchronization /
> > concurrent code you are primarily looking at, I can try to help outline
> > a subset of it (i.e., something like code style but just for
> > synchronization).
> 
> Is the above explanation clearer?

Yes, thanks.

> 
> >> I obviously trust Cambridge for
> >> C11/C++11, but their material is very concise or just refers to the
> >> formal model.
> > 
> > Yes, their publications are really about the model.  It's not a
> > tutorial, but useful for reference.  BTW, have you read their C++ paper
> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3132.pdf
> > or the POPL paper?  The former has more detail (no page limit).
> 
> I know it, but I cannot say I tried hard to understand it.
> 
> > If you haven't yet, I suggest giving their cppmem tool a try too:
> > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/
> 
> I saw and tried the similar tool for POWER.  The problem with these
> tools, is that they require you to abstract your program into an input
> to the tool.  It works when _writing_ the code, but not when reviewing it.

I agree that it isn't a model checker for existing programs (but that
would likely be quite slow :)).  But it can help people to learn the
idioms.

> > I guess so.  But you also have to consider the legacy that you create.
> > I do think the C11/C++11 model will used widely, and more and more
> > people will used to it.
> 
> I don't think many people will learn how to use the various non-seqcst
> modes...  At least so far I punted. :)

But you already use similarly weaker orderings that the other
abstractions provide (e.g., Java), so you're half-way there :)

Torvald

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-19 20:25                   ` Torvald Riegel
@ 2013-06-20  7:53                     ` Paolo Bonzini
  2013-06-22 10:55                       ` Torvald Riegel
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2013-06-20  7:53 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Andrew Haley, qemu-devel, Liu Ping Fan, Anthony Liguori, paulmck,
	Richard Henderson

Il 19/06/2013 22:25, Torvald Riegel ha scritto:
> On Wed, 2013-06-19 at 17:14 +0200, Paolo Bonzini wrote:
>> (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
>> but RMW shouldn't be that bad.  I don't care if reference counting is a
>> little slower than it could be, for example);
> 
> I doubt relaxed RMW ops are sufficient even for reference counting.

They are enough on the increment side, or so says boost...

http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/atomic/usage_examples.html#boost_atomic.usage_examples.example_reference_counters

>>    [An aside: Java guarantees that volatile stores are not reordered
>>    with volatile loads.  This is not guaranteed by just using release
>>    stores and acquire stores, and is why IIUC acq_rel < Java < seq_cst].
>
> Or maybe Java volatile is acq for loads and seq_cst for stores...

Perhaps (but I'm not 100% sure).

>> As long as you only have a producer and a consumer, C11 is fine, because
>> all you need is load-acquire/store-release.  In fact, if it weren't for
>> the experience factor, C11 is easier than manually placing acquire and
>> release barriers.  But as soon as two or more threads are reading _and_
>> writing the shared memory, it gets complicated and I want to provide
>> something simple that people can use.  This is the reason for (2) above.
> 
> I can't quite follow you here.  There is a total order for all
> modifications to a single variable, and if you use acq/rel combined with
> loads and stores on this variable, then you basically can make use of
> the total order.  (All loads that read-from a certain store get a
> synchronized-with (and thus happens-before edge) with the store, and the
> stores are in a total order.)  This is independent of the number of
> readers and writers.  The difference starts once you want to sync with
> more than one variable, and need to establish an order between those
> accesses.

You're right of course.  More specifically when there is a thread where
some variables are stored while others are loaded.

>> There will still be a few cases that need to be optimized, and here are
>> where the difficult requirements come:
>>
>> (R1) the primitives *should* not be alien to people who know Linux.
>>
>> (R2) those optimizations *must* be easy to do and review; at least as
>> easy as these things go.
>>
>> The two are obviously related.  Ease of review is why it is important to
>> make things familiar to people who know Linux.
>>
>> In C11, relaxing SC loads and stores is complicated, and more
>> specifically hard to explain!
> 
> I can't see why that would be harder than reasoning about equally weaker
> Java semantics.  But you obviously know your community, and I don't :)

Because Java semantics are "almost" SC, and as Paul mentioned the
difference doesn't matter in practice (IRIW/RWC is where it matters, WRC
works even on Power; see
http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/ppc051.html#toc5, row
WRC+lwsyncs).  It hasn't ever mattered for Linux, at least.

>> By contrast, Java volatile semantics are easily converted to a sequence
>> of relaxed loads, relaxed stores, and acq/rel/sc fences.
> 
> The same holds for C11/C++11.  If you look at either the standard or the
> Batty model, you'll see that for every pair like store(rel)--load(acq),
> there is also store(rel)--fence(acq)+load(relaxed),
> store(relaxed)+fence(rel)--fence(acq)+load(relaxed), etc. defined,
> giving the same semantics.  Likewise for SC.

Do you have a pointer to that?  It would help.

> You can also build Dekker with SC stores and acq loads, if I'm not
> mistaken.  Typically one would probably use SC fences and relaxed
> stores/loads.

Yes.

>>> I guess so.  But you also have to consider the legacy that you create.
>>> I do think the C11/C++11 model will used widely, and more and more
>>> people will used to it.
>>
>> I don't think many people will learn how to use the various non-seqcst
>> modes...  At least so far I punted. :)
> 
> But you already use similarly weaker orderings that the other
> abstractions provide (e.g., Java), so you're half-way there :)

True.  On the other hand you can treat Java like "kinda SC but don't
worry, you won't see the difference".  It is both worrisome and appealing...

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant
  2013-06-19  9:27         ` Paolo Bonzini
@ 2013-06-20  9:11           ` Stefan Hajnoczi
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Hajnoczi @ 2013-06-20  9:11 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, mdroth, Anthony Liguori, Liu Ping Fan

On Wed, Jun 19, 2013 at 11:27:25AM +0200, Paolo Bonzini wrote:
> Il 19/06/2013 00:26, mdroth ha scritto:
> > On Tue, Jun 18, 2013 at 09:20:26PM +0200, Paolo Bonzini wrote:
> >> Il 18/06/2013 17:14, mdroth ha scritto:
> >>> Could we possibly simplify this by introducing a recursive mutex that we
> >>> could use to protect the whole list loop and hold even during the cb?
> >>
> >> If it is possible, we should avoid recursive locks.  It makes impossible
> >> to establish a lock hierarchy.  For example:
> >>
> >>> I assume we can't hold the lock during the cb currently since we might
> >>> try to reschedule, but if it's a recursive mutex would that simplify
> >>> things?
> >>
> >> If you have two callbacks in two different AioContexts, both of which do
> >> bdrv_drain_all(), you get an AB-BA deadlock
> > 
> > I think I see what you mean. That problem exists regardless of whether we
> > introduce a recursive mutex though right?
> 
> Without a recursive mutex, you only hold one lock at a time in each thread.
> 
> > I guess the main issue is the
> > fact that we'd be encouraging sloppy locking practices without
> > addressing the root problem?
> 
> Yeah.  We're basically standing where the Linux kernel stood 10 years
> ago (let's say 2.2 timeframe).  If Linux got this far without recursive
> mutexes, we can at least try. :)

FWIW I was also looking into recursive mutexes for the block layer.
What scared me a little is that they make it tempting to stop thinking
about locks since you know you'll be able to reacquire locks you already
hold.

Especially when converting existing code, I think we need to be rigorous
about exploring every function and thinking about the locks it needs and
which child functions it calls.

Otherwise we'll have code paths hidden away somewhere that were never
truly thought through.

Stefan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-19  7:11             ` Torvald Riegel
@ 2013-06-20 15:18               ` Paul E. McKenney
  0 siblings, 0 replies; 41+ messages in thread
From: Paul E. McKenney @ 2013-06-20 15:18 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Susmit.Sarkar, Andrew Haley, qemu-devel, Liu Ping Fan,
	Anthony Liguori, luc.maranget, Paolo Bonzini, Peter.Sewell,
	Richard Henderson

On Wed, Jun 19, 2013 at 09:11:36AM +0200, Torvald Riegel wrote:
> On Tue, 2013-06-18 at 18:53 -0700, Paul E. McKenney wrote:
> > On Tue, Jun 18, 2013 at 05:37:42PM +0200, Torvald Riegel wrote:
> > > On Tue, 2013-06-18 at 07:50 -0700, Paul E. McKenney wrote:
> > > > First, I am not a fan of SC, mostly because there don't seem to be many
> > > > (any?) production-quality algorithms that need SC.  But if you really
> > > > want to take a parallel-programming trip back to the 1980s, let's go!  ;-)
> > > 
> > > Dekker-style mutual exclusion is useful for things like read-mostly
> > > multiple-reader single-writer locks, or similar "asymmetric" cases of
> > > synchronization.  SC fences are needed for this.
> > 
> > They definitely need Power hwsync rather than lwsync, but they need
> > fewer fences than would be emitted by slavishly following either of the
> > SC recipes for Power.  (Another example needing store-to-load ordering
> > is hazard pointers.)
> 
> The C++11 seq-cst fence expands to hwsync; combined with a relaxed
> store / load, that should be minimal.  Or are you saying that on Power,
> there is a weaker HW barrier available that still constrains store-load
> reordering sufficiently?

Your example use of seq-cst fence is a very good one for this example.
But most people I have talked to think of C++11 SC as being SC atomic
accesses, and SC atomics would get you a bunch of redundant fences
in this example -- some but not all of which could be easily optimized
away.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations)
  2013-06-20  7:53                     ` Paolo Bonzini
@ 2013-06-22 10:55                       ` Torvald Riegel
  0 siblings, 0 replies; 41+ messages in thread
From: Torvald Riegel @ 2013-06-22 10:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrew Haley, qemu-devel, Liu Ping Fan, Anthony Liguori, paulmck,
	Richard Henderson

On Thu, 2013-06-20 at 09:53 +0200, Paolo Bonzini wrote:
> Il 19/06/2013 22:25, Torvald Riegel ha scritto:
> > On Wed, 2013-06-19 at 17:14 +0200, Paolo Bonzini wrote:
> >> (1) I don't care about relaxed RMW ops (loads/stores occur in hot paths,
> >> but RMW shouldn't be that bad.  I don't care if reference counting is a
> >> little slower than it could be, for example);
> > 
> > I doubt relaxed RMW ops are sufficient even for reference counting.
> 
> They are enough on the increment side, or so says boost...
> 
> http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/atomic/usage_examples.html#boost_atomic.usage_examples.example_reference_counters

Oh, right, for this kind of refcounting it's okay on the increment side.
But the explanation on the page you referenced isn't correct I think:
"...passing an existing reference from one thread to another must
already provide any required synchronization." is not sufficient because
that would just create a happens-before from the reference-passing
source to the other thread that gets the reference.
The relaxed RMW increment works because of the modification order being
consistent with happens-before (see 6.17 in the model), so we can never
reach a value of zero for the refcount once we incremented the reference
even with a relaxed RMW.

IMO, the acquire fence in the release is not 100% correct according to
my understanding of the memory model:
    if (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) {
      boost::atomic_thread_fence(boost::memory_order_acquire);
      delete x;
    }
"delete x" is unconditional, and I guess not specified to read all of
what x points to.  The acquire fence would only result in a
synchronizes-with edge if there is a reads-from edge between the release
store and a load that reads the stores value and is sequenced after the
acquire fence.
Thus, I think the compiler could be allowed to reorder the fence after
the delete in some case (e.g., when there's no destructor called or it
doesn't have any conditionals in it), but I guess it's likely to not
ever try to do that in practice.
Regarding the hardware fences that this maps, I suppose this just
happens to work fine on most architectures, perhaps just because
"delete" will access some of the memory when releasing the memory.

Changing the release to the following would be correct, and probably
little additional overhead:
    if (x->refcount_.fetch_sub(1, boost::memory_order_release) == 1) { 
      if (x->refcount.load(boost::memory_order_acquire) == 0)
        delete x;
    }

That makes delete conditional and thus having to happen after we ensured
the happens before edge that we need.

> >> By contrast, Java volatile semantics are easily converted to a sequence
> >> of relaxed loads, relaxed stores, and acq/rel/sc fences.
> > 
> > The same holds for C11/C++11.  If you look at either the standard or the
> > Batty model, you'll see that for every pair like store(rel)--load(acq),
> > there is also store(rel)--fence(acq)+load(relaxed),
> > store(relaxed)+fence(rel)--fence(acq)+load(relaxed), etc. defined,
> > giving the same semantics.  Likewise for SC.
> 
> Do you have a pointer to that?  It would help.

In the full model (n3132.pdf), see 6.12 (which then references which
parts in the standard lead to those parts of the model).  SC fences are
also acquire and release fences, so this covers synchronizes-with via
reads-from too.  6.17 has more constraints on SC fences and modification
order, so we get something similar for the ordering of just writes.


Torvald

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2013-06-22 10:55 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-16 11:21 [Qemu-devel] [PATCH v2 0/2] make AioContext's bh re-entrant Liu Ping Fan
2013-06-16 11:21 ` [Qemu-devel] [PATCH v2 1/2] add a header file for atomic operations Liu Ping Fan
2013-06-17 18:57   ` Richard Henderson
2013-06-18  8:36     ` Paolo Bonzini
2013-06-18 11:03       ` Paolo Bonzini
2013-06-18 14:38       ` Richard Henderson
2013-06-18 15:04         ` Paolo Bonzini
2013-06-18 13:24     ` [Qemu-devel] Java volatile vs. C11 seq_cst (was Re: [PATCH v2 1/2] add a header file for atomic operations) Paolo Bonzini
2013-06-18 14:50       ` Paul E. McKenney
2013-06-18 15:29         ` Peter Sewell
2013-06-18 15:37         ` Torvald Riegel
2013-06-19  1:53           ` Paul E. McKenney
2013-06-19  7:11             ` Torvald Riegel
2013-06-20 15:18               ` Paul E. McKenney
2013-06-18 16:08         ` Paolo Bonzini
2013-06-18 16:38           ` Torvald Riegel
2013-06-19  1:57             ` Paul E. McKenney
2013-06-19  9:31             ` Paolo Bonzini
2013-06-19 13:15               ` Torvald Riegel
2013-06-19 15:14                 ` Paolo Bonzini
2013-06-19 20:25                   ` Torvald Riegel
2013-06-20  7:53                     ` Paolo Bonzini
2013-06-22 10:55                       ` Torvald Riegel
2013-06-18 15:26       ` Torvald Riegel
2013-06-18 17:38       ` Andrew Haley
2013-06-19  9:30         ` Paolo Bonzini
2013-06-19 15:36           ` Andrew Haley
2013-06-16 11:21 ` [Qemu-devel] [PATCH v2 2/2] QEMUBH: make AioContext's bh re-entrant Liu Ping Fan
2013-06-17 15:28   ` Stefan Hajnoczi
2013-06-17 16:41     ` Paolo Bonzini
2013-06-18  2:19     ` liu ping fan
2013-06-18  9:31       ` Stefan Hajnoczi
2013-06-18 15:14   ` mdroth
2013-06-18 16:19     ` mdroth
2013-06-18 19:20     ` Paolo Bonzini
2013-06-18 22:26       ` mdroth
2013-06-19  9:27         ` Paolo Bonzini
2013-06-20  9:11           ` Stefan Hajnoczi
2013-06-17  7:11 ` [Qemu-devel] [PATCH v2 0/2] " Paolo Bonzini
2013-06-18  2:40   ` liu ping fan
2013-06-18  8:36     ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).