[Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock
@ 2012-09-11  7:51 Liu Ping Fan
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations Liu Ping Fan
                   ` (10 more replies)
  0 siblings, 11 replies; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

v1:
https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg03312.html

v2:
http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01275.html


changes v2->v3

--remove refcnt on MemroyRegion
--add ref/unref for MemoryRegionOps
--make memory view protected by mem_map_lock, delay to adopt rcu style.



Liu Ping Fan (11):
  atomic: introduce atomic operations
  qom: apply atomic on object's refcount
  hotplug: introduce qdev_unplug_complete() to remove device from views
  pci: remove pci device from mem view when unplug
  memory: introduce ref,unref interface for MemoryRegionOps
  memory: make mmio dispatch able to be out of biglock
  memory: implement e1000's  MemoryRegionOps ref/unref
  qom: introduce reclaimer to release obj in async
  vcpu: make QemuThread as tls to store thread-self info
  vcpu: introduce lockmap
  vcpu: push mmio dispatcher out of big lock

 cpus.c                   |   19 +++++++++
 exec.c                   |   95 +++++++++++++++++++++++++++++++++++++++++++++-
 hw/acpi_piix4.c          |    2 +-
 hw/e1000.c               |   17 ++++++++
 hw/pci.c                 |   13 ++++++-
 hw/pci.h                 |    1 +
 hw/qdev.c                |   26 ++++++++++++
 hw/qdev.h                |    3 +-
 include/qemu/atomic.h    |   63 ++++++++++++++++++++++++++++++
 include/qemu/object.h    |    3 +-
 include/qemu/reclaimer.h |   30 ++++++++++++++
 kvm-all.c                |    3 +
 main-loop.c              |    5 ++
 memory.h                 |    3 +
 qemu-thread-posix.c      |   30 ++++++++++++++
 qemu-thread-posix.h      |    7 +++
 qemu-thread.h            |    4 ++
 qemu-tool.c              |    5 ++
 qom/Makefile.objs        |    2 +-
 qom/object.c             |   11 ++---
 qom/reclaimer.c          |   54 ++++++++++++++++++++++++++
 vl.c                     |    4 ++
 22 files changed, 388 insertions(+), 12 deletions(-)
 create mode 100644 include/qemu/atomic.h
 create mode 100644 include/qemu/reclaimer.h
 create mode 100644 qom/reclaimer.c

-- 
1.7.4.4

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  2012-09-11  8:04   ` Avi Kivity
  2012-09-11  8:15   ` Peter Maydell
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 02/11] qom: apply atomic on object's refcount Liu Ping Fan
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

If out of global lock, we will be challenged by SMP in low level,
so need atomic ops.

This file is a wrapper of GCC atomic builtin.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 include/qemu/atomic.h |   63 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 63 insertions(+), 0 deletions(-)
 create mode 100644 include/qemu/atomic.h

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
new file mode 100644
index 0000000..f17145d
--- /dev/null
+++ b/include/qemu/atomic.h
@@ -0,0 +1,63 @@
+/*
+ * Simple wrapper of gcc Atomic-Builtins
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+#ifndef __QEMU_ATOMIC_H
+#define __QEMU_ATOMIC_H
+
+typedef struct Atomic {
+    int counter;
+} Atomic;
+
+static inline void atomic_set(Atomic *v, int i)
+{
+    v->counter = i;
+}
+
+static inline int atomic_read(Atomic *v)
+{
+    return v->counter;
+}
+
+static inline int atomic_return_and_add(int i, Atomic *v)
+{
+    int ret;
+
+    ret = __sync_fetch_and_add(&v->counter, i);
+    return ret;
+}
+
+static inline int atomic_return_and_sub(int i, Atomic *v)
+{
+    int ret;
+
+    ret = __sync_fetch_and_sub(&v->counter, i);
+    return ret;
+}
+
+/**
+ *  * atomic_inc - increment atomic variable
+ *   * @v: pointer of type Atomic
+ *    *
+ *     * Atomically increments @v by 1.
+ *      */
+static inline void atomic_inc(Atomic *v)
+{
+    __sync_fetch_and_add(&v->counter, 1);
+}
+
+/**
+ *  * atomic_dec - decrement atomic variable
+ *   * @v: pointer of type Atomic
+ *    *
+ *     * Atomically decrements @v by 1.
+ *      */
+static inline void atomic_dec(Atomic *v)
+{
+    __sync_fetch_and_sub(&v->counter, 1);
+}
+
+#endif
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 02/11] qom: apply atomic on object's refcount
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 03/11] hotplug: introduce qdev_unplug_complete() to remove device from views Liu Ping Fan
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 include/qemu/object.h |    3 ++-
 qom/object.c          |   11 +++++------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/qemu/object.h b/include/qemu/object.h
index cc75fee..0c02614 100644
--- a/include/qemu/object.h
+++ b/include/qemu/object.h
@@ -18,6 +18,7 @@
 #include <stdint.h>
 #include <stdbool.h>
 #include "qemu-queue.h"
+#include "qemu/atomic.h"
 
 struct Visitor;
 struct Error;
@@ -262,7 +263,7 @@ struct Object
     /*< private >*/
     ObjectClass *class;
     QTAILQ_HEAD(, ObjectProperty) properties;
-    uint32_t ref;
+    Atomic ref;
     Object *parent;
 };
 
diff --git a/qom/object.c b/qom/object.c
index e3e9242..34ec2a1 100644
--- a/qom/object.c
+++ b/qom/object.c
@@ -383,7 +383,7 @@ void object_finalize(void *data)
     object_deinit(obj, ti);
     object_property_del_all(obj);
 
-    g_assert(obj->ref == 0);
+    g_assert(atomic_read(&obj->ref) == 0);
 }
 
 Object *object_new_with_type(Type type)
@@ -410,7 +410,7 @@ Object *object_new(const char *typename)
 void object_delete(Object *obj)
 {
     object_unparent(obj);
-    g_assert(obj->ref == 1);
+    g_assert(atomic_read(&obj->ref) == 1);
     object_unref(obj);
     g_free(obj);
 }
@@ -600,16 +600,15 @@ GSList *object_class_get_list(const char *implements_type,
 
 void object_ref(Object *obj)
 {
-    obj->ref++;
+    atomic_inc(&obj->ref);
 }
 
 void object_unref(Object *obj)
 {
-    g_assert(obj->ref > 0);
-    obj->ref--;
+    g_assert(atomic_read(&obj->ref) > 0);
 
     /* parent always holds a reference to its children */
-    if (obj->ref == 0) {
+    if (atomic_return_and_sub(1, &obj->ref) == 1) {
         object_finalize(obj);
     }
 }
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 03/11] hotplug: introduce qdev_unplug_complete() to remove device from views
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations Liu Ping Fan
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 02/11] qom: apply atomic on object's refcount Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 04/11] pci: remove pci device from mem view when unplug Liu Ping Fan
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

When device unplug has been ack by guest, we first remove it from memory
to prevent incoming access from dispatcher. Then we isolate it from
device composition tree

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 hw/qdev.c |   26 ++++++++++++++++++++++++++
 hw/qdev.h |    3 ++-
 2 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/hw/qdev.c b/hw/qdev.c
index b5a52ac..73df046 100644
--- a/hw/qdev.c
+++ b/hw/qdev.c
@@ -104,6 +104,14 @@ void qdev_set_parent_bus(DeviceState *dev, BusState *bus)
     bus_add_child(bus, dev);
 }
 
+static void qdev_unset_parent(DeviceState *dev)
+{
+    BusState *b = dev->parent_bus;
+
+    object_unparent(OBJECT(dev));
+    bus_remove_child(b, dev);
+}
+
 /* Create a new device.  This only initializes the device state structure
    and allows properties to be set.  qdev_init should be called to
    initialize the actual device emulation.  */
@@ -193,6 +201,24 @@ void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
     dev->alias_required_for_version = required_for_version;
 }
 
+static int qdev_unmap(DeviceState *dev)
+{
+    DeviceClass *dc =  DEVICE_GET_CLASS(dev);
+    if (dc->unmap) {
+        dc->unmap(dev);
+    }
+    return 0;
+}
+
+void qdev_unplug_complete(DeviceState *dev, Error **errp)
+{
+    /* isolate from mem view */
+    qdev_unmap(dev);
+    /* isolate from device tree */
+    qdev_unset_parent(dev);
+    object_unref(OBJECT(dev));
+}
+
 void qdev_unplug(DeviceState *dev, Error **errp)
 {
     DeviceClass *dc = DEVICE_GET_CLASS(dev);
diff --git a/hw/qdev.h b/hw/qdev.h
index d699194..aeae29e 100644
--- a/hw/qdev.h
+++ b/hw/qdev.h
@@ -47,7 +47,7 @@ typedef struct DeviceClass {
 
     /* callbacks */
     void (*reset)(DeviceState *dev);
-
+    void (*unmap)(DeviceState *dev);
     /* device state */
     const VMStateDescription *vmsd;
 
@@ -161,6 +161,7 @@ void qdev_init_nofail(DeviceState *dev);
 void qdev_set_legacy_instance_id(DeviceState *dev, int alias_id,
                                  int required_for_version);
 void qdev_unplug(DeviceState *dev, Error **errp);
+void qdev_unplug_complete(DeviceState *dev, Error **errp);
 void qdev_free(DeviceState *dev);
 int qdev_simple_unplug_cb(DeviceState *dev);
 void qdev_machine_creation_done(void);
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 04/11] pci: remove pci device from mem view when unplug
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
                   ` (2 preceding siblings ...)
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 03/11] hotplug: introduce qdev_unplug_complete() to remove device from views Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 05/11] memory: introduce ref, unref interface for MemoryRegionOps Liu Ping Fan
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 hw/acpi_piix4.c |    2 +-
 hw/pci.c        |   13 ++++++++++++-
 hw/pci.h        |    1 +
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index c56220b..a78b0e3 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -305,7 +305,7 @@ static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
             if (pc->no_hotplug) {
                 slot_free = false;
             } else {
-                qdev_free(qdev);
+                qdev_unplug_complete(qdev, NULL);
             }
         }
     }
diff --git a/hw/pci.c b/hw/pci.c
index 4d95984..3e2a081 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -850,7 +850,6 @@ static int pci_unregister_device(DeviceState *dev)
     PCIDevice *pci_dev = PCI_DEVICE(dev);
     PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
 
-    pci_unregister_io_regions(pci_dev);
     pci_del_option_rom(pci_dev);
 
     if (pc->exit) {
@@ -861,6 +860,17 @@ static int pci_unregister_device(DeviceState *dev)
     return 0;
 }
 
+static void pci_unmap_device(DeviceState *dev)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(pci_dev);
+
+    pci_unregister_io_regions(pci_dev);
+    if (pc->unmap) {
+        pc->unmap(pci_dev);
+    }
+}
+
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
                       uint8_t type, MemoryRegion *memory)
 {
@@ -2064,6 +2074,7 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
     DeviceClass *k = DEVICE_CLASS(klass);
     k->init = pci_qdev_init;
     k->unplug = pci_unplug_device;
+    k->unmap = pci_unmap_device;
     k->exit = pci_unregister_device;
     k->bus_type = TYPE_PCI_BUS;
     k->props = pci_props;
diff --git a/hw/pci.h b/hw/pci.h
index 4b6ab3d..09bbe2b 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -154,6 +154,7 @@ typedef struct PCIDeviceClass {
     DeviceClass parent_class;
 
     int (*init)(PCIDevice *dev);
+    void (*unmap)(PCIDevice *dev);
     PCIUnregisterFunc *exit;
     PCIConfigReadFunc *config_read;
     PCIConfigWriteFunc *config_write;
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 05/11] memory: introduce ref, unref interface for MemoryRegionOps
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
                   ` (3 preceding siblings ...)
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 04/11] pci: remove pci device from mem view when unplug Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  2012-09-11  8:08   ` Avi Kivity
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 06/11] memory: make mmio dispatch able to be out of biglock Liu Ping Fan
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

This pair of interface help to decide when dispatching, whether
we can pin mr without big lock or not.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 memory.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/memory.h b/memory.h
index bd1bbae..9039411 100644
--- a/memory.h
+++ b/memory.h
@@ -25,6 +25,7 @@
 #include "iorange.h"
 #include "ioport.h"
 #include "int128.h"
+#include "qemu/object.h"
 
 typedef struct MemoryRegionOps MemoryRegionOps;
 typedef struct MemoryRegion MemoryRegion;
@@ -66,6 +67,8 @@ struct MemoryRegionOps {
                   target_phys_addr_t addr,
                   uint64_t data,
                   unsigned size);
+    int (*ref)(MemoryRegion *mr);
+    void (*unref)(MemoryRegion *mr);
 
     enum device_endian endianness;
     /* Guest-visible constraints: */
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 06/11] memory: make mmio dispatch able to be out of biglock
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
                   ` (4 preceding siblings ...)
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 05/11] memory: introduce ref, unref interface for MemoryRegionOps Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  2012-09-11  8:25   ` Avi Kivity
  2012-09-11  8:47   ` Avi Kivity
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 07/11] memory: implement e1000's MemoryRegionOps ref/unref Liu Ping Fan
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Without biglock, we try to protect the mr by increase refcnt.
If we can inc refcnt, go backward and resort to biglock.

Another point is memory radix-tree can be flushed by another
thread, so we should get the copy of terminal mr to survive
from such issue.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 exec.c |   95 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 94 insertions(+), 1 deletions(-)

diff --git a/exec.c b/exec.c
index 5834766..93205b1 100644
--- a/exec.c
+++ b/exec.c
@@ -3163,8 +3163,11 @@ static void io_mem_init(void)
                           "watch", UINT64_MAX);
 }
 
+static QemuMutex mem_map_lock;
+
 static void core_begin(MemoryListener *listener)
 {
+    qemu_mutex_lock(&mem_map_lock);
     destroy_all_mappings();
     phys_sections_clear();
     phys_map.ptr = PHYS_MAP_NODE_NIL;
@@ -3184,17 +3187,32 @@ static void core_commit(MemoryListener *listener)
     for(env = first_cpu; env != NULL; env = env->next_cpu) {
         tlb_flush(env, 1);
     }
+    qemu_mutex_unlock(&mem_map_lock);
 }
 
 static void core_region_add(MemoryListener *listener,
                             MemoryRegionSection *section)
 {
+    MemoryRegion *mr = section->mr;
+
+    if (mr->ops != NULL) {
+        if (mr->ops->ref != NULL) {
+            mr->ops->ref(mr);
+        }
+    }
     cpu_register_physical_memory_log(section, section->readonly);
 }
 
 static void core_region_del(MemoryListener *listener,
                             MemoryRegionSection *section)
 {
+    MemoryRegion *mr = section->mr;
+
+    if (mr->ops != NULL) {
+        if (mr->ops->unref != NULL) {
+            mr->ops->unref(mr);
+        }
+    }
 }
 
 static void core_region_nop(MemoryListener *listener,
@@ -3348,6 +3366,8 @@ static void memory_map_init(void)
     memory_region_init(system_io, "io", 65536);
     set_system_io_map(system_io);
 
+    qemu_mutex_init(&mem_map_lock);
+
     memory_listener_register(&core_memory_listener, system_memory);
     memory_listener_register(&io_memory_listener, system_io);
 }
@@ -3406,6 +3426,52 @@ int cpu_memory_rw_debug(CPUArchState *env, target_ulong addr,
 }
 
 #else
+
+static MemoryRegionSection *subpage_get_terminal(subpage_t *mmio,
+    target_phys_addr_t addr)
+{
+    MemoryRegionSection *section;
+    unsigned int idx = SUBPAGE_IDX(addr);
+
+    section = &phys_sections[mmio->sub_section[idx]];
+    return section;
+}
+
+static MemoryRegionSection mrs_get_terminal(MemoryRegionSection *mrs,
+    target_phys_addr_t addr)
+{
+    if (mrs->mr->subpage) {
+        mrs = subpage_get_terminal(mrs->mr->opaque, addr);
+    }
+    return *mrs;
+}
+
+static int mrs_ref(MemoryRegionSection *mrs)
+{
+    MemoryRegion *mr;
+    int ret;
+
+    mr = mrs->mr;
+    if (mr->ops != NULL) {
+        if (mr->ops->ref != NULL) {
+            ret = mr->ops->ref(mr);
+        }
+    }
+    return ret;
+}
+
+static void mrs_unref(MemoryRegionSection *mrs)
+{
+    MemoryRegion *mr;
+
+    mr = mrs->mr;
+    if (mr->ops != NULL) {
+        if (mr->ops->unref != NULL) {
+            mr->ops->unref(mr);
+        }
+    }
+}
+
 void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
                             int len, int is_write)
 {
@@ -3413,14 +3479,35 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
     uint8_t *ptr;
     uint32_t val;
     target_phys_addr_t page;
-    MemoryRegionSection *section;
+    MemoryRegionSection *section, obj_mrs;
+    int ret;
+    int need_biglock = 0;
 
+    /* Will finally removed after all mr->ops implement ref/unref() */
+try_big_lock:
+    if (need_biglock == 1) {
+        qemu_mutex_lock_iothread();
+    }
     while (len > 0) {
         page = addr & TARGET_PAGE_MASK;
         l = (page + TARGET_PAGE_SIZE) - addr;
         if (l > len)
             l = len;
+
+        qemu_mutex_lock(&mem_map_lock);
         section = phys_page_find(page >> TARGET_PAGE_BITS);
+        if (need_biglock == 0) {
+            obj_mrs = mrs_get_terminal(section, addr);
+            ret = mrs_ref(&obj_mrs);
+            if (ret <= 0) {
+                need_biglock = 1;
+                qemu_mutex_unlock(&mem_map_lock);
+                goto try_big_lock;
+            }
+            /* rely on local variable */
+            section = &obj_mrs;
+        }
+        qemu_mutex_unlock(&mem_map_lock);
 
         if (is_write) {
             if (!memory_region_is_ram(section->mr)) {
@@ -3491,10 +3578,16 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
                 qemu_put_ram_ptr(ptr);
             }
         }
+
+        mrs_unref(&obj_mrs);
         len -= l;
         buf += l;
         addr += l;
     }
+
+    if (need_biglock == 1) {
+        qemu_mutex_unlock_iothread();
+    }
 }
 
 /* used for ROM loading : can write in RAM and ROM */
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 07/11] memory: implement e1000's MemoryRegionOps ref/unref
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
                   ` (5 preceding siblings ...)
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 06/11] memory: make mmio dispatch able to be out of biglock Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  2012-09-11  8:37   ` Avi Kivity
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async Liu Ping Fan
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

With this, e1000 show to memory core that it can be protected by
refcnt.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 hw/e1000.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/hw/e1000.c b/hw/e1000.c
index ae8a6c5..df3a349 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -1037,9 +1037,26 @@ e1000_mmio_read(void *opaque, target_phys_addr_t addr, unsigned size)
     return 0;
 }
 
+static int e1000_mmio_ref(MemoryRegion *mr)
+{
+    E1000State *e1000 = container_of(mr, E1000State, mmio);
+
+    object_ref(OBJECT(e1000));
+    return 1;
+}
+
+static void e1000_mmio_unref(MemoryRegion *mr)
+{
+    E1000State *e1000 = container_of(mr, E1000State, mmio);
+
+    object_unref(OBJECT(e1000));
+}
+
 static const MemoryRegionOps e1000_mmio_ops = {
     .read = e1000_mmio_read,
     .write = e1000_mmio_write,
+    .ref = e1000_mmio_ref,
+    .unref = e1000_mmio_unref,
     .endianness = DEVICE_LITTLE_ENDIAN,
     .impl = {
         .min_access_size = 4,
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
                   ` (6 preceding siblings ...)
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 07/11] memory: implement e1000's MemoryRegionOps ref/unref Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  2012-09-11  8:32   ` Avi Kivity
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 09/11] vcpu: make QemuThread as tls to store thread-self info Liu Ping Fan
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

DeviceState will be protected by refcnt from disappearing during
dispatching. But when refcnt comes down to zero, DeviceState may
be still in use by iohandler, timer etc in main loop, we just delay
its free untill no reader.

This patch aim to build this delay reclaimer.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 include/qemu/reclaimer.h |   30 +++++++++++++++++++++++++
 main-loop.c              |    5 ++++
 qemu-tool.c              |    5 ++++
 qom/Makefile.objs        |    2 +-
 qom/reclaimer.c          |   54 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 95 insertions(+), 1 deletions(-)
 create mode 100644 include/qemu/reclaimer.h
 create mode 100644 qom/reclaimer.c

diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h
new file mode 100644
index 0000000..5143c4f
--- /dev/null
+++ b/include/qemu/reclaimer.h
@@ -0,0 +1,30 @@
+/*
+ * QEMU reclaimer
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_RECLAIMER
+#define QEMU_RECLAIMER
+
+#include "qemu-thread.h"
+
+typedef void ReleaseHandler(void *opaque);
+typedef struct Chunk {
+    QLIST_ENTRY(Chunk) list;
+    void *opaque;
+    ReleaseHandler *release;
+} Chunk;
+
+typedef struct ChunkHead {
+    struct QemuMutex lock;
+    QLIST_HEAD(, Chunk) reclaim_list;
+} ChunkHead;
+
+extern ChunkHead qdev_reclaimer;
+void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release);
+void reclaimer_worker(ChunkHead *head);
+void qemu_reclaimer(void);
+#endif
diff --git a/main-loop.c b/main-loop.c
index eb3b6e6..be9d095 100644
--- a/main-loop.c
+++ b/main-loop.c
@@ -26,6 +26,7 @@
 #include "qemu-timer.h"
 #include "slirp/slirp.h"
 #include "main-loop.h"
+#include "qemu/reclaimer.h"
 
 #ifndef _WIN32
 
@@ -505,5 +506,9 @@ int main_loop_wait(int nonblocking)
        them.  */
     qemu_bh_poll();
 
+    /* ref to device from iohandler/bh/timer do not obey the rules, so delay
+     * reclaiming until now.
+     */
+    qemu_reclaimer();
     return ret;
 }
diff --git a/qemu-tool.c b/qemu-tool.c
index 18205ba..f250c87 100644
--- a/qemu-tool.c
+++ b/qemu-tool.c
@@ -21,6 +21,7 @@
 #include "main-loop.h"
 #include "qemu_socket.h"
 #include "slirp/libslirp.h"
+#include "qemu/reclaimer.h"
 
 #include <sys/time.h>
 
@@ -100,6 +101,10 @@ void qemu_mutex_unlock_iothread(void)
 {
 }
 
+void qemu_reclaimer(void)
+{
+}
+
 int use_icount;
 
 void qemu_clock_warp(QEMUClock *clock)
diff --git a/qom/Makefile.objs b/qom/Makefile.objs
index 5ef060a..a579261 100644
--- a/qom/Makefile.objs
+++ b/qom/Makefile.objs
@@ -1,4 +1,4 @@
-qom-obj-y = object.o container.o qom-qobject.o
+qom-obj-y = object.o container.o qom-qobject.o reclaimer.o
 qom-obj-twice-y = cpu.o
 common-obj-y = $(qom-obj-twice-y)
 user-obj-y = $(qom-obj-twice-y)
diff --git a/qom/reclaimer.c b/qom/reclaimer.c
new file mode 100644
index 0000000..b098ad7
--- /dev/null
+++ b/qom/reclaimer.c
@@ -0,0 +1,54 @@
+/*
+ * QEMU reclaimer
+ *
+ * Copyright IBM, Corp. 2012
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu-common.h"
+#include "qemu-thread.h"
+#include "main-loop.h"
+#include "qemu-queue.h"
+#include "qemu/reclaimer.h"
+
+ChunkHead qdev_reclaimer;
+
+static void reclaimer_init(ChunkHead *head)
+{
+  qemu_mutex_init(&head->lock);
+}
+
+void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler *release)
+{
+    Chunk *r = g_malloc0(sizeof(Chunk));
+    r->opaque = opaque;
+    r->release = release;
+    qemu_mutex_lock(&head->lock);
+    QLIST_INSERT_HEAD(&head->reclaim_list, r, list);
+    qemu_mutex_unlock(&head->lock);
+}
+
+void reclaimer_worker(ChunkHead *head)
+{
+    Chunk *cur, *next;
+
+    qemu_mutex_lock(&head->lock);
+    QLIST_FOREACH_SAFE(cur, &head->reclaim_list, list, next) {
+        QLIST_REMOVE(cur, list);
+        cur->release(cur->opaque);
+        g_free(cur);
+    }
+    qemu_mutex_unlock(&head->lock);
+}
+
+void qemu_reclaimer(void)
+{
+    static int init;
+
+    if (init == 0) {
+        init = 1;
+        reclaimer_init(&qdev_reclaimer);
+    }
+    reclaimer_worker(&qdev_reclaimer);
+}
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 09/11] vcpu: make QemuThread as tls to store thread-self info
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
                   ` (7 preceding siblings ...)
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap Liu Ping Fan
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 11/11] vcpu: push mmio dispatcher out of big lock Liu Ping Fan
  10 siblings, 0 replies; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

We store the thread self info in QemuThread.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 cpus.c              |    1 +
 qemu-thread-posix.c |    7 +++++++
 qemu-thread-posix.h |    2 ++
 qemu-thread.h       |    1 +
 vl.c                |    4 ++++
 5 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/cpus.c b/cpus.c
index e476a3c..4cd7f85 100644
--- a/cpus.c
+++ b/cpus.c
@@ -735,6 +735,7 @@ static void *qemu_kvm_cpu_thread_fn(void *arg)
     CPUState *cpu = ENV_GET_CPU(env);
     int r;
 
+    pthread_setspecific(qemu_thread_key, cpu->thread);
     qemu_mutex_lock(&qemu_global_mutex);
     qemu_thread_get_self(cpu->thread);
     env->thread_id = qemu_get_thread_id();
diff --git a/qemu-thread-posix.c b/qemu-thread-posix.c
index 8fbabda..f448fcb 100644
--- a/qemu-thread-posix.c
+++ b/qemu-thread-posix.c
@@ -19,6 +19,8 @@
 #include <string.h>
 #include "qemu-thread.h"
 
+pthread_key_t qemu_thread_key;
+
 static void error_exit(int err, const char *msg)
 {
     fprintf(stderr, "qemu: %s: %s\n", msg, strerror(err));
@@ -151,6 +153,11 @@ void qemu_thread_get_self(QemuThread *thread)
     thread->thread = pthread_self();
 }
 
+void qemu_thread_key_create(void)
+{
+    pthread_key_create(&qemu_thread_key, NULL);
+}
+
 bool qemu_thread_is_self(QemuThread *thread)
 {
    return pthread_equal(pthread_self(), thread->thread);
diff --git a/qemu-thread-posix.h b/qemu-thread-posix.h
index ee4618e..2607b1c 100644
--- a/qemu-thread-posix.h
+++ b/qemu-thread-posix.h
@@ -14,4 +14,6 @@ struct QemuThread {
     pthread_t thread;
 };
 
+extern pthread_key_t qemu_thread_key;
+
 #endif
diff --git a/qemu-thread.h b/qemu-thread.h
index 05fdaaf..4a6427d 100644
--- a/qemu-thread.h
+++ b/qemu-thread.h
@@ -46,4 +46,5 @@ void qemu_thread_get_self(QemuThread *thread);
 bool qemu_thread_is_self(QemuThread *thread);
 void qemu_thread_exit(void *retval);
 
+void qemu_thread_key_create(void);
 #endif
diff --git a/vl.c b/vl.c
index 7c577fa..44e2783 100644
--- a/vl.c
+++ b/vl.c
@@ -149,6 +149,7 @@ int main(int argc, char **argv)
 #include "qemu-options.h"
 #include "qmp-commands.h"
 #include "main-loop.h"
+#include "qemu-thread.h"
 #ifdef CONFIG_VIRTFS
 #include "fsdev/qemu-fsdev.h"
 #endif
@@ -2342,6 +2343,7 @@ int qemu_init_main_loop(void)
     return main_loop_init();
 }
 
+
 int main(int argc, char **argv, char **envp)
 {
     int i;
@@ -3483,6 +3485,8 @@ int main(int argc, char **argv, char **envp)
         exit(1);
     }
 
+    qemu_thread_key_create();
+
     os_set_line_buffering();
 
     if (init_timer_alarm() < 0) {
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
                   ` (8 preceding siblings ...)
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 09/11] vcpu: make QemuThread as tls to store thread-self info Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  2012-09-11  8:35   ` Avi Kivity
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 11/11] vcpu: push mmio dispatcher out of big lock Liu Ping Fan
  10 siblings, 1 reply; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

The func call chain can suffer from recursively hold
qemu_mutex_lock_iothread. We introduce lockmap to record the
lock depth.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 cpus.c              |   18 ++++++++++++++++++
 qemu-thread-posix.c |   23 +++++++++++++++++++++++
 qemu-thread-posix.h |    5 +++++
 qemu-thread.h       |    3 +++
 4 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/cpus.c b/cpus.c
index 4cd7f85..09f6670 100644
--- a/cpus.c
+++ b/cpus.c
@@ -736,6 +736,8 @@ static void *qemu_kvm_cpu_thread_fn(void *arg)
     int r;
 
     pthread_setspecific(qemu_thread_key, cpu->thread);
+    cpu->thread->lockmap.biglock = 0;
+    qemu_big_lockmap_inc();
     qemu_mutex_lock(&qemu_global_mutex);
     qemu_thread_get_self(cpu->thread);
     env->thread_id = qemu_get_thread_id();
@@ -905,6 +907,14 @@ int qemu_cpu_is_self(void *_env)
 
 void qemu_mutex_lock_iothread(void)
 {
+    unsigned int map;
+
+    if (!qemu_thread_is_self(&io_thread)) {
+        map = qemu_big_lockmap_inc();
+        if (map > 1) {
+            return;
+        }
+    }
     if (!tcg_enabled()) {
         qemu_mutex_lock(&qemu_global_mutex);
     } else {
@@ -920,6 +930,14 @@ void qemu_mutex_lock_iothread(void)
 
 void qemu_mutex_unlock_iothread(void)
 {
+    unsigned int map;
+
+    if (!qemu_thread_is_self(&io_thread)) {
+        map = qemu_big_lockmap_dec();
+        if (map != 0) {
+            return;
+        }
+    }
     qemu_mutex_unlock(&qemu_global_mutex);
 }
 
diff --git a/qemu-thread-posix.c b/qemu-thread-posix.c
index f448fcb..1e07dc2 100644
--- a/qemu-thread-posix.c
+++ b/qemu-thread-posix.c
@@ -17,6 +17,7 @@
 #include <signal.h>
 #include <stdint.h>
 #include <string.h>
+#include <glib.h>
 #include "qemu-thread.h"
 
 pthread_key_t qemu_thread_key;
@@ -158,6 +159,28 @@ void qemu_thread_key_create(void)
     pthread_key_create(&qemu_thread_key, NULL);
 }
 
+int16_t qemu_big_lockmap_inc(void)
+{
+    QemuThread *t = pthread_getspecific(qemu_thread_key);
+
+    return ++t->lockmap.biglock;
+}
+
+int16_t qemu_big_lockmap_dec(void)
+{
+    QemuThread *t = pthread_getspecific(qemu_thread_key);
+    g_assert(t->lockmap.biglock > 0);
+
+    return --t->lockmap.biglock;
+}
+
+int16_t qemu_big_lockmap_get(void)
+{
+    QemuThread *t = pthread_getspecific(qemu_thread_key);
+
+    return t->lockmap.biglock;
+}
+
 bool qemu_thread_is_self(QemuThread *thread)
 {
    return pthread_equal(pthread_self(), thread->thread);
diff --git a/qemu-thread-posix.h b/qemu-thread-posix.h
index 2607b1c..8f9506b 100644
--- a/qemu-thread-posix.h
+++ b/qemu-thread-posix.h
@@ -10,8 +10,13 @@ struct QemuCond {
     pthread_cond_t cond;
 };
 
+typedef struct Lockmap {
+    int16_t biglock;
+} Lockmap;
+
 struct QemuThread {
     pthread_t thread;
+    Lockmap lockmap;
 };
 
 extern pthread_key_t qemu_thread_key;
diff --git a/qemu-thread.h b/qemu-thread.h
index 4a6427d..529850b 100644
--- a/qemu-thread.h
+++ b/qemu-thread.h
@@ -47,4 +47,7 @@ bool qemu_thread_is_self(QemuThread *thread);
 void qemu_thread_exit(void *retval);
 
 void qemu_thread_key_create(void);
+int16_t qemu_big_lockmap_inc(void);
+int16_t qemu_big_lockmap_dec(void);
+int16_t qemu_big_lockmap_get(void);
 #endif
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH V3 11/11] vcpu: push mmio dispatcher out of big lock
  2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
                   ` (9 preceding siblings ...)
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap Liu Ping Fan
@ 2012-09-11  7:51 ` Liu Ping Fan
  10 siblings, 0 replies; 72+ messages in thread
From: Liu Ping Fan @ 2012-09-11  7:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 kvm-all.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 34b02c1..ef7cd5f 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1562,10 +1562,13 @@ int kvm_cpu_exec(CPUArchState *env)
             break;
         case KVM_EXIT_MMIO:
             DPRINTF("handle_mmio\n");
+            qemu_mutex_unlock_iothread();
             cpu_physical_memory_rw(run->mmio.phys_addr,
                                    run->mmio.data,
                                    run->mmio.len,
                                    run->mmio.is_write);
+            qemu_mutex_lock_iothread();
+
             ret = 0;
             break;
         case KVM_EXIT_IRQ_WINDOW_OPEN:
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations Liu Ping Fan
@ 2012-09-11  8:04   ` Avi Kivity
  2012-09-13  6:54     ` liu ping fan
  2012-09-11  8:15   ` Peter Maydell
  1 sibling, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-11  8:04 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> If out of global lock, we will be challenged by SMP in low level,
> so need atomic ops.
> 
> This file is a wrapper of GCC atomic builtin.
> 
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  include/qemu/atomic.h |   63 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 63 insertions(+), 0 deletions(-)
>  create mode 100644 include/qemu/atomic.h
> 
> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
> new file mode 100644
> index 0000000..f17145d
> --- /dev/null
> +++ b/include/qemu/atomic.h
> @@ -0,0 +1,63 @@
> +/*
> + * Simple wrapper of gcc Atomic-Builtins
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +#ifndef __QEMU_ATOMIC_H
> +#define __QEMU_ATOMIC_H
> +
> +typedef struct Atomic {
> +    int counter;
> +} Atomic;

Best to mark counter 'volatile'.

> +
> +static inline void atomic_set(Atomic *v, int i)
> +{
> +    v->counter = i;
> +}
> +
> +static inline int atomic_read(Atomic *v)
> +{
> +    return v->counter;
> +}
> 

So these two operations don't get mangled by the optimizer.




-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 05/11] memory: introduce ref, unref interface for MemoryRegionOps
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 05/11] memory: introduce ref, unref interface for MemoryRegionOps Liu Ping Fan
@ 2012-09-11  8:08   ` Avi Kivity
  0 siblings, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2012-09-11  8:08 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> This pair of interface help to decide when dispatching, whether
> we can pin mr without big lock or not.
> 
> --- a/memory.h
> +++ b/memory.h
> @@ -25,6 +25,7 @@
>  #include "iorange.h"
>  #include "ioport.h"
>  #include "int128.h"
> +#include "qemu/object.h"

Not needed.

>  
>  typedef struct MemoryRegionOps MemoryRegionOps;
>  typedef struct MemoryRegion MemoryRegion;
> @@ -66,6 +67,8 @@ struct MemoryRegionOps {
>                    target_phys_addr_t addr,
>                    uint64_t data,
>                    unsigned size);
> +    int (*ref)(MemoryRegion *mr);
> +    void (*unref)(MemoryRegion *mr);
>  

Please add documentation, esp. that the operations are optional.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations Liu Ping Fan
  2012-09-11  8:04   ` Avi Kivity
@ 2012-09-11  8:15   ` Peter Maydell
  2012-09-13  6:53     ` liu ping fan
  1 sibling, 1 reply; 72+ messages in thread
From: Peter Maydell @ 2012-09-11  8:15 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Avi Kivity,
	Anthony Liguori, Paolo Bonzini

On 11 September 2012 08:51, Liu Ping Fan <qemulist@gmail.com> wrote:
> +
> +/**
> + *  * atomic_inc - increment atomic variable
> + *   * @v: pointer of type Atomic
> + *    *
> + *     * Atomically increments @v by 1.
> + *      */

Your editor has done something weird with these comments.

-- PMM

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 06/11] memory: make mmio dispatch able to be out of biglock
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 06/11] memory: make mmio dispatch able to be out of biglock Liu Ping Fan
@ 2012-09-11  8:25   ` Avi Kivity
  2012-09-11  8:47   ` Avi Kivity
  1 sibling, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2012-09-11  8:25 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> Without biglock, we try to protect the mr by increase refcnt.
> If we can inc refcnt, go backward and resort to biglock.
> 
> Another point is memory radix-tree can be flushed by another
> thread, so we should get the copy of terminal mr to survive
> from such issue.
> 
>  
> +static QemuMutex mem_map_lock;
> +
>  static void core_begin(MemoryListener *listener)
>  {
> +    qemu_mutex_lock(&mem_map_lock);

Since we have an unbalanced lock here, please add a comment that
explains the lock holding region.

>      destroy_all_mappings();
>      phys_sections_clear();
>      phys_map.ptr = PHYS_MAP_NODE_NIL;
> @@ -3184,17 +3187,32 @@ static void core_commit(MemoryListener *listener)
>      for(env = first_cpu; env != NULL; env = env->next_cpu) {
>          tlb_flush(env, 1);
>      }
> +    qemu_mutex_unlock(&mem_map_lock);
>  }
>  
>  static void core_region_add(MemoryListener *listener,
>                              MemoryRegionSection *section)
>  {
> +    MemoryRegion *mr = section->mr;
> +
> +    if (mr->ops != NULL) {
> +        if (mr->ops->ref != NULL) {
> +            mr->ops->ref(mr);
> +        }
> +    }

Can drop '!= NULL'.

>      cpu_register_physical_memory_log(section, section->readonly);
>  }
>  
>  static void core_region_del(MemoryListener *listener,
>                              MemoryRegionSection *section)
>  {
> +    MemoryRegion *mr = section->mr;
> +
> +    if (mr->ops != NULL) {
> +        if (mr->ops->unref != NULL) {
> +            mr->ops->unref(mr);
> +        }
> +    }
>  }

Here too.

> @@ -3406,6 +3426,52 @@ int cpu_memory_rw_debug(CPUArchState *env, target_ulong addr,
>  }
>  
>  #else
> +
> +static MemoryRegionSection *subpage_get_terminal(subpage_t *mmio,
> +    target_phys_addr_t addr)
> +{
> +    MemoryRegionSection *section;
> +    unsigned int idx = SUBPAGE_IDX(addr);
> +
> +    section = &phys_sections[mmio->sub_section[idx]];
> +    return section;
> +}
> +
> +static MemoryRegionSection mrs_get_terminal(MemoryRegionSection *mrs,
> +    target_phys_addr_t addr)
> +{
> +    if (mrs->mr->subpage) {
> +        mrs = subpage_get_terminal(mrs->mr->opaque, addr);
> +    }
> +    return *mrs;
> +}
> +

Please spell out memory_region_section_ in function names, like
elsewhere in the file.

> +static int mrs_ref(MemoryRegionSection *mrs)
> +{
> +    MemoryRegion *mr;
> +    int ret;
> +
> +    mr = mrs->mr;
> +    if (mr->ops != NULL) {
> +        if (mr->ops->ref != NULL) {
> +            ret = mr->ops->ref(mr);
> +        }
> +    }
> +    return ret;
> +}

What could 'ret' possibly be?

> +
> +static void mrs_unref(MemoryRegionSection *mrs)
> +{
> +    MemoryRegion *mr;
> +
> +    mr = mrs->mr;
> +    if (mr->ops != NULL) {
> +        if (mr->ops->unref != NULL) {
> +            mr->ops->unref(mr);
> +        }
> +    }
> +}
> +
>  void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
>                              int len, int is_write)
>  {
> @@ -3413,14 +3479,35 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
>      uint8_t *ptr;
>      uint32_t val;
>      target_phys_addr_t page;
> -    MemoryRegionSection *section;
> +    MemoryRegionSection *section, obj_mrs;
> +    int ret;
> +    int need_biglock = 0;
>  
> +    /* Will finally removed after all mr->ops implement ref/unref() */
> +try_big_lock:
> +    if (need_biglock == 1) {
> +        qemu_mutex_lock_iothread();
> +    }
>      while (len > 0) {
>          page = addr & TARGET_PAGE_MASK;
>          l = (page + TARGET_PAGE_SIZE) - addr;
>          if (l > len)
>              l = len;
> +
> +        qemu_mutex_lock(&mem_map_lock);
>          section = phys_page_find(page >> TARGET_PAGE_BITS);
> +        if (need_biglock == 0) {
> +            obj_mrs = mrs_get_terminal(section, addr);
> +            ret = mrs_ref(&obj_mrs);
> +            if (ret <= 0) {
> +                need_biglock = 1;
> +                qemu_mutex_unlock(&mem_map_lock);
> +                goto try_big_lock;
> +            }
> +            /* rely on local variable */
> +            section = &obj_mrs;
> +        }
> +        qemu_mutex_unlock(&mem_map_lock);

Please split out this code to a separate function (can return
MemoryRegionSection by value, with either the big lock held or the
object referenced).

>  
>          if (is_write) {
>              if (!memory_region_is_ram(section->mr)) {
> @@ -3491,10 +3578,16 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
>                  qemu_put_ram_ptr(ptr);
>              }
>          }
> +
> +        mrs_unref(&obj_mrs);
>          len -= l;
>          buf += l;
>          addr += l;
>      }`
> +
> +    if (need_biglock == 1) {
> +        qemu_mutex_unlock_iothread();
> +    }


Similarly this should be in a separate function, either unrefing or
dropping the big lock as appropriate.

>  }
>  
>  /* used for ROM loading : can write in RAM and ROM */
> 

I see other calls to phys_page_find(), they all need to be protected.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async Liu Ping Fan
@ 2012-09-11  8:32   ` Avi Kivity
  2012-09-11  9:32     ` liu ping fan
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-11  8:32 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> DeviceState will be protected by refcnt from disappearing during
> dispatching. But when refcnt comes down to zero, DeviceState may
> be still in use by iohandler, timer etc in main loop, we just delay
> its free untill no reader.
> 

How can this be?  We elevate the refcount while dispatching I/O.  If we
have similar problems with the timer, we need to employ a similar solution.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap Liu Ping Fan
@ 2012-09-11  8:35   ` Avi Kivity
  2012-09-11  9:44     ` liu ping fan
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-11  8:35 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> The func call chain can suffer from recursively hold
> qemu_mutex_lock_iothread. We introduce lockmap to record the
> lock depth.

What is the root cause?  io handlers initiating I/O?

Perhaps we can have a solution that is local to exec.c.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 07/11] memory: implement e1000's MemoryRegionOps ref/unref
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 07/11] memory: implement e1000's MemoryRegionOps ref/unref Liu Ping Fan
@ 2012-09-11  8:37   ` Avi Kivity
  0 siblings, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2012-09-11  8:37 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> With this, e1000 show to memory core that it can be protected by
> refcnt.
> 
> diff --git a/hw/e1000.c b/hw/e1000.c
> index ae8a6c5..df3a349 100644
> --- a/hw/e1000.c
> +++ b/hw/e1000.c
> @@ -1037,9 +1037,26 @@ e1000_mmio_read(void *opaque, target_phys_addr_t addr, unsigned size)
>      return 0;
>  }
>  
> +static int e1000_mmio_ref(MemoryRegion *mr)
> +{
> +    E1000State *e1000 = container_of(mr, E1000State, mmio);
> +
> +    object_ref(OBJECT(e1000));
> +    return 1;
> +}
> +
> +static void e1000_mmio_unref(MemoryRegion *mr)
> +{
> +    E1000State *e1000 = container_of(mr, E1000State, mmio);
> +
> +    object_unref(OBJECT(e1000));
> +}
> +
>  static const MemoryRegionOps e1000_mmio_ops = {
>      .read = e1000_mmio_read,
>      .write = e1000_mmio_write,
> +    .ref = e1000_mmio_ref,
> +    .unref = e1000_mmio_unref,
>      .endianness = DEVICE_LITTLE_ENDIAN,
>      .impl = {
>          .min_access_size = 4,
> 


Does e1000_mmio_write() never call functions that assume bql protection?
 The network layer, timers, and qemu_irq all need protection.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 06/11] memory: make mmio dispatch able to be out of biglock
  2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 06/11] memory: make mmio dispatch able to be out of biglock Liu Ping Fan
  2012-09-11  8:25   ` Avi Kivity
@ 2012-09-11  8:47   ` Avi Kivity
  1 sibling, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2012-09-11  8:47 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> Without biglock, we try to protect the mr by increase refcnt.
> If we can inc refcnt, go backward and resort to biglock.
> 
> Another point is memory radix-tree can be flushed by another
> thread, so we should get the copy of terminal mr to survive
> from such issue.
> 
>  void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
>                              int len, int is_write)
>  {
> @@ -3413,14 +3479,35 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
>      uint8_t *ptr;
>      uint32_t val;
>      target_phys_addr_t page;
> -    MemoryRegionSection *section;
> +    MemoryRegionSection *section, obj_mrs;
> +    int ret;
> +    int need_biglock = 0;
>  
> +    /* Will finally removed after all mr->ops implement ref/unref() */
> +try_big_lock:
> +    if (need_biglock == 1) {
> +        qemu_mutex_lock_iothread();
> +    }
>      while (len > 0) {
>          page = addr & TARGET_PAGE_MASK;
>          l = (page + TARGET_PAGE_SIZE) - addr;
>          if (l > len)
>              l = len;
> +
> +        qemu_mutex_lock(&mem_map_lock);
>          section = phys_page_find(page >> TARGET_PAGE_BITS);
> +        if (need_biglock == 0) {
> +            obj_mrs = mrs_get_terminal(section, addr);
> +            ret = mrs_ref(&obj_mrs);
> +            if (ret <= 0) {
> +                need_biglock = 1;
> +                qemu_mutex_unlock(&mem_map_lock);
> +                goto try_big_lock;
> +            }
> +            /* rely on local variable */
> +            section = &obj_mrs;
> +        }
> +        qemu_mutex_unlock(&mem_map_lock);
>  

If on the first pass we find that we need the big lock, we may find on
the second pass that we don't need it since the memory map has changed.
 So on the second pass we might actually need to drop the big lock.

Could code it like

  section, need_lock = lookup(...)
  if need_lock:
     lock(big_lock)
     section, need_lock = lookup(...)
     if not need_lock:
        unlock(big_lock)
  dispatch(section)
  if need_lock:
     unlock(big_lock)

It's ugly as hell, so we'll to apologize to readers in a big comment
explaining what's going on.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async
  2012-09-11  8:32   ` Avi Kivity
@ 2012-09-11  9:32     ` liu ping fan
  2012-09-11  9:37       ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: liu ping fan @ 2012-09-11  9:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> DeviceState will be protected by refcnt from disappearing during
>> dispatching. But when refcnt comes down to zero, DeviceState may
>> be still in use by iohandler, timer etc in main loop, we just delay
>> its free untill no reader.
>>
>
> How can this be?  We elevate the refcount while dispatching I/O.  If we
> have similar problems with the timer, we need to employ a similar solution.
>
Yes, at the next step, plan to covert iohandler, timer etc to use
refcount as memory. Here just a temp solution.

Regards,
pingfan
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async
  2012-09-11  9:32     ` liu ping fan
@ 2012-09-11  9:37       ` Avi Kivity
  2012-09-13  6:54         ` liu ping fan
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-11  9:37 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/11/2012 12:32 PM, liu ping fan wrote:
> On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>
>>> DeviceState will be protected by refcnt from disappearing during
>>> dispatching. But when refcnt comes down to zero, DeviceState may
>>> be still in use by iohandler, timer etc in main loop, we just delay
>>> its free untill no reader.
>>>
>>
>> How can this be?  We elevate the refcount while dispatching I/O.  If we
>> have similar problems with the timer, we need to employ a similar solution.
>>
> Yes, at the next step, plan to covert iohandler, timer etc to use
> refcount as memory. Here just a temp solution.

I prefer not to ever introduce it.

What we can do is introduce a sub-region for e1000's mmio that will take
only the device lock, and let original region use the old dispatch path
(and also take the device lock).  As we thread the various subsystems
e1000 uses, we can expand the sub-region until it covers all of e1000's
functions, then fold it back into the main region.

To start with the sub-region can only include registers that call no
qemu infrastructure code: simple read/writes.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11  8:35   ` Avi Kivity
@ 2012-09-11  9:44     ` liu ping fan
  2012-09-11  9:54       ` Avi Kivity
  2012-09-11  9:57       ` Jan Kiszka
  0 siblings, 2 replies; 72+ messages in thread
From: liu ping fan @ 2012-09-11  9:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> The func call chain can suffer from recursively hold
>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>> lock depth.
>
> What is the root cause?  io handlers initiating I/O?
>
cpu_physical_memory_rw() can be called nested, and when called, it can
be protected by no-lock/device lock/ big-lock.
I think without big lock, io-dispatcher should face the same issue.
As to main-loop, have not carefully consider, but at least, dma-helper
will call cpu_physical_memory_rw() with big-lock.

Regards,
pingfan

> Perhaps we can have a solution that is local to exec.c.
>
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11  9:44     ` liu ping fan
@ 2012-09-11  9:54       ` Avi Kivity
  2012-09-11 10:04         ` Jan Kiszka
  2012-09-11  9:57       ` Jan Kiszka
  1 sibling, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-11  9:54 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/11/2012 12:44 PM, liu ping fan wrote:
> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>
>>> The func call chain can suffer from recursively hold
>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>> lock depth.
>>
>> What is the root cause?  io handlers initiating I/O?
>>
> cpu_physical_memory_rw() can be called nested, and when called, it can
> be protected by no-lock/device lock/ big-lock.

Then we should look for a solution that is local to exec.c (and the
nested dispatch problem).  I think we can identify and fix all nested
dispatches (converting them to either async dma, or letting the memory
core do a single dispatch without indirection through I/O handlers).

> I think without big lock, io-dispatcher should face the same issue.
> As to main-loop, have not carefully consider, but at least, dma-helper
> will call cpu_physical_memory_rw() with big-lock.

DMA is inherently asynchronous, so we already drop the lock between
initiation and completion; we need to find a way to make it easy to use
the API without taking the lock while the transfer takes place.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11  9:44     ` liu ping fan
  2012-09-11  9:54       ` Avi Kivity
@ 2012-09-11  9:57       ` Jan Kiszka
  2012-09-11 12:24         ` Avi Kivity
  1 sibling, 1 reply; 72+ messages in thread
From: Jan Kiszka @ 2012-09-11  9:57 UTC (permalink / raw)
  To: liu ping fan
  Cc: Paolo Bonzini, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	qemu-devel@nongnu.org

On 2012-09-11 11:44, liu ping fan wrote:
> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>
>>> The func call chain can suffer from recursively hold
>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>> lock depth.
>>
>> What is the root cause?  io handlers initiating I/O?
>>
> cpu_physical_memory_rw() can be called nested, and when called, it can
> be protected by no-lock/device lock/ big-lock.
> I think without big lock, io-dispatcher should face the same issue.
> As to main-loop, have not carefully consider, but at least, dma-helper
> will call cpu_physical_memory_rw() with big-lock.

That is our core problem: inconsistent invocation of existing services
/wrt locking. For portio, I was lucky that there is no nesting and I was
able to drop the big lock around all (x86) call sites. But MMIO is way
more tricky due to DMA nesting.

We could try to introduce a different version of cpu_physical_memory_rw,
cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
request can trigger the very same access in a nested fashion, and we
will have to detect this to avoid locking up QEMU (locking up the guest
might be OK).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11  9:54       ` Avi Kivity
@ 2012-09-11 10:04         ` Jan Kiszka
  2012-09-11 11:03           ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Kiszka @ 2012-09-11 10:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 2012-09-11 11:54, Avi Kivity wrote:
> On 09/11/2012 12:44 PM, liu ping fan wrote:
>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>
>>>> The func call chain can suffer from recursively hold
>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>> lock depth.
>>>
>>> What is the root cause?  io handlers initiating I/O?
>>>
>> cpu_physical_memory_rw() can be called nested, and when called, it can
>> be protected by no-lock/device lock/ big-lock.
> 
> Then we should look for a solution that is local to exec.c (and the
> nested dispatch problem).  I think we can identify and fix all nested
> dispatches (converting them to either async dma, or letting the memory
> core do a single dispatch without indirection through I/O handlers).
> 
>> I think without big lock, io-dispatcher should face the same issue.
>> As to main-loop, have not carefully consider, but at least, dma-helper
>> will call cpu_physical_memory_rw() with big-lock.
> 
> DMA is inherently asynchronous, so we already drop the lock between
> initiation and completion; we need to find a way to make it easy to use
> the API without taking the lock while the transfer takes place.

We will have to review/rework device models that want to use the new
locking scheme in a way that they can drop their own lock while issuing
DMA. But that is surely non-trivial.

The other option is to keep DMA requests issued by devices synchronous
but let them fail if we are about to lock up. Still requires changes,
but is probably more comprehensible for device model developers.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 10:04         ` Jan Kiszka
@ 2012-09-11 11:03           ` Avi Kivity
  2012-09-11 11:08             ` Jan Kiszka
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-11 11:03 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 09/11/2012 01:04 PM, Jan Kiszka wrote:

>> DMA is inherently asynchronous, so we already drop the lock between
>> initiation and completion; we need to find a way to make it easy to use
>> the API without taking the lock while the transfer takes place.
> 
> We will have to review/rework device models that want to use the new
> locking scheme in a way that they can drop their own lock while issuing
> DMA. But that is surely non-trivial.

Most DMA today happens without the big qemu lock.  We only need to
convert the paths that actually access memory, these are the block and
network layers (for the respective devices).

> The other option is to keep DMA requests issued by devices synchronous
> but let them fail if we are about to lock up. Still requires changes,
> but is probably more comprehensible for device model developers.

How do you handle failures?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 11:03           ` Avi Kivity
@ 2012-09-11 11:08             ` Jan Kiszka
  2012-09-11 12:20               ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Kiszka @ 2012-09-11 11:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 2012-09-11 13:03, Avi Kivity wrote:
> On 09/11/2012 01:04 PM, Jan Kiszka wrote:
> 
>>> DMA is inherently asynchronous, so we already drop the lock between
>>> initiation and completion; we need to find a way to make it easy to use
>>> the API without taking the lock while the transfer takes place.
>>
>> We will have to review/rework device models that want to use the new
>> locking scheme in a way that they can drop their own lock while issuing
>> DMA. But that is surely non-trivial.
> 
> Most DMA today happens without the big qemu lock.  We only need to
> convert the paths that actually access memory, these are the block and
> network layers (for the respective devices).

...and the devices themselves, of course.

> 
>> The other option is to keep DMA requests issued by devices synchronous
>> but let them fail if we are about to lock up. Still requires changes,
>> but is probably more comprehensible for device model developers.
> 
> How do you handle failures?

By not sending a network frame or dropping an incoming one, e.g., and
signaling this in a device specific way.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 11:08             ` Jan Kiszka
@ 2012-09-11 12:20               ` Avi Kivity
  2012-09-11 12:25                 ` Jan Kiszka
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-11 12:20 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 09/11/2012 02:08 PM, Jan Kiszka wrote:
> On 2012-09-11 13:03, Avi Kivity wrote:
>> On 09/11/2012 01:04 PM, Jan Kiszka wrote:
>> 
>>>> DMA is inherently asynchronous, so we already drop the lock between
>>>> initiation and completion; we need to find a way to make it easy to use
>>>> the API without taking the lock while the transfer takes place.
>>>
>>> We will have to review/rework device models that want to use the new
>>> locking scheme in a way that they can drop their own lock while issuing
>>> DMA. But that is surely non-trivial.
>> 
>> Most DMA today happens without the big qemu lock.  We only need to
>> convert the paths that actually access memory, these are the block and
>> network layers (for the respective devices).
> 
> ...and the devices themselves, of course.

Right, for descriptors and stuff.  So a guest can set a DMA address to
point at its own BAR, and cause qemu to deadlock.

The problem is not limited to locking.  A guest can also cause a write
to a BAR to initiate a synchronous write with the same address and data,
triggering infinite recursion.

Perhaps one fix is the mythical DMA API, which will provide each DMA
initiator with its own view of memory (a private MemoryRegion root
node).  Even that can be worked around with a pair of devices accessing
each other.

> 
>> 
>>> The other option is to keep DMA requests issued by devices synchronous
>>> but let them fail if we are about to lock up. Still requires changes,
>>> but is probably more comprehensible for device model developers.
>> 
>> How do you handle failures?
> 
> By not sending a network frame or dropping an incoming one, e.g., and
> signaling this in a device specific way.

Doesn't work for block devices.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11  9:57       ` Jan Kiszka
@ 2012-09-11 12:24         ` Avi Kivity
  2012-09-11 12:41           ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-11 12:24 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 09/11/2012 12:57 PM, Jan Kiszka wrote:
> On 2012-09-11 11:44, liu ping fan wrote:
>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>
>>>> The func call chain can suffer from recursively hold
>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>> lock depth.
>>>
>>> What is the root cause?  io handlers initiating I/O?
>>>
>> cpu_physical_memory_rw() can be called nested, and when called, it can
>> be protected by no-lock/device lock/ big-lock.
>> I think without big lock, io-dispatcher should face the same issue.
>> As to main-loop, have not carefully consider, but at least, dma-helper
>> will call cpu_physical_memory_rw() with big-lock.
> 
> That is our core problem: inconsistent invocation of existing services
> /wrt locking. For portio, I was lucky that there is no nesting and I was
> able to drop the big lock around all (x86) call sites. But MMIO is way
> more tricky due to DMA nesting.

Maybe we need to switch to a continuation style.  Instead of expecting
cpu_physical_memory_rw() to complete synchronously, it becomes an
asynchronous call and you provide it with a completion.  That means
devices which use it are forced to drop the lock in between.  Block and
network clients will be easy to convert since they already use APIs that
drop the lock (except for accessing the descriptors).

> We could try to introduce a different version of cpu_physical_memory_rw,
> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
> request can trigger the very same access in a nested fashion, and we
> will have to detect this to avoid locking up QEMU (locking up the guest
> might be OK).

An async version of c_p_m_rw() will just cause a completion to bounce
around, consuming cpu but not deadlocking anything.  If we can keep a
count of the bounces, we might be able to stall it indefinitely or at
least ratelimit it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 12:20               ` Avi Kivity
@ 2012-09-11 12:25                 ` Jan Kiszka
  2012-09-11 12:30                   ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Kiszka @ 2012-09-11 12:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 2012-09-11 14:20, Avi Kivity wrote:
> On 09/11/2012 02:08 PM, Jan Kiszka wrote:
>> On 2012-09-11 13:03, Avi Kivity wrote:
>>> On 09/11/2012 01:04 PM, Jan Kiszka wrote:
>>>
>>>>> DMA is inherently asynchronous, so we already drop the lock between
>>>>> initiation and completion; we need to find a way to make it easy to use
>>>>> the API without taking the lock while the transfer takes place.
>>>>
>>>> We will have to review/rework device models that want to use the new
>>>> locking scheme in a way that they can drop their own lock while issuing
>>>> DMA. But that is surely non-trivial.
>>>
>>> Most DMA today happens without the big qemu lock.  We only need to
>>> convert the paths that actually access memory, these are the block and
>>> network layers (for the respective devices).
>>
>> ...and the devices themselves, of course.
> 
> Right, for descriptors and stuff.  So a guest can set a DMA address to
> point at its own BAR, and cause qemu to deadlock.
> 
> The problem is not limited to locking.  A guest can also cause a write
> to a BAR to initiate a synchronous write with the same address and data,
> triggering infinite recursion.
> 
> Perhaps one fix is the mythical DMA API, which will provide each DMA
> initiator with its own view of memory (a private MemoryRegion root
> node).  Even that can be worked around with a pair of devices accessing
> each other.

Hmm, don't see an alternative to runtime loop detection yet.

> 
>>
>>>
>>>> The other option is to keep DMA requests issued by devices synchronous
>>>> but let them fail if we are about to lock up. Still requires changes,
>>>> but is probably more comprehensible for device model developers.
>>>
>>> How do you handle failures?
>>
>> By not sending a network frame or dropping an incoming one, e.g., and
>> signaling this in a device specific way.
> 
> Doesn't work for block devices.

Because the block layer API cannot report errors to the devices? What
happens if there is a real I/O error?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 12:25                 ` Jan Kiszka
@ 2012-09-11 12:30                   ` Avi Kivity
  2012-09-11 12:35                     ` Jan Kiszka
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-11 12:30 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 09/11/2012 03:25 PM, Jan Kiszka wrote:
>>>>
>>>> Most DMA today happens without the big qemu lock.  We only need to
>>>> convert the paths that actually access memory, these are the block and
>>>> network layers (for the respective devices).
>>>
>>> ...and the devices themselves, of course.
>> 
>> Right, for descriptors and stuff.  So a guest can set a DMA address to
>> point at its own BAR, and cause qemu to deadlock.
>> 
>> The problem is not limited to locking.  A guest can also cause a write
>> to a BAR to initiate a synchronous write with the same address and data,
>> triggering infinite recursion.
>> 
>> Perhaps one fix is the mythical DMA API, which will provide each DMA
>> initiator with its own view of memory (a private MemoryRegion root
>> node).  Even that can be worked around with a pair of devices accessing
>> each other.
> 
> Hmm, don't see an alternative to runtime loop detection yet.

See an expensive one on the other branch of this thread.

> 
>> 
>>>
>>>>
>>>>> The other option is to keep DMA requests issued by devices synchronous
>>>>> but let them fail if we are about to lock up. Still requires changes,
>>>>> but is probably more comprehensible for device model developers.
>>>>
>>>> How do you handle failures?
>>>
>>> By not sending a network frame or dropping an incoming one, e.g., and
>>> signaling this in a device specific way.
>> 
>> Doesn't work for block devices.
> 
> Because the block layer API cannot report errors to the devices? What
> happens if there is a real I/O error?

We report real I/O errors.  But if we report a transient error due to
some lock being taken as an I/O error, the guest will take unwarranted
action.

If the errors are not expected in normal operation (we can avoid them if
all DMA is to real RAM) then this is an acceptable solution.  Still it
generates a lot of rarely used code paths and so isn't very good for
security.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 12:30                   ` Avi Kivity
@ 2012-09-11 12:35                     ` Jan Kiszka
  2012-09-11 12:39                       ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Jan Kiszka @ 2012-09-11 12:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 2012-09-11 14:30, Avi Kivity wrote:
>>>>>> The other option is to keep DMA requests issued by devices synchronous
>>>>>> but let them fail if we are about to lock up. Still requires changes,
>>>>>> but is probably more comprehensible for device model developers.
>>>>>
>>>>> How do you handle failures?
>>>>
>>>> By not sending a network frame or dropping an incoming one, e.g., and
>>>> signaling this in a device specific way.
>>>
>>> Doesn't work for block devices.
>>
>> Because the block layer API cannot report errors to the devices? What
>> happens if there is a real I/O error?
> 
> We report real I/O errors.  But if we report a transient error due to
> some lock being taken as an I/O error, the guest will take unwarranted
> action.
> 
> If the errors are not expected in normal operation (we can avoid them if
> all DMA is to real RAM) then this is an acceptable solution.  Still it
> generates a lot of rarely used code paths and so isn't very good for
> security.

I'm not talking about transient errors. Recursions like this are always
guest configuration errors that would cause real devices to lock up or
timeout as well. This is practically about avoiding that a malicious
guest can lock up QEMU, leaving it inoperative even for management tools.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 12:35                     ` Jan Kiszka
@ 2012-09-11 12:39                       ` Avi Kivity
  2012-09-19  4:25                         ` Peter Crosthwaite
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-11 12:39 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 09/11/2012 03:35 PM, Jan Kiszka wrote:
> On 2012-09-11 14:30, Avi Kivity wrote:
>>>>>>> The other option is to keep DMA requests issued by devices synchronous
>>>>>>> but let them fail if we are about to lock up. Still requires changes,
>>>>>>> but is probably more comprehensible for device model developers.
>>>>>>
>>>>>> How do you handle failures?
>>>>>
>>>>> By not sending a network frame or dropping an incoming one, e.g., and
>>>>> signaling this in a device specific way.
>>>>
>>>> Doesn't work for block devices.
>>>
>>> Because the block layer API cannot report errors to the devices? What
>>> happens if there is a real I/O error?
>> 
>> We report real I/O errors.  But if we report a transient error due to
>> some lock being taken as an I/O error, the guest will take unwarranted
>> action.
>> 
>> If the errors are not expected in normal operation (we can avoid them if
>> all DMA is to real RAM) then this is an acceptable solution.  Still it
>> generates a lot of rarely used code paths and so isn't very good for
>> security.
> 
> I'm not talking about transient errors. Recursions like this are always
> guest configuration errors that would cause real devices to lock up or
> timeout as well. This is practically about avoiding that a malicious
> guest can lock up QEMU, leaving it inoperative even for management tools.

Ok.  That's more palatable.  We don't even have to report an error in
that case, we can just perform the operation incorrectly (as I'm sure
real hardware will), log an error, and continue.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 12:24         ` Avi Kivity
@ 2012-09-11 12:41           ` Avi Kivity
  2012-09-11 14:54             ` Marcelo Tosatti
  2012-09-13  6:55             ` liu ping fan
  0 siblings, 2 replies; 72+ messages in thread
From: Avi Kivity @ 2012-09-11 12:41 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 09/11/2012 03:24 PM, Avi Kivity wrote:
> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>> On 2012-09-11 11:44, liu ping fan wrote:
>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>
>>>>> The func call chain can suffer from recursively hold
>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>> lock depth.
>>>>
>>>> What is the root cause?  io handlers initiating I/O?
>>>>
>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>> be protected by no-lock/device lock/ big-lock.
>>> I think without big lock, io-dispatcher should face the same issue.
>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>> will call cpu_physical_memory_rw() with big-lock.
>> 
>> That is our core problem: inconsistent invocation of existing services
>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>> able to drop the big lock around all (x86) call sites. But MMIO is way
>> more tricky due to DMA nesting.
> 
> Maybe we need to switch to a continuation style.  Instead of expecting
> cpu_physical_memory_rw() to complete synchronously, it becomes an
> asynchronous call and you provide it with a completion.  That means
> devices which use it are forced to drop the lock in between.  Block and
> network clients will be easy to convert since they already use APIs that
> drop the lock (except for accessing the descriptors).
> 
>> We could try to introduce a different version of cpu_physical_memory_rw,
>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>> request can trigger the very same access in a nested fashion, and we
>> will have to detect this to avoid locking up QEMU (locking up the guest
>> might be OK).
> 
> An async version of c_p_m_rw() will just cause a completion to bounce
> around, consuming cpu but not deadlocking anything.  If we can keep a
> count of the bounces, we might be able to stall it indefinitely or at
> least ratelimit it.
> 

Another option is to require all users of c_p_m_rw() and related to use
a coroutine or thread.  That makes the programming easier (but still
required a revalidation after the dropped lock).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 12:41           ` Avi Kivity
@ 2012-09-11 14:54             ` Marcelo Tosatti
  2012-09-13  6:55               ` liu ping fan
  2012-09-13  6:55             ` liu ping fan
  1 sibling, 1 reply; 72+ messages in thread
From: Marcelo Tosatti @ 2012-09-11 14:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Stefan Hajnoczi, liu ping fan, qemu-devel@nongnu.org,
	Anthony Liguori, Jan Kiszka, Paolo Bonzini

On Tue, Sep 11, 2012 at 03:41:15PM +0300, Avi Kivity wrote:
> On 09/11/2012 03:24 PM, Avi Kivity wrote:
> > On 09/11/2012 12:57 PM, Jan Kiszka wrote:
> >> On 2012-09-11 11:44, liu ping fan wrote:
> >>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
> >>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
> >>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >>>>>
> >>>>> The func call chain can suffer from recursively hold
> >>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
> >>>>> lock depth.
> >>>>
> >>>> What is the root cause?  io handlers initiating I/O?
> >>>>
> >>> cpu_physical_memory_rw() can be called nested, and when called, it can
> >>> be protected by no-lock/device lock/ big-lock.
> >>> I think without big lock, io-dispatcher should face the same issue.
> >>> As to main-loop, have not carefully consider, but at least, dma-helper
> >>> will call cpu_physical_memory_rw() with big-lock.
> >> 
> >> That is our core problem: inconsistent invocation of existing services
> >> /wrt locking. For portio, I was lucky that there is no nesting and I was
> >> able to drop the big lock around all (x86) call sites. But MMIO is way
> >> more tricky due to DMA nesting.
> > 
> > Maybe we need to switch to a continuation style.  Instead of expecting
> > cpu_physical_memory_rw() to complete synchronously, it becomes an
> > asynchronous call and you provide it with a completion.  That means
> > devices which use it are forced to drop the lock in between.  Block and
> > network clients will be easy to convert since they already use APIs that
> > drop the lock (except for accessing the descriptors).
> > 
> >> We could try to introduce a different version of cpu_physical_memory_rw,
> >> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
> >> request can trigger the very same access in a nested fashion, and we
> >> will have to detect this to avoid locking up QEMU (locking up the guest
> >> might be OK).
> > 
> > An async version of c_p_m_rw() will just cause a completion to bounce
> > around, consuming cpu but not deadlocking anything.  If we can keep a
> > count of the bounces, we might be able to stall it indefinitely or at
> > least ratelimit it.
> > 
> 
> Another option is to require all users of c_p_m_rw() and related to use
> a coroutine or thread.  That makes the programming easier (but still
> required a revalidation after the dropped lock).

Unless i am misunderstanding this thread, the iothread flow section of
http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html
contains possible solutions for this.

Basically use trylock() and if it fails, then choose a bailout option
(which there are more than one possibilities listed).

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-11  8:15   ` Peter Maydell
@ 2012-09-13  6:53     ` liu ping fan
  0 siblings, 0 replies; 72+ messages in thread
From: liu ping fan @ 2012-09-13  6:53 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Avi Kivity,
	Anthony Liguori, Paolo Bonzini

On Tue, Sep 11, 2012 at 4:15 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 11 September 2012 08:51, Liu Ping Fan <qemulist@gmail.com> wrote:
>> +
>> +/**
>> + *  * atomic_inc - increment atomic variable
>> + *   * @v: pointer of type Atomic
>> + *    *
>> + *     * Atomically increments @v by 1.
>> + *      */
>
> Your editor has done something weird with these comments.
>
Yes, will fix it. Thanx, pingfan
> -- PMM

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-11  8:04   ` Avi Kivity
@ 2012-09-13  6:54     ` liu ping fan
  2012-09-13  8:14       ` Avi Kivity
  2012-09-19 13:32       ` Jamie Lokier
  0 siblings, 2 replies; 72+ messages in thread
From: liu ping fan @ 2012-09-13  6:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On Tue, Sep 11, 2012 at 4:04 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> If out of global lock, we will be challenged by SMP in low level,
>> so need atomic ops.
>>
>> This file is a wrapper of GCC atomic builtin.
>>
>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> ---
>>  include/qemu/atomic.h |   63 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 63 insertions(+), 0 deletions(-)
>>  create mode 100644 include/qemu/atomic.h
>>
>> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
>> new file mode 100644
>> index 0000000..f17145d
>> --- /dev/null
>> +++ b/include/qemu/atomic.h
>> @@ -0,0 +1,63 @@
>> +/*
>> + * Simple wrapper of gcc Atomic-Builtins
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +#ifndef __QEMU_ATOMIC_H
>> +#define __QEMU_ATOMIC_H
>> +
>> +typedef struct Atomic {
>> +    int counter;
>> +} Atomic;
>
> Best to mark counter 'volatile'.
>
>> +
>> +static inline void atomic_set(Atomic *v, int i)
>> +{
>> +    v->counter = i;
>> +}
>> +
>> +static inline int atomic_read(Atomic *v)
>> +{
>> +    return v->counter;
>> +}
>>
>
> So these two operations don't get mangled by the optimizer.
>
Browsing linux code and reading lkml, find some similar material. But
they have moved volatile from ->counter to function - atomic_read().
As to atomic_read(), I think it need to prevent optimizer from
refetching issue, but as to atomic_set(), do we need ?

Regards,
pingfan
>
>
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async
  2012-09-11  9:37       ` Avi Kivity
@ 2012-09-13  6:54         ` liu ping fan
  2012-09-13  8:45           ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: liu ping fan @ 2012-09-13  6:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On Tue, Sep 11, 2012 at 5:37 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 12:32 PM, liu ping fan wrote:
>> On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>
>>>> DeviceState will be protected by refcnt from disappearing during
>>>> dispatching. But when refcnt comes down to zero, DeviceState may
>>>> be still in use by iohandler, timer etc in main loop, we just delay
>>>> its free untill no reader.
>>>>
>>>
>>> How can this be?  We elevate the refcount while dispatching I/O.  If we
>>> have similar problems with the timer, we need to employ a similar solution.
>>>
>> Yes, at the next step, plan to covert iohandler, timer etc to use
>> refcount as memory. Here just a temp solution.
>
> I prefer not to ever introduce it.
>
> What we can do is introduce a sub-region for e1000's mmio that will take
> only the device lock, and let original region use the old dispatch path
> (and also take the device lock).  As we thread the various subsystems
> e1000 uses, we can expand the sub-region until it covers all of e1000's
> functions, then fold it back into the main region.
>
Introducing new sub-region for e1000  seems no help to resolve this
issue. It can not tell whether main-loop still use it or not.
I think the key point is that original code SYNC eliminate all the
readers of DeviceState at acpi_piix_eject_slot() by
dev->unit()/exit(), so each subsystem will no access it in future.
But now, we can delete the DeviceState async.
Currently, we can just use e1000->unmap() to detach itself from each
subsystem(Not implemented in this series patches for timer,...) to
achieve the goal, because their readers are still under the protection
of big lock, but when they are out of big lock, we need extra effort
like memory system.

Regards,
pingfan
> To start with the sub-region can only include registers that call no
> qemu infrastructure code: simple read/writes.
>
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 14:54             ` Marcelo Tosatti
@ 2012-09-13  6:55               ` liu ping fan
  0 siblings, 0 replies; 72+ messages in thread
From: liu ping fan @ 2012-09-13  6:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Stefan Hajnoczi, qemu-devel@nongnu.org, Avi Kivity,
	Anthony Liguori, Jan Kiszka, Paolo Bonzini

On Tue, Sep 11, 2012 at 10:54 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Tue, Sep 11, 2012 at 03:41:15PM +0300, Avi Kivity wrote:
>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>> > On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>> >> On 2012-09-11 11:44, liu ping fan wrote:
>> >>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>> >>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>> >>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> >>>>>
>> >>>>> The func call chain can suffer from recursively hold
>> >>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>> >>>>> lock depth.
>> >>>>
>> >>>> What is the root cause?  io handlers initiating I/O?
>> >>>>
>> >>> cpu_physical_memory_rw() can be called nested, and when called, it can
>> >>> be protected by no-lock/device lock/ big-lock.
>> >>> I think without big lock, io-dispatcher should face the same issue.
>> >>> As to main-loop, have not carefully consider, but at least, dma-helper
>> >>> will call cpu_physical_memory_rw() with big-lock.
>> >>
>> >> That is our core problem: inconsistent invocation of existing services
>> >> /wrt locking. For portio, I was lucky that there is no nesting and I was
>> >> able to drop the big lock around all (x86) call sites. But MMIO is way
>> >> more tricky due to DMA nesting.
>> >
>> > Maybe we need to switch to a continuation style.  Instead of expecting
>> > cpu_physical_memory_rw() to complete synchronously, it becomes an
>> > asynchronous call and you provide it with a completion.  That means
>> > devices which use it are forced to drop the lock in between.  Block and
>> > network clients will be easy to convert since they already use APIs that
>> > drop the lock (except for accessing the descriptors).
>> >
>> >> We could try to introduce a different version of cpu_physical_memory_rw,
>> >> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>> >> request can trigger the very same access in a nested fashion, and we
>> >> will have to detect this to avoid locking up QEMU (locking up the guest
>> >> might be OK).
>> >
>> > An async version of c_p_m_rw() will just cause a completion to bounce
>> > around, consuming cpu but not deadlocking anything.  If we can keep a
>> > count of the bounces, we might be able to stall it indefinitely or at
>> > least ratelimit it.
>> >
>>
>> Another option is to require all users of c_p_m_rw() and related to use
>> a coroutine or thread.  That makes the programming easier (but still
>> required a revalidation after the dropped lock).
>
> Unless i am misunderstanding this thread, the iothread flow section of
> http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04315.html
> contains possible solutions for this.
>
> Basically use trylock() and if it fails, then choose a bailout option
> (which there are more than one possibilities listed).
>
I think in that thread, the original goal of the trylock() is to solve
the lock held by other thread, by here we want to resolve nested lock.
But yes, there is similar point to solve it by async if we do not
allow nested lock.

Regards,
pingfan
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 12:41           ` Avi Kivity
  2012-09-11 14:54             ` Marcelo Tosatti
@ 2012-09-13  6:55             ` liu ping fan
  2012-09-13  8:19               ` Avi Kivity
  1 sibling, 1 reply; 72+ messages in thread
From: liu ping fan @ 2012-09-13  6:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>
>>>>>> The func call chain can suffer from recursively hold
>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>> lock depth.
>>>>>
>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>
>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>> be protected by no-lock/device lock/ big-lock.
>>>> I think without big lock, io-dispatcher should face the same issue.
>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>> will call cpu_physical_memory_rw() with big-lock.
>>>
>>> That is our core problem: inconsistent invocation of existing services
>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>> more tricky due to DMA nesting.
>>
>> Maybe we need to switch to a continuation style.  Instead of expecting
>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>> asynchronous call and you provide it with a completion.  That means
>> devices which use it are forced to drop the lock in between.  Block and
>> network clients will be easy to convert since they already use APIs that
>> drop the lock (except for accessing the descriptors).
>>
>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>> request can trigger the very same access in a nested fashion, and we
>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>> might be OK).
>>
>> An async version of c_p_m_rw() will just cause a completion to bounce
>> around, consuming cpu but not deadlocking anything.  If we can keep a
>> count of the bounces, we might be able to stall it indefinitely or at
>> least ratelimit it.
>>
>
> Another option is to require all users of c_p_m_rw() and related to use
> a coroutine or thread.  That makes the programming easier (but still
> required a revalidation after the dropped lock).
>
For the nested cpu_physical_memory_rw(), we change it internal but
keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
into cpu_physical_memory_rw_internal() )


LOCK()  // can be device or big lock or both, depend on caller.
..............
cpu_physical_memory_rw()
{
   UNLOCK() //unlock all the locks
   queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
// cpu_physical_memory_rw_internal can take lock(device,biglock) again
   wait_for_completion(completion)
   LOCK()
}
..................
UNLOCK()

Although cpu_physical_memory_rw_internal() can be freed to use lock,
but with precondition. -- We still need to trace lock stack taken by
cpu_physical_memory_rw(), so that it can return to caller correctly.
Is that OK?

Regards,
pingfan

> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-13  6:54     ` liu ping fan
@ 2012-09-13  8:14       ` Avi Kivity
  2012-09-13  8:19         ` Paolo Bonzini
  2012-09-19 13:16         ` Jamie Lokier
  2012-09-19 13:32       ` Jamie Lokier
  1 sibling, 2 replies; 72+ messages in thread
From: Avi Kivity @ 2012-09-13  8:14 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/13/2012 09:54 AM, liu ping fan wrote:

>>> +typedef struct Atomic {
>>> +    int counter;
>>> +} Atomic;
>>
>> Best to mark counter 'volatile'.
>>
>>> +
>>> +static inline void atomic_set(Atomic *v, int i)
>>> +{
>>> +    v->counter = i;
>>> +}
>>> +
>>> +static inline int atomic_read(Atomic *v)
>>> +{
>>> +    return v->counter;
>>> +}
>>>
>>
>> So these two operations don't get mangled by the optimizer.
>>
> Browsing linux code and reading lkml, find some similar material. But
> they have moved volatile from ->counter to function - atomic_read().
> As to atomic_read(), I think it need to prevent optimizer from
> refetching issue, but as to atomic_set(), do we need ?

I think so, to prevent reordering.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-13  8:14       ` Avi Kivity
@ 2012-09-13  8:19         ` Paolo Bonzini
  2012-09-13  8:23           ` Avi Kivity
  2012-09-13  8:45           ` liu ping fan
  2012-09-19 13:16         ` Jamie Lokier
  1 sibling, 2 replies; 72+ messages in thread
From: Paolo Bonzini @ 2012-09-13  8:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel

Il 13/09/2012 10:14, Avi Kivity ha scritto:
>>>> >>> +static inline void atomic_set(Atomic *v, int i)
>>>> >>> +{
>>>> >>> +    v->counter = i;
>>>> >>> +}
>>>> >>> +
>>>> >>> +static inline int atomic_read(Atomic *v)
>>>> >>> +{
>>>> >>> +    return v->counter;
>>>> >>> +}
>>>> >>>
>>> >>
>>> >> So these two operations don't get mangled by the optimizer.
>>> >>
>> > Browsing linux code and reading lkml, find some similar material. But
>> > they have moved volatile from ->counter to function - atomic_read().
>> > As to atomic_read(), I think it need to prevent optimizer from
>> > refetching issue, but as to atomic_set(), do we need ?
> I think so, to prevent reordering.

Agreed.  Alternatively, stick a barrier() before and after the
assignment and read.

But I don't really see the point in wrapping atomically-accessed
variables in a struct.  Are we going to add a variant for long, a
variant for pointers, etc.?

I already proposed my version of this at
http://github.com/bonzini/qemu/commit/1b439343.

Paolo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-13  6:55             ` liu ping fan
@ 2012-09-13  8:19               ` Avi Kivity
  2012-09-17  2:24                 ` liu ping fan
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-13  8:19 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On 09/13/2012 09:55 AM, liu ping fan wrote:
> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>>
>>>>>>> The func call chain can suffer from recursively hold
>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>>> lock depth.
>>>>>>
>>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>>
>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>>> be protected by no-lock/device lock/ big-lock.
>>>>> I think without big lock, io-dispatcher should face the same issue.
>>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>>> will call cpu_physical_memory_rw() with big-lock.
>>>>
>>>> That is our core problem: inconsistent invocation of existing services
>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>>> more tricky due to DMA nesting.
>>>
>>> Maybe we need to switch to a continuation style.  Instead of expecting
>>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>>> asynchronous call and you provide it with a completion.  That means
>>> devices which use it are forced to drop the lock in between.  Block and
>>> network clients will be easy to convert since they already use APIs that
>>> drop the lock (except for accessing the descriptors).
>>>
>>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>>> request can trigger the very same access in a nested fashion, and we
>>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>>> might be OK).
>>>
>>> An async version of c_p_m_rw() will just cause a completion to bounce
>>> around, consuming cpu but not deadlocking anything.  If we can keep a
>>> count of the bounces, we might be able to stall it indefinitely or at
>>> least ratelimit it.
>>>
>>
>> Another option is to require all users of c_p_m_rw() and related to use
>> a coroutine or thread.  That makes the programming easier (but still
>> required a revalidation after the dropped lock).
>>
> For the nested cpu_physical_memory_rw(), we change it internal but
> keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
> into cpu_physical_memory_rw_internal() )
> 
> 
> LOCK()  // can be device or big lock or both, depend on caller.
> ..............
> cpu_physical_memory_rw()
> {
>    UNLOCK() //unlock all the locks
>    queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
> // cpu_physical_memory_rw_internal can take lock(device,biglock) again
>    wait_for_completion(completion)
>    LOCK()
> }
> ..................
> UNLOCK()
> 

This is dangerous.  The caller expects to hold the lock across the call,
and will not re-validate its state.

> Although cpu_physical_memory_rw_internal() can be freed to use lock,
> but with precondition. -- We still need to trace lock stack taken by
> cpu_physical_memory_rw(), so that it can return to caller correctly.
> Is that OK?

I'm convinced that we need a recursive lock if we don't convert
everything at once.

So how about:

- make bql a recursive lock (use pthreads, don't invent our own)
- change c_p_m_rw() calling convention to "caller may hold the BQL, but
no device lock"

this allows devices to DMA each other.  Unconverted device models will
work as they used to.  Converted device models will need to drop the
device lock, re-acquire it, and revalidate any state.  That will cause
them to become more correct wrt recursive invocation.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-13  8:19         ` Paolo Bonzini
@ 2012-09-13  8:23           ` Avi Kivity
  2012-09-13  8:29             ` Paolo Bonzini
  2012-09-13  8:45           ` liu ping fan
  1 sibling, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-13  8:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel

On 09/13/2012 11:19 AM, Paolo Bonzini wrote:
> Il 13/09/2012 10:14, Avi Kivity ha scritto:
>>>>> >>> +static inline void atomic_set(Atomic *v, int i)
>>>>> >>> +{
>>>>> >>> +    v->counter = i;
>>>>> >>> +}
>>>>> >>> +
>>>>> >>> +static inline int atomic_read(Atomic *v)
>>>>> >>> +{
>>>>> >>> +    return v->counter;
>>>>> >>> +}
>>>>> >>>
>>>> >>
>>>> >> So these two operations don't get mangled by the optimizer.
>>>> >>
>>> > Browsing linux code and reading lkml, find some similar material. But
>>> > they have moved volatile from ->counter to function - atomic_read().
>>> > As to atomic_read(), I think it need to prevent optimizer from
>>> > refetching issue, but as to atomic_set(), do we need ?
>> I think so, to prevent reordering.
> 
> Agreed.  Alternatively, stick a barrier() before and after the
> assignment and read.
> 
> But I don't really see the point in wrapping atomically-accessed
> variables in a struct.  

Preventing accidental naked access (to be reported in patch review as
"wardrobe malfunction").

> Are we going to add a variant for long, a
> variant for pointers, etc.?

template <typename T> ...;

> I already proposed my version of this at
> http://github.com/bonzini/qemu/commit/1b439343.

Atomic operations are sufficiently rare and sufficiently important to
warrant extra effort, so I prefer the explicitly typed variants.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-13  8:23           ` Avi Kivity
@ 2012-09-13  8:29             ` Paolo Bonzini
  0 siblings, 0 replies; 72+ messages in thread
From: Paolo Bonzini @ 2012-09-13  8:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel

Il 13/09/2012 10:23, Avi Kivity ha scritto:
>> > But I don't really see the point in wrapping atomically-accessed
>> > variables in a struct.  
> Preventing accidental naked access (to be reported in patch review as
> "wardrobe malfunction").

Yeah, I understand that, but it's rare enough that I don't think it's
worth the complication.  With C++ and templates it would be a different
story.

Paolo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async
  2012-09-13  6:54         ` liu ping fan
@ 2012-09-13  8:45           ` Avi Kivity
  2012-09-13  9:59             ` liu ping fan
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-13  8:45 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/13/2012 09:54 AM, liu ping fan wrote:
> On Tue, Sep 11, 2012 at 5:37 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/11/2012 12:32 PM, liu ping fan wrote:
>>> On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>
>>>>> DeviceState will be protected by refcnt from disappearing during
>>>>> dispatching. But when refcnt comes down to zero, DeviceState may
>>>>> be still in use by iohandler, timer etc in main loop, we just delay
>>>>> its free untill no reader.
>>>>>
>>>>
>>>> How can this be?  We elevate the refcount while dispatching I/O.  If we
>>>> have similar problems with the timer, we need to employ a similar solution.
>>>>
>>> Yes, at the next step, plan to covert iohandler, timer etc to use
>>> refcount as memory. Here just a temp solution.
>>
>> I prefer not to ever introduce it.
>>
>> What we can do is introduce a sub-region for e1000's mmio that will take
>> only the device lock, and let original region use the old dispatch path
>> (and also take the device lock).  As we thread the various subsystems
>> e1000 uses, we can expand the sub-region until it covers all of e1000's
>> functions, then fold it back into the main region.
>>
> Introducing new sub-region for e1000  seems no help to resolve this
> issue. It can not tell whether main-loop still use it or not.

What is "it" here? (actually two of them).

> I think the key point is that original code SYNC eliminate all the
> readers of DeviceState at acpi_piix_eject_slot() by
> dev->unit()/exit(), so each subsystem will no access it in future.
> But now, we can delete the DeviceState async.

But deleting happens when we are guaranteed to have no I/O dispatch.

> Currently, we can just use e1000->unmap() to detach itself from each
> subsystem(Not implemented in this series patches for timer,...) to
> achieve the goal, because their readers are still under the protection
> of big lock, but when they are out of big lock, we need extra effort
> like memory system.

I see what you mean.  So you defer the deletion to a context where the
big lock is held.

But this solves nothing.  The device model accesses the network stack
and timer subsystem without the big lock held.  So you either need to
thread those two subsystems, or take the big lock in the I/O handlers.
If you do that, you can also take the big lock in the destructor.  If we
make the big lock a recursive lock, then the destructor can be invoked
in any context.

To summarize, I propose:
- drop the reclaimer
- make the bql recursive
- take the bql in the e1000 destructor
- take the bql in the e1000 I/O handlers when it accesses the timer or
network subsystems
(rest for a bit)
- thread the timer subsystem
- drop bql from around timer accesses
- thread the network subsystem
- drop bql from e1000 I/O handlers and destructor

does this work?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-13  8:19         ` Paolo Bonzini
  2012-09-13  8:23           ` Avi Kivity
@ 2012-09-13  8:45           ` liu ping fan
  1 sibling, 0 replies; 72+ messages in thread
From: liu ping fan @ 2012-09-13  8:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	qemu-devel

On Thu, Sep 13, 2012 at 4:19 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 13/09/2012 10:14, Avi Kivity ha scritto:
>>>>> >>> +static inline void atomic_set(Atomic *v, int i)
>>>>> >>> +{
>>>>> >>> +    v->counter = i;
>>>>> >>> +}
>>>>> >>> +
>>>>> >>> +static inline int atomic_read(Atomic *v)
>>>>> >>> +{
>>>>> >>> +    return v->counter;
>>>>> >>> +}
>>>>> >>>
>>>> >>
>>>> >> So these two operations don't get mangled by the optimizer.
>>>> >>
>>> > Browsing linux code and reading lkml, find some similar material. But
>>> > they have moved volatile from ->counter to function - atomic_read().
>>> > As to atomic_read(), I think it need to prevent optimizer from
>>> > refetching issue, but as to atomic_set(), do we need ?
>> I think so, to prevent reordering.
>
> Agreed.  Alternatively, stick a barrier() before and after the
> assignment and read.
>
Yes, the linux just leave this barrier() as caller's choice at the
exact point, not right in atomic_set(). And embed barrier() in
atomic_set() can easy the caller's job.

Thanks and regards,
pingfan

> But I don't really see the point in wrapping atomically-accessed
> variables in a struct.  Are we going to add a variant for long, a
> variant for pointers, etc.?
>
> I already proposed my version of this at
> http://github.com/bonzini/qemu/commit/1b439343.
>
> Paolo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async
  2012-09-13  8:45           ` Avi Kivity
@ 2012-09-13  9:59             ` liu ping fan
  2012-09-13 10:09               ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: liu ping fan @ 2012-09-13  9:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On Thu, Sep 13, 2012 at 4:45 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/13/2012 09:54 AM, liu ping fan wrote:
>> On Tue, Sep 11, 2012 at 5:37 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/11/2012 12:32 PM, liu ping fan wrote:
>>>> On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>
>>>>>> DeviceState will be protected by refcnt from disappearing during
>>>>>> dispatching. But when refcnt comes down to zero, DeviceState may
>>>>>> be still in use by iohandler, timer etc in main loop, we just delay
>>>>>> its free untill no reader.
>>>>>>
>>>>>
>>>>> How can this be?  We elevate the refcount while dispatching I/O.  If we
>>>>> have similar problems with the timer, we need to employ a similar solution.
>>>>>
>>>> Yes, at the next step, plan to covert iohandler, timer etc to use
>>>> refcount as memory. Here just a temp solution.
>>>
>>> I prefer not to ever introduce it.
>>>
>>> What we can do is introduce a sub-region for e1000's mmio that will take
>>> only the device lock, and let original region use the old dispatch path
>>> (and also take the device lock).  As we thread the various subsystems
>>> e1000 uses, we can expand the sub-region until it covers all of e1000's
>>> functions, then fold it back into the main region.
>>>
>> Introducing new sub-region for e1000  seems no help to resolve this
>> issue. It can not tell whether main-loop still use it or not.
>
> What is "it" here? (actually two of them).
>
Should expressed as "The sub-region's dispatcher can not tell whether
main-loop still use e1000 or not"

>> I think the key point is that original code SYNC eliminate all the
>> readers of DeviceState at acpi_piix_eject_slot() by
>> dev->unit()/exit(), so each subsystem will no access it in future.
>> But now, we can delete the DeviceState async.
>
> But deleting happens when we are guaranteed to have no I/O dispatch.
>
>> Currently, we can just use e1000->unmap() to detach itself from each
>> subsystem(Not implemented in this series patches for timer,...) to
>> achieve the goal, because their readers are still under the protection
>> of big lock, but when they are out of big lock, we need extra effort
>> like memory system.
>
> I see what you mean.  So you defer the deletion to a context where the
> big lock is held.
>
> But this solves nothing.  The device model accesses the network stack
> and timer subsystem without the big lock held.  So you either need to
> thread those two subsystems, or take the big lock in the I/O handlers.

Yes, at present, I tend to  use big lock to protect around the call to
subsystem in the e1000's I/O handlers. And verify the current changes,
then thread other subsystems as the next step.

> If you do that, you can also take the big lock in the destructor.  If we

We do not call qemu_del_timer() etc at the destructor, instead, we
will call it in qdev_unplug_complete() -->e1000::unmap(). And
e1000::unmap() is the only function definitely called under bql. When
coming to destructor, the DeviceState has been completely isolated
from all of the subsystem. So no need to require big lock in
destructor.

> make the big lock a recursive lock, then the destructor can be invoked
> in any context.
>
> To summarize, I propose:
> - drop the reclaimer
Agree
> - make the bql recursive
> - take the bql in the e1000 destructor
Change to e1000::unmap()
> - take the bql in the e1000 I/O handlers when it accesses the timer or
> network subsystems
Agree
> (rest for a bit)
> - thread the timer subsystem
> - drop bql from around timer accesses
> - thread the network subsystem
> - drop bql from e1000 I/O handlers and destructor
Agree
>

Thanks and regards,
pingfan

> does this work?
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async
  2012-09-13  9:59             ` liu ping fan
@ 2012-09-13 10:09               ` Avi Kivity
  0 siblings, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2012-09-13 10:09 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/13/2012 12:59 PM, liu ping fan wrote:
> On Thu, Sep 13, 2012 at 4:45 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/13/2012 09:54 AM, liu ping fan wrote:
>>> On Tue, Sep 11, 2012 at 5:37 PM, Avi Kivity <avi@redhat.com> wrote:
>>>> On 09/11/2012 12:32 PM, liu ping fan wrote:
>>>>> On Tue, Sep 11, 2012 at 4:32 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>>
>>>>>>> DeviceState will be protected by refcnt from disappearing during
>>>>>>> dispatching. But when refcnt comes down to zero, DeviceState may
>>>>>>> be still in use by iohandler, timer etc in main loop, we just delay
>>>>>>> its free untill no reader.
>>>>>>>
>>>>>>
>>>>>> How can this be?  We elevate the refcount while dispatching I/O.  If we
>>>>>> have similar problems with the timer, we need to employ a similar solution.
>>>>>>
>>>>> Yes, at the next step, plan to covert iohandler, timer etc to use
>>>>> refcount as memory. Here just a temp solution.
>>>>
>>>> I prefer not to ever introduce it.
>>>>
>>>> What we can do is introduce a sub-region for e1000's mmio that will take
>>>> only the device lock, and let original region use the old dispatch path
>>>> (and also take the device lock).  As we thread the various subsystems
>>>> e1000 uses, we can expand the sub-region until it covers all of e1000's
>>>> functions, then fold it back into the main region.
>>>>
>>> Introducing new sub-region for e1000  seems no help to resolve this
>>> issue. It can not tell whether main-loop still use it or not.
>>
>> What is "it" here? (actually two of them).
>>
> Should expressed as "The sub-region's dispatcher can not tell whether
> main-loop still use e1000 or not"

The sub-region will not use any unthreaded subsystems, so it need not
care about the main loop.

At first, it would only access registers in device state.

But if we go with the plan below, we can drop it.

> 
>> If you do that, you can also take the big lock in the destructor.  If we
> 
> We do not call qemu_del_timer() etc at the destructor, instead, we
> will call it in qdev_unplug_complete() -->e1000::unmap(). And
> e1000::unmap() is the only function definitely called under bql. When
> coming to destructor, the DeviceState has been completely isolated
> from all of the subsystem. So no need to require big lock in
> destructor.

But between unmap() and the destructor, accesses can still occur (an
access that was started before unmap() was called, but was delayed and
is dispatched after it completes).  These accesses will find the timer
deleted, and so must be prepared to check if the timer is there or not.

So we have a choice, either move timer destruction to the destructor, or
add checks in the dispatch code.



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-13  8:19               ` Avi Kivity
@ 2012-09-17  2:24                 ` liu ping fan
  2012-09-19  8:01                   ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: liu ping fan @ 2012-09-17  2:24 UTC (permalink / raw)
  To: Avi Kivity, Paolo Bonzini, Jan Kiszka, Anthony Liguori,
	Marcelo Tosatti
  Cc: qemu-devel@nongnu.org

On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/13/2012 09:55 AM, liu ping fan wrote:
>> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>>>
>>>>>>>> The func call chain can suffer from recursively hold
>>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>>>> lock depth.
>>>>>>>
>>>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>>>
>>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>>>> be protected by no-lock/device lock/ big-lock.
>>>>>> I think without big lock, io-dispatcher should face the same issue.
>>>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>>>> will call cpu_physical_memory_rw() with big-lock.
>>>>>
>>>>> That is our core problem: inconsistent invocation of existing services
>>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>>>> more tricky due to DMA nesting.
>>>>
>>>> Maybe we need to switch to a continuation style.  Instead of expecting
>>>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>>>> asynchronous call and you provide it with a completion.  That means
>>>> devices which use it are forced to drop the lock in between.  Block and
>>>> network clients will be easy to convert since they already use APIs that
>>>> drop the lock (except for accessing the descriptors).
>>>>
>>>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>>>> request can trigger the very same access in a nested fashion, and we
>>>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>>>> might be OK).
>>>>
>>>> An async version of c_p_m_rw() will just cause a completion to bounce
>>>> around, consuming cpu but not deadlocking anything.  If we can keep a
>>>> count of the bounces, we might be able to stall it indefinitely or at
>>>> least ratelimit it.
>>>>
>>>
>>> Another option is to require all users of c_p_m_rw() and related to use
>>> a coroutine or thread.  That makes the programming easier (but still
>>> required a revalidation after the dropped lock).
>>>
>> For the nested cpu_physical_memory_rw(), we change it internal but
>> keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
>> into cpu_physical_memory_rw_internal() )
>>
>>
>> LOCK()  // can be device or big lock or both, depend on caller.
>> ..............
>> cpu_physical_memory_rw()
>> {
>>    UNLOCK() //unlock all the locks
>>    queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
>> // cpu_physical_memory_rw_internal can take lock(device,biglock) again
>>    wait_for_completion(completion)
>>    LOCK()
>> }
>> ..................
>> UNLOCK()
>>
>
> This is dangerous.  The caller expects to hold the lock across the call,
> and will not re-validate its state.
>
>> Although cpu_physical_memory_rw_internal() can be freed to use lock,
>> but with precondition. -- We still need to trace lock stack taken by
>> cpu_physical_memory_rw(), so that it can return to caller correctly.
>> Is that OK?
>
> I'm convinced that we need a recursive lock if we don't convert
> everything at once.
>
> So how about:
>
> - make bql a recursive lock (use pthreads, don't invent our own)
> - change c_p_m_rw() calling convention to "caller may hold the BQL, but
> no device lock"
>
> this allows devices to DMA each other.  Unconverted device models will
> work as they used to.  Converted device models will need to drop the
> device lock, re-acquire it, and revalidate any state.  That will cause

I think we are cornered by devices to DMA each other, and it raises
the unavoidable nested lock. Also to avoid deadlock caused by device's
lock sequence, we should drop the current device lock  to acquire
another one.   The only little  diverge is about the "revalidate", do
we need to rollback? Or for converted device, we can just tag it a
busy flag, that is check&set busy flag at the entry of device's
mmio-dispatch.  So  when re-acquire device's lock, the device's state
is intact.

Has anybody other suggestion?

Thanks and regards,
pingfan
> them to become more correct wrt recursive invocation.
>
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-11 12:39                       ` Avi Kivity
@ 2012-09-19  4:25                         ` Peter Crosthwaite
  2012-09-19  4:32                           ` Edgar E. Iglesias
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Crosthwaite @ 2012-09-19  4:25 UTC (permalink / raw)
  To: Avi Kivity, Peter Crosthwaite, Igor Mitsyanko, Peter Maydell
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org, liu ping fan,
	Anthony Liguori, Paolo Bonzini

Ping for PMM,

This is the root case of your block on the SDHCI series - this is a
discussion on resolution to bogus infinite looping DMA. For current
participants in this discussion, heres our thread on the same topic
over in SD land:

http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html

With the findings here and acknowledgement that this is a general
problem, we either need to declare this issue of scope for SDHCI, or
work with these guys (in the immediate future) on the DMA infinite
loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
here, but can we get a decisive plan for going forward with this issue
if it is going to continue to block SDHCI.

Thanks to Igor for identifying the overlap between these discussions.

Regards,
Peter

On Tue, Sep 11, 2012 at 10:39 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/11/2012 03:35 PM, Jan Kiszka wrote:
>> On 2012-09-11 14:30, Avi Kivity wrote:
>>>>>>>> The other option is to keep DMA requests issued by devices synchronous
>>>>>>>> but let them fail if we are about to lock up. Still requires changes,
>>>>>>>> but is probably more comprehensible for device model developers.
>>>>>>>
>>>>>>> How do you handle failures?
>>>>>>
>>>>>> By not sending a network frame or dropping an incoming one, e.g., and
>>>>>> signaling this in a device specific way.
>>>>>
>>>>> Doesn't work for block devices.
>>>>
>>>> Because the block layer API cannot report errors to the devices? What
>>>> happens if there is a real I/O error?
>>>
>>> We report real I/O errors.  But if we report a transient error due to
>>> some lock being taken as an I/O error, the guest will take unwarranted
>>> action.
>>>
>>> If the errors are not expected in normal operation (we can avoid them if
>>> all DMA is to real RAM) then this is an acceptable solution.  Still it
>>> generates a lot of rarely used code paths and so isn't very good for
>>> security.
>>
>> I'm not talking about transient errors. Recursions like this are always
>> guest configuration errors that would cause real devices to lock up or
>> timeout as well. This is practically about avoiding that a malicious
>> guest can lock up QEMU, leaving it inoperative even for management tools.
>
> Ok.  That's more palatable.  We don't even have to report an error in
> that case, we can just perform the operation incorrectly (as I'm sure
> real hardware will), log an error, and continue.
>
>
> --
> error compiling committee.c: too many arguments to function
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19  4:25                         ` Peter Crosthwaite
@ 2012-09-19  4:32                           ` Edgar E. Iglesias
  2012-09-19  4:40                             ` Peter Crosthwaite
  0 siblings, 1 reply; 72+ messages in thread
From: Edgar E. Iglesias @ 2012-09-19  4:32 UTC (permalink / raw)
  To: Peter Crosthwaite
  Cc: Peter Maydell, Igor Mitsyanko, Jan Kiszka, Marcelo Tosatti,
	qemu-devel@nongnu.org, liu ping fan, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
> Ping for PMM,
> 
> This is the root case of your block on the SDHCI series - this is a
> discussion on resolution to bogus infinite looping DMA. For current
> participants in this discussion, heres our thread on the same topic
> over in SD land:
> 
> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
> 
> With the findings here and acknowledgement that this is a general
> problem, we either need to declare this issue of scope for SDHCI, or
> work with these guys (in the immediate future) on the DMA infinite
> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
> here, but can we get a decisive plan for going forward with this issue
> if it is going to continue to block SDHCI.
> 
> Thanks to Igor for identifying the overlap between these discussions.

Hi,

A couple of dumb questions.

What is the reason for the blocker? that possible guest dos is worse
than no functionality at all?

Can't you put the DMA/IO processing into the background?

what's the diff between setup of bad descriptors and  writing a
while (1) loop in the guest CPU?

Cheers,
E

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19  4:32                           ` Edgar E. Iglesias
@ 2012-09-19  4:40                             ` Peter Crosthwaite
  2012-09-19  7:55                               ` Avi Kivity
  2012-09-19  7:57                               ` Jan Kiszka
  0 siblings, 2 replies; 72+ messages in thread
From: Peter Crosthwaite @ 2012-09-19  4:40 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Peter Maydell, Igor Mitsyanko, Jan Kiszka, Marcelo Tosatti,
	qemu-devel@nongnu.org, liu ping fan, Avi Kivity, Anthony Liguori,
	Paolo Bonzini

On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
<edgar.iglesias@gmail.com> wrote:
> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>> Ping for PMM,
>>
>> This is the root case of your block on the SDHCI series - this is a
>> discussion on resolution to bogus infinite looping DMA. For current
>> participants in this discussion, heres our thread on the same topic
>> over in SD land:
>>
>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>>
>> With the findings here and acknowledgement that this is a general
>> problem, we either need to declare this issue of scope for SDHCI, or
>> work with these guys (in the immediate future) on the DMA infinite
>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>> here, but can we get a decisive plan for going forward with this issue
>> if it is going to continue to block SDHCI.
>>
>> Thanks to Igor for identifying the overlap between these discussions.
>
> Hi,
>
> A couple of dumb questions.
>
> What is the reason for the blocker? that possible guest dos is worse
> than no functionality at all?
>
> Can't you put the DMA/IO processing into the background?

I dont know a way do do asynchronous DMA unless I am missing something
here. So what happens is we have a device that walks a SG list
synchronously while holding all the locks and stuff being discussed
here. If that SG list infinite loops then crash.

>
> what's the diff between setup of bad descriptors and  writing a
> while (1) loop in the guest CPU?
>

While(1) in guest doesnt freeze all of QEMU (monitor still works and
stuff), wheras the DMA ops going in circles does, as locks are taken
by an infinite loop.

Regards,
peter

> Cheers,
> E

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19  4:40                             ` Peter Crosthwaite
@ 2012-09-19  7:55                               ` Avi Kivity
  2012-09-19 11:46                                 ` Edgar E. Iglesias
  2012-09-19  7:57                               ` Jan Kiszka
  1 sibling, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-19  7:55 UTC (permalink / raw)
  To: Peter Crosthwaite
  Cc: Peter Maydell, Igor Mitsyanko, Jan Kiszka, Marcelo Tosatti,
	qemu-devel@nongnu.org, liu ping fan, Anthony Liguori,
	Paolo Bonzini, Edgar E. Iglesias

On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
> On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
> <edgar.iglesias@gmail.com> wrote:
>> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>>> Ping for PMM,
>>>
>>> This is the root case of your block on the SDHCI series - this is a
>>> discussion on resolution to bogus infinite looping DMA. For current
>>> participants in this discussion, heres our thread on the same topic
>>> over in SD land:
>>>
>>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>>>
>>> With the findings here and acknowledgement that this is a general
>>> problem, we either need to declare this issue of scope for SDHCI, or
>>> work with these guys (in the immediate future) on the DMA infinite
>>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>>> here, but can we get a decisive plan for going forward with this issue
>>> if it is going to continue to block SDHCI.
>>>
>>> Thanks to Igor for identifying the overlap between these discussions.
>>
>> Hi,
>>
>> A couple of dumb questions.
>>
>> What is the reason for the blocker? that possible guest dos is worse
>> than no functionality at all?
>>
>> Can't you put the DMA/IO processing into the background?
> 
> I dont know a way do do asynchronous DMA unless I am missing something
> here.

You could schedule a bottom half and do the accesses from there.  Solves
nothing though.

> So what happens is we have a device that walks a SG list
> synchronously while holding all the locks and stuff being discussed
> here. If that SG list infinite loops then crash.

Did you mean loop, or recursion into the memory region that initiates DMA?

As this problem already exists I don't think new device models need
block on it.

Maybe a simple fix is:

void my_device_write(...)
{
   if (dma_in_progress) {
       return;
   }
   ...
   s->dma_in_progress = true;
   cpu_physical_memory_rw(...)
   s->dma_in_progress = false;
}

and we could try to figure out how to move this into common code.
-- 

error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19  4:40                             ` Peter Crosthwaite
  2012-09-19  7:55                               ` Avi Kivity
@ 2012-09-19  7:57                               ` Jan Kiszka
  2012-09-19 13:07                                 ` Igor Mitsyanko
  1 sibling, 1 reply; 72+ messages in thread
From: Jan Kiszka @ 2012-09-19  7:57 UTC (permalink / raw)
  To: Peter Crosthwaite
  Cc: Peter Maydell, Igor Mitsyanko, Marcelo Tosatti,
	qemu-devel@nongnu.org, liu ping fan, Avi Kivity, Anthony Liguori,
	Paolo Bonzini, Edgar E. Iglesias

On 2012-09-19 06:40, Peter Crosthwaite wrote:
> On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
> <edgar.iglesias@gmail.com> wrote:
>> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>>> Ping for PMM,
>>>
>>> This is the root case of your block on the SDHCI series - this is a
>>> discussion on resolution to bogus infinite looping DMA. For current
>>> participants in this discussion, heres our thread on the same topic
>>> over in SD land:
>>>
>>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>>>
>>> With the findings here and acknowledgement that this is a general
>>> problem, we either need to declare this issue of scope for SDHCI, or
>>> work with these guys (in the immediate future) on the DMA infinite
>>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>>> here, but can we get a decisive plan for going forward with this issue
>>> if it is going to continue to block SDHCI.
>>>
>>> Thanks to Igor for identifying the overlap between these discussions.
>>
>> Hi,
>>
>> A couple of dumb questions.
>>
>> What is the reason for the blocker? that possible guest dos is worse
>> than no functionality at all?
>>
>> Can't you put the DMA/IO processing into the background?
> 
> I dont know a way do do asynchronous DMA unless I am missing something
> here. So what happens is we have a device that walks a SG list
> synchronously while holding all the locks and stuff being discussed
> here. If that SG list infinite loops then crash.

We could switch to async DMA, but most DMA-capable devices aren't
prepared for rescheduling points in the middle of their code. This will
require quite a bit of code review.

> 
>>
>> what's the diff between setup of bad descriptors and  writing a
>> while (1) loop in the guest CPU?
>>
> 
> While(1) in guest doesnt freeze all of QEMU (monitor still works and
> stuff), wheras the DMA ops going in circles does, as locks are taken
> by an infinite loop.

That's the point. If we loop, we need at least a way to stop it by
issuing a device or system reset. But I tend to think that detecting
loops and failing/ignoring requests is easier implementable.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-17  2:24                 ` liu ping fan
@ 2012-09-19  8:01                   ` Avi Kivity
  2012-09-19  8:36                     ` liu ping fan
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-19  8:01 UTC (permalink / raw)
  To: liu ping fan
  Cc: Paolo Bonzini, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Jan Kiszka

On 09/17/2012 05:24 AM, liu ping fan wrote:
> On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/13/2012 09:55 AM, liu ping fan wrote:
>>> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote:
>>>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>>>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>>>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>>>>
>>>>>>>>> The func call chain can suffer from recursively hold
>>>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>>>>> lock depth.
>>>>>>>>
>>>>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>>>>
>>>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>>>>> be protected by no-lock/device lock/ big-lock.
>>>>>>> I think without big lock, io-dispatcher should face the same issue.
>>>>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>>>>> will call cpu_physical_memory_rw() with big-lock.
>>>>>>
>>>>>> That is our core problem: inconsistent invocation of existing services
>>>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>>>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>>>>> more tricky due to DMA nesting.
>>>>>
>>>>> Maybe we need to switch to a continuation style.  Instead of expecting
>>>>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>>>>> asynchronous call and you provide it with a completion.  That means
>>>>> devices which use it are forced to drop the lock in between.  Block and
>>>>> network clients will be easy to convert since they already use APIs that
>>>>> drop the lock (except for accessing the descriptors).
>>>>>
>>>>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>>>>> request can trigger the very same access in a nested fashion, and we
>>>>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>>>>> might be OK).
>>>>>
>>>>> An async version of c_p_m_rw() will just cause a completion to bounce
>>>>> around, consuming cpu but not deadlocking anything.  If we can keep a
>>>>> count of the bounces, we might be able to stall it indefinitely or at
>>>>> least ratelimit it.
>>>>>
>>>>
>>>> Another option is to require all users of c_p_m_rw() and related to use
>>>> a coroutine or thread.  That makes the programming easier (but still
>>>> required a revalidation after the dropped lock).
>>>>
>>> For the nested cpu_physical_memory_rw(), we change it internal but
>>> keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
>>> into cpu_physical_memory_rw_internal() )
>>>
>>>
>>> LOCK()  // can be device or big lock or both, depend on caller.
>>> ..............
>>> cpu_physical_memory_rw()
>>> {
>>>    UNLOCK() //unlock all the locks
>>>    queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
>>> // cpu_physical_memory_rw_internal can take lock(device,biglock) again
>>>    wait_for_completion(completion)
>>>    LOCK()
>>> }
>>> ..................
>>> UNLOCK()
>>>
>>
>> This is dangerous.  The caller expects to hold the lock across the call,
>> and will not re-validate its state.
>>
>>> Although cpu_physical_memory_rw_internal() can be freed to use lock,
>>> but with precondition. -- We still need to trace lock stack taken by
>>> cpu_physical_memory_rw(), so that it can return to caller correctly.
>>> Is that OK?
>>
>> I'm convinced that we need a recursive lock if we don't convert
>> everything at once.
>>
>> So how about:
>>
>> - make bql a recursive lock (use pthreads, don't invent our own)
>> - change c_p_m_rw() calling convention to "caller may hold the BQL, but
>> no device lock"
>>
>> this allows devices to DMA each other.  Unconverted device models will
>> work as they used to.  Converted device models will need to drop the
>> device lock, re-acquire it, and revalidate any state.  That will cause
> 
> I think we are cornered by devices to DMA each other, and it raises
> the unavoidable nested lock. Also to avoid deadlock caused by device's
> lock sequence, we should drop the current device lock  to acquire
> another one.   The only little  diverge is about the "revalidate", do
> we need to rollback?

It basically means you can't hold contents of device state in local
variables.  You need to read everything again from the device.  That
includes things like DMA enable bits.

(btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
is a trivial example of an iommu, we should get that going).

> Or for converted device, we can just tag it a
> busy flag, that is check&set busy flag at the entry of device's
> mmio-dispatch.  So  when re-acquire device's lock, the device's state
> is intact.

The state can be changed by a parallel access to another register, which
is valid.

> 
> Has anybody other suggestion?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19  8:01                   ` Avi Kivity
@ 2012-09-19  8:36                     ` liu ping fan
  2012-09-19  9:05                       ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: liu ping fan @ 2012-09-19  8:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Jan Kiszka

On Wed, Sep 19, 2012 at 4:01 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/17/2012 05:24 AM, liu ping fan wrote:
>> On Thu, Sep 13, 2012 at 4:19 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/13/2012 09:55 AM, liu ping fan wrote:
>>>> On Tue, Sep 11, 2012 at 8:41 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>> On 09/11/2012 03:24 PM, Avi Kivity wrote:
>>>>>> On 09/11/2012 12:57 PM, Jan Kiszka wrote:
>>>>>>> On 2012-09-11 11:44, liu ping fan wrote:
>>>>>>>> On Tue, Sep 11, 2012 at 4:35 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>>>>> On 09/11/2012 10:51 AM, Liu Ping Fan wrote:
>>>>>>>>>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>>>>>>>>>
>>>>>>>>>> The func call chain can suffer from recursively hold
>>>>>>>>>> qemu_mutex_lock_iothread. We introduce lockmap to record the
>>>>>>>>>> lock depth.
>>>>>>>>>
>>>>>>>>> What is the root cause?  io handlers initiating I/O?
>>>>>>>>>
>>>>>>>> cpu_physical_memory_rw() can be called nested, and when called, it can
>>>>>>>> be protected by no-lock/device lock/ big-lock.
>>>>>>>> I think without big lock, io-dispatcher should face the same issue.
>>>>>>>> As to main-loop, have not carefully consider, but at least, dma-helper
>>>>>>>> will call cpu_physical_memory_rw() with big-lock.
>>>>>>>
>>>>>>> That is our core problem: inconsistent invocation of existing services
>>>>>>> /wrt locking. For portio, I was lucky that there is no nesting and I was
>>>>>>> able to drop the big lock around all (x86) call sites. But MMIO is way
>>>>>>> more tricky due to DMA nesting.
>>>>>>
>>>>>> Maybe we need to switch to a continuation style.  Instead of expecting
>>>>>> cpu_physical_memory_rw() to complete synchronously, it becomes an
>>>>>> asynchronous call and you provide it with a completion.  That means
>>>>>> devices which use it are forced to drop the lock in between.  Block and
>>>>>> network clients will be easy to convert since they already use APIs that
>>>>>> drop the lock (except for accessing the descriptors).
>>>>>>
>>>>>>> We could try to introduce a different version of cpu_physical_memory_rw,
>>>>>>> cpu_physical_memory_rw_unlocked. But the problem remains that an MMIO
>>>>>>> request can trigger the very same access in a nested fashion, and we
>>>>>>> will have to detect this to avoid locking up QEMU (locking up the guest
>>>>>>> might be OK).
>>>>>>
>>>>>> An async version of c_p_m_rw() will just cause a completion to bounce
>>>>>> around, consuming cpu but not deadlocking anything.  If we can keep a
>>>>>> count of the bounces, we might be able to stall it indefinitely or at
>>>>>> least ratelimit it.
>>>>>>
>>>>>
>>>>> Another option is to require all users of c_p_m_rw() and related to use
>>>>> a coroutine or thread.  That makes the programming easier (but still
>>>>> required a revalidation after the dropped lock).
>>>>>
>>>> For the nested cpu_physical_memory_rw(), we change it internal but
>>>> keep it sync API as it is.  (Wrapper current cpu_physical_memory_rw()
>>>> into cpu_physical_memory_rw_internal() )
>>>>
>>>>
>>>> LOCK()  // can be device or big lock or both, depend on caller.
>>>> ..............
>>>> cpu_physical_memory_rw()
>>>> {
>>>>    UNLOCK() //unlock all the locks
>>>>    queue_work_on_thread(cpu_physical_memory_rw_internal, completion);
>>>> // cpu_physical_memory_rw_internal can take lock(device,biglock) again
>>>>    wait_for_completion(completion)
>>>>    LOCK()
>>>> }
>>>> ..................
>>>> UNLOCK()
>>>>
>>>
>>> This is dangerous.  The caller expects to hold the lock across the call,
>>> and will not re-validate its state.
>>>
>>>> Although cpu_physical_memory_rw_internal() can be freed to use lock,
>>>> but with precondition. -- We still need to trace lock stack taken by
>>>> cpu_physical_memory_rw(), so that it can return to caller correctly.
>>>> Is that OK?
>>>
>>> I'm convinced that we need a recursive lock if we don't convert
>>> everything at once.
>>>
>>> So how about:
>>>
>>> - make bql a recursive lock (use pthreads, don't invent our own)
>>> - change c_p_m_rw() calling convention to "caller may hold the BQL, but
>>> no device lock"
>>>
>>> this allows devices to DMA each other.  Unconverted device models will
>>> work as they used to.  Converted device models will need to drop the
>>> device lock, re-acquire it, and revalidate any state.  That will cause
>>
>> I think we are cornered by devices to DMA each other, and it raises
>> the unavoidable nested lock. Also to avoid deadlock caused by device's
>> lock sequence, we should drop the current device lock  to acquire
>> another one.   The only little  diverge is about the "revalidate", do
>> we need to rollback?
>
> It basically means you can't hold contents of device state in local
> variables.  You need to read everything again from the device.  That
> includes things like DMA enable bits.
>
I think that read everything again from the device can not work.
Suppose the following scene: If the device's state contains the change
of a series of internal registers (supposing partA+partB;
partC+partD), after partA changed, the device's lock is broken.  At
this point, another access to this device, it will work on partA+partB
to determine C+D, but since partB is not correctly updated yet. So C+D
may be decided by broken context and be wrong.

> (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
> is a trivial example of an iommu, we should get that going).
>
>> Or for converted device, we can just tag it a
>> busy flag, that is check&set busy flag at the entry of device's
>> mmio-dispatch.  So  when re-acquire device's lock, the device's state
>> is intact.
>
> The state can be changed by a parallel access to another register, which
> is valid.
>
Do you mean that the device can be accessed in parallel? But how? we
use device's lock.
What my suggestion is:
lock();
set_and_test(dev->busy);
if busy
  unlock and return;
changing device registers;
do other things including calling to c_p_m_rw() //here,lock broken,
but set_and_test() works
clear(dev->busy);
unlock();

So changing device registers is protected, and unbreakable.

Regards,
pingfan
>>
>> Has anybody other suggestion?
>
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19  8:36                     ` liu ping fan
@ 2012-09-19  9:05                       ` Avi Kivity
  2012-09-20  7:51                         ` liu ping fan
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-19  9:05 UTC (permalink / raw)
  To: liu ping fan
  Cc: Paolo Bonzini, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Jan Kiszka

On 09/19/2012 11:36 AM, liu ping fan wrote:
>>
>> It basically means you can't hold contents of device state in local
>> variables.  You need to read everything again from the device.  That
>> includes things like DMA enable bits.
>>
> I think that read everything again from the device can not work.
> Suppose the following scene: If the device's state contains the change
> of a series of internal registers (supposing partA+partB;
> partC+partD), after partA changed, the device's lock is broken.  At
> this point, another access to this device, it will work on partA+partB
> to determine C+D, but since partB is not correctly updated yet. So C+D
> may be decided by broken context and be wrong.

That's the guest's problem.  Note it could have happened even before the
change, since the writes to A/B/C/D are unordered wrt the DMA.

> 
>> (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
>> is a trivial example of an iommu, we should get that going).
>>
>>> Or for converted device, we can just tag it a
>>> busy flag, that is check&set busy flag at the entry of device's
>>> mmio-dispatch.  So  when re-acquire device's lock, the device's state
>>> is intact.
>>
>> The state can be changed by a parallel access to another register, which
>> is valid.
>>
> Do you mean that the device can be accessed in parallel? But how? we
> use device's lock.

Some DMA is asynchronous already, like networking and IDE DMA.  So we
drop the big lock while doing it.  With the change to drop device locks
around c_p_m_rw(), we have that problem everywhere.

> What my suggestion is:
> lock();
> set_and_test(dev->busy);
> if busy
>   unlock and return;
> changing device registers;
> do other things including calling to c_p_m_rw() //here,lock broken,
> but set_and_test() works
> clear(dev->busy);
> unlock();
> 
> So changing device registers is protected, and unbreakable.

But the changes may be legitimate.  Maybe you're writing to a completely
unrelated register, from a different vcpu, now that write is lost.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19  7:55                               ` Avi Kivity
@ 2012-09-19 11:46                                 ` Edgar E. Iglesias
  2012-09-19 12:12                                   ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Edgar E. Iglesias @ 2012-09-19 11:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Maydell, Igor Mitsyanko, Jan Kiszka, Marcelo Tosatti,
	qemu-devel@nongnu.org, liu ping fan, Peter Crosthwaite,
	Anthony Liguori, Paolo Bonzini

On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
> > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
> > <edgar.iglesias@gmail.com> wrote:
> >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
> >>> Ping for PMM,
> >>>
> >>> This is the root case of your block on the SDHCI series - this is a
> >>> discussion on resolution to bogus infinite looping DMA. For current
> >>> participants in this discussion, heres our thread on the same topic
> >>> over in SD land:
> >>>
> >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
> >>>
> >>> With the findings here and acknowledgement that this is a general
> >>> problem, we either need to declare this issue of scope for SDHCI, or
> >>> work with these guys (in the immediate future) on the DMA infinite
> >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
> >>> here, but can we get a decisive plan for going forward with this issue
> >>> if it is going to continue to block SDHCI.
> >>>
> >>> Thanks to Igor for identifying the overlap between these discussions.
> >>
> >> Hi,
> >>
> >> A couple of dumb questions.
> >>
> >> What is the reason for the blocker? that possible guest dos is worse
> >> than no functionality at all?
> >>
> >> Can't you put the DMA/IO processing into the background?
> > 
> > I dont know a way do do asynchronous DMA unless I am missing something
> > here.
> 
> You could schedule a bottom half and do the accesses from there.  Solves
> nothing though.
> 
> > So what happens is we have a device that walks a SG list
> > synchronously while holding all the locks and stuff being discussed
> > here. If that SG list infinite loops then crash.
> 
> Did you mean loop, or recursion into the memory region that initiates DMA?

I think we were discussing the loop issue (I hadn't thought of recursion
issues before) :)

Jan's point of resetability was interesting.

Processing a finite amount of dma descriptors and scheduling bh's
to continously get back and continue processing should be OK no?

That would avoid the lockups and allow the device to be reset at
any time. Or am I missing something?

IIRC the etrax dma does something like that but only for receive
dma rings...

Cheers,
Edgar

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19 11:46                                 ` Edgar E. Iglesias
@ 2012-09-19 12:12                                   ` Avi Kivity
  2012-09-19 12:17                                     ` Edgar E. Iglesias
  2012-09-19 13:01                                     ` Igor Mitsyanko
  0 siblings, 2 replies; 72+ messages in thread
From: Avi Kivity @ 2012-09-19 12:12 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Peter Maydell, Igor Mitsyanko, Jan Kiszka, Marcelo Tosatti,
	qemu-devel@nongnu.org, liu ping fan, Peter Crosthwaite,
	Anthony Liguori, Paolo Bonzini

On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote:
> On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
>> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
>> > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
>> > <edgar.iglesias@gmail.com> wrote:
>> >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>> >>> Ping for PMM,
>> >>>
>> >>> This is the root case of your block on the SDHCI series - this is a
>> >>> discussion on resolution to bogus infinite looping DMA. For current
>> >>> participants in this discussion, heres our thread on the same topic
>> >>> over in SD land:
>> >>>
>> >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>> >>>
>> >>> With the findings here and acknowledgement that this is a general
>> >>> problem, we either need to declare this issue of scope for SDHCI, or
>> >>> work with these guys (in the immediate future) on the DMA infinite
>> >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>> >>> here, but can we get a decisive plan for going forward with this issue
>> >>> if it is going to continue to block SDHCI.
>> >>>
>> >>> Thanks to Igor for identifying the overlap between these discussions.
>> >>
>> >> Hi,
>> >>
>> >> A couple of dumb questions.
>> >>
>> >> What is the reason for the blocker? that possible guest dos is worse
>> >> than no functionality at all?
>> >>
>> >> Can't you put the DMA/IO processing into the background?
>> > 
>> > I dont know a way do do asynchronous DMA unless I am missing something
>> > here.
>> 
>> You could schedule a bottom half and do the accesses from there.  Solves
>> nothing though.
>> 
>> > So what happens is we have a device that walks a SG list
>> > synchronously while holding all the locks and stuff being discussed
>> > here. If that SG list infinite loops then crash.
>> 
>> Did you mean loop, or recursion into the memory region that initiates DMA?
> 
> I think we were discussing the loop issue (I hadn't thought of recursion
> issues before) :)
> 
> Jan's point of resetability was interesting.
> 
> Processing a finite amount of dma descriptors and scheduling bh's
> to continously get back and continue processing should be OK no?
> 

Yes.

> That would avoid the lockups and allow the device to be reset at
> any time. Or am I missing something?
> 

Not that I can see.  If real hardware can be looped, so can qemu.  I'm
only worried about recursion and deadlocks (while real hardware can
deadlock, we'd prefer to avoid that).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19 12:12                                   ` Avi Kivity
@ 2012-09-19 12:17                                     ` Edgar E. Iglesias
  2012-09-19 13:01                                     ` Igor Mitsyanko
  1 sibling, 0 replies; 72+ messages in thread
From: Edgar E. Iglesias @ 2012-09-19 12:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Maydell, Igor Mitsyanko, Jan Kiszka, Marcelo Tosatti,
	qemu-devel@nongnu.org, liu ping fan, Peter Crosthwaite,
	Anthony Liguori, Paolo Bonzini

On Wed, Sep 19, 2012 at 03:12:12PM +0300, Avi Kivity wrote:
> On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote:
> > On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
> >> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
> >> > On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
> >> > <edgar.iglesias@gmail.com> wrote:
> >> >> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
> >> >>> Ping for PMM,
> >> >>>
> >> >>> This is the root case of your block on the SDHCI series - this is a
> >> >>> discussion on resolution to bogus infinite looping DMA. For current
> >> >>> participants in this discussion, heres our thread on the same topic
> >> >>> over in SD land:
> >> >>>
> >> >>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
> >> >>>
> >> >>> With the findings here and acknowledgement that this is a general
> >> >>> problem, we either need to declare this issue of scope for SDHCI, or
> >> >>> work with these guys (in the immediate future) on the DMA infinite
> >> >>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
> >> >>> here, but can we get a decisive plan for going forward with this issue
> >> >>> if it is going to continue to block SDHCI.
> >> >>>
> >> >>> Thanks to Igor for identifying the overlap between these discussions.
> >> >>
> >> >> Hi,
> >> >>
> >> >> A couple of dumb questions.
> >> >>
> >> >> What is the reason for the blocker? that possible guest dos is worse
> >> >> than no functionality at all?
> >> >>
> >> >> Can't you put the DMA/IO processing into the background?
> >> > 
> >> > I dont know a way do do asynchronous DMA unless I am missing something
> >> > here.
> >> 
> >> You could schedule a bottom half and do the accesses from there.  Solves
> >> nothing though.
> >> 
> >> > So what happens is we have a device that walks a SG list
> >> > synchronously while holding all the locks and stuff being discussed
> >> > here. If that SG list infinite loops then crash.
> >> 
> >> Did you mean loop, or recursion into the memory region that initiates DMA?
> > 
> > I think we were discussing the loop issue (I hadn't thought of recursion
> > issues before) :)
> > 
> > Jan's point of resetability was interesting.
> > 
> > Processing a finite amount of dma descriptors and scheduling bh's
> > to continously get back and continue processing should be OK no?
> > 
> 
> Yes.
> 
> > That would avoid the lockups and allow the device to be reset at
> > any time. Or am I missing something?
> > 
> 
> Not that I can see.  If real hardware can be looped, so can qemu.  I'm
> only worried about recursion and deadlocks (while real hardware can
> deadlock, we'd prefer to avoid that).

Agreed, thanks

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19 12:12                                   ` Avi Kivity
  2012-09-19 12:17                                     ` Edgar E. Iglesias
@ 2012-09-19 13:01                                     ` Igor Mitsyanko
  2012-09-19 13:03                                       ` Avi Kivity
  1 sibling, 1 reply; 72+ messages in thread
From: Igor Mitsyanko @ 2012-09-19 13:01 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Maydell, Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	liu ping fan, Peter Crosthwaite, Anthony Liguori, Paolo Bonzini,
	Edgar E. Iglesias

On 09/19/2012 04:12 PM, Avi Kivity wrote:
> On 09/19/2012 02:46 PM, Edgar E. Iglesias wrote:
>> On Wed, Sep 19, 2012 at 10:55:30AM +0300, Avi Kivity wrote:
>>> On 09/19/2012 07:40 AM, Peter Crosthwaite wrote:
>>>> On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
>>>> <edgar.iglesias@gmail.com> wrote:
>>>>> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>>>>>> Ping for PMM,
>>>>>>
>>>>>> This is the root case of your block on the SDHCI series - this is a
>>>>>> discussion on resolution to bogus infinite looping DMA. For current
>>>>>> participants in this discussion, heres our thread on the same topic
>>>>>> over in SD land:
>>>>>>
>>>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>>>>>>
>>>>>> With the findings here and acknowledgement that this is a general
>>>>>> problem, we either need to declare this issue of scope for SDHCI, or
>>>>>> work with these guys (in the immediate future) on the DMA infinite
>>>>>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>>>>>> here, but can we get a decisive plan for going forward with this issue
>>>>>> if it is going to continue to block SDHCI.
>>>>>>
>>>>>> Thanks to Igor for identifying the overlap between these discussions.
>>>>> Hi,
>>>>>
>>>>> A couple of dumb questions.
>>>>>
>>>>> What is the reason for the blocker? that possible guest dos is worse
>>>>> than no functionality at all?
>>>>>
>>>>> Can't you put the DMA/IO processing into the background?
>>>> I dont know a way do do asynchronous DMA unless I am missing something
>>>> here.
>>> You could schedule a bottom half and do the accesses from there.  Solves
>>> nothing though.
>>>
>>>> So what happens is we have a device that walks a SG list
>>>> synchronously while holding all the locks and stuff being discussed
>>>> here. If that SG list infinite loops then crash.
>>> Did you mean loop, or recursion into the memory region that initiates DMA?
>> I think we were discussing the loop issue (I hadn't thought of recursion
>> issues before) :)
>>
>> Jan's point of resetability was interesting.
>>
>> Processing a finite amount of dma descriptors and scheduling bh's
>> to continously get back and continue processing should be OK no?
>>
> Yes.

Bottom half idea sounds good.
Also, for this particular device, rather then limiting amount of 
processed descriptor entries in bottom half, it seems reasonable to 
interrupt bh handler if we encounter a LINK descriptor since only 
recursive LINKs can generate infinite loops. In that case, a very long 
descriptor table without a single LINK entry could potentially hung QEMU 
for a long time, but that time will be finite anyway. Long term, this 
problem will disappear when/if someone converts QEMU SD card model to 
use async io.
>> That would avoid the lockups and allow the device to be reset at
>> any time. Or am I missing something?
>>
> Not that I can see.  If real hardware can be looped, so can qemu.  I'm
> only worried about recursion and deadlocks (while real hardware can
> deadlock, we'd prefer to avoid that).
>
>

So, I think the idea here is that if real hardware can be locked we 
should lock too, but provide a guest CPU a possibility to abort locked 
operation. For this particular example, SD card controller can deadlock 
on recursive descriptor LINK entries, but CPU can abort ongoing 
transaction at any time by issuing ABORT command. And if guest CPU 
busy-waits in infinite while() loop for a TRANSFER OVER flag to be set, 
then it had it coming.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19 13:01                                     ` Igor Mitsyanko
@ 2012-09-19 13:03                                       ` Avi Kivity
  0 siblings, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2012-09-19 13:03 UTC (permalink / raw)
  To: Igor Mitsyanko
  Cc: Peter Maydell, Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	liu ping fan, Peter Crosthwaite, Anthony Liguori, Paolo Bonzini,
	Edgar E. Iglesias

On 09/19/2012 04:01 PM, Igor Mitsyanko wrote:
>>> That would avoid the lockups and allow the device to be reset at
>>> any time. Or am I missing something?
>>>
>> Not that I can see.  If real hardware can be looped, so can qemu.  I'm
>> only worried about recursion and deadlocks (while real hardware can
>> deadlock, we'd prefer to avoid that).
>>
>>
> 
> So, I think the idea here is that if real hardware can be locked we
> should lock too, but provide a guest CPU a possibility to abort locked
> operation. For this particular example, SD card controller can deadlock
> on recursive descriptor LINK entries, but CPU can abort ongoing
> transaction at any time by issuing ABORT command. And if guest CPU
> busy-waits in infinite while() loop for a TRANSFER OVER flag to be set,
> then it had it coming.

If real hardware would lock up so should we (in the guest's eyes) but
the qemu monitor should remain responsive.  It is not part of emulated
hardware.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19  7:57                               ` Jan Kiszka
@ 2012-09-19 13:07                                 ` Igor Mitsyanko
  0 siblings, 0 replies; 72+ messages in thread
From: Igor Mitsyanko @ 2012-09-19 13:07 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Peter Maydell, Marcelo Tosatti, qemu-devel@nongnu.org,
	liu ping fan, Peter Crosthwaite, Avi Kivity, Anthony Liguori,
	Paolo Bonzini, Edgar E. Iglesias

On 09/19/2012 11:57 AM, Jan Kiszka wrote:
> On 2012-09-19 06:40, Peter Crosthwaite wrote:
>> On Wed, Sep 19, 2012 at 2:32 PM, Edgar E. Iglesias
>> <edgar.iglesias@gmail.com> wrote:
>>> On Wed, Sep 19, 2012 at 02:25:48PM +1000, Peter Crosthwaite wrote:
>>>> Ping for PMM,
>>>>
>>>> This is the root case of your block on the SDHCI series - this is a
>>>> discussion on resolution to bogus infinite looping DMA. For current
>>>> participants in this discussion, heres our thread on the same topic
>>>> over in SD land:
>>>>
>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-08/msg01017.html
>>>>
>>>> With the findings here and acknowledgement that this is a general
>>>> problem, we either need to declare this issue of scope for SDHCI, or
>>>> work with these guys (in the immediate future) on the DMA infinite
>>>> loop problem flagged here. I dont mind if SDHCI+ADMA is the guinea pig
>>>> here, but can we get a decisive plan for going forward with this issue
>>>> if it is going to continue to block SDHCI.
>>>>
>>>> Thanks to Igor for identifying the overlap between these discussions.
>>> Hi,
>>>
>>> A couple of dumb questions.
>>>
>>> What is the reason for the blocker? that possible guest dos is worse
>>> than no functionality at all?
>>>
>>> Can't you put the DMA/IO processing into the background?
>> I dont know a way do do asynchronous DMA unless I am missing something
>> here. So what happens is we have a device that walks a SG list
>> synchronously while holding all the locks and stuff being discussed
>> here. If that SG list infinite loops then crash.
> We could switch to async DMA, but most DMA-capable devices aren't
> prepared for rescheduling points in the middle of their code. This will
> require quite a bit of code review.

 From what I understand, we can't switch to async DMA while SD card 
model doesn't support it. And introducing this support would require to 
convert every existing SD card user in QEMU.
I can recall that such a conversion was already suggested in a mailing 
list a while ago, but no one (including me) volunteered to do it :)

>>> what's the diff between setup of bad descriptors and  writing a
>>> while (1) loop in the guest CPU?
>>>
>> While(1) in guest doesnt freeze all of QEMU (monitor still works and
>> stuff), wheras the DMA ops going in circles does, as locks are taken
>> by an infinite loop.
> That's the point. If we loop, we need at least a way to stop it by
> issuing a device or system reset. But I tend to think that detecting
> loops and failing/ignoring requests is easier implementable.
>
> Jan
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-13  8:14       ` Avi Kivity
  2012-09-13  8:19         ` Paolo Bonzini
@ 2012-09-19 13:16         ` Jamie Lokier
  1 sibling, 0 replies; 72+ messages in thread
From: Jamie Lokier @ 2012-09-19 13:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, qemu-devel,
	Anthony Liguori, Paolo Bonzini

Avi Kivity wrote:
> On 09/13/2012 09:54 AM, liu ping fan wrote:
> 
> >>> +typedef struct Atomic {
> >>> +    int counter;
> >>> +} Atomic;
> >>
> >> Best to mark counter 'volatile'.
> >>
> >>> +
> >>> +static inline void atomic_set(Atomic *v, int i)
> >>> +{
> >>> +    v->counter = i;
> >>> +}
> >>> +
> >>> +static inline int atomic_read(Atomic *v)
> >>> +{
> >>> +    return v->counter;
> >>> +}
> >>>
> >>
> >> So these two operations don't get mangled by the optimizer.
> >>
> > Browsing linux code and reading lkml, find some similar material. But
> > they have moved volatile from ->counter to function - atomic_read().
> > As to atomic_read(), I think it need to prevent optimizer from
> > refetching issue, but as to atomic_set(), do we need ?
> 
> I think so, to prevent reordering.

Hi,

I don't think volatile makes any difference to reordering here.

The compiler is not going to move the atomic_set() store before or
after another instruction on the same atomic variable anyway, just
like it wouldn't do that for an ordinary assignment.

If you're concerned about ordering with respect to other memory, then
volatile wouldn't make much difference.  barrier() before and after would.

If you're copying Linux's semantics, Linux's atomic_set() doesn't
include any barriers, nor imply any.  atomic_read() uses volatile to
ensure that each call re-reads the value, for example in a loop.
(Same as ACCESS_ONCE()).  If there was a call to atomic_set() in a
loop, it doesn't guarantee that will be written each time.

-- Jamie

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-13  6:54     ` liu ping fan
  2012-09-13  8:14       ` Avi Kivity
@ 2012-09-19 13:32       ` Jamie Lokier
  2012-09-19 14:12         ` Peter Maydell
  1 sibling, 1 reply; 72+ messages in thread
From: Jamie Lokier @ 2012-09-19 13:32 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Avi Kivity,
	Anthony Liguori, Paolo Bonzini

liu ping fan wrote:
> >> +static inline void atomic_set(Atomic *v, int i)
> >> +{
> >> +    v->counter = i;
> >> +}

Hi,

When running on ARM Linux kernels prior to 2.6.32, userspace
atomic_set() needs to use "clrex" or "strex" too.

See Linux commit 200b812d, "Clear the exclusive monitor when returning
from an exception".

You can see ARM's atomic_set() used to use "strex", and warns it's
important.  The kernel patch allows atomic_set() to be simplified, and
that includes for userspace, by putting clrex/strex in the exception
return path instead.

However, someone may run QEMU on a kernel before 2.6.32, which isn't
that old.  (E.g. my phone is running 2.6.28).

Otherwise you can have this situation:

    Initially: a = 0.

    Thread
          atomic_inc(&a, 1)
          = ldrex, add, [strex interrupted]

                                 Interrupted by signal handler
                                      atomic_set(&a, 3)
                                      = str
                                 Signal return

    Resume thread
          = strex (succeeds because CPU-local exclusive-flag still set)

    Result: a = 1, should be impossible when the signal triggered, and
            information about the signal is lost.

A more realistic example would use atomic_compare_exchange(), to
atomic-read-and-clear, atomic-read-and-dec-if-not-zero a variable set
in a signal handler, however I've used atomic_inc() to illustrate
because that's in your patch.

Best,
-- Jamie

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-19 13:32       ` Jamie Lokier
@ 2012-09-19 14:12         ` Peter Maydell
  2012-09-19 15:53           ` Jamie Lokier
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Maydell @ 2012-09-19 14:12 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, qemu-devel, Avi Kivity,
	Anthony Liguori, Paolo Bonzini

On 19 September 2012 14:32, Jamie Lokier <jamie@shareable.org> wrote:
> However, someone may run QEMU on a kernel before 2.6.32, which isn't
> that old.  (E.g. my phone is running 2.6.28).

NB that ARM kernels that old have other amusing bugs, such
as not saving the floating point registers when invoking
signal handlers. I would be happy for QEMU to just say "your
kernel is too old!"...

-- PMM

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations
  2012-09-19 14:12         ` Peter Maydell
@ 2012-09-19 15:53           ` Jamie Lokier
  0 siblings, 0 replies; 72+ messages in thread
From: Jamie Lokier @ 2012-09-19 15:53 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, qemu-devel, Avi Kivity,
	Anthony Liguori, Paolo Bonzini

Peter Maydell wrote:
> On 19 September 2012 14:32, Jamie Lokier <jamie@shareable.org> wrote:
> > However, someone may run QEMU on a kernel before 2.6.32, which isn't
> > that old.  (E.g. my phone is running 2.6.28).
> 
> NB that ARM kernels that old have other amusing bugs, such
> as not saving the floating point registers when invoking
> signal handlers.

Hi Peter,

It's not that old (< 3 years).  Granted that's not a nice one, but I'm
under the impression it occurs only when the signal handler uses (VFP
hardware) floating point.  I.e. most programs don't do that, they keep
the signal handlers simple (probably including QEMU).

(I've read about other platforms that have similar issues using
floating point in signal handlers; best avoided.)

Anyway, people are running those kernels, someone will try to run QEMU
on it unless...

> I would be happy for QEMU to just say "your  kernel is too old!"...

I'd be quite happy with that as well, if you want to put a check in
and refuse to run (like Glibc does).

Less happy with obscure, rare failures of atomicity that are
practically undebuggable, and easily fixed.

Cheers,
-- Jamie

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-19  9:05                       ` Avi Kivity
@ 2012-09-20  7:51                         ` liu ping fan
  2012-09-20  9:15                           ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: liu ping fan @ 2012-09-20  7:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Jan Kiszka

On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/19/2012 11:36 AM, liu ping fan wrote:
>>>
>>> It basically means you can't hold contents of device state in local
>>> variables.  You need to read everything again from the device.  That
>>> includes things like DMA enable bits.
>>>
>> I think that read everything again from the device can not work.
>> Suppose the following scene: If the device's state contains the change
>> of a series of internal registers (supposing partA+partB;
>> partC+partD), after partA changed, the device's lock is broken.  At
>> this point, another access to this device, it will work on partA+partB
>> to determine C+D, but since partB is not correctly updated yet. So C+D
>> may be decided by broken context and be wrong.
>
> That's the guest's problem.  Note it could have happened even before the
> change, since the writes to A/B/C/D are unordered wrt the DMA.
>
Yes, agree, it is the guest's problem.  So it means that ready_of(A+B)
is not signaled to guest, the guest should not launch operations on
(C+D). Right?   But here comes the question, if ready not signaled to
guest, how can guest launch operation on (A+B) again?
i.e. although local lock is broken, the (A+B) is still intact when
re-acquire local lock.  So need not to read everything again from the
device.  Wrong?

>>
>>> (btw: to we respect PCI_COMMAND_MASTER? it seems that we do not.  This
>>> is a trivial example of an iommu, we should get that going).
>>>
>>>> Or for converted device, we can just tag it a
>>>> busy flag, that is check&set busy flag at the entry of device's
>>>> mmio-dispatch.  So  when re-acquire device's lock, the device's state
>>>> is intact.
>>>
>>> The state can be changed by a parallel access to another register, which
>>> is valid.
>>>
>> Do you mean that the device can be accessed in parallel? But how? we
>> use device's lock.
>
> Some DMA is asynchronous already, like networking and IDE DMA.  So we
> drop the big lock while doing it.  With the change to drop device locks
> around c_p_m_rw(), we have that problem everywhere.
>
>> What my suggestion is:
>> lock();
>> set_and_test(dev->busy);
>> if busy
>>   unlock and return;
>> changing device registers;
>> do other things including calling to c_p_m_rw() //here,lock broken,
>> but set_and_test() works
>> clear(dev->busy);
>> unlock();
>>
>> So changing device registers is protected, and unbreakable.
>
> But the changes may be legitimate.  Maybe you're writing to a completely
> unrelated register, from a different vcpu, now that write is lost.
>
But I think it will mean more-fine locks for each groups of unrelated
register, and accordingly, the busy should be bitmap for each group.

Thanks and regards,
pingfan
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-20  7:51                         ` liu ping fan
@ 2012-09-20  9:15                           ` Avi Kivity
  2012-09-21  7:27                             ` liu ping fan
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2012-09-20  9:15 UTC (permalink / raw)
  To: liu ping fan
  Cc: Paolo Bonzini, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Jan Kiszka

On 09/20/2012 10:51 AM, liu ping fan wrote:
> On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/19/2012 11:36 AM, liu ping fan wrote:
>>>>
>>>> It basically means you can't hold contents of device state in local
>>>> variables.  You need to read everything again from the device.  That
>>>> includes things like DMA enable bits.
>>>>
>>> I think that read everything again from the device can not work.
>>> Suppose the following scene: If the device's state contains the change
>>> of a series of internal registers (supposing partA+partB;
>>> partC+partD), after partA changed, the device's lock is broken.  At
>>> this point, another access to this device, it will work on partA+partB
>>> to determine C+D, but since partB is not correctly updated yet. So C+D
>>> may be decided by broken context and be wrong.
>>
>> That's the guest's problem.  Note it could have happened even before the
>> change, since the writes to A/B/C/D are unordered wrt the DMA.
>>
> Yes, agree, it is the guest's problem.  So it means that ready_of(A+B)
> is not signaled to guest, the guest should not launch operations on
> (C+D). Right?   But here comes the question, if ready not signaled to
> guest, how can guest launch operation on (A+B) again?

It may be evil.

> i.e. although local lock is broken, the (A+B) is still intact when
> re-acquire local lock.  So need not to read everything again from the
> device.  Wrong?

The device needs to perform according to its specifications.  If the
specifications allow for this kind of access, we must ensure it works.
If they don't, we must ensure something sane happens, instead of a qemu
crash or exploit.  This means that anything dangerous like pointers must
be revalidated.  To be on the safe side, I recommend revalidating (or
reloading) everything, but it may not be necessary in all cases.

>>> What my suggestion is:
>>> lock();
>>> set_and_test(dev->busy);
>>> if busy
>>>   unlock and return;
>>> changing device registers;
>>> do other things including calling to c_p_m_rw() //here,lock broken,
>>> but set_and_test() works
>>> clear(dev->busy);
>>> unlock();
>>>
>>> So changing device registers is protected, and unbreakable.
>>
>> But the changes may be legitimate.  Maybe you're writing to a completely
>> unrelated register, from a different vcpu, now that write is lost.
>>
> But I think it will mean more-fine locks for each groups of unrelated
> register, and accordingly, the busy should be bitmap for each group.

It's possible.  Let's defer the discussion until a concrete case is
before us.  It may be that different devices will want different
solutions (though there is value in applying one solution everywhere).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap
  2012-09-20  9:15                           ` Avi Kivity
@ 2012-09-21  7:27                             ` liu ping fan
  0 siblings, 0 replies; 72+ messages in thread
From: liu ping fan @ 2012-09-21  7:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Jan Kiszka

On Thu, Sep 20, 2012 at 5:15 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/20/2012 10:51 AM, liu ping fan wrote:
>> On Wed, Sep 19, 2012 at 5:05 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/19/2012 11:36 AM, liu ping fan wrote:
>>>>>
>>>>> It basically means you can't hold contents of device state in local
>>>>> variables.  You need to read everything again from the device.  That
>>>>> includes things like DMA enable bits.
>>>>>
>>>> I think that read everything again from the device can not work.
>>>> Suppose the following scene: If the device's state contains the change
>>>> of a series of internal registers (supposing partA+partB;
>>>> partC+partD), after partA changed, the device's lock is broken.  At
>>>> this point, another access to this device, it will work on partA+partB
>>>> to determine C+D, but since partB is not correctly updated yet. So C+D
>>>> may be decided by broken context and be wrong.
>>>
>>> That's the guest's problem.  Note it could have happened even before the
>>> change, since the writes to A/B/C/D are unordered wrt the DMA.
>>>
>> Yes, agree, it is the guest's problem.  So it means that ready_of(A+B)
>> is not signaled to guest, the guest should not launch operations on
>> (C+D). Right?   But here comes the question, if ready not signaled to
>> guest, how can guest launch operation on (A+B) again?
>
> It may be evil.
>
>> i.e. although local lock is broken, the (A+B) is still intact when
>> re-acquire local lock.  So need not to read everything again from the
>> device.  Wrong?
>
> The device needs to perform according to its specifications.  If the
> specifications allow for this kind of access, we must ensure it works.
> If they don't, we must ensure something sane happens, instead of a qemu
> crash or exploit.  This means that anything dangerous like pointers must
> be revalidated.  To be on the safe side, I recommend revalidating (or
> reloading) everything, but it may not be necessary in all cases.
>
Yeah, catch the two points exactly.

>>>> What my suggestion is:
>>>> lock();
>>>> set_and_test(dev->busy);
>>>> if busy
>>>>   unlock and return;
>>>> changing device registers;
>>>> do other things including calling to c_p_m_rw() //here,lock broken,
>>>> but set_and_test() works
>>>> clear(dev->busy);
>>>> unlock();
>>>>
>>>> So changing device registers is protected, and unbreakable.
>>>
>>> But the changes may be legitimate.  Maybe you're writing to a completely
>>> unrelated register, from a different vcpu, now that write is lost.
>>>
>> But I think it will mean more-fine locks for each groups of unrelated
>> register, and accordingly, the busy should be bitmap for each group.
>
> It's possible.  Let's defer the discussion until a concrete case is
> before us.  It may be that different devices will want different
> solutions (though there is value in applying one solution everywhere).
>
Okay. appreciate for the total detail explanation.

Regards,
pingfan
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2012-09-21  7:27 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-11  7:51 [Qemu-devel] [PATCH V3 0/10] prepare unplug out of protection of global lock Liu Ping Fan
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 01/11] atomic: introduce atomic operations Liu Ping Fan
2012-09-11  8:04   ` Avi Kivity
2012-09-13  6:54     ` liu ping fan
2012-09-13  8:14       ` Avi Kivity
2012-09-13  8:19         ` Paolo Bonzini
2012-09-13  8:23           ` Avi Kivity
2012-09-13  8:29             ` Paolo Bonzini
2012-09-13  8:45           ` liu ping fan
2012-09-19 13:16         ` Jamie Lokier
2012-09-19 13:32       ` Jamie Lokier
2012-09-19 14:12         ` Peter Maydell
2012-09-19 15:53           ` Jamie Lokier
2012-09-11  8:15   ` Peter Maydell
2012-09-13  6:53     ` liu ping fan
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 02/11] qom: apply atomic on object's refcount Liu Ping Fan
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 03/11] hotplug: introduce qdev_unplug_complete() to remove device from views Liu Ping Fan
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 04/11] pci: remove pci device from mem view when unplug Liu Ping Fan
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 05/11] memory: introduce ref, unref interface for MemoryRegionOps Liu Ping Fan
2012-09-11  8:08   ` Avi Kivity
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 06/11] memory: make mmio dispatch able to be out of biglock Liu Ping Fan
2012-09-11  8:25   ` Avi Kivity
2012-09-11  8:47   ` Avi Kivity
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 07/11] memory: implement e1000's MemoryRegionOps ref/unref Liu Ping Fan
2012-09-11  8:37   ` Avi Kivity
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 08/11] qom: introduce reclaimer to release obj in async Liu Ping Fan
2012-09-11  8:32   ` Avi Kivity
2012-09-11  9:32     ` liu ping fan
2012-09-11  9:37       ` Avi Kivity
2012-09-13  6:54         ` liu ping fan
2012-09-13  8:45           ` Avi Kivity
2012-09-13  9:59             ` liu ping fan
2012-09-13 10:09               ` Avi Kivity
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 09/11] vcpu: make QemuThread as tls to store thread-self info Liu Ping Fan
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 10/11] vcpu: introduce lockmap Liu Ping Fan
2012-09-11  8:35   ` Avi Kivity
2012-09-11  9:44     ` liu ping fan
2012-09-11  9:54       ` Avi Kivity
2012-09-11 10:04         ` Jan Kiszka
2012-09-11 11:03           ` Avi Kivity
2012-09-11 11:08             ` Jan Kiszka
2012-09-11 12:20               ` Avi Kivity
2012-09-11 12:25                 ` Jan Kiszka
2012-09-11 12:30                   ` Avi Kivity
2012-09-11 12:35                     ` Jan Kiszka
2012-09-11 12:39                       ` Avi Kivity
2012-09-19  4:25                         ` Peter Crosthwaite
2012-09-19  4:32                           ` Edgar E. Iglesias
2012-09-19  4:40                             ` Peter Crosthwaite
2012-09-19  7:55                               ` Avi Kivity
2012-09-19 11:46                                 ` Edgar E. Iglesias
2012-09-19 12:12                                   ` Avi Kivity
2012-09-19 12:17                                     ` Edgar E. Iglesias
2012-09-19 13:01                                     ` Igor Mitsyanko
2012-09-19 13:03                                       ` Avi Kivity
2012-09-19  7:57                               ` Jan Kiszka
2012-09-19 13:07                                 ` Igor Mitsyanko
2012-09-11  9:57       ` Jan Kiszka
2012-09-11 12:24         ` Avi Kivity
2012-09-11 12:41           ` Avi Kivity
2012-09-11 14:54             ` Marcelo Tosatti
2012-09-13  6:55               ` liu ping fan
2012-09-13  6:55             ` liu ping fan
2012-09-13  8:19               ` Avi Kivity
2012-09-17  2:24                 ` liu ping fan
2012-09-19  8:01                   ` Avi Kivity
2012-09-19  8:36                     ` liu ping fan
2012-09-19  9:05                       ` Avi Kivity
2012-09-20  7:51                         ` liu ping fan
2012-09-20  9:15                           ` Avi Kivity
2012-09-21  7:27                             ` liu ping fan
2012-09-11  7:51 ` [Qemu-devel] [PATCH V3 11/11] vcpu: push mmio dispatcher out of big lock Liu Ping Fan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).