[PULL 0/9] Migration 20240909 patches

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PULL 0/9] Migration 20240909 patches
@ 2024-09-09 20:11 Peter Xu
  2024-09-09 20:11 ` [PULL 1/9] softmmu: Support concurrent bounce buffers Peter Xu
                   ` (9 more replies)
  0 siblings, 10 replies; 24+ messages in thread
From: Peter Xu @ 2024-09-09 20:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Peter Xu, Fabiano Rosas

The following changes since commit f2aee60305a1e40374b2fc1093e4d04404e780ee:

  Merge tag 'pull-request-2024-09-08' of https://gitlab.com/huth/qemu into staging (2024-09-09 10:47:24 +0100)

are available in the Git repository at:

  https://gitlab.com/peterx/qemu.git tags/migration-20240909-pull-request

for you to fetch changes up to 89bccecdda253c9a1a38921cf9266a4f9655c88c:

  system: improve migration debug (2024-09-09 10:55:40 -0400)

----------------------------------------------------------------
Migration pull request for 9.2

- Mattias's patch to support concurrent bounce buffers for PCI devices
- David's memory leak fix in dirty_memory_extend()
- Fabiano's CI fix to disable vmstate-static-checker test in compat tests
- Denis's patch that adds one more trace point for cpu throttle changes
- Yichen's multifd qatzip compressor support

----------------------------------------------------------------

Bryan Zhang (4):
  meson: Introduce 'qatzip' feature to the build system
  migration: Add migration parameters for QATzip
  migration: Introduce 'qatzip' compression method
  tests/migration: Add integration test for 'qatzip' compression method

David Hildenbrand (1):
  softmmu/physmem: fix memory leak in dirty_memory_extend()

Denis V. Lunev (1):
  system: improve migration debug

Fabiano Rosas (1):
  ci: migration: Don't run python tests in the compat job

Mattias Nissler (1):
  softmmu: Support concurrent bounce buffers

Yuan Liu (1):
  docs/migration: add qatzip compression feature

 docs/devel/migration/features.rst           |   1 +
 docs/devel/migration/qatzip-compression.rst | 165 ++++++++
 meson.build                                 |  10 +
 qapi/migration.json                         |  21 ++
 include/exec/memory.h                       |  14 +-
 include/exec/ramlist.h                      |   1 +
 include/hw/pci/pci_device.h                 |   3 +
 migration/multifd.h                         |   5 +-
 migration/options.h                         |   1 +
 hw/core/qdev-properties-system.c            |   2 +-
 hw/pci/pci.c                                |   8 +
 migration/migration-hmp-cmds.c              |   4 +
 migration/multifd-qatzip.c                  | 394 ++++++++++++++++++++
 migration/options.c                         |  34 ++
 system/cpu-throttle.c                       |   3 +
 system/memory.c                             |   5 +-
 system/physmem.c                            | 117 +++---
 tests/qtest/migration-test.c                |  27 ++
 .gitlab-ci.d/buildtest.yml                  |   8 +
 meson_options.txt                           |   2 +
 migration/meson.build                       |   1 +
 scripts/meson-buildoptions.sh               |   3 +
 system/trace-events                         |   3 +
 23 files changed, 767 insertions(+), 65 deletions(-)
 create mode 100644 docs/devel/migration/qatzip-compression.rst
 create mode 100644 migration/multifd-qatzip.c

-- 
2.45.0



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
@ 2024-09-09 20:11 ` Peter Xu
  2024-09-13 14:35   ` Cédric Le Goater
  2024-09-09 20:11 ` [PULL 2/9] softmmu/physmem: fix memory leak in dirty_memory_extend() Peter Xu
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2024-09-09 20:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Peter Xu, Fabiano Rosas, Mattias Nissler,
	Philippe Mathieu-Daudé

From: Mattias Nissler <mnissler@rivosinc.com>

When DMA memory can't be directly accessed, as is the case when
running the device model in a separate process without shareable DMA
file descriptors, bounce buffering is used.

It is not uncommon for device models to request mapping of several DMA
regions at the same time. Examples include:
 * net devices, e.g. when transmitting a packet that is split across
   several TX descriptors (observed with igb)
 * USB host controllers, when handling a packet with multiple data TRBs
   (observed with xhci)

Previously, qemu only provided a single bounce buffer per AddressSpace
and would fail DMA map requests while the buffer was already in use. In
turn, this would cause DMA failures that ultimately manifest as hardware
errors from the guest perspective.

This change allocates DMA bounce buffers dynamically instead of
supporting only a single buffer. Thus, multiple DMA mappings work
correctly also when RAM can't be mmap()-ed.

The total bounce buffer allocation size is limited individually for each
AddressSpace. The default limit is 4096 bytes, matching the previous
maximum buffer size. A new x-max-bounce-buffer-size parameter is
provided to configure the limit for PCI devices.

Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Peter Xu <peterx@redhat.com>
Link: https://lore.kernel.org/r/20240819135455.2957406-1-mnissler@rivosinc.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h       | 14 +++----
 include/hw/pci/pci_device.h |  3 ++
 hw/pci/pci.c                |  8 ++++
 system/memory.c             |  5 ++-
 system/physmem.c            | 82 ++++++++++++++++++++++++++-----------
 5 files changed, 76 insertions(+), 36 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 296fd068c0..e5e865d1a9 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1084,13 +1084,7 @@ typedef struct AddressSpaceMapClient {
     QLIST_ENTRY(AddressSpaceMapClient) link;
 } AddressSpaceMapClient;
 
-typedef struct {
-    MemoryRegion *mr;
-    void *buffer;
-    hwaddr addr;
-    hwaddr len;
-    bool in_use;
-} BounceBuffer;
+#define DEFAULT_MAX_BOUNCE_BUFFER_SIZE (4096)
 
 /**
  * struct AddressSpace: describes a mapping of addresses to #MemoryRegion objects
@@ -1110,8 +1104,10 @@ struct AddressSpace {
     QTAILQ_HEAD(, MemoryListener) listeners;
     QTAILQ_ENTRY(AddressSpace) address_spaces_link;
 
-    /* Bounce buffer to use for this address space. */
-    BounceBuffer bounce;
+    /* Maximum DMA bounce buffer size used for indirect memory map requests */
+    size_t max_bounce_buffer_size;
+    /* Total size of bounce buffers currently allocated, atomically accessed */
+    size_t bounce_buffer_size;
     /* List of callbacks to invoke when buffers free up */
     QemuMutex map_client_list_lock;
     QLIST_HEAD(, AddressSpaceMapClient) map_client_list;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index 15694f2489..91df40f989 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -167,6 +167,9 @@ struct PCIDevice {
     /* ID of standby device in net_failover pair */
     char *failover_pair_id;
     uint32_t acpi_index;
+
+    /* Maximum DMA bounce buffer size used for indirect memory map requests */
+    uint32_t max_bounce_buffer_size;
 };
 
 static inline int pci_intx(PCIDevice *pci_dev)
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index fab86d0567..d2caf3ee8b 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -85,6 +85,8 @@ static Property pci_props[] = {
                     QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
     DEFINE_PROP_BIT("x-pcie-ari-nextfn-1", PCIDevice, cap_present,
                     QEMU_PCIE_ARI_NEXTFN_1_BITNR, false),
+    DEFINE_PROP_SIZE32("x-max-bounce-buffer-size", PCIDevice,
+                     max_bounce_buffer_size, DEFAULT_MAX_BOUNCE_BUFFER_SIZE),
     DEFINE_PROP_END_OF_LIST()
 };
 
@@ -1204,6 +1206,8 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
                        "bus master container", UINT64_MAX);
     address_space_init(&pci_dev->bus_master_as,
                        &pci_dev->bus_master_container_region, pci_dev->name);
+    pci_dev->bus_master_as.max_bounce_buffer_size =
+        pci_dev->max_bounce_buffer_size;
 
     if (phase_check(PHASE_MACHINE_READY)) {
         pci_init_bus_master(pci_dev);
@@ -2633,6 +2637,10 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
     k->unrealize = pci_qdev_unrealize;
     k->bus_type = TYPE_PCI_BUS;
     device_class_set_props(k, pci_props);
+    object_class_property_set_description(
+        klass, "x-max-bounce-buffer-size",
+        "Maximum buffer size allocated for bounce buffers used for mapped "
+        "access to indirect DMA memory");
 }
 
 static void pci_device_class_base_init(ObjectClass *klass, void *data)
diff --git a/system/memory.c b/system/memory.c
index 5e6eb459d5..f6f6fee6d8 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -3148,7 +3148,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
     as->ioeventfds = NULL;
     QTAILQ_INIT(&as->listeners);
     QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
-    as->bounce.in_use = false;
+    as->max_bounce_buffer_size = DEFAULT_MAX_BOUNCE_BUFFER_SIZE;
+    as->bounce_buffer_size = 0;
     qemu_mutex_init(&as->map_client_list_lock);
     QLIST_INIT(&as->map_client_list);
     as->name = g_strdup(name ? name : "anonymous");
@@ -3158,7 +3159,7 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name)
 
 static void do_address_space_destroy(AddressSpace *as)
 {
-    assert(!qatomic_read(&as->bounce.in_use));
+    assert(qatomic_read(&as->bounce_buffer_size) == 0);
     assert(QLIST_EMPTY(&as->map_client_list));
     qemu_mutex_destroy(&as->map_client_list_lock);
 
diff --git a/system/physmem.c b/system/physmem.c
index 94600a33ec..971bfa0855 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3095,6 +3095,20 @@ void cpu_flush_icache_range(hwaddr start, hwaddr len)
                                      NULL, len, FLUSH_CACHE);
 }
 
+/*
+ * A magic value stored in the first 8 bytes of the bounce buffer struct. Used
+ * to detect illegal pointers passed to address_space_unmap.
+ */
+#define BOUNCE_BUFFER_MAGIC 0xb4017ceb4ffe12ed
+
+typedef struct {
+    uint64_t magic;
+    MemoryRegion *mr;
+    hwaddr addr;
+    size_t len;
+    uint8_t buffer[];
+} BounceBuffer;
+
 static void
 address_space_unregister_map_client_do(AddressSpaceMapClient *client)
 {
@@ -3120,9 +3134,9 @@ void address_space_register_map_client(AddressSpace *as, QEMUBH *bh)
     QEMU_LOCK_GUARD(&as->map_client_list_lock);
     client->bh = bh;
     QLIST_INSERT_HEAD(&as->map_client_list, client, link);
-    /* Write map_client_list before reading in_use.  */
+    /* Write map_client_list before reading bounce_buffer_size. */
     smp_mb();
-    if (!qatomic_read(&as->bounce.in_use)) {
+    if (qatomic_read(&as->bounce_buffer_size) < as->max_bounce_buffer_size) {
         address_space_notify_map_clients_locked(as);
     }
 }
@@ -3251,28 +3265,40 @@ void *address_space_map(AddressSpace *as,
     mr = flatview_translate(fv, addr, &xlat, &l, is_write, attrs);
 
     if (!memory_access_is_direct(mr, is_write)) {
-        if (qatomic_xchg(&as->bounce.in_use, true)) {
+        size_t used = qatomic_read(&as->bounce_buffer_size);
+        for (;;) {
+            hwaddr alloc = MIN(as->max_bounce_buffer_size - used, l);
+            size_t new_size = used + alloc;
+            size_t actual =
+                qatomic_cmpxchg(&as->bounce_buffer_size, used, new_size);
+            if (actual == used) {
+                l = alloc;
+                break;
+            }
+            used = actual;
+        }
+
+        if (l == 0) {
             *plen = 0;
             return NULL;
         }
-        /* Avoid unbounded allocations */
-        l = MIN(l, TARGET_PAGE_SIZE);
-        as->bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, l);
-        as->bounce.addr = addr;
-        as->bounce.len = l;
 
+        BounceBuffer *bounce = g_malloc0(l + sizeof(BounceBuffer));
+        bounce->magic = BOUNCE_BUFFER_MAGIC;
         memory_region_ref(mr);
-        as->bounce.mr = mr;
+        bounce->mr = mr;
+        bounce->addr = addr;
+        bounce->len = l;
+
         if (!is_write) {
             flatview_read(fv, addr, MEMTXATTRS_UNSPECIFIED,
-                          as->bounce.buffer, l);
+                          bounce->buffer, l);
         }
 
         *plen = l;
-        return as->bounce.buffer;
+        return bounce->buffer;
     }
 
-
     memory_region_ref(mr);
     *plen = flatview_extend_translation(fv, addr, len, mr, xlat,
                                         l, is_write, attrs);
@@ -3287,12 +3313,11 @@ void *address_space_map(AddressSpace *as,
 void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
                          bool is_write, hwaddr access_len)
 {
-    if (buffer != as->bounce.buffer) {
-        MemoryRegion *mr;
-        ram_addr_t addr1;
+    MemoryRegion *mr;
+    ram_addr_t addr1;
 
-        mr = memory_region_from_host(buffer, &addr1);
-        assert(mr != NULL);
+    mr = memory_region_from_host(buffer, &addr1);
+    if (mr != NULL) {
         if (is_write) {
             invalidate_and_set_dirty(mr, addr1, access_len);
         }
@@ -3302,15 +3327,22 @@ void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len,
         memory_region_unref(mr);
         return;
     }
+
+
+    BounceBuffer *bounce = container_of(buffer, BounceBuffer, buffer);
+    assert(bounce->magic == BOUNCE_BUFFER_MAGIC);
+
     if (is_write) {
-        address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
-                            as->bounce.buffer, access_len);
-    }
-    qemu_vfree(as->bounce.buffer);
-    as->bounce.buffer = NULL;
-    memory_region_unref(as->bounce.mr);
-    /* Clear in_use before reading map_client_list.  */
-    qatomic_set_mb(&as->bounce.in_use, false);
+        address_space_write(as, bounce->addr, MEMTXATTRS_UNSPECIFIED,
+                            bounce->buffer, access_len);
+    }
+
+    qatomic_sub(&as->bounce_buffer_size, bounce->len);
+    bounce->magic = ~BOUNCE_BUFFER_MAGIC;
+    memory_region_unref(bounce->mr);
+    g_free(bounce);
+    /* Write bounce_buffer_size before reading map_client_list. */
+    smp_mb();
     address_space_notify_map_clients(as);
 }
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PULL 2/9] softmmu/physmem: fix memory leak in dirty_memory_extend()
  2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
  2024-09-09 20:11 ` [PULL 1/9] softmmu: Support concurrent bounce buffers Peter Xu
@ 2024-09-09 20:11 ` Peter Xu
  2024-09-09 20:11 ` [PULL 3/9] ci: migration: Don't run python tests in the compat job Peter Xu
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-09-09 20:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Peter Xu, Fabiano Rosas, David Hildenbrand,
	Stefan Hajnoczi, qemu-stable, Paolo Bonzini,
	Philippe Mathieu-Daudé

From: David Hildenbrand <david@redhat.com>

As reported by Peter, we might be leaking memory when removing the
highest RAMBlock (in the weird ram_addr_t space), and adding a new one.

We will fail to realize that we already allocated bitmaps for more
dirty memory blocks, and effectively discard the pointers to them.

Fix it by getting rid of last_ram_page() and by remembering the number
of dirty memory blocks that have been allocated already.

While at it, let's use "unsigned int" for the number of blocks, which
should be sufficient until we reach ~32 exabytes.

Looks like this leak was introduced as we switched from using a single
bitmap_zero_extend() to allocating multiple bitmaps:
bitmap_zero_extend() relies on g_renew() which should have taken care of
this.

Resolves: https://lkml.kernel.org/r/CAFEAcA-k7a+VObGAfCFNygQNfCKL=AfX6A4kScq=VSSK0peqPg@mail.gmail.com
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Fixes: 5b82b703b69a ("memory: RCU ram_list.dirty_memory[] for safe RAM hotplug")
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Peter Maydell <peter.maydell@linaro.org>
Cc: qemu-stable@nongnu.org
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240828090743.128647-1-david@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/ramlist.h |  1 +
 system/physmem.c       | 35 +++++++++--------------------------
 2 files changed, 10 insertions(+), 26 deletions(-)

diff --git a/include/exec/ramlist.h b/include/exec/ramlist.h
index 2ad2a81acc..d9cfe530be 100644
--- a/include/exec/ramlist.h
+++ b/include/exec/ramlist.h
@@ -50,6 +50,7 @@ typedef struct RAMList {
     /* RCU-enabled, writes protected by the ramlist lock. */
     QLIST_HEAD(, RAMBlock) blocks;
     DirtyMemoryBlocks *dirty_memory[DIRTY_MEMORY_NUM];
+    unsigned int num_dirty_blocks;
     uint32_t version;
     QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
 } RAMList;
diff --git a/system/physmem.c b/system/physmem.c
index 971bfa0855..d71a2b1bbd 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1534,18 +1534,6 @@ static ram_addr_t find_ram_offset(ram_addr_t size)
     return offset;
 }
 
-static unsigned long last_ram_page(void)
-{
-    RAMBlock *block;
-    ram_addr_t last = 0;
-
-    RCU_READ_LOCK_GUARD();
-    RAMBLOCK_FOREACH(block) {
-        last = MAX(last, block->offset + block->max_length);
-    }
-    return last >> TARGET_PAGE_BITS;
-}
-
 static void qemu_ram_setup_dump(void *addr, ram_addr_t size)
 {
     int ret;
@@ -1799,13 +1787,11 @@ void qemu_ram_msync(RAMBlock *block, ram_addr_t start, ram_addr_t length)
 }
 
 /* Called with ram_list.mutex held */
-static void dirty_memory_extend(ram_addr_t old_ram_size,
-                                ram_addr_t new_ram_size)
+static void dirty_memory_extend(ram_addr_t new_ram_size)
 {
-    ram_addr_t old_num_blocks = DIV_ROUND_UP(old_ram_size,
-                                             DIRTY_MEMORY_BLOCK_SIZE);
-    ram_addr_t new_num_blocks = DIV_ROUND_UP(new_ram_size,
-                                             DIRTY_MEMORY_BLOCK_SIZE);
+    unsigned int old_num_blocks = ram_list.num_dirty_blocks;
+    unsigned int new_num_blocks = DIV_ROUND_UP(new_ram_size,
+                                               DIRTY_MEMORY_BLOCK_SIZE);
     int i;
 
     /* Only need to extend if block count increased */
@@ -1837,6 +1823,8 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
             g_free_rcu(old_blocks, rcu);
         }
     }
+
+    ram_list.num_dirty_blocks = new_num_blocks;
 }
 
 static void ram_block_add(RAMBlock *new_block, Error **errp)
@@ -1846,11 +1834,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
     RAMBlock *block;
     RAMBlock *last_block = NULL;
     bool free_on_error = false;
-    ram_addr_t old_ram_size, new_ram_size;
+    ram_addr_t ram_size;
     Error *err = NULL;
 
-    old_ram_size = last_ram_page();
-
     qemu_mutex_lock_ramlist();
     new_block->offset = find_ram_offset(new_block->max_length);
 
@@ -1901,11 +1887,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         }
     }
 
-    new_ram_size = MAX(old_ram_size,
-              (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS);
-    if (new_ram_size > old_ram_size) {
-        dirty_memory_extend(old_ram_size, new_ram_size);
-    }
+    ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
+    dirty_memory_extend(ram_size);
     /* Keep the list sorted from biggest to smallest block.  Unlike QTAILQ,
      * QLIST (which has an RCU-friendly variant) does not have insertion at
      * tail, so save the last element in last_block.
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PULL 3/9] ci: migration: Don't run python tests in the compat job
  2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
  2024-09-09 20:11 ` [PULL 1/9] softmmu: Support concurrent bounce buffers Peter Xu
  2024-09-09 20:11 ` [PULL 2/9] softmmu/physmem: fix memory leak in dirty_memory_extend() Peter Xu
@ 2024-09-09 20:11 ` Peter Xu
  2024-09-09 20:11 ` [PULL 4/9] docs/migration: add qatzip compression feature Peter Xu
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-09-09 20:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Peter Xu, Fabiano Rosas, Thomas Huth

From: Fabiano Rosas <farosas@suse.de>

The vmstate-checker-script test has a bug that makes it flaky. It was
also committed by mistake and will be removed.

Since the migration-compat job takes the tests from the build-previous
job instead of the current HEAD, neither a fix or a removal of the
test will take effect for this release.

Disable the faulty/undesirable test by taking advantage that it only
runs if the PYTHON environment variable is set. This also disables the
analyze-migration-script test, but this is fine because that test
doesn't have migration compatibility implications.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Link: https://lore.kernel.org/r/20240905185445.8179-1-farosas@suse.de
[peterx: Added a TODO to remove the line after 9.2 release, per thuth]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 .gitlab-ci.d/buildtest.yml | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
index 1d2afae996..cfc51be08a 100644
--- a/.gitlab-ci.d/buildtest.yml
+++ b/.gitlab-ci.d/buildtest.yml
@@ -212,6 +212,14 @@ build-previous-qemu:
     # testing an old QEMU against new features/tests that it is not
     # compatible with.
     - cd build-previous
+    # Don't allow python-based tests to run. The
+    # vmstate-checker-script test has a race that causes it to fail
+    # sometimes. It cannot be fixed it because this job runs the test
+    # from the old QEMU version. The test will be removed on master,
+    # but this job will only see the change in the next release.
+    #
+    # TODO: remove this line after 9.2 release
+    - unset PYTHON
     # old to new
     - QTEST_QEMU_BINARY_SRC=./qemu-system-${TARGET}
           QTEST_QEMU_BINARY=../build/qemu-system-${TARGET} ./tests/qtest/migration-test
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PULL 4/9] docs/migration: add qatzip compression feature
  2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
                   ` (2 preceding siblings ...)
  2024-09-09 20:11 ` [PULL 3/9] ci: migration: Don't run python tests in the compat job Peter Xu
@ 2024-09-09 20:11 ` Peter Xu
  2024-09-09 20:11 ` [PULL 5/9] meson: Introduce 'qatzip' feature to the build system Peter Xu
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-09-09 20:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Peter Xu, Fabiano Rosas, Yuan Liu, Nanhai Zou,
	Yichen Wang

From: Yuan Liu <yuan1.liu@intel.com>

add Intel QATzip compression method introduction

Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
Link: https://lore.kernel.org/r/20240830232722.58272-2-yichen.wang@bytedance.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 docs/devel/migration/features.rst           |   1 +
 docs/devel/migration/qatzip-compression.rst | 165 ++++++++++++++++++++
 2 files changed, 166 insertions(+)
 create mode 100644 docs/devel/migration/qatzip-compression.rst

diff --git a/docs/devel/migration/features.rst b/docs/devel/migration/features.rst
index 58f8fd9e16..8f431d52f9 100644
--- a/docs/devel/migration/features.rst
+++ b/docs/devel/migration/features.rst
@@ -14,3 +14,4 @@ Migration has plenty of features to support different use cases.
    CPR
    qpl-compression
    uadk-compression
+   qatzip-compression
diff --git a/docs/devel/migration/qatzip-compression.rst b/docs/devel/migration/qatzip-compression.rst
new file mode 100644
index 0000000000..862b383164
--- /dev/null
+++ b/docs/devel/migration/qatzip-compression.rst
@@ -0,0 +1,165 @@
+==================
+QATzip Compression
+==================
+In scenarios with limited network bandwidth, the ``QATzip`` solution can help
+users save a lot of host CPU resources by accelerating compression and
+decompression through the Intel QuickAssist Technology(``QAT``) hardware.
+
+
+The following test was conducted using 8 multifd channels and 10Gbps network
+bandwidth. The results show that, compared to zstd, ``QATzip`` significantly
+saves CPU resources on the sender and reduces migration time. Compared to the
+uncompressed solution, ``QATzip`` greatly improves the dirty page processing
+capability, indicated by the Pages per Second metric, and also reduces the
+total migration time.
+
+::
+
+   VM Configuration: 16 vCPU and 64G memory
+   VM Workload: all vCPUs are idle and 54G memory is filled with Silesia data.
+   QAT Devices: 4
+   |-----------|--------|---------|----------|----------|------|------|
+   |8 Channels |Total   |down     |throughput|pages per | send | recv |
+   |           |time(ms)|time(ms) |(mbps)    |second    | cpu %| cpu% |
+   |-----------|--------|---------|----------|----------|------|------|
+   |qatzip     |   16630|       28|     10467|   2940235|   160|   360|
+   |-----------|--------|---------|----------|----------|------|------|
+   |zstd       |   20165|       24|      8579|   2391465|   810|   340|
+   |-----------|--------|---------|----------|----------|------|------|
+   |none       |   46063|       40|     10848|    330240|    45|    85|
+   |-----------|--------|---------|----------|----------|------|------|
+
+
+QATzip Compression Framework
+============================
+
+``QATzip`` is a user space library which builds on top of the Intel QuickAssist
+Technology to provide extended accelerated compression and decompression
+services.
+
+For more ``QATzip`` introduction, please refer to `QATzip Introduction
+<https://github.com/intel/QATzip?tab=readme-ov-file#introductionl>`_
+
+::
+
+  +----------------+
+  | MultiFd Thread |
+  +-------+--------+
+          |
+          | compress/decompress
+  +-------+--------+
+  | QATzip library |
+  +-------+--------+
+          |
+  +-------+--------+
+  |  QAT library   |
+  +-------+--------+
+          |         user space
+  --------+---------------------
+          |         kernel space
+   +------+-------+
+   |  QAT  Driver |
+   +------+-------+
+          |
+   +------+-------+
+   | QAT Devices  |
+   +--------------+
+
+
+QATzip Installation
+-------------------
+
+The ``QATzip`` installation package has been integrated into some Linux
+distributions and can be installed directly. For example, the Ubuntu Server
+24.04 LTS system can be installed using below command
+
+.. code-block:: shell
+
+   #apt search qatzip
+   libqatzip-dev/noble 1.2.0-0ubuntu3 amd64
+     Intel QuickAssist user space library development files
+
+   libqatzip3/noble 1.2.0-0ubuntu3 amd64
+     Intel QuickAssist user space library
+
+   qatzip/noble,now 1.2.0-0ubuntu3 amd64 [installed]
+     Compression user-space tool for Intel QuickAssist Technology
+
+   #sudo apt install libqatzip-dev libqatzip3 qatzip
+
+If your system does not support the ``QATzip`` installation package, you can
+use the source code to build and install, please refer to `QATzip source code installation
+<https://github.com/intel/QATzip?tab=readme-ov-file#build-intel-quickassist-technology-driver>`_
+
+QAT Hardware Deployment
+-----------------------
+
+``QAT`` supports physical functions(PFs) and virtual functions(VFs) for
+deployment, and users can configure ``QAT`` resources for migration according
+to actual needs. For more details about ``QAT`` deployment, please refer to
+`Intel QuickAssist Technology Documentation
+<https://intel.github.io/quickassist/index.html>`_
+
+For more ``QAT`` hardware introduction, please refer to `intel-quick-assist-technology-overview
+<https://www.intel.com/content/www/us/en/architecture-and-technology/intel-quick-assist-technology-overview.html>`_
+
+How To Use QATzip Compression
+=============================
+
+1 - Install ``QATzip`` library
+
+2 - Build ``QEMU`` with ``--enable-qatzip`` parameter
+
+  E.g. configure --target-list=x86_64-softmmu --enable-kvm ``--enable-qatzip``
+
+3 - Set ``migrate_set_parameter multifd-compression qatzip``
+
+4 - Set ``migrate_set_parameter multifd-qatzip-level comp_level``, the default
+comp_level value is 1, and it supports levels from 1 to 9
+
+QAT Memory Requirements
+=======================
+
+The user needs to reserve system memory for the QAT memory management to
+allocate DMA memory. The size of the reserved system memory depends on the
+number of devices used for migration and the number of multifd channels.
+
+Because memory usage depends on QAT configuration, please refer to `QAT Memory
+Driver Queries
+<https://intel.github.io/quickassist/PG/infrastructure_debugability.html?highlight=memory>`_
+for memory usage calculation.
+
+.. list-table:: An example of a PF used for migration
+  :header-rows: 1
+
+  * - Number of channels
+    - Sender memory usage
+    - Receiver memory usage
+  * - 2
+    - 10M
+    - 10M
+  * - 4
+    - 12M
+    - 14M
+  * - 8
+    - 16M
+    - 20M
+
+How To Choose Between QATzip and QPL
+====================================
+Starting from 4th Gen Intel Xeon Scalable processors, codenamed Sapphire Rapids
+processor(``SPR``), multiple built-in accelerators are supported including
+``QAT`` and ``IAA``.  The former can accelerate ``QATzip`` and the latter is
+used to accelerate ``QPL``.
+
+Here are some suggestions:
+
+1 - If the live migration scenario is limited by network bandwidth and ``QAT``
+hardware resources exceed ``IAA``, use the ``QATzip`` method, which can save a
+lot of host CPU resources for compression.
+
+2 - If the system cannot support shared virtual memory (SVM) technology, use
+the ``QATzip`` method because ``QPL`` performance is not good without SVM
+support.
+
+3 - For other scenarios, use the ``QPL`` method first.
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PULL 5/9] meson: Introduce 'qatzip' feature to the build system
  2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
                   ` (3 preceding siblings ...)
  2024-09-09 20:11 ` [PULL 4/9] docs/migration: add qatzip compression feature Peter Xu
@ 2024-09-09 20:11 ` Peter Xu
  2024-09-09 20:11 ` [PULL 6/9] migration: Add migration parameters for QATzip Peter Xu
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-09-09 20:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Peter Xu, Fabiano Rosas, Bryan Zhang, Hao Xiang,
	Yichen Wang

From: Bryan Zhang <bryan.zhang@bytedance.com>

Add a 'qatzip' feature, which is automatically disabled, and which
depends on the QATzip library if enabled.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
Link: https://lore.kernel.org/r/20240830232722.58272-3-yichen.wang@bytedance.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 meson.build                   | 10 ++++++++++
 meson_options.txt             |  2 ++
 scripts/meson-buildoptions.sh |  3 +++
 3 files changed, 15 insertions(+)

diff --git a/meson.build b/meson.build
index fbda17c987..b89b713e79 100644
--- a/meson.build
+++ b/meson.build
@@ -1262,6 +1262,14 @@ if not get_option('uadk').auto() or have_system
      uadk = declare_dependency(dependencies: [libwd, libwd_comp])
   endif
 endif
+
+qatzip = not_found
+if not get_option('qatzip').auto() or have_system
+  qatzip = dependency('qatzip', version: '>=1.1.2',
+                      required: get_option('qatzip'),
+                      method: 'pkg-config')
+endif
+
 virgl = not_found
 
 have_vhost_user_gpu = have_tools and host_os == 'linux' and pixman.found()
@@ -2412,6 +2420,7 @@ config_host_data.set('CONFIG_STATX_MNT_ID', has_statx_mnt_id)
 config_host_data.set('CONFIG_ZSTD', zstd.found())
 config_host_data.set('CONFIG_QPL', qpl.found())
 config_host_data.set('CONFIG_UADK', uadk.found())
+config_host_data.set('CONFIG_QATZIP', qatzip.found())
 config_host_data.set('CONFIG_FUSE', fuse.found())
 config_host_data.set('CONFIG_FUSE_LSEEK', fuse_lseek.found())
 config_host_data.set('CONFIG_SPICE_PROTOCOL', spice_protocol.found())
@@ -4535,6 +4544,7 @@ summary_info += {'lzfse support':     liblzfse}
 summary_info += {'zstd support':      zstd}
 summary_info += {'Query Processing Library support': qpl}
 summary_info += {'UADK Library support': uadk}
+summary_info += {'qatzip support':    qatzip}
 summary_info += {'NUMA host support': numa}
 summary_info += {'capstone':          capstone}
 summary_info += {'libpmem support':   libpmem}
diff --git a/meson_options.txt b/meson_options.txt
index 0269fa0f16..f7b652b30d 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -261,6 +261,8 @@ option('qpl', type : 'feature', value : 'auto',
        description: 'Query Processing Library support')
 option('uadk', type : 'feature', value : 'auto',
        description: 'UADK Library support')
+option('qatzip', type: 'feature', value: 'auto',
+       description: 'QATzip compression support')
 option('fuse', type: 'feature', value: 'auto',
        description: 'FUSE block device export')
 option('fuse_lseek', type : 'feature', value : 'auto',
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index c97079a38c..5f377a6d81 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -163,6 +163,7 @@ meson_options_help() {
   printf "%s\n" '  pixman          pixman support'
   printf "%s\n" '  plugins         TCG plugins via shared library loading'
   printf "%s\n" '  png             PNG support with libpng'
+  printf "%s\n" '  qatzip          QATzip compression support'
   printf "%s\n" '  qcow1           qcow1 image format support'
   printf "%s\n" '  qed             qed image format support'
   printf "%s\n" '  qga-vss         build QGA VSS support (broken with MinGW)'
@@ -427,6 +428,8 @@ _meson_option_parse() {
     --enable-png) printf "%s" -Dpng=enabled ;;
     --disable-png) printf "%s" -Dpng=disabled ;;
     --prefix=*) quote_sh "-Dprefix=$2" ;;
+    --enable-qatzip) printf "%s" -Dqatzip=enabled ;;
+    --disable-qatzip) printf "%s" -Dqatzip=disabled ;;
     --enable-qcow1) printf "%s" -Dqcow1=enabled ;;
     --disable-qcow1) printf "%s" -Dqcow1=disabled ;;
     --enable-qed) printf "%s" -Dqed=enabled ;;
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PULL 6/9] migration: Add migration parameters for QATzip
  2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
                   ` (4 preceding siblings ...)
  2024-09-09 20:11 ` [PULL 5/9] meson: Introduce 'qatzip' feature to the build system Peter Xu
@ 2024-09-09 20:11 ` Peter Xu
  2024-09-09 20:11 ` [PULL 7/9] migration: Introduce 'qatzip' compression method Peter Xu
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-09-09 20:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Peter Xu, Fabiano Rosas, Bryan Zhang,
	Markus Armbruster, Hao Xiang, Yichen Wang, Prasad Pandit

From: Bryan Zhang <bryan.zhang@bytedance.com>

Adds support for migration parameters to control QATzip compression
level.

Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>
Link: https://lore.kernel.org/r/20240830232722.58272-4-yichen.wang@bytedance.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 qapi/migration.json            | 18 ++++++++++++++++++
 migration/options.h            |  1 +
 migration/migration-hmp-cmds.c |  4 ++++
 migration/options.c            | 34 ++++++++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+)

diff --git a/qapi/migration.json b/qapi/migration.json
index 7324571e92..f4c27426c8 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -792,6 +792,11 @@
 #     speed, and 9 means best compression ratio which will consume
 #     more CPU.  Defaults to 1.  (Since 5.0)
 #
+# @multifd-qatzip-level: Set the compression level to be used in live
+#     migration. The level is an integer between 1 and 9, where 1 means
+#     the best compression speed, and 9 means the best compression
+#     ratio which will consume more CPU. Defaults to 1.  (Since 9.2)
+#
 # @multifd-zstd-level: Set the compression level to be used in live
 #     migration, the compression level is an integer between 0 and 20,
 #     where 0 means no compression, 1 means the best compression
@@ -852,6 +857,7 @@
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
            'max-cpu-throttle', 'multifd-compression',
            'multifd-zlib-level', 'multifd-zstd-level',
+           'multifd-qatzip-level',
            'block-bitmap-mapping',
            { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
            'vcpu-dirty-limit',
@@ -967,6 +973,11 @@
 #     speed, and 9 means best compression ratio which will consume
 #     more CPU.  Defaults to 1.  (Since 5.0)
 #
+# @multifd-qatzip-level: Set the compression level to be used in live
+#     migration. The level is an integer between 1 and 9, where 1 means
+#     the best compression speed, and 9 means the best compression
+#     ratio which will consume more CPU. Defaults to 1.  (Since 9.2)
+#
 # @multifd-zstd-level: Set the compression level to be used in live
 #     migration, the compression level is an integer between 0 and 20,
 #     where 0 means no compression, 1 means the best compression
@@ -1040,6 +1051,7 @@
             '*max-cpu-throttle': 'uint8',
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
+            '*multifd-qatzip-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
             '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
@@ -1171,6 +1183,11 @@
 #     speed, and 9 means best compression ratio which will consume
 #     more CPU.  Defaults to 1.  (Since 5.0)
 #
+# @multifd-qatzip-level: Set the compression level to be used in live
+#     migration. The level is an integer between 1 and 9, where 1 means
+#     the best compression speed, and 9 means the best compression
+#     ratio which will consume more CPU. Defaults to 1.  (Since 9.2)
+#
 # @multifd-zstd-level: Set the compression level to be used in live
 #     migration, the compression level is an integer between 0 and 20,
 #     where 0 means no compression, 1 means the best compression
@@ -1241,6 +1258,7 @@
             '*max-cpu-throttle': 'uint8',
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
+            '*multifd-qatzip-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
             '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
diff --git a/migration/options.h b/migration/options.h
index a2397026db..a0bd6edc06 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -78,6 +78,7 @@ uint64_t migrate_max_postcopy_bandwidth(void);
 int migrate_multifd_channels(void);
 MultiFDCompression migrate_multifd_compression(void);
 int migrate_multifd_zlib_level(void);
+int migrate_multifd_qatzip_level(void);
 int migrate_multifd_zstd_level(void);
 uint8_t migrate_throttle_trigger_threshold(void);
 const char *migrate_tls_authz(void);
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 7d608d26e1..28165cfc9e 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -576,6 +576,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_multifd_zlib_level = true;
         visit_type_uint8(v, param, &p->multifd_zlib_level, &err);
         break;
+    case MIGRATION_PARAMETER_MULTIFD_QATZIP_LEVEL:
+        p->has_multifd_qatzip_level = true;
+        visit_type_uint8(v, param, &p->multifd_qatzip_level, &err);
+        break;
     case MIGRATION_PARAMETER_MULTIFD_ZSTD_LEVEL:
         p->has_multifd_zstd_level = true;
         visit_type_uint8(v, param, &p->multifd_zstd_level, &err);
diff --git a/migration/options.c b/migration/options.c
index 645f55003d..147cd2b8fd 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -55,6 +55,13 @@
 #define DEFAULT_MIGRATE_MULTIFD_COMPRESSION MULTIFD_COMPRESSION_NONE
 /* 0: means nocompress, 1: best speed, ... 9: best compress ratio */
 #define DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL 1
+/*
+ * 1: best speed, ... 9: best compress ratio
+ * There is some nuance here. Refer to QATzip documentation to understand
+ * the mapping of QATzip levels to standard deflate levels.
+ */
+#define DEFAULT_MIGRATE_MULTIFD_QATZIP_LEVEL 1
+
 /* 0: means nocompress, 1: best speed, ... 20: best compress ratio */
 #define DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL 1
 
@@ -123,6 +130,9 @@ Property migration_properties[] = {
     DEFINE_PROP_UINT8("multifd-zlib-level", MigrationState,
                       parameters.multifd_zlib_level,
                       DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL),
+    DEFINE_PROP_UINT8("multifd-qatzip-level", MigrationState,
+                      parameters.multifd_qatzip_level,
+                      DEFAULT_MIGRATE_MULTIFD_QATZIP_LEVEL),
     DEFINE_PROP_UINT8("multifd-zstd-level", MigrationState,
                       parameters.multifd_zstd_level,
                       DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL),
@@ -787,6 +797,13 @@ int migrate_multifd_zlib_level(void)
     return s->parameters.multifd_zlib_level;
 }
 
+int migrate_multifd_qatzip_level(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.multifd_qatzip_level;
+}
+
 int migrate_multifd_zstd_level(void)
 {
     MigrationState *s = migrate_get_current();
@@ -892,6 +909,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->multifd_compression = s->parameters.multifd_compression;
     params->has_multifd_zlib_level = true;
     params->multifd_zlib_level = s->parameters.multifd_zlib_level;
+    params->has_multifd_qatzip_level = true;
+    params->multifd_qatzip_level = s->parameters.multifd_qatzip_level;
     params->has_multifd_zstd_level = true;
     params->multifd_zstd_level = s->parameters.multifd_zstd_level;
     params->has_xbzrle_cache_size = true;
@@ -946,6 +965,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_multifd_channels = true;
     params->has_multifd_compression = true;
     params->has_multifd_zlib_level = true;
+    params->has_multifd_qatzip_level = true;
     params->has_multifd_zstd_level = true;
     params->has_xbzrle_cache_size = true;
     params->has_max_postcopy_bandwidth = true;
@@ -1038,6 +1058,14 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
         return false;
     }
 
+    if (params->has_multifd_qatzip_level &&
+        ((params->multifd_qatzip_level > 9) ||
+        (params->multifd_qatzip_level < 1))) {
+        error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "multifd_qatzip_level",
+                   "a value between 1 and 9");
+        return false;
+    }
+
     if (params->has_multifd_zstd_level &&
         (params->multifd_zstd_level > 20)) {
         error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "multifd_zstd_level",
@@ -1195,6 +1223,9 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_multifd_compression) {
         dest->multifd_compression = params->multifd_compression;
     }
+    if (params->has_multifd_qatzip_level) {
+        dest->multifd_qatzip_level = params->multifd_qatzip_level;
+    }
     if (params->has_multifd_zlib_level) {
         dest->multifd_zlib_level = params->multifd_zlib_level;
     }
@@ -1315,6 +1346,9 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_multifd_compression) {
         s->parameters.multifd_compression = params->multifd_compression;
     }
+    if (params->has_multifd_qatzip_level) {
+        s->parameters.multifd_qatzip_level = params->multifd_qatzip_level;
+    }
     if (params->has_multifd_zlib_level) {
         s->parameters.multifd_zlib_level = params->multifd_zlib_level;
     }
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PULL 7/9] migration: Introduce 'qatzip' compression method
  2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
                   ` (5 preceding siblings ...)
  2024-09-09 20:11 ` [PULL 6/9] migration: Add migration parameters for QATzip Peter Xu
@ 2024-09-09 20:11 ` Peter Xu
  2024-09-09 20:11 ` [PULL 8/9] tests/migration: Add integration test for " Peter Xu
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-09-09 20:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Peter Xu, Fabiano Rosas, Bryan Zhang,
	Markus Armbruster, Prasad Pandit, Hao Xiang, Yichen Wang

From: Bryan Zhang <bryan.zhang@bytedance.com>

Adds support for 'qatzip' as an option for the multifd compression
method parameter, and implements using QAT for 'qatzip' compression and
decompression.

Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
Link: https://lore.kernel.org/r/20240830232722.58272-5-yichen.wang@bytedance.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 qapi/migration.json              |   3 +
 migration/multifd.h              |   5 +-
 hw/core/qdev-properties-system.c |   2 +-
 migration/multifd-qatzip.c       | 394 +++++++++++++++++++++++++++++++
 migration/meson.build            |   1 +
 5 files changed, 402 insertions(+), 3 deletions(-)
 create mode 100644 migration/multifd-qatzip.c

diff --git a/qapi/migration.json b/qapi/migration.json
index f4c27426c8..f1b7103dc8 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -561,6 +561,8 @@
 #
 # @zstd: use zstd compression method.
 #
+# @qatzip: use qatzip compression method.  (Since 9.2)
+#
 # @qpl: use qpl compression method.  Query Processing Library(qpl) is
 #     based on the deflate compression algorithm and use the Intel
 #     In-Memory Analytics Accelerator(IAA) accelerated compression and
@@ -573,6 +575,7 @@
 { 'enum': 'MultiFDCompression',
   'data': [ 'none', 'zlib',
             { 'name': 'zstd', 'if': 'CONFIG_ZSTD' },
+            { 'name': 'qatzip', 'if': 'CONFIG_QATZIP'},
             { 'name': 'qpl', 'if': 'CONFIG_QPL' },
             { 'name': 'uadk', 'if': 'CONFIG_UADK' } ] }
 
diff --git a/migration/multifd.h b/migration/multifd.h
index 3bb96e9558..50d58c0c9c 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -36,14 +36,15 @@ MultiFDRecvData *multifd_get_recv_data(void);
 /* Multifd Compression flags */
 #define MULTIFD_FLAG_SYNC (1 << 0)
 
-/* We reserve 4 bits for compression methods */
-#define MULTIFD_FLAG_COMPRESSION_MASK (0xf << 1)
+/* We reserve 5 bits for compression methods */
+#define MULTIFD_FLAG_COMPRESSION_MASK (0x1f << 1)
 /* we need to be compatible. Before compression value was 0 */
 #define MULTIFD_FLAG_NOCOMP (0 << 1)
 #define MULTIFD_FLAG_ZLIB (1 << 1)
 #define MULTIFD_FLAG_ZSTD (2 << 1)
 #define MULTIFD_FLAG_QPL (4 << 1)
 #define MULTIFD_FLAG_UADK (8 << 1)
+#define MULTIFD_FLAG_QATZIP (16 << 1)
 
 /* This value needs to be a multiple of qemu_target_page_size() */
 #define MULTIFD_PACKET_SIZE (512 * 1024)
diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index f13350b4fb..a56fbf728d 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -659,7 +659,7 @@ const PropertyInfo qdev_prop_fdc_drive_type = {
 const PropertyInfo qdev_prop_multifd_compression = {
     .name = "MultiFDCompression",
     .description = "multifd_compression values, "
-                   "none/zlib/zstd/qpl/uadk",
+                   "none/zlib/zstd/qpl/uadk/qatzip",
     .enum_table = &MultiFDCompression_lookup,
     .get = qdev_propinfo_get_enum,
     .set = qdev_propinfo_set_enum,
diff --git a/migration/multifd-qatzip.c b/migration/multifd-qatzip.c
new file mode 100644
index 0000000000..3c787ed879
--- /dev/null
+++ b/migration/multifd-qatzip.c
@@ -0,0 +1,394 @@
+/*
+ * Multifd QATzip compression implementation
+ *
+ * Copyright (c) Bytedance
+ *
+ * Authors:
+ *  Bryan Zhang <bryan.zhang@bytedance.com>
+ *  Hao Xiang <hao.xiang@bytedance.com>
+ *  Yichen Wang <yichen.wang@bytedance.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "exec/ramblock.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qapi/qapi-types-migration.h"
+#include "options.h"
+#include "multifd.h"
+#include <qatzip.h>
+
+typedef struct {
+    /*
+     * Unique session for use with QATzip API
+     */
+    QzSession_T sess;
+
+    /*
+     * For compression: Buffer for pages to compress
+     * For decompression: Buffer for data to decompress
+     */
+    uint8_t *in_buf;
+    uint32_t in_len;
+
+    /*
+     * For compression: Output buffer of compressed data
+     * For decompression: Output buffer of decompressed data
+     */
+    uint8_t *out_buf;
+    uint32_t out_len;
+} QatzipData;
+
+/**
+ * qatzip_send_setup: Set up QATzip session and private buffers.
+ *
+ * @param p    Multifd channel params
+ * @param errp Pointer to error, which will be set in case of error
+ * @return     0 on success, -1 on error (and *errp will be set)
+ */
+static int qatzip_send_setup(MultiFDSendParams *p, Error **errp)
+{
+    QatzipData *q;
+    QzSessionParamsDeflate_T params;
+    const char *err_msg;
+    int ret;
+
+    q = g_new0(QatzipData, 1);
+    p->compress_data = q;
+    /* We need one extra place for the packet header */
+    p->iov = g_new0(struct iovec, 2);
+
+    /*
+     * Initialize QAT device with software fallback by default. This allows
+     * QATzip to use CPU path when QAT hardware reaches maximum throughput.
+     */
+    ret = qzInit(&q->sess, true);
+    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
+        err_msg = "qzInit failed";
+        goto err;
+    }
+
+    ret = qzGetDefaultsDeflate(&params);
+    if (ret != QZ_OK) {
+        err_msg = "qzGetDefaultsDeflate failed";
+        goto err;
+    }
+
+    /* Make sure to use configured QATzip compression level. */
+    params.common_params.comp_lvl = migrate_multifd_qatzip_level();
+    ret = qzSetupSessionDeflate(&q->sess, &params);
+    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
+        err_msg = "qzSetupSessionDeflate failed";
+        goto err;
+    }
+
+    if (MULTIFD_PACKET_SIZE > UINT32_MAX) {
+        err_msg = "packet size too large for QAT";
+        goto err;
+    }
+
+    q->in_len = MULTIFD_PACKET_SIZE;
+    /*
+     * PINNED_MEM is an enum from qatzip headers, which means to use
+     * kzalloc_node() to allocate memory for QAT DMA purposes. When QAT device
+     * is not available or software fallback is used, the malloc flag needs to
+     * be set as COMMON_MEM.
+     */
+    q->in_buf = qzMalloc(q->in_len, 0, PINNED_MEM);
+    if (!q->in_buf) {
+        q->in_buf = qzMalloc(q->in_len, 0, COMMON_MEM);
+        if (!q->in_buf) {
+            err_msg = "qzMalloc failed";
+            goto err;
+        }
+    }
+
+    q->out_len = qzMaxCompressedLength(MULTIFD_PACKET_SIZE, &q->sess);
+    q->out_buf = qzMalloc(q->out_len, 0, PINNED_MEM);
+    if (!q->out_buf) {
+        q->out_buf = qzMalloc(q->out_len, 0, COMMON_MEM);
+        if (!q->out_buf) {
+            err_msg = "qzMalloc failed";
+            goto err;
+        }
+    }
+
+    return 0;
+
+err:
+    error_setg(errp, "multifd %u: [sender] %s", p->id, err_msg);
+    return -1;
+}
+
+/**
+ * qatzip_send_cleanup: Tear down QATzip session and release private buffers.
+ *
+ * @param p    Multifd channel params
+ * @param errp Pointer to error, which will be set in case of error
+ * @return     None
+ */
+static void qatzip_send_cleanup(MultiFDSendParams *p, Error **errp)
+{
+    QatzipData *q = p->compress_data;
+
+    if (q) {
+        if (q->in_buf) {
+            qzFree(q->in_buf);
+        }
+        if (q->out_buf) {
+            qzFree(q->out_buf);
+        }
+        (void)qzTeardownSession(&q->sess);
+        (void)qzClose(&q->sess);
+        g_free(q);
+    }
+
+    g_free(p->iov);
+    p->iov = NULL;
+    p->compress_data = NULL;
+}
+
+/**
+ * qatzip_send_prepare: Compress pages and update IO channel info.
+ *
+ * @param p    Multifd channel params
+ * @param errp Pointer to error, which will be set in case of error
+ * @return     0 on success, -1 on error (and *errp will be set)
+ */
+static int qatzip_send_prepare(MultiFDSendParams *p, Error **errp)
+{
+    MultiFDPages_t *pages = p->pages;
+    QatzipData *q = p->compress_data;
+    int ret;
+    unsigned int in_len, out_len;
+
+    if (!multifd_send_prepare_common(p)) {
+        goto out;
+    }
+
+    /*
+     * Unlike other multifd compression implementations, we use a non-streaming
+     * API and place all the data into one buffer, rather than sending each
+     * page to the compression API at a time. Based on initial benchmarks, the
+     * non-streaming API outperforms the streaming API. Plus, the logic in QEMU
+     * is friendly to using the non-streaming API anyway. If either of these
+     * statements becomes no longer true, we can revisit adding a streaming
+     * implementation.
+     */
+    for (int i = 0; i < pages->normal_num; i++) {
+        memcpy(q->in_buf + (i * p->page_size),
+               pages->block->host + pages->offset[i],
+               p->page_size);
+    }
+
+    in_len = pages->normal_num * p->page_size;
+    if (in_len > q->in_len) {
+        error_setg(errp, "multifd %u: unexpectedly large input", p->id);
+        return -1;
+    }
+    out_len = q->out_len;
+
+    ret = qzCompress(&q->sess, q->in_buf, &in_len, q->out_buf, &out_len, 1);
+    if (ret != QZ_OK) {
+        error_setg(errp, "multifd %u: QATzip returned %d instead of QZ_OK",
+                   p->id, ret);
+        return -1;
+    }
+    if (in_len != pages->normal_num * p->page_size) {
+        error_setg(errp, "multifd %u: QATzip failed to compress all input",
+                   p->id);
+        return -1;
+    }
+
+    p->iov[p->iovs_num].iov_base = q->out_buf;
+    p->iov[p->iovs_num].iov_len = out_len;
+    p->iovs_num++;
+    p->next_packet_size = out_len;
+
+out:
+    p->flags |= MULTIFD_FLAG_QATZIP;
+    multifd_send_fill_packet(p);
+    return 0;
+}
+
+/**
+ * qatzip_recv_setup: Set up QATzip session and allocate private buffers.
+ *
+ * @param p    Multifd channel params
+ * @param errp Pointer to error, which will be set in case of error
+ * @return     0 on success, -1 on error (and *errp will be set)
+ */
+static int qatzip_recv_setup(MultiFDRecvParams *p, Error **errp)
+{
+    QatzipData *q;
+    QzSessionParamsDeflate_T params;
+    const char *err_msg;
+    int ret;
+
+    q = g_new0(QatzipData, 1);
+    p->compress_data = q;
+
+    /*
+     * Initialize QAT device with software fallback by default. This allows
+     * QATzip to use CPU path when QAT hardware reaches maximum throughput.
+     */
+    ret = qzInit(&q->sess, true);
+    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
+        err_msg = "qzInit failed";
+        goto err;
+    }
+
+    ret = qzGetDefaultsDeflate(&params);
+    if (ret != QZ_OK) {
+        err_msg = "qzGetDefaultsDeflate failed";
+        goto err;
+    }
+
+    ret = qzSetupSessionDeflate(&q->sess, &params);
+    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
+        err_msg = "qzSetupSessionDeflate failed";
+        goto err;
+    }
+
+    /*
+     * Reserve extra spaces for the incoming packets. Current implementation
+     * doesn't send uncompressed pages in case the compression gets too big.
+     */
+    q->in_len = MULTIFD_PACKET_SIZE * 2;
+    /*
+     * PINNED_MEM is an enum from qatzip headers, which means to use
+     * kzalloc_node() to allocate memory for QAT DMA purposes. When QAT device
+     * is not available or software fallback is used, the malloc flag needs to
+     * be set as COMMON_MEM.
+     */
+    q->in_buf = qzMalloc(q->in_len, 0, PINNED_MEM);
+    if (!q->in_buf) {
+        q->in_buf = qzMalloc(q->in_len, 0, COMMON_MEM);
+        if (!q->in_buf) {
+            err_msg = "qzMalloc failed";
+            goto err;
+        }
+    }
+
+    q->out_len = MULTIFD_PACKET_SIZE;
+    q->out_buf = qzMalloc(q->out_len, 0, PINNED_MEM);
+    if (!q->out_buf) {
+        q->out_buf = qzMalloc(q->out_len, 0, COMMON_MEM);
+        if (!q->out_buf) {
+            err_msg = "qzMalloc failed";
+            goto err;
+        }
+    }
+
+    return 0;
+
+err:
+    error_setg(errp, "multifd %u: [receiver] %s", p->id, err_msg);
+    return -1;
+}
+
+/**
+ * qatzip_recv_cleanup: Tear down QATzip session and release private buffers.
+ *
+ * @param p    Multifd channel params
+ * @return     None
+ */
+static void qatzip_recv_cleanup(MultiFDRecvParams *p)
+{
+    QatzipData *q = p->compress_data;
+
+    if (q) {
+        if (q->in_buf) {
+            qzFree(q->in_buf);
+        }
+        if (q->out_buf) {
+            qzFree(q->out_buf);
+        }
+        (void)qzTeardownSession(&q->sess);
+        (void)qzClose(&q->sess);
+        g_free(q);
+    }
+    p->compress_data = NULL;
+}
+
+
+/**
+ * qatzip_recv: Decompress pages and copy them to the appropriate
+ * locations.
+ *
+ * @param p    Multifd channel params
+ * @param errp Pointer to error, which will be set in case of error
+ * @return     0 on success, -1 on error (and *errp will be set)
+ */
+static int qatzip_recv(MultiFDRecvParams *p, Error **errp)
+{
+    QatzipData *q = p->compress_data;
+    int ret;
+    unsigned int in_len, out_len;
+    uint32_t in_size = p->next_packet_size;
+    uint32_t expected_size = p->normal_num * p->page_size;
+    uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
+
+    if (in_size > q->in_len) {
+        error_setg(errp, "multifd %u: received unexpectedly large packet",
+                   p->id);
+        return -1;
+    }
+
+    if (flags != MULTIFD_FLAG_QATZIP) {
+        error_setg(errp, "multifd %u: flags received %x flags expected %x",
+                   p->id, flags, MULTIFD_FLAG_QATZIP);
+        return -1;
+    }
+
+    multifd_recv_zero_page_process(p);
+    if (!p->normal_num) {
+        assert(in_size == 0);
+        return 0;
+    }
+
+    ret = qio_channel_read_all(p->c, (void *)q->in_buf, in_size, errp);
+    if (ret != 0) {
+        return ret;
+    }
+
+    in_len = in_size;
+    out_len = q->out_len;
+    ret = qzDecompress(&q->sess, q->in_buf, &in_len, q->out_buf, &out_len);
+    if (ret != QZ_OK) {
+        error_setg(errp, "multifd %u: qzDecompress failed", p->id);
+        return -1;
+    }
+    if (out_len != expected_size) {
+        error_setg(errp, "multifd %u: packet size received %u size expected %u",
+                   p->id, out_len, expected_size);
+        return -1;
+    }
+
+    /* Copy each page to its appropriate location. */
+    for (int i = 0; i < p->normal_num; i++) {
+        memcpy(p->host + p->normal[i],
+               q->out_buf + p->page_size * i,
+               p->page_size);
+    }
+    return 0;
+}
+
+static MultiFDMethods multifd_qatzip_ops = {
+    .send_setup = qatzip_send_setup,
+    .send_cleanup = qatzip_send_cleanup,
+    .send_prepare = qatzip_send_prepare,
+    .recv_setup = qatzip_recv_setup,
+    .recv_cleanup = qatzip_recv_cleanup,
+    .recv = qatzip_recv
+};
+
+static void multifd_qatzip_register(void)
+{
+    multifd_register_ops(MULTIFD_COMPRESSION_QATZIP, &multifd_qatzip_ops);
+}
+
+migration_init(multifd_qatzip_register);
diff --git a/migration/meson.build b/migration/meson.build
index 77f3abf08e..66d3de86f0 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -42,6 +42,7 @@ system_ss.add(when: rdma, if_true: files('rdma.c'))
 system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
 system_ss.add(when: qpl, if_true: files('multifd-qpl.c'))
 system_ss.add(when: uadk, if_true: files('multifd-uadk.c'))
+system_ss.add(when: qatzip, if_true: files('multifd-qatzip.c'))
 
 specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
                 if_true: files('ram.c',
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PULL 8/9] tests/migration: Add integration test for 'qatzip' compression method
  2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
                   ` (6 preceding siblings ...)
  2024-09-09 20:11 ` [PULL 7/9] migration: Introduce 'qatzip' compression method Peter Xu
@ 2024-09-09 20:11 ` Peter Xu
  2024-09-09 20:11 ` [PULL 9/9] system: improve migration debug Peter Xu
  2024-09-10 14:46 ` [PULL 0/9] Migration 20240909 patches Peter Maydell
  9 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-09-09 20:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Peter Xu, Fabiano Rosas, Bryan Zhang, Hao Xiang,
	Yichen Wang

From: Bryan Zhang <bryan.zhang@bytedance.com>

Adds an integration test for 'qatzip'.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
Link: https://lore.kernel.org/r/20240830232722.58272-6-yichen.wang@bytedance.com
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tests/qtest/migration-test.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 9d08101643..d6768d5d71 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2920,6 +2920,18 @@ test_migrate_precopy_tcp_multifd_zstd_start(QTestState *from,
 }
 #endif /* CONFIG_ZSTD */
 
+#ifdef CONFIG_QATZIP
+static void *
+test_migrate_precopy_tcp_multifd_qatzip_start(QTestState *from,
+                                              QTestState *to)
+{
+    migrate_set_parameter_int(from, "multifd-qatzip-level", 2);
+    migrate_set_parameter_int(to, "multifd-qatzip-level", 2);
+
+    return test_migrate_precopy_tcp_multifd_start_common(from, to, "qatzip");
+}
+#endif
+
 #ifdef CONFIG_QPL
 static void *
 test_migrate_precopy_tcp_multifd_qpl_start(QTestState *from,
@@ -3017,6 +3029,17 @@ static void test_multifd_tcp_zstd(void)
 }
 #endif
 
+#ifdef CONFIG_QATZIP
+static void test_multifd_tcp_qatzip(void)
+{
+    MigrateCommon args = {
+        .listen_uri = "defer",
+        .start_hook = test_migrate_precopy_tcp_multifd_qatzip_start,
+    };
+    test_precopy_common(&args);
+}
+#endif
+
 #ifdef CONFIG_QPL
 static void test_multifd_tcp_qpl(void)
 {
@@ -3922,6 +3945,10 @@ int main(int argc, char **argv)
     migration_test_add("/migration/multifd/tcp/plain/zstd",
                        test_multifd_tcp_zstd);
 #endif
+#ifdef CONFIG_QATZIP
+    migration_test_add("/migration/multifd/tcp/plain/qatzip",
+                test_multifd_tcp_qatzip);
+#endif
 #ifdef CONFIG_QPL
     migration_test_add("/migration/multifd/tcp/plain/qpl",
                        test_multifd_tcp_qpl);
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PULL 9/9] system: improve migration debug
  2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
                   ` (7 preceding siblings ...)
  2024-09-09 20:11 ` [PULL 8/9] tests/migration: Add integration test for " Peter Xu
@ 2024-09-09 20:11 ` Peter Xu
  2024-09-10 14:46 ` [PULL 0/9] Migration 20240909 patches Peter Maydell
  9 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2024-09-09 20:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Peter Xu, Fabiano Rosas, Denis V. Lunev,
	Paolo Bonzini

From: "Denis V. Lunev" <den@openvz.org>

Right now migration_throttle() tracepoint lacks very important
important information, i.e. no one could easily say how much the guest
is throttled. This makes difficult to debug guest quality of service
during migration.

This patch adds one more tracepoint into cpu_throttle_set() which is
actually doing this job.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Peter Xu <peterx@redhat.com>
CC: Fabiano Rosas <farosas@suse.de>
CC: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240905191941.310592-1-den@openvz.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 system/cpu-throttle.c | 3 +++
 system/trace-events   | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/system/cpu-throttle.c b/system/cpu-throttle.c
index c951a6c65e..7632dc6143 100644
--- a/system/cpu-throttle.c
+++ b/system/cpu-throttle.c
@@ -28,6 +28,7 @@
 #include "qemu/main-loop.h"
 #include "sysemu/cpus.h"
 #include "sysemu/cpu-throttle.h"
+#include "trace.h"
 
 /* vcpu throttling controls */
 static QEMUTimer *throttle_timer;
@@ -95,6 +96,8 @@ void cpu_throttle_set(int new_throttle_pct)
      */
     bool throttle_active = cpu_throttle_active();
 
+    trace_cpu_throttle_set(new_throttle_pct);
+
     /* Ensure throttle percentage is within valid range */
     new_throttle_pct = MIN(new_throttle_pct, CPU_THROTTLE_PCT_MAX);
     new_throttle_pct = MAX(new_throttle_pct, CPU_THROTTLE_PCT_MIN);
diff --git a/system/trace-events b/system/trace-events
index 2ed1d59b1f..074d001e90 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -44,3 +44,6 @@ dirtylimit_state_finalize(void)
 dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
 dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
 dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
+
+# cpu-throttle.c
+cpu_throttle_set(int new_throttle_pct)  "set guest CPU throttled by %d%%"
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PULL 0/9] Migration 20240909 patches
  2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
                   ` (8 preceding siblings ...)
  2024-09-09 20:11 ` [PULL 9/9] system: improve migration debug Peter Xu
@ 2024-09-10 14:46 ` Peter Maydell
  9 siblings, 0 replies; 24+ messages in thread
From: Peter Maydell @ 2024-09-10 14:46 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Fabiano Rosas

On Mon, 9 Sept 2024 at 21:11, Peter Xu <peterx@redhat.com> wrote:
>
> The following changes since commit f2aee60305a1e40374b2fc1093e4d04404e780ee:
>
>   Merge tag 'pull-request-2024-09-08' of https://gitlab.com/huth/qemu into staging (2024-09-09 10:47:24 +0100)
>
> are available in the Git repository at:
>
>   https://gitlab.com/peterx/qemu.git tags/migration-20240909-pull-request
>
> for you to fetch changes up to 89bccecdda253c9a1a38921cf9266a4f9655c88c:
>
>   system: improve migration debug (2024-09-09 10:55:40 -0400)
>
> ----------------------------------------------------------------
> Migration pull request for 9.2
>
> - Mattias's patch to support concurrent bounce buffers for PCI devices
> - David's memory leak fix in dirty_memory_extend()
> - Fabiano's CI fix to disable vmstate-static-checker test in compat tests
> - Denis's patch that adds one more trace point for cpu throttle changes
> - Yichen's multifd qatzip compressor support
>


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/9.2
for any user-visible changes.

-- PMM


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-09 20:11 ` [PULL 1/9] softmmu: Support concurrent bounce buffers Peter Xu
@ 2024-09-13 14:35   ` Cédric Le Goater
  2024-09-13 14:47     ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Cédric Le Goater @ 2024-09-13 14:35 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Peter Maydell, Fabiano Rosas, Mattias Nissler,
	Philippe Mathieu-Daudé, Mark Cave-Ayland

Hello,

+Mark (for the Mac devices)

On 9/9/24 22:11, Peter Xu wrote:
> From: Mattias Nissler <mnissler@rivosinc.com>
> 
> When DMA memory can't be directly accessed, as is the case when
> running the device model in a separate process without shareable DMA
> file descriptors, bounce buffering is used.
> 
> It is not uncommon for device models to request mapping of several DMA
> regions at the same time. Examples include:
>   * net devices, e.g. when transmitting a packet that is split across
>     several TX descriptors (observed with igb)
>   * USB host controllers, when handling a packet with multiple data TRBs
>     (observed with xhci)
> 
> Previously, qemu only provided a single bounce buffer per AddressSpace
> and would fail DMA map requests while the buffer was already in use. In
> turn, this would cause DMA failures that ultimately manifest as hardware
> errors from the guest perspective.
> 
> This change allocates DMA bounce buffers dynamically instead of
> supporting only a single buffer. Thus, multiple DMA mappings work
> correctly also when RAM can't be mmap()-ed.
> 
> The total bounce buffer allocation size is limited individually for each
> AddressSpace. The default limit is 4096 bytes, matching the previous
> maximum buffer size. A new x-max-bounce-buffer-size parameter is
> provided to configure the limit for PCI devices.
> 
> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> Acked-by: Peter Xu <peterx@redhat.com>
> Link: https://lore.kernel.org/r/20240819135455.2957406-1-mnissler@rivosinc.com
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/exec/memory.h       | 14 +++----
>   include/hw/pci/pci_device.h |  3 ++
>   hw/pci/pci.c                |  8 ++++
>   system/memory.c             |  5 ++-
>   system/physmem.c            | 82 ++++++++++++++++++++++++++-----------
>   5 files changed, 76 insertions(+), 36 deletions(-)

Here is a report of a segv of the ppc64 mac99+cpu970 machine booting debian.
See the stack trace below. Just wanted to let you know. I will digging further
next week.

Thanks,

C.



Thread 1 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
address_space_unmap (len=<optimized out>, access_len=0, is_write=false, buffer=0x0,
     as=0x5555565d45c0 <address_space_memory>) at ../system/physmem.c:3333
3333	    assert(bounce->magic == BOUNCE_BUFFER_MAGIC);
(gdb) bt
#0  address_space_unmap
     (len=<optimized out>, access_len=0, is_write=false, buffer=0x0, as=0x5555565d45c0 <address_space_memory>)
     at ../system/physmem.c:3333
#1  address_space_unmap
     (as=as@entry=0x5555565d45c0 <address_space_memory>, buffer=0x0, len=<optimized out>, is_write=<optimized out>, access_len=0) at ../system/physmem.c:3313
#2  0x000055555595ea48 in dma_memory_unmap
     (access_len=<optimized out>, dir=<optimized out>, len=<optimized out>, buffer=<optimized out>, as=<optimized out>) at /home/legoater/work/qemu/qemu.git/include/sysemu/dma.h:236
#3  pmac_ide_atapi_transfer_cb (opaque=0x555556c06470, ret=<optimized out>) at ../hw/ide/macio.c:122
#4  0x00005555559861f3 in DBDMA_run (s=0x555556c04c60) at ../hw/misc/macio/mac_dbdma.c:546
#5  DBDMA_run_bh (opaque=0x555556c04c60) at ../hw/misc/macio/mac_dbdma.c:556
#6  0x0000555555f19f33 in aio_bh_call (bh=bh@entry=0x555556ab5570) at ../util/async.c:171
#7  0x0000555555f1a0f5 in aio_bh_poll (ctx=ctx@entry=0x5555566af150) at ../util/async.c:218
#8  0x0000555555f0269e in aio_dispatch (ctx=0x5555566af150) at ../util/aio-posix.c:423
#9  0x0000555555f19d8e in aio_ctx_dispatch
     (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:360
#10 0x00007ffff7315f4f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#11 0x0000555555f1b488 in glib_pollfds_poll () at ../util/main-loop.c:287
#12 os_host_main_loop_wait (timeout=2143429) at ../util/main-loop.c:310
#13 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:589
#14 0x0000555555abeba3 in qemu_main_loop () at ../system/runstate.c:826
#15 0x0000555555e63787 in qemu_default_main () at ../system/main.c:37
#16 0x00007ffff6e29590 in __libc_start_call_main () at /lib64/libc.so.6
#17 0x00007ffff6e29640 in __libc_start_main_impl () at /lib64/libc.so.6
#18 0x000055555588d4f5 in _start ()




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-13 14:35   ` Cédric Le Goater
@ 2024-09-13 14:47     ` Peter Xu
  2024-09-16  8:23       ` Mattias Nissler
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2024-09-13 14:47 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-devel, Peter Maydell, Fabiano Rosas, Mattias Nissler,
	Philippe Mathieu-Daudé, Mark Cave-Ayland

On Fri, Sep 13, 2024 at 04:35:32PM +0200, Cédric Le Goater wrote:
> Hello,
> 
> +Mark (for the Mac devices)
> 
> On 9/9/24 22:11, Peter Xu wrote:
> > From: Mattias Nissler <mnissler@rivosinc.com>
> > 
> > When DMA memory can't be directly accessed, as is the case when
> > running the device model in a separate process without shareable DMA
> > file descriptors, bounce buffering is used.
> > 
> > It is not uncommon for device models to request mapping of several DMA
> > regions at the same time. Examples include:
> >   * net devices, e.g. when transmitting a packet that is split across
> >     several TX descriptors (observed with igb)
> >   * USB host controllers, when handling a packet with multiple data TRBs
> >     (observed with xhci)
> > 
> > Previously, qemu only provided a single bounce buffer per AddressSpace
> > and would fail DMA map requests while the buffer was already in use. In
> > turn, this would cause DMA failures that ultimately manifest as hardware
> > errors from the guest perspective.
> > 
> > This change allocates DMA bounce buffers dynamically instead of
> > supporting only a single buffer. Thus, multiple DMA mappings work
> > correctly also when RAM can't be mmap()-ed.
> > 
> > The total bounce buffer allocation size is limited individually for each
> > AddressSpace. The default limit is 4096 bytes, matching the previous
> > maximum buffer size. A new x-max-bounce-buffer-size parameter is
> > provided to configure the limit for PCI devices.
> > 
> > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> > Acked-by: Peter Xu <peterx@redhat.com>
> > Link: https://lore.kernel.org/r/20240819135455.2957406-1-mnissler@rivosinc.com
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/exec/memory.h       | 14 +++----
> >   include/hw/pci/pci_device.h |  3 ++
> >   hw/pci/pci.c                |  8 ++++
> >   system/memory.c             |  5 ++-
> >   system/physmem.c            | 82 ++++++++++++++++++++++++++-----------
> >   5 files changed, 76 insertions(+), 36 deletions(-)
> 
> Here is a report of a segv of the ppc64 mac99+cpu970 machine booting debian.
> See the stack trace below. Just wanted to let you know. I will digging further
> next week.
> 
> Thanks,
> 
> C.
> 
> 
> 
> Thread 1 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
> address_space_unmap (len=<optimized out>, access_len=0, is_write=false, buffer=0x0,
>     as=0x5555565d45c0 <address_space_memory>) at ../system/physmem.c:3333
> 3333	    assert(bounce->magic == BOUNCE_BUFFER_MAGIC);
> (gdb) bt
> #0  address_space_unmap
>     (len=<optimized out>, access_len=0, is_write=false, buffer=0x0, as=0x5555565d45c0 <address_space_memory>)
>     at ../system/physmem.c:3333
> #1  address_space_unmap
>     (as=as@entry=0x5555565d45c0 <address_space_memory>, buffer=0x0, len=<optimized out>, is_write=<optimized out>, access_len=0) at ../system/physmem.c:3313
> #2  0x000055555595ea48 in dma_memory_unmap
>     (access_len=<optimized out>, dir=<optimized out>, len=<optimized out>, buffer=<optimized out>, as=<optimized out>) at /home/legoater/work/qemu/qemu.git/include/sysemu/dma.h:236
> #3  pmac_ide_atapi_transfer_cb (opaque=0x555556c06470, ret=<optimized out>) at ../hw/ide/macio.c:122
> #4  0x00005555559861f3 in DBDMA_run (s=0x555556c04c60) at ../hw/misc/macio/mac_dbdma.c:546
> #5  DBDMA_run_bh (opaque=0x555556c04c60) at ../hw/misc/macio/mac_dbdma.c:556
> #6  0x0000555555f19f33 in aio_bh_call (bh=bh@entry=0x555556ab5570) at ../util/async.c:171
> #7  0x0000555555f1a0f5 in aio_bh_poll (ctx=ctx@entry=0x5555566af150) at ../util/async.c:218
> #8  0x0000555555f0269e in aio_dispatch (ctx=0x5555566af150) at ../util/aio-posix.c:423
> #9  0x0000555555f19d8e in aio_ctx_dispatch
>     (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:360
> #10 0x00007ffff7315f4f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
> #11 0x0000555555f1b488 in glib_pollfds_poll () at ../util/main-loop.c:287
> #12 os_host_main_loop_wait (timeout=2143429) at ../util/main-loop.c:310
> #13 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:589
> #14 0x0000555555abeba3 in qemu_main_loop () at ../system/runstate.c:826
> #15 0x0000555555e63787 in qemu_default_main () at ../system/main.c:37
> #16 0x00007ffff6e29590 in __libc_start_call_main () at /lib64/libc.so.6
> #17 0x00007ffff6e29640 in __libc_start_main_impl () at /lib64/libc.so.6
> #18 0x000055555588d4f5 in _start ()

Thanks for the report!

Mattias,

Would you have time to take a look?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-13 14:47     ` Peter Xu
@ 2024-09-16  8:23       ` Mattias Nissler
  2024-09-16 11:29         ` Mark Cave-Ayland
  2024-09-16 12:13         ` Cédric Le Goater
  0 siblings, 2 replies; 24+ messages in thread
From: Mattias Nissler @ 2024-09-16  8:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Cédric Le Goater, qemu-devel, Peter Maydell, Fabiano Rosas,
	Philippe Mathieu-Daudé, Mark Cave-Ayland

Thanks for the report, and my apologies for the breakage.

On Fri, Sep 13, 2024 at 4:47 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Sep 13, 2024 at 04:35:32PM +0200, Cédric Le Goater wrote:
> > Hello,
> >
> > +Mark (for the Mac devices)
> >
> > On 9/9/24 22:11, Peter Xu wrote:
> > > From: Mattias Nissler <mnissler@rivosinc.com>
> > >
> > > When DMA memory can't be directly accessed, as is the case when
> > > running the device model in a separate process without shareable DMA
> > > file descriptors, bounce buffering is used.
> > >
> > > It is not uncommon for device models to request mapping of several DMA
> > > regions at the same time. Examples include:
> > >   * net devices, e.g. when transmitting a packet that is split across
> > >     several TX descriptors (observed with igb)
> > >   * USB host controllers, when handling a packet with multiple data TRBs
> > >     (observed with xhci)
> > >
> > > Previously, qemu only provided a single bounce buffer per AddressSpace
> > > and would fail DMA map requests while the buffer was already in use. In
> > > turn, this would cause DMA failures that ultimately manifest as hardware
> > > errors from the guest perspective.
> > >
> > > This change allocates DMA bounce buffers dynamically instead of
> > > supporting only a single buffer. Thus, multiple DMA mappings work
> > > correctly also when RAM can't be mmap()-ed.
> > >
> > > The total bounce buffer allocation size is limited individually for each
> > > AddressSpace. The default limit is 4096 bytes, matching the previous
> > > maximum buffer size. A new x-max-bounce-buffer-size parameter is
> > > provided to configure the limit for PCI devices.
> > >
> > > Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
> > > Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> > > Acked-by: Peter Xu <peterx@redhat.com>
> > > Link: https://lore.kernel.org/r/20240819135455.2957406-1-mnissler@rivosinc.com
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >   include/exec/memory.h       | 14 +++----
> > >   include/hw/pci/pci_device.h |  3 ++
> > >   hw/pci/pci.c                |  8 ++++
> > >   system/memory.c             |  5 ++-
> > >   system/physmem.c            | 82 ++++++++++++++++++++++++++-----------
> > >   5 files changed, 76 insertions(+), 36 deletions(-)
> >
> > Here is a report of a segv of the ppc64 mac99+cpu970 machine booting debian.
> > See the stack trace below. Just wanted to let you know. I will digging further
> > next week.
> >
> > Thanks,
> >
> > C.
> >
> >
> >
> > Thread 1 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
> > address_space_unmap (len=<optimized out>, access_len=0, is_write=false, buffer=0x0,
> >     as=0x5555565d45c0 <address_space_memory>) at ../system/physmem.c:3333
> > 3333      assert(bounce->magic == BOUNCE_BUFFER_MAGIC);
> > (gdb) bt
> > #0  address_space_unmap
> >     (len=<optimized out>, access_len=0, is_write=false, buffer=0x0, as=0x5555565d45c0 <address_space_memory>)
> >     at ../system/physmem.c:3333
> > #1  address_space_unmap
> >     (as=as@entry=0x5555565d45c0 <address_space_memory>, buffer=0x0, len=<optimized out>, is_write=<optimized out>, access_len=0) at ../system/physmem.c:3313
> > #2  0x000055555595ea48 in dma_memory_unmap
> >     (access_len=<optimized out>, dir=<optimized out>, len=<optimized out>, buffer=<optimized out>, as=<optimized out>) at /home/legoater/work/qemu/qemu.git/include/sysemu/dma.h:236
> > #3  pmac_ide_atapi_transfer_cb (opaque=0x555556c06470, ret=<optimized out>) at ../hw/ide/macio.c:122
> > #4  0x00005555559861f3 in DBDMA_run (s=0x555556c04c60) at ../hw/misc/macio/mac_dbdma.c:546
> > #5  DBDMA_run_bh (opaque=0x555556c04c60) at ../hw/misc/macio/mac_dbdma.c:556
> > #6  0x0000555555f19f33 in aio_bh_call (bh=bh@entry=0x555556ab5570) at ../util/async.c:171
> > #7  0x0000555555f1a0f5 in aio_bh_poll (ctx=ctx@entry=0x5555566af150) at ../util/async.c:218
> > #8  0x0000555555f0269e in aio_dispatch (ctx=0x5555566af150) at ../util/aio-posix.c:423
> > #9  0x0000555555f19d8e in aio_ctx_dispatch
> >     (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:360
> > #10 0x00007ffff7315f4f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
> > #11 0x0000555555f1b488 in glib_pollfds_poll () at ../util/main-loop.c:287
> > #12 os_host_main_loop_wait (timeout=2143429) at ../util/main-loop.c:310
> > #13 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:589
> > #14 0x0000555555abeba3 in qemu_main_loop () at ../system/runstate.c:826
> > #15 0x0000555555e63787 in qemu_default_main () at ../system/main.c:37
> > #16 0x00007ffff6e29590 in __libc_start_call_main () at /lib64/libc.so.6
> > #17 0x00007ffff6e29640 in __libc_start_main_impl () at /lib64/libc.so.6
> > #18 0x000055555588d4f5 in _start ()
>
> Thanks for the report!
>
> Mattias,
>
> Would you have time to take a look?

I noticed that the backtrace indicates address_space_unmap is called
with buffer=0x0, len=0. This wasn't really correct before my
concurrent bounce buffering change either, but it looks like the
previous code would have tolerated this to a certain extent (at least
no immediate crashes). Original code in question:

    if (is_write) {
        address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
                            as->bounce.buffer, access_len);
    }
    qemu_vfree(as->bounce.buffer);
    as->bounce.buffer = NULL;
    memory_region_unref(as->bounce.mr);
    /* Clear in_use before reading map_client_list.  */
    qatomic_set_mb(&as->bounce.in_use, false);
    address_space_notify_map_clients(as);

address_space_write and qemu_vfree are safe to call with NULL/0
parameters. as->bounce.buffer = NULL would leak the buffer if one is
allocated, and memory_region_unref(as->bounce.mr) is only OK if the
bounce buffer hasn't been used before, otherwise we'd erroneously drop
a memory region reference.

We have two options here: Either we fix the caller to not call
address_space_unmap with buffer=NULL. Or alternatively we make
address_space_unmap NULL-safe by putting a check to return immediately
when being passed a NULL buffer parameter.

Looking at the code, the dma_memory_unmap calls in hw/ide/macio.c seem
to be passing buffer=NULL unconditionally, since the dma_mem field in
struct DBDMA_io is never set to anything non-zero. In fact, I believe
after commit be1e343995ef81fc05d9a4e1ec263ca171d842e7 "macio: switch
over to new byte-aligned DMA helpers", the dma_memory_unmap calls in
hw/ide/macio.c aren't doing anything and should probably have been
removed together with the dma_mem, dma_len and dir fields in struct
DBDMA_io. Speculative patch:

diff --git a/hw/ide/macio.c b/hw/ide/macio.c
index e84bf2c9f6..15dd40138e 100644
--- a/hw/ide/macio.c
+++ b/hw/ide/macio.c
@@ -119,9 +119,6 @@ static void pmac_ide_atapi_transfer_cb(void
*opaque, int ret)
     return;

 done:
-    dma_memory_unmap(&address_space_memory, io->dma_mem, io->dma_len,
-                     io->dir, io->dma_len);
-
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->blk), &s->acct);
     } else {
@@ -202,9 +199,6 @@ static void pmac_ide_transfer_cb(void *opaque, int ret)
     return;

 done:
-    dma_memory_unmap(&address_space_memory, io->dma_mem, io->dma_len,
-                     io->dir, io->dma_len);
-
     if (s->dma_cmd == IDE_DMA_READ || s->dma_cmd == IDE_DMA_WRITE) {
         if (ret < 0) {
             block_acct_failed(blk_get_stats(s->blk), &s->acct);
diff --git a/include/hw/ppc/mac_dbdma.h b/include/hw/ppc/mac_dbdma.h
index 4a3f644516..c774f6bf84 100644
--- a/include/hw/ppc/mac_dbdma.h
+++ b/include/hw/ppc/mac_dbdma.h
@@ -44,10 +44,6 @@ struct DBDMA_io {
     DBDMA_end dma_end;
     /* DMA is in progress, don't start another one */
     bool processing;
-    /* DMA request */
-    void *dma_mem;
-    dma_addr_t dma_len;
-    DMADirection dir;
 };

 /*

Cédric, can you try with the above patch and/or share more details of
your setup so I can verify (I tried booting a ppc64el-pseries dqib
image but didn't see the issue)?

Thanks,
Mattias


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-16  8:23       ` Mattias Nissler
@ 2024-09-16 11:29         ` Mark Cave-Ayland
  2024-09-16 11:44           ` Peter Maydell
  2024-09-16 12:13         ` Cédric Le Goater
  1 sibling, 1 reply; 24+ messages in thread
From: Mark Cave-Ayland @ 2024-09-16 11:29 UTC (permalink / raw)
  To: Mattias Nissler, Peter Xu
  Cc: Cédric Le Goater, qemu-devel, Peter Maydell, Fabiano Rosas,
	Philippe Mathieu-Daudé

On 16/09/2024 09:23, Mattias Nissler wrote:

> Thanks for the report, and my apologies for the breakage.
> 
> On Fri, Sep 13, 2024 at 4:47 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Fri, Sep 13, 2024 at 04:35:32PM +0200, Cédric Le Goater wrote:
>>> Hello,
>>>
>>> +Mark (for the Mac devices)
>>>
>>> On 9/9/24 22:11, Peter Xu wrote:
>>>> From: Mattias Nissler <mnissler@rivosinc.com>
>>>>
>>>> When DMA memory can't be directly accessed, as is the case when
>>>> running the device model in a separate process without shareable DMA
>>>> file descriptors, bounce buffering is used.
>>>>
>>>> It is not uncommon for device models to request mapping of several DMA
>>>> regions at the same time. Examples include:
>>>>    * net devices, e.g. when transmitting a packet that is split across
>>>>      several TX descriptors (observed with igb)
>>>>    * USB host controllers, when handling a packet with multiple data TRBs
>>>>      (observed with xhci)
>>>>
>>>> Previously, qemu only provided a single bounce buffer per AddressSpace
>>>> and would fail DMA map requests while the buffer was already in use. In
>>>> turn, this would cause DMA failures that ultimately manifest as hardware
>>>> errors from the guest perspective.
>>>>
>>>> This change allocates DMA bounce buffers dynamically instead of
>>>> supporting only a single buffer. Thus, multiple DMA mappings work
>>>> correctly also when RAM can't be mmap()-ed.
>>>>
>>>> The total bounce buffer allocation size is limited individually for each
>>>> AddressSpace. The default limit is 4096 bytes, matching the previous
>>>> maximum buffer size. A new x-max-bounce-buffer-size parameter is
>>>> provided to configure the limit for PCI devices.
>>>>
>>>> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
>>>> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>> Link: https://lore.kernel.org/r/20240819135455.2957406-1-mnissler@rivosinc.com
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>> ---
>>>>    include/exec/memory.h       | 14 +++----
>>>>    include/hw/pci/pci_device.h |  3 ++
>>>>    hw/pci/pci.c                |  8 ++++
>>>>    system/memory.c             |  5 ++-
>>>>    system/physmem.c            | 82 ++++++++++++++++++++++++++-----------
>>>>    5 files changed, 76 insertions(+), 36 deletions(-)
>>>
>>> Here is a report of a segv of the ppc64 mac99+cpu970 machine booting debian.
>>> See the stack trace below. Just wanted to let you know. I will digging further
>>> next week.
>>>
>>> Thanks,
>>>
>>> C.
>>>
>>>
>>>
>>> Thread 1 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
>>> address_space_unmap (len=<optimized out>, access_len=0, is_write=false, buffer=0x0,
>>>      as=0x5555565d45c0 <address_space_memory>) at ../system/physmem.c:3333
>>> 3333      assert(bounce->magic == BOUNCE_BUFFER_MAGIC);
>>> (gdb) bt
>>> #0  address_space_unmap
>>>      (len=<optimized out>, access_len=0, is_write=false, buffer=0x0, as=0x5555565d45c0 <address_space_memory>)
>>>      at ../system/physmem.c:3333
>>> #1  address_space_unmap
>>>      (as=as@entry=0x5555565d45c0 <address_space_memory>, buffer=0x0, len=<optimized out>, is_write=<optimized out>, access_len=0) at ../system/physmem.c:3313
>>> #2  0x000055555595ea48 in dma_memory_unmap
>>>      (access_len=<optimized out>, dir=<optimized out>, len=<optimized out>, buffer=<optimized out>, as=<optimized out>) at /home/legoater/work/qemu/qemu.git/include/sysemu/dma.h:236
>>> #3  pmac_ide_atapi_transfer_cb (opaque=0x555556c06470, ret=<optimized out>) at ../hw/ide/macio.c:122
>>> #4  0x00005555559861f3 in DBDMA_run (s=0x555556c04c60) at ../hw/misc/macio/mac_dbdma.c:546
>>> #5  DBDMA_run_bh (opaque=0x555556c04c60) at ../hw/misc/macio/mac_dbdma.c:556
>>> #6  0x0000555555f19f33 in aio_bh_call (bh=bh@entry=0x555556ab5570) at ../util/async.c:171
>>> #7  0x0000555555f1a0f5 in aio_bh_poll (ctx=ctx@entry=0x5555566af150) at ../util/async.c:218
>>> #8  0x0000555555f0269e in aio_dispatch (ctx=0x5555566af150) at ../util/aio-posix.c:423
>>> #9  0x0000555555f19d8e in aio_ctx_dispatch
>>>      (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:360
>>> #10 0x00007ffff7315f4f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
>>> #11 0x0000555555f1b488 in glib_pollfds_poll () at ../util/main-loop.c:287
>>> #12 os_host_main_loop_wait (timeout=2143429) at ../util/main-loop.c:310
>>> #13 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:589
>>> #14 0x0000555555abeba3 in qemu_main_loop () at ../system/runstate.c:826
>>> #15 0x0000555555e63787 in qemu_default_main () at ../system/main.c:37
>>> #16 0x00007ffff6e29590 in __libc_start_call_main () at /lib64/libc.so.6
>>> #17 0x00007ffff6e29640 in __libc_start_main_impl () at /lib64/libc.so.6
>>> #18 0x000055555588d4f5 in _start ()
>>
>> Thanks for the report!
>>
>> Mattias,
>>
>> Would you have time to take a look?
> 
> I noticed that the backtrace indicates address_space_unmap is called
> with buffer=0x0, len=0. This wasn't really correct before my
> concurrent bounce buffering change either, but it looks like the
> previous code would have tolerated this to a certain extent (at least
> no immediate crashes). Original code in question:
> 
>      if (is_write) {
>          address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
>                              as->bounce.buffer, access_len);
>      }
>      qemu_vfree(as->bounce.buffer);
>      as->bounce.buffer = NULL;
>      memory_region_unref(as->bounce.mr);
>      /* Clear in_use before reading map_client_list.  */
>      qatomic_set_mb(&as->bounce.in_use, false);
>      address_space_notify_map_clients(as);
> 
> address_space_write and qemu_vfree are safe to call with NULL/0
> parameters. as->bounce.buffer = NULL would leak the buffer if one is
> allocated, and memory_region_unref(as->bounce.mr) is only OK if the
> bounce buffer hasn't been used before, otherwise we'd erroneously drop
> a memory region reference.
> 
> We have two options here: Either we fix the caller to not call
> address_space_unmap with buffer=NULL. Or alternatively we make
> address_space_unmap NULL-safe by putting a check to return immediately
> when being passed a NULL buffer parameter.
> 
> Looking at the code, the dma_memory_unmap calls in hw/ide/macio.c seem
> to be passing buffer=NULL unconditionally, since the dma_mem field in
> struct DBDMA_io is never set to anything non-zero. In fact, I believe
> after commit be1e343995ef81fc05d9a4e1ec263ca171d842e7 "macio: switch
> over to new byte-aligned DMA helpers", the dma_memory_unmap calls in
> hw/ide/macio.c aren't doing anything and should probably have been
> removed together with the dma_mem, dma_len and dir fields in struct
> DBDMA_io. Speculative patch:
> 
> diff --git a/hw/ide/macio.c b/hw/ide/macio.c
> index e84bf2c9f6..15dd40138e 100644
> --- a/hw/ide/macio.c
> +++ b/hw/ide/macio.c
> @@ -119,9 +119,6 @@ static void pmac_ide_atapi_transfer_cb(void
> *opaque, int ret)
>       return;
> 
>   done:
> -    dma_memory_unmap(&address_space_memory, io->dma_mem, io->dma_len,
> -                     io->dir, io->dma_len);
> -
>       if (ret < 0) {
>           block_acct_failed(blk_get_stats(s->blk), &s->acct);
>       } else {
> @@ -202,9 +199,6 @@ static void pmac_ide_transfer_cb(void *opaque, int ret)
>       return;
> 
>   done:
> -    dma_memory_unmap(&address_space_memory, io->dma_mem, io->dma_len,
> -                     io->dir, io->dma_len);
> -
>       if (s->dma_cmd == IDE_DMA_READ || s->dma_cmd == IDE_DMA_WRITE) {
>           if (ret < 0) {
>               block_acct_failed(blk_get_stats(s->blk), &s->acct);
> diff --git a/include/hw/ppc/mac_dbdma.h b/include/hw/ppc/mac_dbdma.h
> index 4a3f644516..c774f6bf84 100644
> --- a/include/hw/ppc/mac_dbdma.h
> +++ b/include/hw/ppc/mac_dbdma.h
> @@ -44,10 +44,6 @@ struct DBDMA_io {
>       DBDMA_end dma_end;
>       /* DMA is in progress, don't start another one */
>       bool processing;
> -    /* DMA request */
> -    void *dma_mem;
> -    dma_addr_t dma_len;
> -    DMADirection dir;
>   };
> 
>   /*
> 
> Cédric, can you try with the above patch and/or share more details of
> your setup so I can verify (I tried booting a ppc64el-pseries dqib
> image but didn't see the issue)?

I'm fairly sure that this patch would break MacOS 9 which was the reason that 
dma_memory_unmap() was added here in the first place: what I was finding was that 
without the dma_memory_unmap() the destination RAM wasn't being invalidated (or 
marked dirty), causing random crashes during boot.

Would the issue be solved by adding a corresponding dma_memory_map() beforehand at 
the relevant places in hw/ide/macio.c? If that's required as part of the setup for 
bounce buffers then I can see how not having this present could cause problems.


ATB,

Mark.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-16 11:29         ` Mark Cave-Ayland
@ 2024-09-16 11:44           ` Peter Maydell
  2024-09-16 12:13             ` Mark Cave-Ayland
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Maydell @ 2024-09-16 11:44 UTC (permalink / raw)
  To: Mark Cave-Ayland
  Cc: Mattias Nissler, Peter Xu, Cédric Le Goater, qemu-devel,
	Fabiano Rosas, Philippe Mathieu-Daudé

On Mon, 16 Sept 2024 at 12:29, Mark Cave-Ayland
<mark.cave-ayland@ilande.co.uk> wrote:
>
> On 16/09/2024 09:23, Mattias Nissler wrote:
> > Looking at the code, the dma_memory_unmap calls in hw/ide/macio.c seem
> > to be passing buffer=NULL unconditionally, since the dma_mem field in
> > struct DBDMA_io is never set to anything non-zero. In fact, I believe
> > after commit be1e343995ef81fc05d9a4e1ec263ca171d842e7 "macio: switch
> > over to new byte-aligned DMA helpers", the dma_memory_unmap calls in
> > hw/ide/macio.c aren't doing anything and should probably have been
> > removed together with the dma_mem, dma_len and dir fields in struct
> > DBDMA_io. Speculative patch:
> >
> > diff --git a/hw/ide/macio.c b/hw/ide/macio.c
> > index e84bf2c9f6..15dd40138e 100644
> > --- a/hw/ide/macio.c
> > +++ b/hw/ide/macio.c
> > @@ -119,9 +119,6 @@ static void pmac_ide_atapi_transfer_cb(void
> > *opaque, int ret)
> >       return;
> >
> >   done:
> > -    dma_memory_unmap(&address_space_memory, io->dma_mem, io->dma_len,
> > -                     io->dir, io->dma_len);
> > -
> >       if (ret < 0) {
> >           block_acct_failed(blk_get_stats(s->blk), &s->acct);
> >       } else {
> > @@ -202,9 +199,6 @@ static void pmac_ide_transfer_cb(void *opaque, int ret)
> >       return;
> >
> >   done:
> > -    dma_memory_unmap(&address_space_memory, io->dma_mem, io->dma_len,
> > -                     io->dir, io->dma_len);
> > -
> >       if (s->dma_cmd == IDE_DMA_READ || s->dma_cmd == IDE_DMA_WRITE) {
> >           if (ret < 0) {
> >               block_acct_failed(blk_get_stats(s->blk), &s->acct);
> > diff --git a/include/hw/ppc/mac_dbdma.h b/include/hw/ppc/mac_dbdma.h
> > index 4a3f644516..c774f6bf84 100644
> > --- a/include/hw/ppc/mac_dbdma.h
> > +++ b/include/hw/ppc/mac_dbdma.h
> > @@ -44,10 +44,6 @@ struct DBDMA_io {
> >       DBDMA_end dma_end;
> >       /* DMA is in progress, don't start another one */
> >       bool processing;
> > -    /* DMA request */
> > -    void *dma_mem;
> > -    dma_addr_t dma_len;
> > -    DMADirection dir;
> >   };
> >
> >   /*
> >
> > Cédric, can you try with the above patch and/or share more details of
> > your setup so I can verify (I tried booting a ppc64el-pseries dqib
> > image but didn't see the issue)?
>
> I'm fairly sure that this patch would break MacOS 9 which was the reason that
> dma_memory_unmap() was added here in the first place: what I was finding was that
> without the dma_memory_unmap() the destination RAM wasn't being invalidated (or
> marked dirty), causing random crashes during boot.

dma_memory_unmap() of something you never mapped is
definitely wrong. Whatever is going on here, leaving the unmap
call in after you removed the dma_memory_map() call is just
papering over the actual cause of the crashes.

> Would the issue be solved by adding a corresponding dma_memory_map() beforehand at
> the relevant places in hw/ide/macio.c? If that's required as part of the setup for
> bounce buffers then I can see how not having this present could cause problems.

The only purpose of this API is sequences of:
  host_ptr = dma_memory_map(...);
  access the host_ptr directly;
  dma_memory_unmap(...);

The bounce-buffer stuff is an internal implementation detail
of making this API work when the DMA is going to a device.

We need to find whatever the actual cause of the macos failure is.
Mattias' suggested change looks right to me.

I do wonder if something needs the memory barrier than
unmap does as part of its operation, e.g. in the
implementation of the dma_blk_* functions.

-- PMM


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-16  8:23       ` Mattias Nissler
  2024-09-16 11:29         ` Mark Cave-Ayland
@ 2024-09-16 12:13         ` Cédric Le Goater
  2024-09-16 12:28           ` Cédric Le Goater
  1 sibling, 1 reply; 24+ messages in thread
From: Cédric Le Goater @ 2024-09-16 12:13 UTC (permalink / raw)
  To: Mattias Nissler, Peter Xu
  Cc: qemu-devel, Peter Maydell, Fabiano Rosas,
	Philippe Mathieu-Daudé, Mark Cave-Ayland

On 9/16/24 10:23, Mattias Nissler wrote:
> Thanks for the report, and my apologies for the breakage.
> 
> On Fri, Sep 13, 2024 at 4:47 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Fri, Sep 13, 2024 at 04:35:32PM +0200, Cédric Le Goater wrote:
>>> Hello,
>>>
>>> +Mark (for the Mac devices)
>>>
>>> On 9/9/24 22:11, Peter Xu wrote:
>>>> From: Mattias Nissler <mnissler@rivosinc.com>
>>>>
>>>> When DMA memory can't be directly accessed, as is the case when
>>>> running the device model in a separate process without shareable DMA
>>>> file descriptors, bounce buffering is used.
>>>>
>>>> It is not uncommon for device models to request mapping of several DMA
>>>> regions at the same time. Examples include:
>>>>    * net devices, e.g. when transmitting a packet that is split across
>>>>      several TX descriptors (observed with igb)
>>>>    * USB host controllers, when handling a packet with multiple data TRBs
>>>>      (observed with xhci)
>>>>
>>>> Previously, qemu only provided a single bounce buffer per AddressSpace
>>>> and would fail DMA map requests while the buffer was already in use. In
>>>> turn, this would cause DMA failures that ultimately manifest as hardware
>>>> errors from the guest perspective.
>>>>
>>>> This change allocates DMA bounce buffers dynamically instead of
>>>> supporting only a single buffer. Thus, multiple DMA mappings work
>>>> correctly also when RAM can't be mmap()-ed.
>>>>
>>>> The total bounce buffer allocation size is limited individually for each
>>>> AddressSpace. The default limit is 4096 bytes, matching the previous
>>>> maximum buffer size. A new x-max-bounce-buffer-size parameter is
>>>> provided to configure the limit for PCI devices.
>>>>
>>>> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
>>>> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>> Link: https://lore.kernel.org/r/20240819135455.2957406-1-mnissler@rivosinc.com
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>> ---
>>>>    include/exec/memory.h       | 14 +++----
>>>>    include/hw/pci/pci_device.h |  3 ++
>>>>    hw/pci/pci.c                |  8 ++++
>>>>    system/memory.c             |  5 ++-
>>>>    system/physmem.c            | 82 ++++++++++++++++++++++++++-----------
>>>>    5 files changed, 76 insertions(+), 36 deletions(-)
>>>
>>> Here is a report of a segv of the ppc64 mac99+cpu970 machine booting debian.
>>> See the stack trace below. Just wanted to let you know. I will digging further
>>> next week.
>>>
>>> Thanks,
>>>
>>> C.
>>>
>>>
>>>
>>> Thread 1 "qemu-system-ppc" received signal SIGSEGV, Segmentation fault.
>>> address_space_unmap (len=<optimized out>, access_len=0, is_write=false, buffer=0x0,
>>>      as=0x5555565d45c0 <address_space_memory>) at ../system/physmem.c:3333
>>> 3333      assert(bounce->magic == BOUNCE_BUFFER_MAGIC);
>>> (gdb) bt
>>> #0  address_space_unmap
>>>      (len=<optimized out>, access_len=0, is_write=false, buffer=0x0, as=0x5555565d45c0 <address_space_memory>)
>>>      at ../system/physmem.c:3333
>>> #1  address_space_unmap
>>>      (as=as@entry=0x5555565d45c0 <address_space_memory>, buffer=0x0, len=<optimized out>, is_write=<optimized out>, access_len=0) at ../system/physmem.c:3313
>>> #2  0x000055555595ea48 in dma_memory_unmap
>>>      (access_len=<optimized out>, dir=<optimized out>, len=<optimized out>, buffer=<optimized out>, as=<optimized out>) at /home/legoater/work/qemu/qemu.git/include/sysemu/dma.h:236
>>> #3  pmac_ide_atapi_transfer_cb (opaque=0x555556c06470, ret=<optimized out>) at ../hw/ide/macio.c:122
>>> #4  0x00005555559861f3 in DBDMA_run (s=0x555556c04c60) at ../hw/misc/macio/mac_dbdma.c:546
>>> #5  DBDMA_run_bh (opaque=0x555556c04c60) at ../hw/misc/macio/mac_dbdma.c:556
>>> #6  0x0000555555f19f33 in aio_bh_call (bh=bh@entry=0x555556ab5570) at ../util/async.c:171
>>> #7  0x0000555555f1a0f5 in aio_bh_poll (ctx=ctx@entry=0x5555566af150) at ../util/async.c:218
>>> #8  0x0000555555f0269e in aio_dispatch (ctx=0x5555566af150) at ../util/aio-posix.c:423
>>> #9  0x0000555555f19d8e in aio_ctx_dispatch
>>>      (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:360
>>> #10 0x00007ffff7315f4f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
>>> #11 0x0000555555f1b488 in glib_pollfds_poll () at ../util/main-loop.c:287
>>> #12 os_host_main_loop_wait (timeout=2143429) at ../util/main-loop.c:310
>>> #13 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:589
>>> #14 0x0000555555abeba3 in qemu_main_loop () at ../system/runstate.c:826
>>> #15 0x0000555555e63787 in qemu_default_main () at ../system/main.c:37
>>> #16 0x00007ffff6e29590 in __libc_start_call_main () at /lib64/libc.so.6
>>> #17 0x00007ffff6e29640 in __libc_start_main_impl () at /lib64/libc.so.6
>>> #18 0x000055555588d4f5 in _start ()
>>
>> Thanks for the report!
>>
>> Mattias,
>>
>> Would you have time to take a look?
> 
> I noticed that the backtrace indicates address_space_unmap is called
> with buffer=0x0, len=0. This wasn't really correct before my
> concurrent bounce buffering change either, but it looks like the
> previous code would have tolerated this to a certain extent (at least
> no immediate crashes). Original code in question:
> 
>      if (is_write) {
>          address_space_write(as, as->bounce.addr, MEMTXATTRS_UNSPECIFIED,
>                              as->bounce.buffer, access_len);
>      }
>      qemu_vfree(as->bounce.buffer);
>      as->bounce.buffer = NULL;
>      memory_region_unref(as->bounce.mr);
>      /* Clear in_use before reading map_client_list.  */
>      qatomic_set_mb(&as->bounce.in_use, false);
>      address_space_notify_map_clients(as);
> 
> address_space_write and qemu_vfree are safe to call with NULL/0
> parameters. as->bounce.buffer = NULL would leak the buffer if one is
> allocated, and memory_region_unref(as->bounce.mr) is only OK if the
> bounce buffer hasn't been used before, otherwise we'd erroneously drop
> a memory region reference.
> 
> We have two options here: Either we fix the caller to not call
> address_space_unmap with buffer=NULL. Or alternatively we make
> address_space_unmap NULL-safe by putting a check to return immediately
> when being passed a NULL buffer parameter.
> 
> Looking at the code, the dma_memory_unmap calls in hw/ide/macio.c seem
> to be passing buffer=NULL unconditionally, since the dma_mem field in
> struct DBDMA_io is never set to anything non-zero. In fact, I believe
> after commit be1e343995ef81fc05d9a4e1ec263ca171d842e7 "macio: switch
> over to new byte-aligned DMA helpers", the dma_memory_unmap calls in
> hw/ide/macio.c aren't doing anything and should probably have been
> removed together with the dma_mem, dma_len and dir fields in struct
> DBDMA_io. Speculative patch:
> 
> diff --git a/hw/ide/macio.c b/hw/ide/macio.c
> index e84bf2c9f6..15dd40138e 100644
> --- a/hw/ide/macio.c
> +++ b/hw/ide/macio.c
> @@ -119,9 +119,6 @@ static void pmac_ide_atapi_transfer_cb(void
> *opaque, int ret)
>       return;
> 
>   done:
> -    dma_memory_unmap(&address_space_memory, io->dma_mem, io->dma_len,
> -                     io->dir, io->dma_len);
> -
>       if (ret < 0) {
>           block_acct_failed(blk_get_stats(s->blk), &s->acct);
>       } else {
> @@ -202,9 +199,6 @@ static void pmac_ide_transfer_cb(void *opaque, int ret)
>       return;
> 
>   done:
> -    dma_memory_unmap(&address_space_memory, io->dma_mem, io->dma_len,
> -                     io->dir, io->dma_len);
> -
>       if (s->dma_cmd == IDE_DMA_READ || s->dma_cmd == IDE_DMA_WRITE) {
>           if (ret < 0) {
>               block_acct_failed(blk_get_stats(s->blk), &s->acct);
> diff --git a/include/hw/ppc/mac_dbdma.h b/include/hw/ppc/mac_dbdma.h
> index 4a3f644516..c774f6bf84 100644
> --- a/include/hw/ppc/mac_dbdma.h
> +++ b/include/hw/ppc/mac_dbdma.h
> @@ -44,10 +44,6 @@ struct DBDMA_io {
>       DBDMA_end dma_end;
>       /* DMA is in progress, don't start another one */
>       bool processing;
> -    /* DMA request */
> -    void *dma_mem;
> -    dma_addr_t dma_len;
> -    DMADirection dir;
>   };
> 
>   /*
> 
> Cédric, can you try with the above patch and/or 

crash seems gone.

> share more details of your setup so I can verify

You will need a Linnux powerpc or powerpc64 image for mac machines,
which are not common now days, or MacOS images. My debian images
are big. I will try to build you a small one for more tests.

> (I tried booting a ppc64el-pseries dqib
> image but didn't see the issue)?

pseriers is a very different type of machine, the equivalent of the virt
machine on ARM and RISCV. The HW is completely different.

Thanks,

C.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-16 11:44           ` Peter Maydell
@ 2024-09-16 12:13             ` Mark Cave-Ayland
  2024-09-16 12:28               ` Peter Maydell
  0 siblings, 1 reply; 24+ messages in thread
From: Mark Cave-Ayland @ 2024-09-16 12:13 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Mattias Nissler, Peter Xu, Cédric Le Goater, qemu-devel,
	Fabiano Rosas, Philippe Mathieu-Daudé

On 16/09/2024 12:44, Peter Maydell wrote:

> On Mon, 16 Sept 2024 at 12:29, Mark Cave-Ayland
> <mark.cave-ayland@ilande.co.uk> wrote:
>>
>> On 16/09/2024 09:23, Mattias Nissler wrote:
>>> Looking at the code, the dma_memory_unmap calls in hw/ide/macio.c seem
>>> to be passing buffer=NULL unconditionally, since the dma_mem field in
>>> struct DBDMA_io is never set to anything non-zero. In fact, I believe
>>> after commit be1e343995ef81fc05d9a4e1ec263ca171d842e7 "macio: switch
>>> over to new byte-aligned DMA helpers", the dma_memory_unmap calls in
>>> hw/ide/macio.c aren't doing anything and should probably have been
>>> removed together with the dma_mem, dma_len and dir fields in struct
>>> DBDMA_io. Speculative patch:
>>>
>>> diff --git a/hw/ide/macio.c b/hw/ide/macio.c
>>> index e84bf2c9f6..15dd40138e 100644
>>> --- a/hw/ide/macio.c
>>> +++ b/hw/ide/macio.c
>>> @@ -119,9 +119,6 @@ static void pmac_ide_atapi_transfer_cb(void
>>> *opaque, int ret)
>>>        return;
>>>
>>>    done:
>>> -    dma_memory_unmap(&address_space_memory, io->dma_mem, io->dma_len,
>>> -                     io->dir, io->dma_len);
>>> -
>>>        if (ret < 0) {
>>>            block_acct_failed(blk_get_stats(s->blk), &s->acct);
>>>        } else {
>>> @@ -202,9 +199,6 @@ static void pmac_ide_transfer_cb(void *opaque, int ret)
>>>        return;
>>>
>>>    done:
>>> -    dma_memory_unmap(&address_space_memory, io->dma_mem, io->dma_len,
>>> -                     io->dir, io->dma_len);
>>> -
>>>        if (s->dma_cmd == IDE_DMA_READ || s->dma_cmd == IDE_DMA_WRITE) {
>>>            if (ret < 0) {
>>>                block_acct_failed(blk_get_stats(s->blk), &s->acct);
>>> diff --git a/include/hw/ppc/mac_dbdma.h b/include/hw/ppc/mac_dbdma.h
>>> index 4a3f644516..c774f6bf84 100644
>>> --- a/include/hw/ppc/mac_dbdma.h
>>> +++ b/include/hw/ppc/mac_dbdma.h
>>> @@ -44,10 +44,6 @@ struct DBDMA_io {
>>>        DBDMA_end dma_end;
>>>        /* DMA is in progress, don't start another one */
>>>        bool processing;
>>> -    /* DMA request */
>>> -    void *dma_mem;
>>> -    dma_addr_t dma_len;
>>> -    DMADirection dir;
>>>    };
>>>
>>>    /*
>>>
>>> Cédric, can you try with the above patch and/or share more details of
>>> your setup so I can verify (I tried booting a ppc64el-pseries dqib
>>> image but didn't see the issue)?
>>
>> I'm fairly sure that this patch would break MacOS 9 which was the reason that
>> dma_memory_unmap() was added here in the first place: what I was finding was that
>> without the dma_memory_unmap() the destination RAM wasn't being invalidated (or
>> marked dirty), causing random crashes during boot.
> 
> dma_memory_unmap() of something you never mapped is
> definitely wrong. Whatever is going on here, leaving the unmap
> call in after you removed the dma_memory_map() call is just
> papering over the actual cause of the crashes.
> 
>> Would the issue be solved by adding a corresponding dma_memory_map() beforehand at
>> the relevant places in hw/ide/macio.c? If that's required as part of the setup for
>> bounce buffers then I can see how not having this present could cause problems.
> 
> The only purpose of this API is sequences of:
>    host_ptr = dma_memory_map(...);
>    access the host_ptr directly;
>    dma_memory_unmap(...);
> 
> The bounce-buffer stuff is an internal implementation detail
> of making this API work when the DMA is going to a device.
> 
> We need to find whatever the actual cause of the macos failure is.
> Mattias' suggested change looks right to me.
> 
> I do wonder if something needs the memory barrier than
> unmap does as part of its operation, e.g. in the
> implementation of the dma_blk_* functions.

It has been a few years now, but I'm fairly sure the issue was that dma_blk_read() 
didn't mark RAM containing code as dirty/invalid, and since MacOS 9 used overlays 
then it would crash randomly trying to execute stale memory. dma_memory_unmap() 
checks to see if the direction was to RAM, and then marks the memory dirty allowing 
the new code to get picked up after a MMU fault.

If the memory barriers are already in place for the dma_blk_*() functions then the 
analysis could be correct, in which case the bug is a misunderstanding I made in 
be1e343995 ("macio: switch over to new byte-aligned DMA helpers") back in 2016.


ATB,

Mark.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-16 12:13             ` Mark Cave-Ayland
@ 2024-09-16 12:28               ` Peter Maydell
  2024-09-16 12:44                 ` Mattias Nissler
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Maydell @ 2024-09-16 12:28 UTC (permalink / raw)
  To: Mark Cave-Ayland
  Cc: Mattias Nissler, Peter Xu, Cédric Le Goater, qemu-devel,
	Fabiano Rosas, Philippe Mathieu-Daudé

On Mon, 16 Sept 2024 at 13:14, Mark Cave-Ayland
<mark.cave-ayland@ilande.co.uk> wrote:
>
> On 16/09/2024 12:44, Peter Maydell wrote:
>
> > On Mon, 16 Sept 2024 at 12:29, Mark Cave-Ayland
> > <mark.cave-ayland@ilande.co.uk> wrote:
> >> I'm fairly sure that this patch would break MacOS 9 which was the reason that
> >> dma_memory_unmap() was added here in the first place: what I was finding was that
> >> without the dma_memory_unmap() the destination RAM wasn't being invalidated (or
> >> marked dirty), causing random crashes during boot.
> >
> > dma_memory_unmap() of something you never mapped is
> > definitely wrong. Whatever is going on here, leaving the unmap
> > call in after you removed the dma_memory_map() call is just
> > papering over the actual cause of the crashes.
> >
> >> Would the issue be solved by adding a corresponding dma_memory_map() beforehand at
> >> the relevant places in hw/ide/macio.c? If that's required as part of the setup for
> >> bounce buffers then I can see how not having this present could cause problems.
> >
> > The only purpose of this API is sequences of:
> >    host_ptr = dma_memory_map(...);
> >    access the host_ptr directly;
> >    dma_memory_unmap(...);
> >
> > The bounce-buffer stuff is an internal implementation detail
> > of making this API work when the DMA is going to a device.
> >
> > We need to find whatever the actual cause of the macos failure is.
> > Mattias' suggested change looks right to me.
> >
> > I do wonder if something needs the memory barrier than
> > unmap does as part of its operation, e.g. in the
> > implementation of the dma_blk_* functions.
>
> It has been a few years now, but I'm fairly sure the issue was that dma_blk_read()
> didn't mark RAM containing code as dirty/invalid, and since MacOS 9 used overlays
> then it would crash randomly trying to execute stale memory. dma_memory_unmap()
> checks to see if the direction was to RAM, and then marks the memory dirty allowing
> the new code to get picked up after a MMU fault.

dma_blk_io() does its writes into guest memory by doing
a dma_memory_map()/write-to-host-pointer/dma_memory_unmap()
sequence, though (this is done in dma_blk_cb()).

More generally there should be *no* path for doing writes to
guest memory that does not handle the dirty-memory case:
so if there is one we need to find and fix it.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-16 12:13         ` Cédric Le Goater
@ 2024-09-16 12:28           ` Cédric Le Goater
  2024-09-16 12:41             ` Mattias Nissler
  0 siblings, 1 reply; 24+ messages in thread
From: Cédric Le Goater @ 2024-09-16 12:28 UTC (permalink / raw)
  To: Mattias Nissler, Peter Xu
  Cc: qemu-devel, Peter Maydell, Fabiano Rosas,
	Philippe Mathieu-Daudé, Mark Cave-Ayland

Mattias,


> Cédric, can you try with the above patch and/or 
> 
> crash seems gone.
> 
>> share more details of your setup so I can verify
> 
> You will need a Linnux powerpc or powerpc64 image for mac machines,
> which are not common now days, or MacOS images. My debian images
> are big. I will try to build you a small one for more tests.

Grab :

   https://cdimage.debian.org/cdimage/ports/10.0/powerpc/iso-cd/debian-10.0-powerpc-NETINST-1.iso

and run :

   qemu-system-ppc -M mac99 -cpu g4 -cdrom debian-10.0.0-powerpc-NETINST-1.iso -nographic -boot d

Thanks,

C.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-16 12:28           ` Cédric Le Goater
@ 2024-09-16 12:41             ` Mattias Nissler
  2024-09-16 13:06               ` Cédric Le Goater
  0 siblings, 1 reply; 24+ messages in thread
From: Mattias Nissler @ 2024-09-16 12:41 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Peter Xu, qemu-devel, Peter Maydell, Fabiano Rosas,
	Philippe Mathieu-Daudé, Mark Cave-Ayland

Thanks Cédric, I can reproduce now, and my proposed patch fixes avoids
the crash as expected.

On Mon, Sep 16, 2024 at 2:28 PM Cédric Le Goater <clg@kaod.org> wrote:
>
> Mattias,
>
>
> > Cédric, can you try with the above patch and/or
> >
> > crash seems gone.
> >
> >> share more details of your setup so I can verify
> >
> > You will need a Linnux powerpc or powerpc64 image for mac machines,
> > which are not common now days, or MacOS images. My debian images
> > are big. I will try to build you a small one for more tests.
>
> Grab :
>
>    https://cdimage.debian.org/cdimage/ports/10.0/powerpc/iso-cd/debian-10.0-powerpc-NETINST-1.iso
>
> and run :
>
>    qemu-system-ppc -M mac99 -cpu g4 -cdrom debian-10.0.0-powerpc-NETINST-1.iso -nographic -boot d
>
> Thanks,
>
> C.
>
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-16 12:28               ` Peter Maydell
@ 2024-09-16 12:44                 ` Mattias Nissler
  0 siblings, 0 replies; 24+ messages in thread
From: Mattias Nissler @ 2024-09-16 12:44 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Mark Cave-Ayland, Peter Xu, Cédric Le Goater, qemu-devel,
	Fabiano Rosas, Philippe Mathieu-Daudé

On Mon, Sep 16, 2024 at 2:28 PM Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Mon, 16 Sept 2024 at 13:14, Mark Cave-Ayland
> <mark.cave-ayland@ilande.co.uk> wrote:
> >
> > On 16/09/2024 12:44, Peter Maydell wrote:
> >
> > > On Mon, 16 Sept 2024 at 12:29, Mark Cave-Ayland
> > > <mark.cave-ayland@ilande.co.uk> wrote:
> > >> I'm fairly sure that this patch would break MacOS 9 which was the reason that
> > >> dma_memory_unmap() was added here in the first place: what I was finding was that
> > >> without the dma_memory_unmap() the destination RAM wasn't being invalidated (or
> > >> marked dirty), causing random crashes during boot.
> > >
> > > dma_memory_unmap() of something you never mapped is
> > > definitely wrong. Whatever is going on here, leaving the unmap
> > > call in after you removed the dma_memory_map() call is just
> > > papering over the actual cause of the crashes.
> > >
> > >> Would the issue be solved by adding a corresponding dma_memory_map() beforehand at
> > >> the relevant places in hw/ide/macio.c? If that's required as part of the setup for
> > >> bounce buffers then I can see how not having this present could cause problems.
> > >
> > > The only purpose of this API is sequences of:
> > >    host_ptr = dma_memory_map(...);
> > >    access the host_ptr directly;
> > >    dma_memory_unmap(...);
> > >
> > > The bounce-buffer stuff is an internal implementation detail
> > > of making this API work when the DMA is going to a device.
> > >
> > > We need to find whatever the actual cause of the macos failure is.
> > > Mattias' suggested change looks right to me.
> > >
> > > I do wonder if something needs the memory barrier than
> > > unmap does as part of its operation, e.g. in the
> > > implementation of the dma_blk_* functions.
> >
> > It has been a few years now, but I'm fairly sure the issue was that dma_blk_read()
> > didn't mark RAM containing code as dirty/invalid, and since MacOS 9 used overlays
> > then it would crash randomly trying to execute stale memory. dma_memory_unmap()
> > checks to see if the direction was to RAM, and then marks the memory dirty allowing
> > the new code to get picked up after a MMU fault.
>
> dma_blk_io() does its writes into guest memory by doing
> a dma_memory_map()/write-to-host-pointer/dma_memory_unmap()
> sequence, though (this is done in dma_blk_cb()).
>
> More generally there should be *no* path for doing writes to
> guest memory that does not handle the dirty-memory case:
> so if there is one we need to find and fix it.

I concur that it should be the responsibility of the code performing
the DMA write to make sure any invalidation side effects take place
rather than relying on ad-hoc calls taking place later.

Regardless, in the interest of reaching a conclusion here: Mark, can
you provide instructions on how to verify MacOS 9 or alternatively
kindly do a quick test?

Thanks,
Mattias


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-16 12:41             ` Mattias Nissler
@ 2024-09-16 13:06               ` Cédric Le Goater
  2024-09-16 17:47                 ` Mattias Nissler
  0 siblings, 1 reply; 24+ messages in thread
From: Cédric Le Goater @ 2024-09-16 13:06 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: Peter Xu, qemu-devel, Peter Maydell, Fabiano Rosas,
	Philippe Mathieu-Daudé, Mark Cave-Ayland

On 9/16/24 14:41, Mattias Nissler wrote:
> Thanks Cédric, I can reproduce now, and my proposed patch fixes avoids
> the crash as expected.
disk images for macos9 and macosx10 all boot.

C.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PULL 1/9] softmmu: Support concurrent bounce buffers
  2024-09-16 13:06               ` Cédric Le Goater
@ 2024-09-16 17:47                 ` Mattias Nissler
  0 siblings, 0 replies; 24+ messages in thread
From: Mattias Nissler @ 2024-09-16 17:47 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Peter Xu, qemu-devel, Peter Maydell, Fabiano Rosas,
	Philippe Mathieu-Daudé, Mark Cave-Ayland

On Mon, Sep 16, 2024 at 3:06 PM Cédric Le Goater <clg@kaod.org> wrote:
>
> On 9/16/24 14:41, Mattias Nissler wrote:
> > Thanks Cédric, I can reproduce now, and my proposed patch fixes avoids
> > the crash as expected.
> disk images for macos9 and macosx10 all boot.

Thanks for testing, happy to hear!

I will go ahead and send the change proposed earlier as a patch to the
list then.

>
> C.
>
>
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-09-16 17:48 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-09 20:11 [PULL 0/9] Migration 20240909 patches Peter Xu
2024-09-09 20:11 ` [PULL 1/9] softmmu: Support concurrent bounce buffers Peter Xu
2024-09-13 14:35   ` Cédric Le Goater
2024-09-13 14:47     ` Peter Xu
2024-09-16  8:23       ` Mattias Nissler
2024-09-16 11:29         ` Mark Cave-Ayland
2024-09-16 11:44           ` Peter Maydell
2024-09-16 12:13             ` Mark Cave-Ayland
2024-09-16 12:28               ` Peter Maydell
2024-09-16 12:44                 ` Mattias Nissler
2024-09-16 12:13         ` Cédric Le Goater
2024-09-16 12:28           ` Cédric Le Goater
2024-09-16 12:41             ` Mattias Nissler
2024-09-16 13:06               ` Cédric Le Goater
2024-09-16 17:47                 ` Mattias Nissler
2024-09-09 20:11 ` [PULL 2/9] softmmu/physmem: fix memory leak in dirty_memory_extend() Peter Xu
2024-09-09 20:11 ` [PULL 3/9] ci: migration: Don't run python tests in the compat job Peter Xu
2024-09-09 20:11 ` [PULL 4/9] docs/migration: add qatzip compression feature Peter Xu
2024-09-09 20:11 ` [PULL 5/9] meson: Introduce 'qatzip' feature to the build system Peter Xu
2024-09-09 20:11 ` [PULL 6/9] migration: Add migration parameters for QATzip Peter Xu
2024-09-09 20:11 ` [PULL 7/9] migration: Introduce 'qatzip' compression method Peter Xu
2024-09-09 20:11 ` [PULL 8/9] tests/migration: Add integration test for " Peter Xu
2024-09-09 20:11 ` [PULL 9/9] system: improve migration debug Peter Xu
2024-09-10 14:46 ` [PULL 0/9] Migration 20240909 patches Peter Maydell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).