qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory()
@ 2019-09-24 14:47 Igor Mammedov
  2019-09-24 14:47 ` [PATCH v7 1/4] kvm: extract kvm_log_clear_one_slot Igor Mammedov
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Igor Mammedov @ 2019-09-24 14:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: thuth, david, cohuck, peterx, borntraeger, qemu-s390x, pbonzini

Changelog:
  since v6:
    - include and rebase on top of
       [PATCH 0/2] kvm: clear dirty bitmaps from all overlapping memslots
        https://www.mail-archive.com/qemu-devel@nongnu.org/msg646200.html
    - minor fixups suggested during v6 review
    - more testing incl. hacked x86
  since v5:
    - [1/2] fix migration that wasn't starting and make sure that KVM part
      is able to handle 1:n MemorySection:memslot arrangement
  since v3:
    - fix compilation issue
    - advance HVA along with GPA in kvm_set_phys_mem()
  since v2:
    - break migration from old QEMU (since 2.12-4.1) for guest with >8TB RAM
      and drop migratable aliases patch as was agreed during v2 review
    - drop 4.2 machines patch as it's not prerequisite anymore
  since v1:
    - include 4.2 machines patch for adding compat RAM layout on top
    - 2/4 add missing in v1 patch for splitting too big MemorySection on
          several memslots
    - 3/4 amend code path on alias destruction to ensure that RAMBlock is
          cleaned properly
    - 4/4 add compat machine code to keep old layout (migration-wise) for
          4.1 and older machines 


While looking into unifying guest RAM allocation to use hostmem backends
for initial RAM (especially when -mempath is used) and retiring
memory_region_allocate_system_memory() API, leaving only single hostmem backend,
I was inspecting how currently it is used by boards and it turns out several
boards abuse it by calling the function several times (despite documented contract
forbiding it).

s390 is one of such boards where KVM limitation on memslot size got propagated
to board design and memory_region_allocate_system_memory() was abused to satisfy
KVM requirement for max RAM chunk where memory region alias would suffice.

Unfortunately, memory_region_allocate_system_memory() usage created migration
dependency where guest RAM is transferred in migration stream as several RAMBlocks
if it's more than KVM_SLOT_MAX_BYTES. During v2 review it was agreed to ignore
migration breakage (documenting it in release notes) and leaving only KVM fix.

In order to replace these several RAM chunks with a single memdev and keep it
working with KVM memslot size limit, the later was modified to deal with 
memory section split on several KVMSlots and manual RAM splitting in s390
was replace by single memory_region_allocate_system_memory() call.

Tested:
  * s390 with hacked KVM_SLOT_MAX_BYTES = 128Mb
      - guest reboot cycle in ping-pong migration
  * x86 with hacke max memslot = 128 and manual_dirty_log_protect enabled
      - ping-pong migration with workload dirtying RAM around a split area



Igor Mammedov (2):
  kvm: split too big memory section on several memslots
  s390: do not call memory_region_allocate_system_memory() multiple
    times

Paolo Bonzini (2):
  kvm: extract kvm_log_clear_one_slot
  kvm: clear dirty bitmaps from all overlapping memslots

 include/sysemu/kvm_int.h   |   1 +
 accel/kvm/kvm-all.c        | 238 +++++++++++++++++++++++--------------
 hw/s390x/s390-virtio-ccw.c |  30 +----
 target/s390x/kvm.c         |  11 ++
 4 files changed, 161 insertions(+), 119 deletions(-)

-- 
2.18.1



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v7 1/4] kvm: extract kvm_log_clear_one_slot
  2019-09-24 14:47 [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory() Igor Mammedov
@ 2019-09-24 14:47 ` Igor Mammedov
  2019-09-30 10:25   ` Christian Borntraeger
  2019-09-24 14:47 ` [PATCH v7 2/4] kvm: clear dirty bitmaps from all overlapping memslots Igor Mammedov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Igor Mammedov @ 2019-09-24 14:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: thuth, david, cohuck, peterx, borntraeger, qemu-s390x, pbonzini

From: Paolo Bonzini <pbonzini@redhat.com>

We may need to clear the dirty bitmap for more than one KVM memslot.
First do some code movement with no semantic change.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 102 ++++++++++++++++++++++++--------------------
 1 file changed, 56 insertions(+), 46 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index b09bad0804..e9e6086c09 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -575,55 +575,13 @@ out:
 #define KVM_CLEAR_LOG_ALIGN  (qemu_real_host_page_size << KVM_CLEAR_LOG_SHIFT)
 #define KVM_CLEAR_LOG_MASK   (-KVM_CLEAR_LOG_ALIGN)
 
-/**
- * kvm_physical_log_clear - Clear the kernel's dirty bitmap for range
- *
- * NOTE: this will be a no-op if we haven't enabled manual dirty log
- * protection in the host kernel because in that case this operation
- * will be done within log_sync().
- *
- * @kml:     the kvm memory listener
- * @section: the memory range to clear dirty bitmap
- */
-static int kvm_physical_log_clear(KVMMemoryListener *kml,
-                                  MemoryRegionSection *section)
+static int kvm_log_clear_one_slot(KVMSlot *mem, int as_id, uint64_t start, uint64_t size)
 {
     KVMState *s = kvm_state;
+    uint64_t end, bmap_start, start_delta, bmap_npages;
     struct kvm_clear_dirty_log d;
-    uint64_t start, end, bmap_start, start_delta, bmap_npages, size;
     unsigned long *bmap_clear = NULL, psize = qemu_real_host_page_size;
-    KVMSlot *mem = NULL;
-    int ret, i;
-
-    if (!s->manual_dirty_log_protect) {
-        /* No need to do explicit clear */
-        return 0;
-    }
-
-    start = section->offset_within_address_space;
-    size = int128_get64(section->size);
-
-    if (!size) {
-        /* Nothing more we can do... */
-        return 0;
-    }
-
-    kvm_slots_lock(kml);
-
-    /* Find any possible slot that covers the section */
-    for (i = 0; i < s->nr_slots; i++) {
-        mem = &kml->slots[i];
-        if (mem->start_addr <= start &&
-            start + size <= mem->start_addr + mem->memory_size) {
-            break;
-        }
-    }
-
-    /*
-     * We should always find one memslot until this point, otherwise
-     * there could be something wrong from the upper layer
-     */
-    assert(mem && i != s->nr_slots);
+    int ret;
 
     /*
      * We need to extend either the start or the size or both to
@@ -694,7 +652,7 @@ static int kvm_physical_log_clear(KVMMemoryListener *kml,
     /* It should never overflow.  If it happens, say something */
     assert(bmap_npages <= UINT32_MAX);
     d.num_pages = bmap_npages;
-    d.slot = mem->slot | (kml->as_id << 16);
+    d.slot = mem->slot | (as_id << 16);
 
     if (kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d) == -1) {
         ret = -errno;
@@ -717,6 +675,58 @@ static int kvm_physical_log_clear(KVMMemoryListener *kml,
                  size / psize);
     /* This handles the NULL case well */
     g_free(bmap_clear);
+    return ret;
+}
+
+
+/**
+ * kvm_physical_log_clear - Clear the kernel's dirty bitmap for range
+ *
+ * NOTE: this will be a no-op if we haven't enabled manual dirty log
+ * protection in the host kernel because in that case this operation
+ * will be done within log_sync().
+ *
+ * @kml:     the kvm memory listener
+ * @section: the memory range to clear dirty bitmap
+ */
+static int kvm_physical_log_clear(KVMMemoryListener *kml,
+                                  MemoryRegionSection *section)
+{
+    KVMState *s = kvm_state;
+    uint64_t start, size;
+    KVMSlot *mem = NULL;
+    int ret, i;
+
+    if (!s->manual_dirty_log_protect) {
+        /* No need to do explicit clear */
+        return 0;
+    }
+
+    start = section->offset_within_address_space;
+    size = int128_get64(section->size);
+
+    if (!size) {
+        /* Nothing more we can do... */
+        return 0;
+    }
+
+    kvm_slots_lock(kml);
+
+    /* Find any possible slot that covers the section */
+    for (i = 0; i < s->nr_slots; i++) {
+        mem = &kml->slots[i];
+        if (mem->start_addr <= start &&
+            start + size <= mem->start_addr + mem->memory_size) {
+            break;
+        }
+    }
+
+    /*
+     * We should always find one memslot until this point, otherwise
+     * there could be something wrong from the upper layer
+     */
+    assert(mem && i != s->nr_slots);
+    ret = kvm_log_clear_one_slot(mem, kml->as_id, start, size);
 
     kvm_slots_unlock(kml);
 
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v7 2/4] kvm: clear dirty bitmaps from all overlapping memslots
  2019-09-24 14:47 [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory() Igor Mammedov
  2019-09-24 14:47 ` [PATCH v7 1/4] kvm: extract kvm_log_clear_one_slot Igor Mammedov
@ 2019-09-24 14:47 ` Igor Mammedov
  2019-09-24 14:47 ` [PATCH v7 3/4] kvm: split too big memory section on several memslots Igor Mammedov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Igor Mammedov @ 2019-09-24 14:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: thuth, david, cohuck, peterx, borntraeger, qemu-s390x, pbonzini

From: Paolo Bonzini <pbonzini@redhat.com>

Currently MemoryRegionSection has 1:1 mapping to KVMSlot.
However next patch will allow splitting MemoryRegionSection into
several KVMSlot-s, make sure that kvm_physical_log_slot_clear()
is able to handle such 1:N mapping.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index e9e6086c09..315a91557f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -588,8 +588,8 @@ static int kvm_log_clear_one_slot(KVMSlot *mem, int as_id, uint64_t start, uint6
      * satisfy the KVM interface requirement.  Firstly, do the start
      * page alignment on 64 host pages
      */
-    bmap_start = (start - mem->start_addr) & KVM_CLEAR_LOG_MASK;
-    start_delta = start - mem->start_addr - bmap_start;
+    bmap_start = start & KVM_CLEAR_LOG_MASK;
+    start_delta = start - bmap_start;
     bmap_start /= psize;
 
     /*
@@ -693,8 +693,8 @@ static int kvm_physical_log_clear(KVMMemoryListener *kml,
                                   MemoryRegionSection *section)
 {
     KVMState *s = kvm_state;
-    uint64_t start, size;
-    KVMSlot *mem = NULL;
+    uint64_t start, size, offset, count;
+    KVMSlot *mem;
     int ret, i;
 
     if (!s->manual_dirty_log_protect) {
@@ -712,22 +712,30 @@ static int kvm_physical_log_clear(KVMMemoryListener *kml,
 
     kvm_slots_lock(kml);
 
-    /* Find any possible slot that covers the section */
     for (i = 0; i < s->nr_slots; i++) {
         mem = &kml->slots[i];
-        if (mem->start_addr <= start &&
-            start + size <= mem->start_addr + mem->memory_size) {
+        /* Discard slots that are empty or do not overlap the section */
+        if (!mem->memory_size ||
+            mem->start_addr > start + size - 1 ||
+            start > mem->start_addr + mem->memory_size - 1) {
+            continue;
+        }
+
+        if (start >= mem->start_addr) {
+            /* The slot starts before section or is aligned to it.  */
+            offset = start - mem->start_addr;
+            count = MIN(mem->memory_size - offset, size);
+        } else {
+            /* The slot starts after section.  */
+            offset = 0;
+            count = MIN(mem->memory_size, size - (mem->start_addr - start));
+        }
+        ret = kvm_log_clear_one_slot(mem, kml->as_id, offset, count);
+        if (ret < 0) {
             break;
         }
     }
 
-    /*
-     * We should always find one memslot until this point, otherwise
-     * there could be something wrong from the upper layer
-     */
-    assert(mem && i != s->nr_slots);
-    ret = kvm_log_clear_one_slot(mem, kml->as_id, start, size);
-
     kvm_slots_unlock(kml);
 
     return ret;
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v7 3/4] kvm: split too big memory section on several memslots
  2019-09-24 14:47 [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory() Igor Mammedov
  2019-09-24 14:47 ` [PATCH v7 1/4] kvm: extract kvm_log_clear_one_slot Igor Mammedov
  2019-09-24 14:47 ` [PATCH v7 2/4] kvm: clear dirty bitmaps from all overlapping memslots Igor Mammedov
@ 2019-09-24 14:47 ` Igor Mammedov
  2019-09-25  3:12   ` Peter Xu
  2019-09-24 14:47 ` [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times Igor Mammedov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Igor Mammedov @ 2019-09-24 14:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: thuth, david, cohuck, peterx, borntraeger, qemu-s390x, pbonzini

Max memslot size supported by kvm on s390 is 8Tb,
move logic of splitting RAM in chunks upto 8T to KVM code.

This way it will hide KVM specific restrictions in KVM code
and won't affect board level design decisions. Which would allow
us to avoid misusing memory_region_allocate_system_memory() API
and eventually use a single hostmem backend for guest RAM.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
v7:
  * rebase on top of Paolo's series
     "kvm: clear dirty bitmaps from all overlapping memslots"
  * use MIN() instead of open coding it
  * keep KVM_SLOT_MAX_BYTES at original place, and deal with
    it in the next s390 specific patch
  * s/slot_size/range_size/ in kvm_physical_log_clear()
  * use temporary MemoryRegionSection variable and keep
    original kvm_get_dirty_pages_log_range() untouched
v6:
  * KVM's migration code was assuming 1:1 relation between
    memslot and MemorySection, which becomes not true fo s390x
    with this patch. As result migration was broken and
    dirty logging wasn't even started with when a MemorySection
    was split on several memslots. Amend related KVM dirty
    log tracking code to account for split MemorySection.
v5:
  * move computation 'size -= slot_size' inside of loop body
          (David Hildenbrand <david@redhat.com>)
v4:
  * fix compilation issue
          (Christian Borntraeger <borntraeger@de.ibm.com>)
  * advance HVA along with GPA in kvm_set_phys_mem()
          (Christian Borntraeger <borntraeger@de.ibm.com>)

patch prepares only KVM side for switching to single RAM memory region
another patch will take care of  dropping manual RAM partitioning in
s390 code.
---
 include/sysemu/kvm_int.h |   1 +
 accel/kvm/kvm-all.c      | 124 +++++++++++++++++++++++++--------------
 2 files changed, 81 insertions(+), 44 deletions(-)

diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 72b2d1b3ae..ac2d1f8b56 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -41,4 +41,5 @@ typedef struct KVMMemoryListener {
 void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
                                   AddressSpace *as, int as_id);
 
+void kvm_set_max_memslot_size(hwaddr max_slot_size);
 #endif
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 315a91557f..f3848e7e75 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -140,6 +140,7 @@ bool kvm_direct_msi_allowed;
 bool kvm_ioeventfd_any_length_allowed;
 bool kvm_msi_use_devid;
 static bool kvm_immediate_exit;
+static hwaddr kvm_max_slot_size = ~0;
 
 static const KVMCapabilityInfo kvm_required_capabilites[] = {
     KVM_CAP_INFO(USER_MEMORY),
@@ -437,7 +438,7 @@ static int kvm_slot_update_flags(KVMMemoryListener *kml, KVMSlot *mem,
 static int kvm_section_update_flags(KVMMemoryListener *kml,
                                     MemoryRegionSection *section)
 {
-    hwaddr start_addr, size;
+    hwaddr start_addr, size, slot_size;
     KVMSlot *mem;
     int ret = 0;
 
@@ -448,13 +449,18 @@ static int kvm_section_update_flags(KVMMemoryListener *kml,
 
     kvm_slots_lock(kml);
 
-    mem = kvm_lookup_matching_slot(kml, start_addr, size);
-    if (!mem) {
-        /* We don't have a slot if we want to trap every access. */
-        goto out;
-    }
+    while (size && !ret) {
+        slot_size = MIN(kvm_max_slot_size, size);
+        mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
+        if (!mem) {
+            /* We don't have a slot if we want to trap every access. */
+            goto out;
+        }
 
-    ret = kvm_slot_update_flags(kml, mem, section->mr);
+        ret = kvm_slot_update_flags(kml, mem, section->mr);
+        start_addr += slot_size;
+        size -= slot_size;
+    }
 
 out:
     kvm_slots_unlock(kml);
@@ -527,11 +533,15 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
     struct kvm_dirty_log d = {};
     KVMSlot *mem;
     hwaddr start_addr, size;
+    hwaddr slot_size, slot_offset = 0;
     int ret = 0;
 
     size = kvm_align_section(section, &start_addr);
-    if (size) {
-        mem = kvm_lookup_matching_slot(kml, start_addr, size);
+    while (size) {
+        MemoryRegionSection subsection = *section;
+
+        slot_size = MIN(kvm_max_slot_size, size);
+        mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
         if (!mem) {
             /* We don't have a slot if we want to trap every access. */
             goto out;
@@ -549,11 +559,11 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
          * So for now, let's align to 64 instead of HOST_LONG_BITS here, in
          * a hope that sizeof(long) won't become >8 any time soon.
          */
-        size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS),
-                     /*HOST_LONG_BITS*/ 64) / 8;
         if (!mem->dirty_bmap) {
+            hwaddr bitmap_size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS),
+                                        /*HOST_LONG_BITS*/ 64) / 8;
             /* Allocate on the first log_sync, once and for all */
-            mem->dirty_bmap = g_malloc0(size);
+            mem->dirty_bmap = g_malloc0(bitmap_size);
         }
 
         d.dirty_bitmap = mem->dirty_bmap;
@@ -564,7 +574,13 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
             goto out;
         }
 
-        kvm_get_dirty_pages_log_range(section, d.dirty_bitmap);
+        subsection.offset_within_region += slot_offset;
+        subsection.size = int128_make64(slot_size);
+        kvm_get_dirty_pages_log_range(&subsection, d.dirty_bitmap);
+
+        slot_offset += slot_size;
+        start_addr += slot_size;
+        size -= slot_size;
     }
 out:
     return ret;
@@ -971,6 +987,14 @@ kvm_check_extension_list(KVMState *s, const KVMCapabilityInfo *list)
     return NULL;
 }
 
+void kvm_set_max_memslot_size(hwaddr max_slot_size)
+{
+    g_assert(
+        ROUND_UP(max_slot_size, qemu_real_host_page_size) == max_slot_size
+    );
+    kvm_max_slot_size = max_slot_size;
+}
+
 static void kvm_set_phys_mem(KVMMemoryListener *kml,
                              MemoryRegionSection *section, bool add)
 {
@@ -978,7 +1002,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     int err;
     MemoryRegion *mr = section->mr;
     bool writeable = !mr->readonly && !mr->rom_device;
-    hwaddr start_addr, size;
+    hwaddr start_addr, size, slot_size;
     void *ram;
 
     if (!memory_region_is_ram(mr)) {
@@ -1003,41 +1027,52 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
     kvm_slots_lock(kml);
 
     if (!add) {
-        mem = kvm_lookup_matching_slot(kml, start_addr, size);
-        if (!mem) {
-            goto out;
-        }
-        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
-            kvm_physical_sync_dirty_bitmap(kml, section);
-        }
+        do {
+            slot_size = MIN(kvm_max_slot_size, size);
+            mem = kvm_lookup_matching_slot(kml, start_addr, slot_size);
+            if (!mem) {
+                goto out;
+            }
+            if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+                kvm_physical_sync_dirty_bitmap(kml, section);
+            }
 
-        /* unregister the slot */
-        g_free(mem->dirty_bmap);
-        mem->dirty_bmap = NULL;
-        mem->memory_size = 0;
-        mem->flags = 0;
-        err = kvm_set_user_memory_region(kml, mem, false);
-        if (err) {
-            fprintf(stderr, "%s: error unregistering slot: %s\n",
-                    __func__, strerror(-err));
-            abort();
-        }
+            /* unregister the slot */
+            g_free(mem->dirty_bmap);
+            mem->dirty_bmap = NULL;
+            mem->memory_size = 0;
+            mem->flags = 0;
+            err = kvm_set_user_memory_region(kml, mem, false);
+            if (err) {
+                fprintf(stderr, "%s: error unregistering slot: %s\n",
+                        __func__, strerror(-err));
+                abort();
+            }
+            start_addr += slot_size;
+            size -= slot_size;
+        } while (size);
         goto out;
     }
 
     /* register the new slot */
-    mem = kvm_alloc_slot(kml);
-    mem->memory_size = size;
-    mem->start_addr = start_addr;
-    mem->ram = ram;
-    mem->flags = kvm_mem_flags(mr);
-
-    err = kvm_set_user_memory_region(kml, mem, true);
-    if (err) {
-        fprintf(stderr, "%s: error registering slot: %s\n", __func__,
-                strerror(-err));
-        abort();
-    }
+    do {
+        slot_size = MIN(kvm_max_slot_size, size);
+        mem = kvm_alloc_slot(kml);
+        mem->memory_size = slot_size;
+        mem->start_addr = start_addr;
+        mem->ram = ram;
+        mem->flags = kvm_mem_flags(mr);
+
+        err = kvm_set_user_memory_region(kml, mem, true);
+        if (err) {
+            fprintf(stderr, "%s: error registering slot: %s\n", __func__,
+                    strerror(-err));
+            abort();
+        }
+        start_addr += slot_size;
+        ram += slot_size;
+        size -= slot_size;
+    } while (size);
 
 out:
     kvm_slots_unlock(kml);
@@ -2877,6 +2912,7 @@ static bool kvm_accel_has_memory(MachineState *ms, AddressSpace *as,
 
     for (i = 0; i < kvm->nr_as; ++i) {
         if (kvm->as[i].as == as && kvm->as[i].ml) {
+            size = MIN(kvm_max_slot_size, size);
             return NULL != kvm_lookup_matching_slot(kvm->as[i].ml,
                                                     start_addr, size);
         }
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times
  2019-09-24 14:47 [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory() Igor Mammedov
                   ` (2 preceding siblings ...)
  2019-09-24 14:47 ` [PATCH v7 3/4] kvm: split too big memory section on several memslots Igor Mammedov
@ 2019-09-24 14:47 ` Igor Mammedov
  2019-09-25  3:27   ` Peter Xu
  2019-09-25  7:47 ` [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory() Christian Borntraeger
  2019-09-30 11:00 ` Christian Borntraeger
  5 siblings, 1 reply; 20+ messages in thread
From: Igor Mammedov @ 2019-09-24 14:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: thuth, david, cohuck, peterx, borntraeger, qemu-s390x, pbonzini

s390 was trying to solve limited KVM memslot size issue by abusing
memory_region_allocate_system_memory(), which breaks API contract
where the function might be called only once.

Beside an invalid use of API, the approach also introduced migration
issue, since RAM chunks for each KVM_SLOT_MAX_BYTES are transferred in
migration stream as separate RAMBlocks.

After discussion [1], it was agreed to break migration from older
QEMU for guest with RAM >8Tb (as it was relatively new (since 2.12)
and considered to be not actually used downstream).
Migration should keep working for guests with less than 8TB and for
more than 8TB with QEMU 4.2 and newer binary.
In case user tries to migrate more than 8TB guest, between incompatible
QEMU versions, migration should fail gracefully due to non-exiting
RAMBlock ID or RAMBlock size mismatch.

Taking in account above and that now KVM code is able to split too
big MemorySection into several memslots, partially revert commit
 (bb223055b s390-ccw-virtio: allow for systems larger that 7.999TB)
and use kvm_set_max_memslot_size() to set KVMSlot size to
KVM_SLOT_MAX_BYTES.

1) [PATCH RFC v2 4/4] s390: do not call  memory_region_allocate_system_memory() multiple times

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
v7:
  - move KVM_SLOT_MAX_BYTES movement from kvm specific patch to here
    (Peter Xu <peterx@redhat.com>)
v3:
  - drop migration compat code

PS:
I don't have access to a suitable system to test it with 8Tb split,
so I've tested only hacked up KVM_SLOT_MAX_BYTES = 1Gb variant
---
 hw/s390x/s390-virtio-ccw.c | 30 +++---------------------------
 target/s390x/kvm.c         | 11 +++++++++++
 2 files changed, 14 insertions(+), 27 deletions(-)

diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 8bfb6684cb..18ad279a00 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -154,39 +154,15 @@ static void virtio_ccw_register_hcalls(void)
                                    virtio_ccw_hcall_early_printk);
 }
 
-/*
- * KVM does only support memory slots up to KVM_MEM_MAX_NR_PAGES pages
- * as the dirty bitmap must be managed by bitops that take an int as
- * position indicator. If we have a guest beyond that we will split off
- * new subregions. The split must happen on a segment boundary (1MB).
- */
-#define KVM_MEM_MAX_NR_PAGES ((1ULL << 31) - 1)
-#define SEG_MSK (~0xfffffULL)
-#define KVM_SLOT_MAX_BYTES ((KVM_MEM_MAX_NR_PAGES * TARGET_PAGE_SIZE) & SEG_MSK)
 static void s390_memory_init(ram_addr_t mem_size)
 {
     MemoryRegion *sysmem = get_system_memory();
-    ram_addr_t chunk, offset = 0;
-    unsigned int number = 0;
+    MemoryRegion *ram = g_new(MemoryRegion, 1);
     Error *local_err = NULL;
-    gchar *name;
 
     /* allocate RAM for core */
-    name = g_strdup_printf("s390.ram");
-    while (mem_size) {
-        MemoryRegion *ram = g_new(MemoryRegion, 1);
-        uint64_t size = mem_size;
-
-        /* KVM does not allow memslots >= 8 TB */
-        chunk = MIN(size, KVM_SLOT_MAX_BYTES);
-        memory_region_allocate_system_memory(ram, NULL, name, chunk);
-        memory_region_add_subregion(sysmem, offset, ram);
-        mem_size -= chunk;
-        offset += chunk;
-        g_free(name);
-        name = g_strdup_printf("s390.ram.%u", ++number);
-    }
-    g_free(name);
+    memory_region_allocate_system_memory(ram, NULL, "s390.ram", mem_size);
+    memory_region_add_subregion(sysmem, 0, ram);
 
     /*
      * Configure the maximum page size. As no memory devices were created
diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index 97a662ad0e..54864c259c 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -28,6 +28,7 @@
 #include "cpu.h"
 #include "internal.h"
 #include "kvm_s390x.h"
+#include "sysemu/kvm_int.h"
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "qemu/timer.h"
@@ -122,6 +123,15 @@
  */
 #define VCPU_IRQ_BUF_SIZE(max_cpus) (sizeof(struct kvm_s390_irq) * \
                                      (max_cpus + NR_LOCAL_IRQS))
+/*
+ * KVM does only support memory slots up to KVM_MEM_MAX_NR_PAGES pages
+ * as the dirty bitmap must be managed by bitops that take an int as
+ * position indicator. If we have a guest beyond that we will split off
+ * new subregions. The split must happen on a segment boundary (1MB).
+ */
+#define KVM_MEM_MAX_NR_PAGES ((1ULL << 31) - 1)
+#define SEG_MSK (~0xfffffULL)
+#define KVM_SLOT_MAX_BYTES ((KVM_MEM_MAX_NR_PAGES * TARGET_PAGE_SIZE) & SEG_MSK)
 
 static CPUWatchpoint hw_watchpoint;
 /*
@@ -355,6 +365,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
      */
     /* kvm_vm_enable_cap(s, KVM_CAP_S390_AIS, 0); */
 
+    kvm_set_max_memslot_size(KVM_SLOT_MAX_BYTES);
     return 0;
 }
 
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 3/4] kvm: split too big memory section on several memslots
  2019-09-24 14:47 ` [PATCH v7 3/4] kvm: split too big memory section on several memslots Igor Mammedov
@ 2019-09-25  3:12   ` Peter Xu
  2019-09-25 12:09     ` Igor Mammedov
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2019-09-25  3:12 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: thuth, david, cohuck, qemu-devel, borntraeger, qemu-s390x,
	pbonzini

On Tue, Sep 24, 2019 at 10:47:50AM -0400, Igor Mammedov wrote:

[...]

> @@ -2877,6 +2912,7 @@ static bool kvm_accel_has_memory(MachineState *ms, AddressSpace *as,
>  
>      for (i = 0; i < kvm->nr_as; ++i) {
>          if (kvm->as[i].as == as && kvm->as[i].ml) {
> +            size = MIN(kvm_max_slot_size, size);
>              return NULL != kvm_lookup_matching_slot(kvm->as[i].ml,
>                                                      start_addr, size);
>          }

Ideally we could also check that the whole (start_addr, size) region
is covered by KVM memslots here, but with current code I can't think
of a case where the result doesn't match with only checking the 1st
memslot. So I assume it's fine.

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times
  2019-09-24 14:47 ` [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times Igor Mammedov
@ 2019-09-25  3:27   ` Peter Xu
  2019-09-25 11:51     ` Igor Mammedov
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2019-09-25  3:27 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: thuth, david, cohuck, qemu-devel, borntraeger, qemu-s390x,
	pbonzini

On Tue, Sep 24, 2019 at 10:47:51AM -0400, Igor Mammedov wrote:
> s390 was trying to solve limited KVM memslot size issue by abusing
> memory_region_allocate_system_memory(), which breaks API contract
> where the function might be called only once.
> 
> Beside an invalid use of API, the approach also introduced migration
> issue, since RAM chunks for each KVM_SLOT_MAX_BYTES are transferred in
> migration stream as separate RAMBlocks.
> 
> After discussion [1], it was agreed to break migration from older
> QEMU for guest with RAM >8Tb (as it was relatively new (since 2.12)
> and considered to be not actually used downstream).
> Migration should keep working for guests with less than 8TB and for
> more than 8TB with QEMU 4.2 and newer binary.
> In case user tries to migrate more than 8TB guest, between incompatible
> QEMU versions, migration should fail gracefully due to non-exiting
> RAMBlock ID or RAMBlock size mismatch.
> 
> Taking in account above and that now KVM code is able to split too
> big MemorySection into several memslots, partially revert commit
>  (bb223055b s390-ccw-virtio: allow for systems larger that 7.999TB)
> and use kvm_set_max_memslot_size() to set KVMSlot size to
> KVM_SLOT_MAX_BYTES.
> 
> 1) [PATCH RFC v2 4/4] s390: do not call  memory_region_allocate_system_memory() multiple times
> 
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>

Acked-by: Peter Xu <peterx@redhat.com>

IMHO it would be good to at least mention bb223055b9 in the commit
message even if not with a "Fixed:" tag.  May be amended during commit
if anyone prefers.

Also, this only applies the split limitation to s390.  Would that be a
good thing to some other archs as well?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory()
  2019-09-24 14:47 [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory() Igor Mammedov
                   ` (3 preceding siblings ...)
  2019-09-24 14:47 ` [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times Igor Mammedov
@ 2019-09-25  7:47 ` Christian Borntraeger
  2019-09-30 11:00 ` Christian Borntraeger
  5 siblings, 0 replies; 20+ messages in thread
From: Christian Borntraeger @ 2019-09-25  7:47 UTC (permalink / raw)
  To: Igor Mammedov, qemu-devel
  Cc: thuth, david, cohuck, peterx, qemu-s390x, pbonzini

On 24.09.19 16:47, Igor Mammedov wrote:
> Changelog:
>   since v6:
>     - include and rebase on top of
>        [PATCH 0/2] kvm: clear dirty bitmaps from all overlapping memslots
>         https://www.mail-archive.com/qemu-devel@nongnu.org/msg646200.html
>     - minor fixups suggested during v6 review
>     - more testing incl. hacked x86
>   since v5:
>     - [1/2] fix migration that wasn't starting and make sure that KVM part
>       is able to handle 1:n MemorySection:memslot arrangement
>   since v3:
>     - fix compilation issue
>     - advance HVA along with GPA in kvm_set_phys_mem()
>   since v2:
>     - break migration from old QEMU (since 2.12-4.1) for guest with >8TB RAM
>       and drop migratable aliases patch as was agreed during v2 review
>     - drop 4.2 machines patch as it's not prerequisite anymore
>   since v1:
>     - include 4.2 machines patch for adding compat RAM layout on top
>     - 2/4 add missing in v1 patch for splitting too big MemorySection on
>           several memslots
>     - 3/4 amend code path on alias destruction to ensure that RAMBlock is
>           cleaned properly
>     - 4/4 add compat machine code to keep old layout (migration-wise) for
>           4.1 and older machines 
> 
> 
> While looking into unifying guest RAM allocation to use hostmem backends
> for initial RAM (especially when -mempath is used) and retiring
> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
> I was inspecting how currently it is used by boards and it turns out several
> boards abuse it by calling the function several times (despite documented contract
> forbiding it).
> 
> s390 is one of such boards where KVM limitation on memslot size got propagated
> to board design and memory_region_allocate_system_memory() was abused to satisfy
> KVM requirement for max RAM chunk where memory region alias would suffice.
> 
> Unfortunately, memory_region_allocate_system_memory() usage created migration
> dependency where guest RAM is transferred in migration stream as several RAMBlocks
> if it's more than KVM_SLOT_MAX_BYTES. During v2 review it was agreed to ignore
> migration breakage (documenting it in release notes) and leaving only KVM fix.
> 
> In order to replace these several RAM chunks with a single memdev and keep it
> working with KVM memslot size limit, the later was modified to deal with 
> memory section split on several KVMSlots and manual RAM splitting in s390
> was replace by single memory_region_allocate_system_memory() call.
> 
> Tested:
>   * s390 with hacked KVM_SLOT_MAX_BYTES = 128Mb
>       - guest reboot cycle in ping-pong migration
>   * x86 with hacke max memslot = 128 and manual_dirty_log_protect enabled
>       - ping-pong migration with workload dirtying RAM around a split area
> 
> 
> 
> Igor Mammedov (2):
>   kvm: split too big memory section on several memslots
>   s390: do not call memory_region_allocate_system_memory() multiple
>     times
> 
> Paolo Bonzini (2):
>   kvm: extract kvm_log_clear_one_slot
>   kvm: clear dirty bitmaps from all overlapping memslots
> 
>  include/sysemu/kvm_int.h   |   1 +
>  accel/kvm/kvm-all.c        | 238 +++++++++++++++++++++++--------------
>  hw/s390x/s390-virtio-ccw.c |  30 +----
>  target/s390x/kvm.c         |  11 ++
>  4 files changed, 161 insertions(+), 119 deletions(-)
> 

Series
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com




FWIW, I think I would like to add something like the following later on.


Subject: [PATCH 1/1] s390/kvm: split kvm mem slots at 4TB

Instead of splitting at an unaligned address, we can simply split at
4TB.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 target/s390x/kvm.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index ad2dd14f7e78..611f56f4b5ac 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -126,12 +126,11 @@
 /*
  * KVM does only support memory slots up to KVM_MEM_MAX_NR_PAGES pages
  * as the dirty bitmap must be managed by bitops that take an int as
- * position indicator. If we have a guest beyond that we will split off
- * new subregions. The split must happen on a segment boundary (1MB).
+ * position indicator. This would end at an unaligned  address
+ * (0x7fffff00000). As future variants might provide larger pages
+ * and to make all addresses properly aligned, let us split at 4TB.
  */
-#define KVM_MEM_MAX_NR_PAGES ((1ULL << 31) - 1)
-#define SEG_MSK (~0xfffffULL)
-#define KVM_SLOT_MAX_BYTES ((KVM_MEM_MAX_NR_PAGES * TARGET_PAGE_SIZE) & SEG_MSK)
+#define KVM_SLOT_MAX_BYTES 4096UL*1024*1024*1024
 
 static CPUWatchpoint hw_watchpoint;
 /*
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times
  2019-09-25  3:27   ` Peter Xu
@ 2019-09-25 11:51     ` Igor Mammedov
  2019-09-25 23:52       ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Igor Mammedov @ 2019-09-25 11:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: thuth, david, cohuck, qemu-devel, borntraeger, qemu-s390x,
	pbonzini

On Wed, 25 Sep 2019 11:27:00 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Sep 24, 2019 at 10:47:51AM -0400, Igor Mammedov wrote:
> > s390 was trying to solve limited KVM memslot size issue by abusing
> > memory_region_allocate_system_memory(), which breaks API contract
> > where the function might be called only once.
> > 
> > Beside an invalid use of API, the approach also introduced migration
> > issue, since RAM chunks for each KVM_SLOT_MAX_BYTES are transferred in
> > migration stream as separate RAMBlocks.
> > 
> > After discussion [1], it was agreed to break migration from older
> > QEMU for guest with RAM >8Tb (as it was relatively new (since 2.12)
> > and considered to be not actually used downstream).
> > Migration should keep working for guests with less than 8TB and for
> > more than 8TB with QEMU 4.2 and newer binary.
> > In case user tries to migrate more than 8TB guest, between incompatible
> > QEMU versions, migration should fail gracefully due to non-exiting
> > RAMBlock ID or RAMBlock size mismatch.
> > 
> > Taking in account above and that now KVM code is able to split too
> > big MemorySection into several memslots, partially revert commit
> >  (bb223055b s390-ccw-virtio: allow for systems larger that 7.999TB)
> > and use kvm_set_max_memslot_size() to set KVMSlot size to
> > KVM_SLOT_MAX_BYTES.
> > 
> > 1) [PATCH RFC v2 4/4] s390: do not call  memory_region_allocate_system_memory() multiple times
> > 
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>  
> 
> Acked-by: Peter Xu <peterx@redhat.com>
> 
> IMHO it would be good to at least mention bb223055b9 in the commit
> message even if not with a "Fixed:" tag.  May be amended during commit
> if anyone prefers.

/me confused, bb223055b9 is mentioned in commit message
 
> Also, this only applies the split limitation to s390.  Would that be a
> good thing to some other archs as well?

Don't we have the similar bitmap size issue in KVM for other archs?

> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 3/4] kvm: split too big memory section on several memslots
  2019-09-25  3:12   ` Peter Xu
@ 2019-09-25 12:09     ` Igor Mammedov
  2019-09-25 23:45       ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Igor Mammedov @ 2019-09-25 12:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: thuth, david, cohuck, qemu-devel, borntraeger, qemu-s390x,
	pbonzini

On Wed, 25 Sep 2019 11:12:11 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Sep 24, 2019 at 10:47:50AM -0400, Igor Mammedov wrote:
> 
> [...]
> 
> > @@ -2877,6 +2912,7 @@ static bool kvm_accel_has_memory(MachineState *ms, AddressSpace *as,
> >  
> >      for (i = 0; i < kvm->nr_as; ++i) {
> >          if (kvm->as[i].as == as && kvm->as[i].ml) {
> > +            size = MIN(kvm_max_slot_size, size);
> >              return NULL != kvm_lookup_matching_slot(kvm->as[i].ml,
> >                                                      start_addr, size);
> >          }  
> 
> Ideally we could also check that the whole (start_addr, size) region
> is covered by KVM memslots here, but with current code I can't think
> of a case where the result doesn't match with only checking the 1st
> memslot. So I assume it's fine.
yep, it's micro-optimization that works on assumption that whole memory
section always is covered by memslots and original semantics where
working only for if start_addr/size where covering whole memory section.

Sole user mtree_print_flatview() is not performance sensitive,
so if you'd like I can post an additional patch that iterates
over whole range.

> Reviewed-by: Peter Xu <peterx@redhat.com>
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 3/4] kvm: split too big memory section on several memslots
  2019-09-25 12:09     ` Igor Mammedov
@ 2019-09-25 23:45       ` Peter Xu
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Xu @ 2019-09-25 23:45 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: thuth, david, cohuck, qemu-devel, borntraeger, qemu-s390x,
	pbonzini

On Wed, Sep 25, 2019 at 02:09:15PM +0200, Igor Mammedov wrote:
> On Wed, 25 Sep 2019 11:12:11 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Sep 24, 2019 at 10:47:50AM -0400, Igor Mammedov wrote:
> > 
> > [...]
> > 
> > > @@ -2877,6 +2912,7 @@ static bool kvm_accel_has_memory(MachineState *ms, AddressSpace *as,
> > >  
> > >      for (i = 0; i < kvm->nr_as; ++i) {
> > >          if (kvm->as[i].as == as && kvm->as[i].ml) {
> > > +            size = MIN(kvm_max_slot_size, size);
> > >              return NULL != kvm_lookup_matching_slot(kvm->as[i].ml,
> > >                                                      start_addr, size);
> > >          }  
> > 
> > Ideally we could also check that the whole (start_addr, size) region
> > is covered by KVM memslots here, but with current code I can't think
> > of a case where the result doesn't match with only checking the 1st
> > memslot. So I assume it's fine.
> yep, it's micro-optimization that works on assumption that whole memory
> section always is covered by memslots and original semantics where
> working only for if start_addr/size where covering whole memory section.
> 
> Sole user mtree_print_flatview() is not performance sensitive,
> so if you'd like I can post an additional patch that iterates
> over whole range.

No need it's fine, thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times
  2019-09-25 11:51     ` Igor Mammedov
@ 2019-09-25 23:52       ` Peter Xu
  2019-09-27 13:33         ` Igor Mammedov
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2019-09-25 23:52 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: thuth, david, cohuck, qemu-devel, borntraeger, qemu-s390x,
	pbonzini

On Wed, Sep 25, 2019 at 01:51:05PM +0200, Igor Mammedov wrote:
> On Wed, 25 Sep 2019 11:27:00 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Sep 24, 2019 at 10:47:51AM -0400, Igor Mammedov wrote:
> > > s390 was trying to solve limited KVM memslot size issue by abusing
> > > memory_region_allocate_system_memory(), which breaks API contract
> > > where the function might be called only once.
> > > 
> > > Beside an invalid use of API, the approach also introduced migration
> > > issue, since RAM chunks for each KVM_SLOT_MAX_BYTES are transferred in
> > > migration stream as separate RAMBlocks.
> > > 
> > > After discussion [1], it was agreed to break migration from older
> > > QEMU for guest with RAM >8Tb (as it was relatively new (since 2.12)
> > > and considered to be not actually used downstream).
> > > Migration should keep working for guests with less than 8TB and for
> > > more than 8TB with QEMU 4.2 and newer binary.
> > > In case user tries to migrate more than 8TB guest, between incompatible
> > > QEMU versions, migration should fail gracefully due to non-exiting
> > > RAMBlock ID or RAMBlock size mismatch.
> > > 
> > > Taking in account above and that now KVM code is able to split too
> > > big MemorySection into several memslots, partially revert commit
> > >  (bb223055b s390-ccw-virtio: allow for systems larger that 7.999TB)
> > > and use kvm_set_max_memslot_size() to set KVMSlot size to
> > > KVM_SLOT_MAX_BYTES.
> > > 
> > > 1) [PATCH RFC v2 4/4] s390: do not call  memory_region_allocate_system_memory() multiple times
> > > 
> > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>  
> > 
> > Acked-by: Peter Xu <peterx@redhat.com>
> > 
> > IMHO it would be good to at least mention bb223055b9 in the commit
> > message even if not with a "Fixed:" tag.  May be amended during commit
> > if anyone prefers.
> 
> /me confused, bb223055b9 is mentioned in commit message

I'm sorry, I overlooked that.

>  
> > Also, this only applies the split limitation to s390.  Would that be a
> > good thing to some other archs as well?
> 
> Don't we have the similar bitmap size issue in KVM for other archs?

Yes I thought we had.  So I feel like it would be good to also allow
other archs to support >8TB mem as well.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times
  2019-09-25 23:52       ` Peter Xu
@ 2019-09-27 13:33         ` Igor Mammedov
  2019-09-28  1:28           ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Igor Mammedov @ 2019-09-27 13:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: thuth, david, cohuck, qemu-devel, borntraeger, qemu-s390x,
	pbonzini

On Thu, 26 Sep 2019 07:52:35 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Sep 25, 2019 at 01:51:05PM +0200, Igor Mammedov wrote:
> > On Wed, 25 Sep 2019 11:27:00 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Tue, Sep 24, 2019 at 10:47:51AM -0400, Igor Mammedov wrote:  
> > > > s390 was trying to solve limited KVM memslot size issue by abusing
> > > > memory_region_allocate_system_memory(), which breaks API contract
> > > > where the function might be called only once.
> > > > 
> > > > Beside an invalid use of API, the approach also introduced migration
> > > > issue, since RAM chunks for each KVM_SLOT_MAX_BYTES are transferred in
> > > > migration stream as separate RAMBlocks.
> > > > 
> > > > After discussion [1], it was agreed to break migration from older
> > > > QEMU for guest with RAM >8Tb (as it was relatively new (since 2.12)
> > > > and considered to be not actually used downstream).
> > > > Migration should keep working for guests with less than 8TB and for
> > > > more than 8TB with QEMU 4.2 and newer binary.
> > > > In case user tries to migrate more than 8TB guest, between incompatible
> > > > QEMU versions, migration should fail gracefully due to non-exiting
> > > > RAMBlock ID or RAMBlock size mismatch.
> > > > 
> > > > Taking in account above and that now KVM code is able to split too
> > > > big MemorySection into several memslots, partially revert commit
> > > >  (bb223055b s390-ccw-virtio: allow for systems larger that 7.999TB)
> > > > and use kvm_set_max_memslot_size() to set KVMSlot size to
> > > > KVM_SLOT_MAX_BYTES.
> > > > 
> > > > 1) [PATCH RFC v2 4/4] s390: do not call  memory_region_allocate_system_memory() multiple times
> > > > 
> > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>    
> > > 
> > > Acked-by: Peter Xu <peterx@redhat.com>
> > > 
> > > IMHO it would be good to at least mention bb223055b9 in the commit
> > > message even if not with a "Fixed:" tag.  May be amended during commit
> > > if anyone prefers.  
> > 
> > /me confused, bb223055b9 is mentioned in commit message  
> 
> I'm sorry, I overlooked that.
> 
> >    
> > > Also, this only applies the split limitation to s390.  Would that be a
> > > good thing to some other archs as well?  
> > 
> > Don't we have the similar bitmap size issue in KVM for other archs?  
> 
> Yes I thought we had.  So I feel like it would be good to also allow
> other archs to support >8TB mem as well.  Thanks,
Another question, Is there another archs with that much RAM that are
available/used in real life (if not I'd wait for demand to arise first)?

If we are to generalize it to other targets, then instead of using
arbitrary memslot max size per target, we could just hardcode or get
from KVM, max supported size of bitmap and use that to calculate
kvm_max_slot_size depending on target page size.

Then there wouldn't be need for having machine specific code
to care about it and pick/set arbitrary values.

Another aspect to think about if we are to enable it for
other targets is memslot accounting. It doesn't affect s390
but other targets that support memory hotplug now assume 1:1
relation between memoryregion:memslot, which currently holds
true but would need to amended in case split is enabled there.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times
  2019-09-27 13:33         ` Igor Mammedov
@ 2019-09-28  1:28           ` Peter Xu
  2019-09-30  7:09             ` Christian Borntraeger
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2019-09-28  1:28 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: thuth, david, cohuck, qemu-devel, borntraeger, qemu-s390x,
	pbonzini

On Fri, Sep 27, 2019 at 03:33:20PM +0200, Igor Mammedov wrote:
> On Thu, 26 Sep 2019 07:52:35 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Wed, Sep 25, 2019 at 01:51:05PM +0200, Igor Mammedov wrote:
> > > On Wed, 25 Sep 2019 11:27:00 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >   
> > > > On Tue, Sep 24, 2019 at 10:47:51AM -0400, Igor Mammedov wrote:  
> > > > > s390 was trying to solve limited KVM memslot size issue by abusing
> > > > > memory_region_allocate_system_memory(), which breaks API contract
> > > > > where the function might be called only once.
> > > > > 
> > > > > Beside an invalid use of API, the approach also introduced migration
> > > > > issue, since RAM chunks for each KVM_SLOT_MAX_BYTES are transferred in
> > > > > migration stream as separate RAMBlocks.
> > > > > 
> > > > > After discussion [1], it was agreed to break migration from older
> > > > > QEMU for guest with RAM >8Tb (as it was relatively new (since 2.12)
> > > > > and considered to be not actually used downstream).
> > > > > Migration should keep working for guests with less than 8TB and for
> > > > > more than 8TB with QEMU 4.2 and newer binary.
> > > > > In case user tries to migrate more than 8TB guest, between incompatible
> > > > > QEMU versions, migration should fail gracefully due to non-exiting
> > > > > RAMBlock ID or RAMBlock size mismatch.
> > > > > 
> > > > > Taking in account above and that now KVM code is able to split too
> > > > > big MemorySection into several memslots, partially revert commit
> > > > >  (bb223055b s390-ccw-virtio: allow for systems larger that 7.999TB)
> > > > > and use kvm_set_max_memslot_size() to set KVMSlot size to
> > > > > KVM_SLOT_MAX_BYTES.
> > > > > 
> > > > > 1) [PATCH RFC v2 4/4] s390: do not call  memory_region_allocate_system_memory() multiple times
> > > > > 
> > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>    
> > > > 
> > > > Acked-by: Peter Xu <peterx@redhat.com>
> > > > 
> > > > IMHO it would be good to at least mention bb223055b9 in the commit
> > > > message even if not with a "Fixed:" tag.  May be amended during commit
> > > > if anyone prefers.  
> > > 
> > > /me confused, bb223055b9 is mentioned in commit message  
> > 
> > I'm sorry, I overlooked that.
> > 
> > >    
> > > > Also, this only applies the split limitation to s390.  Would that be a
> > > > good thing to some other archs as well?  
> > > 
> > > Don't we have the similar bitmap size issue in KVM for other archs?  
> > 
> > Yes I thought we had.  So I feel like it would be good to also allow
> > other archs to support >8TB mem as well.  Thanks,
> Another question, Is there another archs with that much RAM that are
> available/used in real life (if not I'd wait for demand to arise first)?

I don't know, so it was a pure question besides the series.  Sorry if
that holds your series somehow, it was not my intention.

> 
> If we are to generalize it to other targets, then instead of using
> arbitrary memslot max size per target, we could just hardcode or get
> from KVM, max supported size of bitmap and use that to calculate
> kvm_max_slot_size depending on target page size.

Right, I think if so hard code would be fine for now, and probably can
with a smallest one across all archs (should depend on the smallest
page size, I guess).

> 
> Then there wouldn't be need for having machine specific code
> to care about it and pick/set arbitrary values.
> 
> Another aspect to think about if we are to enable it for
> other targets is memslot accounting. It doesn't affect s390
> but other targets that support memory hotplug now assume 1:1
> relation between memoryregion:memslot, which currently holds
> true but would need to amended in case split is enabled there.

I didn't know this.  So maybe it makes more sense to have s390 only
here.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times
  2019-09-28  1:28           ` Peter Xu
@ 2019-09-30  7:09             ` Christian Borntraeger
  2019-09-30  9:33               ` Igor Mammedov
  0 siblings, 1 reply; 20+ messages in thread
From: Christian Borntraeger @ 2019-09-30  7:09 UTC (permalink / raw)
  To: Peter Xu, Igor Mammedov
  Cc: thuth, david, cohuck, qemu-devel, qemu-s390x, pbonzini

On 28.09.19 03:28, Peter Xu wrote:
> On Fri, Sep 27, 2019 at 03:33:20PM +0200, Igor Mammedov wrote:
>> On Thu, 26 Sep 2019 07:52:35 +0800
>> Peter Xu <peterx@redhat.com> wrote:
>>
>>> On Wed, Sep 25, 2019 at 01:51:05PM +0200, Igor Mammedov wrote:
>>>> On Wed, 25 Sep 2019 11:27:00 +0800
>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>   
>>>>> On Tue, Sep 24, 2019 at 10:47:51AM -0400, Igor Mammedov wrote:  
>>>>>> s390 was trying to solve limited KVM memslot size issue by abusing
>>>>>> memory_region_allocate_system_memory(), which breaks API contract
>>>>>> where the function might be called only once.
>>>>>>
>>>>>> Beside an invalid use of API, the approach also introduced migration
>>>>>> issue, since RAM chunks for each KVM_SLOT_MAX_BYTES are transferred in
>>>>>> migration stream as separate RAMBlocks.
>>>>>>
>>>>>> After discussion [1], it was agreed to break migration from older
>>>>>> QEMU for guest with RAM >8Tb (as it was relatively new (since 2.12)
>>>>>> and considered to be not actually used downstream).
>>>>>> Migration should keep working for guests with less than 8TB and for
>>>>>> more than 8TB with QEMU 4.2 and newer binary.
>>>>>> In case user tries to migrate more than 8TB guest, between incompatible
>>>>>> QEMU versions, migration should fail gracefully due to non-exiting
>>>>>> RAMBlock ID or RAMBlock size mismatch.
>>>>>>
>>>>>> Taking in account above and that now KVM code is able to split too
>>>>>> big MemorySection into several memslots, partially revert commit
>>>>>>  (bb223055b s390-ccw-virtio: allow for systems larger that 7.999TB)
>>>>>> and use kvm_set_max_memslot_size() to set KVMSlot size to
>>>>>> KVM_SLOT_MAX_BYTES.
>>>>>>
>>>>>> 1) [PATCH RFC v2 4/4] s390: do not call  memory_region_allocate_system_memory() multiple times
>>>>>>
>>>>>> Signed-off-by: Igor Mammedov <imammedo@redhat.com>    
>>>>>
>>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>>>
>>>>> IMHO it would be good to at least mention bb223055b9 in the commit
>>>>> message even if not with a "Fixed:" tag.  May be amended during commit
>>>>> if anyone prefers.  
>>>>
>>>> /me confused, bb223055b9 is mentioned in commit message  
>>>
>>> I'm sorry, I overlooked that.
>>>
>>>>    
>>>>> Also, this only applies the split limitation to s390.  Would that be a
>>>>> good thing to some other archs as well?  
>>>>
>>>> Don't we have the similar bitmap size issue in KVM for other archs?  
>>>
>>> Yes I thought we had.  So I feel like it would be good to also allow
>>> other archs to support >8TB mem as well.  Thanks,
>> Another question, Is there another archs with that much RAM that are
>> available/used in real life (if not I'd wait for demand to arise first)?
> 
> I don't know, so it was a pure question besides the series.  Sorry if
> that holds your series somehow, it was not my intention.
> 
>>
>> If we are to generalize it to other targets, then instead of using
>> arbitrary memslot max size per target, we could just hardcode or get
>> from KVM, max supported size of bitmap and use that to calculate
>> kvm_max_slot_size depending on target page size.
> 
> Right, I think if so hard code would be fine for now, and probably can
> with a smallest one across all archs (should depend on the smallest
> page size, I guess).
> 
>>
>> Then there wouldn't be need for having machine specific code
>> to care about it and pick/set arbitrary values.
>>
>> Another aspect to think about if we are to enable it for
>> other targets is memslot accounting. It doesn't affect s390
>> but other targets that support memory hotplug now assume 1:1
>> relation between memoryregion:memslot, which currently holds
>> true but would need to amended in case split is enabled there.
> 
> I didn't know this.  So maybe it makes more sense to have s390 only
> here.  Thanks,

OK. So shall I take the series as is via the s390 tree?
I would like to add the following patch on top if nobody minds:

Subject: [PATCH 1/1] s390/kvm: split kvm mem slots at 4TB

Instead of splitting at an unaligned address, we can simply split at
4TB.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 target/s390x/kvm.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
index ad2dd14f7e78..611f56f4b5ac 100644
--- a/target/s390x/kvm.c
+++ b/target/s390x/kvm.c
@@ -126,12 +126,11 @@
 /*
  * KVM does only support memory slots up to KVM_MEM_MAX_NR_PAGES pages
  * as the dirty bitmap must be managed by bitops that take an int as
- * position indicator. If we have a guest beyond that we will split off
- * new subregions. The split must happen on a segment boundary (1MB).
+ * position indicator. This would end at an unaligned  address
+ * (0x7fffff00000). As future variants might provide larger pages
+ * and to make all addresses properly aligned, let us split at 4TB.
  */
-#define KVM_MEM_MAX_NR_PAGES ((1ULL << 31) - 1)
-#define SEG_MSK (~0xfffffULL)
-#define KVM_SLOT_MAX_BYTES ((KVM_MEM_MAX_NR_PAGES * TARGET_PAGE_SIZE) & SEG_MSK)
+#define KVM_SLOT_MAX_BYTES 4096UL*1024*1024*1024
 
 static CPUWatchpoint hw_watchpoint;
 /*
-- 
2.21.0






^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times
  2019-09-30  7:09             ` Christian Borntraeger
@ 2019-09-30  9:33               ` Igor Mammedov
  2019-09-30 10:04                 ` Christian Borntraeger
  0 siblings, 1 reply; 20+ messages in thread
From: Igor Mammedov @ 2019-09-30  9:33 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: thuth, david, cohuck, qemu-devel, Peter Xu, qemu-s390x, pbonzini

On Mon, 30 Sep 2019 09:09:59 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 28.09.19 03:28, Peter Xu wrote:
> > On Fri, Sep 27, 2019 at 03:33:20PM +0200, Igor Mammedov wrote:  
> >> On Thu, 26 Sep 2019 07:52:35 +0800
> >> Peter Xu <peterx@redhat.com> wrote:
> >>  
> >>> On Wed, Sep 25, 2019 at 01:51:05PM +0200, Igor Mammedov wrote:  
> >>>> On Wed, 25 Sep 2019 11:27:00 +0800
> >>>> Peter Xu <peterx@redhat.com> wrote:
> >>>>     
> >>>>> On Tue, Sep 24, 2019 at 10:47:51AM -0400, Igor Mammedov wrote:    
> >>>>>> s390 was trying to solve limited KVM memslot size issue by abusing
> >>>>>> memory_region_allocate_system_memory(), which breaks API contract
> >>>>>> where the function might be called only once.
> >>>>>>
> >>>>>> Beside an invalid use of API, the approach also introduced migration
> >>>>>> issue, since RAM chunks for each KVM_SLOT_MAX_BYTES are transferred in
> >>>>>> migration stream as separate RAMBlocks.
> >>>>>>
> >>>>>> After discussion [1], it was agreed to break migration from older
> >>>>>> QEMU for guest with RAM >8Tb (as it was relatively new (since 2.12)
> >>>>>> and considered to be not actually used downstream).
> >>>>>> Migration should keep working for guests with less than 8TB and for
> >>>>>> more than 8TB with QEMU 4.2 and newer binary.
> >>>>>> In case user tries to migrate more than 8TB guest, between incompatible
> >>>>>> QEMU versions, migration should fail gracefully due to non-exiting
> >>>>>> RAMBlock ID or RAMBlock size mismatch.
> >>>>>>
> >>>>>> Taking in account above and that now KVM code is able to split too
> >>>>>> big MemorySection into several memslots, partially revert commit
> >>>>>>  (bb223055b s390-ccw-virtio: allow for systems larger that 7.999TB)
> >>>>>> and use kvm_set_max_memslot_size() to set KVMSlot size to
> >>>>>> KVM_SLOT_MAX_BYTES.
> >>>>>>
> >>>>>> 1) [PATCH RFC v2 4/4] s390: do not call  memory_region_allocate_system_memory() multiple times
> >>>>>>
> >>>>>> Signed-off-by: Igor Mammedov <imammedo@redhat.com>      
> >>>>>
> >>>>> Acked-by: Peter Xu <peterx@redhat.com>
> >>>>>
> >>>>> IMHO it would be good to at least mention bb223055b9 in the commit
> >>>>> message even if not with a "Fixed:" tag.  May be amended during commit
> >>>>> if anyone prefers.    
> >>>>
> >>>> /me confused, bb223055b9 is mentioned in commit message    
> >>>
> >>> I'm sorry, I overlooked that.
> >>>  
> >>>>      
> >>>>> Also, this only applies the split limitation to s390.  Would that be a
> >>>>> good thing to some other archs as well?    
> >>>>
> >>>> Don't we have the similar bitmap size issue in KVM for other archs?    
> >>>
> >>> Yes I thought we had.  So I feel like it would be good to also allow
> >>> other archs to support >8TB mem as well.  Thanks,  
> >> Another question, Is there another archs with that much RAM that are
> >> available/used in real life (if not I'd wait for demand to arise first)?  
> > 
> > I don't know, so it was a pure question besides the series.  Sorry if
> > that holds your series somehow, it was not my intention.
> >   
> >>
> >> If we are to generalize it to other targets, then instead of using
> >> arbitrary memslot max size per target, we could just hardcode or get
> >> from KVM, max supported size of bitmap and use that to calculate
> >> kvm_max_slot_size depending on target page size.  
> > 
> > Right, I think if so hard code would be fine for now, and probably can
> > with a smallest one across all archs (should depend on the smallest
> > page size, I guess).
> >   
> >>
> >> Then there wouldn't be need for having machine specific code
> >> to care about it and pick/set arbitrary values.
> >>
> >> Another aspect to think about if we are to enable it for
> >> other targets is memslot accounting. It doesn't affect s390
> >> but other targets that support memory hotplug now assume 1:1
> >> relation between memoryregion:memslot, which currently holds
> >> true but would need to amended in case split is enabled there.  
> > 
> > I didn't know this.  So maybe it makes more sense to have s390 only
> > here.  Thanks,  
> 
> OK. So shall I take the series as is via the s390 tree?
Yes, I'd appreciate it.

> I would like to add the following patch on top if nobody minds:
> 
> Subject: [PATCH 1/1] s390/kvm: split kvm mem slots at 4TB
> 
> Instead of splitting at an unaligned address, we can simply split at
> 4TB.
> 
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>

Looks fine to me

Acked-by: Igor Mammedov <imammedo@redhat.com>

> ---
>  target/s390x/kvm.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
> index ad2dd14f7e78..611f56f4b5ac 100644
> --- a/target/s390x/kvm.c
> +++ b/target/s390x/kvm.c
> @@ -126,12 +126,11 @@
>  /*
>   * KVM does only support memory slots up to KVM_MEM_MAX_NR_PAGES pages
>   * as the dirty bitmap must be managed by bitops that take an int as
> - * position indicator. If we have a guest beyond that we will split off
> - * new subregions. The split must happen on a segment boundary (1MB).
> + * position indicator. This would end at an unaligned  address
> + * (0x7fffff00000). As future variants might provide larger pages
> + * and to make all addresses properly aligned, let us split at 4TB.
>   */
> -#define KVM_MEM_MAX_NR_PAGES ((1ULL << 31) - 1)
> -#define SEG_MSK (~0xfffffULL)
> -#define KVM_SLOT_MAX_BYTES ((KVM_MEM_MAX_NR_PAGES * TARGET_PAGE_SIZE) & SEG_MSK)
> +#define KVM_SLOT_MAX_BYTES 4096UL*1024*1024*1024

I'd use TiB instead of 1024*1024*1024

>  
>  static CPUWatchpoint hw_watchpoint;
>  /*



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times
  2019-09-30  9:33               ` Igor Mammedov
@ 2019-09-30 10:04                 ` Christian Borntraeger
  2019-09-30 10:35                   ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Christian Borntraeger @ 2019-09-30 10:04 UTC (permalink / raw)
  To: pbonzini
  Cc: thuth, david, cohuck, qemu-devel, Peter Xu, qemu-s390x,
	Igor Mammedov



On 30.09.19 11:33, Igor Mammedov wrote:
> On Mon, 30 Sep 2019 09:09:59 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 28.09.19 03:28, Peter Xu wrote:
>>> On Fri, Sep 27, 2019 at 03:33:20PM +0200, Igor Mammedov wrote:  
>>>> On Thu, 26 Sep 2019 07:52:35 +0800
>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>  
>>>>> On Wed, Sep 25, 2019 at 01:51:05PM +0200, Igor Mammedov wrote:  
>>>>>> On Wed, 25 Sep 2019 11:27:00 +0800
>>>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>>>     
>>>>>>> On Tue, Sep 24, 2019 at 10:47:51AM -0400, Igor Mammedov wrote:    
>>>>>>>> s390 was trying to solve limited KVM memslot size issue by abusing
>>>>>>>> memory_region_allocate_system_memory(), which breaks API contract
>>>>>>>> where the function might be called only once.
>>>>>>>>
>>>>>>>> Beside an invalid use of API, the approach also introduced migration
>>>>>>>> issue, since RAM chunks for each KVM_SLOT_MAX_BYTES are transferred in
>>>>>>>> migration stream as separate RAMBlocks.
>>>>>>>>
>>>>>>>> After discussion [1], it was agreed to break migration from older
>>>>>>>> QEMU for guest with RAM >8Tb (as it was relatively new (since 2.12)
>>>>>>>> and considered to be not actually used downstream).
>>>>>>>> Migration should keep working for guests with less than 8TB and for
>>>>>>>> more than 8TB with QEMU 4.2 and newer binary.
>>>>>>>> In case user tries to migrate more than 8TB guest, between incompatible
>>>>>>>> QEMU versions, migration should fail gracefully due to non-exiting
>>>>>>>> RAMBlock ID or RAMBlock size mismatch.
>>>>>>>>
>>>>>>>> Taking in account above and that now KVM code is able to split too
>>>>>>>> big MemorySection into several memslots, partially revert commit
>>>>>>>>  (bb223055b s390-ccw-virtio: allow for systems larger that 7.999TB)
>>>>>>>> and use kvm_set_max_memslot_size() to set KVMSlot size to
>>>>>>>> KVM_SLOT_MAX_BYTES.
>>>>>>>>
>>>>>>>> 1) [PATCH RFC v2 4/4] s390: do not call  memory_region_allocate_system_memory() multiple times
>>>>>>>>
>>>>>>>> Signed-off-by: Igor Mammedov <imammedo@redhat.com>      
>>>>>>>
>>>>>>> Acked-by: Peter Xu <peterx@redhat.com>
>>>>>>>
>>>>>>> IMHO it would be good to at least mention bb223055b9 in the commit
>>>>>>> message even if not with a "Fixed:" tag.  May be amended during commit
>>>>>>> if anyone prefers.    
>>>>>>
>>>>>> /me confused, bb223055b9 is mentioned in commit message    
>>>>>
>>>>> I'm sorry, I overlooked that.
>>>>>  
>>>>>>      
>>>>>>> Also, this only applies the split limitation to s390.  Would that be a
>>>>>>> good thing to some other archs as well?    
>>>>>>
>>>>>> Don't we have the similar bitmap size issue in KVM for other archs?    
>>>>>
>>>>> Yes I thought we had.  So I feel like it would be good to also allow
>>>>> other archs to support >8TB mem as well.  Thanks,  
>>>> Another question, Is there another archs with that much RAM that are
>>>> available/used in real life (if not I'd wait for demand to arise first)?  
>>>
>>> I don't know, so it was a pure question besides the series.  Sorry if
>>> that holds your series somehow, it was not my intention.
>>>   
>>>>
>>>> If we are to generalize it to other targets, then instead of using
>>>> arbitrary memslot max size per target, we could just hardcode or get
>>>> from KVM, max supported size of bitmap and use that to calculate
>>>> kvm_max_slot_size depending on target page size.  
>>>
>>> Right, I think if so hard code would be fine for now, and probably can
>>> with a smallest one across all archs (should depend on the smallest
>>> page size, I guess).
>>>   
>>>>
>>>> Then there wouldn't be need for having machine specific code
>>>> to care about it and pick/set arbitrary values.
>>>>
>>>> Another aspect to think about if we are to enable it for
>>>> other targets is memslot accounting. It doesn't affect s390
>>>> but other targets that support memory hotplug now assume 1:1
>>>> relation between memoryregion:memslot, which currently holds
>>>> true but would need to amended in case split is enabled there.  
>>>
>>> I didn't know this.  So maybe it makes more sense to have s390 only
>>> here.  Thanks,  
>>
>> OK. So shall I take the series as is via the s390 tree?
> Yes, I'd appreciate it.


Paolo, ok it I pick the first 3 patches as well? Can you ack?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 1/4] kvm: extract kvm_log_clear_one_slot
  2019-09-24 14:47 ` [PATCH v7 1/4] kvm: extract kvm_log_clear_one_slot Igor Mammedov
@ 2019-09-30 10:25   ` Christian Borntraeger
  0 siblings, 0 replies; 20+ messages in thread
From: Christian Borntraeger @ 2019-09-30 10:25 UTC (permalink / raw)
  To: Igor Mammedov, qemu-devel
  Cc: thuth, david, cohuck, peterx, qemu-s390x, pbonzini



On 24.09.19 16:47, Igor Mammedov wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> We may need to clear the dirty bitmap for more than one KVM memslot.
> First do some code movement with no semantic change.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---
>  accel/kvm/kvm-all.c | 102 ++++++++++++++++++++++++--------------------
>  1 file changed, 56 insertions(+), 46 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index b09bad0804..e9e6086c09 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -575,55 +575,13 @@ out:
>  #define KVM_CLEAR_LOG_ALIGN  (qemu_real_host_page_size << KVM_CLEAR_LOG_SHIFT)
>  #define KVM_CLEAR_LOG_MASK   (-KVM_CLEAR_LOG_ALIGN)
>  
> -/**
> - * kvm_physical_log_clear - Clear the kernel's dirty bitmap for range
> - *
> - * NOTE: this will be a no-op if we haven't enabled manual dirty log
> - * protection in the host kernel because in that case this operation
> - * will be done within log_sync().
> - *
> - * @kml:     the kvm memory listener
> - * @section: the memory range to clear dirty bitmap
> - */
> -static int kvm_physical_log_clear(KVMMemoryListener *kml,
> -                                  MemoryRegionSection *section)
> +static int kvm_log_clear_one_slot(KVMSlot *mem, int as_id, uint64_t start, uint64_t size)

When applying, I will split this to fit within 80 chars. ok?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times
  2019-09-30 10:04                 ` Christian Borntraeger
@ 2019-09-30 10:35                   ` Paolo Bonzini
  0 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2019-09-30 10:35 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: thuth, david, cohuck, qemu-devel, Peter Xu, qemu-s390x,
	Igor Mammedov

On 30/09/19 12:04, Christian Borntraeger wrote:
> Paolo, ok it I pick the first 3 patches as well? Can you ack?

Yes, please.

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory()
  2019-09-24 14:47 [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory() Igor Mammedov
                   ` (4 preceding siblings ...)
  2019-09-25  7:47 ` [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory() Christian Borntraeger
@ 2019-09-30 11:00 ` Christian Borntraeger
  5 siblings, 0 replies; 20+ messages in thread
From: Christian Borntraeger @ 2019-09-30 11:00 UTC (permalink / raw)
  To: Igor Mammedov, qemu-devel
  Cc: thuth, david, cohuck, peterx, qemu-s390x, pbonzini

On 24.09.19 16:47, Igor Mammedov wrote:
> Changelog:
>   since v6:
>     - include and rebase on top of
>        [PATCH 0/2] kvm: clear dirty bitmaps from all overlapping memslots
>         https://www.mail-archive.com/qemu-devel@nongnu.org/msg646200.html
>     - minor fixups suggested during v6 review
>     - more testing incl. hacked x86
>   since v5:
>     - [1/2] fix migration that wasn't starting and make sure that KVM part
>       is able to handle 1:n MemorySection:memslot arrangement
>   since v3:
>     - fix compilation issue
>     - advance HVA along with GPA in kvm_set_phys_mem()
>   since v2:
>     - break migration from old QEMU (since 2.12-4.1) for guest with >8TB RAM
>       and drop migratable aliases patch as was agreed during v2 review
>     - drop 4.2 machines patch as it's not prerequisite anymore
>   since v1:
>     - include 4.2 machines patch for adding compat RAM layout on top
>     - 2/4 add missing in v1 patch for splitting too big MemorySection on
>           several memslots
>     - 3/4 amend code path on alias destruction to ensure that RAMBlock is
>           cleaned properly
>     - 4/4 add compat machine code to keep old layout (migration-wise) for
>           4.1 and older machines 
> 
> 
> While looking into unifying guest RAM allocation to use hostmem backends
> for initial RAM (especially when -mempath is used) and retiring
> memory_region_allocate_system_memory() API, leaving only single hostmem backend,
> I was inspecting how currently it is used by boards and it turns out several
> boards abuse it by calling the function several times (despite documented contract
> forbiding it).
> 
> s390 is one of such boards where KVM limitation on memslot size got propagated
> to board design and memory_region_allocate_system_memory() was abused to satisfy
> KVM requirement for max RAM chunk where memory region alias would suffice.
> 
> Unfortunately, memory_region_allocate_system_memory() usage created migration
> dependency where guest RAM is transferred in migration stream as several RAMBlocks
> if it's more than KVM_SLOT_MAX_BYTES. During v2 review it was agreed to ignore
> migration breakage (documenting it in release notes) and leaving only KVM fix.
> 
> In order to replace these several RAM chunks with a single memdev and keep it
> working with KVM memslot size limit, the later was modified to deal with 
> memory section split on several KVMSlots and manual RAM splitting in s390
> was replace by single memory_region_allocate_system_memory() call.
> 
> Tested:
>   * s390 with hacked KVM_SLOT_MAX_BYTES = 128Mb
>       - guest reboot cycle in ping-pong migration
>   * x86 with hacke max memslot = 128 and manual_dirty_log_protect enabled
>       - ping-pong migration with workload dirtying RAM around a split area
> 

Thanks, v7 applied to s390-next.




^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-09-30 11:03 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-24 14:47 [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory() Igor Mammedov
2019-09-24 14:47 ` [PATCH v7 1/4] kvm: extract kvm_log_clear_one_slot Igor Mammedov
2019-09-30 10:25   ` Christian Borntraeger
2019-09-24 14:47 ` [PATCH v7 2/4] kvm: clear dirty bitmaps from all overlapping memslots Igor Mammedov
2019-09-24 14:47 ` [PATCH v7 3/4] kvm: split too big memory section on several memslots Igor Mammedov
2019-09-25  3:12   ` Peter Xu
2019-09-25 12:09     ` Igor Mammedov
2019-09-25 23:45       ` Peter Xu
2019-09-24 14:47 ` [PATCH v7 4/4] s390: do not call memory_region_allocate_system_memory() multiple times Igor Mammedov
2019-09-25  3:27   ` Peter Xu
2019-09-25 11:51     ` Igor Mammedov
2019-09-25 23:52       ` Peter Xu
2019-09-27 13:33         ` Igor Mammedov
2019-09-28  1:28           ` Peter Xu
2019-09-30  7:09             ` Christian Borntraeger
2019-09-30  9:33               ` Igor Mammedov
2019-09-30 10:04                 ` Christian Borntraeger
2019-09-30 10:35                   ` Paolo Bonzini
2019-09-25  7:47 ` [PATCH v7 0/4] s390: stop abusing memory_region_allocate_system_memory() Christian Borntraeger
2019-09-30 11:00 ` Christian Borntraeger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).