From: Paolo Bonzini <pbonzini@redhat.com>
To: qemu-devel@nongnu.org
Cc: Peter Xu <peterx@redhat.com>,
qemu-stable <qemu-stable@nongnu.org>,
Zhiyi Guo <zhguo@redhat.com>,
David Hildenbrand <david@redhat.com>,
Fabiano Rosas <farosas@suse.de>
Subject: [PULL 17/25] KVM: Dynamic sized kvm memslots array
Date: Tue, 15 Oct 2024 16:17:03 +0200 [thread overview]
Message-ID: <20241015141711.528342-18-pbonzini@redhat.com> (raw)
In-Reply-To: <20241015141711.528342-1-pbonzini@redhat.com>
From: Peter Xu <peterx@redhat.com>
Zhiyi reported an infinite loop issue in VFIO use case. The cause of that
was a separate discussion, however during that I found a regression of
dirty sync slowness when profiling.
Each KVMMemoryListerner maintains an array of kvm memslots. Currently it's
statically allocated to be the max supported by the kernel. However after
Linux commit 4fc096a99e ("KVM: Raise the maximum number of user memslots"),
the max supported memslots reported now grows to some number large enough
so that it may not be wise to always statically allocate with the max
reported.
What's worse, QEMU kvm code still walks all the allocated memslots entries
to do any form of lookups. It can drastically slow down all memslot
operations because each of such loop can run over 32K times on the new
kernels.
Fix this issue by making the memslots to be allocated dynamically.
Here the initial size was set to 16 because it should cover the basic VM
usages, so that the hope is the majority VM use case may not even need to
grow at all (e.g. if one starts a VM with ./qemu-system-x86_64 by default
it'll consume 9 memslots), however not too large to waste memory.
There can also be even better way to address this, but so far this is the
simplest and should be already better even than before we grow the max
supported memslots. For example, in the case of above issue when VFIO was
attached on a 32GB system, there are only ~10 memslots used. So it could
be good enough as of now.
In the above VFIO context, measurement shows that the precopy dirty sync
shrinked from ~86ms to ~3ms after this patch applied. It should also apply
to any KVM enabled VM even without VFIO.
NOTE: we don't have a FIXES tag for this patch because there's no real
commit that regressed this in QEMU. Such behavior existed for a long time,
but only start to be a problem when the kernel reports very large
nr_slots_max value. However that's pretty common now (the kernel change
was merged in 2021) so we attached cc:stable because we'll want this change
to be backported to stable branches.
Cc: qemu-stable <qemu-stable@nongnu.org>
Reported-by: Zhiyi Guo <zhguo@redhat.com>
Tested-by: Zhiyi Guo <zhguo@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Link: https://lore.kernel.org/r/20240917163835.194664-2-peterx@redhat.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
include/sysemu/kvm_int.h | 1 +
accel/kvm/kvm-all.c | 87 +++++++++++++++++++++++++++++++++-------
accel/kvm/trace-events | 1 +
3 files changed, 74 insertions(+), 15 deletions(-)
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index 17483ff53bd..2304537b93c 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -46,6 +46,7 @@ typedef struct KVMMemoryListener {
MemoryListener listener;
KVMSlot *slots;
unsigned int nr_used_slots;
+ unsigned int nr_slots_allocated;
int as_id;
QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_add;
QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_del;
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 905fb844e46..f84413b7954 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -69,6 +69,9 @@
#define KVM_GUESTDBG_BLOCKIRQ 0
#endif
+/* Default num of memslots to be allocated when VM starts */
+#define KVM_MEMSLOTS_NR_ALLOC_DEFAULT 16
+
struct KVMParkedVcpu {
unsigned long vcpu_id;
int kvm_fd;
@@ -165,6 +168,57 @@ void kvm_resample_fd_notify(int gsi)
}
}
+/**
+ * kvm_slots_grow(): Grow the slots[] array in the KVMMemoryListener
+ *
+ * @kml: The KVMMemoryListener* to grow the slots[] array
+ * @nr_slots_new: The new size of slots[] array
+ *
+ * Returns: True if the array grows larger, false otherwise.
+ */
+static bool kvm_slots_grow(KVMMemoryListener *kml, unsigned int nr_slots_new)
+{
+ unsigned int i, cur = kml->nr_slots_allocated;
+ KVMSlot *slots;
+
+ if (nr_slots_new > kvm_state->nr_slots) {
+ nr_slots_new = kvm_state->nr_slots;
+ }
+
+ if (cur >= nr_slots_new) {
+ /* Big enough, no need to grow, or we reached max */
+ return false;
+ }
+
+ if (cur == 0) {
+ slots = g_new0(KVMSlot, nr_slots_new);
+ } else {
+ assert(kml->slots);
+ slots = g_renew(KVMSlot, kml->slots, nr_slots_new);
+ /*
+ * g_renew() doesn't initialize extended buffers, however kvm
+ * memslots require fields to be zero-initialized. E.g. pointers,
+ * memory_size field, etc.
+ */
+ memset(&slots[cur], 0x0, sizeof(slots[0]) * (nr_slots_new - cur));
+ }
+
+ for (i = cur; i < nr_slots_new; i++) {
+ slots[i].slot = i;
+ }
+
+ kml->slots = slots;
+ kml->nr_slots_allocated = nr_slots_new;
+ trace_kvm_slots_grow(cur, nr_slots_new);
+
+ return true;
+}
+
+static bool kvm_slots_double(KVMMemoryListener *kml)
+{
+ return kvm_slots_grow(kml, kml->nr_slots_allocated * 2);
+}
+
unsigned int kvm_get_max_memslots(void)
{
KVMState *s = KVM_STATE(current_accel());
@@ -193,15 +247,26 @@ unsigned int kvm_get_free_memslots(void)
/* Called with KVMMemoryListener.slots_lock held */
static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
{
- KVMState *s = kvm_state;
+ unsigned int n;
int i;
- for (i = 0; i < s->nr_slots; i++) {
+ for (i = 0; i < kml->nr_slots_allocated; i++) {
if (kml->slots[i].memory_size == 0) {
return &kml->slots[i];
}
}
+ /*
+ * If no free slots, try to grow first by doubling. Cache the old size
+ * here to avoid another round of search: if the grow succeeded, it
+ * means slots[] now must have the existing "n" slots occupied,
+ * followed by one or more free slots starting from slots[n].
+ */
+ n = kml->nr_slots_allocated;
+ if (kvm_slots_double(kml)) {
+ return &kml->slots[n];
+ }
+
return NULL;
}
@@ -222,10 +287,9 @@ static KVMSlot *kvm_lookup_matching_slot(KVMMemoryListener *kml,
hwaddr start_addr,
hwaddr size)
{
- KVMState *s = kvm_state;
int i;
- for (i = 0; i < s->nr_slots; i++) {
+ for (i = 0; i < kml->nr_slots_allocated; i++) {
KVMSlot *mem = &kml->slots[i];
if (start_addr == mem->start_addr && size == mem->memory_size) {
@@ -267,7 +331,7 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
int i, ret = 0;
kvm_slots_lock();
- for (i = 0; i < s->nr_slots; i++) {
+ for (i = 0; i < kml->nr_slots_allocated; i++) {
KVMSlot *mem = &kml->slots[i];
if (ram >= mem->ram && ram < mem->ram + mem->memory_size) {
@@ -1071,7 +1135,7 @@ static int kvm_physical_log_clear(KVMMemoryListener *kml,
kvm_slots_lock();
- for (i = 0; i < s->nr_slots; i++) {
+ for (i = 0; i < kml->nr_slots_allocated; i++) {
mem = &kml->slots[i];
/* Discard slots that are empty or do not overlap the section */
if (!mem->memory_size ||
@@ -1715,12 +1779,8 @@ static void kvm_log_sync_global(MemoryListener *l, bool last_stage)
/* Flush all kernel dirty addresses into KVMSlot dirty bitmap */
kvm_dirty_ring_flush();
- /*
- * TODO: make this faster when nr_slots is big while there are
- * only a few used slots (small VMs).
- */
kvm_slots_lock();
- for (i = 0; i < s->nr_slots; i++) {
+ for (i = 0; i < kml->nr_slots_allocated; i++) {
mem = &kml->slots[i];
if (mem->memory_size && mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
kvm_slot_sync_dirty_pages(mem);
@@ -1835,12 +1895,9 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
{
int i;
- kml->slots = g_new0(KVMSlot, s->nr_slots);
kml->as_id = as_id;
- for (i = 0; i < s->nr_slots; i++) {
- kml->slots[i].slot = i;
- }
+ kvm_slots_grow(kml, KVM_MEMSLOTS_NR_ALLOC_DEFAULT);
QSIMPLEQ_INIT(&kml->transaction_add);
QSIMPLEQ_INIT(&kml->transaction_del);
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 82c65fd2ab8..e43d18a8692 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -36,3 +36,4 @@ kvm_io_window_exit(void) ""
kvm_run_exit_system_event(int cpu_index, uint32_t event_type) "cpu_index %d, system_even_type %"PRIu32
kvm_convert_memory(uint64_t start, uint64_t size, const char *msg) "start 0x%" PRIx64 " size 0x%" PRIx64 " %s"
kvm_memory_fault(uint64_t start, uint64_t size, uint64_t flags) "start 0x%" PRIx64 " size 0x%" PRIx64 " flags 0x%" PRIx64
+kvm_slots_grow(unsigned int old, unsigned int new) "%u -> %u"
--
2.46.2
next prev parent reply other threads:[~2024-10-15 14:18 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-15 14:16 [PULL 00/25] x86 and KVM patches for 2024-10-15 Paolo Bonzini
2024-10-15 14:16 ` [PULL 01/25] target/i386: Don't construct a all-zero entry for CPUID[0xD 0x3f] Paolo Bonzini
2024-10-15 14:16 ` [PULL 02/25] target/i386: Enable fdp-excptn-only and zero-fcs-fds Paolo Bonzini
2024-10-15 14:16 ` [PULL 03/25] target/i386: Construct CPUID 2 as stateful iff times > 1 Paolo Bonzini
2024-10-15 14:16 ` [PULL 04/25] target/i386: Make invtsc migratable when user sets tsc-khz explicitly Paolo Bonzini
2024-10-15 14:16 ` [PULL 05/25] target/i386: Add more features enumerated by CPUID.7.2.EDX Paolo Bonzini
2024-10-15 14:16 ` [PULL 06/25] target/i386: Add support save/load HWCR MSR Paolo Bonzini
2024-10-15 14:16 ` [PULL 07/25] target/i386: Fix conditional CONFIG_SYNDBG enablement Paolo Bonzini
2024-10-15 14:16 ` [PULL 08/25] target/i386: Exclude 'hv-syndbg' from 'hv-passthrough' Paolo Bonzini
2024-10-15 14:16 ` [PULL 09/25] target/i386: Make sure SynIC state is really updated before KVM_RUN Paolo Bonzini
2024-10-15 14:16 ` [PULL 10/25] docs/system: Add recommendations to Hyper-V enlightenments doc Paolo Bonzini
2024-10-15 14:16 ` [PULL 11/25] target/i386: convert bit test instructions to new decoder Paolo Bonzini
2024-10-15 14:16 ` [PULL 12/25] target/i386: decode address before going back to translate.c Paolo Bonzini
2024-10-15 14:16 ` [PULL 13/25] target/i386: convert CMPXCHG8B/CMPXCHG16B to new decoder Paolo Bonzini
2024-10-16 16:37 ` Philippe Mathieu-Daudé
2024-10-17 9:14 ` Paolo Bonzini
2024-10-15 14:17 ` [PULL 14/25] target/i386: do not check PREFIX_LOCK in old-style decoder Paolo Bonzini
2024-10-15 14:17 ` [PULL 15/25] target/i386: list instructions still in translate.c Paolo Bonzini
2024-10-15 14:17 ` [PULL 16/25] target/i386: assert that cc_op* and pc_save are preserved Paolo Bonzini
2024-10-15 14:17 ` Paolo Bonzini [this message]
2024-10-15 14:17 ` [PULL 18/25] KVM: Define KVM_MEMSLOTS_NUM_MAX_DEFAULT Paolo Bonzini
2024-10-15 14:17 ` [PULL 19/25] KVM: Rename KVMMemoryListener.nr_used_slots to nr_slots_used Paolo Bonzini
2024-10-15 14:17 ` [PULL 20/25] KVM: Rename KVMState->nr_slots to nr_slots_max Paolo Bonzini
2024-10-15 14:17 ` [PULL 21/25] target/i386/tcg: Use DPL-level accesses for interrupts and call gates Paolo Bonzini
2024-10-15 14:17 ` [PULL 22/25] accel/kvm: check for KVM_CAP_MULTI_ADDRESS_SPACE on vm Paolo Bonzini
2024-10-15 14:17 ` [PULL 23/25] accel/kvm: check for KVM_CAP_MEMORY_ATTRIBUTES " Paolo Bonzini
2024-10-15 14:17 ` [PULL 24/25] accel/kvm: check for KVM_CAP_READONLY_MEM on VM Paolo Bonzini
2024-10-15 14:17 ` [PULL 25/25] target/i386: Use only 16 and 32-bit operands for IN/OUT Paolo Bonzini
2024-10-17 10:32 ` [PULL 00/25] x86 and KVM patches for 2024-10-15 Peter Maydell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241015141711.528342-18-pbonzini@redhat.com \
--to=pbonzini@redhat.com \
--cc=david@redhat.com \
--cc=farosas@suse.de \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-stable@nongnu.org \
--cc=zhguo@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).