From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>, qemu-devel@nongnu.org
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>,
David Hildenbrand <david@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
peterx@redhat.com, Prasad Pandit <ppandit@redhat.com>,
Juraj Marcin <jmarcin@redhat.com>,
Julia Suvorova <jusual@redhat.com>,
qemu-stable <qemu-stable@nongnu.org>,
Zhiyi Guo <zhguo@redhat.com>
Subject: Re: [PATCH v3 1/4] KVM: Dynamic sized kvm memslots array
Date: Mon, 16 Sep 2024 14:52:04 -0300 [thread overview]
Message-ID: <87y13rzdtn.fsf@suse.de> (raw)
In-Reply-To: <871q1j1of1.fsf@suse.de>
Fabiano Rosas <farosas@suse.de> writes:
> Peter Xu <peterx@redhat.com> writes:
>
>> Zhiyi reported an infinite loop issue in VFIO use case. The cause of that
>> was a separate discussion, however during that I found a regression of
>> dirty sync slowness when profiling.
>>
>> Each KVMMemoryListerner maintains an array of kvm memslots. Currently it's
>> statically allocated to be the max supported by the kernel. However after
>> Linux commit 4fc096a99e ("KVM: Raise the maximum number of user memslots"),
>> the max supported memslots reported now grows to some number large enough
>> so that it may not be wise to always statically allocate with the max
>> reported.
>>
>> What's worse, QEMU kvm code still walks all the allocated memslots entries
>> to do any form of lookups. It can drastically slow down all memslot
>> operations because each of such loop can run over 32K times on the new
>> kernels.
>>
>> Fix this issue by making the memslots to be allocated dynamically.
>>
>> Here the initial size was set to 16 because it should cover the basic VM
>> usages, so that the hope is the majority VM use case may not even need to
>> grow at all (e.g. if one starts a VM with ./qemu-system-x86_64 by default
>> it'll consume 9 memslots), however not too large to waste memory.
>>
>> There can also be even better way to address this, but so far this is the
>> simplest and should be already better even than before we grow the max
>> supported memslots. For example, in the case of above issue when VFIO was
>> attached on a 32GB system, there are only ~10 memslots used. So it could
>> be good enough as of now.
>>
>> In the above VFIO context, measurement shows that the precopy dirty sync
>> shrinked from ~86ms to ~3ms after this patch applied. It should also apply
>> to any KVM enabled VM even without VFIO.
>>
>> NOTE: we don't have a FIXES tag for this patch because there's no real
>> commit that regressed this in QEMU. Such behavior existed for a long time,
>> but only start to be a problem when the kernel reports very large
>> nr_slots_max value. However that's pretty common now (the kernel change
>> was merged in 2021) so we attached cc:stable because we'll want this change
>> to be backported to stable branches.
>>
>> Cc: qemu-stable <qemu-stable@nongnu.org>
>> Reported-by: Zhiyi Guo <zhguo@redhat.com>
>> Tested-by: Zhiyi Guo <zhguo@redhat.com>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>> include/sysemu/kvm_int.h | 1 +
>> accel/kvm/kvm-all.c | 99 ++++++++++++++++++++++++++++++++++------
>> accel/kvm/trace-events | 1 +
>> 3 files changed, 86 insertions(+), 15 deletions(-)
>>
>> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
>> index 1d8fb1473b..48e496b3d4 100644
>> --- a/include/sysemu/kvm_int.h
>> +++ b/include/sysemu/kvm_int.h
>> @@ -46,6 +46,7 @@ typedef struct KVMMemoryListener {
>> MemoryListener listener;
>> KVMSlot *slots;
>> unsigned int nr_used_slots;
>> + unsigned int nr_slots_allocated;
>> int as_id;
>> QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_add;
>> QSIMPLEQ_HEAD(, KVMMemoryUpdate) transaction_del;
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 75d11a07b2..c51a3f18db 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -69,6 +69,9 @@
>> #define KVM_GUESTDBG_BLOCKIRQ 0
>> #endif
>>
>> +/* Default num of memslots to be allocated when VM starts */
>> +#define KVM_MEMSLOTS_NR_ALLOC_DEFAULT 16
>> +
>> struct KVMParkedVcpu {
>> unsigned long vcpu_id;
>> int kvm_fd;
>> @@ -165,6 +168,57 @@ void kvm_resample_fd_notify(int gsi)
>> }
>> }
>>
>> +/**
>> + * kvm_slots_grow(): Grow the slots[] array in the KVMMemoryListener
>> + *
>> + * @kml: The KVMMemoryListener* to grow the slots[] array
>> + * @nr_slots_new: The new size of slots[] array
>> + *
>> + * Returns: True if the array grows larger, false otherwise.
>> + */
>> +static bool kvm_slots_grow(KVMMemoryListener *kml, unsigned int nr_slots_new)
>> +{
>> + unsigned int i, cur = kml->nr_slots_allocated;
>> + KVMSlot *slots;
>> +
>> + if (nr_slots_new > kvm_state->nr_slots) {
>> + nr_slots_new = kvm_state->nr_slots;
>> + }
>> +
>> + if (cur >= nr_slots_new) {
>> + /* Big enough, no need to grow, or we reached max */
>> + return false;
>> + }
>> +
>> + if (cur == 0) {
>> + slots = g_new0(KVMSlot, nr_slots_new);
>> + } else {
>> + assert(kml->slots);
>> + slots = g_renew(KVMSlot, kml->slots, nr_slots_new);
>> + /*
>> + * g_renew() doesn't initialize extended buffers, however kvm
>> + * memslots require fields to be zero-initialized. E.g. pointers,
>> + * memory_size field, etc.
>> + */
>> + memset(&slots[cur], 0x0, sizeof(slots[0]) * (nr_slots_new - cur));
>> + }
>> +
>> + for (i = cur; i < nr_slots_new; i++) {
>> + slots[i].slot = i;
>> + }
>> +
>> + kml->slots = slots;
>> + kml->nr_slots_allocated = nr_slots_new;
>> + trace_kvm_slots_grow(cur, nr_slots_new);
>> +
>> + return true;
>> +}
>> +
>> +static bool kvm_slots_double(KVMMemoryListener *kml)
>> +{
>> + return kvm_slots_grow(kml, kml->nr_slots_allocated * 2);
>> +}
>> +
>> unsigned int kvm_get_max_memslots(void)
>> {
>> KVMState *s = KVM_STATE(current_accel());
>> @@ -193,15 +247,26 @@ unsigned int kvm_get_free_memslots(void)
>> /* Called with KVMMemoryListener.slots_lock held */
>> static KVMSlot *kvm_get_free_slot(KVMMemoryListener *kml)
>> {
>> - KVMState *s = kvm_state;
>> + unsigned int n;
>> int i;
>>
>> - for (i = 0; i < s->nr_slots; i++) {
>> + for (i = 0; i < kml->nr_slots_allocated; i++) {
>> if (kml->slots[i].memory_size == 0) {
>> return &kml->slots[i];
>> }
>> }
>>
>> + /*
>> + * If no free slots, try to grow first by doubling. Cache the old size
>> + * here to avoid another round of search: if the grow succeeded, it
>> + * means slots[] now must have the existing "n" slots occupied,
>> + * followed by one or more free slots starting from slots[n].
>> + */
>> + n = kml->nr_slots_allocated;
>> + if (kvm_slots_double(kml)) {
>> + return &kml->slots[n];
>> + }
>> +
>> return NULL;
>> }
>>
>> @@ -222,10 +287,9 @@ static KVMSlot *kvm_lookup_matching_slot(KVMMemoryListener *kml,
>> hwaddr start_addr,
>> hwaddr size)
>> {
>> - KVMState *s = kvm_state;
>> int i;
>>
>> - for (i = 0; i < s->nr_slots; i++) {
>> + for (i = 0; i < kml->nr_slots_allocated; i++) {
>> KVMSlot *mem = &kml->slots[i];
>>
>> if (start_addr == mem->start_addr && size == mem->memory_size) {
>> @@ -267,7 +331,7 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
>> int i, ret = 0;
>>
>> kvm_slots_lock();
>> - for (i = 0; i < s->nr_slots; i++) {
>> + for (i = 0; i < kml->nr_slots_allocated; i++) {
>> KVMSlot *mem = &kml->slots[i];
>>
>> if (ram >= mem->ram && ram < mem->ram + mem->memory_size) {
>> @@ -1071,7 +1135,7 @@ static int kvm_physical_log_clear(KVMMemoryListener *kml,
>>
>> kvm_slots_lock();
>>
>> - for (i = 0; i < s->nr_slots; i++) {
>> + for (i = 0; i < kml->nr_slots_allocated; i++) {
>> mem = &kml->slots[i];
>> /* Discard slots that are empty or do not overlap the section */
>> if (!mem->memory_size ||
>> @@ -1719,12 +1783,8 @@ static void kvm_log_sync_global(MemoryListener *l, bool last_stage)
>> /* Flush all kernel dirty addresses into KVMSlot dirty bitmap */
>> kvm_dirty_ring_flush();
>>
>> - /*
>> - * TODO: make this faster when nr_slots is big while there are
>> - * only a few used slots (small VMs).
>> - */
>> kvm_slots_lock();
>> - for (i = 0; i < s->nr_slots; i++) {
>> + for (i = 0; i < kml->nr_slots_allocated; i++) {
>> mem = &kml->slots[i];
>> if (mem->memory_size && mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>> kvm_slot_sync_dirty_pages(mem);
>> @@ -1839,12 +1899,9 @@ void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
>> {
>> int i;
>>
>> - kml->slots = g_new0(KVMSlot, s->nr_slots);
>> kml->as_id = as_id;
>>
>> - for (i = 0; i < s->nr_slots; i++) {
>> - kml->slots[i].slot = i;
>> - }
>> + kvm_slots_grow(kml, KVM_MEMSLOTS_NR_ALLOC_DEFAULT);
>>
>> QSIMPLEQ_INIT(&kml->transaction_add);
>> QSIMPLEQ_INIT(&kml->transaction_del);
>> @@ -2461,6 +2518,18 @@ static int kvm_init(MachineState *ms)
>> s->nr_slots = 32;
>> }
>
> If !s->nr_slots, this 32 here^ will always be smaller than 16, so the
> code below will always trigger.
Hehe, nevermind, crossed wires in my brain.
>>
>> + /*
>> + * A VM will at least require a few memslots to work, or it can even
>> + * fail to boot. Make sure the supported value is always at least
>> + * larger than what we will initially allocate.
>
> The commit message says 16 was chosen to cover basic usage, which is
> fine. But here we're disallowing anything smaller. Shouldn't QEMU always
> respect what KVM decided? Of course, setting aside bugs or other
> scenarios that could result in the ioctl returning 0. Could some kernel
> implementation at some point want to reduce the max number of memslots
> and then get effectively denied because QEMU thinks otherwise?
>
>> + */
>> + if (s->nr_slots < KVM_MEMSLOTS_NR_ALLOC_DEFAULT) {
>> + ret = -EINVAL;
>> + fprintf(stderr, "KVM max supported number of slots (%d) too small\n",
>> + s->nr_slots);
>> + goto err;
>> + }
>> +
>> s->nr_as = kvm_check_extension(s, KVM_CAP_MULTI_ADDRESS_SPACE);
>> if (s->nr_as <= 1) {
>> s->nr_as = 1;
>> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
>> index 37626c1ac5..ad2ae6fca5 100644
>> --- a/accel/kvm/trace-events
>> +++ b/accel/kvm/trace-events
>> @@ -36,3 +36,4 @@ kvm_io_window_exit(void) ""
>> kvm_run_exit_system_event(int cpu_index, uint32_t event_type) "cpu_index %d, system_even_type %"PRIu32
>> kvm_convert_memory(uint64_t start, uint64_t size, const char *msg) "start 0x%" PRIx64 " size 0x%" PRIx64 " %s"
>> kvm_memory_fault(uint64_t start, uint64_t size, uint64_t flags) "start 0x%" PRIx64 " size 0x%" PRIx64 " flags 0x%" PRIx64
>> +kvm_slots_grow(unsigned int old, unsigned int new) "%u -> %u"
next prev parent reply other threads:[~2024-09-16 17:52 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-09 14:54 [PATCH v3 0/4] KVM: Dynamic sized memslots array Peter Xu
2024-09-09 14:54 ` [PATCH v3 1/4] KVM: Dynamic sized kvm " Peter Xu
2024-09-16 17:47 ` Fabiano Rosas
2024-09-16 17:52 ` Fabiano Rosas [this message]
2024-09-17 15:52 ` Peter Xu
2024-09-09 14:54 ` [PATCH v3 2/4] KVM: Define KVM_MEMSLOTS_NUM_MAX_DEFAULT Peter Xu
2024-09-09 14:54 ` [PATCH v3 3/4] KVM: Rename KVMMemoryListener.nr_used_slots to nr_slots_used Peter Xu
2024-09-09 14:54 ` [PATCH v3 4/4] KVM: Rename KVMState->nr_slots to nr_slots_max Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87y13rzdtn.fsf@suse.de \
--to=farosas@suse.de \
--cc=david@redhat.com \
--cc=jmarcin@redhat.com \
--cc=jusual@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=ppandit@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-stable@nongnu.org \
--cc=vkuznets@redhat.com \
--cc=zhguo@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).