From: Jing Zhang <jingzhangos@google.com>
To: KVM <kvm@vger.kernel.org>, KVMARM <kvmarm@lists.cs.columbia.edu>,
Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org>,
Paolo Bonzini <pbonzini@redhat.com>,
David Matlack <dmatlack@google.com>,
Oliver Upton <oupton@google.com>,
Reiji Watanabe <reijiw@google.com>,
Ricardo Koller <ricarkol@google.com>,
Raghavendra Rao Ananta <rananta@google.com>
Cc: Jing Zhang <jingzhangos@google.com>
Subject: [PATCH v1 0/3] ARM64: Guest performance improvement during dirty
Date: Thu, 13 Jan 2022 22:18:26 +0000 [thread overview]
Message-ID: <20220113221829.2785604-1-jingzhangos@google.com> (raw)
This patch is to reduce the performance degradation of guest workload during
dirty logging on ARM64. A fast path is added to handle permission relaxation
during dirty logging. The MMU lock is replaced with rwlock, by which all
permision relaxations on leaf pte can be performed under the read lock. This
greatly reduces the MMU lock contention during dirty logging. With this
solution, the source guest workload performance degradation can be improved
by more than 60%.
Problem:
* A Google internal live migration test shows that the source guest workload
performance has >99% degradation for about 105 seconds, >50% degradation
for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
This shows that most of the time, the guest workload degradtion is above
99%, which obviously needs some improvement compared to the test result
on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
* Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB, PageSize: 4K
* VM spec: #vCPU: 48, #Mem/vCPU: 4GB, PageSize: 4K, 2M hugepage backed
Analysis:
* We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
the number of contentions of MMU lock and the "dirty memory time" on
various VM spec. The "dirty memory time" is the time vCPU threads spent
in KVM after fault. Higher "dirty memory time" means higher degradation
to guest workload.
'-m 2' specifies the mode "PA-bits:48, VA-bits:48, 4K pages".
By using test command
./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
Below are the results:
+-------+------------------------+-----------------------+
| #vCPU | dirty memory time (ms) | number of contentions |
+-------+------------------------+-----------------------+
| 1 | 926 | 0 |
+-------+------------------------+-----------------------+
| 2 | 1189 | 4732558 |
+-------+------------------------+-----------------------+
| 4 | 2503 | 11527185 |
+-------+------------------------+-----------------------+
| 8 | 5069 | 24881677 |
+-------+------------------------+-----------------------+
| 16 | 10340 | 50347956 |
+-------+------------------------+-----------------------+
| 32 | 20351 | 100605720 |
+-------+------------------------+-----------------------+
| 64 | 40994 | 201442478 |
+-------+------------------------+-----------------------+
* From the test results above, the "dirty memory time" and the number of
MMU lock contention scale with the number of vCPUs. That means all the
dirty memory operations from all vCPU threads have been serialized by
the MMU lock. Further analysis also shows that the permission relaxation
during dirty logging is where vCPU threads get serialized.
Solution:
* On ARM64, there is no mechanism as PML (Page Modification Logging) and
the dirty-bit solution for dirty logging is much complicated compared to
the write-protection solution. The straight way to reduce the guest
performance degradation is to enhance the concurrency for the permission
fault path during dirty logging.
* In this patch, we only put leaf PTE permission relaxation for dirty
logging under read lock, all others would go under write lock.
Below are the results based on the fast path solution:
+-------+------------------------+
| #vCPU | dirty memory time (ms) |
+-------+------------------------+
| 1 | 965 |
+-------+------------------------+
| 2 | 1006 |
+-------+------------------------+
| 4 | 1128 |
+-------+------------------------+
| 8 | 2005 |
+-------+------------------------+
| 16 | 3903 |
+-------+------------------------+
| 32 | 7595 |
+-------+------------------------+
| 64 | 15783 |
+-------+------------------------+
* Furtuer analysis shows that there is another bottleneck caused by the
setup of the test code itself. The 3rd commit is meant to fix that by
setting up vgic in the test code. With the test code fix, below are
the results which show better improvement.
+-------+------------------------+
| #vCPU | dirty memory time (ms) |
+-------+------------------------+
| 1 | 803 |
+-------+------------------------+
| 2 | 843 |
+-------+------------------------+
| 4 | 942 |
+-------+------------------------+
| 8 | 1458 |
+-------+------------------------+
| 16 | 2853 |
+-------+------------------------+
| 32 | 5886 |
+-------+------------------------+
| 64 | 12190 |
+-------+------------------------+
All "dirty memory time" has been reduced by more than 60% when the
number of vCPU grows.
* Based on the solution, the test results from the Google internal live
migration test also shows more than 60% improvement with >99% for 30s,
>50% for 58s and >10% for 76s.
---
* RFC -> v1
- Rebase to kvm/queue, commit fea31d169094
(KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event)
- Moved the fast path in user_mem_abort, as suggested by Marc.
- Addressed other comments from Marc.
[RFC] https://lore.kernel.org/all/20220110210441.2074798-1-jingzhangos@google.com
---
Jing Zhang (3):
KVM: arm64: Use read/write spin lock for MMU protection
KVM: arm64: Add fast path to handle permission relaxation during dirty
logging
KVM: selftests: Add vgic initialization for dirty log perf test for
ARM
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/kvm/mmu.c | 52 ++++++++++++-------
.../selftests/kvm/dirty_log_perf_test.c | 10 ++++
3 files changed, 46 insertions(+), 18 deletions(-)
base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
--
2.34.1.703.g22d0c6ccf7-goog
next reply other threads:[~2022-01-13 22:18 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-13 22:18 Jing Zhang [this message]
2022-01-13 22:18 ` [PATCH v1 1/3] KVM: arm64: Use read/write spin lock for MMU protection Jing Zhang
2022-01-13 22:18 ` [PATCH v1 2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging Jing Zhang
2022-01-16 11:14 ` Marc Zyngier
2022-01-17 3:23 ` Jing Zhang
2022-01-13 22:18 ` [PATCH v1 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM Jing Zhang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220113221829.2785604-1-jingzhangos@google.com \
--to=jingzhangos@google.com \
--cc=dmatlack@google.com \
--cc=kvm@vger.kernel.org \
--cc=kvmarm@lists.cs.columbia.edu \
--cc=maz@kernel.org \
--cc=oupton@google.com \
--cc=pbonzini@redhat.com \
--cc=rananta@google.com \
--cc=reijiw@google.com \
--cc=ricarkol@google.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).