[PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging
@ 2023-01-13  3:49 Ricardo Koller
  2023-01-13  3:49 ` [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags Ricardo Koller
                   ` (9 more replies)
  0 siblings, 10 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:49 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

Implement Eager Page Splitting for ARM.

Eager Page Splitting improves the performance of dirty-logging (used
in live migrations) when guest memory is backed by huge-pages.  It's
an optimization used in Google Cloud since 2016 on x86, and for the
last couple of months on ARM.

Background and motivation
=========================
Dirty logging is typically used for live-migration iterative copying.
KVM implements dirty-logging at the PAGE_SIZE granularity (will refer
to 4K pages from now on).  It does it by faulting on write-protected
4K pages.  Therefore, enabling dirty-logging on a huge-page requires
breaking it into 4K pages in the first place.  KVM does this breaking
on fault, and because it's in the critical path it only maps the 4K
page that faulted; every other 4K page is left unmapped.  This is not
great for performance on ARM for a couple of reasons:

- Splitting on fault can halt vcpus for milliseconds in some
  implementations. Splitting a block PTE requires using a broadcasted
  TLB invalidation (TLBI) for every huge-page (due to the
  break-before-make requirement). Note that x86 doesn't need this. We
  observed some implementations that take millliseconds to complete
  broadcasted TLBIs when done in parallel from multiple vcpus.  And
  that's exactly what happens when doing it on fault: multiple vcpus
  fault at the same time triggering TLBIs in parallel.

- Read intensive guest workloads end up paying for dirty-logging.
  Only mapping the faulting 4K page means that all the other pages
  that were part of the huge-page will now be unmapped. The effect is
  that any access, including reads, now has to fault.

Eager Page Splitting (on ARM)
=============================
Eager Page Splitting fixes the above two issues by eagerly splitting
huge-pages when enabling dirty logging. The goal is to avoid doing it
while faulting on write-protected pages. This is what the TDP MMU does
for x86 [0], except that x86 does it for different reasons: to avoid
grabbing the MMU lock on fault. Note that taking care of
write-protection faults still requires grabbing the MMU lock on ARM,
but not on x86 (with the fast_page_fault path).

An additional benefit of eagerly splitting huge-pages is that it can
be done in a controlled way (e.g., via an IOCTL). This series provides
two knobs for doing it, just like its x86 counterpart: when enabling
dirty logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The
benefit of doing it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes
ranges, and not complete memslots like when enabling dirty logging.
This means that the cost of splitting (mainly broadcasted TLBIs) can
be throttled: split a range, wait for a bit, split another range, etc.
The benefits of this approach were presented by Oliver Upton at KVM
Forum 2022 [1].

Implementation
==============
Patches 1-3 add a pgtable utility function for splitting huge block
PTEs: kvm_pgtable_stage2_split(). Patches 4-8 add support for eagerly
splitting huge-pages when enabling dirty-logging and when using the
KVM_CLEAR_DIRTY_LOG ioctl. Note that this is just like what x86 does,
and the code is actually based on it.  And finally, patch 9:

	KVM: arm64: Use local TLBI on permission relaxation

adds support for using local TLBIs instead of broadcasts when doing
permission relaxation. This last patch is key to achieving good
performance during dirty-logging, as eagerly breaking huge-pages
replaces mapping new pages with permission relaxation. Got this patch
(indirectly) from Marc Z.  and took the liberty of adding a commit
message.

Note: this applies on top of 6.2-rc3.

Performance evaluation
======================
The performance benefits were tested using the dirty_log_perf_test
selftest with 2M huge-pages.

The first test uses a write-only sequential workload where the stride
is 2M instead of 4K [2]. The idea with this experiment is to emulate a
random access pattern writing a different huge-page at every access.
Observe that the benefit increases with the number of vcpus: up to
5.76x for 152 vcpus.

./dirty_log_perf_test_sparse -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2

	+-------+----------+------------------+
	| vCPUs | 6.2-rc3  | 6.2-rc3 + series |
	|       |    (ms)  |             (ms) |
	+-------+----------+------------------+
	|    1  |    2.63  |          1.66    |
	|    2  |    2.95  |          1.70    |
	|    4  |    3.21  |          1.71    |
	|    8  |    4.97  |          1.78    |
	|   16  |    9.51  |          1.82    |
	|   32  |   20.15  |          3.03    |
	|   64  |   40.09  |          5.80    |
	|  128  |   80.08  |         12.24    |
	|  152  |  109.81  |         15.14    |
	+-------+----------+------------------+

This second test measures the benefit of eager page splitting on read
intensive workloads (1 write for every 10 reads). As in the other
test, the benefit increases with the number of vcpus, up to 8.82x for
152 vcpus.

./dirty_log_perf_test -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2 -f 10

	+-------+----------+------------------+
	| vCPUs | 6.2-rc3  | 6.2-rc3 + series |
	|       |   (sec)  |            (sec) |
	+-------+----------+------------------+
	|    1  |    0.65  |          0.07    |
	|    2  |    0.70  |          0.08    |
	|    4  |    0.71  |          0.08    |
	|    8  |    0.72  |          0.08    |
	|   16  |    0.76  |          0.08    |
	|   32  |    1.61  |          0.14    |
	|   64  |    3.46  |          0.30    |
	|  128  |    5.49  |          0.64    |
	|  152  |    6.44  |          0.63    |
	+-------+----------+------------------+

Changes from the RFC:
https://lore.kernel.org/kvmarm/20221112081714.2169495-1-ricarkol@google.com/
- dropped the changes to split on POST visits. No visible perf
  benefit.
- changed the kvm_pgtable_stage2_free_removed() implementation to
  reuse the stage2 mapper.
- dropped the FEAT_BBM changes and optimization. Will send this on a
  different series.

Thanks,
Ricardo

[0] https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
[1] https://kvmforum2022.sched.com/event/15jJq/kvmarm-at-scale-improvements-to-the-mmu-in-the-face-of-hardware-growing-pains-oliver-upton-google
[2] https://github.com/ricarkol/linux/commit/f78e9102b2bff4fb7f30bee810d7d611a537b46d
[3] https://lore.kernel.org/kvmarm/20221107215644.1895162-1-oliver.upton@linux.dev/

Marc Zyngier (1):
  KVM: arm64: Use local TLBI on permission relaxation

Ricardo Koller (8):
  KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags
  KVM: arm64: Add helper for creating removed stage2 subtrees
  KVM: arm64: Add kvm_pgtable_stage2_split()
  KVM: arm64: Refactor kvm_arch_commit_memory_region()
  KVM: arm64: Add kvm_uninit_stage2_mmu()
  KVM: arm64: Split huge pages when dirty logging is enabled
  KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
  KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG

 arch/arm64/include/asm/kvm_asm.h     |   4 +
 arch/arm64/include/asm/kvm_host.h    |  30 +++++
 arch/arm64/include/asm/kvm_mmu.h     |   1 +
 arch/arm64/include/asm/kvm_pgtable.h |  62 ++++++++++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c   |  10 ++
 arch/arm64/kvm/hyp/nvhe/tlb.c        |  54 +++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 143 +++++++++++++++++++++--
 arch/arm64/kvm/hyp/vhe/tlb.c         |  32 +++++
 arch/arm64/kvm/mmu.c                 | 168 ++++++++++++++++++++++-----
 9 files changed, 466 insertions(+), 38 deletions(-)

-- 
2.39.0.314.g84b9a713c41-goog

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags
  2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
@ 2023-01-13  3:49 ` Ricardo Koller
  2023-01-24  0:51   ` Ben Gardon
  2023-01-13  3:49 ` [PATCH 2/9] KVM: arm64: Add helper for creating removed stage2 subtrees Ricardo Koller
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:49 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

Add a flag to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_REMOVED, to
indicate that the walk is on a removed table not accesible to the HW
page-table walker. Then use it to avoid doing break-before-make or
performing CMOs (Cache Maintenance Operations) when mapping a removed
table. This is safe as these removed tables are not visible to the HW
page-table walker. This will be used in a subsequent commit for
replacing huge-page block PTEs into tables of 4K PTEs.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 27 ++++++++++++++++-----------
 2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 63f81b27a4e3..84a271647007 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -188,12 +188,15 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
  *					children.
  * @KVM_PGTABLE_WALK_SHARED:		Indicates the page-tables may be shared
  *					with other software walkers.
+ * @KVM_PGTABLE_WALK_REMOVED:		Indicates the page-tables are
+ *					removed: not visible to the HW walker.
  */
 enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_LEAF			= BIT(0),
 	KVM_PGTABLE_WALK_TABLE_PRE		= BIT(1),
 	KVM_PGTABLE_WALK_TABLE_POST		= BIT(2),
 	KVM_PGTABLE_WALK_SHARED			= BIT(3),
+	KVM_PGTABLE_WALK_REMOVED		= BIT(4),
 };
 
 struct kvm_pgtable_visit_ctx {
@@ -215,6 +218,11 @@ static inline bool kvm_pgtable_walk_shared(const struct kvm_pgtable_visit_ctx *c
 	return ctx->flags & KVM_PGTABLE_WALK_SHARED;
 }
 
+static inline bool kvm_pgtable_walk_removed(const struct kvm_pgtable_visit_ctx *ctx)
+{
+	return ctx->flags & KVM_PGTABLE_WALK_REMOVED;
+}
+
 /**
  * struct kvm_pgtable_walker - Hook into a page-table walk.
  * @cb:		Callback function to invoke during the walk.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b11cf2c618a6..87fd40d09056 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -717,14 +717,17 @@ static bool stage2_try_break_pte(const struct kvm_pgtable_visit_ctx *ctx,
 	if (!stage2_try_set_pte(ctx, KVM_INVALID_PTE_LOCKED))
 		return false;
 
-	/*
-	 * Perform the appropriate TLB invalidation based on the evicted pte
-	 * value (if any).
-	 */
-	if (kvm_pte_table(ctx->old, ctx->level))
-		kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
-	else if (kvm_pte_valid(ctx->old))
-		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu, ctx->addr, ctx->level);
+	if (!kvm_pgtable_walk_removed(ctx)) {
+		/*
+		 * Perform the appropriate TLB invalidation based on the
+		 * evicted pte value (if any).
+		 */
+		if (kvm_pte_table(ctx->old, ctx->level))
+			kvm_call_hyp(__kvm_tlb_flush_vmid, mmu);
+		else if (kvm_pte_valid(ctx->old))
+			kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, mmu,
+				     ctx->addr, ctx->level);
+	}
 
 	if (stage2_pte_is_counted(ctx->old))
 		mm_ops->put_page(ctx->ptep);
@@ -808,11 +811,13 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
 		return -EAGAIN;
 
 	/* Perform CMOs before installation of the guest stage-2 PTE */
-	if (mm_ops->dcache_clean_inval_poc && stage2_pte_cacheable(pgt, new))
+	if (!kvm_pgtable_walk_removed(ctx) && mm_ops->dcache_clean_inval_poc &&
+	    stage2_pte_cacheable(pgt, new))
 		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(new, mm_ops),
-						granule);
+					       granule);
 
-	if (mm_ops->icache_inval_pou && stage2_pte_executable(new))
+	if (!kvm_pgtable_walk_removed(ctx) && mm_ops->icache_inval_pou &&
+	    stage2_pte_executable(new))
 		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
 
 	stage2_make_pte(ctx, new);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags
  2023-01-13  3:49 ` [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags Ricardo Koller
@ 2023-01-24  0:51   ` Ben Gardon
  2023-01-24  0:56     ` Oliver Upton
  2023-01-24 16:30     ` Ricardo Koller
  0 siblings, 2 replies; 47+ messages in thread
From: Ben Gardon @ 2023-01-24  0:51 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> Add a flag to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_REMOVED, to
> indicate that the walk is on a removed table not accesible to the HW
> page-table walker. Then use it to avoid doing break-before-make or
> performing CMOs (Cache Maintenance Operations) when mapping a removed

Nit: Should this say unmapping? Or are we actually going to use this
to map memory ?

> table. This is safe as these removed tables are not visible to the HW
> page-table walker. This will be used in a subsequent commit for
...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags
  2023-01-24  0:51   ` Ben Gardon
@ 2023-01-24  0:56     ` Oliver Upton
  2023-01-24 16:32       ` Ricardo Koller
  2023-01-24 16:30     ` Ricardo Koller
  1 sibling, 1 reply; 47+ messages in thread
From: Oliver Upton @ 2023-01-24  0:56 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Ricardo Koller, pbonzini, maz, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Mon, Jan 23, 2023 at 04:51:16PM -0800, Ben Gardon wrote:
> On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > Add a flag to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_REMOVED, to
> > indicate that the walk is on a removed table not accesible to the HW
> > page-table walker. Then use it to avoid doing break-before-make or
> > performing CMOs (Cache Maintenance Operations) when mapping a removed
> 
> Nit: Should this say unmapping? Or are we actually going to use this
> to map memory ?

I think the *_REMOVED term feels weird as it relates to constructing a
page table. It'd be better if we instead added flags to describe the
operations we intend to elide (i.e. CMOs and TLBIs). That way the
implementation is generic enough that we can repurpose it for other use
cases.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags
  2023-01-24  0:56     ` Oliver Upton
@ 2023-01-24 16:32       ` Ricardo Koller
  2023-01-24 18:00         ` Ben Gardon
  0 siblings, 1 reply; 47+ messages in thread
From: Ricardo Koller @ 2023-01-24 16:32 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Ben Gardon, pbonzini, maz, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Tue, Jan 24, 2023 at 12:56:11AM +0000, Oliver Upton wrote:
> On Mon, Jan 23, 2023 at 04:51:16PM -0800, Ben Gardon wrote:
> > On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> > >
> > > Add a flag to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_REMOVED, to
> > > indicate that the walk is on a removed table not accesible to the HW
> > > page-table walker. Then use it to avoid doing break-before-make or
> > > performing CMOs (Cache Maintenance Operations) when mapping a removed
> > 
> > Nit: Should this say unmapping? Or are we actually going to use this
> > to map memory ?
> 
> I think the *_REMOVED term feels weird as it relates to constructing a
> page table. It'd be better if we instead added flags to describe the
> operations we intend to elide (i.e. CMOs and TLBIs).

What about KVM_PGTABLE_WALK_ELIDE_BBM and KVM_PGTABLE_WALK_ELIDE_CMO?

> That way the
> implementation is generic enough that we can repurpose it for other use
> cases.

Aha, good point. I actually have a use case for it (FEAT_BBM).

> 
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags
  2023-01-24 16:32       ` Ricardo Koller
@ 2023-01-24 18:00         ` Ben Gardon
  2023-01-26 18:48           ` Ricardo Koller
  0 siblings, 1 reply; 47+ messages in thread
From: Ben Gardon @ 2023-01-24 18:00 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: Oliver Upton, pbonzini, maz, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Tue, Jan 24, 2023 at 8:32 AM Ricardo Koller <ricarkol@google.com> wrote:
>
> On Tue, Jan 24, 2023 at 12:56:11AM +0000, Oliver Upton wrote:
> > On Mon, Jan 23, 2023 at 04:51:16PM -0800, Ben Gardon wrote:
> > > On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> > > >
> > > > Add a flag to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_REMOVED, to
> > > > indicate that the walk is on a removed table not accesible to the HW
> > > > page-table walker. Then use it to avoid doing break-before-make or
> > > > performing CMOs (Cache Maintenance Operations) when mapping a removed
> > >
> > > Nit: Should this say unmapping? Or are we actually going to use this
> > > to map memory ?
> >
> > I think the *_REMOVED term feels weird as it relates to constructing a
> > page table. It'd be better if we instead added flags to describe the
> > operations we intend to elide (i.e. CMOs and TLBIs).
>
> What about KVM_PGTABLE_WALK_ELIDE_BBM and KVM_PGTABLE_WALK_ELIDE_CMO?

I like this, but please don't use elide in the code. I'm all for
vocabulary, but that's not a common enough word to expect everyone to
know. Perhaps just SKIP?

>
> > That way the
> > implementation is generic enough that we can repurpose it for other use
> > cases.
>
> Aha, good point. I actually have a use case for it (FEAT_BBM).
>
> >
> > --
> > Thanks,
> > Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags
  2023-01-24 18:00         ` Ben Gardon
@ 2023-01-26 18:48           ` Ricardo Koller
  0 siblings, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-26 18:48 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Oliver Upton, pbonzini, maz, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Tue, Jan 24, 2023 at 10:00 AM Ben Gardon <bgardon@google.com> wrote:
>
> On Tue, Jan 24, 2023 at 8:32 AM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > On Tue, Jan 24, 2023 at 12:56:11AM +0000, Oliver Upton wrote:
> > > On Mon, Jan 23, 2023 at 04:51:16PM -0800, Ben Gardon wrote:
> > > > On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> > > > >
> > > > > Add a flag to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_REMOVED, to
> > > > > indicate that the walk is on a removed table not accesible to the HW
> > > > > page-table walker. Then use it to avoid doing break-before-make or
> > > > > performing CMOs (Cache Maintenance Operations) when mapping a removed
> > > >
> > > > Nit: Should this say unmapping? Or are we actually going to use this
> > > > to map memory ?
> > >
> > > I think the *_REMOVED term feels weird as it relates to constructing a
> > > page table. It'd be better if we instead added flags to describe the
> > > operations we intend to elide (i.e. CMOs and TLBIs).
> >
> > What about KVM_PGTABLE_WALK_ELIDE_BBM and KVM_PGTABLE_WALK_ELIDE_CMO?
>
> I like this, but please don't use elide in the code. I'm all for
> vocabulary, but that's not a common enough word to expect everyone to
> know. Perhaps just SKIP?

No problem, SKIP should be fine.

>
> >
> > > That way the
> > > implementation is generic enough that we can repurpose it for other use
> > > cases.
> >
> > Aha, good point. I actually have a use case for it (FEAT_BBM).
> >
> > >
> > > --
> > > Thanks,
> > > Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags
  2023-01-24  0:51   ` Ben Gardon
  2023-01-24  0:56     ` Oliver Upton
@ 2023-01-24 16:30     ` Ricardo Koller
  1 sibling, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-24 16:30 UTC (permalink / raw)
  To: Ben Gardon
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Mon, Jan 23, 2023 at 04:51:16PM -0800, Ben Gardon wrote:
> On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > Add a flag to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_REMOVED, to
> > indicate that the walk is on a removed table not accesible to the HW
> > page-table walker. Then use it to avoid doing break-before-make or
> > performing CMOs (Cache Maintenance Operations) when mapping a removed
> 
> Nit: Should this say unmapping? Or are we actually going to use this
> to map memory ?

As you mentioned in the next commit, this will be clearer if I use
"unliked" instead of "removed". Might end up rewriting this message as
well, as I will be using Oliver's suggestion of using multiple flags,
one for each operation to elide.

> 
> > table. This is safe as these removed tables are not visible to the HW
> > page-table walker. This will be used in a subsequent commit for
> ...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 2/9] KVM: arm64: Add helper for creating removed stage2 subtrees
  2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
  2023-01-13  3:49 ` [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags Ricardo Koller
@ 2023-01-13  3:49 ` Ricardo Koller
  2023-01-24  0:55   ` Ben Gardon
  2023-01-13  3:49 ` [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:49 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

Add a stage2 helper, kvm_pgtable_stage2_create_removed(), for creating
removed tables (the opposite of kvm_pgtable_stage2_free_removed()).
Creating a removed table is useful for splitting block PTEs into
subtrees of 4K PTEs.  For example, a 1G block PTE can be split into 4K
PTEs by first creating a fully populated tree, and then use it to
replace the 1G PTE in a single step.  This will be used in a
subsequent commit for eager huge-page splitting (a dirty-logging
optimization).

No functional change intended. This new function will be used in a
subsequent commit.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h | 25 +++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 47 ++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 84a271647007..8ad78d61af7f 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -450,6 +450,31 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
  */
 void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
 
+/**
+ * kvm_pgtable_stage2_free_removed() - Create a removed stage-2 paging structure.
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @new:	Unlinked stage-2 paging structure to be created.
+ * @phys:	Physical address of the memory to map.
+ * @level:	Level of the stage-2 paging structure to be created.
+ * @prot:	Permissions and attributes for the mapping.
+ * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
+ *		page-table pages.
+ *
+ * Create a removed page-table tree of PAGE_SIZE leaf PTEs under *new.
+ * This new page-table tree is not reachable (i.e., it is removed) from the
+ * root pgd and it's therefore unreachableby the hardware page-table
+ * walker. No TLB invalidation or CMOs are performed.
+ *
+ * If device attributes are not explicitly requested in @prot, then the
+ * mapping will be normal, cacheable.
+ *
+ * Return: 0 only if a fully populated tree was created, negative error
+ * code on failure. No partially-populated table can be returned.
+ */
+int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
+				      kvm_pte_t *new, u64 phys, u32 level,
+				      enum kvm_pgtable_prot prot, void *mc);
+
 /**
  * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
  * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 87fd40d09056..0dee13007776 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1181,6 +1181,53 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
 	return kvm_pgtable_walk(pgt, addr, size, &walker);
 }
 
+/*
+ * map_data->force_pte is true in order to force creating PAGE_SIZE PTEs.
+ * data->addr is 0 because the IPA is irrelevant for a removed table.
+ */
+int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
+				      kvm_pte_t *new, u64 phys, u32 level,
+				      enum kvm_pgtable_prot prot, void *mc)
+{
+	struct stage2_map_data map_data = {
+		.phys		= phys,
+		.mmu		= pgt->mmu,
+		.memcache	= mc,
+		.force_pte	= true,
+	};
+	struct kvm_pgtable_walker walker = {
+		.cb		= stage2_map_walker,
+		.flags		= KVM_PGTABLE_WALK_LEAF |
+				  KVM_PGTABLE_WALK_REMOVED,
+		.arg		= &map_data,
+	};
+	struct kvm_pgtable_walk_data data = {
+		.walker	= &walker,
+		.addr	= 0,
+		.end	= kvm_granule_size(level),
+	};
+	struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
+	kvm_pte_t *pgtable;
+	int ret;
+
+	ret = stage2_set_prot_attr(pgt, prot, &map_data.attr);
+	if (ret)
+		return ret;
+
+	pgtable = mm_ops->zalloc_page(mc);
+	if (!pgtable)
+		return -ENOMEM;
+
+	ret = __kvm_pgtable_walk(&data, mm_ops, pgtable, level + 1);
+	if (ret) {
+		kvm_pgtable_stage2_free_removed(mm_ops, pgtable, level);
+		mm_ops->put_page(pgtable);
+		return ret;
+	}
+
+	*new = kvm_init_table_pte(pgtable, mm_ops);
+	return 0;
+}
 
 int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 			      struct kvm_pgtable_mm_ops *mm_ops,
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/9] KVM: arm64: Add helper for creating removed stage2 subtrees
  2023-01-13  3:49 ` [PATCH 2/9] KVM: arm64: Add helper for creating removed stage2 subtrees Ricardo Koller
@ 2023-01-24  0:55   ` Ben Gardon
  2023-01-24 16:35     ` Ricardo Koller
  0 siblings, 1 reply; 47+ messages in thread
From: Ben Gardon @ 2023-01-24  0:55 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> Add a stage2 helper, kvm_pgtable_stage2_create_removed(), for creating
> removed tables (the opposite of kvm_pgtable_stage2_free_removed()).
> Creating a removed table is useful for splitting block PTEs into
> subtrees of 4K PTEs.  For example, a 1G block PTE can be split into 4K
> PTEs by first creating a fully populated tree, and then use it to
> replace the 1G PTE in a single step.  This will be used in a
> subsequent commit for eager huge-page splitting (a dirty-logging
> optimization).
>
> No functional change intended. This new function will be used in a
> subsequent commit.
>
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 25 +++++++++++++++
>  arch/arm64/kvm/hyp/pgtable.c         | 47 ++++++++++++++++++++++++++++
>  2 files changed, 72 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 84a271647007..8ad78d61af7f 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -450,6 +450,31 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
>   */
>  void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
>
> +/**
> + * kvm_pgtable_stage2_free_removed() - Create a removed stage-2 paging structure.
> + * @pgt:       Page-table structure initialised by kvm_pgtable_stage2_init*().
> + * @new:       Unlinked stage-2 paging structure to be created.

Oh, I see so the "removed" page table is actually a new page table
that has never been part of the paging structure. In that case I would
find it much more intuitive to call it "unlinked" or similar.

> + * @phys:      Physical address of the memory to map.
> + * @level:     Level of the stage-2 paging structure to be created.
> + * @prot:      Permissions and attributes for the mapping.
> + * @mc:                Cache of pre-allocated and zeroed memory from which to allocate
> + *             page-table pages.
> + *
> + * Create a removed page-table tree of PAGE_SIZE leaf PTEs under *new.
> + * This new page-table tree is not reachable (i.e., it is removed) from the
> + * root pgd and it's therefore unreachableby the hardware page-table
> + * walker. No TLB invalidation or CMOs are performed.
> + *
> + * If device attributes are not explicitly requested in @prot, then the
> + * mapping will be normal, cacheable.
> + *
> + * Return: 0 only if a fully populated tree was created, negative error
> + * code on failure. No partially-populated table can be returned.
> + */
> +int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
> +                                     kvm_pte_t *new, u64 phys, u32 level,
> +                                     enum kvm_pgtable_prot prot, void *mc);
> +
>  /**
>   * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
>   * @pgt:       Page-table structure initialised by kvm_pgtable_stage2_init*().
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 87fd40d09056..0dee13007776 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1181,6 +1181,53 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
>         return kvm_pgtable_walk(pgt, addr, size, &walker);
>  }
>
> +/*
> + * map_data->force_pte is true in order to force creating PAGE_SIZE PTEs.
> + * data->addr is 0 because the IPA is irrelevant for a removed table.
> + */
> +int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
> +                                     kvm_pte_t *new, u64 phys, u32 level,
> +                                     enum kvm_pgtable_prot prot, void *mc)
> +{
> +       struct stage2_map_data map_data = {
> +               .phys           = phys,
> +               .mmu            = pgt->mmu,
> +               .memcache       = mc,
> +               .force_pte      = true,
> +       };
> +       struct kvm_pgtable_walker walker = {
> +               .cb             = stage2_map_walker,
> +               .flags          = KVM_PGTABLE_WALK_LEAF |
> +                                 KVM_PGTABLE_WALK_REMOVED,
> +               .arg            = &map_data,
> +       };
> +       struct kvm_pgtable_walk_data data = {
> +               .walker = &walker,
> +               .addr   = 0,
> +               .end    = kvm_granule_size(level),
> +       };
> +       struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
> +       kvm_pte_t *pgtable;
> +       int ret;
> +
> +       ret = stage2_set_prot_attr(pgt, prot, &map_data.attr);
> +       if (ret)
> +               return ret;
> +
> +       pgtable = mm_ops->zalloc_page(mc);
> +       if (!pgtable)
> +               return -ENOMEM;
> +
> +       ret = __kvm_pgtable_walk(&data, mm_ops, pgtable, level + 1);
> +       if (ret) {
> +               kvm_pgtable_stage2_free_removed(mm_ops, pgtable, level);
> +               mm_ops->put_page(pgtable);
> +               return ret;
> +       }
> +
> +       *new = kvm_init_table_pte(pgtable, mm_ops);
> +       return 0;
> +}
>
>  int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
>                               struct kvm_pgtable_mm_ops *mm_ops,
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/9] KVM: arm64: Add helper for creating removed stage2 subtrees
  2023-01-24  0:55   ` Ben Gardon
@ 2023-01-24 16:35     ` Ricardo Koller
  2023-01-24 17:07       ` Oliver Upton
  0 siblings, 1 reply; 47+ messages in thread
From: Ricardo Koller @ 2023-01-24 16:35 UTC (permalink / raw)
  To: Ben Gardon
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Mon, Jan 23, 2023 at 04:55:40PM -0800, Ben Gardon wrote:
> On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > Add a stage2 helper, kvm_pgtable_stage2_create_removed(), for creating
> > removed tables (the opposite of kvm_pgtable_stage2_free_removed()).
> > Creating a removed table is useful for splitting block PTEs into
> > subtrees of 4K PTEs.  For example, a 1G block PTE can be split into 4K
> > PTEs by first creating a fully populated tree, and then use it to
> > replace the 1G PTE in a single step.  This will be used in a
> > subsequent commit for eager huge-page splitting (a dirty-logging
> > optimization).
> >
> > No functional change intended. This new function will be used in a
> > subsequent commit.
> >
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h | 25 +++++++++++++++
> >  arch/arm64/kvm/hyp/pgtable.c         | 47 ++++++++++++++++++++++++++++
> >  2 files changed, 72 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 84a271647007..8ad78d61af7f 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -450,6 +450,31 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
> >   */
> >  void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, u32 level);
> >
> > +/**
> > + * kvm_pgtable_stage2_free_removed() - Create a removed stage-2 paging structure.
> > + * @pgt:       Page-table structure initialised by kvm_pgtable_stage2_init*().
> > + * @new:       Unlinked stage-2 paging structure to be created.
> 
> Oh, I see so the "removed" page table is actually a new page table
> that has never been part of the paging structure. In that case I would
> find it much more intuitive to call it "unlinked" or similar.
>

Sounds good, I like "unlinked".

Oliver, are you OK if I rename free_removed() as well? just to keep them
symmetric.

> > + * @phys:      Physical address of the memory to map.
> > + * @level:     Level of the stage-2 paging structure to be created.
> > + * @prot:      Permissions and attributes for the mapping.
> > + * @mc:                Cache of pre-allocated and zeroed memory from which to allocate
> > + *             page-table pages.
> > + *
> > + * Create a removed page-table tree of PAGE_SIZE leaf PTEs under *new.
> > + * This new page-table tree is not reachable (i.e., it is removed) from the
> > + * root pgd and it's therefore unreachableby the hardware page-table
> > + * walker. No TLB invalidation or CMOs are performed.
> > + *
> > + * If device attributes are not explicitly requested in @prot, then the
> > + * mapping will be normal, cacheable.
> > + *
> > + * Return: 0 only if a fully populated tree was created, negative error
> > + * code on failure. No partially-populated table can be returned.
> > + */
> > +int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
> > +                                     kvm_pte_t *new, u64 phys, u32 level,
> > +                                     enum kvm_pgtable_prot prot, void *mc);
> > +
> >  /**
> >   * kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
> >   * @pgt:       Page-table structure initialised by kvm_pgtable_stage2_init*().
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 87fd40d09056..0dee13007776 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1181,6 +1181,53 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
> >         return kvm_pgtable_walk(pgt, addr, size, &walker);
> >  }
> >
> > +/*
> > + * map_data->force_pte is true in order to force creating PAGE_SIZE PTEs.
> > + * data->addr is 0 because the IPA is irrelevant for a removed table.
> > + */
> > +int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
> > +                                     kvm_pte_t *new, u64 phys, u32 level,
> > +                                     enum kvm_pgtable_prot prot, void *mc)
> > +{
> > +       struct stage2_map_data map_data = {
> > +               .phys           = phys,
> > +               .mmu            = pgt->mmu,
> > +               .memcache       = mc,
> > +               .force_pte      = true,
> > +       };
> > +       struct kvm_pgtable_walker walker = {
> > +               .cb             = stage2_map_walker,
> > +               .flags          = KVM_PGTABLE_WALK_LEAF |
> > +                                 KVM_PGTABLE_WALK_REMOVED,
> > +               .arg            = &map_data,
> > +       };
> > +       struct kvm_pgtable_walk_data data = {
> > +               .walker = &walker,
> > +               .addr   = 0,
> > +               .end    = kvm_granule_size(level),
> > +       };
> > +       struct kvm_pgtable_mm_ops *mm_ops = pgt->mm_ops;
> > +       kvm_pte_t *pgtable;
> > +       int ret;
> > +
> > +       ret = stage2_set_prot_attr(pgt, prot, &map_data.attr);
> > +       if (ret)
> > +               return ret;
> > +
> > +       pgtable = mm_ops->zalloc_page(mc);
> > +       if (!pgtable)
> > +               return -ENOMEM;
> > +
> > +       ret = __kvm_pgtable_walk(&data, mm_ops, pgtable, level + 1);
> > +       if (ret) {
> > +               kvm_pgtable_stage2_free_removed(mm_ops, pgtable, level);
> > +               mm_ops->put_page(pgtable);
> > +               return ret;
> > +       }
> > +
> > +       *new = kvm_init_table_pte(pgtable, mm_ops);
> > +       return 0;
> > +}
> >
> >  int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
> >                               struct kvm_pgtable_mm_ops *mm_ops,
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/9] KVM: arm64: Add helper for creating removed stage2 subtrees
  2023-01-24 16:35     ` Ricardo Koller
@ 2023-01-24 17:07       ` Oliver Upton
  0 siblings, 0 replies; 47+ messages in thread
From: Oliver Upton @ 2023-01-24 17:07 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: Ben Gardon, pbonzini, maz, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Tue, Jan 24, 2023 at 08:35:40AM -0800, Ricardo Koller wrote:
> On Mon, Jan 23, 2023 at 04:55:40PM -0800, Ben Gardon wrote:

[...]

> > > +/**
> > > + * kvm_pgtable_stage2_free_removed() - Create a removed stage-2 paging structure.
> > > + * @pgt:       Page-table structure initialised by kvm_pgtable_stage2_init*().
> > > + * @new:       Unlinked stage-2 paging structure to be created.
> > 
> > Oh, I see so the "removed" page table is actually a new page table
> > that has never been part of the paging structure. In that case I would
> > find it much more intuitive to call it "unlinked" or similar.
> >
> 
> Sounds good, I like "unlinked".
> 
> Oliver, are you OK if I rename free_removed() as well? just to keep them
> symmetric.

Fine by me, and sorry for the silly naming :)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
  2023-01-13  3:49 ` [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags Ricardo Koller
  2023-01-13  3:49 ` [PATCH 2/9] KVM: arm64: Add helper for creating removed stage2 subtrees Ricardo Koller
@ 2023-01-13  3:49 ` Ricardo Koller
  2023-01-24  1:03   ` Ben Gardon
  2023-02-06  9:20   ` Zheng Chuan
  2023-01-13  3:49 ` [PATCH 4/9] KVM: arm64: Refactor kvm_arch_commit_memory_region() Ricardo Koller
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:49 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
range of huge pages. This will be used for eager-splitting huge pages
into PAGE_SIZE pages. The goal is to avoid having to split huge pages
on write-protection faults, and instead use this function to do it
ahead of time for large ranges (e.g., all guest memory in 1G chunks at
a time).

No functional change intended. This new function will be used in a
subsequent commit.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_pgtable.h | 29 ++++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 67 ++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 8ad78d61af7f..5fbdc1f259fd 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -644,6 +644,35 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
  */
 int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
 
+/**
+ * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
+ *				to PAGE_SIZE guest pages.
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
+ * @addr:	Intermediate physical address from which to split.
+ * @size:	Size of the range.
+ * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
+ *		page-table pages.
+ *
+ * @addr and the end (@addr + @size) are effectively aligned down and up to
+ * the top level huge-page block size. This is an exampe using 1GB
+ * huge-pages and 4KB granules.
+ *
+ *                          [---input range---]
+ *                          :                 :
+ * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
+ *                          :                 :
+ *                   [--2MB--][--2MB--][--2MB--][--2MB--]
+ *                          :                 :
+ *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
+ *                          :                 :
+ *
+ * Return: 0 on success, negative error code on failure. Note that
+ * kvm_pgtable_stage2_split() is best effort: it tries to break as many
+ * blocks in the input range as allowed by the size of the memcache. It
+ * will fail it wasn't able to break any block.
+ */
+int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
+
 /**
  * kvm_pgtable_walk() - Walk a page-table.
  * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0dee13007776..db9d1a28769b 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1229,6 +1229,73 @@ int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
 	return 0;
 }
 
+struct stage2_split_data {
+	struct kvm_s2_mmu		*mmu;
+	void				*memcache;
+};
+
+static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
+			       enum kvm_pgtable_walk_flags visit)
+{
+	struct stage2_split_data *data = ctx->arg;
+	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
+	kvm_pte_t pte = ctx->old, new, *childp;
+	enum kvm_pgtable_prot prot;
+	void *mc = data->memcache;
+	u32 level = ctx->level;
+	u64 phys;
+	int ret;
+
+	/* Nothing to split at the last level */
+	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
+		return 0;
+
+	/* We only split valid block mappings */
+	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
+		return 0;
+
+	phys = kvm_pte_to_phys(pte);
+	prot = kvm_pgtable_stage2_pte_prot(pte);
+
+	ret = kvm_pgtable_stage2_create_removed(data->mmu->pgt, &new, phys,
+						level, prot, mc);
+	if (ret)
+		return ret;
+
+	if (!stage2_try_break_pte(ctx, data->mmu)) {
+		childp = kvm_pte_follow(new, mm_ops);
+		kvm_pgtable_stage2_free_removed(mm_ops, childp, level);
+		mm_ops->put_page(childp);
+		return -EAGAIN;
+	}
+
+	/*
+	 * Note, the contents of the page table are guaranteed to be
+	 * made visible before the new PTE is assigned because
+	 * stage2_make_pte() writes the PTE using smp_store_release().
+	 */
+	stage2_make_pte(ctx, new);
+	dsb(ishst);
+	return 0;
+}
+
+int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt,
+			     u64 addr, u64 size, void *mc)
+{
+	struct stage2_split_data split_data = {
+		.mmu		= pgt->mmu,
+		.memcache	= mc,
+	};
+
+	struct kvm_pgtable_walker walker = {
+		.cb	= stage2_split_walker,
+		.flags	= KVM_PGTABLE_WALK_LEAF,
+		.arg	= &split_data,
+	};
+
+	return kvm_pgtable_walk(pgt, addr, size, &walker);
+}
+
 int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
 			      struct kvm_pgtable_mm_ops *mm_ops,
 			      enum kvm_pgtable_stage2_flags flags,
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-01-13  3:49 ` [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
@ 2023-01-24  1:03   ` Ben Gardon
  2023-01-24 16:46     ` Ricardo Koller
  2023-02-06  9:20   ` Zheng Chuan
  1 sibling, 1 reply; 47+ messages in thread
From: Ben Gardon @ 2023-01-24  1:03 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
> range of huge pages. This will be used for eager-splitting huge pages
> into PAGE_SIZE pages. The goal is to avoid having to split huge pages
> on write-protection faults, and instead use this function to do it
> ahead of time for large ranges (e.g., all guest memory in 1G chunks at
> a time).
>
> No functional change intended. This new function will be used in a
> subsequent commit.
>
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 29 ++++++++++++
>  arch/arm64/kvm/hyp/pgtable.c         | 67 ++++++++++++++++++++++++++++
>  2 files changed, 96 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 8ad78d61af7f..5fbdc1f259fd 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -644,6 +644,35 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
>   */
>  int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
>
> +/**
> + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> + *                             to PAGE_SIZE guest pages.
> + * @pgt:       Page-table structure initialised by kvm_pgtable_stage2_init*().
> + * @addr:      Intermediate physical address from which to split.
> + * @size:      Size of the range.
> + * @mc:                Cache of pre-allocated and zeroed memory from which to allocate
> + *             page-table pages.
> + *
> + * @addr and the end (@addr + @size) are effectively aligned down and up to
> + * the top level huge-page block size. This is an exampe using 1GB

Nit: example

> + * huge-pages and 4KB granules.
> + *
> + *                          [---input range---]
> + *                          :                 :
> + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> + *                          :                 :
> + *                   [--2MB--][--2MB--][--2MB--][--2MB--]
> + *                          :                 :
> + *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> + *                          :                 :
> + *
> + * Return: 0 on success, negative error code on failure. Note that
> + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> + * blocks in the input range as allowed by the size of the memcache. It
> + * will fail it wasn't able to break any block.
> + */
> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
> +
>  /**
>   * kvm_pgtable_walk() - Walk a page-table.
>   * @pgt:       Page-table structure initialised by kvm_pgtable_*_init().
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 0dee13007776..db9d1a28769b 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1229,6 +1229,73 @@ int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
>         return 0;
>  }
>
> +struct stage2_split_data {
> +       struct kvm_s2_mmu               *mmu;
> +       void                            *memcache;
> +};
> +
> +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> +                              enum kvm_pgtable_walk_flags visit)
> +{
> +       struct stage2_split_data *data = ctx->arg;
> +       struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> +       kvm_pte_t pte = ctx->old, new, *childp;
> +       enum kvm_pgtable_prot prot;
> +       void *mc = data->memcache;
> +       u32 level = ctx->level;
> +       u64 phys;
> +       int ret;
> +
> +       /* Nothing to split at the last level */

Would it be accurate to say:
/* No huge pages can exist at the root level, so there's nothing to
split here. */

I think of "last level" as the lowest/leaf/4k level but
KVM_PGTABLE_MAX_LEVELS - 1 is 3? Does ARM do the level numbering in
reverse order to x86?

> +       if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> +               return 0;
> +
...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-01-24  1:03   ` Ben Gardon
@ 2023-01-24 16:46     ` Ricardo Koller
  2023-01-24 17:11       ` Oliver Upton
  0 siblings, 1 reply; 47+ messages in thread
From: Ricardo Koller @ 2023-01-24 16:46 UTC (permalink / raw)
  To: Ben Gardon
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Mon, Jan 23, 2023 at 05:03:23PM -0800, Ben Gardon wrote:
> On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
> > range of huge pages. This will be used for eager-splitting huge pages
> > into PAGE_SIZE pages. The goal is to avoid having to split huge pages
> > on write-protection faults, and instead use this function to do it
> > ahead of time for large ranges (e.g., all guest memory in 1G chunks at
> > a time).
> >
> > No functional change intended. This new function will be used in a
> > subsequent commit.
> >
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h | 29 ++++++++++++
> >  arch/arm64/kvm/hyp/pgtable.c         | 67 ++++++++++++++++++++++++++++
> >  2 files changed, 96 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 8ad78d61af7f..5fbdc1f259fd 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -644,6 +644,35 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
> >   */
> >  int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
> >
> > +/**
> > + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> > + *                             to PAGE_SIZE guest pages.
> > + * @pgt:       Page-table structure initialised by kvm_pgtable_stage2_init*().
> > + * @addr:      Intermediate physical address from which to split.
> > + * @size:      Size of the range.
> > + * @mc:                Cache of pre-allocated and zeroed memory from which to allocate
> > + *             page-table pages.
> > + *
> > + * @addr and the end (@addr + @size) are effectively aligned down and up to
> > + * the top level huge-page block size. This is an exampe using 1GB
> 
> Nit: example
> 
> > + * huge-pages and 4KB granules.
> > + *
> > + *                          [---input range---]
> > + *                          :                 :
> > + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> > + *                          :                 :
> > + *                   [--2MB--][--2MB--][--2MB--][--2MB--]
> > + *                          :                 :
> > + *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> > + *                          :                 :
> > + *
> > + * Return: 0 on success, negative error code on failure. Note that
> > + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> > + * blocks in the input range as allowed by the size of the memcache. It
> > + * will fail it wasn't able to break any block.
> > + */
> > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
> > +
> >  /**
> >   * kvm_pgtable_walk() - Walk a page-table.
> >   * @pgt:       Page-table structure initialised by kvm_pgtable_*_init().
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 0dee13007776..db9d1a28769b 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1229,6 +1229,73 @@ int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
> >         return 0;
> >  }
> >
> > +struct stage2_split_data {
> > +       struct kvm_s2_mmu               *mmu;
> > +       void                            *memcache;
> > +};
> > +
> > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > +                              enum kvm_pgtable_walk_flags visit)
> > +{
> > +       struct stage2_split_data *data = ctx->arg;
> > +       struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > +       kvm_pte_t pte = ctx->old, new, *childp;
> > +       enum kvm_pgtable_prot prot;
> > +       void *mc = data->memcache;
> > +       u32 level = ctx->level;
> > +       u64 phys;
> > +       int ret;
> > +
> > +       /* Nothing to split at the last level */
> 
> Would it be accurate to say:
> /* No huge pages can exist at the root level, so there's nothing to
> split here. */
> 
> I think of "last level" as the lowest/leaf/4k level but
> KVM_PGTABLE_MAX_LEVELS - 1 is 3? 

Right, this is the 4k level.

> Does ARM do the level numbering in
> reverse order to x86?

Yes, it does. Interesting, x86 does

	iter->level--;

while arm does:

	ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);

I don't think this numbering scheme is encoded anywhere in the PTEs, so
either architecture could use the other.


> 
> > +       if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > +               return 0;
> > +
> ...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-01-24 16:46     ` Ricardo Koller
@ 2023-01-24 17:11       ` Oliver Upton
  2023-01-24 17:18         ` Ricardo Koller
  0 siblings, 1 reply; 47+ messages in thread
From: Oliver Upton @ 2023-01-24 17:11 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: Ben Gardon, pbonzini, maz, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Tue, Jan 24, 2023 at 08:46:58AM -0800, Ricardo Koller wrote:
> On Mon, Jan 23, 2023 at 05:03:23PM -0800, Ben Gardon wrote:

[...]

> > Would it be accurate to say:
> > /* No huge pages can exist at the root level, so there's nothing to
> > split here. */
> > 
> > I think of "last level" as the lowest/leaf/4k level but
> > KVM_PGTABLE_MAX_LEVELS - 1 is 3? 
> 
> Right, this is the 4k level.
> 
> > Does ARM do the level numbering in
> > reverse order to x86?
> 
> Yes, it does. Interesting, x86 does
> 
> 	iter->level--;
> 
> while arm does:
> 
> 	ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);
> 
> I don't think this numbering scheme is encoded anywhere in the PTEs, so
> either architecture could use the other.

The numbering we use in the page table walkers is deliberate, as it
directly matches the Arm ARM. While we can certainly use either scheme
I'd prefer we keep aligned with the architecture.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-01-24 17:11       ` Oliver Upton
@ 2023-01-24 17:18         ` Ricardo Koller
  2023-01-24 17:48           ` David Matlack
  0 siblings, 1 reply; 47+ messages in thread
From: Ricardo Koller @ 2023-01-24 17:18 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Ben Gardon, pbonzini, maz, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Tue, Jan 24, 2023 at 9:11 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Tue, Jan 24, 2023 at 08:46:58AM -0800, Ricardo Koller wrote:
> > On Mon, Jan 23, 2023 at 05:03:23PM -0800, Ben Gardon wrote:
>
> [...]
>
> > > Would it be accurate to say:
> > > /* No huge pages can exist at the root level, so there's nothing to
> > > split here. */
> > >
> > > I think of "last level" as the lowest/leaf/4k level but
> > > KVM_PGTABLE_MAX_LEVELS - 1 is 3?
> >
> > Right, this is the 4k level.
> >
> > > Does ARM do the level numbering in
> > > reverse order to x86?
> >
> > Yes, it does. Interesting, x86 does
> >
> >       iter->level--;
> >
> > while arm does:
> >
> >       ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);
> >
> > I don't think this numbering scheme is encoded anywhere in the PTEs, so
> > either architecture could use the other.
>
> The numbering we use in the page table walkers is deliberate, as it
> directly matches the Arm ARM. While we can certainly use either scheme
> I'd prefer we keep aligned with the architecture.

hehe, I was actually subtly suggesting our x86 friends to change their side.

>
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-01-24 17:18         ` Ricardo Koller
@ 2023-01-24 17:48           ` David Matlack
  2023-01-24 20:28             ` Oliver Upton
  0 siblings, 1 reply; 47+ messages in thread
From: David Matlack @ 2023-01-24 17:48 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: Oliver Upton, Ben Gardon, pbonzini, maz, yuzenghui, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Tue, Jan 24, 2023 at 09:18:33AM -0800, Ricardo Koller wrote:
> On Tue, Jan 24, 2023 at 9:11 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Tue, Jan 24, 2023 at 08:46:58AM -0800, Ricardo Koller wrote:
> > > On Mon, Jan 23, 2023 at 05:03:23PM -0800, Ben Gardon wrote:
> >
> > [...]
> >
> > > > Would it be accurate to say:
> > > > /* No huge pages can exist at the root level, so there's nothing to
> > > > split here. */
> > > >
> > > > I think of "last level" as the lowest/leaf/4k level but
> > > > KVM_PGTABLE_MAX_LEVELS - 1 is 3?
> > >
> > > Right, this is the 4k level.
> > >
> > > > Does ARM do the level numbering in
> > > > reverse order to x86?
> > >
> > > Yes, it does. Interesting, x86 does
> > >
> > >       iter->level--;
> > >
> > > while arm does:
> > >
> > >       ret = __kvm_pgtable_walk(data, mm_ops, childp, level + 1);
> > >
> > > I don't think this numbering scheme is encoded anywhere in the PTEs, so
> > > either architecture could use the other.
> >
> > The numbering we use in the page table walkers is deliberate, as it
> > directly matches the Arm ARM. While we can certainly use either scheme
> > I'd prefer we keep aligned with the architecture.
> 
> hehe, I was actually subtly suggesting our x86 friends to change their side.

Yeah KVM/x86 and KVM/ARM use basically opposite numbering schemes for
page table levels.

Level | KVM/ARM | KVM/x86
----- | ------- | ---------------
pte   | 3       | 1 (PG_LEVEL_4K)
pmd   | 2       | 2 (PG_LEVEL_2M)
pud   | 1       | 3 (PG_LEVEL_1G)
p4d   | 0       | 4
      | -1      | 5

The ARM levels come from the architecture, whereas the x86 levels are
arbitrary.

I do think it would be valuable to standardize on one leveling scheme at
some point. Otherwise, mixing level schemes is bound to be a source of
bugs if and when we are sharing more MMU code across architectures.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-01-24 17:48           ` David Matlack
@ 2023-01-24 20:28             ` Oliver Upton
  0 siblings, 0 replies; 47+ messages in thread
From: Oliver Upton @ 2023-01-24 20:28 UTC (permalink / raw)
  To: David Matlack
  Cc: Ricardo Koller, Ben Gardon, pbonzini, maz, yuzenghui, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Tue, Jan 24, 2023 at 09:48:28AM -0800, David Matlack wrote:
> On Tue, Jan 24, 2023 at 09:18:33AM -0800, Ricardo Koller wrote:
> > On Tue, Jan 24, 2023 at 9:11 AM Oliver Upton <oliver.upton@linux.dev> wrote:

[...]

> > > The numbering we use in the page table walkers is deliberate, as it
> > > directly matches the Arm ARM. While we can certainly use either scheme
> > > I'd prefer we keep aligned with the architecture.
> > 
> > hehe, I was actually subtly suggesting our x86 friends to change their side.
> 
> Yeah KVM/x86 and KVM/ARM use basically opposite numbering schemes for
> page table levels.
> 
> Level | KVM/ARM | KVM/x86
> ----- | ------- | ---------------
> pte   | 3       | 1 (PG_LEVEL_4K)
> pmd   | 2       | 2 (PG_LEVEL_2M)
> pud   | 1       | 3 (PG_LEVEL_1G)
> p4d   | 0       | 4
>       | -1      | 5
> 
> The ARM levels come from the architecture, whereas the x86 levels are
> arbitrary.
> 
> I do think it would be valuable to standardize on one leveling scheme at
> some point. Otherwise, mixing level schemes is bound to be a source of
> bugs if and when we are sharing more MMU code across architectures.

That could work, so long as the respective ISAs don't depend on any
specific numbering scheme. For arm64 the level hints encoded in TLBIs
match our numbering scheme, which is quite valuable. We are definitely
at odds with RISC-V's numbering scheme (descending order, starting from
root), but AFAICT there doesn't appear to be any portion of the ISA that
depends on it (yet).

Sure, we could add some glue code to transform KVM's common leveling
scheme into an architecture-specific one for these use cases, but I
worry that'll be incredibly error-prone.

In any case I'd prefer we not make any changes at this point, as they'd
be purely cosmetic.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-01-13  3:49 ` [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
  2023-01-24  1:03   ` Ben Gardon
@ 2023-02-06  9:20   ` Zheng Chuan
  2023-02-06 16:28     ` Ricardo Koller
  1 sibling, 1 reply; 47+ messages in thread
From: Zheng Chuan @ 2023-02-06  9:20 UTC (permalink / raw)
  To: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Xiexiangyou, yezhenyu2

Hi, Ricardo

On 2023/1/13 11:49, Ricardo Koller wrote:
> Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
> range of huge pages. This will be used for eager-splitting huge pages
> into PAGE_SIZE pages. The goal is to avoid having to split huge pages
> on write-protection faults, and instead use this function to do it
> ahead of time for large ranges (e.g., all guest memory in 1G chunks at
> a time).
> 
> No functional change intended. This new function will be used in a
> subsequent commit.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 29 ++++++++++++
>  arch/arm64/kvm/hyp/pgtable.c         | 67 ++++++++++++++++++++++++++++
>  2 files changed, 96 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 8ad78d61af7f..5fbdc1f259fd 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -644,6 +644,35 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
>   */
>  int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
>  
> +/**
> + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> + *				to PAGE_SIZE guest pages.
> + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init*().
> + * @addr:	Intermediate physical address from which to split.
> + * @size:	Size of the range.
> + * @mc:		Cache of pre-allocated and zeroed memory from which to allocate
> + *		page-table pages.
> + *
> + * @addr and the end (@addr + @size) are effectively aligned down and up to
> + * the top level huge-page block size. This is an exampe using 1GB
> + * huge-pages and 4KB granules.
> + *
> + *                          [---input range---]
> + *                          :                 :
> + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> + *                          :                 :
> + *                   [--2MB--][--2MB--][--2MB--][--2MB--]
> + *                          :                 :
> + *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> + *                          :                 :
> + *
> + * Return: 0 on success, negative error code on failure. Note that
> + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> + * blocks in the input range as allowed by the size of the memcache. It
> + * will fail it wasn't able to break any block.
> + */
> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
> +
>  /**
>   * kvm_pgtable_walk() - Walk a page-table.
>   * @pgt:	Page-table structure initialised by kvm_pgtable_*_init().
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 0dee13007776..db9d1a28769b 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -1229,6 +1229,73 @@ int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
>  	return 0;
>  }
>  
> +struct stage2_split_data {
> +	struct kvm_s2_mmu		*mmu;
> +	void				*memcache;
> +};
> +
> +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> +			       enum kvm_pgtable_walk_flags visit)
> +{
> +	struct stage2_split_data *data = ctx->arg;
> +	struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> +	kvm_pte_t pte = ctx->old, new, *childp;
> +	enum kvm_pgtable_prot prot;
> +	void *mc = data->memcache;
> +	u32 level = ctx->level;
> +	u64 phys;
> +	int ret;
> +
> +	/* Nothing to split at the last level */
> +	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> +		return 0;
> +
> +	/* We only split valid block mappings */
> +	if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> +		return 0;
> +
IIUC, It should be !kvm_pte_table(pte, ctx->level)?
also, the kvm_pte_table includes the level check and kvm_pte_valid, so, it just be like:
-	/* Nothing to split at the last level */
-	if (level == KVM_PGTABLE_MAX_LEVELS - 1)
-		return 0;
-
-	/* We only split valid block mappings */
+	if (!kvm_pte_table(pte, ctx->level))
+		return 0;

> +	phys = kvm_pte_to_phys(pte);
> +	prot = kvm_pgtable_stage2_pte_prot(pte);
> +
> +	ret = kvm_pgtable_stage2_create_removed(data->mmu->pgt, &new, phys,
> +						level, prot, mc);
> +	if (ret)
> +		return ret;
> +
> +	if (!stage2_try_break_pte(ctx, data->mmu)) {
> +		childp = kvm_pte_follow(new, mm_ops);
> +		kvm_pgtable_stage2_free_removed(mm_ops, childp, level);
> +		mm_ops->put_page(childp);
> +		return -EAGAIN;
> +	}
> +
> +	/*
> +	 * Note, the contents of the page table are guaranteed to be
> +	 * made visible before the new PTE is assigned because
> +	 * stage2_make_pte() writes the PTE using smp_store_release().
> +	 */
> +	stage2_make_pte(ctx, new);
> +	dsb(ishst);
> +	return 0;
> +}
> +
> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt,
> +			     u64 addr, u64 size, void *mc)
> +{
> +	struct stage2_split_data split_data = {
> +		.mmu		= pgt->mmu,
> +		.memcache	= mc,
> +	};
> +
> +	struct kvm_pgtable_walker walker = {
> +		.cb	= stage2_split_walker,
> +		.flags	= KVM_PGTABLE_WALK_LEAF,
> +		.arg	= &split_data,
> +	};
> +
> +	return kvm_pgtable_walk(pgt, addr, size, &walker);
> +}
> +
>  int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
>  			      struct kvm_pgtable_mm_ops *mm_ops,
>  			      enum kvm_pgtable_stage2_flags flags,
> 

-- 
Regards.
Chuan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split()
  2023-02-06  9:20   ` Zheng Chuan
@ 2023-02-06 16:28     ` Ricardo Koller
  0 siblings, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-02-06 16:28 UTC (permalink / raw)
  To: Zheng Chuan
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol, Xiexiangyou, yezhenyu2

Hi Zheng,

On Mon, Feb 6, 2023 at 1:21 AM Zheng Chuan <zhengchuan@huawei.com> wrote:
>
> Hi, Ricardo
>
> On 2023/1/13 11:49, Ricardo Koller wrote:
> > Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
> > range of huge pages. This will be used for eager-splitting huge pages
> > into PAGE_SIZE pages. The goal is to avoid having to split huge pages
> > on write-protection faults, and instead use this function to do it
> > ahead of time for large ranges (e.g., all guest memory in 1G chunks at
> > a time).
> >
> > No functional change intended. This new function will be used in a
> > subsequent commit.
> >
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h | 29 ++++++++++++
> >  arch/arm64/kvm/hyp/pgtable.c         | 67 ++++++++++++++++++++++++++++
> >  2 files changed, 96 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 8ad78d61af7f..5fbdc1f259fd 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -644,6 +644,35 @@ bool kvm_pgtable_stage2_is_young(struct kvm_pgtable *pgt, u64 addr);
> >   */
> >  int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
> >
> > +/**
> > + * kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
> > + *                           to PAGE_SIZE guest pages.
> > + * @pgt:     Page-table structure initialised by kvm_pgtable_stage2_init*().
> > + * @addr:    Intermediate physical address from which to split.
> > + * @size:    Size of the range.
> > + * @mc:              Cache of pre-allocated and zeroed memory from which to allocate
> > + *           page-table pages.
> > + *
> > + * @addr and the end (@addr + @size) are effectively aligned down and up to
> > + * the top level huge-page block size. This is an exampe using 1GB
> > + * huge-pages and 4KB granules.
> > + *
> > + *                          [---input range---]
> > + *                          :                 :
> > + * [--1G block pte--][--1G block pte--][--1G block pte--][--1G block pte--]
> > + *                          :                 :
> > + *                   [--2MB--][--2MB--][--2MB--][--2MB--]
> > + *                          :                 :
> > + *                   [ ][ ][:][ ][ ][ ][ ][ ][:][ ][ ][ ]
> > + *                          :                 :
> > + *
> > + * Return: 0 on success, negative error code on failure. Note that
> > + * kvm_pgtable_stage2_split() is best effort: it tries to break as many
> > + * blocks in the input range as allowed by the size of the memcache. It
> > + * will fail it wasn't able to break any block.
> > + */
> > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size, void *mc);
> > +
> >  /**
> >   * kvm_pgtable_walk() - Walk a page-table.
> >   * @pgt:     Page-table structure initialised by kvm_pgtable_*_init().
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 0dee13007776..db9d1a28769b 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1229,6 +1229,73 @@ int kvm_pgtable_stage2_create_removed(struct kvm_pgtable *pgt,
> >       return 0;
> >  }
> >
> > +struct stage2_split_data {
> > +     struct kvm_s2_mmu               *mmu;
> > +     void                            *memcache;
> > +};
> > +
> > +static int stage2_split_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > +                            enum kvm_pgtable_walk_flags visit)
> > +{
> > +     struct stage2_split_data *data = ctx->arg;
> > +     struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > +     kvm_pte_t pte = ctx->old, new, *childp;
> > +     enum kvm_pgtable_prot prot;
> > +     void *mc = data->memcache;
> > +     u32 level = ctx->level;
> > +     u64 phys;
> > +     int ret;
> > +
> > +     /* Nothing to split at the last level */
> > +     if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> > +             return 0;
> > +
> > +     /* We only split valid block mappings */
> > +     if (!kvm_pte_valid(pte) || kvm_pte_table(pte, ctx->level))
> > +             return 0;
> > +
> IIUC, It should be !kvm_pte_table(pte, ctx->level)?

It's the other way around; if the current PTE is a table then there's
nothing to split
(it's already a table), and we leave early.

> also, the kvm_pte_table includes the level check and kvm_pte_valid, so, it just be like:
> -       /* Nothing to split at the last level */
> -       if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> -               return 0;
> -
> -       /* We only split valid block mappings */
> +       if (!kvm_pte_table(pte, ctx->level))
> +               return 0;

Still need the kvm_pte_valid() check because !kvm_pte_table() does not imply
that the pte is valid. Same with the level check.

Thanks for taking a look at the patch,
Ricardo

>
> > +     phys = kvm_pte_to_phys(pte);
> > +     prot = kvm_pgtable_stage2_pte_prot(pte);
> > +
> > +     ret = kvm_pgtable_stage2_create_removed(data->mmu->pgt, &new, phys,
> > +                                             level, prot, mc);
> > +     if (ret)
> > +             return ret;
> > +
> > +     if (!stage2_try_break_pte(ctx, data->mmu)) {
> > +             childp = kvm_pte_follow(new, mm_ops);
> > +             kvm_pgtable_stage2_free_removed(mm_ops, childp, level);
> > +             mm_ops->put_page(childp);
> > +             return -EAGAIN;
> > +     }
> > +
> > +     /*
> > +      * Note, the contents of the page table are guaranteed to be
> > +      * made visible before the new PTE is assigned because
> > +      * stage2_make_pte() writes the PTE using smp_store_release().
> > +      */
> > +     stage2_make_pte(ctx, new);
> > +     dsb(ishst);
> > +     return 0;
> > +}
> > +
> > +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt,
> > +                          u64 addr, u64 size, void *mc)
> > +{
> > +     struct stage2_split_data split_data = {
> > +             .mmu            = pgt->mmu,
> > +             .memcache       = mc,
> > +     };
> > +
> > +     struct kvm_pgtable_walker walker = {
> > +             .cb     = stage2_split_walker,
> > +             .flags  = KVM_PGTABLE_WALK_LEAF,
> > +             .arg    = &split_data,
> > +     };
> > +
> > +     return kvm_pgtable_walk(pgt, addr, size, &walker);
> > +}
> > +
> >  int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
> >                             struct kvm_pgtable_mm_ops *mm_ops,
> >                             enum kvm_pgtable_stage2_flags flags,
> >
>
> --
> Regards.
> Chuan
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 4/9] KVM: arm64: Refactor kvm_arch_commit_memory_region()
  2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
                   ` (2 preceding siblings ...)
  2023-01-13  3:49 ` [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
@ 2023-01-13  3:49 ` Ricardo Koller
  2023-01-13  3:49 ` [PATCH 5/9] KVM: arm64: Add kvm_uninit_stage2_mmu() Ricardo Koller
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:49 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

Refactor kvm_arch_commit_memory_region() as a preparation for a future
commit to look cleaner and more understandable. Also, it looks more
like its x86 counterpart (in kvm_mmu_slot_apply_flags()).

No functional change intended.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 31d7fa4c7c14..dbcd5d9bc260 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1758,20 +1758,27 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 				   const struct kvm_memory_slot *new,
 				   enum kvm_mr_change change)
 {
+	bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
+
 	/*
 	 * At this point memslot has been committed and there is an
 	 * allocated dirty_bitmap[], dirty pages will be tracked while the
 	 * memory slot is write protected.
 	 */
-	if (change != KVM_MR_DELETE && new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+	if (log_dirty_pages) {
+
+		if (change == KVM_MR_DELETE)
+			return;
+
 		/*
 		 * If we're with initial-all-set, we don't need to write
 		 * protect any pages because they're all reported as dirty.
 		 * Huge pages and normal pages will be write protect gradually.
 		 */
-		if (!kvm_dirty_log_manual_protect_and_init_set(kvm)) {
-			kvm_mmu_wp_memory_region(kvm, new->id);
-		}
+		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
+			return;
+
+		kvm_mmu_wp_memory_region(kvm, new->id);
 	}
 }
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 5/9] KVM: arm64: Add kvm_uninit_stage2_mmu()
  2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
                   ` (3 preceding siblings ...)
  2023-01-13  3:49 ` [PATCH 4/9] KVM: arm64: Refactor kvm_arch_commit_memory_region() Ricardo Koller
@ 2023-01-13  3:49 ` Ricardo Koller
  2023-01-13  3:49 ` [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:49 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

Add kvm_uninit_stage2_mmu() and move kvm_free_stage2_pgd() into it. A
future commit will add some more things to do inside of
kvm_uninit_stage2_mmu().

No functional change intended.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 1 +
 arch/arm64/kvm/mmu.c             | 7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e4a7e6369499..058f3ae5bc26 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -167,6 +167,7 @@ void free_hyp_pgds(void);
 
 void stage2_unmap_vm(struct kvm *kvm);
 int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
+void kvm_uninit_stage2_mmu(struct kvm *kvm);
 void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 			  phys_addr_t pa, unsigned long size, bool writable);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dbcd5d9bc260..700c5774b50d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -766,6 +766,11 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	return err;
 }
 
+void kvm_uninit_stage2_mmu(struct kvm *kvm)
+{
+	kvm_free_stage2_pgd(&kvm->arch.mmu);
+}
+
 static void stage2_unmap_memslot(struct kvm *kvm,
 				 struct kvm_memory_slot *memslot)
 {
@@ -1852,7 +1857,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-	kvm_free_stage2_pgd(&kvm->arch.mmu);
+	kvm_uninit_stage2_mmu(kvm);
 }
 
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
                   ` (4 preceding siblings ...)
  2023-01-13  3:49 ` [PATCH 5/9] KVM: arm64: Add kvm_uninit_stage2_mmu() Ricardo Koller
@ 2023-01-13  3:49 ` Ricardo Koller
  2023-01-24 17:52   ` Ben Gardon
  2023-01-24 22:45   ` Oliver Upton
  2023-01-13  3:49 ` [PATCH 7/9] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() Ricardo Koller
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:49 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

Split huge pages eagerly when enabling dirty logging. The goal is to
avoid doing it while faulting on write-protected pages, which
negatively impacts guest performance.

A memslot marked for dirty logging is split in 1GB pieces at a time.
This is in order to release the mmu_lock and give other kernel threads
the opportunity to run, and also in order to allocate enough pages to
split a 1GB range worth of huge pages (or a single 1GB huge page).
Note that these page allocations can fail, so eager page splitting is
best-effort.  This is not a correctness issue though, as huge pages
can still be split on write-faults.

The benefits of eager page splitting are the same as in x86, added
with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
the TDP MMU when dirty logging is enabled"). For example, when running
dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
all of their memory after dirty logging is enabled decreased by 44%
from 2.58s to 1.42s.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  30 ++++++++
 arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
 2 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 35a159d131b5..6ab37209b1d1 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -153,6 +153,36 @@ struct kvm_s2_mmu {
 	/* The last vcpu id that ran on each physical CPU */
 	int __percpu *last_vcpu_ran;
 
+	/*
+	 * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
+	 * pages. It is used to allocate stage2 page tables while splitting
+	 * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
+	 * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
+	 * the capacity of the split page cache (CACHE_CAPACITY), and how often
+	 * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
+	 *
+	 * A good heuristic to pick CHUNK_SIZE is that it should be larger than
+	 * all the available huge-page sizes, and be a multiple of all the
+	 * other ones; for example, 1GB when all the available huge-page sizes
+	 * are (1GB, 2MB, 32MB, 512MB).
+	 *
+	 * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
+	 * example, 1GB requires the following number of PAGE_SIZE-pages:
+	 * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
+	 * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
+	 * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
+	 * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
+	 * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
+	 * granules.
+	 *
+	 * Protected by kvm->slots_lock.
+	 */
+#define EAGER_PAGE_SPLIT_CHUNK_SIZE		       SZ_1G
+#define EAGER_PAGE_SPLIT_CACHE_CAPACITY					\
+	(DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +		\
+	 DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
+	struct kvm_mmu_memory_cache split_page_cache;
+
 	struct kvm_arch *arch;
 };
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 700c5774b50d..41ee330edae3 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
 
 static unsigned long io_map_base;
 
-static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
+bool __read_mostly eager_page_split = true;
+module_param(eager_page_split, bool, 0644);
+
+static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
+					   phys_addr_t size)
 {
-	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
 	phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
 
 	return (boundary - 1 < end - 1) ? boundary : end;
 }
 
+static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
+{
+	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
+
+	return __stage2_range_addr_end(addr, end, size);
+}
+
 /*
  * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
  * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
@@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
 	return ret;
 }
 
+static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
+{
+	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
+}
+
+static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
+{
+	struct kvm_mmu_memory_cache *cache;
+
+	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
+		return true;
+
+	cache = &kvm->arch.mmu.split_page_cache;
+	return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
+}
+
+static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
+			      phys_addr_t end)
+{
+	struct kvm_mmu_memory_cache *cache;
+	struct kvm_pgtable *pgt;
+	int ret;
+	u64 next;
+	int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	cache = &kvm->arch.mmu.split_page_cache;
+
+	do {
+		if (need_topup_split_page_cache_or_resched(kvm)) {
+			write_unlock(&kvm->mmu_lock);
+			cond_resched();
+			/* Eager page splitting is best-effort. */
+			ret = __kvm_mmu_topup_memory_cache(cache,
+							   cache_capacity,
+							   cache_capacity);
+			write_lock(&kvm->mmu_lock);
+			if (ret)
+				break;
+		}
+
+		pgt = kvm->arch.mmu.pgt;
+		if (!pgt)
+			return -EINVAL;
+
+		next = __stage2_range_addr_end(addr, end,
+					       EAGER_PAGE_SPLIT_CHUNK_SIZE);
+		ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
+		if (ret)
+			break;
+	} while (addr = next, addr != end);
+
+	return ret;
+}
+
 #define stage2_apply_range_resched(kvm, addr, end, fn)			\
 	stage2_apply_range(kvm, addr, end, fn, true)
 
@@ -755,6 +823,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
 
+	mmu->split_page_cache.gfp_zero = __GFP_ZERO;
+
 	mmu->pgt = pgt;
 	mmu->pgd_phys = __pa(pgt->pgd);
 	return 0;
@@ -769,6 +839,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 void kvm_uninit_stage2_mmu(struct kvm *kvm)
 {
 	kvm_free_stage2_pgd(&kvm->arch.mmu);
+	kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 }
 
 static void stage2_unmap_memslot(struct kvm *kvm,
@@ -996,6 +1067,29 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 	stage2_wp_range(&kvm->arch.mmu, start, end);
 }
 
+/**
+ * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
+ *				   pages for memory slot
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot to split
+ *
+ * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired,
+ * serializing operations for VM memory regions.
+ */
+static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
+{
+	struct kvm_memslots *slots = kvm_memslots(kvm);
+	struct kvm_memory_slot *memslot = id_to_memslot(slots, slot);
+	phys_addr_t start, end;
+
+	start = memslot->base_gfn << PAGE_SHIFT;
+	end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+
+	write_lock(&kvm->mmu_lock);
+	kvm_mmu_split_huge_pages(kvm, start, end);
+	write_unlock(&kvm->mmu_lock);
+}
+
 /*
  * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
  * dirty pages.
@@ -1783,7 +1877,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 		if (kvm_dirty_log_manual_protect_and_init_set(kvm))
 			return;
 
+		if (READ_ONCE(eager_page_split))
+			kvm_mmu_split_memory_region(kvm, new->id);
+
 		kvm_mmu_wp_memory_region(kvm, new->id);
+	} else {
+		/*
+		 * Free any leftovers from the eager page splitting cache. Do
+		 * this when deleting, moving, disabling dirty logging, or
+		 * creating the memslot (a nop). Doing it for deletes makes
+		 * sure we don't leak memory, and there's no need to keep the
+		 * cache around for any of the other cases.
+		 */
+		kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 	}
 }
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-13  3:49 ` [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
@ 2023-01-24 17:52   ` Ben Gardon
  2023-01-24 22:19     ` Oliver Upton
  2023-01-24 22:45   ` Oliver Upton
  1 sibling, 1 reply; 47+ messages in thread
From: Ben Gardon @ 2023-01-24 17:52 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> Split huge pages eagerly when enabling dirty logging. The goal is to
> avoid doing it while faulting on write-protected pages, which
> negatively impacts guest performance.
>
> A memslot marked for dirty logging is split in 1GB pieces at a time.
> This is in order to release the mmu_lock and give other kernel threads
> the opportunity to run, and also in order to allocate enough pages to
> split a 1GB range worth of huge pages (or a single 1GB huge page).
> Note that these page allocations can fail, so eager page splitting is
> best-effort.  This is not a correctness issue though, as huge pages
> can still be split on write-faults.
>
> The benefits of eager page splitting are the same as in x86, added
> with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> the TDP MMU when dirty logging is enabled"). For example, when running
> dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> all of their memory after dirty logging is enabled decreased by 44%
> from 2.58s to 1.42s.
>
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
>  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
>  2 files changed, 138 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 35a159d131b5..6ab37209b1d1 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
>         /* The last vcpu id that ran on each physical CPU */
>         int __percpu *last_vcpu_ran;
>
> +       /*
> +        * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> +        * pages. It is used to allocate stage2 page tables while splitting
> +        * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> +        * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> +        * the capacity of the split page cache (CACHE_CAPACITY), and how often
> +        * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> +        *
> +        * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> +        * all the available huge-page sizes, and be a multiple of all the
> +        * other ones; for example, 1GB when all the available huge-page sizes
> +        * are (1GB, 2MB, 32MB, 512MB).

This feels a little fragile to link scheduling decisions into the
batch size. (I'm not saying we don't do the same thing on x86, but
it's a mistake in either case.) I'd prefer if we could yield more
granularly in a way that's not tied to splitting a whole
EAGER_PAGE_SPLIT_CHUNK_SIZE worth of pages.
Tuning the chunk size to balance memory overhead with batch
performance makes sense, but it becomes more difficult to also balance
with the needs of the scheduler. Could we add some yield point in
kvm_pgtable_stage2_split or give it an early return path or something?

> +        *
> +        * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> +        * example, 1GB requires the following number of PAGE_SIZE-pages:
> +        * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> +        * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> +        * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> +        * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> +        * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> +        * granules.
> +        *
> +        * Protected by kvm->slots_lock.
> +        */
> +#define EAGER_PAGE_SPLIT_CHUNK_SIZE                   SZ_1G
> +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY                                        \
> +       (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +         \
> +        DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
> +       struct kvm_mmu_memory_cache split_page_cache;
> +
>         struct kvm_arch *arch;
>  };
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 700c5774b50d..41ee330edae3 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
>
>  static unsigned long io_map_base;
>
> -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> +bool __read_mostly eager_page_split = true;
> +module_param(eager_page_split, bool, 0644);
> +
> +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> +                                          phys_addr_t size)
>  {
> -       phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
>         phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
>
>         return (boundary - 1 < end - 1) ? boundary : end;
>  }
>
> +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> +{
> +       phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> +
> +       return __stage2_range_addr_end(addr, end, size);
> +}
> +
>  /*
>   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
>   * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
> @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
>         return ret;
>  }
>
> +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> +{
> +       return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> +}
> +
> +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
> +{
> +       struct kvm_mmu_memory_cache *cache;
> +
> +       if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +               return true;
> +
> +       cache = &kvm->arch.mmu.split_page_cache;
> +       return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
> +}
> +
> +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
> +                             phys_addr_t end)
> +{
> +       struct kvm_mmu_memory_cache *cache;
> +       struct kvm_pgtable *pgt;
> +       int ret;
> +       u64 next;
> +       int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
> +
> +       lockdep_assert_held_write(&kvm->mmu_lock);
> +
> +       lockdep_assert_held(&kvm->slots_lock);
> +
> +       cache = &kvm->arch.mmu.split_page_cache;
> +
> +       do {
> +               if (need_topup_split_page_cache_or_resched(kvm)) {
> +                       write_unlock(&kvm->mmu_lock);
> +                       cond_resched();
> +                       /* Eager page splitting is best-effort. */
> +                       ret = __kvm_mmu_topup_memory_cache(cache,
> +                                                          cache_capacity,
> +                                                          cache_capacity);
> +                       write_lock(&kvm->mmu_lock);
> +                       if (ret)
> +                               break;
> +               }
> +
> +               pgt = kvm->arch.mmu.pgt;
> +               if (!pgt)
> +                       return -EINVAL;
> +
> +               next = __stage2_range_addr_end(addr, end,
> +                                              EAGER_PAGE_SPLIT_CHUNK_SIZE);
> +               ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
> +               if (ret)
> +                       break;
> +       } while (addr = next, addr != end);
> +
> +       return ret;
> +}
> +
>  #define stage2_apply_range_resched(kvm, addr, end, fn)                 \
>         stage2_apply_range(kvm, addr, end, fn, true)
>
> @@ -755,6 +823,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>         for_each_possible_cpu(cpu)
>                 *per_cpu_ptr(mmu->last_vcpu_ran, cpu) = -1;
>
> +       mmu->split_page_cache.gfp_zero = __GFP_ZERO;
> +
>         mmu->pgt = pgt;
>         mmu->pgd_phys = __pa(pgt->pgd);
>         return 0;
> @@ -769,6 +839,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>  void kvm_uninit_stage2_mmu(struct kvm *kvm)
>  {
>         kvm_free_stage2_pgd(&kvm->arch.mmu);
> +       kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>  }
>
>  static void stage2_unmap_memslot(struct kvm *kvm,
> @@ -996,6 +1067,29 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>         stage2_wp_range(&kvm->arch.mmu, start, end);
>  }
>
> +/**
> + * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
> + *                                pages for memory slot
> + * @kvm:       The KVM pointer
> + * @slot:      The memory slot to split
> + *
> + * Acquires kvm->mmu_lock. Called with kvm->slots_lock mutex acquired,
> + * serializing operations for VM memory regions.
> + */
> +static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
> +{
> +       struct kvm_memslots *slots = kvm_memslots(kvm);
> +       struct kvm_memory_slot *memslot = id_to_memslot(slots, slot);
> +       phys_addr_t start, end;
> +
> +       start = memslot->base_gfn << PAGE_SHIFT;
> +       end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
> +
> +       write_lock(&kvm->mmu_lock);
> +       kvm_mmu_split_huge_pages(kvm, start, end);
> +       write_unlock(&kvm->mmu_lock);
> +}
> +
>  /*
>   * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
>   * dirty pages.
> @@ -1783,7 +1877,19 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>                 if (kvm_dirty_log_manual_protect_and_init_set(kvm))
>                         return;
>
> +               if (READ_ONCE(eager_page_split))
> +                       kvm_mmu_split_memory_region(kvm, new->id);
> +
>                 kvm_mmu_wp_memory_region(kvm, new->id);
> +       } else {
> +               /*
> +                * Free any leftovers from the eager page splitting cache. Do
> +                * this when deleting, moving, disabling dirty logging, or
> +                * creating the memslot (a nop). Doing it for deletes makes
> +                * sure we don't leak memory, and there's no need to keep the
> +                * cache around for any of the other cases.
> +                */
> +               kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
>         }
>  }
>
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-24 17:52   ` Ben Gardon
@ 2023-01-24 22:19     ` Oliver Upton
  0 siblings, 0 replies; 47+ messages in thread
From: Oliver Upton @ 2023-01-24 22:19 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Ricardo Koller, pbonzini, maz, oupton, yuzenghui, dmatlack, kvm,
	kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, ricarkol

On Tue, Jan 24, 2023 at 09:52:49AM -0800, Ben Gardon wrote:
> On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > Split huge pages eagerly when enabling dirty logging. The goal is to
> > avoid doing it while faulting on write-protected pages, which
> > negatively impacts guest performance.
> >
> > A memslot marked for dirty logging is split in 1GB pieces at a time.
> > This is in order to release the mmu_lock and give other kernel threads
> > the opportunity to run, and also in order to allocate enough pages to
> > split a 1GB range worth of huge pages (or a single 1GB huge page).
> > Note that these page allocations can fail, so eager page splitting is
> > best-effort.  This is not a correctness issue though, as huge pages
> > can still be split on write-faults.
> >
> > The benefits of eager page splitting are the same as in x86, added
> > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> > the TDP MMU when dirty logging is enabled"). For example, when running
> > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> > all of their memory after dirty logging is enabled decreased by 44%
> > from 2.58s to 1.42s.
> >
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
> >  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
> >  2 files changed, 138 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 35a159d131b5..6ab37209b1d1 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
> >         /* The last vcpu id that ran on each physical CPU */
> >         int __percpu *last_vcpu_ran;
> >
> > +       /*
> > +        * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> > +        * pages. It is used to allocate stage2 page tables while splitting
> > +        * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> > +        * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> > +        * the capacity of the split page cache (CACHE_CAPACITY), and how often
> > +        * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> > +        *
> > +        * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> > +        * all the available huge-page sizes, and be a multiple of all the
> > +        * other ones; for example, 1GB when all the available huge-page sizes
> > +        * are (1GB, 2MB, 32MB, 512MB).
> 
> This feels a little fragile to link scheduling decisions into the
> batch size.

Completely agree.

> (I'm not saying we don't do the same thing on x86, but
> it's a mistake in either case.) I'd prefer if we could yield more
> granularly in a way that's not tied to splitting a whole
> EAGER_PAGE_SPLIT_CHUNK_SIZE worth of pages.
> Tuning the chunk size to balance memory overhead with batch
> performance makes sense, but it becomes more difficult to also balance
> with the needs of the scheduler. Could we add some yield point in
> kvm_pgtable_stage2_split or give it an early return path or something?

We definitely need to move in the direction of early returns, but I'd
rather not build it purely for the sake of eager page spliting. A better
approach would likely be to add it to the core page table walker code
such that all walkers that operate on a swath of memory can use it.

In the interest of keeping this series manageable, I'd prefer yielding
walkers be done in a separate series.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-13  3:49 ` [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
  2023-01-24 17:52   ` Ben Gardon
@ 2023-01-24 22:45   ` Oliver Upton
  2023-01-26 18:45     ` Ricardo Koller
  1 sibling, 1 reply; 47+ messages in thread
From: Oliver Upton @ 2023-01-24 22:45 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

Hi Ricardo,

On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote:
> Split huge pages eagerly when enabling dirty logging. The goal is to
> avoid doing it while faulting on write-protected pages, which
> negatively impacts guest performance.
> 
> A memslot marked for dirty logging is split in 1GB pieces at a time.
> This is in order to release the mmu_lock and give other kernel threads
> the opportunity to run, and also in order to allocate enough pages to
> split a 1GB range worth of huge pages (or a single 1GB huge page).
> Note that these page allocations can fail, so eager page splitting is
> best-effort.  This is not a correctness issue though, as huge pages
> can still be split on write-faults.
> 
> The benefits of eager page splitting are the same as in x86, added
> with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> the TDP MMU when dirty logging is enabled"). For example, when running
> dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> all of their memory after dirty logging is enabled decreased by 44%
> from 2.58s to 1.42s.
> 
> Signed-off-by: Ricardo Koller <ricarkol@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
>  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
>  2 files changed, 138 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 35a159d131b5..6ab37209b1d1 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
>  	/* The last vcpu id that ran on each physical CPU */
>  	int __percpu *last_vcpu_ran;
>  
> +	/*
> +	 * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> +	 * pages. It is used to allocate stage2 page tables while splitting
> +	 * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> +	 * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> +	 * the capacity of the split page cache (CACHE_CAPACITY), and how often
> +	 * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> +	 *
> +	 * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> +	 * all the available huge-page sizes, and be a multiple of all the
> +	 * other ones; for example, 1GB when all the available huge-page sizes
> +	 * are (1GB, 2MB, 32MB, 512MB).
> +	 *
> +	 * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> +	 * example, 1GB requires the following number of PAGE_SIZE-pages:
> +	 * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> +	 * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> +	 * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> +	 * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> +	 * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> +	 * granules.
> +	 *
> +	 * Protected by kvm->slots_lock.
> +	 */
> +#define EAGER_PAGE_SPLIT_CHUNK_SIZE		       SZ_1G
> +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY					\
> +	(DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +		\
> +	 DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))

Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL
as the batch size? 513 pages across all page sizes is a non-negligible
amount of memory that goes largely unused when PAGE_SIZE != 4K.

With that change it is a lot easier to correctly match the cache
capacity to the selected page size. Additionally, we continue to have a
single set of batching logic that we can improve later on.

> +	struct kvm_mmu_memory_cache split_page_cache;
> +
>  	struct kvm_arch *arch;
>  };
>  
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 700c5774b50d..41ee330edae3 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
>  
>  static unsigned long io_map_base;
>  
> -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> +bool __read_mostly eager_page_split = true;
> +module_param(eager_page_split, bool, 0644);
> +

Unless someone is really begging for it I'd prefer we not add a module
parameter for this.

> +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> +					   phys_addr_t size)
>  {
> -	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
>  	phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
>  
>  	return (boundary - 1 < end - 1) ? boundary : end;
>  }
>  
> +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> +{
> +	phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> +
> +	return __stage2_range_addr_end(addr, end, size);
> +}
> +
>  /*
>   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
>   * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
> @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
>  	return ret;
>  }
>  
> +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> +{
> +	return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> +}

I don't think the helper is adding too much here.

> +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
> +{
> +	struct kvm_mmu_memory_cache *cache;
> +
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		return true;
> +
> +	cache = &kvm->arch.mmu.split_page_cache;
> +	return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
> +}
> +
> +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
> +			      phys_addr_t end)
> +{
> +	struct kvm_mmu_memory_cache *cache;
> +	struct kvm_pgtable *pgt;
> +	int ret;
> +	u64 next;
> +	int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
> +
> +	lockdep_assert_held_write(&kvm->mmu_lock);

Rather than having the caller acquire the lock, can you instead do it
here? It would appear that the entire critical section is enclosed
within this function.

> +	lockdep_assert_held(&kvm->slots_lock);

This function doesn't depend on anything guarded by the slots_lock, can
you move this to kvm_mmu_split_memory_region()?

> +	cache = &kvm->arch.mmu.split_page_cache;
> +
> +	do {
> +		if (need_topup_split_page_cache_or_resched(kvm)) {
> +			write_unlock(&kvm->mmu_lock);
> +			cond_resched();
> +			/* Eager page splitting is best-effort. */
> +			ret = __kvm_mmu_topup_memory_cache(cache,
> +							   cache_capacity,
> +							   cache_capacity);
> +			write_lock(&kvm->mmu_lock);
> +			if (ret)
> +				break;
> +		}
> +
> +		pgt = kvm->arch.mmu.pgt;
> +		if (!pgt)
> +			return -EINVAL;
> +
> +		next = __stage2_range_addr_end(addr, end,
> +					       EAGER_PAGE_SPLIT_CHUNK_SIZE);
> +		ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
> +		if (ret)
> +			break;
> +	} while (addr = next, addr != end);
> +
> +	return ret;
> +}

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-24 22:45   ` Oliver Upton
@ 2023-01-26 18:45     ` Ricardo Koller
  2023-01-26 19:25       ` Ricardo Koller
  2023-01-26 20:10       ` Marc Zyngier
  0 siblings, 2 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-26 18:45 UTC (permalink / raw)
  To: Oliver Upton
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Hi Ricardo,
>
> On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote:
> > Split huge pages eagerly when enabling dirty logging. The goal is to
> > avoid doing it while faulting on write-protected pages, which
> > negatively impacts guest performance.
> >
> > A memslot marked for dirty logging is split in 1GB pieces at a time.
> > This is in order to release the mmu_lock and give other kernel threads
> > the opportunity to run, and also in order to allocate enough pages to
> > split a 1GB range worth of huge pages (or a single 1GB huge page).
> > Note that these page allocations can fail, so eager page splitting is
> > best-effort.  This is not a correctness issue though, as huge pages
> > can still be split on write-faults.
> >
> > The benefits of eager page splitting are the same as in x86, added
> > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> > the TDP MMU when dirty logging is enabled"). For example, when running
> > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> > all of their memory after dirty logging is enabled decreased by 44%
> > from 2.58s to 1.42s.
> >
> > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
> >  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
> >  2 files changed, 138 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 35a159d131b5..6ab37209b1d1 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
> >       /* The last vcpu id that ran on each physical CPU */
> >       int __percpu *last_vcpu_ran;
> >
> > +     /*
> > +      * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> > +      * pages. It is used to allocate stage2 page tables while splitting
> > +      * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> > +      * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> > +      * the capacity of the split page cache (CACHE_CAPACITY), and how often
> > +      * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> > +      *
> > +      * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> > +      * all the available huge-page sizes, and be a multiple of all the
> > +      * other ones; for example, 1GB when all the available huge-page sizes
> > +      * are (1GB, 2MB, 32MB, 512MB).
> > +      *
> > +      * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> > +      * example, 1GB requires the following number of PAGE_SIZE-pages:
> > +      * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> > +      * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> > +      * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> > +      * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> > +      * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> > +      * granules.
> > +      *
> > +      * Protected by kvm->slots_lock.
> > +      */
> > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE                 SZ_1G
> > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY                                      \
> > +     (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +         \
> > +      DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
>
> Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL
> as the batch size? 513 pages across all page sizes is a non-negligible
> amount of memory that goes largely unused when PAGE_SIZE != 4K.
>

Sounds good, will refine this for v2.

> With that change it is a lot easier to correctly match the cache
> capacity to the selected page size. Additionally, we continue to have a
> single set of batching logic that we can improve later on.
>
> > +     struct kvm_mmu_memory_cache split_page_cache;
> > +
> >       struct kvm_arch *arch;
> >  };
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 700c5774b50d..41ee330edae3 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
> >
> >  static unsigned long io_map_base;
> >
> > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > +bool __read_mostly eager_page_split = true;
> > +module_param(eager_page_split, bool, 0644);
> > +
>
> Unless someone is really begging for it I'd prefer we not add a module
> parameter for this.

It was mainly to match x86 and because it makes perf testing a bit
simpler. What do others think?

>
> > +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> > +                                        phys_addr_t size)
> >  {
> > -     phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> >       phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
> >
> >       return (boundary - 1 < end - 1) ? boundary : end;
> >  }
> >
> > +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > +{
> > +     phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> > +
> > +     return __stage2_range_addr_end(addr, end, size);
> > +}
> > +
> >  /*
> >   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
> >   * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
> > @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
> >       return ret;
> >  }
> >
> > +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> > +{
> > +     return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> > +}
>
> I don't think the helper is adding too much here.

Will try how it looks without.

>
> > +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
> > +{
> > +     struct kvm_mmu_memory_cache *cache;
> > +
> > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > +             return true;
> > +
> > +     cache = &kvm->arch.mmu.split_page_cache;
> > +     return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
> > +}
> > +
> > +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
> > +                           phys_addr_t end)
> > +{
> > +     struct kvm_mmu_memory_cache *cache;
> > +     struct kvm_pgtable *pgt;
> > +     int ret;
> > +     u64 next;
> > +     int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
> > +
> > +     lockdep_assert_held_write(&kvm->mmu_lock);
>
> Rather than having the caller acquire the lock, can you instead do it
> here? It would appear that the entire critical section is enclosed
> within this function.

Sure. I will first double check things related to perf and correctness
just in case.
I'm not sure if the increase of acquire/releases makes any difference in perf.
Also, not sure if there's a correctness issue because of releasing the lock
between WP and split (I think it should be fine, but not 100% sure).

>
> > +     lockdep_assert_held(&kvm->slots_lock);
>
> This function doesn't depend on anything guarded by the slots_lock, can
> you move this to kvm_mmu_split_memory_region()?

kvm_mmu_split_memory_region() takes a memslot.
That works in this case, eager splitting when enabling dirty logging, but won't
work in the next commit when spliting on the CLEAR ioctl.

>
> > +     cache = &kvm->arch.mmu.split_page_cache;
> > +
> > +     do {
> > +             if (need_topup_split_page_cache_or_resched(kvm)) {
> > +                     write_unlock(&kvm->mmu_lock);
> > +                     cond_resched();
> > +                     /* Eager page splitting is best-effort. */
> > +                     ret = __kvm_mmu_topup_memory_cache(cache,
> > +                                                        cache_capacity,
> > +                                                        cache_capacity);
> > +                     write_lock(&kvm->mmu_lock);
> > +                     if (ret)
> > +                             break;
> > +             }
> > +
> > +             pgt = kvm->arch.mmu.pgt;
> > +             if (!pgt)
> > +                     return -EINVAL;
> > +
> > +             next = __stage2_range_addr_end(addr, end,
> > +                                            EAGER_PAGE_SPLIT_CHUNK_SIZE);
> > +             ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
> > +             if (ret)
> > +                     break;
> > +     } while (addr = next, addr != end);
> > +
> > +     return ret;
> > +}
>
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-26 18:45     ` Ricardo Koller
@ 2023-01-26 19:25       ` Ricardo Koller
  2023-01-26 20:10       ` Marc Zyngier
  1 sibling, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-26 19:25 UTC (permalink / raw)
  To: Oliver Upton
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

On Thu, Jan 26, 2023 at 10:45 AM Ricardo Koller <ricarkol@google.com> wrote:
>
> On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > Hi Ricardo,
> >
> > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote:
> > > Split huge pages eagerly when enabling dirty logging. The goal is to
> > > avoid doing it while faulting on write-protected pages, which
> > > negatively impacts guest performance.
> > >
> > > A memslot marked for dirty logging is split in 1GB pieces at a time.
> > > This is in order to release the mmu_lock and give other kernel threads
> > > the opportunity to run, and also in order to allocate enough pages to
> > > split a 1GB range worth of huge pages (or a single 1GB huge page).
> > > Note that these page allocations can fail, so eager page splitting is
> > > best-effort.  This is not a correctness issue though, as huge pages
> > > can still be split on write-faults.
> > >
> > > The benefits of eager page splitting are the same as in x86, added
> > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> > > the TDP MMU when dirty logging is enabled"). For example, when running
> > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> > > all of their memory after dirty logging is enabled decreased by 44%
> > > from 2.58s to 1.42s.
> > >
> > > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
> > >  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
> > >  2 files changed, 138 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index 35a159d131b5..6ab37209b1d1 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
> > >       /* The last vcpu id that ran on each physical CPU */
> > >       int __percpu *last_vcpu_ran;
> > >
> > > +     /*
> > > +      * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> > > +      * pages. It is used to allocate stage2 page tables while splitting
> > > +      * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> > > +      * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> > > +      * the capacity of the split page cache (CACHE_CAPACITY), and how often
> > > +      * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> > > +      *
> > > +      * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> > > +      * all the available huge-page sizes, and be a multiple of all the
> > > +      * other ones; for example, 1GB when all the available huge-page sizes
> > > +      * are (1GB, 2MB, 32MB, 512MB).
> > > +      *
> > > +      * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> > > +      * example, 1GB requires the following number of PAGE_SIZE-pages:
> > > +      * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> > > +      * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> > > +      * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> > > +      * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> > > +      * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> > > +      * granules.
> > > +      *
> > > +      * Protected by kvm->slots_lock.
> > > +      */
> > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE                 SZ_1G
> > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY                                      \
> > > +     (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +         \
> > > +      DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
> >
> > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL
> > as the batch size? 513 pages across all page sizes is a non-negligible
> > amount of memory that goes largely unused when PAGE_SIZE != 4K.
> >
>
> Sounds good, will refine this for v2.
>
> > With that change it is a lot easier to correctly match the cache
> > capacity to the selected page size. Additionally, we continue to have a
> > single set of batching logic that we can improve later on.
> >
> > > +     struct kvm_mmu_memory_cache split_page_cache;
> > > +
> > >       struct kvm_arch *arch;
> > >  };
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 700c5774b50d..41ee330edae3 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
> > >
> > >  static unsigned long io_map_base;
> > >
> > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > > +bool __read_mostly eager_page_split = true;
> > > +module_param(eager_page_split, bool, 0644);
> > > +
> >
> > Unless someone is really begging for it I'd prefer we not add a module
> > parameter for this.
>
> It was mainly to match x86 and because it makes perf testing a bit
> simpler. What do others think?
>
> >
> > > +static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
> > > +                                        phys_addr_t size)
> > >  {
> > > -     phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> > >       phys_addr_t boundary = ALIGN_DOWN(addr + size, size);
> > >
> > >       return (boundary - 1 < end - 1) ? boundary : end;
> > >  }
> > >
> > > +static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > > +{
> > > +     phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
> > > +
> > > +     return __stage2_range_addr_end(addr, end, size);
> > > +}
> > > +
> > >  /*
> > >   * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
> > >   * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
> > > @@ -71,6 +81,64 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
> > >       return ret;
> > >  }
> > >
> > > +static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min)
> > > +{
> > > +     return kvm_mmu_memory_cache_nr_free_objects(cache) < min;
> > > +}
> >
> > I don't think the helper is adding too much here.
>
> Will try how it looks without.
>
> >
> > > +static bool need_topup_split_page_cache_or_resched(struct kvm *kvm)
> > > +{
> > > +     struct kvm_mmu_memory_cache *cache;
> > > +
> > > +     if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> > > +             return true;
> > > +
> > > +     cache = &kvm->arch.mmu.split_page_cache;
> > > +     return need_topup(cache, EAGER_PAGE_SPLIT_CACHE_CAPACITY);
> > > +}
> > > +
> > > +static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
> > > +                           phys_addr_t end)
> > > +{
> > > +     struct kvm_mmu_memory_cache *cache;
> > > +     struct kvm_pgtable *pgt;
> > > +     int ret;
> > > +     u64 next;
> > > +     int cache_capacity = EAGER_PAGE_SPLIT_CACHE_CAPACITY;
> > > +
> > > +     lockdep_assert_held_write(&kvm->mmu_lock);
> >
> > Rather than having the caller acquire the lock, can you instead do it
> > here? It would appear that the entire critical section is enclosed
> > within this function.
>
> Sure. I will first double check things related to perf and correctness
> just in case.
> I'm not sure if the increase of acquire/releases makes any difference in perf.
> Also, not sure if there's a correctness issue because of releasing the lock
> between WP and split (I think it should be fine, but not 100% sure).
>
> >
> > > +     lockdep_assert_held(&kvm->slots_lock);
> >
> > This function doesn't depend on anything guarded by the slots_lock, can
> > you move this to kvm_mmu_split_memory_region()?
>
> kvm_mmu_split_memory_region() takes a memslot.
> That works in this case, eager splitting when enabling dirty logging, but won't
> work in the next commit when spliting on the CLEAR ioctl.
>

Ahh, you meant just the "lockdep" line. Yes, that makes sense. Will do.

> >
> > > +     cache = &kvm->arch.mmu.split_page_cache;
> > > +
> > > +     do {
> > > +             if (need_topup_split_page_cache_or_resched(kvm)) {
> > > +                     write_unlock(&kvm->mmu_lock);
> > > +                     cond_resched();
> > > +                     /* Eager page splitting is best-effort. */
> > > +                     ret = __kvm_mmu_topup_memory_cache(cache,
> > > +                                                        cache_capacity,
> > > +                                                        cache_capacity);
> > > +                     write_lock(&kvm->mmu_lock);
> > > +                     if (ret)
> > > +                             break;
> > > +             }
> > > +
> > > +             pgt = kvm->arch.mmu.pgt;
> > > +             if (!pgt)
> > > +                     return -EINVAL;
> > > +
> > > +             next = __stage2_range_addr_end(addr, end,
> > > +                                            EAGER_PAGE_SPLIT_CHUNK_SIZE);
> > > +             ret = kvm_pgtable_stage2_split(pgt, addr, next - addr, cache);
> > > +             if (ret)
> > > +                     break;
> > > +     } while (addr = next, addr != end);
> > > +
> > > +     return ret;
> > > +}
> >
> > --
> > Thanks,
> > Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-26 18:45     ` Ricardo Koller
  2023-01-26 19:25       ` Ricardo Koller
@ 2023-01-26 20:10       ` Marc Zyngier
  2023-01-27 15:45         ` Ricardo Koller
  1 sibling, 1 reply; 47+ messages in thread
From: Marc Zyngier @ 2023-01-26 20:10 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: Oliver Upton, pbonzini, oupton, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

On Thu, 26 Jan 2023 18:45:43 +0000,
Ricardo Koller <ricarkol@google.com> wrote:
> 
> On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > Hi Ricardo,
> >
> > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote:
> > > Split huge pages eagerly when enabling dirty logging. The goal is to
> > > avoid doing it while faulting on write-protected pages, which
> > > negatively impacts guest performance.
> > >
> > > A memslot marked for dirty logging is split in 1GB pieces at a time.
> > > This is in order to release the mmu_lock and give other kernel threads
> > > the opportunity to run, and also in order to allocate enough pages to
> > > split a 1GB range worth of huge pages (or a single 1GB huge page).
> > > Note that these page allocations can fail, so eager page splitting is
> > > best-effort.  This is not a correctness issue though, as huge pages
> > > can still be split on write-faults.
> > >
> > > The benefits of eager page splitting are the same as in x86, added
> > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> > > the TDP MMU when dirty logging is enabled"). For example, when running
> > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> > > all of their memory after dirty logging is enabled decreased by 44%
> > > from 2.58s to 1.42s.
> > >
> > > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
> > >  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
> > >  2 files changed, 138 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index 35a159d131b5..6ab37209b1d1 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
> > >       /* The last vcpu id that ran on each physical CPU */
> > >       int __percpu *last_vcpu_ran;
> > >
> > > +     /*
> > > +      * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> > > +      * pages. It is used to allocate stage2 page tables while splitting
> > > +      * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> > > +      * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> > > +      * the capacity of the split page cache (CACHE_CAPACITY), and how often
> > > +      * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> > > +      *
> > > +      * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> > > +      * all the available huge-page sizes, and be a multiple of all the
> > > +      * other ones; for example, 1GB when all the available huge-page sizes
> > > +      * are (1GB, 2MB, 32MB, 512MB).
> > > +      *
> > > +      * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> > > +      * example, 1GB requires the following number of PAGE_SIZE-pages:
> > > +      * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> > > +      * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> > > +      * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> > > +      * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> > > +      * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> > > +      * granules.
> > > +      *
> > > +      * Protected by kvm->slots_lock.
> > > +      */
> > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE                 SZ_1G
> > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY                                      \
> > > +     (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +         \
> > > +      DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
> >
> > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL
> > as the batch size? 513 pages across all page sizes is a non-negligible
> > amount of memory that goes largely unused when PAGE_SIZE != 4K.
> >
> 
> Sounds good, will refine this for v2.
> 
> > With that change it is a lot easier to correctly match the cache
> > capacity to the selected page size. Additionally, we continue to have a
> > single set of batching logic that we can improve later on.
> >
> > > +     struct kvm_mmu_memory_cache split_page_cache;
> > > +
> > >       struct kvm_arch *arch;
> > >  };
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 700c5774b50d..41ee330edae3 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
> > >
> > >  static unsigned long io_map_base;
> > >
> > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > > +bool __read_mostly eager_page_split = true;
> > > +module_param(eager_page_split, bool, 0644);
> > > +
> >
> > Unless someone is really begging for it I'd prefer we not add a module
> > parameter for this.
> 
> It was mainly to match x86 and because it makes perf testing a bit
> simpler. What do others think?

From my PoV this is a no.

If you have a flag because this is an experimental feature (like NV),
then this is a kernel option, and you taint the kernel when it is set.

If you have a flag because this is a modal option that makes different
use of the HW which cannot be exposed to userspace (like GICv4), then
this also is a kernel option.

This is neither.

The one thing that would convince me to make it an option is the
amount of memory this thing consumes. 512+ pages is a huge amount, and
I'm not overly happy about that. Why can't this be a userspace visible
option, selectable on a per VM (or memslot) basis?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-26 20:10       ` Marc Zyngier
@ 2023-01-27 15:45         ` Ricardo Koller
  2023-01-30 21:18           ` Oliver Upton
  2023-01-31 10:28           ` Marc Zyngier
  0 siblings, 2 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-27 15:45 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, pbonzini, oupton, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

Hi Marc,

On Thu, Jan 26, 2023 at 12:10 PM Marc Zyngier <maz@kernel.org> wrote:
>
> On Thu, 26 Jan 2023 18:45:43 +0000,
> Ricardo Koller <ricarkol@google.com> wrote:
> >
> > On Tue, Jan 24, 2023 at 2:45 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > Hi Ricardo,
> > >
> > > On Fri, Jan 13, 2023 at 03:49:57AM +0000, Ricardo Koller wrote:
> > > > Split huge pages eagerly when enabling dirty logging. The goal is to
> > > > avoid doing it while faulting on write-protected pages, which
> > > > negatively impacts guest performance.
> > > >
> > > > A memslot marked for dirty logging is split in 1GB pieces at a time.
> > > > This is in order to release the mmu_lock and give other kernel threads
> > > > the opportunity to run, and also in order to allocate enough pages to
> > > > split a 1GB range worth of huge pages (or a single 1GB huge page).
> > > > Note that these page allocations can fail, so eager page splitting is
> > > > best-effort.  This is not a correctness issue though, as huge pages
> > > > can still be split on write-faults.
> > > >
> > > > The benefits of eager page splitting are the same as in x86, added
> > > > with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by
> > > > the TDP MMU when dirty logging is enabled"). For example, when running
> > > > dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU,
> > > > 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access
> > > > all of their memory after dirty logging is enabled decreased by 44%
> > > > from 2.58s to 1.42s.
> > > >
> > > > Signed-off-by: Ricardo Koller <ricarkol@google.com>
> > > > ---
> > > >  arch/arm64/include/asm/kvm_host.h |  30 ++++++++
> > > >  arch/arm64/kvm/mmu.c              | 110 +++++++++++++++++++++++++++++-
> > > >  2 files changed, 138 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > > index 35a159d131b5..6ab37209b1d1 100644
> > > > --- a/arch/arm64/include/asm/kvm_host.h
> > > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > > @@ -153,6 +153,36 @@ struct kvm_s2_mmu {
> > > >       /* The last vcpu id that ran on each physical CPU */
> > > >       int __percpu *last_vcpu_ran;
> > > >
> > > > +     /*
> > > > +      * Memory cache used to split EAGER_PAGE_SPLIT_CHUNK_SIZE worth of huge
> > > > +      * pages. It is used to allocate stage2 page tables while splitting
> > > > +      * huge pages. Its capacity should be EAGER_PAGE_SPLIT_CACHE_CAPACITY.
> > > > +      * Note that the choice of EAGER_PAGE_SPLIT_CHUNK_SIZE influences both
> > > > +      * the capacity of the split page cache (CACHE_CAPACITY), and how often
> > > > +      * KVM reschedules. Be wary of raising CHUNK_SIZE too high.
> > > > +      *
> > > > +      * A good heuristic to pick CHUNK_SIZE is that it should be larger than
> > > > +      * all the available huge-page sizes, and be a multiple of all the
> > > > +      * other ones; for example, 1GB when all the available huge-page sizes
> > > > +      * are (1GB, 2MB, 32MB, 512MB).
> > > > +      *
> > > > +      * CACHE_CAPACITY should have enough pages to cover CHUNK_SIZE; for
> > > > +      * example, 1GB requires the following number of PAGE_SIZE-pages:
> > > > +      * - 512 when using 2MB hugepages with 4KB granules (1GB / 2MB).
> > > > +      * - 513 when using 1GB hugepages with 4KB granules (1 + (1GB / 2MB)).
> > > > +      * - 32 when using 32MB hugepages with 16KB granule (1GB / 32MB).
> > > > +      * - 2 when using 512MB hugepages with 64KB granules (1GB / 512MB).
> > > > +      * CACHE_CAPACITY below assumes the worst case: 1GB hugepages with 4KB
> > > > +      * granules.
> > > > +      *
> > > > +      * Protected by kvm->slots_lock.
> > > > +      */
> > > > +#define EAGER_PAGE_SPLIT_CHUNK_SIZE                 SZ_1G
> > > > +#define EAGER_PAGE_SPLIT_CACHE_CAPACITY                                      \
> > > > +     (DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_1G) +         \
> > > > +      DIV_ROUND_UP_ULL(EAGER_PAGE_SPLIT_CHUNK_SIZE, SZ_2M))
> > >
> > > Could you instead make use of the existing KVM_PGTABLE_MIN_BLOCK_LEVEL
> > > as the batch size? 513 pages across all page sizes is a non-negligible
> > > amount of memory that goes largely unused when PAGE_SIZE != 4K.
> > >
> >
> > Sounds good, will refine this for v2.
> >
> > > With that change it is a lot easier to correctly match the cache
> > > capacity to the selected page size. Additionally, we continue to have a
> > > single set of batching logic that we can improve later on.
> > >
> > > > +     struct kvm_mmu_memory_cache split_page_cache;
> > > > +
> > > >       struct kvm_arch *arch;
> > > >  };
> > > >
> > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > > index 700c5774b50d..41ee330edae3 100644
> > > > --- a/arch/arm64/kvm/mmu.c
> > > > +++ b/arch/arm64/kvm/mmu.c
> > > > @@ -31,14 +31,24 @@ static phys_addr_t hyp_idmap_vector;
> > > >
> > > >  static unsigned long io_map_base;
> > > >
> > > > -static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
> > > > +bool __read_mostly eager_page_split = true;
> > > > +module_param(eager_page_split, bool, 0644);
> > > > +
> > >
> > > Unless someone is really begging for it I'd prefer we not add a module
> > > parameter for this.
> >
> > It was mainly to match x86 and because it makes perf testing a bit
> > simpler. What do others think?
>
> From my PoV this is a no.
>
> If you have a flag because this is an experimental feature (like NV),
> then this is a kernel option, and you taint the kernel when it is set.
>
> If you have a flag because this is a modal option that makes different
> use of the HW which cannot be exposed to userspace (like GICv4), then
> this also is a kernel option.
>
> This is neither.

Ah, I see. Thanks for the explanation.

>
> The one thing that would convince me to make it an option is the
> amount of memory this thing consumes. 512+ pages is a huge amount, and
> I'm not overly happy about that. Why can't this be a userspace visible
> option, selectable on a per VM (or memslot) basis?
>

It should be possible.  I am exploring a couple of ideas that could
help when the hugepages are not 1G (e.g., 2M).  However, they add
complexity and I'm not sure they help much.

(will be using PAGE_SIZE=4K to make things simpler)

This feature pre-allocates 513 pages before splitting every 1G range.
For example, it converts 1G block PTEs into trees made of 513 pages.
When not using this feature, the same 513 pages would be allocated,
but lazily over a longer period of time.

Eager-splitting pre-allocates those pages in order to split huge-pages
into fully populated trees.  Which is needed in order to use FEAT_BBM
and skipping the expensive TLBI broadcasts.  513 is just the number of
pages needed to break a 1G huge-page.

We could optimize for smaller huge-pages, like 2M by splitting 1
huge-page at a time: only preallocate one 4K page at a time.  The
trick is how to know that we are splitting 2M huge-pages.  We could
either get the vma pagesize or use hints from userspace.  I'm not sure
that this is worth it though.  The user will most likely want to split
big ranges of memory (>1G), so optimizing for smaller huge-pages only
converts the left into the right:

alloc 1 page            |    |  alloc 512 pages
split 2M huge-page      |    |  split 2M huge-page
alloc 1 page            |    |  split 2M huge-page
split 2M huge-page      | => |  split 2M huge-page
                        ...
alloc 1 page            |    |  split 2M huge-page
split 2M huge-page      |    |  split 2M huge-page

Still thinking of what else to do.

Thanks!
Ricardo

> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-27 15:45         ` Ricardo Koller
@ 2023-01-30 21:18           ` Oliver Upton
  2023-01-31  1:18             ` Sean Christopherson
  2023-01-31 10:31             ` Marc Zyngier
  2023-01-31 10:28           ` Marc Zyngier
  1 sibling, 2 replies; 47+ messages in thread
From: Oliver Upton @ 2023-01-30 21:18 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: Marc Zyngier, pbonzini, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

On Fri, Jan 27, 2023 at 07:45:15AM -0800, Ricardo Koller wrote:
> Hi Marc,
> 
> On Thu, Jan 26, 2023 at 12:10 PM Marc Zyngier <maz@kernel.org> wrote:

[...]

> >
> > The one thing that would convince me to make it an option is the
> > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > I'm not overly happy about that. Why can't this be a userspace visible
> > option, selectable on a per VM (or memslot) basis?
> >
> 
> It should be possible.  I am exploring a couple of ideas that could
> help when the hugepages are not 1G (e.g., 2M).  However, they add
> complexity and I'm not sure they help much.
> 
> (will be using PAGE_SIZE=4K to make things simpler)
> 
> This feature pre-allocates 513 pages before splitting every 1G range.
> For example, it converts 1G block PTEs into trees made of 513 pages.
> When not using this feature, the same 513 pages would be allocated,
> but lazily over a longer period of time.
> 
> Eager-splitting pre-allocates those pages in order to split huge-pages
> into fully populated trees.  Which is needed in order to use FEAT_BBM
> and skipping the expensive TLBI broadcasts.  513 is just the number of
> pages needed to break a 1G huge-page.
> 
> We could optimize for smaller huge-pages, like 2M by splitting 1
> huge-page at a time: only preallocate one 4K page at a time.  The
> trick is how to know that we are splitting 2M huge-pages.  We could
> either get the vma pagesize or use hints from userspace.  I'm not sure
> that this is worth it though.  The user will most likely want to split
> big ranges of memory (>1G), so optimizing for smaller huge-pages only
> converts the left into the right:
> 
> alloc 1 page            |    |  alloc 512 pages
> split 2M huge-page      |    |  split 2M huge-page
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      | => |  split 2M huge-page
>                         ...
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      |    |  split 2M huge-page
> 
> Still thinking of what else to do.

I think that Marc's suggestion of having userspace configure this is
sound. After all, userspace _should_ know the granularity of the backing
source it chose for guest memory.

We could also interpret a cache size of 0 to signal that userspace wants
to disable eager page split for a VM altogether. It is entirely possible
that the user will want a differing QoS between slice-of-hardware and
overcommitted VMs.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-30 21:18           ` Oliver Upton
@ 2023-01-31  1:18             ` Sean Christopherson
  2023-01-31 17:45               ` Oliver Upton
  2023-01-31 10:31             ` Marc Zyngier
  1 sibling, 1 reply; 47+ messages in thread
From: Sean Christopherson @ 2023-01-31  1:18 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Ricardo Koller, Marc Zyngier, pbonzini, yuzenghui, dmatlack, kvm,
	kvmarm, qperret, catalin.marinas, andrew.jones, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

On Mon, Jan 30, 2023, Oliver Upton wrote:
> I think that Marc's suggestion of having userspace configure this is
> sound. After all, userspace _should_ know the granularity of the backing
> source it chose for guest memory.
> 
> We could also interpret a cache size of 0 to signal that userspace wants
> to disable eager page split for a VM altogether. It is entirely possible that
> the user will want a differing QoS between slice-of-hardware and
> overcommitted VMs.

Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
guarantees for all VMs on a system are better met by enabling eager splitting
across the board.

There are other reasons to use module/kernel params beyond what Marc listed, e.g.
to let the user opt out even when something is on by default.  x86's TDP MMU has
benefited greatly from downstream users being able to do A/B performance testing
this way.  I suspect x86's eager_page_split knob was added largely for this
reason, e.g. to easily see how a specific workload is affected by eager splitting.
That seems like a reasonable fit on the ARM side as well.

IMO, we should try to avoid new uAPI without a proven need, especially if we're
considering throwing something into memslots.  AFAIK, module params, at least on
the x86 side of things, aren't considered immutable ABI (or maybe it's just that
we haven't modified/removed one recently that folks care about).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-31  1:18             ` Sean Christopherson
@ 2023-01-31 17:45               ` Oliver Upton
  2023-01-31 17:54                 ` Sean Christopherson
  2023-01-31 18:01                 ` David Matlack
  0 siblings, 2 replies; 47+ messages in thread
From: Oliver Upton @ 2023-01-31 17:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ricardo Koller, Marc Zyngier, pbonzini, yuzenghui, dmatlack, kvm,
	kvmarm, qperret, catalin.marinas, andrew.jones, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> On Mon, Jan 30, 2023, Oliver Upton wrote:
> > I think that Marc's suggestion of having userspace configure this is
> > sound. After all, userspace _should_ know the granularity of the backing
> > source it chose for guest memory.
> > 
> > We could also interpret a cache size of 0 to signal that userspace wants
> > to disable eager page split for a VM altogether. It is entirely possible that
> > the user will want a differing QoS between slice-of-hardware and
> > overcommitted VMs.
> 
> Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> guarantees for all VMs on a system are better met by enabling eager splitting
> across the board.
> 
> There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> to let the user opt out even when something is on by default.  x86's TDP MMU has
> benefited greatly from downstream users being able to do A/B performance testing
> this way.  I suspect x86's eager_page_split knob was added largely for this
> reason, e.g. to easily see how a specific workload is affected by eager splitting.
> That seems like a reasonable fit on the ARM side as well.

There's a rather important distinction here in that we'd allow userspace
to select the page split cache size, which should be correctly sized for
the backing memory source. Considering the break-before-make rules of
the architecture, the only way eager split is performant on arm64 is by
replacing a block entry with a fully populated table hierarchy in one
operation. AFAICT, you don't have this problem on x86, as the
architecture generally permits a direct valid->valid transformation
without an intermediate invalidation. Well, ignoring iTLB multihit :)

So, the largest transformation we need to do right now is on a PUD w/
PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing
that configuration option in a module parameter is presumptive that all
VMs on a host use the exact same memory configuration, which doesn't
feel right to me.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-31 17:45               ` Oliver Upton
@ 2023-01-31 17:54                 ` Sean Christopherson
  2023-01-31 19:06                   ` Oliver Upton
  2023-01-31 18:01                 ` David Matlack
  1 sibling, 1 reply; 47+ messages in thread
From: Sean Christopherson @ 2023-01-31 17:54 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Ricardo Koller, Marc Zyngier, pbonzini, yuzenghui, dmatlack, kvm,
	kvmarm, qperret, catalin.marinas, andrew.jones, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

On Tue, Jan 31, 2023, Oliver Upton wrote:
> On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> > On Mon, Jan 30, 2023, Oliver Upton wrote:
> > > I think that Marc's suggestion of having userspace configure this is
> > > sound. After all, userspace _should_ know the granularity of the backing
> > > source it chose for guest memory.
> > > 
> > > We could also interpret a cache size of 0 to signal that userspace wants
> > > to disable eager page split for a VM altogether. It is entirely possible that
> > > the user will want a differing QoS between slice-of-hardware and
> > > overcommitted VMs.
> > 
> > Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> > guarantees for all VMs on a system are better met by enabling eager splitting
> > across the board.
> > 
> > There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> > to let the user opt out even when something is on by default.  x86's TDP MMU has
> > benefited greatly from downstream users being able to do A/B performance testing
> > this way.  I suspect x86's eager_page_split knob was added largely for this
> > reason, e.g. to easily see how a specific workload is affected by eager splitting.
> > That seems like a reasonable fit on the ARM side as well.
> 
> There's a rather important distinction here in that we'd allow userspace
> to select the page split cache size, which should be correctly sized for
> the backing memory source. Considering the break-before-make rules of
> the architecture, the only way eager split is performant on arm64 is by
> replacing a block entry with a fully populated table hierarchy in one
> operation. AFAICT, you don't have this problem on x86, as the
> architecture generally permits a direct valid->valid transformation
> without an intermediate invalidation. Well, ignoring iTLB multihit :)
> 
> So, the largest transformation we need to do right now is on a PUD w/
> PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing
> that configuration option in a module parameter is presumptive that all
> VMs on a host use the exact same memory configuration, which doesn't
> feel right to me.

Can you elaborate on the cache size needing to be tied to the backing source?
Do the issues arise if you get to a point where KVM can have PGD-sized hugepages
with PAGE_SIZE=4KiB?  Or do you want to let userspace optimize _now_ for PMD+4KiB?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-31 17:54                 ` Sean Christopherson
@ 2023-01-31 19:06                   ` Oliver Upton
  0 siblings, 0 replies; 47+ messages in thread
From: Oliver Upton @ 2023-01-31 19:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ricardo Koller, Marc Zyngier, pbonzini, yuzenghui, dmatlack, kvm,
	kvmarm, qperret, catalin.marinas, andrew.jones, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

On Tue, Jan 31, 2023 at 05:54:45PM +0000, Sean Christopherson wrote:
> On Tue, Jan 31, 2023, Oliver Upton wrote:
> > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> > > On Mon, Jan 30, 2023, Oliver Upton wrote:
> > > > I think that Marc's suggestion of having userspace configure this is
> > > > sound. After all, userspace _should_ know the granularity of the backing
> > > > source it chose for guest memory.
> > > > 
> > > > We could also interpret a cache size of 0 to signal that userspace wants
> > > > to disable eager page split for a VM altogether. It is entirely possible that
> > > > the user will want a differing QoS between slice-of-hardware and
> > > > overcommitted VMs.
> > > 
> > > Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> > > guarantees for all VMs on a system are better met by enabling eager splitting
> > > across the board.
> > > 
> > > There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> > > to let the user opt out even when something is on by default.  x86's TDP MMU has
> > > benefited greatly from downstream users being able to do A/B performance testing
> > > this way.  I suspect x86's eager_page_split knob was added largely for this
> > > reason, e.g. to easily see how a specific workload is affected by eager splitting.
> > > That seems like a reasonable fit on the ARM side as well.
> > 
> > There's a rather important distinction here in that we'd allow userspace
> > to select the page split cache size, which should be correctly sized for
> > the backing memory source. Considering the break-before-make rules of
> > the architecture, the only way eager split is performant on arm64 is by
> > replacing a block entry with a fully populated table hierarchy in one
> > operation. AFAICT, you don't have this problem on x86, as the
> > architecture generally permits a direct valid->valid transformation
> > without an intermediate invalidation. Well, ignoring iTLB multihit :)
> > 
> > So, the largest transformation we need to do right now is on a PUD w/
> > PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing
> > that configuration option in a module parameter is presumptive that all
> > VMs on a host use the exact same memory configuration, which doesn't
> > feel right to me.
> 
> Can you elaborate on the cache size needing to be tied to the backing source?

The proposed eager split mechanism attempts to replace a block with a
a fully populated page table hierarchy (i.e. mapped at PTE granularity)
in order to avoid successive break-before-make invalidations. The cache
size must be >= the number of pages required to build out that fully
mapped page table hierarchy.

> Do the issues arise if you get to a point where KVM can have PGD-sized hugepages
> with PAGE_SIZE=4KiB?

Those problems when splitting any hugepage larger than a PMD. It just
so happens that the only configuration that supports larger mappings is
4K at the moment.

If we were to take the step-down approach to eager page splitting, there
will be a lot of knock-on break-before-make operations as we go
PUD -> PMD -> PTE.

> Or do you want to let userspace optimize _now_ for PMD+4KiB?

The default cache value should probably optimize for PMD splitting and
give userspace the option to scale that up for PUD or greater if it sees
fit.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-31 17:45               ` Oliver Upton
  2023-01-31 17:54                 ` Sean Christopherson
@ 2023-01-31 18:01                 ` David Matlack
  2023-01-31 18:19                   ` Ricardo Koller
  2023-01-31 18:35                   ` Oliver Upton
  1 sibling, 2 replies; 47+ messages in thread
From: David Matlack @ 2023-01-31 18:01 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Sean Christopherson, Ricardo Koller, Marc Zyngier, pbonzini,
	yuzenghui, kvm, kvmarm, qperret, catalin.marinas, andrew.jones,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol

On Tue, Jan 31, 2023 at 9:46 AM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> > On Mon, Jan 30, 2023, Oliver Upton wrote:
> > > I think that Marc's suggestion of having userspace configure this is
> > > sound. After all, userspace _should_ know the granularity of the backing
> > > source it chose for guest memory.
> > >
> > > We could also interpret a cache size of 0 to signal that userspace wants
> > > to disable eager page split for a VM altogether. It is entirely possible that
> > > the user will want a differing QoS between slice-of-hardware and
> > > overcommitted VMs.
> >
> > Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> > guarantees for all VMs on a system are better met by enabling eager splitting
> > across the board.
> >
> > There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> > to let the user opt out even when something is on by default.  x86's TDP MMU has
> > benefited greatly from downstream users being able to do A/B performance testing
> > this way.  I suspect x86's eager_page_split knob was added largely for this
> > reason, e.g. to easily see how a specific workload is affected by eager splitting.
> > That seems like a reasonable fit on the ARM side as well.
>
> There's a rather important distinction here in that we'd allow userspace
> to select the page split cache size, which should be correctly sized for
> the backing memory source. Considering the break-before-make rules of
> the architecture, the only way eager split is performant on arm64 is by
> replacing a block entry with a fully populated table hierarchy in one
> operation.

I don't see how this can be true if we are able to tolerate splitting
2M pages. Splitting 2M pages inherently means 512 Break-Before-Make
operations per 1GiB region of guest physical memory.

If we had a cache size of 1 and were splitting a 1GiB region, we would
then need to do 512+1 BBM per 1GiB region. If we can tolerate 512 per
1GiB, why not 513?

It seems more like the 513 cache size is more to optimize splitting
1GiB pages. I agree it can turn those 513 int 1, but future versions
of the architecture also elide BBM requirements which is another way
to optimize 1GiB pages.


> AFAICT, you don't have this problem on x86, as the
> architecture generally permits a direct valid->valid transformation
> without an intermediate invalidation. Well, ignoring iTLB multihit :)
>
> So, the largest transformation we need to do right now is on a PUD w/
> PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing
> that configuration option in a module parameter is presumptive that all
> VMs on a host use the exact same memory configuration, which doesn't
> feel right to me.
>
> --
> Thanks,
> Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-31 18:01                 ` David Matlack
@ 2023-01-31 18:19                   ` Ricardo Koller
  2023-01-31 18:35                   ` Oliver Upton
  1 sibling, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-31 18:19 UTC (permalink / raw)
  To: David Matlack
  Cc: Oliver Upton, Sean Christopherson, Marc Zyngier, pbonzini,
	yuzenghui, kvm, kvmarm, qperret, catalin.marinas, andrew.jones,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol

On Tue, Jan 31, 2023 at 10:02 AM David Matlack <dmatlack@google.com> wrote:
>
> On Tue, Jan 31, 2023 at 9:46 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> > > On Mon, Jan 30, 2023, Oliver Upton wrote:
> > > > I think that Marc's suggestion of having userspace configure this is
> > > > sound. After all, userspace _should_ know the granularity of the backing
> > > > source it chose for guest memory.
> > > >
> > > > We could also interpret a cache size of 0 to signal that userspace wants
> > > > to disable eager page split for a VM altogether. It is entirely possible that
> > > > the user will want a differing QoS between slice-of-hardware and
> > > > overcommitted VMs.
> > >
> > > Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> > > guarantees for all VMs on a system are better met by enabling eager splitting
> > > across the board.
> > >
> > > There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> > > to let the user opt out even when something is on by default.  x86's TDP MMU has
> > > benefited greatly from downstream users being able to do A/B performance testing
> > > this way.  I suspect x86's eager_page_split knob was added largely for this
> > > reason, e.g. to easily see how a specific workload is affected by eager splitting.
> > > That seems like a reasonable fit on the ARM side as well.
> >
> > There's a rather important distinction here in that we'd allow userspace
> > to select the page split cache size, which should be correctly sized for
> > the backing memory source. Considering the break-before-make rules of
> > the architecture, the only way eager split is performant on arm64 is by
> > replacing a block entry with a fully populated table hierarchy in one
> > operation.
>
> I don't see how this can be true if we are able to tolerate splitting
> 2M pages. Splitting 2M pages inherently means 512 Break-Before-Make
> operations per 1GiB region of guest physical memory.
>
> If we had a cache size of 1 and were splitting a 1GiB region, we would
> then need to do 512+1 BBM per 1GiB region. If we can tolerate 512 per
> 1GiB, why not 513?
>
> It seems more like the 513 cache size is more to optimize splitting
> 1GiB pages.

That's correct. Although there's also a slight benefit for the 2M huge-pages
case as the 512 huge-pages can be split without having to walk down
the tree from the root every time.

> I agree it can turn those 513 int 1, but future versions
> of the architecture also elide BBM requirements which is another way
> to optimize 1GiB pages.

There's also the CPU cost needed to walk down the tree from the root
513 times (same as above).

>
>
> > AFAICT, you don't have this problem on x86, as the
> > architecture generally permits a direct valid->valid transformation
> > without an intermediate invalidation. Well, ignoring iTLB multihit :)
> >
> > So, the largest transformation we need to do right now is on a PUD w/
> > PAGE_SIZE=4K, leading to 513 pages as proposed in the series. Exposing
> > that configuration option in a module parameter is presumptive that all
> > VMs on a host use the exact same memory configuration, which doesn't
> > feel right to me.
> >
> > --
> > Thanks,
> > Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-31 18:01                 ` David Matlack
  2023-01-31 18:19                   ` Ricardo Koller
@ 2023-01-31 18:35                   ` Oliver Upton
  1 sibling, 0 replies; 47+ messages in thread
From: Oliver Upton @ 2023-01-31 18:35 UTC (permalink / raw)
  To: David Matlack
  Cc: Sean Christopherson, Ricardo Koller, Marc Zyngier, pbonzini,
	yuzenghui, kvm, kvmarm, qperret, catalin.marinas, andrew.jones,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol

On Tue, Jan 31, 2023 at 10:01:45AM -0800, David Matlack wrote:
> On Tue, Jan 31, 2023 at 9:46 AM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Tue, Jan 31, 2023 at 01:18:15AM +0000, Sean Christopherson wrote:
> > > On Mon, Jan 30, 2023, Oliver Upton wrote:
> > > > I think that Marc's suggestion of having userspace configure this is
> > > > sound. After all, userspace _should_ know the granularity of the backing
> > > > source it chose for guest memory.
> > > >
> > > > We could also interpret a cache size of 0 to signal that userspace wants
> > > > to disable eager page split for a VM altogether. It is entirely possible that
> > > > the user will want a differing QoS between slice-of-hardware and
> > > > overcommitted VMs.
> > >
> > > Maybe.  It's also entirely possible that QoS is never factored in, e.g. if QoS
> > > guarantees for all VMs on a system are better met by enabling eager splitting
> > > across the board.
> > >
> > > There are other reasons to use module/kernel params beyond what Marc listed, e.g.
> > > to let the user opt out even when something is on by default.  x86's TDP MMU has
> > > benefited greatly from downstream users being able to do A/B performance testing
> > > this way.  I suspect x86's eager_page_split knob was added largely for this
> > > reason, e.g. to easily see how a specific workload is affected by eager splitting.
> > > That seems like a reasonable fit on the ARM side as well.
> >
> > There's a rather important distinction here in that we'd allow userspace
> > to select the page split cache size, which should be correctly sized for
> > the backing memory source. Considering the break-before-make rules of
> > the architecture, the only way eager split is performant on arm64 is by
> > replacing a block entry with a fully populated table hierarchy in one
> > operation.
> 
> I don't see how this can be true if we are able to tolerate splitting
> 2M pages. Splitting 2M pages inherently means 512 Break-Before-Make
> operations per 1GiB region of guest physical memory.

'Only' was definitely not the right word here.

I fear that there is a rather limited degree of portability for any
observation that we derive from a particular system. Assuming the
absolute worst of hardware, TLBIs + serialization will become even more
of a bounding issue as the number of affected entities on the
interconnect scales.

So in that vein I don't believe it is easy to glean what may or may
not be tolerable.

> It seems more like the 513 cache size is more to optimize splitting
> 1GiB pages.

I don't believe anyone is particularly jazzed about this detail, so I
would be surprised if we accepted this as the default configuration.

> I agree it can turn those 513 int 1, but future versions
> of the architecture also elide BBM requirements which is another way
> to optimize 1GiB pages.

Level 2 break-before-make behavior is helpful but certainly not a
panacea. Most worrying is that implementations may generate TLB
conflict aborts, at which point the only remediation is to invalidate
the entire affected context.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-30 21:18           ` Oliver Upton
  2023-01-31  1:18             ` Sean Christopherson
@ 2023-01-31 10:31             ` Marc Zyngier
  1 sibling, 0 replies; 47+ messages in thread
From: Marc Zyngier @ 2023-01-31 10:31 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Ricardo Koller, pbonzini, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

On Mon, 30 Jan 2023 21:18:32 +0000,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> On Fri, Jan 27, 2023 at 07:45:15AM -0800, Ricardo Koller wrote:
> > Hi Marc,
> > 
> > On Thu, Jan 26, 2023 at 12:10 PM Marc Zyngier <maz@kernel.org> wrote:
> 
> [...]
> 
> > >
> > > The one thing that would convince me to make it an option is the
> > > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > > I'm not overly happy about that. Why can't this be a userspace visible
> > > option, selectable on a per VM (or memslot) basis?
> > >
> > 
> > It should be possible.  I am exploring a couple of ideas that could
> > help when the hugepages are not 1G (e.g., 2M).  However, they add
> > complexity and I'm not sure they help much.
> > 
> > (will be using PAGE_SIZE=4K to make things simpler)
> > 
> > This feature pre-allocates 513 pages before splitting every 1G range.
> > For example, it converts 1G block PTEs into trees made of 513 pages.
> > When not using this feature, the same 513 pages would be allocated,
> > but lazily over a longer period of time.
> > 
> > Eager-splitting pre-allocates those pages in order to split huge-pages
> > into fully populated trees.  Which is needed in order to use FEAT_BBM
> > and skipping the expensive TLBI broadcasts.  513 is just the number of
> > pages needed to break a 1G huge-page.
> > 
> > We could optimize for smaller huge-pages, like 2M by splitting 1
> > huge-page at a time: only preallocate one 4K page at a time.  The
> > trick is how to know that we are splitting 2M huge-pages.  We could
> > either get the vma pagesize or use hints from userspace.  I'm not sure
> > that this is worth it though.  The user will most likely want to split
> > big ranges of memory (>1G), so optimizing for smaller huge-pages only
> > converts the left into the right:
> > 
> > alloc 1 page            |    |  alloc 512 pages
> > split 2M huge-page      |    |  split 2M huge-page
> > alloc 1 page            |    |  split 2M huge-page
> > split 2M huge-page      | => |  split 2M huge-page
> >                         ...
> > alloc 1 page            |    |  split 2M huge-page
> > split 2M huge-page      |    |  split 2M huge-page
> > 
> > Still thinking of what else to do.
> 
> I think that Marc's suggestion of having userspace configure this is
> sound. After all, userspace _should_ know the granularity of the backing
> source it chose for guest memory.

Only if it is not using anonymous memory. That's the important distinction.

> 
> We could also interpret a cache size of 0 to signal that userspace wants
> to disable eager page split for a VM altogether. It is entirely possible
> that the user will want a differing QoS between slice-of-hardware and
> overcommitted VMs.

Absolutely. The overcommited case would suffer from the upfront
allocation (these systems are usually very densely packed).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-27 15:45         ` Ricardo Koller
  2023-01-30 21:18           ` Oliver Upton
@ 2023-01-31 10:28           ` Marc Zyngier
  2023-02-06 16:35             ` Ricardo Koller
  1 sibling, 1 reply; 47+ messages in thread
From: Marc Zyngier @ 2023-01-31 10:28 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: Oliver Upton, pbonzini, oupton, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

On Fri, 27 Jan 2023 15:45:15 +0000,
Ricardo Koller <ricarkol@google.com> wrote:
> 
> > The one thing that would convince me to make it an option is the
> > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > I'm not overly happy about that. Why can't this be a userspace visible
> > option, selectable on a per VM (or memslot) basis?
> >
> 
> It should be possible.  I am exploring a couple of ideas that could
> help when the hugepages are not 1G (e.g., 2M).  However, they add
> complexity and I'm not sure they help much.
> 
> (will be using PAGE_SIZE=4K to make things simpler)
> 
> This feature pre-allocates 513 pages before splitting every 1G range.
> For example, it converts 1G block PTEs into trees made of 513 pages.
> When not using this feature, the same 513 pages would be allocated,
> but lazily over a longer period of time.

This is an important difference. It avoids the upfront allocation
"thermal shock", giving time to the kernel to reclaim memory from
somewhere else. Doing it upfront means you *must* have 2MB+ of
immediately available memory for each GB of RAM you guest uses.

> 
> Eager-splitting pre-allocates those pages in order to split huge-pages
> into fully populated trees.  Which is needed in order to use FEAT_BBM
> and skipping the expensive TLBI broadcasts.  513 is just the number of
> pages needed to break a 1G huge-page.

I understand that. But it also clear that 1GB huge pages are unlikely
to be THPs, and I wonder if we should treat the two differently. Using
HugeTLBFS pages is significant here.

> 
> We could optimize for smaller huge-pages, like 2M by splitting 1
> huge-page at a time: only preallocate one 4K page at a time.  The
> trick is how to know that we are splitting 2M huge-pages.  We could
> either get the vma pagesize or use hints from userspace.  I'm not sure
> that this is worth it though.  The user will most likely want to split
> big ranges of memory (>1G), so optimizing for smaller huge-pages only
> converts the left into the right:
> 
> alloc 1 page            |    |  alloc 512 pages
> split 2M huge-page      |    |  split 2M huge-page
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      | => |  split 2M huge-page
>                         ...
> alloc 1 page            |    |  split 2M huge-page
> split 2M huge-page      |    |  split 2M huge-page
> 
> Still thinking of what else to do.

I think the 1G case fits your own use case, but I doubt this covers
the majority of the users. Most people rely on the kernel ability to
use THPs, which are capped at the first level of block mapping.

2MB (and 32MB for 16kB base pages) are the most likely mappings in my
experience (512MB with 64kB pages are vanishingly rare).

Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it
fits the model. For THPs, where everything is opportunistic and the
user not involved, this is a lot more debatable.

This is why I'd like this behaviour to be a buy-in, either directly (a
first class userspace API) or indirectly (the provenance of the
memory).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled
  2023-01-31 10:28           ` Marc Zyngier
@ 2023-02-06 16:35             ` Ricardo Koller
  0 siblings, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-02-06 16:35 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, pbonzini, oupton, yuzenghui, dmatlack, kvm, kvmarm,
	qperret, catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, bgardon,
	ricarkol

Hi Marc,

On Tue, Jan 31, 2023 at 2:28 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 27 Jan 2023 15:45:15 +0000,
> Ricardo Koller <ricarkol@google.com> wrote:
> >
> > > The one thing that would convince me to make it an option is the
> > > amount of memory this thing consumes. 512+ pages is a huge amount, and
> > > I'm not overly happy about that. Why can't this be a userspace visible
> > > option, selectable on a per VM (or memslot) basis?
> > >
> >
> > It should be possible.  I am exploring a couple of ideas that could
> > help when the hugepages are not 1G (e.g., 2M).  However, they add
> > complexity and I'm not sure they help much.
> >
> > (will be using PAGE_SIZE=4K to make things simpler)
> >
> > This feature pre-allocates 513 pages before splitting every 1G range.
> > For example, it converts 1G block PTEs into trees made of 513 pages.
> > When not using this feature, the same 513 pages would be allocated,
> > but lazily over a longer period of time.
>
> This is an important difference. It avoids the upfront allocation
> "thermal shock", giving time to the kernel to reclaim memory from
> somewhere else. Doing it upfront means you *must* have 2MB+ of
> immediately available memory for each GB of RAM you guest uses.
>
> >
> > Eager-splitting pre-allocates those pages in order to split huge-pages
> > into fully populated trees.  Which is needed in order to use FEAT_BBM
> > and skipping the expensive TLBI broadcasts.  513 is just the number of
> > pages needed to break a 1G huge-page.
>
> I understand that. But it also clear that 1GB huge pages are unlikely
> to be THPs, and I wonder if we should treat the two differently. Using
> HugeTLBFS pages is significant here.
>
> >
> > We could optimize for smaller huge-pages, like 2M by splitting 1
> > huge-page at a time: only preallocate one 4K page at a time.  The
> > trick is how to know that we are splitting 2M huge-pages.  We could
> > either get the vma pagesize or use hints from userspace.  I'm not sure
> > that this is worth it though.  The user will most likely want to split
> > big ranges of memory (>1G), so optimizing for smaller huge-pages only
> > converts the left into the right:
> >
> > alloc 1 page            |    |  alloc 512 pages
> > split 2M huge-page      |    |  split 2M huge-page
> > alloc 1 page            |    |  split 2M huge-page
> > split 2M huge-page      | => |  split 2M huge-page
> >                         ...
> > alloc 1 page            |    |  split 2M huge-page
> > split 2M huge-page      |    |  split 2M huge-page
> >
> > Still thinking of what else to do.
>
> I think the 1G case fits your own use case, but I doubt this covers
> the majority of the users. Most people rely on the kernel ability to
> use THPs, which are capped at the first level of block mapping.
>
> 2MB (and 32MB for 16kB base pages) are the most likely mappings in my
> experience (512MB with 64kB pages are vanishingly rare).
>
> Having to pay an upfront cost for HugeTLBFS doesn't shock me, and it
> fits the model. For THPs, where everything is opportunistic and the
> user not involved, this is a lot more debatable.
>
> This is why I'd like this behaviour to be a buy-in, either directly (a
> first class userspace API) or indirectly (the provenance of the
> memory).

This all makes sense, thanks for the explanation. I decided to implement
something for both cases: small caches (~1 page) where the PUDs are
split one PMD at a time, and bigger caches (>513) where the PUDs can
be split with a single replacement. The user specifies the size of the cache
via a capability, and size of 0 implies no eager splitting (the feature is off).

Thanks,
Ricardo

>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 7/9] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
  2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
                   ` (5 preceding siblings ...)
  2023-01-13  3:49 ` [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
@ 2023-01-13  3:49 ` Ricardo Koller
  2023-01-13  3:49 ` [PATCH 8/9] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG Ricardo Koller
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:49 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

Move the functionality of kvm_mmu_write_protect_pt_masked() into its
caller, kvm_arch_mmu_enable_log_dirty_pt_masked().  This will be used
in a subsequent commit in order to share some of the code in
kvm_arch_mmu_enable_log_dirty_pt_masked().

No functional change intended.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 42 +++++++++++++++---------------------------
 1 file changed, 15 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 41ee330edae3..009468822bca 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1045,28 +1045,6 @@ static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
 	kvm_flush_remote_tlbs(kvm);
 }
 
-/**
- * kvm_mmu_write_protect_pt_masked() - write protect dirty pages
- * @kvm:	The KVM pointer
- * @slot:	The memory slot associated with mask
- * @gfn_offset:	The gfn offset in memory slot
- * @mask:	The mask of dirty pages at offset 'gfn_offset' in this memory
- *		slot to be write protected
- *
- * Walks bits set in mask write protects the associated pte's. Caller must
- * acquire kvm_mmu_lock.
- */
-static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-		struct kvm_memory_slot *slot,
-		gfn_t gfn_offset, unsigned long mask)
-{
-	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
-	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
-	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
-
-	stage2_wp_range(&kvm->arch.mmu, start, end);
-}
-
 /**
  * kvm_mmu_split_memory_region() - split the stage 2 blocks into PAGE_SIZE
  *				   pages for memory slot
@@ -1091,17 +1069,27 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
 }
 
 /*
- * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
- * dirty pages.
+ * kvm_arch_mmu_enable_log_dirty_pt_masked() - enable dirty logging for selected pages.
+ * @kvm:	The KVM pointer
+ * @slot:	The memory slot associated with mask
+ * @gfn_offset:	The gfn offset in memory slot
+ * @mask:	The mask of pages at offset 'gfn_offset' in this memory
+ *		slot to enable dirty logging on
  *
- * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
- * enable dirty logging for them.
+ * Writes protect selected pages to enable dirty logging for them. Caller must
+ * acquire kvm->mmu_lock.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
 		gfn_t gfn_offset, unsigned long mask)
 {
-	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
+	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	stage2_wp_range(&kvm->arch.mmu, start, end);
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 8/9] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
  2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
                   ` (6 preceding siblings ...)
  2023-01-13  3:49 ` [PATCH 7/9] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() Ricardo Koller
@ 2023-01-13  3:49 ` Ricardo Koller
  2023-01-13  3:50 ` [PATCH 9/9] KVM: arm64: Use local TLBI on permission relaxation Ricardo Koller
  2023-01-24  0:48 ` [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ben Gardon
  9 siblings, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:49 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

This is the arm64 counterpart of commit cb00a70bd4b7 ("KVM: x86/mmu:
Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG"),
which has the benefit of splitting the cost of splitting a memslot
across multiple ioctls.

Split huge pages on the range specified using KVM_CLEAR_DIRTY_LOG.
And do not split when enabling dirty logging if
KVM_DIRTY_LOG_INITIALLY_SET is set.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/kvm/mmu.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 009468822bca..08f28140c4a9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1076,8 +1076,8 @@ static void kvm_mmu_split_memory_region(struct kvm *kvm, int slot)
  * @mask:	The mask of pages at offset 'gfn_offset' in this memory
  *		slot to enable dirty logging on
  *
- * Writes protect selected pages to enable dirty logging for them. Caller must
- * acquire kvm->mmu_lock.
+ * Splits selected pages to PAGE_SIZE and then writes protect them to enable
+ * dirty logging for them. Caller must acquire kvm->mmu_lock.
  */
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
@@ -1090,6 +1090,14 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	stage2_wp_range(&kvm->arch.mmu, start, end);
+
+	/*
+	 * If initially-all-set mode is not set, then huge-pages were already
+	 * split when enabling dirty logging: no need to do it again.
+	 */
+	if (kvm_dirty_log_manual_protect_and_init_set(kvm) &&
+	    READ_ONCE(eager_page_split))
+		kvm_mmu_split_huge_pages(kvm, start, end);
 }
 
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
@@ -1875,7 +1883,9 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 		 * this when deleting, moving, disabling dirty logging, or
 		 * creating the memslot (a nop). Doing it for deletes makes
 		 * sure we don't leak memory, and there's no need to keep the
-		 * cache around for any of the other cases.
+		 * cache around for any of the other cases. Keeping the cache
+		 * is useful for succesive KVM_CLEAR_DIRTY_LOG calls, which is
+		 * not handled in this function.
 		 */
 		kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
 	}
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 9/9] KVM: arm64: Use local TLBI on permission relaxation
  2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
                   ` (7 preceding siblings ...)
  2023-01-13  3:49 ` [PATCH 8/9] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG Ricardo Koller
@ 2023-01-13  3:50 ` Ricardo Koller
  2023-01-24  0:48 ` [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ben Gardon
  9 siblings, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-13  3:50 UTC (permalink / raw)
  To: pbonzini, maz, oupton, yuzenghui, dmatlack
  Cc: kvm, kvmarm, qperret, catalin.marinas, andrew.jones, seanjc,
	alexandru.elisei, suzuki.poulose, eric.auger, gshan, reijiw,
	rananta, bgardon, ricarkol, Ricardo Koller

From: Marc Zyngier <maz@kernel.org>

Broadcasted TLB invalidations (TLBI) are usually less performant than their
local variant. In particular, we observed some implementations that take
millliseconds to complete parallel broadcasted TLBIs.

It's safe to use local, non-shareable, TLBIs when relaxing permissions on a
PTE in the KVM case for a couple of reasons. First, according to the ARM
Arm (DDI 0487H.a D5-4913), permission relaxation does not need
break-before-make.  Second, KVM does not set the VTTBR_EL2.CnP bit, so each
PE has its own TLB entry for the same page. KVM could tolerate that when
doing permission relaxation (i.e., not having changes broadcasted to all
PEs).

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
---
 arch/arm64/include/asm/kvm_asm.h   |  4 +++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c | 10 ++++++
 arch/arm64/kvm/hyp/nvhe/tlb.c      | 54 ++++++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c       |  2 +-
 arch/arm64/kvm/hyp/vhe/tlb.c       | 32 ++++++++++++++++++
 5 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 43c3bc0f9544..bb17b2ead4c7 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -68,6 +68,7 @@ enum __kvm_host_smccc_func {
 	__KVM_HOST_SMCCC_FUNC___kvm_vcpu_run,
 	__KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context,
 	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa,
+	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid_ipa_nsh,
 	__KVM_HOST_SMCCC_FUNC___kvm_tlb_flush_vmid,
 	__KVM_HOST_SMCCC_FUNC___kvm_flush_cpu_context,
 	__KVM_HOST_SMCCC_FUNC___kvm_timer_set_cntvoff,
@@ -225,6 +226,9 @@ extern void __kvm_flush_vm_context(void);
 extern void __kvm_flush_cpu_context(struct kvm_s2_mmu *mmu);
 extern void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu, phys_addr_t ipa,
 				     int level);
+extern void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+					 phys_addr_t ipa,
+					 int level);
 extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
 
 extern void __kvm_timer_set_cntvoff(u64 cntvoff);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 728e01d4536b..c6bf1e49ca93 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -125,6 +125,15 @@ static void handle___kvm_tlb_flush_vmid_ipa(struct kvm_cpu_context *host_ctxt)
 	__kvm_tlb_flush_vmid_ipa(kern_hyp_va(mmu), ipa, level);
 }
 
+static void handle___kvm_tlb_flush_vmid_ipa_nsh(struct kvm_cpu_context *host_ctxt)
+{
+	DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
+	DECLARE_REG(phys_addr_t, ipa, host_ctxt, 2);
+	DECLARE_REG(int, level, host_ctxt, 3);
+
+	__kvm_tlb_flush_vmid_ipa_nsh(kern_hyp_va(mmu), ipa, level);
+}
+
 static void handle___kvm_tlb_flush_vmid(struct kvm_cpu_context *host_ctxt)
 {
 	DECLARE_REG(struct kvm_s2_mmu *, mmu, host_ctxt, 1);
@@ -315,6 +324,7 @@ static const hcall_t host_hcall[] = {
 	HANDLE_FUNC(__kvm_vcpu_run),
 	HANDLE_FUNC(__kvm_flush_vm_context),
 	HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa),
+	HANDLE_FUNC(__kvm_tlb_flush_vmid_ipa_nsh),
 	HANDLE_FUNC(__kvm_tlb_flush_vmid),
 	HANDLE_FUNC(__kvm_flush_cpu_context),
 	HANDLE_FUNC(__kvm_timer_set_cntvoff),
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index d296d617f589..ef2b70587f93 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -109,6 +109,60 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
 	__tlb_switch_to_host(&cxt);
 }
 
+void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+				  phys_addr_t ipa, int level)
+{
+	struct tlb_inv_context cxt;
+
+	dsb(nshst);
+
+	/* Switch to requested VMID */
+	__tlb_switch_to_guest(mmu, &cxt);
+
+	/*
+	 * We could do so much better if we had the VA as well.
+	 * Instead, we invalidate Stage-2 for this IPA, and the
+	 * whole of Stage-1. Weep...
+	 */
+	ipa >>= 12;
+	__tlbi_level(ipas2e1, ipa, level);
+
+	/*
+	 * We have to ensure completion of the invalidation at Stage-2,
+	 * since a table walk on another CPU could refill a TLB with a
+	 * complete (S1 + S2) walk based on the old Stage-2 mapping if
+	 * the Stage-1 invalidation happened first.
+	 */
+	dsb(nsh);
+	__tlbi(vmalle1);
+	dsb(nsh);
+	isb();
+
+	/*
+	 * If the host is running at EL1 and we have a VPIPT I-cache,
+	 * then we must perform I-cache maintenance at EL2 in order for
+	 * it to have an effect on the guest. Since the guest cannot hit
+	 * I-cache lines allocated with a different VMID, we don't need
+	 * to worry about junk out of guest reset (we nuke the I-cache on
+	 * VMID rollover), but we do need to be careful when remapping
+	 * executable pages for the same guest. This can happen when KSM
+	 * takes a CoW fault on an executable page, copies the page into
+	 * a page that was previously mapped in the guest and then needs
+	 * to invalidate the guest view of the I-cache for that page
+	 * from EL1. To solve this, we invalidate the entire I-cache when
+	 * unmapping a page from a guest if we have a VPIPT I-cache but
+	 * the host is running at EL1. As above, we could do better if
+	 * we had the VA.
+	 *
+	 * The moral of this story is: if you have a VPIPT I-cache, then
+	 * you should be running with VHE enabled.
+	 */
+	if (icache_is_vpipt())
+		icache_inval_all_pou();
+
+	__tlb_switch_to_host(&cxt);
+}
+
 void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
 {
 	struct tlb_inv_context cxt;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index db9d1a28769b..7d694d12b5c4 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1148,7 +1148,7 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	ret = stage2_update_leaf_attrs(pgt, addr, 1, set, clr, NULL, &level,
 				       KVM_PGTABLE_WALK_SHARED);
 	if (!ret)
-		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, pgt->mmu, addr, level);
+		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr, level);
 	return ret;
 }
 
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index 24cef9b87f9e..e69da550cdc5 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -111,6 +111,38 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
 	__tlb_switch_to_host(&cxt);
 }
 
+void __kvm_tlb_flush_vmid_ipa_nsh(struct kvm_s2_mmu *mmu,
+				  phys_addr_t ipa, int level)
+{
+	struct tlb_inv_context cxt;
+
+	dsb(nshst);
+
+	/* Switch to requested VMID */
+	__tlb_switch_to_guest(mmu, &cxt);
+
+	/*
+	 * We could do so much better if we had the VA as well.
+	 * Instead, we invalidate Stage-2 for this IPA, and the
+	 * whole of Stage-1. Weep...
+	 */
+	ipa >>= 12;
+	__tlbi_level(ipas2e1, ipa, level);
+
+	/*
+	 * We have to ensure completion of the invalidation at Stage-2,
+	 * since a table walk on another CPU could refill a TLB with a
+	 * complete (S1 + S2) walk based on the old Stage-2 mapping if
+	 * the Stage-1 invalidation happened first.
+	 */
+	dsb(nsh);
+	__tlbi(vmalle1);
+	dsb(nsh);
+	isb();
+
+	__tlb_switch_to_host(&cxt);
+}
+
 void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
 {
 	struct tlb_inv_context cxt;
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging
  2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
                   ` (8 preceding siblings ...)
  2023-01-13  3:50 ` [PATCH 9/9] KVM: arm64: Use local TLBI on permission relaxation Ricardo Koller
@ 2023-01-24  0:48 ` Ben Gardon
  2023-01-24 16:50   ` Ricardo Koller
  9 siblings, 1 reply; 47+ messages in thread
From: Ben Gardon @ 2023-01-24  0:48 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> Implement Eager Page Splitting for ARM.
>
> Eager Page Splitting improves the performance of dirty-logging (used
> in live migrations) when guest memory is backed by huge-pages.  It's
> an optimization used in Google Cloud since 2016 on x86, and for the
> last couple of months on ARM.
>
> Background and motivation
> =========================
> Dirty logging is typically used for live-migration iterative copying.
> KVM implements dirty-logging at the PAGE_SIZE granularity (will refer
> to 4K pages from now on).  It does it by faulting on write-protected
> 4K pages.  Therefore, enabling dirty-logging on a huge-page requires
> breaking it into 4K pages in the first place.  KVM does this breaking
> on fault, and because it's in the critical path it only maps the 4K
> page that faulted; every other 4K page is left unmapped.  This is not
> great for performance on ARM for a couple of reasons:
>
> - Splitting on fault can halt vcpus for milliseconds in some
>   implementations. Splitting a block PTE requires using a broadcasted
>   TLB invalidation (TLBI) for every huge-page (due to the
>   break-before-make requirement). Note that x86 doesn't need this. We
>   observed some implementations that take millliseconds to complete
>   broadcasted TLBIs when done in parallel from multiple vcpus.  And
>   that's exactly what happens when doing it on fault: multiple vcpus
>   fault at the same time triggering TLBIs in parallel.
>
> - Read intensive guest workloads end up paying for dirty-logging.
>   Only mapping the faulting 4K page means that all the other pages
>   that were part of the huge-page will now be unmapped. The effect is
>   that any access, including reads, now has to fault.
>
> Eager Page Splitting (on ARM)
> =============================
> Eager Page Splitting fixes the above two issues by eagerly splitting
> huge-pages when enabling dirty logging. The goal is to avoid doing it
> while faulting on write-protected pages. This is what the TDP MMU does
> for x86 [0], except that x86 does it for different reasons: to avoid
> grabbing the MMU lock on fault. Note that taking care of
> write-protection faults still requires grabbing the MMU lock on ARM,
> but not on x86 (with the fast_page_fault path).
>
> An additional benefit of eagerly splitting huge-pages is that it can
> be done in a controlled way (e.g., via an IOCTL). This series provides
> two knobs for doing it, just like its x86 counterpart: when enabling
> dirty logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The
> benefit of doing it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes
> ranges, and not complete memslots like when enabling dirty logging.
> This means that the cost of splitting (mainly broadcasted TLBIs) can
> be throttled: split a range, wait for a bit, split another range, etc.
> The benefits of this approach were presented by Oliver Upton at KVM
> Forum 2022 [1].
>
> Implementation
> ==============
> Patches 1-3 add a pgtable utility function for splitting huge block
> PTEs: kvm_pgtable_stage2_split(). Patches 4-8 add support for eagerly
> splitting huge-pages when enabling dirty-logging and when using the
> KVM_CLEAR_DIRTY_LOG ioctl. Note that this is just like what x86 does,
> and the code is actually based on it.  And finally, patch 9:
>
>         KVM: arm64: Use local TLBI on permission relaxation
>
> adds support for using local TLBIs instead of broadcasts when doing
> permission relaxation. This last patch is key to achieving good
> performance during dirty-logging, as eagerly breaking huge-pages
> replaces mapping new pages with permission relaxation. Got this patch
> (indirectly) from Marc Z.  and took the liberty of adding a commit
> message.
>
> Note: this applies on top of 6.2-rc3.
>
> Performance evaluation
> ======================
> The performance benefits were tested using the dirty_log_perf_test
> selftest with 2M huge-pages.
>
> The first test uses a write-only sequential workload where the stride
> is 2M instead of 4K [2]. The idea with this experiment is to emulate a
> random access pattern writing a different huge-page at every access.
> Observe that the benefit increases with the number of vcpus: up to
> 5.76x for 152 vcpus.
>
> ./dirty_log_perf_test_sparse -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2
>
>         +-------+----------+------------------+
>         | vCPUs | 6.2-rc3  | 6.2-rc3 + series |
>         |       |    (ms)  |             (ms) |
>         +-------+----------+------------------+
>         |    1  |    2.63  |          1.66    |
>         |    2  |    2.95  |          1.70    |
>         |    4  |    3.21  |          1.71    |
>         |    8  |    4.97  |          1.78    |
>         |   16  |    9.51  |          1.82    |
>         |   32  |   20.15  |          3.03    |
>         |   64  |   40.09  |          5.80    |
>         |  128  |   80.08  |         12.24    |
>         |  152  |  109.81  |         15.14    |
>         +-------+----------+------------------+

This is the average pass time I assume? Or the total or last? Or whole
test runtime?
Was this run with MANUAL_PROTECT or do the split during the CLEAR
IOCTL or with the splitting at enable time?

>
> This second test measures the benefit of eager page splitting on read
> intensive workloads (1 write for every 10 reads). As in the other
> test, the benefit increases with the number of vcpus, up to 8.82x for
> 152 vcpus.
>
> ./dirty_log_perf_test -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2 -f 10
>
>         +-------+----------+------------------+
>         | vCPUs | 6.2-rc3  | 6.2-rc3 + series |
>         |       |   (sec)  |            (sec) |
>         +-------+----------+------------------+
>         |    1  |    0.65  |          0.07    |
>         |    2  |    0.70  |          0.08    |
>         |    4  |    0.71  |          0.08    |
>         |    8  |    0.72  |          0.08    |
>         |   16  |    0.76  |          0.08    |
>         |   32  |    1.61  |          0.14    |
>         |   64  |    3.46  |          0.30    |
>         |  128  |    5.49  |          0.64    |
>         |  152  |    6.44  |          0.63    |
>         +-------+----------+------------------+
>
> Changes from the RFC:
> https://lore.kernel.org/kvmarm/20221112081714.2169495-1-ricarkol@google.com/
> - dropped the changes to split on POST visits. No visible perf
>   benefit.
> - changed the kvm_pgtable_stage2_free_removed() implementation to
>   reuse the stage2 mapper.
> - dropped the FEAT_BBM changes and optimization. Will send this on a
>   different series.
>
> Thanks,
> Ricardo
>
> [0] https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
> [1] https://kvmforum2022.sched.com/event/15jJq/kvmarm-at-scale-improvements-to-the-mmu-in-the-face-of-hardware-growing-pains-oliver-upton-google
> [2] https://github.com/ricarkol/linux/commit/f78e9102b2bff4fb7f30bee810d7d611a537b46d
> [3] https://lore.kernel.org/kvmarm/20221107215644.1895162-1-oliver.upton@linux.dev/
>
> Marc Zyngier (1):
>   KVM: arm64: Use local TLBI on permission relaxation
>
> Ricardo Koller (8):
>   KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags
>   KVM: arm64: Add helper for creating removed stage2 subtrees
>   KVM: arm64: Add kvm_pgtable_stage2_split()
>   KVM: arm64: Refactor kvm_arch_commit_memory_region()
>   KVM: arm64: Add kvm_uninit_stage2_mmu()
>   KVM: arm64: Split huge pages when dirty logging is enabled
>   KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
>   KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
>
>  arch/arm64/include/asm/kvm_asm.h     |   4 +
>  arch/arm64/include/asm/kvm_host.h    |  30 +++++
>  arch/arm64/include/asm/kvm_mmu.h     |   1 +
>  arch/arm64/include/asm/kvm_pgtable.h |  62 ++++++++++
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c   |  10 ++
>  arch/arm64/kvm/hyp/nvhe/tlb.c        |  54 +++++++++
>  arch/arm64/kvm/hyp/pgtable.c         | 143 +++++++++++++++++++++--
>  arch/arm64/kvm/hyp/vhe/tlb.c         |  32 +++++
>  arch/arm64/kvm/mmu.c                 | 168 ++++++++++++++++++++++-----
>  9 files changed, 466 insertions(+), 38 deletions(-)
>
> --
> 2.39.0.314.g84b9a713c41-goog
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging
  2023-01-24  0:48 ` [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ben Gardon
@ 2023-01-24 16:50   ` Ricardo Koller
  0 siblings, 0 replies; 47+ messages in thread
From: Ricardo Koller @ 2023-01-24 16:50 UTC (permalink / raw)
  To: Ben Gardon
  Cc: pbonzini, maz, oupton, yuzenghui, dmatlack, kvm, kvmarm, qperret,
	catalin.marinas, andrew.jones, seanjc, alexandru.elisei,
	suzuki.poulose, eric.auger, gshan, reijiw, rananta, ricarkol

On Mon, Jan 23, 2023 at 04:48:02PM -0800, Ben Gardon wrote:
> On Thu, Jan 12, 2023 at 7:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > Implement Eager Page Splitting for ARM.
> >
> > Eager Page Splitting improves the performance of dirty-logging (used
> > in live migrations) when guest memory is backed by huge-pages.  It's
> > an optimization used in Google Cloud since 2016 on x86, and for the
> > last couple of months on ARM.
> >
> > Background and motivation
> > =========================
> > Dirty logging is typically used for live-migration iterative copying.
> > KVM implements dirty-logging at the PAGE_SIZE granularity (will refer
> > to 4K pages from now on).  It does it by faulting on write-protected
> > 4K pages.  Therefore, enabling dirty-logging on a huge-page requires
> > breaking it into 4K pages in the first place.  KVM does this breaking
> > on fault, and because it's in the critical path it only maps the 4K
> > page that faulted; every other 4K page is left unmapped.  This is not
> > great for performance on ARM for a couple of reasons:
> >
> > - Splitting on fault can halt vcpus for milliseconds in some
> >   implementations. Splitting a block PTE requires using a broadcasted
> >   TLB invalidation (TLBI) for every huge-page (due to the
> >   break-before-make requirement). Note that x86 doesn't need this. We
> >   observed some implementations that take millliseconds to complete
> >   broadcasted TLBIs when done in parallel from multiple vcpus.  And
> >   that's exactly what happens when doing it on fault: multiple vcpus
> >   fault at the same time triggering TLBIs in parallel.
> >
> > - Read intensive guest workloads end up paying for dirty-logging.
> >   Only mapping the faulting 4K page means that all the other pages
> >   that were part of the huge-page will now be unmapped. The effect is
> >   that any access, including reads, now has to fault.
> >
> > Eager Page Splitting (on ARM)
> > =============================
> > Eager Page Splitting fixes the above two issues by eagerly splitting
> > huge-pages when enabling dirty logging. The goal is to avoid doing it
> > while faulting on write-protected pages. This is what the TDP MMU does
> > for x86 [0], except that x86 does it for different reasons: to avoid
> > grabbing the MMU lock on fault. Note that taking care of
> > write-protection faults still requires grabbing the MMU lock on ARM,
> > but not on x86 (with the fast_page_fault path).
> >
> > An additional benefit of eagerly splitting huge-pages is that it can
> > be done in a controlled way (e.g., via an IOCTL). This series provides
> > two knobs for doing it, just like its x86 counterpart: when enabling
> > dirty logging, and when using the KVM_CLEAR_DIRTY_LOG ioctl. The
> > benefit of doing it on KVM_CLEAR_DIRTY_LOG is that this ioctl takes
> > ranges, and not complete memslots like when enabling dirty logging.
> > This means that the cost of splitting (mainly broadcasted TLBIs) can
> > be throttled: split a range, wait for a bit, split another range, etc.
> > The benefits of this approach were presented by Oliver Upton at KVM
> > Forum 2022 [1].
> >
> > Implementation
> > ==============
> > Patches 1-3 add a pgtable utility function for splitting huge block
> > PTEs: kvm_pgtable_stage2_split(). Patches 4-8 add support for eagerly
> > splitting huge-pages when enabling dirty-logging and when using the
> > KVM_CLEAR_DIRTY_LOG ioctl. Note that this is just like what x86 does,
> > and the code is actually based on it.  And finally, patch 9:
> >
> >         KVM: arm64: Use local TLBI on permission relaxation
> >
> > adds support for using local TLBIs instead of broadcasts when doing
> > permission relaxation. This last patch is key to achieving good
> > performance during dirty-logging, as eagerly breaking huge-pages
> > replaces mapping new pages with permission relaxation. Got this patch
> > (indirectly) from Marc Z.  and took the liberty of adding a commit
> > message.
> >
> > Note: this applies on top of 6.2-rc3.
> >
> > Performance evaluation
> > ======================
> > The performance benefits were tested using the dirty_log_perf_test
> > selftest with 2M huge-pages.
> >
> > The first test uses a write-only sequential workload where the stride
> > is 2M instead of 4K [2]. The idea with this experiment is to emulate a
> > random access pattern writing a different huge-page at every access.
> > Observe that the benefit increases with the number of vcpus: up to
> > 5.76x for 152 vcpus.
> >
> > ./dirty_log_perf_test_sparse -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2
> >
> >         +-------+----------+------------------+
> >         | vCPUs | 6.2-rc3  | 6.2-rc3 + series |
> >         |       |    (ms)  |             (ms) |
> >         +-------+----------+------------------+
> >         |    1  |    2.63  |          1.66    |
> >         |    2  |    2.95  |          1.70    |
> >         |    4  |    3.21  |          1.71    |
> >         |    8  |    4.97  |          1.78    |
> >         |   16  |    9.51  |          1.82    |
> >         |   32  |   20.15  |          3.03    |
> >         |   64  |   40.09  |          5.80    |
> >         |  128  |   80.08  |         12.24    |
> >         |  152  |  109.81  |         15.14    |
> >         +-------+----------+------------------+
> 
> This is the average pass time I assume? Or the total or last? Or whole
> test runtime?

Oops, forgot to provide any detail, sorry for that. This is the length
of the first guest pass ("Iteration 2 dirty memory time").

> Was this run with MANUAL_PROTECT or do the split during the CLEAR
> IOCTL or with the splitting at enable time?

The CLEAR IOCTL (with MANUAL_PROTECT). Just to double check, without the
'-g' argument.

> 
> >
> > This second test measures the benefit of eager page splitting on read
> > intensive workloads (1 write for every 10 reads). As in the other
> > test, the benefit increases with the number of vcpus, up to 8.82x for
> > 152 vcpus.
> >
> > ./dirty_log_perf_test -s anonymous_hugetlb_2mb -b 1G -v $i -i 3 -m 2 -f 10
> >
> >         +-------+----------+------------------+
> >         | vCPUs | 6.2-rc3  | 6.2-rc3 + series |
> >         |       |   (sec)  |            (sec) |
> >         +-------+----------+------------------+
> >         |    1  |    0.65  |          0.07    |
> >         |    2  |    0.70  |          0.08    |
> >         |    4  |    0.71  |          0.08    |
> >         |    8  |    0.72  |          0.08    |
> >         |   16  |    0.76  |          0.08    |
> >         |   32  |    1.61  |          0.14    |
> >         |   64  |    3.46  |          0.30    |
> >         |  128  |    5.49  |          0.64    |
> >         |  152  |    6.44  |          0.63    |
> >         +-------+----------+------------------+
> >
> > Changes from the RFC:
> > https://lore.kernel.org/kvmarm/20221112081714.2169495-1-ricarkol@google.com/
> > - dropped the changes to split on POST visits. No visible perf
> >   benefit.
> > - changed the kvm_pgtable_stage2_free_removed() implementation to
> >   reuse the stage2 mapper.
> > - dropped the FEAT_BBM changes and optimization. Will send this on a
> >   different series.
> >
> > Thanks,
> > Ricardo
> >
> > [0] https://lore.kernel.org/kvm/20220119230739.2234394-1-dmatlack@google.com/
> > [1] https://kvmforum2022.sched.com/event/15jJq/kvmarm-at-scale-improvements-to-the-mmu-in-the-face-of-hardware-growing-pains-oliver-upton-google
> > [2] https://github.com/ricarkol/linux/commit/f78e9102b2bff4fb7f30bee810d7d611a537b46d
> > [3] https://lore.kernel.org/kvmarm/20221107215644.1895162-1-oliver.upton@linux.dev/
> >
> > Marc Zyngier (1):
> >   KVM: arm64: Use local TLBI on permission relaxation
> >
> > Ricardo Koller (8):
> >   KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags
> >   KVM: arm64: Add helper for creating removed stage2 subtrees
> >   KVM: arm64: Add kvm_pgtable_stage2_split()
> >   KVM: arm64: Refactor kvm_arch_commit_memory_region()
> >   KVM: arm64: Add kvm_uninit_stage2_mmu()
> >   KVM: arm64: Split huge pages when dirty logging is enabled
> >   KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
> >   KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
> >
> >  arch/arm64/include/asm/kvm_asm.h     |   4 +
> >  arch/arm64/include/asm/kvm_host.h    |  30 +++++
> >  arch/arm64/include/asm/kvm_mmu.h     |   1 +
> >  arch/arm64/include/asm/kvm_pgtable.h |  62 ++++++++++
> >  arch/arm64/kvm/hyp/nvhe/hyp-main.c   |  10 ++
> >  arch/arm64/kvm/hyp/nvhe/tlb.c        |  54 +++++++++
> >  arch/arm64/kvm/hyp/pgtable.c         | 143 +++++++++++++++++++++--
> >  arch/arm64/kvm/hyp/vhe/tlb.c         |  32 +++++
> >  arch/arm64/kvm/mmu.c                 | 168 ++++++++++++++++++++++-----
> >  9 files changed, 466 insertions(+), 38 deletions(-)
> >
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2023-02-06 16:36 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-13  3:49 [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ricardo Koller
2023-01-13  3:49 ` [PATCH 1/9] KVM: arm64: Add KVM_PGTABLE_WALK_REMOVED into ctx->flags Ricardo Koller
2023-01-24  0:51   ` Ben Gardon
2023-01-24  0:56     ` Oliver Upton
2023-01-24 16:32       ` Ricardo Koller
2023-01-24 18:00         ` Ben Gardon
2023-01-26 18:48           ` Ricardo Koller
2023-01-24 16:30     ` Ricardo Koller
2023-01-13  3:49 ` [PATCH 2/9] KVM: arm64: Add helper for creating removed stage2 subtrees Ricardo Koller
2023-01-24  0:55   ` Ben Gardon
2023-01-24 16:35     ` Ricardo Koller
2023-01-24 17:07       ` Oliver Upton
2023-01-13  3:49 ` [PATCH 3/9] KVM: arm64: Add kvm_pgtable_stage2_split() Ricardo Koller
2023-01-24  1:03   ` Ben Gardon
2023-01-24 16:46     ` Ricardo Koller
2023-01-24 17:11       ` Oliver Upton
2023-01-24 17:18         ` Ricardo Koller
2023-01-24 17:48           ` David Matlack
2023-01-24 20:28             ` Oliver Upton
2023-02-06  9:20   ` Zheng Chuan
2023-02-06 16:28     ` Ricardo Koller
2023-01-13  3:49 ` [PATCH 4/9] KVM: arm64: Refactor kvm_arch_commit_memory_region() Ricardo Koller
2023-01-13  3:49 ` [PATCH 5/9] KVM: arm64: Add kvm_uninit_stage2_mmu() Ricardo Koller
2023-01-13  3:49 ` [PATCH 6/9] KVM: arm64: Split huge pages when dirty logging is enabled Ricardo Koller
2023-01-24 17:52   ` Ben Gardon
2023-01-24 22:19     ` Oliver Upton
2023-01-24 22:45   ` Oliver Upton
2023-01-26 18:45     ` Ricardo Koller
2023-01-26 19:25       ` Ricardo Koller
2023-01-26 20:10       ` Marc Zyngier
2023-01-27 15:45         ` Ricardo Koller
2023-01-30 21:18           ` Oliver Upton
2023-01-31  1:18             ` Sean Christopherson
2023-01-31 17:45               ` Oliver Upton
2023-01-31 17:54                 ` Sean Christopherson
2023-01-31 19:06                   ` Oliver Upton
2023-01-31 18:01                 ` David Matlack
2023-01-31 18:19                   ` Ricardo Koller
2023-01-31 18:35                   ` Oliver Upton
2023-01-31 10:31             ` Marc Zyngier
2023-01-31 10:28           ` Marc Zyngier
2023-02-06 16:35             ` Ricardo Koller
2023-01-13  3:49 ` [PATCH 7/9] KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() Ricardo Koller
2023-01-13  3:49 ` [PATCH 8/9] KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG Ricardo Koller
2023-01-13  3:50 ` [PATCH 9/9] KVM: arm64: Use local TLBI on permission relaxation Ricardo Koller
2023-01-24  0:48 ` [PATCH 0/9] KVM: arm64: Eager Huge-page splitting for dirty-logging Ben Gardon
2023-01-24 16:50   ` Ricardo Koller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox