linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log
@ 2023-08-25  9:35 Shameer Kolothum
  2023-08-25  9:35 ` [RFC PATCH v2 1/8] arm64: cpufeature: Add API to report system support of HWDBM Shameer Kolothum
                   ` (8 more replies)
  0 siblings, 9 replies; 33+ messages in thread
From: Shameer Kolothum @ 2023-08-25  9:35 UTC (permalink / raw)
  To: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	oliver.upton
  Cc: james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

Hi,

This is to revive the RFC series[1], which makes use of hardware dirty
bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent
out by Zhu Keqian sometime back.

One of the main drawbacks in using the hardware DBM feature for dirty
page tracking is the additional overhead in scanning the PTEs for dirty
pages[2]. Also there are no vCPU page faults when we set the DBM bit,
which may result in higher convergence time during guest migration. 

This series tries to reduce these overheads by not setting the
DBM for all the writeable pages during migration and instead uses a
combined software(current page fault mechanism) and hardware approach
(set DBM) for dirty page tracking.

As noted in RFC v1[1],
"The core idea is that we do not enable hardware dirty at start (do not
add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking
for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add
DBM bit for nearby 64PTs). Then when sync dirty log, we have known all
PTs with hardware dirty enabled, so we do not need to scan all PTs."

Major changes from the RFC v1 are:

1. Rebased to 6.5-rc5 + FEAT_TLBIRANGE series[3].
   The original RFC v1 was based on 5.11 and there are multiple changes
   in KVM/arm64 that fundamentally changed the way the page tables are
   updated. I am not 100% sure that I got all the locking mechanisms
   right during page table traversal here. But haven't seen any
   regressions or mem corruptions so far in my test setup.

2. Use of ctx->flags for handling DBM updates(patch#2)

3. During migration, we can only set DBM for pages that are already
   writeable. But the CLEAR_LOG path will set all the pages as write
   protected. There isn't any easy way to distinguish previous read-only
   pages from this write protected pages. Hence, made use of 
   "Reserved for Software use" bits in the page descriptor to mark
   "writeable-clean" pages. See patch #4.

4. Introduced KVM_CAP_ARM_HW_DBM for enabling this feature from userspace.

Testing
----------
Hardware: HiSilicon ARM64 platform(without FEAT_TLBIRANGE)
Kernel: 6.5-rc5 based with eager page split explicitly
        enabled(chunksize=2MB)

Tests with dirty_log_perf_test with anonymous THP pages shows significant
improvement in "dirty memory time" as expected but with a hit on
"get dirty time" .

./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp

+---------------------------+----------------+------------------+
|                           |   6.5-rc5      | 6.5-rc5 + series |
|                           |     (s)        |       (s)        |
+---------------------------+----------------+------------------+
|    dirty memory time      |    4.22        |          0.41    |
|    get dirty log time     |    0.00047     |          3.25    |
|    clear dirty log time   |    0.48        |          0.98    |
+---------------------------------------------------------------+
       
In order to get some idea on actual live migration performance,
I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and
while the test was in progress initiated live migration(local).

redis-benchmark -t set -c 900 -n 5000000 --threads 96

Average of 5 runs shows that benchmark finishes ~10% faster with
a ~8% increase in "total time" for migration.

+---------------------------+----------------+------------------+
|                           |   6.5-rc5      | 6.5-rc5 + series |
|                           |     (s)        |    (s)           |
+---------------------------+----------------+------------------+
| [redis]5000000 requests in|    79.428      |      71.49       |
| [info migrate]total time  |    8438        |      9097        |
+---------------------------------------------------------------+
       
Also ran extensive VM migrations with a Qemu with md5 checksum
calculated for RAM. No regressions or memory corruption observed
so far.

It looks like this series will benefit VMs with write intensive
workloads to improve the Guest uptime during migration.

Please take a look and let me know your feedback. Any help with further
tests and verification is really appreciated.

Thanks,
Shameer

1. https://lore.kernel.org/linux-arm-kernel/20210126124444.27136-1-zhukeqian1@huawei.com/
2. https://lore.kernel.org/linux-arm-kernel/20200525112406.28224-1-zhukeqian1@huawei.com/
3. https://lore.kernel.org/kvm/20230811045127.3308641-1-rananta@google.com/


Keqian Zhu (5):
  arm64: cpufeature: Add API to report system support of HWDBM
  KVM: arm64: Add some HW_DBM related pgtable interfaces
  KVM: arm64: Add some HW_DBM related mmu interfaces
  KVM: arm64: Only write protect selected PTE
  KVM: arm64: Start up SW/HW combined dirty log

Shameer Kolothum (3):
  KVM: arm64: Add KVM_PGTABLE_WALK_HW_DBM for HW DBM support
  KVM: arm64: Set DBM for writeable-clean pages
  KVM: arm64: Add KVM_CAP_ARM_HW_DBM

 arch/arm64/include/asm/cpufeature.h  |  15 +++
 arch/arm64/include/asm/kvm_host.h    |   8 ++
 arch/arm64/include/asm/kvm_mmu.h     |   7 ++
 arch/arm64/include/asm/kvm_pgtable.h |  53 ++++++++++
 arch/arm64/kernel/image-vars.h       |   2 +
 arch/arm64/kvm/arm.c                 | 138 ++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c         | 139 +++++++++++++++++++++++++--
 arch/arm64/kvm/mmu.c                 |  50 +++++++++-
 include/uapi/linux/kvm.h             |   1 +
 9 files changed, 403 insertions(+), 10 deletions(-)

-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 1/8] arm64: cpufeature: Add API to report system support of HWDBM
  2023-08-25  9:35 [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Shameer Kolothum
@ 2023-08-25  9:35 ` Shameer Kolothum
  2023-08-25  9:35 ` [RFC PATCH v2 2/8] KVM: arm64: Add KVM_PGTABLE_WALK_HW_DBM for HW DBM support Shameer Kolothum
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Shameer Kolothum @ 2023-08-25  9:35 UTC (permalink / raw)
  To: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	oliver.upton
  Cc: james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

From: Keqian Zhu <zhukeqian1@huawei.com>

Though we already has a cpu capability named ARM64_HW_DBM, it's a
LOCAL_CPU cap and conditionally compiled by CONFIG_ARM64_HW_AFDBM.

This reports the system wide support of HW_DBM.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 arch/arm64/include/asm/cpufeature.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 96e50227f940..edb04e45e030 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -733,6 +733,21 @@ static inline bool system_supports_mixed_endian(void)
 	return val == 0x1;
 }
 
+static inline bool system_supports_hw_dbm(void)
+{
+	u64 mmfr1;
+	u32 val;
+
+	if (!IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
+		return false;
+
+	mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
+	val = cpuid_feature_extract_unsigned_field(mmfr1,
+						ID_AA64MMFR1_EL1_HAFDBS_SHIFT);
+
+	return val == ID_AA64MMFR1_EL1_HAFDBS_DBM;
+}
+
 static __always_inline bool system_supports_fpsimd(void)
 {
 	return !cpus_have_const_cap(ARM64_HAS_NO_FPSIMD);
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 2/8] KVM: arm64: Add KVM_PGTABLE_WALK_HW_DBM for HW DBM support
  2023-08-25  9:35 [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Shameer Kolothum
  2023-08-25  9:35 ` [RFC PATCH v2 1/8] arm64: cpufeature: Add API to report system support of HWDBM Shameer Kolothum
@ 2023-08-25  9:35 ` Shameer Kolothum
  2023-09-15 22:05   ` Oliver Upton
  2023-08-25  9:35 ` [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces Shameer Kolothum
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Shameer Kolothum @ 2023-08-25  9:35 UTC (permalink / raw)
  To: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	oliver.upton
  Cc: james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

KVM_PGTABLE_WALK_HW_DBM - Indicates page table walk is for HW DBM
 related updates.

No functional changes here. Only apply any HW DBM bit updates to last
level only. These will be used by a future commit where we will add
support for HW DBM.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  3 +++
 arch/arm64/kvm/hyp/pgtable.c         | 10 ++++++++++
 2 files changed, 13 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index d3e354bb8351..3f96bdd2086f 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -219,6 +219,8 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
  * @KVM_PGTABLE_WALK_SKIP_CMO:		Visit and update table entries
  *					without Cache maintenance
  *					operations required.
+ * @KVM_PGTABLE_WALK_HW_DBM:		Indicates that the attribute update is
+ *					HW DBM related.
  */
 enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_LEAF			= BIT(0),
@@ -228,6 +230,7 @@ enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_HANDLE_FAULT		= BIT(4),
 	KVM_PGTABLE_WALK_SKIP_BBM_TLBI		= BIT(5),
 	KVM_PGTABLE_WALK_SKIP_CMO		= BIT(6),
+	KVM_PGTABLE_WALK_HW_DBM			= BIT(7),
 };
 
 struct kvm_pgtable_visit_ctx {
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index f155b8c9e98c..1e65b8c97059 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -67,6 +67,11 @@ struct kvm_pgtable_walk_data {
 	const u64			end;
 };
 
+static bool kvm_pgtable_walk_hw_dbm(const struct kvm_pgtable_visit_ctx *ctx)
+{
+	return ctx->flags & KVM_PGTABLE_WALK_HW_DBM;
+}
+
 static bool kvm_pgtable_walk_skip_bbm_tlbi(const struct kvm_pgtable_visit_ctx *ctx)
 {
 	return unlikely(ctx->flags & KVM_PGTABLE_WALK_SKIP_BBM_TLBI);
@@ -1164,6 +1169,11 @@ static int stage2_attr_walker(const struct kvm_pgtable_visit_ctx *ctx,
 	if (!kvm_pte_valid(ctx->old))
 		return -EAGAIN;
 
+	/* Only apply HW DBM for last level */
+	if (kvm_pgtable_walk_hw_dbm(ctx) &&
+	    ctx->level != (KVM_PGTABLE_MAX_LEVELS - 1))
+		return 0;
+
 	data->level = ctx->level;
 	data->pte = pte;
 	pte &= ~data->attr_clr;
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces
  2023-08-25  9:35 [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Shameer Kolothum
  2023-08-25  9:35 ` [RFC PATCH v2 1/8] arm64: cpufeature: Add API to report system support of HWDBM Shameer Kolothum
  2023-08-25  9:35 ` [RFC PATCH v2 2/8] KVM: arm64: Add KVM_PGTABLE_WALK_HW_DBM for HW DBM support Shameer Kolothum
@ 2023-08-25  9:35 ` Shameer Kolothum
  2023-09-15 22:22   ` Oliver Upton
  2023-09-22 15:24   ` Catalin Marinas
  2023-08-25  9:35 ` [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously writeable pages Shameer Kolothum
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 33+ messages in thread
From: Shameer Kolothum @ 2023-08-25  9:35 UTC (permalink / raw)
  To: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	oliver.upton
  Cc: james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

From: Keqian Zhu <zhukeqian1@huawei.com>

This adds set_dbm, clear_dbm and sync_dirty interfaces in pgtable
layer. (1) set_dbm: Set DBM bit for last level PTE of a specified
range. TLBI is completed. (2) clear_dbm: Clear DBM bit for last
level PTE of a specified range. TLBI is not acted. (3) sync_dirty:
Scan last level PTE of a specific range. Log dirty if PTE is writeable.

Besides, save the dirty state of PTE if it's invalided by map or
unmap.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 arch/arm64/include/asm/kvm_pgtable.h | 45 +++++++++++++
 arch/arm64/kernel/image-vars.h       |  2 +
 arch/arm64/kvm/hyp/pgtable.c         | 98 ++++++++++++++++++++++++++++
 3 files changed, 145 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 3f96bdd2086f..a12add002b89 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -578,6 +578,51 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
  */
 int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
 
+/**
+ * kvm_pgtable_stage2_clear_dbm() - Clear DBM of guest stage-2 address range
+ *                                  without TLB invalidation (only last level).
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr:	Intermediate physical address from which to clear DBM,
+ * @size:	Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Note that it is the caller's responsibility to invalidate the TLB after
+ * calling this function to ensure that the disabled HW dirty are visible
+ * to the CPUs.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
+/**
+ * kvm_pgtable_stage2_set_dbm() - Set DBM of guest stage-2 address range to
+ *                                enable HW dirty (only last level).
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr:	Intermediate physical address from which to set DBM.
+ * @size:	Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
+/**
+ * kvm_pgtable_stage2_sync_dirty() - Sync HW dirty state into memslot.
+ * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
+ * @addr:	Intermediate physical address from which to sync.
+ * @size:	Size of the range.
+ *
+ * The offset of @addr within a page is ignored and @size is rounded-up to
+ * the next page boundary.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_stage2_sync_dirty(struct kvm_pgtable *pgt, u64 addr, u64 size);
+
 /**
  * kvm_pgtable_stage2_wrprotect() - Write-protect guest stage-2 address range
  *                                  without TLB invalidation.
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 35f3c7959513..2ca600e3d637 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -68,6 +68,8 @@ KVM_NVHE_ALIAS(__hyp_stub_vectors);
 KVM_NVHE_ALIAS(vgic_v2_cpuif_trap);
 KVM_NVHE_ALIAS(vgic_v3_cpuif_trap);
 
+KVM_NVHE_ALIAS(mark_page_dirty);
+
 #ifdef CONFIG_ARM64_PSEUDO_NMI
 /* Static key checked in GIC_PRIO_IRQOFF. */
 KVM_NVHE_ALIAS(gic_nonsecure_priorities);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 1e65b8c97059..d7a46a00a7f6 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -9,6 +9,7 @@
 
 #include <linux/bitfield.h>
 #include <asm/kvm_pgtable.h>
+#include <asm/kvm_mmu.h>
 #include <asm/stage2_pgtable.h>
 
 
@@ -42,6 +43,7 @@
 
 #define KVM_PTE_LEAF_ATTR_HI_S1_XN	BIT(54)
 
+#define KVM_PTE_LEAF_ATTR_HI_S2_DBM	BIT(51)
 #define KVM_PTE_LEAF_ATTR_HI_S2_XN	BIT(54)
 
 #define KVM_PTE_LEAF_ATTR_HI_S1_GP	BIT(50)
@@ -764,8 +766,44 @@ static bool stage2_pte_is_locked(kvm_pte_t pte)
 	return !kvm_pte_valid(pte) && (pte & KVM_INVALID_PTE_LOCKED);
 }
 
+static bool stage2_pte_writeable(kvm_pte_t pte)
+{
+	return pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
+}
+
+static void kvm_update_hw_dbm(const struct kvm_pgtable_visit_ctx *ctx,
+			      kvm_pte_t new)
+{
+	kvm_pte_t old_pte, pte = ctx->old;
+
+	/* Only set DBM if page is writeable */
+	if ((new & KVM_PTE_LEAF_ATTR_HI_S2_DBM) && !stage2_pte_writeable(pte))
+		return;
+
+	/* Clear DBM walk is not shared, update */
+	if (!kvm_pgtable_walk_shared(ctx)) {
+		WRITE_ONCE(*ctx->ptep, new);
+		return;
+	}
+
+	do {
+		old_pte = pte;
+		pte = new;
+
+		if (old_pte == pte)
+			break;
+
+		pte = cmpxchg_relaxed(ctx->ptep, old_pte, pte);
+	} while (pte != old_pte);
+}
+
 static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new)
 {
+	if (kvm_pgtable_walk_hw_dbm(ctx)) {
+		kvm_update_hw_dbm(ctx, new);
+		return true;
+	}
+
 	if (!kvm_pgtable_walk_shared(ctx)) {
 		WRITE_ONCE(*ctx->ptep, new);
 		return true;
@@ -952,6 +990,11 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
 	    stage2_pte_executable(new))
 		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
 
+	/* Save the possible hardware dirty info */
+	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
+	    stage2_pte_writeable(ctx->old))
+		mark_page_dirty(kvm_s2_mmu_to_kvm(pgt->mmu), ctx->addr >> PAGE_SHIFT);
+
 	stage2_make_pte(ctx, new);
 
 	return 0;
@@ -1125,6 +1168,11 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
 	 */
 	stage2_unmap_put_pte(ctx, mmu, mm_ops);
 
+	/* Save the possible hardware dirty info */
+	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
+	    stage2_pte_writeable(ctx->old))
+		mark_page_dirty(kvm_s2_mmu_to_kvm(mmu), ctx->addr >> PAGE_SHIFT);
+
 	if (need_flush && mm_ops->dcache_clean_inval_poc)
 		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
 					       kvm_granule_size(ctx->level));
@@ -1230,6 +1278,30 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
 	return 0;
 }
 
+int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+	int ret;
+	u64 offset;
+
+	ret = stage2_update_leaf_attrs(pgt, addr, size, KVM_PTE_LEAF_ATTR_HI_S2_DBM, 0,
+				       NULL, NULL, KVM_PGTABLE_WALK_HW_DBM |
+				       KVM_PGTABLE_WALK_SHARED);
+	if (!ret)
+		return ret;
+
+	for (offset = 0; offset < size; offset += PAGE_SIZE)
+		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr + offset, 3);
+
+	return 0;
+}
+
+int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+	return stage2_update_leaf_attrs(pgt, addr, size,
+					0, KVM_PTE_LEAF_ATTR_HI_S2_DBM,
+					NULL, NULL, KVM_PGTABLE_WALK_HW_DBM);
+}
+
 int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
 {
 	return stage2_update_leaf_attrs(pgt, addr, size, 0,
@@ -1329,6 +1401,32 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
 	return ret;
 }
 
+static int stage2_sync_dirty_walker(const struct kvm_pgtable_visit_ctx *ctx,
+				    enum kvm_pgtable_walk_flags visit)
+{
+	kvm_pte_t pte = READ_ONCE(*ctx->ptep);
+	struct kvm *kvm = ctx->arg;
+
+	if (!kvm_pte_valid(pte))
+		return 0;
+
+	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) && stage2_pte_writeable(pte))
+		mark_page_dirty(kvm, ctx->addr >> PAGE_SHIFT);
+
+	return 0;
+}
+
+int kvm_pgtable_stage2_sync_dirty(struct kvm_pgtable *pgt, u64 addr, u64 size)
+{
+	struct kvm_pgtable_walker walker = {
+		.cb	= stage2_sync_dirty_walker,
+		.flags	= KVM_PGTABLE_WALK_LEAF,
+		.arg	= kvm_s2_mmu_to_kvm(pgt->mmu),
+	};
+
+	return kvm_pgtable_walk(pgt, addr, size, &walker);
+}
+
 static int stage2_flush_walker(const struct kvm_pgtable_visit_ctx *ctx,
 			       enum kvm_pgtable_walk_flags visit)
 {
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously writeable pages
  2023-08-25  9:35 [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Shameer Kolothum
                   ` (2 preceding siblings ...)
  2023-08-25  9:35 ` [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces Shameer Kolothum
@ 2023-08-25  9:35 ` Shameer Kolothum
  2023-09-15 22:54   ` Oliver Upton
  2023-09-22 15:40   ` Catalin Marinas
  2023-08-25  9:35 ` [RFC PATCH v2 5/8] KVM: arm64: Add some HW_DBM related mmu interfaces Shameer Kolothum
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 33+ messages in thread
From: Shameer Kolothum @ 2023-08-25  9:35 UTC (permalink / raw)
  To: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	oliver.upton
  Cc: james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

We only set DBM if the page is writeable (S2AP[1] == 1). But once migration
starts, CLEAR_LOG path will write protect the pages (S2AP[1] = 0) and there
isn't an easy way to differentiate the writeable pages that gets write
protected from read-only pages as we only have S2AP[1] bit to check.

Introduced a ctx->flag KVM_PGTABLE_WALK_WC_HINT to identify the dirty page
tracking related write-protect page table walk and used one of the "Reserved
for software use" bit in page descriptor to mark a page as "writeable-clean". 

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  5 +++++
 arch/arm64/kvm/hyp/pgtable.c         | 25 ++++++++++++++++++++++---
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index a12add002b89..67bcbc5984f9 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -190,6 +190,8 @@ enum kvm_pgtable_prot {
 #define KVM_PGTABLE_PROT_RW	(KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
 #define KVM_PGTABLE_PROT_RWX	(KVM_PGTABLE_PROT_RW | KVM_PGTABLE_PROT_X)
 
+#define KVM_PGTABLE_PROT_WC	KVM_PGTABLE_PROT_SW0  /*write-clean*/
+
 #define PKVM_HOST_MEM_PROT	KVM_PGTABLE_PROT_RWX
 #define PKVM_HOST_MMIO_PROT	KVM_PGTABLE_PROT_RW
 
@@ -221,6 +223,8 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
  *					operations required.
  * @KVM_PGTABLE_WALK_HW_DBM:		Indicates that the attribute update is
  *					HW DBM related.
+ * @KVM_PGTABLE_WALK_WC_HINT:		Update the page as writeable-clean(software attribute)
+ *					if we are write protecting a writeable page.
  */
 enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_LEAF			= BIT(0),
@@ -231,6 +235,7 @@ enum kvm_pgtable_walk_flags {
 	KVM_PGTABLE_WALK_SKIP_BBM_TLBI		= BIT(5),
 	KVM_PGTABLE_WALK_SKIP_CMO		= BIT(6),
 	KVM_PGTABLE_WALK_HW_DBM			= BIT(7),
+	KVM_PGTABLE_WALK_WC_HINT		= BIT(8),
 };
 
 struct kvm_pgtable_visit_ctx {
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index d7a46a00a7f6..4552bfb1f274 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -69,6 +69,11 @@ struct kvm_pgtable_walk_data {
 	const u64			end;
 };
 
+static bool kvm_pgtable_walk_wc_hint(const struct kvm_pgtable_visit_ctx *ctx)
+{
+	return ctx->flags & KVM_PGTABLE_WALK_WC_HINT;
+}
+
 static bool kvm_pgtable_walk_hw_dbm(const struct kvm_pgtable_visit_ctx *ctx)
 {
 	return ctx->flags & KVM_PGTABLE_WALK_HW_DBM;
@@ -771,13 +776,24 @@ static bool stage2_pte_writeable(kvm_pte_t pte)
 	return pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
 }
 
+static bool stage2_pte_is_write_clean(kvm_pte_t pte)
+{
+	return kvm_pte_valid(pte) && (pte & KVM_PGTABLE_PROT_WC);
+}
+
+static bool stage2_pte_can_be_write_clean(const struct kvm_pgtable_visit_ctx *ctx,
+					  kvm_pte_t new)
+{
+	return (stage2_pte_writeable(ctx->old) && !stage2_pte_writeable(new));
+}
+
 static void kvm_update_hw_dbm(const struct kvm_pgtable_visit_ctx *ctx,
 			      kvm_pte_t new)
 {
 	kvm_pte_t old_pte, pte = ctx->old;
 
-	/* Only set DBM if page is writeable */
-	if ((new & KVM_PTE_LEAF_ATTR_HI_S2_DBM) && !stage2_pte_writeable(pte))
+	/* Only set DBM if page is writeable-clean */
+	if ((new & KVM_PTE_LEAF_ATTR_HI_S2_DBM) && !stage2_pte_is_write_clean(pte))
 		return;
 
 	/* Clear DBM walk is not shared, update */
@@ -805,6 +821,9 @@ static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_
 	}
 
 	if (!kvm_pgtable_walk_shared(ctx)) {
+		if (kvm_pgtable_walk_wc_hint(ctx) &&
+		    stage2_pte_can_be_write_clean(ctx, new))
+			new |= KVM_PGTABLE_PROT_WC;
 		WRITE_ONCE(*ctx->ptep, new);
 		return true;
 	}
@@ -1306,7 +1325,7 @@ int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
 {
 	return stage2_update_leaf_attrs(pgt, addr, size, 0,
 					KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
-					NULL, NULL, 0);
+					NULL, NULL, KVM_PGTABLE_WALK_WC_HINT);
 }
 
 kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr)
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 5/8] KVM: arm64: Add some HW_DBM related mmu interfaces
  2023-08-25  9:35 [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Shameer Kolothum
                   ` (3 preceding siblings ...)
  2023-08-25  9:35 ` [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously writeable pages Shameer Kolothum
@ 2023-08-25  9:35 ` Shameer Kolothum
  2023-08-25  9:35 ` [RFC PATCH v2 6/8] KVM: arm64: Only write protect selected PTE Shameer Kolothum
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Shameer Kolothum @ 2023-08-25  9:35 UTC (permalink / raw)
  To: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	oliver.upton
  Cc: james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

From: Keqian Zhu <zhukeqian1@huawei.com>

This adds set_dbm, clear_dbm and sync_dirty interfaces in mmu layer.
They simply wrap those interfaces of pgtable layer.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 arch/arm64/include/asm/kvm_mmu.h |  7 +++++++
 arch/arm64/kvm/mmu.c             | 30 ++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 0e1e1ab17b4d..86e1e074337b 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -170,6 +170,13 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
 			     void **haddr);
 void __init free_hyp_pgds(void);
 
+void kvm_stage2_clear_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+			  gfn_t gfn_offset, unsigned long npages);
+void kvm_stage2_set_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+			gfn_t gfn_offset, unsigned long npages);
+void kvm_stage2_sync_dirty(struct kvm *kvm, struct kvm_memory_slot *slot,
+			   gfn_t gfn_offset, unsigned long npages);
+
 void stage2_unmap_vm(struct kvm *kvm);
 int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long type);
 void kvm_uninit_stage2_mmu(struct kvm *kvm);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index b16aff3f65f6..f5ae4b97df4d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1149,6 +1149,36 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 		kvm_mmu_split_huge_pages(kvm, start, end);
 }
 
+void kvm_stage2_clear_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+			  gfn_t gfn_offset, unsigned long npages)
+{
+	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+	phys_addr_t addr = base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (base_gfn + npages) << PAGE_SHIFT;
+
+	stage2_apply_range_resched(&kvm->arch.mmu, addr, end, kvm_pgtable_stage2_clear_dbm);
+}
+
+void kvm_stage2_set_dbm(struct kvm *kvm, struct kvm_memory_slot *slot,
+			gfn_t gfn_offset, unsigned long npages)
+{
+	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+	phys_addr_t addr = base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (base_gfn + npages) << PAGE_SHIFT;
+
+	stage2_apply_range(&kvm->arch.mmu, addr, end, kvm_pgtable_stage2_set_dbm, false);
+}
+
+void kvm_stage2_sync_dirty(struct kvm *kvm, struct kvm_memory_slot *slot,
+			   gfn_t gfn_offset, unsigned long npages)
+{
+	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+	phys_addr_t addr = base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (base_gfn + npages) << PAGE_SHIFT;
+
+	stage2_apply_range(&kvm->arch.mmu, addr, end, kvm_pgtable_stage2_sync_dirty, false);
+}
+
 static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
 {
 	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 6/8] KVM: arm64: Only write protect selected PTE
  2023-08-25  9:35 [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Shameer Kolothum
                   ` (4 preceding siblings ...)
  2023-08-25  9:35 ` [RFC PATCH v2 5/8] KVM: arm64: Add some HW_DBM related mmu interfaces Shameer Kolothum
@ 2023-08-25  9:35 ` Shameer Kolothum
  2023-09-22 16:00   ` Catalin Marinas
  2023-08-25  9:35 ` [RFC PATCH v2 7/8] KVM: arm64: Add KVM_CAP_ARM_HW_DBM Shameer Kolothum
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Shameer Kolothum @ 2023-08-25  9:35 UTC (permalink / raw)
  To: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	oliver.upton
  Cc: james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

From: Keqian Zhu <zhukeqian1@huawei.com>

This function write protects all PTEs between the ffs and fls of mask.
There may be unset bits between this range. It works well under pure
software dirty log, as software dirty log is not working during this
process.

But it will unexpectly clear dirty status of PTE when hardware dirty
log is enabled. So change it to only write protect selected PTE.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 arch/arm64/kvm/mmu.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f5ae4b97df4d..34251932560e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1132,10 +1132,17 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 	phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
 	phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
 	phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+	int rs, re;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	stage2_wp_range(&kvm->arch.mmu, start, end);
+	for_each_set_bitrange(rs, re, &mask, BITS_PER_LONG) {
+		phys_addr_t addr_s, addr_e;
+
+		addr_s = (base_gfn + rs) << PAGE_SHIFT;
+		addr_e = (base_gfn + re) << PAGE_SHIFT;
+		stage2_wp_range(&kvm->arch.mmu, addr_s, addr_e);
+	}
 
 	/*
 	 * Eager-splitting is done when manual-protect is set.  We
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 7/8] KVM: arm64: Add KVM_CAP_ARM_HW_DBM
  2023-08-25  9:35 [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Shameer Kolothum
                   ` (5 preceding siblings ...)
  2023-08-25  9:35 ` [RFC PATCH v2 6/8] KVM: arm64: Only write protect selected PTE Shameer Kolothum
@ 2023-08-25  9:35 ` Shameer Kolothum
  2023-08-25  9:35 ` [RFC PATCH v2 8/8] KVM: arm64: Start up SW/HW combined dirty log Shameer Kolothum
  2023-09-13 17:30 ` [RFC PATCH v2 0/8] KVM: arm64: Implement " Oliver Upton
  8 siblings, 0 replies; 33+ messages in thread
From: Shameer Kolothum @ 2023-08-25  9:35 UTC (permalink / raw)
  To: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	oliver.upton
  Cc: james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

Add a capability for userspace to enable hardware DBM support
for live migration.

ToDo: Update documentation.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 ++
 arch/arm64/kvm/arm.c              | 13 +++++++++++++
 include/uapi/linux/kvm.h          |  1 +
 3 files changed, 16 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index f623b989ddd1..17ac53150a1d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -175,6 +175,8 @@ struct kvm_s2_mmu {
 	struct kvm_mmu_memory_cache split_page_cache;
 	uint64_t split_page_chunk_size;
 
+	bool hwdbm_enabled;  /* KVM_CAP_ARM_HW_DBM enabled */
+
 	struct kvm_arch *arch;
 };
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index fd2af63d788d..0dbf2cda40d7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -115,6 +115,16 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->slots_lock);
 		break;
+	case KVM_CAP_ARM_HW_DBM:
+		mutex_lock(&kvm->slots_lock);
+		if (!system_supports_hw_dbm()) {
+			r = -EINVAL;
+		} else {
+			r = 0;
+			kvm->arch.mmu.hwdbm_enabled = true;
+		}
+		mutex_unlock(&kvm->slots_lock);
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -316,6 +326,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES:
 		r = kvm_supported_block_sizes();
 		break;
+	case KVM_CAP_ARM_HW_DBM:
+		r = system_supports_hw_dbm();
+		break;
 	default:
 		r = 0;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f089ab290978..99bd5c0420ba 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1192,6 +1192,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COUNTER_OFFSET 227
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_ARM_HW_DBM 230
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v2 8/8] KVM: arm64: Start up SW/HW combined dirty log
  2023-08-25  9:35 [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Shameer Kolothum
                   ` (6 preceding siblings ...)
  2023-08-25  9:35 ` [RFC PATCH v2 7/8] KVM: arm64: Add KVM_CAP_ARM_HW_DBM Shameer Kolothum
@ 2023-08-25  9:35 ` Shameer Kolothum
  2023-09-13 17:30 ` [RFC PATCH v2 0/8] KVM: arm64: Implement " Oliver Upton
  8 siblings, 0 replies; 33+ messages in thread
From: Shameer Kolothum @ 2023-08-25  9:35 UTC (permalink / raw)
  To: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	oliver.upton
  Cc: james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

From: Keqian Zhu <zhukeqian1@huawei.com>

When a user has enabled HW DBM support for live migration,set the
HW DBM bit for nearby pages(64 pages) for a write faulting page.
We track the DBM set pages in a separate bitmap and uses that during
sync dirty log avoiding a full scan of PTEs.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 arch/arm64/include/asm/kvm_host.h |   6 ++
 arch/arm64/kvm/arm.c              | 125 ++++++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/pgtable.c      |  10 +--
 arch/arm64/kvm/mmu.c              |  11 ++-
 4 files changed, 144 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 17ac53150a1d..5f0be57eebc4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvm_s2_mmu {
 };
 
 struct kvm_arch_memory_slot {
+	#define HWDBM_GRANULE_SHIFT 6  /* 64 pages per bit */
+	unsigned long *hwdbm_bitmap;
 };
 
 /**
@@ -901,6 +903,10 @@ struct kvm_vcpu_stat {
 	u64 exits;
 };
 
+int kvm_arm_init_hwdbm_bitmap(struct kvm *kvm, struct kvm_memory_slot *memslot);
+void kvm_arm_destroy_hwdbm_bitmap(struct kvm_memory_slot *memslot);
+void kvm_arm_enable_nearby_hwdbm(struct kvm *kvm, gfn_t gfn);
+
 void kvm_vcpu_preferred_target(struct kvm_vcpu_init *init);
 unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 0dbf2cda40d7..ab1e2da3bf0d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1540,9 +1540,134 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	return r;
 }
 
+static unsigned long kvm_hwdbm_bitmap_bytes(struct kvm_memory_slot *memslot)
+{
+	unsigned long nbits = DIV_ROUND_UP(memslot->npages, 1 << HWDBM_GRANULE_SHIFT);
+
+	return ALIGN(nbits, BITS_PER_LONG) / 8;
+}
+
+static unsigned long *kvm_second_hwdbm_bitmap(struct kvm_memory_slot *memslot)
+{
+	unsigned long len = kvm_hwdbm_bitmap_bytes(memslot);
+
+	return (void *)memslot->arch.hwdbm_bitmap + len;
+}
+
+/*
+ * Allocate twice space. Refer kvm_arch_sync_dirty_log() to see why the
+ * second space is needed.
+ */
+int kvm_arm_init_hwdbm_bitmap(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+	unsigned long bytes = 2 * kvm_hwdbm_bitmap_bytes(memslot);
+
+	if (!kvm->arch.mmu.hwdbm_enabled)
+		return 0;
+
+	if (memslot->arch.hwdbm_bitmap) {
+		/* Inherited from old memslot */
+		bitmap_clear(memslot->arch.hwdbm_bitmap, 0, bytes * 8);
+	} else {
+		memslot->arch.hwdbm_bitmap = kvzalloc(bytes, GFP_KERNEL_ACCOUNT);
+		if (!memslot->arch.hwdbm_bitmap)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void kvm_arm_destroy_hwdbm_bitmap(struct kvm_memory_slot *memslot)
+{
+	if (!memslot->arch.hwdbm_bitmap)
+		return;
+
+	kvfree(memslot->arch.hwdbm_bitmap);
+	memslot->arch.hwdbm_bitmap = NULL;
+}
+
+/* Add DBM for nearby pagetables but do not across memslot */
+void kvm_arm_enable_nearby_hwdbm(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	if (memslot && kvm_slot_dirty_track_enabled(memslot) &&
+	    memslot->arch.hwdbm_bitmap) {
+		unsigned long rel_gfn = gfn - memslot->base_gfn;
+		unsigned long dbm_idx = rel_gfn >> HWDBM_GRANULE_SHIFT;
+		unsigned long start_page, npages;
+
+		if (!test_and_set_bit(dbm_idx, memslot->arch.hwdbm_bitmap)) {
+			start_page = dbm_idx << HWDBM_GRANULE_SHIFT;
+			npages = 1 << HWDBM_GRANULE_SHIFT;
+			npages = min(memslot->npages - start_page, npages);
+			kvm_stage2_set_dbm(kvm, memslot, start_page, npages);
+		}
+	}
+}
+
+/*
+ * We have to find a place to clear hwdbm_bitmap, and clear hwdbm_bitmap means
+ * to clear DBM bit of all related pgtables. Note that between we clear DBM bit
+ * and flush TLB, HW dirty log may occur, so we must scan all related pgtables
+ * after flush TLB. Giving above, it's best choice to clear hwdbm_bitmap before
+ * sync HW dirty log.
+ */
 void kvm_arch_sync_dirty_log(struct kvm *kvm, struct kvm_memory_slot *memslot)
 {
+	unsigned long *second_bitmap = kvm_second_hwdbm_bitmap(memslot);
+	unsigned long start_page, npages;
+	unsigned int end, rs, re;
+	bool has_hwdbm = false;
+
+	if (!memslot->arch.hwdbm_bitmap)
+		return;
+
+	end = kvm_hwdbm_bitmap_bytes(memslot) * 8;
+	bitmap_clear(second_bitmap, 0, end);
+
+	write_lock(&kvm->mmu_lock);
+	for_each_set_bitrange(rs, re, memslot->arch.hwdbm_bitmap, end) {
+		has_hwdbm = true;
 
+		/*
+		 * Must clear bitmap before clear DBM bit. During we clear DBM
+		 * (it releases the mmu spinlock periodly), SW dirty tracking
+		 * has chance to add DBM which overlaps what we are clearing. So
+		 * if we clear bitmap after clear DBM, we will face a situation
+		 * that bitmap is cleared but DBM are lefted, then we may have
+		 * no chance to scan these lefted pgtables anymore.
+		 */
+		bitmap_clear(memslot->arch.hwdbm_bitmap, rs, re - rs);
+
+		/* Record the bitmap cleared */
+		bitmap_set(second_bitmap, rs, re - rs);
+
+		start_page = rs << HWDBM_GRANULE_SHIFT;
+		npages = (re - rs) << HWDBM_GRANULE_SHIFT;
+		npages = min(memslot->npages - start_page, npages);
+		kvm_stage2_clear_dbm(kvm, memslot, start_page, npages);
+	}
+	write_unlock(&kvm->mmu_lock);
+
+	if (!has_hwdbm)
+		return;
+
+	/*
+	 * Ensure vcpu write-actions that occur after we clear hwdbm_bitmap can
+	 * be catched by guest memory abort handler.
+	 */
+	kvm_flush_remote_tlbs_memslot(kvm, memslot);
+
+	read_lock(&kvm->mmu_lock);
+	for_each_set_bitrange(rs, re, second_bitmap, end) {
+		start_page = rs << HWDBM_GRANULE_SHIFT;
+		npages = (re - rs) << HWDBM_GRANULE_SHIFT;
+		npages = min(memslot->npages - start_page, npages);
+		kvm_stage2_sync_dirty(kvm, memslot, start_page, npages);
+	}
+	read_unlock(&kvm->mmu_lock);
 }
 
 static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 4552bfb1f274..330912d647c7 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -651,10 +651,10 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
 
 #ifdef CONFIG_ARM64_HW_AFDBM
 	/*
-	 * Enable the Hardware Access Flag management, unconditionally
-	 * on all CPUs. In systems that have asymmetric support for the feature
-	 * this allows KVM to leverage hardware support on the subset of cores
-	 * that implement the feature.
+	 * Enable the Hardware Access Flag management and Dirty State management,
+	 * unconditionally on all CPUs. In systems that have asymmetric support for
+	 * the feature this allows KVM to leverage hardware support on the subset of
+	 * cores that implement the feature.
 	 *
 	 * The architecture requires VTCR_EL2.HA to be RES0 (thus ignored by
 	 * hardware) on implementations that do not advertise support for the
@@ -663,7 +663,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
 	 * HAFDBS. Here be dragons.
 	 */
 	if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_CPU_38))
-		vtcr |= VTCR_EL2_HA;
+		vtcr |= VTCR_EL2_HA | VTCR_EL2_HD;
 #endif /* CONFIG_ARM64_HW_AFDBM */
 
 	/* Set the vmid bits */
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 34251932560e..b2fdcd762d70 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1569,14 +1569,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * permissions only if vma_pagesize equals fault_granule. Otherwise,
 	 * kvm_pgtable_stage2_map() should be called to change block size.
 	 */
-	if (fault_status == ESR_ELx_FSC_PERM && vma_pagesize == fault_granule)
+	if (fault_status == ESR_ELx_FSC_PERM && vma_pagesize == fault_granule) {
 		ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
-	else
+		/* Try to enable HW DBM for nearby pages */
+		if (!ret && vma_pagesize == PAGE_SIZE && writable)
+			kvm_arm_enable_nearby_hwdbm(kvm, gfn);
+	} else {
 		ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
 					     memcache,
 					     KVM_PGTABLE_WALK_HANDLE_FAULT |
 					     KVM_PGTABLE_WALK_SHARED);
+	}
 
 	/* Mark the page dirty only if the fault is handled successfully */
 	if (writable && !ret) {
@@ -2046,11 +2050,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	} while (hva < reg_end);
 
 	mmap_read_unlock(current->mm);
-	return ret;
+	return ret ? : kvm_arm_init_hwdbm_bitmap(kvm, new);
 }
 
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 {
+	kvm_arm_destroy_hwdbm_bitmap(slot);
 }
 
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen)
-- 
2.34.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log
  2023-08-25  9:35 [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Shameer Kolothum
                   ` (7 preceding siblings ...)
  2023-08-25  9:35 ` [RFC PATCH v2 8/8] KVM: arm64: Start up SW/HW combined dirty log Shameer Kolothum
@ 2023-09-13 17:30 ` Oliver Upton
  2023-09-14  9:47   ` Shameerali Kolothum Thodi
  8 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2023-09-13 17:30 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

Hi Shameer,

On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote:
> Hi,
> 
> This is to revive the RFC series[1], which makes use of hardware dirty
> bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent
> out by Zhu Keqian sometime back.
> 
> One of the main drawbacks in using the hardware DBM feature for dirty
> page tracking is the additional overhead in scanning the PTEs for dirty
> pages[2]. Also there are no vCPU page faults when we set the DBM bit,
> which may result in higher convergence time during guest migration. 
> 
> This series tries to reduce these overheads by not setting the
> DBM for all the writeable pages during migration and instead uses a
> combined software(current page fault mechanism) and hardware approach
> (set DBM) for dirty page tracking.
> 
> As noted in RFC v1[1],
> "The core idea is that we do not enable hardware dirty at start (do not
> add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking
> for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add
> DBM bit for nearby 64PTs). Then when sync dirty log, we have known all
> PTs with hardware dirty enabled, so we do not need to scan all PTs."

I'm unconvinced of the value of such a change.

What you're proposing here is complicated and I fear not easily
maintainable. Keeping the *two* sources of dirty state seems likely to
fail (eventually) with some very unfortunate consequences.

The optimization of enabling DBM on neighboring PTEs is presumptive of
the guest access pattern and could incur unnecessary scans of the
stage-2 page table w/ a sufficiently sparse guest access pattern.

> Tests with dirty_log_perf_test with anonymous THP pages shows significant
> improvement in "dirty memory time" as expected but with a hit on
> "get dirty time" .
> 
> ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp
> 
> +---------------------------+----------------+------------------+
> |                           |   6.5-rc5      | 6.5-rc5 + series |
> |                           |     (s)        |       (s)        |
> +---------------------------+----------------+------------------+
> |    dirty memory time      |    4.22        |          0.41    |
> |    get dirty log time     |    0.00047     |          3.25    |
> |    clear dirty log time   |    0.48        |          0.98    |
> +---------------------------------------------------------------+

The vCPU:memory ratio you're testing doesn't seem representative of what
a typical cloud provider would be configuring, and the dirty log
collection is going to scale linearly with the size of guest memory.

Slow dirty log collection is going to matter a lot for VM blackout,
which from experience tends to be the most sensitive period of live
migration for guest workloads.

At least in our testing, the split GET/CLEAR dirty log ioctls
dramatically improved the performance of a write-protection based ditry
tracking scheme, as the false positive rate for dirtied pages is
significantly reduced. FWIW, this is what we use for doing LM on arm64 as
opposed to the D-bit implemenation that we use on x86.
       
> In order to get some idea on actual live migration performance,
> I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and
> while the test was in progress initiated live migration(local).
> 
> redis-benchmark -t set -c 900 -n 5000000 --threads 96
> 
> Average of 5 runs shows that benchmark finishes ~10% faster with
> a ~8% increase in "total time" for migration.
> 
> +---------------------------+----------------+------------------+
> |                           |   6.5-rc5      | 6.5-rc5 + series |
> |                           |     (s)        |    (s)           |
> +---------------------------+----------------+------------------+
> | [redis]5000000 requests in|    79.428      |      71.49       |
> | [info migrate]total time  |    8438        |      9097        |
> +---------------------------------------------------------------+

Faster pre-copy performance would help the benchmark complete faster,
but the goal for a live migration should be to minimize the lost
computation for the entire operation. You'd need to test with a
continuous workload rather than one with a finite amount of work.

Also, do you know what live migration scheme you're using here?

-- 
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log
  2023-09-13 17:30 ` [RFC PATCH v2 0/8] KVM: arm64: Implement " Oliver Upton
@ 2023-09-14  9:47   ` Shameerali Kolothum Thodi
  2023-09-15  0:36     ` Oliver Upton
  0 siblings, 1 reply; 33+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-09-14  9:47 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui, zhukeqian, Jonathan Cameron,
	Linuxarm

Hi Oliver,

> -----Original Message-----
> From: Oliver Upton [mailto:oliver.upton@linux.dev]
> Sent: 13 September 2023 18:30
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> catalin.marinas@arm.com; james.morse@arm.com;
> suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined
> dirty log
> 
> Hi Shameer,
> 
> On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote:
> > Hi,
> >
> > This is to revive the RFC series[1], which makes use of hardware dirty
> > bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent
> > out by Zhu Keqian sometime back.
> >
> > One of the main drawbacks in using the hardware DBM feature for dirty
> > page tracking is the additional overhead in scanning the PTEs for dirty
> > pages[2]. Also there are no vCPU page faults when we set the DBM bit,
> > which may result in higher convergence time during guest migration.
> >
> > This series tries to reduce these overheads by not setting the
> > DBM for all the writeable pages during migration and instead uses a
> > combined software(current page fault mechanism) and hardware
> approach
> > (set DBM) for dirty page tracking.
> >
> > As noted in RFC v1[1],
> > "The core idea is that we do not enable hardware dirty at start (do not
> > add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking
> > for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add
> > DBM bit for nearby 64PTs). Then when sync dirty log, we have known all
> > PTs with hardware dirty enabled, so we do not need to scan all PTs."
> 
> I'm unconvinced of the value of such a change.
> 
> What you're proposing here is complicated and I fear not easily
> maintainable. Keeping the *two* sources of dirty state seems likely to
> fail (eventually) with some very unfortunate consequences.

It does adds complexity to the dirty state management code. I have tried
to separate the code path using appropriate FLAGS etc to make it more
manageable. But this is probably one area we can work on if the overall
approach does have some benefits.

> The optimization of enabling DBM on neighboring PTEs is presumptive of
> the guest access pattern and could incur unnecessary scans of the
> stage-2 page table w/ a sufficiently sparse guest access pattern.

Agree. This may not work as intended for all workloads and especially 
if the access pattern is sparse. But still hopeful that it will be beneficial for
workloads that have continuous write patterns. And we do have a knob to
turn it on or off.

> > Tests with dirty_log_perf_test with anonymous THP pages shows
> significant
> > improvement in "dirty memory time" as expected but with a hit on
> > "get dirty time" .
> >
> > ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp
> >
> > +---------------------------+----------------+------------------+
> > |                           |   6.5-rc5      | 6.5-rc5 + series
> |
> >
> |                           |     (s)        |       (s)
>    |
> > +---------------------------+----------------+------------------+
> > |    dirty memory
> time      |    4.22        |          0.41    |
> > |    get dirty log
> time     |    0.00047     |          3.25    |
> > |    clear dirty log
> time   |    0.48        |          0.98    |
> > +---------------------------------------------------------------+
> 
> The vCPU:memory ratio you're testing doesn't seem representative of what
> a typical cloud provider would be configuring, and the dirty log
> collection is going to scale linearly with the size of guest memory.

I was limited by the test setup I had. I will give it a go with a higher mem
system. 
 
> Slow dirty log collection is going to matter a lot for VM blackout,
> which from experience tends to be the most sensitive period of live
> migration for guest workloads.
> 
> At least in our testing, the split GET/CLEAR dirty log ioctls
> dramatically improved the performance of a write-protection based ditry
> tracking scheme, as the false positive rate for dirtied pages is
> significantly reduced. FWIW, this is what we use for doing LM on arm64 as
> opposed to the D-bit implemenation that we use on x86.

Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is
something we lack on ARM yet.
 
> > In order to get some idea on actual live migration performance,
> > I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and
> > while the test was in progress initiated live migration(local).
> >
> > redis-benchmark -t set -c 900 -n 5000000 --threads 96
> >
> > Average of 5 runs shows that benchmark finishes ~10% faster with
> > a ~8% increase in "total time" for migration.
> >
> > +---------------------------+----------------+------------------+
> > |                           |   6.5-rc5      | 6.5-rc5 + series
> |
> >
> |                           |     (s)        |    (s)
>     |
> > +---------------------------+----------------+------------------+
> > | [redis]5000000 requests in|    79.428      |      71.49       |
> > | [info migrate]total
> time  |    8438        |      9097        |
> > +---------------------------------------------------------------+
> 
> Faster pre-copy performance would help the benchmark complete faster,
> but the goal for a live migration should be to minimize the lost
> computation for the entire operation. You'd need to test with a
> continuous workload rather than one with a finite amount of work.

Ok. Though the above is not representative of a real workload, I thought
it gives some idea on how "Guest up time improvement" is benefitting the
overall availability of the workload during migration. I will check within our
wider team to see if I can setup a more suitable test/workload to show some
improvement with this approach. 

Please let me know if there is a specific workload you have in mind.

> Also, do you know what live migration scheme you're using here?

The above is the default one (pre-copy).

Thanks for getting back on this. Appreciate if you can do a quick glance
through the rest of the patches as well for any gross errors especially with
respect to page table walk locking, usage of DBM FLAGS etc.

Thanks,
Shameer 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log
  2023-09-14  9:47   ` Shameerali Kolothum Thodi
@ 2023-09-15  0:36     ` Oliver Upton
  2023-09-18  9:55       ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2023-09-15  0:36 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui, zhukeqian, Jonathan Cameron,
	Linuxarm

On Thu, Sep 14, 2023 at 09:47:48AM +0000, Shameerali Kolothum Thodi wrote:

[...]

> > What you're proposing here is complicated and I fear not easily
> > maintainable. Keeping the *two* sources of dirty state seems likely to
> > fail (eventually) with some very unfortunate consequences.
> 
> It does adds complexity to the dirty state management code. I have tried
> to separate the code path using appropriate FLAGS etc to make it more
> manageable. But this is probably one area we can work on if the overall
> approach does have some benefits.

I'd be a bit more amenable to a solution that would select either
write-protection or dirty state management, but not both.

> > The vCPU:memory ratio you're testing doesn't seem representative of what
> > a typical cloud provider would be configuring, and the dirty log
> > collection is going to scale linearly with the size of guest memory.
> 
> I was limited by the test setup I had. I will give it a go with a higher mem
> system. 

Thanks. Dirty log collection needn't be single threaded, but the
fundamental concern of dirty log collection time scaling linearly w.r.t.
the size to memory remains. Write-protection helps spread the cost of
collecting dirty state out across all the vCPU threads.

There could be some value in giving userspace the ability to parallelize
calls to dirty log ioctls to work on non-intersecting intervals.

> > Slow dirty log collection is going to matter a lot for VM blackout,
> > which from experience tends to be the most sensitive period of live
> > migration for guest workloads.
> > 
> > At least in our testing, the split GET/CLEAR dirty log ioctls
> > dramatically improved the performance of a write-protection based ditry
> > tracking scheme, as the false positive rate for dirtied pages is
> > significantly reduced. FWIW, this is what we use for doing LM on arm64 as
> > opposed to the D-bit implemenation that we use on x86.
> 
> Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is
> something we lack on ARM yet.

Sorry, this was rather nonspecific. I was describing the pre-copy
strategies we're using at Google (out of tree). We're carrying patches
to use EPT D-bit for exitless dirty tracking.

> > Faster pre-copy performance would help the benchmark complete faster,
> > but the goal for a live migration should be to minimize the lost
> > computation for the entire operation. You'd need to test with a
> > continuous workload rather than one with a finite amount of work.
> 
> Ok. Though the above is not representative of a real workload, I thought
> it gives some idea on how "Guest up time improvement" is benefitting the
> overall availability of the workload during migration. I will check within our
> wider team to see if I can setup a more suitable test/workload to show some
> improvement with this approach. 
> 
> Please let me know if there is a specific workload you have in mind.

No objection to the workload you've chosen, I'm more concerned about the
benchmark finishing before live migration completes.

What I'm looking for is something like this:

 - Calculate the ops/sec your benchmark completes in steady state

 - Do a live migration and sample the rate throughout the benchmark,
   accounting for VM blackout time

 - Calculate the area under the curve of:

     y = steady_state_rate - live_migration_rate(t)

 - Compare the area under the curve for write-protection and your DBM
   approach.

> Thanks for getting back on this. Appreciate if you can do a quick glance
> through the rest of the patches as well for any gross errors especially with
> respect to page table walk locking, usage of DBM FLAGS etc.

I'll give it a read when I have some spare cycles. To be entirely clear,
I don't have any fundamental objections to using DBM for dirty tracking.
I just want to make sure that all alternatives have been considered
in the current scheme before we seriously consider a new approach with
its own set of tradeoffs.

-- 
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 2/8] KVM: arm64: Add KVM_PGTABLE_WALK_HW_DBM for HW DBM support
  2023-08-25  9:35 ` [RFC PATCH v2 2/8] KVM: arm64: Add KVM_PGTABLE_WALK_HW_DBM for HW DBM support Shameer Kolothum
@ 2023-09-15 22:05   ` Oliver Upton
  2023-09-18  9:52     ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2023-09-15 22:05 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

Hi Shameer,

On Fri, Aug 25, 2023 at 10:35:22AM +0100, Shameer Kolothum wrote:
> KVM_PGTABLE_WALK_HW_DBM - Indicates page table walk is for HW DBM
>  related updates.
> 
> No functional changes here. Only apply any HW DBM bit updates to last
> level only. These will be used by a future commit where we will add
> support for HW DBM.
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  3 +++
>  arch/arm64/kvm/hyp/pgtable.c         | 10 ++++++++++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index d3e354bb8351..3f96bdd2086f 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -219,6 +219,8 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
>   * @KVM_PGTABLE_WALK_SKIP_CMO:		Visit and update table entries
>   *					without Cache maintenance
>   *					operations required.
> + * @KVM_PGTABLE_WALK_HW_DBM:		Indicates that the attribute update is
> + *					HW DBM related.
>   */
>  enum kvm_pgtable_walk_flags {
>  	KVM_PGTABLE_WALK_LEAF			= BIT(0),
> @@ -228,6 +230,7 @@ enum kvm_pgtable_walk_flags {
>  	KVM_PGTABLE_WALK_HANDLE_FAULT		= BIT(4),
>  	KVM_PGTABLE_WALK_SKIP_BBM_TLBI		= BIT(5),
>  	KVM_PGTABLE_WALK_SKIP_CMO		= BIT(6),
> +	KVM_PGTABLE_WALK_HW_DBM			= BIT(7),
>  };

Rather than making this DBM specific, call it KVM_PGTABLE_WALK_FORCE_PTE
and get rid of stage2_map_data::force_pte. Then it becomes immediately
obvious what this flag implies.

-- 
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces
  2023-08-25  9:35 ` [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces Shameer Kolothum
@ 2023-09-15 22:22   ` Oliver Upton
  2023-09-18  9:53     ` Shameerali Kolothum Thodi
  2023-09-22 15:24   ` Catalin Marinas
  1 sibling, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2023-09-15 22:22 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

Hi Shameer,

On Fri, Aug 25, 2023 at 10:35:23AM +0100, Shameer Kolothum wrote:
> From: Keqian Zhu <zhukeqian1@huawei.com>
> 
> This adds set_dbm, clear_dbm and sync_dirty interfaces in pgtable
> layer. (1) set_dbm: Set DBM bit for last level PTE of a specified
> range. TLBI is completed. (2) clear_dbm: Clear DBM bit for last
> level PTE of a specified range. TLBI is not acted. (3) sync_dirty:
> Scan last level PTE of a specific range. Log dirty if PTE is writeable.
> 
> Besides, save the dirty state of PTE if it's invalided by map or
> unmap.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 45 +++++++++++++
>  arch/arm64/kernel/image-vars.h       |  2 +
>  arch/arm64/kvm/hyp/pgtable.c         | 98 ++++++++++++++++++++++++++++
>  3 files changed, 145 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 3f96bdd2086f..a12add002b89 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -578,6 +578,51 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
>   */
>  int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
>  
> +/**
> + * kvm_pgtable_stage2_clear_dbm() - Clear DBM of guest stage-2 address range
> + *                                  without TLB invalidation (only last level).
> + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
> + * @addr:	Intermediate physical address from which to clear DBM,
> + * @size:	Size of the range.
> + *
> + * The offset of @addr within a page is ignored and @size is rounded-up to
> + * the next page boundary.
> + *
> + * Note that it is the caller's responsibility to invalidate the TLB after
> + * calling this function to ensure that the disabled HW dirty are visible
> + * to the CPUs.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size);
> +
> +/**
> + * kvm_pgtable_stage2_set_dbm() - Set DBM of guest stage-2 address range to
> + *                                enable HW dirty (only last level).
> + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
> + * @addr:	Intermediate physical address from which to set DBM.
> + * @size:	Size of the range.
> + *
> + * The offset of @addr within a page is ignored and @size is rounded-up to
> + * the next page boundary.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size);
> +
> +/**
> + * kvm_pgtable_stage2_sync_dirty() - Sync HW dirty state into memslot.
> + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
> + * @addr:	Intermediate physical address from which to sync.
> + * @size:	Size of the range.
> + *
> + * The offset of @addr within a page is ignored and @size is rounded-up to
> + * the next page boundary.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +int kvm_pgtable_stage2_sync_dirty(struct kvm_pgtable *pgt, u64 addr, u64 size);
> +
>  /**
>   * kvm_pgtable_stage2_wrprotect() - Write-protect guest stage-2 address range
>   *                                  without TLB invalidation.
> diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
> index 35f3c7959513..2ca600e3d637 100644
> --- a/arch/arm64/kernel/image-vars.h
> +++ b/arch/arm64/kernel/image-vars.h
> @@ -68,6 +68,8 @@ KVM_NVHE_ALIAS(__hyp_stub_vectors);
>  KVM_NVHE_ALIAS(vgic_v2_cpuif_trap);
>  KVM_NVHE_ALIAS(vgic_v3_cpuif_trap);
>  
> +KVM_NVHE_ALIAS(mark_page_dirty);
> +

This doesn't make any sense besides satisfying the linker. The hyp page
table walkers will *never* need to mark a page as dirty.

Consider adding a new function pointer to kvm_pgtable_mm_ops and only
set it for the 'normal' stage-2 mm ops in mmu.c.

> +static bool stage2_pte_writeable(kvm_pte_t pte)
> +{
> +	return pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
> +}
> +
> +static void kvm_update_hw_dbm(const struct kvm_pgtable_visit_ctx *ctx,
> +			      kvm_pte_t new)
> +{
> +	kvm_pte_t old_pte, pte = ctx->old;
> +
> +	/* Only set DBM if page is writeable */
> +	if ((new & KVM_PTE_LEAF_ATTR_HI_S2_DBM) && !stage2_pte_writeable(pte))
> +		return;
> +
> +	/* Clear DBM walk is not shared, update */
> +	if (!kvm_pgtable_walk_shared(ctx)) {
> +		WRITE_ONCE(*ctx->ptep, new);
> +		return;
> +	}
> +
> +	do {
> +		old_pte = pte;
> +		pte = new;
> +
> +		if (old_pte == pte)
> +			break;
> +
> +		pte = cmpxchg_relaxed(ctx->ptep, old_pte, pte);
> +	} while (pte != old_pte);
> +}
> +

We don't need a separate walker for updating the DBM state in the
page tables. Can't we treat it like any other permission bit and use
something like kvm_pgtable_stage2_relax_perms()?

Also, what is the purpose of retrying the update on a descriptor if the
cmpxchg() fails? Everywhere else in the page table walker code we bail
and retry execution since that is a clear indication of a race. In all
likelihood the other vCPU thread was either trying to update DBM or
service a permission fault.

>  static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx, kvm_pte_t new)
>  {
> +	if (kvm_pgtable_walk_hw_dbm(ctx)) {
> +		kvm_update_hw_dbm(ctx, new);
> +		return true;
> +	}
> +

Why are you trying to circumvent the primitive for setting stage-2 PTEs?

>  	if (!kvm_pgtable_walk_shared(ctx)) {
>  		WRITE_ONCE(*ctx->ptep, new);
>  		return true;
> @@ -952,6 +990,11 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
>  	    stage2_pte_executable(new))
>  		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
>  
> +	/* Save the possible hardware dirty info */
> +	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
> +	    stage2_pte_writeable(ctx->old))
> +		mark_page_dirty(kvm_s2_mmu_to_kvm(pgt->mmu), ctx->addr >> PAGE_SHIFT);
> +
>  	stage2_make_pte(ctx, new);
>  
>  	return 0;
> @@ -1125,6 +1168,11 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
>  	 */
>  	stage2_unmap_put_pte(ctx, mmu, mm_ops);
>  
> +	/* Save the possible hardware dirty info */
> +	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
> +	    stage2_pte_writeable(ctx->old))
> +		mark_page_dirty(kvm_s2_mmu_to_kvm(mmu), ctx->addr >> PAGE_SHIFT);
> +
>  	if (need_flush && mm_ops->dcache_clean_inval_poc)
>  		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops),
>  					       kvm_granule_size(ctx->level));
> @@ -1230,6 +1278,30 @@ static int stage2_update_leaf_attrs(struct kvm_pgtable *pgt, u64 addr,
>  	return 0;
>  }
>  
> +int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size)
> +{
> +	int ret;
> +	u64 offset;
> +
> +	ret = stage2_update_leaf_attrs(pgt, addr, size, KVM_PTE_LEAF_ATTR_HI_S2_DBM, 0,
> +				       NULL, NULL, KVM_PGTABLE_WALK_HW_DBM |
> +				       KVM_PGTABLE_WALK_SHARED);
> +	if (!ret)
> +		return ret;
> +
> +	for (offset = 0; offset < size; offset += PAGE_SIZE)
> +		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr + offset, 3);
> +
> +	return 0;
> +}
> +
> +int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr, u64 size)
> +{
> +	return stage2_update_leaf_attrs(pgt, addr, size,
> +					0, KVM_PTE_LEAF_ATTR_HI_S2_DBM,
> +					NULL, NULL, KVM_PGTABLE_WALK_HW_DBM);
> +}
> +
>  int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size)
>  {
>  	return stage2_update_leaf_attrs(pgt, addr, size, 0,
> @@ -1329,6 +1401,32 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
>  	return ret;
>  }
>  
> +static int stage2_sync_dirty_walker(const struct kvm_pgtable_visit_ctx *ctx,
> +				    enum kvm_pgtable_walk_flags visit)
> +{
> +	kvm_pte_t pte = READ_ONCE(*ctx->ptep);

ctx->old is fetched in __kvm_pgtable_visit(), and the whole
parallelization scheme for page table walkers depends on us being
careful to read the PTE once per level. Do you need to reread it here?

-- 
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously writeable pages
  2023-08-25  9:35 ` [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously writeable pages Shameer Kolothum
@ 2023-09-15 22:54   ` Oliver Upton
  2023-09-18  9:54     ` Shameerali Kolothum Thodi
  2023-09-22 15:40   ` Catalin Marinas
  1 sibling, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2023-09-15 22:54 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvmarm, kvm, linux-arm-kernel, maz, will, catalin.marinas,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

On Fri, Aug 25, 2023 at 10:35:24AM +0100, Shameer Kolothum wrote:
> We only set DBM if the page is writeable (S2AP[1] == 1). But once migration
> starts, CLEAR_LOG path will write protect the pages (S2AP[1] = 0) and there
> isn't an easy way to differentiate the writeable pages that gets write
> protected from read-only pages as we only have S2AP[1] bit to check.
> 
> Introduced a ctx->flag KVM_PGTABLE_WALK_WC_HINT to identify the dirty page
> tracking related write-protect page table walk and used one of the "Reserved
> for software use" bit in page descriptor to mark a page as "writeable-clean". 
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  5 +++++
>  arch/arm64/kvm/hyp/pgtable.c         | 25 ++++++++++++++++++++++---
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index a12add002b89..67bcbc5984f9 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -190,6 +190,8 @@ enum kvm_pgtable_prot {
>  #define KVM_PGTABLE_PROT_RW	(KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
>  #define KVM_PGTABLE_PROT_RWX	(KVM_PGTABLE_PROT_RW | KVM_PGTABLE_PROT_X)
>  
> +#define KVM_PGTABLE_PROT_WC	KVM_PGTABLE_PROT_SW0  /*write-clean*/
> +
>  #define PKVM_HOST_MEM_PROT	KVM_PGTABLE_PROT_RWX
>  #define PKVM_HOST_MMIO_PROT	KVM_PGTABLE_PROT_RW
>  
> @@ -221,6 +223,8 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
>   *					operations required.
>   * @KVM_PGTABLE_WALK_HW_DBM:		Indicates that the attribute update is
>   *					HW DBM related.
> + * @KVM_PGTABLE_WALK_WC_HINT:		Update the page as writeable-clean(software attribute)
> + *					if we are write protecting a writeable page.

This really looks like a permission bit, not a walker flag. This should
be defined in kvm_pgtable_prot and converted to the hardware definition
in stage2_set_prot_attr(). Also, the first time I saw 'WC' I read it as
'write-combine', not writable-clean.

As I understand it, the only need for an additional software bit here is
to identify neighboring PTEs that can have DBM set while we're in the
middle of the walk right?

-- 
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH v2 2/8] KVM: arm64: Add KVM_PGTABLE_WALK_HW_DBM for HW DBM support
  2023-09-15 22:05   ` Oliver Upton
@ 2023-09-18  9:52     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 33+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-09-18  9:52 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui, zhukeqian, Jonathan Cameron,
	Linuxarm



> -----Original Message-----
> From: Oliver Upton [mailto:oliver.upton@linux.dev]
> Sent: 15 September 2023 23:06
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> catalin.marinas@arm.com; james.morse@arm.com;
> suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [RFC PATCH v2 2/8] KVM: arm64: Add
> KVM_PGTABLE_WALK_HW_DBM for HW DBM support
> 
> Hi Shameer,
> 
> On Fri, Aug 25, 2023 at 10:35:22AM +0100, Shameer Kolothum wrote:
> > KVM_PGTABLE_WALK_HW_DBM - Indicates page table walk is for HW
> DBM
> >  related updates.
> >
> > No functional changes here. Only apply any HW DBM bit updates to last
> > level only. These will be used by a future commit where we will add
> > support for HW DBM.
> >
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h |  3 +++
> >  arch/arm64/kvm/hyp/pgtable.c         | 10 ++++++++++
> >  2 files changed, 13 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h
> b/arch/arm64/include/asm/kvm_pgtable.h
> > index d3e354bb8351..3f96bdd2086f 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -219,6 +219,8 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64
> addr, u64 end,
> >   * @KVM_PGTABLE_WALK_SKIP_CMO:		Visit and update table
> entries
> >   *					without Cache maintenance
> >   *					operations required.
> > + * @KVM_PGTABLE_WALK_HW_DBM:		Indicates that the attribute
> update is
> > + *					HW DBM related.
> >   */
> >  enum kvm_pgtable_walk_flags {
> >  	KVM_PGTABLE_WALK_LEAF			= BIT(0),
> > @@ -228,6 +230,7 @@ enum kvm_pgtable_walk_flags {
> >  	KVM_PGTABLE_WALK_HANDLE_FAULT		= BIT(4),
> >  	KVM_PGTABLE_WALK_SKIP_BBM_TLBI		= BIT(5),
> >  	KVM_PGTABLE_WALK_SKIP_CMO		= BIT(6),
> > +	KVM_PGTABLE_WALK_HW_DBM			= BIT(7),
> >  };
> 
> Rather than making this DBM specific, call it
> KVM_PGTABLE_WALK_FORCE_PTE
> and get rid of stage2_map_data::force_pte. Then it becomes immediately
> obvious what this flag implies.

Ok. Will change accordingly.

Thanks,
Shameer

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces
  2023-09-15 22:22   ` Oliver Upton
@ 2023-09-18  9:53     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 33+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-09-18  9:53 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui, zhukeqian, Jonathan Cameron,
	Linuxarm



> -----Original Message-----
> From: Oliver Upton [mailto:oliver.upton@linux.dev]
> Sent: 15 September 2023 23:23
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> catalin.marinas@arm.com; james.morse@arm.com;
> suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related
> pgtable interfaces
> 
> Hi Shameer,
> 
> On Fri, Aug 25, 2023 at 10:35:23AM +0100, Shameer Kolothum wrote:
> > From: Keqian Zhu <zhukeqian1@huawei.com>
> >
> > This adds set_dbm, clear_dbm and sync_dirty interfaces in pgtable
> > layer. (1) set_dbm: Set DBM bit for last level PTE of a specified
> > range. TLBI is completed. (2) clear_dbm: Clear DBM bit for last
> > level PTE of a specified range. TLBI is not acted. (3) sync_dirty:
> > Scan last level PTE of a specific range. Log dirty if PTE is writeable.
> >
> > Besides, save the dirty state of PTE if it's invalided by map or
> > unmap.
> >
> > Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h | 45 +++++++++++++
> >  arch/arm64/kernel/image-vars.h       |  2 +
> >  arch/arm64/kvm/hyp/pgtable.c         | 98
> ++++++++++++++++++++++++++++
> >  3 files changed, 145 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h
> b/arch/arm64/include/asm/kvm_pgtable.h
> > index 3f96bdd2086f..a12add002b89 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -578,6 +578,51 @@ int kvm_pgtable_stage2_set_owner(struct
> kvm_pgtable *pgt, u64 addr, u64 size,
> >   */
> >  int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64
> size);
> >
> > +/**
> > + * kvm_pgtable_stage2_clear_dbm() - Clear DBM of guest stage-2 address
> range
> > + *                                  without TLB invalidation (only
> last level).
> > + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
> > + * @addr:	Intermediate physical address from which to clear DBM,
> > + * @size:	Size of the range.
> > + *
> > + * The offset of @addr within a page is ignored and @size is rounded-up
> to
> > + * the next page boundary.
> > + *
> > + * Note that it is the caller's responsibility to invalidate the TLB after
> > + * calling this function to ensure that the disabled HW dirty are visible
> > + * to the CPUs.
> > + *
> > + * Return: 0 on success, negative error code on failure.
> > + */
> > +int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr,
> u64 size);
> > +
> > +/**
> > + * kvm_pgtable_stage2_set_dbm() - Set DBM of guest stage-2 address
> range to
> > + *                                enable HW dirty (only last level).
> > + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
> > + * @addr:	Intermediate physical address from which to set DBM.
> > + * @size:	Size of the range.
> > + *
> > + * The offset of @addr within a page is ignored and @size is rounded-up
> to
> > + * the next page boundary.
> > + *
> > + * Return: 0 on success, negative error code on failure.
> > + */
> > +int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64
> size);
> > +
> > +/**
> > + * kvm_pgtable_stage2_sync_dirty() - Sync HW dirty state into memslot.
> > + * @pgt:	Page-table structure initialised by kvm_pgtable_stage2_init().
> > + * @addr:	Intermediate physical address from which to sync.
> > + * @size:	Size of the range.
> > + *
> > + * The offset of @addr within a page is ignored and @size is rounded-up
> to
> > + * the next page boundary.
> > + *
> > + * Return: 0 on success, negative error code on failure.
> > + */
> > +int kvm_pgtable_stage2_sync_dirty(struct kvm_pgtable *pgt, u64 addr,
> u64 size);
> > +
> >  /**
> >   * kvm_pgtable_stage2_wrprotect() - Write-protect guest stage-2
> address range
> >   *                                  without TLB invalidation.
> > diff --git a/arch/arm64/kernel/image-vars.h
> b/arch/arm64/kernel/image-vars.h
> > index 35f3c7959513..2ca600e3d637 100644
> > --- a/arch/arm64/kernel/image-vars.h
> > +++ b/arch/arm64/kernel/image-vars.h
> > @@ -68,6 +68,8 @@ KVM_NVHE_ALIAS(__hyp_stub_vectors);
> >  KVM_NVHE_ALIAS(vgic_v2_cpuif_trap);
> >  KVM_NVHE_ALIAS(vgic_v3_cpuif_trap);
> >
> > +KVM_NVHE_ALIAS(mark_page_dirty);
> > +
> 
> This doesn't make any sense besides satisfying the linker. The hyp page
> table walkers will *never* need to mark a page as dirty.
> 
> Consider adding a new function pointer to kvm_pgtable_mm_ops and only
> set it for the 'normal' stage-2 mm ops in mmu.c.

Yes, this is only to satisfy the linker for NVHE. I will add the function pointer
to handle this.

> 
> > +static bool stage2_pte_writeable(kvm_pte_t pte)
> > +{
> > +	return pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
> > +}
> > +
> > +static void kvm_update_hw_dbm(const struct kvm_pgtable_visit_ctx *ctx,
> > +			      kvm_pte_t new)
> > +{
> > +	kvm_pte_t old_pte, pte = ctx->old;
> > +
> > +	/* Only set DBM if page is writeable */
> > +	if ((new & KVM_PTE_LEAF_ATTR_HI_S2_DBM)
> && !stage2_pte_writeable(pte))
> > +		return;
> > +
> > +	/* Clear DBM walk is not shared, update */
> > +	if (!kvm_pgtable_walk_shared(ctx)) {
> > +		WRITE_ONCE(*ctx->ptep, new);
> > +		return;
> > +	}
> > +
> > +	do {
> > +		old_pte = pte;
> > +		pte = new;
> > +
> > +		if (old_pte == pte)
> > +			break;
> > +
> > +		pte = cmpxchg_relaxed(ctx->ptep, old_pte, pte);
> > +	} while (pte != old_pte);
> > +}
> > +
> 
> We don't need a separate walker for updating the DBM state in the
> page tables. Can't we treat it like any other permission bit and use
> something like kvm_pgtable_stage2_relax_perms()?

I treated it separately to isolate it from rest of PTE updates for ease of
debug/test initially. But yes, looks like we can treat it generally. 

> Also, what is the purpose of retrying the update on a descriptor if the
> cmpxchg() fails? Everywhere else in the page table walker code we bail
> and retry execution since that is a clear indication of a race. In all
> likelihood the other vCPU thread was either trying to update DBM or
> service a permission fault.

Retry was added to avoid any situation where it will return without setting
the DBM and we end up scanning for dirty pages irrespectively. I will add
some debug info to see the number of races we are getting and its impact
on performance.

> 
> >  static bool stage2_try_set_pte(const struct kvm_pgtable_visit_ctx *ctx,
> kvm_pte_t new)
> >  {
> > +	if (kvm_pgtable_walk_hw_dbm(ctx)) {
> > +		kvm_update_hw_dbm(ctx, new);
> > +		return true;
> > +	}
> > +
> 
> Why are you trying to circumvent the primitive for setting stage-2 PTEs?

Not sure it actually does that from a functionality point of view, as the flag 
_WALK_HW_DBM is only set from set_dbm/clear_dbm paths. But yes we
can do better if we get rid of the above retry logic, I guess.

> 
> >  	if (!kvm_pgtable_walk_shared(ctx)) {
> >  		WRITE_ONCE(*ctx->ptep, new);
> >  		return true;
> > @@ -952,6 +990,11 @@ static int stage2_map_walker_try_leaf(const
> struct kvm_pgtable_visit_ctx *ctx,
> >  	    stage2_pte_executable(new))
> >  		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops),
> granule);
> >
> > +	/* Save the possible hardware dirty info */
> > +	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
> > +	    stage2_pte_writeable(ctx->old))
> > +		mark_page_dirty(kvm_s2_mmu_to_kvm(pgt->mmu), ctx->addr >>
> PAGE_SHIFT);
> > +
> >  	stage2_make_pte(ctx, new);
> >
> >  	return 0;
> > @@ -1125,6 +1168,11 @@ static int stage2_unmap_walker(const struct
> kvm_pgtable_visit_ctx *ctx,
> >  	 */
> >  	stage2_unmap_put_pte(ctx, mmu, mm_ops);
> >
> > +	/* Save the possible hardware dirty info */
> > +	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
> > +	    stage2_pte_writeable(ctx->old))
> > +		mark_page_dirty(kvm_s2_mmu_to_kvm(mmu), ctx->addr >>
> PAGE_SHIFT);
> > +
> >  	if (need_flush && mm_ops->dcache_clean_inval_poc)
> >  		mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old,
> mm_ops),
> >  					       kvm_granule_size(ctx->level));
> > @@ -1230,6 +1278,30 @@ static int stage2_update_leaf_attrs(struct
> kvm_pgtable *pgt, u64 addr,
> >  	return 0;
> >  }
> >
> > +int kvm_pgtable_stage2_set_dbm(struct kvm_pgtable *pgt, u64 addr, u64
> size)
> > +{
> > +	int ret;
> > +	u64 offset;
> > +
> > +	ret = stage2_update_leaf_attrs(pgt, addr, size,
> KVM_PTE_LEAF_ATTR_HI_S2_DBM, 0,
> > +				       NULL, NULL, KVM_PGTABLE_WALK_HW_DBM |
> > +				       KVM_PGTABLE_WALK_SHARED);
> > +	if (!ret)
> > +		return ret;
> > +
> > +	for (offset = 0; offset < size; offset += PAGE_SIZE)
> > +		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa_nsh, pgt->mmu, addr +
> offset, 3);
> > +
> > +	return 0;
> > +}
> > +
> > +int kvm_pgtable_stage2_clear_dbm(struct kvm_pgtable *pgt, u64 addr,
> u64 size)
> > +{
> > +	return stage2_update_leaf_attrs(pgt, addr, size,
> > +					0, KVM_PTE_LEAF_ATTR_HI_S2_DBM,
> > +					NULL, NULL, KVM_PGTABLE_WALK_HW_DBM);
> > +}
> > +
> >  int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr,
> u64 size)
> >  {
> >  	return stage2_update_leaf_attrs(pgt, addr, size, 0,
> > @@ -1329,6 +1401,32 @@ int kvm_pgtable_stage2_relax_perms(struct
> kvm_pgtable *pgt, u64 addr,
> >  	return ret;
> >  }
> >
> > +static int stage2_sync_dirty_walker(const struct kvm_pgtable_visit_ctx
> *ctx,
> > +				    enum kvm_pgtable_walk_flags visit)
> > +{
> > +	kvm_pte_t pte = READ_ONCE(*ctx->ptep);
> 
> ctx->old is fetched in __kvm_pgtable_visit(), and the whole
> parallelization scheme for page table walkers depends on us being
> careful to read the PTE once per level. Do you need to reread it here?

Nope. That's a mistake. Will change that.

Thanks,
Shameer


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously writeable pages
  2023-09-15 22:54   ` Oliver Upton
@ 2023-09-18  9:54     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 33+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-09-18  9:54 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui, zhukeqian, Jonathan Cameron,
	Linuxarm



> -----Original Message-----
> From: Oliver Upton [mailto:oliver.upton@linux.dev]
> Sent: 15 September 2023 23:54
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> catalin.marinas@arm.com; james.morse@arm.com;
> suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously
> writeable pages
> 
> On Fri, Aug 25, 2023 at 10:35:24AM +0100, Shameer Kolothum wrote:
> > We only set DBM if the page is writeable (S2AP[1] == 1). But once
> migration
> > starts, CLEAR_LOG path will write protect the pages (S2AP[1] = 0) and
> there
> > isn't an easy way to differentiate the writeable pages that gets write
> > protected from read-only pages as we only have S2AP[1] bit to check.
> >
> > Introduced a ctx->flag KVM_PGTABLE_WALK_WC_HINT to identify the
> dirty page
> > tracking related write-protect page table walk and used one of the
> "Reserved
> > for software use" bit in page descriptor to mark a page as
> "writeable-clean".
> >
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h |  5 +++++
> >  arch/arm64/kvm/hyp/pgtable.c         | 25
> ++++++++++++++++++++++---
> >  2 files changed, 27 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h
> b/arch/arm64/include/asm/kvm_pgtable.h
> > index a12add002b89..67bcbc5984f9 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -190,6 +190,8 @@ enum kvm_pgtable_prot {
> >  #define KVM_PGTABLE_PROT_RW	(KVM_PGTABLE_PROT_R |
> KVM_PGTABLE_PROT_W)
> >  #define KVM_PGTABLE_PROT_RWX	(KVM_PGTABLE_PROT_RW |
> KVM_PGTABLE_PROT_X)
> >
> > +#define KVM_PGTABLE_PROT_WC	KVM_PGTABLE_PROT_SW0
> /*write-clean*/
> > +
> >  #define PKVM_HOST_MEM_PROT	KVM_PGTABLE_PROT_RWX
> >  #define PKVM_HOST_MMIO_PROT	KVM_PGTABLE_PROT_RW
> >
> > @@ -221,6 +223,8 @@ typedef bool (*kvm_pgtable_force_pte_cb_t)(u64
> addr, u64 end,
> >   *					operations required.
> >   * @KVM_PGTABLE_WALK_HW_DBM:		Indicates that the attribute
> update is
> >   *					HW DBM related.
> > + * @KVM_PGTABLE_WALK_WC_HINT:		Update the page as
> writeable-clean(software attribute)
> > + *					if we are write protecting a writeable page.
> 
> This really looks like a permission bit, not a walker flag. This should
> be defined in kvm_pgtable_prot and converted to the hardware definition
> in stage2_set_prot_attr(). Also, the first time I saw 'WC' I read it as
> 'write-combine', not writable-clean.

Ok. I will have a go as suggested above. In fact "writeable-clean" is an
ARM ARM term, I will make it more specific.

> 
> As I understand it, the only need for an additional software bit here is
> to identify neighboring PTEs that can have DBM set while we're in the
> middle of the walk right?

Yes, that is the purpose.

Thanks,
Shameer

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log
  2023-09-15  0:36     ` Oliver Upton
@ 2023-09-18  9:55       ` Shameerali Kolothum Thodi
  2023-09-20 21:12         ` Oliver Upton
  2023-10-12  7:51         ` Shameerali Kolothum Thodi
  0 siblings, 2 replies; 33+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-09-18  9:55 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui, zhukeqian, Jonathan Cameron,
	Linuxarm



> -----Original Message-----
> From: Oliver Upton [mailto:oliver.upton@linux.dev]
> Sent: 15 September 2023 01:36
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> catalin.marinas@arm.com; james.morse@arm.com;
> suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined
> dirty log
> 
> On Thu, Sep 14, 2023 at 09:47:48AM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> [...]
> 
> > > What you're proposing here is complicated and I fear not easily
> > > maintainable. Keeping the *two* sources of dirty state seems likely to
> > > fail (eventually) with some very unfortunate consequences.
> >
> > It does adds complexity to the dirty state management code. I have tried
> > to separate the code path using appropriate FLAGS etc to make it more
> > manageable. But this is probably one area we can work on if the overall
> > approach does have some benefits.
> 
> I'd be a bit more amenable to a solution that would select either
> write-protection or dirty state management, but not both.
> 
> > > The vCPU:memory ratio you're testing doesn't seem representative of
> what
> > > a typical cloud provider would be configuring, and the dirty log
> > > collection is going to scale linearly with the size of guest memory.
> >
> > I was limited by the test setup I had. I will give it a go with a higher mem
> > system.
> 
> Thanks. Dirty log collection needn't be single threaded, but the
> fundamental concern of dirty log collection time scaling linearly w.r.t.
> the size to memory remains. Write-protection helps spread the cost of
> collecting dirty state out across all the vCPU threads.
> 
> There could be some value in giving userspace the ability to parallelize
> calls to dirty log ioctls to work on non-intersecting intervals.
> 
> > > Slow dirty log collection is going to matter a lot for VM blackout,
> > > which from experience tends to be the most sensitive period of live
> > > migration for guest workloads.
> > >
> > > At least in our testing, the split GET/CLEAR dirty log ioctls
> > > dramatically improved the performance of a write-protection based ditry
> > > tracking scheme, as the false positive rate for dirtied pages is
> > > significantly reduced. FWIW, this is what we use for doing LM on arm64
> as
> > > opposed to the D-bit implemenation that we use on x86.
> >
> > Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is
> > something we lack on ARM yet.
> 
> Sorry, this was rather nonspecific. I was describing the pre-copy
> strategies we're using at Google (out of tree). We're carrying patches
> to use EPT D-bit for exitless dirty tracking.

Just curious, how does it handle the overheads associated with scanning for
dirty pages and the convergence w.r.t high rate of dirtying in exitless mode? 
 
> > > Faster pre-copy performance would help the benchmark complete faster,
> > > but the goal for a live migration should be to minimize the lost
> > > computation for the entire operation. You'd need to test with a
> > > continuous workload rather than one with a finite amount of work.
> >
> > Ok. Though the above is not representative of a real workload, I thought
> > it gives some idea on how "Guest up time improvement" is benefitting the
> > overall availability of the workload during migration. I will check within our
> > wider team to see if I can setup a more suitable test/workload to show
> some
> > improvement with this approach.
> >
> > Please let me know if there is a specific workload you have in mind.
> 
> No objection to the workload you've chosen, I'm more concerned about the
> benchmark finishing before live migration completes.
> 
> What I'm looking for is something like this:
> 
>  - Calculate the ops/sec your benchmark completes in steady state
> 
>  - Do a live migration and sample the rate throughout the benchmark,
>    accounting for VM blackout time
> 
>  - Calculate the area under the curve of:
> 
>      y = steady_state_rate - live_migration_rate(t)
> 
>  - Compare the area under the curve for write-protection and your DBM
>    approach.

Ok. Got it.

> > Thanks for getting back on this. Appreciate if you can do a quick glance
> > through the rest of the patches as well for any gross errors especially with
> > respect to page table walk locking, usage of DBM FLAGS etc.
> 
> I'll give it a read when I have some spare cycles. To be entirely clear,
> I don't have any fundamental objections to using DBM for dirty tracking.
> I just want to make sure that all alternatives have been considered
> in the current scheme before we seriously consider a new approach with
> its own set of tradeoffs.

Thanks for taking a look.

Shameer

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log
  2023-09-18  9:55       ` Shameerali Kolothum Thodi
@ 2023-09-20 21:12         ` Oliver Upton
  2023-10-12  7:51         ` Shameerali Kolothum Thodi
  1 sibling, 0 replies; 33+ messages in thread
From: Oliver Upton @ 2023-09-20 21:12 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui, zhukeqian, Jonathan Cameron,
	Linuxarm

On Mon, Sep 18, 2023 at 09:55:22AM +0000, Shameerali Kolothum Thodi wrote:

[...]

> > Sorry, this was rather nonspecific. I was describing the pre-copy
> > strategies we're using at Google (out of tree). We're carrying patches
> > to use EPT D-bit for exitless dirty tracking.
> 
> Just curious, how does it handle the overheads associated with scanning for
> dirty pages and the convergence w.r.t high rate of dirtying in exitless mode? 

A pool of kthreads, which really isn't a good solution at all. The
'better' way to do it would be to add some back pressure to the guest
such that your pre-copy transfer can converge with the guest and use the
freed up CPU time to manage the dirty state.

But hopefully we can make that a userspace issue.

-- 
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces
  2023-08-25  9:35 ` [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces Shameer Kolothum
  2023-09-15 22:22   ` Oliver Upton
@ 2023-09-22 15:24   ` Catalin Marinas
  2023-09-22 17:49     ` Oliver Upton
  1 sibling, 1 reply; 33+ messages in thread
From: Catalin Marinas @ 2023-09-22 15:24 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvmarm, kvm, linux-arm-kernel, maz, will, oliver.upton,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

On Fri, Aug 25, 2023 at 10:35:23AM +0100, Shameer Kolothum wrote:
> +static bool stage2_pte_writeable(kvm_pte_t pte)
> +{
> +	return pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
> +}
> +
> +static void kvm_update_hw_dbm(const struct kvm_pgtable_visit_ctx *ctx,
> +			      kvm_pte_t new)
> +{
> +	kvm_pte_t old_pte, pte = ctx->old;
> +
> +	/* Only set DBM if page is writeable */
> +	if ((new & KVM_PTE_LEAF_ATTR_HI_S2_DBM) && !stage2_pte_writeable(pte))
> +		return;
> +
> +	/* Clear DBM walk is not shared, update */
> +	if (!kvm_pgtable_walk_shared(ctx)) {
> +		WRITE_ONCE(*ctx->ptep, new);
> +		return;
> +	}

I was wondering if this interferes with the OS dirty tracking (not the
KVM one) but I think that's ok, at least at this point, since the PTE is
already writeable and a fault would have marked the underlying page as
dirty (user_mem_abort() -> kvm_set_pfn_dirty()).

I'm not particularly fond of relying on this but I need to see how it
fits with the rest of the series. IIRC KVM doesn't go around and make
Stage 2 PTEs read-only but rather unmaps them when it changes the
permission of the corresponding Stage 1 VMM mapping.

My personal preference would be to track dirty/clean properly as we do
for stage 1 (e.g. DBM means writeable PTE) but it has some downsides
like the try_to_unmap() code having to retrieve the dirty state via
notifiers.

Anyway, assuming this works correctly, it means that live migration via
DBM is only tracked for PTEs already made dirty/writeable by some guest
write.

> @@ -952,6 +990,11 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
>  	    stage2_pte_executable(new))
>  		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
>  
> +	/* Save the possible hardware dirty info */
> +	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
> +	    stage2_pte_writeable(ctx->old))
> +		mark_page_dirty(kvm_s2_mmu_to_kvm(pgt->mmu), ctx->addr >> PAGE_SHIFT);
> +
>  	stage2_make_pte(ctx, new);

Isn't this racy and potentially losing the dirty state? Or is the 'new'
value guaranteed to have the S2AP[1] bit? For stage 1 we normally make
the page genuinely read-only (clearing DBM) in a cmpxchg loop to
preserve the dirty state (see ptep_set_wrprotect()).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously writeable pages
  2023-08-25  9:35 ` [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously writeable pages Shameer Kolothum
  2023-09-15 22:54   ` Oliver Upton
@ 2023-09-22 15:40   ` Catalin Marinas
  2023-09-25  8:04     ` Shameerali Kolothum Thodi
  1 sibling, 1 reply; 33+ messages in thread
From: Catalin Marinas @ 2023-09-22 15:40 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvmarm, kvm, linux-arm-kernel, maz, will, oliver.upton,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

On Fri, Aug 25, 2023 at 10:35:24AM +0100, Shameer Kolothum wrote:
> We only set DBM if the page is writeable (S2AP[1] == 1). But once migration
> starts, CLEAR_LOG path will write protect the pages (S2AP[1] = 0) and there
> isn't an easy way to differentiate the writeable pages that gets write
> protected from read-only pages as we only have S2AP[1] bit to check.

Don't we have enough bits without an additional one?

writeable: DBM == 1 || S2AP[1] == 1
  clean: S2AP[1] == 0
  dirty: S2AP[1] == 1 (irrespective of DBM)

read-only: DBM == 0 && S2AP[1] == 0

For S1 we use a software dirty bit as well to track read-only dirty
mappings but I don't think it is necessary for S2 since KVM unmaps the
PTE when changing the VMM permission.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 6/8] KVM: arm64: Only write protect selected PTE
  2023-08-25  9:35 ` [RFC PATCH v2 6/8] KVM: arm64: Only write protect selected PTE Shameer Kolothum
@ 2023-09-22 16:00   ` Catalin Marinas
  2023-09-22 16:59     ` Oliver Upton
  0 siblings, 1 reply; 33+ messages in thread
From: Catalin Marinas @ 2023-09-22 16:00 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: kvmarm, kvm, linux-arm-kernel, maz, will, oliver.upton,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

On Fri, Aug 25, 2023 at 10:35:26AM +0100, Shameer Kolothum wrote:
> From: Keqian Zhu <zhukeqian1@huawei.com>
> 
> This function write protects all PTEs between the ffs and fls of mask.
> There may be unset bits between this range. It works well under pure
> software dirty log, as software dirty log is not working during this
> process.
> 
> But it will unexpectly clear dirty status of PTE when hardware dirty
> log is enabled. So change it to only write protect selected PTE.

Ah, I did wonder about losing the dirty status. The equivalent to S1
would be for kvm_pgtable_stage2_wrprotect() to set a software dirty bit.

I'm only superficially familiar with how KVM does dirty tracking for
live migration. Does it need to first write-protect the pages and
disable DBM? Is DBM re-enabled later? Or does stage2_wp_range() with
your patches leave the DBM on? If the latter, the 'wp' aspect is a bit
confusing since DBM basically means writeable (and maybe clean). So
better to have something like stage2_clean_range().

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 6/8] KVM: arm64: Only write protect selected PTE
  2023-09-22 16:00   ` Catalin Marinas
@ 2023-09-22 16:59     ` Oliver Upton
  2023-09-26 15:58       ` Catalin Marinas
  0 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2023-09-22 16:59 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Shameer Kolothum, kvmarm, kvm, linux-arm-kernel, maz, will,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

On Fri, Sep 22, 2023 at 05:00:40PM +0100, Catalin Marinas wrote:
> On Fri, Aug 25, 2023 at 10:35:26AM +0100, Shameer Kolothum wrote:
> > From: Keqian Zhu <zhukeqian1@huawei.com>
> > 
> > This function write protects all PTEs between the ffs and fls of mask.
> > There may be unset bits between this range. It works well under pure
> > software dirty log, as software dirty log is not working during this
> > process.
> > 
> > But it will unexpectly clear dirty status of PTE when hardware dirty
> > log is enabled. So change it to only write protect selected PTE.
> 
> Ah, I did wonder about losing the dirty status. The equivalent to S1
> would be for kvm_pgtable_stage2_wrprotect() to set a software dirty bit.
> 
> I'm only superficially familiar with how KVM does dirty tracking for
> live migration. Does it need to first write-protect the pages and
> disable DBM? Is DBM re-enabled later? Or does stage2_wp_range() with
> your patches leave the DBM on? If the latter, the 'wp' aspect is a bit
> confusing since DBM basically means writeable (and maybe clean). So
> better to have something like stage2_clean_range().

KVM has never enabled DBM and we solely rely on write-protection faults
for dirty tracking. IOW, we do not have a writable-clean state for
stage-2 PTEs (yet).

-- 
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces
  2023-09-22 15:24   ` Catalin Marinas
@ 2023-09-22 17:49     ` Oliver Upton
  2023-09-25  8:04       ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2023-09-22 17:49 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Shameer Kolothum, kvmarm, kvm, linux-arm-kernel, maz, will,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

On Fri, Sep 22, 2023 at 04:24:11PM +0100, Catalin Marinas wrote:
> On Fri, Aug 25, 2023 at 10:35:23AM +0100, Shameer Kolothum wrote:
> > +static bool stage2_pte_writeable(kvm_pte_t pte)
> > +{
> > +	return pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
> > +}
> > +
> > +static void kvm_update_hw_dbm(const struct kvm_pgtable_visit_ctx *ctx,
> > +			      kvm_pte_t new)
> > +{
> > +	kvm_pte_t old_pte, pte = ctx->old;
> > +
> > +	/* Only set DBM if page is writeable */
> > +	if ((new & KVM_PTE_LEAF_ATTR_HI_S2_DBM) && !stage2_pte_writeable(pte))
> > +		return;
> > +
> > +	/* Clear DBM walk is not shared, update */
> > +	if (!kvm_pgtable_walk_shared(ctx)) {
> > +		WRITE_ONCE(*ctx->ptep, new);
> > +		return;
> > +	}
> 
> I was wondering if this interferes with the OS dirty tracking (not the
> KVM one) but I think that's ok, at least at this point, since the PTE is
> already writeable and a fault would have marked the underlying page as
> dirty (user_mem_abort() -> kvm_set_pfn_dirty()).
> 
> I'm not particularly fond of relying on this but I need to see how it
> fits with the rest of the series. IIRC KVM doesn't go around and make
> Stage 2 PTEs read-only but rather unmaps them when it changes the
> permission of the corresponding Stage 1 VMM mapping.
> 
> My personal preference would be to track dirty/clean properly as we do
> for stage 1 (e.g. DBM means writeable PTE) but it has some downsides
> like the try_to_unmap() code having to retrieve the dirty state via
> notifiers.

KVM's usage of DBM is complicated by the fact that the dirty log
interface w/ userspace is at PTE granularity. We only want the page
table walker to relax PTEs, but take faults on hugepages so we can do
page splitting.

> Anyway, assuming this works correctly, it means that live migration via
> DBM is only tracked for PTEs already made dirty/writeable by some guest
> write.

I'm hoping that we move away from this combined write-protection and DBM
scheme and only use a single dirty tracking strategy at a time.

> > @@ -952,6 +990,11 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
> >  	    stage2_pte_executable(new))
> >  		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
> >  
> > +	/* Save the possible hardware dirty info */
> > +	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
> > +	    stage2_pte_writeable(ctx->old))
> > +		mark_page_dirty(kvm_s2_mmu_to_kvm(pgt->mmu), ctx->addr >> PAGE_SHIFT);
> > +
> >  	stage2_make_pte(ctx, new);
> 
> Isn't this racy and potentially losing the dirty state? Or is the 'new'
> value guaranteed to have the S2AP[1] bit? For stage 1 we normally make
> the page genuinely read-only (clearing DBM) in a cmpxchg loop to
> preserve the dirty state (see ptep_set_wrprotect()).

stage2_try_break_pte() a few lines up does a cmpxchg() and full
break-before-make, so at this point there shouldn't be a race with
either software or hardware table walkers.

But I'm still confused by this one. KVM only goes down the map
walker path (in the context of dirty tracking) if:

 - We took a translation fault

 - We took a write permission fault on a hugepage and need to split

In both cases the 'old' translation should have DBM cleared. Even if the
PTE were dirty, this is wasted work since we need to do a final scan of
the stage-2 when userspace collects the dirty log.

Am I missing something?

-- 
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously writeable pages
  2023-09-22 15:40   ` Catalin Marinas
@ 2023-09-25  8:04     ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 33+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-09-25  8:04 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, oliver.upton@linux.dev, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui, zhukeqian, Jonathan Cameron,
	Linuxarm



> -----Original Message-----
> From: Catalin Marinas [mailto:catalin.marinas@arm.com]
> Sent: 22 September 2023 16:40
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> oliver.upton@linux.dev; james.morse@arm.com; suzuki.poulose@arm.com;
> yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously
> writeable pages
> 
> On Fri, Aug 25, 2023 at 10:35:24AM +0100, Shameer Kolothum wrote:
> > We only set DBM if the page is writeable (S2AP[1] == 1). But once
> migration
> > starts, CLEAR_LOG path will write protect the pages (S2AP[1] = 0) and
> there
> > isn't an easy way to differentiate the writeable pages that gets write
> > protected from read-only pages as we only have S2AP[1] bit to check.
> 
> Don't we have enough bits without an additional one?
> 
> writeable: DBM == 1 || S2AP[1] == 1
>   clean: S2AP[1] == 0
>   dirty: S2AP[1] == 1 (irrespective of DBM)
> 
> read-only: DBM == 0 && S2AP[1] == 0
> 
> For S1 we use a software dirty bit as well to track read-only dirty
> mappings but I don't think it is necessary for S2 since KVM unmaps the
> PTE when changing the VMM permission.
> 

We don't set the DBM for all the memory. In order to reduce the overhead
associated with scanning PTEs, this series sets the DBM for the nearby pages
on page fault during the migration phase.

See patch #8,
  user_mem_abort()
     kvm_arm_enable_nearby_hwdbm()

But once migration starts, on CLEAR_LOG path,
   kvm_arch_mmu_enable_log_dirty_pt_masked()
    stage2_wp_range()  --> set the page read only
    kvm_mmu_split_huge_pages() --> split huge pages and pages are read only.

This in effect means there are no writeable-clean near-by pages to set the DBM on
kvm_arm_enable_nearby_hwdbm().

To identify the pages that can be set DBM, we provide a hint to
stage2_wp_range( ) --> kvm_pgtable_stage2_wrprotect() table walker and make
use of a new software bit to mark the PTE as writeable-clean.

Hope, I am clear.

Thanks,
Shameer
 


   




_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces
  2023-09-22 17:49     ` Oliver Upton
@ 2023-09-25  8:04       ` Shameerali Kolothum Thodi
  2023-09-26 15:20         ` Catalin Marinas
  0 siblings, 1 reply; 33+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-09-25  8:04 UTC (permalink / raw)
  To: Oliver Upton, Catalin Marinas
  Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, james.morse@arm.com, suzuki.poulose@arm.com,
	yuzenghui, zhukeqian, Jonathan Cameron, Linuxarm



> -----Original Message-----
> From: Oliver Upton [mailto:oliver.upton@linux.dev]
> Sent: 22 September 2023 18:49
> To: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> james.morse@arm.com; suzuki.poulose@arm.com; yuzenghui
> <yuzenghui@huawei.com>; zhukeqian <zhukeqian1@huawei.com>; Jonathan
> Cameron <jonathan.cameron@huawei.com>; Linuxarm
> <linuxarm@huawei.com>
> Subject: Re: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related
> pgtable interfaces
> 
> On Fri, Sep 22, 2023 at 04:24:11PM +0100, Catalin Marinas wrote:
> > On Fri, Aug 25, 2023 at 10:35:23AM +0100, Shameer Kolothum wrote:
> > > +static bool stage2_pte_writeable(kvm_pte_t pte)
> > > +{
> > > +	return pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W;
> > > +}
> > > +
> > > +static void kvm_update_hw_dbm(const struct kvm_pgtable_visit_ctx
> *ctx,
> > > +			      kvm_pte_t new)
> > > +{
> > > +	kvm_pte_t old_pte, pte = ctx->old;
> > > +
> > > +	/* Only set DBM if page is writeable */
> > > +	if ((new & KVM_PTE_LEAF_ATTR_HI_S2_DBM)
> && !stage2_pte_writeable(pte))
> > > +		return;
> > > +
> > > +	/* Clear DBM walk is not shared, update */
> > > +	if (!kvm_pgtable_walk_shared(ctx)) {
> > > +		WRITE_ONCE(*ctx->ptep, new);
> > > +		return;
> > > +	}
> >
> > I was wondering if this interferes with the OS dirty tracking (not the
> > KVM one) but I think that's ok, at least at this point, since the PTE is
> > already writeable and a fault would have marked the underlying page as
> > dirty (user_mem_abort() -> kvm_set_pfn_dirty()).
> >
> > I'm not particularly fond of relying on this but I need to see how it
> > fits with the rest of the series. IIRC KVM doesn't go around and make
> > Stage 2 PTEs read-only but rather unmaps them when it changes the
> > permission of the corresponding Stage 1 VMM mapping.
> >
> > My personal preference would be to track dirty/clean properly as we do
> > for stage 1 (e.g. DBM means writeable PTE) but it has some downsides
> > like the try_to_unmap() code having to retrieve the dirty state via
> > notifiers.
> 
> KVM's usage of DBM is complicated by the fact that the dirty log
> interface w/ userspace is at PTE granularity. We only want the page
> table walker to relax PTEs, but take faults on hugepages so we can do
> page splitting.
> 
> > Anyway, assuming this works correctly, it means that live migration via
> > DBM is only tracked for PTEs already made dirty/writeable by some guest
> > write.
> 
> I'm hoping that we move away from this combined write-protection and
> DBM
> scheme and only use a single dirty tracking strategy at a time.

Yes. As mentioned in the cover letter this is a combined approach where we only set
DBM for near-by pages(64) on page fault when migration is started.

> 
> > > @@ -952,6 +990,11 @@ static int stage2_map_walker_try_leaf(const
> struct kvm_pgtable_visit_ctx *ctx,
> > >  	    stage2_pte_executable(new))
> > >  		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops),
> granule);
> > >
> > > +	/* Save the possible hardware dirty info */
> > > +	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
> > > +	    stage2_pte_writeable(ctx->old))
> > > +		mark_page_dirty(kvm_s2_mmu_to_kvm(pgt->mmu),
> ctx->addr >> PAGE_SHIFT);
> > > +
> > >  	stage2_make_pte(ctx, new);
> >
> > Isn't this racy and potentially losing the dirty state? Or is the 'new'
> > value guaranteed to have the S2AP[1] bit? For stage 1 we normally make
> > the page genuinely read-only (clearing DBM) in a cmpxchg loop to
> > preserve the dirty state (see ptep_set_wrprotect()).
> 
> stage2_try_break_pte() a few lines up does a cmpxchg() and full
> break-before-make, so at this point there shouldn't be a race with
> either software or hardware table walkers.
> 
> But I'm still confused by this one. KVM only goes down the map
> walker path (in the context of dirty tracking) if:
> 
>  - We took a translation fault
> 
>  - We took a write permission fault on a hugepage and need to split

Agree.

> In both cases the 'old' translation should have DBM cleared. Even if the
> PTE were dirty, this is wasted work since we need to do a final scan of
> the stage-2 when userspace collects the dirty log.
> 
> Am I missing something?

I think we can get rid of the above mark_page_dirty(). I will test it to confirm
we are not missing anything here.
 
Thanks,
Shameer

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces
  2023-09-25  8:04       ` Shameerali Kolothum Thodi
@ 2023-09-26 15:20         ` Catalin Marinas
  2023-09-26 15:52           ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 33+ messages in thread
From: Catalin Marinas @ 2023-09-26 15:20 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Oliver Upton, kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, james.morse@arm.com, suzuki.poulose@arm.com,
	yuzenghui, zhukeqian, Jonathan Cameron, Linuxarm

On Mon, Sep 25, 2023 at 08:04:39AM +0000, Shameerali Kolothum Thodi wrote:
> From: Oliver Upton [mailto:oliver.upton@linux.dev]
> > On Fri, Sep 22, 2023 at 04:24:11PM +0100, Catalin Marinas wrote:
> > > I was wondering if this interferes with the OS dirty tracking (not the
> > > KVM one) but I think that's ok, at least at this point, since the PTE is
> > > already writeable and a fault would have marked the underlying page as
> > > dirty (user_mem_abort() -> kvm_set_pfn_dirty()).
> > >
> > > I'm not particularly fond of relying on this but I need to see how it
> > > fits with the rest of the series. IIRC KVM doesn't go around and make
> > > Stage 2 PTEs read-only but rather unmaps them when it changes the
> > > permission of the corresponding Stage 1 VMM mapping.
> > >
> > > My personal preference would be to track dirty/clean properly as we do
> > > for stage 1 (e.g. DBM means writeable PTE) but it has some downsides
> > > like the try_to_unmap() code having to retrieve the dirty state via
> > > notifiers.
> > 
> > KVM's usage of DBM is complicated by the fact that the dirty log
> > interface w/ userspace is at PTE granularity. We only want the page
> > table walker to relax PTEs, but take faults on hugepages so we can do
> > page splitting.

Thanks for the clarification.

> > > > @@ -952,6 +990,11 @@ static int stage2_map_walker_try_leaf(const struct kvm_pgtable_visit_ctx *ctx,
> > > >  	    stage2_pte_executable(new))
> > > >  		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops), granule);
> > > >
> > > > +	/* Save the possible hardware dirty info */
> > > > +	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
> > > > +	    stage2_pte_writeable(ctx->old))
> > > > +		mark_page_dirty(kvm_s2_mmu_to_kvm(pgt->mmu), ctx->addr >> PAGE_SHIFT);
> > > > +
> > > >  	stage2_make_pte(ctx, new);
> > >
> > > Isn't this racy and potentially losing the dirty state? Or is the 'new'
> > > value guaranteed to have the S2AP[1] bit? For stage 1 we normally make
> > > the page genuinely read-only (clearing DBM) in a cmpxchg loop to
> > > preserve the dirty state (see ptep_set_wrprotect()).
> > 
> > stage2_try_break_pte() a few lines up does a cmpxchg() and full
> > break-before-make, so at this point there shouldn't be a race with
> > either software or hardware table walkers.

Ah, I missed this. Also it was unrelated to this patch (or rather not
introduced by this patch).

> > In both cases the 'old' translation should have DBM cleared. Even if the
> > PTE were dirty, this is wasted work since we need to do a final scan of
> > the stage-2 when userspace collects the dirty log.
> > 
> > Am I missing something?
> 
> I think we can get rid of the above mark_page_dirty(). I will test it to confirm
> we are not missing anything here.

Is this the case for the other places of mark_page_dirty() in your
patches? If stage2_pte_writeable() is true, it must have been made
writeable earlier by a fault and the underlying page marked as dirty.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces
  2023-09-26 15:20         ` Catalin Marinas
@ 2023-09-26 15:52           ` Shameerali Kolothum Thodi
  2023-09-26 16:37             ` Catalin Marinas
  0 siblings, 1 reply; 33+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-09-26 15:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Oliver Upton, kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, james.morse@arm.com, suzuki.poulose@arm.com,
	yuzenghui, zhukeqian, Jonathan Cameron, Linuxarm



> -----Original Message-----
> From: Catalin Marinas [mailto:catalin.marinas@arm.com]
> Sent: 26 September 2023 16:20
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Oliver Upton <oliver.upton@linux.dev>; kvmarm@lists.linux.dev;
> kvm@vger.kernel.org; linux-arm-kernel@lists.infradead.org; maz@kernel.org;
> will@kernel.org; james.morse@arm.com; suzuki.poulose@arm.com;
> yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: Re: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related
> pgtable interfaces
> 
> On Mon, Sep 25, 2023 at 08:04:39AM +0000, Shameerali Kolothum Thodi
> wrote:
> > From: Oliver Upton [mailto:oliver.upton@linux.dev]
> > > On Fri, Sep 22, 2023 at 04:24:11PM +0100, Catalin Marinas wrote:
> > > > I was wondering if this interferes with the OS dirty tracking (not the
> > > > KVM one) but I think that's ok, at least at this point, since the PTE is
> > > > already writeable and a fault would have marked the underlying page
> as
> > > > dirty (user_mem_abort() -> kvm_set_pfn_dirty()).
> > > >
> > > > I'm not particularly fond of relying on this but I need to see how it
> > > > fits with the rest of the series. IIRC KVM doesn't go around and make
> > > > Stage 2 PTEs read-only but rather unmaps them when it changes the
> > > > permission of the corresponding Stage 1 VMM mapping.
> > > >
> > > > My personal preference would be to track dirty/clean properly as we do
> > > > for stage 1 (e.g. DBM means writeable PTE) but it has some downsides
> > > > like the try_to_unmap() code having to retrieve the dirty state via
> > > > notifiers.
> > >
> > > KVM's usage of DBM is complicated by the fact that the dirty log
> > > interface w/ userspace is at PTE granularity. We only want the page
> > > table walker to relax PTEs, but take faults on hugepages so we can do
> > > page splitting.
> 
> Thanks for the clarification.
> 
> > > > > @@ -952,6 +990,11 @@ static int stage2_map_walker_try_leaf(const
> struct kvm_pgtable_visit_ctx *ctx,
> > > > >  	    stage2_pte_executable(new))
> > > > >  		mm_ops->icache_inval_pou(kvm_pte_follow(new, mm_ops),
> granule);
> > > > >
> > > > > +	/* Save the possible hardware dirty info */
> > > > > +	if ((ctx->level == KVM_PGTABLE_MAX_LEVELS - 1) &&
> > > > > +	    stage2_pte_writeable(ctx->old))
> > > > > +		mark_page_dirty(kvm_s2_mmu_to_kvm(pgt->mmu),
> ctx->addr >> PAGE_SHIFT);
> > > > > +
> > > > >  	stage2_make_pte(ctx, new);
> > > >
> > > > Isn't this racy and potentially losing the dirty state? Or is the 'new'
> > > > value guaranteed to have the S2AP[1] bit? For stage 1 we normally
> make
> > > > the page genuinely read-only (clearing DBM) in a cmpxchg loop to
> > > > preserve the dirty state (see ptep_set_wrprotect()).
> > >
> > > stage2_try_break_pte() a few lines up does a cmpxchg() and full
> > > break-before-make, so at this point there shouldn't be a race with
> > > either software or hardware table walkers.
> 
> Ah, I missed this. Also it was unrelated to this patch (or rather not
> introduced by this patch).
> 
> > > In both cases the 'old' translation should have DBM cleared. Even if the
> > > PTE were dirty, this is wasted work since we need to do a final scan of
> > > the stage-2 when userspace collects the dirty log.
> > >
> > > Am I missing something?
> >
> > I think we can get rid of the above mark_page_dirty(). I will test it to
> confirm
> > we are not missing anything here.
> 
> Is this the case for the other places of mark_page_dirty() in your
> patches? If stage2_pte_writeable() is true, it must have been made
> writeable earlier by a fault and the underlying page marked as dirty.
> 

One of the other place we have mark_page_dirty() is in the stage2_unmap_walker().
And during the testing of this series, I have tried to remove that and
found out that it actually causes memory corruption during VM migration.

From my old debug logs:

[  399.288076]  stage2_unmap_walker+0x270/0x284
[  399.288078]  __kvm_pgtable_walk+0x1ec/0x2cc
[  399.288081]  __kvm_pgtable_walk+0xec/0x2cc
[  399.288084]  __kvm_pgtable_walk+0xec/0x2cc
[  399.288086]  kvm_pgtable_walk+0xcc/0x160
[  399.288088]  kvm_pgtable_stage2_unmap+0x4c/0xbc
[  399.288091]  stage2_apply_range+0xd0/0xec
[  399.288094]  __unmap_stage2_range+0x2c/0x60
[  399.288096]  kvm_unmap_gfn_range+0x30/0x48
[  399.288099]  kvm_mmu_notifier_invalidate_range_start+0xe0/0x264
[  399.288103]  __mmu_notifier_invalidate_range_start+0xa4/0x23c
[  399.288106]  change_protection+0x638/0x900
[  399.288109]  change_prot_numa+0x64/0xfc
[  399.288113]  task_numa_work+0x2ac/0x450
[  399.288117]  task_work_run+0x78/0xd0

It looks like that the unmap path gets triggered from Numa page migration code
path, so we may need it there.

Thanks,
Shameer

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 6/8] KVM: arm64: Only write protect selected PTE
  2023-09-22 16:59     ` Oliver Upton
@ 2023-09-26 15:58       ` Catalin Marinas
  2023-09-26 16:10         ` Catalin Marinas
  0 siblings, 1 reply; 33+ messages in thread
From: Catalin Marinas @ 2023-09-26 15:58 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Shameer Kolothum, kvmarm, kvm, linux-arm-kernel, maz, will,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

On Fri, Sep 22, 2023 at 04:59:08PM +0000, Oliver Upton wrote:
> On Fri, Sep 22, 2023 at 05:00:40PM +0100, Catalin Marinas wrote:
> > On Fri, Aug 25, 2023 at 10:35:26AM +0100, Shameer Kolothum wrote:
> > > From: Keqian Zhu <zhukeqian1@huawei.com>
> > > 
> > > This function write protects all PTEs between the ffs and fls of mask.
> > > There may be unset bits between this range. It works well under pure
> > > software dirty log, as software dirty log is not working during this
> > > process.
> > > 
> > > But it will unexpectly clear dirty status of PTE when hardware dirty
> > > log is enabled. So change it to only write protect selected PTE.
> > 
> > Ah, I did wonder about losing the dirty status. The equivalent to S1
> > would be for kvm_pgtable_stage2_wrprotect() to set a software dirty bit.
> > 
> > I'm only superficially familiar with how KVM does dirty tracking for
> > live migration. Does it need to first write-protect the pages and
> > disable DBM? Is DBM re-enabled later? Or does stage2_wp_range() with
> > your patches leave the DBM on? If the latter, the 'wp' aspect is a bit
> > confusing since DBM basically means writeable (and maybe clean). So
> > better to have something like stage2_clean_range().
> 
> KVM has never enabled DBM and we solely rely on write-protection faults
> for dirty tracking. IOW, we do not have a writable-clean state for
> stage-2 PTEs (yet).

When I did the stage 2 AF support I left out DBM as it was unlikely
to be of any use in the real world. Now with dirty tracking for
migration, we may have a better use for this feature.

What I find confusing with these patches is that stage2_wp_range() is
supposed to make a stage 2 pte read-only, as the name implies. However,
if the pte was writeable, it leaves it writeable, clean with DBM
enabled. Doesn't the change to kvm_pgtable_stage2_wrprotect() in patch 4
break other uses of stage2_wp_range()? E.g. kvm_mmu_wp_memory_region()?

Unless I misunderstood, I'd rather change
kvm_arch_mmu_enable_log_dirty_pt_masked() to call a new function,
stage2_clean_range(), which clears S2AP[1] together with setting DBM if
previously writeable. But we should not confuse this with
write-protecting or change the write-protecting functions to mark a pte
writeable+clean.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 6/8] KVM: arm64: Only write protect selected PTE
  2023-09-26 15:58       ` Catalin Marinas
@ 2023-09-26 16:10         ` Catalin Marinas
  0 siblings, 0 replies; 33+ messages in thread
From: Catalin Marinas @ 2023-09-26 16:10 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Shameer Kolothum, kvmarm, kvm, linux-arm-kernel, maz, will,
	james.morse, suzuki.poulose, yuzenghui, zhukeqian1,
	jonathan.cameron, linuxarm

On Tue, Sep 26, 2023 at 04:58:03PM +0100, Catalin Marinas wrote:
> On Fri, Sep 22, 2023 at 04:59:08PM +0000, Oliver Upton wrote:
> > On Fri, Sep 22, 2023 at 05:00:40PM +0100, Catalin Marinas wrote:
> > > On Fri, Aug 25, 2023 at 10:35:26AM +0100, Shameer Kolothum wrote:
> > > > From: Keqian Zhu <zhukeqian1@huawei.com>
> > > > 
> > > > This function write protects all PTEs between the ffs and fls of mask.
> > > > There may be unset bits between this range. It works well under pure
> > > > software dirty log, as software dirty log is not working during this
> > > > process.
> > > > 
> > > > But it will unexpectly clear dirty status of PTE when hardware dirty
> > > > log is enabled. So change it to only write protect selected PTE.
> > > 
> > > Ah, I did wonder about losing the dirty status. The equivalent to S1
> > > would be for kvm_pgtable_stage2_wrprotect() to set a software dirty bit.
> > > 
> > > I'm only superficially familiar with how KVM does dirty tracking for
> > > live migration. Does it need to first write-protect the pages and
> > > disable DBM? Is DBM re-enabled later? Or does stage2_wp_range() with
> > > your patches leave the DBM on? If the latter, the 'wp' aspect is a bit
> > > confusing since DBM basically means writeable (and maybe clean). So
> > > better to have something like stage2_clean_range().
> > 
> > KVM has never enabled DBM and we solely rely on write-protection faults
> > for dirty tracking. IOW, we do not have a writable-clean state for
> > stage-2 PTEs (yet).
> 
> When I did the stage 2 AF support I left out DBM as it was unlikely
> to be of any use in the real world. Now with dirty tracking for
> migration, we may have a better use for this feature.
> 
> What I find confusing with these patches is that stage2_wp_range() is
> supposed to make a stage 2 pte read-only, as the name implies. However,
> if the pte was writeable, it leaves it writeable, clean with DBM
> enabled. Doesn't the change to kvm_pgtable_stage2_wrprotect() in patch 4
> break other uses of stage2_wp_range()? E.g. kvm_mmu_wp_memory_region()?

Ah, that's also used for dirty tracking, so maybe it's ok.

AFAICT KVM doesn't do any form of stage 2 pte change from writeable to
read-only other than dirty tracking (all other cases triggered via MMU
notifier end up unmapping at stage 2).

> Unless I misunderstood, I'd rather change
> kvm_arch_mmu_enable_log_dirty_pt_masked() to call a new function,
> stage2_clean_range(), which clears S2AP[1] together with setting DBM if
> previously writeable. But we should not confuse this with
> write-protecting or change the write-protecting functions to mark a pte
> writeable+clean.

I think it's still good to rename stage2_wp_range() to make it clear
that it's about clean ptes rather than read-only.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces
  2023-09-26 15:52           ` Shameerali Kolothum Thodi
@ 2023-09-26 16:37             ` Catalin Marinas
  0 siblings, 0 replies; 33+ messages in thread
From: Catalin Marinas @ 2023-09-26 16:37 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Oliver Upton, kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, james.morse@arm.com, suzuki.poulose@arm.com,
	yuzenghui, zhukeqian, Jonathan Cameron, Linuxarm

On Tue, Sep 26, 2023 at 03:52:19PM +0000, Shameerali Kolothum Thodi wrote:
> From: Catalin Marinas [mailto:catalin.marinas@arm.com]
> > On Mon, Sep 25, 2023 at 08:04:39AM +0000, Shameerali Kolothum Thodi wrote:
> > > From: Oliver Upton [mailto:oliver.upton@linux.dev]
> > > > In both cases the 'old' translation should have DBM cleared. Even if the
> > > > PTE were dirty, this is wasted work since we need to do a final scan of
> > > > the stage-2 when userspace collects the dirty log.
> > > >
> > > > Am I missing something?
> > >
> > > I think we can get rid of the above mark_page_dirty(). I will test it to confirm
> > > we are not missing anything here.
> > 
> > Is this the case for the other places of mark_page_dirty() in your
> > patches? If stage2_pte_writeable() is true, it must have been made
> > writeable earlier by a fault and the underlying page marked as dirty.
> > 
> 
> One of the other place we have mark_page_dirty() is in the stage2_unmap_walker().
> And during the testing of this series, I have tried to remove that and
> found out that it actually causes memory corruption during VM migration.
> 
> From my old debug logs:
> 
> [  399.288076]  stage2_unmap_walker+0x270/0x284
> [  399.288078]  __kvm_pgtable_walk+0x1ec/0x2cc
> [  399.288081]  __kvm_pgtable_walk+0xec/0x2cc
> [  399.288084]  __kvm_pgtable_walk+0xec/0x2cc
> [  399.288086]  kvm_pgtable_walk+0xcc/0x160
> [  399.288088]  kvm_pgtable_stage2_unmap+0x4c/0xbc
> [  399.288091]  stage2_apply_range+0xd0/0xec
> [  399.288094]  __unmap_stage2_range+0x2c/0x60
> [  399.288096]  kvm_unmap_gfn_range+0x30/0x48
> [  399.288099]  kvm_mmu_notifier_invalidate_range_start+0xe0/0x264
> [  399.288103]  __mmu_notifier_invalidate_range_start+0xa4/0x23c
> [  399.288106]  change_protection+0x638/0x900
> [  399.288109]  change_prot_numa+0x64/0xfc
> [  399.288113]  task_numa_work+0x2ac/0x450
> [  399.288117]  task_work_run+0x78/0xd0
> 
> It looks like that the unmap path gets triggered from Numa page migration code
> path, so we may need it there.

I think I confused things (should have looked at the code).
mark_page_dirty() is actually about marking the bitmap for dirty
tracking rather than the actual struct page. The latter is hidden
somewhere deep down the get_user_pages() code.

So yeah, I think we still need mark_page_dirty() in some places as to
avoid losing the dirty information with DBM being turned on.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log
  2023-09-18  9:55       ` Shameerali Kolothum Thodi
  2023-09-20 21:12         ` Oliver Upton
@ 2023-10-12  7:51         ` Shameerali Kolothum Thodi
  1 sibling, 0 replies; 33+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-10-12  7:51 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui, zhukeqian, Jonathan Cameron,
	Linuxarm

Hi,

> -----Original Message-----
> From: linux-arm-kernel
> [mailto:linux-arm-kernel-bounces@lists.infradead.org] On Behalf Of
> Shameerali Kolothum Thodi
> Sent: 18 September 2023 10:55
> To: Oliver Upton <oliver.upton@linux.dev>
> Cc: kvmarm@lists.linux.dev; kvm@vger.kernel.org;
> linux-arm-kernel@lists.infradead.org; maz@kernel.org; will@kernel.org;
> catalin.marinas@arm.com; james.morse@arm.com;
> suzuki.poulose@arm.com; yuzenghui <yuzenghui@huawei.com>; zhukeqian
> <zhukeqian1@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>
> Subject: RE: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined
> dirty log
 
[...]

> > > Please let me know if there is a specific workload you have in mind.
> >
> > No objection to the workload you've chosen, I'm more concerned about
> the
> > benchmark finishing before live migration completes.
> >
> > What I'm looking for is something like this:
> >
> >  - Calculate the ops/sec your benchmark completes in steady state
> >
> >  - Do a live migration and sample the rate throughout the benchmark,
> >    accounting for VM blackout time
> >
> >  - Calculate the area under the curve of:
> >
> >      y = steady_state_rate - live_migration_rate(t)
> >
> >  - Compare the area under the curve for write-protection and your DBM
> >    approach.
> 
> Ok. Got it.

I attempted to benchmark the performance of this series better as suggested above.

Used memcached/memaslap instead of redis-benchmark as this tool seems to dirty
memory at a faster rate than redis-benchmark in my setup.

./memaslap -s 127.0.0.1:11211 -S 1s  -F ./memslap.cnf -T 96 -c 96 -t 20m

Please find the google sheet link below for the charts that compare the average
throughput rates during the migration time window for 6.5-org and
6.5-kvm-dbm branch.

https://docs.google.com/spreadsheets/d/1T2F94Lsjpx080hW8OSxwbTJXihbXDNlTE1HjWCC0J_4/edit?usp=sharing

Sheet #1 : is with autoconverge=on with default settings(initial-throttle 20 & increment 10).

As you can see from the charts, if you compare the kvm-dbm branch throughput
during the migration window of original branch, it is considerably higher.
But the convergence time to finish migration increases almost at the same
rate for KVM-DBM. This in effect results in a decreased overall avg. 
throughput if we compare with the same time window of original branch.

Sheet #2: is with autoconverge=on with throttle-increment set to 15 for kvm-dbm branch run.

However, if we increase the migration throttling rate for kvm-dbm branch, 
it looks to me we can still have better throughput during the migration
window time and also an overall higher throughput rate with KVM-DBM solution.
 
Sheet: #3. Captures the dirty_log_perf_test times vs memory per vCPU. 

This is also in line with the above results. KVM-DBM has better/constant-ish
dirty memory time compared to linear increase noted for original. 
But it is just the opposite for Get Dirty log time. 

From the above, it looks to me there is a value addition in using HW DBM
for write intensive workloads if we adjust the CPU throttling in the user space.

Please take a look and let me know your feedback/thoughts.

Thanks,
Shameer

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-10-12  7:52 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-25  9:35 [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Shameer Kolothum
2023-08-25  9:35 ` [RFC PATCH v2 1/8] arm64: cpufeature: Add API to report system support of HWDBM Shameer Kolothum
2023-08-25  9:35 ` [RFC PATCH v2 2/8] KVM: arm64: Add KVM_PGTABLE_WALK_HW_DBM for HW DBM support Shameer Kolothum
2023-09-15 22:05   ` Oliver Upton
2023-09-18  9:52     ` Shameerali Kolothum Thodi
2023-08-25  9:35 ` [RFC PATCH v2 3/8] KVM: arm64: Add some HW_DBM related pgtable interfaces Shameer Kolothum
2023-09-15 22:22   ` Oliver Upton
2023-09-18  9:53     ` Shameerali Kolothum Thodi
2023-09-22 15:24   ` Catalin Marinas
2023-09-22 17:49     ` Oliver Upton
2023-09-25  8:04       ` Shameerali Kolothum Thodi
2023-09-26 15:20         ` Catalin Marinas
2023-09-26 15:52           ` Shameerali Kolothum Thodi
2023-09-26 16:37             ` Catalin Marinas
2023-08-25  9:35 ` [RFC PATCH v2 4/8] KVM: arm64: Set DBM for previously writeable pages Shameer Kolothum
2023-09-15 22:54   ` Oliver Upton
2023-09-18  9:54     ` Shameerali Kolothum Thodi
2023-09-22 15:40   ` Catalin Marinas
2023-09-25  8:04     ` Shameerali Kolothum Thodi
2023-08-25  9:35 ` [RFC PATCH v2 5/8] KVM: arm64: Add some HW_DBM related mmu interfaces Shameer Kolothum
2023-08-25  9:35 ` [RFC PATCH v2 6/8] KVM: arm64: Only write protect selected PTE Shameer Kolothum
2023-09-22 16:00   ` Catalin Marinas
2023-09-22 16:59     ` Oliver Upton
2023-09-26 15:58       ` Catalin Marinas
2023-09-26 16:10         ` Catalin Marinas
2023-08-25  9:35 ` [RFC PATCH v2 7/8] KVM: arm64: Add KVM_CAP_ARM_HW_DBM Shameer Kolothum
2023-08-25  9:35 ` [RFC PATCH v2 8/8] KVM: arm64: Start up SW/HW combined dirty log Shameer Kolothum
2023-09-13 17:30 ` [RFC PATCH v2 0/8] KVM: arm64: Implement " Oliver Upton
2023-09-14  9:47   ` Shameerali Kolothum Thodi
2023-09-15  0:36     ` Oliver Upton
2023-09-18  9:55       ` Shameerali Kolothum Thodi
2023-09-20 21:12         ` Oliver Upton
2023-10-12  7:51         ` Shameerali Kolothum Thodi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).