[PATCH v5 0/4] live migration dirty bitmap support for ARMv7

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/4] live migration dirty bitmap support for ARMv7
@ 2014-05-08  0:40 Mario Smarduch
  2014-05-08  0:40 ` [PATCH v5 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Mario Smarduch @ 2014-05-08  0:40 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,
  This v5 patchset of live mgiration support for ARMv7.

- Tested on two 4-way A15 hardware, QEMU 2-way/4-way SMP guest upto 2GB
- Various dirty data rates tested - 2GB/1s ... 2048 pgs/5ms
- validated source/destination memory image integrity
- No issues, v4 time skip due to test harness.

Changes since v1:
- add unlock of VM mmu_lock to prevent a deadlock
- moved migratiion active inside mmu_lock acquire for visibility in 2nd stage
  data abort handler
- Added comments

Changes since v2:
- move initial VM write protect to memory region architecture prepare function
  (needed to make dirty logging function generic) 
- added stage2_mark_pte_ro() - to mark ptes ro - Marc's comment
- optimized initial VM memory region write protect to do fewer table lookups -
  applied Marc's comment for walking dirty bitmap mask
- added pud_addr_end() for stage2 tables, to make the walk 4-level
- added kvm_flush_remote_tlbs() to use ARM TLB invalidation, made the generic
  one weak, Marc's comment to for generic dirty bitmap log function
- optimized walking dirty bit map mask to skip upper tables - Marc's comment
- deleted x86,arm kvm_vm_ioctl_get_dirty_log(), moved to kvm_main.c tagged 
  the function weak - Marc's comment
- changed Data Abort handler pte index handling - Marc's comment

Changes since v3:
- changed pte updates to reset write bit instead of setting default 
  value for existing pte's - Steve's comment 
- In addition to PUD add 2nd stage >4GB range functions - Steves
  suggestion
- Restructured initial memory slot write protect function for PGD, PUD, PMD
  table walking - Steves suggestion
- Renamed variable types to resemble their use - Steves suggestions
- Added couple pte helpers for 2nd stage tables - Steves suggestion
- Updated unmap_range() that handles 2nd stage tables and identity mappings
  to handle 2nd stage addresses >4GB. Left ARMv8 unchanged.

Changes since v4:
- rebased to 3.15.0-rc1 - 'next' to pickup p*addr_end patches - Gavins comment
- Update PUD address end function to support 4-level page table walk
- Elimiated 5th patch of the series that fixed unmap_range(), since it was
  fixed by Marcs patches.


Mario Smarduch (4):
  add ARMv7 HYP API to flush VM TLBs without address param
  live migration support for initial write protect of VM
  live migration support for VM dirty log management
  add 2nd stage page fault handling during live migration

 arch/arm/include/asm/kvm_asm.h  |    1 +
 arch/arm/include/asm/kvm_host.h |   12 ++
 arch/arm/include/asm/kvm_mmu.h  |   16 +-
 arch/arm/kvm/arm.c              |    8 +-
 arch/arm/kvm/interrupts.S       |    5 +
 arch/arm/kvm/mmu.c              |  318 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   86 -----------
 virt/kvm/kvm_main.c             |   89 ++++++++++-
 8 files changed, 440 insertions(+), 95 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/4] add ARMv7 HYP API to flush VM TLBs without address param
  2014-05-08  0:40 [PATCH v5 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
@ 2014-05-08  0:40 ` Mario Smarduch
  2014-05-14 16:47   ` Christoffer Dall
  2014-05-08  0:40 ` [PATCH v5 2/4] live migration support for initial write protect of VM Mario Smarduch
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Mario Smarduch @ 2014-05-08  0:40 UTC (permalink / raw)
  To: linux-arm-kernel

Patch adds HYP interface for global VM TLB invalidation without address
parameter. 

- Added ARM version of kvm_flush_remote_tlbs()


Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
---
 arch/arm/include/asm/kvm_asm.h  |    1 +
 arch/arm/include/asm/kvm_host.h |    2 ++
 arch/arm/kvm/interrupts.S       |    5 +++++
 arch/arm/kvm/mmu.c              |   10 ++++++++++
 4 files changed, 18 insertions(+)

diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
index 53b3c4a..21bc519 100644
--- a/arch/arm/include/asm/kvm_asm.h
+++ b/arch/arm/include/asm/kvm_asm.h
@@ -78,6 +78,7 @@ extern char __kvm_hyp_code_end[];
 
 extern void __kvm_flush_vm_context(void);
 extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);
+extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
 
 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 193ceaf..ac3bb65 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -231,4 +231,6 @@ int kvm_perf_teardown(void);
 u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid);
 int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
 
+void kvm_tlb_flush_vmid(struct kvm *kvm);
+
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
index 0d68d40..8620280 100644
--- a/arch/arm/kvm/interrupts.S
+++ b/arch/arm/kvm/interrupts.S
@@ -45,8 +45,12 @@ __kvm_hyp_code_start:
  *
  * As v7 does not support flushing per IPA, just nuke the whole TLB
  * instead, ignoring the ipa value.
+ *
+ * void __kvm_tlb_flush_vm(struct kvm *kvm) - alias on ARMv7 for global VM TLB
+ * flush with no address parameters.
  */
 ENTRY(__kvm_tlb_flush_vmid_ipa)
+ENTRY(__kvm_tlb_flush_vmid)
 	push	{r2, r3}
 
 	dsb	ishst
@@ -65,6 +69,7 @@ ENTRY(__kvm_tlb_flush_vmid_ipa)
 	pop	{r2, r3}
 	bx	lr
 ENDPROC(__kvm_tlb_flush_vmid_ipa)
+ENDPROC(__kvm_tlb_flush_vmid)
 
 /********************************************************************
  * Flush TLBs and instruction caches of all CPUs inside the inner-shareable
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 80bb1e6..95c172a 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -56,6 +56,16 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
 }
 
+/* Function reuses __kvm_tlb_flush_vmid_ipa() HYP interface without additional
+ * address argument to flush entire VM TLBs.
+ */
+void kvm_flush_remote_tlbs(struct kvm *kvm)
+{
+	if (kvm)
+		kvm_call_hyp(__kvm_tlb_flush_vmid, kvm);
+}
+
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
 				  int min, int max)
 {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 2/4] live migration support for initial write protect of VM
  2014-05-08  0:40 [PATCH v5 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
  2014-05-08  0:40 ` [PATCH v5 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
@ 2014-05-08  0:40 ` Mario Smarduch
  2014-05-15 18:53   ` Christoffer Dall
  2014-05-08  0:40 ` [PATCH v5 3/4] live migration support for VM dirty log management Mario Smarduch
  2014-05-08  0:40 ` [PATCH v5 4/4] add 2nd stage page fault handling during live migration Mario Smarduch
  3 siblings, 1 reply; 13+ messages in thread
From: Mario Smarduch @ 2014-05-08  0:40 UTC (permalink / raw)
  To: linux-arm-kernel

Patch adds support for live migration initial split up of huge pages
in memory slot and write protection of all pages in memory slot.

Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
---
 arch/arm/include/asm/kvm_host.h |    7 ++
 arch/arm/include/asm/kvm_mmu.h  |   16 +++-
 arch/arm/kvm/arm.c              |    3 +
 arch/arm/kvm/mmu.c              |  179 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c             |    6 +-
 5 files changed, 209 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index ac3bb65..91744c3 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -67,6 +67,11 @@ struct kvm_arch {
 
 	/* Interrupt controller */
 	struct vgic_dist	vgic;
+	/* Marks start of migration, used to handle 2nd stage page faults
+	 * during migration, prevent installing huge pages and split huge pages
+	 * to small pages.
+	 */
+	int migration_in_progress;
 };
 
 #define KVM_NR_MEM_OBJS     40
@@ -233,4 +238,6 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
 
 void kvm_tlb_flush_vmid(struct kvm *kvm);
 
+int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 5c7aa3c..b339fa9 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -114,13 +114,27 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
 	pmd_val(*pmd) |= L_PMD_S2_RDWR;
 }
 
+static inline void kvm_set_s2pte_readonly(pte_t *pte)
+{
+	pte_val(*pte) &= ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR);
+}
+
+static inline bool kvm_s2pte_readonly(pte_t *pte)
+{
+	return (pte_val(*pte) & L_PTE_S2_RDWR) == L_PTE_S2_RDONLY;
+}
+
 /* Open coded p*d_addr_end that can deal with 64bit addresses */
 #define kvm_pgd_addr_end(addr, end)					\
 ({	u64 __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;		\
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 
-#define kvm_pud_addr_end(addr,end)		(end)
+/* For - 4-level table walk return PUD range end if end > 1GB */
+#define kvm_pud_addr_end(addr, end)					\
+({	u64 __boundary = ((addr) + PUD_SIZE) & PUD_MASK;		\
+	(__boundary - 1 < (end) - 1) ? __boundary : (end);		\
+})
 
 #define kvm_pmd_addr_end(addr, end)					\
 ({	u64 __boundary = ((addr) + PMD_SIZE) & PMD_MASK;		\
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 3c82b37..1055266 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -234,6 +234,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				   struct kvm_userspace_memory_region *mem,
 				   enum kvm_mr_change change)
 {
+	/* Request for migration issued by user, write protect memory slot */
+	if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
+		return kvm_mmu_slot_remove_write_access(kvm, mem->slot);
 	return 0;
 }
 
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 95c172a..85145d8 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -743,6 +743,185 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
 	return false;
 }
 
+/* kvm_split_pmd - splits huge pages to small pages, required to keep a dirty
+ *	log of smaller memory granules, otherwise huge pages would need to be
+ *	migrated. Practically an idle system has problems migrating with
+ *	huge pages.  Called during WP of entire VM address space, done
+ *	initially when  migration thread isses the KVM_MEM_LOG_DIRTY_PAGES
+ *	ioctl.
+ *	The mmu_lock is held during splitting.
+ *
+ * @kvm:        The KVM pointer
+ * @pmd:        Pmd to 2nd stage huge page
+ * @addr:       Guest Physical Address
+ */
+static int kvm_split_pmd(struct kvm *kvm, pmd_t *pmd, u64 addr)
+{
+	struct page *page;
+	pfn_t pfn = pmd_pfn(*pmd);
+	pte_t *pte;
+	int i;
+
+	page = alloc_page(GFP_KERNEL);
+	if (page == NULL)
+		return -ENOMEM;
+
+	pte = page_address(page);
+	/* cycle through ptes first, use pmd pfn */
+	for (i = 0; i < PTRS_PER_PMD; i++)
+		pte[i] = pfn_pte(pfn+i, PAGE_S2);
+
+	kvm_clean_pte(pte);
+	/* After page table setup set pmd */
+	pmd_populate_kernel(NULL, pmd, pte);
+
+	/* get reference on pte page */
+	get_page(virt_to_page(pte));
+	return 0;
+}
+
+/* Walks PMD page table range and write protects it. Called with
+ * 'kvm->mmu_lock' * held
+ */
+static void stage2_wp_pmd_range(phys_addr_t addr, phys_addr_t end, pmd_t *pmd)
+{
+	pte_t *pte;
+
+	while (addr < end) {
+		pte = pte_offset_kernel(pmd, addr);
+		addr += PAGE_SIZE;
+		if (!pte_present(*pte))
+			continue;
+		/* skip write protected pages */
+		if (kvm_s2pte_readonly(pte))
+			continue;
+		kvm_set_s2pte_readonly(pte);
+	}
+}
+
+/* Walks PUD  page table range to write protects it , if necessary spluts up
+ * huge pages to small pages. Called with 'kvm->mmu_lock' held.
+ */
+static int stage2_wp_pud_range(struct kvm *kvm, phys_addr_t addr,
+				phys_addr_t end, pud_t *pud)
+{
+	pmd_t *pmd;
+	phys_addr_t pmd_end;
+	int ret;
+
+	while (addr < end) {
+		/* If needed give up CPU during PUD page table walk */
+		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+			cond_resched_lock(&kvm->mmu_lock);
+
+		pmd = pmd_offset(pud, addr);
+		if (!pmd_present(*pmd)) {
+			addr = kvm_pmd_addr_end(addr, end);
+			continue;
+		}
+
+		if (kvm_pmd_huge(*pmd)) {
+			ret = kvm_split_pmd(kvm, pmd, addr);
+			/* Failed to split up huge  page abort. */
+			if (ret < 0)
+				return ret;
+
+			addr = kvm_pmd_addr_end(addr, end);
+			continue;
+		}
+
+		pmd_end = kvm_pmd_addr_end(addr, end);
+		stage2_wp_pmd_range(addr, pmd_end, pmd);
+		addr = pmd_end;
+	}
+	return 0;
+}
+
+/* Walks PGD page table range to write protect it. Called with 'kvm->mmu_lock'
+ * held.
+ */
+static int stage2_wp_pgd_range(struct kvm *kvm, phys_addr_t addr,
+				phys_addr_t end, pgd_t *pgd)
+{
+	phys_addr_t pud_end;
+	pud_t *pud;
+	int ret;
+
+	while (addr < end) {
+		/* give up CPU if mmu_lock is needed by other vCPUs */
+		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+			cond_resched_lock(&kvm->mmu_lock);
+
+		pud = pud_offset(pgd, addr);
+		if (!pud_present(*pud)) {
+			addr = kvm_pud_addr_end(addr, end);
+			continue;
+		}
+
+		/* Fail if PUD is huge, splitting PUDs not supported */
+		if (pud_huge(*pud))
+			return -EFAULT;
+
+		/* By default 'nopud' is supported which fails with guests
+		 * larger then 1GB. Added to support 4-level page tables.
+		 */
+		pud_end = kvm_pud_addr_end(addr, end);
+		ret = stage2_wp_pud_range(kvm, addr, pud_end, pud);
+		if (ret < 0)
+			return ret;
+		addr = pud_end;
+	}
+	return 0;
+}
+
+/**
+ * kvm_mmu_slot_remove_access - write protects entire memslot address space.
+ *      Called@start of migration when KVM_MEM_LOG_DIRTY_PAGES ioctl is
+ *      issued. After this function returns all pages (minus the ones faulted
+ *      in or released when mmu_lock is give nup) must be write protected to
+ *	keep track of dirty pages to migrate on subsequent dirty log read.
+ *      mmu_lock is held during write protecting, released on contention.
+ *
+ * @kvm:        The KVM pointer
+ * @slot:       The memory slot the dirty log is retrieved for
+ */
+int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+{
+	pgd_t *pgd;
+	pgd_t *pgdp = kvm->arch.pgd;
+	struct kvm_memory_slot *memslot = id_to_memslot(kvm->memslots, slot);
+	phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+	phys_addr_t pgdir_end;
+	int ret = -ENOMEM;
+
+	spin_lock(&kvm->mmu_lock);
+	/* set start of migration, sychronize with Data Abort handler */
+	kvm->arch.migration_in_progress = 1;
+
+	/* Walk range, split up huge pages as needed and write protect ptes */
+	while (addr < end) {
+		pgd = pgdp + pgd_index(addr);
+		if (!pgd_present(*pgd)) {
+			addr = kvm_pgd_addr_end(addr, end);
+			continue;
+		}
+
+		pgdir_end = kvm_pgd_addr_end(addr, end);
+		ret = stage2_wp_pgd_range(kvm, addr, pgdir_end, pgd);
+		/* Failed to WP a pgd range abort */
+		if (ret < 0)
+			goto out;
+		addr = pgdir_end;
+	}
+	ret = 0;
+	/* Flush TLBs, >= ARMv7 variant uses hardware broadcast not IPIs */
+	kvm_flush_remote_tlbs(kvm);
+out:
+	spin_unlock(&kvm->mmu_lock);
+	return ret;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot,
 			  unsigned long fault_status)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa70c6e..e49f976 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -184,7 +184,11 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	return called;
 }
 
-void kvm_flush_remote_tlbs(struct kvm *kvm)
+/*
+ * Architectures like >= ARMv7 hardware broadcast TLB invalidations and don't
+ * use IPIs, to flush TLBs.
+ */
+void __weak kvm_flush_remote_tlbs(struct kvm *kvm)
 {
 	long dirty_count = kvm->tlbs_dirty;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 3/4] live migration support for VM dirty log management
  2014-05-08  0:40 [PATCH v5 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
  2014-05-08  0:40 ` [PATCH v5 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
  2014-05-08  0:40 ` [PATCH v5 2/4] live migration support for initial write protect of VM Mario Smarduch
@ 2014-05-08  0:40 ` Mario Smarduch
  2014-05-08  0:40 ` [PATCH v5 4/4] add 2nd stage page fault handling during live migration Mario Smarduch
  3 siblings, 0 replies; 13+ messages in thread
From: Mario Smarduch @ 2014-05-08  0:40 UTC (permalink / raw)
  To: linux-arm-kernel

This patch adds support for keeping track of VM dirty pages, by updating 
per memslot dirty bitmap and write protecting the page again.

Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
---
 arch/arm/include/asm/kvm_host.h |    3 ++
 arch/arm/kvm/arm.c              |    5 --
 arch/arm/kvm/mmu.c              |   99 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |   86 ----------------------------------
 virt/kvm/kvm_main.c             |   83 ++++++++++++++++++++++++++++++++
 5 files changed, 185 insertions(+), 91 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 91744c3..e2db1b5 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -239,5 +239,8 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
 void kvm_tlb_flush_vmid(struct kvm *kvm);
 
 int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+	struct kvm_memory_slot *slot,
+	gfn_t gfn_offset, unsigned long mask);
 
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 1055266..0b847b5 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -777,11 +777,6 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	}
 }
 
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
-{
-	return -EINVAL;
-}
-
 static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
 					struct kvm_arm_device_addr *dev_addr)
 {
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 85145d8..1458b6e 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -922,6 +922,105 @@ out:
 	return ret;
 }
 
+/**
+ * kvm_mmu_write_protected_pt_masked - after migration thread write protects
+ *  the entire VM address space itterative call are made to get diry pags
+ *  as the VM pages are being migrated. New dirty pages may be subset
+ *  of initial WPed VM or new writes faulted in. Here write protect new
+ *  dirty pages again in preparation of next dirty log read. This function is
+ *  called as a result KVM_GET_DIRTY_LOG ioctl, to determine what pages
+ *  need to be migrated.
+ *   'kvm->mmu_lock' must be  held to protect against concurrent modification
+ *   of page tables (2nd stage fault, mmu modifiers, ...)
+ *
+ * @kvm:        The KVM pointer
+ * @slot:       The memory slot the dirty log is retrieved for
+ * @gfn_offset: The gfn offset in memory slot
+ * @mask:       The mask of dirty pages at offset 'gnf_offset in this memory
+ *              slot to be writ protect
+ */
+
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+		struct kvm_memory_slot *slot,
+		gfn_t gfn_offset, unsigned long mask)
+{
+	phys_addr_t ipa, next, offset_ipa;
+	pgd_t *pgdp = kvm->arch.pgd, *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	gfn_t gfnofst = slot->base_gfn + gfn_offset;
+	bool crosses_pmd;
+
+	ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
+	offset_ipa  = gfnofst << PAGE_SHIFT;
+	next = (gfnofst + (BITS_PER_LONG - 1)) << PAGE_SHIFT;
+
+	/* check if mask width crosses 2nd level page table range, and
+	 * possibly 3rd, 4th. If not skip upper table lookups. Unlikely
+	 * to be true machine memory regions tend to start on atleast PMD
+	 * boundary and mask is a power of 2.
+	 */
+	crosses_pmd = ((offset_ipa & PMD_MASK) ^ (next & PMD_MASK)) ? true :
+									false;
+
+	/* If pgd, pud, pmd not present and you cross pmd range check next
+	 * index. Unlikely that pgd and pud would be not present. Between
+	 * dirty page marking and now page tables may have been altered.
+	 */
+	pgd = pgdp + pgd_index(ipa);
+	if (unlikely(crosses_pmd && !pgd_present(*pgd))) {
+		pgd = pgdp + pgd_index(next);
+		if (!pgd_present(*pgd))
+			return;
+	}
+
+	pud = pud_offset(pgd, ipa);
+	if (unlikely(crosses_pmd && !pud_present(*pud))) {
+		pud = pud_offset(pgd, next);
+		if (!pud_present(*pud))
+			return;
+	}
+
+	pmd = pmd_offset(pud, ipa);
+	if (unlikely(crosses_pmd && !pmd_present(*pmd))) {
+		pmd = pmd_offset(pud, next);
+		if (!pmd_present(*pmd))
+			return;
+	}
+
+	for (;;) {
+		pte = pte_offset_kernel(pmd, ipa);
+		if (!pte_present(*pte))
+			goto next_ipa;
+
+		if (kvm_s2pte_readonly(pte))
+			goto next_ipa;
+		kvm_set_s2pte_readonly(pte);
+next_ipa:
+		mask &= mask - 1;
+		if (!mask)
+			break;
+
+		/* find next page */
+		ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
+
+		/* skip upper page table lookups */
+		if (!crosses_pmd)
+			continue;
+
+		pgd = pgdp + pgd_index(ipa);
+		if (unlikely(!pgd_present(*pgd)))
+			goto next_ipa;
+		pud = pud_offset(pgd, ipa);
+		if (unlikely(!pud_present(*pud)))
+			goto next_ipa;
+		pmd = pmd_offset(pud, ipa);
+		if (unlikely(!pmd_present(*pmd)))
+			goto next_ipa;
+	}
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot,
 			  unsigned long fault_status)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c5582c3..a603ca3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3569,92 +3569,6 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
 	return 0;
 }
 
-/**
- * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
- * @kvm: kvm instance
- * @log: slot id and address to which we copy the log
- *
- * We need to keep it in mind that VCPU threads can write to the bitmap
- * concurrently.  So, to avoid losing data, we keep the following order for
- * each bit:
- *
- *   1. Take a snapshot of the bit and clear it if needed.
- *   2. Write protect the corresponding page.
- *   3. Flush TLB's if needed.
- *   4. Copy the snapshot to the userspace.
- *
- * Between 2 and 3, the guest may write to the page using the remaining TLB
- * entry.  This is not a problem because the page will be reported dirty at
- * step 4 using the snapshot taken before and step 3 ensures that successive
- * writes will be logged for the next call.
- */
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
-{
-	int r;
-	struct kvm_memory_slot *memslot;
-	unsigned long n, i;
-	unsigned long *dirty_bitmap;
-	unsigned long *dirty_bitmap_buffer;
-	bool is_dirty = false;
-
-	mutex_lock(&kvm->slots_lock);
-
-	r = -EINVAL;
-	if (log->slot >= KVM_USER_MEM_SLOTS)
-		goto out;
-
-	memslot = id_to_memslot(kvm->memslots, log->slot);
-
-	dirty_bitmap = memslot->dirty_bitmap;
-	r = -ENOENT;
-	if (!dirty_bitmap)
-		goto out;
-
-	n = kvm_dirty_bitmap_bytes(memslot);
-
-	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
-	memset(dirty_bitmap_buffer, 0, n);
-
-	spin_lock(&kvm->mmu_lock);
-
-	for (i = 0; i < n / sizeof(long); i++) {
-		unsigned long mask;
-		gfn_t offset;
-
-		if (!dirty_bitmap[i])
-			continue;
-
-		is_dirty = true;
-
-		mask = xchg(&dirty_bitmap[i], 0);
-		dirty_bitmap_buffer[i] = mask;
-
-		offset = i * BITS_PER_LONG;
-		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
-	}
-
-	spin_unlock(&kvm->mmu_lock);
-
-	/* See the comments in kvm_mmu_slot_remove_write_access(). */
-	lockdep_assert_held(&kvm->slots_lock);
-
-	/*
-	 * All the TLBs can be flushed out of mmu lock, see the comments in
-	 * kvm_mmu_slot_remove_write_access().
-	 */
-	if (is_dirty)
-		kvm_flush_remote_tlbs(kvm);
-
-	r = -EFAULT;
-	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
-		goto out;
-
-	r = 0;
-out:
-	mutex_unlock(&kvm->slots_lock);
-	return r;
-}
-
 int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
 			bool line_status)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e49f976..7d95700 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -433,6 +433,89 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
 }
 
+
+/**
+ * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
+ * @kvm: kvm instance
+ * @log: slot id and address to which we copy the log
+ *
+ * Shared by x86 and ARM.
+ *
+ * We need to keep it in mind that VCPU threads can write to the bitmap
+ * concurrently.  So, to avoid losing data, we keep the following order for
+ * each bit:
+ *
+ *   1. Take a snapshot of the bit and clear it if needed.
+ *   2. Write protect the corresponding page.
+ *   3. Flush TLB's if needed.
+ *   4. Copy the snapshot to the userspace.
+ *
+ * Between 2 and 3, the guest may write to the page using the remaining TLB
+ * entry.  This is not a problem because the page will be reported dirty at
+ * step 4 using the snapshot taken before and step 3 ensures that successive
+ * writes will be logged for the next call.
+ */
+
+int __weak kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
+						struct kvm_dirty_log *log)
+{
+	int r;
+	struct kvm_memory_slot *memslot;
+	unsigned long n, i;
+	unsigned long *dirty_bitmap;
+	unsigned long *dirty_bitmap_buffer;
+	bool is_dirty = false;
+
+	mutex_lock(&kvm->slots_lock);
+
+	r = -EINVAL;
+	if (log->slot >= KVM_USER_MEM_SLOTS)
+		goto out;
+
+	memslot = id_to_memslot(kvm->memslots, log->slot);
+
+	dirty_bitmap = memslot->dirty_bitmap;
+	r = -ENOENT;
+	if (!dirty_bitmap)
+		goto out;
+
+	n = kvm_dirty_bitmap_bytes(memslot);
+
+	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
+	memset(dirty_bitmap_buffer, 0, n);
+
+	spin_lock(&kvm->mmu_lock);
+
+	for (i = 0; i < n / sizeof(long); i++) {
+		unsigned long mask;
+		gfn_t offset;
+
+		if (!dirty_bitmap[i])
+			continue;
+
+		is_dirty = true;
+
+		mask = xchg(&dirty_bitmap[i], 0);
+		dirty_bitmap_buffer[i] = mask;
+
+		offset = i * BITS_PER_LONG;
+		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
+	}
+	if (is_dirty)
+		kvm_flush_remote_tlbs(kvm);
+
+	spin_unlock(&kvm->mmu_lock);
+
+	r = -EFAULT;
+	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
+		goto out;
+
+	r = 0;
+out:
+	mutex_unlock(&kvm->slots_lock);
+	return r;
+}
+
 #else  /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 4/4] add 2nd stage page fault handling during live migration
  2014-05-08  0:40 [PATCH v5 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
                   ` (2 preceding siblings ...)
  2014-05-08  0:40 ` [PATCH v5 3/4] live migration support for VM dirty log management Mario Smarduch
@ 2014-05-08  0:40 ` Mario Smarduch
  3 siblings, 0 replies; 13+ messages in thread
From: Mario Smarduch @ 2014-05-08  0:40 UTC (permalink / raw)
  To: linux-arm-kernel

This patch adds support for handling 2nd stage page faults during migration,
it disables faulting in huge pages, and splits up existing huge pages.

Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
---
 arch/arm/kvm/mmu.c |   30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 1458b6e..b0633dc 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1034,6 +1034,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
 	struct vm_area_struct *vma;
 	pfn_t pfn;
+	bool migration_active;
 
 	write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
 	if (fault_status == FSC_PERM && !write_fault) {
@@ -1085,12 +1086,22 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 
 	spin_lock(&kvm->mmu_lock);
+
+	/* place inside lock to prevent race condition when whole VM is being
+	 * write proteced. Prevent race of huge page install when migration is
+	 * active.
+	 */
+	migration_active = vcpu->kvm->arch.migration_in_progress;
+
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
-	if (!hugetlb && !force_pte)
+
+	/* During migration no need rebuild huge pages */
+	if (!hugetlb && !force_pte && !migration_active)
 		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
 
-	if (hugetlb) {
+	/* During migration don't install new huge pages */
+	if (hugetlb && !migration_active) {
 		pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
 		new_pmd = pmd_mkhuge(new_pmd);
 		if (writable) {
@@ -1102,6 +1113,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	} else {
 		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
 		if (writable) {
+			/* First convert huge page pfn to small page pfn,
+			 * while  migration is in progress.
+			 * Second if pmd is  mapping a huge page then
+			 * clear pmd so stage2_set_pte() can split the pmd.
+			 */
+			if (migration_active && hugetlb) {
+				pmd_t *pmd;
+				pfn += pte_index(fault_ipa);
+				new_pte = pfn_pte(pfn, PAGE_S2);
+				pmd = stage2_get_pmd(kvm, NULL, fault_ipa);
+				if (pmd && kvm_pmd_huge(*pmd))
+					clear_pmd_entry(kvm, pmd, fault_ipa);
+			}
 			kvm_set_s2pte_writable(&new_pte);
 			kvm_set_pfn_dirty(pfn);
 		}
@@ -1109,6 +1133,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
 	}
 
+	if (writable)
+		mark_page_dirty(kvm, gfn);
 
 out_unlock:
 	spin_unlock(&kvm->mmu_lock);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 1/4] add ARMv7 HYP API to flush VM TLBs without address param
  2014-05-08  0:40 ` [PATCH v5 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
@ 2014-05-14 16:47   ` Christoffer Dall
  2014-05-15  2:00     ` Mario Smarduch
  0 siblings, 1 reply; 13+ messages in thread
From: Christoffer Dall @ 2014-05-14 16:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 07, 2014 at 05:40:13PM -0700, Mario Smarduch wrote:
> Patch adds HYP interface for global VM TLB invalidation without address
> parameter.
>
> - Added ARM version of kvm_flush_remote_tlbs()
>
>
> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
> ---
>  arch/arm/include/asm/kvm_asm.h  |    1 +
>  arch/arm/include/asm/kvm_host.h |    2 ++
>  arch/arm/kvm/interrupts.S       |    5 +++++
>  arch/arm/kvm/mmu.c              |   10 ++++++++++
>  4 files changed, 18 insertions(+)
>
> diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
> index 53b3c4a..21bc519 100644
> --- a/arch/arm/include/asm/kvm_asm.h
> +++ b/arch/arm/include/asm/kvm_asm.h
> @@ -78,6 +78,7 @@ extern char __kvm_hyp_code_end[];
>
>  extern void __kvm_flush_vm_context(void);
>  extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);
> +extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
>
>  extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
>  #endif
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 193ceaf..ac3bb65 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -231,4 +231,6 @@ int kvm_perf_teardown(void);
>  u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid);
>  int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
>
> +void kvm_tlb_flush_vmid(struct kvm *kvm);
> +
>  #endif /* __ARM_KVM_HOST_H__ */
> diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
> index 0d68d40..8620280 100644
> --- a/arch/arm/kvm/interrupts.S
> +++ b/arch/arm/kvm/interrupts.S
> @@ -45,8 +45,12 @@ __kvm_hyp_code_start:
>   *
>   * As v7 does not support flushing per IPA, just nuke the whole TLB
>   * instead, ignoring the ipa value.
> + *
> + * void __kvm_tlb_flush_vm(struct kvm *kvm) - alias on ARMv7 for global VM TLB
> + * flush with no address parameters.
>   */
>  ENTRY(__kvm_tlb_flush_vmid_ipa)
> +ENTRY(__kvm_tlb_flush_vmid)
>   push {r2, r3}
>
>   dsb ishst
> @@ -65,6 +69,7 @@ ENTRY(__kvm_tlb_flush_vmid_ipa)
>   pop {r2, r3}
>   bx lr
>  ENDPROC(__kvm_tlb_flush_vmid_ipa)
> +ENDPROC(__kvm_tlb_flush_vmid)

yikes, can you please make this a separate function that calls the other
one?

>
>  /********************************************************************
>   * Flush TLBs and instruction caches of all CPUs inside the inner-shareable
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 80bb1e6..95c172a 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -56,6 +56,16 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>   kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
>  }
>
> +/* Function reuses __kvm_tlb_flush_vmid_ipa() HYP interface without additional
> + * address argument to flush entire VM TLBs.
> + */

This is not proper kernel commenting formatting, please see
Documentation/CodingStyle.

For new exported functions in the KVM/ARM code, please add kdocs style
documentation to the functions.

> +void kvm_flush_remote_tlbs(struct kvm *kvm)

This doesn't build?:

arch/arm/kvm/mmu.o: In function `kvm_flush_remote_tlbs':
mmu.c:(.text+0xc7c): multiple definition of `kvm_flush_remote_tlbs'

> +{
> + if (kvm)
> + kvm_call_hyp(__kvm_tlb_flush_vmid, kvm);
> +}
> +
> +
>  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
>    int min, int max)
>  {
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/4] add ARMv7 HYP API to flush VM TLBs without address param
  2014-05-14 16:47   ` Christoffer Dall
@ 2014-05-15  2:00     ` Mario Smarduch
  2014-05-15 18:50       ` Christoffer Dall
  0 siblings, 1 reply; 13+ messages in thread
From: Mario Smarduch @ 2014-05-15  2:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/14/2014 09:47 AM, Christoffer Dall wrote:
> On Wed, May 07, 2014 at 05:40:13PM -0700, Mario Smarduch wrote:
>> Patch adds HYP interface for global VM TLB invalidation without address
>> parameter.
>>
>> - Added ARM version of kvm_flush_remote_tlbs()
>>
>>
>> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
>> ---
>>  arch/arm/include/asm/kvm_asm.h  |    1 +
>>  arch/arm/include/asm/kvm_host.h |    2 ++
>>  arch/arm/kvm/interrupts.S       |    5 +++++
>>  arch/arm/kvm/mmu.c              |   10 ++++++++++
>>  4 files changed, 18 insertions(+)
>>
>> diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
>> index 53b3c4a..21bc519 100644
>> --- a/arch/arm/include/asm/kvm_asm.h
>> +++ b/arch/arm/include/asm/kvm_asm.h
>> @@ -78,6 +78,7 @@ extern char __kvm_hyp_code_end[];
>>
>>  extern void __kvm_flush_vm_context(void);
>>  extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);
>> +extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
>>
>>  extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
>>  #endif
>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>> index 193ceaf..ac3bb65 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -231,4 +231,6 @@ int kvm_perf_teardown(void);
>>  u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid);
>>  int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
>>
>> +void kvm_tlb_flush_vmid(struct kvm *kvm);
>> +
>>  #endif /* __ARM_KVM_HOST_H__ */
>> diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
>> index 0d68d40..8620280 100644
>> --- a/arch/arm/kvm/interrupts.S
>> +++ b/arch/arm/kvm/interrupts.S
>> @@ -45,8 +45,12 @@ __kvm_hyp_code_start:
>>   *
>>   * As v7 does not support flushing per IPA, just nuke the whole TLB
>>   * instead, ignoring the ipa value.
>> + *
>> + * void __kvm_tlb_flush_vm(struct kvm *kvm) - alias on ARMv7 for global VM TLB
>> + * flush with no address parameters.
>>   */
>>  ENTRY(__kvm_tlb_flush_vmid_ipa)
>> +ENTRY(__kvm_tlb_flush_vmid)
>>   push {r2, r3}
>>
>>   dsb ishst
>> @@ -65,6 +69,7 @@ ENTRY(__kvm_tlb_flush_vmid_ipa)
>>   pop {r2, r3}
>>   bx lr
>>  ENDPROC(__kvm_tlb_flush_vmid_ipa)
>> +ENDPROC(__kvm_tlb_flush_vmid)
> 
> yikes, can you please make this a separate function that calls the other
> one?

Done separate function, got the idea from entry-common.s ENTRY(ret_to_user), 
ENTRY(ret_to_user_from_irq) and others.

> 
>>
>>  /********************************************************************
>>   * Flush TLBs and instruction caches of all CPUs inside the inner-shareable
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 80bb1e6..95c172a 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -56,6 +56,16 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>>   kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
>>  }
>>
>> +/* Function reuses __kvm_tlb_flush_vmid_ipa() HYP interface without additional
>> + * address argument to flush entire VM TLBs.
>> + */
> 
> This is not proper kernel commenting formatting, please see
> Documentation/CodingStyle.
> 
> For new exported functions in the KVM/ARM code, please add kdocs style
> documentation to the functions.

Done.

> 
>> +void kvm_flush_remote_tlbs(struct kvm *kvm)
> 
> This doesn't build?:

I reworked the patch series to build successfully after
applying each patch. This patch was missing a weak 
declaration of the function in virt/kvm/kvm_main.c.

I simplified some related code for PMD splitting
reusing current mmu.c code, instead of reinventing. 
I'll email new patch series tomorrow, you might not want 
to waste your time on 2-4.

Thanks.
- Mario

> 
> arch/arm/kvm/mmu.o: In function `kvm_flush_remote_tlbs':
> mmu.c:(.text+0xc7c): multiple definition of `kvm_flush_remote_tlbs'
> 
>> +{
>> + if (kvm)
>> + kvm_call_hyp(__kvm_tlb_flush_vmid, kvm);
>> +}
>> +
>> +
>>  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
>>    int min, int max)
>>  {
>> --
>> 1.7.9.5
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/4] add ARMv7 HYP API to flush VM TLBs without address param
  2014-05-15  2:00     ` Mario Smarduch
@ 2014-05-15 18:50       ` Christoffer Dall
  0 siblings, 0 replies; 13+ messages in thread
From: Christoffer Dall @ 2014-05-15 18:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 14, 2014 at 07:00:18PM -0700, Mario Smarduch wrote:
> On 05/14/2014 09:47 AM, Christoffer Dall wrote:
> > On Wed, May 07, 2014 at 05:40:13PM -0700, Mario Smarduch wrote:

[...]

> >> +void kvm_flush_remote_tlbs(struct kvm *kvm)
> > 
> > This doesn't build?:
> 
> I reworked the patch series to build successfully after
> applying each patch. This patch was missing a weak 
> declaration of the function in virt/kvm/kvm_main.c.
> 
> I simplified some related code for PMD splitting
> reusing current mmu.c code, instead of reinventing. 
> I'll email new patch series tomorrow, you might not want 
> to waste your time on 2-4.
> 

ok, I'll review the new series.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 2/4] live migration support for initial write protect of VM
  2014-05-08  0:40 ` [PATCH v5 2/4] live migration support for initial write protect of VM Mario Smarduch
@ 2014-05-15 18:53   ` Christoffer Dall
  2014-05-15 22:51     ` Mario Smarduch
  2014-05-30 16:48     ` Mario Smarduch
  0 siblings, 2 replies; 13+ messages in thread
From: Christoffer Dall @ 2014-05-15 18:53 UTC (permalink / raw)
  To: linux-arm-kernel


[I know you sent out a newer version but I already reviewed some of this
patch on the plane today but couldn't send it out before I got home.
Anyway, here it is:]

On Wed, May 07, 2014 at 05:40:14PM -0700, Mario Smarduch wrote:
> Patch adds support for live migration initial split up of huge pages
> in memory slot and write protection of all pages in memory slot.
> 
> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
> ---
>  arch/arm/include/asm/kvm_host.h |    7 ++
>  arch/arm/include/asm/kvm_mmu.h  |   16 +++-
>  arch/arm/kvm/arm.c              |    3 +
>  arch/arm/kvm/mmu.c              |  179 +++++++++++++++++++++++++++++++++++++++
>  virt/kvm/kvm_main.c             |    6 +-
>  5 files changed, 209 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index ac3bb65..91744c3 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -67,6 +67,11 @@ struct kvm_arch {
>  
>  	/* Interrupt controller */
>  	struct vgic_dist	vgic;
> +	/* Marks start of migration, used to handle 2nd stage page faults
> +	 * during migration, prevent installing huge pages and split huge pages
> +	 * to small pages.
> +	 */

commenting style

this is a bit verbose for a field in a struct, perhaps moving the longer
version to where you set this?

> +	int migration_in_progress;
>  };
>  
>  #define KVM_NR_MEM_OBJS     40
> @@ -233,4 +238,6 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
>  
>  void kvm_tlb_flush_vmid(struct kvm *kvm);
>  
> +int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
> +
>  #endif /* __ARM_KVM_HOST_H__ */
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 5c7aa3c..b339fa9 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -114,13 +114,27 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
>  	pmd_val(*pmd) |= L_PMD_S2_RDWR;
>  }
>  
> +static inline void kvm_set_s2pte_readonly(pte_t *pte)
> +{
> +	pte_val(*pte) &= ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR);

This relies on the pte already having been set as RDONLY or RDWR, if you
are creating a new pte and calling this function it could be easy to
miss that distinction, I would prefer:

pte_val(*pte) &= L_PTE_S2_RDWR;
pte_val(*pte) |= L_PTE_S2_RDONLY;

> +}
> +
> +static inline bool kvm_s2pte_readonly(pte_t *pte)
> +{
> +	return (pte_val(*pte) & L_PTE_S2_RDWR) == L_PTE_S2_RDONLY;
> +}
> +
>  /* Open coded p*d_addr_end that can deal with 64bit addresses */
>  #define kvm_pgd_addr_end(addr, end)					\
>  ({	u64 __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;		\
>  	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>  })
>  
> -#define kvm_pud_addr_end(addr,end)		(end)
> +/* For - 4-level table walk return PUD range end if end > 1GB */

not sure you need this comment, the scheme is very common all over the
kernel.

> +#define kvm_pud_addr_end(addr, end)					\
> +({	u64 __boundary = ((addr) + PUD_SIZE) & PUD_MASK;		\
> +	(__boundary - 1 < (end) - 1) ? __boundary : (end);		\
> +})

why do we need this?  We should only ever have 3 levels of page tables,
right?

>  
>  #define kvm_pmd_addr_end(addr, end)					\
>  ({	u64 __boundary = ((addr) + PMD_SIZE) & PMD_MASK;		\
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 3c82b37..1055266 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -234,6 +234,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  				   struct kvm_userspace_memory_region *mem,
>  				   enum kvm_mr_change change)
>  {
> +	/* Request for migration issued by user, write protect memory slot */

Does this necessarily only happen when there's a request for migration?
Isn't it just a log call that could be used for other things
(potentially)?

> +	if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
> +		return kvm_mmu_slot_remove_write_access(kvm, mem->slot);
>  	return 0;
>  }
>  
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 95c172a..85145d8 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -743,6 +743,185 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
>  	return false;
>  }
>  
> +/* kvm_split_pmd - splits huge pages to small pages, required to keep a dirty
> + *	log of smaller memory granules, otherwise huge pages would need to be
> + *	migrated. Practically an idle system has problems migrating with

This seems abrupt.  Why can't we just represent a 2M huge page as 512 4K
bits and write protect the huge pages, if you take a write fault on a 2M
page, then split it then.

If your use case is HA, then you will be doing this a lot, and you don't
want to hurt performance of your main live system more than necessary.

> + *	huge pages.  Called during WP of entire VM address space, done

s/WP/write protect/

> + *	initially when  migration thread isses the KVM_MEM_LOG_DIRTY_PAGES

"migration thread"?  I don't think this is a KVM term.  Please keep
documentation to concepts that can be understood from the KVM
perspective without knowing anything specific about user space
implementation.

> + *	ioctl.

This is not an ioctl but a flag in an ioctl.  There's another ioctl to
get the dirty log.

> + *	The mmu_lock is held during splitting.
> + *
> + * @kvm:        The KVM pointer
> + * @pmd:        Pmd to 2nd stage huge page
> + * @addr:       Guest Physical Address
> + */

Thanks for commenting the function, however, a few nits:

again, check the commenting style, checkpatch should complain, kdocs
usually look like this:

/*
 * kvm_split_pmd - <one line description of function>
 *
 * Some more description of the function which can be on multiple lines.
 */

 can you also comment on the return value?

> +static int kvm_split_pmd(struct kvm *kvm, pmd_t *pmd, u64 addr)
> +{
> +	struct page *page;
> +	pfn_t pfn = pmd_pfn(*pmd);
> +	pte_t *pte;
> +	int i;
> +
> +	page = alloc_page(GFP_KERNEL);
> +	if (page == NULL)
> +		return -ENOMEM;
> +
> +	pte = page_address(page);
> +	/* cycle through ptes first, use pmd pfn */
> +	for (i = 0; i < PTRS_PER_PMD; i++)
> +		pte[i] = pfn_pte(pfn+i, PAGE_S2);
> +
> +	kvm_clean_pte(pte);
> +	/* After page table setup set pmd */
> +	pmd_populate_kernel(NULL, pmd, pte);
> +
> +	/* get reference on pte page */
> +	get_page(virt_to_page(pte));
> +	return 0;
> +}
> +
> +/* Walks PMD page table range and write protects it. Called with
> + * 'kvm->mmu_lock' * held
> + */
> +static void stage2_wp_pmd_range(phys_addr_t addr, phys_addr_t end, pmd_t *pmd)
> +{
> +	pte_t *pte;
> +
> +	while (addr < end) {
> +		pte = pte_offset_kernel(pmd, addr);
> +		addr += PAGE_SIZE;
> +		if (!pte_present(*pte))
> +			continue;
> +		/* skip write protected pages */
> +		if (kvm_s2pte_readonly(pte))
> +			continue;
> +		kvm_set_s2pte_readonly(pte);
> +	}
> +}
> +
> +/* Walks PUD  page table range to write protects it , if necessary spluts up
> + * huge pages to small pages. Called with 'kvm->mmu_lock' held.
> + */
> +static int stage2_wp_pud_range(struct kvm *kvm, phys_addr_t addr,
> +				phys_addr_t end, pud_t *pud)
> +{
> +	pmd_t *pmd;
> +	phys_addr_t pmd_end;
> +	int ret;
> +
> +	while (addr < end) {
> +		/* If needed give up CPU during PUD page table walk */
> +		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> +			cond_resched_lock(&kvm->mmu_lock);
> +
> +		pmd = pmd_offset(pud, addr);
> +		if (!pmd_present(*pmd)) {
> +			addr = kvm_pmd_addr_end(addr, end);
> +			continue;
> +		}
> +
> +		if (kvm_pmd_huge(*pmd)) {
> +			ret = kvm_split_pmd(kvm, pmd, addr);
> +			/* Failed to split up huge  page abort. */
> +			if (ret < 0)
> +				return ret;
> +
> +			addr = kvm_pmd_addr_end(addr, end);
> +			continue;
> +		}
> +
> +		pmd_end = kvm_pmd_addr_end(addr, end);
> +		stage2_wp_pmd_range(addr, pmd_end, pmd);
> +		addr = pmd_end;
> +	}
> +	return 0;
> +}
> +
> +/* Walks PGD page table range to write protect it. Called with 'kvm->mmu_lock'
> + * held.
> + */
> +static int stage2_wp_pgd_range(struct kvm *kvm, phys_addr_t addr,
> +				phys_addr_t end, pgd_t *pgd)
> +{
> +	phys_addr_t pud_end;
> +	pud_t *pud;
> +	int ret;
> +
> +	while (addr < end) {
> +		/* give up CPU if mmu_lock is needed by other vCPUs */
> +		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> +			cond_resched_lock(&kvm->mmu_lock);
> +
> +		pud = pud_offset(pgd, addr);
> +		if (!pud_present(*pud)) {
> +			addr = kvm_pud_addr_end(addr, end);
> +			continue;
> +		}
> +
> +		/* Fail if PUD is huge, splitting PUDs not supported */
> +		if (pud_huge(*pud))
> +			return -EFAULT;
> +
> +		/* By default 'nopud' is supported which fails with guests
> +		 * larger then 1GB. Added to support 4-level page tables.
> +		 */
> +		pud_end = kvm_pud_addr_end(addr, end);
> +		ret = stage2_wp_pud_range(kvm, addr, pud_end, pud);
> +		if (ret < 0)
> +			return ret;
> +		addr = pud_end;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * kvm_mmu_slot_remove_access - write protects entire memslot address space.
> + *      Called at start of migration when KVM_MEM_LOG_DIRTY_PAGES ioctl is
> + *      issued. After this function returns all pages (minus the ones faulted
> + *      in or released when mmu_lock is give nup) must be write protected to
> + *	keep track of dirty pages to migrate on subsequent dirty log read.
> + *      mmu_lock is held during write protecting, released on contention.
> + *
> + * @kvm:        The KVM pointer
> + * @slot:       The memory slot the dirty log is retrieved for
> + */

your indentation is weird here and you also need a new line after the
first description.  Also missing return value.

> +int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
> +{
> +	pgd_t *pgd;
> +	pgd_t *pgdp = kvm->arch.pgd;
> +	struct kvm_memory_slot *memslot = id_to_memslot(kvm->memslots, slot);
> +	phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
> +	phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
> +	phys_addr_t pgdir_end;
> +	int ret = -ENOMEM;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	/* set start of migration, sychronize with Data Abort handler */

s/sychronize/synchronize/

What is Data Abort handler?  user_mem_abort()?  Can you use a concrete
function name?

> +	kvm->arch.migration_in_progress = 1;
> +
> +	/* Walk range, split up huge pages as needed and write protect ptes */
> +	while (addr < end) {

I think this should be rewritten to use the scheme used in
stage2_flush_memslot, I will submit a patch one of these days that
changes unmap_range() as well, you can wait for that and take a look.

> +		pgd = pgdp + pgd_index(addr);
> +		if (!pgd_present(*pgd)) {
> +			addr = kvm_pgd_addr_end(addr, end);
> +			continue;
> +		}
> +
> +		pgdir_end = kvm_pgd_addr_end(addr, end);
> +		ret = stage2_wp_pgd_range(kvm, addr, pgdir_end, pgd);
> +		/* Failed to WP a pgd range abort */
> +		if (ret < 0)
> +			goto out;
> +		addr = pgdir_end;
> +	}
> +	ret = 0;
> +	/* Flush TLBs, >= ARMv7 variant uses hardware broadcast not IPIs */
> +	kvm_flush_remote_tlbs(kvm);
> +out:
> +	spin_unlock(&kvm->mmu_lock);
> +	return ret;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot,
>  			  unsigned long fault_status)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fa70c6e..e49f976 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -184,7 +184,11 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  	return called;
>  }
>  
> -void kvm_flush_remote_tlbs(struct kvm *kvm)
> +/*
> + * Architectures like >= ARMv7 hardware broadcast TLB invalidations and don't
> + * use IPIs, to flush TLBs.
> + */

I don't think this comment is appropriate in generic code.  If you want
to say anything here, you should say that arch code can override this
function.

> +void __weak kvm_flush_remote_tlbs(struct kvm *kvm)
>  {
>  	long dirty_count = kvm->tlbs_dirty;
>  
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 2/4] live migration support for initial write protect of VM
  2014-05-15 18:53   ` Christoffer Dall
@ 2014-05-15 22:51     ` Mario Smarduch
  2014-05-16 21:39       ` Mario Smarduch
  2014-05-30 16:48     ` Mario Smarduch
  1 sibling, 1 reply; 13+ messages in thread
From: Mario Smarduch @ 2014-05-15 22:51 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/15/2014 11:53 AM, Christoffer Dall wrote:
> 
> [I know you sent out a newer version but I already reviewed some of this
> patch on the plane today but couldn't send it out before I got home.
> Anyway, here it is:]
> 
> On Wed, May 07, 2014 at 05:40:14PM -0700, Mario Smarduch wrote:
>> Patch adds support for live migration initial split up of huge pages
>> in memory slot and write protection of all pages in memory slot.
>>
>> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
>> ---
>>  arch/arm/include/asm/kvm_host.h |    7 ++
>>  arch/arm/include/asm/kvm_mmu.h  |   16 +++-
>>  arch/arm/kvm/arm.c              |    3 +
>>  arch/arm/kvm/mmu.c              |  179 +++++++++++++++++++++++++++++++++++++++
>>  virt/kvm/kvm_main.c             |    6 +-
>>  5 files changed, 209 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>> index ac3bb65..91744c3 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -67,6 +67,11 @@ struct kvm_arch {
>>  
>>  	/* Interrupt controller */
>>  	struct vgic_dist	vgic;
>> +	/* Marks start of migration, used to handle 2nd stage page faults
>> +	 * during migration, prevent installing huge pages and split huge pages
>> +	 * to small pages.
>> +	 */
> 
> commenting style
> 
> this is a bit verbose for a field in a struct, perhaps moving the longer
> version to where you set this?
Will do.
> 
>> +	int migration_in_progress;
>>  };
>>  
>>  #define KVM_NR_MEM_OBJS     40
>> @@ -233,4 +238,6 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
>>  
>>  void kvm_tlb_flush_vmid(struct kvm *kvm);
>>  
>> +int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
>> +
>>  #endif /* __ARM_KVM_HOST_H__ */
>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 5c7aa3c..b339fa9 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -114,13 +114,27 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
>>  	pmd_val(*pmd) |= L_PMD_S2_RDWR;
>>  }
>>  
>> +static inline void kvm_set_s2pte_readonly(pte_t *pte)
>> +{
>> +	pte_val(*pte) &= ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR);
> 
> This relies on the pte already having been set as RDONLY or RDWR, if you
> are creating a new pte and calling this function it could be easy to
> miss that distinction, I would prefer:
> 
> pte_val(*pte) &= L_PTE_S2_RDWR;
> pte_val(*pte) |= L_PTE_S2_RDONLY;

Currently it's called only on set, or live pte's, I'll change
it so it's applicate to all cases. 
> 
>> +}
>> +
>> +static inline bool kvm_s2pte_readonly(pte_t *pte)
>> +{
>> +	return (pte_val(*pte) & L_PTE_S2_RDWR) == L_PTE_S2_RDONLY;
>> +}
>> +
>>  /* Open coded p*d_addr_end that can deal with 64bit addresses */
>>  #define kvm_pgd_addr_end(addr, end)					\
>>  ({	u64 __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;		\
>>  	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>>  })
>>  
>> -#define kvm_pud_addr_end(addr,end)		(end)
>> +/* For - 4-level table walk return PUD range end if end > 1GB */
> 
> not sure you need this comment, the scheme is very common all over the
> kernel.
Yes.
> 
>> +#define kvm_pud_addr_end(addr, end)					\
>> +({	u64 __boundary = ((addr) + PUD_SIZE) & PUD_MASK;		\
>> +	(__boundary - 1 < (end) - 1) ? __boundary : (end);		\
>> +})
> 
> why do we need this?  We should only ever have 3 levels of page tables,
> right?

I removed in v6 patch.
> 
>>  
>>  #define kvm_pmd_addr_end(addr, end)					\
>>  ({	u64 __boundary = ((addr) + PMD_SIZE) & PMD_MASK;		\
>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> index 3c82b37..1055266 100644
>> --- a/arch/arm/kvm/arm.c
>> +++ b/arch/arm/kvm/arm.c
>> @@ -234,6 +234,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>  				   struct kvm_userspace_memory_region *mem,
>>  				   enum kvm_mr_change change)
>>  {
>> +	/* Request for migration issued by user, write protect memory slot */
> 
> Does this necessarily only happen when there's a request for migration?
> Isn't it just a log call that could be used for other things
> (potentially)?


>From QEMU view migration thread calls KVM memory listener kvm_log_global_start
and that kicks off dirty log tracking for each memslot. There are other operations 
like region add (kvm_region_add)  that starts kvm_log_start for that memslot, 
or other odd case if you add a region that overlaps regions you may start logging
the whole region.
 
But in either case it appears you're migrating already. 

But no I don't see any other feature that triggers this.

> 
>> +	if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
>> +		return kvm_mmu_slot_remove_write_access(kvm, mem->slot);
>>  	return 0;
>>  }
>>  
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 95c172a..85145d8 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -743,6 +743,185 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
>>  	return false;
>>  }
>>  
>> +/* kvm_split_pmd - splits huge pages to small pages, required to keep a dirty
>> + *	log of smaller memory granules, otherwise huge pages would need to be
>> + *	migrated. Practically an idle system has problems migrating with
> 
> This seems abrupt.  Why can't we just represent a 2M huge page as 512 4K
> bits and write protect the huge pages, if you take a write fault on a 2M
> page, then split it then.

That's one alternative the one I put into v6 is clear the PMD
and force user_mem_abort() to fault in 4k pages, and mark the
dirty_bitmap[] for that page, reuse the current code. Have not
checked the impact on performance, it takes few seconds longer
to converge for the tests I'm running. 

> 
> If your use case is HA, then you will be doing this a lot, and you don't
> want to hurt performance of your main live system more than necessary.

> 
>> + *	huge pages.  Called during WP of entire VM address space, done
> 
> s/WP/write protect/
Ok.
> 
>> + *	initially when  migration thread isses the KVM_MEM_LOG_DIRTY_PAGES
> 
> "migration thread"?  I don't think this is a KVM term.  Please keep
> documentation to concepts that can be understood from the KVM
> perspective without knowing anything specific about user space
> implementation.
Ok.
> 
>> + *	ioctl.
> 
> This is not an ioctl but a flag in an ioctl.  There's another ioctl to
> get the dirty log.
Ok.
> 
>> + *	The mmu_lock is held during splitting.
>> + *
>> + * @kvm:        The KVM pointer
>> + * @pmd:        Pmd to 2nd stage huge page
>> + * @addr:       Guest Physical Address
>> + */
> 
> Thanks for commenting the function, however, a few nits:
> 
> again, check the commenting style, checkpatch should complain, kdocs
> usually look like this:

checkpatch didn't complain, kdocs seemed to describe this, will
check again.

> 
> /*
>  * kvm_split_pmd - <one line description of function>
>  *
>  * Some more description of the function which can be on multiple lines.
>  */
> 
>  can you also comment on the return value?

Ok.
> 
>> +static int kvm_split_pmd(struct kvm *kvm, pmd_t *pmd, u64 addr)
>> +{
>> +	struct page *page;
>> +	pfn_t pfn = pmd_pfn(*pmd);
>> +	pte_t *pte;
>> +	int i;
>> +
>> +	page = alloc_page(GFP_KERNEL);
>> +	if (page == NULL)
>> +		return -ENOMEM;
>> +
>> +	pte = page_address(page);
>> +	/* cycle through ptes first, use pmd pfn */
>> +	for (i = 0; i < PTRS_PER_PMD; i++)
>> +		pte[i] = pfn_pte(pfn+i, PAGE_S2);
>> +
>> +	kvm_clean_pte(pte);
>> +	/* After page table setup set pmd */
>> +	pmd_populate_kernel(NULL, pmd, pte);
>> +
>> +	/* get reference on pte page */
>> +	get_page(virt_to_page(pte));
>> +	return 0;
>> +}
>> +
>> +/* Walks PMD page table range and write protects it. Called with
>> + * 'kvm->mmu_lock' * held
>> + */
>> +static void stage2_wp_pmd_range(phys_addr_t addr, phys_addr_t end, pmd_t *pmd)
>> +{
>> +	pte_t *pte;
>> +
>> +	while (addr < end) {
>> +		pte = pte_offset_kernel(pmd, addr);
>> +		addr += PAGE_SIZE;
>> +		if (!pte_present(*pte))
>> +			continue;
>> +		/* skip write protected pages */
>> +		if (kvm_s2pte_readonly(pte))
>> +			continue;
>> +		kvm_set_s2pte_readonly(pte);
>> +	}
>> +}
>> +
>> +/* Walks PUD  page table range to write protects it , if necessary spluts up
>> + * huge pages to small pages. Called with 'kvm->mmu_lock' held.
>> + */
>> +static int stage2_wp_pud_range(struct kvm *kvm, phys_addr_t addr,
>> +				phys_addr_t end, pud_t *pud)
>> +{
>> +	pmd_t *pmd;
>> +	phys_addr_t pmd_end;
>> +	int ret;
>> +
>> +	while (addr < end) {
>> +		/* If needed give up CPU during PUD page table walk */
>> +		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>> +			cond_resched_lock(&kvm->mmu_lock);
>> +
>> +		pmd = pmd_offset(pud, addr);
>> +		if (!pmd_present(*pmd)) {
>> +			addr = kvm_pmd_addr_end(addr, end);
>> +			continue;
>> +		}
>> +
>> +		if (kvm_pmd_huge(*pmd)) {
>> +			ret = kvm_split_pmd(kvm, pmd, addr);
>> +			/* Failed to split up huge  page abort. */
>> +			if (ret < 0)
>> +				return ret;
>> +
>> +			addr = kvm_pmd_addr_end(addr, end);
>> +			continue;
>> +		}
>> +
>> +		pmd_end = kvm_pmd_addr_end(addr, end);
>> +		stage2_wp_pmd_range(addr, pmd_end, pmd);
>> +		addr = pmd_end;
>> +	}
>> +	return 0;
>> +}
>> +
>> +/* Walks PGD page table range to write protect it. Called with 'kvm->mmu_lock'
>> + * held.
>> + */
>> +static int stage2_wp_pgd_range(struct kvm *kvm, phys_addr_t addr,
>> +				phys_addr_t end, pgd_t *pgd)
>> +{
>> +	phys_addr_t pud_end;
>> +	pud_t *pud;
>> +	int ret;
>> +
>> +	while (addr < end) {
>> +		/* give up CPU if mmu_lock is needed by other vCPUs */
>> +		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>> +			cond_resched_lock(&kvm->mmu_lock);
>> +
>> +		pud = pud_offset(pgd, addr);
>> +		if (!pud_present(*pud)) {
>> +			addr = kvm_pud_addr_end(addr, end);
>> +			continue;
>> +		}
>> +
>> +		/* Fail if PUD is huge, splitting PUDs not supported */
>> +		if (pud_huge(*pud))
>> +			return -EFAULT;
>> +
>> +		/* By default 'nopud' is supported which fails with guests
>> +		 * larger then 1GB. Added to support 4-level page tables.
>> +		 */
>> +		pud_end = kvm_pud_addr_end(addr, end);
>> +		ret = stage2_wp_pud_range(kvm, addr, pud_end, pud);
>> +		if (ret < 0)
>> +			return ret;
>> +		addr = pud_end;
>> +	}
>> +	return 0;
>> +}
>> +
>> +/**
>> + * kvm_mmu_slot_remove_access - write protects entire memslot address space.
>> + *      Called at start of migration when KVM_MEM_LOG_DIRTY_PAGES ioctl is
>> + *      issued. After this function returns all pages (minus the ones faulted
>> + *      in or released when mmu_lock is give nup) must be write protected to
>> + *	keep track of dirty pages to migrate on subsequent dirty log read.
>> + *      mmu_lock is held during write protecting, released on contention.
>> + *
>> + * @kvm:        The KVM pointer
>> + * @slot:       The memory slot the dirty log is retrieved for
>> + */
> 
> your indentation is weird here and you also need a new line after the
> first description.  Also missing return value.
Ok
> 
>> +int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
>> +{
>> +	pgd_t *pgd;
>> +	pgd_t *pgdp = kvm->arch.pgd;
>> +	struct kvm_memory_slot *memslot = id_to_memslot(kvm->memslots, slot);
>> +	phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
>> +	phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
>> +	phys_addr_t pgdir_end;
>> +	int ret = -ENOMEM;
>> +
>> +	spin_lock(&kvm->mmu_lock);
>> +	/* set start of migration, sychronize with Data Abort handler */
> 
> s/sychronize/synchronize/
Ok
> 
> What is Data Abort handler?  user_mem_abort()?  Can you use a concrete
> function name?
Yes user_mem_abort() will change.
> 
>> +	kvm->arch.migration_in_progress = 1;
>> +
>> +	/* Walk range, split up huge pages as needed and write protect ptes */
>> +	while (addr < end) {
> 
> I think this should be rewritten to use the scheme used in
> stage2_flush_memslot, I will submit a patch one of these days that
> changes unmap_range() as well, you can wait for that and take a look.

Yes I saw you're unmap_range() and Marcs approach, will change it.
> 
>> +		pgd = pgdp + pgd_index(addr);
>> +		if (!pgd_present(*pgd)) {
>> +			addr = kvm_pgd_addr_end(addr, end);
>> +			continue;
>> +		}
>> +
>> +		pgdir_end = kvm_pgd_addr_end(addr, end);
>> +		ret = stage2_wp_pgd_range(kvm, addr, pgdir_end, pgd);
>> +		/* Failed to WP a pgd range abort */
>> +		if (ret < 0)
>> +			goto out;
>> +		addr = pgdir_end;
>> +	}
>> +	ret = 0;
>> +	/* Flush TLBs, >= ARMv7 variant uses hardware broadcast not IPIs */
>> +	kvm_flush_remote_tlbs(kvm);
>> +out:
>> +	spin_unlock(&kvm->mmu_lock);
>> +	return ret;
>> +}
>> +
>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  			  struct kvm_memory_slot *memslot,
>>  			  unsigned long fault_status)
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index fa70c6e..e49f976 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -184,7 +184,11 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>>  	return called;
>>  }
>>  
>> -void kvm_flush_remote_tlbs(struct kvm *kvm)
>> +/*
>> + * Architectures like >= ARMv7 hardware broadcast TLB invalidations and don't
>> + * use IPIs, to flush TLBs.
>> + */
> 
> I don't think this comment is appropriate in generic code.  If you want
> to say anything here, you should say that arch code can override this
> function.
Ok.
> 
>> +void __weak kvm_flush_remote_tlbs(struct kvm *kvm)
>>  {
>>  	long dirty_count = kvm->tlbs_dirty;
>>  
>> -- 
>> 1.7.9.5
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 2/4] live migration support for initial write protect of VM
  2014-05-15 22:51     ` Mario Smarduch
@ 2014-05-16 21:39       ` Mario Smarduch
  2014-05-19 17:56         ` Christoffer Dall
  0 siblings, 1 reply; 13+ messages in thread
From: Mario Smarduch @ 2014-05-16 21:39 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Christoffer,
  few more comments
>>>  	struct vgic_dist	vgic;
>>> +	/* Marks start of migration, used to handle 2nd stage page faults
>>> +	 * during migration, prevent installing huge pages and split huge pages
>>> +	 * to small pages.
>>> +	 */
>>
>> commenting style
>>
>> this is a bit verbose for a field in a struct, perhaps moving the longer
>> version to where you set this?
> Will do.
>>
>>> +	int migration_in_progress;
>>>  };

I think this flag could be removed all together. Migration can be
stopped at any time (started too), through user request or other events. 
When that happens (like migrate_cancel) migrate cleanup bh runs and eventually calls 
KVM memory listener kvm_log_global_start() (cancel handler) 
that stops logging, clears KVM_MEM_LOG_DIRTY_PAGES, and region ops ioctl,
clears dirty_bitmap. In either case dirty_bitmap for memslot is set or 
unset during migration to track dirty pages, following that field seems to be 
a better way to keep track of migration. This again is QEMU view but it appears 
all these policies are driven from user space.

>>>  
>>> +/* kvm_split_pmd - splits huge pages to small pages, required to keep a dirty
>>> + *	log of smaller memory granules, otherwise huge pages would need to be
>>> + *	migrated. Practically an idle system has problems migrating with
>>
>> This seems abrupt.  Why can't we just represent a 2M huge page as 512 4K
>> bits and write protect the huge pages, if you take a write fault on a 2M
>> page, then split it then.
> 
> That's one alternative the one I put into v6 is clear the PMD
> and force user_mem_abort() to fault in 4k pages, and mark the
> dirty_bitmap[] for that page, reuse the current code. Have not
> checked the impact on performance, it takes few seconds longer
> to converge for the tests I'm running. 

I was thinking about this and if PMD attributes need to be passed
onto the PTEs then it appears what you recommend is required.
But during run time I don't see how 2nd stage attributes can
change, could the guest do anything to change them (SH, Memattr)?

Performance may also be other reason but that always depends
on the load, clearing a PMD seems easier and reuses current code.
Probably several load tests/benchmarks can help here.
Also noticed hw PMD/PTE attributes differ a little which
is not significant now, but moving forward different page size
and any new revisions to fields may require additional maintenance.

I'll be out next week and back 26'th, I'll create a link with
details on test environment and tests. The cover letter will
will go through general overview only.

Thanks,
  Mario

> 
>>
>> If your use case is HA, then you will be doing this a lot, and you don't
>> want to hurt performance of your main live system more than necessary.
> 
>>
>>> + *	huge pages.  Called during WP of entire VM address space, done
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 2/4] live migration support for initial write protect of VM
  2014-05-16 21:39       ` Mario Smarduch
@ 2014-05-19 17:56         ` Christoffer Dall
  0 siblings, 0 replies; 13+ messages in thread
From: Christoffer Dall @ 2014-05-19 17:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 16, 2014 at 02:39:16PM -0700, Mario Smarduch wrote:
> Hi Christoffer,
>   few more comments
> >>>  	struct vgic_dist	vgic;
> >>> +	/* Marks start of migration, used to handle 2nd stage page faults
> >>> +	 * during migration, prevent installing huge pages and split huge pages
> >>> +	 * to small pages.
> >>> +	 */
> >>
> >> commenting style
> >>
> >> this is a bit verbose for a field in a struct, perhaps moving the longer
> >> version to where you set this?
> > Will do.
> >>
> >>> +	int migration_in_progress;
> >>>  };
> 
> I think this flag could be removed all together. Migration can be
> stopped at any time (started too), through user request or other events. 
> When that happens (like migrate_cancel) migrate cleanup bh runs and eventually calls 
> KVM memory listener kvm_log_global_start() (cancel handler) 
> that stops logging, clears KVM_MEM_LOG_DIRTY_PAGES, and region ops ioctl,
> clears dirty_bitmap. In either case dirty_bitmap for memslot is set or 
> unset during migration to track dirty pages, following that field seems to be 
> a better way to keep track of migration. This again is QEMU view but it appears 
> all these policies are driven from user space.
> 

ok, I need to look more closely at the whole thing to properly comment
on this.

> 
> 
> >>>  
> >>> +/* kvm_split_pmd - splits huge pages to small pages, required to keep a dirty
> >>> + *	log of smaller memory granules, otherwise huge pages would need to be
> >>> + *	migrated. Practically an idle system has problems migrating with
> >>
> >> This seems abrupt.  Why can't we just represent a 2M huge page as 512 4K
> >> bits and write protect the huge pages, if you take a write fault on a 2M
> >> page, then split it then.
> > 
> > That's one alternative the one I put into v6 is clear the PMD
> > and force user_mem_abort() to fault in 4k pages, and mark the
> > dirty_bitmap[] for that page, reuse the current code. Have not
> > checked the impact on performance, it takes few seconds longer
> > to converge for the tests I'm running. 
> 
> I was thinking about this and if PMD attributes need to be passed
> onto the PTEs then it appears what you recommend is required.
> But during run time I don't see how 2nd stage attributes can
> change, could the guest do anything to change them (SH, Memattr)?

You should be able to just grab the kvm_mmu lock, update the stage-2
page tables to remove all writable bits, flush all Stage-2 TLBs for that
VMID, and you should be all set.

> 
> 
> Performance may also be other reason but that always depends
> on the load, clearing a PMD seems easier and reuses current code.
> Probably several load tests/benchmarks can help here.
> Also noticed hw PMD/PTE attributes differ a little which
> is not significant now, but moving forward different page size
> and any new revisions to fields may require additional maintenance.

I think clearing out all PMD mappings will carry a significant
performance degradation on the source VM, and in the case you keep it
running, will be quite unfortunate.  Hint: Page faults are expensive and
huge pages have shown to give about 10-15% performance increase on ARMv7
for CPU/memory intensive benchmarks.

> 
> I'll be out next week and back 26'th, I'll create a link with
> details on test environment and tests. The cover letter will
> will go through general overview only.
> 

ok, I have some time then.

-Christoffer

> 
> > 
> >>
> >> If your use case is HA, then you will be doing this a lot, and you don't
> >> want to hurt performance of your main live system more than necessary.
> > 
> >>
> >>> + *	huge pages.  Called during WP of entire VM address space, done
> >>
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 2/4] live migration support for initial write protect of VM
  2014-05-15 18:53   ` Christoffer Dall
  2014-05-15 22:51     ` Mario Smarduch
@ 2014-05-30 16:48     ` Mario Smarduch
  1 sibling, 0 replies; 13+ messages in thread
From: Mario Smarduch @ 2014-05-30 16:48 UTC (permalink / raw)
  To: linux-arm-kernel


>>  
>> +static inline void kvm_set_s2pte_readonly(pte_t *pte)
>> +{
>> +	pte_val(*pte) &= ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR);
> 
> This relies on the pte already having been set as RDONLY or RDWR, if you
> are creating a new pte and calling this function it could be easy to
> miss that distinction, I would prefer:
> 
> pte_val(*pte) &= L_PTE_S2_RDWR;
> pte_val(*pte) |= L_PTE_S2_RDONLY;
> 

Confused on this comment, this appears to just add the read-only
permission. But will leave other permission bits intact, and
clears out the rest of the pte?

- Mario

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-05-30 16:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-08  0:40 [PATCH v5 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
2014-05-08  0:40 ` [PATCH v5 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
2014-05-14 16:47   ` Christoffer Dall
2014-05-15  2:00     ` Mario Smarduch
2014-05-15 18:50       ` Christoffer Dall
2014-05-08  0:40 ` [PATCH v5 2/4] live migration support for initial write protect of VM Mario Smarduch
2014-05-15 18:53   ` Christoffer Dall
2014-05-15 22:51     ` Mario Smarduch
2014-05-16 21:39       ` Mario Smarduch
2014-05-19 17:56         ` Christoffer Dall
2014-05-30 16:48     ` Mario Smarduch
2014-05-08  0:40 ` [PATCH v5 3/4] live migration support for VM dirty log management Mario Smarduch
2014-05-08  0:40 ` [PATCH v5 4/4] add 2nd stage page fault handling during live migration Mario Smarduch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).