[PATCH v6 0/4] live migration dirty bitmap support for ARMv7

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/4] live migration dirty bitmap support for ARMv7
@ 2014-05-15 18:27 Mario Smarduch
  2014-05-15 18:27 ` [PATCH v6 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Mario Smarduch @ 2014-05-15 18:27 UTC (permalink / raw)
  To: linux-arm-kernel

This is v6 patcheset of live mgiration support for ARMv7.

- Tested on two 4-way A15 hardware, QEMU 2-way/4-way SMP guest upto 2GB
- Various dirty data rates tested - 2GB/1s ... 2048 pgs/5ms
- validated source/destination memory image integrity

Changes since v1:
- add unlock of VM mmu_lock to prevent a deadlock
- moved migratiion active inside mmu_lock acquire for visibility in 2nd stage
  data abort handler
- Added comments

Changes since v2: 
- move initial VM write protect to memory region architecture prepare function
  (needed to make dirty logging function generic) 
- added stage2_mark_pte_ro() - to mark ptes ro - Marc's comment
- optimized initial VM memory region write protect to do fewer table lookups -
  applied Marc's comment for walking dirty bitmap mask
- added pud_addr_end() for stage2 tables, to make the walk 4-level
- added kvm_flush_remote_tlbs() to use ARM TLB invalidation, made the generic
  one weak, Marc's comment to for generic dirty bitmap log function
- optimized walking dirty bit map mask to skip upper tables - Marc's comment
- deleted x86,arm kvm_vm_ioctl_get_dirty_log(), moved to kvm_main.c tagged 
  the function weak - Marc's comment
- changed Data Abort handler pte index handling - Marc's comment

Changes since v3:
- changed pte updates to reset write bit instead of setting default 
  value for existing pte's - Steve's comment
- In addition to PUD add 2nd stage >4GB range functions - Steves
  suggestion
- Restructured initial memory slot write protect function for PGD, PUD, PMD
  table walking - Steves suggestion
- Renamed variable types to resemble their use - Steves suggestions
- Added couple pte helpers for 2nd stage tables - Steves suggestion
- Updated unmap_range() that handles 2nd stage tables and identity mappings
  to handle 2nd stage addresses >4GB. Left ARMv8 unchanged.

Changes since v4:
- rebased to 3.15.0-rc1 - 'next' to pickup p*addr_end patches - Gavins comment
- Update PUD address end function to support 4-level page table walk
- Elimiated 5th patch of the series that fixed unmap_range(), since it was
  fixed by Marcs patches.

Changes since v5:
- Created seperate entry point for VMID TLB flush with no param - Christoffers
  comment
- Update documentation for kvm_flush_remote_tlbs() - Christoffers comment
- Simplified splitting of huge pages - inittial WP and 2nd stage DABT handler
  clear the huge page PMD, and use current code to fault in small pages.
  Removed kvm_split_pmd().

Mario Smarduch (4):
  add ARMv7 HYP API to flush VM TLBs without address param
  live migration support for initial write protect of VM
  live migration support for VM dirty log management
  add 2nd stage page fault handling during live migration

 arch/arm/include/asm/kvm_asm.h  |    1 +
 arch/arm/include/asm/kvm_host.h |   11 ++
 arch/arm/include/asm/kvm_mmu.h  |   10 ++
 arch/arm/kvm/arm.c              |    8 +-
 arch/arm/kvm/interrupts.S       |   11 ++
 arch/arm/kvm/mmu.c              |  292 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   86 ------------
 virt/kvm/kvm_main.c             |   84 ++++++++++-
 8 files changed, 409 insertions(+), 94 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 1/4] add ARMv7 HYP API to flush VM TLBs without address param
  2014-05-15 18:27 [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
@ 2014-05-15 18:27 ` Mario Smarduch
  2014-05-27 19:51   ` Christoffer Dall
  2014-05-15 18:27 ` [PATCH v6 2/4] live migration support for initial write protect of VM Mario Smarduch
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Mario Smarduch @ 2014-05-15 18:27 UTC (permalink / raw)
  To: linux-arm-kernel

Patch adds HYP interface for global VM TLB invalidation without address
parameter. Added ARM version of kvm_flush_remote_tlbs(), made generic one weak.

Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
---
 arch/arm/include/asm/kvm_asm.h |    1 +
 arch/arm/kvm/interrupts.S      |   11 +++++++++++
 arch/arm/kvm/mmu.c             |   15 +++++++++++++++
 virt/kvm/kvm_main.c            |    2 +-
 4 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
index 53b3c4a..21bc519 100644
--- a/arch/arm/include/asm/kvm_asm.h
+++ b/arch/arm/include/asm/kvm_asm.h
@@ -78,6 +78,7 @@ extern char __kvm_hyp_code_end[];
 
 extern void __kvm_flush_vm_context(void);
 extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);
+extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
 
 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
index 0d68d40..bddc66b 100644
--- a/arch/arm/kvm/interrupts.S
+++ b/arch/arm/kvm/interrupts.S
@@ -66,6 +66,17 @@ ENTRY(__kvm_tlb_flush_vmid_ipa)
 	bx	lr
 ENDPROC(__kvm_tlb_flush_vmid_ipa)
 
+/**
+ * void __kvm_tlb_flush_vmid(struct kvm *kvm) - Flush per-VMID TLBs
+ *
+ * Reuses __kvm_tlb_flush_vmid_ipa() for ARMv7, without passing address
+ * parameter
+ */
+
+ENTRY(__kvm_tlb_flush_vmid)
+	b	__kvm_tlb_flush_vmid_ipa
+ENDPROC(__kvm_tlb_flush_vmid)
+
 /********************************************************************
  * Flush TLBs and instruction caches of all CPUs inside the inner-shareable
  * domain, for all VMIDs
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 80bb1e6..eea3f0a 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -56,6 +56,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
 }
 
+/**
+ * kvm_flush_remote_tlbs() - flush all VM TLB entries
+ *
+ * Interface to HYP function to flush all VM TLB entries without address
+ * parameter. In HYP mode reuses __kvm_tlb_flush_vmid_ipa() function used by
+ * kvm_tlb_flush_vmid_ipa().
+ *
+ * @kvm:       pointer to kvm structure.
+ */
+void kvm_flush_remote_tlbs(struct kvm *kvm)
+{
+	if (kvm)
+		kvm_call_hyp(__kvm_tlb_flush_vmid, kvm);
+}
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
 				  int min, int max)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa70c6e..ba25765 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -184,7 +184,7 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	return called;
 }
 
-void kvm_flush_remote_tlbs(struct kvm *kvm)
+void __weak kvm_flush_remote_tlbs(struct kvm *kvm)
 {
 	long dirty_count = kvm->tlbs_dirty;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 2/4] live migration support for initial write protect of VM
  2014-05-15 18:27 [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
  2014-05-15 18:27 ` [PATCH v6 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
@ 2014-05-15 18:27 ` Mario Smarduch
  2014-05-27 19:58   ` Christoffer Dall
  2014-05-15 18:27 ` [PATCH v6 3/4] live migration support for VM dirty log management Mario Smarduch
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Mario Smarduch @ 2014-05-15 18:27 UTC (permalink / raw)
  To: linux-arm-kernel

Patch adds memslot support for initial write protection and split up of huge 
pages

Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
---
 arch/arm/include/asm/kvm_host.h |    8 +++
 arch/arm/include/asm/kvm_mmu.h  |   10 +++
 arch/arm/kvm/arm.c              |    3 +
 arch/arm/kvm/mmu.c              |  143 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 164 insertions(+)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 193ceaf..0e55b17 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -67,6 +67,12 @@ struct kvm_arch {
 
 	/* Interrupt controller */
 	struct vgic_dist	vgic;
+	/*
+	 * Marks start of migration, used to handle 2nd stage page faults
+	 * during migration, prevent installing huge pages and split huge pages
+	 * to small pages.
+	 */
+	int migration_in_progress;
 };
 
 #define KVM_NR_MEM_OBJS     40
@@ -231,4 +237,6 @@ int kvm_perf_teardown(void);
 u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid);
 int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
 
+int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 5c7aa3c..7f9d9d3 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -114,6 +114,16 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
 	pmd_val(*pmd) |= L_PMD_S2_RDWR;
 }
 
+static inline void kvm_set_s2pte_readonly(pte_t *pte)
+{
+	pte_val(*pte) &= ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR);
+}
+
+static inline bool kvm_s2pte_readonly(pte_t *pte)
+{
+	return (pte_val(*pte) & L_PTE_S2_RDWR) == L_PTE_S2_RDONLY;
+}
+
 /* Open coded p*d_addr_end that can deal with 64bit addresses */
 #define kvm_pgd_addr_end(addr, end)					\
 ({	u64 __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;		\
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 3c82b37..1055266 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -234,6 +234,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				   struct kvm_userspace_memory_region *mem,
 				   enum kvm_mr_change change)
 {
+	/* Request for migration issued by user, write protect memory slot */
+	if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
+		return kvm_mmu_slot_remove_write_access(kvm, mem->slot);
 	return 0;
 }
 
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index eea3f0a..b71ad27 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -748,6 +748,149 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap)
 	return false;
 }
 
+
+/*
+ * Walks PMD page table range and write protects it. Called with
+ * 'kvm->mmu_lock' * held
+ */
+static void stage2_wp_pmd_range(phys_addr_t addr, phys_addr_t end, pmd_t *pmd)
+{
+	pte_t *pte;
+
+	while (addr < end) {
+		pte = pte_offset_kernel(pmd, addr);
+		addr += PAGE_SIZE;
+		if (!pte_present(*pte))
+			continue;
+		/* skip write protected pages */
+		if (kvm_s2pte_readonly(pte))
+			continue;
+		kvm_set_s2pte_readonly(pte);
+	}
+}
+
+/*
+ * Walks PUD  page table range to write protects it , if necessary spluts up
+ * huge pages to small pages. Called with 'kvm->mmu_lock' held.
+ */
+static void stage2_wp_pud_range(struct kvm *kvm, phys_addr_t addr,
+				phys_addr_t end, pud_t *pud)
+{
+	pmd_t *pmd;
+	phys_addr_t pmd_end;
+
+	while (addr < end) {
+		/* If needed give up CPU during PUD page table walk */
+		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+			cond_resched_lock(&kvm->mmu_lock);
+
+		pmd = pmd_offset(pud, addr);
+		if (!pmd_present(*pmd)) {
+			addr = kvm_pmd_addr_end(addr, end);
+			continue;
+		}
+
+		if (kvm_pmd_huge(*pmd)) {
+			/*
+			 * Clear pmd entry DABT handler will install smaller
+			 * pages.
+			 */
+			clear_pmd_entry(kvm, pmd, addr);
+			addr = kvm_pmd_addr_end(addr, end);
+			continue;
+		}
+
+		pmd_end = kvm_pmd_addr_end(addr, end);
+		stage2_wp_pmd_range(addr, pmd_end, pmd);
+		addr = pmd_end;
+	}
+}
+
+/*
+ * Walks PGD page table range to write protect it. Called with 'kvm->mmu_lock'
+ * held.
+ */
+static int stage2_wp_pgd_range(struct kvm *kvm, phys_addr_t addr,
+				phys_addr_t end, pgd_t *pgd)
+{
+	phys_addr_t pud_end;
+	pud_t *pud;
+
+	while (addr < end) {
+		/* give up CPU if mmu_lock is needed by other vCPUs */
+		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+			cond_resched_lock(&kvm->mmu_lock);
+
+		pud = pud_offset(pgd, addr);
+		if (!pud_present(*pud)) {
+			addr = kvm_pud_addr_end(addr, end);
+			continue;
+		}
+
+		/* Fail if PUD is huge, splitting PUDs not supported */
+		if (pud_huge(*pud))
+			return -EFAULT;
+
+		/*
+		 * By default 'nopud' folds third level page tables.
+		 * Implement for future support of 4-level tables
+		 */
+		pud_end = kvm_pud_addr_end(addr, end);
+		stage2_wp_pud_range(kvm, addr, pud_end, pud);
+		addr = pud_end;
+	}
+	return 0;
+}
+
+/**
+ * kvm_mmu_slot_remove_access() - write protects entire memslot address space.
+ *
+ *      Called at start of live migration when KVM_MEM_LOG_DIRTY_PAGES ioctl is
+ *      issued. After this function returns all pages (minus the ones faulted
+ *      in or released when mmu_lock is given up) must be write protected to
+ *	keep track of dirty pages to migrate on subsequent dirty log read.
+ *      mmu_lock is held during write protecting, released on contention.
+ *
+ * @kvm:        The KVM pointer
+ * @slot:       The memory slot the dirty log is retrieved for
+ */
+int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+{
+	pgd_t *pgd;
+	pgd_t *pgdp = kvm->arch.pgd;
+	struct kvm_memory_slot *memslot = id_to_memslot(kvm->memslots, slot);
+	phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
+	phys_addr_t end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
+	phys_addr_t pgdir_end;
+	int ret = -ENOMEM;
+
+	spin_lock(&kvm->mmu_lock);
+	/* set start of migration, sychronize with Data Abort handler */
+	kvm->arch.migration_in_progress = 1;
+
+	/* Walk range, split up huge pages as needed and write protect ptes */
+	while (addr < end) {
+		pgd = pgdp + pgd_index(addr);
+		if (!pgd_present(*pgd)) {
+			addr = kvm_pgd_addr_end(addr, end);
+			continue;
+		}
+
+		pgdir_end = kvm_pgd_addr_end(addr, end);
+		ret = stage2_wp_pgd_range(kvm, addr, pgdir_end, pgd);
+		/* Failed to WP a pgd range abort */
+		if (ret < 0)
+			goto out;
+		addr = pgdir_end;
+	}
+	ret = 0;
+	/* Flush TLBs, >= ARMv7 variant uses hardware broadcast not IPIs */
+	kvm_flush_remote_tlbs(kvm);
+out:
+	spin_unlock(&kvm->mmu_lock);
+	return ret;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot,
 			  unsigned long fault_status)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 3/4] live migration support for VM dirty log management
  2014-05-15 18:27 [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
  2014-05-15 18:27 ` [PATCH v6 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
  2014-05-15 18:27 ` [PATCH v6 2/4] live migration support for initial write protect of VM Mario Smarduch
@ 2014-05-15 18:27 ` Mario Smarduch
  2014-05-27 20:12   ` Christoffer Dall
  2014-05-15 18:27 ` [PATCH v6 4/4] add 2nd stage page fault handling during live migration Mario Smarduch
  2014-05-15 18:51 ` [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Christoffer Dall
  4 siblings, 1 reply; 26+ messages in thread
From: Mario Smarduch @ 2014-05-15 18:27 UTC (permalink / raw)
  To: linux-arm-kernel

This patch adds support for keeping track of VM dirty pages, by updating
per memslot dirty bitmap and write protecting the page again.

Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
---
 arch/arm/include/asm/kvm_host.h |    3 ++
 arch/arm/kvm/arm.c              |    5 --
 arch/arm/kvm/mmu.c              |   98 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |   86 ----------------------------------
 virt/kvm/kvm_main.c             |   82 ++++++++++++++++++++++++++++++++
 5 files changed, 183 insertions(+), 91 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 0e55b17..4fef77d 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -238,5 +238,8 @@ u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid);
 int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
 
 int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+	struct kvm_memory_slot *slot,
+	gfn_t gfn_offset, unsigned long mask);
 
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 1055266..0b847b5 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -777,11 +777,6 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	}
 }
 
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
-{
-	return -EINVAL;
-}
-
 static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
 					struct kvm_arm_device_addr *dev_addr)
 {
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index b71ad27..b939312 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -891,6 +891,104 @@ out:
 	return ret;
 }
 
+
+/**
+ * kvm_mmu_write_protected_pt_masked - walk mask relative start of memslot and
+ *	write protect again for next dirty log read.
+ *
+ *  After migration thread write protects entire VM iterative calls are made
+ *  to get diry page log. The log is returned and dirty pages are write
+ *  protected again. This function is called as a result KVM_GET_DIRTY_LOG
+ *  ioctl.
+ *  'kvm->mmu_lock' must be  held to protect against concurrent modification
+ *  of page tables (2nd stage fault, mmu modifiers, ...)
+ *
+ * @kvm:        The KVM pointer
+ * @slot:       The memory slot the dirty log is retrieved for
+ * @gfn_offset: The gfn offset in memory slot
+ * @mask:       The mask of dirty pages at offset 'gnf_offset in this memory
+ *              slot to be writ protect
+ */
+
+void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+		struct kvm_memory_slot *slot,
+		gfn_t gfn_offset, unsigned long mask)
+{
+	phys_addr_t ipa, next, offset_ipa;
+	pgd_t *pgdp = kvm->arch.pgd, *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	gfn_t gfnofst = slot->base_gfn + gfn_offset;
+	bool crosses_pmd;
+
+	ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
+	offset_ipa  = gfnofst << PAGE_SHIFT;
+	next = (gfnofst + (BITS_PER_LONG - 1)) << PAGE_SHIFT;
+
+	/* check if mask width crosses 2nd level page table range, and
+	 * possibly 3rd, 4th. If not skip upper table lookups. Unlikely
+	 * to be true.
+	 */
+	crosses_pmd = ((offset_ipa & PMD_MASK) ^ (next & PMD_MASK)) ? true :
+									false;
+
+	/* If pgd, pud, pmd not present and you cross pmd range check next
+	 * index.
+	 */
+	pgd = pgdp + pgd_index(ipa);
+	if (unlikely(crosses_pmd && !pgd_present(*pgd))) {
+		pgd = pgdp + pgd_index(next);
+		if (!pgd_present(*pgd))
+			return;
+	}
+
+	pud = pud_offset(pgd, ipa);
+	if (unlikely(crosses_pmd && !pud_present(*pud))) {
+		pud = pud_offset(pgd, next);
+		if (!pud_present(*pud))
+			return;
+	}
+
+	pmd = pmd_offset(pud, ipa);
+	if (unlikely(crosses_pmd && !pmd_present(*pmd))) {
+		pmd = pmd_offset(pud, next);
+		if (!pmd_present(*pmd))
+			return;
+	}
+
+	for (;;) {
+		pte = pte_offset_kernel(pmd, ipa);
+		if (!pte_present(*pte))
+			goto next_ipa;
+
+		if (kvm_s2pte_readonly(pte))
+			goto next_ipa;
+		kvm_set_s2pte_readonly(pte);
+next_ipa:
+		mask &= mask - 1;
+		if (!mask)
+			break;
+
+		/* find next page */
+		ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
+
+		/* skip upper page table lookups */
+		if (!crosses_pmd)
+			continue;
+
+		pgd = pgdp + pgd_index(ipa);
+		if (unlikely(!pgd_present(*pgd)))
+			goto next_ipa;
+		pud = pud_offset(pgd, ipa);
+		if (unlikely(!pud_present(*pud)))
+			goto next_ipa;
+		pmd = pmd_offset(pud, ipa);
+		if (unlikely(!pmd_present(*pmd)))
+			goto next_ipa;
+	}
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot,
 			  unsigned long fault_status)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c5582c3..a603ca3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3569,92 +3569,6 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
 	return 0;
 }
 
-/**
- * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
- * @kvm: kvm instance
- * @log: slot id and address to which we copy the log
- *
- * We need to keep it in mind that VCPU threads can write to the bitmap
- * concurrently.  So, to avoid losing data, we keep the following order for
- * each bit:
- *
- *   1. Take a snapshot of the bit and clear it if needed.
- *   2. Write protect the corresponding page.
- *   3. Flush TLB's if needed.
- *   4. Copy the snapshot to the userspace.
- *
- * Between 2 and 3, the guest may write to the page using the remaining TLB
- * entry.  This is not a problem because the page will be reported dirty at
- * step 4 using the snapshot taken before and step 3 ensures that successive
- * writes will be logged for the next call.
- */
-int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
-{
-	int r;
-	struct kvm_memory_slot *memslot;
-	unsigned long n, i;
-	unsigned long *dirty_bitmap;
-	unsigned long *dirty_bitmap_buffer;
-	bool is_dirty = false;
-
-	mutex_lock(&kvm->slots_lock);
-
-	r = -EINVAL;
-	if (log->slot >= KVM_USER_MEM_SLOTS)
-		goto out;
-
-	memslot = id_to_memslot(kvm->memslots, log->slot);
-
-	dirty_bitmap = memslot->dirty_bitmap;
-	r = -ENOENT;
-	if (!dirty_bitmap)
-		goto out;
-
-	n = kvm_dirty_bitmap_bytes(memslot);
-
-	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
-	memset(dirty_bitmap_buffer, 0, n);
-
-	spin_lock(&kvm->mmu_lock);
-
-	for (i = 0; i < n / sizeof(long); i++) {
-		unsigned long mask;
-		gfn_t offset;
-
-		if (!dirty_bitmap[i])
-			continue;
-
-		is_dirty = true;
-
-		mask = xchg(&dirty_bitmap[i], 0);
-		dirty_bitmap_buffer[i] = mask;
-
-		offset = i * BITS_PER_LONG;
-		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
-	}
-
-	spin_unlock(&kvm->mmu_lock);
-
-	/* See the comments in kvm_mmu_slot_remove_write_access(). */
-	lockdep_assert_held(&kvm->slots_lock);
-
-	/*
-	 * All the TLBs can be flushed out of mmu lock, see the comments in
-	 * kvm_mmu_slot_remove_write_access().
-	 */
-	if (is_dirty)
-		kvm_flush_remote_tlbs(kvm);
-
-	r = -EFAULT;
-	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
-		goto out;
-
-	r = 0;
-out:
-	mutex_unlock(&kvm->slots_lock);
-	return r;
-}
-
 int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
 			bool line_status)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ba25765..d33ac9c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -429,6 +429,88 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
 }
 
+/**
+ * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
+ * @kvm: kvm instance
+ * @log: slot id and address to which we copy the log
+ *
+ * Shared by x86 and ARM.
+ *
+ * We need to keep it in mind that VCPU threads can write to the bitmap
+ * concurrently.  So, to avoid losing data, we keep the following order for
+ * each bit:
+ *
+ *   1. Take a snapshot of the bit and clear it if needed.
+ *   2. Write protect the corresponding page.
+ *   3. Flush TLB's if needed.
+ *   4. Copy the snapshot to the userspace.
+ *
+ * Between 2 and 3, the guest may write to the page using the remaining TLB
+ * entry.  This is not a problem because the page will be reported dirty at
+ * step 4 using the snapshot taken before and step 3 ensures that successive
+ * writes will be logged for the next call.
+ */
+
+int __weak kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
+						struct kvm_dirty_log *log)
+{
+	int r;
+	struct kvm_memory_slot *memslot;
+	unsigned long n, i;
+	unsigned long *dirty_bitmap;
+	unsigned long *dirty_bitmap_buffer;
+	bool is_dirty = false;
+
+	mutex_lock(&kvm->slots_lock);
+
+	r = -EINVAL;
+	if (log->slot >= KVM_USER_MEM_SLOTS)
+		goto out;
+
+	memslot = id_to_memslot(kvm->memslots, log->slot);
+
+	dirty_bitmap = memslot->dirty_bitmap;
+	r = -ENOENT;
+	if (!dirty_bitmap)
+		goto out;
+
+	n = kvm_dirty_bitmap_bytes(memslot);
+
+	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
+	memset(dirty_bitmap_buffer, 0, n);
+
+	spin_lock(&kvm->mmu_lock);
+
+	for (i = 0; i < n / sizeof(long); i++) {
+		unsigned long mask;
+		gfn_t offset;
+
+		if (!dirty_bitmap[i])
+			continue;
+
+		is_dirty = true;
+
+		mask = xchg(&dirty_bitmap[i], 0);
+		dirty_bitmap_buffer[i] = mask;
+
+		offset = i * BITS_PER_LONG;
+		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
+	}
+	if (is_dirty)
+		kvm_flush_remote_tlbs(kvm);
+
+	spin_unlock(&kvm->mmu_lock);
+
+	r = -EFAULT;
+	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
+		goto out;
+
+	r = 0;
+out:
+	mutex_unlock(&kvm->slots_lock);
+	return r;
+}
+
 #else  /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
 
 static int kvm_init_mmu_notifier(struct kvm *kvm)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-15 18:27 [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
                   ` (2 preceding siblings ...)
  2014-05-15 18:27 ` [PATCH v6 3/4] live migration support for VM dirty log management Mario Smarduch
@ 2014-05-15 18:27 ` Mario Smarduch
  2014-05-27 20:19   ` Christoffer Dall
  2014-05-15 18:51 ` [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Christoffer Dall
  4 siblings, 1 reply; 26+ messages in thread
From: Mario Smarduch @ 2014-05-15 18:27 UTC (permalink / raw)
  To: linux-arm-kernel

This patch adds support for handling 2nd stage page faults during migration,
it disables faulting in huge pages, and splits up existing huge pages.

Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
---
 arch/arm/kvm/mmu.c |   36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index b939312..10e7bf6 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1002,6 +1002,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
 	struct vm_area_struct *vma;
 	pfn_t pfn;
+	bool migration_active;
 
 	write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
 	if (fault_status == FSC_PERM && !write_fault) {
@@ -1053,12 +1054,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 
 	spin_lock(&kvm->mmu_lock);
+
+	/*
+	 * Place inside lock to prevent race condition when whole VM is being
+	 * write proteced. Prevent race of huge page install when migration is
+	 * active.
+	 */
+	migration_active = vcpu->kvm->arch.migration_in_progress;
+
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
-	if (!hugetlb && !force_pte)
+
+	/* When migrating don't spend cycles coalescing huge pages */
+	if (!hugetlb && !force_pte && !migration_active)
 		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
 
-	if (hugetlb) {
+	/* During migration don't install huge pages */
+	if (hugetlb && !migration_active) {
 		pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
 		new_pmd = pmd_mkhuge(new_pmd);
 		if (writable) {
@@ -1069,6 +1081,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
 	} else {
 		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
+
+		/*
+		 * If pmd is  mapping a huge page then split it up into
+		 * small pages, when doing live migration.
+		 */
+		if (migration_active) {
+			pmd_t *pmd;
+			if (hugetlb) {
+				pfn += pte_index(fault_ipa);
+				gfn = fault_ipa >> PAGE_SHIFT;
+			}
+			new_pte = pfn_pte(pfn, PAGE_S2);
+			pmd = stage2_get_pmd(kvm, NULL, fault_ipa);
+			if (pmd && kvm_pmd_huge(*pmd))
+				clear_pmd_entry(kvm, pmd, fault_ipa);
+		}
+
 		if (writable) {
 			kvm_set_s2pte_writable(&new_pte);
 			kvm_set_pfn_dirty(pfn);
@@ -1077,6 +1106,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
 	}
 
+	/* Assuming 4k pages, set one bit/page in memslot dirty_bitmap[] */
+	if (writable)
+		mark_page_dirty(kvm, gfn);
 
 out_unlock:
 	spin_unlock(&kvm->mmu_lock);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 0/4] live migration dirty bitmap support for ARMv7
  2014-05-15 18:27 [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
                   ` (3 preceding siblings ...)
  2014-05-15 18:27 ` [PATCH v6 4/4] add 2nd stage page fault handling during live migration Mario Smarduch
@ 2014-05-15 18:51 ` Christoffer Dall
  2014-05-15 22:53   ` Mario Smarduch
  4 siblings, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2014-05-15 18:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 15, 2014 at 11:27:27AM -0700, Mario Smarduch wrote:
> This is v6 patcheset of live mgiration support for ARMv7.

migration

This is an extremely terse cover letter.  It would have been nice with a
few sentences of which existing features this leverages, which support
was missing, what the preferred approach is, etc.  Also, links to a wiki
page or just a few notes on how you did the testing below with which
user space tools etc. would also have been great.

> 
> - Tested on two 4-way A15 hardware, QEMU 2-way/4-way SMP guest upto 2GB
> - Various dirty data rates tested - 2GB/1s ... 2048 pgs/5ms
> - validated source/destination memory image integrity
> 
> Changes since v1:
> - add unlock of VM mmu_lock to prevent a deadlock
> - moved migratiion active inside mmu_lock acquire for visibility in 2nd stage
>   data abort handler
> - Added comments
> 
> Changes since v2: 
> - move initial VM write protect to memory region architecture prepare function
>   (needed to make dirty logging function generic) 
> - added stage2_mark_pte_ro() - to mark ptes ro - Marc's comment
> - optimized initial VM memory region write protect to do fewer table lookups -
>   applied Marc's comment for walking dirty bitmap mask
> - added pud_addr_end() for stage2 tables, to make the walk 4-level
> - added kvm_flush_remote_tlbs() to use ARM TLB invalidation, made the generic
>   one weak, Marc's comment to for generic dirty bitmap log function
> - optimized walking dirty bit map mask to skip upper tables - Marc's comment
> - deleted x86,arm kvm_vm_ioctl_get_dirty_log(), moved to kvm_main.c tagged 
>   the function weak - Marc's comment
> - changed Data Abort handler pte index handling - Marc's comment
> 
> Changes since v3:
> - changed pte updates to reset write bit instead of setting default 
>   value for existing pte's - Steve's comment
> - In addition to PUD add 2nd stage >4GB range functions - Steves
>   suggestion
> - Restructured initial memory slot write protect function for PGD, PUD, PMD
>   table walking - Steves suggestion
> - Renamed variable types to resemble their use - Steves suggestions
> - Added couple pte helpers for 2nd stage tables - Steves suggestion
> - Updated unmap_range() that handles 2nd stage tables and identity mappings
>   to handle 2nd stage addresses >4GB. Left ARMv8 unchanged.
> 
> Changes since v4:
> - rebased to 3.15.0-rc1 - 'next' to pickup p*addr_end patches - Gavins comment
> - Update PUD address end function to support 4-level page table walk
> - Elimiated 5th patch of the series that fixed unmap_range(), since it was
>   fixed by Marcs patches.
> 
> Changes since v5:
> - Created seperate entry point for VMID TLB flush with no param - Christoffers
>   comment
> - Update documentation for kvm_flush_remote_tlbs() - Christoffers comment
> - Simplified splitting of huge pages - inittial WP and 2nd stage DABT handler
>   clear the huge page PMD, and use current code to fault in small pages.
>   Removed kvm_split_pmd().
> 
> Mario Smarduch (4):
>   add ARMv7 HYP API to flush VM TLBs without address param
>   live migration support for initial write protect of VM
>   live migration support for VM dirty log management
>   add 2nd stage page fault handling during live migration
> 
>  arch/arm/include/asm/kvm_asm.h  |    1 +
>  arch/arm/include/asm/kvm_host.h |   11 ++
>  arch/arm/include/asm/kvm_mmu.h  |   10 ++
>  arch/arm/kvm/arm.c              |    8 +-
>  arch/arm/kvm/interrupts.S       |   11 ++
>  arch/arm/kvm/mmu.c              |  292 ++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |   86 ------------
>  virt/kvm/kvm_main.c             |   84 ++++++++++-
>  8 files changed, 409 insertions(+), 94 deletions(-)
> 
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 0/4] live migration dirty bitmap support for ARMv7
  2014-05-15 18:51 ` [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Christoffer Dall
@ 2014-05-15 22:53   ` Mario Smarduch
  0 siblings, 0 replies; 26+ messages in thread
From: Mario Smarduch @ 2014-05-15 22:53 UTC (permalink / raw)
  To: linux-arm-kernel

Will do that, I'm sure there will be another iteration :).

On 05/15/2014 11:51 AM, Christoffer Dall wrote:
> On Thu, May 15, 2014 at 11:27:27AM -0700, Mario Smarduch wrote:
>> This is v6 patcheset of live mgiration support for ARMv7.
> 
> migration
> 
> This is an extremely terse cover letter.  It would have been nice with a
> few sentences of which existing features this leverages, which support
> was missing, what the preferred approach is, etc.  Also, links to a wiki
> page or just a few notes on how you did the testing below with which
> user space tools etc. would also have been great.
> 
>>
>> - Tested on two 4-way A15 hardware, QEMU 2-way/4-way SMP guest upto 2GB
>> - Various dirty data rates tested - 2GB/1s ... 2048 pgs/5ms
>> - validated source/destination memory image integrity
>>
>> Changes since v1:
>> - add unlock of VM mmu_lock to prevent a deadlock
>> - moved migratiion active inside mmu_lock acquire for visibility in 2nd stage
>>   data abort handler
>> - Added comments
>>
>> Changes since v2: 
>> - move initial VM write protect to memory region architecture prepare function
>>   (needed to make dirty logging function generic) 
>> - added stage2_mark_pte_ro() - to mark ptes ro - Marc's comment
>> - optimized initial VM memory region write protect to do fewer table lookups -
>>   applied Marc's comment for walking dirty bitmap mask
>> - added pud_addr_end() for stage2 tables, to make the walk 4-level
>> - added kvm_flush_remote_tlbs() to use ARM TLB invalidation, made the generic
>>   one weak, Marc's comment to for generic dirty bitmap log function
>> - optimized walking dirty bit map mask to skip upper tables - Marc's comment
>> - deleted x86,arm kvm_vm_ioctl_get_dirty_log(), moved to kvm_main.c tagged 
>>   the function weak - Marc's comment
>> - changed Data Abort handler pte index handling - Marc's comment
>>
>> Changes since v3:
>> - changed pte updates to reset write bit instead of setting default 
>>   value for existing pte's - Steve's comment
>> - In addition to PUD add 2nd stage >4GB range functions - Steves
>>   suggestion
>> - Restructured initial memory slot write protect function for PGD, PUD, PMD
>>   table walking - Steves suggestion
>> - Renamed variable types to resemble their use - Steves suggestions
>> - Added couple pte helpers for 2nd stage tables - Steves suggestion
>> - Updated unmap_range() that handles 2nd stage tables and identity mappings
>>   to handle 2nd stage addresses >4GB. Left ARMv8 unchanged.
>>
>> Changes since v4:
>> - rebased to 3.15.0-rc1 - 'next' to pickup p*addr_end patches - Gavins comment
>> - Update PUD address end function to support 4-level page table walk
>> - Elimiated 5th patch of the series that fixed unmap_range(), since it was
>>   fixed by Marcs patches.
>>
>> Changes since v5:
>> - Created seperate entry point for VMID TLB flush with no param - Christoffers
>>   comment
>> - Update documentation for kvm_flush_remote_tlbs() - Christoffers comment
>> - Simplified splitting of huge pages - inittial WP and 2nd stage DABT handler
>>   clear the huge page PMD, and use current code to fault in small pages.
>>   Removed kvm_split_pmd().
>>
>> Mario Smarduch (4):
>>   add ARMv7 HYP API to flush VM TLBs without address param
>>   live migration support for initial write protect of VM
>>   live migration support for VM dirty log management
>>   add 2nd stage page fault handling during live migration
>>
>>  arch/arm/include/asm/kvm_asm.h  |    1 +
>>  arch/arm/include/asm/kvm_host.h |   11 ++
>>  arch/arm/include/asm/kvm_mmu.h  |   10 ++
>>  arch/arm/kvm/arm.c              |    8 +-
>>  arch/arm/kvm/interrupts.S       |   11 ++
>>  arch/arm/kvm/mmu.c              |  292 ++++++++++++++++++++++++++++++++++++++-
>>  arch/x86/kvm/x86.c              |   86 ------------
>>  virt/kvm/kvm_main.c             |   84 ++++++++++-
>>  8 files changed, 409 insertions(+), 94 deletions(-)
>>
>> -- 
>> 1.7.9.5
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 1/4] add ARMv7 HYP API to flush VM TLBs without address param
  2014-05-15 18:27 ` [PATCH v6 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
@ 2014-05-27 19:51   ` Christoffer Dall
  0 siblings, 0 replies; 26+ messages in thread
From: Christoffer Dall @ 2014-05-27 19:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 15, 2014 at 11:27:28AM -0700, Mario Smarduch wrote:
> Patch adds HYP interface for global VM TLB invalidation without address
> parameter. Added ARM version of kvm_flush_remote_tlbs(), made generic one weak.

nit: made the generic implementation a weak symbol.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 2/4] live migration support for initial write protect of VM
  2014-05-15 18:27 ` [PATCH v6 2/4] live migration support for initial write protect of VM Mario Smarduch
@ 2014-05-27 19:58   ` Christoffer Dall
  2014-05-27 20:15     ` Mario Smarduch
  0 siblings, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2014-05-27 19:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 15, 2014 at 11:27:29AM -0700, Mario Smarduch wrote:
> Patch adds memslot support for initial write protection and split up of huge 
> pages

I lost track of where we are with these patches, but I see a lot of
issues in this patch that I believe I already commented on (but I may
not have had time to comment before you sent out v6).

In any case, I'm going to wait with reviewing things carefully until you
send out a v7, but for v7:
 - Please document the rationale and design behind what you're doing in
   the commit text of each patch.  Each of these patches are quite
   large, but the commit messages are barely two lines.  I suggest you
   take a look at 'git log arch/arm/kvm' for example to get a feel for
   what I'm looking for.

 - There is nothing specific in the interface to KVM discussing
   migration or live migration, it is only used as an example for
   features in trying to stay generic.  Please use similar generic
   concepts in the kernel to make things coherent.  'git grep
   migration arch/x86/kvm' also tells you that x86 gets away with full
   support for live migration without referring to migration except as
   examples of how features might be useful.

Thanks for the work, looking forward to seeing a new revision.

-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 3/4] live migration support for VM dirty log management
  2014-05-15 18:27 ` [PATCH v6 3/4] live migration support for VM dirty log management Mario Smarduch
@ 2014-05-27 20:12   ` Christoffer Dall
  2014-05-27 21:55     ` Mario Smarduch
  0 siblings, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2014-05-27 20:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 15, 2014 at 11:27:30AM -0700, Mario Smarduch wrote:
> This patch adds support for keeping track of VM dirty pages, by updating
> per memslot dirty bitmap and write protecting the page again.
> 
> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
> ---
>  arch/arm/include/asm/kvm_host.h |    3 ++
>  arch/arm/kvm/arm.c              |    5 --
>  arch/arm/kvm/mmu.c              |   98 +++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c              |   86 ----------------------------------
>  virt/kvm/kvm_main.c             |   82 ++++++++++++++++++++++++++++++++
>  5 files changed, 183 insertions(+), 91 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 0e55b17..4fef77d 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -238,5 +238,8 @@ u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid);
>  int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
>  
>  int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
> +void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> +	struct kvm_memory_slot *slot,
> +	gfn_t gfn_offset, unsigned long mask);
>  
>  #endif /* __ARM_KVM_HOST_H__ */
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 1055266..0b847b5 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -777,11 +777,6 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  	}
>  }
>  
> -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
> -{
> -	return -EINVAL;
> -}
> -
>  static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>  					struct kvm_arm_device_addr *dev_addr)
>  {
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index b71ad27..b939312 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -891,6 +891,104 @@ out:
>  	return ret;
>  }
>  
> +
> +/**
> + * kvm_mmu_write_protected_pt_masked - walk mask relative start of memslot and
> + *	write protect again for next dirty log read.
> + *
> + *  After migration thread write protects entire VM iterative calls are made
> + *  to get diry page log. The log is returned and dirty pages are write
> + *  protected again. This function is called as a result KVM_GET_DIRTY_LOG
> + *  ioctl.
> + *  'kvm->mmu_lock' must be  held to protect against concurrent modification
> + *  of page tables (2nd stage fault, mmu modifiers, ...)
> + *
> + * @kvm:        The KVM pointer
> + * @slot:       The memory slot the dirty log is retrieved for
> + * @gfn_offset: The gfn offset in memory slot
> + * @mask:       The mask of dirty pages at offset 'gnf_offset in this memory
> + *              slot to be writ protect
> + */
> +
> +void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> +		struct kvm_memory_slot *slot,
> +		gfn_t gfn_offset, unsigned long mask)
> +{
> +	phys_addr_t ipa, next, offset_ipa;
> +	pgd_t *pgdp = kvm->arch.pgd, *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	gfn_t gfnofst = slot->base_gfn + gfn_offset;
> +	bool crosses_pmd;
> +
> +	ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
> +	offset_ipa  = gfnofst << PAGE_SHIFT;
> +	next = (gfnofst + (BITS_PER_LONG - 1)) << PAGE_SHIFT;
> +
> +	/* check if mask width crosses 2nd level page table range, and
> +	 * possibly 3rd, 4th. If not skip upper table lookups. Unlikely
> +	 * to be true.
> +	 */
> +	crosses_pmd = ((offset_ipa & PMD_MASK) ^ (next & PMD_MASK)) ? true :
> +									false;

you can just assign the value, no need for the tertiary operator, a bool
will always be true or false.  (Marc wanted to make this explicit
elsewhere in the code, an uses the 'val = !!(expression)' syntax).

> +
> +	/* If pgd, pud, pmd not present and you cross pmd range check next
> +	 * index.
> +	 */
> +	pgd = pgdp + pgd_index(ipa);
> +	if (unlikely(crosses_pmd && !pgd_present(*pgd))) {
> +		pgd = pgdp + pgd_index(next);
> +		if (!pgd_present(*pgd))
> +			return;
> +	}
> +
> +	pud = pud_offset(pgd, ipa);
> +	if (unlikely(crosses_pmd && !pud_present(*pud))) {
> +		pud = pud_offset(pgd, next);
> +		if (!pud_present(*pud))
> +			return;
> +	}
> +
> +	pmd = pmd_offset(pud, ipa);
> +	if (unlikely(crosses_pmd && !pmd_present(*pmd))) {
> +		pmd = pmd_offset(pud, next);
> +		if (!pmd_present(*pmd))
> +			return;
> +	}
> +
> +	for (;;) {
> +		pte = pte_offset_kernel(pmd, ipa);
> +		if (!pte_present(*pte))
> +			goto next_ipa;
> +
> +		if (kvm_s2pte_readonly(pte))
> +			goto next_ipa;
> +		kvm_set_s2pte_readonly(pte);
> +next_ipa:
> +		mask &= mask - 1;
> +		if (!mask)
> +			break;
> +
> +		/* find next page */
> +		ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
> +
> +		/* skip upper page table lookups */
> +		if (!crosses_pmd)
> +			continue;
> +
> +		pgd = pgdp + pgd_index(ipa);
> +		if (unlikely(!pgd_present(*pgd)))
> +			goto next_ipa;
> +		pud = pud_offset(pgd, ipa);
> +		if (unlikely(!pud_present(*pud)))
> +			goto next_ipa;
> +		pmd = pmd_offset(pud, ipa);
> +		if (unlikely(!pmd_present(*pmd)))
> +			goto next_ipa;
> +	}

So I think the reason this is done separately on x86 is that they have
an rmap structure for their gfn mappings so that they can quickly lookup
ptes based on a gfn and write-protect it without having to walk the
stage-2 page tables.

Unless you want to introduce this on ARM, I think you will be much
better off just having a single (properly written) iterating
write-protect function, that takes a start and end IPA and a bitmap for
which pages to actually write-protect, which can then handle the generic
case (either NULL or all-ones bitmap) or a specific case, which just
traverses the IPA range given as input.  Such a function should follow
the model of page table walk functions discussed previously
(separate functions: wp_pgd_enties(), wp_pud_entries(),
wp_pmd_entries(), wp_pte_entries()).

However, you may want to verify my assumption above with the x86 people
and look at sharing the rmap logic between architectures.

In any case, this code is very difficult to read and understand, and it
doesn't look at all like the other code we have to walk page tables.  I
understand you are trying to optimize for performance (by skipping some
intermediate page table level lookups), but you never declare that goal
anywhere in the code or in the commit message.

> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot,
>  			  unsigned long fault_status)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c5582c3..a603ca3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3569,92 +3569,6 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
>  	return 0;
>  }
>  
> -/**
> - * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
> - * @kvm: kvm instance
> - * @log: slot id and address to which we copy the log
> - *
> - * We need to keep it in mind that VCPU threads can write to the bitmap
> - * concurrently.  So, to avoid losing data, we keep the following order for
> - * each bit:
> - *
> - *   1. Take a snapshot of the bit and clear it if needed.
> - *   2. Write protect the corresponding page.
> - *   3. Flush TLB's if needed.
> - *   4. Copy the snapshot to the userspace.
> - *
> - * Between 2 and 3, the guest may write to the page using the remaining TLB
> - * entry.  This is not a problem because the page will be reported dirty at
> - * step 4 using the snapshot taken before and step 3 ensures that successive
> - * writes will be logged for the next call.
> - */
> -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
> -{
> -	int r;
> -	struct kvm_memory_slot *memslot;
> -	unsigned long n, i;
> -	unsigned long *dirty_bitmap;
> -	unsigned long *dirty_bitmap_buffer;
> -	bool is_dirty = false;
> -
> -	mutex_lock(&kvm->slots_lock);
> -
> -	r = -EINVAL;
> -	if (log->slot >= KVM_USER_MEM_SLOTS)
> -		goto out;
> -
> -	memslot = id_to_memslot(kvm->memslots, log->slot);
> -
> -	dirty_bitmap = memslot->dirty_bitmap;
> -	r = -ENOENT;
> -	if (!dirty_bitmap)
> -		goto out;
> -
> -	n = kvm_dirty_bitmap_bytes(memslot);
> -
> -	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
> -	memset(dirty_bitmap_buffer, 0, n);
> -
> -	spin_lock(&kvm->mmu_lock);
> -
> -	for (i = 0; i < n / sizeof(long); i++) {
> -		unsigned long mask;
> -		gfn_t offset;
> -
> -		if (!dirty_bitmap[i])
> -			continue;
> -
> -		is_dirty = true;
> -
> -		mask = xchg(&dirty_bitmap[i], 0);
> -		dirty_bitmap_buffer[i] = mask;
> -
> -		offset = i * BITS_PER_LONG;
> -		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
> -	}
> -
> -	spin_unlock(&kvm->mmu_lock);
> -
> -	/* See the comments in kvm_mmu_slot_remove_write_access(). */
> -	lockdep_assert_held(&kvm->slots_lock);
> -
> -	/*
> -	 * All the TLBs can be flushed out of mmu lock, see the comments in
> -	 * kvm_mmu_slot_remove_write_access().
> -	 */
> -	if (is_dirty)
> -		kvm_flush_remote_tlbs(kvm);
> -
> -	r = -EFAULT;
> -	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
> -		goto out;
> -
> -	r = 0;
> -out:
> -	mutex_unlock(&kvm->slots_lock);
> -	return r;
> -}
> -
>  int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
>  			bool line_status)
>  {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ba25765..d33ac9c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -429,6 +429,88 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>  	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
>  }
>  
> +/**
> + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
> + * @kvm: kvm instance
> + * @log: slot id and address to which we copy the log
> + *
> + * Shared by x86 and ARM.

probably an unnecessary comment

> + *
> + * We need to keep it in mind that VCPU threads can write to the bitmap
> + * concurrently.  So, to avoid losing data, we keep the following order for
> + * each bit:
> + *
> + *   1. Take a snapshot of the bit and clear it if needed.
> + *   2. Write protect the corresponding page.
> + *   3. Flush TLB's if needed.
> + *   4. Copy the snapshot to the userspace.
> + *
> + * Between 2 and 3, the guest may write to the page using the remaining TLB
> + * entry.  This is not a problem because the page will be reported dirty at
> + * step 4 using the snapshot taken before and step 3 ensures that successive
> + * writes will be logged for the next call.
> + */
> +
> +int __weak kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
> +						struct kvm_dirty_log *log)
> +{
> +	int r;
> +	struct kvm_memory_slot *memslot;
> +	unsigned long n, i;
> +	unsigned long *dirty_bitmap;
> +	unsigned long *dirty_bitmap_buffer;
> +	bool is_dirty = false;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	r = -EINVAL;
> +	if (log->slot >= KVM_USER_MEM_SLOTS)
> +		goto out;
> +
> +	memslot = id_to_memslot(kvm->memslots, log->slot);
> +
> +	dirty_bitmap = memslot->dirty_bitmap;
> +	r = -ENOENT;
> +	if (!dirty_bitmap)
> +		goto out;
> +
> +	n = kvm_dirty_bitmap_bytes(memslot);
> +
> +	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
> +	memset(dirty_bitmap_buffer, 0, n);
> +
> +	spin_lock(&kvm->mmu_lock);
> +
> +	for (i = 0; i < n / sizeof(long); i++) {
> +		unsigned long mask;
> +		gfn_t offset;
> +
> +		if (!dirty_bitmap[i])
> +			continue;
> +
> +		is_dirty = true;
> +
> +		mask = xchg(&dirty_bitmap[i], 0);
> +		dirty_bitmap_buffer[i] = mask;
> +
> +		offset = i * BITS_PER_LONG;
> +		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
> +	}
> +	if (is_dirty)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	r = -EFAULT;
> +	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
> +		goto out;
> +
> +	r = 0;
> +out:
> +	mutex_unlock(&kvm->slots_lock);
> +	return r;
> +}
> +
>  #else  /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
>  
>  static int kvm_init_mmu_notifier(struct kvm *kvm)
> -- 
> 1.7.9.5
> 

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 2/4] live migration support for initial write protect of VM
  2014-05-27 19:58   ` Christoffer Dall
@ 2014-05-27 20:15     ` Mario Smarduch
  2014-05-27 20:20       ` Christoffer Dall
  0 siblings, 1 reply; 26+ messages in thread
From: Mario Smarduch @ 2014-05-27 20:15 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Christoffer,
 I was out traveling last week + holiday.

You had lots of comments in last version (incl. below), reworking to submit a 
new series. Un-clutter from basic issues, and update current logic. In next 
couple days I'll submit new  series.

Also looking into a wiki to document test env (but may windup with a github link).

Thanks,
  Mario



On 05/27/2014 12:58 PM, Christoffer Dall wrote:
> On Thu, May 15, 2014 at 11:27:29AM -0700, Mario Smarduch wrote:
>> Patch adds memslot support for initial write protection and split up of huge 
>> pages
> 
> I lost track of where we are with these patches, but I see a lot of
> issues in this patch that I believe I already commented on (but I may
> not have had time to comment before you sent out v6).
> 
> In any case, I'm going to wait with reviewing things carefully until you
> send out a v7, but for v7:
>  - Please document the rationale and design behind what you're doing in
>    the commit text of each patch.  Each of these patches are quite
>    large, but the commit messages are barely two lines.  I suggest you
>    take a look at 'git log arch/arm/kvm' for example to get a feel for
>    what I'm looking for.
> 
>  - There is nothing specific in the interface to KVM discussing
>    migration or live migration, it is only used as an example for
>    features in trying to stay generic.  Please use similar generic
>    concepts in the kernel to make things coherent.  'git grep
>    migration arch/x86/kvm' also tells you that x86 gets away with full
>    support for live migration without referring to migration except as
>    examples of how features might be useful.
> 
> Thanks for the work, looking forward to seeing a new revision.
> 
> -Christoffer
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-15 18:27 ` [PATCH v6 4/4] add 2nd stage page fault handling during live migration Mario Smarduch
@ 2014-05-27 20:19   ` Christoffer Dall
  2014-05-28  1:30     ` Mario Smarduch
  0 siblings, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2014-05-27 20:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 15, 2014 at 11:27:31AM -0700, Mario Smarduch wrote:
> This patch adds support for handling 2nd stage page faults during migration,
> it disables faulting in huge pages, and splits up existing huge pages.
> 
> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
> ---
>  arch/arm/kvm/mmu.c |   36 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index b939312..10e7bf6 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1002,6 +1002,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
>  	struct vm_area_struct *vma;
>  	pfn_t pfn;
> +	bool migration_active;
>  
>  	write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
>  	if (fault_status == FSC_PERM && !write_fault) {
> @@ -1053,12 +1054,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		return -EFAULT;
>  
>  	spin_lock(&kvm->mmu_lock);
> +
> +	/*
> +	 * Place inside lock to prevent race condition when whole VM is being
> +	 * write proteced. Prevent race of huge page install when migration is
> +	 * active.
> +	 */
> +	migration_active = vcpu->kvm->arch.migration_in_progress;
> +
>  	if (mmu_notifier_retry(kvm, mmu_seq))
>  		goto out_unlock;
> -	if (!hugetlb && !force_pte)
> +
> +	/* When migrating don't spend cycles coalescing huge pages */
> +	if (!hugetlb && !force_pte && !migration_active)
>  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
>  
> -	if (hugetlb) {
> +	/* During migration don't install huge pages */

again, all this is not about migration per se, it's about when logging
dirty pages, (which may be commonly used for migration).

> +	if (hugetlb && !migration_active) {
>  		pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
>  		new_pmd = pmd_mkhuge(new_pmd);
>  		if (writable) {
> @@ -1069,6 +1081,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
>  	} else {
>  		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
> +
> +		/*
> +		 * If pmd is  mapping a huge page then split it up into
> +		 * small pages, when doing live migration.
> +		 */
> +		if (migration_active) {
> +			pmd_t *pmd;
> +			if (hugetlb) {
> +				pfn += pte_index(fault_ipa);
> +				gfn = fault_ipa >> PAGE_SHIFT;
> +			}

how can you have hugetlb when we entered this else-clause conditional on
having !hugetlb?

> +			new_pte = pfn_pte(pfn, PAGE_S2);
> +			pmd = stage2_get_pmd(kvm, NULL, fault_ipa);
> +			if (pmd && kvm_pmd_huge(*pmd))
> +				clear_pmd_entry(kvm, pmd, fault_ipa);

If we have a huge pmd entry, how did we take a fault on there?  Would
that be if a different CPU inserted a huge page entry since we got here,
is this what you're trying to handle?

I'm confused.

> +		}
> +
>  		if (writable) {
>  			kvm_set_s2pte_writable(&new_pte);
>  			kvm_set_pfn_dirty(pfn);
> @@ -1077,6 +1106,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
>  	}
>  
> +	/* Assuming 4k pages, set one bit/page in memslot dirty_bitmap[] */

Assuming? this makes me nervous.  The point is probably that it's
harmless if we're not logging dirty pages, because then nobody reads teh
data structure, and if we are logging, then we are mapping everything
using 4K pages?

It's probably clearer code-wise to condition this on whether or not we
are logging dirty page, and the branch is also likely to be much faster
than the function call to mark_page_dirty.

> +	if (writable)
> +		mark_page_dirty(kvm, gfn);
>  
>  out_unlock:
>  	spin_unlock(&kvm->mmu_lock);
> -- 
> 1.7.9.5
> 

-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 2/4] live migration support for initial write protect of VM
  2014-05-27 20:15     ` Mario Smarduch
@ 2014-05-27 20:20       ` Christoffer Dall
  0 siblings, 0 replies; 26+ messages in thread
From: Christoffer Dall @ 2014-05-27 20:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 27, 2014 at 01:15:46PM -0700, Mario Smarduch wrote:
> Hi Christoffer,
>  I was out traveling last week + holiday.
> 
> You had lots of comments in last version (incl. below), reworking to submit a 
> new series. Un-clutter from basic issues, and update current logic. In next 
> couple days I'll submit new  series.
> 
> Also looking into a wiki to document test env (but may windup with a github link).
> 
Just explaining how you tested it in the cover letter, giving a minimum
of information on how someone may go about reproducing your results
would be really useful.

Thanks!
-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 3/4] live migration support for VM dirty log management
  2014-05-27 20:12   ` Christoffer Dall
@ 2014-05-27 21:55     ` Mario Smarduch
  2014-05-28  9:08       ` Christoffer Dall
  0 siblings, 1 reply; 26+ messages in thread
From: Mario Smarduch @ 2014-05-27 21:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/27/2014 01:12 PM, Christoffer Dall wrote:
> On Thu, May 15, 2014 at 11:27:30AM -0700, Mario Smarduch wrote:
>> This patch adds support for keeping track of VM dirty pages, by updating
>> per memslot dirty bitmap and write protecting the page again.
>>
>> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
>> ---
>>  arch/arm/include/asm/kvm_host.h |    3 ++
>>  arch/arm/kvm/arm.c              |    5 --
>>  arch/arm/kvm/mmu.c              |   98 +++++++++++++++++++++++++++++++++++++++
>>  arch/x86/kvm/x86.c              |   86 ----------------------------------
>>  virt/kvm/kvm_main.c             |   82 ++++++++++++++++++++++++++++++++
>>  5 files changed, 183 insertions(+), 91 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>> index 0e55b17..4fef77d 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -238,5 +238,8 @@ u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid);
>>  int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
>>  
>>  int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
>> +void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>> +	struct kvm_memory_slot *slot,
>> +	gfn_t gfn_offset, unsigned long mask);
>>  
>>  #endif /* __ARM_KVM_HOST_H__ */
>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> index 1055266..0b847b5 100644
>> --- a/arch/arm/kvm/arm.c
>> +++ b/arch/arm/kvm/arm.c
>> @@ -777,11 +777,6 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>>  	}
>>  }
>>  
>> -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>> -{
>> -	return -EINVAL;
>> -}
>> -
>>  static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>>  					struct kvm_arm_device_addr *dev_addr)
>>  {
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index b71ad27..b939312 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -891,6 +891,104 @@ out:
>>  	return ret;
>>  }
>>  
>> +
>> +/**
>> + * kvm_mmu_write_protected_pt_masked - walk mask relative start of memslot and
>> + *	write protect again for next dirty log read.
>> + *
>> + *  After migration thread write protects entire VM iterative calls are made
>> + *  to get diry page log. The log is returned and dirty pages are write
>> + *  protected again. This function is called as a result KVM_GET_DIRTY_LOG
>> + *  ioctl.
>> + *  'kvm->mmu_lock' must be  held to protect against concurrent modification
>> + *  of page tables (2nd stage fault, mmu modifiers, ...)
>> + *
>> + * @kvm:        The KVM pointer
>> + * @slot:       The memory slot the dirty log is retrieved for
>> + * @gfn_offset: The gfn offset in memory slot
>> + * @mask:       The mask of dirty pages at offset 'gnf_offset in this memory
>> + *              slot to be writ protect
>> + */
>> +
>> +void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>> +		struct kvm_memory_slot *slot,
>> +		gfn_t gfn_offset, unsigned long mask)
>> +{
>> +	phys_addr_t ipa, next, offset_ipa;
>> +	pgd_t *pgdp = kvm->arch.pgd, *pgd;
>> +	pud_t *pud;
>> +	pmd_t *pmd;
>> +	pte_t *pte;
>> +	gfn_t gfnofst = slot->base_gfn + gfn_offset;
>> +	bool crosses_pmd;
>> +
>> +	ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
>> +	offset_ipa  = gfnofst << PAGE_SHIFT;
>> +	next = (gfnofst + (BITS_PER_LONG - 1)) << PAGE_SHIFT;
>> +
>> +	/* check if mask width crosses 2nd level page table range, and
>> +	 * possibly 3rd, 4th. If not skip upper table lookups. Unlikely
>> +	 * to be true.
>> +	 */
>> +	crosses_pmd = ((offset_ipa & PMD_MASK) ^ (next & PMD_MASK)) ? true :
>> +									false;
> 
> you can just assign the value, no need for the tertiary operator, a bool
> will always be true or false.  (Marc wanted to make this explicit
> elsewhere in the code, an uses the 'val = !!(expression)' syntax).
> 
Ah ok.
>> +
>> +	/* If pgd, pud, pmd not present and you cross pmd range check next
>> +	 * index.
>> +	 */
>> +	pgd = pgdp + pgd_index(ipa);
>> +	if (unlikely(crosses_pmd && !pgd_present(*pgd))) {
>> +		pgd = pgdp + pgd_index(next);
>> +		if (!pgd_present(*pgd))
>> +			return;
>> +	}
>> +
>> +	pud = pud_offset(pgd, ipa);
>> +	if (unlikely(crosses_pmd && !pud_present(*pud))) {
>> +		pud = pud_offset(pgd, next);
>> +		if (!pud_present(*pud))
>> +			return;
>> +	}
>> +
>> +	pmd = pmd_offset(pud, ipa);
>> +	if (unlikely(crosses_pmd && !pmd_present(*pmd))) {
>> +		pmd = pmd_offset(pud, next);
>> +		if (!pmd_present(*pmd))
>> +			return;
>> +	}
>> +
>> +	for (;;) {
>> +		pte = pte_offset_kernel(pmd, ipa);
>> +		if (!pte_present(*pte))
>> +			goto next_ipa;
>> +
>> +		if (kvm_s2pte_readonly(pte))
>> +			goto next_ipa;
>> +		kvm_set_s2pte_readonly(pte);
>> +next_ipa:
>> +		mask &= mask - 1;
>> +		if (!mask)
>> +			break;
>> +
>> +		/* find next page */
>> +		ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
>> +
>> +		/* skip upper page table lookups */
>> +		if (!crosses_pmd)
>> +			continue;
>> +
>> +		pgd = pgdp + pgd_index(ipa);
>> +		if (unlikely(!pgd_present(*pgd)))
>> +			goto next_ipa;
>> +		pud = pud_offset(pgd, ipa);
>> +		if (unlikely(!pud_present(*pud)))
>> +			goto next_ipa;
>> +		pmd = pmd_offset(pud, ipa);
>> +		if (unlikely(!pmd_present(*pmd)))
>> +			goto next_ipa;
>> +	}
> 
> So I think the reason this is done separately on x86 is that they have
> an rmap structure for their gfn mappings so that they can quickly lookup
> ptes based on a gfn and write-protect it without having to walk the
> stage-2 page tables.

Yes, they also use rmapps for mmu notifiers, invalidations on huge VMs and 
large ranges resulted in excessive times. 
> 
> Unless you want to introduce this on ARM, I think you will be much

Eventually yes but that would also require reworking mmu notifiers.  I had 
two step approach in mind. Initially get the dirty page marking to work, 
TLB flushing, GIC/arch-timer migration, validate migration under various 
stress loads (page reclaim) with mmu notifiers, test several VMs and migration 
times. 

Then get rmapp (or something similar) working - eventually for huge VMs it's
needed. In short two phases.

> better off just having a single (properly written) iterating
> write-protect function, that takes a start and end IPA and a bitmap for
> which pages to actually write-protect, which can then handle the generic
> case (either NULL or all-ones bitmap) or a specific case, which just
> traverses the IPA range given as input.  Such a function should follow
> the model of page table walk functions discussed previously
> (separate functions: wp_pgd_enties(), wp_pud_entries(),
> wp_pmd_entries(), wp_pte_entries()).
> 
> However, you may want to verify my assumption above with the x86 people
> and look at sharing the rmap logic between architectures.
> 
> In any case, this code is very difficult to read and understand, and it
> doesn't look at all like the other code we have to walk page tables.  I
> understand you are trying to optimize for performance (by skipping some
> intermediate page table level lookups), but you never declare that goal
> anywhere in the code or in the commit message.

Marc's comment noticed I was walking a small range (128k), using upper table
iterations that covered 1G, 2MB ranges. As you mention the code tries to
optimize upper table lookups. Yes the function is too bulky, but I'm not sure how 
to remove the upper table checks since page tables may change between the 
time pages are marked dirty and the log is retrieved. And if a memory slot 
is very dirty walking upper tables will impact performance. I'll think some 
more on this function.

> 
>> +}
>> +
>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  			  struct kvm_memory_slot *memslot,
>>  			  unsigned long fault_status)
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c5582c3..a603ca3 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3569,92 +3569,6 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
>>  	return 0;
>>  }
>>  
>> -/**
>> - * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
>> - * @kvm: kvm instance
>> - * @log: slot id and address to which we copy the log
>> - *
>> - * We need to keep it in mind that VCPU threads can write to the bitmap
>> - * concurrently.  So, to avoid losing data, we keep the following order for
>> - * each bit:
>> - *
>> - *   1. Take a snapshot of the bit and clear it if needed.
>> - *   2. Write protect the corresponding page.
>> - *   3. Flush TLB's if needed.
>> - *   4. Copy the snapshot to the userspace.
>> - *
>> - * Between 2 and 3, the guest may write to the page using the remaining TLB
>> - * entry.  This is not a problem because the page will be reported dirty at
>> - * step 4 using the snapshot taken before and step 3 ensures that successive
>> - * writes will be logged for the next call.
>> - */
>> -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>> -{
>> -	int r;
>> -	struct kvm_memory_slot *memslot;
>> -	unsigned long n, i;
>> -	unsigned long *dirty_bitmap;
>> -	unsigned long *dirty_bitmap_buffer;
>> -	bool is_dirty = false;
>> -
>> -	mutex_lock(&kvm->slots_lock);
>> -
>> -	r = -EINVAL;
>> -	if (log->slot >= KVM_USER_MEM_SLOTS)
>> -		goto out;
>> -
>> -	memslot = id_to_memslot(kvm->memslots, log->slot);
>> -
>> -	dirty_bitmap = memslot->dirty_bitmap;
>> -	r = -ENOENT;
>> -	if (!dirty_bitmap)
>> -		goto out;
>> -
>> -	n = kvm_dirty_bitmap_bytes(memslot);
>> -
>> -	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
>> -	memset(dirty_bitmap_buffer, 0, n);
>> -
>> -	spin_lock(&kvm->mmu_lock);
>> -
>> -	for (i = 0; i < n / sizeof(long); i++) {
>> -		unsigned long mask;
>> -		gfn_t offset;
>> -
>> -		if (!dirty_bitmap[i])
>> -			continue;
>> -
>> -		is_dirty = true;
>> -
>> -		mask = xchg(&dirty_bitmap[i], 0);
>> -		dirty_bitmap_buffer[i] = mask;
>> -
>> -		offset = i * BITS_PER_LONG;
>> -		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
>> -	}
>> -
>> -	spin_unlock(&kvm->mmu_lock);
>> -
>> -	/* See the comments in kvm_mmu_slot_remove_write_access(). */
>> -	lockdep_assert_held(&kvm->slots_lock);
>> -
>> -	/*
>> -	 * All the TLBs can be flushed out of mmu lock, see the comments in
>> -	 * kvm_mmu_slot_remove_write_access().
>> -	 */
>> -	if (is_dirty)
>> -		kvm_flush_remote_tlbs(kvm);
>> -
>> -	r = -EFAULT;
>> -	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
>> -		goto out;
>> -
>> -	r = 0;
>> -out:
>> -	mutex_unlock(&kvm->slots_lock);
>> -	return r;
>> -}
>> -
>>  int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
>>  			bool line_status)
>>  {
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index ba25765..d33ac9c 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -429,6 +429,88 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
>>  	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
>>  }
>>  
>> +/**
>> + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
>> + * @kvm: kvm instance
>> + * @log: slot id and address to which we copy the log
>> + *
>> + * Shared by x86 and ARM.
> 
> probably an unnecessary comment
> 
Sure.
>> + *
>> + * We need to keep it in mind that VCPU threads can write to the bitmap
>> + * concurrently.  So, to avoid losing data, we keep the following order for
>> + * each bit:
>> + *
>> + *   1. Take a snapshot of the bit and clear it if needed.
>> + *   2. Write protect the corresponding page.
>> + *   3. Flush TLB's if needed.
>> + *   4. Copy the snapshot to the userspace.
>> + *
>> + * Between 2 and 3, the guest may write to the page using the remaining TLB
>> + * entry.  This is not a problem because the page will be reported dirty at
>> + * step 4 using the snapshot taken before and step 3 ensures that successive
>> + * writes will be logged for the next call.
>> + */
>> +
>> +int __weak kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
>> +						struct kvm_dirty_log *log)
>> +{
>> +	int r;
>> +	struct kvm_memory_slot *memslot;
>> +	unsigned long n, i;
>> +	unsigned long *dirty_bitmap;
>> +	unsigned long *dirty_bitmap_buffer;
>> +	bool is_dirty = false;
>> +
>> +	mutex_lock(&kvm->slots_lock);
>> +
>> +	r = -EINVAL;
>> +	if (log->slot >= KVM_USER_MEM_SLOTS)
>> +		goto out;
>> +
>> +	memslot = id_to_memslot(kvm->memslots, log->slot);
>> +
>> +	dirty_bitmap = memslot->dirty_bitmap;
>> +	r = -ENOENT;
>> +	if (!dirty_bitmap)
>> +		goto out;
>> +
>> +	n = kvm_dirty_bitmap_bytes(memslot);
>> +
>> +	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
>> +	memset(dirty_bitmap_buffer, 0, n);
>> +
>> +	spin_lock(&kvm->mmu_lock);
>> +
>> +	for (i = 0; i < n / sizeof(long); i++) {
>> +		unsigned long mask;
>> +		gfn_t offset;
>> +
>> +		if (!dirty_bitmap[i])
>> +			continue;
>> +
>> +		is_dirty = true;
>> +
>> +		mask = xchg(&dirty_bitmap[i], 0);
>> +		dirty_bitmap_buffer[i] = mask;
>> +
>> +		offset = i * BITS_PER_LONG;
>> +		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
>> +	}
>> +	if (is_dirty)
>> +		kvm_flush_remote_tlbs(kvm);
>> +
>> +	spin_unlock(&kvm->mmu_lock);
>> +
>> +	r = -EFAULT;
>> +	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
>> +		goto out;
>> +
>> +	r = 0;
>> +out:
>> +	mutex_unlock(&kvm->slots_lock);
>> +	return r;
>> +}
>> +
>>  #else  /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
>>  
>>  static int kvm_init_mmu_notifier(struct kvm *kvm)
>> -- 
>> 1.7.9.5
>>
> 
> Thanks,
> -Christoffer
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-27 20:19   ` Christoffer Dall
@ 2014-05-28  1:30     ` Mario Smarduch
  2014-05-28  8:09       ` Christoffer Dall
  0 siblings, 1 reply; 26+ messages in thread
From: Mario Smarduch @ 2014-05-28  1:30 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/27/2014 01:19 PM, Christoffer Dall wrote:
> On Thu, May 15, 2014 at 11:27:31AM -0700, Mario Smarduch wrote:
>> This patch adds support for handling 2nd stage page faults during migration,
>> it disables faulting in huge pages, and splits up existing huge pages.
>>
>> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
>> ---
>>  arch/arm/kvm/mmu.c |   36 ++++++++++++++++++++++++++++++++++--
>>  1 file changed, 34 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index b939312..10e7bf6 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -1002,6 +1002,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
>>  	struct vm_area_struct *vma;
>>  	pfn_t pfn;
>> +	bool migration_active;
>>  
>>  	write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
>>  	if (fault_status == FSC_PERM && !write_fault) {
>> @@ -1053,12 +1054,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  		return -EFAULT;
>>  
>>  	spin_lock(&kvm->mmu_lock);
>> +
>> +	/*
>> +	 * Place inside lock to prevent race condition when whole VM is being
>> +	 * write proteced. Prevent race of huge page install when migration is
>> +	 * active.
>> +	 */
>> +	migration_active = vcpu->kvm->arch.migration_in_progress;
>> +
>>  	if (mmu_notifier_retry(kvm, mmu_seq))
>>  		goto out_unlock;
>> -	if (!hugetlb && !force_pte)
>> +
>> +	/* When migrating don't spend cycles coalescing huge pages */
>> +	if (!hugetlb && !force_pte && !migration_active)
>>  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
>>  
>> -	if (hugetlb) {
>> +	/* During migration don't install huge pages */
> 
> again, all this is not about migration per se, it's about when logging
> dirty pages, (which may be commonly used for migration).
> 

Yes that's true , I'll update but until recently (new RFC on qemu list) where
dirty logging is used for getting VM RSS or hot memory regions, I don't see any
other use case.

>> +	if (hugetlb && !migration_active) {
>>  		pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
>>  		new_pmd = pmd_mkhuge(new_pmd);
>>  		if (writable) {
>> @@ -1069,6 +1081,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
>>  	} else {
>>  		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
>> +
>> +		/*
>> +		 * If pmd is  mapping a huge page then split it up into
>> +		 * small pages, when doing live migration.
>> +		 */
>> +		if (migration_active) {
>> +			pmd_t *pmd;
>> +			if (hugetlb) {
>> +				pfn += pte_index(fault_ipa);
>> +				gfn = fault_ipa >> PAGE_SHIFT;
>> +			}
> 
> how can you have hugetlb when we entered this else-clause conditional on
> having !hugetlb?
> 
- if(hugetlb && !migration_active)

forces all page faults to enter here while in migration. Huge page entries
are cleared and stage2_set_pte() splits the huge page, and installs the pte
for the fault_ipa. I placed that there since it flows with installing 
a pte as well as splitting a huge page. But your comment on performance
split up huge page vs. deferred  page faulting should move it out of here. 


>> +			new_pte = pfn_pte(pfn, PAGE_S2);
>> +			pmd = stage2_get_pmd(kvm, NULL, fault_ipa);
>> +			if (pmd && kvm_pmd_huge(*pmd))
>> +				clear_pmd_entry(kvm, pmd, fault_ipa);
> 
> If we have a huge pmd entry, how did we take a fault on there?  Would
> that be if a different CPU inserted a huge page entry since we got here,
> is this what you're trying to handle?
> 
> I'm confused.
> 

I thing this related to the above.

>> +		}
>> +
>>  		if (writable) {
>>  			kvm_set_s2pte_writable(&new_pte);
>>  			kvm_set_pfn_dirty(pfn);
>> @@ -1077,6 +1106,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  		ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
>>  	}
>>  
>> +	/* Assuming 4k pages, set one bit/page in memslot dirty_bitmap[] */
> 
> Assuming? this makes me nervous.  The point is probably that it's
> harmless if we're not logging dirty pages, because then nobody reads teh
> data structure, and if we are logging, then we are mapping everything
> using 4K pages?
> 
> It's probably clearer code-wise to condition this on whether or not we
> are logging dirty page, and the branch is also likely to be much faster
> than the function call to mark_page_dirty.
> 

I'm not sure I get the point. The call is always safe, you either 
have old copy or new copy of memory slot with dirty_bitmap set or not set.
The log read is done while holding kvm slots_lock.

Is the comment related to performance, not supporting multiple page sizes,
or it's unsafe to call mark_page_dirty() under all circumstances, or 
something else? 


>> +	if (writable)
>> +		mark_page_dirty(kvm, gfn);
>>  
>>  out_unlock:
>>  	spin_unlock(&kvm->mmu_lock);
>> -- 
>> 1.7.9.5
>>
> 
> -Christoffer
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-28  1:30     ` Mario Smarduch
@ 2014-05-28  8:09       ` Christoffer Dall
  2014-05-28 17:55         ` Mario Smarduch
  0 siblings, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2014-05-28  8:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 27, 2014 at 06:30:23PM -0700, Mario Smarduch wrote:
> On 05/27/2014 01:19 PM, Christoffer Dall wrote:
> > On Thu, May 15, 2014 at 11:27:31AM -0700, Mario Smarduch wrote:
> >> This patch adds support for handling 2nd stage page faults during migration,
> >> it disables faulting in huge pages, and splits up existing huge pages.
> >>
> >> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
> >> ---
> >>  arch/arm/kvm/mmu.c |   36 ++++++++++++++++++++++++++++++++++--
> >>  1 file changed, 34 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >> index b939312..10e7bf6 100644
> >> --- a/arch/arm/kvm/mmu.c
> >> +++ b/arch/arm/kvm/mmu.c
> >> @@ -1002,6 +1002,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>  	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> >>  	struct vm_area_struct *vma;
> >>  	pfn_t pfn;
> >> +	bool migration_active;
> >>  
> >>  	write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
> >>  	if (fault_status == FSC_PERM && !write_fault) {
> >> @@ -1053,12 +1054,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>  		return -EFAULT;
> >>  
> >>  	spin_lock(&kvm->mmu_lock);
> >> +
> >> +	/*
> >> +	 * Place inside lock to prevent race condition when whole VM is being
> >> +	 * write proteced. Prevent race of huge page install when migration is
> >> +	 * active.
> >> +	 */
> >> +	migration_active = vcpu->kvm->arch.migration_in_progress;
> >> +
> >>  	if (mmu_notifier_retry(kvm, mmu_seq))
> >>  		goto out_unlock;
> >> -	if (!hugetlb && !force_pte)
> >> +
> >> +	/* When migrating don't spend cycles coalescing huge pages */
> >> +	if (!hugetlb && !force_pte && !migration_active)
> >>  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> >>  
> >> -	if (hugetlb) {
> >> +	/* During migration don't install huge pages */
> > 
> > again, all this is not about migration per se, it's about when logging
> > dirty pages, (which may be commonly used for migration).
> > 
> 
> Yes that's true , I'll update but until recently (new RFC on qemu list) where
> dirty logging is used for getting VM RSS or hot memory regions, I don't see any
> other use case.
> 

That doesn't really matter.  KVM doesn't really know (or care) what user
space is doing with its features; it implements a certain functionality
behind an ABI, and that's it.  For things to be consistent and make
sense in the kernel, you can only refer to concepts defined by KVM, not
by how QEMU or kvmtools (or some other user space client) may use it.

> >> +	if (hugetlb && !migration_active) {
> >>  		pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
> >>  		new_pmd = pmd_mkhuge(new_pmd);
> >>  		if (writable) {
> >> @@ -1069,6 +1081,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>  		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
> >>  	} else {
> >>  		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
> >> +
> >> +		/*
> >> +		 * If pmd is  mapping a huge page then split it up into
> >> +		 * small pages, when doing live migration.
> >> +		 */
> >> +		if (migration_active) {
> >> +			pmd_t *pmd;
> >> +			if (hugetlb) {
> >> +				pfn += pte_index(fault_ipa);
> >> +				gfn = fault_ipa >> PAGE_SHIFT;
> >> +			}
> > 
> > how can you have hugetlb when we entered this else-clause conditional on
> > having !hugetlb?
> > 
> - if(hugetlb && !migration_active)

ah, you changed that, sorry, my bad.

> 
> forces all page faults to enter here while in migration. Huge page entries
> are cleared and stage2_set_pte() splits the huge page, and installs the pte
> for the fault_ipa. I placed that there since it flows with installing 
> a pte as well as splitting a huge page. But your comment on performance
> split up huge page vs. deferred  page faulting should move it out of here. 
> 

Why do you need to make that change?  I would think that just not
setting hugetlb when you have dirty page logging activated should take
care of all that you need.

Wrt. my comments on performance, I had a look at the x86 code, and they
seem to take your approach.  We should probably talk more closely to
them about their experiences.

> 
> >> +			new_pte = pfn_pte(pfn, PAGE_S2);
> >> +			pmd = stage2_get_pmd(kvm, NULL, fault_ipa);
> >> +			if (pmd && kvm_pmd_huge(*pmd))
> >> +				clear_pmd_entry(kvm, pmd, fault_ipa);
> > 
> > If we have a huge pmd entry, how did we take a fault on there?  Would
> > that be if a different CPU inserted a huge page entry since we got here,
> > is this what you're trying to handle?
> > 
> > I'm confused.
> > 
> 
> I thing this related to the above.
> 

Well, if you're taking a fault, it means that you either don't have a
PMD or you don't have a pte.  If you have kvm_pmd_huge() you have a pmd
and you don't need a pte, so this should never happen. Ever.

> >> +		}
> >> +
> >>  		if (writable) {
> >>  			kvm_set_s2pte_writable(&new_pte);
> >>  			kvm_set_pfn_dirty(pfn);
> >> @@ -1077,6 +1106,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>  		ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
> >>  	}
> >>  
> >> +	/* Assuming 4k pages, set one bit/page in memslot dirty_bitmap[] */
> > 
> > Assuming? this makes me nervous.  The point is probably that it's
> > harmless if we're not logging dirty pages, because then nobody reads teh
> > data structure, and if we are logging, then we are mapping everything
> > using 4K pages?
> > 
> > It's probably clearer code-wise to condition this on whether or not we
> > are logging dirty page, and the branch is also likely to be much faster
> > than the function call to mark_page_dirty.
> > 
> 
> I'm not sure I get the point. The call is always safe, you either 
> have old copy or new copy of memory slot with dirty_bitmap set or not set.
> The log read is done while holding kvm slots_lock.
> 
> Is the comment related to performance, not supporting multiple page sizes,
> or it's unsafe to call mark_page_dirty() under all circumstances, or 
> something else? 
> 
> 
You're always calling this, regardless if you have dirty page logging
activated or not.  I think this is weird.

-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 3/4] live migration support for VM dirty log management
  2014-05-27 21:55     ` Mario Smarduch
@ 2014-05-28  9:08       ` Christoffer Dall
  2014-05-28 17:59         ` Mario Smarduch
  0 siblings, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2014-05-28  9:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 27, 2014 at 02:55:21PM -0700, Mario Smarduch wrote:
> On 05/27/2014 01:12 PM, Christoffer Dall wrote:
> > On Thu, May 15, 2014 at 11:27:30AM -0700, Mario Smarduch wrote:

[...]

> >> +
> >> +	/* If pgd, pud, pmd not present and you cross pmd range check next
> >> +	 * index.
> >> +	 */
> >> +	pgd = pgdp + pgd_index(ipa);
> >> +	if (unlikely(crosses_pmd && !pgd_present(*pgd))) {
> >> +		pgd = pgdp + pgd_index(next);
> >> +		if (!pgd_present(*pgd))
> >> +			return;
> >> +	}
> >> +
> >> +	pud = pud_offset(pgd, ipa);
> >> +	if (unlikely(crosses_pmd && !pud_present(*pud))) {
> >> +		pud = pud_offset(pgd, next);
> >> +		if (!pud_present(*pud))
> >> +			return;
> >> +	}
> >> +
> >> +	pmd = pmd_offset(pud, ipa);
> >> +	if (unlikely(crosses_pmd && !pmd_present(*pmd))) {
> >> +		pmd = pmd_offset(pud, next);
> >> +		if (!pmd_present(*pmd))
> >> +			return;
> >> +	}
> >> +
> >> +	for (;;) {
> >> +		pte = pte_offset_kernel(pmd, ipa);
> >> +		if (!pte_present(*pte))
> >> +			goto next_ipa;
> >> +
> >> +		if (kvm_s2pte_readonly(pte))
> >> +			goto next_ipa;
> >> +		kvm_set_s2pte_readonly(pte);
> >> +next_ipa:
> >> +		mask &= mask - 1;
> >> +		if (!mask)
> >> +			break;
> >> +
> >> +		/* find next page */
> >> +		ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
> >> +
> >> +		/* skip upper page table lookups */
> >> +		if (!crosses_pmd)
> >> +			continue;
> >> +
> >> +		pgd = pgdp + pgd_index(ipa);
> >> +		if (unlikely(!pgd_present(*pgd)))
> >> +			goto next_ipa;
> >> +		pud = pud_offset(pgd, ipa);
> >> +		if (unlikely(!pud_present(*pud)))
> >> +			goto next_ipa;
> >> +		pmd = pmd_offset(pud, ipa);
> >> +		if (unlikely(!pmd_present(*pmd)))
> >> +			goto next_ipa;
> >> +	}
> > 
> > So I think the reason this is done separately on x86 is that they have
> > an rmap structure for their gfn mappings so that they can quickly lookup
> > ptes based on a gfn and write-protect it without having to walk the
> > stage-2 page tables.
> 
> Yes, they also use rmapps for mmu notifiers, invalidations on huge VMs and 
> large ranges resulted in excessive times. 
> > 
> > Unless you want to introduce this on ARM, I think you will be much
> 
> Eventually yes but that would also require reworking mmu notifiers.  I had 
> two step approach in mind. Initially get the dirty page marking to work, 
> TLB flushing, GIC/arch-timer migration, validate migration under various 
> stress loads (page reclaim) with mmu notifiers, test several VMs and migration 
> times. 
> 
> Then get rmapp (or something similar) working - eventually for huge VMs it's
> needed. In short two phases.
> 
> > better off just having a single (properly written) iterating
> > write-protect function, that takes a start and end IPA and a bitmap for
> > which pages to actually write-protect, which can then handle the generic
> > case (either NULL or all-ones bitmap) or a specific case, which just
> > traverses the IPA range given as input.  Such a function should follow
> > the model of page table walk functions discussed previously
> > (separate functions: wp_pgd_enties(), wp_pud_entries(),
> > wp_pmd_entries(), wp_pte_entries()).
> > 
> > However, you may want to verify my assumption above with the x86 people
> > and look at sharing the rmap logic between architectures.
> > 
> > In any case, this code is very difficult to read and understand, and it
> > doesn't look at all like the other code we have to walk page tables.  I
> > understand you are trying to optimize for performance (by skipping some
> > intermediate page table level lookups), but you never declare that goal
> > anywhere in the code or in the commit message.
> 
> Marc's comment noticed I was walking a small range (128k), using upper table
> iterations that covered 1G, 2MB ranges. As you mention the code tries to
> optimize upper table lookups. Yes the function is too bulky, but I'm not sure how 
> to remove the upper table checks since page tables may change between the 
> time pages are marked dirty and the log is retrieved. And if a memory slot 
> is very dirty walking upper tables will impact performance. I'll think some 
> more on this function.
> 
I think you should aim at the simplest possible implementation that
functionally works, first.  Let's verify that this thing works, have
clean working code that implementation-wise is as minimal as possible.

Then we can run perf on that and see if our migrations are very slow,
where we are actually spending time, and only then optimize it.

The solution to this specific problem for the time being appears quite
clear to me: Follow the exact same scheme as for unmap_range (the one I
sent out here:
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-May/009592.html, the
diff is hard to read, so I recommend you apply the patch and look at the
resulting code).  Have a similar scheme, call it wp_ipa_range() or
something like that, and use that for now.

-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-28  8:09       ` Christoffer Dall
@ 2014-05-28 17:55         ` Mario Smarduch
  2014-05-28 18:42           ` Mario Smarduch
  2014-05-29  8:51           ` Christoffer Dall
  0 siblings, 2 replies; 26+ messages in thread
From: Mario Smarduch @ 2014-05-28 17:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/28/2014 01:09 AM, Christoffer Dall wrote:
> On Tue, May 27, 2014 at 06:30:23PM -0700, Mario Smarduch wrote:
>> On 05/27/2014 01:19 PM, Christoffer Dall wrote:
>>> On Thu, May 15, 2014 at 11:27:31AM -0700, Mario Smarduch wrote:
>>>> This patch adds support for handling 2nd stage page faults during migration,
>>>> it disables faulting in huge pages, and splits up existing huge pages.
>>>>
>>>> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
>>>> ---
>>>>  arch/arm/kvm/mmu.c |   36 ++++++++++++++++++++++++++++++++++--
>>>>  1 file changed, 34 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>> index b939312..10e7bf6 100644
>>>> --- a/arch/arm/kvm/mmu.c
>>>> +++ b/arch/arm/kvm/mmu.c
>>>> @@ -1002,6 +1002,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>  	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
>>>>  	struct vm_area_struct *vma;
>>>>  	pfn_t pfn;
>>>> +	bool migration_active;
>>>>  
>>>>  	write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
>>>>  	if (fault_status == FSC_PERM && !write_fault) {
>>>> @@ -1053,12 +1054,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>  		return -EFAULT;
>>>>  
>>>>  	spin_lock(&kvm->mmu_lock);
>>>> +
>>>> +	/*
>>>> +	 * Place inside lock to prevent race condition when whole VM is being
>>>> +	 * write proteced. Prevent race of huge page install when migration is
>>>> +	 * active.
>>>> +	 */
>>>> +	migration_active = vcpu->kvm->arch.migration_in_progress;
>>>> +
>>>>  	if (mmu_notifier_retry(kvm, mmu_seq))
>>>>  		goto out_unlock;
>>>> -	if (!hugetlb && !force_pte)
>>>> +
>>>> +	/* When migrating don't spend cycles coalescing huge pages */
>>>> +	if (!hugetlb && !force_pte && !migration_active)
>>>>  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
>>>>  
>>>> -	if (hugetlb) {
>>>> +	/* During migration don't install huge pages */
>>>
>>> again, all this is not about migration per se, it's about when logging
>>> dirty pages, (which may be commonly used for migration).
>>>
>>
>> Yes that's true , I'll update but until recently (new RFC on qemu list) where
>> dirty logging is used for getting VM RSS or hot memory regions, I don't see any
>> other use case.
>>
> 
> That doesn't really matter.  KVM doesn't really know (or care) what user
> space is doing with its features; it implements a certain functionality
> behind an ABI, and that's it.  For things to be consistent and make
> sense in the kernel, you can only refer to concepts defined by KVM, not
> by how QEMU or kvmtools (or some other user space client) may use it.
> 
 
Ah yes only the exported kernel interface matters, otherwise you would
be revising forever. Got too focused on specific use case, it narrowed my 
perspective.

>>>> +	if (hugetlb && !migration_active) {
>>>>  		pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
>>>>  		new_pmd = pmd_mkhuge(new_pmd);
>>>>  		if (writable) {
>>>> @@ -1069,6 +1081,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>  		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
>>>>  	} else {
>>>>  		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
>>>> +
>>>> +		/*
>>>> +		 * If pmd is  mapping a huge page then split it up into
>>>> +		 * small pages, when doing live migration.
>>>> +		 */
>>>> +		if (migration_active) {
>>>> +			pmd_t *pmd;
>>>> +			if (hugetlb) {
>>>> +				pfn += pte_index(fault_ipa);
>>>> +				gfn = fault_ipa >> PAGE_SHIFT;
>>>> +			}
>>>
>>> how can you have hugetlb when we entered this else-clause conditional on
>>> having !hugetlb?
>>>
>> - if(hugetlb && !migration_active)
> 
> ah, you changed that, sorry, my bad.
> 
>>
>> forces all page faults to enter here while in migration. Huge page entries
>> are cleared and stage2_set_pte() splits the huge page, and installs the pte
>> for the fault_ipa. I placed that there since it flows with installing 
>> a pte as well as splitting a huge page. But your comment on performance
>> split up huge page vs. deferred  page faulting should move it out of here. 
>>
> 
> Why do you need to make that change?  I would think that just not
> setting hugetlb when you have dirty page logging activated should take
> care of all that you need.

During the initial write protect you occasionally release
kvm_mmu_lock you may get a write protect fault on a huge page that didn't
get cleared yet, so here the intent is to clear the pmd and let  
stage2_set_pte() create a page table and install the pte for fault_ipa.

> 
> Wrt. my comments on performance, I had a look at the x86 code, and they
> seem to take your approach.  We should probably talk more closely to
> them about their experiences.

Yes some answers are needed for that and few other issues, I'm seeing migration
on ARM complete for much higher dirty rates then x86 which is a bit fishy.

>>
>>>> +			new_pte = pfn_pte(pfn, PAGE_S2);
>>>> +			pmd = stage2_get_pmd(kvm, NULL, fault_ipa);
>>>> +			if (pmd && kvm_pmd_huge(*pmd))
>>>> +				clear_pmd_entry(kvm, pmd, fault_ipa);
>>>
>>> If we have a huge pmd entry, how did we take a fault on there?  Would
>>> that be if a different CPU inserted a huge page entry since we got here,
>>> is this what you're trying to handle?
>>>
>>> I'm confused.
>>>
>>
>> I thing this related to the above.
>>
> 
> Well, if you're taking a fault, it means that you either don't have a
> PMD or you don't have a pte.  If you have kvm_pmd_huge() you have a pmd
> and you don't need a pte, so this should never happen. Ever.
> 

So this needs to be cleared up given this is key to logging.
Cases this code handles during migration -
1. huge page fault described above - write protect fault so you breakup
   the huge page.
2. All other faults - first time access, pte write protect you again wind up in
   stage2_set_pte().

Am I missing something here?

>>>> +		}
>>>> +
>>>>  		if (writable) {
>>>>  			kvm_set_s2pte_writable(&new_pte);
>>>>  			kvm_set_pfn_dirty(pfn);
>>>> @@ -1077,6 +1106,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>  		ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
>>>>  	}
>>>>  
>>>> +	/* Assuming 4k pages, set one bit/page in memslot dirty_bitmap[] */
>>>
>>> Assuming? this makes me nervous.  The point is probably that it's
>>> harmless if we're not logging dirty pages, because then nobody reads teh
>>> data structure, and if we are logging, then we are mapping everything
>>> using 4K pages?
>>>
>>> It's probably clearer code-wise to condition this on whether or not we
>>> are logging dirty page, and the branch is also likely to be much faster
>>> than the function call to mark_page_dirty.
>>>
>>
>> I'm not sure I get the point. The call is always safe, you either 
>> have old copy or new copy of memory slot with dirty_bitmap set or not set.
>> The log read is done while holding kvm slots_lock.
>>
>> Is the comment related to performance, not supporting multiple page sizes,
>> or it's unsafe to call mark_page_dirty() under all circumstances, or 
>> something else? 
>>
>>
> You're always calling this, regardless if you have dirty page logging
> activated or not.  I think this is weird.

RCU is useg to set kvm->memslots, for memslot logging updates like disable 
logging you could be checking old memslot with dirty bitmap allocated or new 
one with dirty_bitmap NULL. A conditional check would not always be accurate
on the state of logging, until the srcu synchronize completes. So you may do a
few logs even after user disables logging to old memslot dirty_bitmap. But 
this way logging is stateless. 

Also to determine if memslot has logging on/off is majority
of the function, unlike user_mem_abort() some places it's called from don't
have the memslot accessed handy.

In the other direction after you turn on logging the write protect executes 
after rcu synchronize has completed. Any calls to mark_page_dirty() will be
using the newly installed memslot dirty_bitmap during and after write protect.

> 
> -Christoffer
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 3/4] live migration support for VM dirty log management
  2014-05-28  9:08       ` Christoffer Dall
@ 2014-05-28 17:59         ` Mario Smarduch
  0 siblings, 0 replies; 26+ messages in thread
From: Mario Smarduch @ 2014-05-28 17:59 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/28/2014 02:08 AM, Christoffer Dall wrote:
> On Tue, May 27, 2014 at 02:55:21PM -0700, Mario Smarduch wrote:
>> On 05/27/2014 01:12 PM, Christoffer Dall wrote:
>>> On Thu, May 15, 2014 at 11:27:30AM -0700, Mario Smarduch wrote:
> 
> [...]
> 
>>>> +
>>>> +	/* If pgd, pud, pmd not present and you cross pmd range check next
>>>> +	 * index.
>>>> +	 */
>>>> +	pgd = pgdp + pgd_index(ipa);
>>>> +	if (unlikely(crosses_pmd && !pgd_present(*pgd))) {
>>>> +		pgd = pgdp + pgd_index(next);
>>>> +		if (!pgd_present(*pgd))
>>>> +			return;
>>>> +	}
>>>> +
>>>> +	pud = pud_offset(pgd, ipa);
>>>> +	if (unlikely(crosses_pmd && !pud_present(*pud))) {
>>>> +		pud = pud_offset(pgd, next);
>>>> +		if (!pud_present(*pud))
>>>> +			return;
>>>> +	}
>>>> +
>>>> +	pmd = pmd_offset(pud, ipa);
>>>> +	if (unlikely(crosses_pmd && !pmd_present(*pmd))) {
>>>> +		pmd = pmd_offset(pud, next);
>>>> +		if (!pmd_present(*pmd))
>>>> +			return;
>>>> +	}
>>>> +
>>>> +	for (;;) {
>>>> +		pte = pte_offset_kernel(pmd, ipa);
>>>> +		if (!pte_present(*pte))
>>>> +			goto next_ipa;
>>>> +
>>>> +		if (kvm_s2pte_readonly(pte))
>>>> +			goto next_ipa;
>>>> +		kvm_set_s2pte_readonly(pte);
>>>> +next_ipa:
>>>> +		mask &= mask - 1;
>>>> +		if (!mask)
>>>> +			break;
>>>> +
>>>> +		/* find next page */
>>>> +		ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
>>>> +
>>>> +		/* skip upper page table lookups */
>>>> +		if (!crosses_pmd)
>>>> +			continue;
>>>> +
>>>> +		pgd = pgdp + pgd_index(ipa);
>>>> +		if (unlikely(!pgd_present(*pgd)))
>>>> +			goto next_ipa;
>>>> +		pud = pud_offset(pgd, ipa);
>>>> +		if (unlikely(!pud_present(*pud)))
>>>> +			goto next_ipa;
>>>> +		pmd = pmd_offset(pud, ipa);
>>>> +		if (unlikely(!pmd_present(*pmd)))
>>>> +			goto next_ipa;
>>>> +	}
>>>
>>> So I think the reason this is done separately on x86 is that they have
>>> an rmap structure for their gfn mappings so that they can quickly lookup
>>> ptes based on a gfn and write-protect it without having to walk the
>>> stage-2 page tables.
>>
>> Yes, they also use rmapps for mmu notifiers, invalidations on huge VMs and 
>> large ranges resulted in excessive times. 
>>>
>>> Unless you want to introduce this on ARM, I think you will be much
>>
>> Eventually yes but that would also require reworking mmu notifiers.  I had 
>> two step approach in mind. Initially get the dirty page marking to work, 
>> TLB flushing, GIC/arch-timer migration, validate migration under various 
>> stress loads (page reclaim) with mmu notifiers, test several VMs and migration 
>> times. 
>>
>> Then get rmapp (or something similar) working - eventually for huge VMs it's
>> needed. In short two phases.
>>
>>> better off just having a single (properly written) iterating
>>> write-protect function, that takes a start and end IPA and a bitmap for
>>> which pages to actually write-protect, which can then handle the generic
>>> case (either NULL or all-ones bitmap) or a specific case, which just
>>> traverses the IPA range given as input.  Such a function should follow
>>> the model of page table walk functions discussed previously
>>> (separate functions: wp_pgd_enties(), wp_pud_entries(),
>>> wp_pmd_entries(), wp_pte_entries()).
>>>
>>> However, you may want to verify my assumption above with the x86 people
>>> and look at sharing the rmap logic between architectures.
>>>
>>> In any case, this code is very difficult to read and understand, and it
>>> doesn't look at all like the other code we have to walk page tables.  I
>>> understand you are trying to optimize for performance (by skipping some
>>> intermediate page table level lookups), but you never declare that goal
>>> anywhere in the code or in the commit message.
>>
>> Marc's comment noticed I was walking a small range (128k), using upper table
>> iterations that covered 1G, 2MB ranges. As you mention the code tries to
>> optimize upper table lookups. Yes the function is too bulky, but I'm not sure how 
>> to remove the upper table checks since page tables may change between the 
>> time pages are marked dirty and the log is retrieved. And if a memory slot 
>> is very dirty walking upper tables will impact performance. I'll think some 
>> more on this function.
>>
> I think you should aim at the simplest possible implementation that
> functionally works, first.  Let's verify that this thing works, have
> clean working code that implementation-wise is as minimal as possible.
> 
> Then we can run perf on that and see if our migrations are very slow,
> where we are actually spending time, and only then optimize it.
> 
> The solution to this specific problem for the time being appears quite
> clear to me: Follow the exact same scheme as for unmap_range (the one I
> sent out here:
> https://lists.cs.columbia.edu/pipermail/kvmarm/2014-May/009592.html, the
> diff is hard to read, so I recommend you apply the patch and look at the
> resulting code).  Have a similar scheme, call it wp_ipa_range() or
> something like that, and use that for now.

Ok I'll reuse that code. I'll need to apply that patch given I need
to test various host conditions during migration.

> 
> -Christoffer
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-28 17:55         ` Mario Smarduch
@ 2014-05-28 18:42           ` Mario Smarduch
  2014-05-29  2:02             ` Mario Smarduch
  2014-05-29  8:42             ` Christoffer Dall
  2014-05-29  8:51           ` Christoffer Dall
  1 sibling, 2 replies; 26+ messages in thread
From: Mario Smarduch @ 2014-05-28 18:42 UTC (permalink / raw)
  To: linux-arm-kernel


>emslot dirty_bitmap during and after write protect.
> 
>>
>> -Christoffer

Regarding huge pud that's causing some design problems, should huge PUD
pages be considered at all?

Thanks,
  Mario
>>
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-28 18:42           ` Mario Smarduch
@ 2014-05-29  2:02             ` Mario Smarduch
  2014-05-29  8:42             ` Christoffer Dall
  1 sibling, 0 replies; 26+ messages in thread
From: Mario Smarduch @ 2014-05-29  2:02 UTC (permalink / raw)
  To: linux-arm-kernel


Little bit more details on this question -

For 2nd stage 3-level tables PUD blocks don't exist - although
it appears you can have a PGD block but I don't see any
support for that. But should the code still work as if PUDs
(4-level table) are used and check for pud_huge()?

Looking at ARMv8 there are several block formats, I don't know which one
will be use for 2nd stage (4KB, 16,...) but one of them supports 4-level
table (have not looked at this in detail, could be wrong here).

Should pud_huge() be supported for future compatibility?

This impacts logging -
 - Some decisions are needed either clear the PUD entry and
   force them to pages or mark dirty bit map for each 4k page
   in the PUD Block range, IA64 appears to that in mark_pages_dirty().

 - If you assume pud_huge() then you probably have to support
   the logic for PUD Block descriptor even though
   it's not used in 3-level table at this time.

I think until PUD Blocks are actually used it's maybe better to
ignore them.

- Mario



On 05/28/2014 11:42 AM, Mario Smarduch wrote:
> 
>> emslot dirty_bitmap during and after write protect.
>>
>>>
>>> -Christoffer
> 
> Regarding huge pud that's causing some design problems, should huge PUD
> pages be considered at all?
> 
> Thanks,
>   Mario
>>>
>>
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm at lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>>
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-28 18:42           ` Mario Smarduch
  2014-05-29  2:02             ` Mario Smarduch
@ 2014-05-29  8:42             ` Christoffer Dall
  1 sibling, 0 replies; 26+ messages in thread
From: Christoffer Dall @ 2014-05-29  8:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 28, 2014 at 11:42:00AM -0700, Mario Smarduch wrote:
> 
> >emslot dirty_bitmap during and after write protect.
> > 
> >>
> >> -Christoffer
> 
> Regarding huge pud that's causing some design problems, should huge PUD
> pages be considered at all?
> 
I'm not sure if there is any support for this on aarch64 or anyone
planning to work on this.

Steve?

-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-28 17:55         ` Mario Smarduch
  2014-05-28 18:42           ` Mario Smarduch
@ 2014-05-29  8:51           ` Christoffer Dall
  2014-05-29 17:08             ` Mario Smarduch
  1 sibling, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2014-05-29  8:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 28, 2014 at 10:55:54AM -0700, Mario Smarduch wrote:
> On 05/28/2014 01:09 AM, Christoffer Dall wrote:
> > On Tue, May 27, 2014 at 06:30:23PM -0700, Mario Smarduch wrote:
> >> On 05/27/2014 01:19 PM, Christoffer Dall wrote:
> >>> On Thu, May 15, 2014 at 11:27:31AM -0700, Mario Smarduch wrote:
> >>>> This patch adds support for handling 2nd stage page faults during migration,
> >>>> it disables faulting in huge pages, and splits up existing huge pages.
> >>>>
> >>>> Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
> >>>> ---
> >>>>  arch/arm/kvm/mmu.c |   36 ++++++++++++++++++++++++++++++++++--
> >>>>  1 file changed, 34 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >>>> index b939312..10e7bf6 100644
> >>>> --- a/arch/arm/kvm/mmu.c
> >>>> +++ b/arch/arm/kvm/mmu.c
> >>>> @@ -1002,6 +1002,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>>>  	struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
> >>>>  	struct vm_area_struct *vma;
> >>>>  	pfn_t pfn;
> >>>> +	bool migration_active;
> >>>>  
> >>>>  	write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
> >>>>  	if (fault_status == FSC_PERM && !write_fault) {
> >>>> @@ -1053,12 +1054,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>>>  		return -EFAULT;
> >>>>  
> >>>>  	spin_lock(&kvm->mmu_lock);
> >>>> +
> >>>> +	/*
> >>>> +	 * Place inside lock to prevent race condition when whole VM is being
> >>>> +	 * write proteced. Prevent race of huge page install when migration is
> >>>> +	 * active.
> >>>> +	 */
> >>>> +	migration_active = vcpu->kvm->arch.migration_in_progress;
> >>>> +
> >>>>  	if (mmu_notifier_retry(kvm, mmu_seq))
> >>>>  		goto out_unlock;
> >>>> -	if (!hugetlb && !force_pte)
> >>>> +
> >>>> +	/* When migrating don't spend cycles coalescing huge pages */
> >>>> +	if (!hugetlb && !force_pte && !migration_active)
> >>>>  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> >>>>  
> >>>> -	if (hugetlb) {
> >>>> +	/* During migration don't install huge pages */
> >>>
> >>> again, all this is not about migration per se, it's about when logging
> >>> dirty pages, (which may be commonly used for migration).
> >>>
> >>
> >> Yes that's true , I'll update but until recently (new RFC on qemu list) where
> >> dirty logging is used for getting VM RSS or hot memory regions, I don't see any
> >> other use case.
> >>
> > 
> > That doesn't really matter.  KVM doesn't really know (or care) what user
> > space is doing with its features; it implements a certain functionality
> > behind an ABI, and that's it.  For things to be consistent and make
> > sense in the kernel, you can only refer to concepts defined by KVM, not
> > by how QEMU or kvmtools (or some other user space client) may use it.
> > 
>  
> Ah yes only the exported kernel interface matters, otherwise you would
> be revising forever. Got too focused on specific use case, it narrowed my 
> perspective.
> 
> >>>> +	if (hugetlb && !migration_active) {
> >>>>  		pmd_t new_pmd = pfn_pmd(pfn, PAGE_S2);
> >>>>  		new_pmd = pmd_mkhuge(new_pmd);
> >>>>  		if (writable) {
> >>>> @@ -1069,6 +1081,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>>>  		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
> >>>>  	} else {
> >>>>  		pte_t new_pte = pfn_pte(pfn, PAGE_S2);
> >>>> +
> >>>> +		/*
> >>>> +		 * If pmd is  mapping a huge page then split it up into
> >>>> +		 * small pages, when doing live migration.
> >>>> +		 */
> >>>> +		if (migration_active) {
> >>>> +			pmd_t *pmd;
> >>>> +			if (hugetlb) {
> >>>> +				pfn += pte_index(fault_ipa);
> >>>> +				gfn = fault_ipa >> PAGE_SHIFT;
> >>>> +			}
> >>>
> >>> how can you have hugetlb when we entered this else-clause conditional on
> >>> having !hugetlb?
> >>>
> >> - if(hugetlb && !migration_active)
> > 
> > ah, you changed that, sorry, my bad.
> > 
> >>
> >> forces all page faults to enter here while in migration. Huge page entries
> >> are cleared and stage2_set_pte() splits the huge page, and installs the pte
> >> for the fault_ipa. I placed that there since it flows with installing 
> >> a pte as well as splitting a huge page. But your comment on performance
> >> split up huge page vs. deferred  page faulting should move it out of here. 
> >>
> > 
> > Why do you need to make that change?  I would think that just not
> > setting hugetlb when you have dirty page logging activated should take
> > care of all that you need.
> 
> During the initial write protect you occasionally release
> kvm_mmu_lock you may get a write protect fault on a huge page that didn't
> get cleared yet, so here the intent is to clear the pmd and let  
> stage2_set_pte() create a page table and install the pte for fault_ipa.
> 
> > 
> > Wrt. my comments on performance, I had a look at the x86 code, and they
> > seem to take your approach.  We should probably talk more closely to
> > them about their experiences.
> 
> Yes some answers are needed for that and few other issues, I'm seeing migration
> on ARM complete for much higher dirty rates then x86 which is a bit fishy.
> 
> >>
> >>>> +			new_pte = pfn_pte(pfn, PAGE_S2);
> >>>> +			pmd = stage2_get_pmd(kvm, NULL, fault_ipa);
> >>>> +			if (pmd && kvm_pmd_huge(*pmd))
> >>>> +				clear_pmd_entry(kvm, pmd, fault_ipa);
> >>>
> >>> If we have a huge pmd entry, how did we take a fault on there?  Would
> >>> that be if a different CPU inserted a huge page entry since we got here,
> >>> is this what you're trying to handle?
> >>>
> >>> I'm confused.
> >>>
> >>
> >> I thing this related to the above.
> >>
> > 
> > Well, if you're taking a fault, it means that you either don't have a
> > PMD or you don't have a pte.  If you have kvm_pmd_huge() you have a pmd
> > and you don't need a pte, so this should never happen. Ever.
> > 
> 
> So this needs to be cleared up given this is key to logging.
> Cases this code handles during migration -
> 1. huge page fault described above - write protect fault so you breakup
>    the huge page.
> 2. All other faults - first time access, pte write protect you again wind up in
>    stage2_set_pte().
> 
> Am I missing something here?
> 

no, I forgot about the fact that we can take the permission fault now.
Hmm, ok, so either we need to use the original approach of always
splitting up huge pages or we need to just follow the regular huge page
path here and just mark all 512 4K pages dirty in the log, or handle it
in stage2_set_pte().

I would say go with the most simple appraoch for now (which may be going
back to splitting all pmd_huge() into regular pte's), and we can take a
more careful look in the next patch iteration.

> >>>> +		}
> >>>> +
> >>>>  		if (writable) {
> >>>>  			kvm_set_s2pte_writable(&new_pte);
> >>>>  			kvm_set_pfn_dirty(pfn);
> >>>> @@ -1077,6 +1106,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>>>  		ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false);
> >>>>  	}
> >>>>  
> >>>> +	/* Assuming 4k pages, set one bit/page in memslot dirty_bitmap[] */
> >>>
> >>> Assuming? this makes me nervous.  The point is probably that it's
> >>> harmless if we're not logging dirty pages, because then nobody reads teh
> >>> data structure, and if we are logging, then we are mapping everything
> >>> using 4K pages?
> >>>
> >>> It's probably clearer code-wise to condition this on whether or not we
> >>> are logging dirty page, and the branch is also likely to be much faster
> >>> than the function call to mark_page_dirty.
> >>>
> >>
> >> I'm not sure I get the point. The call is always safe, you either 
> >> have old copy or new copy of memory slot with dirty_bitmap set or not set.
> >> The log read is done while holding kvm slots_lock.
> >>
> >> Is the comment related to performance, not supporting multiple page sizes,
> >> or it's unsafe to call mark_page_dirty() under all circumstances, or 
> >> something else? 
> >>
> >>
> > You're always calling this, regardless if you have dirty page logging
> > activated or not.  I think this is weird.
> 
> RCU is useg to set kvm->memslots, for memslot logging updates like disable 
> logging you could be checking old memslot with dirty bitmap allocated or new 
> one with dirty_bitmap NULL. A conditional check would not always be accurate
> on the state of logging, until the srcu synchronize completes. So you may do a
> few logs even after user disables logging to old memslot dirty_bitmap. But 
> this way logging is stateless. 
> 
> Also to determine if memslot has logging on/off is majority
> of the function, unlike user_mem_abort() some places it's called from don't
> have the memslot accessed handy.
> 
> In the other direction after you turn on logging the write protect executes 
> after rcu synchronize has completed. Any calls to mark_page_dirty() will be
> using the newly installed memslot dirty_bitmap during and after write protect.
> 
Fair enough, didn't think about the locking.  It seems x86 calls it
unconditionally too.

-Christoffer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-29  8:51           ` Christoffer Dall
@ 2014-05-29 17:08             ` Mario Smarduch
  2014-05-29 17:57               ` Christoffer Dall
  0 siblings, 1 reply; 26+ messages in thread
From: Mario Smarduch @ 2014-05-29 17:08 UTC (permalink / raw)
  To: linux-arm-kernel


>> So this needs to be cleared up given this is key to logging.
>> Cases this code handles during migration -
>> 1. huge page fault described above - write protect fault so you breakup
>>    the huge page.
>> 2. All other faults - first time access, pte write protect you again wind up in
>>    stage2_set_pte().
>>
>> Am I missing something here?
>>
> 
> no, I forgot about the fact that we can take the permission fault now.
> Hmm, ok, so either we need to use the original approach of always
> splitting up huge pages or we need to just follow the regular huge page
> path here and just mark all 512 4K pages dirty in the log, or handle it
> in stage2_set_pte().
> 
> I would say go with the most simple appraoch for now (which may be going
> back to splitting all pmd_huge() into regular pte's), and we can take a
> more careful look in the next patch iteration.
> 

Looking at the overall memslot update architecture and various
fail scenarios - user_mem_abort() appears to be the most
optimal and reliable place. First Write Protect huge pages after
memslots are committed and deal with rest in user_mem_abort().

Still need some feedback on the pud_huge() before revising for
next iteration?

- Mario

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-29 17:08             ` Mario Smarduch
@ 2014-05-29 17:57               ` Christoffer Dall
  2014-05-29 19:10                 ` Mario Smarduch
  0 siblings, 1 reply; 26+ messages in thread
From: Christoffer Dall @ 2014-05-29 17:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 29, 2014 at 10:08:07AM -0700, Mario Smarduch wrote:
> 
> >> So this needs to be cleared up given this is key to logging.
> >> Cases this code handles during migration -
> >> 1. huge page fault described above - write protect fault so you breakup
> >>    the huge page.
> >> 2. All other faults - first time access, pte write protect you again wind up in
> >>    stage2_set_pte().
> >>
> >> Am I missing something here?
> >>
> > 
> > no, I forgot about the fact that we can take the permission fault now.
> > Hmm, ok, so either we need to use the original approach of always
> > splitting up huge pages or we need to just follow the regular huge page
> > path here and just mark all 512 4K pages dirty in the log, or handle it
> > in stage2_set_pte().
> > 
> > I would say go with the most simple appraoch for now (which may be going
> > back to splitting all pmd_huge() into regular pte's), and we can take a
> > more careful look in the next patch iteration.
> > 
> 
> Looking at the overall memslot update architecture and various
> fail scenarios - user_mem_abort() appears to be the most
> optimal and reliable place. First Write Protect huge pages after
> memslots are committed and deal with rest in user_mem_abort().
> 
> Still need some feedback on the pud_huge() before revising for
> next iteration?
> 
Just assume it's not used for now, and that you don't have to consider
it, and make that assumption clear in the commit message, so it doesn't
block this work.  I have a feeling we need to go through a few
iterations here, so let's get that rolling.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 4/4] add 2nd stage page fault handling during live migration
  2014-05-29 17:57               ` Christoffer Dall
@ 2014-05-29 19:10                 ` Mario Smarduch
  0 siblings, 0 replies; 26+ messages in thread
From: Mario Smarduch @ 2014-05-29 19:10 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/29/2014 10:57 AM, Christoffer Dall wrote:
> On Thu, May 29, 2014 at 10:08:07AM -0700, Mario Smarduch wrote:
>>
>>>> So this needs to be cleared up given this is key to logging.
>>>> Cases this code handles during migration -
>>>> 1. huge page fault described above - write protect fault so you breakup
>>>>    the huge page.
>>>> 2. All other faults - first time access, pte write protect you again wind up in
>>>>    stage2_set_pte().
>>>>
>>>> Am I missing something here?
>>>>
>>>
>>> no, I forgot about the fact that we can take the permission fault now.
>>> Hmm, ok, so either we need to use the original approach of always
>>> splitting up huge pages or we need to just follow the regular huge page
>>> path here and just mark all 512 4K pages dirty in the log, or handle it
>>> in stage2_set_pte().
>>>
>>> I would say go with the most simple appraoch for now (which may be going
>>> back to splitting all pmd_huge() into regular pte's), and we can take a
>>> more careful look in the next patch iteration.
>>>
>>
>> Looking at the overall memslot update architecture and various
>> fail scenarios - user_mem_abort() appears to be the most
>> optimal and reliable place. First Write Protect huge pages after
>> memslots are committed and deal with rest in user_mem_abort().
>>
>> Still need some feedback on the pud_huge() before revising for
>> next iteration?
>>
> Just assume it's not used for now, and that you don't have to consider
> it, and make that assumption clear in the commit message, so it doesn't
> block this work.  I have a feeling we need to go through a few
> iterations here, so let's get that rolling.
> 
> Thanks.
> 
Ok thanks I'm on it now.

- Mario

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-05-29 19:10 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-15 18:27 [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Mario Smarduch
2014-05-15 18:27 ` [PATCH v6 1/4] add ARMv7 HYP API to flush VM TLBs without address param Mario Smarduch
2014-05-27 19:51   ` Christoffer Dall
2014-05-15 18:27 ` [PATCH v6 2/4] live migration support for initial write protect of VM Mario Smarduch
2014-05-27 19:58   ` Christoffer Dall
2014-05-27 20:15     ` Mario Smarduch
2014-05-27 20:20       ` Christoffer Dall
2014-05-15 18:27 ` [PATCH v6 3/4] live migration support for VM dirty log management Mario Smarduch
2014-05-27 20:12   ` Christoffer Dall
2014-05-27 21:55     ` Mario Smarduch
2014-05-28  9:08       ` Christoffer Dall
2014-05-28 17:59         ` Mario Smarduch
2014-05-15 18:27 ` [PATCH v6 4/4] add 2nd stage page fault handling during live migration Mario Smarduch
2014-05-27 20:19   ` Christoffer Dall
2014-05-28  1:30     ` Mario Smarduch
2014-05-28  8:09       ` Christoffer Dall
2014-05-28 17:55         ` Mario Smarduch
2014-05-28 18:42           ` Mario Smarduch
2014-05-29  2:02             ` Mario Smarduch
2014-05-29  8:42             ` Christoffer Dall
2014-05-29  8:51           ` Christoffer Dall
2014-05-29 17:08             ` Mario Smarduch
2014-05-29 17:57               ` Christoffer Dall
2014-05-29 19:10                 ` Mario Smarduch
2014-05-15 18:51 ` [PATCH v6 0/4] live migration dirty bitmap support for ARMv7 Christoffer Dall
2014-05-15 22:53   ` Mario Smarduch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).