[PATCH 0/3] arm64: KVM: High memory guest fixes

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] arm64: KVM: High memory guest fixes
@ 2015-02-25 16:55 Marc Zyngier
  2015-02-25 16:55 ` [PATCH 1/3] arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting Marc Zyngier
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Marc Zyngier @ 2015-02-25 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

I've been playing with a hacked-up kvmtool that places the memory much
higher than usual, in an effort to emulate what some arm64 platforms
do (typically AMD Seattle).

Unexpectedly (hey, what could possibly go wrong...), I've discovered a
couple of mm related issues, ranging from the host exploding to the
guest being stuck on a page fault. Additionally, I noticed a minor
documentation issue.

With these fixes in, I'm able to run a guest that has its memory
located just under the 40bit limit.

Marc Zyngier (3):
  arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting
  arm64: KVM: Do not use pgd_index to index stage-2 pgd
  arm64: KVM: Fix outdated comment about VTCR_EL2.PS

 arch/arm/include/asm/kvm_mmu.h   | 10 +++++++++-
 arch/arm/kvm/mmu.c               | 10 +++++-----
 arch/arm64/include/asm/kvm_arm.h |  5 +++--
 arch/arm64/include/asm/kvm_mmu.h | 11 ++++++++++-
 4 files changed, 27 insertions(+), 9 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting
  2015-02-25 16:55 [PATCH 0/3] arm64: KVM: High memory guest fixes Marc Zyngier
@ 2015-02-25 16:55 ` Marc Zyngier
  2015-03-02 18:27   ` Christoffer Dall
  2015-02-25 16:55 ` [PATCH 2/3] arm64: KVM: Do not use pgd_index to index stage-2 pgd Marc Zyngier
  2015-02-25 16:55 ` [PATCH 3/3] arm64: KVM: Fix outdated comment about VTCR_EL2.PS Marc Zyngier
  2 siblings, 1 reply; 8+ messages in thread
From: Marc Zyngier @ 2015-02-25 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

We're using __get_free_pages with to allocate the guest's stage-2
PGD. The standard behaviour of this function is to return a set of
pages where only the head page has a valid refcount.

This behaviour gets us into trouble when we're trying to increment
the refount on a non-head page:

page:ffff7c00cfb693c0 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0x4000000000000000()
page dumped because: VM_BUG_ON_PAGE((*({ __attribute__((unused)) typeof((&page->_count)->counter) __var = ( typeof((&page->_count)->counter)) 0; (volatile typeof((&page->_count)->counter) *)&((&page->_count)->counter); })) <= 0)
BUG: failure@include/linux/mm.h:548/get_page()!
Kernel panic - not syncing: BUG!
CPU: 1 PID: 1695 Comm: kvm-vcpu-0 Not tainted 4.0.0-rc1+ #3825
Hardware name: APM X-Gene Mustang board (DT)
Call trace:
[<ffff80000008a09c>] dump_backtrace+0x0/0x13c
[<ffff80000008a1e8>] show_stack+0x10/0x1c
[<ffff800000691da8>] dump_stack+0x74/0x94
[<ffff800000690d78>] panic+0x100/0x240
[<ffff8000000a0bc4>] stage2_get_pmd+0x17c/0x2bc
[<ffff8000000a1dc4>] kvm_handle_guest_abort+0x4b4/0x6b0
[<ffff8000000a420c>] handle_exit+0x58/0x180
[<ffff80000009e7a4>] kvm_arch_vcpu_ioctl_run+0x114/0x45c
[<ffff800000099df4>] kvm_vcpu_ioctl+0x2e0/0x754
[<ffff8000001c0a18>] do_vfs_ioctl+0x424/0x5c8
[<ffff8000001c0bfc>] SyS_ioctl+0x40/0x78
CPU0: stopping

Passing the (unintuitively named) __GFP_COMP flag to __get_free_pages
forces the allocator to maintain a per-page refcount, which is exactly
what we need.

This has been tested on an X-Gene platform with a 4kB/48bit-VA host
kernel, and kvmtool hacked to place memory in the second page of
the hardware PGD (PUD for the host kernel).

Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm/include/asm/kvm_mmu.h   | 7 +++++++
 arch/arm/kvm/mmu.c               | 2 +-
 arch/arm64/include/asm/kvm_mmu.h | 9 ++++++++-
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 37ca2a4..1cac89b 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -162,6 +162,13 @@ static inline bool kvm_page_empty(void *ptr)
 
 #define KVM_PREALLOC_LEVEL	0
 
+/*
+ * We need to ensure that stage-2 PGDs are allocated with a per-page
+ * refcount, as we fiddle with the refcounts of non-head pages.
+ * __GFP_COMP forces the allocator to do what we want.
+ */
+#define KVM_GFP_S2_PGD	(GFP_KERNEL | __GFP_ZERO | __GFP_COMP)
+
 static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
 {
 	return 0;
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 3e6859b..a6a8252 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -666,7 +666,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
 		 * Allocate actual first-level Stage-2 page table used by the
 		 * hardware for Stage-2 page table walks.
 		 */
-		pgd = (pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, S2_PGD_ORDER);
+		pgd = (pgd_t *)__get_free_pages(KVM_GFP_S2_PGD, S2_PGD_ORDER);
 	}
 
 	if (!pgd)
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 6458b53..06c733a 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -171,6 +171,13 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
 #define KVM_PREALLOC_LEVEL	(0)
 #endif
 
+/*
+ * We need to ensure that stage-2 PGDs are allocated with a per-page
+ * refcount, as we fiddle with the refcounts of non-head pages.
+ * __GFP_COMP forces the allocator to do what we want.
+ */
+#define KVM_GFP_S2_PGD	(GFP_KERNEL | __GFP_ZERO | __GFP_COMP)
+
 /**
  * kvm_prealloc_hwpgd - allocate inital table for VTTBR
  * @kvm:	The KVM struct pointer for the VM.
@@ -192,7 +199,7 @@ static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
 	if (KVM_PREALLOC_LEVEL == 0)
 		return 0;
 
-	hwpgd = __get_free_pages(GFP_KERNEL | __GFP_ZERO, PTRS_PER_S2_PGD_SHIFT);
+	hwpgd = __get_free_pages(KVM_GFP_S2_PGD, PTRS_PER_S2_PGD_SHIFT);
 	if (!hwpgd)
 		return -ENOMEM;
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] arm64: KVM: Do not use pgd_index to index stage-2 pgd
  2015-02-25 16:55 [PATCH 0/3] arm64: KVM: High memory guest fixes Marc Zyngier
  2015-02-25 16:55 ` [PATCH 1/3] arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting Marc Zyngier
@ 2015-02-25 16:55 ` Marc Zyngier
  2015-03-02 18:45   ` Christoffer Dall
  2015-02-25 16:55 ` [PATCH 3/3] arm64: KVM: Fix outdated comment about VTCR_EL2.PS Marc Zyngier
  2 siblings, 1 reply; 8+ messages in thread
From: Marc Zyngier @ 2015-02-25 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

The kernel's pgd_index macro is designed to index a normal, page
sized array. KVM is a bit diffferent, as we can use concatenated
pages to have a bigger address space (for example 40bit IPA with
4kB pages gives us an 8kB PGD.

In the above case, the use of pgd_index will always return an index
inside the first 4kB, which makes a guest that has memory above
0x8000000000 rather unhappy, as it spins forever in a page fault,
whist the host happilly corrupts the lower pgd.

The obvious fix is to get our own kvm_pgd_index that does the right
thing(tm).

Tested on X-Gene with a hacked kvmtool that put memory at a stupidly
high address.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm/include/asm/kvm_mmu.h   | 3 ++-
 arch/arm/kvm/mmu.c               | 8 ++++----
 arch/arm64/include/asm/kvm_mmu.h | 2 ++
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 1cac89b..ec669a6 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -149,13 +149,14 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
 	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
 })
 
+#define kvm_pgd_index(addr)			pgd_index(addr)
+
 static inline bool kvm_page_empty(void *ptr)
 {
 	struct page *ptr_page = virt_to_page(ptr);
 	return page_count(ptr_page) == 1;
 }
 
-
 #define kvm_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
 #define kvm_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
 #define kvm_pud_table_empty(kvm, pudp) (0)
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index a6a8252..39a0903 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -290,7 +290,7 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
 	phys_addr_t addr = start, end = start + size;
 	phys_addr_t next;
 
-	pgd = pgdp + pgd_index(addr);
+	pgd = pgdp + kvm_pgd_index(addr);
 	do {
 		next = kvm_pgd_addr_end(addr, end);
 		if (!pgd_none(*pgd))
@@ -355,7 +355,7 @@ static void stage2_flush_memslot(struct kvm *kvm,
 	phys_addr_t next;
 	pgd_t *pgd;
 
-	pgd = kvm->arch.pgd + pgd_index(addr);
+	pgd = kvm->arch.pgd + kvm_pgd_index(addr);
 	do {
 		next = kvm_pgd_addr_end(addr, end);
 		stage2_flush_puds(kvm, pgd, addr, next);
@@ -799,7 +799,7 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
 	pgd_t *pgd;
 	pud_t *pud;
 
-	pgd = kvm->arch.pgd + pgd_index(addr);
+	pgd = kvm->arch.pgd + kvm_pgd_index(addr);
 	if (WARN_ON(pgd_none(*pgd))) {
 		if (!cache)
 			return NULL;
@@ -1089,7 +1089,7 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
 	pgd_t *pgd;
 	phys_addr_t next;
 
-	pgd = kvm->arch.pgd + pgd_index(addr);
+	pgd = kvm->arch.pgd + kvm_pgd_index(addr);
 	do {
 		/*
 		 * Release kvm_mmu_lock periodically if the memory region is
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 06c733a..c6300fd 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -158,6 +158,8 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
 #define PTRS_PER_S2_PGD		(1 << PTRS_PER_S2_PGD_SHIFT)
 #define S2_PGD_ORDER		get_order(PTRS_PER_S2_PGD * sizeof(pgd_t))
 
+#define kvm_pgd_index(addr)	(((addr) >> PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
+
 /*
  * If we are concatenating first level stage-2 page tables, we would have less
  * than or equal to 16 pointers in the fake PGD, because that's what the
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] arm64: KVM: Fix outdated comment about VTCR_EL2.PS
  2015-02-25 16:55 [PATCH 0/3] arm64: KVM: High memory guest fixes Marc Zyngier
  2015-02-25 16:55 ` [PATCH 1/3] arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting Marc Zyngier
  2015-02-25 16:55 ` [PATCH 2/3] arm64: KVM: Do not use pgd_index to index stage-2 pgd Marc Zyngier
@ 2015-02-25 16:55 ` Marc Zyngier
  2015-03-02 18:52   ` Christoffer Dall
  2 siblings, 1 reply; 8+ messages in thread
From: Marc Zyngier @ 2015-02-25 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

Commit 87366d8cf7b3 ("arm64: Add boot time configuration of
Intermediate Physical Address size") removed the hardcoded setting
of VTCR_EL2.PS to use ID_AA64MMFR0_EL1.PARange instead, but didn't
remove the (now rather misleading) comment.

Fix the comments to match reality (at least for the next few minutes).

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm64/include/asm/kvm_arm.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 94674eb..54bb4ba 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -129,6 +129,9 @@
  * 40 bits wide (T0SZ = 24).  Systems with a PARange smaller than 40 bits are
  * not known to exist and will break with this configuration.
  *
+ * VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
+ * (see hyp-init.S).
+ *
  * Note that when using 4K pages, we concatenate two first level page tables
  * together.
  *
@@ -138,7 +141,6 @@
 #ifdef CONFIG_ARM64_64K_PAGES
 /*
  * Stage2 translation configuration:
- * 40bits output (PS = 2)
  * 40bits input  (T0SZ = 24)
  * 64kB pages (TG0 = 1)
  * 2 level page tables (SL = 1)
@@ -150,7 +152,6 @@
 #else
 /*
  * Stage2 translation configuration:
- * 40bits output (PS = 2)
  * 40bits input  (T0SZ = 24)
  * 4kB pages (TG0 = 0)
  * 3 level page tables (SL = 1)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 1/3] arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting
  2015-02-25 16:55 ` [PATCH 1/3] arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting Marc Zyngier
@ 2015-03-02 18:27   ` Christoffer Dall
  2015-03-05 13:58     ` Marc Zyngier
  0 siblings, 1 reply; 8+ messages in thread
From: Christoffer Dall @ 2015-03-02 18:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 25, 2015 at 04:55:38PM +0000, Marc Zyngier wrote:
> We're using __get_free_pages with to allocate the guest's stage-2
> PGD. The standard behaviour of this function is to return a set of
> pages where only the head page has a valid refcount.
> 
> This behaviour gets us into trouble when we're trying to increment
> the refount on a non-head page:
> 
> page:ffff7c00cfb693c0 count:0 mapcount:0 mapping:          (null) index:0x0
> flags: 0x4000000000000000()
> page dumped because: VM_BUG_ON_PAGE((*({ __attribute__((unused)) typeof((&page->_count)->counter) __var = ( typeof((&page->_count)->counter)) 0; (volatile typeof((&page->_count)->counter) *)&((&page->_count)->counter); })) <= 0)
> BUG: failure at include/linux/mm.h:548/get_page()!
> Kernel panic - not syncing: BUG!
> CPU: 1 PID: 1695 Comm: kvm-vcpu-0 Not tainted 4.0.0-rc1+ #3825
> Hardware name: APM X-Gene Mustang board (DT)
> Call trace:
> [<ffff80000008a09c>] dump_backtrace+0x0/0x13c
> [<ffff80000008a1e8>] show_stack+0x10/0x1c
> [<ffff800000691da8>] dump_stack+0x74/0x94
> [<ffff800000690d78>] panic+0x100/0x240
> [<ffff8000000a0bc4>] stage2_get_pmd+0x17c/0x2bc
> [<ffff8000000a1dc4>] kvm_handle_guest_abort+0x4b4/0x6b0
> [<ffff8000000a420c>] handle_exit+0x58/0x180
> [<ffff80000009e7a4>] kvm_arch_vcpu_ioctl_run+0x114/0x45c
> [<ffff800000099df4>] kvm_vcpu_ioctl+0x2e0/0x754
> [<ffff8000001c0a18>] do_vfs_ioctl+0x424/0x5c8
> [<ffff8000001c0bfc>] SyS_ioctl+0x40/0x78
> CPU0: stopping
> 
> Passing the (unintuitively named) __GFP_COMP flag to __get_free_pages
> forces the allocator to maintain a per-page refcount, which is exactly
> what we need.

There's a concerning comment with what seems to be the same kind of
scenario in arch/tile/mm/pgtable.c:248.

I'm a little confused here, because it looks like prep_new_page() calls
prep_compound_page(), which is the effect you're looking for, but it
sets the page count of all tail pages to 0, and I think our code expects
all page table pages with all entries clear to have a page count of 1,
i.e. the allocation count itself, so don't we still have a discrepency
between head and tail pages for compound allocations?

So I'm wondering if we should simply call split_page() on the allocation
or if there's some other way of doing individual multiple contigous page
allocations?  Am I seeing ghosts?

> 
> This has been tested on an X-Gene platform with a 4kB/48bit-VA host
> kernel, and kvmtool hacked to place memory in the second page of
> the hardware PGD (PUD for the host kernel).
> 
> Reported-by: Mark Rutland <mark.rutland@arm.com>
> Tested-by: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> ---
>  arch/arm/include/asm/kvm_mmu.h   | 7 +++++++
>  arch/arm/kvm/mmu.c               | 2 +-
>  arch/arm64/include/asm/kvm_mmu.h | 9 ++++++++-
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 37ca2a4..1cac89b 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -162,6 +162,13 @@ static inline bool kvm_page_empty(void *ptr)
>  
>  #define KVM_PREALLOC_LEVEL	0
>  
> +/*
> + * We need to ensure that stage-2 PGDs are allocated with a per-page
> + * refcount, as we fiddle with the refcounts of non-head pages.
> + * __GFP_COMP forces the allocator to do what we want.
> + */
> +#define KVM_GFP_S2_PGD	(GFP_KERNEL | __GFP_ZERO | __GFP_COMP)
> +
>  static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
>  {

what about the case for kvm_prealloc_hwpgd, why doesn't it need to use
the new GFP mask?

>  	return 0;
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 3e6859b..a6a8252 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -666,7 +666,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
>  		 * Allocate actual first-level Stage-2 page table used by the
>  		 * hardware for Stage-2 page table walks.
>  		 */
> -		pgd = (pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, S2_PGD_ORDER);
> +		pgd = (pgd_t *)__get_free_pages(KVM_GFP_S2_PGD, S2_PGD_ORDER);
>  	}
>  
>  	if (!pgd)
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 6458b53..06c733a 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -171,6 +171,13 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
>  #define KVM_PREALLOC_LEVEL	(0)
>  #endif
>  
> +/*
> + * We need to ensure that stage-2 PGDs are allocated with a per-page
> + * refcount, as we fiddle with the refcounts of non-head pages.
> + * __GFP_COMP forces the allocator to do what we want.
> + */
> +#define KVM_GFP_S2_PGD	(GFP_KERNEL | __GFP_ZERO | __GFP_COMP)
> +
>  /**
>   * kvm_prealloc_hwpgd - allocate inital table for VTTBR
>   * @kvm:	The KVM struct pointer for the VM.
> @@ -192,7 +199,7 @@ static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
>  	if (KVM_PREALLOC_LEVEL == 0)
>  		return 0;
>  
> -	hwpgd = __get_free_pages(GFP_KERNEL | __GFP_ZERO, PTRS_PER_S2_PGD_SHIFT);
> +	hwpgd = __get_free_pages(KVM_GFP_S2_PGD, PTRS_PER_S2_PGD_SHIFT);
>  	if (!hwpgd)
>  		return -ENOMEM;
>  
> -- 
> 2.1.4
> 

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/3] arm64: KVM: Do not use pgd_index to index stage-2 pgd
  2015-02-25 16:55 ` [PATCH 2/3] arm64: KVM: Do not use pgd_index to index stage-2 pgd Marc Zyngier
@ 2015-03-02 18:45   ` Christoffer Dall
  0 siblings, 0 replies; 8+ messages in thread
From: Christoffer Dall @ 2015-03-02 18:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 25, 2015 at 04:55:39PM +0000, Marc Zyngier wrote:
> The kernel's pgd_index macro is designed to index a normal, page
> sized array. KVM is a bit diffferent, as we can use concatenated
> pages to have a bigger address space (for example 40bit IPA with
> 4kB pages gives us an 8kB PGD.
> 
> In the above case, the use of pgd_index will always return an index
> inside the first 4kB, which makes a guest that has memory above
> 0x8000000000 rather unhappy, as it spins forever in a page fault,
> whist the host happilly corrupts the lower pgd.
> 
> The obvious fix is to get our own kvm_pgd_index that does the right
> thing(tm).
> 
> Tested on X-Gene with a hacked kvmtool that put memory at a stupidly
> high address.
> 
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> ---
>  arch/arm/include/asm/kvm_mmu.h   | 3 ++-
>  arch/arm/kvm/mmu.c               | 8 ++++----
>  arch/arm64/include/asm/kvm_mmu.h | 2 ++
>  3 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 1cac89b..ec669a6 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -149,13 +149,14 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
>  	(__boundary - 1 < (end) - 1)? __boundary: (end);		\
>  })
>  
> +#define kvm_pgd_index(addr)			pgd_index(addr)
> +
>  static inline bool kvm_page_empty(void *ptr)
>  {
>  	struct page *ptr_page = virt_to_page(ptr);
>  	return page_count(ptr_page) == 1;
>  }
>  
> -
>  #define kvm_pte_table_empty(kvm, ptep) kvm_page_empty(ptep)
>  #define kvm_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp)
>  #define kvm_pud_table_empty(kvm, pudp) (0)
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index a6a8252..39a0903 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -290,7 +290,7 @@ static void unmap_range(struct kvm *kvm, pgd_t *pgdp,
>  	phys_addr_t addr = start, end = start + size;
>  	phys_addr_t next;
>  
> -	pgd = pgdp + pgd_index(addr);
> +	pgd = pgdp + kvm_pgd_index(addr);
>  	do {
>  		next = kvm_pgd_addr_end(addr, end);
>  		if (!pgd_none(*pgd))
> @@ -355,7 +355,7 @@ static void stage2_flush_memslot(struct kvm *kvm,
>  	phys_addr_t next;
>  	pgd_t *pgd;
>  
> -	pgd = kvm->arch.pgd + pgd_index(addr);
> +	pgd = kvm->arch.pgd + kvm_pgd_index(addr);
>  	do {
>  		next = kvm_pgd_addr_end(addr, end);
>  		stage2_flush_puds(kvm, pgd, addr, next);
> @@ -799,7 +799,7 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>  	pgd_t *pgd;
>  	pud_t *pud;
>  
> -	pgd = kvm->arch.pgd + pgd_index(addr);
> +	pgd = kvm->arch.pgd + kvm_pgd_index(addr);
>  	if (WARN_ON(pgd_none(*pgd))) {
>  		if (!cache)
>  			return NULL;
> @@ -1089,7 +1089,7 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  	pgd_t *pgd;
>  	phys_addr_t next;
>  
> -	pgd = kvm->arch.pgd + pgd_index(addr);
> +	pgd = kvm->arch.pgd + kvm_pgd_index(addr);
>  	do {
>  		/*
>  		 * Release kvm_mmu_lock periodically if the memory region is
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 06c733a..c6300fd 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -158,6 +158,8 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
>  #define PTRS_PER_S2_PGD		(1 << PTRS_PER_S2_PGD_SHIFT)
>  #define S2_PGD_ORDER		get_order(PTRS_PER_S2_PGD * sizeof(pgd_t))
>  
> +#define kvm_pgd_index(addr)	(((addr) >> PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
> +
>  /*
>   * If we are concatenating first level stage-2 page tables, we would have less
>   * than or equal to 16 pointers in the fake PGD, because that's what the
> -- 
> 2.1.4
> 

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/3] arm64: KVM: Fix outdated comment about VTCR_EL2.PS
  2015-02-25 16:55 ` [PATCH 3/3] arm64: KVM: Fix outdated comment about VTCR_EL2.PS Marc Zyngier
@ 2015-03-02 18:52   ` Christoffer Dall
  0 siblings, 0 replies; 8+ messages in thread
From: Christoffer Dall @ 2015-03-02 18:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 25, 2015 at 04:55:40PM +0000, Marc Zyngier wrote:
> Commit 87366d8cf7b3 ("arm64: Add boot time configuration of
> Intermediate Physical Address size") removed the hardcoded setting
> of VTCR_EL2.PS to use ID_AA64MMFR0_EL1.PARange instead, but didn't
> remove the (now rather misleading) comment.
> 
> Fix the comments to match reality (at least for the next few minutes).
> 
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>

Acked-by: Christoffer Dall <christoffer.dall@linaro.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting
  2015-03-02 18:27   ` Christoffer Dall
@ 2015-03-05 13:58     ` Marc Zyngier
  0 siblings, 0 replies; 8+ messages in thread
From: Marc Zyngier @ 2015-03-05 13:58 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/03/15 18:27, Christoffer Dall wrote:
> On Wed, Feb 25, 2015 at 04:55:38PM +0000, Marc Zyngier wrote:
>> We're using __get_free_pages with to allocate the guest's stage-2
>> PGD. The standard behaviour of this function is to return a set of
>> pages where only the head page has a valid refcount.
>>
>> This behaviour gets us into trouble when we're trying to increment
>> the refount on a non-head page:
>>
>> page:ffff7c00cfb693c0 count:0 mapcount:0 mapping:          (null) index:0x0
>> flags: 0x4000000000000000()
>> page dumped because: VM_BUG_ON_PAGE((*({ __attribute__((unused)) typeof((&page->_count)->counter) __var = ( typeof((&page->_count)->counter)) 0; (volatile typeof((&page->_count)->counter) *)&((&page->_count)->counter); })) <= 0)
>> BUG: failure at include/linux/mm.h:548/get_page()!
>> Kernel panic - not syncing: BUG!
>> CPU: 1 PID: 1695 Comm: kvm-vcpu-0 Not tainted 4.0.0-rc1+ #3825
>> Hardware name: APM X-Gene Mustang board (DT)
>> Call trace:
>> [<ffff80000008a09c>] dump_backtrace+0x0/0x13c
>> [<ffff80000008a1e8>] show_stack+0x10/0x1c
>> [<ffff800000691da8>] dump_stack+0x74/0x94
>> [<ffff800000690d78>] panic+0x100/0x240
>> [<ffff8000000a0bc4>] stage2_get_pmd+0x17c/0x2bc
>> [<ffff8000000a1dc4>] kvm_handle_guest_abort+0x4b4/0x6b0
>> [<ffff8000000a420c>] handle_exit+0x58/0x180
>> [<ffff80000009e7a4>] kvm_arch_vcpu_ioctl_run+0x114/0x45c
>> [<ffff800000099df4>] kvm_vcpu_ioctl+0x2e0/0x754
>> [<ffff8000001c0a18>] do_vfs_ioctl+0x424/0x5c8
>> [<ffff8000001c0bfc>] SyS_ioctl+0x40/0x78
>> CPU0: stopping
>>
>> Passing the (unintuitively named) __GFP_COMP flag to __get_free_pages
>> forces the allocator to maintain a per-page refcount, which is exactly
>> what we need.
> 
> There's a concerning comment with what seems to be the same kind of
> scenario in arch/tile/mm/pgtable.c:248.
> 
> I'm a little confused here, because it looks like prep_new_page() calls
> prep_compound_page(), which is the effect you're looking for, but it
> sets the page count of all tail pages to 0, and I think our code expects
> all page table pages with all entries clear to have a page count of 1,
> i.e. the allocation count itself, so don't we still have a discrepency
> between head and tail pages for compound allocations?

I wondered about that, but couldn't decide either way. If that was the
case, we'd trigger some nasty warnings when tearing down a VM, and I
couldn't see any.

> So I'm wondering if we should simply call split_page() on the allocation
> or if there's some other way of doing individual multiple contigous page
> allocations?  Am I seeing ghosts?

Could well be a nicer solution. At least I can understand the code.

I'll give it a go.

>>
>> This has been tested on an X-Gene platform with a 4kB/48bit-VA host
>> kernel, and kvmtool hacked to place memory in the second page of
>> the hardware PGD (PUD for the host kernel).
>>
>> Reported-by: Mark Rutland <mark.rutland@arm.com>
>> Tested-by: Mark Rutland <mark.rutland@arm.com>
>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>> ---
>>  arch/arm/include/asm/kvm_mmu.h   | 7 +++++++
>>  arch/arm/kvm/mmu.c               | 2 +-
>>  arch/arm64/include/asm/kvm_mmu.h | 9 ++++++++-
>>  3 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 37ca2a4..1cac89b 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -162,6 +162,13 @@ static inline bool kvm_page_empty(void *ptr)
>>  
>>  #define KVM_PREALLOC_LEVEL	0
>>  
>> +/*
>> + * We need to ensure that stage-2 PGDs are allocated with a per-page
>> + * refcount, as we fiddle with the refcounts of non-head pages.
>> + * __GFP_COMP forces the allocator to do what we want.
>> + */
>> +#define KVM_GFP_S2_PGD	(GFP_KERNEL | __GFP_ZERO | __GFP_COMP)
>> +
>>  static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
>>  {
> 
> what about the case for kvm_prealloc_hwpgd, why doesn't it need to use
> the new GFP mask?

I don't think so. It is always at most a page.

>>  	return 0;
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 3e6859b..a6a8252 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -666,7 +666,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
>>  		 * Allocate actual first-level Stage-2 page table used by the
>>  		 * hardware for Stage-2 page table walks.
>>  		 */
>> -		pgd = (pgd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, S2_PGD_ORDER);
>> +		pgd = (pgd_t *)__get_free_pages(KVM_GFP_S2_PGD, S2_PGD_ORDER);
>>  	}
>>  
>>  	if (!pgd)
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index 6458b53..06c733a 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -171,6 +171,13 @@ static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
>>  #define KVM_PREALLOC_LEVEL	(0)
>>  #endif
>>  
>> +/*
>> + * We need to ensure that stage-2 PGDs are allocated with a per-page
>> + * refcount, as we fiddle with the refcounts of non-head pages.
>> + * __GFP_COMP forces the allocator to do what we want.
>> + */
>> +#define KVM_GFP_S2_PGD	(GFP_KERNEL | __GFP_ZERO | __GFP_COMP)
>> +
>>  /**
>>   * kvm_prealloc_hwpgd - allocate inital table for VTTBR
>>   * @kvm:	The KVM struct pointer for the VM.
>> @@ -192,7 +199,7 @@ static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd)
>>  	if (KVM_PREALLOC_LEVEL == 0)
>>  		return 0;
>>  
>> -	hwpgd = __get_free_pages(GFP_KERNEL | __GFP_ZERO, PTRS_PER_S2_PGD_SHIFT);
>> +	hwpgd = __get_free_pages(KVM_GFP_S2_PGD, PTRS_PER_S2_PGD_SHIFT);
>>  	if (!hwpgd)
>>  		return -ENOMEM;
>>  
>> -- 
>> 2.1.4
>>

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-03-05 13:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-25 16:55 [PATCH 0/3] arm64: KVM: High memory guest fixes Marc Zyngier
2015-02-25 16:55 ` [PATCH 1/3] arm64: KVM: Fix stage-2 PGD allocation to have per-page refcounting Marc Zyngier
2015-03-02 18:27   ` Christoffer Dall
2015-03-05 13:58     ` Marc Zyngier
2015-02-25 16:55 ` [PATCH 2/3] arm64: KVM: Do not use pgd_index to index stage-2 pgd Marc Zyngier
2015-03-02 18:45   ` Christoffer Dall
2015-02-25 16:55 ` [PATCH 3/3] arm64: KVM: Fix outdated comment about VTCR_EL2.PS Marc Zyngier
2015-03-02 18:52   ` Christoffer Dall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).