From: David Gibson <david@gibson.dropbear.id.au>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: kvm-ppc@vger.kernel.org, Paul Mackerras <paulus@ozlabs.org>,
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH] KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size
Date: Wed, 05 Sep 2018 03:59:52 +0000 [thread overview]
Message-ID: <20180905035952.GJ2679@umbus.fritz.box> (raw)
In-Reply-To: <20180904081601.32703-1-npiggin@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 5572 bytes --]
On Tue, Sep 04, 2018 at 06:16:01PM +1000, Nicholas Piggin wrote:
> THP paths can defer splitting compound pages until after the actual
> remap and TLB flushes to split a huge PMD/PUD. This causes radix
> partition scope page table mappings to get out of synch with the host
> qemu page table mappings.
>
> This results in random memory corruption in the guest when running
> with THP. The easiest way to reproduce is use KVM baloon to free up
> a lot of memory in the guest and then shrink the balloon to give the
> memory back, while some work is being done in the guest.
>
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Seems to fix the problem on my test case.
Tested-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> arch/powerpc/kvm/book3s_64_mmu_radix.c | 88 ++++++++++----------------
> 1 file changed, 34 insertions(+), 54 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 0af1c0aea1fe..d8792445d95a 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -525,8 +525,8 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> unsigned long ea, unsigned long dsisr)
> {
> struct kvm *kvm = vcpu->kvm;
> - unsigned long mmu_seq, pte_size;
> - unsigned long gpa, gfn, hva, pfn;
> + unsigned long mmu_seq;
> + unsigned long gpa, gfn, hva;
> struct kvm_memory_slot *memslot;
> struct page *page = NULL;
> long ret;
> @@ -623,9 +623,10 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> */
> hva = gfn_to_hva_memslot(memslot, gfn);
> if (upgrade_p && __get_user_pages_fast(hva, 1, 1, &page) == 1) {
> - pfn = page_to_pfn(page);
> upgrade_write = true;
> } else {
> + unsigned long pfn;
> +
> /* Call KVM generic code to do the slow-path check */
> pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
> writing, upgrade_p);
> @@ -639,63 +640,42 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> }
> }
>
> - /* See if we can insert a 1GB or 2MB large PTE here */
> - level = 0;
> - if (page && PageCompound(page)) {
> - pte_size = PAGE_SIZE << compound_order(compound_head(page));
> - if (pte_size >= PUD_SIZE &&
> - (gpa & (PUD_SIZE - PAGE_SIZE)) ==
> - (hva & (PUD_SIZE - PAGE_SIZE))) {
> - level = 2;
> - pfn &= ~((PUD_SIZE >> PAGE_SHIFT) - 1);
> - } else if (pte_size >= PMD_SIZE &&
> - (gpa & (PMD_SIZE - PAGE_SIZE)) ==
> - (hva & (PMD_SIZE - PAGE_SIZE))) {
> - level = 1;
> - pfn &= ~((PMD_SIZE >> PAGE_SHIFT) - 1);
> - }
> - }
> -
> /*
> - * Compute the PTE value that we need to insert.
> + * Read the PTE from the process' radix tree and use that
> + * so we get the shift and attribute bits.
> */
> - if (page) {
> - pgflags = _PAGE_READ | _PAGE_EXEC | _PAGE_PRESENT | _PAGE_PTE |
> - _PAGE_ACCESSED;
> - if (writing || upgrade_write)
> - pgflags |= _PAGE_WRITE | _PAGE_DIRTY;
> - pte = pfn_pte(pfn, __pgprot(pgflags));
> + local_irq_disable();
> + ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
> + pte = *ptep;
> + local_irq_enable();
> +
> + /* Get pte level from shift/size */
> + if (shift == PUD_SHIFT &&
> + (gpa & (PUD_SIZE - PAGE_SIZE)) ==
> + (hva & (PUD_SIZE - PAGE_SIZE))) {
> + level = 2;
> + } else if (shift == PMD_SHIFT &&
> + (gpa & (PMD_SIZE - PAGE_SIZE)) ==
> + (hva & (PMD_SIZE - PAGE_SIZE))) {
> + level = 1;
> } else {
> - /*
> - * Read the PTE from the process' radix tree and use that
> - * so we get the attribute bits.
> - */
> - local_irq_disable();
> - ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
> - pte = *ptep;
> - local_irq_enable();
> - if (shift == PUD_SHIFT &&
> - (gpa & (PUD_SIZE - PAGE_SIZE)) ==
> - (hva & (PUD_SIZE - PAGE_SIZE))) {
> - level = 2;
> - } else if (shift == PMD_SHIFT &&
> - (gpa & (PMD_SIZE - PAGE_SIZE)) ==
> - (hva & (PMD_SIZE - PAGE_SIZE))) {
> - level = 1;
> - } else if (shift && shift != PAGE_SHIFT) {
> - /* Adjust PFN */
> - unsigned long mask = (1ul << shift) - PAGE_SIZE;
> - pte = __pte(pte_val(pte) | (hva & mask));
> - }
> - pte = __pte(pte_val(pte) | _PAGE_EXEC | _PAGE_ACCESSED);
> - if (writing || upgrade_write) {
> - if (pte_val(pte) & _PAGE_WRITE)
> - pte = __pte(pte_val(pte) | _PAGE_DIRTY);
> - } else {
> - pte = __pte(pte_val(pte) & ~(_PAGE_WRITE | _PAGE_DIRTY));
> + level = 0;
> +
> + /* Can not cope with unknown page shift */
> + if (shift && shift != PAGE_SHIFT) {
> + WARN_ON_ONCE(1);
> + return -EFAULT;
> }
> }
>
> + pte = __pte(pte_val(pte) | _PAGE_EXEC | _PAGE_ACCESSED);
> + if (writing || upgrade_write) {
> + if (pte_val(pte) & _PAGE_WRITE)
> + pte = __pte(pte_val(pte) | _PAGE_DIRTY);
> + } else {
> + pte = __pte(pte_val(pte) & ~(_PAGE_WRITE | _PAGE_DIRTY));
> + }
> +
> /* Allocate space in the tree and write the PTE */
> ret = kvmppc_create_pte(kvm, pte, gpa, level, mmu_seq);
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
WARNING: multiple messages have this Message-ID (diff)
From: David Gibson <david@gibson.dropbear.id.au>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: kvm-ppc@vger.kernel.org, Paul Mackerras <paulus@ozlabs.org>,
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH] KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size
Date: Wed, 5 Sep 2018 13:59:52 +1000 [thread overview]
Message-ID: <20180905035952.GJ2679@umbus.fritz.box> (raw)
In-Reply-To: <20180904081601.32703-1-npiggin@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 5572 bytes --]
On Tue, Sep 04, 2018 at 06:16:01PM +1000, Nicholas Piggin wrote:
> THP paths can defer splitting compound pages until after the actual
> remap and TLB flushes to split a huge PMD/PUD. This causes radix
> partition scope page table mappings to get out of synch with the host
> qemu page table mappings.
>
> This results in random memory corruption in the guest when running
> with THP. The easiest way to reproduce is use KVM baloon to free up
> a lot of memory in the guest and then shrink the balloon to give the
> memory back, while some work is being done in the guest.
>
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Seems to fix the problem on my test case.
Tested-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> arch/powerpc/kvm/book3s_64_mmu_radix.c | 88 ++++++++++----------------
> 1 file changed, 34 insertions(+), 54 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 0af1c0aea1fe..d8792445d95a 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -525,8 +525,8 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> unsigned long ea, unsigned long dsisr)
> {
> struct kvm *kvm = vcpu->kvm;
> - unsigned long mmu_seq, pte_size;
> - unsigned long gpa, gfn, hva, pfn;
> + unsigned long mmu_seq;
> + unsigned long gpa, gfn, hva;
> struct kvm_memory_slot *memslot;
> struct page *page = NULL;
> long ret;
> @@ -623,9 +623,10 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> */
> hva = gfn_to_hva_memslot(memslot, gfn);
> if (upgrade_p && __get_user_pages_fast(hva, 1, 1, &page) == 1) {
> - pfn = page_to_pfn(page);
> upgrade_write = true;
> } else {
> + unsigned long pfn;
> +
> /* Call KVM generic code to do the slow-path check */
> pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
> writing, upgrade_p);
> @@ -639,63 +640,42 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> }
> }
>
> - /* See if we can insert a 1GB or 2MB large PTE here */
> - level = 0;
> - if (page && PageCompound(page)) {
> - pte_size = PAGE_SIZE << compound_order(compound_head(page));
> - if (pte_size >= PUD_SIZE &&
> - (gpa & (PUD_SIZE - PAGE_SIZE)) ==
> - (hva & (PUD_SIZE - PAGE_SIZE))) {
> - level = 2;
> - pfn &= ~((PUD_SIZE >> PAGE_SHIFT) - 1);
> - } else if (pte_size >= PMD_SIZE &&
> - (gpa & (PMD_SIZE - PAGE_SIZE)) ==
> - (hva & (PMD_SIZE - PAGE_SIZE))) {
> - level = 1;
> - pfn &= ~((PMD_SIZE >> PAGE_SHIFT) - 1);
> - }
> - }
> -
> /*
> - * Compute the PTE value that we need to insert.
> + * Read the PTE from the process' radix tree and use that
> + * so we get the shift and attribute bits.
> */
> - if (page) {
> - pgflags = _PAGE_READ | _PAGE_EXEC | _PAGE_PRESENT | _PAGE_PTE |
> - _PAGE_ACCESSED;
> - if (writing || upgrade_write)
> - pgflags |= _PAGE_WRITE | _PAGE_DIRTY;
> - pte = pfn_pte(pfn, __pgprot(pgflags));
> + local_irq_disable();
> + ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
> + pte = *ptep;
> + local_irq_enable();
> +
> + /* Get pte level from shift/size */
> + if (shift == PUD_SHIFT &&
> + (gpa & (PUD_SIZE - PAGE_SIZE)) ==
> + (hva & (PUD_SIZE - PAGE_SIZE))) {
> + level = 2;
> + } else if (shift == PMD_SHIFT &&
> + (gpa & (PMD_SIZE - PAGE_SIZE)) ==
> + (hva & (PMD_SIZE - PAGE_SIZE))) {
> + level = 1;
> } else {
> - /*
> - * Read the PTE from the process' radix tree and use that
> - * so we get the attribute bits.
> - */
> - local_irq_disable();
> - ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
> - pte = *ptep;
> - local_irq_enable();
> - if (shift == PUD_SHIFT &&
> - (gpa & (PUD_SIZE - PAGE_SIZE)) ==
> - (hva & (PUD_SIZE - PAGE_SIZE))) {
> - level = 2;
> - } else if (shift == PMD_SHIFT &&
> - (gpa & (PMD_SIZE - PAGE_SIZE)) ==
> - (hva & (PMD_SIZE - PAGE_SIZE))) {
> - level = 1;
> - } else if (shift && shift != PAGE_SHIFT) {
> - /* Adjust PFN */
> - unsigned long mask = (1ul << shift) - PAGE_SIZE;
> - pte = __pte(pte_val(pte) | (hva & mask));
> - }
> - pte = __pte(pte_val(pte) | _PAGE_EXEC | _PAGE_ACCESSED);
> - if (writing || upgrade_write) {
> - if (pte_val(pte) & _PAGE_WRITE)
> - pte = __pte(pte_val(pte) | _PAGE_DIRTY);
> - } else {
> - pte = __pte(pte_val(pte) & ~(_PAGE_WRITE | _PAGE_DIRTY));
> + level = 0;
> +
> + /* Can not cope with unknown page shift */
> + if (shift && shift != PAGE_SHIFT) {
> + WARN_ON_ONCE(1);
> + return -EFAULT;
> }
> }
>
> + pte = __pte(pte_val(pte) | _PAGE_EXEC | _PAGE_ACCESSED);
> + if (writing || upgrade_write) {
> + if (pte_val(pte) & _PAGE_WRITE)
> + pte = __pte(pte_val(pte) | _PAGE_DIRTY);
> + } else {
> + pte = __pte(pte_val(pte) & ~(_PAGE_WRITE | _PAGE_DIRTY));
> + }
> +
> /* Allocate space in the tree and write the PTE */
> ret = kvmppc_create_pte(kvm, pte, gpa, level, mmu_seq);
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2018-09-05 3:59 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-09-04 8:16 [PATCH] KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size Nicholas Piggin
2018-09-04 8:16 ` Nicholas Piggin
2018-09-04 9:10 ` Aneesh Kumar K.V
2018-09-04 9:22 ` Aneesh Kumar K.V
2018-09-05 3:59 ` David Gibson [this message]
2018-09-05 3:59 ` David Gibson
2018-09-11 10:01 ` Paul Mackerras
2018-09-11 10:01 ` Paul Mackerras
2018-09-11 10:46 ` Nicholas Piggin
2018-09-11 10:46 ` Nicholas Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180905035952.GJ2679@umbus.fritz.box \
--to=david@gibson.dropbear.id.au \
--cc=aneesh.kumar@linux.ibm.com \
--cc=kvm-ppc@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=npiggin@gmail.com \
--cc=paulus@ozlabs.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.