All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marc Zyngier <maz@kernel.org>
To: Gavin Shan <gshan@redhat.com>
Cc: shan.gavin@gmail.com, Will Deacon <will@kernel.org>,
	kvmarm@lists.cs.columbia.edu
Subject: Re: [PATCH] arm64/kvm: Fix zapping stage2 page table wrongly
Date: Tue, 01 Sep 2020 17:50:57 +0100	[thread overview]
Message-ID: <361e3092fc04063eeced68648e8c8994@kernel.org> (raw)
In-Reply-To: <cadec9ec-3d56-a86d-0da1-f17208221692@redhat.com>

Hi Gavin,

On 2020-08-23 00:59, Gavin Shan wrote:
> Hi Marc,
> 
> On 8/22/20 8:01 PM, Marc Zyngier wrote:
>> On Sat, 22 Aug 2020 03:44:44 +0100,
>> Gavin Shan <gshan@redhat.com> wrote:
>>> 
>>> Depending on the kernel configuration, PUD_SIZE could be equal to
>>> PMD_SIZE. For example, both of them are 512MB with the following
>>> kernel configuration. In this case, both PUD and PMD are folded
>>> to PGD.
>>> 
>>>     CONFIG_ARM64_64K_PAGES   y
>>>     CONFIG_ARM64_VA_BITS     42
>>>     CONFIG_PGTABLE_LEVELS    2
>>> 
>>> With the above configuration, the stage2 PUD is used to backup the
>>> 512MB huge page when the stage2 mapping is built. During the mapping,
>>> the PUD and its subordinate levels of page table entries are unmapped
>>> if the PUD is present and not huge page sensitive in 
>>> stage2_set_pud_huge().
>>> Unfornately, the @addr isn't aligned to S2_PUD_SIZE and wrong page 
>>> table
>>> entries are zapped. It eventually leads to PUD's present bit can't be
>>> cleared successfully and infinite loop in stage2_set_pud_huge().
>>> 
>>> This fixes the issue by checking with S2_{PUD, PMD}_SIZE instead of
>>> {PUD, PMD}_SIZE to determine if stage2 PUD or PMD is used to back the
>>> huge page. For this particular case, the stage2 PMD entry should be
>>> used to backup the 512MB huge page with stage2_set_pmd_huge().
>> 
>> It isn't obvious to me how S2_PMD_SIZE can be different from PMD_SIZE,
>> and the current code certainly expects both to be equal (just look at
>> how often S2_*_SIZE is used in the current code to convince yourself).
>> 
>> My guess is that some lesser tested configurations (such as 64k pages)
>> break that assumption, and result in the wrong thing happening. Could
>> you please shed some light on it?
>> 
> 
> With the following kernel configuration, PUD_SIZE and PMD_SIZE are 
> equal
> and both of them are 512MB because P4D/PUD/PMD are folded into PGD.
> 
>    CONFIG_ARM64_64K_PAGES   y
>    CONFIG_ARM64_VA_BITS     42
>    CONFIG_PGTABLE_LEVELS    2
>    PMD_SIZE                 512MB
> (include/asm-generic/pgtable-no-pud.h)
>    PUD_SIZE                 512MB
> (include/asm-generic/pgtable-no-pmd.h)
>    P4D_SIZE                 512MB
> (include/asm-generic/pgtable-nop4d.h)
>    S2_PMD_SIZE              512MB               (stage2_pgtable.h)
>    S2_PUD_SIZE                4TB               (stage2_pgtable.h)

For a start, this last value makes no sense at all. If that we only
have 2 levels, S2_PUD_SIZE is wrong, and should either be ignored or
be the same as S2_PMD_SIZE (4TB represents the whole of the guest's
IPA space).

> For this particular case, S2_PMD_SIZE and PMD_SIZE are equal and both
> of them are 512MB. However, the issue is wrong size (PMD_SIZE/PUD_SIZE)
> is checked to call stage2_set_pud_huge() or stage2_set_pmd_huge() in
> user_mem_abort(). It causes stage2_set_pud_huge() is called to map the
> 512MB huge page, meaning stage2 PUD is used to back the 512MB huge 
> page.

Which should be fine, because that's all we can map. 2 levels, remember?
The real bug is that we consider that mapping a PUD is valid, while 
there
is no real PUD in this context.

> 
> In stage2_set_pud_huge(), the S2 page table entries are zapped for the
> range ((addr & S2_PUD_MASK), S2_PUD_SIZE), whose size is 4TB. However,
> we're mapping 512MB huge page (block). It means unrelated page table
> entries are cleared.

We can't map 4TB. That's not possible for such configuration (and I
doubt you have more than 4TB of contiguous memory on your system).

> 
> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                           struct kvm_memory_slot *memslot, unsigned 
> long hva,
>                           unsigned long fault_status)
> {
>     :
>     if (vma_pagesize == PUD_SIZE) {           /* PUD_SIZE == 512MB */
>         ret = stage2_set_pud_huge(mmu, memcache, fault_ipa, &new_pud);
>     } else if (vma_pagesize == PMD_SIZE) {    /* PMD_SIZE == 512MB */
>         ret = stage2_set_pmd_huge(mmu, memcache, fault_ipa, &new_pmd);
>     } else {
>         ret = stage2_set_pte(mmu, memcache, fault_ipa, &new_pte, 
> flags);
>     }
>     :
> }
> 
> The issue was initially reported by Eric and it can be reproduced on
> upstream kernel/qemu with the configurations to enable 64KB page size
> and 42-bits VA bits (CONFIG_ARM64_VA_BITS). Here is the command I used
> to reproduce the issue. Note that the IPA limit reported from the 
> machine
> where I reproduced the issue is 44-bits, but qemu just uses 40-bits. It
> means the stage2 pagetable has 3 levels.

No. KVM, in its current form, cannot have more levels at stage-2 than it
has at stage-1 (that's one the things we are fixing with Will's 
patches).

> 
> start_vm_aarch64_hugetlbfs() {
>    echo 32 > /sys/kernel/mm/hugepages/hugepages-524288kB/nr_hugepages

Ah, hugetlb. I wonder if we have something funky here. Alexandru (cc'd)
was looking at something in this area (see [1]).

> 
>    /home/gavin/sandbox/qemu.main/aarch64-softmmu/qemu-system-aarch64    
>        \
>    --enable-kvm -machine virt,gic-version=host                          
>        \
>    -cpu host -smp 8,sockets=8,cores=1,threads=1                         
>        \
>    -m 8G -mem-prealloc -mem-path /dev/hugepages                         
>        \
>    -monitor none -serial mon:stdio -nographic -s                        
>        \
>    -bios /home/gavin/sandbox/qemu.main/pc-bios/edk2-aarch64-code.fd     
>        \
>    -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image        
>        \
>    -initrd /home/gavin/sandbox/images/rootfs.cpio.xz                    
>        \
>    -append "earlycon=pl011,mmio,0x9000000"                              
>        \
>    -device virtio-net-pci,netdev=unet,mac=52:54:00:f1:26:a6             
>        \
>    -netdev user,id=unet,hostfwd=tcp::50959-:22                          
>        \
>    -drive 
> file=/home/gavin/sandbox/images/vm.img,if=none,format=raw,id=nvme0   \
>    -device nvme,drive=nvme0,serial=foo                                  
>        \
>    -drive 
> file=/home/gavin/sandbox/images/vm1.img,if=none,format=raw,id=nvme1  \
>    -device nvme,drive=nvme1,serial=foo1
> }

I'll try to reproduce this.

         M.

[1] 
https://lore.kernel.org/r/20200901133357.52640-1-alexandru.elisei@arm.com
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

  reply	other threads:[~2020-09-01 16:51 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-22  2:44 [PATCH] arm64/kvm: Fix zapping stage2 page table wrongly Gavin Shan
2020-08-22 10:01 ` Marc Zyngier
2020-08-22 23:59   ` Gavin Shan
2020-09-01 16:50     ` Marc Zyngier [this message]
2020-09-02 10:59 ` Alexandru Elisei
2020-09-02 11:10   ` Marc Zyngier
2020-09-02 11:53     ` Alexandru Elisei
2020-09-02 11:56       ` Alexandru Elisei
2020-09-02 12:04       ` Marc Zyngier
2020-09-02 13:58         ` Alexandru Elisei
2020-09-02 17:38     ` Auger Eric
2020-09-02 23:55     ` Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=361e3092fc04063eeced68648e8c8994@kernel.org \
    --to=maz@kernel.org \
    --cc=gshan@redhat.com \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=shan.gavin@gmail.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.