From: Marc Zyngier <maz@kernel.org>
To: Gavin Shan <gshan@redhat.com>
Cc: shan.gavin@gmail.com, Will Deacon <will@kernel.org>,
kvmarm@lists.cs.columbia.edu
Subject: Re: [PATCH] arm64/kvm: Fix zapping stage2 page table wrongly
Date: Tue, 01 Sep 2020 17:50:57 +0100 [thread overview]
Message-ID: <361e3092fc04063eeced68648e8c8994@kernel.org> (raw)
In-Reply-To: <cadec9ec-3d56-a86d-0da1-f17208221692@redhat.com>
Hi Gavin,
On 2020-08-23 00:59, Gavin Shan wrote:
> Hi Marc,
>
> On 8/22/20 8:01 PM, Marc Zyngier wrote:
>> On Sat, 22 Aug 2020 03:44:44 +0100,
>> Gavin Shan <gshan@redhat.com> wrote:
>>>
>>> Depending on the kernel configuration, PUD_SIZE could be equal to
>>> PMD_SIZE. For example, both of them are 512MB with the following
>>> kernel configuration. In this case, both PUD and PMD are folded
>>> to PGD.
>>>
>>> CONFIG_ARM64_64K_PAGES y
>>> CONFIG_ARM64_VA_BITS 42
>>> CONFIG_PGTABLE_LEVELS 2
>>>
>>> With the above configuration, the stage2 PUD is used to backup the
>>> 512MB huge page when the stage2 mapping is built. During the mapping,
>>> the PUD and its subordinate levels of page table entries are unmapped
>>> if the PUD is present and not huge page sensitive in
>>> stage2_set_pud_huge().
>>> Unfornately, the @addr isn't aligned to S2_PUD_SIZE and wrong page
>>> table
>>> entries are zapped. It eventually leads to PUD's present bit can't be
>>> cleared successfully and infinite loop in stage2_set_pud_huge().
>>>
>>> This fixes the issue by checking with S2_{PUD, PMD}_SIZE instead of
>>> {PUD, PMD}_SIZE to determine if stage2 PUD or PMD is used to back the
>>> huge page. For this particular case, the stage2 PMD entry should be
>>> used to backup the 512MB huge page with stage2_set_pmd_huge().
>>
>> It isn't obvious to me how S2_PMD_SIZE can be different from PMD_SIZE,
>> and the current code certainly expects both to be equal (just look at
>> how often S2_*_SIZE is used in the current code to convince yourself).
>>
>> My guess is that some lesser tested configurations (such as 64k pages)
>> break that assumption, and result in the wrong thing happening. Could
>> you please shed some light on it?
>>
>
> With the following kernel configuration, PUD_SIZE and PMD_SIZE are
> equal
> and both of them are 512MB because P4D/PUD/PMD are folded into PGD.
>
> CONFIG_ARM64_64K_PAGES y
> CONFIG_ARM64_VA_BITS 42
> CONFIG_PGTABLE_LEVELS 2
> PMD_SIZE 512MB
> (include/asm-generic/pgtable-no-pud.h)
> PUD_SIZE 512MB
> (include/asm-generic/pgtable-no-pmd.h)
> P4D_SIZE 512MB
> (include/asm-generic/pgtable-nop4d.h)
> S2_PMD_SIZE 512MB (stage2_pgtable.h)
> S2_PUD_SIZE 4TB (stage2_pgtable.h)
For a start, this last value makes no sense at all. If that we only
have 2 levels, S2_PUD_SIZE is wrong, and should either be ignored or
be the same as S2_PMD_SIZE (4TB represents the whole of the guest's
IPA space).
> For this particular case, S2_PMD_SIZE and PMD_SIZE are equal and both
> of them are 512MB. However, the issue is wrong size (PMD_SIZE/PUD_SIZE)
> is checked to call stage2_set_pud_huge() or stage2_set_pmd_huge() in
> user_mem_abort(). It causes stage2_set_pud_huge() is called to map the
> 512MB huge page, meaning stage2 PUD is used to back the 512MB huge
> page.
Which should be fine, because that's all we can map. 2 levels, remember?
The real bug is that we consider that mapping a PUD is valid, while
there
is no real PUD in this context.
>
> In stage2_set_pud_huge(), the S2 page table entries are zapped for the
> range ((addr & S2_PUD_MASK), S2_PUD_SIZE), whose size is 4TB. However,
> we're mapping 512MB huge page (block). It means unrelated page table
> entries are cleared.
We can't map 4TB. That's not possible for such configuration (and I
doubt you have more than 4TB of contiguous memory on your system).
>
> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> struct kvm_memory_slot *memslot, unsigned
> long hva,
> unsigned long fault_status)
> {
> :
> if (vma_pagesize == PUD_SIZE) { /* PUD_SIZE == 512MB */
> ret = stage2_set_pud_huge(mmu, memcache, fault_ipa, &new_pud);
> } else if (vma_pagesize == PMD_SIZE) { /* PMD_SIZE == 512MB */
> ret = stage2_set_pmd_huge(mmu, memcache, fault_ipa, &new_pmd);
> } else {
> ret = stage2_set_pte(mmu, memcache, fault_ipa, &new_pte,
> flags);
> }
> :
> }
>
> The issue was initially reported by Eric and it can be reproduced on
> upstream kernel/qemu with the configurations to enable 64KB page size
> and 42-bits VA bits (CONFIG_ARM64_VA_BITS). Here is the command I used
> to reproduce the issue. Note that the IPA limit reported from the
> machine
> where I reproduced the issue is 44-bits, but qemu just uses 40-bits. It
> means the stage2 pagetable has 3 levels.
No. KVM, in its current form, cannot have more levels at stage-2 than it
has at stage-1 (that's one the things we are fixing with Will's
patches).
>
> start_vm_aarch64_hugetlbfs() {
> echo 32 > /sys/kernel/mm/hugepages/hugepages-524288kB/nr_hugepages
Ah, hugetlb. I wonder if we have something funky here. Alexandru (cc'd)
was looking at something in this area (see [1]).
>
> /home/gavin/sandbox/qemu.main/aarch64-softmmu/qemu-system-aarch64
> \
> --enable-kvm -machine virt,gic-version=host
> \
> -cpu host -smp 8,sockets=8,cores=1,threads=1
> \
> -m 8G -mem-prealloc -mem-path /dev/hugepages
> \
> -monitor none -serial mon:stdio -nographic -s
> \
> -bios /home/gavin/sandbox/qemu.main/pc-bios/edk2-aarch64-code.fd
> \
> -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image
> \
> -initrd /home/gavin/sandbox/images/rootfs.cpio.xz
> \
> -append "earlycon=pl011,mmio,0x9000000"
> \
> -device virtio-net-pci,netdev=unet,mac=52:54:00:f1:26:a6
> \
> -netdev user,id=unet,hostfwd=tcp::50959-:22
> \
> -drive
> file=/home/gavin/sandbox/images/vm.img,if=none,format=raw,id=nvme0 \
> -device nvme,drive=nvme0,serial=foo
> \
> -drive
> file=/home/gavin/sandbox/images/vm1.img,if=none,format=raw,id=nvme1 \
> -device nvme,drive=nvme1,serial=foo1
> }
I'll try to reproduce this.
M.
[1]
https://lore.kernel.org/r/20200901133357.52640-1-alexandru.elisei@arm.com
--
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
next prev parent reply other threads:[~2020-09-01 16:51 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-22 2:44 [PATCH] arm64/kvm: Fix zapping stage2 page table wrongly Gavin Shan
2020-08-22 10:01 ` Marc Zyngier
2020-08-22 23:59 ` Gavin Shan
2020-09-01 16:50 ` Marc Zyngier [this message]
2020-09-02 10:59 ` Alexandru Elisei
2020-09-02 11:10 ` Marc Zyngier
2020-09-02 11:53 ` Alexandru Elisei
2020-09-02 11:56 ` Alexandru Elisei
2020-09-02 12:04 ` Marc Zyngier
2020-09-02 13:58 ` Alexandru Elisei
2020-09-02 17:38 ` Auger Eric
2020-09-02 23:55 ` Gavin Shan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=361e3092fc04063eeced68648e8c8994@kernel.org \
--to=maz@kernel.org \
--cc=gshan@redhat.com \
--cc=kvmarm@lists.cs.columbia.edu \
--cc=shan.gavin@gmail.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox