From: Vitaly Kuznetsov <vkuznets@redhat.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
Wanpeng Li <wanpengli@tencent.com>,
Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Reiji Watanabe <reijiw@google.com>
Subject: Re: [PATCH 2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset()
Date: Thu, 16 Sep 2021 09:19:12 +0200 [thread overview]
Message-ID: <87wnnhyolr.fsf@vitty.brq.redhat.com> (raw)
In-Reply-To: <YUIunxwjea/wq3gd@google.com>
Sean Christopherson <seanjc@google.com> writes:
> On Wed, Sep 15, 2021, Vitaly Kuznetsov wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> > +static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>> > +{
>> > + struct vcpu_vmx *vmx = to_vmx(vcpu);
>> > +
>> > + init_vmcs(vmx);
>> > +
>> > + if (nested)
>> > + memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs));
>> > +
>> > + vcpu_setup_sgx_lepubkeyhash(vcpu);
>> > +
>> > + vmx->nested.posted_intr_nv = -1;
>> > + vmx->nested.current_vmptr = -1ull;
>> > + vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID;
>>
>> What would happen in (hypothetical) case when enlightened VMCS is
>> currently in use? If we zap 'hv_evmcs_vmptr' here, the consequent
>> nested_release_evmcs() (called from
>> nested_vmx_handle_enlightened_vmptrld(), for example) will not do
>> kvm_vcpu_unmap() while it should.
>
> The short answer is that there's a lot of stuff that needs to be addressed before
> KVM can expose a RESET ioctl(). My goal with these patches is to carve out the
> stubs and move the few bits of RESET emulation into the "stubs". This is the same
> answer for the MSR question/comment at the end.
>
>> This, however, got me thinking: should we free all-things-nested with
>> free_nested()/nested_vmx_free_vcpu() upon vcpu reset? I can't seem to
>> find us doing that... (I do remember that INIT is blocked in VMX-root
>> mode and nobody else besides kvm_arch_vcpu_create()/
>> kvm_apic_accept_events() seems to call kvm_vcpu_reset()) but maybe we
>> should at least add a WARN_ON() guardian here?
>
> I think that makes sense. Maybe use CR0 as a sentinel since it has a non-zero
> RESET value? E.g. WARN if CR0 is non-zero at RESET.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 86539c1686fa..3ac074376821 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10813,6 +10813,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> unsigned long new_cr0;
> u32 eax, dummy;
>
> + /*
> + * <comment about KVM not supporting arbitrary RESET>
> + */
> + WARN_ON_ONCE(!init_event && old_cr0);
> +
> kvm_lapic_reset(vcpu, init_event);
>
> vcpu->arch.hflags = 0;
>
Should work (assuming nothing in VMX would want to call vmx_vcpu_reset()/
__vmx_vcpu_reset() directly).
> Huh, typing that out made me realize commit 0aa1837533e5 ("KVM: x86: Properly
> reset MMU context at vCPU RESET/INIT") technically introduced a bug. kvm_vcpu_reset()
> does kvm_read_cr0() and thus reads vmcs.GUEST_CR0 because vcpu->arch.regs_avail is
> (correctly) not stuffed to ALL_ONES until later in kvm_vcpu_reset(). init_vmcs()
> doesn't explicitly zero vmcs.GUEST_CR0 (along with many other guest fields), and
> so VMREAD(GUEST_CR0) is technically consuming garbage. In practice, it's consuming
> '0' because no known CPU or VMM inverts values in the VMCS, i.e. zero allocating
> the VMCS is functionally equivalent to writing '0' to all fields via VMWRITE.
>
> And staring more at kvm_vcpu_reset(), this code is terrifying for INIT
>
> memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
> vcpu->arch.regs_avail = ~0;
> vcpu->arch.regs_dirty = ~0;
>
> because it means cr0 and cr4 are marked available+dirty without immediately writing
> vcpu->arch.cr0/cr4. And VMX subtly relies on that, as vmx_set_cr0() grabs CR0.PG
> via kvm_read_cr0_bits(), i.e. zeroing vcpu->arch.cr0 would "break" the INIT flow.
> Ignoring for the moment that CR0.PG is never guest-owned and thus never stale in
> vcpu->arch.cr0, KVM is also technically relying on the earlier kvm_read_cr0() in
> kvm_vcpu_reset() to ensure vcpu->arch.cr0 is fresh.
>
> Stuffing regs_avail technically means vmx_set_rflags() -> vmx_get_rflags() is
> consuming stale data. It doesn't matter in practice because the old value is
> only used to re-evaluate vmx->emulation_required, which is guaranteed to be up
> refreshed by vmx_set_cr0() and friends. PDPTRs and EXIT_INFO are in a similar
> boat; KVM shouldn't be reading those fields (CR0.PG=0, not an exit path), but
> marking them dirty without actually updating the cached values is wrong.
>
> There's also one concrete bug: KVM doesn't set vcpu->arch.cr3=0 on RESET/INIT.
> That bug has gone unnoticed because no real word BIOS/kernel is going to rely on
> INIT to set CR3=0.
>
> I'm strongly leaning towards stuffing regs_avail/dirty in kvm_arch_vcpu_create(),
> and relying on explicit kvm_register_mark_dirty() calls for the 3 or so cases where
> x86 code writes a vcpu->arch register directly. That would fix the CR0 read bug
> and also prevent subtle bugs from sneaking in. Adding new EXREGS would be slightly
> more costly, but IMO that's a good thing as would force us to actually think about
> how to handle each register.
>
> E.g. implement this over 2-3 patches:
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 114847253e0a..743146ac8307 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4385,6 +4385,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> kvm_set_cr8(vcpu, 0);
>
> vmx_segment_cache_clear(vmx);
> + kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
>
> seg_setup(VCPU_SREG_CS);
> vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 86539c1686fa..ab907a0b9eeb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10656,6 +10656,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> int r;
>
> vcpu->arch.last_vmentry_cpu = -1;
> + vcpu->arch.regs_avail = ~0;
> + vcpu->arch.regs_dirty = ~0;
>
> if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> @@ -10874,9 +10876,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> vcpu->arch.xcr0 = XFEATURE_MASK_FP;
> }
>
> + /* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
> memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
> - vcpu->arch.regs_avail = ~0;
> - vcpu->arch.regs_dirty = ~0;
> + kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP);
>
> /*
> * Fall back to KVM's default Family/Model/Stepping of 0x600 (P6/Athlon)
> @@ -10897,6 +10899,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
> kvm_rip_write(vcpu, 0xfff0);
>
> + vcpu->arch.cr3 = 0;
> + kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> +
> /*
> * CR0.CD/NW are set on RESET, preserved on INIT. Note, some versions
> * of Intel's SDM list CD/NW as being set on INIT, but they contradict
>
A selftest for vCPU create/reset would be really helpful. I can even
volunteer to [eventually] write one :-)
--
Vitaly
next prev parent reply other threads:[~2021-09-16 7:19 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-14 23:08 [PATCH 0/3] KVM: x86: Clean up RESET "emulation" Sean Christopherson
2021-09-14 23:08 ` [PATCH 1/3] KVM: VMX: Drop explicit zeroing of MSR guest values at vCPU creation Sean Christopherson
2021-09-14 23:08 ` [PATCH 2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset() Sean Christopherson
2021-09-15 10:30 ` Vitaly Kuznetsov
2021-09-15 17:34 ` Sean Christopherson
2021-09-16 7:19 ` Vitaly Kuznetsov [this message]
2021-09-16 19:01 ` Sean Christopherson
2021-09-14 23:08 ` [PATCH 3/3] KVM: SVM: Move RESET emulation to svm_vcpu_reset() Sean Christopherson
2021-09-17 16:15 ` [PATCH 0/3] KVM: x86: Clean up RESET "emulation" Paolo Bonzini
2021-09-17 17:34 ` Sean Christopherson
2021-09-17 17:37 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wnnhyolr.fsf@vitty.brq.redhat.com \
--to=vkuznets@redhat.com \
--cc=jmattson@google.com \
--cc=joro@8bytes.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=reijiw@google.com \
--cc=seanjc@google.com \
--cc=wanpengli@tencent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox