public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Vitaly Kuznetsov <vkuznets@redhat.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Reiji Watanabe <reijiw@google.com>
Subject: Re: [PATCH 2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset()
Date: Thu, 16 Sep 2021 09:19:12 +0200	[thread overview]
Message-ID: <87wnnhyolr.fsf@vitty.brq.redhat.com> (raw)
In-Reply-To: <YUIunxwjea/wq3gd@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Wed, Sep 15, 2021, Vitaly Kuznetsov wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> > +static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>> > +{
>> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> > +
>> > +	init_vmcs(vmx);
>> > +
>> > +	if (nested)
>> > +		memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs));
>> > +
>> > +	vcpu_setup_sgx_lepubkeyhash(vcpu);
>> > +
>> > +	vmx->nested.posted_intr_nv = -1;
>> > +	vmx->nested.current_vmptr = -1ull;
>> > +	vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID;
>> 
>> What would happen in (hypothetical) case when enlightened VMCS is
>> currently in use? If we zap 'hv_evmcs_vmptr' here, the consequent
>> nested_release_evmcs() (called from
>> nested_vmx_handle_enlightened_vmptrld(), for example) will not do 
>> kvm_vcpu_unmap() while it should.
>
> The short answer is that there's a lot of stuff that needs to be addressed before
> KVM can expose a RESET ioctl().  My goal with these patches is to carve out the
> stubs and move the few bits of RESET emulation into the "stubs".  This is the same
> answer for the MSR question/comment at the end.
>
>> This, however, got me thinking: should we free all-things-nested with
>> free_nested()/nested_vmx_free_vcpu() upon vcpu reset? I can't seem to
>> find us doing that... (I do remember that INIT is blocked in VMX-root
>> mode and nobody else besides kvm_arch_vcpu_create()/
>> kvm_apic_accept_events() seems to call kvm_vcpu_reset()) but maybe we
>> should at least add a WARN_ON() guardian here?
>
> I think that makes sense.  Maybe use CR0 as a sentinel since it has a non-zero
> RESET value?  E.g. WARN if CR0 is non-zero at RESET.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 86539c1686fa..3ac074376821 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10813,6 +10813,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         unsigned long new_cr0;
>         u32 eax, dummy;
>
> +       /*
> +        * <comment about KVM not supporting arbitrary RESET>
> +        */
> +       WARN_ON_ONCE(!init_event && old_cr0);
> +
>         kvm_lapic_reset(vcpu, init_event);
>
>         vcpu->arch.hflags = 0;
>

Should work (assuming nothing in VMX would want to call vmx_vcpu_reset()/
__vmx_vcpu_reset() directly).

> Huh, typing that out made me realize commit 0aa1837533e5 ("KVM: x86: Properly
> reset MMU context at vCPU RESET/INIT") technically introduced a bug.  kvm_vcpu_reset()
> does kvm_read_cr0() and thus reads vmcs.GUEST_CR0 because vcpu->arch.regs_avail is
> (correctly) not stuffed to ALL_ONES until later in kvm_vcpu_reset().  init_vmcs()
> doesn't explicitly zero vmcs.GUEST_CR0 (along with many other guest fields), and
> so VMREAD(GUEST_CR0) is technically consuming garbage.  In practice, it's consuming
> '0' because no known CPU or VMM inverts values in the VMCS, i.e. zero allocating
> the VMCS is functionally equivalent to writing '0' to all fields via VMWRITE.
>
> And staring more at kvm_vcpu_reset(), this code is terrifying for INIT
>
> 	memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
> 	vcpu->arch.regs_avail = ~0;
> 	vcpu->arch.regs_dirty = ~0;
>
> because it means cr0 and cr4 are marked available+dirty without immediately writing
> vcpu->arch.cr0/cr4.  And VMX subtly relies on that, as vmx_set_cr0() grabs CR0.PG
> via kvm_read_cr0_bits(), i.e. zeroing vcpu->arch.cr0 would "break" the INIT flow.
> Ignoring for the moment that CR0.PG is never guest-owned and thus never stale in
> vcpu->arch.cr0, KVM is also technically relying on the earlier kvm_read_cr0() in
> kvm_vcpu_reset() to ensure vcpu->arch.cr0 is fresh.
>
> Stuffing regs_avail technically means vmx_set_rflags() -> vmx_get_rflags() is
> consuming stale data.  It doesn't matter in practice because the old value is
> only used to re-evaluate vmx->emulation_required, which is guaranteed to be up
> refreshed by vmx_set_cr0() and friends.  PDPTRs and EXIT_INFO are in a similar
> boat; KVM shouldn't be reading those fields (CR0.PG=0, not an exit path), but
> marking them dirty without actually updating the cached values is wrong.
>
> There's also one concrete bug: KVM doesn't set vcpu->arch.cr3=0 on RESET/INIT.
> That bug has gone unnoticed because no real word BIOS/kernel is going to rely on
> INIT to set CR3=0.  
>
> I'm strongly leaning towards stuffing regs_avail/dirty in kvm_arch_vcpu_create(),
> and relying on explicit kvm_register_mark_dirty() calls for the 3 or so cases where
> x86 code writes a vcpu->arch register directly.  That would fix the CR0 read bug
> and also prevent subtle bugs from sneaking in.  Adding new EXREGS would be slightly
> more costly, but IMO that's a good thing as would force us to actually think about
> how to handle each register.
>
> E.g. implement this over 2-3 patches:
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 114847253e0a..743146ac8307 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4385,6 +4385,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         kvm_set_cr8(vcpu, 0);
>
>         vmx_segment_cache_clear(vmx);
> +       kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
>
>         seg_setup(VCPU_SREG_CS);
>         vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 86539c1686fa..ab907a0b9eeb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10656,6 +10656,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         int r;
>
>         vcpu->arch.last_vmentry_cpu = -1;
> +       vcpu->arch.regs_avail = ~0;
> +       vcpu->arch.regs_dirty = ~0;
>
>         if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
>                 vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> @@ -10874,9 +10876,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>                 vcpu->arch.xcr0 = XFEATURE_MASK_FP;
>         }
>
> +       /* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
>         memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
> -       vcpu->arch.regs_avail = ~0;
> -       vcpu->arch.regs_dirty = ~0;
> +       kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP);
>
>         /*
>          * Fall back to KVM's default Family/Model/Stepping of 0x600 (P6/Athlon)
> @@ -10897,6 +10899,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
>         kvm_rip_write(vcpu, 0xfff0);
>
> +       vcpu->arch.cr3 = 0;
> +       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> +
>         /*
>          * CR0.CD/NW are set on RESET, preserved on INIT.  Note, some versions
>          * of Intel's SDM list CD/NW as being set on INIT, but they contradict
>

A selftest for vCPU create/reset would be really helpful. I can even
volunteer to [eventually] write one :-)

-- 
Vitaly


  reply	other threads:[~2021-09-16  7:19 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-14 23:08 [PATCH 0/3] KVM: x86: Clean up RESET "emulation" Sean Christopherson
2021-09-14 23:08 ` [PATCH 1/3] KVM: VMX: Drop explicit zeroing of MSR guest values at vCPU creation Sean Christopherson
2021-09-14 23:08 ` [PATCH 2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset() Sean Christopherson
2021-09-15 10:30   ` Vitaly Kuznetsov
2021-09-15 17:34     ` Sean Christopherson
2021-09-16  7:19       ` Vitaly Kuznetsov [this message]
2021-09-16 19:01         ` Sean Christopherson
2021-09-14 23:08 ` [PATCH 3/3] KVM: SVM: Move RESET emulation to svm_vcpu_reset() Sean Christopherson
2021-09-17 16:15 ` [PATCH 0/3] KVM: x86: Clean up RESET "emulation" Paolo Bonzini
2021-09-17 17:34   ` Sean Christopherson
2021-09-17 17:37     ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wnnhyolr.fsf@vitty.brq.redhat.com \
    --to=vkuznets@redhat.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=reijiw@google.com \
    --cc=seanjc@google.com \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox