From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paolo Bonzini Subject: Re: [PATCH v2 4/4] KVM: x86: get CPL from SS.DPL Date: Mon, 26 May 2014 14:38:37 +0200 Message-ID: <538335CD.2000101@redhat.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE To: Wei Huang , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, jan.kiszka@siemens.com, gleb@kernel.org, avi.kivity@gmail.com Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org Il 26/05/2014 01:21, Wei Huang ha scritto: >> CS.RPL is not equal to the CPL in the few instructions between >> setting CR0.PE and reloading CS. And CS.DPL is also not equal >> to the CPL for conforming code segments. > > Out of my curiousity, could you elaborate the problem of this > CPL gap window, such as breaking any VMs or tests? From Linux kernel > code, it seems kernel enables CR0.PE and immediately ljmpl to > CS. This windows is very small and I am curious how severely it could > be. It is almost guaranteed to happen if you run iPXE and in the meanwhile=20 kick QEMU continuously: # qemu-system-x86_64 -monitor unix:/tmp/m1,server,nowait -net ... & # yes 'info cpus' | nc -U /tmp/m1 >> However, SS.DPL *is* always equal to the CPL except for the weird > > Is this specified anywhere in SDM as a requirement for x86 OS? If so, > maybe provide a pointer to support this. In the case of the Intel manuals, it mentions in several places that=20 SS.DPL=3DCPL. All the mentions are in the VMX sections of the manual,=20 though I've found non-Intel material saying that system-management mode= =20 also used SS.DPL as the CPL: * "SS.DPL corresponds to the logical processor=E2=80=99s current privil= ege level=20 (CPL)" (footnote in 26.3.1.5 Checks on Guest Non-Register State). * "SS.DPL is always loaded from the SS access-rights field. This will b= e=20 the current privilege level (CPL) after the VM entry completes"=20 (26.3.2.2 Loading Guest Segment Registers and Descriptor-Table Register= s) * "VMX-critical state [...] consists of the following: (1) SS.DPL (the=20 current privilege level);" (34.14.1 Default Treatment of SMI delivery=20 [in VMX mode]). Instead, AMD says that "the SS segment base, limit and attributes are=20 not modified" by sysret. It almost looks as if AMD processors never us= e=20 SS.DPL; almost because searching "SS.attr" in the AMD manuals shows tha= t=20 the processor does write to SS.attr sometimes. In the SVM=20 documentation, it says "The processor reads the current privilege level= =20 from the CPL field in the VMCB, not from SS.DPL. However, SS.DPL shoul= d=20 match the CPL field" and sneakily leaves out what happens if they do no= t=20 match... >> case of SYSRET on AMD processors, which sets SS.DPL=3DSS.RPL from th= e >> value in the STAR MSR, but force CPL=3D3 (Intel instead forces >> SS.DPL=3DSS.RPL=3DCPL=3D3). > > Thinking out loud here... Should we force SYSRET SS.RPL to be 3 when > VM updates STAR MSR? Following the same thought, does it make sense t= o > check (and force) SS.DPL=3D=3D3 when STAR MSR is being updated. Will > forcing SYSRET SS.DPL=3D3 break any OS? I think any reasonable OS wou= ld > probably sets SS.RPL=3DSS.DPL=3D3. Yes, I wondered in fact how much the AMD behavior is a bug. We could emulate Intel behavior on AMD by shadowing the STAR MSR; the=20 guest reads the intended SS.DPL and SS.RPL but the processor actually=20 always runs with bits 49-48 of STAR set to 3. This should ensure that=20 CPL=3DSS.DPL always even on AMD. I'm not sure if this has any worth th= ough... Paolo >> >> So this patch: >> >> - modifies SVM to update the CPL from SS.DPL rather than CS.RPL; >> the above case with SYSRET is not broken further, and the way >> to fix it would be to pass the CPL to userspace and back >> >> - modifies VMX to always return the CPL from SS.DPL (except >> forcing it to 0 if we are emulating real mode via vm86 mode; >> in vm86 mode all DPLs have to be 3, but real mode does allow >> privileged instructions). It also removes the CPL cache, >> which becomes a duplicate of the SS access rights cache. >> >> This fixes doing KVM_IOCTL_SET_SREGS exactly after setting > > nit-picking: s/KVM_IOCTL_SET_SREGS/KVM_SET_SREGS IOCTL/, to > match with IOCTL function name exactly. > >> CR0.PE=3D1 but before CS has been reloaded. >> >> Signed-off-by: Paolo Bonzini >> --- >> arch/x86/include/asm/kvm_host.h | 1 - >> arch/x86/kvm/svm.c | 35 ++++++++++++++----------------= ----- >> arch/x86/kvm/vmx.c | 24 ++++-------------------- >> 3 files changed, 18 insertions(+), 42 deletions(-) >> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/= kvm_host.h >> index e21aee98a5c2..49314155b66c 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -130,7 +130,6 @@ enum kvm_reg_ex { >> VCPU_EXREG_PDPTR =3D NR_VCPU_REGS, >> VCPU_EXREG_CR3, >> VCPU_EXREG_RFLAGS, >> - VCPU_EXREG_CPL, >> VCPU_EXREG_SEGMENTS, >> }; >> >> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c >> index 0b7d58d0c5fb..ec8366c5cfea 100644 >> --- a/arch/x86/kvm/svm.c >> +++ b/arch/x86/kvm/svm.c >> @@ -1338,21 +1338,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcp= u) >> wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]); >> } >> >> -static void svm_update_cpl(struct kvm_vcpu *vcpu) >> -{ >> - struct vcpu_svm *svm =3D to_svm(vcpu); >> - int cpl; >> - >> - if (!is_protmode(vcpu)) >> - cpl =3D 0; >> - else if (svm->vmcb->save.rflags & X86_EFLAGS_VM) >> - cpl =3D 3; >> - else >> - cpl =3D svm->vmcb->save.cs.selector & 0x3; >> - >> - svm->vmcb->save.cpl =3D cpl; >> -} >> - >> static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu) >> { >> return to_svm(vcpu)->vmcb->save.rflags; >> @@ -1360,11 +1345,12 @@ static unsigned long svm_get_rflags(struct >> kvm_vcpu *vcpu) >> >> static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rfl= ags) >> { >> - unsigned long old_rflags =3D to_svm(vcpu)->vmcb->save.rflags; >> - >> + /* >> + * Any change of EFLAGS.VM is accompained by a reload of SS >> + * (caused by either a task switch or an inter-privilege IRE= T), >> + * so we do not need to update the CPL here. >> + */ >> to_svm(vcpu)->vmcb->save.rflags =3D rflags; >> - if ((old_rflags ^ rflags) & X86_EFLAGS_VM) >> - svm_update_cpl(vcpu); >> } >> >> static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) >> @@ -1631,8 +1617,15 @@ static void svm_set_segment(struct kvm_vcpu *= vcpu, >> s->attrib |=3D (var->db & 1) << SVM_SELECTOR_DB_SHIFT; >> s->attrib |=3D (var->g & 1) << SVM_SELECTOR_G_SHIFT; >> } >> - if (seg =3D=3D VCPU_SREG_CS) >> - svm_update_cpl(vcpu); >> + >> + /* >> + * This is always accurate, except if SYSRET returned to a segment >> + * with SS.DPL !=3D 3. Intel does not have this quirk, and always >> + * forces SS.DPL to 3 on sysret, so we ignore that case; fixing it >> + * would entail passing the CPL to userspace and back. >> + */ >> + if (seg =3D=3D VCPU_SREG_SS) >> + svm->vmcb->save.cpl =3D (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3; >> >> mark_dirty(svm->vmcb, VMCB_SEG); >> } >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >> index 6f7463f53ed9..a267108403f5 100644 >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -414,7 +414,6 @@ struct vcpu_vmx { >> struct kvm_vcpu vcpu; >> unsigned long host_rsp; >> u8 fail; >> - u8 cpl; >> bool nmi_known_unmasked; >> u32 exit_intr_info; >> u32 idt_vectoring_info; >> @@ -3150,10 +3149,6 @@ static void enter_pmode(struct kvm_vcpu *vcpu= ) >> fix_pmode_seg(vcpu, VCPU_SREG_DS, &vmx->rmode.segs[VCPU_SREG_DS]); >> fix_pmode_seg(vcpu, VCPU_SREG_FS, &vmx->rmode.segs[VCPU_SREG_FS]); >> fix_pmode_seg(vcpu, VCPU_SREG_GS, &vmx->rmode.segs[VCPU_SREG_GS]); >> - >> - /* CPL is always 0 when CPU enters protected mode */ >> - __set_bit(VCPU_EXREG_CPL, (ulong *)&vcpu->arch.regs_avail); >> - vmx->cpl =3D 0; >> } >> >> static void fix_rmode_seg(int seg, struct kvm_segment *save) >> @@ -3555,22 +3550,14 @@ static int vmx_get_cpl(struct kvm_vcpu *vcpu= ) >> { >> struct vcpu_vmx *vmx =3D to_vmx(vcpu); >> >> - if (!is_protmode(vcpu)) >> + if (unlikely(vmx->rmode.vm86_active)) >> return 0; >> - >> - if (!is_long_mode(vcpu) >> - && (kvm_get_rflags(vcpu) & X86_EFLAGS_VM)) /* if virtual 8086 *= / >> - return 3; >> - >> - if (!test_bit(VCPU_EXREG_CPL, (ulong *)&vcpu->arch.regs_avail)) { >> - __set_bit(VCPU_EXREG_CPL, (ulong *)&vcpu->arch.regs_avail); >> - vmx->cpl =3D vmx_read_guest_seg_selector(vmx, VCPU_SREG_CS) & 3; >> + else { >> + int ar =3D vmx_read_guest_seg_ar(vmx, VCPU_SREG_SS); >> + return AR_DPL(ar); >> } >> - >> - return vmx->cpl; >> } >> >> - >> static u32 vmx_segment_access_rights(struct kvm_segment *var) >> { >> u32 ar; >> @@ -3598,8 +3585,6 @@ static void vmx_set_segment(struct kvm_vcpu *v= cpu, >> const struct kvm_vmx_segment_field *sf =3D &kvm_vmx_segment_fields= [seg]; >> >> vmx_segment_cache_clear(vmx); >> - if (seg =3D=3D VCPU_SREG_CS) >> - __clear_bit(VCPU_EXREG_CPL, (ulong *)&vcpu->arch.regs_avail); >> >> if (vmx->rmode.vm86_active && seg !=3D VCPU_SREG_LDTR) { >> vmx->rmode.segs[seg] =3D *var; >> @@ -7471,7 +7456,6 @@ static void __noclone vmx_vcpu_run(struct kvm_= vcpu *vcpu) >> >> vcpu->arch.regs_avail =3D ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS= _RSP) >> | (1 << VCPU_EXREG_RFLAGS) >> - | (1 << VCPU_EXREG_CPL) >> | (1 << VCPU_EXREG_PDPTR) >> | (1 << VCPU_EXREG_SEGMENTS) >> | (1 << VCPU_EXREG_CR3));