From mboxrd@z Thu Jan  1 00:00:00 1970
From: Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH v2 4/4] KVM: x86: get CPL from SS.DPL
Date: Mon, 26 May 2014 14:38:37 +0200
Message-ID: <538335CD.2000101@redhat.com>
References: <CALeycy_c4P+vvWBXVYts5WaZOB-=-ovMgA1eK+A9Dk1Zs1Sptw@mail.gmail.com> <CALeycy8P0R-+jN2SQKR3tU3y6JWzjK9J4NbY6cuynFOPWN5XNA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
To: Wei Huang <huangwei.virt@gmail.com>, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, jan.kiszka@siemens.com,
	gleb@kernel.org, avi.kivity@gmail.com
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <CALeycy8P0R-+jN2SQKR3tU3y6JWzjK9J4NbY6cuynFOPWN5XNA@mail.gmail.com>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: kvm.vger.kernel.org

Il 26/05/2014 01:21, Wei Huang ha scritto:
>> CS.RPL is not equal to the CPL in the few instructions between
>> setting CR0.PE and reloading CS.  And CS.DPL is also not equal
>> to the CPL for conforming code segments.
>
> Out of my curiousity, could you elaborate the problem of this
> CPL gap window, such as breaking any VMs or tests? From Linux kernel
> code, it seems kernel enables CR0.PE and immediately ljmpl to
> CS. This windows is very small and I am curious how severely it could
> be.

It is almost guaranteed to happen if you run iPXE and in the meanwhile=20
kick QEMU continuously:

# qemu-system-x86_64 -monitor unix:/tmp/m1,server,nowait -net ... &
# yes 'info cpus' | nc -U /tmp/m1

>> However, SS.DPL *is* always equal to the CPL except for the weird
>
> Is this specified anywhere in SDM as a requirement for x86 OS? If so,
> maybe provide a pointer to support this.

In the case of the Intel manuals, it mentions in several places that=20
SS.DPL=3DCPL.  All the mentions are in the VMX sections of the manual,=20
though I've found non-Intel material saying that system-management mode=
=20
also used SS.DPL as the CPL:

* "SS.DPL corresponds to the logical processor=E2=80=99s current privil=
ege level=20
(CPL)" (footnote in 26.3.1.5 Checks on Guest Non-Register State).

* "SS.DPL is always loaded from the SS access-rights field. This will b=
e=20
the current privilege level (CPL) after the VM entry completes"=20
(26.3.2.2 Loading Guest Segment Registers and Descriptor-Table Register=
s)

* "VMX-critical state [...] consists of the following: (1) SS.DPL (the=20
current privilege level);" (34.14.1 Default Treatment of SMI delivery=20
[in VMX mode]).

Instead, AMD says that "the SS segment base, limit and attributes are=20
not modified" by sysret.  It almost looks as if AMD processors never us=
e=20
SS.DPL; almost because searching "SS.attr" in the AMD manuals shows tha=
t=20
the processor does write to SS.attr sometimes.  In the SVM=20
documentation, it says "The processor reads the current privilege level=
=20
from the CPL field in the VMCB, not from SS.DPL.  However, SS.DPL shoul=
d=20
match the CPL field" and sneakily leaves out what happens if they do no=
t=20
match...

>> case of SYSRET on AMD processors, which sets SS.DPL=3DSS.RPL from th=
e
>> value in the STAR MSR, but force CPL=3D3 (Intel instead forces
>> SS.DPL=3DSS.RPL=3DCPL=3D3).
>
> Thinking out loud here... Should we force SYSRET SS.RPL to be 3 when
> VM updates STAR MSR? Following the same thought, does it make sense t=
o
> check (and force) SS.DPL=3D=3D3 when STAR MSR is being updated. Will
> forcing SYSRET SS.DPL=3D3 break any OS? I think any reasonable OS wou=
ld
> probably sets SS.RPL=3DSS.DPL=3D3.

Yes, I wondered in fact how much the AMD behavior is a bug.

We could emulate Intel behavior on AMD by shadowing the STAR MSR; the=20
guest reads the intended SS.DPL and SS.RPL but the processor actually=20
always runs with bits 49-48 of STAR set to 3.  This should ensure that=20
CPL=3DSS.DPL always even on AMD.  I'm not sure if this has any worth th=
ough...

Paolo

>>
>> So this patch:
>>
>> - modifies SVM to update the CPL from SS.DPL rather than CS.RPL;
>> the above case with SYSRET is not broken further, and the way
>> to fix it would be to pass the CPL to userspace and back
>>
>> - modifies VMX to always return the CPL from SS.DPL (except
>> forcing it to 0 if we are emulating real mode via vm86 mode;
>> in vm86 mode all DPLs have to be 3, but real mode does allow
>> privileged instructions).  It also removes the CPL cache,
>> which becomes a duplicate of the SS access rights cache.
>>
>> This fixes doing KVM_IOCTL_SET_SREGS exactly after setting
>
> nit-picking: s/KVM_IOCTL_SET_SREGS/KVM_SET_SREGS IOCTL/, to
> match with IOCTL function name exactly.
>
>> CR0.PE=3D1 but before CS has been reloaded.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |  1 -
>>  arch/x86/kvm/svm.c              | 35 ++++++++++++++----------------=
-----
>>  arch/x86/kvm/vmx.c              | 24 ++++--------------------
>>  3 files changed, 18 insertions(+), 42 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/=
kvm_host.h
>> index e21aee98a5c2..49314155b66c 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -130,7 +130,6 @@ enum kvm_reg_ex {
>>   VCPU_EXREG_PDPTR =3D NR_VCPU_REGS,
>>   VCPU_EXREG_CR3,
>>   VCPU_EXREG_RFLAGS,
>> - VCPU_EXREG_CPL,
>>   VCPU_EXREG_SEGMENTS,
>>  };
>>
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 0b7d58d0c5fb..ec8366c5cfea 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -1338,21 +1338,6 @@ static void svm_vcpu_put(struct kvm_vcpu *vcp=
u)
>>   wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
>>  }
>>
>> -static void svm_update_cpl(struct kvm_vcpu *vcpu)
>> -{
>> - struct vcpu_svm *svm =3D to_svm(vcpu);
>> - int cpl;
>> -
>> - if (!is_protmode(vcpu))
>> - cpl =3D 0;
>> - else if (svm->vmcb->save.rflags & X86_EFLAGS_VM)
>> - cpl =3D 3;
>> - else
>> - cpl =3D svm->vmcb->save.cs.selector & 0x3;
>> -
>> - svm->vmcb->save.cpl =3D cpl;
>> -}
>> -
>>  static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
>>  {
>>   return to_svm(vcpu)->vmcb->save.rflags;
>> @@ -1360,11 +1345,12 @@ static unsigned long svm_get_rflags(struct
>> kvm_vcpu *vcpu)
>>
>>  static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rfl=
ags)
>>  {
>> - unsigned long old_rflags =3D to_svm(vcpu)->vmcb->save.rflags;
>> -
>> +       /*
>> +        * Any change of EFLAGS.VM is accompained by a reload of SS
>> +        * (caused by either a task switch or an inter-privilege IRE=
T),
>> +        * so we do not need to update the CPL here.
>> +        */
>>   to_svm(vcpu)->vmcb->save.rflags =3D rflags;
>> - if ((old_rflags ^ rflags) & X86_EFLAGS_VM)
>> - svm_update_cpl(vcpu);
>>  }
>>
>>  static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
>> @@ -1631,8 +1617,15 @@ static void svm_set_segment(struct kvm_vcpu *=
vcpu,
>>   s->attrib |=3D (var->db & 1) << SVM_SELECTOR_DB_SHIFT;
>>   s->attrib |=3D (var->g & 1) << SVM_SELECTOR_G_SHIFT;
>>   }
>> - if (seg =3D=3D VCPU_SREG_CS)
>> - svm_update_cpl(vcpu);
>> +
>> + /*
>> + * This is always accurate, except if SYSRET returned to a segment
>> + * with SS.DPL !=3D 3.  Intel does not have this quirk, and always
>> + * forces SS.DPL to 3 on sysret, so we ignore that case; fixing it
>> + * would entail passing the CPL to userspace and back.
>> + */
>> + if (seg =3D=3D VCPU_SREG_SS)
>> + svm->vmcb->save.cpl =3D (s->attrib >> SVM_SELECTOR_DPL_SHIFT) & 3;
>>
>>   mark_dirty(svm->vmcb, VMCB_SEG);
>>  }
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 6f7463f53ed9..a267108403f5 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -414,7 +414,6 @@ struct vcpu_vmx {
>>   struct kvm_vcpu       vcpu;
>>   unsigned long         host_rsp;
>>   u8                    fail;
>> - u8                    cpl;
>>   bool                  nmi_known_unmasked;
>>   u32                   exit_intr_info;
>>   u32                   idt_vectoring_info;
>> @@ -3150,10 +3149,6 @@ static void enter_pmode(struct kvm_vcpu *vcpu=
)
>>   fix_pmode_seg(vcpu, VCPU_SREG_DS, &vmx->rmode.segs[VCPU_SREG_DS]);
>>   fix_pmode_seg(vcpu, VCPU_SREG_FS, &vmx->rmode.segs[VCPU_SREG_FS]);
>>   fix_pmode_seg(vcpu, VCPU_SREG_GS, &vmx->rmode.segs[VCPU_SREG_GS]);
>> -
>> - /* CPL is always 0 when CPU enters protected mode */
>> - __set_bit(VCPU_EXREG_CPL, (ulong *)&vcpu->arch.regs_avail);
>> - vmx->cpl =3D 0;
>>  }
>>
>>  static void fix_rmode_seg(int seg, struct kvm_segment *save)
>> @@ -3555,22 +3550,14 @@ static int vmx_get_cpl(struct kvm_vcpu *vcpu=
)
>>  {
>>   struct vcpu_vmx *vmx =3D to_vmx(vcpu);
>>
>> - if (!is_protmode(vcpu))
>> + if (unlikely(vmx->rmode.vm86_active))
>>   return 0;
>> -
>> - if (!is_long_mode(vcpu)
>> -    && (kvm_get_rflags(vcpu) & X86_EFLAGS_VM)) /* if virtual 8086 *=
/
>> - return 3;
>> -
>> - if (!test_bit(VCPU_EXREG_CPL, (ulong *)&vcpu->arch.regs_avail)) {
>> - __set_bit(VCPU_EXREG_CPL, (ulong *)&vcpu->arch.regs_avail);
>> - vmx->cpl =3D vmx_read_guest_seg_selector(vmx, VCPU_SREG_CS) & 3;
>> + else {
>> + int ar =3D vmx_read_guest_seg_ar(vmx, VCPU_SREG_SS);
>> + return AR_DPL(ar);
>>   }
>> -
>> - return vmx->cpl;
>>  }
>>
>> -
>>  static u32 vmx_segment_access_rights(struct kvm_segment *var)
>>  {
>>   u32 ar;
>> @@ -3598,8 +3585,6 @@ static void vmx_set_segment(struct kvm_vcpu *v=
cpu,
>>   const struct kvm_vmx_segment_field *sf =3D &kvm_vmx_segment_fields=
[seg];
>>
>>   vmx_segment_cache_clear(vmx);
>> - if (seg =3D=3D VCPU_SREG_CS)
>> - __clear_bit(VCPU_EXREG_CPL, (ulong *)&vcpu->arch.regs_avail);
>>
>>   if (vmx->rmode.vm86_active && seg !=3D VCPU_SREG_LDTR) {
>>   vmx->rmode.segs[seg] =3D *var;
>> @@ -7471,7 +7456,6 @@ static void __noclone vmx_vcpu_run(struct kvm_=
vcpu *vcpu)
>>
>>   vcpu->arch.regs_avail =3D ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS=
_RSP)
>>    | (1 << VCPU_EXREG_RFLAGS)
>> -  | (1 << VCPU_EXREG_CPL)
>>    | (1 << VCPU_EXREG_PDPTR)
>>    | (1 << VCPU_EXREG_SEGMENTS)
>>    | (1 << VCPU_EXREG_CR3));