(Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
@ 2024-12-06  2:42 Xiaoyao Li
  2024-12-06 18:41 ` Edgecombe, Rick P
  0 siblings, 1 reply; 13+ messages in thread
From: Xiaoyao Li @ 2024-12-06  2:42 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, Huang, Kai, Tony Lindgren, Xiaoyao Li, Zhao, Yan Y,
	Edgecombe, Rick P, Adrian Hunter, Reinette Chatre, Binbin Wu,
	Yamahata, Isaku, QEMU

This is a proposal for a potential future TDX Module feature to assist 
QEMU/KVM in configuring CPUID leafs for TD guests. It is only in the 
idea stage and not currently being implemented. We are looking for 
comments on the suitability for QEMU/KVM.

# Background

To correctly virtualize CPUID for TD, the VMM needs to understand the 
behavior of CPUID configuration for each CPUID bit, including whether 
the bit can be configured by the VMM and what the allowed value is.

There is an interface to query the CPUID bit information after the TD 
has been configured. However, this interface does not work before the TD 
is configured. The TDX module, along with its release, provides a 
separate JSON format file, cpuid_virtualization.json, for CPUID 
virtualization information. This file can be used by the VMM even before 
the TD is configured. The TDX module also provides an interface to query 
some limited CPUID information, including:

  - The configurability of a subset of CPUIDs via global metadata 
CPUID_CONFIG_VALUES.

  - The 'fixed0' and 'fixed1' bits of ATTRIBUTES and XFAM via global 
metadata. The VMM can infer the 'configurable' bits related to 
ATTRIBUTES/XFAM indirectly (the bits that are neither 'fixed0' nor 
'fixed1' are 'configurable').

For the remaining CPUID bits not covered by the above two categories, no 
TDX module query interface exists.

# Problem

While the VMM can use the JSON format CPUID information and may embed or 
translate that information into the code, it may face several challenges:

  - The JSON file varies with each TDX module release, which can 
complicate the VMM code. Additionally, depending on its own needs, the 
VMM may require more information than what is provided in the JSON file.

  - The JSON format cannot be easily parsed with low-level programming 
languages like C, which is typically used to write VMMs.

There was objection from KVM community for parsing the JSON and requests 
for a more friendly interface to query CPUID information for each 
specific TDX module.[0][1]

# Analysis

There are many virtualization types defined for single bit or bitfields 
in JSON file, e.g., 12 types in TDX 1.5.06:

   - fixed
   - configured
   - configured & native
   - XFAM & native
   - XFAM & configured & native
   - attributes & native
   - attributes & configured & native
   - CPUID_enabled & native
   - attributes & CPUID_enabled & native
   - attributes & CPUID_enabled & configured & native
   - calculated
   - special

And more types are getting added as TDX evolves.

Though so many types defined, for a single bit, it can only be one of three:
   - fixed0
   - fixed1
   - configurable

For example:
1. For type "configured & native", the bit is “fixed0” bit if the native 
value is 0, and the “configurable” bit if native value is 1.

2. For type "XFAM & native",
    a) the CPUID is “fixed0” if the corresponding XFAM bit is reported 
in XFAM_FIXED0, or the native value is 0;

    b) the CPUID bit is ‘fixed1’ if the corresponding XFAM bit is set in 
XFAM_FIXED1;

    c) otherwise, the CPUID is ‘configurable’ (indirectly by TD_PRRAMS.XFAM)

# Proposal

Current TDX module provides interface to report the “configurable” bits 
via global metadata CPUID_CONFIG_VALUES directly or via global metadata 
ATTRIBUTES/XFAM_fixed0/1 indirectly. But it lacks the interface to 
report the “fixed0” and “fixed1” bits generally (it only reports the 
fixed bits for ATTRIBUTES and XFAM).

We propose to add two new global metadata fields, CPUID_FIXED0_BITS and 
CPUID_FIXED1_BITS, for “fixed0” and “fixed1” bits information respectively.

The encoding of the two fields uses the same format as TDCS field 
CPUID_VALUES:

   Field code is composed as follows:
     - Bits 31:17  Reserved, must be 0
     - Bit  16     Leaf number bit 31
     - Bits 15:9   Leaf number bit 6:0
     - Bit 8       Sub-leaf not applicable flag
     - Bits 7:1    Sub-leaf number bits 6:0
     - Bit 0       Element index within field

   The same for returned result:
     - Element 0[31:0]:   EAX
     - Element 0[63:32]:  EBX
     - Element 1[31:0]:   ECX
     - Element 1[63:32]:  EDX

For CPUID_FIXED0_BITS, any bit in E[A,B,C,D]X is 0, means the bit is fixed0.
For CPUID_FIXED1_BITS, any bit in E[A,B,C,D]X is 1, means the bit is fixed1.

# Interaction with TDX_FEATURES0.VE_REDUCTION

TDX introduces a new feature VE_REDUCTION[2]. From the perspective of 
host VMM, VE_REDUCTION turns several CPUID bits from fixed1 to 
configurable, e.g., MTRR, MCA, MCE, etc. However, from the perspective 
of TD guest, it’s an opt-in feature. The actual value seen by TD guest 
depends on multiple factors: 1). If TD guest enables REDUCE_VE in 
TDCS.TD_CTLS, 2) TDCS.FEATURE_PARAVIRT_CTRL, 3) CPUID value configured 
by host VMM via TD_PARAMS.CPUID_CONFIG[]. (Please refer to latest TDX 
1.5 spec for more details.)

Since host VMM has no idea on the setting of 1) and 2) when creating the 
TD. We make the design to treat them as configurable bits and the global 
metadata interface doesn’t report them as fixed1 bits for simplicity.

Host VMM must be aware itself that the value of these VE_REDUCTION 
related CPUID bits might not be what it configures. The actual value 
seen by TD guest also depends on the guest enabling and configuration of 
VE_REDUCTION.

# POC

We did a POC in QEMU to verify the fixed0/1 data by such an interface is 
enough for userspace to validate and generate a supported vcpu model for 
TD guest.[3]

It retrieves the “fixed” type in JSON file and hardcodes them into two 
arrays, tdx_fixed0_bits and tdx_fixed1_bits. Note, it doesn’t handle the 
other types than “fixed” because 1) just a few of them falls into fixed0 
or fixed1 and 2) turning them into fixed0 or fixed0 needs to check 
various condition which complicates the POC. And in the POC it uses 
value 1 in tdx_fixed0_bits for fixed0 bits, while the proposed metadata 
interface uses value 0 to indicate fixed0 bits.

With the hardcoded information, VMM can validate the TD configuration 
requested from user early by checking whether a feature requested from 
users is allowed to be enabled and is allowed to be disabled.

When TDX module provides fixed0 and fixed1 via global metadata, QEMU can 
change to requested them from KVM to replace the hardcoded one.

[0] https://lore.kernel.org/all/ZhVdh4afvTPq5ssx@google.com/
[1] https://lore.kernel.org/all/ZhVsHVqaff7AKagu@google.com/
[2] https://cdrdv2.intel.com/v1/dl/getContent/733575
[3] 
https://lore.kernel.org/qemu-devel/20241105062408.3533704-49-xiaoyao.li@intel.com/ 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-06  2:42 (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits Xiaoyao Li
@ 2024-12-06 18:41 ` Edgecombe, Rick P
  2024-12-10  3:22   ` Xiaoyao Li
  0 siblings, 1 reply; 13+ messages in thread
From: Edgecombe, Rick P @ 2024-12-06 18:41 UTC (permalink / raw)
  To: Li, Xiaoyao, pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Chatre, Reinette,
	Zhao, Yan Y, tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	Hunter, Adrian, Yamahata, Isaku, qemu-devel@nongnu.org

On Fri, 2024-12-06 at 10:42 +0800, Xiaoyao Li wrote:
> # Interaction with TDX_FEATURES0.VE_REDUCTION
> 
> TDX introduces a new feature VE_REDUCTION[2]. From the perspective of 
> host VMM, VE_REDUCTION turns several CPUID bits from fixed1 to 
> configurable, e.g., MTRR, MCA, MCE, etc. However, from the perspective 
> of TD guest, it’s an opt-in feature. The actual value seen by TD guest 
> depends on multiple factors: 1). If TD guest enables REDUCE_VE in 
> TDCS.TD_CTLS, 2) TDCS.FEATURE_PARAVIRT_CTRL, 3) CPUID value configured 
> by host VMM via TD_PARAMS.CPUID_CONFIG[]. (Please refer to latest TDX 
> 1.5 spec for more details.)
> 
> Since host VMM has no idea on the setting of 1) and 2) when creating the 
> TD. We make the design to treat them as configurable bits and the global 
> metadata interface doesn’t report them as fixed1 bits for simplicity.
> 
> Host VMM must be aware itself that the value of these VE_REDUCTION 
> related CPUID bits might not be what it configures. The actual value 
> seen by TD guest also depends on the guest enabling and configuration of 
> VE_REDUCTION.

As we've been working on this, I've started to wonder whether this is a halfway
solution that is not worth it. Today there are directly configurable bits,
XFAM/attribute controlled bits, other opt-ins (like #VE reduction). And this has
only gotten more complicated as time has gone on.

If we really want to fully solve the problem of userspace understanding which
configurations are possible, the TDX module would almost need to expose some
sort of CPUID logic DSL that could be used to evaluate user configuration.

On the other extreme we could just say, this kind of logic is just going to need
to be hand coded somewhere, like is currently done in the QEMU patches.

The solution in this proposal decreases the work the VMM has to do, but in the
long term won't remove hand coding completely. As long as we are designing
something, what kind of bar should we target?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-06 18:41 ` Edgecombe, Rick P
@ 2024-12-10  3:22   ` Xiaoyao Li
  2024-12-10 17:45     ` Edgecombe, Rick P
  0 siblings, 1 reply; 13+ messages in thread
From: Xiaoyao Li @ 2024-12-10  3:22 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Chatre, Reinette,
	Zhao, Yan Y, tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	Hunter, Adrian, Yamahata, Isaku, qemu-devel@nongnu.org

On 12/7/2024 2:41 AM, Edgecombe, Rick P wrote:
> On Fri, 2024-12-06 at 10:42 +0800, Xiaoyao Li wrote:
>> # Interaction with TDX_FEATURES0.VE_REDUCTION
>>
>> TDX introduces a new feature VE_REDUCTION[2]. From the perspective of
>> host VMM, VE_REDUCTION turns several CPUID bits from fixed1 to
>> configurable, e.g., MTRR, MCA, MCE, etc. However, from the perspective
>> of TD guest, it’s an opt-in feature. The actual value seen by TD guest
>> depends on multiple factors: 1). If TD guest enables REDUCE_VE in
>> TDCS.TD_CTLS, 2) TDCS.FEATURE_PARAVIRT_CTRL, 3) CPUID value configured
>> by host VMM via TD_PARAMS.CPUID_CONFIG[]. (Please refer to latest TDX
>> 1.5 spec for more details.)
>>
>> Since host VMM has no idea on the setting of 1) and 2) when creating the
>> TD. We make the design to treat them as configurable bits and the global
>> metadata interface doesn’t report them as fixed1 bits for simplicity.
>>
>> Host VMM must be aware itself that the value of these VE_REDUCTION
>> related CPUID bits might not be what it configures. The actual value
>> seen by TD guest also depends on the guest enabling and configuration of
>> VE_REDUCTION.
> 
> As we've been working on this, I've started to wonder whether this is a halfway
> solution that is not worth it. Today there are directly configurable bits,
> XFAM/attribute controlled bits, other opt-ins (like #VE reduction). And this has
> only gotten more complicated as time has gone on.
> 
> If we really want to fully solve the problem of userspace understanding which
> configurations are possible, the TDX module would almost need to expose some
> sort of CPUID logic DSL that could be used to evaluate user configuration.
> 
> On the other extreme we could just say, this kind of logic is just going to need
> to be hand coded somewhere, like is currently done in the QEMU patches.

I think hand coded some specific handling for special case is acceptable 
when it's unavoidable. However, an auto-adaptive interface for general 
cases is always better than hand code/hard code something. E.g., current 
QEMU implementation hardcodes the fixed0 and fixed1 information based on 
TDX 1.5.06 spec. When different versions of TDX module have different 
fixed0 and fixed1 information, QEMU will needs interface to get the 
version of TDX module and maintain different information for each 
version of TDX module. It's a disaster IMHO.

> The solution in this proposal decreases the work the VMM has to do, but in the
> long term won't remove hand coding completely. As long as we are designing
> something, what kind of bar should we target?

For this specific #VE reduction case, I think userspace doesn't need to 
do any hand coding. Userspace just treats the bits related to #VE 
reduction as configurable as reported by TDX module/KVM. And userspace 
doesn't care if the value seen by TD guest is matched with what gets 
configured by it because they are out of control of userspace.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-10  3:22   ` Xiaoyao Li
@ 2024-12-10 17:45     ` Edgecombe, Rick P
  2024-12-17  1:53       ` Sean Christopherson
  0 siblings, 1 reply; 13+ messages in thread
From: Edgecombe, Rick P @ 2024-12-10 17:45 UTC (permalink / raw)
  To: Li, Xiaoyao, pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Chatre, Reinette,
	Zhao, Yan Y, tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	Hunter, Adrian, Yamahata, Isaku, qemu-devel@nongnu.org

On Tue, 2024-12-10 at 11:22 +0800, Xiaoyao Li wrote:
> > The solution in this proposal decreases the work the VMM has to do, but in
> > the
> > long term won't remove hand coding completely. As long as we are designing
> > something, what kind of bar should we target?
> 
> For this specific #VE reduction case, I think userspace doesn't need to 
> do any hand coding. Userspace just treats the bits related to #VE 
> reduction as configurable as reported by TDX module/KVM. And userspace 
> doesn't care if the value seen by TD guest is matched with what gets 
> configured by it because they are out of control of userspace.

Besides a specific problem, here reduced #VE is also an example of increasing
complexity for TD CPUID. If we have more things like it, it could make this
interface too rigid. But another way to address it would be to get this
interface *and* understanding of "no more CPUID config complexity". And that may
be a good idea anyway.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-10 17:45     ` Edgecombe, Rick P
@ 2024-12-17  1:53       ` Sean Christopherson
  2024-12-17  4:27         ` Xiaoyao Li
  2024-12-17 21:31         ` Edgecombe, Rick P
  0 siblings, 2 replies; 13+ messages in thread
From: Sean Christopherson @ 2024-12-17  1:53 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Xiaoyao Li, pbonzini@redhat.com, Kai Huang,
	binbin.wu@linux.intel.com, Reinette Chatre, Yan Y Zhao,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org, Adrian Hunter,
	Isaku Yamahata, qemu-devel@nongnu.org

On Tue, Dec 10, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-12-10 at 11:22 +0800, Xiaoyao Li wrote:
> > > The solution in this proposal decreases the work the VMM has to do, but
> > > in the long term won't remove hand coding completely. As long as we are
> > > designing something, what kind of bar should we target?
> > 
> > For this specific #VE reduction case, I think userspace doesn't need to 
> > do any hand coding. Userspace just treats the bits related to #VE 
> > reduction as configurable as reported by TDX module/KVM. And userspace 
> > doesn't care if the value seen by TD guest is matched with what gets 
> > configured by it because they are out of control of userspace.
>
> Besides a specific problem, here reduced #VE is also an example of increasing
> complexity for TD CPUID. If we have more things like it, it could make this
> interface too rigid.

I agree with Rick in that having QEMU treat them as configurable is going to be
a disaster.  But I don't think it's actually problematic in practice.

If QEMU (or KVM) has no visibility into the state of the guest's view of the
affected features, then it doesn't matter whether they are fixed or configurable.
They're effectively Schrödinger's bits: until QEMU/KVM actually looks at them,
they're neither dead nor alive, and since QEMU/KVM *can't* look at them, who cares?

So, if the TDX Module *requires* them to be set/cleared when the TD is created,
then they should be reported as fixed.  If the TDX module doesn't care, then they
should be reported as configurable.  The fact that the guest can muck with things
under the hood doesn't factor into that logic.

If TDX pulls something like this for features that KVM cares about, then we have
problems, but that's already true today.  If a feature requires KVM support, it
doesn't really matter if the feature is fixed or configurable.  What matters is
that KVM has a chance to enforce that the feature can be used by the guest if
and only if KVM has the proper support in place.  Because if KVM is completely
unaware of a feature, it's impossible for KVM to know that the feature needs to
be rejected.

This isn't unique to TDX, CoCo, or firmware.  Every new feature that lands in
hardware needs to either be "benign" or have the appropriate virtualization
controls.  KVM already has to deal with cases where features can effectively be
used without KVM's knowledge.  E.g. there are plenty of instruction-level
virtualization holes, and SEV-ES doubled down by essentially forcing KVM to let
the guest write XCR0 and XSS directly.

It all works, so long as the hardware vendor doesn't screw up and let the guest
use a feature that impacts host safety and/or functionality, without the hypervisor's
knowledge.

So, just don't screw up :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-17  1:53       ` Sean Christopherson
@ 2024-12-17  4:27         ` Xiaoyao Li
  2024-12-17 21:31         ` Edgecombe, Rick P
  1 sibling, 0 replies; 13+ messages in thread
From: Xiaoyao Li @ 2024-12-17  4:27 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe
  Cc: pbonzini@redhat.com, Kai Huang, binbin.wu@linux.intel.com,
	Reinette Chatre, Yan Y Zhao, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, Adrian Hunter, Isaku Yamahata,
	qemu-devel@nongnu.org

On 12/17/2024 9:53 AM, Sean Christopherson wrote:
> On Tue, Dec 10, 2024, Rick P Edgecombe wrote:
>> On Tue, 2024-12-10 at 11:22 +0800, Xiaoyao Li wrote:
>>>> The solution in this proposal decreases the work the VMM has to do, but
>>>> in the long term won't remove hand coding completely. As long as we are
>>>> designing something, what kind of bar should we target?
>>>
>>> For this specific #VE reduction case, I think userspace doesn't need to
>>> do any hand coding. Userspace just treats the bits related to #VE
>>> reduction as configurable as reported by TDX module/KVM. And userspace
>>> doesn't care if the value seen by TD guest is matched with what gets
>>> configured by it because they are out of control of userspace.
>>
>> Besides a specific problem, here reduced #VE is also an example of increasing
>> complexity for TD CPUID. If we have more things like it, it could make this
>> interface too rigid.
> 
> I agree with Rick in that having QEMU treat them as configurable is going to be
> a disaster.  But I don't think it's actually problematic in practice.

Correct the proposal. It should be QEMU treats them as what KVM reports.

TDX module reports these #VE reduction related CPUIDs as configurable 
because it allows VMM to paravirt them. If KVM doesn't support the 
paravirt of them, KVM can clear them from configurable bits and add them 
to fixed0 bits when KVM reports to userspace.

> If QEMU (or KVM) has no visibility into the state of the guest's view of the
> affected features, then it doesn't matter whether they are fixed or configurable.
> They're effectively Schrödinger's bits: until QEMU/KVM actually looks at them,
> they're neither dead nor alive, and since QEMU/KVM *can't* look at them, who cares?

To some degree, I think it matters. As I explained above, if KVM reports 
it as configurable to userspace, it mean TDX module allows it to be 
configured and KVM allows it to be paravirtualized as well. So userspace 
can configure it as 1 when users wants it. This is how VMM is going to 
present the feature to TD guest.

However, how TD guest is going to use it depends on itself.
1) when TD guest doesn't enable #VE reduction: the configuration from 
VMM doesn't matter. The CPUIDs are fixed1 and related operation leads to 
#VE.

2) When TD guest enables #VE reduction and doesn't enable 
TDCS.FEATURE_PARAVIRT_CTRL of the related bit: the configuration from 
VMM doesn't matter. The CPUIDs are fixed0 and related operation leads to 
#GP.

3) When TD guest enables #VE reduction and enable 
TDCS.FEATURE_PARAVIRT_CTRL of the related bit: the configuration from 
VMM matters.
   - When VMM configures the bits to 1, the related operation leads to 
#VE (for paravirtualization).
   - When VMM configures the bits to 0, the related operation leads to #GP.

So for case 3), it does matters.

> So, if the TDX Module *requires* them to be set/cleared when the TD is created,
> then they should be reported as fixed.  If the TDX module doesn't care, then they
> should be reported as configurable.  The fact that the guest can muck with things
> under the hood doesn't factor into that logic.

yes, I agree on it.

> If TDX pulls something like this for features that KVM cares about, then we have
> problems, but that's already true today.  If a feature requires KVM support, it
> doesn't really matter if the feature is fixed or configurable.  What matters is
> that KVM has a chance to enforce that the feature can be used by the guest if
> and only if KVM has the proper support in place.  Because if KVM is completely
> unaware of a feature, it's impossible for KVM to know that the feature needs to
> be rejected.

I agree.

With the proposed fixed/fixed1 information, and in addition to the 
configurable bits, KVM can fully validate the TDX module against its 
capabilities. When violation occurs (e.g., some KVM unsupported bit 
being reported as fixed1 by TDX module), KVM can just refuse to enable TDX.

> This isn't unique to TDX, CoCo, or firmware.  Every new feature that lands in
> hardware needs to either be "benign" or have the appropriate virtualization
> controls.  KVM already has to deal with cases where features can effectively be
> used without KVM's knowledge.  E.g. there are plenty of instruction-level
> virtualization holes, and SEV-ES doubled down by essentially forcing KVM to let
> the guest write XCR0 and XSS directly.
> 
> It all works, so long as the hardware vendor doesn't screw up and let the guest
> use a feature that impacts host safety and/or functionality, without the hypervisor's
> knowledge.
> 
> So, just don't screw up :-)



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-17  1:53       ` Sean Christopherson
  2024-12-17  4:27         ` Xiaoyao Li
@ 2024-12-17 21:31         ` Edgecombe, Rick P
  2024-12-18  0:08           ` Sean Christopherson
  1 sibling, 1 reply; 13+ messages in thread
From: Edgecombe, Rick P @ 2024-12-17 21:31 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Chatre, Reinette, Zhao, Yan Y, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	qemu-devel@nongnu.org, Hunter, Adrian

On Mon, 2024-12-16 at 17:53 -0800, Sean Christopherson wrote:
> Every new feature that lands in hardware needs to either be "benign" or have the
> appropriate virtualization controls.  KVM already has to deal with cases where
> features can effectively be used without KVM's knowledge.  E.g. there are plenty
> of instruction-level virtualization holes, and SEV-ES doubled down by essentially
> forcing KVM to let the guest write XCR0 and XSS directly.

We discussed this in the PUCK call.

It turns out there were two different ideas on how this fixed bit API would be
used. One is to help userspace understand which configurations are possible. For
this one, I'm not sure how helpful this proposal will be in the long run. I'll
respond on the other branch of the thread.

The other usage people were thinking of, which I didn't realize before, was to
prevent the TDX module from setting fixed bits that might require VMMs support
(i.e. save/restoring something that could affect the host). The rest of the mail
is about this issue.

Due to the steps involved in resolving this confusion, and that we didn't really
reach a conclusion, the discussion is hard to summarize. So instead I'll try to
re-kick it off with an idea which has bits and pieces of what people said...

I think we can't have the TDX module setting new fixed bits that require any VMM
enabling. When we finally have settled upstream TDX support, the TDX module
needs to understand what things KVM relies on so it doesn't break them with
updates. But new fixed CPUID bits that require VMM enabling to prevent host
issues seems like the kind of thing in general that just shouldn't happen.

As for new configurable bits that require VMM enabling. Adrian was suggesting
that the TDX module currently only has two guest CPUID bits that are problematic
for KVM today (and the next vcpu enter/exit series has a patch to forbid them).
But a re-check of this assertion is warranted.

It seems like an anti-pattern to have KVM maintaining any code to defend against
TDX module changes that could instead be handled with a promise. However, KVM
having code to defend against userspace prodding the TDX module to do something
bad to the host seems valid. So fixed bit issues should be handled with a
promise, but issues related to new configurable bits seems open.

Some options discussed on the call:

1. If we got a promise to require any new CPUID bits that clobber host state to
require an opt-in (attributes bit, etc) then we could get by with a promise for
that too. The current situation was basically to assume TDX module wouldn't open
up the issue with new CPUID bits (only attributes/xfam).
2. If we required any new configurable CPUID bits to save/restore host state
automatically then we could also get by, but then KVM's code that does host
save/restore would either be redundant or need a TDX branch.
3. If we prevent setting any CPUID bits not supported by KVM, we would need to
track these bits in KVM. The data backing GET_SUPPORTED_CPUID is not sufficient
for this purpose since it is actually more like "default values" then a mask of
supported bits. A patch to try to do this filtering was dropped after upstream
discussion.[0]

Other idea
----------
Previously we tried to maintain an allow list of KVM supported configurable bits
[0]. It was do-able, but not ideal. It would be smaller for KVM to protect
itself with a deny list of bits, or rather a list of bits that needs to be in
KVM_GET_SUPPORTED_CPUID, or they should not be allowed to be configured. But KVM
can't keep a list of bits that it doesn't know about.

But the TDX module does know which bits that it supports result in host state
getting clobbered. So we could ask TDX module to expose a list of bits that have
an effect on host state. We could check those against KVM_GET_SUPPORTED_CPUID.
That check could be expected to fit better than when we tried to massage
KVM_GET_SUPPORTED_CPUID to be a mask that includes all possible configurable
bits (multi-bit fields, etc).

In the meantime we could keep a list of all of today's host affecting bits. TDX
module would need to gate any new bits that effect host state behind a new sys-
wide opt-in that comes with the "clobber bits" metadata. Before entering a TD,
KVM would check the clobber bits in KVM's copy of CPUID against the TD's copy to
make sure everyone knows what they have to do.

(and also this opt-in stuff would need to be run by the TDX module team of
course)

It leaves open the possibility that there is some other bits KVM cares about
that don't have to do with clobbering host state. Not sure about it.

[0]
https://lore.kernel.org/kvm/20240812224820.34826-26-rick.p.edgecombe@intel.com/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-17 21:31         ` Edgecombe, Rick P
@ 2024-12-18  0:08           ` Sean Christopherson
  2024-12-19  1:56             ` Edgecombe, Rick P
  0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2024-12-18  0:08 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Kai Huang, binbin.wu@linux.intel.com, Xiaoyao Li, Reinette Chatre,
	Yan Y Zhao, tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Isaku Yamahata, qemu-devel@nongnu.org,
	Adrian Hunter

On Tue, Dec 17, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-12-16 at 17:53 -0800, Sean Christopherson wrote:
> > Every new feature that lands in hardware needs to either be "benign" or have the
> > appropriate virtualization controls.  KVM already has to deal with cases where
> > features can effectively be used without KVM's knowledge.  E.g. there are plenty
> > of instruction-level virtualization holes, and SEV-ES doubled down by essentially
> > forcing KVM to let the guest write XCR0 and XSS directly.
> 
> We discussed this in the PUCK call.

Argh, I had a response to Xiaoyao all typed up and didn't hit "send" earlier today.

> It turns out there were two different ideas on how this fixed bit API would be
> used. One is to help userspace understand which configurations are possible. For
> this one, I'm not sure how helpful this proposal will be in the long run. I'll
> respond on the other branch of the thread.
> 
> The other usage people were thinking of, which I didn't realize before, was to
> prevent the TDX module from setting fixed bits that might require VMMs support
> (i.e. save/restoring something that could affect the host). The rest of the mail
> is about this issue.
> 
> Due to the steps involved in resolving this confusion, and that we didn't really
> reach a conclusion, the discussion is hard to summarize. So instead I'll try to
> re-kick it off with an idea which has bits and pieces of what people said...
> 
> I think we can't have the TDX module setting new fixed bits that require any VMM
> enabling. When we finally have settled upstream TDX support, the TDX module
> needs to understand what things KVM relies on so it doesn't break them with
> updates. But new fixed CPUID bits that require VMM enabling to prevent host
> issues seems like the kind of thing in general that just shouldn't happen.
> 
> As for new configurable bits that require VMM enabling. Adrian was suggesting
> that the TDX module currently only has two guest CPUID bits that are problematic
> for KVM today (and the next vcpu enter/exit series has a patch to forbid them).
> But a re-check of this assertion is warranted.
> 
> It seems like an anti-pattern to have KVM maintaining any code to defend against
> TDX module changes that could instead be handled with a promise. 

I disagree, sanity checking hardware and firmware is a good thing.  E.g. see KVM's
VMCS checks, the sanity checks for features SEV depends on, etc.

That said, I'm not terribly concerned about more features that are unconditionally
exposed to the guest, because that will cause problems for other reasons, i.e.
Intel should already be heavily incentivized to not do silly things.

> However, KVM having code to defend against userspace prodding the TDX module
> to do something bad to the host seems valid. So fixed bit issues should be
> handled with a promise, but issues related to new configurable bits seems
> open.
> 
> Some options discussed on the call:
> 
> 1. If we got a promise to require any new CPUID bits that clobber host state to
> require an opt-in (attributes bit, etc) then we could get by with a promise for
> that too. The current situation was basically to assume TDX module wouldn't open
> up the issue with new CPUID bits (only attributes/xfam).
> 2. If we required any new configurable CPUID bits to save/restore host state
> automatically then we could also get by, but then KVM's code that does host
> save/restore would either be redundant or need a TDX branch.
> 3. If we prevent setting any CPUID bits not supported by KVM, we would need to
> track these bits in KVM. The data backing GET_SUPPORTED_CPUID is not sufficient
> for this purpose since it is actually more like "default values" then a mask of
> supported bits. A patch to try to do this filtering was dropped after upstream
> discussion.[0]

The only CPUID bits that truly matter are those that are associated with hardware
features the TDX module allows the guest to use directly.  And for those, KVM
*must* know if they are fixed0 (inverted polarity only?), fixed1, or configurable.
As Adrian asserted, there probably aren't many of them.

For all other CPUID bits, what the TDX Module thinks and/or presents to the guest
is completely irrelevant, at least as far as KVM cares, and to some extent as far
as QEMU cares.  This includes the TDX Module's FEATURE_PARAVIRT_CTRL, which frankly
is asinine and should be ignored.  IMO, the TDX Module spec is entirely off the
mark in its assessment of paravirtualization.  Injecting a #VE instead of a #GP
isn't "paravirtualization".

Take TSC_DEADLINE as an example.  "Disabling" the feature from the guest's side
simply means that WRMSR #GPs instead of #VEs.  *Nothing* has changed from KVM's
perspective.  If the guest makes a TDVMCALL to write IA32_TSC_DEADLINE, KVM has
no idea if the guest has opted in/out of #VE vs #GP.  And IMO, a sane guest will
never take a #VE or #GP if it wants to use TSC_DEADLINE; the kernel should instead
make a direct TDVMCALL and save itself a pointless exception.

  Enabling Guest TDs are not allowed to access the IA32_TSC_DEADLINE MSR directly.
  Virtualization of IA32_TSC_DEADLINE depends on the virtual value of
  CPUID(1).ECX[24] bit (TSC Deadline). The host VMM may configure (as an input to
  TDH.MNG.INIT) virtual CPUID(1).ECX[24] to be a constant 0 or allow it to be 1
  if the CPU’s native value is 1.

  If the TDX module supports #VE reduction, as enumerated by TDX_FEATURES0.VE_REDUCTION
  (bit 30), and the guest TD has set TD_CTLS.REDUCE_VE to 1, it may control the
  value of virtual CPUID(1).ECX[24] by writing TDCS.FEATURE_PARAVIRT_CTRL.TSC_DEADLINE. 

  • If the virtual value of CPUID(1).ECX[24] is 0, IA32_TSC_DEADLINE is virtualized
    as non-existent. WRMSR or RDMSR attempts result in a #GP(0).

  • If the virtual value of CPUID(1).ECX[24] is 1, WRMSR or RDMSR attempts result
    in a #VE(CONFIG_PARAVIRT). This enables the TD’s #VE handler.

Ditto for TME, MKTME.

FEATURE_PARAVIRT_CTRL.MCA is even weirder, but I still don't see any reason for
KVM or QEMU to care if it's fixed or configurable.  There's some crazy logic for
whether or not CR4.MCE can be cleared, but the host can't see guest CR4, and so
once again, the TDX Module's view of MCA is irrelevant when it comes to handling
TDVMCALL for the machine check MSRs.

So I think this again purely comes to back to KVM correctness and safety.  More
specifically, the TDX Module needs to report features that are unconditionally
enabled or disabled and can't be emulated by KVM.  For everything else, I don't
see any reason to care what the TDX module does.

I'm pretty sure that gives us a way forward.  If there only a handful of features
that are unconditionally exposed to the guest, then KVM forces those features in
cpu_caps[*].  I.e. treat them kinda like XSAVES on AMD, where KVM assumes the
guest can use XSAVES if it's supported in hardware and XSAVE is exposed to the
guest, because AMD didn't provide an interception knob for XSAVES.

If the list is "too" long (sujbective) for KVM to hardcode, then we revisit and
get the TDX module to provide a list.

This probably doesn't solve Xiaoyao's UX problem in QEMU, but I think it gives
us a sane approach for KVM.

[*] https://lore.kernel.org/all/20241128013424.4096668-1-seanjc@google.com

> Other idea
> ----------
> Previously we tried to maintain an allow list of KVM supported configurable bits
> [0]. It was do-able, but not ideal. It would be smaller for KVM to protect
> itself with a deny list of bits, or rather a list of bits that needs to be in
> KVM_GET_SUPPORTED_CPUID, or they should not be allowed to be configured. But KVM
> can't keep a list of bits that it doesn't know about.
> 
> But the TDX module does know which bits that it supports result in host state
> getting clobbered. So we could ask TDX module to expose a list of bits that have
> an effect on host state. We could check those against KVM_GET_SUPPORTED_CPUID.
> That check could be expected to fit better than when we tried to massage
> KVM_GET_SUPPORTED_CPUID to be a mask that includes all possible configurable
> bits (multi-bit fields, etc).
> 
> In the meantime we could keep a list of all of today's host affecting bits. TDX
> module would need to gate any new bits that effect host state behind a new sys-
> wide opt-in that comes with the "clobber bits" metadata. Before entering a TD,
> KVM would check the clobber bits in KVM's copy of CPUID against the TD's copy to
> make sure everyone knows what they have to do.
> 
> (and also this opt-in stuff would need to be run by the TDX module team of
> course)
> 
> It leaves open the possibility that there is some other bits KVM cares about
> that don't have to do with clobbering host state. Not sure about it.
> 
> [0]
> https://lore.kernel.org/kvm/20240812224820.34826-26-rick.p.edgecombe@intel.com/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-18  0:08           ` Sean Christopherson
@ 2024-12-19  1:56             ` Edgecombe, Rick P
  2024-12-19  2:33               ` Sean Christopherson
  0 siblings, 1 reply; 13+ messages in thread
From: Edgecombe, Rick P @ 2024-12-19  1:56 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Chatre, Reinette, Zhao, Yan Y, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	qemu-devel@nongnu.org, Hunter, Adrian

On Tue, 2024-12-17 at 16:08 -0800, Sean Christopherson wrote:
> On Tue, Dec 17, 2024, Rick P Edgecombe wrote:
> > It seems like an anti-pattern to have KVM maintaining any code to defend against
> > TDX module changes that could instead be handled with a promise. 
> 
> I disagree, sanity checking hardware and firmware is a good thing.  E.g. see KVM's
> VMCS checks, the sanity checks for features SEV depends on, etc.

Ok. More softly, KVM should not do something overly complicated if it could
instead be handled by TDX module adopting reasonable policies. Sanity checks are
supposed to be easy. The reason I'm probing further is it feeds into how we
position things with the TDX module team.

> 
> That said, I'm not terribly concerned about more features that are unconditionally
> exposed to the guest, because that will cause problems for other reasons, i.e.
> Intel should already be heavily incentivized to not do silly things.

Agree. That was sort of the expectation until these two bits came up in the TD
enter/exit discussion.

> 
> > However, KVM having code to defend against userspace prodding the TDX module
> > to do something bad to the host seems valid. So fixed bit issues should be
> > handled with a promise, but issues related to new configurable bits seems
> > open.
> > 
> > Some options discussed on the call:
> > 
> > 1. If we got a promise to require any new CPUID bits that clobber host state to
> > require an opt-in (attributes bit, etc) then we could get by with a promise for
> > that too. The current situation was basically to assume TDX module wouldn't open
> > up the issue with new CPUID bits (only attributes/xfam).
> > 2. If we required any new configurable CPUID bits to save/restore host state
> > automatically then we could also get by, but then KVM's code that does host
> > save/restore would either be redundant or need a TDX branch.
> > 3. If we prevent setting any CPUID bits not supported by KVM, we would need to
> > track these bits in KVM. The data backing GET_SUPPORTED_CPUID is not sufficient
> > for this purpose since it is actually more like "default values" then a mask of
> > supported bits. A patch to try to do this filtering was dropped after upstream
> > discussion.[0]
> 
> The only CPUID bits that truly matter are those that are associated with hardware
> features the TDX module allows the guest to use directly.  And for those, KVM
> *must* know if they are fixed0 (inverted polarity only?), fixed1, or configurable.
> As Adrian asserted, there probably aren't many of them.

I don't follow why the fixed bits are special here. If any configuration can
affect the host, KVM needs to know about it. Whether it is fixed or
configurable.

I wonder if there could be some confusion about how much KVM can trust that its
view of the CPUID bits is the same as the TDX Modules? In the current patches
userspace is responsible for assembling KVM's CPUID data which it sets with
KVM_SET_CPUID2 like normal. It fetches all the set bits from the TDX module,
massages them, and pass them to KVM. So if a host affecting configurable bit is
set in the TDX module, but not in KVM then it could be a problem. I think we
need to reassess which bits could affect host state, and make sure we re-check
them before entering the TD. But we can't simply check that all bits match
because there are some bits that are set in KVM, but not TDX module (real PV
leafs, guestmaxpa, etc).

So that is how I arrived at that we need some list of host affecting bits to
verify match in the TD.

> 
> For all other CPUID bits, what the TDX Module thinks and/or presents to the guest
> is completely irrelevant, at least as far as KVM cares, and to some extent as far
> as QEMU cares.  This includes the TDX Module's FEATURE_PARAVIRT_CTRL, which frankly
> is asinine and should be ignored.  IMO, the TDX Module spec is entirely off the
> mark in its assessment of paravirtualization.  Injecting a #VE instead of a #GP
> isn't "paravirtualization".
>  
> Take TSC_DEADLINE as an example.  "Disabling" the feature from the guest's side
> simply means that WRMSR #GPs instead of #VEs.  *Nothing* has changed from KVM's
> perspective.  If the guest makes a TDVMCALL to write IA32_TSC_DEADLINE, KVM has
> no idea if the guest has opted in/out of #VE vs #GP.  And IMO, a sane guest will
> never take a #VE or #GP if it wants to use TSC_DEADLINE; the kernel should instead
> make a direct TDVMCALL and save itself a pointless exception.
> 
>   Enabling Guest TDs are not allowed to access the IA32_TSC_DEADLINE MSR directly.
>   Virtualization of IA32_TSC_DEADLINE depends on the virtual value of
>   CPUID(1).ECX[24] bit (TSC Deadline). The host VMM may configure (as an input to
>   TDH.MNG.INIT) virtual CPUID(1).ECX[24] to be a constant 0 or allow it to be 1
>   if the CPU’s native value is 1.
> 
>   If the TDX module supports #VE reduction, as enumerated by TDX_FEATURES0.VE_REDUCTION
>   (bit 30), and the guest TD has set TD_CTLS.REDUCE_VE to 1, it may control the
>   value of virtual CPUID(1).ECX[24] by writing TDCS.FEATURE_PARAVIRT_CTRL.TSC_DEADLINE. 
> 
>   • If the virtual value of CPUID(1).ECX[24] is 0, IA32_TSC_DEADLINE is virtualized
>     as non-existent. WRMSR or RDMSR attempts result in a #GP(0).
> 
>   • If the virtual value of CPUID(1).ECX[24] is 1, WRMSR or RDMSR attempts result
>     in a #VE(CONFIG_PARAVIRT). This enables the TD’s #VE handler.
> 
> Ditto for TME, MKTME.
> 
> FEATURE_PARAVIRT_CTRL.MCA is even weirder, but I still don't see any reason for
> KVM or QEMU to care if it's fixed or configurable.  There's some crazy logic for
> whether or not CR4.MCE can be cleared, but the host can't see guest CR4, and so
> once again, the TDX Module's view of MCA is irrelevant when it comes to handling
> TDVMCALL for the machine check MSRs.
> 
> So I think this again purely comes to back to KVM correctness and safety.  More
> specifically, the TDX Module needs to report features that are unconditionally
> enabled or disabled and can't be emulated by KVM.  For everything else, I don't
> see any reason to care what the TDX module does.
> 
> I'm pretty sure that gives us a way forward.  If there only a handful of features
> that are unconditionally exposed to the guest, then KVM forces those features in
> cpu_caps[*].

I see. Hmm. We could use this new interface to double check the fixed bits. It
seems like a relatively cheap sanity check.

There already is an interface to get CPUID bits (fixed and dynamic). But it only
works after a TD is configured. So if we want to do extra verification or
adjustments, we could use it before entering the TD. Basically, if we delay this
logic we don't need to wait for the fixed bit interface.

What is special about the new proposed fixed bit interface is that it can run
before a TD runs (when QEMU wants to do it's checking of the users args).

> 
>   I.e. treat them kinda like XSAVES on AMD, where KVM assumes the
> guest can use XSAVES if it's supported in hardware and XSAVE is exposed to the
> guest, because AMD didn't provide an interception knob for XSAVES.
> 
> If the list is "too" long (sujbective) for KVM to hardcode, then we revisit and
> get the TDX module to provide a list.
> 
> This probably doesn't solve Xiaoyao's UX problem in QEMU, but I think it gives
> us a sane approach for KVM.
> 
> [*] https://lore.kernel.org/all/20241128013424.4096668-1-seanjc@google.com




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-19  1:56             ` Edgecombe, Rick P
@ 2024-12-19  2:33               ` Sean Christopherson
  2024-12-19 17:52                 ` Edgecombe, Rick P
  2024-12-20  2:40                 ` Xiaoyao Li
  0 siblings, 2 replies; 13+ messages in thread
From: Sean Christopherson @ 2024-12-19  2:33 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Kai Huang, binbin.wu@linux.intel.com, Xiaoyao Li, Reinette Chatre,
	Yan Y Zhao, tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Isaku Yamahata, qemu-devel@nongnu.org,
	Adrian Hunter

On Thu, Dec 19, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-12-17 at 16:08 -0800, Sean Christopherson wrote:
> > On Tue, Dec 17, 2024, Rick P Edgecombe wrote:
> > > Some options discussed on the call:
> > > 
> > > 1. If we got a promise to require any new CPUID bits that clobber host state to
> > > require an opt-in (attributes bit, etc) then we could get by with a promise for
> > > that too. The current situation was basically to assume TDX module wouldn't open
> > > up the issue with new CPUID bits (only attributes/xfam).
> > > 2. If we required any new configurable CPUID bits to save/restore host state
> > > automatically then we could also get by, but then KVM's code that does host
> > > save/restore would either be redundant or need a TDX branch.
> > > 3. If we prevent setting any CPUID bits not supported by KVM, we would need to
> > > track these bits in KVM. The data backing GET_SUPPORTED_CPUID is not sufficient
> > > for this purpose since it is actually more like "default values" then a mask of
> > > supported bits. A patch to try to do this filtering was dropped after upstream
> > > discussion.[0]
> > 
> > The only CPUID bits that truly matter are those that are associated with hardware
> > features the TDX module allows the guest to use directly.  And for those, KVM
> > *must* know if they are fixed0 (inverted polarity only?), fixed1, or configurable.
> > As Adrian asserted, there probably aren't many of them.
> 
> I don't follow why the fixed bits are special here. If any configuration can
> affect the host, KVM needs to know about it. Whether it is fixed or
> configurable.

I think we're in violent agreement.  Ignore the last line about Adrian's assertion,
which was very poorly phrased, and the above boils down to:

  The only CPUID bits that truly matter are those that are associated with hardware
  features the TDX module allows the guest to use directly, and KVM *must* know
  if the bits are fixed0, fixed1, or configurable.

> I wonder if there could be some confusion about how much KVM can trust that its
> view of the CPUID bits is the same as the TDX Modules? In the current patches
> userspace is responsible for assembling KVM's CPUID data which it sets with
> KVM_SET_CPUID2 like normal. It fetches all the set bits from the TDX module,
> massages them, and pass them to KVM. So if a host affecting configurable bit is
> set in the TDX module, but not in KVM then it could be a problem. I think we
> need to reassess which bits could affect host state, and make sure we re-check
> them before entering the TD. But we can't simply check that all bits match
> because there are some bits that are set in KVM, but not TDX module (real PV
> leafs, guestmaxpa, etc).
> 
> So that is how I arrived at that we need some list of host affecting bits to
> verify match in the TD.

At the end of the day, the list is going to be human-generated.  For the UX side
of things, it makes sense to push that responsibility to the TDX Module, because
it should be trivially easy to derive from the source code.

But identifying CPUID bits that control features requires human intervention (or
I suppose AI that can cross-reference specs with code).

Again, I think we're in violent agreement.

> > For all other CPUID bits, what the TDX Module thinks and/or presents to the guest
> > is completely irrelevant, at least as far as KVM cares, and to some extent as far
> > as QEMU cares.  This includes the TDX Module's FEATURE_PARAVIRT_CTRL, which frankly
> > is asinine and should be ignored.  IMO, the TDX Module spec is entirely off the
> > mark in its assessment of paravirtualization.  Injecting a #VE instead of a #GP
> > isn't "paravirtualization".
> >  
> > Take TSC_DEADLINE as an example.  "Disabling" the feature from the guest's side
> > simply means that WRMSR #GPs instead of #VEs.  *Nothing* has changed from KVM's
> > perspective.  If the guest makes a TDVMCALL to write IA32_TSC_DEADLINE, KVM has
> > no idea if the guest has opted in/out of #VE vs #GP.  And IMO, a sane guest will
> > never take a #VE or #GP if it wants to use TSC_DEADLINE; the kernel should instead
> > make a direct TDVMCALL and save itself a pointless exception.
> > 
> >   Enabling Guest TDs are not allowed to access the IA32_TSC_DEADLINE MSR directly.
> >   Virtualization of IA32_TSC_DEADLINE depends on the virtual value of
> >   CPUID(1).ECX[24] bit (TSC Deadline). The host VMM may configure (as an input to
> >   TDH.MNG.INIT) virtual CPUID(1).ECX[24] to be a constant 0 or allow it to be 1
> >   if the CPU’s native value is 1.
> > 
> >   If the TDX module supports #VE reduction, as enumerated by TDX_FEATURES0.VE_REDUCTION
> >   (bit 30), and the guest TD has set TD_CTLS.REDUCE_VE to 1, it may control the
> >   value of virtual CPUID(1).ECX[24] by writing TDCS.FEATURE_PARAVIRT_CTRL.TSC_DEADLINE. 
> > 
> >   • If the virtual value of CPUID(1).ECX[24] is 0, IA32_TSC_DEADLINE is virtualized
> >     as non-existent. WRMSR or RDMSR attempts result in a #GP(0).
> > 
> >   • If the virtual value of CPUID(1).ECX[24] is 1, WRMSR or RDMSR attempts result
> >     in a #VE(CONFIG_PARAVIRT). This enables the TD’s #VE handler.
> > 
> > Ditto for TME, MKTME.
> > 
> > FEATURE_PARAVIRT_CTRL.MCA is even weirder, but I still don't see any reason for
> > KVM or QEMU to care if it's fixed or configurable.  There's some crazy logic for
> > whether or not CR4.MCE can be cleared, but the host can't see guest CR4, and so
> > once again, the TDX Module's view of MCA is irrelevant when it comes to handling
> > TDVMCALL for the machine check MSRs.
> > 
> > So I think this again purely comes to back to KVM correctness and safety.  More
> > specifically, the TDX Module needs to report features that are unconditionally
> > enabled or disabled and can't be emulated by KVM.  For everything else, I don't
> > see any reason to care what the TDX module does.
> > 
> > I'm pretty sure that gives us a way forward.  If there only a handful of features
> > that are unconditionally exposed to the guest, then KVM forces those features in
> > cpu_caps[*].
> 
> I see. Hmm. We could use this new interface to double check the fixed bits. It
> seems like a relatively cheap sanity check.
> 
> There already is an interface to get CPUID bits (fixed and dynamic). But it only
> works after a TD is configured. So if we want to do extra verification or
> adjustments, we could use it before entering the TD. Basically, if we delay this
> logic we don't need to wait for the fixed bit interface.

Oh, yeah, that'd work.  Grab the guest CPUID and then verify that bits KVM needs
to be 0 (or 1) are set according, and WARN+kill if there's a mismatch.

Honestly, I'd probably prefer that over using the fixed bit interface, as my gut
says it's less likely for the TDX Module to misreport what CPUID it has created
for the guest, than it is for the TDX module to generate a "fixed bits" list that
doesn't match the code.  E.g. KVM has (had?) more than a few CPUID features that
KVM emulates without actually reporting support to userspace.

> What is special about the new proposed fixed bit interface is that it can run
> before a TD runs (when QEMU wants to do it's checking of the users args).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-19  2:33               ` Sean Christopherson
@ 2024-12-19 17:52                 ` Edgecombe, Rick P
  2024-12-20  2:40                 ` Xiaoyao Li
  1 sibling, 0 replies; 13+ messages in thread
From: Edgecombe, Rick P @ 2024-12-19 17:52 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: Huang, Kai, binbin.wu@linux.intel.com, Li, Xiaoyao,
	Chatre, Reinette, Zhao, Yan Y, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
	qemu-devel@nongnu.org, Hunter, Adrian

On Wed, 2024-12-18 at 18:33 -0800, Sean Christopherson wrote:
> > So that is how I arrived at that we need some list of host affecting bits to
> > verify match in the TD.
> 
> At the end of the day, the list is going to be human-generated.  For the UX
> side
> of things, it makes sense to push that responsibility to the TDX Module,
> because
> it should be trivially easy to derive from the source code.

The other reason to push it to the TDX module is because newly invented bits on
future CPUs can only be known by TDX Modules that start to use them.

[snip]

> > 
> > There already is an interface to get CPUID bits (fixed and dynamic). But it
> > only
> > works after a TD is configured. So if we want to do extra verification or
> > adjustments, we could use it before entering the TD. Basically, if we delay
> > this
> > logic we don't need to wait for the fixed bit interface.
> 
> Oh, yeah, that'd work.  Grab the guest CPUID and then verify that bits KVM
> needs
> to be 0 (or 1) are set according, and WARN+kill if there's a mismatch.
> 
> Honestly, I'd probably prefer that over using the fixed bit interface, as my
> gut
> says it's less likely for the TDX Module to misreport what CPUID it has
> created
> for the guest, than it is for the TDX module to generate a "fixed bits" list
> that
> doesn't match the code.  E.g. KVM has (had?) more than a few CPUID features
> that
> KVM emulates without actually reporting support to userspace.

Ok, so we have a plan to:
1. Verify if there are more bits that affect host state
2. Encode this list in KVM for now
3. Check the bits match via the existing CPUID bit interface before the vCPU
enters the TD

Still to decide (not today):
I guess KVM=0,TDX=1 is the one to worry about, but might as well also check for
KVM=1,TDX=0. When KVM finds a conflict it must prevent entering the TD and
return an error to userspace. We could do this enforcement either on the first
enter, or with an additional per-vcpu "verify" ioctl.

We can take a look at the list of bits to decide. The current solution of
preventing configuration of the two known troublesome bits seems ok if that's
all.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-19  2:33               ` Sean Christopherson
  2024-12-19 17:52                 ` Edgecombe, Rick P
@ 2024-12-20  2:40                 ` Xiaoyao Li
  2024-12-20 16:59                   ` Sean Christopherson
  1 sibling, 1 reply; 13+ messages in thread
From: Xiaoyao Li @ 2024-12-20  2:40 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe
  Cc: Kai Huang, binbin.wu@linux.intel.com, Reinette Chatre, Yan Y Zhao,
	tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
	pbonzini@redhat.com, Isaku Yamahata, qemu-devel@nongnu.org,
	Adrian Hunter

On 12/19/2024 10:33 AM, Sean Christopherson wrote:
>>> For all other CPUID bits, what the TDX Module thinks and/or presents to the guest
>>> is completely irrelevant, at least as far as KVM cares, and to some extent as far
>>> as QEMU cares.  This includes the TDX Module's FEATURE_PARAVIRT_CTRL, which frankly
>>> is asinine and should be ignored.  IMO, the TDX Module spec is entirely off the
>>> mark in its assessment of paravirtualization.  Injecting a #VE instead of a #GP
>>> isn't "paravirtualization".
>>>   
>>> Take TSC_DEADLINE as an example.  "Disabling" the feature from the guest's side
>>> simply means that WRMSR #GPs instead of #VEs.*Nothing* has changed from KVM's
>>> perspective.  If the guest makes a TDVMCALL to write IA32_TSC_DEADLINE, KVM has
>>> no idea if the guest has opted in/out of #VE vs #GP.  And IMO, a sane guest will
>>> never take a #VE or #GP if it wants to use TSC_DEADLINE; the kernel should instead
>>> make a direct TDVMCALL and save itself a pointless exception.
>>>
>>>    Enabling Guest TDs are not allowed to access the IA32_TSC_DEADLINE MSR directly.
>>>    Virtualization of IA32_TSC_DEADLINE depends on the virtual value of
>>>    CPUID(1).ECX[24] bit (TSC Deadline). The host VMM may configure (as an input to
>>>    TDH.MNG.INIT) virtual CPUID(1).ECX[24] to be a constant 0 or allow it to be 1
>>>    if the CPU’s native value is 1.
>>>
>>>    If the TDX module supports #VE reduction, as enumerated by TDX_FEATURES0.VE_REDUCTION
>>>    (bit 30), and the guest TD has set TD_CTLS.REDUCE_VE to 1, it may control the
>>>    value of virtual CPUID(1).ECX[24] by writing TDCS.FEATURE_PARAVIRT_CTRL.TSC_DEADLINE.
>>>
>>>    • If the virtual value of CPUID(1).ECX[24] is 0, IA32_TSC_DEADLINE is virtualized
>>>      as non-existent. WRMSR or RDMSR attempts result in a #GP(0).
>>>
>>>    • If the virtual value of CPUID(1).ECX[24] is 1, WRMSR or RDMSR attempts result
>>>      in a #VE(CONFIG_PARAVIRT). This enables the TD’s #VE handler.
>>>
>>> Ditto for TME, MKTME.
>>>
>>> FEATURE_PARAVIRT_CTRL.MCA is even weirder, but I still don't see any reason for
>>> KVM or QEMU to care if it's fixed or configurable.  There's some crazy logic for
>>> whether or not CR4.MCE can be cleared, but the host can't see guest CR4, and so
>>> once again, the TDX Module's view of MCA is irrelevant when it comes to handling
>>> TDVMCALL for the machine check MSRs.
>>>
>>> So I think this again purely comes to back to KVM correctness and safety.  More
>>> specifically, the TDX Module needs to report features that are unconditionally
>>> enabled or disabled and can't be emulated by KVM.  For everything else, I don't
>>> see any reason to care what the TDX module does.
>>>
>>> I'm pretty sure that gives us a way forward.  If there only a handful of features
>>> that are unconditionally exposed to the guest, then KVM forces those features in
>>> cpu_caps[*].
>> I see. Hmm. We could use this new interface to double check the fixed bits. It
>> seems like a relatively cheap sanity check.
>>
>> There already is an interface to get CPUID bits (fixed and dynamic). But it only
>> works after a TD is configured. So if we want to do extra verification or
>> adjustments, we could use it before entering the TD. Basically, if we delay this
>> logic we don't need to wait for the fixed bit interface.
> Oh, yeah, that'd work.  Grab the guest CPUID and then verify that bits KVM needs
> to be 0 (or 1) are set according, and WARN+kill if there's a mismatch.
> 
> Honestly, I'd probably prefer that over using the fixed bit interface, as my gut
> says it's less likely for the TDX Module to misreport what CPUID it has created
> for the guest, than it is for the TDX module to generate a "fixed bits" list that
> doesn't match the code.  E.g. KVM has (had?) more than a few CPUID features that
> KVM emulates without actually reporting support to userspace.

The original motivation of the proposed fxied0 and fixed1 data is to 
complement the CPUID configurability report, which is important for 
userspace. E.g., Currently, what QEMU is doing is hardcoding the fixed0 
and fixed1 information of a specific TDX release to adjust the 
KVM_GET_SUPPORTED_CPUID result and gets a final "supported" CPUID set 
for TDX. Hardcoding is not a good idea and it's better that KVM can get 
the data from TDX module, then pass to userspace (of course KVM can 
tweak the data based on its own requirement). So, do you think it's 
useful to have TDX module report the fixed0/fixed1 data for this purpose?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits
  2024-12-20  2:40                 ` Xiaoyao Li
@ 2024-12-20 16:59                   ` Sean Christopherson
  0 siblings, 0 replies; 13+ messages in thread
From: Sean Christopherson @ 2024-12-20 16:59 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Rick P Edgecombe, Kai Huang, binbin.wu@linux.intel.com,
	Reinette Chatre, Yan Y Zhao, tony.lindgren@linux.intel.com,
	kvm@vger.kernel.org, pbonzini@redhat.com, Isaku Yamahata,
	qemu-devel@nongnu.org, Adrian Hunter

On Fri, Dec 20, 2024, Xiaoyao Li wrote:
> On 12/19/2024 10:33 AM, Sean Christopherson wrote:
> > > > For all other CPUID bits, what the TDX Module thinks and/or presents to the guest
> > > > is completely irrelevant, at least as far as KVM cares, and to some extent as far
> > > > as QEMU cares.  This includes the TDX Module's FEATURE_PARAVIRT_CTRL, which frankly
> > > > is asinine and should be ignored.  IMO, the TDX Module spec is entirely off the
> > > > mark in its assessment of paravirtualization.  Injecting a #VE instead of a #GP
> > > > isn't "paravirtualization".
> > > > Take TSC_DEADLINE as an example.  "Disabling" the feature from the guest's side
> > > > simply means that WRMSR #GPs instead of #VEs.*Nothing* has changed from KVM's
> > > > perspective.  If the guest makes a TDVMCALL to write IA32_TSC_DEADLINE, KVM has
> > > > no idea if the guest has opted in/out of #VE vs #GP.  And IMO, a sane guest will
> > > > never take a #VE or #GP if it wants to use TSC_DEADLINE; the kernel should instead
> > > > make a direct TDVMCALL and save itself a pointless exception.
> > > > 
> > > >    Enabling Guest TDs are not allowed to access the IA32_TSC_DEADLINE MSR directly.
> > > >    Virtualization of IA32_TSC_DEADLINE depends on the virtual value of
> > > >    CPUID(1).ECX[24] bit (TSC Deadline). The host VMM may configure (as an input to
> > > >    TDH.MNG.INIT) virtual CPUID(1).ECX[24] to be a constant 0 or allow it to be 1
> > > >    if the CPU’s native value is 1.
> > > > 
> > > >    If the TDX module supports #VE reduction, as enumerated by TDX_FEATURES0.VE_REDUCTION
> > > >    (bit 30), and the guest TD has set TD_CTLS.REDUCE_VE to 1, it may control the
> > > >    value of virtual CPUID(1).ECX[24] by writing TDCS.FEATURE_PARAVIRT_CTRL.TSC_DEADLINE.
> > > > 
> > > >    • If the virtual value of CPUID(1).ECX[24] is 0, IA32_TSC_DEADLINE is virtualized
> > > >      as non-existent. WRMSR or RDMSR attempts result in a #GP(0).
> > > > 
> > > >    • If the virtual value of CPUID(1).ECX[24] is 1, WRMSR or RDMSR attempts result
> > > >      in a #VE(CONFIG_PARAVIRT). This enables the TD’s #VE handler.
> > > > 
> > > > Ditto for TME, MKTME.
> > > > 
> > > > FEATURE_PARAVIRT_CTRL.MCA is even weirder, but I still don't see any reason for
> > > > KVM or QEMU to care if it's fixed or configurable.  There's some crazy logic for
> > > > whether or not CR4.MCE can be cleared, but the host can't see guest CR4, and so
> > > > once again, the TDX Module's view of MCA is irrelevant when it comes to handling
> > > > TDVMCALL for the machine check MSRs.
> > > > 
> > > > So I think this again purely comes to back to KVM correctness and safety.  More
> > > > specifically, the TDX Module needs to report features that are unconditionally
> > > > enabled or disabled and can't be emulated by KVM.  For everything else, I don't
> > > > see any reason to care what the TDX module does.
> > > > 
> > > > I'm pretty sure that gives us a way forward.  If there only a handful of features
> > > > that are unconditionally exposed to the guest, then KVM forces those features in
> > > > cpu_caps[*].
> > > I see. Hmm. We could use this new interface to double check the fixed bits. It
> > > seems like a relatively cheap sanity check.
> > > 
> > > There already is an interface to get CPUID bits (fixed and dynamic). But it only
> > > works after a TD is configured. So if we want to do extra verification or
> > > adjustments, we could use it before entering the TD. Basically, if we delay this
> > > logic we don't need to wait for the fixed bit interface.
> > Oh, yeah, that'd work.  Grab the guest CPUID and then verify that bits KVM needs
> > to be 0 (or 1) are set according, and WARN+kill if there's a mismatch.
> > 
> > Honestly, I'd probably prefer that over using the fixed bit interface, as my gut
> > says it's less likely for the TDX Module to misreport what CPUID it has created
> > for the guest, than it is for the TDX module to generate a "fixed bits" list that
> > doesn't match the code.  E.g. KVM has (had?) more than a few CPUID features that
> > KVM emulates without actually reporting support to userspace.
> 
> The original motivation of the proposed fxied0 and fixed1 data is to
> complement the CPUID configurability report, which is important for
> userspace. E.g., Currently, what QEMU is doing is hardcoding the fixed0 and
> fixed1 information of a specific TDX release to adjust the
> KVM_GET_SUPPORTED_CPUID result and gets a final "supported" CPUID set for
> TDX. Hardcoding is not a good idea and it's better that KVM can get the data
> from TDX module, then pass to userspace (of course KVM can tweak the data
> based on its own requirement). So, do you think it's useful to have TDX
> module report the fixed0/fixed1 data for this purpose?

I'm definitely supportive of KVM passing on accurate information, so long as KVM's
ABI isn't too crazy.

That said, I'm starting to agree with Rick's assessment that trying to enumerate
fixed CPUID feature bits is becoming a fool's errand as the TDX architecture gets
more and more complex.

But _that_ said, if userspace ever needs to pivot on the TDX Module *version*,
then IMO that's a non-starter.  E.g. QEMU shouldn't have to hardcode fixed0/fixed1
bits based on TDX 1.5.whatever vs. TDX 2.0.whatever.

One alternative idea to trying to enumerate every fixed bit would be to mimic what
the VMX architecture does for fixed CR0 bits.  Use TDX Module spec 1.5.06 (or whatever
version makes the most sense) as the base, and then adjust fixed/configurable
information based on *features*.  E.g. when unrestricted guest is enabled, CR0.PG
and CR0.PE switch from fixed0 to configurable.  At a glance, I think the whole #VE
reduction madness could follow a similar path.

And then if the TDX tries to add fixed bits in the future that aren't explicitly
and clearly tied to some feature enablement, we collectively reject that TDX spec.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-12-20 17:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-06  2:42 (Proposal) New TDX Global Metadata To Report FIXED0 and FIXED1 CPUID Bits Xiaoyao Li
2024-12-06 18:41 ` Edgecombe, Rick P
2024-12-10  3:22   ` Xiaoyao Li
2024-12-10 17:45     ` Edgecombe, Rick P
2024-12-17  1:53       ` Sean Christopherson
2024-12-17  4:27         ` Xiaoyao Li
2024-12-17 21:31         ` Edgecombe, Rick P
2024-12-18  0:08           ` Sean Christopherson
2024-12-19  1:56             ` Edgecombe, Rick P
2024-12-19  2:33               ` Sean Christopherson
2024-12-19 17:52                 ` Edgecombe, Rick P
2024-12-20  2:40                 ` Xiaoyao Li
2024-12-20 16:59                   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).