[PATCH 00/21] TDX MMU Part 2

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/21] TDX MMU Part 2
@ 2024-09-04  3:07 Rick Edgecombe
  2024-09-04  3:07 ` [PATCH 01/21] KVM: x86/mmu: Implement memslot deletion for TDX Rick Edgecombe
                   ` (20 more replies)
  0 siblings, 21 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

Hi,

This series picks up where “TDX MMU prep series part 1”[0] left off in 
implementing the parts of TDX support that deal with shared and private 
memory. Part 1 focused on changes to the generic x86 parts of the KVM MMU 
code that will be needed by TDX. This series focuses on the parts of the 
TDX support that will live in the Intel side of KVM. These parts include 
actually manipulating the S-EPT (private memory), and other special 
handling around the Shared EPT. 

There is a larger team working on TDX KVM base enabling. The patches were  
originally authored by Sean Christopherson and Isaku Yamahata, but 
otherwise it especially represents the work of Yan Y Zhao, Isaku and 
myself. 

I think the series is in ok shape at this point, but not quite ready to 
move upstream. However, when it seems to be in generally good shape, we 
might think about whether TDX MMU part 1 is ready for promotion.

Base of this series
===================
The changes required for TDX support are too large to effectively move 
upstream as one series. As a result, it has been broken into a bunch of 
smaller series to be applied sequentially. Based on PUCK discussion we are 
going to be pipelining the review of these series, such that series are 
posted before their pre-reqs land in a maintainer branch. While the first 
breakout series (MMU prep) was able to be applied to kvm-coco-queue 
directly, this one is based some pre-req series that have not landed 
upstream. The order of pre-reqs is:

1. Commit 909f9d422f59 in kvm-coco-queue
   This commit includes "TDX MMU Prep" series, but not "TDX vCPU/VM 
   creation". The following pre-reqs depend on Sean’s VMX initialization 
   changes[1], which is currently in kvm/queue.
2. Kai’s host metadata series v3 [2]
3. KVM/TDX Module init series [3]
4. Binbin's "Check hypercall's exit to userspace generically" [4]
5. vCPU/VM creation series [5]

This is quite a few pre-reqs at this point. 1-4 are fairly mature so 
hopefully those will fall off soon.

Per offline discussion with Dave Hansen, the current plan is for the 
seamcall export patch to be expanded into a series that implements and 
exports each seamcall needed by KVM in arch/x86 code. Both this series and 
(5) rely on the export of the raw seamcall procedure. So future revisions 
of those two series will include patches to add the needed seamcall 
implementations into arch/x86 code. The current thought is to send them 
through the KVM tree with their respective breakout series, and with ack's 
from x86 maintainers.

Private/shared memory in TDX background 
======================================= 
Confidential computing solutions have concepts of private and shared 
memory. Often the guest accesses either private or shared memory via a bit 
in the guest PTE. Solutions like SEV treat this bit more like a permission 
bit, where solutions like TDX and ARM CCA treat it more like a GPA bit. In 
the latter case, the host maps private memory in one half of the address 
space and shared in another. For TDX these two halves are mapped by 
different EPT roots. The private half (also called Secure EPT in Intel 
documentation) gets managed by the privileged TDX Module. The shared half 
is managed by the untrusted part of the VMM (KVM).

In addition to the separate roots for private and shared, there are 
limitations on what operations can be done on the private side. Like SNP, 
TDX wants to protect against protected memory being reset or otherwise 
scrambled by the host. In order to prevent this, the guest has to take 
specific action to “accept” memory after changes are made by the VMM to 
the private EPT. This prevents the VMM from performing many of the usual 
memory management operations that involve zapping and refaulting memory. 
The private memory also is always RWX and cannot have VMM specified cache 
attribute attributes applied.

TDX memory implementation
=========================
The following describes how TDX memory management is implemented in KVM.

Creating shared EPT 
-------------------- 
Shared EPT handling is relatively simple compared to private memory. It is 
managed from within KVM. The main differences between shared EPT and EPT 
in a normal VM are that the root is set with a TDVMCS field (via 
SEAMCALL), and that a GFN from a memslot perspective needs to be mapped at 
an offset in the EPT. For the former, this series plumbs in the 
load_mmu_pgd() operation to the correct field for the shared EPT. For the 
latter, previous patches have laid the groundwork for roots managed by EPT 
(called direct roots), to be mapped at an offset based on the VM scoped 
gfn_direct_bits field. So this series sets gfn_direct_bits to the proper 
value.

Creating private EPT 
------------------------- 
In previous patches, the concept of “mirrored roots” were introduced. Such 
roots maintain a KVM side “mirror” of the “external” EPT by keeping an 
unmapped EPT tree within the KVM MMU code. When changing these mirror 
EPTs, the KVM MMU code calls out via x86_ops to update the external EPT. 
This series adds implementations for these “external” ops for TDX to 
create and manage “private” memory via TDX module APIs.

Managing S-EPT with the TDX Module 
------------------------------------------------- 
The TDX module allows the TD’s private memory to be managed via SEAMCALLs. 
This management consists of operating on two internal elements:

1. The private EPT, which the TDX module calls the S-EPT. It maps the 
   actual mapped, private half of the GPA space using an EPT tree.

2. The HKID, which represents private encryption keys used for encrypting 
   TD memory. The CPU doesn’t guarantee cache coherency between these
   encryption keys, so memory that is encrypted with one of these keys
   needs to be reclaimed for use on the host in special ways.

This series will primarily focus on the SEAMCALLs for managing the private 
EPT. Consideration of the HKID is needed for when the TD is torn down.

Populating TDX Private memory 
----------------------------- 
TDX allows the EPT mapping the TD’s private memory to be modified in 
limited ways. There are SEAMCALLs for building and tearing down the EPT 
tree, as well as mapping pages into the private EPT.

As for building and tearing down the EPT page tables, it is relatively 
simple. There are SEAMCALLs for installing and removing them. However, the 
current implementation only supports adding private EPT page tables, and 
leaves them installed for the lifetime of the TD. For teardown, the 
details are discussed in a later section.

As for populating and zapping private SPTE, there are SEAMCALLs for this 
as well. The zapping case will be described in detail later. As for the 
populating case, there are two categories: before TD is finalized and 
after TD is finalized. Both of these scenarios go through the TDP MMU map 
path. The changes done previously to introduce “mirror” and “external” 
page tables handle directing SPTE installation operations through the 
set_external_spte() op.

In the “after” case, the TDX set_external_spte() handler simply calls a 
SEAMCALL (TDX.MEM.PAGE.AUG).

For the before case, it is a bit more complicated as it requires both 
setting the private SPTE *and* copying in the initial contents of the page 
at the same time. For TDX this is done via the KVM_TDX_INIT_MEM_REGION 
ioctl, which is effectively the kvm_gmem_populate() operation.

For SNP, the private memory can be pre-populated first, and faulted in 
later like normal. But for TDX these need to both happen both at the same 
time and the setting of the private SPTE needs to happen in a different 
way than the “after” case described above. It needs to use the 
TDH.MEM.SEPT.ADD SEAMCALL which does both the copying in of the data and 
setting the SPTE.

Without extensive modification to the fault path, it’s not possible 
utilize this callback from the set_external_spte() handler because it the 
source page for the data to be copied in is not known deep down in this 
callchain. So instead the post-populate callback does a three step 
process.

1. Pre-fault the memory into the mirror EPT, but have the 
   set_external_spte() not make any SEAMCALLs.

2. Check that the page is still faulted into the mirror EPT under read
   mmu_lock that is held over this and the following step.

3. Call TDH.MEM.SEPT.ADD with the HPA of the page to copy data from, and 
   the private page installed in the mirror EPT to use for the private 
   mapping.

The scheme involves some assumptions about the operations that might 
operate on the mirrored EPT before the VM is finalized. It assumes that no 
other memory will be faulted into the mirror EPT, that is not also added 
via TDH.MEM.SEPT.ADD). If this is violated the KVM MMU may not see private 
memory faulted in there later and so not make the proper external spte 
callbacks. There is also a problem for SNP, and there was discussion for 
enforcing this in a more general way. In this series a counter is used, to 
enforce that the number of pre-faulted pages is the same as the number of 
pages added via KVM_TDX_INIT_MEM_REGION. It is probably worth discussing 
if this serves any additional error handling benefits.

TDX TLB flushing 
----------------
For TDX, TLB flushing needs to happen in different ways depending on 
whether private and/or shared EPT needs to be flushed. Shared EPT can be 
flushed like normal EPT with INVEPT. To avoid reading TD's EPTP out from 
TDX module, this series flushes shared EPT with type 2 INVEPT. Private TLB 
entries can be flushed this way too (via type 2). However, since the TDX 
module needs to enforce some guarantees around which private memory is 
mapped in the TD, it requires these operations to be done in special ways 
for private memory.

For flushing private memory, three methods will be used. First it can be 
flushed directly via a SEAMCALL TDH.VP.FLUSH. This flush is of the INVEPT 
type 1 variety (i.e. mappings associated with the TD). 

The second method is part of a sequence of SEAMCALLs for removing a guest 
page. The sequence looks like:

1. TDH.MEM.RANGE.BLOCK - Remove RWX bits from entry (similar to KVM’s zap). 

2. TDH.MEM.TRACK - Increment the TD TLB epoch, which is a per-TD counter 

3. Kick off all vCPUs - In order to force them to have to re-enter.

4. TDH.MEM.PAGE.REMOVE - Actually remove the page and make it available for
   other use.

5. TDH.VP.ENTER - On re-entering TDX module will see the epoch is
   incremented and flush the TLB.

The third method, is that during TDX module init, the TDH.SYS.LP.INIT is 
used to online a CPU for TDX usage. It invokes a INVEPT type 2 to flush 
all mappings in the TLB.

TDX TLB flushing in KVM 
----------------------- 
During runtime, for normal (TDP MMU, non-nested) guests, KVM will do a TLB 
flushes in 4 scenarios:

(1) kvm_mmu_load()

    After EPT is loaded, call kvm_x86_flush_tlb_current() to invalidate
    TLBs for current vCPU loaded EPT on current pCPU.

(2) Loading vCPU to a new pCPU

    Send request KVM_REQ_TLB_FLUSH to current vCPU, the request handler 
    will call kvm_x86_flush_tlb_all() to flush all EPTs assocated with the 
    new pCPU.

(3) When EPT mapping has changed (after removing or permission reduction) 
    (e.g. in kvm_flush_remote_tlbs())

    Send request KVM_REQ_TLB_FLUSH to all vCPUs by kicking all them off, 
    the request handler on each vCPU will call kvm_x86_flush_tlb_all() to 
    invalidate TLBs for all EPTs associated with the pCPU. 

(4) When EPT changes only affects current vCPU, e.g. virtual apic mode 
    changed.

    Send request KVM_REQ_TLB_FLUSH_CURRENT, the request handler will call 
    kvm_x86_flush_tlb_current() to invalidate TLBs for current vCPU loaded 
    EPT on current pCPU.

Only the first 3 are relevant to TDX. They are implemented as follows. 

(1) kvm_mmu_load() 

    Only the shared EPT root is loaded in this path. The TDX module does 
    not require any assurances about the operation, so the 
    flush_tlb_current()->ept_sync_global() can be called as normal. 

(2) vCPU load 

    When a vCPU migrates to a new logical processor, it has to be flushed 
    on the old pCPU. This is different than normal VMs, where the INVEPT 
    is executed on the new pCPU. The TDX behavior comes from a requirement 
    that a vCPU can only be associated with one pCPU at at time. This 
    flush happens via the TDH.VP.FLUSH SEAMCALL. It happens in the 
    vcpu_load op callback on the old CPU via IPI.

(3) Removing a private SPTE 

    This is the more complicated flow. It is done in a simple way for now 
    and is especially inefficient during VM teardown. The plan is to get a 
    basic functional version working and optimize some of these flows 
    later.

    When a private page mapping is removed, the core MMU code calls the 
    newly remove_external_spte() op, and flushes the TLB on all vCPUs. But 
    TDX can’t rely on doing that for private memory, so it has it’s own 
    process for making sure the private page is removed. This flow 
    (TDH.MEM.RANGE.BLOCK, TDH.MEM.TRACK, TDH.MEM.PAGE.REMOVE) is done 
    withing the remove_external_spte() implementation as described in the 
    “TDX TLB flushing” section above.

    After that, back in the core MMU code, KVM will call 
    kvm_flush_remote_tlbs*() resulting in an INVEPT. Despite that, when 
    the vCPUs re-enter (TDH.VP.ENTER) the TD, the TDX module will do 
    another INVEPT for its own reassurance.

Private memory teardown 
----------------------- 
Tearing down private memory involves reclaiming three types of resources 
from the TDX module: 

 1. TD’s HKID 

    To reclaim the TD’s HKID, no mappings may be mapped with it. 

 2. Private guest pages (mapped with HKID) 
 3. Private page tables that map private pages (mapped with HKID) 

    From the TDX module’s perspective, to reclaim guest private pages they 
    need to be prevented from be accessed via the HKID (unmapped and TLB 
    flushed), their HKID associated cachelines need to be flushed, and 
    they need to be marked as no longer use by the TD in the TDX modules 
    internal tracking (PAMT) 

During runtime private PTEs can be zapped as part of memslot deletion or 
when memory coverts from shared to private, but private page tables and 
HKIDs are not torn down until the TD is being destructed. The means the 
operation to zap private guest mapped pages needs to do the required cache 
writeback under the assumption that other vCPU’s may be active, but the
PTs do not.

TD teardown resource reclamation
--------------------------------
The code that does the TD teardown is organized such that when an HKID is 
reclaimed:
1. vCPUs will no longer enter the TD
2. The TLB is flushed on all CPUs
3. The HKID associated cachelines have been flushed.

So at that point most of the steps needed to reclaim TD private pages and 
page tables have already been done and the reclaim operation only needs to 
update the TDX module’s tracking of page ownership. For simplicity each 
operation only supports one scenario: before or after HKID reclaim. Since 
zapping and reclaiming private pages has to function during runtime for 
memslot deletion and converting from shared to private, the TD teardown is 
arranged so this happens before HKID reclaim. Since private page tables 
are never torn down during TD runtime, they can happen in a simpler and 
more efficient way after HKID reclaim. The private page reclaim is 
initiated from the kvm fd release. The callchain looks like this:

do_exit 
  |->exit_mm --> tdx_mmu_release_hkid() was called here previously in v19 
  |->exit_files
      |->1.release vcpu fd
      |->2.kvm_gmem_release
      |     |->kvm_gmem_invalidate_begin --> unmap all leaf entries, causing 
      |                                      zapping of private guest pages
      |->3.release kvmfd
            |->kvm_destroy_vm
                |->kvm_arch_destroy_vm
                    |->kvm_unload_vcpu_mmus
                    |  kvm_x86_call(vm_destroy)(kvm) -->tdx_mmu_release_hkid()
                    |  kvm_destroy_vcpus(kvm)
                    |   |->kvm_arch_vcpu_destroy
                    |   |->kvm_x86_call(vcpu_free)(vcpu)
                    |   |  kvm_mmu_destroy(vcpu) -->unref mirror root
                    |  kvm_mmu_uninit_vm(kvm) --> mirror root ref is 1 here, 
                    |                             zap private page tables
                    | static_call_cond(kvm_x86_vm_free)(kvm);

Notable changes since v19
=========================
As usual there are a smattering of small changes across the patches. A few 
more structural changes are highlighted below.

Removal of TDX flush_remote_tlbs() and flush_remote_tlbs_range() hooks
----------------------------------------------------------------------
Since only the remove_external_spte() callback needs to flush remote TLBs 
for private memory, it is ok to let these have the default behavior of 
flushing with a plain INVEPT. This change also resulted in all callers 
doing the TDH.MEM.TRACK flow being inside an MMU write lock, leading to 
the next removal.

Removal of tdh_mem_track counter
--------------------------------
One change of note to this area since the v19 series is how 
synchronization works between the incrementing of the TD epoch, and the 
kicking of all the vCPUs. Previously a counter was used to synchronize 
these. The raw counter instead of more common synchronization primitives 
made it a bit hard to follow, and a new lock was considered. After the 
separation of private and shared flushing, it was realized all the callers 
of tdx_track() held an MMU write lock. So this revision relies on that for 
this synchronization.

Change of pre-populate flow
---------------------------
Previously the part of the pre-populate flow required userspace to 
pre-fault the private pages into the mirror EPT explicitly with a call to 
KVM_PRE_FAULT_MEMORY before KVM_TDX_INIT_MEM_REGION. After discussion [6], 
it was changed to the current design.

Moving of tdx_mmu_release_hkid()
--------------------------------
Yan pointed out [7] some oddities related to private memory being 
reclaimed from an MMU notified callback. This was weird on the face of it, 
and it turned out Sean had already NAKed the approach. In fixing this, 
HKID reclaim was moved to after calls that could zapping/reclaim private 
pages. The fix however meant that the zap/reclaim of the private pages is 
slower. Previously, Kai had suggested [8] to start with something simpler 
and optimize it later (which is aligned with what we are trying to do in 
general for TDX support). So several insights were all leading us in this 
direction. After the move about ~80 lines of architecturally thorny logic 
that branched off of whether the HKID was assigned was able to be dropped.

Split "KVM: TDX: TDP MMU TDX support"
-------------------------------------
This patch was a big monolith that did a bunch of changes at once. It was
split apart for easier reviewing.

Repos
=====
The series is extracted from this KVM tree:
https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-09-03

The KVM tree has some workaround patches removed to resemble more closely 
what will eventually make it upstream. It requires:

    EDK2: 9389b9a208 ("MdePkg/Tdx.h: Fix the order of NumVcpus and MaxVcpus") 

    TDX Module: 1.5.06.00.744

A matching QEMU is here:
https://github.com/intel-staging/qemu-tdx/tree/tdx-qemu-wip-2024-08-27

Testing 
======= 
As mentioned earlier, this series is not ready for upstream. All the same, 
it has been tested as part of the development branch for the TDX base
series. The testing consisted of TDX kvm-unit-tests and booting a Linux
TD, and TDX enhanced KVM selftests.

There is a recently discovered bug in the TDX MMU part 1 patches. We will 
be posting a fix soon. This fix would allow allow for a small amount of 
code to be removed from this series, but otherwise wouldn't interfere.

[0] https://lore.kernel.org/kvm/20240718211230.1492011-1-rick.p.edgecombe@intel.com/ 
[1] https://lore.kernel.org/kvm/20240608000639.3295768-1-seanjc@google.com/#t  
[2] https://lore.kernel.org/kvm/cover.1721186590.git.kai.huang@intel.com/  
[3] https://github.com/intel/tdx/commits/kvm-tdxinit/  
[4] https://lore.kernel.org/kvm/20240826022255.361406-1-binbin.wu@linux.intel.com/
[5] https://lore.kernel.org/kvm/20240812224820.34826-1-rick.p.edgecombe@intel.com/  
[6] https://lore.kernel.org/kvm/73c62e76d83fe4e5990b640582da933ff3862cb1.camel@intel.com/ 
[7] https://lore.kernel.org/kvm/ZnUncmMSJ3Vbn1Fx@yzhao56-desk.sh.intel.com/ 
[8] https://lore.kernel.org/lkml/65a1a35e0a3b9a6f0a123e50ec9ddb755f70da52.camel@intel.com/ 

Isaku Yamahata (14):
  KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
  KVM: TDX: Add accessors VMX VMCS helpers
  KVM: TDX: Set gfn_direct_bits to shared bit
  KVM: TDX: Require TDP MMU and mmio caching for TDX
  KVM: x86/mmu: Add setter for shadow_mmio_value
  KVM: TDX: Set per-VM shadow_mmio_value to 0
  KVM: TDX: Handle TLB tracking for TDX
  KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page
    table
  KVM: TDX: Implement hook to get max mapping level of private pages
  KVM: TDX: Premap initial guest memory
  KVM: TDX: MTRR: implement get_mt_mask() for TDX
  KVM: TDX: Add an ioctl to create initial guest memory
  KVM: TDX: Finalize VM initialization
  KVM: TDX: Handle vCPU dissociation

Rick Edgecombe (3):
  KVM: x86/mmu: Implement memslot deletion for TDX
  KVM: VMX: Teach EPT violation helper about private mem
  KVM: x86/mmu: Export kvm_tdp_map_page()

Sean Christopherson (2):
  KVM: VMX: Split out guts of EPT violation to common/exposed function
  KVM: TDX: Add load_mmu_pgd method for TDX

Yan Zhao (1):
  KVM: x86/mmu: Do not enable page track for TD guest

Yuan Yao (1):
  KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

 arch/x86/include/asm/vmx.h      |   1 +
 arch/x86/include/uapi/asm/kvm.h |  10 +
 arch/x86/kvm/mmu.h              |   4 +
 arch/x86/kvm/mmu/mmu.c          |   7 +-
 arch/x86/kvm/mmu/page_track.c   |   3 +
 arch/x86/kvm/mmu/spte.c         |   8 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  37 +-
 arch/x86/kvm/vmx/common.h       |  47 +++
 arch/x86/kvm/vmx/main.c         | 122 +++++-
 arch/x86/kvm/vmx/tdx.c          | 674 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h          |  91 ++++-
 arch/x86/kvm/vmx/tdx_arch.h     |  23 ++
 arch/x86/kvm/vmx/tdx_ops.h      |  54 ++-
 arch/x86/kvm/vmx/vmx.c          |  25 +-
 arch/x86/kvm/vmx/x86_ops.h      |  51 +++
 virt/kvm/kvm_main.c             |   1 +
 16 files changed, 1097 insertions(+), 61 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/common.h

-- 
2.34.1

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 01/21] KVM: x86/mmu: Implement memslot deletion for TDX
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 13:44   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 02/21] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Rick Edgecombe
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

Force TDX VMs to use the KVM_X86_QUIRK_SLOT_ZAP_ALL behavior.

TDs cannot use the fast zapping operation to implement memslot deletion for
a couple reasons:
1. KVM cannot fully zap and re-build TDX private PTEs without coordinating
   with the guest. This is due to the TDs needing to "accept" memory. So
   an operation to delete a memslot needs to limit the private zapping to
   the range of the memslot.
2. For reason (1), kvm_mmu_zap_all_fast() is limited to direct (shared)
   roots. This means it will not zap the mirror (private) PTEs. If a
   memslot is deleted with private memory mapped, the private memory would
   remain mapped in the TD. Then if later the gmem fd was whole punched,
   the pages could be freed on the host while still mapped in the TD. This
   is because that operation would no longer have the memslot to map the
   pgoff to the gfn.

To handle the first case, userspace could simply set the
KVM_X86_QUIRK_SLOT_ZAP_ALL quirk for TDs. This would prevent the issue in
(1), but it is not sufficient to resolve (2) because the problems there
extend beyond the userspace's TD, to affecting the rest of the host. So the
zap-leafs-only behavior is required for both

A couple options were considered, including forcing
KVM_X86_QUIRK_SLOT_ZAP_ALL to always be on for TDs, however due to the
currently limited quirks interface (no way to query quirks, or force them
to be disabled), this would require developing additional interfaces. So
instead just do the simple thing and make TDs always do the zap-leafs
behavior like when KVM_X86_QUIRK_SLOT_ZAP_ALL is disabled.

While at it, have the new behavior apply to all non-KVM_X86_DEFAULT_VM VMs,
as the previous behavior was not ideal (see [0]). It is assumed until
proven otherwise that the other VM types will not be exposed to the bug[1]
that derailed that effort.

Memslot deletion needs to zap both the private and shared mappings of a
GFN, so update the attr_filter field in kvm_mmu_zap_memslot_leafs() to
include both.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/kvm/20190205205443.1059-1-sean.j.christopherson@intel.com/ [0]
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
---
TDX MMU part 2 v1:
 - Clarify TDX limits on zapping private memory (Sean)

Memslot quirk series:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a8d91cf11761..7e66d7c426c1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7104,6 +7104,7 @@ static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *s
 		.start = slot->base_gfn,
 		.end = slot->base_gfn + slot->npages,
 		.may_block = true,
+		.attr_filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED,
 	};
 	bool flush = false;

-- 
2.34.1

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 02/21] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
  2024-09-04  3:07 ` [PATCH 01/21] KVM: x86/mmu: Implement memslot deletion for TDX Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 13:51   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 03/21] KVM: x86/mmu: Do not enable page track for TD guest Rick Edgecombe
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Export a function to walk down the TDP without modifying it and simply
check if a PGA is mapped.

Future changes will support pre-populating TDX private memory. In order to
implement this KVM will need to check if a given GFN is already
pre-populated in the mirrored EPT. [1]

There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
the KVM MMU that almost does what is required. However, to make sense of
the results, MMU internal PTE helpers are needed. Refactor the code to
provide a helper that can be used outside of the KVM MMU code.

Refactoring the KVM page fault handler to support this lookup usage was
also considered, but it was an awkward fit.

kvm_tdp_mmu_gpa_is_mapped() is based on a diff by Paolo Bonzini.

Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/ [1]
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Change exported function to just return of GPA is mapped because "You
   are executing with the filemap_invalidate_lock() taken, and therefore
   cannot race with kvm_gmem_punch_hole()" (Paolo)
   https://lore.kernel.org/kvm/CABgObfbpNN842noAe77WYvgi5MzK2SAA_FYw-=fGa+PcT_Z22w@mail.gmail.com/
 - Take root hpa instead of enum (Paolo)

TDX MMU Prep v2:
 - Rename function with "mirror" and use root enum

TDX MMU Prep:
 - New patch
---
 arch/x86/kvm/mmu.h         |  3 +++
 arch/x86/kvm/mmu/mmu.c     |  3 +--
 arch/x86/kvm/mmu/tdp_mmu.c | 37 ++++++++++++++++++++++++++++++++-----
 3 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 8f289222b353..5faa416ac874 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -254,6 +254,9 @@ extern bool tdp_mmu_enabled;
 #define tdp_mmu_enabled false
 #endif
 
+bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
+
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
 	return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7e66d7c426c1..01808cdf8627 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4713,8 +4713,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
-static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
-			    u8 *level)
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
 {
 	int r;
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 37b3769a5d32..019b43723d90 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1911,16 +1911,13 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
  *
  * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
  */
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level)
+static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+				  struct kvm_mmu_page *root)
 {
-	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
 	struct tdp_iter iter;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
 
-	*root_level = vcpu->arch.mmu->root_role.level;
-
 	tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
@@ -1929,6 +1926,36 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	return leaf;
 }
 
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+			 int *root_level)
+{
+	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
+	*root_level = vcpu->arch.mmu->root_role.level;
+
+	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root);
+}
+
+bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
+{
+	struct kvm *kvm = vcpu->kvm;
+	bool is_direct = kvm_is_addr_direct(kvm, gpa);
+	hpa_t root = is_direct ? vcpu->arch.mmu->root.hpa :
+				 vcpu->arch.mmu->mirror_root_hpa;
+	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
+	int leaf;
+
+	lockdep_assert_held(&kvm->mmu_lock);
+	rcu_read_lock();
+	leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
+	rcu_read_unlock();
+	if (leaf < 0)
+		return false;
+
+	spte = sptes[leaf];
+	return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
+
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
  * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 03/21] KVM: x86/mmu: Do not enable page track for TD guest
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
  2024-09-04  3:07 ` [PATCH 01/21] KVM: x86/mmu: Implement memslot deletion for TDX Rick Edgecombe
  2024-09-04  3:07 ` [PATCH 02/21] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 13:53   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 04/21] KVM: VMX: Split out guts of EPT violation to common/exposed function Rick Edgecombe
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel, Yuan Yao, Binbin Wu

From: Yan Zhao <yan.y.zhao@intel.com>

TDX does not support write protection and hence page track.
Though !tdp_enabled and kvm_shadow_root_allocated(kvm) are always false
for TD guest, should also return false when external write tracking is
enabled.

Cc: Yuan Yao <yuan.yao@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
v19:
- drop TDX: from the short log
- Added reviewed-by: BinBin
---
 arch/x86/kvm/mmu/page_track.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 561c331fd6ec..26436113103a 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -35,6 +35,9 @@ static bool kvm_external_write_tracking_enabled(struct kvm *kvm)
 
 bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)
 {
+	if (kvm->arch.vm_type == KVM_X86_TDX_VM)
+		return false;
+
 	return kvm_external_write_tracking_enabled(kvm) ||
 	       kvm_shadow_root_allocated(kvm) || !tdp_enabled;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 04/21] KVM: VMX: Split out guts of EPT violation to common/exposed function
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (2 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 03/21] KVM: x86/mmu: Do not enable page track for TD guest Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 13:57   ` Paolo Bonzini
  2024-09-09 16:07   ` Sean Christopherson
  2024-09-04  3:07 ` [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem Rick Edgecombe
                   ` (16 subsequent siblings)
  20 siblings, 2 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel, Binbin Wu

From: Sean Christopherson <sean.j.christopherson@intel.com>

The difference of TDX EPT violation is how to retrieve information, GPA,
and exit qualification.  To share the code to handle EPT violation, split
out the guts of EPT violation handler so that VMX/TDX exit handler can call
it after retrieving GPA and exit qualification.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
 arch/x86/kvm/vmx/common.h | 34 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c    | 25 +++----------------------
 2 files changed, 37 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/common.h

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
new file mode 100644
index 000000000000..78ae39b6cdcd
--- /dev/null
+++ b/arch/x86/kvm/vmx/common.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_X86_VMX_COMMON_H
+#define __KVM_X86_VMX_COMMON_H
+
+#include <linux/kvm_host.h>
+
+#include "mmu.h"
+
+static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
+					     unsigned long exit_qualification)
+{
+	u64 error_code;
+
+	/* Is it a read fault? */
+	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
+		     ? PFERR_USER_MASK : 0;
+	/* Is it a write fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
+		      ? PFERR_WRITE_MASK : 0;
+	/* Is it a fetch fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
+		      ? PFERR_FETCH_MASK : 0;
+	/* ept page table entry is present? */
+	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
+		      ? PFERR_PRESENT_MASK : 0;
+
+	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
+		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
+			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+
+	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+}
+
+#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5e7b5732f35d..ade7666febe9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -53,6 +53,7 @@
 #include <trace/events/ipi.h>
 
 #include "capabilities.h"
+#include "common.h"
 #include "cpuid.h"
 #include "hyperv.h"
 #include "kvm_onhyperv.h"
@@ -5771,11 +5772,8 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
 
 static int handle_ept_violation(struct kvm_vcpu *vcpu)
 {
-	unsigned long exit_qualification;
+	unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
 	gpa_t gpa;
-	u64 error_code;
-
-	exit_qualification = vmx_get_exit_qual(vcpu);
 
 	/*
 	 * EPT violation happened while executing iret from NMI,
@@ -5791,23 +5789,6 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 	trace_kvm_page_fault(vcpu, gpa, exit_qualification);
 
-	/* Is it a read fault? */
-	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
-		     ? PFERR_USER_MASK : 0;
-	/* Is it a write fault? */
-	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
-		      ? PFERR_WRITE_MASK : 0;
-	/* Is it a fetch fault? */
-	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
-		      ? PFERR_FETCH_MASK : 0;
-	/* ept page table entry is present? */
-	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
-		      ? PFERR_PRESENT_MASK : 0;
-
-	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
-		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
-			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
-
 	/*
 	 * Check that the GPA doesn't exceed physical memory limits, as that is
 	 * a guest page fault.  We have to emulate the instruction here, because
@@ -5819,7 +5800,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (3 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 04/21] KVM: VMX: Split out guts of EPT violation to common/exposed function Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 13:59   ` Paolo Bonzini
                     ` (3 more replies)
  2024-09-04  3:07 ` [PATCH 06/21] KVM: TDX: Add accessors VMX VMCS helpers Rick Edgecombe
                   ` (15 subsequent siblings)
  20 siblings, 4 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

Teach EPT violation helper to check shared mask of a GPA to find out
whether the GPA is for private memory.

When EPT violation is triggered after TD accessing a private GPA, KVM will
exit to user space if the corresponding GFN's attribute is not private.
User space will then update GFN's attribute during its memory conversion
process. After that, TD will re-access the private GPA and trigger EPT
violation again. Only with GFN's attribute matches to private, KVM will
fault in private page, map it in mirrored TDP root, and propagate changes
to private EPT to resolve the EPT violation.

Relying on GFN's attribute tracking xarray to determine if a GFN is
private, as for KVM_X86_SW_PROTECTED_VM, may lead to endless EPT
violations.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Split from "KVM: TDX: handle ept violation/misconfig exit"
---
 arch/x86/kvm/vmx/common.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 78ae39b6cdcd..10aa12d45097 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -6,6 +6,12 @@
 
 #include "mmu.h"
 
+static inline bool kvm_is_private_gpa(struct kvm *kvm, gpa_t gpa)
+{
+	/* For TDX the direct mask is the shared mask. */
+	return !kvm_is_addr_direct(kvm, gpa);
+}
+
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 					     unsigned long exit_qualification)
 {
@@ -28,6 +34,13 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
 			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
+	/*
+	 * Don't rely on GFN's attribute tracking xarray to prevent EPT violation
+	 * loops.
+	 */
+	if (kvm_is_private_gpa(vcpu->kvm, gpa))
+		error_code |= PFERR_PRIVATE_ACCESS;
+
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 06/21] KVM: TDX: Add accessors VMX VMCS helpers
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (4 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 14:19   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 07/21] KVM: TDX: Add load_mmu_pgd method for TDX Rick Edgecombe
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX defines SEAMCALL APIs to access TDX control structures corresponding to
the VMX VMCS.  Introduce helper accessors to hide its SEAMCALL ABI details.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Update for the wrapper functions for SEAMCALLs. (Sean)
 - Eliminate kvm_mmu_free_private_spt() and open code it.
 - Fix bisectability issues in headers (Kai)
 - Updates from seamcall overhaul (Kai)

v19:
 - deleted unnecessary stub functions,
   tdvps_state_non_arch_check() and tdvps_management_check().
---
 arch/x86/kvm/vmx/tdx.h | 87 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 7eeb54fbcae1..66540c57ed61 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -76,6 +76,93 @@ static __always_inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
  */
 #include "tdx_ops.h"
 
+static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
+{
+#define VMCS_ENC_ACCESS_TYPE_MASK	0x1UL
+#define VMCS_ENC_ACCESS_TYPE_FULL	0x0UL
+#define VMCS_ENC_ACCESS_TYPE_HIGH	0x1UL
+#define VMCS_ENC_ACCESS_TYPE(field)	((field) & VMCS_ENC_ACCESS_TYPE_MASK)
+
+	/* TDX is 64bit only.  HIGH field isn't supported. */
+	BUILD_BUG_ON_MSG(__builtin_constant_p(field) &&
+			 VMCS_ENC_ACCESS_TYPE(field) == VMCS_ENC_ACCESS_TYPE_HIGH,
+			 "Read/Write to TD VMCS *_HIGH fields not supported");
+
+	BUILD_BUG_ON(bits != 16 && bits != 32 && bits != 64);
+
+#define VMCS_ENC_WIDTH_MASK	GENMASK(14, 13)
+#define VMCS_ENC_WIDTH_16BIT	(0UL << 13)
+#define VMCS_ENC_WIDTH_64BIT	(1UL << 13)
+#define VMCS_ENC_WIDTH_32BIT	(2UL << 13)
+#define VMCS_ENC_WIDTH_NATURAL	(3UL << 13)
+#define VMCS_ENC_WIDTH(field)	((field) & VMCS_ENC_WIDTH_MASK)
+
+	/* TDX is 64bit only.  i.e. natural width = 64bit. */
+	BUILD_BUG_ON_MSG(bits != 64 && __builtin_constant_p(field) &&
+			 (VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_64BIT ||
+			  VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_NATURAL),
+			 "Invalid TD VMCS access for 64-bit field");
+	BUILD_BUG_ON_MSG(bits != 32 && __builtin_constant_p(field) &&
+			 VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_32BIT,
+			 "Invalid TD VMCS access for 32-bit field");
+	BUILD_BUG_ON_MSG(bits != 16 && __builtin_constant_p(field) &&
+			 VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_16BIT,
+			 "Invalid TD VMCS access for 16-bit field");
+}
+
+#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)				\
+static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,	\
+							u32 field)		\
+{										\
+	u64 err, data;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_rd(tdx, TDVPS_##uclass(field), &data);			\
+	if (KVM_BUG_ON(err, tdx->vcpu.kvm)) {					\
+		pr_err("TDH_VP_RD["#uclass".0x%x] failed: 0x%llx\n",		\
+		       field, err);						\
+		return 0;							\
+	}									\
+	return (u##bits)data;							\
+}										\
+static __always_inline void td_##lclass##_write##bits(struct vcpu_tdx *tdx,	\
+						      u32 field, u##bits val)	\
+{										\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx, TDVPS_##uclass(field), val,			\
+		      GENMASK_ULL(bits - 1, 0));				\
+	if (KVM_BUG_ON(err, tdx->vcpu.kvm))					\
+		pr_err("TDH_VP_WR["#uclass".0x%x] = 0x%llx failed: 0x%llx\n",	\
+		       field, (u64)val, err);					\
+}										\
+static __always_inline void td_##lclass##_setbit##bits(struct vcpu_tdx *tdx,	\
+						       u32 field, u64 bit)	\
+{										\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx, TDVPS_##uclass(field), bit, bit);			\
+	if (KVM_BUG_ON(err, tdx->vcpu.kvm))					\
+		pr_err("TDH_VP_WR["#uclass".0x%x] |= 0x%llx failed: 0x%llx\n",	\
+		       field, bit, err);					\
+}										\
+static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx,	\
+							 u32 field, u64 bit)	\
+{										\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx, TDVPS_##uclass(field), 0, bit);			\
+	if (KVM_BUG_ON(err, tdx->vcpu.kvm))					\
+		pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: 0x%llx\n",	\
+		       field, bit,  err);					\
+}
+
+TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
 #else
 static inline void tdx_bringup(void) {}
 static inline void tdx_cleanup(void) {}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 07/21] KVM: TDX: Add load_mmu_pgd method for TDX
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (5 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 06/21] KVM: TDX: Add accessors VMX VMCS helpers Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-11  2:48   ` Chao Gao
  2024-09-04  3:07 ` [PATCH 08/21] KVM: TDX: Set gfn_direct_bits to shared bit Rick Edgecombe
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX uses two EPT pointers, one for the private half of the GPA space and
one for the shared half. The private half uses the normal EPT_POINTER vmcs
field, which is managed in a special way by the TDX module. For TDX, KVM is
not allowed to operate on it directly. The shared half uses a new
SHARED_EPT_POINTER field and will be managed by the conventional MMU
management operations that operate directly on the EPT root. This means for
TDX the .load_mmu_pgd() operation will need to know to use the
SHARED_EPT_POINTER field instead of the normal one. Add a new wrapper in
x86 ops for load_mmu_pgd() that either directs the write to the existing
vmx implementation or a TDX one.

tdx_load_mmu_pgd() is so much simpler than vmx_load_mmu_pgd() since for the
TDX mode of operation, EPT will always be used and KVM does not need to be
involved in virtualization of CR3 behavior. So tdx_load_mmu_pgd() can
simply write to SHARED_EPT_POINTER.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX MMU part 2 v1:
- update the commit msg with the version rephrased by Rick.
  https://lore.kernel.org/all/78b1024ec3f5868e228baf797c6be98c5397bd49.camel@intel.com/

v19:
- Add WARN_ON_ONCE() to tdx_load_mmu_pgd() and drop unconditional mask
---
 arch/x86/include/asm/vmx.h |  1 +
 arch/x86/kvm/vmx/main.c    | 13 ++++++++++++-
 arch/x86/kvm/vmx/tdx.c     |  5 +++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++++
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index d77a31039f24..3e003183a4f7 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -237,6 +237,7 @@ enum vmcs_field {
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
 	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
 	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
+	SHARED_EPT_POINTER		= 0x0000203C,
 	PID_POINTER_TABLE		= 0x00002042,
 	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d63685ea95ce..c9dfa3aa866c 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -100,6 +100,17 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
+			int pgd_level)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+		return;
+	}
+
+	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -229,7 +240,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.write_tsc_offset = vmx_write_tsc_offset,
 	.write_tsc_multiplier = vmx_write_tsc_multiplier,
 
-	.load_mmu_pgd = vmx_load_mmu_pgd,
+	.load_mmu_pgd = vt_load_mmu_pgd,
 
 	.check_intercept = vmx_check_intercept,
 	.handle_exit_irqoff = vmx_handle_exit_irqoff,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2ef95c84ee5b..8f43977ef4c6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -428,6 +428,11 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	 */
 }
 
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
+{
+	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index debc6877729a..dcf2b36efbb9 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -130,6 +130,8 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
+
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
@@ -142,6 +144,8 @@ static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
+
+static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 08/21] KVM: TDX: Set gfn_direct_bits to shared bit
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (6 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 07/21] KVM: TDX: Add load_mmu_pgd method for TDX Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 15:21   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT Rick Edgecombe
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Make the direct root handle memslot GFNs at an alias with the TDX shared
bit set.

For TDX shared memory, the memslot GFNs need to be mapped at an alias with
the shared bit set. These shared mappings will be be mapped on the KVM
MMU's "direct" root. The direct root has it's mappings shifted by
applying "gfn_direct_bits" as a mask. The concept of "GPAW" (guest
physical address width) determines the location of the shared bit. So set
gfn_direct_bits based on this, to map shared memory at the proper GPA.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Move setting of gfn_direct_bits to separate patch (Yan)
---
 arch/x86/kvm/vmx/tdx.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8f43977ef4c6..25c24901061b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -921,6 +921,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	kvm_tdx->attributes = td_params->attributes;
 	kvm_tdx->xfam = td_params->xfam;
 
+	if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
+		kvm->arch.gfn_direct_bits = gpa_to_gfn(BIT_ULL(51));
+	else
+		kvm->arch.gfn_direct_bits = gpa_to_gfn(BIT_ULL(47));
+
 out:
 	/* kfree() accepts NULL. */
 	kfree(init_vm);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (7 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 08/21] KVM: TDX: Set gfn_direct_bits to shared bit Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-06  1:41   ` Huang, Kai
  2024-09-09 15:25   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 10/21] KVM: TDX: Require TDP MMU and mmio caching for TDX Rick Edgecombe
                   ` (11 subsequent siblings)
  20 siblings, 2 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel, Yuan Yao

From: Yuan Yao <yuan.yao@intel.com>

TDX module internally uses locks to protect internal resources.  It tries
to acquire the locks.  If it fails to obtain the lock, it returns
TDX_OPERAND_BUSY error without spin because its execution time limitation.

TDX SEAMCALL API reference describes what resources are used.  It's known
which TDX SEAMCALL can cause contention with which resources.  VMM can
avoid contention inside the TDX module by avoiding contentious TDX SEAMCALL
with, for example, spinlock.  Because OS knows better its process
scheduling and its scalability, a lock at OS/VMM layer would work better
than simply retrying TDX SEAMCALLs.

TDH.MEM.* API except for TDH.MEM.TRACK operates on a secure EPT tree and
the TDX module internally tries to acquire the lock of the secure EPT tree.
They return TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT in case of failure to
get the lock.  TDX KVM allows sept callbacks to return error so that TDP
MMU layer can retry.

Retry TDX TDH.MEM.* API on the error because the error is a rare event
caused by zero-step attack mitigation.

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Updates from seamcall overhaul (Kai)

v19:
 - fix typo TDG.VP.ENTER => TDH.VP.ENTER,
   TDX_OPRRAN_BUSY => TDX_OPERAND_BUSY
 - drop the description on TDH.VP.ENTER as this patch doesn't touch
   TDH.VP.ENTER
---
 arch/x86/kvm/vmx/tdx_ops.h | 48 ++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 0363d8544f42..8ca3e252a6ed 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -31,6 +31,40 @@
 #define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
 	pr_tdx_error_N(__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)
 
+/*
+ * TDX module acquires its internal lock for resources.  It doesn't spin to get
+ * locks because of its restrictions of allowed execution time.  Instead, it
+ * returns TDX_OPERAND_BUSY with an operand id.
+ *
+ * Multiple VCPUs can operate on SEPT.  Also with zero-step attack mitigation,
+ * TDH.VP.ENTER may rarely acquire SEPT lock and release it when zero-step
+ * attack is suspected.  It results in TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT
+ * with TDH.MEM.* operation.  Note: TDH.MEM.TRACK is an exception.
+ *
+ * Because TDP MMU uses read lock for scalability, spin lock around SEAMCALL
+ * spoils TDP MMU effort.  Retry several times with the assumption that SEPT
+ * lock contention is rare.  But don't loop forever to avoid lockup.  Let TDP
+ * MMU retry.
+ */
+#define TDX_ERROR_SEPT_BUSY    (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)
+
+static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
+{
+#define SEAMCALL_RETRY_MAX     16
+	struct tdx_module_args args_in;
+	int retry = SEAMCALL_RETRY_MAX;
+	u64 ret;
+
+	do {
+		args_in = *in;
+		ret = seamcall_ret(op, in);
+	} while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0);
+
+	*in = args_in;
+
+	return ret;
+}
+
 static inline u64 tdh_mng_addcx(struct kvm_tdx *kvm_tdx, hpa_t addr)
 {
 	struct tdx_module_args in = {
@@ -55,7 +89,7 @@ static inline u64 tdh_mem_page_add(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 	u64 ret;
 
 	clflush_cache_range(__va(hpa), PAGE_SIZE);
-	ret = seamcall_ret(TDH_MEM_PAGE_ADD, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_PAGE_ADD, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -76,7 +110,7 @@ static inline u64 tdh_mem_sept_add(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 
 	clflush_cache_range(__va(page), PAGE_SIZE);
 
-	ret = seamcall_ret(TDH_MEM_SEPT_ADD, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_SEPT_ADD, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -93,7 +127,7 @@ static inline u64 tdh_mem_sept_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 	};
 	u64 ret;
 
-	ret = seamcall_ret(TDH_MEM_SEPT_REMOVE, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_SEPT_REMOVE, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -123,7 +157,7 @@ static inline u64 tdh_mem_page_aug(struct kvm_tdx *kvm_tdx, gpa_t gpa, hpa_t hpa
 	u64 ret;
 
 	clflush_cache_range(__va(hpa), PAGE_SIZE);
-	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_PAGE_AUG, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -140,7 +174,7 @@ static inline u64 tdh_mem_range_block(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 	};
 	u64 ret;
 
-	ret = seamcall_ret(TDH_MEM_RANGE_BLOCK, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_RANGE_BLOCK, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -335,7 +369,7 @@ static inline u64 tdh_mem_page_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 	};
 	u64 ret;
 
-	ret = seamcall_ret(TDH_MEM_PAGE_REMOVE, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
@@ -361,7 +395,7 @@ static inline u64 tdh_mem_range_unblock(struct kvm_tdx *kvm_tdx, gpa_t gpa,
 	};
 	u64 ret;
 
-	ret = seamcall_ret(TDH_MEM_RANGE_UNBLOCK, &in);
+	ret = tdx_seamcall_sept(TDH_MEM_RANGE_UNBLOCK, &in);
 
 	*rcx = in.rcx;
 	*rdx = in.rdx;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 10/21] KVM: TDX: Require TDP MMU and mmio caching for TDX
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (8 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 15:26   ` Paolo Bonzini
  2024-09-12  0:15   ` Huang, Kai
  2024-09-04  3:07 ` [PATCH 11/21] KVM: x86/mmu: Add setter for shadow_mmio_value Rick Edgecombe
                   ` (10 subsequent siblings)
  20 siblings, 2 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Disable TDX support when TDP MMU or mmio caching aren't supported.

As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
support for TDX isn't implemented.

TDX requires KVM mmio caching. Without mmio caching, KVM will go to MMIO
emulation without installing SPTEs for MMIOs. However, TDX guest is
protected and KVM would meet errors when trying to emulate MMIOs for TDX
guest during instruction decoding. So, TDX guest relies on SPTEs being
installed for MMIOs, which are with no RWX bits and with VE suppress bit
unset, to inject VE to TDX guest. The TDX guest would then issue TDVMCALL
in the VE handler to perform instruction decoding and have host do MMIO
emulation.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Addressed Binbin's comment by massaging Isaku's updated comments and
   adding more explanations about instroducing mmio caching.
 - Addressed Sean's comments of v19 according to Isaku's update but
   kept the warning for MOVDIR64B.
 - Move code change in tdx_hardware_setup() to __tdx_bringup() since the
   former has been removed.
---
 arch/x86/kvm/mmu/mmu.c  | 1 +
 arch/x86/kvm/vmx/main.c | 1 +
 arch/x86/kvm/vmx/tdx.c  | 8 +++-----
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 01808cdf8627..d26b235d8f84 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -110,6 +110,7 @@ static bool __ro_after_init tdp_mmu_allowed;
 #ifdef CONFIG_X86_64
 bool __read_mostly tdp_mmu_enabled = true;
 module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
+EXPORT_SYMBOL_GPL(tdp_mmu_enabled);
 #endif
 
 static int max_huge_page_level __read_mostly;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index c9dfa3aa866c..2cc29d0fc279 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -3,6 +3,7 @@
 
 #include "x86_ops.h"
 #include "vmx.h"
+#include "mmu.h"
 #include "nested.h"
 #include "pmu.h"
 #include "posted_intr.h"
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 25c24901061b..0c08062ef99f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1474,16 +1474,14 @@ static int __init __tdx_bringup(void)
 	const struct tdx_sys_info_td_conf *td_conf;
 	int r;
 
+	if (!tdp_mmu_enabled || !enable_mmio_caching)
+		return -EOPNOTSUPP;
+
 	if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
 		pr_warn("MOVDIR64B is reqiured for TDX\n");
 		return -EOPNOTSUPP;
 	}
 
-	if (!enable_ept) {
-		pr_err("Cannot enable TDX with EPT disabled.\n");
-		return -EINVAL;
-	}
-
 	/*
 	 * Enabling TDX requires enabling hardware virtualization first,
 	 * as making SEAMCALLs requires CPU being in post-VMXON state.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 11/21] KVM: x86/mmu: Add setter for shadow_mmio_value
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (9 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 10/21] KVM: TDX: Require TDP MMU and mmio caching for TDX Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 15:33   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 12/21] KVM: TDX: Set per-VM shadow_mmio_value to 0 Rick Edgecombe
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Future changes will want to set shadow_mmio_value from TDX code. Add a
helper to setter with a name that makes more sense from that context.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[split into new patch]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Split into new patch
---
 arch/x86/kvm/mmu.h      | 1 +
 arch/x86/kvm/mmu/spte.c | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 5faa416ac874..72035154a23a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -78,6 +78,7 @@ static inline gfn_t kvm_mmu_max_gfn(void)
 u8 kvm_mmu_get_max_tdp_level(void);
 
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
+void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
 
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index d4527965e48c..46a26be0245b 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -409,6 +409,12 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
+void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value)
+{
+	kvm->arch.shadow_mmio_value = mmio_value;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_value);
+
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
 {
 	/* shadow_me_value must be a subset of shadow_me_mask */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 12/21] KVM: TDX: Set per-VM shadow_mmio_value to 0
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (10 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 11/21] KVM: x86/mmu: Add setter for shadow_mmio_value Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 15:33   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX Rick Edgecombe
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Set per-VM shadow_mmio_value to 0 for TDX.

With enable_mmio_caching on, KVM installs MMIO SPTEs for TDs. To correctly
configure MMIO SPTEs, TDX requires the per-VM shadow_mmio_value to be set
to 0. This is necessary to override the default value of the suppress VE
bit in the SPTE, which is 1, and to ensure value 0 in RWX bits.

For MMIO SPTE, the spte value changes as follows:
1. initial value (suppress VE bit is set)
2. Guest issues MMIO and triggers EPT violation
3. KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
4. Guest MMIO resumes.  It triggers VE exception in guest TD
5. Guest VE handler issues TDG.VP.VMCALL<MMIO>
6. KVM handles MMIO
7. Guest VE handler resumes its execution after MMIO instruction

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Split from the big patch "KVM: TDX: TDP MMU TDX support".
 - Remove warning for shadow_mmio_value
---
 arch/x86/kvm/mmu/spte.c |  2 --
 arch/x86/kvm/vmx/tdx.c  | 15 ++++++++++++++-
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 46a26be0245b..4ab6d2a87032 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -94,8 +94,6 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
 	u64 spte = generation_mmio_spte_mask(gen);
 	u64 gpa = gfn << PAGE_SHIFT;
 
-	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);
-
 	access &= shadow_mmio_access_mask;
 	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
 	spte |= gpa | shadow_nonpresent_or_rsvd_mask;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0c08062ef99f..9da71782660f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,7 +6,7 @@
 #include "mmu.h"
 #include "tdx.h"
 #include "tdx_ops.h"
-
+#include "mmu/spte.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -344,6 +344,19 @@ int tdx_vm_init(struct kvm *kvm)
 {
 	kvm->arch.has_private_mem = true;
 
+	/*
+	 * Because guest TD is protected, VMM can't parse the instruction in TD.
+	 * Instead, guest uses MMIO hypercall.  For unmodified device driver,
+	 * #VE needs to be injected for MMIO and #VE handler in TD converts MMIO
+	 * instruction into MMIO hypercall.
+	 *
+	 * SPTE value for MMIO needs to be setup so that #VE is injected into
+	 * TD instead of triggering EPT MISCONFIG.
+	 * - RWX=0 so that EPT violation is triggered.
+	 * - suppress #VE bit is cleared to inject #VE.
+	 */
+	kvm_mmu_set_mmio_spte_value(kvm, 0);
+
 	/*
 	 * This function initializes only KVM software construct.  It doesn't
 	 * initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (11 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 12/21] KVM: TDX: Set per-VM shadow_mmio_value to 0 Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-10  8:16   ` Paolo Bonzini
  2024-09-11  6:25   ` Xu Yilun
  2024-09-04  3:07 ` [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Rick Edgecombe
                   ` (7 subsequent siblings)
  20 siblings, 2 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Handle TLB tracking for TDX by introducing function tdx_track() for private
memory TLB tracking and implementing flush_tlb* hooks to flush TLBs for
shared memory.

Introduce function tdx_track() to do TLB tracking on private memory, which
basically does two things: calling TDH.MEM.TRACK to increase TD epoch and
kicking off all vCPUs. The private EPT will then be flushed when each vCPU
re-enters the TD. This function is unused temporarily in this patch and
will be called on a page-by-page basis on removal of private guest page in
a later patch.

In earlier revisions, tdx_track() relied on an atomic counter to coordinate
the synchronization between the actions of kicking off vCPUs, incrementing
the TD epoch, and the vCPUs waiting for the incremented TD epoch after
being kicked off.

However, the core MMU only actually needs to call tdx_track() while
aleady under a write mmu_lock. So this sychnonization can be made to be
unneeded. vCPUs are kicked off only after the successful execution of
TDH.MEM.TRACK, eliminating the need for vCPUs to wait for TDH.MEM.TRACK
completion after being kicked off. tdx_track() is therefore able to send
requests KVM_REQ_OUTSIDE_GUEST_MODE rather than KVM_REQ_TLB_FLUSH.

Hooks for flush_remote_tlb and flush_remote_tlbs_range are not necessary
for TDX, as tdx_track() will handle TLB tracking of private memory on
page-by-page basis when private guest pages are removed. There is no need
to invoke tdx_track() again in kvm_flush_remote_tlbs() even after changes
to the mirrored page table.

For hooks flush_tlb_current and flush_tlb_all, which are invoked during
kvm_mmu_load() and vcpu load for normal VMs, let VMM to flush all EPTs in
the two hooks for simplicity, since TDX does not depend on the two
hooks to notify TDX module to flush private EPT in those cases.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Split from the big patch "KVM: TDX: TDP MMU TDX support".
 - Modification of synchronization mechanism in tdx_track().
 - Dropped hooks flush_remote_tlb and flush_remote_tlbs_range.
 - Let VMM to flush all EPTs in hooks flush_tlb_all and flush_tlb_current.
 - Dropped KVM_BUG_ON() in vt_flush_tlb_gva(). (Rick)
---
 arch/x86/kvm/vmx/main.c    | 52 ++++++++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/tdx.c     | 55 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 3 files changed, 105 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2cc29d0fc279..1c86849680a3 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -101,6 +101,50 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * TDX calls tdx_track() in tdx_sept_remove_private_spte() to ensure
+	 * private EPT will be flushed on the next TD enter.
+	 * No need to call tdx_track() here again even when this callback is as
+	 * a result of zapping private EPT.
+	 * Just invoke invept() directly here to work for both shared EPT and
+	 * private EPT.
+	 */
+	if (is_td_vcpu(vcpu)) {
+		ept_sync_global();
+		return;
+	}
+
+	vmx_flush_tlb_all(vcpu);
+}
+
+static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_flush_tlb_current(vcpu);
+		return;
+	}
+
+	vmx_flush_tlb_current(vcpu);
+}
+
+static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_flush_tlb_gva(vcpu, addr);
+}
+
+static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_flush_tlb_guest(vcpu);
+}
+
 static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			int pgd_level)
 {
@@ -190,10 +234,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_rflags = vmx_set_rflags,
 	.get_if_flag = vmx_get_if_flag,
 
-	.flush_tlb_all = vmx_flush_tlb_all,
-	.flush_tlb_current = vmx_flush_tlb_current,
-	.flush_tlb_gva = vmx_flush_tlb_gva,
-	.flush_tlb_guest = vmx_flush_tlb_guest,
+	.flush_tlb_all = vt_flush_tlb_all,
+	.flush_tlb_current = vt_flush_tlb_current,
+	.flush_tlb_gva = vt_flush_tlb_gva,
+	.flush_tlb_guest = vt_flush_tlb_guest,
 
 	.vcpu_pre_run = vmx_vcpu_pre_run,
 	.vcpu_run = vmx_vcpu_run,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9da71782660f..6feb3ab96926 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,6 +6,7 @@
 #include "mmu.h"
 #include "tdx.h"
 #include "tdx_ops.h"
+#include "vmx.h"
 #include "mmu/spte.h"
 
 #undef pr_fmt
@@ -446,6 +447,51 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
+/*
+ * Ensure shared and private EPTs to be flushed on all vCPUs.
+ * tdh_mem_track() is the only caller that increases TD epoch. An increase in
+ * the TD epoch (e.g., to value "N + 1") is successful only if no vCPUs are
+ * running in guest mode with the value "N - 1".
+ *
+ * A successful execution of tdh_mem_track() ensures that vCPUs can only run in
+ * guest mode with TD epoch value "N" if no TD exit occurs after the TD epoch
+ * being increased to "N + 1".
+ *
+ * Kicking off all vCPUs after that further results in no vCPUs can run in guest
+ * mode with TD epoch value "N", which unblocks the next tdh_mem_track() (e.g.
+ * to increase TD epoch to "N + 2").
+ *
+ * TDX module will flush EPT on the next TD enter and make vCPUs to run in
+ * guest mode with TD epoch value "N + 1".
+ *
+ * kvm_make_all_cpus_request() guarantees all vCPUs are out of guest mode by
+ * waiting empty IPI handler ack_kick().
+ *
+ * No action is required to the vCPUs being kicked off since the kicking off
+ * occurs certainly after TD epoch increment and before the next
+ * tdh_mem_track().
+ */
+static void __always_unused tdx_track(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err;
+
+	/* If TD isn't finalized, it's before any vcpu running. */
+	if (unlikely(!is_td_finalized(kvm_tdx)))
+		return;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	do {
+		err = tdh_mem_track(kvm_tdx);
+	} while (unlikely((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY));
+
+	if (KVM_BUG_ON(err, kvm))
+		pr_tdx_error(TDH_MEM_TRACK, err);
+
+	kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
@@ -947,6 +993,15 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return ret;
 }
 
+void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * flush_tlb_current() is used only the first time for the vcpu to run.
+	 * As it isn't performance critical, keep this function simple.
+	 */
+	ept_sync_global();
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index dcf2b36efbb9..28fda93f0b27 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -131,6 +131,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
+void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
@@ -145,6 +146,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
+static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (12 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-06  2:10   ` Huang, Kai
  2024-10-30  3:03   ` Binbin Wu
  2024-09-04  3:07 ` [PATCH 15/21] KVM: TDX: Implement hook to get max mapping level of private pages Rick Edgecombe
                   ` (6 subsequent siblings)
  20 siblings, 2 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement hooks in TDX to propagate changes of mirror page table to private
EPT, including changes for page table page adding/removing, guest page
adding/removing.

TDX invokes corresponding SEAMCALLs in the hooks.

- Hook link_external_spt
  propagates adding page table page into private EPT.

- Hook set_external_spte
  tdx_sept_set_private_spte() in this patch only handles adding of guest
  private page when TD is finalized.
  Later patches will handle the case of adding guest private pages before
  TD finalization.

- Hook free_external_spt
  It is invoked when page table page is removed in mirror page table, which
  currently must occur at TD tear down phase, after hkid is freed.

- Hook remove_external_spte
  It is invoked when guest private page is removed in mirror page table,
  which can occur when TD is active, e.g. during shared <-> private
  conversion and slot move/deletion.
  This hook is ensured to be triggered before hkid is freed, because
  gmem fd is released along with all private leaf mappings zapped before
  freeing hkid at VM destroy.

  TDX invokes below SEAMCALLs sequentially:
  1) TDH.MEM.RANGE.BLOCK (remove RWX bits from a private EPT entry),
  2) TDH.MEM.TRACK (increases TD epoch)
  3) TDH.MEM.PAGE.REMOVE (remove the private EPT entry and untrack the
     guest page).

  TDH.MEM.PAGE.REMOVE can't succeed without TDH.MEM.RANGE.BLOCK and
  TDH.MEM.TRACK being called successfully.
  SEAMCALL TDH.MEM.TRACK is called in function tdx_track() to enforce that
  TLB tracking will be performed by TDX module for private EPT.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Split from the big patch "KVM: TDX: TDP MMU TDX support".
 - Move setting up the 4 callbacks (kvm_x86_ops::link_external_spt etc)
   from tdx_hardware_setup() (which doesn't exist anymore) to
   vt_hardware_setup() directly.  Make tdx_sept_link_external_spt() those
   4 callbacks global and add declarations to x86_ops.h so they can be
   setup in vt_hardware_setup().
 - Updated the KVM_BUG_ON() in tdx_sept_free_private_spt(). (Isaku, Binbin)
 - Removed the unused tdx_post_mmu_map_page().
 - Removed WARN_ON_ONCE) in tdh_mem_page_aug() according to Isaku's
   feedback:
   "This WARN_ON_ONCE() is a guard for buggy TDX module. It shouldn't return
   (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX)) when
   SEPT_VE_DISABLED cleared.  Maybe we should remove this WARN_ON_ONCE()
   because the TDX module is mature."
 - Update for the wrapper functions for SEAMCALLs. (Sean)
 - Add preparation for KVM_TDX_INIT_MEM_REGION to make
   tdx_sept_set_private_spte() callback nop when the guest isn't finalized.
 - use unlikely(err) in  tdx_reclaim_td_page().
 - Updates from seamcall overhaul (Kai)
 - Move header definitions from "KVM: TDX: Define TDX architectural
   definitions" (Sean)
 - Drop ugly unions (Sean)
 - Remove tdx_mng_key_config_lock cleanup after dropped in "KVM: TDX:
   create/destroy VM structure" (Chao)
 - Since HKID is freed on vm_destroy() zapping only happens when HKID is
   allocated. Remove relevant code in zapping handlers that assume the
   opposite, and add some KVM_BUG_ON() to assert this where it was
   missing. (Isaku)
---
 arch/x86/kvm/vmx/main.c     |  14 ++-
 arch/x86/kvm/vmx/tdx.c      | 222 +++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx_arch.h |  23 ++++
 arch/x86/kvm/vmx/tdx_ops.h  |   6 +
 arch/x86/kvm/vmx/x86_ops.h  |  37 ++++++
 5 files changed, 300 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 1c86849680a3..bf6fd5cca1d6 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -36,9 +36,21 @@ static __init int vt_hardware_setup(void)
 	 * is KVM may allocate couple of more bytes than needed for
 	 * each VM.
 	 */
-	if (enable_tdx)
+	if (enable_tdx) {
 		vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
 				sizeof(struct kvm_tdx));
+		/*
+		 * Note, TDX may fail to initialize in a later time in
+		 * vt_init(), in which case it is not necessary to setup
+		 * those callbacks.  But making them valid here even
+		 * when TDX fails to init later is fine because those
+		 * callbacks won't be called if the VM isn't TDX guest.
+		 */
+		vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
+		vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
+		vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
+		vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
+	}
 
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6feb3ab96926..b8cd5a629a80 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -447,6 +447,177 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
 }
 
+static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	put_page(page);
+}
+
+static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
+			    enum pg_level level, kvm_pfn_t pfn)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	hpa_t hpa = pfn_to_hpa(pfn);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	u64 entry, level_state;
+	u64 err;
+
+	err = tdh_mem_page_aug(kvm_tdx, gpa, hpa, &entry, &level_state);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
+		tdx_unpin(kvm, pfn);
+		return -EAGAIN;
+	}
+	if (unlikely(err == (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))) {
+		if (tdx_get_sept_level(level_state) == tdx_level &&
+		    tdx_get_sept_state(level_state) == TDX_SEPT_PENDING &&
+		    is_last_spte(entry, level) &&
+		    spte_to_pfn(entry) == pfn &&
+		    entry & VMX_EPT_SUPPRESS_VE_BIT) {
+			tdx_unpin(kvm, pfn);
+			return -EAGAIN;
+		}
+	}
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
+		tdx_unpin(kvm, pfn);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EINVAL;
+
+	/*
+	 * Because guest_memfd doesn't support page migration with
+	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
+	 * migration.  Until guest_memfd supports page migration, prevent page
+	 * migration.
+	 * TODO: Once guest_memfd introduces callback on page migration,
+	 * implement it and remove get_page/put_page().
+	 */
+	get_page(pfn_to_page(pfn));
+
+	if (likely(is_td_finalized(kvm_tdx)))
+		return tdx_mem_page_aug(kvm, gfn, level, pfn);
+
+	/*
+	 * TODO: KVM_MAP_MEMORY support to populate before finalize comes
+	 * here for the initial memory.
+	 */
+	return 0;
+}
+
+static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level, kvm_pfn_t pfn)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	hpa_t hpa = pfn_to_hpa(pfn);
+	hpa_t hpa_with_hkid;
+	u64 err, entry, level_state;
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EINVAL;
+
+	if (KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), kvm))
+		return -EINVAL;
+
+	do {
+		/*
+		 * When zapping private page, write lock is held. So no race
+		 * condition with other vcpu sept operation.  Race only with
+		 * TDH.VP.ENTER.
+		 */
+		err = tdh_mem_page_remove(kvm_tdx, gpa, tdx_level, &entry,
+					  &level_state);
+	} while (unlikely(err == TDX_ERROR_SEPT_BUSY));
+	if (unlikely(!is_td_finalized(kvm_tdx) &&
+		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
+		/*
+		 * This page was mapped with KVM_MAP_MEMORY, but
+		 * KVM_TDX_INIT_MEM_REGION is not issued yet.
+		 */
+		if (!is_last_spte(entry, level) || !(entry & VMX_EPT_RWX_MASK)) {
+			tdx_unpin(kvm, pfn);
+			return 0;
+		}
+	}
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_PAGE_REMOVE, err, entry, level_state);
+		return -EIO;
+	}
+
+	hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+	do {
+		/*
+		 * TDX_OPERAND_BUSY can happen on locking PAMT entry.  Because
+		 * this page was removed above, other thread shouldn't be
+		 * repeatedly operating on this page.  Just retry loop.
+		 */
+		err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+	} while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err);
+		return -EIO;
+	}
+	tdx_clear_page(hpa);
+	tdx_unpin(kvm, pfn);
+	return 0;
+}
+
+int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, void *private_spt)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	hpa_t hpa = __pa(private_spt);
+	u64 err, entry, level_state;
+
+	err = tdh_mem_sept_add(to_kvm_tdx(kvm), gpa, tdx_level, hpa, &entry,
+			       &level_state);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_SEPT_ADD, err, entry, level_state);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn) & KVM_HPAGE_MASK(level);
+	u64 err, entry, level_state;
+
+	/* For now large page isn't supported yet. */
+	WARN_ON_ONCE(level != PG_LEVEL_4K);
+
+	err = tdh_mem_range_block(kvm_tdx, gpa, tdx_level, &entry, &level_state);
+	if (unlikely(err == TDX_ERROR_SEPT_BUSY))
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error_2(TDH_MEM_RANGE_BLOCK, err, entry, level_state);
+		return -EIO;
+	}
+	return 0;
+}
+
 /*
  * Ensure shared and private EPTs to be flushed on all vCPUs.
  * tdh_mem_track() is the only caller that increases TD epoch. An increase in
@@ -471,7 +642,7 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
  * occurs certainly after TD epoch increment and before the next
  * tdh_mem_track().
  */
-static void __always_unused tdx_track(struct kvm *kvm)
+static void tdx_track(struct kvm *kvm)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	u64 err;
@@ -492,6 +663,55 @@ static void __always_unused tdx_track(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
 }
 
+int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, void *private_spt)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/*
+	 * free_external_spt() is only called after hkid is freed when TD is
+	 * tearing down.
+	 * KVM doesn't (yet) zap page table pages in mirror page table while
+	 * TD is active, though guest pages mapped in mirror page table could be
+	 * zapped during TD is active, e.g. for shared <-> private conversion
+	 * and slot move/deletion.
+	 */
+	if (KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm))
+		return -EINVAL;
+
+	/*
+	 * The HKID assigned to this TD was already freed and cache was
+	 * already flushed. We don't have to flush again.
+	 */
+	return tdx_reclaim_page(__pa(private_spt));
+}
+
+int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+				 enum pg_level level, kvm_pfn_t pfn)
+{
+	int ret;
+
+	/*
+	 * HKID is released when vm_free() which is after closing gmem_fd
+	 * which causes gmem invalidation to zap all spte.
+	 * Population is only allowed after KVM_TDX_INIT_VM.
+	 */
+	if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
+		return -EINVAL;
+
+	ret = tdx_sept_zap_private_spte(kvm, gfn, level);
+	if (ret)
+		return ret;
+
+	/*
+	 * TDX requires TLB tracking before dropping private page.  Do
+	 * it here, although it is also done later.
+	 */
+	tdx_track(kvm);
+
+	return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
+}
+
 static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
 {
 	const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
index 815e74408a34..634ed76db26a 100644
--- a/arch/x86/kvm/vmx/tdx_arch.h
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -155,6 +155,29 @@ struct td_params {
 #define TDX_MIN_TSC_FREQUENCY_KHZ		(100 * 1000)
 #define TDX_MAX_TSC_FREQUENCY_KHZ		(10 * 1000 * 1000)
 
+/* Additional Secure EPT entry information */
+#define TDX_SEPT_LEVEL_MASK		GENMASK_ULL(2, 0)
+#define TDX_SEPT_STATE_MASK		GENMASK_ULL(15, 8)
+#define TDX_SEPT_STATE_SHIFT		8
+
+enum tdx_sept_entry_state {
+	TDX_SEPT_FREE = 0,
+	TDX_SEPT_BLOCKED = 1,
+	TDX_SEPT_PENDING = 2,
+	TDX_SEPT_PENDING_BLOCKED = 3,
+	TDX_SEPT_PRESENT = 4,
+};
+
+static inline u8 tdx_get_sept_level(u64 sept_entry_info)
+{
+	return sept_entry_info & TDX_SEPT_LEVEL_MASK;
+}
+
+static inline u8 tdx_get_sept_state(u64 sept_entry_info)
+{
+	return (sept_entry_info & TDX_SEPT_STATE_MASK) >> TDX_SEPT_STATE_SHIFT;
+}
+
 #define MD_FIELD_ID_FEATURES0_TOPOLOGY_ENUM	BIT_ULL(20)
 
 /*
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 8ca3e252a6ed..73ffd80223b0 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -31,6 +31,12 @@
 #define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
 	pr_tdx_error_N(__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)
 
+static inline int pg_level_to_tdx_sept_level(enum pg_level level)
+{
+	WARN_ON_ONCE(level == PG_LEVEL_NONE);
+	return level - 1;
+}
+
 /*
  * TDX module acquires its internal lock for resources.  It doesn't spin to get
  * locks because of its restrictions of allowed execution time.  Instead, it
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 28fda93f0b27..d1db807b793a 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -131,6 +131,15 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
+int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, void *private_spt);
+int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, void *private_spt);
+int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+			      enum pg_level level, kvm_pfn_t pfn);
+int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+				 enum pg_level level, kvm_pfn_t pfn);
+
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
@@ -146,6 +155,34 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
+static inline int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+					    enum pg_level level,
+					    void *private_spt)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+					    enum pg_level level,
+					    void *private_spt)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+					    enum pg_level level,
+					    kvm_pfn_t pfn)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+					       enum pg_level level,
+					       kvm_pfn_t pfn)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 15/21] KVM: TDX: Implement hook to get max mapping level of private pages
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (13 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-10 10:17   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 16/21] KVM: TDX: Premap initial guest memory Rick Edgecombe
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement hook private_max_mapping_level for TDX to let TDP MMU core get
max mapping level of private pages.

The value is hard coded to 4K for no huge page support for now.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Split from the big patch "KVM: TDX: TDP MMU TDX support".
 - Fix missing tdx_gmem_private_max_mapping_level() implementation for
   !CONFIG_INTEL_TDX_HOST

v19:
 - Use gmem_max_level callback, delete tdp_max_page_level.
---
 arch/x86/kvm/vmx/main.c    | 10 ++++++++++
 arch/x86/kvm/vmx/tdx.c     |  5 +++++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index bf6fd5cca1d6..5d43b44e2467 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -184,6 +184,14 @@ static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return tdx_vcpu_ioctl(vcpu, argp);
 }
 
+static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+{
+	if (is_td(kvm))
+		return tdx_gmem_private_max_mapping_level(kvm, pfn);
+
+	return 0;
+}
+
 #define VMX_REQUIRED_APICV_INHIBITS				\
 	(BIT(APICV_INHIBIT_REASON_DISABLED) |			\
 	 BIT(APICV_INHIBIT_REASON_ABSENT) |			\
@@ -337,6 +345,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.mem_enc_ioctl = vt_mem_enc_ioctl,
 	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
+
+	.private_max_mapping_level = vt_gmem_private_max_mapping_level
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b8cd5a629a80..59b627b45475 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1582,6 +1582,11 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return ret;
 }
 
+int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+{
+	return PG_LEVEL_4K;
+}
+
 #define KVM_SUPPORTED_TD_ATTRS (TDX_TD_ATTR_SEPT_VE_DISABLE)
 
 static int __init setup_kvm_tdx_caps(void)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index d1db807b793a..66829413797d 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -142,6 +142,7 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
 #else
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
@@ -185,6 +186,7 @@ static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 
 static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
+static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 16/21] KVM: TDX: Premap initial guest memory
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (14 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 15/21] KVM: TDX: Implement hook to get max mapping level of private pages Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-10 10:24   ` Paolo Bonzini
  2024-09-10 10:49   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 17/21] KVM: TDX: MTRR: implement get_mt_mask() for TDX Rick Edgecombe
                   ` (4 subsequent siblings)
  20 siblings, 2 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Update TDX's hook of set_external_spte() to record pre-mapping cnt instead
of doing nothing and returning when TD is not finalized.

TDX uses ioctl KVM_TDX_INIT_MEM_REGION to initialize its initial guest
memory. This ioctl calls kvm_gmem_populate() to get guest pages and in
tdx_gmem_post_populate(), it will
(1) Map page table pages into KVM mirror page table and private EPT.
(2) Map guest pages into KVM mirror page table. In the propagation hook,
    just record pre-mapping cnt without mapping the guest page into private
    EPT.
(3) Map guest pages into private EPT and decrease pre-mapping cnt.

Do not map guest pages into private EPT directly in step (2), because TDX
requires TDH.MEM.PAGE.ADD() to add a guest page before TD is finalized,
which copies page content from a source page from user to target guest page
to be added. However, source page is not available via common interface
kvm_tdp_map_page() in step (2).

Therefore, just pre-map the guest page into KVM mirror page table and
record the pre-mapping cnt in TDX's propagation hook. The pre-mapping cnt
would be decreased in ioctl KVM_TDX_INIT_MEM_REGION when the guest page is
mapped into private EPT.

Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Update the code comment and patch log according to latest gmem update.
   https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/
 - Rename tdx_mem_page_add() to tdx_mem_page_record_premap_cnt() to avoid
   confusion.
 - Change the patch title to "KVM: TDX: Premap initial guest memory".
 - Rename KVM_MEMORY_MAPPING => KVM_MAP_MEMORY (Sean)
 - Drop issueing TDH.MEM.PAGE.ADD() on KVM_MAP_MEMORY(), defer it to
   KVM_TDX_INIT_MEM_REGION. (Sean)
 - Added nr_premapped to track the number of premapped pages
 - Drop tdx_post_mmu_map_page().

v19:
 - Switched to use KVM_MEMORY_MAPPING
 - Dropped measurement extension
 - updated commit message. private_page_add() => set_private_spte()
---
 arch/x86/kvm/vmx/tdx.c | 40 +++++++++++++++++++++++++++++++++-------
 arch/x86/kvm/vmx/tdx.h |  2 +-
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 59b627b45475..435112562954 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -488,6 +488,34 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
+/*
+ * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to get guest pages and
+ * tdx_gmem_post_populate() to premap page table pages into private EPT.
+ * Mapping guest pages into private EPT before TD is finalized should use a
+ * seamcall TDH.MEM.PAGE.ADD(), which copies page content from a source page
+ * from user to target guest pages to be added. This source page is not
+ * available via common interface kvm_tdp_map_page(). So, currently,
+ * kvm_tdp_map_page() only premaps guest pages into KVM mirrored root.
+ * A counter nr_premapped is increased here to record status. The counter will
+ * be decreased after TDH.MEM.PAGE.ADD() is called after the kvm_tdp_map_page()
+ * in tdx_gmem_post_populate().
+ */
+static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
+					  enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/* Returning error here to let TDP MMU bail out early. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) {
+		tdx_unpin(kvm, pfn);
+		return -EINVAL;
+	}
+
+	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
+	atomic64_inc(&kvm_tdx->nr_premapped);
+	return 0;
+}
+
 int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 			      enum pg_level level, kvm_pfn_t pfn)
 {
@@ -510,11 +538,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (likely(is_td_finalized(kvm_tdx)))
 		return tdx_mem_page_aug(kvm, gfn, level, pfn);
 
-	/*
-	 * TODO: KVM_MAP_MEMORY support to populate before finalize comes
-	 * here for the initial memory.
-	 */
-	return 0;
+	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
 }
 
 static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -546,10 +570,12 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	if (unlikely(!is_td_finalized(kvm_tdx) &&
 		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
 		/*
-		 * This page was mapped with KVM_MAP_MEMORY, but
-		 * KVM_TDX_INIT_MEM_REGION is not issued yet.
+		 * Page is mapped by KVM_TDX_INIT_MEM_REGION, but hasn't called
+		 * tdh_mem_page_add().
 		 */
 		if (!is_last_spte(entry, level) || !(entry & VMX_EPT_RWX_MASK)) {
+			WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
+			atomic64_dec(&kvm_tdx->nr_premapped);
 			tdx_unpin(kvm, pfn);
 			return 0;
 		}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 66540c57ed61..25a4aaede2ba 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -26,7 +26,7 @@ struct kvm_tdx {
 
 	u64 tsc_offset;
 
-	/* For KVM_MAP_MEMORY and KVM_TDX_INIT_MEM_REGION. */
+	/* For KVM_TDX_INIT_MEM_REGION. */
 	atomic64_t nr_premapped;
 
 	struct kvm_cpuid2 *cpuid;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 17/21] KVM: TDX: MTRR: implement get_mt_mask() for TDX
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (15 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 16/21] KVM: TDX: Premap initial guest memory Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-10 10:04   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 18/21] KVM: x86/mmu: Export kvm_tdp_map_page() Rick Edgecombe
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Although TDX supports only WB for private GPA, it's desirable to support
MTRR for shared GPA.  Always honor guest PAT for shared EPT as what's done
for normal VMs.

Suggested-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Align with latest vmx code in kvm/queue.
 - Updated patch log.
 - Dropped KVM_BUG_ON() in vt_get_mt_mask(). (Rick)

v19:
 - typo in the commit message
 - Deleted stale paragraph in the commit message
---
 arch/x86/kvm/vmx/main.c    | 10 +++++++++-
 arch/x86/kvm/vmx/tdx.c     |  8 ++++++++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 5d43b44e2467..8f5dbab9099f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -168,6 +168,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
 }
 
+static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_mt_mask(vcpu, gfn, is_mmio);
+
+	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -292,7 +300,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.set_identity_map_addr = vmx_set_identity_map_addr,
-	.get_mt_mask = vmx_get_mt_mask,
+	.get_mt_mask = vt_get_mt_mask,
 
 	.get_exit_info = vmx_get_exit_info,
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 435112562954..50ce24905062 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -374,6 +374,14 @@ int tdx_vm_init(struct kvm *kvm)
 	return 0;
 }
 
+u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+{
+	if (is_mmio)
+		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+
+	return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+}
+
 int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 66829413797d..d8a00ab4651c 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -128,6 +128,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
@@ -153,6 +154,7 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 18/21] KVM: x86/mmu: Export kvm_tdp_map_page()
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (16 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 17/21] KVM: TDX: MTRR: implement get_mt_mask() for TDX Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-10 10:02   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory Rick Edgecombe
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

In future changes coco specific code will need to call kvm_tdp_map_page()
from within their respective gmem_post_populate() callbacks. Export it
so this can be done from vendor specific code. Since kvm_mmu_reload()
will be needed for this operation, export it as well.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d26b235d8f84..1a7965cfa08e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4754,6 +4754,7 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 		return -EIO;
 	}
 }
+EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
 
 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range)
@@ -5776,6 +5777,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 out:
 	return r;
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_load);
 
 void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (17 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 18/21] KVM: x86/mmu: Export kvm_tdp_map_page() Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-04  4:53   ` Yan Zhao
                     ` (2 more replies)
  2024-09-04  3:07 ` [PATCH 20/21] KVM: TDX: Finalize VM initialization Rick Edgecombe
  2024-09-04  3:07 ` [PATCH 21/21] KVM: TDX: Handle vCPU dissociation Rick Edgecombe
  20 siblings, 3 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a new ioctl for the user space VMM to initialize guest memory with the
specified memory contents.

Because TDX protects the guest's memory, the creation of the initial guest
memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of
directly copying the memory contents into the guest's memory in the case of
the default VM type.

Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped
KVM_MEMORY_ENCRYPT_OP.  Check if the GFN is already pre-allocated, assign
the guest page in Secure-EPT, copy the initial memory contents into the
guest memory, and encrypt the guest memory.  Optionally, extend the memory
measurement of the TDX guest.

Discussion history:
- Originally, KVM_TDX_INIT_MEM_REGION used the callback of the TDP MMU of
  the KVM page fault handler.  It issues TDX SEAMCALL deep in the call
  stack, and the ioctl passes down the necessary parameters.  [2] rejected
  it.  [3] suggests that the call to the TDX module should be invoked in a
  shallow call stack.

- Instead, introduce guest memory pre-population [1] that doesn't update
  vendor-specific part (Secure-EPT in TDX case) and the vendor-specific
  code (KVM_TDX_INIT_MEM_REGION) updates only vendor-specific parts without
  modifying the KVM TDP MMU suggested at [4]

    Crazy idea.  For TDX S-EPT, what if KVM_MAP_MEMORY does all of the
    SEPT.ADD stuff, which doesn't affect the measurement, and even fills in
    KVM's copy of the leaf EPTE, but tdx_sept_set_private_spte() doesn't do
    anything if the TD isn't finalized?

    Then KVM provides a dedicated TDX ioctl(), i.e. what is/was
    KVM_TDX_INIT_MEM_REGION, to do PAGE.ADD.  KVM_TDX_INIT_MEM_REGION
    wouldn't need to map anything, it would simply need to verify that the
    pfn from guest_memfd() is the same as what's in the TDP MMU.

- Use the common guest_memfd population function, kvm_gmem_populate()
  instead of a custom function.  It should check whether the PFN
  from TDP MMU is the same as the one from guest_memfd. [1]

- Instead of forcing userspace to do two passes, pre-map the guest
  initial memory in tdx_gmem_post_populate. [5]

Link: https://lore.kernel.org/kvm/20240419085927.3648704-1-pbonzini@redhat.com/ [1]
Link: https://lore.kernel.org/kvm/Zbrj5WKVgMsUFDtb@google.com/ [2]
Link: https://lore.kernel.org/kvm/Zh8DHbb8FzoVErgX@google.com/ [3]
Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/ [4]
Link: https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/ [5]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Update the code according to latest gmem update.
   https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/
 - Fixup a aligment bug reported by Binbin.
 - Rename KVM_MEMORY_MAPPING => KVM_MAP_MEMORY (Sean)
 - Drop issueing TDH.MEM.PAGE.ADD() on KVM_MAP_MEMORY(), defer it to
   KVM_TDX_INIT_MEM_REGION. (Sean)
 - Added nr_premapped to track the number of premapped pages
 - Drop tdx_post_mmu_map_page().
 - Drop kvm_slot_can_be_private() check (Paolo)
 - Use kvm_tdp_mmu_gpa_is_mapped() (Paolo)

v19:
 - Switched to use KVM_MEMORY_MAPPING
 - Dropped measurement extension
 - updated commit message. private_page_add() => set_private_spte()
---
 arch/x86/include/uapi/asm/kvm.h |   9 ++
 arch/x86/kvm/vmx/tdx.c          | 150 ++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c             |   1 +
 3 files changed, 160 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 39636be5c891..789d1d821b4f 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -931,6 +931,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
+	KVM_TDX_INIT_MEM_REGION,
 	KVM_TDX_GET_CPUID,
 
 	KVM_TDX_CMD_NR_MAX,
@@ -996,4 +997,12 @@ struct kvm_tdx_init_vm {
 	struct kvm_cpuid2 cpuid;
 };
 
+#define KVM_TDX_MEASURE_MEMORY_REGION   _BITULL(0)
+
+struct kvm_tdx_init_mem_region {
+	__u64 source_addr;
+	__u64 gpa;
+	__u64 nr_pages;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 50ce24905062..796d1a495a66 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -8,6 +8,7 @@
 #include "tdx_ops.h"
 #include "vmx.h"
 #include "mmu/spte.h"
+#include "common.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -1586,6 +1587,152 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+struct tdx_gmem_post_populate_arg {
+	struct kvm_vcpu *vcpu;
+	__u32 flags;
+};
+
+static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
+				  void __user *src, int order, void *_arg)
+{
+	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct tdx_gmem_post_populate_arg *arg = _arg;
+	struct kvm_vcpu *vcpu = arg->vcpu;
+	gpa_t gpa = gfn_to_gpa(gfn);
+	u8 level = PG_LEVEL_4K;
+	struct page *page;
+	int ret, i;
+	u64 err, entry, level_state;
+
+	/*
+	 * Get the source page if it has been faulted in. Return failure if the
+	 * source page has been swapped out or unmapped in primary memory.
+	 */
+	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page);
+	if (ret < 0)
+		return ret;
+	if (ret != 1)
+		return -ENOMEM;
+
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		ret = -EFAULT;
+		goto out_put_page;
+	}
+
+	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+	if (ret < 0)
+		goto out_put_page;
+
+	read_lock(&kvm->mmu_lock);
+
+	if (!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	ret = 0;
+	do {
+		err = tdh_mem_page_add(kvm_tdx, gpa, pfn_to_hpa(pfn),
+				       pfn_to_hpa(page_to_pfn(page)),
+				       &entry, &level_state);
+	} while (err == TDX_ERROR_SEPT_BUSY);
+	if (err) {
+		ret = -EIO;
+		goto out;
+	}
+
+	WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
+	atomic64_dec(&kvm_tdx->nr_premapped);
+
+	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
+		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+			err = tdh_mr_extend(kvm_tdx, gpa + i, &entry,
+					&level_state);
+			if (err) {
+				ret = -EIO;
+				break;
+			}
+		}
+	}
+
+out:
+	read_unlock(&kvm->mmu_lock);
+out_put_page:
+	put_page(page);
+	return ret;
+}
+
+static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_tdx_init_mem_region region;
+	struct tdx_gmem_post_populate_arg arg;
+	long gmem_ret;
+	int ret;
+
+	if (!to_tdx(vcpu)->initialized)
+		return -EINVAL;
+
+	/* Once TD is finalized, the initial guest memory is fixed. */
+	if (is_td_finalized(kvm_tdx))
+		return -EINVAL;
+
+	if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION)
+		return -EINVAL;
+
+	if (copy_from_user(&region, u64_to_user_ptr(cmd->data), sizeof(region)))
+		return -EFAULT;
+
+	if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
+	    !region.nr_pages ||
+	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
+	    !kvm_is_private_gpa(kvm, region.gpa) ||
+	    !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
+		return -EINVAL;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_mmu_reload(vcpu);
+	ret = 0;
+	while (region.nr_pages) {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
+		arg = (struct tdx_gmem_post_populate_arg) {
+			.vcpu = vcpu,
+			.flags = cmd->flags,
+		};
+		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
+					     u64_to_user_ptr(region.source_addr),
+					     1, tdx_gmem_post_populate, &arg);
+		if (gmem_ret < 0) {
+			ret = gmem_ret;
+			break;
+		}
+
+		if (gmem_ret != 1) {
+			ret = -EIO;
+			break;
+		}
+
+		region.source_addr += PAGE_SIZE;
+		region.gpa += PAGE_SIZE;
+		region.nr_pages--;
+
+		cond_resched();
+	}
+
+	mutex_unlock(&kvm->slots_lock);
+
+	if (copy_to_user(u64_to_user_ptr(cmd->data), &region, sizeof(region)))
+		ret = -EFAULT;
+	return ret;
+}
+
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -1605,6 +1752,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	case KVM_TDX_INIT_VCPU:
 		ret = tdx_vcpu_init(vcpu, &cmd);
 		break;
+	case KVM_TDX_INIT_MEM_REGION:
+		ret = tdx_vcpu_init_mem_region(vcpu, &cmd);
+		break;
 	case KVM_TDX_GET_CPUID:
 		ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
 		break;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 73fc3334721d..0822db480719 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2639,6 +2639,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
 
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 20/21] KVM: TDX: Finalize VM initialization
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (18 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-04 15:37   ` Adrian Hunter
  2024-09-10 10:25   ` Paolo Bonzini
  2024-09-04  3:07 ` [PATCH 21/21] KVM: TDX: Handle vCPU dissociation Rick Edgecombe
  20 siblings, 2 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel, Adrian Hunter

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.

Documentation for the API is added in another patch:
"Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)"

For the purpose of attestation, a measurement must be made of the TDX VM
initial state. This is referred to as TD Measurement Finalization, and
uses SEAMCALL TDH.MR.FINALIZE, after which:
1. The VMM adding TD private pages with arbitrary content is no longer
   allowed
2. The TDX VM is runnable

Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Added premapped check.
 - Update for the wrapper functions for SEAMCALLs. (Sean)
 - Add check if nr_premapped is zero.  If not, return error.
 - Use KVM_BUG_ON() in tdx_td_finalizer() for consistency.
 - Change tdx_td_finalizemr() to take struct kvm_tdx_cmd *cmd and return error
   (Adrian)
 - Handle TDX_OPERAND_BUSY case (Adrian)
 - Updates from seamcall overhaul (Kai)
 - Rename error->hw_error

v18:
 - Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.

v15:
 - removed unconditional tdx_track() by tdx_flush_tlb_current() that
   does tdx_track().
---
 arch/x86/include/uapi/asm/kvm.h |  1 +
 arch/x86/kvm/vmx/tdx.c          | 28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 789d1d821b4f..0b4827e39458 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -932,6 +932,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
 	KVM_TDX_INIT_MEM_REGION,
+	KVM_TDX_FINALIZE_VM,
 	KVM_TDX_GET_CPUID,
 
 	KVM_TDX_CMD_NR_MAX,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 796d1a495a66..3083a66bb895 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1257,6 +1257,31 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
 	ept_sync_global();
 }
 
+static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
+		return -EINVAL;
+	/*
+	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
+	 * TDH.MEM.PAGE.ADD().
+	 */
+	if (atomic64_read(&kvm_tdx->nr_premapped))
+		return -EINVAL;
+
+	cmd->hw_error = tdh_mr_finalize(kvm_tdx);
+	if ((cmd->hw_error & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY)
+		return -EAGAIN;
+	if (KVM_BUG_ON(cmd->hw_error, kvm)) {
+		pr_tdx_error(TDH_MR_FINALIZE, cmd->hw_error);
+		return -EIO;
+	}
+
+	kvm_tdx->finalized = true;
+	return 0;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -1281,6 +1306,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_INIT_VM:
 		r = tdx_td_init(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_FINALIZE_VM:
+		r = tdx_td_finalizemr(kvm, &tdx_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* [PATCH 21/21] KVM: TDX: Handle vCPU dissociation
  2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
                   ` (19 preceding siblings ...)
  2024-09-04  3:07 ` [PATCH 20/21] KVM: TDX: Finalize VM initialization Rick Edgecombe
@ 2024-09-04  3:07 ` Rick Edgecombe
  2024-09-09 15:41   ` Paolo Bonzini
  2024-09-10 10:45   ` Paolo Bonzini
  20 siblings, 2 replies; 139+ messages in thread
From: Rick Edgecombe @ 2024-09-04  3:07 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	rick.p.edgecombe, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Handle vCPUs dissociations by invoking SEAMCALL TDH.VP.FLUSH which flushes
the address translation caches and cached TD VMCS of a TD vCPU in its
associated pCPU.

In TDX, a vCPUs can only be associated with one pCPU at a time, which is
done by invoking SEAMCALL TDH.VP.ENTER. For a successful association, the
vCPU must be dissociated from its previous associated pCPU.

To facilitate vCPU dissociation, introduce a per-pCPU list
associated_tdvcpus. Add a vCPU into this list when it's loaded into a new
pCPU (i.e. when a vCPU is loaded for the first time or migrated to a new
pCPU).

vCPU dissociations can happen under below conditions:
- On the op hardware_disable is called.
  This op is called when virtualization is disabled on a given pCPU, e.g.
  when hot-unplug a pCPU or machine shutdown/suspend.
  In this case, dissociate all vCPUs from the pCPU by iterating its
  per-pCPU list associated_tdvcpus.

- On vCPU migration to a new pCPU.
  Before adding a vCPU into associated_tdvcpus list of the new pCPU,
  dissociation from its old pCPU is required, which is performed by issuing
  an IPI and executing SEAMCALL TDH.VP.FLUSH on the old pCPU.
  On a successful dissociation, the vCPU will be removed from the
  associated_tdvcpus list of its previously associated pCPU.

- On tdx_mmu_release_hkid() is called.
  TDX mandates that all vCPUs must be disassociated prior to the release of
  an hkid. Therefore, dissociation of all vCPUs is a must before executing
  the SEAMCALL TDH.MNG.VPFLUSHDONE and subsequently freeing the hkid.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Changed title to "KVM: TDX: Handle vCPU dissociation" .
 - Updated commit log.
 - Removed calling tdx_disassociate_vp_on_cpu() in tdx_vcpu_free() since
   no new TD enter would be called for vCPU association after
   tdx_mmu_release_hkid(), which is now called in vt_vm_destroy(), i.e.
   after releasing vcpu fd and kvm_unload_vcpu_mmus(), and before
   tdx_vcpu_free().
 - TODO: include Isaku's fix
   https://eclists.intel.com/sympa/arc/kvm-qemu-review/2024-07/msg00359.html
 - Update for the wrapper functions for SEAMCALLs. (Sean)
 - Removed unnecessary pr_err() in tdx_flush_vp_on_cpu().
 - Use KVM_BUG_ON() in tdx_flush_vp_on_cpu() for consistency.
 - Capitalize the first word of tile. (Binbin)
 - Minor fixed in changelog. (Binbin, Reinette(internal))
 - Fix some comments. (Binbin, Reinette(internal))
 - Rename arg_ to _arg (Binbin)
 - Updates from seamcall overhaul (Kai)
 - Remove lockdep_assert_preemption_disabled() in tdx_hardware_setup()
   since now hardware_enable() is not called via SMP func call anymore,
   but (per-cpu) CPU hotplug thread
 - Use KVM_BUG_ON() for SEAMCALLs in tdx_mmu_release_hkid() (Kai)
 - Update based on upstream commit "KVM: x86: Fold kvm_arch_sched_in()
   into kvm_arch_vcpu_load()"
 - Eliminate TDX_FLUSHVP_NOT_DONE error check because vCPUs were all freed.
   So the error won't happen. (Sean)
---
 arch/x86/kvm/vmx/main.c    |  22 +++++-
 arch/x86/kvm/vmx/tdx.c     | 151 +++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.h     |   2 +
 arch/x86/kvm/vmx/x86_ops.h |   4 +
 4 files changed, 169 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 8f5dbab9099f..8171c1412c3b 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -10,6 +10,14 @@
 #include "tdx.h"
 #include "tdx_arch.h"
 
+static void vt_hardware_disable(void)
+{
+	/* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
+	if (enable_tdx)
+		tdx_hardware_disable();
+	vmx_hardware_disable();
+}
+
 static __init int vt_hardware_setup(void)
 {
 	int ret;
@@ -113,6 +121,16 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_vcpu_load(vcpu, cpu);
+		return;
+	}
+
+	vmx_vcpu_load(vcpu, cpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	/*
@@ -217,7 +235,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.hardware_unsetup = vmx_hardware_unsetup,
 
 	.hardware_enable = vmx_hardware_enable,
-	.hardware_disable = vmx_hardware_disable,
+	.hardware_disable = vt_hardware_disable,
 	.emergency_disable = vmx_emergency_disable,
 
 	.has_emulated_msr = vmx_has_emulated_msr,
@@ -234,7 +252,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_reset = vt_vcpu_reset,
 
 	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
-	.vcpu_load = vmx_vcpu_load,
+	.vcpu_load = vt_vcpu_load,
 	.vcpu_put = vmx_vcpu_put,
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3083a66bb895..554154d3dd58 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -57,6 +57,14 @@ static DEFINE_MUTEX(tdx_lock);
 /* Maximum number of retries to attempt for SEAMCALLs. */
 #define TDX_SEAMCALL_RETRIES	10000
 
+/*
+ * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
+ * is brought down to invoke TDH_VP_FLUSH on the appropriate TD vCPUS.
+ * Protected by interrupt mask.  This list is manipulated in process context
+ * of vCPU and IPI callback.  See tdx_flush_vp_on_cpu().
+ */
+static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+
 static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
 {
 	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
@@ -88,6 +96,22 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
 	return kvm_tdx->finalized;
 }
 
+static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
+{
+	lockdep_assert_irqs_disabled();
+
+	list_del(&to_tdx(vcpu)->cpu_list);
+
+	/*
+	 * Ensure tdx->cpu_list is updated before setting vcpu->cpu to -1,
+	 * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
+	 * to its list before it's deleted from this CPU's list.
+	 */
+	smp_wmb();
+
+	vcpu->cpu = -1;
+}
+
 static void tdx_clear_page(unsigned long page_pa)
 {
 	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -168,6 +192,83 @@ static void tdx_reclaim_control_page(unsigned long ctrl_page_pa)
 	free_page((unsigned long)__va(ctrl_page_pa));
 }
 
+struct tdx_flush_vp_arg {
+	struct kvm_vcpu *vcpu;
+	u64 err;
+};
+
+static void tdx_flush_vp(void *_arg)
+{
+	struct tdx_flush_vp_arg *arg = _arg;
+	struct kvm_vcpu *vcpu = arg->vcpu;
+	u64 err;
+
+	arg->err = 0;
+	lockdep_assert_irqs_disabled();
+
+	/* Task migration can race with CPU offlining. */
+	if (unlikely(vcpu->cpu != raw_smp_processor_id()))
+		return;
+
+	/*
+	 * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized.  The
+	 * list tracking still needs to be updated so that it's correct if/when
+	 * the vCPU does get initialized.
+	 */
+	if (is_td_vcpu_created(to_tdx(vcpu))) {
+		/*
+		 * No need to retry.  TDX Resources needed for TDH.VP.FLUSH are:
+		 * TDVPR as exclusive, TDR as shared, and TDCS as shared.  This
+		 * vp flush function is called when destructing vCPU/TD or vCPU
+		 * migration.  No other thread uses TDVPR in those cases.
+		 */
+		err = tdh_vp_flush(to_tdx(vcpu));
+		if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
+			/*
+			 * This function is called in IPI context. Do not use
+			 * printk to avoid console semaphore.
+			 * The caller prints out the error message, instead.
+			 */
+			if (err)
+				arg->err = err;
+		}
+	}
+
+	tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+	struct tdx_flush_vp_arg arg = {
+		.vcpu = vcpu,
+	};
+	int cpu = vcpu->cpu;
+
+	if (unlikely(cpu == -1))
+		return;
+
+	smp_call_function_single(cpu, tdx_flush_vp, &arg, 1);
+	if (KVM_BUG_ON(arg.err, vcpu->kvm))
+		pr_tdx_error(TDH_VP_FLUSH, arg.err);
+}
+
+void tdx_hardware_disable(void)
+{
+	int cpu = raw_smp_processor_id();
+	struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+	struct tdx_flush_vp_arg arg;
+	struct vcpu_tdx *tdx, *tmp;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	/* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
+	list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list) {
+		arg.vcpu = &tdx->vcpu;
+		tdx_flush_vp(&arg);
+	}
+	local_irq_restore(flags);
+}
+
 static void smp_func_do_phymem_cache_wb(void *unused)
 {
 	u64 err = 0;
@@ -204,22 +305,21 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	bool packages_allocated, targets_allocated;
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	cpumask_var_t packages, targets;
-	u64 err;
+	struct kvm_vcpu *vcpu;
+	unsigned long j;
 	int i;
+	u64 err;
 
 	if (!is_hkid_assigned(kvm_tdx))
 		return;
 
-	/* KeyID has been allocated but guest is not yet configured */
-	if (!is_td_created(kvm_tdx)) {
-		tdx_hkid_free(kvm_tdx);
-		return;
-	}
-
 	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
 	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
 	cpus_read_lock();
 
+	kvm_for_each_vcpu(j, vcpu, kvm)
+		tdx_flush_vp_on_cpu(vcpu);
+
 	/*
 	 * TDH.PHYMEM.CACHE.WB tries to acquire the TDX module global lock
 	 * and can fail with TDX_OPERAND_BUSY when it fails to get the lock.
@@ -233,6 +333,16 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	 * After the above flushing vps, there should be no more vCPU
 	 * associations, as all vCPU fds have been released at this stage.
 	 */
+	err = tdh_mng_vpflushdone(kvm_tdx);
+	if (err == TDX_FLUSHVP_NOT_DONE)
+		goto out;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MNG_VPFLUSHDONE, err);
+		pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
+		       kvm_tdx->hkid);
+		goto out;
+	}
+
 	for_each_online_cpu(i) {
 		if (packages_allocated &&
 		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -258,6 +368,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 		tdx_hkid_free(kvm_tdx);
 	}
 
+out:
 	mutex_unlock(&tdx_lock);
 	cpus_read_unlock();
 	free_cpumask_var(targets);
@@ -409,6 +520,26 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (vcpu->cpu == cpu)
+		return;
+
+	tdx_flush_vp_on_cpu(vcpu);
+
+	local_irq_disable();
+	/*
+	 * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
+	 * vcpu->cpu is read before tdx->cpu_list.
+	 */
+	smp_rmb();
+
+	list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+	local_irq_enable();
+}
+
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -1977,7 +2108,7 @@ static int __init __do_tdx_bringup(void)
 static int __init __tdx_bringup(void)
 {
 	const struct tdx_sys_info_td_conf *td_conf;
-	int r;
+	int r, i;
 
 	if (!tdp_mmu_enabled || !enable_mmio_caching)
 		return -EOPNOTSUPP;
@@ -1987,6 +2118,10 @@ static int __init __tdx_bringup(void)
 		return -EOPNOTSUPP;
 	}
 
+	/* tdx_hardware_disable() uses associated_tdvcpus. */
+	for_each_possible_cpu(i)
+		INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, i));
+
 	/*
 	 * Enabling TDX requires enabling hardware virtualization first,
 	 * as making SEAMCALLs requires CPU being in post-VMXON state.
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 25a4aaede2ba..4b6fc25feeb6 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -39,6 +39,8 @@ struct vcpu_tdx {
 	unsigned long *tdcx_pa;
 	bool td_vcpu_created;
 
+	struct list_head cpu_list;
+
 	bool initialized;
 
 	/*
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index d8a00ab4651c..f4aa0ec16980 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -119,6 +119,7 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
 void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
 #ifdef CONFIG_INTEL_TDX_HOST
+void tdx_hardware_disable(void);
 int tdx_vm_init(struct kvm *kvm);
 void tdx_mmu_release_hkid(struct kvm *kvm);
 void tdx_vm_free(struct kvm *kvm);
@@ -128,6 +129,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -145,6 +147,7 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
 #else
+static inline void tdx_hardware_disable(void) {}
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
 static inline void tdx_vm_free(struct kvm *kvm) {}
@@ -154,6 +157,7 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
 static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
 
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory
  2024-09-04  3:07 ` [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory Rick Edgecombe
@ 2024-09-04  4:53   ` Yan Zhao
  2024-09-04 14:01     ` Edgecombe, Rick P
  2024-09-04 13:56   ` Edgecombe, Rick P
  2024-09-10 10:16   ` Paolo Bonzini
  2 siblings, 1 reply; 139+ messages in thread
From: Yan Zhao @ 2024-09-04  4:53 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kvm, kai.huang, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel

On Tue, Sep 03, 2024 at 08:07:49PM -0700, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a new ioctl for the user space VMM to initialize guest memory with the
> specified memory contents.
> 
> Because TDX protects the guest's memory, the creation of the initial guest
> memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of
> directly copying the memory contents into the guest's memory in the case of
> the default VM type.
> 
> Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped
> KVM_MEMORY_ENCRYPT_OP.  Check if the GFN is already pre-allocated, assign
> the guest page in Secure-EPT, copy the initial memory contents into the
> guest memory, and encrypt the guest memory.  Optionally, extend the memory
> measurement of the TDX guest.
> 
> Discussion history:
> - Originally, KVM_TDX_INIT_MEM_REGION used the callback of the TDP MMU of
>   the KVM page fault handler.  It issues TDX SEAMCALL deep in the call
>   stack, and the ioctl passes down the necessary parameters.  [2] rejected
>   it.  [3] suggests that the call to the TDX module should be invoked in a
>   shallow call stack.
> 
> - Instead, introduce guest memory pre-population [1] that doesn't update
>   vendor-specific part (Secure-EPT in TDX case) and the vendor-specific
>   code (KVM_TDX_INIT_MEM_REGION) updates only vendor-specific parts without
>   modifying the KVM TDP MMU suggested at [4]
> 
>     Crazy idea.  For TDX S-EPT, what if KVM_MAP_MEMORY does all of the
>     SEPT.ADD stuff, which doesn't affect the measurement, and even fills in
>     KVM's copy of the leaf EPTE, but tdx_sept_set_private_spte() doesn't do
>     anything if the TD isn't finalized?
> 
>     Then KVM provides a dedicated TDX ioctl(), i.e. what is/was
>     KVM_TDX_INIT_MEM_REGION, to do PAGE.ADD.  KVM_TDX_INIT_MEM_REGION
>     wouldn't need to map anything, it would simply need to verify that the
>     pfn from guest_memfd() is the same as what's in the TDP MMU.
> 
> - Use the common guest_memfd population function, kvm_gmem_populate()
>   instead of a custom function.  It should check whether the PFN
>   from TDP MMU is the same as the one from guest_memfd. [1]
> 
> - Instead of forcing userspace to do two passes, pre-map the guest
>   initial memory in tdx_gmem_post_populate. [5]
> 
> Link: https://lore.kernel.org/kvm/20240419085927.3648704-1-pbonzini@redhat.com/ [1]
> Link: https://lore.kernel.org/kvm/Zbrj5WKVgMsUFDtb@google.com/ [2]
> Link: https://lore.kernel.org/kvm/Zh8DHbb8FzoVErgX@google.com/ [3]
> Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/ [4]
> Link: https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/ [5]
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>  - Update the code according to latest gmem update.
>    https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/
>  - Fixup a aligment bug reported by Binbin.
>  - Rename KVM_MEMORY_MAPPING => KVM_MAP_MEMORY (Sean)
>  - Drop issueing TDH.MEM.PAGE.ADD() on KVM_MAP_MEMORY(), defer it to
>    KVM_TDX_INIT_MEM_REGION. (Sean)
>  - Added nr_premapped to track the number of premapped pages
>  - Drop tdx_post_mmu_map_page().
>  - Drop kvm_slot_can_be_private() check (Paolo)
>  - Use kvm_tdp_mmu_gpa_is_mapped() (Paolo)
> 
> v19:
>  - Switched to use KVM_MEMORY_MAPPING
>  - Dropped measurement extension
>  - updated commit message. private_page_add() => set_private_spte()
> ---
>  arch/x86/include/uapi/asm/kvm.h |   9 ++
>  arch/x86/kvm/vmx/tdx.c          | 150 ++++++++++++++++++++++++++++++++
>  virt/kvm/kvm_main.c             |   1 +
>  3 files changed, 160 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 39636be5c891..789d1d821b4f 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -931,6 +931,7 @@ enum kvm_tdx_cmd_id {
>  	KVM_TDX_CAPABILITIES = 0,
>  	KVM_TDX_INIT_VM,
>  	KVM_TDX_INIT_VCPU,
> +	KVM_TDX_INIT_MEM_REGION,
>  	KVM_TDX_GET_CPUID,
>  
>  	KVM_TDX_CMD_NR_MAX,
> @@ -996,4 +997,12 @@ struct kvm_tdx_init_vm {
>  	struct kvm_cpuid2 cpuid;
>  };
>  
> +#define KVM_TDX_MEASURE_MEMORY_REGION   _BITULL(0)
> +
> +struct kvm_tdx_init_mem_region {
> +	__u64 source_addr;
> +	__u64 gpa;
> +	__u64 nr_pages;
> +};
> +
>  #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 50ce24905062..796d1a495a66 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -8,6 +8,7 @@
>  #include "tdx_ops.h"
>  #include "vmx.h"
>  #include "mmu/spte.h"
> +#include "common.h"
>  
>  #undef pr_fmt
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> @@ -1586,6 +1587,152 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>  	return 0;
>  }
>  
> +struct tdx_gmem_post_populate_arg {
> +	struct kvm_vcpu *vcpu;
> +	__u32 flags;
> +};
> +
> +static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> +				  void __user *src, int order, void *_arg)
> +{
> +	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct tdx_gmem_post_populate_arg *arg = _arg;
> +	struct kvm_vcpu *vcpu = arg->vcpu;
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	u8 level = PG_LEVEL_4K;
> +	struct page *page;
> +	int ret, i;
> +	u64 err, entry, level_state;
> +
> +	/*
> +	 * Get the source page if it has been faulted in. Return failure if the
> +	 * source page has been swapped out or unmapped in primary memory.
> +	 */
> +	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page);
> +	if (ret < 0)
> +		return ret;
> +	if (ret != 1)
> +		return -ENOMEM;
> +
> +	if (!kvm_mem_is_private(kvm, gfn)) {
> +		ret = -EFAULT;
> +		goto out_put_page;
> +	}
> +
> +	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> +	if (ret < 0)
> +		goto out_put_page;
> +
> +	read_lock(&kvm->mmu_lock);
Although mirrored root can't be zapped with shared lock currently, is it
better to hold write_lock() here?

It should bring no extra overhead in a normal condition when the
tdx_gmem_post_populate() is called.

> +
> +	if (!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa)) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	ret = 0;
> +	do {
> +		err = tdh_mem_page_add(kvm_tdx, gpa, pfn_to_hpa(pfn),
> +				       pfn_to_hpa(page_to_pfn(page)),
> +				       &entry, &level_state);
> +	} while (err == TDX_ERROR_SEPT_BUSY);
> +	if (err) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
> +	atomic64_dec(&kvm_tdx->nr_premapped);
> +
> +	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> +		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +			err = tdh_mr_extend(kvm_tdx, gpa + i, &entry,
> +					&level_state);
> +			if (err) {
> +				ret = -EIO;
> +				break;
> +			}
> +		}
> +	}
> +
> +out:
> +	read_unlock(&kvm->mmu_lock);
> +out_put_page:
> +	put_page(page);
> +	return ret;
> +}
> +
> +static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct kvm_tdx_init_mem_region region;
> +	struct tdx_gmem_post_populate_arg arg;
> +	long gmem_ret;
> +	int ret;
> +
> +	if (!to_tdx(vcpu)->initialized)
> +		return -EINVAL;
> +
> +	/* Once TD is finalized, the initial guest memory is fixed. */
> +	if (is_td_finalized(kvm_tdx))
> +		return -EINVAL;
> +
> +	if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&region, u64_to_user_ptr(cmd->data), sizeof(region)))
> +		return -EFAULT;
> +
> +	if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
> +	    !region.nr_pages ||
> +	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
> +	    !kvm_is_private_gpa(kvm, region.gpa) ||
> +	    !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	kvm_mmu_reload(vcpu);
> +	ret = 0;
> +	while (region.nr_pages) {
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
> +		arg = (struct tdx_gmem_post_populate_arg) {
> +			.vcpu = vcpu,
> +			.flags = cmd->flags,
> +		};
> +		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
> +					     u64_to_user_ptr(region.source_addr),
> +					     1, tdx_gmem_post_populate, &arg);
> +		if (gmem_ret < 0) {
> +			ret = gmem_ret;
> +			break;
> +		}
> +
> +		if (gmem_ret != 1) {
> +			ret = -EIO;
> +			break;
> +		}
> +
> +		region.source_addr += PAGE_SIZE;
> +		region.gpa += PAGE_SIZE;
> +		region.nr_pages--;
> +
> +		cond_resched();
> +	}
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	if (copy_to_user(u64_to_user_ptr(cmd->data), &region, sizeof(region)))
> +		ret = -EFAULT;
> +	return ret;
> +}
> +
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> @@ -1605,6 +1752,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>  	case KVM_TDX_INIT_VCPU:
>  		ret = tdx_vcpu_init(vcpu, &cmd);
>  		break;
> +	case KVM_TDX_INIT_MEM_REGION:
> +		ret = tdx_vcpu_init_mem_region(vcpu, &cmd);
> +		break;
>  	case KVM_TDX_GET_CPUID:
>  		ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
>  		break;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 73fc3334721d..0822db480719 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2639,6 +2639,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
>  
>  	return NULL;
>  }
> +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
>  
>  bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
>  {
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory
  2024-09-04  3:07 ` [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory Rick Edgecombe
  2024-09-04  4:53   ` Yan Zhao
@ 2024-09-04 13:56   ` Edgecombe, Rick P
  2024-09-10 10:16   ` Paolo Bonzini
  2 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-04 13:56 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Tue, 2024-09-03 at 20:07 -0700, Rick Edgecombe wrote:
> +static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> +                                 void __user *src, int order, void *_arg)
> +{
> +       u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +       struct tdx_gmem_post_populate_arg *arg = _arg;
> +       struct kvm_vcpu *vcpu = arg->vcpu;
> +       gpa_t gpa = gfn_to_gpa(gfn);
> +       u8 level = PG_LEVEL_4K;
> +       struct page *page;
> +       int ret, i;
> +       u64 err, entry, level_state;
> +
> +       /*
> +        * Get the source page if it has been faulted in. Return failure if
> the
> +        * source page has been swapped out or unmapped in primary memory.
> +        */
> +       ret = get_user_pages_fast((unsigned long)src, 1, 0, &page);
> +       if (ret < 0)
> +               return ret;
> +       if (ret != 1)
> +               return -ENOMEM;
> +
> +       if (!kvm_mem_is_private(kvm, gfn)) {
> +               ret = -EFAULT;
> +               goto out_put_page;
> +       }

Paulo had said he was going to add this check in gmem code. I thought it was not
added but it actually is. So we can drop this check.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory
  2024-09-04  4:53   ` Yan Zhao
@ 2024-09-04 14:01     ` Edgecombe, Rick P
  2024-09-06 16:30       ` Edgecombe, Rick P
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-04 14:01 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: seanjc@google.com, Huang, Kai, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, dmatlack@google.com,
	kvm@vger.kernel.org, nik.borisov@suse.com, pbonzini@redhat.com

On Wed, 2024-09-04 at 12:53 +0800, Yan Zhao wrote:
> > +       if (!kvm_mem_is_private(kvm, gfn)) {
> > +               ret = -EFAULT;
> > +               goto out_put_page;
> > +       }
> > +
> > +       ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> > +       if (ret < 0)
> > +               goto out_put_page;
> > +
> > +       read_lock(&kvm->mmu_lock);
> Although mirrored root can't be zapped with shared lock currently, is it
> better to hold write_lock() here?
> 
> It should bring no extra overhead in a normal condition when the
> tdx_gmem_post_populate() is called.

I think we should hold the weakest lock we can. Otherwise someday someone could
run into it and think the write_lock() is required. It will add confusion.

What was the benefit of a write lock? Just in case we got it wrong?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 20/21] KVM: TDX: Finalize VM initialization
  2024-09-04  3:07 ` [PATCH 20/21] KVM: TDX: Finalize VM initialization Rick Edgecombe
@ 2024-09-04 15:37   ` Adrian Hunter
  2024-09-04 16:09     ` Edgecombe, Rick P
  2024-09-10 10:33     ` Paolo Bonzini
  2024-09-10 10:25   ` Paolo Bonzini
  1 sibling, 2 replies; 139+ messages in thread
From: Adrian Hunter @ 2024-09-04 15:37 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 4/09/24 06:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
> KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.
> 
> Documentation for the API is added in another patch:
> "Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)"
> 
> For the purpose of attestation, a measurement must be made of the TDX VM
> initial state. This is referred to as TD Measurement Finalization, and
> uses SEAMCALL TDH.MR.FINALIZE, after which:
> 1. The VMM adding TD private pages with arbitrary content is no longer
>    allowed
> 2. The TDX VM is runnable
> 
> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>  - Added premapped check.
>  - Update for the wrapper functions for SEAMCALLs. (Sean)
>  - Add check if nr_premapped is zero.  If not, return error.
>  - Use KVM_BUG_ON() in tdx_td_finalizer() for consistency.
>  - Change tdx_td_finalizemr() to take struct kvm_tdx_cmd *cmd and return error
>    (Adrian)
>  - Handle TDX_OPERAND_BUSY case (Adrian)
>  - Updates from seamcall overhaul (Kai)
>  - Rename error->hw_error
> 
> v18:
>  - Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.
> 
> v15:
>  - removed unconditional tdx_track() by tdx_flush_tlb_current() that
>    does tdx_track().
> ---
>  arch/x86/include/uapi/asm/kvm.h |  1 +
>  arch/x86/kvm/vmx/tdx.c          | 28 ++++++++++++++++++++++++++++
>  2 files changed, 29 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 789d1d821b4f..0b4827e39458 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -932,6 +932,7 @@ enum kvm_tdx_cmd_id {
>  	KVM_TDX_INIT_VM,
>  	KVM_TDX_INIT_VCPU,
>  	KVM_TDX_INIT_MEM_REGION,
> +	KVM_TDX_FINALIZE_VM,
>  	KVM_TDX_GET_CPUID,
>  
>  	KVM_TDX_CMD_NR_MAX,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 796d1a495a66..3083a66bb895 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1257,6 +1257,31 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
>  	ept_sync_global();
>  }
>  
> +static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +	if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
> +		return -EINVAL;
> +	/*
> +	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
> +	 * TDH.MEM.PAGE.ADD().
> +	 */
> +	if (atomic64_read(&kvm_tdx->nr_premapped))
> +		return -EINVAL;
> +
> +	cmd->hw_error = tdh_mr_finalize(kvm_tdx);
> +	if ((cmd->hw_error & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY)
> +		return -EAGAIN;
> +	if (KVM_BUG_ON(cmd->hw_error, kvm)) {
> +		pr_tdx_error(TDH_MR_FINALIZE, cmd->hw_error);
> +		return -EIO;
> +	}
> +
> +	kvm_tdx->finalized = true;
> +	return 0;
> +}

Isaku was going to lock the mmu.  Seems like the change got lost.
To protect against racing with KVM_PRE_FAULT_MEMORY,
KVM_TDX_INIT_MEM_REGION, tdx_sept_set_private_spte() etc
e.g. Rename tdx_td_finalizemr to __tdx_td_finalizemr and add:

static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
{
	int ret;

	write_lock(&kvm->mmu_lock);
	ret = __tdx_td_finalizemr(kvm, cmd);
	write_unlock(&kvm->mmu_lock);

	return ret;
}

> +
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_tdx_cmd tdx_cmd;
> @@ -1281,6 +1306,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  	case KVM_TDX_INIT_VM:
>  		r = tdx_td_init(kvm, &tdx_cmd);
>  		break;
> +	case KVM_TDX_FINALIZE_VM:
> +		r = tdx_td_finalizemr(kvm, &tdx_cmd);
> +		break;
>  	default:
>  		r = -EINVAL;
>  		goto out;


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 20/21] KVM: TDX: Finalize VM initialization
  2024-09-04 15:37   ` Adrian Hunter
@ 2024-09-04 16:09     ` Edgecombe, Rick P
  2024-09-10 10:33     ` Paolo Bonzini
  1 sibling, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-04 16:09 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Wed, 2024-09-04 at 18:37 +0300, Adrian Hunter wrote:
> 
> Isaku was going to lock the mmu.  Seems like the change got lost.
> To protect against racing with KVM_PRE_FAULT_MEMORY,
> KVM_TDX_INIT_MEM_REGION, tdx_sept_set_private_spte() etc
> e.g. Rename tdx_td_finalizemr to __tdx_td_finalizemr and add:
> 
> static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> {
>         int ret;
> 
>         write_lock(&kvm->mmu_lock);
>         ret = __tdx_td_finalizemr(kvm, cmd);
>         write_unlock(&kvm->mmu_lock);
> 
>         return ret;
> }

Makes sense. Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-04  3:07 ` [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT Rick Edgecombe
@ 2024-09-06  1:41   ` Huang, Kai
  2024-09-09 20:25     ` Edgecombe, Rick P
  2024-09-09 15:25   ` Paolo Bonzini
  1 sibling, 1 reply; 139+ messages in thread
From: Huang, Kai @ 2024-09-06  1:41 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, pbonzini, kvm
  Cc: dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov, linux-kernel,
	Yuan Yao



On 4/09/2024 3:07 pm, Rick Edgecombe wrote:
> From: Yuan Yao <yuan.yao@intel.com>
> 
> TDX module internally uses locks to protect internal resources.  It tries
> to acquire the locks.  If it fails to obtain the lock, it returns
> TDX_OPERAND_BUSY error without spin because its execution time limitation.
> 
> TDX SEAMCALL API reference describes what resources are used.  It's known
> which TDX SEAMCALL can cause contention with which resources.  VMM can
> avoid contention inside the TDX module by avoiding contentious TDX SEAMCALL
> with, for example, spinlock.  Because OS knows better its process
> scheduling and its scalability, a lock at OS/VMM layer would work better
> than simply retrying TDX SEAMCALLs.
> 
> TDH.MEM.* API except for TDH.MEM.TRACK operates on a secure EPT tree and
> the TDX module internally tries to acquire the lock of the secure EPT tree.
> They return TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT in case of failure to
> get the lock.  TDX KVM allows sept callbacks to return error so that TDP
> MMU layer can retry.
> 
> Retry TDX TDH.MEM.* API on the error because the error is a rare event
> caused by zero-step attack mitigation.

The last paragraph seems can be improved:

It seems to say the "TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT" can only be 
cauesd by zero-step attack detection/mitigation, which isn't true from 
the previous paragraph.

In fact, I think this patch can be dropped:

1) The TDH_MEM_xx()s can return BUSY due to nature of TDP MMU, but all 
the callers of TDH_MEM_xx()s are already explicitly retrying by looking 
at the patch "KVM: TDX: Implement hooks to propagate changes of TDP MMU 
mirror page table" -- they either return PF_RETRY to let the fault to 
happen again or explicitly loop until no BUSY is returned.  So I am not 
sure why we need to "loo SEAMCALL_RETRY_MAX (16) times" in the common code.

2) TDH_VP_ENTER explicitly retries immediately for such case:

         /* See the comment of tdx_seamcall_sept(). */
         if (unlikely(vp_enter_ret == TDX_ERROR_SEPT_BUSY))
                 return EXIT_FASTPATH_REENTER_GUEST;


3) That means the _ONLY_ reason to retry in the common code for 
TDH_MEM_xx()s is to mitigate zero-step attack by reducing the times of 
letting guest to fault on the same instruction.

I don't think we need to handle zero-step attack mitigation in the first 
TDX support submission.  So I think we can just remove this patch.

> 
> Signed-off-by: Yuan Yao <yuan.yao@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Updates from seamcall overhaul (Kai)
> 
> v19:
>   - fix typo TDG.VP.ENTER => TDH.VP.ENTER,
>     TDX_OPRRAN_BUSY => TDX_OPERAND_BUSY
>   - drop the description on TDH.VP.ENTER as this patch doesn't touch
>     TDH.VP.ENTER
> ---
>   arch/x86/kvm/vmx/tdx_ops.h | 48 ++++++++++++++++++++++++++++++++------
>   1 file changed, 41 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
> index 0363d8544f42..8ca3e252a6ed 100644
> --- a/arch/x86/kvm/vmx/tdx_ops.h
> +++ b/arch/x86/kvm/vmx/tdx_ops.h
> @@ -31,6 +31,40 @@
>   #define pr_tdx_error_3(__fn, __err, __rcx, __rdx, __r8)	\
>   	pr_tdx_error_N(__fn, __err, "rcx 0x%llx, rdx 0x%llx, r8 0x%llx\n", __rcx, __rdx, __r8)
>   
> +/*
> + * TDX module acquires its internal lock for resources.  It doesn't spin to get
> + * locks because of its restrictions of allowed execution time.  Instead, it
> + * returns TDX_OPERAND_BUSY with an operand id.
> + *
> + * Multiple VCPUs can operate on SEPT.  Also with zero-step attack mitigation,
> + * TDH.VP.ENTER may rarely acquire SEPT lock and release it when zero-step
> + * attack is suspected.  It results in TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT
> + * with TDH.MEM.* operation.  Note: TDH.MEM.TRACK is an exception.
> + *
> + * Because TDP MMU uses read lock for scalability, spin lock around SEAMCALL
> + * spoils TDP MMU effort.  Retry several times with the assumption that SEPT
> + * lock contention is rare.  But don't loop forever to avoid lockup.  Let TDP
> + * MMU retry.
> + */
> +#define TDX_ERROR_SEPT_BUSY    (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)
> +
> +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
> +{
> +#define SEAMCALL_RETRY_MAX     16
> +	struct tdx_module_args args_in;
> +	int retry = SEAMCALL_RETRY_MAX;
> +	u64 ret;
> +
> +	do {
> +		args_in = *in;
> +		ret = seamcall_ret(op, in);
> +	} while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0);
> +
> +	*in = args_in;
> +
> +	return ret;
> +}
> +
>   static inline u64 tdh_mng_addcx(struct kvm_tdx *kvm_tdx, hpa_t addr)
>   {
>   	struct tdx_module_args in = {
> @@ -55,7 +89,7 @@ static inline u64 tdh_mem_page_add(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   	u64 ret;
>   
>   	clflush_cache_range(__va(hpa), PAGE_SIZE);
> -	ret = seamcall_ret(TDH_MEM_PAGE_ADD, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_PAGE_ADD, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -76,7 +110,7 @@ static inline u64 tdh_mem_sept_add(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   
>   	clflush_cache_range(__va(page), PAGE_SIZE);
>   
> -	ret = seamcall_ret(TDH_MEM_SEPT_ADD, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_SEPT_ADD, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -93,7 +127,7 @@ static inline u64 tdh_mem_sept_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   	};
>   	u64 ret;
>   
> -	ret = seamcall_ret(TDH_MEM_SEPT_REMOVE, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_SEPT_REMOVE, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -123,7 +157,7 @@ static inline u64 tdh_mem_page_aug(struct kvm_tdx *kvm_tdx, gpa_t gpa, hpa_t hpa
>   	u64 ret;
>   
>   	clflush_cache_range(__va(hpa), PAGE_SIZE);
> -	ret = seamcall_ret(TDH_MEM_PAGE_AUG, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_PAGE_AUG, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -140,7 +174,7 @@ static inline u64 tdh_mem_range_block(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   	};
>   	u64 ret;
>   
> -	ret = seamcall_ret(TDH_MEM_RANGE_BLOCK, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_RANGE_BLOCK, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -335,7 +369,7 @@ static inline u64 tdh_mem_page_remove(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   	};
>   	u64 ret;
>   
> -	ret = seamcall_ret(TDH_MEM_PAGE_REMOVE, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;
> @@ -361,7 +395,7 @@ static inline u64 tdh_mem_range_unblock(struct kvm_tdx *kvm_tdx, gpa_t gpa,
>   	};
>   	u64 ret;
>   
> -	ret = seamcall_ret(TDH_MEM_RANGE_UNBLOCK, &in);
> +	ret = tdx_seamcall_sept(TDH_MEM_RANGE_UNBLOCK, &in);
>   
>   	*rcx = in.rcx;
>   	*rdx = in.rdx;


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2024-09-04  3:07 ` [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Rick Edgecombe
@ 2024-09-06  2:10   ` Huang, Kai
  2024-09-09 21:03     ` Edgecombe, Rick P
  2024-10-30  3:03   ` Binbin Wu
  1 sibling, 1 reply; 139+ messages in thread
From: Huang, Kai @ 2024-09-06  2:10 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, pbonzini, kvm
  Cc: dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov, linux-kernel


> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -36,9 +36,21 @@ static __init int vt_hardware_setup(void)
>   	 * is KVM may allocate couple of more bytes than needed for
>   	 * each VM.
>   	 */
> -	if (enable_tdx)
> +	if (enable_tdx) {
>   		vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
>   				sizeof(struct kvm_tdx));
> +		/*
> +		 * Note, TDX may fail to initialize in a later time in
> +		 * vt_init(), in which case it is not necessary to setup
> +		 * those callbacks.  But making them valid here even
> +		 * when TDX fails to init later is fine because those
> +		 * callbacks won't be called if the VM isn't TDX guest.
> +		 */
> +		vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
> +		vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
> +		vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
> +		vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;

Nit:  The callbacks in 'struct kvm_x86_ops' have name "external", but 
TDX callbacks have name "private".  Should we rename TDX callbacks to 
make them aligned?

> +	}
>   
>   	return 0;
>   }
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 6feb3ab96926..b8cd5a629a80 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -447,6 +447,177 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>   	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>   }
>   
> +static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +
> +	put_page(page);
> +}
> +
> +static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> +			    enum pg_level level, kvm_pfn_t pfn)
> +{
> +	int tdx_level = pg_level_to_tdx_sept_level(level);
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	hpa_t hpa = pfn_to_hpa(pfn);
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	u64 entry, level_state;
> +	u64 err;
> +
> +	err = tdh_mem_page_aug(kvm_tdx, gpa, hpa, &entry, &level_state);
> +	if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
> +		tdx_unpin(kvm, pfn);
> +		return -EAGAIN;
> +	}

Nit: Here (and other non-fatal error cases) I think we should return 
-EBUSY to make it consistent with non-TDX case?  E.g., the non-TDX case has:

                 if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
                         return -EBUSY;

And the comment of tdp_mmu_set_spte_atomic() currently says it can only 
return 0 or -EBUSY.  It needs to be patched to reflect it can also 
return other non-0 errors like -EIO but those are fatal.  In terms of 
non-fatal error I don't think we need another -EAGAIN.

/*
  * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically

[...]

  * Return:
  * * 0      - If the SPTE was set.
  * * -EBUSY - If the SPTE cannot be set. In this case this function will
  *	      have no side-effects other than setting iter->old_spte to
  *            the last known value of the spte.
  */

[...]

> +
> +static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> +				      enum pg_level level, kvm_pfn_t pfn)
> +{
>
[...]

> +
> +	hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
> +	do {
> +		/*
> +		 * TDX_OPERAND_BUSY can happen on locking PAMT entry.  Because
> +		 * this page was removed above, other thread shouldn't be
> +		 * repeatedly operating on this page.  Just retry loop.
> +		 */
> +		err = tdh_phymem_page_wbinvd(hpa_with_hkid);
> +	} while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));

In what case(s) other threads can concurrently lock the PAMT entry, 
leading to the above BUSY?

[...]

> +
> +int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> +				 enum pg_level level, kvm_pfn_t pfn)
> +{
> +	int ret;
> +
> +	/*
> +	 * HKID is released when vm_free() which is after closing gmem_fd

 From latest dev branch HKID is freed from vt_vm_destroy(), but not 
vm_free() (which should be tdx_vm_free() btw).

static void vt_vm_destroy(struct kvm *kvm)
{
         if (is_td(kvm))
                 return tdx_mmu_release_hkid(kvm);

         vmx_vm_destroy(kvm);
}

Btw, why not have a tdx_vm_destroy() wrapper?  Seems all other vt_xx()s 
have a tdx_xx() but only this one calls tdx_mmu_release_hkid() directly.

> +	 * which causes gmem invalidation to zap all spte.
> +	 * Population is only allowed after KVM_TDX_INIT_VM.
> +	 */

What does the second sentence ("Population ...")  meaning?  Why is it 
relevant here?


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory
  2024-09-04 14:01     ` Edgecombe, Rick P
@ 2024-09-06 16:30       ` Edgecombe, Rick P
  2024-09-09  1:29         ` Yan Zhao
  2024-09-10 10:13         ` Paolo Bonzini
  0 siblings, 2 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-06 16:30 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: seanjc@google.com, Huang, Kai, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, dmatlack@google.com,
	kvm@vger.kernel.org, nik.borisov@suse.com, pbonzini@redhat.com

On Wed, 2024-09-04 at 07:01 -0700, Rick Edgecombe wrote:
> On Wed, 2024-09-04 at 12:53 +0800, Yan Zhao wrote:
> > > +       if (!kvm_mem_is_private(kvm, gfn)) {
> > > +               ret = -EFAULT;
> > > +               goto out_put_page;
> > > +       }
> > > +
> > > +       ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> > > +       if (ret < 0)
> > > +               goto out_put_page;
> > > +
> > > +       read_lock(&kvm->mmu_lock);
> > Although mirrored root can't be zapped with shared lock currently, is it
> > better to hold write_lock() here?
> > 
> > It should bring no extra overhead in a normal condition when the
> > tdx_gmem_post_populate() is called.
> 
> I think we should hold the weakest lock we can. Otherwise someday someone
> could
> run into it and think the write_lock() is required. It will add confusion.
> 
> What was the benefit of a write lock? Just in case we got it wrong?

I just tried to draft a comment to make it look less weird, but I think actually
even the mmu_read lock is technically unnecessary because we hold both
filemap_invalidate_lock() and slots_lock. The cases we care about:
 memslot deletion - slots_lock protects
 gmem hole punch - filemap_invalidate_lock() protects
 set attributes - slots_lock protects
 others?

So I guess all the mirror zapping cases that could execute concurrently are
already covered by other locks. If we skipped grabbing the mmu lock completely
it would trigger the assertion in kvm_tdp_mmu_gpa_is_mapped(). Removing the
assert would probably make kvm_tdp_mmu_gpa_is_mapped() a bit dangerous. Hmm. 

Maybe a comment like this:
/*
 * The case to care about here is a PTE getting zapped concurrently and 
 * this function erroneously thinking a page is mapped in the mirror EPT.
 * The private mem zapping paths are already covered by other locks held
 * here, but grab an mmu read_lock to not trigger the assert in
 * kvm_tdp_mmu_gpa_is_mapped().
 */

Yan, do you think it is sufficient?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory
  2024-09-06 16:30       ` Edgecombe, Rick P
@ 2024-09-09  1:29         ` Yan Zhao
  2024-09-10 10:13         ` Paolo Bonzini
  1 sibling, 0 replies; 139+ messages in thread
From: Yan Zhao @ 2024-09-09  1:29 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Huang, Kai, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, dmatlack@google.com,
	kvm@vger.kernel.org, nik.borisov@suse.com, pbonzini@redhat.com

On Sat, Sep 07, 2024 at 12:30:00AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2024-09-04 at 07:01 -0700, Rick Edgecombe wrote:
> > On Wed, 2024-09-04 at 12:53 +0800, Yan Zhao wrote:
> > > > +       if (!kvm_mem_is_private(kvm, gfn)) {
> > > > +               ret = -EFAULT;
> > > > +               goto out_put_page;
> > > > +       }
> > > > +
> > > > +       ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> > > > +       if (ret < 0)
> > > > +               goto out_put_page;
> > > > +
> > > > +       read_lock(&kvm->mmu_lock);
> > > Although mirrored root can't be zapped with shared lock currently, is it
> > > better to hold write_lock() here?
> > > 
> > > It should bring no extra overhead in a normal condition when the
> > > tdx_gmem_post_populate() is called.
> > 
> > I think we should hold the weakest lock we can. Otherwise someday someone
> > could
> > run into it and think the write_lock() is required. It will add confusion.
> > 
> > What was the benefit of a write lock? Just in case we got it wrong?
> 
> I just tried to draft a comment to make it look less weird, but I think actually
> even the mmu_read lock is technically unnecessary because we hold both
> filemap_invalidate_lock() and slots_lock. The cases we care about:
>  memslot deletion - slots_lock protects
>  gmem hole punch - filemap_invalidate_lock() protects
>  set attributes - slots_lock protects
>  others?
> 
> So I guess all the mirror zapping cases that could execute concurrently are
> already covered by other locks. If we skipped grabbing the mmu lock completely
> it would trigger the assertion in kvm_tdp_mmu_gpa_is_mapped(). Removing the
> assert would probably make kvm_tdp_mmu_gpa_is_mapped() a bit dangerous. Hmm. 
> 
> Maybe a comment like this:
> /*
>  * The case to care about here is a PTE getting zapped concurrently and 
>  * this function erroneously thinking a page is mapped in the mirror EPT.
>  * The private mem zapping paths are already covered by other locks held
>  * here, but grab an mmu read_lock to not trigger the assert in
>  * kvm_tdp_mmu_gpa_is_mapped().
>  */
> 
> Yan, do you think it is sufficient?
Yes, with current code base, I think it's sufficient. Thanks!

I asked that question was just to confirm whether we need to guard against the
potential removal of SPTE under a shared lock, given the change is small and
KVM_TDX_INIT_MEM_REGION() is not on performance critical path.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 01/21] KVM: x86/mmu: Implement memslot deletion for TDX
  2024-09-04  3:07 ` [PATCH 01/21] KVM: x86/mmu: Implement memslot deletion for TDX Rick Edgecombe
@ 2024-09-09 13:44   ` Paolo Bonzini
  2024-09-09 21:06     ` Edgecombe, Rick P
  0 siblings, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 13:44 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> Force TDX VMs to use the KVM_X86_QUIRK_SLOT_ZAP_ALL behavior.
> 
> TDs cannot use the fast zapping operation to implement memslot deletion for
> a couple reasons:
> 1. KVM cannot fully zap and re-build TDX private PTEs without coordinating
>     with the guest. This is due to the TDs needing to "accept" memory. So
>     an operation to delete a memslot needs to limit the private zapping to
>     the range of the memslot.
> 2. For reason (1), kvm_mmu_zap_all_fast() is limited to direct (shared)
>     roots. This means it will not zap the mirror (private) PTEs. If a
>     memslot is deleted with private memory mapped, the private memory would
>     remain mapped in the TD. Then if later the gmem fd was whole punched,
>     the pages could be freed on the host while still mapped in the TD. This
>     is because that operation would no longer have the memslot to map the
>     pgoff to the gfn.
> 
> To handle the first case, userspace could simply set the
> KVM_X86_QUIRK_SLOT_ZAP_ALL quirk for TDs. This would prevent the issue in
> (1), but it is not sufficient to resolve (2) because the problems there
> extend beyond the userspace's TD, to affecting the rest of the host. So the
> zap-leafs-only behavior is required for both
> 
> A couple options were considered, including forcing
> KVM_X86_QUIRK_SLOT_ZAP_ALL to always be on for TDs, however due to the
> currently limited quirks interface (no way to query quirks, or force them
> to be disabled), this would require developing additional interfaces. So
> instead just do the simple thing and make TDs always do the zap-leafs
> behavior like when KVM_X86_QUIRK_SLOT_ZAP_ALL is disabled.
> 
> While at it, have the new behavior apply to all non-KVM_X86_DEFAULT_VM VMs,
> as the previous behavior was not ideal (see [0]). It is assumed until
> proven otherwise that the other VM types will not be exposed to the bug[1]
> that derailed that effort.
> 
> Memslot deletion needs to zap both the private and shared mappings of a
> GFN, so update the attr_filter field in kvm_mmu_zap_memslot_leafs() to
> include both.
> 
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Link: https://lore.kernel.org/kvm/20190205205443.1059-1-sean.j.christopherson@intel.com/ [0]
> Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
> ---
> TDX MMU part 2 v1:
>   - Clarify TDX limits on zapping private memory (Sean)
> 
> Memslot quirk series:
>   - New patch
> ---
>   arch/x86/kvm/mmu/mmu.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a8d91cf11761..7e66d7c426c1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7104,6 +7104,7 @@ static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *s
>   		.start = slot->base_gfn,
>   		.end = slot->base_gfn + slot->npages,
>   		.may_block = true,
> +		.attr_filter = KVM_FILTER_PRIVATE | KVM_FILTER_SHARED,
>   	};
>   	bool flush = false;
>   

Stale commit message, I guess.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 02/21] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU
  2024-09-04  3:07 ` [PATCH 02/21] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Rick Edgecombe
@ 2024-09-09 13:51   ` Paolo Bonzini
  0 siblings, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 13:51 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Export a function to walk down the TDP without modifying it and simply
> check if a PGA is mapped.
> 
> Future changes will support pre-populating TDX private memory. In order to
> implement this KVM will need to check if a given GFN is already
> pre-populated in the mirrored EPT. [1]
> 
> There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
> the KVM MMU that almost does what is required. However, to make sense of
> the results, MMU internal PTE helpers are needed. Refactor the code to
> provide a helper that can be used outside of the KVM MMU code.
> 
> Refactoring the KVM page fault handler to support this lookup usage was
> also considered, but it was an awkward fit.
> 
> kvm_tdp_mmu_gpa_is_mapped() is based on a diff by Paolo Bonzini.
> 
> Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/ [1]
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Change exported function to just return of GPA is mapped because "You
>     are executing with the filemap_invalidate_lock() taken, and therefore
>     cannot race with kvm_gmem_punch_hole()" (Paolo)
>     https://lore.kernel.org/kvm/CABgObfbpNN842noAe77WYvgi5MzK2SAA_FYw-=fGa+PcT_Z22w@mail.gmail.com/
>   - Take root hpa instead of enum (Paolo)
> 
> TDX MMU Prep v2:
>   - Rename function with "mirror" and use root enum
> 
> TDX MMU Prep:
>   - New patch
> ---
>   arch/x86/kvm/mmu.h         |  3 +++
>   arch/x86/kvm/mmu/mmu.c     |  3 +--
>   arch/x86/kvm/mmu/tdp_mmu.c | 37 ++++++++++++++++++++++++++++++++-----
>   3 files changed, 36 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 8f289222b353..5faa416ac874 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -254,6 +254,9 @@ extern bool tdp_mmu_enabled;
>   #define tdp_mmu_enabled false
>   #endif
>   
> +bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
> +
>   static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
>   {
>   	return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7e66d7c426c1..01808cdf8627 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4713,8 +4713,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   	return direct_page_fault(vcpu, fault);
>   }
>   
> -static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> -			    u8 *level)
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
>   {
>   	int r;
>   
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 37b3769a5d32..019b43723d90 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1911,16 +1911,13 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>    *
>    * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
>    */
> -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> -			 int *root_level)
> +static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> +				  struct kvm_mmu_page *root)
>   {
> -	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
>   	struct tdp_iter iter;
>   	gfn_t gfn = addr >> PAGE_SHIFT;
>   	int leaf = -1;
>   
> -	*root_level = vcpu->arch.mmu->root_role.level;
> -
>   	tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
>   		leaf = iter.level;
>   		sptes[leaf] = iter.old_spte;
> @@ -1929,6 +1926,36 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>   	return leaf;
>   }
>   
> +int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> +			 int *root_level)
> +{
> +	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
> +	*root_level = vcpu->arch.mmu->root_role.level;
> +
> +	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, root);
> +}
> +
> +bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	bool is_direct = kvm_is_addr_direct(kvm, gpa);
> +	hpa_t root = is_direct ? vcpu->arch.mmu->root.hpa :
> +				 vcpu->arch.mmu->mirror_root_hpa;
> +	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
> +	int leaf;
> +
> +	lockdep_assert_held(&kvm->mmu_lock);
> +	rcu_read_lock();
> +	leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
> +	rcu_read_unlock();
> +	if (leaf < 0)
> +		return false;
> +
> +	spte = sptes[leaf];
> +	return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
> +}
> +EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
> +
>   /*
>    * Returns the last level spte pointer of the shadow page walk for the given
>    * gpa, and sets *spte to the spte value. This spte may be non-preset. If no

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

I will take another look at the locking after I see some callers.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 03/21] KVM: x86/mmu: Do not enable page track for TD guest
  2024-09-04  3:07 ` [PATCH 03/21] KVM: x86/mmu: Do not enable page track for TD guest Rick Edgecombe
@ 2024-09-09 13:53   ` Paolo Bonzini
  2024-09-09 21:07     ` Edgecombe, Rick P
  0 siblings, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 13:53 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel, Yuan Yao, Binbin Wu

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Yan Zhao <yan.y.zhao@intel.com>
> 
> TDX does not support write protection and hence page track.
> Though !tdp_enabled and kvm_shadow_root_allocated(kvm) are always false
> for TD guest, should also return false when external write tracking is
> enabled.
> 
> Cc: Yuan Yao <yuan.yao@linux.intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> v19:
> - drop TDX: from the short log
> - Added reviewed-by: BinBin
> ---
>   arch/x86/kvm/mmu/page_track.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index 561c331fd6ec..26436113103a 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -35,6 +35,9 @@ static bool kvm_external_write_tracking_enabled(struct kvm *kvm)
>   
>   bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)
>   {
> +	if (kvm->arch.vm_type == KVM_X86_TDX_VM)
> +		return false;
> +
>   	return kvm_external_write_tracking_enabled(kvm) ||
>   	       kvm_shadow_root_allocated(kvm) || !tdp_enabled;
>   }

You should instead return an error from 
kvm_enable_external_write_tracking().

This will cause kvm_page_track_register_notifier() and therefore 
intel_vgpu_open_device() to fail.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 04/21] KVM: VMX: Split out guts of EPT violation to common/exposed function
  2024-09-04  3:07 ` [PATCH 04/21] KVM: VMX: Split out guts of EPT violation to common/exposed function Rick Edgecombe
@ 2024-09-09 13:57   ` Paolo Bonzini
  2024-09-09 16:07   ` Sean Christopherson
  1 sibling, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 13:57 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel, Binbin Wu

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> The difference of TDX EPT violation is how to retrieve information, GPA,
> and exit qualification.  To share the code to handle EPT violation, split
> out the guts of EPT violation handler so that VMX/TDX exit handler can call
> it after retrieving GPA and exit qualification.

Already has my RB but, for what it's worth, I'm not sure it's necessary 
to put this in a header as opposed to main.c.  Otherwise no comments, as 
there isn't much going on here.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem
  2024-09-04  3:07 ` [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem Rick Edgecombe
@ 2024-09-09 13:59   ` Paolo Bonzini
  2024-09-11  8:52   ` Chao Gao
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 13:59 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> Teach EPT violation helper to check shared mask of a GPA to find out
> whether the GPA is for private memory.
> 
> When EPT violation is triggered after TD accessing a private GPA, KVM will
> exit to user space if the corresponding GFN's attribute is not private.
> User space will then update GFN's attribute during its memory conversion
> process. After that, TD will re-access the private GPA and trigger EPT
> violation again. Only with GFN's attribute matches to private, KVM will
> fault in private page, map it in mirrored TDP root, and propagate changes
> to private EPT to resolve the EPT violation.
> 
> Relying on GFN's attribute tracking xarray to determine if a GFN is
> private, as for KVM_X86_SW_PROTECTED_VM, may lead to endless EPT
> violations.
> 
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Split from "KVM: TDX: handle ept violation/misconfig exit"
> ---
>   arch/x86/kvm/vmx/common.h | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 78ae39b6cdcd..10aa12d45097 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -6,6 +6,12 @@
>   
>   #include "mmu.h"
>   
> +static inline bool kvm_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> +{
> +	/* For TDX the direct mask is the shared mask. */
> +	return !kvm_is_addr_direct(kvm, gpa);
> +}
> +
>   static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
>   					     unsigned long exit_qualification)
>   {
> @@ -28,6 +34,13 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
>   		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
>   			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
>   
> +	/*
> +	 * Don't rely on GFN's attribute tracking xarray to prevent EPT violation
> +	 * loops.
> +	 */
> +	if (kvm_is_private_gpa(vcpu->kvm, gpa))
> +		error_code |= PFERR_PRIVATE_ACCESS;
> +
>   	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>   }
>   

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 06/21] KVM: TDX: Add accessors VMX VMCS helpers
  2024-09-04  3:07 ` [PATCH 06/21] KVM: TDX: Add accessors VMX VMCS helpers Rick Edgecombe
@ 2024-09-09 14:19   ` Paolo Bonzini
  2024-09-09 21:29     ` Edgecombe, Rick P
  0 siblings, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 14:19 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> +static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx,	\
> +							 u32 field, u64 bit)	\
> +{										\
> +	u64 err;								\
> +										\
> +	tdvps_##lclass##_check(field, bits);					\
> +	err = tdh_vp_wr(tdx, TDVPS_##uclass(field), 0, bit);			\
> +	if (KVM_BUG_ON(err, tdx->vcpu.kvm))					\
> +		pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: 0x%llx\n",	\
> +		       field, bit,  err);					\

Maybe a bit large when inlined?  Maybe

	if (unlikely(err))
		tdh_vp_wr_failed(tdx, field, bit, err);

and add tdh_vp_wr_failed to tdx.c.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 08/21] KVM: TDX: Set gfn_direct_bits to shared bit
  2024-09-04  3:07 ` [PATCH 08/21] KVM: TDX: Set gfn_direct_bits to shared bit Rick Edgecombe
@ 2024-09-09 15:21   ` Paolo Bonzini
  0 siblings, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 15:21 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Make the direct root handle memslot GFNs at an alias with the TDX shared
> bit set.
> 
> For TDX shared memory, the memslot GFNs need to be mapped at an alias with
> the shared bit set. These shared mappings will be be mapped on the KVM
> MMU's "direct" root. The direct root has it's mappings shifted by
> applying "gfn_direct_bits" as a mask. The concept of "GPAW" (guest
> physical address width) determines the location of the shared bit. So set
> gfn_direct_bits based on this, to map shared memory at the proper GPA.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Move setting of gfn_direct_bits to separate patch (Yan)
> ---
>   arch/x86/kvm/vmx/tdx.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 8f43977ef4c6..25c24901061b 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -921,6 +921,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>   	kvm_tdx->attributes = td_params->attributes;
>   	kvm_tdx->xfam = td_params->xfam;
>   
> +	if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
> +		kvm->arch.gfn_direct_bits = gpa_to_gfn(BIT_ULL(51));
> +	else
> +		kvm->arch.gfn_direct_bits = gpa_to_gfn(BIT_ULL(47));
> +
>   out:
>   	/* kfree() accepts NULL. */
>   	kfree(init_vm);

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-04  3:07 ` [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT Rick Edgecombe
  2024-09-06  1:41   ` Huang, Kai
@ 2024-09-09 15:25   ` Paolo Bonzini
  2024-09-09 20:22     ` Edgecombe, Rick P
  1 sibling, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 15:25 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel, Yuan Yao

On 9/4/24 05:07, Rick Edgecombe wrote:
> +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
> +{
> +#define SEAMCALL_RETRY_MAX     16

How is the 16 determined?  Also, is the lock per-VM or global?

Thanks,

Paolo

> +	struct tdx_module_args args_in;
> +	int retry = SEAMCALL_RETRY_MAX;
> +	u64 ret;
> +
> +	do {
> +		args_in = *in;
> +		ret = seamcall_ret(op, in);
> +	} while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0);
> +
> +	*in = args_in;
> +
> +	return ret;
> +}
> +


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 10/21] KVM: TDX: Require TDP MMU and mmio caching for TDX
  2024-09-04  3:07 ` [PATCH 10/21] KVM: TDX: Require TDP MMU and mmio caching for TDX Rick Edgecombe
@ 2024-09-09 15:26   ` Paolo Bonzini
  2024-09-12  0:15   ` Huang, Kai
  1 sibling, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 15:26 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Disable TDX support when TDP MMU or mmio caching aren't supported.
> 
> As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
> support for TDX isn't implemented.
> 
> TDX requires KVM mmio caching. Without mmio caching, KVM will go to MMIO
> emulation without installing SPTEs for MMIOs. However, TDX guest is
> protected and KVM would meet errors when trying to emulate MMIOs for TDX
> guest during instruction decoding. So, TDX guest relies on SPTEs being
> installed for MMIOs, which are with no RWX bits and with VE suppress bit
> unset, to inject VE to TDX guest. The TDX guest would then issue TDVMCALL
> in the VE handler to perform instruction decoding and have host do MMIO
> emulation.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Addressed Binbin's comment by massaging Isaku's updated comments and
>     adding more explanations about instroducing mmio caching.
>   - Addressed Sean's comments of v19 according to Isaku's update but
>     kept the warning for MOVDIR64B.
>   - Move code change in tdx_hardware_setup() to __tdx_bringup() since the
>     former has been removed.
> ---
>   arch/x86/kvm/mmu/mmu.c  | 1 +
>   arch/x86/kvm/vmx/main.c | 1 +
>   arch/x86/kvm/vmx/tdx.c  | 8 +++-----
>   3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 01808cdf8627..d26b235d8f84 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -110,6 +110,7 @@ static bool __ro_after_init tdp_mmu_allowed;
>   #ifdef CONFIG_X86_64
>   bool __read_mostly tdp_mmu_enabled = true;
>   module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
> +EXPORT_SYMBOL_GPL(tdp_mmu_enabled);
>   #endif
>   
>   static int max_huge_page_level __read_mostly;
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index c9dfa3aa866c..2cc29d0fc279 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -3,6 +3,7 @@
>   
>   #include "x86_ops.h"
>   #include "vmx.h"
> +#include "mmu.h"
>   #include "nested.h"
>   #include "pmu.h"
>   #include "posted_intr.h"
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 25c24901061b..0c08062ef99f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1474,16 +1474,14 @@ static int __init __tdx_bringup(void)
>   	const struct tdx_sys_info_td_conf *td_conf;
>   	int r;
>   
> +	if (!tdp_mmu_enabled || !enable_mmio_caching)
> +		return -EOPNOTSUPP;
> +
>   	if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
>   		pr_warn("MOVDIR64B is reqiured for TDX\n");
>   		return -EOPNOTSUPP;
>   	}
>   
> -	if (!enable_ept) {
> -		pr_err("Cannot enable TDX with EPT disabled.\n");
> -		return -EINVAL;
> -	}
> -
>   	/*
>   	 * Enabling TDX requires enabling hardware virtualization first,
>   	 * as making SEAMCALLs requires CPU being in post-VMXON state.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 12/21] KVM: TDX: Set per-VM shadow_mmio_value to 0
  2024-09-04  3:07 ` [PATCH 12/21] KVM: TDX: Set per-VM shadow_mmio_value to 0 Rick Edgecombe
@ 2024-09-09 15:33   ` Paolo Bonzini
  0 siblings, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 15:33 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 46a26be0245b..4ab6d2a87032 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -94,8 +94,6 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
>   	u64 spte = generation_mmio_spte_mask(gen);
>   	u64 gpa = gfn << PAGE_SHIFT;
>   
> -	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);
> -
>   	access &= shadow_mmio_access_mask;
>   	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
>   	spte |= gpa | shadow_nonpresent_or_rsvd_mask;
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 0c08062ef99f..9da71782660f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -6,7 +6,7 @@
>   #include "mmu.h"
>   #include "tdx.h"
>   #include "tdx_ops.h"
> -
> +#include "mmu/spte.h"
>   
>   #undef pr_fmt
>   #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> @@ -344,6 +344,19 @@ int tdx_vm_init(struct kvm *kvm)
>   {
>   	kvm->arch.has_private_mem = true;
>   
> +	/*
> +	 * Because guest TD is protected, VMM can't parse the instruction in TD.
> +	 * Instead, guest uses MMIO hypercall.  For unmodified device driver,
> +	 * #VE needs to be injected for MMIO and #VE handler in TD converts MMIO
> +	 * instruction into MMIO hypercall.
> +	 *
> +	 * SPTE value for MMIO needs to be setup so that #VE is injected into
> +	 * TD instead of triggering EPT MISCONFIG.
> +	 * - RWX=0 so that EPT violation is triggered.
> +	 * - suppress #VE bit is cleared to inject #VE.
> +	 */
> +	kvm_mmu_set_mmio_spte_value(kvm, 0);
> +
>   	/*
>   	 * This function initializes only KVM software construct.  It doesn't
>   	 * initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 11/21] KVM: x86/mmu: Add setter for shadow_mmio_value
  2024-09-04  3:07 ` [PATCH 11/21] KVM: x86/mmu: Add setter for shadow_mmio_value Rick Edgecombe
@ 2024-09-09 15:33   ` Paolo Bonzini
  0 siblings, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 15:33 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Future changes will want to set shadow_mmio_value from TDX code. Add a
> helper to setter with a name that makes more sense from that context.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> [split into new patch]
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Split into new patch
> ---
>   arch/x86/kvm/mmu.h      | 1 +
>   arch/x86/kvm/mmu/spte.c | 6 ++++++
>   2 files changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 5faa416ac874..72035154a23a 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -78,6 +78,7 @@ static inline gfn_t kvm_mmu_max_gfn(void)
>   u8 kvm_mmu_get_max_tdp_level(void);
>   
>   void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
> +void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value);
>   void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>   void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
>   
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index d4527965e48c..46a26be0245b 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -409,6 +409,12 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
>   }
>   EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
>   
> +void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value)
> +{
> +	kvm->arch.shadow_mmio_value = mmio_value;
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_value);
> +
>   void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
>   {
>   	/* shadow_me_value must be a subset of shadow_me_mask */

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 21/21] KVM: TDX: Handle vCPU dissociation
  2024-09-04  3:07 ` [PATCH 21/21] KVM: TDX: Handle vCPU dissociation Rick Edgecombe
@ 2024-09-09 15:41   ` Paolo Bonzini
  2024-09-09 23:30     ` Edgecombe, Rick P
  2024-09-10 10:45   ` Paolo Bonzini
  1 sibling, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-09 15:41 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Handle vCPUs dissociations by invoking SEAMCALL TDH.VP.FLUSH which flushes
> the address translation caches and cached TD VMCS of a TD vCPU in its
> associated pCPU.
> 
> In TDX, a vCPUs can only be associated with one pCPU at a time, which is
> done by invoking SEAMCALL TDH.VP.ENTER. For a successful association, the
> vCPU must be dissociated from its previous associated pCPU.
> 
> To facilitate vCPU dissociation, introduce a per-pCPU list
> associated_tdvcpus. Add a vCPU into this list when it's loaded into a new
> pCPU (i.e. when a vCPU is loaded for the first time or migrated to a new
> pCPU).
> 
> vCPU dissociations can happen under below conditions:
> - On the op hardware_disable is called.
>    This op is called when virtualization is disabled on a given pCPU, e.g.
>    when hot-unplug a pCPU or machine shutdown/suspend.
>    In this case, dissociate all vCPUs from the pCPU by iterating its
>    per-pCPU list associated_tdvcpus.
> 
> - On vCPU migration to a new pCPU.
>    Before adding a vCPU into associated_tdvcpus list of the new pCPU,
>    dissociation from its old pCPU is required, which is performed by issuing
>    an IPI and executing SEAMCALL TDH.VP.FLUSH on the old pCPU.
>    On a successful dissociation, the vCPU will be removed from the
>    associated_tdvcpus list of its previously associated pCPU.
> 
> - On tdx_mmu_release_hkid() is called.
>    TDX mandates that all vCPUs must be disassociated prior to the release of
>    an hkid. Therefore, dissociation of all vCPUs is a must before executing
>    the SEAMCALL TDH.MNG.VPFLUSHDONE and subsequently freeing the hkid.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

I think this didn't apply correctly to kvm-coco-queue, but I'll wait for 
further instructions on next postings.

Paolo

> ---
> TDX MMU part 2 v1:
>   - Changed title to "KVM: TDX: Handle vCPU dissociation" .
>   - Updated commit log.
>   - Removed calling tdx_disassociate_vp_on_cpu() in tdx_vcpu_free() since
>     no new TD enter would be called for vCPU association after
>     tdx_mmu_release_hkid(), which is now called in vt_vm_destroy(), i.e.
>     after releasing vcpu fd and kvm_unload_vcpu_mmus(), and before
>     tdx_vcpu_free().
>   - TODO: include Isaku's fix
>     https://eclists.intel.com/sympa/arc/kvm-qemu-review/2024-07/msg00359.html
>   - Update for the wrapper functions for SEAMCALLs. (Sean)
>   - Removed unnecessary pr_err() in tdx_flush_vp_on_cpu().
>   - Use KVM_BUG_ON() in tdx_flush_vp_on_cpu() for consistency.
>   - Capitalize the first word of tile. (Binbin)
>   - Minor fixed in changelog. (Binbin, Reinette(internal))
>   - Fix some comments. (Binbin, Reinette(internal))
>   - Rename arg_ to _arg (Binbin)
>   - Updates from seamcall overhaul (Kai)
>   - Remove lockdep_assert_preemption_disabled() in tdx_hardware_setup()
>     since now hardware_enable() is not called via SMP func call anymore,
>     but (per-cpu) CPU hotplug thread
>   - Use KVM_BUG_ON() for SEAMCALLs in tdx_mmu_release_hkid() (Kai)
>   - Update based on upstream commit "KVM: x86: Fold kvm_arch_sched_in()
>     into kvm_arch_vcpu_load()"
>   - Eliminate TDX_FLUSHVP_NOT_DONE error check because vCPUs were all freed.
>     So the error won't happen. (Sean)
> ---
>   arch/x86/kvm/vmx/main.c    |  22 +++++-
>   arch/x86/kvm/vmx/tdx.c     | 151 +++++++++++++++++++++++++++++++++++--
>   arch/x86/kvm/vmx/tdx.h     |   2 +
>   arch/x86/kvm/vmx/x86_ops.h |   4 +
>   4 files changed, 169 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 8f5dbab9099f..8171c1412c3b 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -10,6 +10,14 @@
>   #include "tdx.h"
>   #include "tdx_arch.h"
>   
> +static void vt_hardware_disable(void)
> +{
> +	/* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
> +	if (enable_tdx)
> +		tdx_hardware_disable();
> +	vmx_hardware_disable();
> +}
> +
>   static __init int vt_hardware_setup(void)
>   {
>   	int ret;
> @@ -113,6 +121,16 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>   	vmx_vcpu_reset(vcpu, init_event);
>   }
>   
> +static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +{
> +	if (is_td_vcpu(vcpu)) {
> +		tdx_vcpu_load(vcpu, cpu);
> +		return;
> +	}
> +
> +	vmx_vcpu_load(vcpu, cpu);
> +}
> +
>   static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
>   {
>   	/*
> @@ -217,7 +235,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.hardware_unsetup = vmx_hardware_unsetup,
>   
>   	.hardware_enable = vmx_hardware_enable,
> -	.hardware_disable = vmx_hardware_disable,
> +	.hardware_disable = vt_hardware_disable,
>   	.emergency_disable = vmx_emergency_disable,
>   
>   	.has_emulated_msr = vmx_has_emulated_msr,
> @@ -234,7 +252,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.vcpu_reset = vt_vcpu_reset,
>   
>   	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
> -	.vcpu_load = vmx_vcpu_load,
> +	.vcpu_load = vt_vcpu_load,
>   	.vcpu_put = vmx_vcpu_put,
>   
>   	.update_exception_bitmap = vmx_update_exception_bitmap,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3083a66bb895..554154d3dd58 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -57,6 +57,14 @@ static DEFINE_MUTEX(tdx_lock);
>   /* Maximum number of retries to attempt for SEAMCALLs. */
>   #define TDX_SEAMCALL_RETRIES	10000
>   
> +/*
> + * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
> + * is brought down to invoke TDH_VP_FLUSH on the appropriate TD vCPUS.
> + * Protected by interrupt mask.  This list is manipulated in process context
> + * of vCPU and IPI callback.  See tdx_flush_vp_on_cpu().
> + */
> +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
> +
>   static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
>   {
>   	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> @@ -88,6 +96,22 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
>   	return kvm_tdx->finalized;
>   }
>   
> +static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	list_del(&to_tdx(vcpu)->cpu_list);
> +
> +	/*
> +	 * Ensure tdx->cpu_list is updated before setting vcpu->cpu to -1,
> +	 * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
> +	 * to its list before it's deleted from this CPU's list.
> +	 */
> +	smp_wmb();
> +
> +	vcpu->cpu = -1;
> +}
> +
>   static void tdx_clear_page(unsigned long page_pa)
>   {
>   	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> @@ -168,6 +192,83 @@ static void tdx_reclaim_control_page(unsigned long ctrl_page_pa)
>   	free_page((unsigned long)__va(ctrl_page_pa));
>   }
>   
> +struct tdx_flush_vp_arg {
> +	struct kvm_vcpu *vcpu;
> +	u64 err;
> +};
> +
> +static void tdx_flush_vp(void *_arg)
> +{
> +	struct tdx_flush_vp_arg *arg = _arg;
> +	struct kvm_vcpu *vcpu = arg->vcpu;
> +	u64 err;
> +
> +	arg->err = 0;
> +	lockdep_assert_irqs_disabled();
> +
> +	/* Task migration can race with CPU offlining. */
> +	if (unlikely(vcpu->cpu != raw_smp_processor_id()))
> +		return;
> +
> +	/*
> +	 * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized.  The
> +	 * list tracking still needs to be updated so that it's correct if/when
> +	 * the vCPU does get initialized.
> +	 */
> +	if (is_td_vcpu_created(to_tdx(vcpu))) {
> +		/*
> +		 * No need to retry.  TDX Resources needed for TDH.VP.FLUSH are:
> +		 * TDVPR as exclusive, TDR as shared, and TDCS as shared.  This
> +		 * vp flush function is called when destructing vCPU/TD or vCPU
> +		 * migration.  No other thread uses TDVPR in those cases.
> +		 */
> +		err = tdh_vp_flush(to_tdx(vcpu));
> +		if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
> +			/*
> +			 * This function is called in IPI context. Do not use
> +			 * printk to avoid console semaphore.
> +			 * The caller prints out the error message, instead.
> +			 */
> +			if (err)
> +				arg->err = err;
> +		}
> +	}
> +
> +	tdx_disassociate_vp(vcpu);
> +}
> +
> +static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
> +{
> +	struct tdx_flush_vp_arg arg = {
> +		.vcpu = vcpu,
> +	};
> +	int cpu = vcpu->cpu;
> +
> +	if (unlikely(cpu == -1))
> +		return;
> +
> +	smp_call_function_single(cpu, tdx_flush_vp, &arg, 1);
> +	if (KVM_BUG_ON(arg.err, vcpu->kvm))
> +		pr_tdx_error(TDH_VP_FLUSH, arg.err);
> +}
> +
> +void tdx_hardware_disable(void)
> +{
> +	int cpu = raw_smp_processor_id();
> +	struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
> +	struct tdx_flush_vp_arg arg;
> +	struct vcpu_tdx *tdx, *tmp;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	/* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
> +	list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list) {
> +		arg.vcpu = &tdx->vcpu;
> +		tdx_flush_vp(&arg);
> +	}
> +	local_irq_restore(flags);
> +}
> +
>   static void smp_func_do_phymem_cache_wb(void *unused)
>   {
>   	u64 err = 0;
> @@ -204,22 +305,21 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>   	bool packages_allocated, targets_allocated;
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>   	cpumask_var_t packages, targets;
> -	u64 err;
> +	struct kvm_vcpu *vcpu;
> +	unsigned long j;
>   	int i;
> +	u64 err;
>   
>   	if (!is_hkid_assigned(kvm_tdx))
>   		return;
>   
> -	/* KeyID has been allocated but guest is not yet configured */
> -	if (!is_td_created(kvm_tdx)) {
> -		tdx_hkid_free(kvm_tdx);
> -		return;
> -	}
> -
>   	packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
>   	targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
>   	cpus_read_lock();
>   
> +	kvm_for_each_vcpu(j, vcpu, kvm)
> +		tdx_flush_vp_on_cpu(vcpu);
> +
>   	/*
>   	 * TDH.PHYMEM.CACHE.WB tries to acquire the TDX module global lock
>   	 * and can fail with TDX_OPERAND_BUSY when it fails to get the lock.
> @@ -233,6 +333,16 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>   	 * After the above flushing vps, there should be no more vCPU
>   	 * associations, as all vCPU fds have been released at this stage.
>   	 */
> +	err = tdh_mng_vpflushdone(kvm_tdx);
> +	if (err == TDX_FLUSHVP_NOT_DONE)
> +		goto out;
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error(TDH_MNG_VPFLUSHDONE, err);
> +		pr_err("tdh_mng_vpflushdone() failed. HKID %d is leaked.\n",
> +		       kvm_tdx->hkid);
> +		goto out;
> +	}
> +
>   	for_each_online_cpu(i) {
>   		if (packages_allocated &&
>   		    cpumask_test_and_set_cpu(topology_physical_package_id(i),
> @@ -258,6 +368,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
>   		tdx_hkid_free(kvm_tdx);
>   	}
>   
> +out:
>   	mutex_unlock(&tdx_lock);
>   	cpus_read_unlock();
>   	free_cpumask_var(targets);
> @@ -409,6 +520,26 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>   	return 0;
>   }
>   
> +void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	if (vcpu->cpu == cpu)
> +		return;
> +
> +	tdx_flush_vp_on_cpu(vcpu);
> +
> +	local_irq_disable();
> +	/*
> +	 * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
> +	 * vcpu->cpu is read before tdx->cpu_list.
> +	 */
> +	smp_rmb();
> +
> +	list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
> +	local_irq_enable();
> +}
> +
>   void tdx_vcpu_free(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -1977,7 +2108,7 @@ static int __init __do_tdx_bringup(void)
>   static int __init __tdx_bringup(void)
>   {
>   	const struct tdx_sys_info_td_conf *td_conf;
> -	int r;
> +	int r, i;
>   
>   	if (!tdp_mmu_enabled || !enable_mmio_caching)
>   		return -EOPNOTSUPP;
> @@ -1987,6 +2118,10 @@ static int __init __tdx_bringup(void)
>   		return -EOPNOTSUPP;
>   	}
>   
> +	/* tdx_hardware_disable() uses associated_tdvcpus. */
> +	for_each_possible_cpu(i)
> +		INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, i));
> +
>   	/*
>   	 * Enabling TDX requires enabling hardware virtualization first,
>   	 * as making SEAMCALLs requires CPU being in post-VMXON state.
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 25a4aaede2ba..4b6fc25feeb6 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -39,6 +39,8 @@ struct vcpu_tdx {
>   	unsigned long *tdcx_pa;
>   	bool td_vcpu_created;
>   
> +	struct list_head cpu_list;
> +
>   	bool initialized;
>   
>   	/*
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index d8a00ab4651c..f4aa0ec16980 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -119,6 +119,7 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
>   void vmx_setup_mce(struct kvm_vcpu *vcpu);
>   
>   #ifdef CONFIG_INTEL_TDX_HOST
> +void tdx_hardware_disable(void);
>   int tdx_vm_init(struct kvm *kvm);
>   void tdx_mmu_release_hkid(struct kvm *kvm);
>   void tdx_vm_free(struct kvm *kvm);
> @@ -128,6 +129,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> +void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>   u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>   
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -145,6 +147,7 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
>   void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
>   int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
>   #else
> +static inline void tdx_hardware_disable(void) {}
>   static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
>   static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
>   static inline void tdx_vm_free(struct kvm *kvm) {}
> @@ -154,6 +157,7 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
>   static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>   static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> +static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
>   static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
>   
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 04/21] KVM: VMX: Split out guts of EPT violation to common/exposed function
  2024-09-04  3:07 ` [PATCH 04/21] KVM: VMX: Split out guts of EPT violation to common/exposed function Rick Edgecombe
  2024-09-09 13:57   ` Paolo Bonzini
@ 2024-09-09 16:07   ` Sean Christopherson
  2024-09-10  7:36     ` Paolo Bonzini
  1 sibling, 1 reply; 139+ messages in thread
From: Sean Christopherson @ 2024-09-09 16:07 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: pbonzini, kvm, kai.huang, dmatlack, isaku.yamahata, yan.y.zhao,
	nik.borisov, linux-kernel, Binbin Wu

On Tue, Sep 03, 2024, Rick Edgecombe wrote:
> +static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> +					     unsigned long exit_qualification)
> +{
> +	u64 error_code;
> +
> +	/* Is it a read fault? */
> +	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
> +		     ? PFERR_USER_MASK : 0;
> +	/* Is it a write fault? */
> +	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
> +		      ? PFERR_WRITE_MASK : 0;
> +	/* Is it a fetch fault? */
> +	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
> +		      ? PFERR_FETCH_MASK : 0;
> +	/* ept page table entry is present? */
> +	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
> +		      ? PFERR_PRESENT_MASK : 0;
> +
> +	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
> +		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
> +			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> +
> +	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +}
> +
> +#endif /* __KVM_X86_VMX_COMMON_H */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 5e7b5732f35d..ade7666febe9 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -53,6 +53,7 @@
>  #include <trace/events/ipi.h>
>  
>  #include "capabilities.h"
> +#include "common.h"
>  #include "cpuid.h"
>  #include "hyperv.h"
>  #include "kvm_onhyperv.h"
> @@ -5771,11 +5772,8 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
>  
>  static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  {
> -	unsigned long exit_qualification;
> +	unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
>  	gpa_t gpa;
> -	u64 error_code;
> -
> -	exit_qualification = vmx_get_exit_qual(vcpu);
>  
>  	/*
>  	 * EPT violation happened while executing iret from NMI,
> @@ -5791,23 +5789,6 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>  	trace_kvm_page_fault(vcpu, gpa, exit_qualification);
>  
> -	/* Is it a read fault? */
> -	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
> -		     ? PFERR_USER_MASK : 0;
> -	/* Is it a write fault? */
> -	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
> -		      ? PFERR_WRITE_MASK : 0;
> -	/* Is it a fetch fault? */
> -	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
> -		      ? PFERR_FETCH_MASK : 0;
> -	/* ept page table entry is present? */
> -	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
> -		      ? PFERR_PRESENT_MASK : 0;
> -
> -	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
> -		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
> -			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> -

Paolo, are you planning on queueing these for 6.12, or for a later kernel?  I ask
because this will conflict with a bug fix[*] that I am planning on taking through
kvm-x86/mmu.  If you anticipate merging these in 6.12, then it'd probably be best
for you to grab that one patch directly, as I don't think it has semantic conflicts
with anything else in that series.

[*] https://lore.kernel.org/all/20240831001538.336683-2-seanjc@google.com

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-09 15:25   ` Paolo Bonzini
@ 2024-09-09 20:22     ` Edgecombe, Rick P
  2024-09-09 21:11       ` Sean Christopherson
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-09 20:22 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, Yao, Yuan, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Mon, 2024-09-09 at 17:25 +0200, Paolo Bonzini wrote:
> On 9/4/24 05:07, Rick Edgecombe wrote:
> > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
> > +{
> > +#define SEAMCALL_RETRY_MAX     16
> 
> How is the 16 determined?  Also, is the lock per-VM or global?

The lock being considered here is per-TD, but TDX_OPERAND_BUSY in general can be
for other locks. I'm not sure where the 16 came from, maybe Yuan or Isaku can
share the history. In any case, there seems to be some problems with this patch
or justification.

Regarding the zero-step mitigation, the TDX Module has a mitigation for an
attack where a malicious VMM causes repeated private EPT violations for the same
GPA. When this happens TDH.VP.ENTER will fail to enter the guest. Regardless of
zero-step detection, these SEPT related SEAMCALLs will exit with the checked
error code if they contend the mentioned lock. If there was some other (non-
zero-step related) contention for this lock and KVM tries to re-enter the TD too
many times without resolving an EPT violation, it might inadvertently trigger
the zero-step mitigation. I *think* this patch is trying to say not to worry
about this case, and do a simple retry loop instead to handle the contention.

But why 16 retries would be sufficient, I can't find a reason for. Getting this
required retry logic right is important because some failures
(TDH.MEM.RANGE.BLOCK) can lead to KVM_BUG_ON()s.

Per the docs, in general the VMM is supposed to retry SEAMCALLs that return
TDX_OPERAND_BUSY. I think we need to revisit the general question of which
SEAMCALLs we should be retrying and how many times/how long. The other
consideration is that KVM already has per-VM locking, that would prevent
contention for some of the locks. So depending on internal details KVM may not
need to do any retries in some cases.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-06  1:41   ` Huang, Kai
@ 2024-09-09 20:25     ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-09 20:25 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Huang, Kai
  Cc: Yao, Yuan, nik.borisov@suse.com, dmatlack@google.com,
	isaku.yamahata@gmail.com, Zhao, Yan Y,
	linux-kernel@vger.kernel.org

On Fri, 2024-09-06 at 13:41 +1200, Huang, Kai wrote:
> 3) That means the _ONLY_ reason to retry in the common code for 
> TDH_MEM_xx()s is to mitigate zero-step attack by reducing the times of 
> letting guest to fault on the same instruction.

My read of the zero-step mitigation is that it is implemented in the TDX module.
(which makes sense since it is defending against VMMs). There is some optional
ability for the guest to request notification, but the host defense is always in
place. Is that your understanding?

> 
> I don't think we need to handle zero-step attack mitigation in the first 
> TDX support submission.  So I think we can just remove this patch.

Thanks for highlighting the weirdness here. I think it needs more investigation.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2024-09-06  2:10   ` Huang, Kai
@ 2024-09-09 21:03     ` Edgecombe, Rick P
  2024-09-10  1:52       ` Yan Zhao
  2024-09-10  9:33       ` Paolo Bonzini
  0 siblings, 2 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-09 21:03 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Huang, Kai
  Cc: nik.borisov@suse.com, dmatlack@google.com,
	isaku.yamahata@gmail.com, Zhao, Yan Y,
	linux-kernel@vger.kernel.org

On Fri, 2024-09-06 at 14:10 +1200, Huang, Kai wrote:
> 
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -36,9 +36,21 @@ static __init int vt_hardware_setup(void)
> >          * is KVM may allocate couple of more bytes than needed for
> >          * each VM.
> >          */
> > -       if (enable_tdx)
> > +       if (enable_tdx) {
> >                 vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
> >                                 sizeof(struct kvm_tdx));
> > +               /*
> > +                * Note, TDX may fail to initialize in a later time in
> > +                * vt_init(), in which case it is not necessary to setup
> > +                * those callbacks.  But making them valid here even
> > +                * when TDX fails to init later is fine because those
> > +                * callbacks won't be called if the VM isn't TDX guest.
> > +                */
> > +               vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
> > +               vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
> > +               vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
> > +               vt_x86_ops.remove_external_spte =
> > tdx_sept_remove_private_spte;
> 
> Nit:  The callbacks in 'struct kvm_x86_ops' have name "external", but 
> TDX callbacks have name "private".  Should we rename TDX callbacks to 
> make them aligned?

"external" is the core MMU naming abstraction. I think you were part of the
discussion to purge special TDX private naming from the core MMU to avoid
confusion with AMD private memory in the last MMU series.

So external page tables ended up being a general concept, and private mem is the
TDX use. In practice of course it will likely only be used for TDX. So I thought
the external<->private connection here was nice to have.

> 
> > +       }
> >   
> >         return 0;
> >   }
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 6feb3ab96926..b8cd5a629a80 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -447,6 +447,177 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t
> > root_hpa, int pgd_level)
> >         td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> >   }
> >   
> > +static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
> > +{
> > +       struct page *page = pfn_to_page(pfn);
> > +
> > +       put_page(page);
> > +}
> > +
> > +static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > +                           enum pg_level level, kvm_pfn_t pfn)
> > +{
> > +       int tdx_level = pg_level_to_tdx_sept_level(level);
> > +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +       hpa_t hpa = pfn_to_hpa(pfn);
> > +       gpa_t gpa = gfn_to_gpa(gfn);
> > +       u64 entry, level_state;
> > +       u64 err;
> > +
> > +       err = tdh_mem_page_aug(kvm_tdx, gpa, hpa, &entry, &level_state);
> > +       if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
> > +               tdx_unpin(kvm, pfn);
> > +               return -EAGAIN;
> > +       }
> 
> Nit: Here (and other non-fatal error cases) I think we should return 
> -EBUSY to make it consistent with non-TDX case?  E.g., the non-TDX case has:
> 
>                  if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
>                          return -EBUSY;
> 
> And the comment of tdp_mmu_set_spte_atomic() currently says it can only 
> return 0 or -EBUSY.  It needs to be patched to reflect it can also 
> return other non-0 errors like -EIO but those are fatal.  In terms of 
> non-fatal error I don't think we need another -EAGAIN.

Yes, good point.

> 
> /*
>   * tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
> 
> [...]
> 
>   * Return:
>   * * 0      - If the SPTE was set.
>   * * -EBUSY - If the SPTE cannot be set. In this case this function will
>   *           have no side-effects other than setting iter->old_spte to
>   *            the last known value of the spte.
>   */
> 
> [...]
> 
> > +
> > +static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> > +                                     enum pg_level level, kvm_pfn_t pfn)
> > +{
> > 
> [...]
> 
> > +
> > +       hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
> > +       do {
> > +               /*
> > +                * TDX_OPERAND_BUSY can happen on locking PAMT entry. 
> > Because
> > +                * this page was removed above, other thread shouldn't be
> > +                * repeatedly operating on this page.  Just retry loop.
> > +                */
> > +               err = tdh_phymem_page_wbinvd(hpa_with_hkid);
> > +       } while (unlikely(err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX)));
> 
> In what case(s) other threads can concurrently lock the PAMT entry, 
> leading to the above BUSY?

Good question, lets add this to the seamcall retry research.

> 
> [...]
> 
> > +
> > +int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> > +                                enum pg_level level, kvm_pfn_t pfn)
> > +{
> > +       int ret;
> > +
> > +       /*
> > +        * HKID is released when vm_free() which is after closing gmem_fd
> 
>  From latest dev branch HKID is freed from vt_vm_destroy(), but not 
> vm_free() (which should be tdx_vm_free() btw).

Oh, yes, we should update the comment.

> 
> static void vt_vm_destroy(struct kvm *kvm)
> {
>          if (is_td(kvm))
>                  return tdx_mmu_release_hkid(kvm);
> 
>          vmx_vm_destroy(kvm);
> }
> 
> Btw, why not have a tdx_vm_destroy() wrapper?  Seems all other vt_xx()s 
> have a tdx_xx() but only this one calls tdx_mmu_release_hkid() directly.

No strong reason. It's asymmetric to the other tdx callbacks, but KVM code tends
to be less wrapped and a tdx_vm_destory would be a oneline function. So I think
it fits in other ways.

> 
> > +        * which causes gmem invalidation to zap all spte.
> > +        * Population is only allowed after KVM_TDX_INIT_VM.
> > +        */
> 
> What does the second sentence ("Population ...")  meaning?  Why is it 
> relevant here?
> 
How about:
/*
 * HKID is released after all private pages have been removed,
 * and set before any might be populated. Warn if zapping is
 * attempted when there can't be anything populated in the private
 * EPT.
 */

But actually, I wonder if we need to remove the KVM_BUG_ON(). I think if you did
a KVM_PRE_FAULT_MEMORY and then deleted the memslot you could hit it?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 01/21] KVM: x86/mmu: Implement memslot deletion for TDX
  2024-09-09 13:44   ` Paolo Bonzini
@ 2024-09-09 21:06     ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-09 21:06 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Mon, 2024-09-09 at 15:44 +0200, Paolo Bonzini wrote:
> 
> Stale commit message, I guess.

Oof, yes. We can lose the KVM_X86_QUIRK_SLOT_ZAP_ALL discussion.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 03/21] KVM: x86/mmu: Do not enable page track for TD guest
  2024-09-09 13:53   ` Paolo Bonzini
@ 2024-09-09 21:07     ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-09 21:07 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Huang, Kai, binbin.wu@linux.intel.com, isaku.yamahata@gmail.com,
	Zhao, Yan Y, dmatlack@google.com, yuan.yao@linux.intel.com,
	nik.borisov@suse.com, linux-kernel@vger.kernel.org

On Mon, 2024-09-09 at 15:53 +0200, Paolo Bonzini wrote:
> 
> You should instead return an error from 
> kvm_enable_external_write_tracking().
> 
> This will cause kvm_page_track_register_notifier() and therefore 
> intel_vgpu_open_device() to fail.

Makes sense, thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-09 20:22     ` Edgecombe, Rick P
@ 2024-09-09 21:11       ` Sean Christopherson
  2024-09-09 21:23         ` Sean Christopherson
  2024-09-10 13:15         ` Paolo Bonzini
  0 siblings, 2 replies; 139+ messages in thread
From: Sean Christopherson @ 2024-09-09 21:11 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Yan Y Zhao, Yuan Yao,
	nik.borisov@suse.com, dmatlack@google.com, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org

On Mon, Sep 09, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-09-09 at 17:25 +0200, Paolo Bonzini wrote:
> > On 9/4/24 05:07, Rick Edgecombe wrote:
> > > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
> > > +{
> > > +#define SEAMCALL_RETRY_MAX     16
> > 
> > How is the 16 determined?  Also, is the lock per-VM or global?
> 
> The lock being considered here is per-TD, but TDX_OPERAND_BUSY in general can be
> for other locks. I'm not sure where the 16 came from, maybe Yuan or Isaku can
> share the history. In any case, there seems to be some problems with this patch
> or justification.
> 
> Regarding the zero-step mitigation, the TDX Module has a mitigation for an
> attack where a malicious VMM causes repeated private EPT violations for the same
> GPA. When this happens TDH.VP.ENTER will fail to enter the guest. Regardless of
> zero-step detection, these SEPT related SEAMCALLs will exit with the checked
> error code if they contend the mentioned lock. If there was some other (non-
> zero-step related) contention for this lock and KVM tries to re-enter the TD too
> many times without resolving an EPT violation, it might inadvertently trigger
> the zero-step mitigation. I *think* this patch is trying to say not to worry
> about this case, and do a simple retry loop instead to handle the contention.
> 
> But why 16 retries would be sufficient, I can't find a reason for. Getting this
> required retry logic right is important because some failures
> (TDH.MEM.RANGE.BLOCK) can lead to KVM_BUG_ON()s.

I (somewhat indirectly) raised this as an issue in v11, and at a (very quick)
glance, nothing has changed to alleviate my concerns.

In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, ever.  For
its operations, I'm pretty sure the only sane approach is for KVM to ensure there
will be no contention.  And if the TDX module's single-step protection spuriously
kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't communicate
that it's mitigating single-step, e.g. so that KVM can forward the information
to userspace, then that's a TDX module problem to solve.

> Per the docs, in general the VMM is supposed to retry SEAMCALLs that return
> TDX_OPERAND_BUSY.

IMO, that's terrible advice.  SGX has similar behavior, where the xucode "module"
signals #GP if there's a conflict.  #GP is obviously far, far worse as it lacks
the precision that would help software understand exactly what went wrong, but I
think one of the better decisions we made with the SGX driver was to have a
"zero tolerance" policy where the driver would _never_ retry due to a potential
resource conflict, i.e. that any conflict in the module would be treated as a
kernel bug.

> I think we need to revisit the general question of which
> SEAMCALLs we should be retrying and how many times/how long. The other
> consideration is that KVM already has per-VM locking, that would prevent
> contention for some of the locks. So depending on internal details KVM may not
> need to do any retries in some cases.

Yes, and if KVM can't avoid conflict/retry, then before we go any further, I want
to know exactly why that is the case.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-09 21:11       ` Sean Christopherson
@ 2024-09-09 21:23         ` Sean Christopherson
  2024-09-09 22:34           ` Edgecombe, Rick P
  2024-09-10 13:15         ` Paolo Bonzini
  1 sibling, 1 reply; 139+ messages in thread
From: Sean Christopherson @ 2024-09-09 21:23 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Yan Y Zhao, Yuan Yao,
	nik.borisov@suse.com, dmatlack@google.com, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org

On Mon, Sep 09, 2024, Sean Christopherson wrote:
> On Mon, Sep 09, 2024, Rick P Edgecombe wrote:
> > On Mon, 2024-09-09 at 17:25 +0200, Paolo Bonzini wrote:
> > > On 9/4/24 05:07, Rick Edgecombe wrote:
> > > > +static inline u64 tdx_seamcall_sept(u64 op, struct tdx_module_args *in)
> > > > +{
> > > > +#define SEAMCALL_RETRY_MAX     16
> > > 
> > > How is the 16 determined?  Also, is the lock per-VM or global?
> > 
> > The lock being considered here is per-TD, but TDX_OPERAND_BUSY in general can be
> > for other locks. I'm not sure where the 16 came from, maybe Yuan or Isaku can
> > share the history. In any case, there seems to be some problems with this patch
> > or justification.
> > 
> > Regarding the zero-step mitigation, the TDX Module has a mitigation for an
> > attack where a malicious VMM causes repeated private EPT violations for the same
> > GPA. When this happens TDH.VP.ENTER will fail to enter the guest. Regardless of
> > zero-step detection, these SEPT related SEAMCALLs will exit with the checked
> > error code if they contend the mentioned lock. If there was some other (non-
> > zero-step related) contention for this lock and KVM tries to re-enter the TD too
> > many times without resolving an EPT violation, it might inadvertently trigger
> > the zero-step mitigation. I *think* this patch is trying to say not to worry
> > about this case, and do a simple retry loop instead to handle the contention.
> > 
> > But why 16 retries would be sufficient, I can't find a reason for. Getting this
> > required retry logic right is important because some failures
> > (TDH.MEM.RANGE.BLOCK) can lead to KVM_BUG_ON()s.
> 
> I (somewhat indirectly) raised this as an issue in v11, and at a (very quick)
> glance, nothing has changed to alleviate my concerns.

Gah, went out of my way to find the thread and then forgot to post the link:

https://lore.kernel.org/all/Y8m34OEVBfL7Q4Ns@google.com

> In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, ever.  For
> its operations, I'm pretty sure the only sane approach is for KVM to ensure there
> will be no contention.  And if the TDX module's single-step protection spuriously
> kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't communicate
> that it's mitigating single-step, e.g. so that KVM can forward the information
> to userspace, then that's a TDX module problem to solve.
> 
> > Per the docs, in general the VMM is supposed to retry SEAMCALLs that return
> > TDX_OPERAND_BUSY.
> 
> IMO, that's terrible advice.  SGX has similar behavior, where the xucode "module"
> signals #GP if there's a conflict.  #GP is obviously far, far worse as it lacks
> the precision that would help software understand exactly what went wrong, but I
> think one of the better decisions we made with the SGX driver was to have a
> "zero tolerance" policy where the driver would _never_ retry due to a potential
> resource conflict, i.e. that any conflict in the module would be treated as a
> kernel bug.
> 
> > I think we need to revisit the general question of which
> > SEAMCALLs we should be retrying and how many times/how long. The other
> > consideration is that KVM already has per-VM locking, that would prevent
> > contention for some of the locks. So depending on internal details KVM may not
> > need to do any retries in some cases.
> 
> Yes, and if KVM can't avoid conflict/retry, then before we go any further, I want
> to know exactly why that is the case.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 06/21] KVM: TDX: Add accessors VMX VMCS helpers
  2024-09-09 14:19   ` Paolo Bonzini
@ 2024-09-09 21:29     ` Edgecombe, Rick P
  2024-09-10 10:48       ` Paolo Bonzini
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-09 21:29 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Mon, 2024-09-09 at 16:19 +0200, Paolo Bonzini wrote:
> On 9/4/24 05:07, Rick Edgecombe wrote:
> > +static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx
> > *tdx, \
> > +                                                        u32 field, u64
> > bit)    \
> > +{                                                                          
> >     \
> > +       u64
> > err;                                                                \
> > +                                                                           
> >     \
> > +       tdvps_##lclass##_check(field,
> > bits);                                    \
> > +       err = tdh_vp_wr(tdx, TDVPS_##uclass(field), 0,
> > bit);                    \
> > +       if (KVM_BUG_ON(err, tdx-
> > >vcpu.kvm))                                     \
> > +               pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed:
> > 0x%llx\n", \
> > +                      field, bit, 
> > err);                                       \
> 
> Maybe a bit large when inlined?  Maybe
> 
>         if (unlikely(err))
>                 tdh_vp_wr_failed(tdx, field, bit, err);
> 
> and add tdh_vp_wr_failed to tdx.c.

There is a tiny bit of difference between the messages:
pr_err("TDH_VP_WR["#uclass".0x%x] = 0x%llx failed: 0x%llx\n", ...
pr_err("TDH_VP_WR["#uclass".0x%x] |= 0x%llx failed: 0x%llx\n", ...
pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: 0x%llx\n", ...

We can parameterize that part of the message, but it gets a bit tortured. Or
just lose that bit of detail. We can take a look. Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-09 21:23         ` Sean Christopherson
@ 2024-09-09 22:34           ` Edgecombe, Rick P
  2024-09-09 23:58             ` Sean Christopherson
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-09 22:34 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: linux-kernel@vger.kernel.org, Yao, Yuan, Huang, Kai,
	isaku.yamahata@gmail.com, Zhao, Yan Y, dmatlack@google.com,
	kvm@vger.kernel.org, nik.borisov@suse.com, pbonzini@redhat.com

On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote:
> > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL, ever. 
> > For
> > its operations, I'm pretty sure the only sane approach is for KVM to ensure
> > there
> > will be no contention.  And if the TDX module's single-step protection
> > spuriously
> > kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't
> > communicate
> > that it's mitigating single-step, e.g. so that KVM can forward the
> > information
> > to userspace, then that's a TDX module problem to solve.
> > 
> > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that
> > > return
> > > TDX_OPERAND_BUSY.
> > 
> > IMO, that's terrible advice.  SGX has similar behavior, where the xucode
> > "module"
> > signals #GP if there's a conflict.  #GP is obviously far, far worse as it
> > lacks
> > the precision that would help software understand exactly what went wrong,
> > but I
> > think one of the better decisions we made with the SGX driver was to have a
> > "zero tolerance" policy where the driver would _never_ retry due to a
> > potential
> > resource conflict, i.e. that any conflict in the module would be treated as
> > a
> > kernel bug.

Thanks for the analysis. The direction seems reasonable to me for this lock in
particular. We need to do some analysis on how much the existing mmu_lock can
protects us. Maybe sprinkle some asserts for documentation purposes.

For the general case of TDX_OPERAND_BUSY, there might be one wrinkle. The guest
side operations can take the locks too. From "Base Architecture Specification":
"
Host-Side (SEAMCALL) Operation
------------------------------
The host VMM is expected to retry host-side operations that fail with a
TDX_OPERAND_BUSY status. The host priority mechanism helps guarantee that at
most after a limited time (the longest guest-side TDX module flow) there will be
no contention with a guest TD attempting to acquire access to the same resource.

Lock operations process the HOST_PRIORITY bit as follows:
   - A SEAMCALL (host-side) function that fails to acquire a lock sets the lock’s
   HOST_PRIORITY bit and returns a TDX_OPERAND_BUSY status to the host VMM. It is
   the host VMM’s responsibility to re-attempt the SEAMCALL function until is
   succeeds; otherwise, the HOST_PRIORITY bit remains set, preventing the guest TD
   from acquiring the lock.
   - A SEAMCALL (host-side) function that succeeds to acquire a lock clears the
   lock’s HOST_PRIORITY bit.
   
Guest-Side (TDCALL) Operation
-----------------------------
A TDCALL (guest-side) function that attempt to acquire a lock fails if
HOST_PRIORITY is set to 1; a TDX_OPERAND_BUSY status is returned to the guest.
The guest is expected to retry the operation.

Guest-side TDCALL flows that acquire a host priority lock have an upper bound on
the host-side latency for that lock; once a lock is acquired, the flow either
releases within a fixed upper time bound, or periodically monitor the
HOST_PRIORITY flag to see if the host is attempting to acquire the lock.
"

So KVM can't fully prevent TDX_OPERAND_BUSY with KVM side locks, because it is
involved in sorting out contention between the guest as well. We need to double
check this, but I *think* this HOST_PRIORITY bit doesn't come into play for the
functionality we need to exercise for base support.

The thing that makes me nervous about retry based solution is the potential for
some kind deadlock like pattern. Just to gather your opinion, if there was some
SEAMCALL contention that couldn't be locked around from KVM, but came with some
strong well described guarantees, would a retry loop be hard NAK still?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 21/21] KVM: TDX: Handle vCPU dissociation
  2024-09-09 15:41   ` Paolo Bonzini
@ 2024-09-09 23:30     ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-09 23:30 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Mon, 2024-09-09 at 17:41 +0200, Paolo Bonzini wrote:
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> I think this didn't apply correctly to kvm-coco-queue, but I'll wait for 
> further instructions on next postings.

There was some feedback integrated into the preceding VM/vCPU creation patches
before this was generated. So that might have been it.

Thanks for the review.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-09 22:34           ` Edgecombe, Rick P
@ 2024-09-09 23:58             ` Sean Christopherson
  2024-09-10  0:50               ` Edgecombe, Rick P
  2024-09-11  1:17               ` Huang, Kai
  0 siblings, 2 replies; 139+ messages in thread
From: Sean Christopherson @ 2024-09-09 23:58 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: linux-kernel@vger.kernel.org, Yuan Yao, Kai Huang,
	isaku.yamahata@gmail.com, Yan Y Zhao, dmatlack@google.com,
	kvm@vger.kernel.org, nik.borisov@suse.com, pbonzini@redhat.com

On Mon, Sep 09, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote:
> > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL,
> > > ever.  For its operations, I'm pretty sure the only sane approach is for
> > > KVM to ensure there will be no contention.  And if the TDX module's
> > > single-step protection spuriously kicks in, KVM exits to userspace.  If
> > > the TDX module can't/doesn't/won't communicate that it's mitigating
> > > single-step, e.g. so that KVM can forward the information to userspace,
> > > then that's a TDX module problem to solve.
> > > 
> > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that
> > > > return TDX_OPERAND_BUSY.
> > > 
> > > IMO, that's terrible advice.  SGX has similar behavior, where the xucode
> > > "module" signals #GP if there's a conflict.  #GP is obviously far, far
> > > worse as it lacks the precision that would help software understand
> > > exactly what went wrong, but I think one of the better decisions we made
> > > with the SGX driver was to have a "zero tolerance" policy where the
> > > driver would _never_ retry due to a potential resource conflict, i.e.
> > > that any conflict in the module would be treated as a kernel bug.
> 
> Thanks for the analysis. The direction seems reasonable to me for this lock in
> particular. We need to do some analysis on how much the existing mmu_lock can
> protects us. 

I would operate under the assumption that it provides SEPT no meaningful protection.
I think I would even go so far as to say that it is a _requirement_ that mmu_lock
does NOT provide the ordering required by SEPT, because I do not want to take on
any risk (due to SEPT constraints) that would limit KVM's ability to do things
while holding mmu_lock for read.

> Maybe sprinkle some asserts for documentation purposes.

Not sure I understand, assert on what?

> For the general case of TDX_OPERAND_BUSY, there might be one wrinkle. The guest
> side operations can take the locks too. From "Base Architecture Specification":
> "
> Host-Side (SEAMCALL) Operation
> ------------------------------
> The host VMM is expected to retry host-side operations that fail with a
> TDX_OPERAND_BUSY status. The host priority mechanism helps guarantee that at
> most after a limited time (the longest guest-side TDX module flow) there will be
> no contention with a guest TD attempting to acquire access to the same resource.
> 
> Lock operations process the HOST_PRIORITY bit as follows:
>    - A SEAMCALL (host-side) function that fails to acquire a lock sets the lock’s
>    HOST_PRIORITY bit and returns a TDX_OPERAND_BUSY status to the host VMM. It is
>    the host VMM’s responsibility to re-attempt the SEAMCALL function until is
>    succeeds; otherwise, the HOST_PRIORITY bit remains set, preventing the guest TD
>    from acquiring the lock.
>    - A SEAMCALL (host-side) function that succeeds to acquire a lock clears the
>    lock’s HOST_PRIORITY bit.

*sigh*

> Guest-Side (TDCALL) Operation
> -----------------------------
> A TDCALL (guest-side) function that attempt to acquire a lock fails if
> HOST_PRIORITY is set to 1; a TDX_OPERAND_BUSY status is returned to the guest.
> The guest is expected to retry the operation.
> 
> Guest-side TDCALL flows that acquire a host priority lock have an upper bound on
> the host-side latency for that lock; once a lock is acquired, the flow either
> releases within a fixed upper time bound, or periodically monitor the
> HOST_PRIORITY flag to see if the host is attempting to acquire the lock.
> "
> 
> So KVM can't fully prevent TDX_OPERAND_BUSY with KVM side locks, because it is
> involved in sorting out contention between the guest as well. We need to double
> check this, but I *think* this HOST_PRIORITY bit doesn't come into play for the
> functionality we need to exercise for base support.
> 
> The thing that makes me nervous about retry based solution is the potential for
> some kind deadlock like pattern. Just to gather your opinion, if there was some
> SEAMCALL contention that couldn't be locked around from KVM, but came with some
> strong well described guarantees, would a retry loop be hard NAK still?

I don't know.  It would depend on what operations can hit BUSY, and what the
alternatives are.  E.g. if we can narrow down the retry paths to a few select
cases where it's (a) expected, (b) unavoidable, and (c) has minimal risk of
deadlock, then maybe that's the least awful option.

What I don't think KVM should do is blindly retry N number of times, because
then there are effectively no rules whatsoever.  E.g. if KVM is tearing down a
VM then KVM should assert on immediate success.  And if KVM is handling a fault
on behalf of a vCPU, then KVM can and should resume the guest and let it retry.
Ugh, but that would likely trigger the annoying "zero-step mitigation" crap.

What does this actually mean in practice?  What's the threshold, is the VM-Enter
error uniquely identifiable, and can KVM rely on HOST_PRIORITY to be set if KVM
runs afoul of the zero-step mitigation?

  After a pre-determined number of such EPT violations occur on the same instruction,
  the TDX module starts tracking the GPAs that caused Secure EPT faults and fails
  further host VMM attempts to enter the TD VCPU unless previously faulting private
  GPAs are properly mapped in the Secure EPT.

If HOST_PRIORITY is set, then one idea would be to resume the guest if there's
SEPT contention on a fault, and then _if_ the zero-step mitigation is triggered,
kick all vCPUs (via IPI) to ensure that the contended SEPT entry is unlocked and
can't be re-locked by the guest.  That would allow KVM to guarantee forward
progress without an arbitrary retry loop in the TDP MMU.

Similarly, if KVM needs to zap a SPTE and hits BUSY, kick all vCPUs to ensure the
one and only retry is guaranteed to succeed.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-09 23:58             ` Sean Christopherson
@ 2024-09-10  0:50               ` Edgecombe, Rick P
  2024-09-10  1:46                 ` Sean Christopherson
  2024-09-11  1:17               ` Huang, Kai
  1 sibling, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-10  0:50 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: pbonzini@redhat.com, Yao, Yuan, Huang, Kai,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, kvm@vger.kernel.org,
	dmatlack@google.com, nik.borisov@suse.com,
	isaku.yamahata@gmail.com

On Mon, 2024-09-09 at 16:58 -0700, Sean Christopherson wrote:
> On Mon, Sep 09, 2024, Rick P Edgecombe wrote:
> > On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote:
> > > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL,
> > > > ever.  For its operations, I'm pretty sure the only sane approach is for
> > > > KVM to ensure there will be no contention.  And if the TDX module's
> > > > single-step protection spuriously kicks in, KVM exits to userspace.  If
> > > > the TDX module can't/doesn't/won't communicate that it's mitigating
> > > > single-step, e.g. so that KVM can forward the information to userspace,
> > > > then that's a TDX module problem to solve.
> > > > 
> > > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that
> > > > > return TDX_OPERAND_BUSY.
> > > > 
> > > > IMO, that's terrible advice.  SGX has similar behavior, where the xucode
> > > > "module" signals #GP if there's a conflict.  #GP is obviously far, far
> > > > worse as it lacks the precision that would help software understand
> > > > exactly what went wrong, but I think one of the better decisions we made
> > > > with the SGX driver was to have a "zero tolerance" policy where the
> > > > driver would _never_ retry due to a potential resource conflict, i.e.
> > > > that any conflict in the module would be treated as a kernel bug.
> > 
> > Thanks for the analysis. The direction seems reasonable to me for this lock
> > in
> > particular. We need to do some analysis on how much the existing mmu_lock
> > can
> > protects us. 
> 
> I would operate under the assumption that it provides SEPT no meaningful
> protection.
> I think I would even go so far as to say that it is a _requirement_ that
> mmu_lock
> does NOT provide the ordering required by SEPT, because I do not want to take
> on
> any risk (due to SEPT constraints) that would limit KVM's ability to do things
> while holding mmu_lock for read.

Ok. Not sure, but I think you are saying not to add any extra acquisitions of
mmu_lock.

> > 
> > Maybe sprinkle some asserts for documentation purposes.
> 
> Not sure I understand, assert on what?

Please ignore. For the asserts, I was imagining mmu_lock acquisitions in core
MMU code might already protect the non-zero-step TDX_OPERAND_BUSY cases, and we
could somehow explain this in code. But it seems less likely.

[snip]
> 
> I don't know.  It would depend on what operations can hit BUSY, and what the
> alternatives are.  E.g. if we can narrow down the retry paths to a few select
> cases where it's (a) expected, (b) unavoidable, and (c) has minimal risk of
> deadlock, then maybe that's the least awful option.
> 
> What I don't think KVM should do is blindly retry N number of times, because
> then there are effectively no rules whatsoever.

Complete agreement.

>   E.g. if KVM is tearing down a
> VM then KVM should assert on immediate success.  And if KVM is acquit ions a
> fault
> on behalf of a vCPU, then KVM can and should resume the guest and let it
> retry.
> Ugh, but that would likely trigger the annoying "zero-step mitigation" crap.
> 
> What does this actually mean in practice?  What's the threshold, is the VM-
> Enter
> error uniquely identifiable, and can KVM rely on HOST_PRIORITY to be set if
> KVM
> runs afoul of the zero-step mitigation?
> 
>   After a pre-determined number of such EPT violations occur on the same
> instruction,
>   the TDX module starts tracking the GPAs that caused Secure EPT faults and
> fails
>   further host VMM attempts to enter the TD VCPU unless previously faulting
> private
>   GPAs are properly mapped in the Secure EPT.
> 
> If HOST_PRIORITY is set, then one idea would be to resume the guest if there's
> SEPT contention on a fault, and then _if_ the zero-step mitigation is
> triggered,
> kick all vCPUs (via IPI) to ensure that the contended SEPT entry is unlocked
> and
> can't be re-locked by the guest.  That would allow KVM to guarantee forward
> progress without an arbitrary retry loop in the TDP MMU.
> 
> Similarly, if KVM needs to zap a SPTE and hits BUSY, kick all vCPUs to ensure
> the
> one and only retry is guaranteed to succeed.

Ok so not against retry loops, just against magic number retry loops with no
explanation that can be found. Makes sense.

Until we answer some of the questions (i.e. HOST_PRIORITY exposure), it's hard
to say. We need to check some stuff on our end.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-10  0:50               ` Edgecombe, Rick P
@ 2024-09-10  1:46                 ` Sean Christopherson
  0 siblings, 0 replies; 139+ messages in thread
From: Sean Christopherson @ 2024-09-10  1:46 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pbonzini@redhat.com, Yuan Yao, Kai Huang,
	linux-kernel@vger.kernel.org, Yan Y Zhao, kvm@vger.kernel.org,
	dmatlack@google.com, nik.borisov@suse.com,
	isaku.yamahata@gmail.com

On Tue, Sep 10, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-09-09 at 16:58 -0700, Sean Christopherson wrote:
> > On Mon, Sep 09, 2024, Rick P Edgecombe wrote:
> > > On Mon, 2024-09-09 at 14:23 -0700, Sean Christopherson wrote:
> > > > > In general, I am _very_ opposed to blindly retrying an SEPT SEAMCALL,
> > > > > ever.  For its operations, I'm pretty sure the only sane approach is for
> > > > > KVM to ensure there will be no contention.  And if the TDX module's
> > > > > single-step protection spuriously kicks in, KVM exits to userspace.  If
> > > > > the TDX module can't/doesn't/won't communicate that it's mitigating
> > > > > single-step, e.g. so that KVM can forward the information to userspace,
> > > > > then that's a TDX module problem to solve.
> > > > > 
> > > > > > Per the docs, in general the VMM is supposed to retry SEAMCALLs that
> > > > > > return TDX_OPERAND_BUSY.
> > > > > 
> > > > > IMO, that's terrible advice.  SGX has similar behavior, where the xucode
> > > > > "module" signals #GP if there's a conflict.  #GP is obviously far, far
> > > > > worse as it lacks the precision that would help software understand
> > > > > exactly what went wrong, but I think one of the better decisions we made
> > > > > with the SGX driver was to have a "zero tolerance" policy where the
> > > > > driver would _never_ retry due to a potential resource conflict, i.e.
> > > > > that any conflict in the module would be treated as a kernel bug.
> > > 
> > > Thanks for the analysis. The direction seems reasonable to me for this lock
> > > in
> > > particular. We need to do some analysis on how much the existing mmu_lock
> > > can
> > > protects us. 
> > 
> > I would operate under the assumption that it provides SEPT no meaningful
> > protection.
> > I think I would even go so far as to say that it is a _requirement_ that
> > mmu_lock
> > does NOT provide the ordering required by SEPT, because I do not want to take
> > on
> > any risk (due to SEPT constraints) that would limit KVM's ability to do things
> > while holding mmu_lock for read.
> 
> Ok. Not sure, but I think you are saying not to add any extra acquisitions of
> mmu_lock.

No new write_lock.  If read_lock is truly needed, no worries.  But SEPT needing
a write_lock is likely a hard "no", as the TDP MMU's locking model depends
heavily on vCPUs being readers.  E.g. the TDP MMU has _much_ coarser granularity
than core MM, but it works because almost everything is done while holding
mmu_lock for read.

> Until we answer some of the questions (i.e. HOST_PRIORITY exposure), it's hard
> to say. We need to check some stuff on our end.

Ya, agreed.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2024-09-09 21:03     ` Edgecombe, Rick P
@ 2024-09-10  1:52       ` Yan Zhao
  2024-09-10  9:33       ` Paolo Bonzini
  1 sibling, 0 replies; 139+ messages in thread
From: Yan Zhao @ 2024-09-10  1:52 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Huang, Kai, nik.borisov@suse.com, dmatlack@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org

On Tue, Sep 10, 2024 at 05:03:57AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2024-09-06 at 14:10 +1200, Huang, Kai wrote:
...
> > > +        * which causes gmem invalidation to zap all spte.
> > > +        * Population is only allowed after KVM_TDX_INIT_VM.
> > > +        */
> > 
> > What does the second sentence ("Population ...")  meaning?  Why is it 
> > relevant here?
> > 
> How about:
> /*
>  * HKID is released after all private pages have been removed,
>  * and set before any might be populated. Warn if zapping is
>  * attempted when there can't be anything populated in the private
>  * EPT.
>  */
> 
> But actually, I wonder if we need to remove the KVM_BUG_ON(). I think if you did
> a KVM_PRE_FAULT_MEMORY and then deleted the memslot you could hit it?
If we disallow vCPU creation before TD initialization, as discussed in [1],
the BUG_ON should not be hit.

[1] https://lore.kernel.org/all/ZtAU7FIV2Xkw+L3O@yzhao56-desk.sh.intel.com/

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 04/21] KVM: VMX: Split out guts of EPT violation to common/exposed function
  2024-09-09 16:07   ` Sean Christopherson
@ 2024-09-10  7:36     ` Paolo Bonzini
  0 siblings, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10  7:36 UTC (permalink / raw)
  To: Sean Christopherson, Rick Edgecombe
  Cc: kvm, kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel, Binbin Wu

On 9/9/24 18:07, Sean Christopherson wrote:
> Paolo, are you planning on queueing these for 6.12, or for a later kernel?  I ask
> because this will conflict with a bug fix[*] that I am planning on taking through
> kvm-x86/mmu.  If you anticipate merging these in 6.12, then it'd probably be best
> for you to grab that one patch directly, as I don't think it has semantic conflicts
> with anything else in that series.
> 
> [*]https://lore.kernel.org/all/20240831001538.336683-2-seanjc@google.com

No, this one is independent of TDX but the patches need not be rushed 
into 6.12.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX
  2024-09-04  3:07 ` [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX Rick Edgecombe
@ 2024-09-10  8:16   ` Paolo Bonzini
  2024-09-10 23:49     ` Edgecombe, Rick P
  2024-10-14  6:34     ` Yan Zhao
  2024-09-11  6:25   ` Xu Yilun
  1 sibling, 2 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10  8:16 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> +static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * TDX calls tdx_track() in tdx_sept_remove_private_spte() to ensure
> +	 * private EPT will be flushed on the next TD enter.
> +	 * No need to call tdx_track() here again even when this callback is as
> +	 * a result of zapping private EPT.
> +	 * Just invoke invept() directly here to work for both shared EPT and
> +	 * private EPT.
> +	 */
> +	if (is_td_vcpu(vcpu)) {
> +		ept_sync_global();
> +		return;
> +	}
> +
> +	vmx_flush_tlb_all(vcpu);
> +}
> +
> +static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu)) {
> +		tdx_flush_tlb_current(vcpu);
> +		return;
> +	}
> +
> +	vmx_flush_tlb_current(vcpu);
> +}
> +

I'd do it slightly different:

static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
	if (is_td_vcpu(vcpu)) {
		tdx_flush_tlb_all(vcpu);
		return;
	}

	vmx_flush_tlb_all(vcpu);
}

static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
{
	if (is_td_vcpu(vcpu)) {
		/*
		 * flush_tlb_current() is used only the first time for
		 * the vcpu runs, since TDX supports neither shadow
		 * nested paging nor SMM.  Keep this function simple.
		 */
		tdx_flush_tlb_all(vcpu);
		return;
	}

	vmx_flush_tlb_current(vcpu);
}

and put the implementation details close to tdx_track:

void tdx_flush_tlb_all(struct kvm_vcpu *vcpu)
{
	/*
	 * TDX calls tdx_track() in tdx_sept_remove_private_spte() to
	 * ensure private EPT will be flushed on the next TD enter.
	 * No need to call tdx_track() here again, even when this
	 * callback is a result of zapping private EPT.  Just
	 * invoke invept() directly here, which works for both shared
	 * EPT and private EPT.
	 */
	ept_sync_global();
}


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2024-09-09 21:03     ` Edgecombe, Rick P
  2024-09-10  1:52       ` Yan Zhao
@ 2024-09-10  9:33       ` Paolo Bonzini
  2024-09-10 23:58         ` Edgecombe, Rick P
  1 sibling, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10  9:33 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org, seanjc@google.com,
	Huang, Kai
  Cc: nik.borisov@suse.com, dmatlack@google.com,
	isaku.yamahata@gmail.com, Zhao, Yan Y,
	linux-kernel@vger.kernel.org

On 9/9/24 23:03, Edgecombe, Rick P wrote:
> KVM code tends to be less wrapped and a tdx_vm_destory would be a
> oneline function. So I think it fits in other ways.

Yes, no problem there.  Sometimes one-line functions are ok (see 
ept_sync_global() case elsewhere in the series), sometimes they're 
overkill, especially if they wrap a function defined in the same file as 
the wrapper.

>>> +        * which causes gmem invalidation to zap all spte.
>>> +        * Population is only allowed after KVM_TDX_INIT_VM.
>>> +        */
>> What does the second sentence ("Population ...")  meaning?  Why is it
>> relevant here?
>>
> How about:
> /*
>   * HKID is released after all private pages have been removed,
>   * and set before any might be populated. Warn if zapping is
>   * attempted when there can't be anything populated in the private
>   * EPT.
>   */
> 
> But actually, I wonder if we need to remove the KVM_BUG_ON(). I think if you did
> a KVM_PRE_FAULT_MEMORY and then deleted the memslot you could hit it?

I think all paths to handle_removed_pt() are safe:

__tdp_mmu_zap_root
         tdp_mmu_zap_root
                 kvm_tdp_mmu_zap_all
                         kvm_arch_flush_shadow_all
                                 kvm_flush_shadow_all
                                         kvm_destroy_vm (*)
                                         kvm_mmu_notifier_release (*)
                 kvm_tdp_mmu_zap_invalidated_roots
                         kvm_mmu_zap_all_fast (**)
kvm_tdp_mmu_zap_sp
         kvm_recover_nx_huge_pages (***)


(*) only called at destroy time
(**) only invalidates direct roots
(***) shouldn't apply to TDX I hope?

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 18/21] KVM: x86/mmu: Export kvm_tdp_map_page()
  2024-09-04  3:07 ` [PATCH 18/21] KVM: x86/mmu: Export kvm_tdp_map_page() Rick Edgecombe
@ 2024-09-10 10:02   ` Paolo Bonzini
  0 siblings, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:02 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> In future changes coco specific code will need to call kvm_tdp_map_page()
> from within their respective gmem_post_populate() callbacks. Export it
> so this can be done from vendor specific code. Since kvm_mmu_reload()
> will be needed for this operation, export it as well.

You can just squash this into patch 19; if you don't, s/it/its callee 
kvm_mmu_load()/ in the last line.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 17/21] KVM: TDX: MTRR: implement get_mt_mask() for TDX
  2024-09-04  3:07 ` [PATCH 17/21] KVM: TDX: MTRR: implement get_mt_mask() for TDX Rick Edgecombe
@ 2024-09-10 10:04   ` Paolo Bonzini
  2024-09-10 14:05     ` Sean Christopherson
  0 siblings, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:04 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Although TDX supports only WB for private GPA, it's desirable to support
> MTRR for shared GPA.  Always honor guest PAT for shared EPT as what's done
> for normal VMs.
> 
> Suggested-by: Kai Huang <kai.huang@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Align with latest vmx code in kvm/queue.
>   - Updated patch log.
>   - Dropped KVM_BUG_ON() in vt_get_mt_mask(). (Rick)

The only difference at this point is

         if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
             !kvm_arch_has_noncoherent_dma(vcpu->kvm))
                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | 
VMX_EPT_IPAT_BIT;


which should never be true.  I think this patch can simply be dropped.

Paolo

> +static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_get_mt_mask(vcpu, gfn, is_mmio);
> +
> +	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
> +}
> +
>   static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>   {
>   	if (!is_td(kvm))
> @@ -292,7 +300,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   
>   	.set_tss_addr = vmx_set_tss_addr,
>   	.set_identity_map_addr = vmx_set_identity_map_addr,
> -	.get_mt_mask = vmx_get_mt_mask,
> +	.get_mt_mask = vt_get_mt_mask,
>   
>   	.get_exit_info = vmx_get_exit_info,
>   
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 435112562954..50ce24905062 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -374,6 +374,14 @@ int tdx_vm_init(struct kvm *kvm)
>   	return 0;
>   }
>   
> +u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +{
> +	if (is_mmio)
> +		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> +
> +	return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> +}
> +
>   int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>   {
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 66829413797d..d8a00ab4651c 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -128,6 +128,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> +u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>   
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>   
> @@ -153,6 +154,7 @@ static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOP
>   static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>   static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> +static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
>   
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>   


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory
  2024-09-06 16:30       ` Edgecombe, Rick P
  2024-09-09  1:29         ` Yan Zhao
@ 2024-09-10 10:13         ` Paolo Bonzini
  2024-09-11  0:11           ` Edgecombe, Rick P
  1 sibling, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:13 UTC (permalink / raw)
  To: Edgecombe, Rick P, Zhao, Yan Y
  Cc: seanjc@google.com, Huang, Kai, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, dmatlack@google.com,
	kvm@vger.kernel.org, nik.borisov@suse.com

On 9/6/24 18:30, Edgecombe, Rick P wrote:
> /*
>   * The case to care about here is a PTE getting zapped concurrently and
>   * this function erroneously thinking a page is mapped in the mirror EPT.
>   * The private mem zapping paths are already covered by other locks held
>   * here, but grab an mmu read_lock to not trigger the assert in
>   * kvm_tdp_mmu_gpa_is_mapped().
>   */
> 
> Yan, do you think it is sufficient?

If you're actually requiring that the other locks are sufficient, then 
there can be no ENOENT.

Maybe:

	/*
	 * The private mem cannot be zapped after kvm_tdp_map_page()
	 * because all paths are covered by slots_lock and the
	 * filemap invalidate lock.  Check that they are indeed enough.
	 */
	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
		scoped_guard(read_lock, &kvm->mmu_lock) {
			if (KVM_BUG_ON(kvm,
				!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa)) {
				ret = -EIO;
				goto out;
			}
		}
	}

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory
  2024-09-04  3:07 ` [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory Rick Edgecombe
  2024-09-04  4:53   ` Yan Zhao
  2024-09-04 13:56   ` Edgecombe, Rick P
@ 2024-09-10 10:16   ` Paolo Bonzini
  2024-09-11  0:12     ` Edgecombe, Rick P
  2 siblings, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:16 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a new ioctl for the user space VMM to initialize guest memory with the
> specified memory contents.
> 
> Because TDX protects the guest's memory, the creation of the initial guest
> memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of
> directly copying the memory contents into the guest's memory in the case of
> the default VM type.
> 
> Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped
> KVM_MEMORY_ENCRYPT_OP.  Check if the GFN is already pre-allocated, assign
> the guest page in Secure-EPT, copy the initial memory contents into the
> guest memory, and encrypt the guest memory.  Optionally, extend the memory
> measurement of the TDX guest.
> 
> Discussion history:

While useful for the reviewers, in the end this is the simplest possible 
userspace API (the one that we started with) and the objections just 
went away because it reuses the infrastructure that was introduced for 
pre-faulting memory.

So I'd replace everything with:

---
The ioctl uses the vCPU file descriptor because of the TDX module's 
requirement that the memory is added to the S-EPT (via TDH.MEM.SEPT.ADD) 
prior to initialization (TDH.MEM.PAGE.ADD).  Accessing the MMU in turn 
requires a vCPU file descriptor, just like for KVM_PRE_FAULT_MEMORY.  In 
fact, the post-populate callback is able to reuse the same logic used by 
KVM_PRE_FAULT_MEMORY, so that userspace can do everything with a single 
ioctl.

Note that this is the only way to invoke TDH.MEM.SEPT.ADD before the TD 
in finalized, as userspace cannot use KVM_PRE_FAULT_MEMORY at that 
point.  This ensures that there cannot be pages in the S-EPT awaiting 
TDH.MEM.PAGE.ADD, which would be treated incorrectly as spurious by 
tdp_mmu_map_handle_target_level() (KVM would see the SPTE as PRESENT, 
but the corresponding S-EPT entry will be !PRESENT).
---

Part of the second paragraph comes from your link [4], 
https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/, but updated 
for recent changes to KVM_PRE_FAULT_MEMORY.

This drops the historical information that is not particularly relevant 
for the future, it updates what's relevant to mention changes done for 
SEV-SNP, and also preserves most of the other information:

* why the vCPU file descriptor

* the desirability of a single ioctl for userspace

* the relationship between KVM_TDX_INIT_MEM_REGION and KVM_PRE_FAULT_MEMORY

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 15/21] KVM: TDX: Implement hook to get max mapping level of private pages
  2024-09-04  3:07 ` [PATCH 15/21] KVM: TDX: Implement hook to get max mapping level of private pages Rick Edgecombe
@ 2024-09-10 10:17   ` Paolo Bonzini
  0 siblings, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:17 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Implement hook private_max_mapping_level for TDX to let TDP MMU core get
> max mapping level of private pages.
> 
> The value is hard coded to 4K for no huge page support for now.
> 
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Split from the big patch "KVM: TDX: TDP MMU TDX support".
>   - Fix missing tdx_gmem_private_max_mapping_level() implementation for
>     !CONFIG_INTEL_TDX_HOST
> 
> v19:
>   - Use gmem_max_level callback, delete tdp_max_page_level.
> ---
>   arch/x86/kvm/vmx/main.c    | 10 ++++++++++
>   arch/x86/kvm/vmx/tdx.c     |  5 +++++
>   arch/x86/kvm/vmx/x86_ops.h |  2 ++
>   3 files changed, 17 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index bf6fd5cca1d6..5d43b44e2467 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -184,6 +184,14 @@ static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>   	return tdx_vcpu_ioctl(vcpu, argp);
>   }
>   
> +static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> +{
> +	if (is_td(kvm))
> +		return tdx_gmem_private_max_mapping_level(kvm, pfn);
> +
> +	return 0;
> +}
> +
>   #define VMX_REQUIRED_APICV_INHIBITS				\
>   	(BIT(APICV_INHIBIT_REASON_DISABLED) |			\
>   	 BIT(APICV_INHIBIT_REASON_ABSENT) |			\
> @@ -337,6 +345,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   
>   	.mem_enc_ioctl = vt_mem_enc_ioctl,
>   	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
> +
> +	.private_max_mapping_level = vt_gmem_private_max_mapping_level
>   };
>   
>   struct kvm_x86_init_ops vt_init_ops __initdata = {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b8cd5a629a80..59b627b45475 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1582,6 +1582,11 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>   	return ret;
>   }
>   
> +int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> +{
> +	return PG_LEVEL_4K;
> +}
> +
>   #define KVM_SUPPORTED_TD_ATTRS (TDX_TD_ATTR_SEPT_VE_DISABLE)
>   
>   static int __init setup_kvm_tdx_caps(void)
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index d1db807b793a..66829413797d 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -142,6 +142,7 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>   
>   void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
>   void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
> +int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
>   #else
>   static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
>   static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
> @@ -185,6 +186,7 @@ static inline int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
>   
>   static inline void tdx_flush_tlb_current(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
> +static inline int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) { return 0; }
>   #endif
>   
>   #endif /* __KVM_X86_VMX_X86_OPS_H */

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/21] KVM: TDX: Premap initial guest memory
  2024-09-04  3:07 ` [PATCH 16/21] KVM: TDX: Premap initial guest memory Rick Edgecombe
@ 2024-09-10 10:24   ` Paolo Bonzini
  2024-09-11  0:19     ` Edgecombe, Rick P
  2024-09-10 10:49   ` Paolo Bonzini
  1 sibling, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:24 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> +static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
> +					  enum pg_level level, kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +	/* Returning error here to let TDP MMU bail out early. */
> +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) {
> +		tdx_unpin(kvm, pfn);
> +		return -EINVAL;
> +	}

Should this "if" already be part of patch 14, and in 
tdx_sept_set_private_spte() rather than tdx_mem_page_record_premap_cnt()?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 20/21] KVM: TDX: Finalize VM initialization
  2024-09-04  3:07 ` [PATCH 20/21] KVM: TDX: Finalize VM initialization Rick Edgecombe
  2024-09-04 15:37   ` Adrian Hunter
@ 2024-09-10 10:25   ` Paolo Bonzini
  2024-09-10 11:54     ` Adrian Hunter
  1 sibling, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:25 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel, Adrian Hunter

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
> KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.
> 
> Documentation for the API is added in another patch:
> "Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)"
> 
> For the purpose of attestation, a measurement must be made of the TDX VM
> initial state. This is referred to as TD Measurement Finalization, and
> uses SEAMCALL TDH.MR.FINALIZE, after which:
> 1. The VMM adding TD private pages with arbitrary content is no longer
>     allowed
> 2. The TDX VM is runnable
> 
> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Added premapped check.
>   - Update for the wrapper functions for SEAMCALLs. (Sean)
>   - Add check if nr_premapped is zero.  If not, return error.
>   - Use KVM_BUG_ON() in tdx_td_finalizer() for consistency.
>   - Change tdx_td_finalizemr() to take struct kvm_tdx_cmd *cmd and return error
>     (Adrian)
>   - Handle TDX_OPERAND_BUSY case (Adrian)
>   - Updates from seamcall overhaul (Kai)
>   - Rename error->hw_error
> 
> v18:
>   - Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.
> 
> v15:
>   - removed unconditional tdx_track() by tdx_flush_tlb_current() that
>     does tdx_track().
> ---
>   arch/x86/include/uapi/asm/kvm.h |  1 +
>   arch/x86/kvm/vmx/tdx.c          | 28 ++++++++++++++++++++++++++++
>   2 files changed, 29 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 789d1d821b4f..0b4827e39458 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -932,6 +932,7 @@ enum kvm_tdx_cmd_id {
>   	KVM_TDX_INIT_VM,
>   	KVM_TDX_INIT_VCPU,
>   	KVM_TDX_INIT_MEM_REGION,
> +	KVM_TDX_FINALIZE_VM,
>   	KVM_TDX_GET_CPUID,
>   
>   	KVM_TDX_CMD_NR_MAX,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 796d1a495a66..3083a66bb895 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1257,6 +1257,31 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
>   	ept_sync_global();
>   }
>   
> +static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +	if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
> +		return -EINVAL;
> +	/*
> +	 * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
> +	 * TDH.MEM.PAGE.ADD().
> +	 */
> +	if (atomic64_read(&kvm_tdx->nr_premapped))
> +		return -EINVAL;

I suggest moving all of patch 16, plus the

+	WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
+	atomic64_dec(&kvm_tdx->nr_premapped);

lines of patch 19, into this patch.

> +	cmd->hw_error = tdh_mr_finalize(kvm_tdx);
> +	if ((cmd->hw_error & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY)
> +		return -EAGAIN;
> +	if (KVM_BUG_ON(cmd->hw_error, kvm)) {
> +		pr_tdx_error(TDH_MR_FINALIZE, cmd->hw_error);
> +		return -EIO;
> +	}
> +
> +	kvm_tdx->finalized = true;
> +	return 0;

This should also set pre_fault_allowed to true.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 20/21] KVM: TDX: Finalize VM initialization
  2024-09-04 15:37   ` Adrian Hunter
  2024-09-04 16:09     ` Edgecombe, Rick P
@ 2024-09-10 10:33     ` Paolo Bonzini
  2024-09-10 11:15       ` Adrian Hunter
  1 sibling, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:33 UTC (permalink / raw)
  To: Adrian Hunter, Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 17:37, Adrian Hunter wrote:
> Isaku was going to lock the mmu.  Seems like the change got lost.
> To protect against racing with KVM_PRE_FAULT_MEMORY,
> KVM_TDX_INIT_MEM_REGION, tdx_sept_set_private_spte() etc
> e.g. Rename tdx_td_finalizemr to __tdx_td_finalizemr and add:
> 
> static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> {
> 	int ret;
> 
> 	write_lock(&kvm->mmu_lock);
> 	ret = __tdx_td_finalizemr(kvm, cmd);
> 	write_unlock(&kvm->mmu_lock);
> 
> 	return ret;
> }

kvm->slots_lock is better.  In tdx_vcpu_init_mem_region() you can take 
it before the is_td_finalized() so that there is a lock that is clearly 
protecting kvm_tdx->finalized between the two.  (I also suggest 
switching to guard() in tdx_vcpu_init_mem_region()).

Also, I think that in patch 16 (whether merged or not) nr_premapped 
should not be incremented, once kvm_tdx->finalized has been set?

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 21/21] KVM: TDX: Handle vCPU dissociation
  2024-09-04  3:07 ` [PATCH 21/21] KVM: TDX: Handle vCPU dissociation Rick Edgecombe
  2024-09-09 15:41   ` Paolo Bonzini
@ 2024-09-10 10:45   ` Paolo Bonzini
  2024-09-11  0:17     ` Edgecombe, Rick P
  2024-11-04  9:45     ` Yan Zhao
  1 sibling, 2 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:45 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> +/*
> + * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
> + * is brought down to invoke TDH_VP_FLUSH on the appropriate TD vCPUS.

... or when a vCPU is migrated.

> + * Protected by interrupt mask.  This list is manipulated in process context
> + * of vCPU and IPI callback.  See tdx_flush_vp_on_cpu().
> + */
> +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);

It may be a bit more modern, or cleaner, to use a local_lock here 
instead of just relying on local_irq_disable/enable.

Another more organizational question is whether to put this in the 
VM/vCPU series but I might be missing something obvious.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 06/21] KVM: TDX: Add accessors VMX VMCS helpers
  2024-09-09 21:29     ` Edgecombe, Rick P
@ 2024-09-10 10:48       ` Paolo Bonzini
  0 siblings, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:48 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org, seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On 9/9/24 23:29, Edgecombe, Rick P wrote:
>> Maybe a bit large when inlined?  Maybe
>>
>>          if (unlikely(err))
>>                  tdh_vp_wr_failed(tdx, field, bit, err);
>>
>> and add tdh_vp_wr_failed to tdx.c.
> There is a tiny bit of difference between the messages:
> pr_err("TDH_VP_WR["#uclass".0x%x] = 0x%llx failed: 0x%llx\n", ...
> pr_err("TDH_VP_WR["#uclass".0x%x] |= 0x%llx failed: 0x%llx\n", ...
> pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: 0x%llx\n", ...
> 
> We can parameterize that part of the message, but it gets a bit tortured. Or
> just lose that bit of detail. We can take a look. Thanks.

Yes, you can:

1) have three different functions for the failure

2) leave out the value part

3) pass the mask as well to tdh_vp_wr_failed() and use it to deduce the 
=/|=/&= part, like

	if (!~mask)
		op = "=";
	else if (!value)
		op = "&= ~", value = mask;
	else if (value == mask)
		op = "|=";
	else
		op = "??, value = ";

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/21] KVM: TDX: Premap initial guest memory
  2024-09-04  3:07 ` [PATCH 16/21] KVM: TDX: Premap initial guest memory Rick Edgecombe
  2024-09-10 10:24   ` Paolo Bonzini
@ 2024-09-10 10:49   ` Paolo Bonzini
  2024-09-11  0:30     ` Edgecombe, Rick P
  1 sibling, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 10:49 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Update TDX's hook of set_external_spte() to record pre-mapping cnt instead
> of doing nothing and returning when TD is not finalized.
> 
> TDX uses ioctl KVM_TDX_INIT_MEM_REGION to initialize its initial guest
> memory. This ioctl calls kvm_gmem_populate() to get guest pages and in
> tdx_gmem_post_populate(), it will
> (1) Map page table pages into KVM mirror page table and private EPT.
> (2) Map guest pages into KVM mirror page table. In the propagation hook,
>      just record pre-mapping cnt without mapping the guest page into private
>      EPT.
> (3) Map guest pages into private EPT and decrease pre-mapping cnt.
> 
> Do not map guest pages into private EPT directly in step (2), because TDX
> requires TDH.MEM.PAGE.ADD() to add a guest page before TD is finalized,
> which copies page content from a source page from user to target guest page
> to be added. However, source page is not available via common interface
> kvm_tdp_map_page() in step (2).
> 
> Therefore, just pre-map the guest page into KVM mirror page table and
> record the pre-mapping cnt in TDX's propagation hook. The pre-mapping cnt
> would be decreased in ioctl KVM_TDX_INIT_MEM_REGION when the guest page is
> mapped into private EPT.

Stale commit message; squashing all of it into patch 20 is an easy cop 
out...

Paolo

> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Update the code comment and patch log according to latest gmem update.
>     https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/
>   - Rename tdx_mem_page_add() to tdx_mem_page_record_premap_cnt() to avoid
>     confusion.
>   - Change the patch title to "KVM: TDX: Premap initial guest memory".
>   - Rename KVM_MEMORY_MAPPING => KVM_MAP_MEMORY (Sean)
>   - Drop issueing TDH.MEM.PAGE.ADD() on KVM_MAP_MEMORY(), defer it to
>     KVM_TDX_INIT_MEM_REGION. (Sean)
>   - Added nr_premapped to track the number of premapped pages
>   - Drop tdx_post_mmu_map_page().
> 
> v19:
>   - Switched to use KVM_MEMORY_MAPPING
>   - Dropped measurement extension
>   - updated commit message. private_page_add() => set_private_spte()
> ---
>   arch/x86/kvm/vmx/tdx.c | 40 +++++++++++++++++++++++++++++++++-------
>   arch/x86/kvm/vmx/tdx.h |  2 +-
>   2 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 59b627b45475..435112562954 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -488,6 +488,34 @@ static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
>   	return 0;
>   }
>   
> +/*
> + * KVM_TDX_INIT_MEM_REGION calls kvm_gmem_populate() to get guest pages and
> + * tdx_gmem_post_populate() to premap page table pages into private EPT.
> + * Mapping guest pages into private EPT before TD is finalized should use a
> + * seamcall TDH.MEM.PAGE.ADD(), which copies page content from a source page
> + * from user to target guest pages to be added. This source page is not
> + * available via common interface kvm_tdp_map_page(). So, currently,
> + * kvm_tdp_map_page() only premaps guest pages into KVM mirrored root.
> + * A counter nr_premapped is increased here to record status. The counter will
> + * be decreased after TDH.MEM.PAGE.ADD() is called after the kvm_tdp_map_page()
> + * in tdx_gmem_post_populate().
> + */
> +static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
> +					  enum pg_level level, kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +	/* Returning error here to let TDP MMU bail out early. */
> +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) {
> +		tdx_unpin(kvm, pfn);
> +		return -EINVAL;
> +	}
> +
> +	/* nr_premapped will be decreased when tdh_mem_page_add() is called. */
> +	atomic64_inc(&kvm_tdx->nr_premapped);
> +	return 0;
> +}
> +
>   int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   			      enum pg_level level, kvm_pfn_t pfn)
>   {
> @@ -510,11 +538,7 @@ int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
>   	if (likely(is_td_finalized(kvm_tdx)))
>   		return tdx_mem_page_aug(kvm, gfn, level, pfn);
>   
> -	/*
> -	 * TODO: KVM_MAP_MEMORY support to populate before finalize comes
> -	 * here for the initial memory.
> -	 */
> -	return 0;
> +	return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
>   }
>   
>   static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
> @@ -546,10 +570,12 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
>   	if (unlikely(!is_td_finalized(kvm_tdx) &&
>   		     err == (TDX_EPT_WALK_FAILED | TDX_OPERAND_ID_RCX))) {
>   		/*
> -		 * This page was mapped with KVM_MAP_MEMORY, but
> -		 * KVM_TDX_INIT_MEM_REGION is not issued yet.
> +		 * Page is mapped by KVM_TDX_INIT_MEM_REGION, but hasn't called
> +		 * tdh_mem_page_add().
>   		 */
>   		if (!is_last_spte(entry, level) || !(entry & VMX_EPT_RWX_MASK)) {
> +			WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
> +			atomic64_dec(&kvm_tdx->nr_premapped);
>   			tdx_unpin(kvm, pfn);
>   			return 0;
>   		}
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 66540c57ed61..25a4aaede2ba 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -26,7 +26,7 @@ struct kvm_tdx {
>   
>   	u64 tsc_offset;
>   
> -	/* For KVM_MAP_MEMORY and KVM_TDX_INIT_MEM_REGION. */
> +	/* For KVM_TDX_INIT_MEM_REGION. */
>   	atomic64_t nr_premapped;
>   
>   	struct kvm_cpuid2 *cpuid;


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 20/21] KVM: TDX: Finalize VM initialization
  2024-09-10 10:33     ` Paolo Bonzini
@ 2024-09-10 11:15       ` Adrian Hunter
  2024-09-10 11:28         ` Paolo Bonzini
  2024-09-10 11:31         ` Adrian Hunter
  0 siblings, 2 replies; 139+ messages in thread
From: Adrian Hunter @ 2024-09-10 11:15 UTC (permalink / raw)
  To: Paolo Bonzini, Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 10/09/24 13:33, Paolo Bonzini wrote:
> On 9/4/24 17:37, Adrian Hunter wrote:
>> Isaku was going to lock the mmu.  Seems like the change got lost.
>> To protect against racing with KVM_PRE_FAULT_MEMORY,
>> KVM_TDX_INIT_MEM_REGION, tdx_sept_set_private_spte() etc
>> e.g. Rename tdx_td_finalizemr to __tdx_td_finalizemr and add:
>>
>> static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>> {
>>     int ret;
>>
>>     write_lock(&kvm->mmu_lock);
>>     ret = __tdx_td_finalizemr(kvm, cmd);
>>     write_unlock(&kvm->mmu_lock);
>>
>>     return ret;
>> }
> 
> kvm->slots_lock is better.  In tdx_vcpu_init_mem_region() you can take it before the is_td_finalized() so that there is a lock that is clearly protecting kvm_tdx->finalized between the two.  (I also suggest switching to guard() in tdx_vcpu_init_mem_region()).

Doesn't KVM_PRE_FAULT_MEMORY also need to be protected?

> 
> Also, I think that in patch 16 (whether merged or not) nr_premapped should not be incremented, once kvm_tdx->finalized has been set?

tdx_sept_set_private_spte() checks is_td_finalized() to decide
whether to call tdx_mem_page_aug() or tdx_mem_page_record_premap_cnt()
Refer patch 14 "KVM: TDX: Implement hooks to propagate changes
of TDP MMU mirror page table" for the addition of
tdx_sept_set_private_spte()



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 20/21] KVM: TDX: Finalize VM initialization
  2024-09-10 11:15       ` Adrian Hunter
@ 2024-09-10 11:28         ` Paolo Bonzini
  2024-09-10 11:31         ` Adrian Hunter
  1 sibling, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 11:28 UTC (permalink / raw)
  To: Adrian Hunter, Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 9/10/24 13:15, Adrian Hunter wrote:
>> kvm->slots_lock is better.  In tdx_vcpu_init_mem_region() you can
>> take it before the is_td_finalized() so that there is a lock that
>> is clearly protecting kvm_tdx->finalized between the two.  (I also
>> suggest switching to guard() in tdx_vcpu_init_mem_region()).
>
> Doesn't KVM_PRE_FAULT_MEMORY also need to be protected?

KVM_PRE_FAULT_MEMORY is forbidden until kvm->arch.pre_fault_allowed is set.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 20/21] KVM: TDX: Finalize VM initialization
  2024-09-10 11:15       ` Adrian Hunter
  2024-09-10 11:28         ` Paolo Bonzini
@ 2024-09-10 11:31         ` Adrian Hunter
  1 sibling, 0 replies; 139+ messages in thread
From: Adrian Hunter @ 2024-09-10 11:31 UTC (permalink / raw)
  To: Paolo Bonzini, Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 10/09/24 14:15, Adrian Hunter wrote:
> On 10/09/24 13:33, Paolo Bonzini wrote:
>> On 9/4/24 17:37, Adrian Hunter wrote:
>>> Isaku was going to lock the mmu.  Seems like the change got lost.
>>> To protect against racing with KVM_PRE_FAULT_MEMORY,
>>> KVM_TDX_INIT_MEM_REGION, tdx_sept_set_private_spte() etc
>>> e.g. Rename tdx_td_finalizemr to __tdx_td_finalizemr and add:
>>>
>>> static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>>> {
>>>     int ret;
>>>
>>>     write_lock(&kvm->mmu_lock);
>>>     ret = __tdx_td_finalizemr(kvm, cmd);
>>>     write_unlock(&kvm->mmu_lock);
>>>
>>>     return ret;
>>> }
>>
>> kvm->slots_lock is better.  In tdx_vcpu_init_mem_region() you can take it before the is_td_finalized() so that there is a lock that is clearly protecting kvm_tdx->finalized between the two.  (I also suggest switching to guard() in tdx_vcpu_init_mem_region()).
> 
> Doesn't KVM_PRE_FAULT_MEMORY also need to be protected?

Ah, but not if pre_fault_allowed is false.

> 
>>
>> Also, I think that in patch 16 (whether merged or not) nr_premapped should not be incremented, once kvm_tdx->finalized has been set?
> 
> tdx_sept_set_private_spte() checks is_td_finalized() to decide
> whether to call tdx_mem_page_aug() or tdx_mem_page_record_premap_cnt()
> Refer patch 14 "KVM: TDX: Implement hooks to propagate changes
> of TDP MMU mirror page table" for the addition of
> tdx_sept_set_private_spte()
> 
> 


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 20/21] KVM: TDX: Finalize VM initialization
  2024-09-10 10:25   ` Paolo Bonzini
@ 2024-09-10 11:54     ` Adrian Hunter
  0 siblings, 0 replies; 139+ messages in thread
From: Adrian Hunter @ 2024-09-10 11:54 UTC (permalink / raw)
  To: Paolo Bonzini, Rick Edgecombe, seanjc, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel

On 10/09/24 13:25, Paolo Bonzini wrote:
> On 9/4/24 05:07, Rick Edgecombe wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand,
>> KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization.
>>
>> Documentation for the API is added in another patch:
>> "Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)"
>>
>> For the purpose of attestation, a measurement must be made of the TDX VM
>> initial state. This is referred to as TD Measurement Finalization, and
>> uses SEAMCALL TDH.MR.FINALIZE, after which:
>> 1. The VMM adding TD private pages with arbitrary content is no longer
>>     allowed
>> 2. The TDX VM is runnable
>>
>> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com>
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>> ---
>> TDX MMU part 2 v1:
>>   - Added premapped check.
>>   - Update for the wrapper functions for SEAMCALLs. (Sean)
>>   - Add check if nr_premapped is zero.  If not, return error.
>>   - Use KVM_BUG_ON() in tdx_td_finalizer() for consistency.
>>   - Change tdx_td_finalizemr() to take struct kvm_tdx_cmd *cmd and return error
>>     (Adrian)
>>   - Handle TDX_OPERAND_BUSY case (Adrian)
>>   - Updates from seamcall overhaul (Kai)
>>   - Rename error->hw_error
>>
>> v18:
>>   - Remove the change of tools/arch/x86/include/uapi/asm/kvm.h.
>>
>> v15:
>>   - removed unconditional tdx_track() by tdx_flush_tlb_current() that
>>     does tdx_track().
>> ---
>>   arch/x86/include/uapi/asm/kvm.h |  1 +
>>   arch/x86/kvm/vmx/tdx.c          | 28 ++++++++++++++++++++++++++++
>>   2 files changed, 29 insertions(+)
>>
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index 789d1d821b4f..0b4827e39458 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -932,6 +932,7 @@ enum kvm_tdx_cmd_id {
>>       KVM_TDX_INIT_VM,
>>       KVM_TDX_INIT_VCPU,
>>       KVM_TDX_INIT_MEM_REGION,
>> +    KVM_TDX_FINALIZE_VM,
>>       KVM_TDX_GET_CPUID,
>>         KVM_TDX_CMD_NR_MAX,
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index 796d1a495a66..3083a66bb895 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -1257,6 +1257,31 @@ void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
>>       ept_sync_global();
>>   }
>>   +static int tdx_td_finalizemr(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>> +{
>> +    struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>> +
>> +    if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
>> +        return -EINVAL;
>> +    /*
>> +     * Pages are pending for KVM_TDX_INIT_MEM_REGION to issue
>> +     * TDH.MEM.PAGE.ADD().
>> +     */
>> +    if (atomic64_read(&kvm_tdx->nr_premapped))
>> +        return -EINVAL;
> 
> I suggest moving all of patch 16, plus the
> 
> +    WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
> +    atomic64_dec(&kvm_tdx->nr_premapped);
> 
> lines of patch 19, into this patch.
> 
>> +    cmd->hw_error = tdh_mr_finalize(kvm_tdx);
>> +    if ((cmd->hw_error & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY)
>> +        return -EAGAIN;
>> +    if (KVM_BUG_ON(cmd->hw_error, kvm)) {
>> +        pr_tdx_error(TDH_MR_FINALIZE, cmd->hw_error);
>> +        return -EIO;
>> +    }
>> +
>> +    kvm_tdx->finalized = true;
>> +    return 0;
> 
> This should also set pre_fault_allowed to true.

Ideally, need to ensure it is not possible for another CPU
to see kvm_tdx->finalized==false and pre_fault_allowed==true

Perhaps also, to document the dependency, return an error if
pre_fault_allowed is true in tdx_mem_page_record_premap_cnt().


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-09 21:11       ` Sean Christopherson
  2024-09-09 21:23         ` Sean Christopherson
@ 2024-09-10 13:15         ` Paolo Bonzini
  2024-09-10 13:57           ` Sean Christopherson
  1 sibling, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 13:15 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe
  Cc: kvm@vger.kernel.org, Yan Y Zhao, Yuan Yao, nik.borisov@suse.com,
	dmatlack@google.com, Kai Huang, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On 9/9/24 23:11, Sean Christopherson wrote:
> In general, I am_very_  opposed to blindly retrying an SEPT SEAMCALL, ever.  For
> its operations, I'm pretty sure the only sane approach is for KVM to ensure there
> will be no contention.  And if the TDX module's single-step protection spuriously
> kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't communicate
> that it's mitigating single-step, e.g. so that KVM can forward the information
> to userspace, then that's a TDX module problem to solve.

In principle I agree but we also need to be pragmatic.  Exiting to 
userspace may not be practical in all flows, for example.

First of all, we can add a spinlock around affected seamcalls.  This way 
we know that "busy" errors must come from the guest and have set 
HOST_PRIORITY.  It is still kinda bad that guests can force the VMM to 
loop, but the VMM can always say enough is enough.  In other words, 
let's assume that a limit of 16 is probably appropriate but we can also 
increase the limit and crash the VM if things become ridiculous.

Something like this:

	static u32 max = 16;
	int retry = 0;
	spin_lock(&kvm->arch.seamcall_lock);
	for (;;) {
		args_in = *in;
		ret = seamcall_ret(op, in);
		if (++retry == 1) {
			/* protected by the same seamcall_lock */
			kvm->stat.retried_seamcalls++;
		} else if (retry == READ_ONCE(max)) {
			pr_warn("Exceeded %d retries for S-EPT operation\n", max);
			if (KVM_BUG_ON(kvm, retry == 1024)) {
				pr_err("Crashing due to lock contention in the TDX module\n");
				break;
			}
			cmpxchg(&max, retry, retry * 2);
		}
	}
	spin_unlock(&kvm->arch.seamcall_lock);

This way we can do some testing and figure out a useful limit.

For zero step detection, my reading is that it's TDH.VP.ENTER that 
fails; not any of the MEM seamcalls.  For that one to be resolved, it 
should be enough to do take and release the mmu_lock back to back, which 
ensures that all pending critical sections have completed (that is, 
"write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then 
loop.  Adding a vCPU stat for that one is a good idea, too.

Paolo

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-10 13:15         ` Paolo Bonzini
@ 2024-09-10 13:57           ` Sean Christopherson
  2024-09-10 15:16             ` Paolo Bonzini
  0 siblings, 1 reply; 139+ messages in thread
From: Sean Christopherson @ 2024-09-10 13:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, Yan Y Zhao, Yuan Yao,
	nik.borisov@suse.com, dmatlack@google.com, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org

On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> On 9/9/24 23:11, Sean Christopherson wrote:
> > In general, I am_very_  opposed to blindly retrying an SEPT SEAMCALL, ever.  For
> > its operations, I'm pretty sure the only sane approach is for KVM to ensure there
> > will be no contention.  And if the TDX module's single-step protection spuriously
> > kicks in, KVM exits to userspace.  If the TDX module can't/doesn't/won't communicate
> > that it's mitigating single-step, e.g. so that KVM can forward the information
> > to userspace, then that's a TDX module problem to solve.
> 
> In principle I agree but we also need to be pragmatic.  Exiting to userspace
> may not be practical in all flows, for example.
> 
> First of all, we can add a spinlock around affected seamcalls.

No, because that defeates the purpose of having mmu_lock be a rwlock.

> This way we know that "busy" errors must come from the guest and have set
> HOST_PRIORITY.

We should be able to achieve that without a VM-wide spinlock.  My thought (from
v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep
it set until the SEAMCALL completes.

> It is still kinda bad that guests can force the VMM to loop, but the VMM can
> always say enough is enough.  In other words, let's assume that a limit of
> 16 is probably appropriate but we can also increase the limit and crash the
> VM if things become ridiculous.
> 
> Something like this:
> 
> 	static u32 max = 16;
> 	int retry = 0;
> 	spin_lock(&kvm->arch.seamcall_lock);
> 	for (;;) {
> 		args_in = *in;
> 		ret = seamcall_ret(op, in);
> 		if (++retry == 1) {
> 			/* protected by the same seamcall_lock */
> 			kvm->stat.retried_seamcalls++;
> 		} else if (retry == READ_ONCE(max)) {
> 			pr_warn("Exceeded %d retries for S-EPT operation\n", max);
> 			if (KVM_BUG_ON(kvm, retry == 1024)) {
> 				pr_err("Crashing due to lock contention in the TDX module\n");
> 				break;
> 			}
> 			cmpxchg(&max, retry, retry * 2);
> 		}
> 	}
> 	spin_unlock(&kvm->arch.seamcall_lock);
> 
> This way we can do some testing and figure out a useful limit.

2 :-)

One try that guarantees no other host task is accessing the S-EPT entry, and a
second try after blasting IPI to kick vCPUs to ensure no guest-side task has
locked the S-EPT entry.

My concern with an arbitrary retry loop is that we'll essentially propagate the
TDX module issues to the broader kernel.  Each of those SEAMCALLs is slooow, so
retrying even ~20 times could exceed the system's tolerances for scheduling, RCU,
etc...

> For zero step detection, my reading is that it's TDH.VP.ENTER that fails;
> not any of the MEM seamcalls.  For that one to be resolved, it should be
> enough to do take and release the mmu_lock back to back, which ensures that
> all pending critical sections have completed (that is,
> "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> loop.  Adding a vCPU stat for that one is a good idea, too.

As above and in my discussion with Rick, I would prefer to kick vCPUs to force
forward progress, especially for the zero-step case.  If KVM gets to the point
where it has retried TDH.VP.ENTER on the same fault so many times that zero-step
kicks in, then it's time to kick and wait, not keep retrying blindly.

There is still risk of a hang, e.g. if a CPU fails to respond to the IPI, but
that's a possibility that always exists.  Kicking vCPUs allows KVM to know with
100% certainty that a SEAMCALL should succeed.

Hrm, the wrinkle is that if we want to guarantee success, the vCPU kick would
need to happen when the SPTE is frozen, to ensure some other host task doesn't
"steal" the lock.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 17/21] KVM: TDX: MTRR: implement get_mt_mask() for TDX
  2024-09-10 10:04   ` Paolo Bonzini
@ 2024-09-10 14:05     ` Sean Christopherson
  0 siblings, 0 replies; 139+ messages in thread
From: Sean Christopherson @ 2024-09-10 14:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rick Edgecombe, kvm, kai.huang, dmatlack, isaku.yamahata,
	yan.y.zhao, nik.borisov, linux-kernel

On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> On 9/4/24 05:07, Rick Edgecombe wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Although TDX supports only WB for private GPA, it's desirable to support
> > MTRR for shared GPA.  Always honor guest PAT for shared EPT as what's done
> > for normal VMs.
> > 
> > Suggested-by: Kai Huang <kai.huang@intel.com>
> > Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> > TDX MMU part 2 v1:
> >   - Align with latest vmx code in kvm/queue.
> >   - Updated patch log.
> >   - Dropped KVM_BUG_ON() in vt_get_mt_mask(). (Rick)
> 
> The only difference at this point is
> 
>         if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
>             !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) |
> VMX_EPT_IPAT_BIT;
> 
> 
> which should never be true.  I think this patch can simply be dropped.

And we can/should do what we've done for SEV, and make it a hard dependency to
enable TDX, e.g. similar to this:

	/*
	 * SEV must obviously be supported in hardware.  Sanity check that the
	 * CPU supports decode assists, which is mandatory for SEV guests to
	 * support instruction emulation.  Ditto for flushing by ASID, as SEV
	 * guests are bound to a single ASID, i.e. KVM can't rotate to a new
	 * ASID to effect a TLB flush.
	 */
	if (!boot_cpu_has(X86_FEATURE_SEV) ||
	    WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_DECODEASSISTS)) ||
	    WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_FLUSHBYASID)))
		goto out;

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-10 13:57           ` Sean Christopherson
@ 2024-09-10 15:16             ` Paolo Bonzini
  2024-09-10 15:57               ` Sean Christopherson
  0 siblings, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-10 15:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, Yan Y Zhao, Yuan Yao,
	nik.borisov@suse.com, dmatlack@google.com, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org

On Tue, Sep 10, 2024 at 3:58 PM Sean Christopherson <seanjc@google.com> wrote:
> On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> No, because that defeates the purpose of having mmu_lock be a rwlock.

But if this part of the TDX module is wrapped in a single big
try_lock, there's no difference in spinning around busy seamcalls, or
doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention
in the same way.  With respect to FROZEN_SPTE...

> > This way we know that "busy" errors must come from the guest and have set
> > HOST_PRIORITY.
>
> We should be able to achieve that without a VM-wide spinlock.  My thought (from
> v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep
> it set until the SEAMCALL completes.

Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
which documents that the TDX module returns TDX_OPERAND_BUSY on a
CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
to prevent contention in the TDX module.

If we want to be a bit more optimistic, let's do something more
sophisticated, like only take the lock after the first busy reply. But
the spinlock is the easiest way to completely remove host-induced
TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.

> > It is still kinda bad that guests can force the VMM to loop, but the VMM can
> > always say enough is enough.  In other words, let's assume that a limit of
> > 16 is probably appropriate but we can also increase the limit and crash the
> > VM if things become ridiculous.
>
> 2 :-)
>
> One try that guarantees no other host task is accessing the S-EPT entry, and a
> second try after blasting IPI to kick vCPUs to ensure no guest-side task has
> locked the S-EPT entry.

Fair enough. Though in principle it is possible to race and have the
vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.
So I would make it 5 or so just to be safe.

> My concern with an arbitrary retry loop is that we'll essentially propagate the
> TDX module issues to the broader kernel.  Each of those SEAMCALLs is slooow, so
> retrying even ~20 times could exceed the system's tolerances for scheduling, RCU,
> etc...

How slow are the failed ones? The number of retries is essentially the
cost of successful seamcall / cost of busy seamcall.

If HOST_PRIORITY works, even a not-small-but-not-huge number of
retries would be better than the IPIs. IPIs are not cheap either.

> > For zero step detection, my reading is that it's TDH.VP.ENTER that fails;
> > not any of the MEM seamcalls.  For that one to be resolved, it should be
> > enough to do take and release the mmu_lock back to back, which ensures that
> > all pending critical sections have completed (that is,
> > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> > loop.  Adding a vCPU stat for that one is a good idea, too.
>
> As above and in my discussion with Rick, I would prefer to kick vCPUs to force
> forward progress, especially for the zero-step case.  If KVM gets to the point
> where it has retried TDH.VP.ENTER on the same fault so many times that zero-step
> kicks in, then it's time to kick and wait, not keep retrying blindly.

Wait, zero-step detection should _not_ affect TDH.MEM latency. Only
TDH.VP.ENTER is delayed. If it is delayed to the point of failing, we
can do write_lock/write_unlock() in the vCPU entry path.

My issue is that, even if we could make it a bit better by looking at
the TDX module source code, we don't have enough information to make a
good choice.  For now we should start with something _easy_, even if
it may not be the greatest.

Paolo

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-10 15:16             ` Paolo Bonzini
@ 2024-09-10 15:57               ` Sean Christopherson
  2024-09-10 16:28                 ` Edgecombe, Rick P
  0 siblings, 1 reply; 139+ messages in thread
From: Sean Christopherson @ 2024-09-10 15:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rick P Edgecombe, kvm@vger.kernel.org, Yan Y Zhao, Yuan Yao,
	nik.borisov@suse.com, dmatlack@google.com, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org

On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> On Tue, Sep 10, 2024 at 3:58 PM Sean Christopherson <seanjc@google.com> wrote:
> > On Tue, Sep 10, 2024, Paolo Bonzini wrote:
> > No, because that defeates the purpose of having mmu_lock be a rwlock.
> 
> But if this part of the TDX module is wrapped in a single big
> try_lock, there's no difference in spinning around busy seamcalls, or
> doing spin_lock(&kvm->arch.seamcall_lock). All of them hit contention
> in the same way.  With respect to FROZEN_SPTE...
>
> > > This way we know that "busy" errors must come from the guest and have set
> > > HOST_PRIORITY.
> >
> > We should be able to achieve that without a VM-wide spinlock.  My thought (from
> > v11?) was to effectively use the FROZEN_SPTE bit as a per-SPTE spinlock, i.e. keep
> > it set until the SEAMCALL completes.
> 
> Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
> which documents that the TDX module returns TDX_OPERAND_BUSY on a
> CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
> to prevent contention in the TDX module.

Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM
lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read mode.

So for the operations that KVM can do in parallel, the locking should effectively
be per-entry.  Because KVM will never throw away an entire S-EPT root, zapping
SPTEs will need to be done while holding mmu_lock for write, i.e. KVM shouldn't
have problems with host tasks competing for the TDX module's VM-wide lock.

> If we want to be a bit more optimistic, let's do something more
> sophisticated, like only take the lock after the first busy reply. But
> the spinlock is the easiest way to completely remove host-induced
> TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.

I am not convinced that's necessary or a good idea.  I worry that doing so would
just kick the can down the road, and potentially make the problems harder to solve,
e.g. because we'd have to worry about regressing existing setups.

> > > It is still kinda bad that guests can force the VMM to loop, but the VMM can
> > > always say enough is enough.  In other words, let's assume that a limit of
> > > 16 is probably appropriate but we can also increase the limit and crash the
> > > VM if things become ridiculous.
> >
> > 2 :-)
> >
> > One try that guarantees no other host task is accessing the S-EPT entry, and a
> > second try after blasting IPI to kick vCPUs to ensure no guest-side task has
> > locked the S-EPT entry.
> 
> Fair enough. Though in principle it is possible to race and have the
> vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.

My limit of '2' is predicated on the lock being a "host priority" lock, i.e. that
kicking vCPUs would ensure the lock has been dropped and can't be re-acquired by
the guest.

> So I would make it 5 or so just to be safe.
> 
> > My concern with an arbitrary retry loop is that we'll essentially propagate the
> > TDX module issues to the broader kernel.  Each of those SEAMCALLs is slooow, so
> > retrying even ~20 times could exceed the system's tolerances for scheduling, RCU,
> > etc...
> 
> How slow are the failed ones? The number of retries is essentially the
> cost of successful seamcall / cost of busy seamcall.

I haven't measured, but would be surprised if it's less than 2000 cycles.

> If HOST_PRIORITY works, even a not-small-but-not-huge number of
> retries would be better than the IPIs. IPIs are not cheap either.

Agreed, but we also need to account for the operations that are conflicting.
E.g. if KVM is trying to zap a S-EPT that the guest is accessing, then busy waiting
for the to-be-zapped S-EPT entry to be available doesn't make much sense.

> > > For zero step detection, my reading is that it's TDH.VP.ENTER that fails;
> > > not any of the MEM seamcalls.  For that one to be resolved, it should be
> > > enough to do take and release the mmu_lock back to back, which ensures that
> > > all pending critical sections have completed (that is,
> > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> > > loop.  Adding a vCPU stat for that one is a good idea, too.
> >
> > As above and in my discussion with Rick, I would prefer to kick vCPUs to force
> > forward progress, especially for the zero-step case.  If KVM gets to the point
> > where it has retried TDH.VP.ENTER on the same fault so many times that zero-step
> > kicks in, then it's time to kick and wait, not keep retrying blindly.
> 
> Wait, zero-step detection should _not_ affect TDH.MEM latency. Only
> TDH.VP.ENTER is delayed.

Blocked, not delayed.  Yes, it's TDH.VP.ENTER that "fails", but to get past
TDH.VP.ENTER, KVM needs to resolve the underlying fault, i.e. needs to guarantee
forward progress for TDH.MEM (or whatever the operations are called).

Though I wonder, are there any operations guest/host operations that can conflict
if the vCPU is faulting?  Maybe this particular scenario is a complete non-issue.

> If it is delayed to the point of failing, we can do write_lock/write_unlock()
> in the vCPU entry path.

I was thinking that KVM could set a flag (another synthetic error code bit?) to
tell the page fault handler that it needs to kick vCPUs.  But as above, it might
be unnecessary.

> My issue is that, even if we could make it a bit better by looking at
> the TDX module source code, we don't have enough information to make a
> good choice.  For now we should start with something _easy_, even if
> it may not be the greatest.

I am not opposed to an easy/simple solution, but I am very much opposed to
implementing a retry loop without understanding _exactly_ when and why it's
needed.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-10 15:57               ` Sean Christopherson
@ 2024-09-10 16:28                 ` Edgecombe, Rick P
  2024-09-10 17:42                   ` Sean Christopherson
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-10 16:28 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: Yao, Yuan, Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, kvm@vger.kernel.org,
	dmatlack@google.com, nik.borisov@suse.com

On Tue, 2024-09-10 at 08:57 -0700, Sean Christopherson wrote:
> > Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
> > which documents that the TDX module returns TDX_OPERAND_BUSY on a
> > CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
> > to prevent contention in the TDX module.
> 
> Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM
> lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read
> mode.

AUG does take other locks as exclusive:
https://github.com/intel/tdx-module/blob/tdx_1.5/src/vmm_dispatcher/api_calls/tdh_mem_page_aug.c

I count 5 locks in total as well. I think trying to mirror the locking in KVM
will be an uphill battle.

> 
> So for the operations that KVM can do in parallel, the locking should
> effectively
> be per-entry.  Because KVM will never throw away an entire S-EPT root, zapping
> SPTEs will need to be done while holding mmu_lock for write, i.e. KVM
> shouldn't
> have problems with host tasks competing for the TDX module's VM-wide lock.
> 
> > If we want to be a bit more optimistic, let's do something more
> > sophisticated, like only take the lock after the first busy reply. But
> > the spinlock is the easiest way to completely remove host-induced
> > TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.
> 
> I am not convinced that's necessary or a good idea.  I worry that doing so
> would
> just kick the can down the road, and potentially make the problems harder to
> solve,
> e.g. because we'd have to worry about regressing existing setups.
> 
> > > > It is still kinda bad that guests can force the VMM to loop, but the VMM
> > > > can
> > > > always say enough is enough.  In other words, let's assume that a limit
> > > > of
> > > > 16 is probably appropriate but we can also increase the limit and crash
> > > > the
> > > > VM if things become ridiculous.
> > > 
> > > 2 :-)
> > > 
> > > One try that guarantees no other host task is accessing the S-EPT entry,
> > > and a
> > > second try after blasting IPI to kick vCPUs to ensure no guest-side task
> > > has
> > > locked the S-EPT entry.
> > 
> > Fair enough. Though in principle it is possible to race and have the
> > vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.
> 
> My limit of '2' is predicated on the lock being a "host priority" lock, i.e.
> that
> kicking vCPUs would ensure the lock has been dropped and can't be re-acquired
> by
> the guest.

So kicking would be to try to break loose any deadlock we encountered? It sounds
like the kind of kludge that could be hard to remove.

> 
> > So I would make it 5 or so just to be safe.
> > 
> > > My concern with an arbitrary retry loop is that we'll essentially
> > > propagate the
> > > TDX module issues to the broader kernel.  Each of those SEAMCALLs is
> > > slooow, so
> > > retrying even ~20 times could exceed the system's tolerances for
> > > scheduling, RCU,
> > > etc...
> > 
> > How slow are the failed ones? The number of retries is essentially the
> > cost of successful seamcall / cost of busy seamcall.
> 
> I haven't measured, but would be surprised if it's less than 2000 cycles.
> 
> > If HOST_PRIORITY works, even a not-small-but-not-huge number of
> > retries would be better than the IPIs. IPIs are not cheap either.
> 
> Agreed, but we also need to account for the operations that are conflicting.
> E.g. if KVM is trying to zap a S-EPT that the guest is accessing, then busy
> waiting
> for the to-be-zapped S-EPT entry to be available doesn't make much sense.
> 
> > > > For zero step detection, my reading is that it's TDH.VP.ENTER that
> > > > fails;
> > > > not any of the MEM seamcalls.  For that one to be resolved, it should be
> > > > enough to do take and release the mmu_lock back to back, which ensures
> > > > that
> > > > all pending critical sections have completed (that is,
> > > > "write_lock(&kvm->mmu_lock); write_unlock(&kvm->mmu_lock);").  And then
> > > > loop.  Adding a vCPU stat for that one is a good idea, too.
> > > 
> > > As above and in my discussion with Rick, I would prefer to kick vCPUs to
> > > force
> > > forward progress, especially for the zero-step case.  If KVM gets to the
> > > point
> > > where it has retried TDH.VP.ENTER on the same fault so many times that
> > > zero-step
> > > kicks in, then it's time to kick and wait, not keep retrying blindly.
> > 
> > Wait, zero-step detection should _not_ affect TDH.MEM latency. Only
> > TDH.VP.ENTER is delayed.
> 
> Blocked, not delayed.  Yes, it's TDH.VP.ENTER that "fails", but to get past
> TDH.VP.ENTER, KVM needs to resolve the underlying fault, i.e. needs to
> guarantee
> forward progress for TDH.MEM (or whatever the operations are called).
> 
> Though I wonder, are there any operations guest/host operations that can
> conflict
> if the vCPU is faulting?  Maybe this particular scenario is a complete non-
> issue.
> 
> > If it is delayed to the point of failing, we can do
> > write_lock/write_unlock()
> > in the vCPU entry path.
> 
> I was thinking that KVM could set a flag (another synthetic error code bit?)
> to
> tell the page fault handler that it needs to kick vCPUs.  But as above, it
> might
> be unnecessary.
> 
> > My issue is that, even if we could make it a bit better by looking at
> > the TDX module source code, we don't have enough information to make a
> > good choice.  For now we should start with something _easy_, even if
> > it may not be the greatest.
> 
> I am not opposed to an easy/simple solution, but I am very much opposed to
> implementing a retry loop without understanding _exactly_ when and why it's
> needed.

I'd like to explore letting KVM do the retries (i.e. EPT fault loop) a bit more.
We can verify that we can survive zero-step in this case. After all, zero-step
doesn't kill the TD, just generates an EPT violation exit. So we would just need
to verify that the EPT violation getting generated would result in KVM
eventually fixing whatever zero-step is requiring.

Then we would have to handle BUSY in each SEAMCALL call chain, which currently
we don't. Like the zapping case. If we ended up needing a retry loop for limited
cases like that, at least it would be more limited.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-10 16:28                 ` Edgecombe, Rick P
@ 2024-09-10 17:42                   ` Sean Christopherson
  2024-09-13  8:36                     ` Yan Zhao
  0 siblings, 1 reply; 139+ messages in thread
From: Sean Christopherson @ 2024-09-10 17:42 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: pbonzini@redhat.com, Yuan Yao, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	Yan Y Zhao, kvm@vger.kernel.org, dmatlack@google.com,
	nik.borisov@suse.com

On Tue, Sep 10, 2024, Rick P Edgecombe wrote:
> On Tue, 2024-09-10 at 08:57 -0700, Sean Christopherson wrote:
> > > Only if the TDX module returns BUSY per-SPTE (as suggested by 18.1.3,
> > > which documents that the TDX module returns TDX_OPERAND_BUSY on a
> > > CMPXCHG failure). If it returns BUSY per-VM, FROZEN_SPTE is not enough
> > > to prevent contention in the TDX module.
> > 
> > Looking at the TDX module code, things like (UN)BLOCK and REMOVE take a per-VM
> > lock in write mode, but ADD, AUG, and PROMOTE/DEMOTE take the lock in read
> > mode.
> 
> AUG does take other locks as exclusive:
> https://github.com/intel/tdx-module/blob/tdx_1.5/src/vmm_dispatcher/api_calls/tdh_mem_page_aug.c

Only a lock on the underlying physical page.  guest_memfd should prevent mapping
the same HPA into multiple GPAs, and FROZEN_SPTE should prevent two vCPUs from
concurrently AUGing the same GPA+HPA.

> I count 5 locks in total as well. I think trying to mirror the locking in KVM
> will be an uphill battle.

I don't want to mirror the locking, I want to understand and document the
expectations and rules.  "Throw 16 noodles and hope one sticks" is not a recipe
for success.

> > So for the operations that KVM can do in parallel, the locking should
> > effectively
> > be per-entry.  Because KVM will never throw away an entire S-EPT root, zapping
> > SPTEs will need to be done while holding mmu_lock for write, i.e. KVM
> > shouldn't
> > have problems with host tasks competing for the TDX module's VM-wide lock.
> > 
> > > If we want to be a bit more optimistic, let's do something more
> > > sophisticated, like only take the lock after the first busy reply. But
> > > the spinlock is the easiest way to completely remove host-induced
> > > TDX_OPERAND_BUSY, and only have to deal with guest-induced ones.
> > 
> > I am not convinced that's necessary or a good idea.  I worry that doing so
> > would
> > just kick the can down the road, and potentially make the problems harder to
> > solve,
> > e.g. because we'd have to worry about regressing existing setups.
> > 
> > > > > It is still kinda bad that guests can force the VMM to loop, but the VMM
> > > > > can
> > > > > always say enough is enough.  In other words, let's assume that a limit
> > > > > of
> > > > > 16 is probably appropriate but we can also increase the limit and crash
> > > > > the
> > > > > VM if things become ridiculous.
> > > > 
> > > > 2 :-)
> > > > 
> > > > One try that guarantees no other host task is accessing the S-EPT entry,
> > > > and a
> > > > second try after blasting IPI to kick vCPUs to ensure no guest-side task
> > > > has
> > > > locked the S-EPT entry.
> > > 
> > > Fair enough. Though in principle it is possible to race and have the
> > > vCPU re-run and re-issue a TDG call before KVM re-issues the TDH call.
> > 
> > My limit of '2' is predicated on the lock being a "host priority" lock,
> > i.e.  that kicking vCPUs would ensure the lock has been dropped and can't
> > be re-acquired by the guest.
> 
> So kicking would be to try to break loose any deadlock we encountered? It sounds
> like the kind of kludge that could be hard to remove.

No, the intent of the kick would be to wait for vCPUs to exit, which in turn
guarantees that any locks held by vCPUs have been dropped.  Again, this idea is
predicated on the lock being "host priority", i.e. that vCPUs can't re-take the
lock before KVM.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX
  2024-09-10  8:16   ` Paolo Bonzini
@ 2024-09-10 23:49     ` Edgecombe, Rick P
  2024-10-14  6:34     ` Yan Zhao
  1 sibling, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-10 23:49 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Tue, 2024-09-10 at 10:16 +0200, Paolo Bonzini wrote:
> 
> I'd do it slightly different:

Fair enough, thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2024-09-10  9:33       ` Paolo Bonzini
@ 2024-09-10 23:58         ` Edgecombe, Rick P
  2024-09-11  1:05           ` Yan Zhao
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-10 23:58 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Huang, Kai
  Cc: nik.borisov@suse.com, dmatlack@google.com,
	isaku.yamahata@gmail.com, Zhao, Yan Y,
	linux-kernel@vger.kernel.org

On Tue, 2024-09-10 at 11:33 +0200, Paolo Bonzini wrote:
> > But actually, I wonder if we need to remove the KVM_BUG_ON(). I think if you
> > did
> > a KVM_PRE_FAULT_MEMORY and then deleted the memslot you could hit it?
> 
> I think all paths to handle_removed_pt() are safe:
> 
> __tdp_mmu_zap_root
>          tdp_mmu_zap_root
>                  kvm_tdp_mmu_zap_all
>                          kvm_arch_flush_shadow_all
>                                  kvm_flush_shadow_all
>                                          kvm_destroy_vm (*)
>                                          kvm_mmu_notifier_release (*)
>                  kvm_tdp_mmu_zap_invalidated_roots
>                          kvm_mmu_zap_all_fast (**)
> kvm_tdp_mmu_zap_sp
>          kvm_recover_nx_huge_pages (***)

But not all paths to remove_external_spte():
kvm_arch_flush_shadow_memslot()
  kvm_mmu_zap_memslot_leafs()
    kvm_tdp_mmu_unmap_gfn_range()
      tdp_mmu_zap_leafs()
        tdp_mmu_iter_set_spte()
          tdp_mmu_set_spte()
            remove_external_spte()
              tdx_sept_remove_private_spte()

But we can probably keep the warning if we prevent KVM_PRE_FAULT_MEMORY as you
pointed earlier. I didn't see that that kvm->arch.pre_fault_allowed  got added.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory
  2024-09-10 10:13         ` Paolo Bonzini
@ 2024-09-11  0:11           ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-11  0:11 UTC (permalink / raw)
  To: pbonzini@redhat.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, dmatlack@google.com, nik.borisov@suse.com,
	seanjc@google.com, Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Tue, 2024-09-10 at 12:13 +0200, Paolo Bonzini wrote:
> > Yan, do you think it is sufficient?
> 
> If you're actually requiring that the other locks are sufficient, then 
> there can be no ENOENT.
> 
> Maybe:
> 
>         /*
>          * The private mem cannot be zapped after kvm_tdp_map_page()
>          * because all paths are covered by slots_lock and the
>          * filemap invalidate lock.  Check that they are indeed enough.
>          */
>         if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
>                 scoped_guard(read_lock, &kvm->mmu_lock) {
>                         if (KVM_BUG_ON(kvm,
>                                 !kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa)) {
>                                 ret = -EIO;
>                                 goto out;
>                         }
>                 }
>         }

True. We can put it behind CONFIG_KVM_PROVE_MMU.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory
  2024-09-10 10:16   ` Paolo Bonzini
@ 2024-09-11  0:12     ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-11  0:12 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Tue, 2024-09-10 at 12:16 +0200, Paolo Bonzini wrote:
> While useful for the reviewers, in the end this is the simplest possible 
> userspace API (the one that we started with) and the objections just 
> went away because it reuses the infrastructure that was introduced for 
> pre-faulting memory.
> 
> So I'd replace everything with:

Sure, thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 21/21] KVM: TDX: Handle vCPU dissociation
  2024-09-10 10:45   ` Paolo Bonzini
@ 2024-09-11  0:17     ` Edgecombe, Rick P
  2024-11-04  9:45     ` Yan Zhao
  1 sibling, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-11  0:17 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Tue, 2024-09-10 at 12:45 +0200, Paolo Bonzini wrote:
> On 9/4/24 05:07, Rick Edgecombe wrote:
> > +/*
> > + * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
> > + * is brought down to invoke TDH_VP_FLUSH on the appropriate TD vCPUS.
> 
> ... or when a vCPU is migrated.

It would be better.

> 
> > + * Protected by interrupt mask.  This list is manipulated in process
> > context
> > + * of vCPU and IPI callback.  See tdx_flush_vp_on_cpu().
> > + */
> > +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
> 
> It may be a bit more modern, or cleaner, to use a local_lock here 
> instead of just relying on local_irq_disable/enable.

Hmm, yes. That is weird. If there is some reason it at least deserves a comment.

> 
> Another more organizational question is whether to put this in the 
> VM/vCPU series but I might be missing something obvious.

I moved it into this series because it intersected with the TLB flushing
functionality. In our internal analysis we considered all the TLB flushing
scenarios together. But yes, it kind of straddles two areas. If we think that
bit is discussed enough, we can move it back to its original series.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/21] KVM: TDX: Premap initial guest memory
  2024-09-10 10:24   ` Paolo Bonzini
@ 2024-09-11  0:19     ` Edgecombe, Rick P
  2024-09-13 13:33       ` Adrian Hunter
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-11  0:19 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Tue, 2024-09-10 at 12:24 +0200, Paolo Bonzini wrote:
> On 9/4/24 05:07, Rick Edgecombe wrote:
> > +static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
> > +                                         enum pg_level level, kvm_pfn_t
> > pfn)
> > +{
> > +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +
> > +       /* Returning error here to let TDP MMU bail out early. */
> > +       if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) {
> > +               tdx_unpin(kvm, pfn);
> > +               return -EINVAL;
> > +       }
> 
> Should this "if" already be part of patch 14, and in 
> tdx_sept_set_private_spte() rather than tdx_mem_page_record_premap_cnt()?

Hmm, makes sense to me. Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/21] KVM: TDX: Premap initial guest memory
  2024-09-10 10:49   ` Paolo Bonzini
@ 2024-09-11  0:30     ` Edgecombe, Rick P
  2024-09-11 10:39       ` Paolo Bonzini
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-11  0:30 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On Tue, 2024-09-10 at 12:49 +0200, Paolo Bonzini wrote:
> On 9/4/24 05:07, Rick Edgecombe wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Update TDX's hook of set_external_spte() to record pre-mapping cnt instead
> > of doing nothing and returning when TD is not finalized.
> > 
> > TDX uses ioctl KVM_TDX_INIT_MEM_REGION to initialize its initial guest
> > memory. This ioctl calls kvm_gmem_populate() to get guest pages and in
> > tdx_gmem_post_populate(), it will
> > (1) Map page table pages into KVM mirror page table and private EPT.
> > (2) Map guest pages into KVM mirror page table. In the propagation hook,
> >       just record pre-mapping cnt without mapping the guest page into
> > private
> >       EPT.
> > (3) Map guest pages into private EPT and decrease pre-mapping cnt.
> > 
> > Do not map guest pages into private EPT directly in step (2), because TDX
> > requires TDH.MEM.PAGE.ADD() to add a guest page before TD is finalized,
> > which copies page content from a source page from user to target guest page
> > to be added. However, source page is not available via common interface
> > kvm_tdp_map_page() in step (2).
> > 
> > Therefore, just pre-map the guest page into KVM mirror page table and
> > record the pre-mapping cnt in TDX's propagation hook. The pre-mapping cnt
> > would be decreased in ioctl KVM_TDX_INIT_MEM_REGION when the guest page is
> > mapped into private EPT.
> 
> Stale commit message; squashing all of it into patch 20 is an easy cop 
> out...

Arh, yes this has details that are not relevant to the patch.

Squashing it seems fine, but I wasn't sure about whether we actually needed this
nr_premapped. It was one of the things we decided to punt a decision on in order
to continue our debates on the list. So we need to pick up the debate again.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2024-09-10 23:58         ` Edgecombe, Rick P
@ 2024-09-11  1:05           ` Yan Zhao
  0 siblings, 0 replies; 139+ messages in thread
From: Yan Zhao @ 2024-09-11  1:05 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
	Huang, Kai, nik.borisov@suse.com, dmatlack@google.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org

On Wed, Sep 11, 2024 at 07:58:01AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2024-09-10 at 11:33 +0200, Paolo Bonzini wrote:
> > > But actually, I wonder if we need to remove the KVM_BUG_ON(). I think if you
> > > did
> > > a KVM_PRE_FAULT_MEMORY and then deleted the memslot you could hit it?
> > 
> > I think all paths to handle_removed_pt() are safe:
> > 
> > __tdp_mmu_zap_root
> >          tdp_mmu_zap_root
> >                  kvm_tdp_mmu_zap_all
> >                          kvm_arch_flush_shadow_all
> >                                  kvm_flush_shadow_all
> >                                          kvm_destroy_vm (*)
> >                                          kvm_mmu_notifier_release (*)
> >                  kvm_tdp_mmu_zap_invalidated_roots
> >                          kvm_mmu_zap_all_fast (**)
> > kvm_tdp_mmu_zap_sp
> >          kvm_recover_nx_huge_pages (***)
> 
> But not all paths to remove_external_spte():
> kvm_arch_flush_shadow_memslot()
>   kvm_mmu_zap_memslot_leafs()
>     kvm_tdp_mmu_unmap_gfn_range()
>       tdp_mmu_zap_leafs()
>         tdp_mmu_iter_set_spte()
>           tdp_mmu_set_spte()
>             remove_external_spte()
>               tdx_sept_remove_private_spte()
> 
> But we can probably keep the warning if we prevent KVM_PRE_FAULT_MEMORY as you
> pointed earlier. I didn't see that that kvm->arch.pre_fault_allowed  got added.
Note:
If we diallow vCPU to be created before vm ioctl KVM_TDX_INIT_VM is done,
the vCPU ioctl KVM_PRE_FAULT_MEMORY can't be executed.
Then we can't hit the 
"if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))"
in tdx_sept_remove_private_spte().

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-09 23:58             ` Sean Christopherson
  2024-09-10  0:50               ` Edgecombe, Rick P
@ 2024-09-11  1:17               ` Huang, Kai
  2024-09-11  2:48                 ` Edgecombe, Rick P
  1 sibling, 1 reply; 139+ messages in thread
From: Huang, Kai @ 2024-09-11  1:17 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe
  Cc: linux-kernel@vger.kernel.org, Yuan Yao, isaku.yamahata@gmail.com,
	Yan Y Zhao, dmatlack@google.com, kvm@vger.kernel.org,
	nik.borisov@suse.com, pbonzini@redhat.com


>> Host-Side (SEAMCALL) Operation
>> ------------------------------
>> The host VMM is expected to retry host-side operations that fail with a
>> TDX_OPERAND_BUSY status. The host priority mechanism helps guarantee that at
>> most after a limited time (the longest guest-side TDX module flow) there will be
>> no contention with a guest TD attempting to acquire access to the same resource.
>>
>> Lock operations process the HOST_PRIORITY bit as follows:
>>     - A SEAMCALL (host-side) function that fails to acquire a lock sets the lock’s
>>     HOST_PRIORITY bit and returns a TDX_OPERAND_BUSY status to the host VMM. It is
>>     the host VMM’s responsibility to re-attempt the SEAMCALL function until is
>>     succeeds; otherwise, the HOST_PRIORITY bit remains set, preventing the guest TD
>>     from acquiring the lock.
>>     - A SEAMCALL (host-side) function that succeeds to acquire a lock clears the
>>     lock’s HOST_PRIORITY bit.
> 
> *sigh*
> 
>> Guest-Side (TDCALL) Operation
>> -----------------------------
>> A TDCALL (guest-side) function that attempt to acquire a lock fails if
>> HOST_PRIORITY is set to 1; a TDX_OPERAND_BUSY status is returned to the guest.
>> The guest is expected to retry the operation.
>>
>> Guest-side TDCALL flows that acquire a host priority lock have an upper bound on
>> the host-side latency for that lock; once a lock is acquired, the flow either
>> releases within a fixed upper time bound, or periodically monitor the
>> HOST_PRIORITY flag to see if the host is attempting to acquire the lock.
>> "
>>
>> So KVM can't fully prevent TDX_OPERAND_BUSY with KVM side locks, because it is
>> involved in sorting out contention between the guest as well. We need to double
>> check this, but I *think* this HOST_PRIORITY bit doesn't come into play for the
>> functionality we need to exercise for base support.
>>
>> The thing that makes me nervous about retry based solution is the potential for
>> some kind deadlock like pattern. Just to gather your opinion, if there was some
>> SEAMCALL contention that couldn't be locked around from KVM, but came with some
>> strong well described guarantees, would a retry loop be hard NAK still?
> 
> I don't know.  It would depend on what operations can hit BUSY, and what the
> alternatives are.  E.g. if we can narrow down the retry paths to a few select
> cases where it's (a) expected, (b) unavoidable, and (c) has minimal risk of
> deadlock, then maybe that's the least awful option.
> 
> What I don't think KVM should do is blindly retry N number of times, because
> then there are effectively no rules whatsoever.  E.g. if KVM is tearing down a
> VM then KVM should assert on immediate success.  And if KVM is handling a fault
> on behalf of a vCPU, then KVM can and should resume the guest and let it retry.
> Ugh, but that would likely trigger the annoying "zero-step mitigation" crap.
> 
> What does this actually mean in practice?  What's the threshold, 

FWIW, the limit in the public TDX module code is 6:

   #define STEPPING_EPF_THRESHOLD 6   // Threshold of confidence in 	
			detecting EPT fault-based stepping in progress

We might be able to change it to a larger value though but we need to 
understand why it is necessary.

> is the VM-Enter
> error uniquely identifiable, 

When zero-step mitigation is active in the module, TDH.VP.ENTER tries to 
grab the SEPT lock thus it can fail with SEPT BUSY error.  But if it 
does grab the lock successfully, it exits to VMM with EPT violation on 
that GPA immediately.

In other words, TDH.VP.ENTER returning SEPT BUSY means "zero-step 
mitigation" must have been active.  A normal EPT violation _COULD_ mean 
mitigation is already active, but AFAICT we don't have a way to tell 
that in the EPT violation.

> and can KVM rely on HOST_PRIORITY to be set if KVM
> runs afoul of the zero-step mitigation?

I think HOST_PRIORITY is always set if SEPT SEAMCALLs fails with BUSY.

> 
>    After a pre-determined number of such EPT violations occur on the same instruction,
>    the TDX module starts tracking the GPAs that caused Secure EPT faults and fails
>    further host VMM attempts to enter the TD VCPU unless previously faulting private
>    GPAs are properly mapped in the Secure EPT.
> 
> If HOST_PRIORITY is set, then one idea would be to resume the guest if there's
> SEPT contention on a fault, and then _if_ the zero-step mitigation is triggered,
> kick all vCPUs (via IPI) to ensure that the contended SEPT entry is unlocked and
> can't be re-locked by the guest.  That would allow KVM to guarantee forward
> progress without an arbitrary retry loop in the TDP MMU.

I think this should work.

It doesn't seem we can tell whether the zero step mitigation is active 
in EPT violation TDEXIT, or when SEPT SEAMCALL fails with SEPT BUSY. 
But when any SEPT SEAMCALL fails with SEPT BUSY, if we just kick all 
vCPUs and make them wait until the next retry is done (which must be 
successful otherwise it is illegal error), then this should handle both 
contention from guest and the zero-step mitigation.

> 
> Similarly, if KVM needs to zap a SPTE and hits BUSY, kick all vCPUs to ensure the
> one and only retry is guaranteed to succeed.

Yeah seems so.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 07/21] KVM: TDX: Add load_mmu_pgd method for TDX
  2024-09-04  3:07 ` [PATCH 07/21] KVM: TDX: Add load_mmu_pgd method for TDX Rick Edgecombe
@ 2024-09-11  2:48   ` Chao Gao
  2024-09-11  2:49     ` Edgecombe, Rick P
  0 siblings, 1 reply; 139+ messages in thread
From: Chao Gao @ 2024-09-11  2:48 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kvm, kai.huang, dmatlack, isaku.yamahata,
	yan.y.zhao, nik.borisov, linux-kernel

On Tue, Sep 03, 2024 at 08:07:37PM -0700, Rick Edgecombe wrote:
>From: Sean Christopherson <sean.j.christopherson@intel.com>
>
>TDX uses two EPT pointers, one for the private half of the GPA space and
>one for the shared half. The private half uses the normal EPT_POINTER vmcs
>field, which is managed in a special way by the TDX module. For TDX, KVM is
>not allowed to operate on it directly. The shared half uses a new
>SHARED_EPT_POINTER field and will be managed by the conventional MMU
>management operations that operate directly on the EPT root. This means for
>TDX the .load_mmu_pgd() operation will need to know to use the
>SHARED_EPT_POINTER field instead of the normal one. Add a new wrapper in
>x86 ops for load_mmu_pgd() that either directs the write to the existing
>vmx implementation or a TDX one.
>
>tdx_load_mmu_pgd() is so much simpler than vmx_load_mmu_pgd() since for the
>TDX mode of operation, EPT will always be used and KVM does not need to be
>involved in virtualization of CR3 behavior. So tdx_load_mmu_pgd() can
>simply write to SHARED_EPT_POINTER.
>
>Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>---
>TDX MMU part 2 v1:
>- update the commit msg with the version rephrased by Rick.
>  https://lore.kernel.org/all/78b1024ec3f5868e228baf797c6be98c5397bd49.camel@intel.com/
>
>v19:
>- Add WARN_ON_ONCE() to tdx_load_mmu_pgd() and drop unconditional mask
>---
> arch/x86/include/asm/vmx.h |  1 +
> arch/x86/kvm/vmx/main.c    | 13 ++++++++++++-
> arch/x86/kvm/vmx/tdx.c     |  5 +++++
> arch/x86/kvm/vmx/x86_ops.h |  4 ++++
> 4 files changed, 22 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>index d77a31039f24..3e003183a4f7 100644
>--- a/arch/x86/include/asm/vmx.h
>+++ b/arch/x86/include/asm/vmx.h
>@@ -237,6 +237,7 @@ enum vmcs_field {
> 	TSC_MULTIPLIER_HIGH             = 0x00002033,
> 	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
> 	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
>+	SHARED_EPT_POINTER		= 0x0000203C,
> 	PID_POINTER_TABLE		= 0x00002042,
> 	PID_POINTER_TABLE_HIGH		= 0x00002043,
> 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
>index d63685ea95ce..c9dfa3aa866c 100644
>--- a/arch/x86/kvm/vmx/main.c
>+++ b/arch/x86/kvm/vmx/main.c
>@@ -100,6 +100,17 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> 	vmx_vcpu_reset(vcpu, init_event);
> }
> 
>+static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>+			int pgd_level)
>+{
>+	if (is_td_vcpu(vcpu)) {
>+		tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
>+		return;
>+	}
>+
>+	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
>+}
>+
> static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> {
> 	if (!is_td(kvm))
>@@ -229,7 +240,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> 	.write_tsc_offset = vmx_write_tsc_offset,
> 	.write_tsc_multiplier = vmx_write_tsc_multiplier,
> 
>-	.load_mmu_pgd = vmx_load_mmu_pgd,
>+	.load_mmu_pgd = vt_load_mmu_pgd,
> 
> 	.check_intercept = vmx_check_intercept,
> 	.handle_exit_irqoff = vmx_handle_exit_irqoff,
>diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>index 2ef95c84ee5b..8f43977ef4c6 100644
>--- a/arch/x86/kvm/vmx/tdx.c
>+++ b/arch/x86/kvm/vmx/tdx.c
>@@ -428,6 +428,11 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> 	 */
> }
> 
>+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>+{

pgd_level isn't used. So, I think we can either drop it or assert that it matches
the secure EPT level.

>+	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>+}

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-11  1:17               ` Huang, Kai
@ 2024-09-11  2:48                 ` Edgecombe, Rick P
  2024-09-11 22:55                   ` Huang, Kai
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-11  2:48 UTC (permalink / raw)
  To: seanjc@google.com, Huang, Kai
  Cc: Yao, Yuan, linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	Zhao, Yan Y, pbonzini@redhat.com, kvm@vger.kernel.org,
	nik.borisov@suse.com, dmatlack@google.com

On Wed, 2024-09-11 at 13:17 +1200, Huang, Kai wrote:
> > is the VM-Enter
> > error uniquely identifiable, 
> 
> When zero-step mitigation is active in the module, TDH.VP.ENTER tries to 
> grab the SEPT lock thus it can fail with SEPT BUSY error.  But if it 
> does grab the lock successfully, it exits to VMM with EPT violation on 
> that GPA immediately.
> 
> In other words, TDH.VP.ENTER returning SEPT BUSY means "zero-step 
> mitigation" must have been active.  

I think this isn't true. A sept locking related busy, maybe. But there are other
things going on that return BUSY.

> A normal EPT violation _COULD_ mean 
> mitigation is already active, but AFAICT we don't have a way to tell 
> that in the EPT violation.
> 
> > and can KVM rely on HOST_PRIORITY to be set if KVM
> > runs afoul of the zero-step mitigation?
> 
> I think HOST_PRIORITY is always set if SEPT SEAMCALLs fails with BUSY.

What led you to think this? It seemed more limited to me.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 07/21] KVM: TDX: Add load_mmu_pgd method for TDX
  2024-09-11  2:48   ` Chao Gao
@ 2024-09-11  2:49     ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-11  2:49 UTC (permalink / raw)
  To: Gao, Chao
  Cc: linux-kernel@vger.kernel.org, seanjc@google.com, Huang, Kai,
	isaku.yamahata@gmail.com, Zhao, Yan Y, kvm@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com, dmatlack@google.com

On Wed, 2024-09-11 at 10:48 +0800, Chao Gao wrote:
> > index 2ef95c84ee5b..8f43977ef4c6 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -428,6 +428,11 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool
> > init_event)
> >          */
> > }
> > 
> > +void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> > +{
> 
> pgd_level isn't used. So, I think we can either drop it or assert that it
> matches
> the secure EPT level.

Oh, yea. Good point.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX
  2024-09-04  3:07 ` [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX Rick Edgecombe
  2024-09-10  8:16   ` Paolo Bonzini
@ 2024-09-11  6:25   ` Xu Yilun
  2024-09-11 17:28     ` Edgecombe, Rick P
  1 sibling, 1 reply; 139+ messages in thread
From: Xu Yilun @ 2024-09-11  6:25 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kvm, kai.huang, dmatlack, isaku.yamahata,
	yan.y.zhao, nik.borisov, linux-kernel

> +static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * TDX calls tdx_track() in tdx_sept_remove_private_spte() to ensure
> +	 * private EPT will be flushed on the next TD enter.
> +	 * No need to call tdx_track() here again even when this callback is as
> +	 * a result of zapping private EPT.
> +	 * Just invoke invept() directly here to work for both shared EPT and
> +	 * private EPT.

IIUC, private EPT is already flushed in .remove_private_spte(), so in
theory we don't have to invept() for private EPT?

Thanks,
Yilun

> +	 */
> +	if (is_td_vcpu(vcpu)) {
> +		ept_sync_global();
> +		return;
> +	}
> +
> +	vmx_flush_tlb_all(vcpu);
> +}

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem
  2024-09-04  3:07 ` [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem Rick Edgecombe
  2024-09-09 13:59   ` Paolo Bonzini
@ 2024-09-11  8:52   ` Chao Gao
  2024-09-11 16:29     ` Edgecombe, Rick P
  2024-09-12  0:39   ` Huang, Kai
  2024-09-12  1:19   ` Huang, Kai
  3 siblings, 1 reply; 139+ messages in thread
From: Chao Gao @ 2024-09-11  8:52 UTC (permalink / raw)
  To: Rick Edgecombe
  Cc: seanjc, pbonzini, kvm, kai.huang, dmatlack, isaku.yamahata,
	yan.y.zhao, nik.borisov, linux-kernel

On Tue, Sep 03, 2024 at 08:07:35PM -0700, Rick Edgecombe wrote:
>Teach EPT violation helper to check shared mask of a GPA to find out
>whether the GPA is for private memory.
>
>When EPT violation is triggered after TD accessing a private GPA, KVM will
>exit to user space if the corresponding GFN's attribute is not private.
>User space will then update GFN's attribute during its memory conversion
>process. After that, TD will re-access the private GPA and trigger EPT
>violation again. Only with GFN's attribute matches to private, KVM will
>fault in private page, map it in mirrored TDP root, and propagate changes
>to private EPT to resolve the EPT violation.
>
>Relying on GFN's attribute tracking xarray to determine if a GFN is
>private, as for KVM_X86_SW_PROTECTED_VM, may lead to endless EPT
>violations.
>
>Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
>Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>---
>TDX MMU part 2 v1:
> - Split from "KVM: TDX: handle ept violation/misconfig exit"
>---
> arch/x86/kvm/vmx/common.h | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
>diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
>index 78ae39b6cdcd..10aa12d45097 100644
>--- a/arch/x86/kvm/vmx/common.h
>+++ b/arch/x86/kvm/vmx/common.h
>@@ -6,6 +6,12 @@
> 
> #include "mmu.h"
> 
>+static inline bool kvm_is_private_gpa(struct kvm *kvm, gpa_t gpa)
>+{
>+	/* For TDX the direct mask is the shared mask. */
>+	return !kvm_is_addr_direct(kvm, gpa);
>+}
>+
> static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> 					     unsigned long exit_qualification)
> {
>@@ -28,6 +34,13 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> 		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
> 			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> 
>+	/*
>+	 * Don't rely on GFN's attribute tracking xarray to prevent EPT violation
>+	 * loops.
>+	 */

The comment seems a bit odd to me. We cannot use the gfn attribute from the
attribute xarray simply because here we need to determine if *this access* is
to private memory, which may not match the gfn attribute. Even if there are
other ways to prevent an infinite EPT violation loop, we still need to check
the shared bit in the faulting GPA.

>+	if (kvm_is_private_gpa(vcpu->kvm, gpa))
>+		error_code |= PFERR_PRIVATE_ACCESS;
>+
> 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> }
> 
>-- 
>2.34.1
>
>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/21] KVM: TDX: Premap initial guest memory
  2024-09-11  0:30     ` Edgecombe, Rick P
@ 2024-09-11 10:39       ` Paolo Bonzini
  2024-09-11 16:36         ` Edgecombe, Rick P
  0 siblings, 1 reply; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-11 10:39 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kvm@vger.kernel.org, seanjc@google.com, Zhao, Yan Y,
	nik.borisov@suse.com, dmatlack@google.com, Huang, Kai,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org

On Wed, Sep 11, 2024 at 2:30 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> Arh, yes this has details that are not relevant to the patch.
>
> Squashing it seems fine, but I wasn't sure about whether we actually needed this
> nr_premapped. It was one of the things we decided to punt a decision on in order
> to continue our debates on the list. So we need to pick up the debate again.

I think keeping nr_premapped is safer.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem
  2024-09-11  8:52   ` Chao Gao
@ 2024-09-11 16:29     ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-11 16:29 UTC (permalink / raw)
  To: Gao, Chao
  Cc: linux-kernel@vger.kernel.org, seanjc@google.com, Huang, Kai,
	isaku.yamahata@gmail.com, Zhao, Yan Y, kvm@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com, dmatlack@google.com

On Wed, 2024-09-11 at 16:52 +0800, Chao Gao wrote:
> > +       /*
> > +        * Don't rely on GFN's attribute tracking xarray to prevent EPT
> > violation
> > +        * loops.
> > +        */
> 
> The comment seems a bit odd to me. We cannot use the gfn attribute from the
> attribute xarray simply because here we need to determine if *this access* is
> to private memory, which may not match the gfn attribute. Even if there are
> other ways to prevent an infinite EPT violation loop, we still need to check
> the shared bit in the faulting GPA.

Yea this comment is not super informative. We can probably just drop it.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/21] KVM: TDX: Premap initial guest memory
  2024-09-11 10:39       ` Paolo Bonzini
@ 2024-09-11 16:36         ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-11 16:36 UTC (permalink / raw)
  To: pbonzini@redhat.com
  Cc: seanjc@google.com, Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, Zhao, Yan Y, dmatlack@google.com,
	kvm@vger.kernel.org, nik.borisov@suse.com

On Wed, 2024-09-11 at 12:39 +0200, Paolo Bonzini wrote:
> On Wed, Sep 11, 2024 at 2:30 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > Arh, yes this has details that are not relevant to the patch.
> > 
> > Squashing it seems fine, but I wasn't sure about whether we actually needed
> > this
> > nr_premapped. It was one of the things we decided to punt a decision on in
> > order
> > to continue our debates on the list. So we need to pick up the debate again.
> 
> I think keeping nr_premapped is safer.

Heh, well it's not hurting anything except adding a small amount of complexity,
so I guess we can cancel the debate. Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX
  2024-09-11  6:25   ` Xu Yilun
@ 2024-09-11 17:28     ` Edgecombe, Rick P
  2024-09-12  4:54       ` Yan Zhao
  2024-09-12  7:47       ` Xu Yilun
  0 siblings, 2 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-11 17:28 UTC (permalink / raw)
  To: yilun.xu@linux.intel.com
  Cc: linux-kernel@vger.kernel.org, seanjc@google.com, Huang, Kai,
	isaku.yamahata@gmail.com, Zhao, Yan Y, kvm@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com, dmatlack@google.com

On Wed, 2024-09-11 at 14:25 +0800, Xu Yilun wrote:
> > +static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> > +{
> > +       /*
> > +        * TDX calls tdx_track() in tdx_sept_remove_private_spte() to ensure
> > +        * private EPT will be flushed on the next TD enter.
> > +        * No need to call tdx_track() here again even when this callback is
> > as
> > +        * a result of zapping private EPT.
> > +        * Just invoke invept() directly here to work for both shared EPT
> > and
> > +        * private EPT.
> 
> IIUC, private EPT is already flushed in .remove_private_spte(), so in
> theory we don't have to invept() for private EPT?

I think you are talking about the comment, and not an optimization. So changing:
"Just invoke invept() directly here to work for both shared EPT and private EPT"
to just "Just invoke invept() directly here to work for shared EPT".

Seems good to me.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-11  2:48                 ` Edgecombe, Rick P
@ 2024-09-11 22:55                   ` Huang, Kai
  0 siblings, 0 replies; 139+ messages in thread
From: Huang, Kai @ 2024-09-11 22:55 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com
  Cc: Yao, Yuan, linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	Zhao, Yan Y, pbonzini@redhat.com, kvm@vger.kernel.org,
	nik.borisov@suse.com, dmatlack@google.com



On 11/09/2024 2:48 pm, Edgecombe, Rick P wrote:
> On Wed, 2024-09-11 at 13:17 +1200, Huang, Kai wrote:
>>> is the VM-Enter
>>> error uniquely identifiable,
>>
>> When zero-step mitigation is active in the module, TDH.VP.ENTER tries to
>> grab the SEPT lock thus it can fail with SEPT BUSY error.  But if it
>> does grab the lock successfully, it exits to VMM with EPT violation on
>> that GPA immediately.
>>
>> In other words, TDH.VP.ENTER returning SEPT BUSY means "zero-step
>> mitigation" must have been active.
> 
> I think this isn't true. A sept locking related busy, maybe. But there are other
> things going on that return BUSY.

I thought we are talking about SEPT locking here.  For BUSY in general 
yeah it tries to grab other locks too (e.g., share lock of 
TDR/TDCS/TDVPS etc) but those are impossible to contend in the current 
KVM TDX implementation I suppose?  Perhaps we need to look more to make 
sure.

> 
>> A normal EPT violation _COULD_ mean
>> mitigation is already active, but AFAICT we don't have a way to tell
>> that in the EPT violation.
>>
>>> and can KVM rely on HOST_PRIORITY to be set if KVM
>>> runs afoul of the zero-step mitigation?
>>
>> I think HOST_PRIORITY is always set if SEPT SEAMCALLs fails with BUSY.
> 
> What led you to think this? It seemed more limited to me.

I interpreted from the spec (chapter 18.1.4 Concurrency Restrictions 
with Host Priority).  But looking at the module public code, it seems 
only when the lock can be contended from the guest the HOST_PRIORITY 
will be set when host fails to grab the lock (see 
acquire_sharex_lock_hp_ex() and acquire_sharex_lock_hp_sh()), which 
makes sense anyway.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 10/21] KVM: TDX: Require TDP MMU and mmio caching for TDX
  2024-09-04  3:07 ` [PATCH 10/21] KVM: TDX: Require TDP MMU and mmio caching for TDX Rick Edgecombe
  2024-09-09 15:26   ` Paolo Bonzini
@ 2024-09-12  0:15   ` Huang, Kai
  1 sibling, 0 replies; 139+ messages in thread
From: Huang, Kai @ 2024-09-12  0:15 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com, pbonzini@redhat.com,
	kvm@vger.kernel.org
  Cc: dmatlack@google.com, isaku.yamahata@gmail.com, Zhao, Yan Y,
	nik.borisov@suse.com, linux-kernel@vger.kernel.org

On 4/09/2024 3:07 pm, Edgecombe, Rick P wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Disable TDX support when TDP MMU or mmio caching aren't supported.
> 
> As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU
> support for TDX isn't implemented.

Nitpickings:

I suppose we should use imperative mode since this is part of what this 
patch does?

Like:

TDX needs extensive MMU code change to make it work.  As TDP MMU is 
becoming main stream than the legacy MMU, for simplicity only support 
TDX for TDP MMU for now.

> 
> TDX requires KVM mmio caching. Without mmio caching, KVM will go to MMIO
> emulation without installing SPTEs for MMIOs. However, TDX guest is
> protected and KVM would meet errors when trying to emulate MMIOs for TDX
> guest during instruction decoding. So, TDX guest relies on SPTEs being
> installed for MMIOs, which are with no RWX bits and with VE suppress bit
> unset, to inject VE to TDX guest. The TDX guest would then issue TDVMCALL
> in the VE handler to perform instruction decoding and have host do MMIO
> emulation.

AFAICT the above two paragraphs are talking about two different things 
that one thing doens't have hard dependency to the other.

Should we separate this into two patches:  one patch to change 'checking 
enable_ept' to 'checking tdp_mmu_enabled' (which justifies the first 
paragraph), and the other to add MMIO caching checking.

The final code after the two patches could still end up with ...

[...]

> +	if (!tdp_mmu_enabled || !enable_mmio_caching)
> +		return -EOPNOTSUPP;
> +

... this though.

But feel free to ignore (since nitpickings).

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem
  2024-09-04  3:07 ` [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem Rick Edgecombe
  2024-09-09 13:59   ` Paolo Bonzini
  2024-09-11  8:52   ` Chao Gao
@ 2024-09-12  0:39   ` Huang, Kai
  2024-09-12 13:58     ` Sean Christopherson
  2024-09-12  1:19   ` Huang, Kai
  3 siblings, 1 reply; 139+ messages in thread
From: Huang, Kai @ 2024-09-12  0:39 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, pbonzini, kvm
  Cc: dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov, linux-kernel



On 4/09/2024 3:07 pm, Rick Edgecombe wrote:
> Teach EPT violation helper to check shared mask of a GPA to find out
> whether the GPA is for private memory.
> 
> When EPT violation is triggered after TD accessing a private GPA, KVM will
> exit to user space if the corresponding GFN's attribute is not private.
> User space will then update GFN's attribute during its memory conversion
> process. After that, TD will re-access the private GPA and trigger EPT
> violation again. Only with GFN's attribute matches to private, KVM will
> fault in private page, map it in mirrored TDP root, and propagate changes
> to private EPT to resolve the EPT violation.
> 
> Relying on GFN's attribute tracking xarray to determine if a GFN is
> private, as for KVM_X86_SW_PROTECTED_VM, may lead to endless EPT
> violations.
> 
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>   - Split from "KVM: TDX: handle ept violation/misconfig exit"
> ---
>   arch/x86/kvm/vmx/common.h | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> index 78ae39b6cdcd..10aa12d45097 100644
> --- a/arch/x86/kvm/vmx/common.h
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -6,6 +6,12 @@
>   
>   #include "mmu.h"
>   
> +static inline bool kvm_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> +{
> +	/* For TDX the direct mask is the shared mask. */
> +	return !kvm_is_addr_direct(kvm, gpa);
> +}

Does this get used in any other places?  If no I think we can open code 
this in the __vmx_handle_ept_violation().

The reason is I think the name kvm_is_private_gpa() is too generic and 
this is in the header file.  E.g., one can come up with another 
kvm_is_private_gpa() checking the memory attributes to tell whether a 
GPA is private.

Or we rename it to something like

	__vmx_is_faulting_gpa_private()
?

Which clearly says it is checking the *faulting* GPA.

> +
>   static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
>   					     unsigned long exit_qualification)
>   {
> @@ -28,6 +34,13 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
>   		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
>   			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
>   
> +	/*
> +	 * Don't rely on GFN's attribute tracking xarray to prevent EPT violation
> +	 * loops.
> +	 */
> +	if (kvm_is_private_gpa(vcpu->kvm, gpa))
> +		error_code |= PFERR_PRIVATE_ACCESS;
> +
>   	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>   }
>   


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem
  2024-09-04  3:07 ` [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem Rick Edgecombe
                     ` (2 preceding siblings ...)
  2024-09-12  0:39   ` Huang, Kai
@ 2024-09-12  1:19   ` Huang, Kai
  3 siblings, 0 replies; 139+ messages in thread
From: Huang, Kai @ 2024-09-12  1:19 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, pbonzini, kvm
  Cc: dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov, linux-kernel

On 4/09/2024 3:07 pm, Rick Edgecombe wrote:
> Teach EPT violation helper to check shared mask of a GPA to find out
> whether the GPA is for private memory.
> 
> When EPT violation is triggered after TD accessing a private GPA, KVM will
> exit to user space if the corresponding GFN's attribute is not private.
> User space will then update GFN's attribute during its memory conversion
> process. After that, TD will re-access the private GPA and trigger EPT
> violation again. Only with GFN's attribute matches to private, KVM will
> fault in private page, map it in mirrored TDP root, and propagate changes
> to private EPT to resolve the EPT violation.
> 
> Relying on GFN's attribute tracking xarray to determine if a GFN is
> private, as for KVM_X86_SW_PROTECTED_VM, may lead to endless EPT
> violations.

Sorry for not finishing in the previous reply:

IMHO in the very beginning of fault handler, we should just use hardware 
as the source to determine whether a *faulting* GPA is private or not. 
It doesn't quite matter whether KVM maintains memory attributes and how 
does it handle based on it -- it just must handle this properly.

E.g., even using memory attributes (to determine private) won't lead to 
endless EPT violations, it is wrong to use it to determine, because at 
the beginning of fault handler, we must know the *hardware* behaviour.

So I think the changelog should be something like this (the title could 
be enhanced too perhaps):

When TDX guests access memory causes EPT violation, TDX determines 
whether the faulting GPA is private or shared by checking whether the 
faulting GPA contains the shared bit (either bit 47 or bit 51 depending 
on the configuration of the guest).

KVM maintains an Xarray to record whether a GPA is private or not, e.g., 
for KVM_X86_SW_PROTECTED_VM guests.  TDX needs to honor this too.  The 
memory attributes (private or shared) for a given GPA that KVM records 
may not match the type of the faulting GPA.  E.g., the TDX guest can 
explicitly convert memory type from private to shared or the opposite. 
In this case KVM will exit to userspace to handle (e.g., change to the 
new memory attributes, issue the memory conversion and go back to 
guest).  After KVM determines the faulting type is legal and can 
proceed, it sets up the actual mapping, using TDX-specific ops for 
private one.

The common KVM fault handler uses the PFERR_PRIVATE_ACCESS bit of the 
error code to tell whether a faulting GPA is private.  Check the 
faulting GPA for TDX and convert it to the PFERR_PRIVATE_ACCESS so the 
common code can handle.

The specific operations to setup private mapping when the faulting GPA 
is private will follow in future patches.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX
  2024-09-11 17:28     ` Edgecombe, Rick P
@ 2024-09-12  4:54       ` Yan Zhao
  2024-09-12 14:44         ` Edgecombe, Rick P
  2024-09-12  7:47       ` Xu Yilun
  1 sibling, 1 reply; 139+ messages in thread
From: Yan Zhao @ 2024-09-12  4:54 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: yilun.xu@linux.intel.com, linux-kernel@vger.kernel.org,
	seanjc@google.com, Huang, Kai, isaku.yamahata@gmail.com,
	kvm@vger.kernel.org, pbonzini@redhat.com, nik.borisov@suse.com,
	dmatlack@google.com

On Thu, Sep 12, 2024 at 01:28:18AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2024-09-11 at 14:25 +0800, Xu Yilun wrote:
> > > +static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> > > +{
> > > +       /*
> > > +        * TDX calls tdx_track() in tdx_sept_remove_private_spte() to ensure
> > > +        * private EPT will be flushed on the next TD enter.
> > > +        * No need to call tdx_track() here again even when this callback is
> > > as
> > > +        * a result of zapping private EPT.
> > > +        * Just invoke invept() directly here to work for both shared EPT
> > > and
> > > +        * private EPT.
> > 
> > IIUC, private EPT is already flushed in .remove_private_spte(), so in
> > theory we don't have to invept() for private EPT?
> 
> I think you are talking about the comment, and not an optimization. So changing:
> "Just invoke invept() directly here to work for both shared EPT and private EPT"
> to just "Just invoke invept() directly here to work for shared EPT".
> 
> Seems good to me.
Hmm, what about just adding
"Due to the lack of context within this callback function, it cannot
  determine which EPT has been affected by zapping."?

as blow:

"TDX calls tdx_track() in tdx_sept_remove_private_spte() to ensure
private EPT will be flushed on the next TD enter.
No need to call tdx_track() here again even when this callback is
as a result of zapping private EPT.

Due to the lack of context within this callback function, it cannot
determine which EPT has been affected by zapping.
Just invoke invept() directly here to work for both shared EPT and
private EPT for simplicity."

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX
  2024-09-11 17:28     ` Edgecombe, Rick P
  2024-09-12  4:54       ` Yan Zhao
@ 2024-09-12  7:47       ` Xu Yilun
  1 sibling, 0 replies; 139+ messages in thread
From: Xu Yilun @ 2024-09-12  7:47 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: linux-kernel@vger.kernel.org, seanjc@google.com, Huang, Kai,
	isaku.yamahata@gmail.com, Zhao, Yan Y, kvm@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com, dmatlack@google.com

On Wed, Sep 11, 2024 at 05:28:18PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2024-09-11 at 14:25 +0800, Xu Yilun wrote:
> > > +static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> > > +{
> > > +       /*
> > > +        * TDX calls tdx_track() in tdx_sept_remove_private_spte() to ensure
> > > +        * private EPT will be flushed on the next TD enter.
> > > +        * No need to call tdx_track() here again even when this callback is
> > > as
> > > +        * a result of zapping private EPT.
> > > +        * Just invoke invept() directly here to work for both shared EPT
> > > and
> > > +        * private EPT.
> > 
> > IIUC, private EPT is already flushed in .remove_private_spte(), so in
> > theory we don't have to invept() for private EPT?
> 
> I think you are talking about the comment, and not an optimization. So changing:

Yes, just the comment.

> "Just invoke invept() directly here to work for both shared EPT and private EPT"
> to just "Just invoke invept() directly here to work for shared EPT".

Maybe also remind invept() is redundant for private EPT in some cases,
but we implement like this for simplicity.

Thanks,
Yilun

> 
> Seems good to me.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem
  2024-09-12  0:39   ` Huang, Kai
@ 2024-09-12 13:58     ` Sean Christopherson
  2024-09-12 14:43       ` Edgecombe, Rick P
  0 siblings, 1 reply; 139+ messages in thread
From: Sean Christopherson @ 2024-09-12 13:58 UTC (permalink / raw)
  To: Kai Huang
  Cc: Rick Edgecombe, pbonzini, kvm, dmatlack, isaku.yamahata,
	yan.y.zhao, nik.borisov, linux-kernel

On Thu, Sep 12, 2024, Kai Huang wrote:
> > +static inline bool kvm_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> > +{
> > +	/* For TDX the direct mask is the shared mask. */
> > +	return !kvm_is_addr_direct(kvm, gpa);
> > +}
> 
> Does this get used in any other places?  If no I think we can open code this
> in the __vmx_handle_ept_violation().
> 
> The reason is I think the name kvm_is_private_gpa() is too generic and this
> is in the header file.

+1, kvm_is_private_gpa() is much too generic.  I knew what the code was *supposed*
to do, but had to look at the implementation to verify that's actually what it did.

> E.g., one can come up with another kvm_is_private_gpa() checking the memory
> attributes to tell whether a GPA is private.
> 
> Or we rename it to something like
> 
> 	__vmx_is_faulting_gpa_private()
> ?
> 
> Which clearly says it is checking the *faulting* GPA.

I don't think that necessarily solves the problem either, because the reader has
to know that the KVM looks at the shared bit.

If open coding is undesirable, maybe a very literal name, e.g. vmx_is_shared_bit_set()?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem
  2024-09-12 13:58     ` Sean Christopherson
@ 2024-09-12 14:43       ` Edgecombe, Rick P
  2024-09-12 14:46         ` Paolo Bonzini
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-12 14:43 UTC (permalink / raw)
  To: seanjc@google.com, Huang, Kai
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com, nik.borisov@suse.com,
	dmatlack@google.com, isaku.yamahata@gmail.com, Zhao, Yan Y,
	linux-kernel@vger.kernel.org

On Thu, 2024-09-12 at 06:58 -0700, Sean Christopherson wrote:
> > Which clearly says it is checking the *faulting* GPA.
> 
> I don't think that necessarily solves the problem either, because the reader
> has
> to know that the KVM looks at the shared bit.
> 
> If open coding is undesirable

Yea, I think it's used in enough places that a helper is worth it.

> , maybe a very literal name, e.g. vmx_is_shared_bit_set()?

Sure, thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX
  2024-09-12  4:54       ` Yan Zhao
@ 2024-09-12 14:44         ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-12 14:44 UTC (permalink / raw)
  To: Zhao, Yan Y
  Cc: seanjc@google.com, Huang, Kai, yilun.xu@linux.intel.com,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, kvm@vger.kernel.org, nik.borisov@suse.com,
	dmatlack@google.com

On Thu, 2024-09-12 at 12:54 +0800, Yan Zhao wrote:
> "TDX calls tdx_track() in tdx_sept_remove_private_spte() to ensure
> private EPT will be flushed on the next TD enter.
> No need to call tdx_track() here again even when this callback is
> as a result of zapping private EPT.
> 
> Due to the lack of context within this callback function, it cannot
> determine which EPT has been affected by zapping.
> Just invoke invept() directly here to work for both shared EPT and
> private EPT for simplicity."

Yes, I think agree this is better.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem
  2024-09-12 14:43       ` Edgecombe, Rick P
@ 2024-09-12 14:46         ` Paolo Bonzini
  0 siblings, 0 replies; 139+ messages in thread
From: Paolo Bonzini @ 2024-09-12 14:46 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Huang, Kai, kvm@vger.kernel.org,
	nik.borisov@suse.com, dmatlack@google.com,
	isaku.yamahata@gmail.com, Zhao, Yan Y,
	linux-kernel@vger.kernel.org

On Thu, Sep 12, 2024 at 4:43 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Thu, 2024-09-12 at 06:58 -0700, Sean Christopherson wrote:
> > > Which clearly says it is checking the *faulting* GPA.
> >
> > I don't think that necessarily solves the problem either, because the reader
> > has
> > to know that the KVM looks at the shared bit.
> >
> > If open coding is undesirable
>
> Yea, I think it's used in enough places that a helper is worth it.
>
> > , maybe a very literal name, e.g. vmx_is_shared_bit_set()?
>
> Sure, thanks.

I didn't see a problem with kvm_is_private_gpa(), but I do prefer
something that has vmx_ or vt_ in the name after seeing it. My
preference would go to something like vt_is_tdx_private_gpa(), but I'm
not going to force one name or another.

Paolo


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-10 17:42                   ` Sean Christopherson
@ 2024-09-13  8:36                     ` Yan Zhao
  2024-09-13 17:23                       ` Sean Christopherson
  2024-09-13 19:19                       ` Edgecombe, Rick P
  0 siblings, 2 replies; 139+ messages in thread
From: Yan Zhao @ 2024-09-13  8:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Yuan Yao, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, dmatlack@google.com, nik.borisov@suse.com

This is a lock status report of TDX module for current SEAMCALL retry issue
based on code in TDX module public repo https://github.com/intel/tdx-module.git
branch TDX_1.5.05.

TL;DR:
- tdh_mem_track() can contend with tdh_vp_enter().
- tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
- tdg_mem_page_accept() can contend with other tdh_mem*().

Proposal:
- Return -EAGAIN directly in ops link_external_spt/set_external_spte when
  tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.
- Kick off vCPUs at the beginning of page removal path, i.e. before the
  tdh_mem_range_block().
  Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.
  (one possible optimization:
   since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
   do not kick off vCPUs in normal conditions.
   When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow
   TD enter until page removal completes.)

Below are detailed analysis:

=== Background ===
In TDX module, there are 4 kinds of locks:
1. sharex_lock:
   Normal read/write lock. (no host priority stuff)

2. sharex_hp_lock:
   Just like normal read/write lock, except that host can set host priority bit
   on failure.
   when guest tries to acquire the lock and sees host priority bit set, it will
   return "busy host priority" directly, letting host win.
   After host acquires the lock successfully, host priority bit is cleared.

3. sept entry lock:
   Lock utilizing software bits in SEPT entry.
   HP bit (Host priority): bit 52 
   EL bit (Entry lock): bit 11, used as a bit lock.

   - host sets HP bit when host fails to acquire EL bit lock;
   - host resets HP bit when host wins.
   - guest returns "busy host priority" if HP bit is found set when guest tries
     to acquire EL bit lock.

4. mutex lock:
   Lock with only 2 states: free, lock.
   (not the same as linux mutex, not re-scheduled, could pause() for debugging).

===Resources & users list===

Resources              SHARED  users              EXCLUSIVE users
------------------------------------------------------------------------
(1) TDR                tdh_mng_rdwr               tdh_mng_create
                       tdh_vp_create              tdh_mng_add_cx
                       tdh_vp_addcx               tdh_mng_init
		       tdh_vp_init                tdh_mng_vpflushdone
                       tdh_vp_enter               tdh_mng_key_config 
                       tdh_vp_flush               tdh_mng_key_freeid
                       tdh_vp_rd_wr               tdh_mr_extend
                       tdh_mem_sept_add           tdh_mr_finalize
                       tdh_mem_sept_remove        tdh_vp_init_apicid
                       tdh_mem_page_aug           tdh_mem_page_add
                       tdh_mem_page_remove
                       tdh_mem_range_block
                       tdh_mem_track
                       tdh_mem_range_unblock
                       tdh_phymem_page_reclaim
------------------------------------------------------------------------
(2) KOT                tdh_phymem_cache_wb        tdh_mng_create
                                                  tdh_mng_vpflushdone
                                                  tdh_mng_key_freeid
------------------------------------------------------------------------
(3) TDCS               tdh_mng_rdwr
                       tdh_vp_create
                       tdh_vp_addcx
                       tdh_vp_init
                       tdh_vp_init_apicid
                       tdh_vp_enter 
                       tdh_vp_rd_wr
                       tdh_mem_sept_add
                       tdh_mem_sept_remove
                       tdh_mem_page_aug
                       tdh_mem_page_remove
                       tdh_mem_range_block
                       tdh_mem_track
                       tdh_mem_range_unblock
------------------------------------------------------------------------
(4) TDVPR              tdh_vp_rd_wr                tdh_vp_create
                                                   tdh_vp_addcx
                                                   tdh_vp_init
                                                   tdh_vp_init_apicid
                                                   tdh_vp_enter 
                                                   tdh_vp_flush 
------------------------------------------------------------------------
(5) TDCS epoch         tdh_vp_enter                tdh_mem_track
------------------------------------------------------------------------
(6) secure_ept_lock    tdh_mem_sept_add            tdh_vp_enter
                       tdh_mem_page_aug            tdh_mem_sept_remove
                       tdh_mem_page_remove         tdh_mem_range_block
                                                   tdh_mem_range_unblock
------------------------------------------------------------------------
(7) SEPT entry                                     tdh_mem_sept_add
                                                   tdh_mem_sept_remove
                                                   tdh_mem_page_aug
                                                   tdh_mem_page_remove
                                                   tdh_mem_range_block
                                                   tdh_mem_range_unblock
                                                   tdg_mem_page_accept

Current KVM interested SEAMCALLs:
------------------------------------------------------------------------
  SEAMCALL                Lock Name        Lock Type        Resource       
tdh_mng_create          sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_lock        EXCLUSIVE        KOT

tdh_mng_add_cx          sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     EXCLUSIVE        page to add

tdh_mng_init            sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     NO_LOCK          TDCS

tdh_mng_vpflushdone     sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_lock        EXCLUSIVE        KOT

tdh_mng_key_config      sharex_hp_lock     EXCLUSIVE        TDR

tdh_mng_key_freeid      sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_lock        EXCLUSIVE        KOT

tdh_mng_rdwr            sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS

tdh_mr_extend           sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     NO_LOCK          TDCS

tdh_mr_finalize         sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     NO_LOCK          TDCS

tdh_vp_create           sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_hp_lock     EXCLUSIVE        TDVPR

tdh_vp_addcx            sharex_hp_lock     EXCLUSIVE        TDVPR
                        sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_hp_lock     EXCLUSIVE        page to add

tdh_vp_init             sharex_hp_lock     EXCLUSIVE        TDVPR
                        sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS

tdh_vp_init_apicid      sharex_hp_lock     EXCLUSIVE        TDVPR
                        sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     SHARED           TDCS

tdh_vp_enter(*)         sharex_hp_lock     EXCLUSIVE        TDVPR
                        sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        SHARED           TDCS epoch_lock
                        sharex_lock        EXCLUSIVE        TDCS secure_ept_lock

tdh_vp_flush            sharex_hp_lock     EXCLUSIVE        TDVPR
                        sharex_hp_lock     SHARED           TDR

tdh_vp_rd_wr            sharex_hp_lock     SHARED           TDVPR
                        sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS

tdh_mem_sept_add        sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        SHARED           TDCS secure_ept_lock
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify
                        sharex_hp_lock     EXCLUSIVE        page to add

tdh_mem_sept_remove     sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        EXCLUSIVE        TDCS secure_ept_lock
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify
                        sharex_hp_lock     EXCLUSIVE        page to remove 

tdh_mem_page_add        sharex_hp_lock     EXCLUSIVE        TDR
                        sharex_hp_lock     NO_LOCK          TDCS
                        sharex_lock        NO_LOCK          TDCS secure_ept_lock
                        sharex_hp_lock     EXCLUSIVE        page to add

tdh_mem_page_aug        sharex_hp_lock     SHARED           TDR 
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        SHARED           TDCS secure_ept_lock 
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify
                        sharex_hp_lock     EXCLUSIVE        page to aug

tdh_mem_page_remove     sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        SHARED           TDCS secure_ept_lock
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify
                        sharex_hp_lock     EXCLUSIVE        page to remove

tdh_mem_range_block     sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        EXCLUSIVE        TDCS secure_ept_lock
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify

tdh_mem_track           sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        EXCLUSIVE        TDCS epoch_lock

tdh_mem_range_unblock   sharex_hp_lock     SHARED           TDR
                        sharex_hp_lock     SHARED           TDCS
                        sharex_lock        EXCLUSIVE        TDCS secure_ept_lock
                        sept entry lock    HOST,EXCLUSIVE   SEPT entry to modify

tdh_phymem_page_reclaim sharex_hp_lock     EXCLUSIVE        page to reclaim
                        sharex_hp_lock     SHARED           TDR

tdh_phymem_cache_wb     mutex_lock                       per package wbt_entries 
                        sharex_lock        SHARED           KOT

tdh_phymem_page_wbinvd  sharex_hp_lock     SHARED           page to be wbinvd


Current KVM interested TDCALLs:
------------------------------------------------------------------------
tdg_mem_page_accept     sept entry lock    GUEST            SEPT entry to modify

TDCALLs like tdg_mr_rtmr_extend(), tdg_servtd_rd_wr(), tdg_mem_page_attr_wr()
tdg_mem_page_attr_rd() are not included.

*:(a) tdh_vp_enter() holds shared TDR lock and exclusive TDVPR lock, the two
      locks are released when exiting to VMM.
  (b) tdh_vp_enter() holds shared TDCS lock and shared TDCS epoch_lock lock,
      releases them before entering non-root mode.
  (c) tdh_vp_enter() holds shared epoch lock, contending with tdh_mem_track(). 
  (d) tdh_vp_enter() only holds EXCLUSIVE secure_ept_lock when 0-stepping is
      suspected, i.e. when last_epf_gpa_list is not empty.
      When a EPT violation happens, TDX module checks if the guest RIP equals
      to the guest RIP of last TD entry. Only when this is true for 6 continuous
      times, the gpa will be recorded in last_epf_gpa_list. The list will be
      reset once guest RIP of a EPT violation and last TD enter RIP are unequal.


=== Summary ===
For the 8 kinds of common resources protected in TDX module:

(1) TDR:
    There are only shared accesses to TDR during runtime (i.e. after TD is
    finalized and before TD tearing down), if we don't need to support calling
    tdh_vp_init_apicid() at runtime (e.g. for vCPU hotplug).
    tdh_vp_enter() holds shared TDR lock until exiting to VMM.
    TDCALLs do not acquire the TDR lock.

(2) KOT (Key Ownership Table)
    Current KVM code should have avoided contention to this resource.

(3) TDCS:
    Shared accessed or access with no lock when TDR is exclusively locked.
    Seamcalls in runtime (after TD finalized and before TD tearing down) do not
    contend with each other on TDCS.
    tdh_vp_enter() holds shared TDCS lock and releases it before entering
    non-root mode.
    Current TDCALLs for basic TDX do not acquire this lock.

(4) TDVPR:
    Per-vCPU exclusive accessed except for tdh_vp_rd_wr().
    tdh_vp_enter() holds exclusive TDVPR lock until exiting to VMM.
    TDCALLs do not acquire the TDVPR lock.

(5) TDCS epoch:
    tdh_mem_track() requests exclusive access, and tdh_vp_enter() requests
    shared access.
    tdh_mem_track() can contend with tdh_vp_enter().

(6) SEPT tree:
    Protected by secure_ept_lock (sharex_lock).
    tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_page_remove() holds shared
    lock; tdh_mem_sept_remove()/tdh_mem_range_block()/tdh_mem_range_unblock()
    holds exclusive lock.
    tdh_vp_enter() requests exclusive access when 0-stepping is suspected,
    contending with all other tdh_mem*().
    Guest does not acquire this lock.

    So,
    kvm mmu_lock has prevented contentions between
    tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_page_remove() and
    tdh_mem_sept_remove()/tdh_mem_range_block().
    Though tdh_mem_sept_add()/tdh_mem_page_aug() races with tdh_vp_enter(),
    returning -EAGAIN directly is fine for them.
    The remaining issue is the contention between tdh_vp_enter() and
    tdh_mem_page_remove()/tdh_mem_sept_remove()/tdh_mem_range_block().

(7) SEPT entry:
    All exclusive access.
    tdg_mem_page_accept() may contend with other tdh_mem*() on a specific SEPT
    entry.

(8) PAMT entry for target pages (e.g. page to add/aug/remove/reclaim/wbinvd):
    Though they are all exclusively locked, no race should be met as long as
    they belong to different pamt entries.

Conclusion:
Current KVM code should have avoided contentions of resources (1)-(4),(8), while
(5),(6),(7) are still possible to meet contention.
- tdh_mem_track() can contend with tdh_vp_enter() for (5)
- tdh_vp_enter() contends with tdh_mem*() for (6) when 0-stepping is suspected.
- tdg_mem_page_accept() can contend with other tdh_mem*() for (7).

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/21] KVM: TDX: Premap initial guest memory
  2024-09-11  0:19     ` Edgecombe, Rick P
@ 2024-09-13 13:33       ` Adrian Hunter
  2024-09-13 19:49         ` Edgecombe, Rick P
  0 siblings, 1 reply; 139+ messages in thread
From: Adrian Hunter @ 2024-09-13 13:33 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org, pbonzini@redhat.com,
	seanjc@google.com
  Cc: Zhao, Yan Y, nik.borisov@suse.com, dmatlack@google.com,
	Huang, Kai, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org

On 11/09/24 03:19, Edgecombe, Rick P wrote:
> On Tue, 2024-09-10 at 12:24 +0200, Paolo Bonzini wrote:
>> On 9/4/24 05:07, Rick Edgecombe wrote:
>>> +static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
>>> +                                         enum pg_level level, kvm_pfn_t
>>> pfn)
>>> +{
>>> +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>>> +
>>> +       /* Returning error here to let TDP MMU bail out early. */
>>> +       if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm)) {
>>> +               tdx_unpin(kvm, pfn);
>>> +               return -EINVAL;
>>> +       }
>>
>> Should this "if" already be part of patch 14, and in 
>> tdx_sept_set_private_spte() rather than tdx_mem_page_record_premap_cnt()?
> 
> Hmm, makes sense to me. Thanks.

It is already in patch 14, so just remove it from this patch
presumably.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-13  8:36                     ` Yan Zhao
@ 2024-09-13 17:23                       ` Sean Christopherson
  2024-09-13 19:19                         ` Edgecombe, Rick P
                                           ` (2 more replies)
  2024-09-13 19:19                       ` Edgecombe, Rick P
  1 sibling, 3 replies; 139+ messages in thread
From: Sean Christopherson @ 2024-09-13 17:23 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Yuan Yao, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, dmatlack@google.com, nik.borisov@suse.com

On Fri, Sep 13, 2024, Yan Zhao wrote:
> This is a lock status report of TDX module for current SEAMCALL retry issue
> based on code in TDX module public repo https://github.com/intel/tdx-module.git
> branch TDX_1.5.05.
> 
> TL;DR:
> - tdh_mem_track() can contend with tdh_vp_enter().
> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.

The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
whatever reason.

Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
hits the fault?

For non-TDX, resuming the guest and letting the vCPU retry the instruction is
desirable because in many cases, the winning task will install a valid mapping
before KVM can re-run the vCPU, i.e. the fault will be fixed before the
instruction is re-executed.  In the happy case, that provides optimal performance
as KVM doesn't introduce any extra delay/latency.

But for TDX, the math is different as the cost of a re-hitting a fault is much,
much higher, especially in light of the zero-step issues.

E.g. if the TDP MMU returns a unique error code for the frozen case, and
kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
then the TDX EPT violation path can safely retry locally, similar to the do-while
loop in kvm_tdp_map_page().

The only part I don't like about this idea is having two "retry" return values,
which creates the potential for bugs due to checking one but not the other.

Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
option better even though the out-param is a bit gross, because it makes it more
obvious that the "frozen_spte" is a special case that doesn't need attention for
most paths.

> - tdg_mem_page_accept() can contend with other tdh_mem*().
> 
> Proposal:
> - Return -EAGAIN directly in ops link_external_spt/set_external_spte when
>   tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.

What is the result of returning -EAGAIN?  E.g. does KVM redo tdh_vp_enter()?

Also tdh_mem_sept_add() is strictly pre-finalize, correct?  I.e. should never
contend with tdg_mem_page_accept() because vCPUs can't yet be run.

Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()?
The page isn't yet mapped, so why would the guest be allowed to take a lock on
the S-EPT entry?

> - Kick off vCPUs at the beginning of page removal path, i.e. before the
>   tdh_mem_range_block().
>   Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.

This is easy enough to do via a request, e.g. see KVM_REQ_MCLOCK_INPROGRESS.

>   (one possible optimization:
>    since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
>    do not kick off vCPUs in normal conditions.
>    When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow

Which SEAMCALL is this specifically?  tdh_mem_range_block()?

>    TD enter until page removal completes.)


Idea #1:
---
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b45258285c9c..8113c17bd2f6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4719,7 +4719,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
                        return -EINTR;
                cond_resched();
                r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
-       } while (r == RET_PF_RETRY);
+       } while (r == RET_PF_RETRY || r == RET_PF_RETRY_FOZEN);
 
        if (r < 0)
                return r;
@@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
                vcpu->stat.pf_spurious++;
 
        if (r != RET_PF_EMULATE)
-               return 1;
+               return r;
 
 emulate:
        return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 8d3fb3c8c213..690f03d7daae 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -256,12 +256,15 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * and of course kvm_mmu_do_page_fault().
  *
  * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
+ * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_RETRY: let CPU fault again on the address.
+ * RET_PF_RETRY_FROZEN: One or more SPTEs related to the address is frozen.
+ *                     Let the CPU fault again on the address, or retry the
+ *                     fault "locally", i.e. without re-entering the guest.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
  *                         gfn and retry, or emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
- * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
  *
  * Any names added to this enum should be exported to userspace for use in
@@ -271,14 +274,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
  * will allow for efficient machine code when checking for CONTINUE, e.g.
  * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
+ *
+ * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
+ * scheme, where 1==success, translates '1' to RET_PF_FIXED.
  */
 enum {
        RET_PF_CONTINUE = 0,
+       RET_PF_FIXED    = 1,
        RET_PF_RETRY,
+       RET_PF_RETRY_FROZEN,
        RET_PF_EMULATE,
        RET_PF_WRITE_PROTECTED,
        RET_PF_INVALID,
-       RET_PF_FIXED,
        RET_PF_SPURIOUS,
 };
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5a475a6456d4..cbf9e46203f3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 retry:
        rcu_read_unlock();
+       if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte))
+               return RET_PF_RETRY_FOZEN;
        return ret;
 }
 
---


Idea #2:
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/mmu/mmu.c          | 12 ++++++------
 arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++++++++---
 arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
 arch/x86/kvm/svm/svm.c          |  2 +-
 arch/x86/kvm/vmx/vmx.c          |  4 ++--
 6 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 46e0a466d7fb..200fecd1de88 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2183,7 +2183,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
-		       void *insn, int insn_len);
+		       void *insn, int insn_len, bool *frozen_spte);
 void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
 void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b45258285c9c..207840a316d3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4283,7 +4283,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 		return;
 
 	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
-				  true, NULL, NULL);
+				  true, NULL, NULL, NULL);
 
 	/*
 	 * Account fixed page faults, otherwise they'll never be counted, but
@@ -4627,7 +4627,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 		trace_kvm_page_fault(vcpu, fault_address, error_code);
 
 		r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
-				insn_len);
+				       insn_len, NULL);
 	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
 		vcpu->arch.apf.host_apf_flags = 0;
 		local_irq_disable();
@@ -4718,7 +4718,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
 		if (signal_pending(current))
 			return -EINTR;
 		cond_resched();
-		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
+		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level, NULL);
 	} while (r == RET_PF_RETRY);
 
 	if (r < 0)
@@ -6073,7 +6073,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 }
 
 int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
-		       void *insn, int insn_len)
+				void *insn, int insn_len, bool *frozen_spte)
 {
 	int r, emulation_type = EMULTYPE_PF;
 	bool direct = vcpu->arch.mmu->root_role.direct;
@@ -6109,7 +6109,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 		vcpu->stat.pf_taken++;
 
 		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
-					  &emulation_type, NULL);
+					  &emulation_type, NULL, frozen_spte);
 		if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
 			return -EIO;
 	}
@@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 		vcpu->stat.pf_spurious++;
 
 	if (r != RET_PF_EMULATE)
-		return 1;
+		return r;
 
 emulate:
 	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 8d3fb3c8c213..5b1fc77695c1 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -247,6 +247,9 @@ struct kvm_page_fault {
 	 * is changing its own translation in the guest page tables.
 	 */
 	bool write_fault_to_shadow_pgtable;
+
+	/* Indicates the page fault needs to be retried due to a frozen SPTE. */
+	bool frozen_spte;
 };
 
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
@@ -256,12 +259,12 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * and of course kvm_mmu_do_page_fault().
  *
  * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
+ * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_RETRY: let CPU fault again on the address.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
  * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
  *                         gfn and retry, or emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
- * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
  *
  * Any names added to this enum should be exported to userspace for use in
@@ -271,14 +274,17 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
  * will allow for efficient machine code when checking for CONTINUE, e.g.
  * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
+ *
+ * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
+ * scheme, where 1==success, translates '1' to RET_PF_FIXED.
  */
 enum {
 	RET_PF_CONTINUE = 0,
+	RET_PF_FIXED    = 1,
 	RET_PF_RETRY,
 	RET_PF_EMULATE,
 	RET_PF_WRITE_PROTECTED,
 	RET_PF_INVALID,
-	RET_PF_FIXED,
 	RET_PF_SPURIOUS,
 };
 
@@ -292,7 +298,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 
 static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 					u64 err, bool prefetch,
-					int *emulation_type, u8 *level)
+					int *emulation_type, u8 *level,
+					bool *frozen_spte)
 {
 	struct kvm_page_fault fault = {
 		.addr = cr2_or_gpa,
@@ -341,6 +348,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
 	if (level)
 		*level = fault.goal_level;
+	if (frozen_spte)
+		*frozen_spte = fault.frozen_spte;
 
 	return r;
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5a475a6456d4..e7fc5ea4b437 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1174,6 +1174,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 retry:
 	rcu_read_unlock();
+	fault->frozen_spte = is_frozen_spte(iter.old_spte);
 	return ret;
 }
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 38723b0c435d..269de6a9eb13 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2075,7 +2075,7 @@ static int npf_interception(struct kvm_vcpu *vcpu)
 	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
 				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
 				svm->vmcb->control.insn_bytes : NULL,
-				svm->vmcb->control.insn_len);
+				svm->vmcb->control.insn_len, NULL);
 
 	if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK)
 		sev_handle_rmp_fault(vcpu, fault_address, error_code);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 368acfebd476..fc2ff5d91a71 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5822,7 +5822,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, NULL);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
@@ -5843,7 +5843,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
 		return kvm_skip_emulated_instruction(vcpu);
 	}
 
-	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
+	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0, NULL);
 }
 
 static int handle_nmi_window(struct kvm_vcpu *vcpu)

base-commit: bc87a2b4b5508d247ed2c30cd2829969d168adfe
-- 


^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-13  8:36                     ` Yan Zhao
  2024-09-13 17:23                       ` Sean Christopherson
@ 2024-09-13 19:19                       ` Edgecombe, Rick P
  2024-09-14 10:00                         ` Yan Zhao
  1 sibling, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-13 19:19 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Yao, Yuan, Huang, Kai, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, pbonzini@redhat.com,
	dmatlack@google.com, nik.borisov@suse.com, kvm@vger.kernel.org

On Fri, 2024-09-13 at 16:36 +0800, Yan Zhao wrote:
> 
Thanks Yan, this is great!

> This is a lock status report of TDX module for current SEAMCALL retry issue
> based on code in TDX module public repo
> https://github.com/intel/tdx-module.git
> branch TDX_1.5.05.
> 
> TL;DR:
> - tdh_mem_track() can contend with tdh_vp_enter().
> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> - tdg_mem_page_accept() can contend with other tdh_mem*().
> 
> Proposal:
> - Return -EAGAIN directly in ops link_external_spt/set_external_spte when
>   tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.

Regarding the sept entry contention with the guest, I think KVM might not be
guaranteed to retry the same path and clear the sept entry host priority bit.
What if the first failure exited to userspace because of a pending signal or
something? Then the vcpu could reenter the guest, handle an NMI and go off in
another direction, never to trigger the EPT violation again. This would leave
the SEPT entry locked to the guest.

That is a convoluted scenario that could probably be considered a buggy guest,
but what I am sort of pondering is that the retry solution that loop outside the
fault handler guts will have more complex failure modes around the host priority
bit. The N local retries solution really is a brown paper bag design, but the
more proper looking solution actually has two downsides compared to it:
1. It is based on locking behavior that is not in the spec (yes we can work with
TDX module folks to keep it workable)
2. Failure modes get complex

I think I'm still onboard. Just trying to stress the design a bit.

(BTW it looks like Linux guest doesn't actually retry accept on host priority
busy, so they won't spin on it anyway. Probably any contention here would be a
buggy guest for Linux TDs at least.)

> - Kick off vCPUs at the beginning of page removal path, i.e. before the
>   tdh_mem_range_block().
>   Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.
>   (one possible optimization:
>    since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
>    do not kick off vCPUs in normal conditions.
>    When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow
>    TD enter until page removal completes.)
> 
> Below are detailed analysis:
> 
> === Background ===
> In TDX module, there are 4 kinds of locks:
> 1. sharex_lock:
>    Normal read/write lock. (no host priority stuff)
> 
> 2. sharex_hp_lock:
>    Just like normal read/write lock, except that host can set host priority
> bit
>    on failure.
>    when guest tries to acquire the lock and sees host priority bit set, it
> will
>    return "busy host priority" directly, letting host win.
>    After host acquires the lock successfully, host priority bit is cleared.
> 
> 3. sept entry lock:
>    Lock utilizing software bits in SEPT entry.
>    HP bit (Host priority): bit 52 
>    EL bit (Entry lock): bit 11, used as a bit lock.
> 
>    - host sets HP bit when host fails to acquire EL bit lock;
>    - host resets HP bit when host wins.
>    - guest returns "busy host priority" if HP bit is found set when guest
> tries
>      to acquire EL bit lock.
> 
> 4. mutex lock:
>    Lock with only 2 states: free, lock.
>    (not the same as linux mutex, not re-scheduled, could pause() for
> debugging).
> 
> ===Resources & users list===
> 
> Resources              SHARED  users              EXCLUSIVE users
> ------------------------------------------------------------------------
> (1) TDR                tdh_mng_rdwr               tdh_mng_create
>                        tdh_vp_create              tdh_mng_add_cx
>                        tdh_vp_addcx               tdh_mng_init
>                        tdh_vp_init                tdh_mng_vpflushdone
>                        tdh_vp_enter               tdh_mng_key_config 
>                        tdh_vp_flush               tdh_mng_key_freeid
>                        tdh_vp_rd_wr               tdh_mr_extend
>                        tdh_mem_sept_add           tdh_mr_finalize
>                        tdh_mem_sept_remove        tdh_vp_init_apicid
>                        tdh_mem_page_aug           tdh_mem_page_add
>                        tdh_mem_page_remove
>                        tdh_mem_range_block
>                        tdh_mem_track
>                        tdh_mem_range_unblock
>                        tdh_phymem_page_reclaim

In pamt_walk() it calls promote_sharex_lock_hp() with the lock type passed into
pamt_walk(), and tdh_phymem_page_reclaim() passed TDX_LOCK_EXCLUSIVE. So that is
an exclusive lock. But we can ignore it because we only do reclaim at TD tear
down time?

Separately, I wonder if we should try to add this info as comments around the
SEAMCALL implementations. The locking is not part of the spec, but never-the-
less the kernel is being coded against these assumptions. So it can sort of be
like "the kernel assumes this" and we can at least record what the reason was.
Or maybe just comment the parts that KVM assumes.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-13 17:23                       ` Sean Christopherson
@ 2024-09-13 19:19                         ` Edgecombe, Rick P
  2024-09-13 22:18                           ` Sean Christopherson
  2024-09-14  9:27                         ` Yan Zhao
  2024-09-17  2:11                         ` Huang, Kai
  2 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-13 19:19 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Yao, Yuan, Huang, Kai, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, pbonzini@redhat.com,
	dmatlack@google.com, nik.borisov@suse.com, kvm@vger.kernel.org

On Fri, 2024-09-13 at 10:23 -0700, Sean Christopherson wrote:
> > TL;DR:
> > - tdh_mem_track() can contend with tdh_vp_enter().
> > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
> The zero-step logic seems to be the most problematic.  E.g. if KVM is trying
> to
> install a page on behalf of two vCPUs, and KVM resumes the guest if it
> encounters
> a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> whatever reason.

Can you explain more about what the concern is here? That the zero-step
mitigation activation will be a drag on the TD because of extra contention with
the TDH.MEM calls?

> 
> Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM
> retries
> the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU
> still
> hits the fault?

It seems like an optimization. To me, I would normally want to know how much it
helped before adding it. But if you think it's an obvious win I'll defer.

> 
> For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> desirable because in many cases, the winning task will install a valid mapping
> before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> instruction is re-executed.  In the happy case, that provides optimal
> performance
> as KVM doesn't introduce any extra delay/latency.
> 
> But for TDX, the math is different as the cost of a re-hitting a fault is
> much,
> much higher, especially in light of the zero-step issues.
> 
> E.g. if the TDP MMU returns a unique error code for the frozen case, and
> kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> then the TDX EPT violation path can safely retry locally, similar to the do-
> while
> loop in kvm_tdp_map_page().
> 
> The only part I don't like about this idea is having two "retry" return
> values,
> which creates the potential for bugs due to checking one but not the other.
> 
> Hmm, that could be avoided by passing a bool pointer as an out-param to
> communicate
> to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> option better even though the out-param is a bit gross, because it makes it
> more
> obvious that the "frozen_spte" is a special case that doesn't need attention
> for
> most paths.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 16/21] KVM: TDX: Premap initial guest memory
  2024-09-13 13:33       ` Adrian Hunter
@ 2024-09-13 19:49         ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-09-13 19:49 UTC (permalink / raw)
  To: kvm@vger.kernel.org, pbonzini@redhat.com, Hunter, Adrian,
	seanjc@google.com
  Cc: isaku.yamahata@gmail.com, nik.borisov@suse.com,
	dmatlack@google.com, Zhao, Yan Y, Huang, Kai,
	linux-kernel@vger.kernel.org

On Fri, 2024-09-13 at 16:33 +0300, Adrian Hunter wrote:
> It is already in patch 14, so just remove it from this patch
> presumably.

Right, there is an equivalent check in tdx_sept_set_private_spte(), so we can
just drop this check.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-13 19:19                         ` Edgecombe, Rick P
@ 2024-09-13 22:18                           ` Sean Christopherson
  0 siblings, 0 replies; 139+ messages in thread
From: Sean Christopherson @ 2024-09-13 22:18 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: Yan Y Zhao, Yuan Yao, Kai Huang, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, pbonzini@redhat.com,
	dmatlack@google.com, nik.borisov@suse.com, kvm@vger.kernel.org

On Fri, Sep 13, 2024, Rick P Edgecombe wrote:
> On Fri, 2024-09-13 at 10:23 -0700, Sean Christopherson wrote:
> > > TL;DR:
> > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > 
> > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying
> > to

I am getting a feeling of deja vu.  Please fix your mail client to not generate
newlines in the middle of quoted text.

> > install a page on behalf of two vCPUs, and KVM resumes the guest if it
> > encounters a FROZEN_SPTE when building the non-leaf SPTEs, then one of the
> > vCPUs could trigger the zero-step mitigation if the vCPU that "wins" and
> > gets delayed for whatever reason.
> 
> Can you explain more about what the concern is here? That the zero-step
> mitigation activation will be a drag on the TD because of extra contention with
> the TDH.MEM calls?
> 
> > 
> > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow
> > slow-path, what if instead of resuming the guest if a page fault hits
> > FROZEN_SPTE, KVM retries the fault "locally", i.e. _without_ redoing
> > tdh_vp_enter() to see if the vCPU still hits the fault?
> 
> It seems like an optimization. To me, I would normally want to know how much it
> helped before adding it. But if you think it's an obvious win I'll defer.

I'm not worried about any performance hit with zero-step, I'm worried about KVM
not being able to differentiate between a KVM bug and guest interference.  The
goal with a local retry is to make it so that KVM _never_ triggers zero-step,
unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
report the error to userspace instead of trying to suppress guest activity, and
potentially from other KVM tasks too.

It might even be simpler overall too.  E.g. report status up the call chain and
let the top-level TDX S-EPT handler to do its thing, versus adding various flags
and control knobs to ensure a vCPU can make forward progress.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-13 17:23                       ` Sean Christopherson
  2024-09-13 19:19                         ` Edgecombe, Rick P
@ 2024-09-14  9:27                         ` Yan Zhao
  2024-09-15  9:53                           ` Yan Zhao
  2024-09-25 10:53                           ` Yan Zhao
  2024-09-17  2:11                         ` Huang, Kai
  2 siblings, 2 replies; 139+ messages in thread
From: Yan Zhao @ 2024-09-14  9:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Yuan Yao, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, dmatlack@google.com, nik.borisov@suse.com

On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> On Fri, Sep 13, 2024, Yan Zhao wrote:
> > This is a lock status report of TDX module for current SEAMCALL retry issue
> > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > branch TDX_1.5.05.
> > 
> > TL;DR:
> > - tdh_mem_track() can contend with tdh_vp_enter().
> > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
> The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> whatever reason.
> 
> Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> hits the fault?
> 
> For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> desirable because in many cases, the winning task will install a valid mapping
> before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> instruction is re-executed.  In the happy case, that provides optimal performance
> as KVM doesn't introduce any extra delay/latency.
> 
> But for TDX, the math is different as the cost of a re-hitting a fault is much,
> much higher, especially in light of the zero-step issues.
> 
> E.g. if the TDP MMU returns a unique error code for the frozen case, and
> kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> then the TDX EPT violation path can safely retry locally, similar to the do-while
> loop in kvm_tdp_map_page().
> 
> The only part I don't like about this idea is having two "retry" return values,
> which creates the potential for bugs due to checking one but not the other.
> 
> Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> option better even though the out-param is a bit gross, because it makes it more
> obvious that the "frozen_spte" is a special case that doesn't need attention for
> most paths.
Good idea.
But could we extend it a bit more to allow TDX's EPT violation handler to also
retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?

> 
> > - tdg_mem_page_accept() can contend with other tdh_mem*().
> > 
> > Proposal:
> > - Return -EAGAIN directly in ops link_external_spt/set_external_spte when
> >   tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.
> What is the result of returning -EAGAIN? E.g. does KVM redo tdh_vp_enter()?
Sorry, I meant -EBUSY originally.

With the current code in kvm_tdp_map_page(), vCPU should just retry without
tdh_vp_enter() except when there're signals pending.
With a real EPT violation, tdh_vp_enter() should be called again.

I realized that this is not good enough.
So, is it better to return -EAGAIN in ops link_external_spt/set_external_spte
and have kvm_tdp_mmu_map() return RET_PF_RETRY_FROZEN for -EAGAIN?
(or maybe some other name for RET_PF_RETRY_FROZEN).

> Also tdh_mem_sept_add() is strictly pre-finalize, correct?  I.e. should never
> contend with tdg_mem_page_accept() because vCPUs can't yet be run.
tdh_mem_page_add() is pre-finalize, tdh_mem_sept_add() is not.
tdh_mem_sept_add() can be called runtime by tdp_mmu_link_sp().

 
> Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()?
> The page isn't yet mapped, so why would the guest be allowed to take a lock on
> the S-EPT entry?
Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second
tdh_mem_page_aug() is called on the same gpa, the second one may contend with
tdg_mem_page_accept().

But given KVM does not allow the second tdh_mem_page_aug(), looks the contention
between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen.

> 
> > - Kick off vCPUs at the beginning of page removal path, i.e. before the
> >   tdh_mem_range_block().
> >   Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.
> 
> This is easy enough to do via a request, e.g. see KVM_REQ_MCLOCK_INPROGRESS.
Great!

> 
> >   (one possible optimization:
> >    since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
> >    do not kick off vCPUs in normal conditions.
> >    When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow
> 
> Which SEAMCALL is this specifically?  tdh_mem_range_block()?
Yes, they are
- tdh_mem_range_block() contends with tdh_vp_enter() for secure_ept_lock.
- tdh_mem_track() contends with tdh_vp_enter() for TD epoch.
  (current code in MMU part 2 just retry tdh_mem_track() endlessly),
- tdh_mem_page_remove()/tdh_mem_range_block() contends with
  tdg_mem_page_accept() for SEPT entry lock.
  (this one should not happen on a sane guest).

 Resources              SHARED  users              EXCLUSIVE users      
------------------------------------------------------------------------
(5) TDCS epoch         tdh_vp_enter                tdh_mem_track
------------------------------------------------------------------------
(6) secure_ept_lock    tdh_mem_sept_add            tdh_vp_enter
                       tdh_mem_page_aug            tdh_mem_sept_remove
                       tdh_mem_page_remove         tdh_mem_range_block
                                                   tdh_mem_range_unblock
------------------------------------------------------------------------
(7) SEPT entry                                     tdh_mem_sept_add
                                                   tdh_mem_sept_remove
                                                   tdh_mem_page_aug
                                                   tdh_mem_page_remove
                                                   tdh_mem_range_block
                                                   tdh_mem_range_unblock
                                                   tdg_mem_page_accept


> 
> >    TD enter until page removal completes.)
> 
> 
> Idea #1:
> ---
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b45258285c9c..8113c17bd2f6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4719,7 +4719,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>                         return -EINTR;
>                 cond_resched();
>                 r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> -       } while (r == RET_PF_RETRY);
> +       } while (r == RET_PF_RETRY || r == RET_PF_RETRY_FOZEN);
>  
>         if (r < 0)
>                 return r;
> @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>                 vcpu->stat.pf_spurious++;
>  
>         if (r != RET_PF_EMULATE)
> -               return 1;
> +               return r;
>  
>  emulate:
>         return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 8d3fb3c8c213..690f03d7daae 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -256,12 +256,15 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * and of course kvm_mmu_do_page_fault().
>   *
>   * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
> + * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_RETRY: let CPU fault again on the address.
> + * RET_PF_RETRY_FROZEN: One or more SPTEs related to the address is frozen.
> + *                     Let the CPU fault again on the address, or retry the
> + *                     fault "locally", i.e. without re-entering the guest.
>   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
>   * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
>   *                         gfn and retry, or emulate the instruction directly.
>   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> - * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
>   *
>   * Any names added to this enum should be exported to userspace for use in
> @@ -271,14 +274,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
>   * will allow for efficient machine code when checking for CONTINUE, e.g.
>   * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
> + *
> + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
> + * scheme, where 1==success, translates '1' to RET_PF_FIXED.
>   */
Looks "r > 0" represents success in vcpu_run()?
So, moving RET_PF_FIXED to 1 is not necessary?

>  enum {
>         RET_PF_CONTINUE = 0,
> +       RET_PF_FIXED    = 1,
>         RET_PF_RETRY,
> +       RET_PF_RETRY_FROZEN,
>         RET_PF_EMULATE,
>         RET_PF_WRITE_PROTECTED,
>         RET_PF_INVALID,
> -       RET_PF_FIXED,
>         RET_PF_SPURIOUS,
>  };
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..cbf9e46203f3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  retry:
>         rcu_read_unlock();
> +       if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte))
> +               return RET_PF_RETRY_FOZEN;
>         return ret;
>  }
>  
> ---
> 
> 
> Idea #2:
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/mmu/mmu.c          | 12 ++++++------
>  arch/x86/kvm/mmu/mmu_internal.h | 15 ++++++++++++---
>  arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
>  arch/x86/kvm/svm/svm.c          |  2 +-
>  arch/x86/kvm/vmx/vmx.c          |  4 ++--
>  6 files changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 46e0a466d7fb..200fecd1de88 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2183,7 +2183,7 @@ unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>  
>  int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> -		       void *insn, int insn_len);
> +		       void *insn, int insn_len, bool *frozen_spte);
>  void kvm_mmu_print_sptes(struct kvm_vcpu *vcpu, gpa_t gpa, const char *msg);
>  void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
>  void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b45258285c9c..207840a316d3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4283,7 +4283,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  		return;
>  
>  	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, work->arch.error_code,
> -				  true, NULL, NULL);
> +				  true, NULL, NULL, NULL);
>  
>  	/*
>  	 * Account fixed page faults, otherwise they'll never be counted, but
> @@ -4627,7 +4627,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  		trace_kvm_page_fault(vcpu, fault_address, error_code);
>  
>  		r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
> -				insn_len);
> +				       insn_len, NULL);
>  	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
>  		vcpu->arch.apf.host_apf_flags = 0;
>  		local_irq_disable();
> @@ -4718,7 +4718,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>  		if (signal_pending(current))
>  			return -EINTR;
>  		cond_resched();
> -		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> +		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level, NULL);
>  	} while (r == RET_PF_RETRY);
>  
>  	if (r < 0)
> @@ -6073,7 +6073,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  }
>  
>  int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> -		       void *insn, int insn_len)
> +				void *insn, int insn_len, bool *frozen_spte)
>  {
>  	int r, emulation_type = EMULTYPE_PF;
>  	bool direct = vcpu->arch.mmu->root_role.direct;
> @@ -6109,7 +6109,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  		vcpu->stat.pf_taken++;
>  
>  		r = kvm_mmu_do_page_fault(vcpu, cr2_or_gpa, error_code, false,
> -					  &emulation_type, NULL);
> +					  &emulation_type, NULL, frozen_spte);
>  		if (KVM_BUG_ON(r == RET_PF_INVALID, vcpu->kvm))
>  			return -EIO;
>  	}
> @@ -6129,7 +6129,7 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  		vcpu->stat.pf_spurious++;
>  
>  	if (r != RET_PF_EMULATE)
> -		return 1;
> +		return r;
>  
>  emulate:
>  	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 8d3fb3c8c213..5b1fc77695c1 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -247,6 +247,9 @@ struct kvm_page_fault {
>  	 * is changing its own translation in the guest page tables.
>  	 */
>  	bool write_fault_to_shadow_pgtable;
> +
> +	/* Indicates the page fault needs to be retried due to a frozen SPTE. */
> +	bool frozen_spte;
>  };
>  
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
> @@ -256,12 +259,12 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * and of course kvm_mmu_do_page_fault().
>   *
>   * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
> + * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_RETRY: let CPU fault again on the address.
>   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
>   * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
>   *                         gfn and retry, or emulate the instruction directly.
>   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
> - * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
>   *
>   * Any names added to this enum should be exported to userspace for use in
> @@ -271,14 +274,17 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * on -errno return values.  Somewhat arbitrarily use '0' for CONTINUE, which
>   * will allow for efficient machine code when checking for CONTINUE, e.g.
>   * "TEST %rax, %rax, JNZ", as all "stop!" values are non-zero.
> + *
> + * Note #2, RET_PF_FIXED _must_ be '1', so that KVM's -errno/0/1 return code
> + * scheme, where 1==success, translates '1' to RET_PF_FIXED.
>   */
>  enum {
>  	RET_PF_CONTINUE = 0,
> +	RET_PF_FIXED    = 1,
>  	RET_PF_RETRY,
>  	RET_PF_EMULATE,
>  	RET_PF_WRITE_PROTECTED,
>  	RET_PF_INVALID,
> -	RET_PF_FIXED,
>  	RET_PF_SPURIOUS,
>  };
>  
> @@ -292,7 +298,8 @@ static inline void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>  
>  static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  					u64 err, bool prefetch,
> -					int *emulation_type, u8 *level)
> +					int *emulation_type, u8 *level,
> +					bool *frozen_spte)
>  {
>  	struct kvm_page_fault fault = {
>  		.addr = cr2_or_gpa,
> @@ -341,6 +348,8 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		*emulation_type |= EMULTYPE_WRITE_PF_TO_SP;
>  	if (level)
>  		*level = fault.goal_level;
> +	if (frozen_spte)
> +		*frozen_spte = fault.frozen_spte;
>  
>  	return r;
>  }
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..e7fc5ea4b437 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  retry:
>  	rcu_read_unlock();
> +	fault->frozen_spte = is_frozen_spte(iter.old_spte);
>  	return ret;
>  }
>  
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 38723b0c435d..269de6a9eb13 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2075,7 +2075,7 @@ static int npf_interception(struct kvm_vcpu *vcpu)
>  	rc = kvm_mmu_page_fault(vcpu, fault_address, error_code,
>  				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
>  				svm->vmcb->control.insn_bytes : NULL,
> -				svm->vmcb->control.insn_len);
> +				svm->vmcb->control.insn_len, NULL);
>  
>  	if (rc > 0 && error_code & PFERR_GUEST_RMP_MASK)
>  		sev_handle_rmp_fault(vcpu, fault_address, error_code);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 368acfebd476..fc2ff5d91a71 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5822,7 +5822,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	if (unlikely(allow_smaller_maxphyaddr && !kvm_vcpu_is_legal_gpa(vcpu, gpa)))
>  		return kvm_emulate_instruction(vcpu, 0);
>  
> -	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, NULL);
>  }
>  
>  static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> @@ -5843,7 +5843,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
>  		return kvm_skip_emulated_instruction(vcpu);
>  	}
>  
> -	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
> +	return kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0, NULL);
>  }
>  
>  static int handle_nmi_window(struct kvm_vcpu *vcpu)
> 
> base-commit: bc87a2b4b5508d247ed2c30cd2829969d168adfe
> -- 
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-13 19:19                       ` Edgecombe, Rick P
@ 2024-09-14 10:00                         ` Yan Zhao
  0 siblings, 0 replies; 139+ messages in thread
From: Yan Zhao @ 2024-09-14 10:00 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Yao, Yuan, Huang, Kai,
	linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	pbonzini@redhat.com, dmatlack@google.com, nik.borisov@suse.com,
	kvm@vger.kernel.org

> > ===Resources & users list===
> > 
> > Resources              SHARED  users              EXCLUSIVE users
> > ------------------------------------------------------------------------
> > (1) TDR                tdh_mng_rdwr               tdh_mng_create
> >                        tdh_vp_create              tdh_mng_add_cx
> >                        tdh_vp_addcx               tdh_mng_init
> >                        tdh_vp_init                tdh_mng_vpflushdone
> >                        tdh_vp_enter               tdh_mng_key_config 
> >                        tdh_vp_flush               tdh_mng_key_freeid
> >                        tdh_vp_rd_wr               tdh_mr_extend
> >                        tdh_mem_sept_add           tdh_mr_finalize
> >                        tdh_mem_sept_remove        tdh_vp_init_apicid
> >                        tdh_mem_page_aug           tdh_mem_page_add
> >                        tdh_mem_page_remove
> >                        tdh_mem_range_block
> >                        tdh_mem_track
> >                        tdh_mem_range_unblock
> >                        tdh_phymem_page_reclaim
> 
> In pamt_walk() it calls promote_sharex_lock_hp() with the lock type passed into
> pamt_walk(), and tdh_phymem_page_reclaim() passed TDX_LOCK_EXCLUSIVE. So that is
> an exclusive lock. But we can ignore it because we only do reclaim at TD tear
> down time?
Hmm, if the page to reclaim is not a TDR page, lock_and_map_implicit_tdr() is
called to lock the page's corresponding TDR page with SHARED lock.

if the page to reclaim is a TDR page, it's indeed locked with EXCLUSIVE.

But in pamt_walk() it calls promote_sharex_lock_hp() for the passed in
TDX_LOCK_EXCLUSIVE only when

if ((pamt_1gb->pt == PT_REG) || (target_size == PT_1GB)) or
if ((pamt_2mb->pt == PT_REG) || (target_size == PT_2MB))

"pamt_1gb->pt == PT_REG" (or "pamt_2mb->pt == PT_REG)") is true when it's
assigned (not PT_NDA) and is a normal page (i.e. not TDR, TDVPR...).
This is true only after tdh_mem_page_add()/tdh_mem_page_aug() assigns the page
to a TD with huge page size.

This will not happen for a TDR page.

For normal pages when huge page is supported in future, looks we need to
update tdh_phymem_page_reclaim() to include size info too.

> 
> Separately, I wonder if we should try to add this info as comments around the
> SEAMCALL implementations. The locking is not part of the spec, but never-the-
> less the kernel is being coded against these assumptions. So it can sort of be
> like "the kernel assumes this" and we can at least record what the reason was.
> Or maybe just comment the parts that KVM assumes.
Agreed. 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-14  9:27                         ` Yan Zhao
@ 2024-09-15  9:53                           ` Yan Zhao
  2024-09-17  1:31                             ` Huang, Kai
  2024-09-25 10:53                           ` Yan Zhao
  1 sibling, 1 reply; 139+ messages in thread
From: Yan Zhao @ 2024-09-15  9:53 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe, pbonzini@redhat.com,
	Yuan Yao, Kai Huang, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	dmatlack@google.com, nik.borisov@suse.com

On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()?
> > The page isn't yet mapped, so why would the guest be allowed to take a lock on
> > the S-EPT entry?
> Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second
> tdh_mem_page_aug() is called on the same gpa, the second one may contend with
> tdg_mem_page_accept().
> 
> But given KVM does not allow the second tdh_mem_page_aug(), looks the contention
> between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen.
I withdraw the reply above.

tdh_mem_page_aug() and tdg_mem_page_accept() both attempt to modify the same
SEPT entry, leading to contention.
- tdg_mem_page_accept() first walks the SEPT tree with no lock to get the SEPT
  entry. It then acquire the guest side lock of the found SEPT entry before
  checking entry state.
- tdh_mem_page_aug() first walks the SEPT tree with shared lock to locate the
  SEPT entry to modify, it then aquires host side lock of the SEPT entry before
  checking entry state.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-15  9:53                           ` Yan Zhao
@ 2024-09-17  1:31                             ` Huang, Kai
  0 siblings, 0 replies; 139+ messages in thread
From: Huang, Kai @ 2024-09-17  1:31 UTC (permalink / raw)
  To: Zhao, Yan Y, Sean Christopherson, Edgecombe, Rick P,
	pbonzini@redhat.com, Yao, Yuan, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	dmatlack@google.com, nik.borisov@suse.com

On 15/09/2024 9:53 pm, Zhao, Yan Y wrote:
> On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
>>> Similarly, can tdh_mem_page_aug() actually contend with tdg_mem_page_accept()?
>>> The page isn't yet mapped, so why would the guest be allowed to take a lock on
>>> the S-EPT entry?
>> Before tdg_mem_page_accept() accepts a gpa and set rwx bits in a SPTE, if second
>> tdh_mem_page_aug() is called on the same gpa, the second one may contend with
>> tdg_mem_page_accept().
>>
>> But given KVM does not allow the second tdh_mem_page_aug(), looks the contention
>> between tdh_mem_page_aug() and tdg_mem_page_accept() will not happen.
> I withdraw the reply above.
> 
> tdh_mem_page_aug() and tdg_mem_page_accept() both attempt to modify the same
> SEPT entry, leading to contention.
> - tdg_mem_page_accept() first walks the SEPT tree with no lock to get the SEPT
>    entry. It then acquire the guest side lock of the found SEPT entry before
>    checking entry state.
> - tdh_mem_page_aug() first walks the SEPT tree with shared lock to locate the
>    SEPT entry to modify, it then aquires host side lock of the SEPT entry before
>    checking entry state.

This seems can only happen when there are multiple threads in guest 
trying to do tdg_mem_page_accept() on the same page.  This should be 
extremely rare to happen, and if this happens, it will eventually result 
in another fault in KVM.

So now we set SPTE to FROZEN_SPTE before doing AUG to prevent from other 
threads from going on.  I think when tdh_mem_page_aug() fails with 
secure EPT "entry" busy, we can reset FROZEN_SPTE back to old_spte and 
return PF_RETRY so that this thread and another fault thread can both 
try to complete AUG again?

The thread fails with AUG can also go back to guest though, but since 
host priority bit is already set, the further PAGE.ACCEPT will fail but 
this is fine due to another AUG in KVM will eventually resolve this and 
make progress to the guest.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-13 17:23                       ` Sean Christopherson
  2024-09-13 19:19                         ` Edgecombe, Rick P
  2024-09-14  9:27                         ` Yan Zhao
@ 2024-09-17  2:11                         ` Huang, Kai
  2 siblings, 0 replies; 139+ messages in thread
From: Huang, Kai @ 2024-09-17  2:11 UTC (permalink / raw)
  To: Sean Christopherson, Yan Zhao
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Yuan Yao,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, dmatlack@google.com, nik.borisov@suse.com



On 14/09/2024 5:23 am, Sean Christopherson wrote:
> On Fri, Sep 13, 2024, Yan Zhao wrote:
>> This is a lock status report of TDX module for current SEAMCALL retry issue
>> based on code in TDX module public repo https://github.com/intel/tdx-module.git
>> branch TDX_1.5.05.
>>
>> TL;DR:
>> - tdh_mem_track() can contend with tdh_vp_enter().
>> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> 
> The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> whatever reason.
> 
> Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> hits the fault?
> 
> For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> desirable because in many cases, the winning task will install a valid mapping
> before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> instruction is re-executed.  In the happy case, that provides optimal performance
> as KVM doesn't introduce any extra delay/latency.
> 
> But for TDX, the math is different as the cost of a re-hitting a fault is much,
> much higher, especially in light of the zero-step issues.
> 
> E.g. if the TDP MMU returns a unique error code for the frozen case, and
> kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> then the TDX EPT violation path can safely retry locally, similar to the do-while
> loop in kvm_tdp_map_page().
> 
> The only part I don't like about this idea is having two "retry" return values,
> which creates the potential for bugs due to checking one but not the other.
> 
> Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> option better even though the out-param is a bit gross, because it makes it more
> obvious that the "frozen_spte" is a special case that doesn't need attention for
> most paths.
> 

[...]

>   
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5a475a6456d4..cbf9e46203f3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1174,6 +1174,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   
>   retry:
>          rcu_read_unlock();
> +       if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte))
> +               return RET_PF_RETRY_FOZEN;

Ack the whole "retry on frozen" approach, either with RET_PF_RETRY_FOZEN 
or fault->frozen_spte.

One minor side effect:

For normal VMs, the fault handler can also see a frozen spte, e.g, when 
kvm_tdp_mmu_map() checks the middle level SPTE:

	/*
          * If SPTE has been frozen by another thread, just give up and
          * retry, avoiding unnecessary page table allocation and free.
          */
         if (is_frozen_spte(iter.old_spte))
         	goto retry;

So for normal VM this RET_PF_RETRY_FOZEN will change "go back to guest 
to retry" to "retry in KVM internally".

As you mentioned above for normal VMs we probably always want to "go 
back to guest to retry" even for FROZEN SPTE, but I guess this is a 
minor issue that we can even notice.

Or we can additionally add:

	if (ret == RET_PF_RETRY && is_frozen_spte(iter.old_spte)
			&& is_mirrored_sptep(iter.sptep))
		return RET_PF_RETRY_FOZEN;

So it only applies to TDX.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-14  9:27                         ` Yan Zhao
  2024-09-15  9:53                           ` Yan Zhao
@ 2024-09-25 10:53                           ` Yan Zhao
  2024-10-08 14:51                             ` Sean Christopherson
  1 sibling, 1 reply; 139+ messages in thread
From: Yan Zhao @ 2024-09-25 10:53 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe, pbonzini@redhat.com,
	Yuan Yao, Kai Huang, isaku.yamahata@gmail.com,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	dmatlack@google.com, nik.borisov@suse.com

On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > branch TDX_1.5.05.
> > > 
> > > TL;DR:
> > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > 
> > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > whatever reason.
> > 
> > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > hits the fault?
> > 
> > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > desirable because in many cases, the winning task will install a valid mapping
> > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > instruction is re-executed.  In the happy case, that provides optimal performance
> > as KVM doesn't introduce any extra delay/latency.
> > 
> > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > much higher, especially in light of the zero-step issues.
> > 
> > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > loop in kvm_tdp_map_page().
> > 
> > The only part I don't like about this idea is having two "retry" return values,
> > which creates the potential for bugs due to checking one but not the other.
> > 
> > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > option better even though the out-param is a bit gross, because it makes it more
> > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > most paths.
> Good idea.
> But could we extend it a bit more to allow TDX's EPT violation handler to also
> retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
FROZEN_SPTE might not be enough to prevent zero step mitigation.

E.g. in below selftest with a TD configured with pending_ve_disable=N,
zero step mitigation can be triggered on a vCPU that is stuck in EPT violation
vm exit for more than 6 times (due to that user space does not do memslot
conversion correctly).

So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may
contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck
in EPT violation vm exits.


#include <stdint.h>

#include "kvm_util.h"
#include "processor.h"
#include "tdx/tdcall.h"
#include "tdx/tdx.h"
#include "tdx/tdx_util.h"
#include "tdx/test_util.h"
#include "test_util.h"

/*
 * 0x80000000 is arbitrarily selected, but it should not overlap with selftest
 * code or boot page.
 */
#define ZERO_STEP_TEST_AREA_GPA (0x80000000)
/* Test area GPA is arbitrarily selected */
#define ZERO_STEP_AREA_GVA_PRIVATE (0x90000000)

/* The test area is 2MB in size */
#define ZERO_STEP_AREA_SIZE (2 << 20)

#define ZERO_STEP_ASSERT(x)                             \
        do {                                            \
                if (!(x))                               \
                        tdx_test_fatal(__LINE__);       \
        } while (0)


#define ZERO_STEP_ACCEPT_PRINT_PORT 0x87

#define ZERO_STEP_THRESHOLD 6
#define TRIGGER_ZERO_STEP_MITIGATION 1

static int convert_request_cnt;

static void guest_test_zero_step(void)
{
        void *test_area_gva_private = (void *)ZERO_STEP_AREA_GVA_PRIVATE;

        memset(test_area_gva_private, 1, 8);
        tdx_test_success();
}

static void guest_ve_handler(struct ex_regs *regs)
{
        uint64_t ret;
        struct ve_info ve;

        ret = tdg_vp_veinfo_get(&ve);
        ZERO_STEP_ASSERT(!ret);

        /* For this test, we will only handle EXIT_REASON_EPT_VIOLATION */
        ZERO_STEP_ASSERT(ve.exit_reason == EXIT_REASON_EPT_VIOLATION);


        tdx_test_send_64bit(ZERO_STEP_ACCEPT_PRINT_PORT, ve.gpa);

#define MEM_PAGE_ACCEPT_LEVEL_4K 0
#define MEM_PAGE_ACCEPT_LEVEL_2M 1
        ret = tdg_mem_page_accept(ve.gpa, MEM_PAGE_ACCEPT_LEVEL_4K);
        ZERO_STEP_ASSERT(!ret);
}

static void zero_step_test(void)
{
        struct kvm_vm *vm;
        struct kvm_vcpu *vcpu;
        void *guest_code;
        uint64_t test_area_npages;
        vm_vaddr_t test_area_gva_private;

        vm = td_create();
        td_initialize(vm, VM_MEM_SRC_ANONYMOUS, 0);
        guest_code = guest_test_zero_step;
        vcpu = td_vcpu_add(vm, 0, guest_code);
        vm_install_exception_handler(vm, VE_VECTOR, guest_ve_handler);

        test_area_npages = ZERO_STEP_AREA_SIZE / vm->page_size;
        vm_userspace_mem_region_add(vm,
                                    VM_MEM_SRC_ANONYMOUS, ZERO_STEP_TEST_AREA_GPA,
                                    3, test_area_npages, KVM_MEM_GUEST_MEMFD);
        vm->memslots[MEM_REGION_TEST_DATA] = 3;

        test_area_gva_private = ____vm_vaddr_alloc(
                vm, ZERO_STEP_AREA_SIZE, ZERO_STEP_AREA_GVA_PRIVATE,
                ZERO_STEP_TEST_AREA_GPA, MEM_REGION_TEST_DATA, true);
        TEST_ASSERT_EQ(test_area_gva_private, ZERO_STEP_AREA_GVA_PRIVATE);

        td_finalize(vm);
	handle_memory_conversion(vm, ZERO_STEP_TEST_AREA_GPA, ZERO_STEP_AREA_SIZE, false);
        for (;;) {
                vcpu_run(vcpu);
                if (vcpu->run->exit_reason == KVM_EXIT_IO &&
                        vcpu->run->io.port == ZERO_STEP_ACCEPT_PRINT_PORT) {
                        uint64_t gpa = tdx_test_read_64bit(
                                        vcpu, ZERO_STEP_ACCEPT_PRINT_PORT);
                        printf("\t ... guest accepting 1 page at GPA: 0x%lx\n", gpa);
                        continue;
                } else if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
                        bool skip = TRIGGER_ZERO_STEP_MITIGATION &&
                                    (convert_request_cnt < ZERO_STEP_THRESHOLD -1);

                        convert_request_cnt++;

                        printf("guest request conversion of gpa 0x%llx - 0x%llx to %s, skip=%d\n",
                                vcpu->run->memory_fault.gpa, vcpu->run->memory_fault.size,
                                (vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE) ? "private" : "shared", skip);


                        if (skip)
                                continue;

                        handle_memory_conversion(
                                vm, vcpu->run->memory_fault.gpa,
                                vcpu->run->memory_fault.size,
                                vcpu->run->memory_fault.flags == KVM_MEMORY_EXIT_FLAG_PRIVATE);
                        continue;
                }
                printf("exit reason %d\n", vcpu->run->exit_reason);
                break;
        }

        kvm_vm_free(vm);
}

int main(int argc, char **argv)
{
        /* Disable stdout buffering */
        setbuf(stdout, NULL);

        if (!is_tdx_enabled()) {
                printf("TDX is not supported by the KVM\n"
                       "Skipping the TDX tests.\n");
                return 0;
        }

        run_in_new_process(&zero_step_test);
}

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-09-25 10:53                           ` Yan Zhao
@ 2024-10-08 14:51                             ` Sean Christopherson
  2024-10-10  5:23                               ` Yan Zhao
  0 siblings, 1 reply; 139+ messages in thread
From: Sean Christopherson @ 2024-10-08 14:51 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Yuan Yao, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, dmatlack@google.com, nik.borisov@suse.com

On Wed, Sep 25, 2024, Yan Zhao wrote:
> On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > branch TDX_1.5.05.
> > > > 
> > > > TL;DR:
> > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > 
> > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > whatever reason.
> > > 
> > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > hits the fault?
> > > 
> > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > desirable because in many cases, the winning task will install a valid mapping
> > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > as KVM doesn't introduce any extra delay/latency.
> > > 
> > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > much higher, especially in light of the zero-step issues.
> > > 
> > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > loop in kvm_tdp_map_page().
> > > 
> > > The only part I don't like about this idea is having two "retry" return values,
> > > which creates the potential for bugs due to checking one but not the other.
> > > 
> > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > option better even though the out-param is a bit gross, because it makes it more
> > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > most paths.
> > Good idea.
> > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> FROZEN_SPTE might not be enough to prevent zero step mitigation.

The goal isn't to make it completely impossible for zero-step to fire, it's to
make it so that _if_ zero-step fires, KVM can report the error to userspace without
having to retry, because KVM _knows_ that advancing past the zero-step isn't
something KVM can solve.

 : I'm not worried about any performance hit with zero-step, I'm worried about KVM
 : not being able to differentiate between a KVM bug and guest interference.  The
 : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
 : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 : report the error to userspace instead of trying to suppress guest activity, and
 : potentially from other KVM tasks too.

In other words, for the selftest you crafted, KVM reporting an error to userspace
due to zero-step would be working as intended.  

> E.g. in below selftest with a TD configured with pending_ve_disable=N,
> zero step mitigation can be triggered on a vCPU that is stuck in EPT violation
> vm exit for more than 6 times (due to that user space does not do memslot
> conversion correctly).
> 
> So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may
> contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck
> in EPT violation vm exits.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-10-08 14:51                             ` Sean Christopherson
@ 2024-10-10  5:23                               ` Yan Zhao
  2024-10-10 17:33                                 ` Sean Christopherson
  0 siblings, 1 reply; 139+ messages in thread
From: Yan Zhao @ 2024-10-10  5:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Yuan Yao, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, dmatlack@google.com, nik.borisov@suse.com

On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote:
> On Wed, Sep 25, 2024, Yan Zhao wrote:
> > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > > branch TDX_1.5.05.
> > > > > 
> > > > > TL;DR:
> > > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > > 
> > > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > > whatever reason.
> > > > 
> > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > > hits the fault?
> > > > 
> > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > > desirable because in many cases, the winning task will install a valid mapping
> > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > > as KVM doesn't introduce any extra delay/latency.
> > > > 
> > > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > > much higher, especially in light of the zero-step issues.
> > > > 
> > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > > loop in kvm_tdp_map_page().
> > > > 
> > > > The only part I don't like about this idea is having two "retry" return values,
> > > > which creates the potential for bugs due to checking one but not the other.
> > > > 
> > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > > option better even though the out-param is a bit gross, because it makes it more
> > > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > > most paths.
> > > Good idea.
> > > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> > FROZEN_SPTE might not be enough to prevent zero step mitigation.
> 
> The goal isn't to make it completely impossible for zero-step to fire, it's to
> make it so that _if_ zero-step fires, KVM can report the error to userspace without
> having to retry, because KVM _knows_ that advancing past the zero-step isn't
> something KVM can solve.
> 
>  : I'm not worried about any performance hit with zero-step, I'm worried about KVM
>  : not being able to differentiate between a KVM bug and guest interference.  The
>  : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
>  : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  : report the error to userspace instead of trying to suppress guest activity, and
>  : potentially from other KVM tasks too.
> 
> In other words, for the selftest you crafted, KVM reporting an error to userspace
> due to zero-step would be working as intended.  
Hmm, but the selftest is an example to show that 6 continuous EPT violations on
the same GPA could trigger zero-step.

For an extremely unlucky vCPU, is it still possible to fire zero step when
nothing is wrong both in KVM and QEMU?
e.g.

1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)

 
> > E.g. in below selftest with a TD configured with pending_ve_disable=N,
> > zero step mitigation can be triggered on a vCPU that is stuck in EPT violation
> > vm exit for more than 6 times (due to that user space does not do memslot
> > conversion correctly).
> > 
> > So, if vCPU A wins the chance to call tdh_mem_page_aug(), the SEAMCALL may
> > contend with zero step mitigation code in tdh_vp_enter() in vCPU B stuck
> > in EPT violation vm exits.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-10-10  5:23                               ` Yan Zhao
@ 2024-10-10 17:33                                 ` Sean Christopherson
  2024-10-10 21:53                                   ` Edgecombe, Rick P
                                                     ` (2 more replies)
  0 siblings, 3 replies; 139+ messages in thread
From: Sean Christopherson @ 2024-10-10 17:33 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Yuan Yao, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, dmatlack@google.com, nik.borisov@suse.com

On Thu, Oct 10, 2024, Yan Zhao wrote:
> On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote:
> > On Wed, Sep 25, 2024, Yan Zhao wrote:
> > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > > > branch TDX_1.5.05.
> > > > > > 
> > > > > > TL;DR:
> > > > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > > > 
> > > > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > > > whatever reason.
> > > > > 
> > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > > > hits the fault?
> > > > > 
> > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > > > desirable because in many cases, the winning task will install a valid mapping
> > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > > > as KVM doesn't introduce any extra delay/latency.
> > > > > 
> > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > > > much higher, especially in light of the zero-step issues.
> > > > > 
> > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > > > loop in kvm_tdp_map_page().
> > > > > 
> > > > > The only part I don't like about this idea is having two "retry" return values,
> > > > > which creates the potential for bugs due to checking one but not the other.
> > > > > 
> > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > > > option better even though the out-param is a bit gross, because it makes it more
> > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > > > most paths.
> > > > Good idea.
> > > > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> > > FROZEN_SPTE might not be enough to prevent zero step mitigation.
> > 
> > The goal isn't to make it completely impossible for zero-step to fire, it's to
> > make it so that _if_ zero-step fires, KVM can report the error to userspace without
> > having to retry, because KVM _knows_ that advancing past the zero-step isn't
> > something KVM can solve.
> > 
> >  : I'm not worried about any performance hit with zero-step, I'm worried about KVM
> >  : not being able to differentiate between a KVM bug and guest interference.  The
> >  : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
> >  : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
> >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >  : report the error to userspace instead of trying to suppress guest activity, and
> >  : potentially from other KVM tasks too.
> > 
> > In other words, for the selftest you crafted, KVM reporting an error to userspace
> > due to zero-step would be working as intended.  
> Hmm, but the selftest is an example to show that 6 continuous EPT violations on
> the same GPA could trigger zero-step.
> 
> For an extremely unlucky vCPU, is it still possible to fire zero step when
> nothing is wrong both in KVM and QEMU?
> e.g.
> 
> 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)

Very technically, this shouldn't be possible.  The only way for there to be
contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the
6th attempt should succeed, even if the faulting vCPU wasn't the one to create
the SPTE.

That said, a few thoughts:

1. Where did we end up on the idea of requiring userspace to pre-fault memory?

2. The zero-step logic really should have a slightly more conservative threshold.
   I have a hard time believing that e.g. 10 attempts would create a side channel,
   but 6 attempts is "fine".

3. This would be a good reason to implement a local retry in kvm_tdp_mmu_map().
   Yes, I'm being somewhat hypocritical since I'm so against retrying for the
   S-EPT case, but my objection to retrying for S-EPT is that it _should_ be easy
   for KVM to guarantee success.

E.g. for #3, the below (compile tested only) patch should make it impossible for
the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror SPTEs
should never trigger A/D assists, i.e. retry should always succeed.

---
 arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 3b996c1fdaab..e47573a652a9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1097,6 +1097,18 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 				   struct kvm_mmu_page *sp, bool shared);
 
+static struct kvm_mmu_page *tdp_mmu_realloc_sp(struct kvm_vcpu *vcpu,
+					       struct kvm_mmu_page *sp)
+{
+	if (!sp)
+		return tdp_mmu_alloc_sp(vcpu);
+
+	memset(sp, 0, sizeof(*sp));
+	memset64(sp->spt, vcpu->arch.mmu_shadow_page_cache.init_value,
+		 PAGE_SIZE / sizeof(u64));
+	return sp;
+}
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -1104,9 +1116,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	struct kvm_mmu_page *sp = NULL;
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
-	struct kvm_mmu_page *sp;
 	int ret = RET_PF_RETRY;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1116,8 +1128,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	rcu_read_lock();
 
 	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
-		int r;
-
+		/*
+		 * Somewhat arbitrarily allow two local retries, e.g. to play
+		 * nice with the extremely unlikely case that KVM encounters a
+		 * huge SPTE an Access-assist _and_ a subsequent Dirty-assist.
+		 * Retrying is inexpensive, but if KVM fails to install a SPTE
+		 * three times, then a fourth attempt is likely futile and it's
+		 * time to back off.
+		 */
+		int r, retry_locally = 2;
+again:
 		if (fault->nx_huge_page_workaround_enabled)
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
@@ -1140,7 +1160,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * The SPTE is either non-present or points to a huge page that
 		 * needs to be split.
 		 */
-		sp = tdp_mmu_alloc_sp(vcpu);
+		sp = tdp_mmu_realloc_sp(vcpu, sp);
 		tdp_mmu_init_child_sp(sp, &iter);
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
@@ -1151,11 +1171,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
 
 		/*
-		 * Force the guest to retry if installing an upper level SPTE
-		 * failed, e.g. because a different task modified the SPTE.
+		 * If installing an upper level SPTE failed, retry the walk
+		 * locally before forcing the guest to retry.  If the SPTE was
+		 * modified by a different task, odds are very good the new
+		 * SPTE is usable as-is.  And if the SPTE was modified by the
+		 * CPU, e.g. to set A/D bits, then unless KVM gets *extremely*
+		 * unlucky, the CMPXCHG should succeed the second time around.
 		 */
 		if (r) {
-			tdp_mmu_free_sp(sp);
+			if (retry_locally--)
+				goto again;
 			goto retry;
 		}
 
@@ -1166,6 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 				track_possible_nx_huge_page(kvm, sp);
 			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 		}
+		sp = NULL;
 	}
 
 	/*
@@ -1180,6 +1206,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 retry:
 	rcu_read_unlock();
+
+	/*
+	 * Free the previously allocated MMU page if KVM retried locally and
+	 * ended up not using said page.
+	 */
+	if (sp)
+		tdp_mmu_free_sp(sp);
 	return ret;
 }
 

base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
-- 

^ permalink raw reply related	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-10-10 17:33                                 ` Sean Christopherson
@ 2024-10-10 21:53                                   ` Edgecombe, Rick P
  2024-10-11  2:30                                     ` Yan Zhao
  2024-10-14 10:54                                     ` Huang, Kai
  2024-10-11  2:06                                   ` Yan Zhao
  2024-10-16 14:13                                   ` Yan Zhao
  2 siblings, 2 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-10-10 21:53 UTC (permalink / raw)
  To: seanjc@google.com, Zhao, Yan Y
  Cc: Yao, Yuan, Huang, Kai, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, pbonzini@redhat.com,
	dmatlack@google.com, nik.borisov@suse.com, kvm@vger.kernel.org

On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
> > 
> > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)

Isn't there a more general scenario:

vcpu0                              vcpu1
1. Freezes PTE
2. External op to do the SEAMCALL
3.                                 Faults same PTE, hits frozen PTE
4.                                 Retries N times, triggers zero-step
5. Finally finishes external op

Am I missing something?

> 
> Very technically, this shouldn't be possible.  The only way for there to be
> contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e.
> the
> 6th attempt should succeed, even if the faulting vCPU wasn't the one to create
> the SPTE.
> 
> That said, a few thoughts:
> 
> 1. Where did we end up on the idea of requiring userspace to pre-fault memory?

For others reference, I think you are referring to the idea to pre-fault the
entire S-EPT even for GFNs that usually get AUGed, not the mirrored EPT pre-
faulting/PAGE.ADD dance we are already doing.

The last discussion with Paolo was to resume the retry solution discussion on
the v2 posting because it would be easier "with everything else already
addressed". Also, there was also some discussion that it was not immediately
obvious how prefaulting everything would work for memory hot plug (i.e. memslots
added during runtime).

> 
> 2. The zero-step logic really should have a slightly more conservative
> threshold.
>    I have a hard time believing that e.g. 10 attempts would create a side
> channel,
>    but 6 attempts is "fine".

No idea where the threshold came from. I'm not sure if it affects the KVM
design? We can look into it for curiosity sake in either case.

> 
> 3. This would be a good reason to implement a local retry in
> kvm_tdp_mmu_map().
>    Yes, I'm being somewhat hypocritical since I'm so against retrying for the
>    S-EPT case, but my objection to retrying for S-EPT is that it _should_ be
> easy
>    for KVM to guarantee success.
> 
> E.g. for #3, the below (compile tested only) patch should make it impossible
> for
> the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror
> SPTEs
> should never trigger A/D assists, i.e. retry should always succeed.

I don't see how it addresses the scenario above. More retires could just make it
rarer, but never fix it. Very possible I'm missing something though.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-10-10 17:33                                 ` Sean Christopherson
  2024-10-10 21:53                                   ` Edgecombe, Rick P
@ 2024-10-11  2:06                                   ` Yan Zhao
  2024-10-16 14:13                                   ` Yan Zhao
  2 siblings, 0 replies; 139+ messages in thread
From: Yan Zhao @ 2024-10-11  2:06 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Yuan Yao, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, dmatlack@google.com, nik.borisov@suse.com

On Thu, Oct 10, 2024 at 10:33:30AM -0700, Sean Christopherson wrote:
> On Thu, Oct 10, 2024, Yan Zhao wrote:
> > On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote:
> > > On Wed, Sep 25, 2024, Yan Zhao wrote:
> > > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > > > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > > > > branch TDX_1.5.05.
> > > > > > > 
> > > > > > > TL;DR:
> > > > > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > > > > 
> > > > > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > > > > whatever reason.
> > > > > > 
> > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > > > > hits the fault?
> > > > > > 
> > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > > > > desirable because in many cases, the winning task will install a valid mapping
> > > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > > > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > > > > as KVM doesn't introduce any extra delay/latency.
> > > > > > 
> > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > > > > much higher, especially in light of the zero-step issues.
> > > > > > 
> > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > > > > loop in kvm_tdp_map_page().
> > > > > > 
> > > > > > The only part I don't like about this idea is having two "retry" return values,
> > > > > > which creates the potential for bugs due to checking one but not the other.
> > > > > > 
> > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > > > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > > > > option better even though the out-param is a bit gross, because it makes it more
> > > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > > > > most paths.
> > > > > Good idea.
> > > > > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> > > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> > > > FROZEN_SPTE might not be enough to prevent zero step mitigation.
> > > 
> > > The goal isn't to make it completely impossible for zero-step to fire, it's to
> > > make it so that _if_ zero-step fires, KVM can report the error to userspace without
> > > having to retry, because KVM _knows_ that advancing past the zero-step isn't
> > > something KVM can solve.
> > > 
> > >  : I'm not worried about any performance hit with zero-step, I'm worried about KVM
> > >  : not being able to differentiate between a KVM bug and guest interference.  The
> > >  : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
> > >  : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
> > >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >  : report the error to userspace instead of trying to suppress guest activity, and
> > >  : potentially from other KVM tasks too.
> > > 
> > > In other words, for the selftest you crafted, KVM reporting an error to userspace
> > > due to zero-step would be working as intended.  
> > Hmm, but the selftest is an example to show that 6 continuous EPT violations on
> > the same GPA could trigger zero-step.
> > 
> > For an extremely unlucky vCPU, is it still possible to fire zero step when
> > nothing is wrong both in KVM and QEMU?
> > e.g.
> > 
> > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> 
> Very technically, this shouldn't be possible.  The only way for there to be
> contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the
> 6th attempt should succeed, even if the faulting vCPU wasn't the one to create
> the SPTE.
Hmm, the 7th EPT violation could still occur if the vCPU that sees failure of
"try_cmpxchg64()" returns to guest faster than the one that successfully
installs the SPTE.

> 
> That said, a few thoughts:
> 
> 1. Where did we end up on the idea of requiring userspace to pre-fault memory?
I didn't follow this question.
Do you want to disallow userspace to pre-fault memory after TD finalization
or do you want to suggest userspace to do it?

> 
> 2. The zero-step logic really should have a slightly more conservative threshold.
>    I have a hard time believing that e.g. 10 attempts would create a side channel,
>    but 6 attempts is "fine".
Don't know where the value 6 comes. :)
We may need to ask. 

> 3. This would be a good reason to implement a local retry in kvm_tdp_mmu_map().
>    Yes, I'm being somewhat hypocritical since I'm so against retrying for the
>    S-EPT case, but my objection to retrying for S-EPT is that it _should_ be easy
>    for KVM to guarantee success.
It's reasonable.

But TDX code still needs to retry for the RET_PF_RETRY_FROZEN without
re-entering guest.

Would it be good for TDX code to retry whenever it sees RET_PF_RETRY or
RET_PF_RETRY_FOZEN?
We can have tdx_sept_link_private_spt()/tdx_sept_set_private_spte() to return
-EBUSY on contention.


> 
> E.g. for #3, the below (compile tested only) patch should make it impossible for
> the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror SPTEs
> should never trigger A/D assists, i.e. retry should always succeed.
> 
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 47 ++++++++++++++++++++++++++++++++------
>  1 file changed, 40 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 3b996c1fdaab..e47573a652a9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1097,6 +1097,18 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>  static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  				   struct kvm_mmu_page *sp, bool shared);
>  
> +static struct kvm_mmu_page *tdp_mmu_realloc_sp(struct kvm_vcpu *vcpu,
> +					       struct kvm_mmu_page *sp)
> +{
> +	if (!sp)
> +		return tdp_mmu_alloc_sp(vcpu);
> +
> +	memset(sp, 0, sizeof(*sp));
> +	memset64(sp->spt, vcpu->arch.mmu_shadow_page_cache.init_value,
> +		 PAGE_SIZE / sizeof(u64));
> +	return sp;
> +}
> +
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -1104,9 +1116,9 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
> +	struct kvm_mmu_page *sp = NULL;
>  	struct kvm *kvm = vcpu->kvm;
>  	struct tdp_iter iter;
> -	struct kvm_mmu_page *sp;
>  	int ret = RET_PF_RETRY;
>  
>  	kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -1116,8 +1128,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	rcu_read_lock();
>  
>  	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> -		int r;
> -
> +		/*
> +		 * Somewhat arbitrarily allow two local retries, e.g. to play
> +		 * nice with the extremely unlikely case that KVM encounters a
> +		 * huge SPTE an Access-assist _and_ a subsequent Dirty-assist.
> +		 * Retrying is inexpensive, but if KVM fails to install a SPTE
> +		 * three times, then a fourth attempt is likely futile and it's
> +		 * time to back off.
> +		 */
> +		int r, retry_locally = 2;
> +again:
>  		if (fault->nx_huge_page_workaround_enabled)
>  			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
>  
> @@ -1140,7 +1160,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		 * The SPTE is either non-present or points to a huge page that
>  		 * needs to be split.
>  		 */
> -		sp = tdp_mmu_alloc_sp(vcpu);
> +		sp = tdp_mmu_realloc_sp(vcpu, sp);
>  		tdp_mmu_init_child_sp(sp, &iter);
>  
>  		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> @@ -1151,11 +1171,16 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
>  
>  		/*
> -		 * Force the guest to retry if installing an upper level SPTE
> -		 * failed, e.g. because a different task modified the SPTE.
> +		 * If installing an upper level SPTE failed, retry the walk
> +		 * locally before forcing the guest to retry.  If the SPTE was
> +		 * modified by a different task, odds are very good the new
> +		 * SPTE is usable as-is.  And if the SPTE was modified by the
> +		 * CPU, e.g. to set A/D bits, then unless KVM gets *extremely*
> +		 * unlucky, the CMPXCHG should succeed the second time around.
>  		 */
>  		if (r) {
> -			tdp_mmu_free_sp(sp);
> +			if (retry_locally--)
> +				goto again;
>  			goto retry;
>  		}
>  
> @@ -1166,6 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  				track_possible_nx_huge_page(kvm, sp);
>  			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  		}
> +		sp = NULL;
>  	}
>  
>  	/*
> @@ -1180,6 +1206,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  retry:
>  	rcu_read_unlock();
> +
> +	/*
> +	 * Free the previously allocated MMU page if KVM retried locally and
> +	 * ended up not using said page.
> +	 */
> +	if (sp)
> +		tdp_mmu_free_sp(sp);
>  	return ret;
>  }
>  
> 
> base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b
> -- 
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-10-10 21:53                                   ` Edgecombe, Rick P
@ 2024-10-11  2:30                                     ` Yan Zhao
  2024-10-14 10:54                                     ` Huang, Kai
  1 sibling, 0 replies; 139+ messages in thread
From: Yan Zhao @ 2024-10-11  2:30 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: seanjc@google.com, Yao, Yuan, Huang, Kai,
	linux-kernel@vger.kernel.org, isaku.yamahata@gmail.com,
	pbonzini@redhat.com, dmatlack@google.com, nik.borisov@suse.com,
	kvm@vger.kernel.org

On Fri, Oct 11, 2024 at 05:53:29AM +0800, Edgecombe, Rick P wrote:
> On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
> > > 
> > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> 
> Isn't there a more general scenario:
> 
> vcpu0                              vcpu1
> 1. Freezes PTE
> 2. External op to do the SEAMCALL
> 3.                                 Faults same PTE, hits frozen PTE
> 4.                                 Retries N times, triggers zero-step
> 5. Finally finishes external op
> 
> Am I missing something?
Yes, it's a follow-up discussion of Sean's proposal [1] of having TDX code to
retry on RET_PF_RETRY_FROZEN to avoid zero-step.
My worry is that merely avoiding entering guest for vCPUs seeing FROZEN_SPTE is
not enough to prevent zero-step. 
The two examples shows zero-step is possible without re-entering guest for
FROZEN_SPTE:
- The selftest [2]: a single vCPU can fire zero-step when userspace does
  something wrong (though KVM is correct).
- The above case: Nothing wrong in KVM/QEMU, except an extremely unlucky vCPU.


[1] https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@google.com/
[2] https://lore.kernel.org/all/ZvPrqMj1BWrkkwqN@yzhao56-desk.sh.intel.com/

> 
> > 
> > Very technically, this shouldn't be possible.  The only way for there to be
> > contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e.
> > the
> > 6th attempt should succeed, even if the faulting vCPU wasn't the one to create
> > the SPTE.
> > 
> > That said, a few thoughts:
> > 
> > 1. Where did we end up on the idea of requiring userspace to pre-fault memory?
> 
> For others reference, I think you are referring to the idea to pre-fault the
> entire S-EPT even for GFNs that usually get AUGed, not the mirrored EPT pre-
> faulting/PAGE.ADD dance we are already doing.
> 
> The last discussion with Paolo was to resume the retry solution discussion on
> the v2 posting because it would be easier "with everything else already
> addressed". Also, there was also some discussion that it was not immediately
> obvious how prefaulting everything would work for memory hot plug (i.e. memslots
> added during runtime).
> 
> > 
> > 2. The zero-step logic really should have a slightly more conservative
> > threshold.
> >    I have a hard time believing that e.g. 10 attempts would create a side
> > channel,
> >    but 6 attempts is "fine".
> 
> No idea where the threshold came from. I'm not sure if it affects the KVM
> design? We can look into it for curiosity sake in either case.
> 
> > 
> > 3. This would be a good reason to implement a local retry in
> > kvm_tdp_mmu_map().
> >    Yes, I'm being somewhat hypocritical since I'm so against retrying for the
> >    S-EPT case, but my objection to retrying for S-EPT is that it _should_ be
> > easy
> >    for KVM to guarantee success.
> > 
> > E.g. for #3, the below (compile tested only) patch should make it impossible
> > for
> > the S-EPT case to fail, as dirty logging isn't (yet) supported and mirror
> > SPTEs
> > should never trigger A/D assists, i.e. retry should always succeed.
> 
> I don't see how it addresses the scenario above. More retires could just make it
> rarer, but never fix it. Very possible I'm missing something though.
I'm also not 100% sure if zero-step must not happen after this change even when
KVM/QEMU do nothing wrong.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX
  2024-09-10  8:16   ` Paolo Bonzini
  2024-09-10 23:49     ` Edgecombe, Rick P
@ 2024-10-14  6:34     ` Yan Zhao
  1 sibling, 0 replies; 139+ messages in thread
From: Yan Zhao @ 2024-10-14  6:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rick Edgecombe, seanjc, kvm, kai.huang, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel

On Tue, Sep 10, 2024 at 10:16:27AM +0200, Paolo Bonzini wrote:
> On 9/4/24 05:07, Rick Edgecombe wrote:
> > +static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> > +{
> > +	/*
> > +	 * TDX calls tdx_track() in tdx_sept_remove_private_spte() to ensure
> > +	 * private EPT will be flushed on the next TD enter.
> > +	 * No need to call tdx_track() here again even when this callback is as
> > +	 * a result of zapping private EPT.
> > +	 * Just invoke invept() directly here to work for both shared EPT and
> > +	 * private EPT.
> > +	 */
> > +	if (is_td_vcpu(vcpu)) {
> > +		ept_sync_global();
> > +		return;
> > +	}
> > +
> > +	vmx_flush_tlb_all(vcpu);
> > +}
> > +
> > +static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
> > +{
> > +	if (is_td_vcpu(vcpu)) {
> > +		tdx_flush_tlb_current(vcpu);
> > +		return;
> > +	}
> > +
> > +	vmx_flush_tlb_current(vcpu);
> > +}
> > +
> 
> I'd do it slightly different:
> 
> static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> {
> 	if (is_td_vcpu(vcpu)) {
> 		tdx_flush_tlb_all(vcpu);
> 		return;
> 	}
> 
> 	vmx_flush_tlb_all(vcpu);
> }
Thanks!
This is better.

> 
> static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
> {
> 	if (is_td_vcpu(vcpu)) {
> 		/*
> 		 * flush_tlb_current() is used only the first time for
> 		 * the vcpu runs, since TDX supports neither shadow
> 		 * nested paging nor SMM.  Keep this function simple.
> 		 */
> 		tdx_flush_tlb_all(vcpu);
Could we still keep tdx_flush_tlb_current()?
Though both tdx_flush_tlb_all() and tdx_flush_tlb_current() simply invoke
ept_sync_global(), their purposes are different:

- The ept_sync_global() in tdx_flush_tlb_current() is intended to avoid
  retrieving private EPTP required for the single-context invalidation for
  shared EPT;
- while the ept_sync_global() in tdx_flush_tlb_all() is right for shared EPT.

Adding a tdx_flush_tlb_current() can help document the differences in tdx.c.

like this:

void tdx_flush_tlb_current(struct kvm_vcpu *vcpu)
{
        /*
         * flush_tlb_current() is invoked when the first time for the vcpu to
         * run or when root of shared EPT is invalidated.
         * KVM only needs to flush the TLB for shared EPT because the TDX module
         * handles TLB invalidation for private EPT in tdh_vp_enter();
         *
         * A single context invalidation for shared EPT can be performed here.
         * However, this single context invalidation requires the private EPTP
         * rather than the shared EPTP to flush TLB for shared EPT, as shared
         * EPT uses private EPTP as its ASID for TLB invalidation.
         *
         * To avoid reading back private EPTP, perform a global invalidation for
         * shared EPT instead to keep this function simple.
         */
        ept_sync_global();
}

void tdx_flush_tlb_all(struct kvm_vcpu *vcpu)
{
        /*
         * TDX has called tdx_track() in tdx_sept_remove_private_spte() to
         * ensure that private EPT will be flushed on the next TD enter. No need
         * to call tdx_track() here again even when this callback is a result of
         * zapping private EPT.
         *
         * Due to the lack of the context to determine which EPT has been
         * affected by zapping, invoke invept() directly here for both shared
         * EPT and private EPT for simplicity, though it's not necessary for
         * private EPT.          *
         */
        ept_sync_global();
}



> 		return;
> 	}
> 
> 	vmx_flush_tlb_current(vcpu);
> }
> 

> and put the implementation details close to tdx_track:
> void tdx_flush_tlb_all(struct kvm_vcpu *vcpu)
> {
> 	/*
> 	 * TDX calls tdx_track() in tdx_sept_remove_private_spte() to
> 	 * ensure private EPT will be flushed on the next TD enter.
> 	 * No need to call tdx_track() here again, even when this
> 	 * callback is a result of zapping private EPT.  Just
> 	 * invoke invept() directly here, which works for both shared
> 	 * EPT and private EPT.
> 	 */
> 	ept_sync_global();
> }
Got it! 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-10-10 21:53                                   ` Edgecombe, Rick P
  2024-10-11  2:30                                     ` Yan Zhao
@ 2024-10-14 10:54                                     ` Huang, Kai
  2024-10-14 17:36                                       ` Edgecombe, Rick P
  1 sibling, 1 reply; 139+ messages in thread
From: Huang, Kai @ 2024-10-14 10:54 UTC (permalink / raw)
  To: seanjc@google.com, Edgecombe, Rick P, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Yao, Yuan, pbonzini@redhat.com,
	nik.borisov@suse.com, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, dmatlack@google.com

On Thu, 2024-10-10 at 21:53 +0000, Edgecombe, Rick P wrote:
> On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
> > > 
> > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> 
> Isn't there a more general scenario:
> 
> vcpu0                              vcpu1
> 1. Freezes PTE
> 2. External op to do the SEAMCALL
> 3.                                 Faults same PTE, hits frozen PTE
> 4.                                 Retries N times, triggers zero-step
> 5. Finally finishes external op
> 
> Am I missing something?

I must be missing something.  I thought KVM is going to retry internally for
step 4 (retries N times) because it sees the frozen PTE, but will never go back
to guest after the fault is resolved?  How can step 4 triggers zero-step?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-10-14 10:54                                     ` Huang, Kai
@ 2024-10-14 17:36                                       ` Edgecombe, Rick P
  2024-10-14 23:03                                         ` Huang, Kai
  0 siblings, 1 reply; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-10-14 17:36 UTC (permalink / raw)
  To: seanjc@google.com, Huang, Kai, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Yao, Yuan, pbonzini@redhat.com,
	nik.borisov@suse.com, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, dmatlack@google.com

On Mon, 2024-10-14 at 10:54 +0000, Huang, Kai wrote:
> On Thu, 2024-10-10 at 21:53 +0000, Edgecombe, Rick P wrote:
> > On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
> > > > 
> > > > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > > > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> > 
> > Isn't there a more general scenario:
> > 
> > vcpu0                              vcpu1
> > 1. Freezes PTE
> > 2. External op to do the SEAMCALL
> > 3.                                 Faults same PTE, hits frozen PTE
> > 4.                                 Retries N times, triggers zero-step
> > 5. Finally finishes external op
> > 
> > Am I missing something?
> 
> I must be missing something.  I thought KVM is going to 
> 

"Is going to", as in "will be changed to"? Or "does today"?

> retry internally for
> step 4 (retries N times) because it sees the frozen PTE, but will never go back
> to guest after the fault is resolved?  How can step 4 triggers zero-step?

Step 3-4 is saying it will go back to the guest and fault again.


As far as what KVM will do in the future, I think it is still open. I've not had
the chance to think about this for more than 30 min at a time, but the plan to
handle OPERAND_BUSY by taking an expensive path to break any contention (i.e.
kick+lock + whatever TDX module changes we come up with) seems to the leading
idea.

Retry N times is too hacky. Retry internally forever might be awkward to
implement. Because of the signal_pending() check, you would have to handle
exiting to userspace and going back to an EPT violation next time the vcpu tries
to enter.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-10-14 17:36                                       ` Edgecombe, Rick P
@ 2024-10-14 23:03                                         ` Huang, Kai
  2024-10-15  1:24                                           ` Edgecombe, Rick P
  0 siblings, 1 reply; 139+ messages in thread
From: Huang, Kai @ 2024-10-14 23:03 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Yao, Yuan, pbonzini@redhat.com,
	nik.borisov@suse.com, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, dmatlack@google.com



On 15/10/2024 6:36 am, Edgecombe, Rick P wrote:
> On Mon, 2024-10-14 at 10:54 +0000, Huang, Kai wrote:
>> On Thu, 2024-10-10 at 21:53 +0000, Edgecombe, Rick P wrote:
>>> On Thu, 2024-10-10 at 10:33 -0700, Sean Christopherson wrote:
>>>>>
>>>>> 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
>>>>> 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
>>>
>>> Isn't there a more general scenario:
>>>
>>> vcpu0                              vcpu1
>>> 1. Freezes PTE
>>> 2. External op to do the SEAMCALL
>>> 3.                                 Faults same PTE, hits frozen PTE
>>> 4.                                 Retries N times, triggers zero-step
>>> 5. Finally finishes external op
>>>
>>> Am I missing something?
>>
>> I must be missing something.  I thought KVM is going to
>>
> 
> "Is going to", as in "will be changed to"? Or "does today"?

Will be changed to (today's behaviour is to go back to guest to let the 
fault happen again to retry).

AFAICT this is what Sean suggested:

https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@google.com/

The whole point is to let KVM loop internally but not go back to guest 
when the fault handler sees a frozen PTE.  And in this proposal this 
applies to both leaf and non-leaf PTEs IIUC, so it should handle the 
case where try_cmpxchg64() fails as mentioned by Yan.

> 
>> retry internally for
>> step 4 (retries N times) because it sees the frozen PTE, but will never go back
>> to guest after the fault is resolved?  How can step 4 triggers zero-step?
> 
> Step 3-4 is saying it will go back to the guest and fault again.

As said above, the whole point is to make KVM loop internally when it 
sees a frozen PTE, but not go back to guest.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-10-14 23:03                                         ` Huang, Kai
@ 2024-10-15  1:24                                           ` Edgecombe, Rick P
  0 siblings, 0 replies; 139+ messages in thread
From: Edgecombe, Rick P @ 2024-10-15  1:24 UTC (permalink / raw)
  To: seanjc@google.com, Huang, Kai, Zhao, Yan Y
  Cc: kvm@vger.kernel.org, Yao, Yuan, pbonzini@redhat.com,
	nik.borisov@suse.com, linux-kernel@vger.kernel.org,
	isaku.yamahata@gmail.com, dmatlack@google.com

On Tue, 2024-10-15 at 12:03 +1300, Huang, Kai wrote:
> > "Is going to", as in "will be changed to"? Or "does today"?
> 
> Will be changed to (today's behaviour is to go back to guest to let the 
> fault happen again to retry).
> 
> AFAICT this is what Sean suggested:
> 
> https://lore.kernel.org/all/ZuR09EqzU1WbQYGd@google.com/
> 
> The whole point is to let KVM loop internally but not go back to guest 
> when the fault handler sees a frozen PTE.  And in this proposal this 
> applies to both leaf and non-leaf PTEs IIUC, so it should handle the 
> case where try_cmpxchg64() fails as mentioned by Yan.
> 
> > 
> > > retry internally for
> > > step 4 (retries N times) because it sees the frozen PTE, but will never go
> > > back
> > > to guest after the fault is resolved?  How can step 4 triggers zero-step?
> > 
> > Step 3-4 is saying it will go back to the guest and fault again.
> 
> As said above, the whole point is to make KVM loop internally when it 
> sees a frozen PTE, but not go back to guest.

Yea, I was saying on that idea that I thought looping forever without checking
for a signal would be problematic. Then userspace could re-enter the TD. I don't
know if it's a show stopper.

In any case the discussion between these threads and LPC/KVM forum hallway
chatter has gotten a bit fragmented. I don't think there is any concrete
consensus solution at this point.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2024-10-10 17:33                                 ` Sean Christopherson
  2024-10-10 21:53                                   ` Edgecombe, Rick P
  2024-10-11  2:06                                   ` Yan Zhao
@ 2024-10-16 14:13                                   ` Yan Zhao
  2 siblings, 0 replies; 139+ messages in thread
From: Yan Zhao @ 2024-10-16 14:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Rick P Edgecombe, pbonzini@redhat.com, Yuan Yao, Kai Huang,
	isaku.yamahata@gmail.com, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, dmatlack@google.com, nik.borisov@suse.com

On Thu, Oct 10, 2024 at 10:33:30AM -0700, Sean Christopherson wrote:
> On Thu, Oct 10, 2024, Yan Zhao wrote:
> > On Tue, Oct 08, 2024 at 07:51:13AM -0700, Sean Christopherson wrote:
> > > On Wed, Sep 25, 2024, Yan Zhao wrote:
> > > > On Sat, Sep 14, 2024 at 05:27:32PM +0800, Yan Zhao wrote:
> > > > > On Fri, Sep 13, 2024 at 10:23:00AM -0700, Sean Christopherson wrote:
> > > > > > On Fri, Sep 13, 2024, Yan Zhao wrote:
> > > > > > > This is a lock status report of TDX module for current SEAMCALL retry issue
> > > > > > > based on code in TDX module public repo https://github.com/intel/tdx-module.git
> > > > > > > branch TDX_1.5.05.
> > > > > > > 
> > > > > > > TL;DR:
> > > > > > > - tdh_mem_track() can contend with tdh_vp_enter().
> > > > > > > - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> > > > > > 
> > > > > > The zero-step logic seems to be the most problematic.  E.g. if KVM is trying to
> > > > > > install a page on behalf of two vCPUs, and KVM resumes the guest if it encounters
> > > > > > a FROZEN_SPTE when building the non-leaf SPTEs, then one of the vCPUs could
> > > > > > trigger the zero-step mitigation if the vCPU that "wins" and gets delayed for
> > > > > > whatever reason.
> > > > > > 
> > > > > > Since FROZEN_SPTE is essentially bit-spinlock with a reaaaaaly slow slow-path,
> > > > > > what if instead of resuming the guest if a page fault hits FROZEN_SPTE, KVM retries
> > > > > > the fault "locally", i.e. _without_ redoing tdh_vp_enter() to see if the vCPU still
> > > > > > hits the fault?
> > > > > > 
> > > > > > For non-TDX, resuming the guest and letting the vCPU retry the instruction is
> > > > > > desirable because in many cases, the winning task will install a valid mapping
> > > > > > before KVM can re-run the vCPU, i.e. the fault will be fixed before the
> > > > > > instruction is re-executed.  In the happy case, that provides optimal performance
> > > > > > as KVM doesn't introduce any extra delay/latency.
> > > > > > 
> > > > > > But for TDX, the math is different as the cost of a re-hitting a fault is much,
> > > > > > much higher, especially in light of the zero-step issues.
> > > > > > 
> > > > > > E.g. if the TDP MMU returns a unique error code for the frozen case, and
> > > > > > kvm_mmu_page_fault() is modified to return the raw return code instead of '1',
> > > > > > then the TDX EPT violation path can safely retry locally, similar to the do-while
> > > > > > loop in kvm_tdp_map_page().
> > > > > > 
> > > > > > The only part I don't like about this idea is having two "retry" return values,
> > > > > > which creates the potential for bugs due to checking one but not the other.
> > > > > > 
> > > > > > Hmm, that could be avoided by passing a bool pointer as an out-param to communicate
> > > > > > to the TDX S-EPT fault handler that the SPTE is frozen.  I think I like that
> > > > > > option better even though the out-param is a bit gross, because it makes it more
> > > > > > obvious that the "frozen_spte" is a special case that doesn't need attention for
> > > > > > most paths.
> > > > > Good idea.
> > > > > But could we extend it a bit more to allow TDX's EPT violation handler to also
> > > > > retry directly when tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY?
> > > > I'm asking this because merely avoiding invoking tdh_vp_enter() in vCPUs seeing
> > > > FROZEN_SPTE might not be enough to prevent zero step mitigation.
> > > 
> > > The goal isn't to make it completely impossible for zero-step to fire, it's to
> > > make it so that _if_ zero-step fires, KVM can report the error to userspace without
> > > having to retry, because KVM _knows_ that advancing past the zero-step isn't
> > > something KVM can solve.
> > > 
> > >  : I'm not worried about any performance hit with zero-step, I'm worried about KVM
> > >  : not being able to differentiate between a KVM bug and guest interference.  The
> > >  : goal with a local retry is to make it so that KVM _never_ triggers zero-step,
> > >  : unless there is a bug somewhere.  At that point, if zero-step fires, KVM can
> > >    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >  : report the error to userspace instead of trying to suppress guest activity, and
> > >  : potentially from other KVM tasks too.
> > > 
> > > In other words, for the selftest you crafted, KVM reporting an error to userspace
> > > due to zero-step would be working as intended.  
> > Hmm, but the selftest is an example to show that 6 continuous EPT violations on
> > the same GPA could trigger zero-step.
> > 
> > For an extremely unlucky vCPU, is it still possible to fire zero step when
> > nothing is wrong both in KVM and QEMU?
> > e.g.
> > 
> > 1st: "fault->is_private != kvm_mem_is_private(kvm, fault->gfn)" is found.
> > 2nd-6th: try_cmpxchg64() fails on each level SPTEs (5 levels in total)
> 
> Very technically, this shouldn't be possible.  The only way for there to be
> contention on the leaf SPTE is if some other KVM task installed a SPTE, i.e. the
> 6th attempt should succeed, even if the faulting vCPU wasn't the one to create
> the SPTE.
You are right!
I just realized that if TDX code retries internally for FROZEN_SPTEs, the 6th
attempt should succeed.

But I found below might be another case to return RET_PF_RETRY and trigger
zero-step:

Suppose GFNs are shared from 0x80000 - 0x80200,
with HVA starting from hva1 of size 0x200


     vCPU 0                                    vCPU 1
                                     1. Access GFN 0x80002
	                             2. convert GFN 0x80002 to private

3.munmap hva1 of size 0x200
  kvm_mmu_invalidate_begin
  mmu_invalidate_range_start=0x80000
  mmu_invalidate_range_end=0x80200

                                     4. kvm_faultin_pfn
	                                mmu_invalidate_retry_gfn_unsafe of
				        GFN 0x80002 and return RET_PF_RETRY!

5.kvm_mmu_invalidate_end


Before step 5, step 4 will produce RET_PF_RETRY and re-enter guest.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2024-09-04  3:07 ` [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Rick Edgecombe
  2024-09-06  2:10   ` Huang, Kai
@ 2024-10-30  3:03   ` Binbin Wu
  2024-11-04  9:09     ` Yan Zhao
  1 sibling, 1 reply; 139+ messages in thread
From: Binbin Wu @ 2024-10-30  3:03 UTC (permalink / raw)
  To: Rick Edgecombe, seanjc, pbonzini, kvm
  Cc: kai.huang, dmatlack, isaku.yamahata, yan.y.zhao, nik.borisov,
	linux-kernel




On 9/4/2024 11:07 AM, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
[...]
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 6feb3ab96926..b8cd5a629a80 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -447,6 +447,177 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>   	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
>   }
>   
> +static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +
> +	put_page(page);
Nit: It can be
put_page(pfn_to_page(pfn));


> +}
> +
> +static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> +			    enum pg_level level, kvm_pfn_t pfn)
> +{
> +	int tdx_level = pg_level_to_tdx_sept_level(level);
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	hpa_t hpa = pfn_to_hpa(pfn);
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	u64 entry, level_state;
> +	u64 err;
> +
> +	err = tdh_mem_page_aug(kvm_tdx, gpa, hpa, &entry, &level_state);
Nit:
Usually, kernel prefers to handle and return for error conditions first.

But for this case, for all error conditions, it needs to unpin the page.
Is it better to return the successful case first, so that it only needs
to call tdx_unpin() once?

> +	if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
> +		tdx_unpin(kvm, pfn);
> +		return -EAGAIN;
> +	}
> +	if (unlikely(err == (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))) {
> +		if (tdx_get_sept_level(level_state) == tdx_level &&
> +		    tdx_get_sept_state(level_state) == TDX_SEPT_PENDING &&
> +		    is_last_spte(entry, level) &&
> +		    spte_to_pfn(entry) == pfn &&
> +		    entry & VMX_EPT_SUPPRESS_VE_BIT) {
Can this condition be triggered?
For contention from multiple vCPUs, the winner has frozen the SPTE,
it shouldn't trigger this.
Could KVM  do page aug for a same page multiple times somehow?


> +			tdx_unpin(kvm, pfn);
> +			return -EAGAIN;
> +		}
> +	}
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> +		tdx_unpin(kvm, pfn);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
> +int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> +			      enum pg_level level, kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +	/* TODO: handle large pages. */
> +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> +		return -EINVAL;
> +
> +	/*
> +	 * Because guest_memfd doesn't support page migration with
> +	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> +	 * migration.  Until guest_memfd supports page migration, prevent page
> +	 * migration.
> +	 * TODO: Once guest_memfd introduces callback on page migration,
> +	 * implement it and remove get_page/put_page().
> +	 */
> +	get_page(pfn_to_page(pfn));
> +
> +	if (likely(is_td_finalized(kvm_tdx)))
> +		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> +
> +	/*
> +	 * TODO: KVM_MAP_MEMORY support to populate before finalize comes
> +	 * here for the initial memory.
> +	 */
> +	return 0;
Is it better to return error before adding the support?

> +}
> +
[...]


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table
  2024-10-30  3:03   ` Binbin Wu
@ 2024-11-04  9:09     ` Yan Zhao
  0 siblings, 0 replies; 139+ messages in thread
From: Yan Zhao @ 2024-11-04  9:09 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Rick Edgecombe, seanjc, pbonzini, kvm, kai.huang, dmatlack,
	isaku.yamahata, nik.borisov, linux-kernel

On Wed, Oct 30, 2024 at 11:03:39AM +0800, Binbin Wu wrote:
> On 9/4/2024 11:07 AM, Rick Edgecombe wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> [...]
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 6feb3ab96926..b8cd5a629a80 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -447,6 +447,177 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> >   	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
> >   }
> > +static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
> > +{
> > +	struct page *page = pfn_to_page(pfn);
> > +
> > +	put_page(page);
> Nit: It can be
> put_page(pfn_to_page(pfn));
> 
Yes, thanks.

> 
> > +}
> > +
> > +static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
> > +			    enum pg_level level, kvm_pfn_t pfn)
> > +{
> > +	int tdx_level = pg_level_to_tdx_sept_level(level);
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +	hpa_t hpa = pfn_to_hpa(pfn);
> > +	gpa_t gpa = gfn_to_gpa(gfn);
> > +	u64 entry, level_state;
> > +	u64 err;
> > +
> > +	err = tdh_mem_page_aug(kvm_tdx, gpa, hpa, &entry, &level_state);
> Nit:
> Usually, kernel prefers to handle and return for error conditions first.
> 
> But for this case, for all error conditions, it needs to unpin the page.
> Is it better to return the successful case first, so that it only needs
> to call tdx_unpin() once?
> 
> > +	if (unlikely(err == TDX_ERROR_SEPT_BUSY)) {
> > +		tdx_unpin(kvm, pfn);
> > +		return -EAGAIN;
> > +	}
As how to handle the busy is not decided up to now, I prefer to leave this hunk
as is.

> > +	if (unlikely(err == (TDX_EPT_ENTRY_STATE_INCORRECT | TDX_OPERAND_ID_RCX))) {
> > +		if (tdx_get_sept_level(level_state) == tdx_level &&
> > +		    tdx_get_sept_state(level_state) == TDX_SEPT_PENDING &&
> > +		    is_last_spte(entry, level) &&
> > +		    spte_to_pfn(entry) == pfn &&
> > +		    entry & VMX_EPT_SUPPRESS_VE_BIT) {
> Can this condition be triggered?
> For contention from multiple vCPUs, the winner has frozen the SPTE,
> it shouldn't trigger this.
> Could KVM  do page aug for a same page multiple times somehow?
This condition should not be triggered due to the BUG_ON in
set_external_spte_present().

With Isaku's series [1], this condition will not happen either.

Will remove it. Thanks!


[1] https://lore.kernel.org/all/cover.1728718232.git.isaku.yamahata@intel.com/

> 
> > +			tdx_unpin(kvm, pfn);
> > +			return -EAGAIN;
> > +		}
> > +	}
> > +	if (KVM_BUG_ON(err, kvm)) {
> > +		pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
> > +		tdx_unpin(kvm, pfn);
> > +		return -EIO;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> > +			      enum pg_level level, kvm_pfn_t pfn)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +
> > +	/* TODO: handle large pages. */
> > +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Because guest_memfd doesn't support page migration with
> > +	 * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
> > +	 * migration.  Until guest_memfd supports page migration, prevent page
> > +	 * migration.
> > +	 * TODO: Once guest_memfd introduces callback on page migration,
> > +	 * implement it and remove get_page/put_page().
> > +	 */
> > +	get_page(pfn_to_page(pfn));
> > +
> > +	if (likely(is_td_finalized(kvm_tdx)))
> > +		return tdx_mem_page_aug(kvm, gfn, level, pfn);
> > +
> > +	/*
> > +	 * TODO: KVM_MAP_MEMORY support to populate before finalize comes
> > +	 * here for the initial memory.
> > +	 */
> > +	return 0;
> Is it better to return error before adding the support?
Hmm, returning error is better, though returning 0 is not wrong.
The future tdx_mem_page_record_premap_cnt() just increases
kvm_tdx->nr_premapped, while tdx_vcpu_init_mem_region() can be implemented
without checking kvm_tdx->nr_premapped.


> > +}
> > +
> [...]
> 
> 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 21/21] KVM: TDX: Handle vCPU dissociation
  2024-09-10 10:45   ` Paolo Bonzini
  2024-09-11  0:17     ` Edgecombe, Rick P
@ 2024-11-04  9:45     ` Yan Zhao
  1 sibling, 0 replies; 139+ messages in thread
From: Yan Zhao @ 2024-11-04  9:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Rick Edgecombe, seanjc, kvm, kai.huang, dmatlack, isaku.yamahata,
	nik.borisov, linux-kernel

On Tue, Sep 10, 2024 at 12:45:07PM +0200, Paolo Bonzini wrote:
> On 9/4/24 05:07, Rick Edgecombe wrote:
> > +/*
> > + * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
> > + * is brought down to invoke TDH_VP_FLUSH on the appropriate TD vCPUS.
> 
> ... or when a vCPU is migrated.
> 
> > + * Protected by interrupt mask.  This list is manipulated in process context
> > + * of vCPU and IPI callback.  See tdx_flush_vp_on_cpu().
> > + */
> > +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
> 
> It may be a bit more modern, or cleaner, to use a local_lock here instead of
> just relying on local_irq_disable/enable.
Hi Paolo,
After converting local_irq_disable/enable to local_lock (as the fixup patch at
the bottom), lockdep reported "BUG: Invalid wait context" to the kvm_shutdown
path.

This is because local_lock_irqsave() internally holds a spinlock, which is not
raw_spin_lock, and therefore is regarded by lockdep as sleepable in an atomic
context introduced by on_each_cpu() in kvm_shutdown().

kvm_shutdown
  |->on_each_cpu(__kvm_disable_virtualization, NULL, 1);

__kvm_disable_virtualization
  kvm_arch_hardware_disable
    tdx_hardware_disable
      local_lock_irqsave


Given that
(1) tdx_hardware_disable() is called per-cpu and will only manipulate the
    per-cpu list of its running cpu;
(2) tdx_vcpu_load() also only updates the per-cpu list of its running cpu,

do you think we can keep on just using local_irq_disable/enable?
We can add an bug on in tdx_vcpu_load() to ensure (2).
       KVM_BUG_ON(cpu != raw_smp_processor_id(), vcpu->kvm);

Or do you still prefer a per-vcpu raw_spin_lock + local_irq_disable/enable?

Thanks
Yan

+struct associated_tdvcpus {
+       struct list_head list;
+       local_lock_t lock;
+};
+
 /*
  * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
  * is brought down to invoke TDH_VP_FLUSH on the appropriate TD vCPUS.
- * Protected by interrupt mask.  This list is manipulated in process context
+ * Protected by local lock.  This list is manipulated in process context
  * of vCPU and IPI callback.  See tdx_flush_vp_on_cpu().
  */
-static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+static DEFINE_PER_CPU(struct associated_tdvcpus, associated_tdvcpus);

 static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
 {
@@ -338,19 +344,18 @@ static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)

 void tdx_hardware_disable(void)
 {
-       int cpu = raw_smp_processor_id();
-       struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+       struct list_head *tdvcpus = this_cpu_ptr(&associated_tdvcpus.list);
        struct tdx_flush_vp_arg arg;
        struct vcpu_tdx *tdx, *tmp;
        unsigned long flags;

-       local_irq_save(flags);
+       local_lock_irqsave(&associated_tdvcpus.lock, flags);
        /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
        list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list) {
                arg.vcpu = &tdx->vcpu;
                tdx_flush_vp(&arg);
        }
-       local_irq_restore(flags);
+       local_unlock_irqrestore(&associated_tdvcpus.lock, flags);
 }

 static void smp_func_do_phymem_cache_wb(void *unused)
@@ -609,15 +614,16 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

        tdx_flush_vp_on_cpu(vcpu);

-       local_irq_disable();
+       KVM_BUG_ON(cpu != raw_smp_processor_id(), vcpu->kvm);
+       local_lock_irq(&associated_tdvcpus.lock);
        /*
         * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
         * vcpu->cpu is read before tdx->cpu_list.
         */
        smp_rmb();

-       list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
-       local_irq_enable();
+       list_add(&tdx->cpu_list, this_cpu_ptr(&associated_tdvcpus.list));
+       local_unlock_irq(&associated_tdvcpus.lock);
 }

 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
@@ -2091,8 +2097,10 @@ static int __init __tdx_bringup(void)
        }

        /* tdx_hardware_disable() uses associated_tdvcpus. */
-       for_each_possible_cpu(i)
-               INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, i));
+       for_each_possible_cpu(i) {
+               INIT_LIST_HEAD(&per_cpu(associated_tdvcpus.list, i));
+               local_lock_init(&per_cpu(associated_tdvcpus.lock, i));
+       }

        /*
         * Enabling TDX requires enabling hardware virtualization first,


^ permalink raw reply	[flat|nested] 139+ messages in thread

end of thread, other threads:[~2024-11-04  9:48 UTC | newest]

Thread overview: 139+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-04  3:07 [PATCH 00/21] TDX MMU Part 2 Rick Edgecombe
2024-09-04  3:07 ` [PATCH 01/21] KVM: x86/mmu: Implement memslot deletion for TDX Rick Edgecombe
2024-09-09 13:44   ` Paolo Bonzini
2024-09-09 21:06     ` Edgecombe, Rick P
2024-09-04  3:07 ` [PATCH 02/21] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Rick Edgecombe
2024-09-09 13:51   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 03/21] KVM: x86/mmu: Do not enable page track for TD guest Rick Edgecombe
2024-09-09 13:53   ` Paolo Bonzini
2024-09-09 21:07     ` Edgecombe, Rick P
2024-09-04  3:07 ` [PATCH 04/21] KVM: VMX: Split out guts of EPT violation to common/exposed function Rick Edgecombe
2024-09-09 13:57   ` Paolo Bonzini
2024-09-09 16:07   ` Sean Christopherson
2024-09-10  7:36     ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 05/21] KVM: VMX: Teach EPT violation helper about private mem Rick Edgecombe
2024-09-09 13:59   ` Paolo Bonzini
2024-09-11  8:52   ` Chao Gao
2024-09-11 16:29     ` Edgecombe, Rick P
2024-09-12  0:39   ` Huang, Kai
2024-09-12 13:58     ` Sean Christopherson
2024-09-12 14:43       ` Edgecombe, Rick P
2024-09-12 14:46         ` Paolo Bonzini
2024-09-12  1:19   ` Huang, Kai
2024-09-04  3:07 ` [PATCH 06/21] KVM: TDX: Add accessors VMX VMCS helpers Rick Edgecombe
2024-09-09 14:19   ` Paolo Bonzini
2024-09-09 21:29     ` Edgecombe, Rick P
2024-09-10 10:48       ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 07/21] KVM: TDX: Add load_mmu_pgd method for TDX Rick Edgecombe
2024-09-11  2:48   ` Chao Gao
2024-09-11  2:49     ` Edgecombe, Rick P
2024-09-04  3:07 ` [PATCH 08/21] KVM: TDX: Set gfn_direct_bits to shared bit Rick Edgecombe
2024-09-09 15:21   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT Rick Edgecombe
2024-09-06  1:41   ` Huang, Kai
2024-09-09 20:25     ` Edgecombe, Rick P
2024-09-09 15:25   ` Paolo Bonzini
2024-09-09 20:22     ` Edgecombe, Rick P
2024-09-09 21:11       ` Sean Christopherson
2024-09-09 21:23         ` Sean Christopherson
2024-09-09 22:34           ` Edgecombe, Rick P
2024-09-09 23:58             ` Sean Christopherson
2024-09-10  0:50               ` Edgecombe, Rick P
2024-09-10  1:46                 ` Sean Christopherson
2024-09-11  1:17               ` Huang, Kai
2024-09-11  2:48                 ` Edgecombe, Rick P
2024-09-11 22:55                   ` Huang, Kai
2024-09-10 13:15         ` Paolo Bonzini
2024-09-10 13:57           ` Sean Christopherson
2024-09-10 15:16             ` Paolo Bonzini
2024-09-10 15:57               ` Sean Christopherson
2024-09-10 16:28                 ` Edgecombe, Rick P
2024-09-10 17:42                   ` Sean Christopherson
2024-09-13  8:36                     ` Yan Zhao
2024-09-13 17:23                       ` Sean Christopherson
2024-09-13 19:19                         ` Edgecombe, Rick P
2024-09-13 22:18                           ` Sean Christopherson
2024-09-14  9:27                         ` Yan Zhao
2024-09-15  9:53                           ` Yan Zhao
2024-09-17  1:31                             ` Huang, Kai
2024-09-25 10:53                           ` Yan Zhao
2024-10-08 14:51                             ` Sean Christopherson
2024-10-10  5:23                               ` Yan Zhao
2024-10-10 17:33                                 ` Sean Christopherson
2024-10-10 21:53                                   ` Edgecombe, Rick P
2024-10-11  2:30                                     ` Yan Zhao
2024-10-14 10:54                                     ` Huang, Kai
2024-10-14 17:36                                       ` Edgecombe, Rick P
2024-10-14 23:03                                         ` Huang, Kai
2024-10-15  1:24                                           ` Edgecombe, Rick P
2024-10-11  2:06                                   ` Yan Zhao
2024-10-16 14:13                                   ` Yan Zhao
2024-09-17  2:11                         ` Huang, Kai
2024-09-13 19:19                       ` Edgecombe, Rick P
2024-09-14 10:00                         ` Yan Zhao
2024-09-04  3:07 ` [PATCH 10/21] KVM: TDX: Require TDP MMU and mmio caching for TDX Rick Edgecombe
2024-09-09 15:26   ` Paolo Bonzini
2024-09-12  0:15   ` Huang, Kai
2024-09-04  3:07 ` [PATCH 11/21] KVM: x86/mmu: Add setter for shadow_mmio_value Rick Edgecombe
2024-09-09 15:33   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 12/21] KVM: TDX: Set per-VM shadow_mmio_value to 0 Rick Edgecombe
2024-09-09 15:33   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 13/21] KVM: TDX: Handle TLB tracking for TDX Rick Edgecombe
2024-09-10  8:16   ` Paolo Bonzini
2024-09-10 23:49     ` Edgecombe, Rick P
2024-10-14  6:34     ` Yan Zhao
2024-09-11  6:25   ` Xu Yilun
2024-09-11 17:28     ` Edgecombe, Rick P
2024-09-12  4:54       ` Yan Zhao
2024-09-12 14:44         ` Edgecombe, Rick P
2024-09-12  7:47       ` Xu Yilun
2024-09-04  3:07 ` [PATCH 14/21] KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Rick Edgecombe
2024-09-06  2:10   ` Huang, Kai
2024-09-09 21:03     ` Edgecombe, Rick P
2024-09-10  1:52       ` Yan Zhao
2024-09-10  9:33       ` Paolo Bonzini
2024-09-10 23:58         ` Edgecombe, Rick P
2024-09-11  1:05           ` Yan Zhao
2024-10-30  3:03   ` Binbin Wu
2024-11-04  9:09     ` Yan Zhao
2024-09-04  3:07 ` [PATCH 15/21] KVM: TDX: Implement hook to get max mapping level of private pages Rick Edgecombe
2024-09-10 10:17   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 16/21] KVM: TDX: Premap initial guest memory Rick Edgecombe
2024-09-10 10:24   ` Paolo Bonzini
2024-09-11  0:19     ` Edgecombe, Rick P
2024-09-13 13:33       ` Adrian Hunter
2024-09-13 19:49         ` Edgecombe, Rick P
2024-09-10 10:49   ` Paolo Bonzini
2024-09-11  0:30     ` Edgecombe, Rick P
2024-09-11 10:39       ` Paolo Bonzini
2024-09-11 16:36         ` Edgecombe, Rick P
2024-09-04  3:07 ` [PATCH 17/21] KVM: TDX: MTRR: implement get_mt_mask() for TDX Rick Edgecombe
2024-09-10 10:04   ` Paolo Bonzini
2024-09-10 14:05     ` Sean Christopherson
2024-09-04  3:07 ` [PATCH 18/21] KVM: x86/mmu: Export kvm_tdp_map_page() Rick Edgecombe
2024-09-10 10:02   ` Paolo Bonzini
2024-09-04  3:07 ` [PATCH 19/21] KVM: TDX: Add an ioctl to create initial guest memory Rick Edgecombe
2024-09-04  4:53   ` Yan Zhao
2024-09-04 14:01     ` Edgecombe, Rick P
2024-09-06 16:30       ` Edgecombe, Rick P
2024-09-09  1:29         ` Yan Zhao
2024-09-10 10:13         ` Paolo Bonzini
2024-09-11  0:11           ` Edgecombe, Rick P
2024-09-04 13:56   ` Edgecombe, Rick P
2024-09-10 10:16   ` Paolo Bonzini
2024-09-11  0:12     ` Edgecombe, Rick P
2024-09-04  3:07 ` [PATCH 20/21] KVM: TDX: Finalize VM initialization Rick Edgecombe
2024-09-04 15:37   ` Adrian Hunter
2024-09-04 16:09     ` Edgecombe, Rick P
2024-09-10 10:33     ` Paolo Bonzini
2024-09-10 11:15       ` Adrian Hunter
2024-09-10 11:28         ` Paolo Bonzini
2024-09-10 11:31         ` Adrian Hunter
2024-09-10 10:25   ` Paolo Bonzini
2024-09-10 11:54     ` Adrian Hunter
2024-09-04  3:07 ` [PATCH 21/21] KVM: TDX: Handle vCPU dissociation Rick Edgecombe
2024-09-09 15:41   ` Paolo Bonzini
2024-09-09 23:30     ` Edgecombe, Rick P
2024-09-10 10:45   ` Paolo Bonzini
2024-09-11  0:17     ` Edgecombe, Rick P
2024-11-04  9:45     ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).