public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	luto@kernel.org, peterz@infradead.org
Cc: sathyanarayanan.kuppuswamy@linux.intel.com, aarcange@redhat.com,
	ak@linux.intel.com, dan.j.williams@intel.com, david@redhat.com,
	hpa@zytor.com, jgross@suse.com, jmattson@google.com,
	joro@8bytes.org, jpoimboe@redhat.com, knsathya@kernel.org,
	pbonzini@redhat.com, sdeep@vmware.com, seanjc@google.com,
	tony.luck@intel.com, vkuznets@redhat.com, wanpengli@tencent.com,
	thomas.lendacky@amd.com, brijesh.singh@amd.com, x86@kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCHv4 30/30] Documentation/x86: Document TDX kernel architecture
Date: Fri, 25 Feb 2022 09:42:37 -0800	[thread overview]
Message-ID: <98248dc2-7802-72e9-4936-ccfca330d1d3@intel.com> (raw)
In-Reply-To: <20220224155630.52734-31-kirill.shutemov@linux.intel.com>

> +#VE Exceptions:
> +===============
> +
> +In TDX guests, #VE Exceptions are delivered to TDX guests in following
> +scenarios:
> +
> +* Execution of certain instructions (see list below)
> +* Certain MSR accesses.
> +* CPUID usage (only for certain leaves)
> +* Shared memory access (including MMIO)

This makes it sound like *ALL* MMIO will cause a #VE.  Is this strictly
true?  I didn't see anything in the spec that completely disallowed a
host from passing through an MMIO range to a guest in a shared memory
range.  Granted, the host can unilaterally make that range start causing
a #VE at any time.  But, is MMIO itself disallowed?  Or, do guests just
have to be *prepared* for a #VE when accessing something that might be MMIO?

> +#VE due to instruction execution
> +---------------------------------
> +
> +Intel TDX dis-allows execution of certain instructions in non-root

		^ disallows

> +mode. Execution of these instructions would lead to #VE or #GP.


Some instruction behavior changes when running inside a TDX guest.
These are typically instructions that would have been trapped by a
hypervisor and emulated.  In a TDX guest, these instructions either lead
to a #VE or #GP.

* Instructions that always cause a #VE:

> +* String I/O (INS, OUTS), IN, OUT
> +* HLT
> +* MONITOR, MWAIT
> +* WBINVD, INVD
> +* VMCALL

* Instructions always cause a #GP:

> +* All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
> +  VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
> +* ENCLS, ENCLV

		^ ENCLU

I don't think there's an "ENCLV" instruction.

> +* GETSEC
> +* RSM
> +* ENQCMD


* Instructions that conditionally cause a #VE (more details below)
 * WRMSR, RDMSR (details below)
 * CPUID
> +#VE due to MSR access
> +----------------------

<Sigh> The title of this section is #VE.  Then, it talks about how #GP's
are triggered.

> +In TDX guest, MSR access behavior can be categorized as,
> +
> +* Native supported (also called "context switched MSR")
> +  No special handling is required for these MSRs in TDX guests.
> +* #GP triggered
> +  Dis-allowed MSR read/write would lead to #GP.
> +* #VE triggered
> +  All MSRs that are not natively supported or dis-allowed
> +  (triggers #GP) will trigger #VE. To support access to
> +  these MSRs, it needs to be emulated using TDCALL.

This is really struggling to do anything useful.  I mean, it says: "look
there are three categories."  It defines the third category as
"everything not in the other two". <sigh>  That's just a waste of bytes.

--

MSR access behavior falls into three categories:

 * #GP generated
 * #VE generated
 * MSR "just works"

In general, the #GP MSRs should not be used in guests.  Their use likely
indicates a bug in the guest.  The guest _can_ try to handle the #GP
with a hypercall but it is unlikely to succeed.

The #VE MSRs are typically able to be handled by the hypervisor.  Guests
can make a hypercall to the hypervisor to handle the #VE.

The "just works" MSRs do not need any special guest handling.  They
might be implemented by directly passing through the MSR to the hardware
or by trapping and handling in the TDX module.  Other than possibly
being slow, these MSRs appear to function just as they would on bare metal.

> +Look Intel TDX Module Specification, sec "MSR Virtualization" for the complete
> +list of MSRs that fall under the categories above.

Could we try to write some actual coherent text here, please?  This
isn't even a complete sentence.

> +#VE due to CPUID instruction
> +----------------------------
> +
> +In TDX guests, most of CPUID leaf/sub-leaf combinations are virtualized by
> +the TDX module while some trigger #VE. Whether the leaf/sub-leaf triggers #VE
> +defined in the TDX spec.
> +
> +VMM during the TD initialization time (using TDH.MNG.INIT) configures if
> +a feature bits in specific leaf-subleaf are exposed to TD guest or not.

This needs to *say* something.  Otherwise, it's just useless bytes.
Basically, this is a long-winded way of saying "if you want to know
anything about CPUID, look at the TDX spec".

What do we want the reader to take away from this?

> +#VE on Memory Accesses
> +----------------------
> +
> +A TD guest is in control of whether its memory accesses are treated as
> +private or shared.  It selects the behavior with a bit in its page table
> +entries.

... and what?

Why does this matter?  What does it have to do with #VE?

> +#VE on Shared Pages
> +-------------------
> +
> +Access to shared mappings can cause a #VE. The hypervisor controls whether
> +access of shared mapping causes a #VE, so the guest must be careful to only
> +reference shared pages it can safely handle a #VE, avoid nested #VEs.
> +
> +Content of shared mapping is not trusted since shared memory is writable
> +by the hypervisor. Shared mappings are never used for sensitive memory content
> +like stacks or kernel text, only for I/O buffers and MMIO regions. The kernel
> +will not encounter shared mappings in sensitive contexts like syscall entry
> +or NMIs.
> +
> +#VE on Private Pages
> +--------------------
> +
> +Some accesses to private mappings may cause #VEs.  Before a mapping is
> +accepted (AKA in the SEPT_PENDING state), a reference would cause a #VE.
> +But, after acceptance, references typically succeed.
> +
> +The hypervisor can cause a private page reference to fail if it chooses
> +to move an accepted page to a "blocked" state.  However, if it does
> +this, page access will not generate a #VE.  It will, instead, cause a
> +"TD Exit" where the hypervisor is required to handle the exception.
> +
> +Linux #VE handler
> +-----------------
> +
> +Both user/kernel #VE exceptions are handled by the tdx_handle_virt_exception()
> +handler. If successfully handled, the instruction pointer is incremented to
> +complete the handling process. If failed to handle, it is treated as a regular
> +exception and handled via fixup handlers.
> +
> +In TD guests, #VE nesting (a #VE triggered before handling the current one
> +or AKA syscall gap issue) problem is handled by TDX module ensuring that
> +interrupts, including NMIs, are blocked. The hardware blocks interrupts
> +starting with #VE delivery until TDGETVEINFO is called.
> +
> +The kernel must avoid triggering #VE in entry paths: do not touch TD-shared
> +memory, including MMIO regions, and do not use #VE triggering MSRs,
> +instructions, or CPUID leaves that might generate #VE.
> +
> +MMIO handling:
> +==============
> +
> +In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
> +mapping which will cause a VMEXIT on access, and then the VMM emulates the
> +access. That's not possible in TDX guests because VMEXIT will expose the
> +register state to the host. TDX guests don't trust the host and can't have
> +their state exposed to the host.
> +
> +In TDX the MMIO regions are instead configured to trigger a #VE
> +exception in the guest. The guest #VE handler then emulates the MMIO
> +instructions inside the guest and converts them into a controlled TDCALL
> +to the host, rather than completely exposing the state to the host.
> +
> +MMIO addresses on x86 are just special physical addresses. They can be
> +accessed with any instruction that accesses memory. However, the
> +introduced instruction decoding method is limited. It is only designed
> +to decode instructions like those generated by io.h macros.
> +
> +MMIO access via other means (like structure overlays) may result in
> +MMIO_DECODE_FAILED and an oops.
> +
> +Shared memory:
> +==============
> +
> +Intel TDX doesn't allow the VMM to access guest private memory. Any
> +memory that is required for communication with VMM must be shared
> +explicitly by setting the bit in the page table entry. The shared bit
> +can be enumerated with TDX_GET_INFO.
> +
> +After setting the shared bit, the conversion must be completed with
> +MapGPA hypercall. The call informs the VMM about the conversion between
> +private/shared mappings.
> +
> +set_memory_decrypted() converts a range of pages to shared.
> +set_memory_encrypted() converts memory back to private.
> +
> +Device drivers are the primary user of shared memory, but there's no
> +need in touching every driver. DMA buffers and ioremap()'ed regions are
> +converted to shared automatically.
> +
> +TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
> +converted to shared on boot.
> +
> +For coherent DMA allocation, the DMA buffer gets converted on the
> +allocation. Check force_dma_unencrypted() for details.
> +


> +More details about TDX module (and its response for MSR, memory access,
> +IO, CPUID etc) can be found at,
> +
> +https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
> +
> +More details about TDX hypercall and TDX module call ABI can be found
> +at,
> +
> +https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface-1.0-344426-002.pdf
> +
> +More details about TDVF requirements can be found at,
> +
> +https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf

None of these are stable URLs.  Let's just get rid of them.

  reply	other threads:[~2022-02-25 17:42 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-24 15:56 [PATCHv4 00/30] TDX Guest: TDX core support Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 01/30] x86/mm: Fix warning on build with X86_MEM_ENCRYPT=y Kirill A. Shutemov
2022-02-24 16:06   ` Dave Hansen
2022-02-27 22:01   ` Josh Poimboeuf
2022-02-28 16:20     ` Kirill A. Shutemov
2022-02-28 16:40       ` Josh Poimboeuf
2022-02-28 16:51         ` Dave Hansen
2022-02-28 17:11           ` Josh Poimboeuf
2022-03-01  8:48             ` Borislav Petkov
2022-02-24 15:56 ` [PATCHv4 02/30] x86/tdx: Detect running as a TDX guest in early boot Kirill A. Shutemov
2022-02-24 16:16   ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 03/30] x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappers Kirill A. Shutemov
2022-02-24 16:35   ` Dave Hansen
2022-02-24 23:10     ` Kirill A. Shutemov
2022-02-25  0:41       ` Dave Hansen
2022-02-25 10:39         ` Kai Huang
2022-02-25 15:46         ` Kirill A. Shutemov
2022-02-25 16:12           ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 04/30] x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions Kirill A. Shutemov
2022-02-24 17:01   ` Dave Hansen
2022-02-24 23:29     ` Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 05/30] x86/tdx: Extend the confidential computing API to support TDX guests Kirill A. Shutemov
2022-02-24 17:54   ` Dave Hansen
2022-02-24 23:54     ` Kirill A. Shutemov
2022-02-25  0:51       ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 06/30] x86/tdx: Exclude shared bit from __PHYSICAL_MASK Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 07/30] x86/traps: Add #VE support for TDX guest Kirill A. Shutemov
2022-02-24 18:36   ` Dave Hansen
2022-02-25 19:30     ` Kirill A. Shutemov
2022-02-25 19:46       ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 08/30] x86/tdx: Add HLT support for TDX guests Kirill A. Shutemov
2022-02-24 18:42   ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 09/30] x86/tdx: Add MSR " Kirill A. Shutemov
2022-02-24 18:52   ` Dave Hansen
2022-02-24 19:04     ` Sean Christopherson
2022-02-24 19:36       ` Dave Hansen
2022-02-26 21:35     ` Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 10/30] x86/tdx: Handle CPUID via #VE Kirill A. Shutemov
2022-02-24 19:04   ` Dave Hansen
2022-02-27  1:07     ` Kirill A. Shutemov
2022-02-28 16:41       ` Dave Hansen
2022-02-28 22:53         ` Kirill A. Shutemov
2022-02-28 23:05           ` Dave Hansen
2022-02-28 23:31             ` Kirill A. Shutemov
2022-02-28 23:37               ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 11/30] x86/tdx: Handle in-kernel MMIO Kirill A. Shutemov
2022-02-24 20:11   ` Dave Hansen
2022-02-25  2:23     ` David Laight
2022-02-25  3:10       ` David Laight
2022-03-02 13:42     ` Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 12/30] x86/tdx: Detect TDX at early kernel decompression time Kirill A. Shutemov
2022-02-24 20:44   ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 13/30] x86: Adjust types used in port I/O helpers Kirill A. Shutemov
2022-02-24 21:24   ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 14/30] x86: Consolidate " Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 15/30] x86/boot: Allow to hook up alternative " Kirill A. Shutemov
2022-02-24 22:14   ` Dave Hansen
2022-02-27 22:02   ` Josh Poimboeuf
2022-02-28 16:33     ` Kirill A. Shutemov
2022-02-28 16:44       ` Josh Poimboeuf
2022-02-24 15:56 ` [PATCHv4 16/30] x86/boot/compressed: Support TDX guest port I/O at decompression time Kirill A. Shutemov
2022-02-24 22:22   ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 17/30] x86/tdx: Add port I/O emulation Kirill A. Shutemov
2022-02-24 22:43   ` Dave Hansen
2022-02-25  3:59   ` Dave Hansen
2022-02-28  1:16     ` Kirill A. Shutemov
2022-02-28  4:32       ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 18/30] x86/tdx: Handle early boot port I/O Kirill A. Shutemov
2022-02-24 22:58   ` Dave Hansen
2022-02-24 15:56 ` [PATCHv4 19/30] x86/tdx: Wire up KVM hypercalls Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 20/30] x86/boot: Add a trampoline for booting APs via firmware handoff Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 21/30] x86/acpi, x86/boot: Add multiprocessor wake-up support Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 22/30] x86/boot: Set CR0.NE early and keep it set during the boot Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 23/30] x86/boot: Avoid #VE during boot for TDX platforms Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 24/30] x86/topology: Disable CPU online/offline control for TDX guests Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 25/30] x86/tdx: Make pages shared in ioremap() Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 26/30] x86/mm/cpa: Add support for TDX shared memory Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 27/30] x86/kvm: Use bounce buffers for TD guest Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 28/30] x86/tdx: ioapic: Add shared bit for IOAPIC base address Kirill A. Shutemov
2022-02-24 15:56 ` [PATCHv4 29/30] ACPICA: Avoid cache flush on TDX guest Kirill A. Shutemov
2022-02-27 22:05   ` Josh Poimboeuf
2022-02-28  1:34     ` Dan Williams
2022-02-28 16:37       ` Kirill A. Shutemov
2022-02-28 16:46         ` Dave Hansen
2022-02-28 17:02         ` Josh Poimboeuf
2022-02-24 15:56 ` [PATCHv4 30/30] Documentation/x86: Document TDX kernel architecture Kirill A. Shutemov
2022-02-25 17:42   ` Dave Hansen [this message]
2022-02-25 17:54   ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98248dc2-7802-72e9-4936-ccfca330d1d3@intel.com \
    --to=dave.hansen@intel.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=brijesh.singh@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jgross@suse.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=jpoimboe@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=knsathya@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=sdeep@vmware.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=tony.luck@intel.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox