Re: [RFC PATCH v2 00/17] RFC: SGX Virtualization design and draft patches

From: Boqun Feng <boqun.feng@intel.com>
To: xen-devel@lists.xen.org
Cc: Kevin Tian <kevin.tian@intel.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Wei Liu <wei.liu2@citrix.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>, Tim Deegan <tim@xen.org>,
	kai.huang@linux.intel.com, Jan Beulich <jbeulich@suse.com>
Subject: Re: [RFC PATCH v2 00/17] RFC: SGX Virtualization design and draft patches
Date: Mon, 25 Dec 2017 13:01:19 +0800	[thread overview]
Message-ID: <20171225050119.GA748@winterfell.sh.intel.com> (raw)
In-Reply-To: <20171204001528.1342-1-boqun.feng@intel.com>

On Mon, Dec 04, 2017 at 08:15:11AM +0800, Boqun Feng wrote:
> Hi all,
> 
> This is the v2 of RFC SGX Virtualization design and draft patches, you

Ping ;-)

Any comments?

Regards,
Boqun

> can find v1 at:
> 
>     https://lists.gt.net/xen/devel/483404
> 
> In the new version, I fix a few things according to the feedbacks for
> previous version(mostly are cleanups and code movement).
> 
> Besides, Kai and I redesign the SGX MSRs setting up part and introduce
> new XL parameter 'lehash' and 'lewr'.
> 
> Another big change is that I modify the EPC management to fit EPC pages
> in 'struct page_info', and in patch #6 and #7, unscrubbable pages,
> 'PGC_epc', 'MEMF_epc' and 'XENZONE_EPC' are introduced, so that EPC
> management is fully integrated into existing memory management of xen.
> This might be the controversial bit, so patch 6~8 are simply to show the
> idea and drive deep discussion.
> 
> Detailed changes since v1: (modifications with tag "[New]" is totally
> new in this series, reviews and comments are highly welcome for those
> parts)
> 
> *   Make SGX related mostly common for x86 by: 1) moving sgx.[ch] to
>     arch/x86/ and include/asm-x86/ and 2) renaming EPC related functions
>     with domain_* prefix.
> 
> *   Rename ioremap_cache() with ioremap_wb() and make it x86-specific as
>     suggested by Jan Beulich.
> 
> *   Remove percpu sgx_cpudata, during bootup secondary CPUs now check
>     whether they read different value than boot CPU, if so SGX is
>     disabled.
> 
> *   Remove domain_has_sgx_{,launch_control}, and make sure we can
>     rely on domain's arch.cpuid->feat.sgx{_lc} for setting checks.
> 
> *   Cleanup the code for CPUID handling as suggested by Andrew Cooper.
> 
> *   Adjust to msr_policy framework for SGX MSRs handling, and remove
>     unnecessary fields like 'readable' and 'writable'
> 
> *   Use 'page_info' to maintain EPC pages, and [NEW] add an draft
>     implementation for employing xenheap for EPC page management. Please
>     see patch 6~8
> 
> *   [New] Modify the XL parameter for SGX, please see section 2.1.1 in
>     the updated design doc. 
> 
> *   [New] Use _set_vcpu_msrs hypercall in the toolstack to set the SGX
>     related. Please see patch #17.
> 
> *   ACPI related tool changes are temporarily dropped in this patchset,
>     as I need more time to resolve the comments and do related tests.
> 
> And the update design doc is as follow, as the previous version in the
> design there are some particualr points that we don't know which
> implementation is better. For those a question mark (?) is added at the
> right of the menu. And for SGX live migration, thanks to Wei Liu for
> providing comments that it's nice to support if we can in previous
> version review, but we'd like hear more from you guys so we still put a
> question mark fot this item. Your comments on those "question mark (?)"
> parts (and other comments as well, of course) are highly appreciated.
> 
> ===================================================================
> 1. SGX Introduction
>     1.1 Overview
>         1.1.1 Enclave
>         1.1.2 EPC (Enclave Paage Cache)
>         1.1.3 ENCLS and ENCLU
>     1.2 Discovering SGX Capability
>         1.2.1 Enumerate SGX via CPUID
>         1.2.2 Intel SGX Opt-in Configuration
>     1.3 Enclave Life Cycle
>         1.3.1 Constructing & Destroying Enclave
>         1.3.2 Enclave Entry and Exit
>             1.3.2.1 Synchonous Entry and Exit
>             1.3.2.2 Asynchounous Enclave Exit
>         1.3.3 EPC Eviction and Reload
>     1.4 SGX Launch Control
>     1.5 SGX Interaction with IA32 and IA64 Architecture
> 2. SGX Virtualization Design
>     2.1 High Level Toolstack Changes
>         2.1.1 New 'sgx' XL configure file parameter
>         2.1.2 New XL commands (?)
>         2.1.3 Notify domain's virtual EPC base and size to Xen
>     2.2 High Level Hypervisor Changes
>         2.2.1 EPC Management
>         2.2.2 EPC Virtualization
>         2.2.3 Populate EPC for Guest
>         2.2.4 Launch Control Support
>         2.2.5 CPUID Emulation
>         2.2.6 EPT Violation & ENCLS Trapping Handling
>         2.2.7 Guest Suspend & Resume
>         2.2.8 Destroying Domain
>     2.3 Additional Point: Live Migration, Snapshot Support (?)
> 3. Reference
> 
> 1. SGX Introduction
> 
> 1.1 Overview
> 
> 1.1.1 Enclave
> 
> Intel Software Guard Extensions (SGX) is a set of instructions and mechanisms
> for memory accesses in order to provide security accesses for sensitive
> applications and data. SGX allows an application to use it's pariticular address
> space as an *enclave*, which is a protected area provides confidentiality and
> integrity even in the presence of privileged malware. Accesses to the enclave
> memory area from any software not resident in the enclave are prevented,
> including those from privileged software. Below diagram illustrates the presence
> of Enclave in application.
> 
>         |-----------------------|
>         |                       |
>         |   |---------------|   |
>         |   |   OS kernel   |   |       |-----------------------|
>         |   |---------------|   |       |                       |
>         |   |               |   |       |   |---------------|   |
>         |   |---------------|   |       |   | Entry table   |   |
>         |   |   Enclave     |---|-----> |   |---------------|   |
>         |   |---------------|   |       |   | Enclave stack |   |
>         |   |   App code    |   |       |   |---------------|   |
>         |   |---------------|   |       |   | Enclave heap  |   |
>         |   |   Enclave     |   |       |   |---------------|   |
>         |   |---------------|   |       |   | Enclave code  |   |
>         |   |   App code    |   |       |   |---------------|   |
>         |   |---------------|   |       |                       |
>         |           |           |       |-----------------------|
>         |-----------------------|
> 
> SGX supports SGX1 and SGX2 extensions. SGX1 provides basic enclave support,
> and SGX2 allows additional flexibility in runtime management of enclave
> resources and thread execution within an enclave.
> 
> 1.1.2 EPC (Enclave Page Cache)
> 
> Just like normal application memory management, enclave memory management can be
> devided into two parts: address space allocation and memory commitment. Address
> space allocation is allocating particular range of linear address space for
> enclave. Memory commitment is assigning actual resource for the enclave.
> 
> Enclave Page Cache (EPC) is the physical resource used to commit to enclave.
> EPC is divided to 4K pages. An EPC page is 4K in size and always aligned to 4K
> boundary. Hardware performs additional access control checks to restrict access
> to the EPC page. The Enclave Page Cache Map (EPCM) is a secure structure which
> holds one entry for each EPC page, and is used by hardware to track the status
> of each EPC page (invisibe to software). Typically EPC and EPCM are reserved
> by BIOS as Processor Reserved Memory but the actual amount, size, and layout
> of EPC are model-specific, and dependent on BIOS settings. EPC is enumerated
> via new SGX CPUID, and is reported as reserved memory.
> 
> EPC pages can either be invalid or valid. There are 4 valid EPC types in SGX1:
> regular EPC page, SGX Enclave Control Structure (SECS) page, Thread Control
> Structure (TCS) page, and Version Array (VA) page. SGX2 adds Trimmed EPC page.
> Each enclave is associated with one SECS page. Each thread in enclave is
> associated with one TCS page. VA page is used in EPC page eviction and reload.
> Trimmed EPC page is introduced in SGX2 when particular 4K page in enclave is
> going to be freed (trimmed) at runtime after enclave is initialized.
> 
> 1.1.3 ENCLS and ENCLU
> 
> Two new instructions ENCLS and ENCLU are introduced to manage enclave and EPC.
> ENCLS can only run in ring 0, while ENCLU can only run in ring 3. Both ENCLS and
> ENCLU have multiple leaf functions, with EAX indicating the specific leaf
> function.
> 
> SGX1 supports below ENCLS and ENCLU leaves:
> 
>     ENCLS:
>     - ECREATE, EADD, EEXTEND, EINIT, EREMOVE (Enclave build and destroy)
>     - EPA, EBLOCK, ETRACK, EWB, ELDU/ELDB (EPC eviction & reload)
> 
>     ENCLU:
>     - EENTER, EEXIT, ERESUME (Enclave entry, exit, re-enter)
>     - EGETKEY, EREPORT (SGX key derivation, attestation)
> 
> Additionally, SGX2 supports below ENCLS and ENCLU leaves for runtime add/remove
> EPC page to enclave after enclave is initialized, along with permission change.
> 
>     ENCLS:
>     - EAUG, EMODT, EMODPR
>     
>     ENCLU:
>     - EACCEPT, EACCEPTCOPY, EMODPE
> 
> VMM is able to interfere with ENCLS running in guest (see 1.2.x SGX interaction
> with VMX) but is unable to interfere with ENCLU.
> 
> 1.2 Discovering SGX Capability
> 
> 1.2.1 Enumerate SGX via CPUID
> 
> If CPUID.0x7.0:EBX.SGX (bit 2) is 1, then processor supports SGX and SGX
> capability and resource can be enumerated via new SGX CPUID (0x12).
> CPUID.0x12.0x0 reports SGX capability, such as the presence of SGX1, SGX2,
> enclave's maximum size for both 32-bit and 64-bit application. CPUID.0x12.0x1
> reports the availability of bits that can be set for SECS.ATTRIBUTES.
> CPUID.0x12.0x2 reports the EPC resource's base and size. Platform may support
> multiple EPC sections, and CPUID.0x12.0x3 and further sub-leaves can be used
> to detect the existence of multiple EPC sections (until CPUID reports invalid
> EPC).
> 
> Refer to 37.7.2 Intel SGX Resource Enumeration Leaves for full description of
> SGX CPUID 0x12.
> 
> 1.2.2 Intel SGX Opt-in Configuration
> 
> On processors that support Intel SGX, IA32_FEATURE_CONTROL also provides the
> SGX_ENABLE bit (bit 18) to turn on/off SGX. Before system software can enable
> and use SGX, BIOS is required to set IA32_FEATURE_CONTROL.SGX_ENABLE = 1 to
> opt-in SGX.
> 
> Setting SGX_ENABLE follows the rules of IA32_FEATURE_CONTROL.LOCK (bit 0).
> Software is considered to have opted into Intel SGX if and only if
> IA32_FEATURE_CONTROL.SGX_ENABLE and IA32_FEATURE_CONTROL.LOCK are set to 1.
> 
> The setting of IA32_FEATURE_CONTROL.SGX_ENABLE (bit 18) is not reflected by
> SGX CPUID. Enclave instructions will behavior differently according to value
> of CPUID.0x7.0x0:EBX.SGX and whether BIOS has opted-in SGX.
> 
> Refer to 37.7.1 Intel SGX Opt-in Configuration for more information.
> 
> 1.3 Enclave Life Cycle
> 
> 1.3.1 Constructing & Destroying Enclave
> 
> Enclave is created via ENCLS[ECREATE] leaf by previleged software. Basically
> ECREATE converts an invalid EPC page into SECS page, according to a source SECS
> structure resides in normal memory. The source SECS contains enclave's info
> such as base (linear) address, size, enclave attributes, enclave's measurement,
> etc.
> 
> After ECREATE, for each 4K linear address space page, priviledged software uses
> EADD and EEXTEND to add one EPC page to it. Enclave code/data (resides in normal
> memory) is loaded to enclave during EADD for enclave's each 4K page. After all
> EPC pages are added to enclave, priviledged software calls EINIT to initialize
> the enclave, and then enclave is ready to run.
> 
> During enclave is constructed, enclave measurement, which is a SHA256 hash
> value, is also built according to enclave's size, code/data itself and its
> location in enclave, etc. The measurement can be used to uniquely identify the
> enclave. SIGSTRUCT in EINIT leaf also contains the measurement specified by
> untrusted software, via MRENCLAVE. EINIT will check the two measurements and
> will only succeed when the two matches.
> 
> Enclave is destroyed by running EREMOVE for all Enclave's EPC page, and then
> for enclave's SECS. EREMOVE will report SGX_CHILD_PRESENT error if it is called
> for SECS when there's still regular EPC pages that haven't been removed from
> enclave.
> 
> Please refer to SDM chapter 39.1 Constructing an Enclave for more infomation.
> 
> 1.3.2 Enclave Entry and Exit
> 
> 1.3.2.1 Synchonous Entry and Exit
> 
> After enclave is constructed, non-priviledged software use ENCLU[EENTER] to
> enter enclave to run. While process runs in enclave, non-priviledged software
> can use ENCLU[EEXIT] to exit from enclave and return to normal mode.
> 
> 1.3.2.2 Asynchounous Enclave Exit
> 
> Asynchronous and synchronous events, such as exceptions, interrupts, traps,
> SMIs, and VM exits may occur while executing inside an enclave. These events
> are referred to as Enclave Exiting Events (EEE). Upon an EEE, the processor
> state is securely saved inside the enclave and then replaced by a synthetic
> state to prevent leakage of secrets. The process of securely saving state and
> establishing the synthetic state is called an Asynchronous Enclave Exit (AEX).
> 
> After AEX, non-priviledged software uses ENCLU[ERESUME] to re-enter enclave.
> The SGX userspace software maintains a small piece of code (resides in normal
> memory) which basically calls ERESUME to re-enter enclave. The address of this
> piece of code is called Asynchronous Exit Pointer (AEP). AEP is specified as
> parameter in EENTER and will be kept internally in enclave. Upon AEX, AEP will
> be pushed to stack and upon returning from EEE handling, such as IRET, AEP will
> be loaded to RIP and ERESUME will be called subsequently to re-enter enclave.
> 
> During AEX the processor will do context saving and restore automatically
> therefore no change to interrupt handling of OS kernel and VMM is required. It
> is SGX userspace software's responsibility to setup AEP correctly.
> 
> Please refer to SDM chapter 39.2 Enclave Entry and Exit for more infomation.
> 
> 1.3.3 EPC Eviction and Reload
> 
> SGX also allows priviledged software to evict any EPC pages that are used by
> enclave. The idea is the same as normal memory swapping. Below is the detail
> info of how to evict EPC pages.
> 
> Below is the sequence to evict regular EPC page:
> 
> 	1) Select one or multiple regular EPC pages from one enclave
> 	2) Remove EPT/PT mapping for selected EPC pages
> 	3) Send IPIs to remote CPUs to flush TLB of selected EPC pages
> 	4) EBLOCK on selected EPC pages
> 	5) ETRACK on enclave's SECS page
> 	6) allocate one available slot (8-byte) in VA page
> 	7) EWB on selected EPC pages
> 
> With EWB taking:
> 
> 	- VA slot, to restore eviction version info.
> 	- one normal 4K page in memory, to store encrypted content of EPC page.
> 	- one struct PCMD in memory, to store meta data.
> 
>     (VA slot is a 8-byte slot in VA page, which is a particualr EPC page.)
> 
> And below is the sequence to evict an SECS page or VA page:
> 
> 	1) locate SECS (or VA) page
> 	2) remove EPT/PT mapping for SECS (or VA) page
> 	3) Send IPIs to remote CPUs
> 	6) allocate one available slot (8-byte) in VA page
> 	4) EWB on SECS (or) page
> 
> And for evicting SECS page, all regular EPC pages that belongs to that SECS
> must be evicted out prior, otherwise EWB returns SGX_CHILD_PRESENT error.
> 
> And to reload an EPC page:
> 
> 	1) ELDU/ELDB on EPC page
> 	2) setup EPT/PT mapping
> 
> With ELDU/ELDB taking:
> 
> 	- location of SECS page
> 	- linear address of enclave's 4K page (that we are going to reload to)
> 	- VA slot (used in EWB)
> 	- 4K page in memory (used in EWB)
> 	- struct PCMD in memory (used in EWB)
> 
> Please refer to SDM chapter 39.5 EPC and Management of EPC pages for more
> information.
> 
> 1.4 SGX Launch Control
> 
> SGX requires running "Launch Enclave" (LE) before running any other enclaves.
> This is because LE is the only enclave that does not requires EINITTOKEN in
> EINIT. Running any other enclave requires a valid EINITTOKEN, which contains
> MAC of the (first 192 bytes) EINITTOKEN calculated by EINITTOKEN key. EINIT
> will verify the MAC via internally deriving the EINITTOKEN key, and only the
> EINITTOKEN that has matched MAC will be accepted by EINIT. The EINITTOKEN key
> derivation depends on some info from LE. The typical process is LE generates
> EINITTOKEN for other enclave according to LE itself and the target enclave,
> and calcualtes the MAC by using ENCLU[EGETKEY] to get the EINITTOKEN key. Only
> LE is able to get the EINITTOKEN key.
> 
> Running LE requies the SHA256 hash of LE signer's RSA public key (SHA256 of
> sigstruct->modulus) to equal to IA32_SGXLEPUBKEYHASH[0-3] MSRs (the 4 MSRs
> together makes up 256-bit SHA256 hash value).
> 
> If CPUID.0x7.0x0:EBX.SGX and CPUID.0x7.0x0:ECX.SGX_LAUNCH_CONTROL[bit 30] is
> set, then IA32_FEATURE_CONTROL is available, and IA32_FEATURE_CONTROL MSR has
> SGX_LAUNCH_CONTROL_ENABLE bit (bit 17) available. 1-setting of
> SGX_LAUNCH_CONTROL_ENABLE bit enables runtime change of IA32_SGXLEPUBKEYHASHn
> after IA32_FEATURE_CONTROL is locked. Otherwise, IA32_SGXLEPUBKEYHASHn are
> read-only after IA32_FEATURE_CONTROL is locked. After reset,
> IA32_SGXLEPUBKEYHASHn will be set to hash of Intel's default key. On system
> that has only CPUID.0x7.0x0:EBX.SGX set, IA32_SGXLEPUBKEYHASHn are not
> available. On such system EINIT will always treat IA32_SGXLEPUBKEYHASHn as
> Intel's default value thus only Intel's LE is able to run.
> 
> On system with IA32_SGXLEPUBKEYHASHn available, it is BIOS's implementation to
> decide whether to provide configurations to user to set IA32_SGXLEPUBKEYHASHn
> in *locked* (IA32_SGXLEPUBKEYHASHn are read-only after IA32_FEATURE_CONTROL is
> locked) or *unlocked* mode (IA32_SGXLEPUBKEYHASHn are writable to kernel at
> runtime). Also BIOS may or may not provide configurations to allow user to set
> custom value of IA32_SGXLEPUBKEYHASHn.
> 
> 1.5 SGX Interaction with IA32 and IA64 Architecture
> 
> SDM Chapter 42 describes SGX interaction with various features in IA32 and IA64
> architecture. Below outlines the major ones. Refer to Chapter 42 for full
> description of SGX interaction with various IA32 and IA64 features.
> 
> 1.5.1 VMX Changes for Supporting SGX Virtualization
> 
> A new 64-bit ENCLS-exiting bitmap control field is added to VMCS (encoding
> 0202EH) to control VMEXIT on ENCLS leaf functions. And a new "Enable ENCLS
> exiting" control bit (bit 15) is defined in secondary processor based vm
> execution control. 1-Setting of "Enable ENCLS exiting" enables ENCLS-exiting
> bitmap control. ENCLS-exiting bitmap controls which ENCLS leaves will trigger
> VMEXIT.
> 
> Additionally two new bits are added to indicate whether VMEXIT (any) is from
> enclave. Below two bits will be set if VMEXIT is from enclave:
>     - Bit 27 in the Exit reason filed of Basic VM-exit information.
>     - Bit 4 in the Interruptibility State of Guest Non-Register State of VMCS.
> 
> Refer to 42.5 Interactions with VMX, 27.2.1 Basic VM-Exit Information, and
> 27.3.4 Saving Non-Register.
> 
> 1.5.2 Interaction with XSAVE
> 
> SGX defines a sub-field called X-Feature Request Mask (XFRM) in the attributes
> field of SECS. On enclave entry, SGX HW verifies XFRM in SECS.ATTRIBUTES are
> already enabled in XCR0.
> 
> Upon AEX, SGX saves the processor extended state and miscellaneous state to
> enclave's state-save area (SSA), and clear the secrets from processor extended
> state that is used by enclave (from leaking secrets).
> 
> Refer to 42.7 Interaction with Processor Extended State and Miscellaneous State
> 
> 1.5.3 Interaction with S state
> 
> When processor goes into S3-S5 state, EPC is destroyed, thus all enclaves are
> destroyed as well consequently.
> 
> Refer to 42.14 Interaction with S States.
> 
> 2. SGX Virtualization Design
> 
> 2.1 High Level Toolstack Changes:
> 
> 2.1.1 New 'sgx' XL configure file parameter
> 
> EPC is limited resource. In order to use EPC efficiently among all domains,
> when creating guest, administrator should be able to specify domain's virtual
> EPC size. And admin alao should be able to get all domain's virtual EPC size.
> 
> For SGX Launch Control virtualization, we should allow admin to create VM with
> either VM's virtual IA32_SGXLEPUBKEYHASHn locked or unlocked, and we should
> also allow admin to create VM with custom IA32_SGXLEPUBKEYHASHn value.
> 
> For above purposes, below new 'sgx' XL configure file parameter is added:
> 
> 	sgx = 'epc=<size>,lehash=<sha256-hash>,lewr=<0|1>'
> 
> In which 'epc' specifies VM's EPC size in MB and it's mandatory.
> 
> When physical machine is in *locked* mode, both 'lehash' and 'lewr'
> cannot be specificed, as physical machine are unable to change
> IA32_SGXLEPUBKEYHASHn at runtime. Adding either 'lehash' and 'lewr' will
> cause failure to create VM in that case. And VM's initial
> IA32_SGXLEPUBKEYHASHn value will be set to value of physical MSRs.
> 
> When physical machine is in *unlocked* mode, then VM's initial
> IA32_SGXLEPUBKEYHASHn value will be set to 'lehash' if specified, or
> Intel's default value. VM's SGX_LAUNCH_CONTROL_ENABLE bit in
> IA32_FEATURE_CONTROL will be set or cleared, depending on whether 'lewr'
> is specificied (or set to true or false expilicity).
> 
> Please also refer to 2.2.4 Launch Control Support.
> 
> 2.1.2 New XL commands (?)
> 
> Administrator should be able to get physical EPC size, and all domain's virtual
> EPC size. For this purpose, we can introduce 2 additional commands:
> 
>     # xl sgxinfo
> 
> Which will print out physical EPC size, and other SGX info (such as SGX1, SGX2,
> etc) if necessary.
> 
>     # xl sgxlist <did>
> 
> Which will print out particular domain's virtual EPC size, or list all virtual
> EPC sizes for all supported domains.
> 
> Alternatively, we can also extend existing XL commands by adding new option
> 
>     # xl info -sgx
> 
> Which will print out physical EPC size along with other physinfo. And
> 
>     # xl list <did> -sgx
> 
> Which will print out domain's virtual EPC size.
> 
> Comments?
> 
> In this RFC the two new commands are not implemented yet.
> 
> 2.1.3 Notify domain's virtual EPC base and size to Xen
> 
> Xen needs to know guest's EPC base and size in order to populate EPC pages for
> it. Toolstack notifies EPC base and size to Xen via XEN_DOMCTL_set_cpuid.
> 
> 2.2 High Level Xen Hypervisor Changes:
> 
> 2.2.1 EPC Management
> 
> Xen hypervisor needs to detect SGX, discover EPC, and manage EPC before
> supporting SGX to guest. EPC is detected via SGX CPUID 0x12.0x2. It's possible
> that there are multiple EPC sections (enumerated via sub-leaves 0x3 and so on,
> until invaid EPC is reported), but this is typically on MP-socket server on
> which each package would have its own EPC.
> 
> EPC is reported as reserved memory (so it is not reported as normal memory).
> EPC must be managed in 4K pages. CPU hardware uses EPCM to track status of each
> EPC pages. Xen needs to manage EPC and provide functions to, ie, alloc and free
> EPC pages for guest.
> 
> Although typically on physical machine (at least existing machines), EPC is
> ~100M in size at maximum, but we cannot assume EPC size, thus in terms of EPC
> management, it's better to integrate EPC management to Xen's memmory management
> framework to take advantage of existing Xen's memory management algorithms.
> 
> Specifically, one 'struct page_info' will be created for each EPC page, just
> like normal memory, and a new flag will be defined to identify whether 'struct
> page_info' is EPC or normal memory. Existing memory allocation API
> alloc_domheap_pages will be resued to allocate EPC page, by adding a new memflag
> 'MEMF_epc' to indicate EPC allocation, rather than memory allocation. The new
> 'MEMF_epc' can also be used for EPC ballooning (if required in the future), as
> with the new flag, existing XENMEM_increase{decrease}_reservation,
> XENMEM_populate_physmap can be resued for EPC as well.
> 
> 2.2.2 EPC Virtualization
> 
> This part is how to populate EPC for guests. We have 3 choices:
>     - Static Partitioning
>     - Oversubscription
>     - Ballooning
> 
> Static Partitioning means all EPC pages will be allocated and mapped to guest
> when it is created, and there's no runtime change of page table mappings for EPC
> pages. Oversubscription means Xen hypervisor supports EPC page swapping between
> domains, meaning Xen is able to evict EPC page from another domain and assign it
> to the domain that needs the EPC. With oversubscription, EPC can be assigned to
> domain on demand, when EPT violation happens. Ballooning is similar to memory
> ballooning. It is basically "Static Partitioning" + "Balloon driver" in guest.
> 
> Static Partitioning is the easiest way in terms of implementation, and there
> will be no hypervisor overhead (except EPT overhead of course), because in
> "Static partitioning", there is no EPT violation for EPC, and Xen doesn't need
> to turn on ENCLS VMEXIT for guest as ENCLS runs perfectly in non-root mode.
> 
> Ballooning is "Static Partitioning" + "Balloon driver" in guest. Like "Static
> Paratitioning", ballooning doesn't need to turn on ENCLS VMEXIT, and doesn't
> have EPT violation for EPC either. To support ballooning, we need ballooning
> driver in guest to issue hypercall to give up or reclaim EPC pages. In terms of
> hypercall, we have two choices: 1) Add new hypercall for EPC ballooning; 2)
> Using existing XENMEM_{increase/decrease}_reservation with new memory flag, ie,
> XENMEMF_epc. I'll discuss more regarding to adding dedicated hypercall or not
> later.
> 
> Oversubscription looks nice but it requires more complicated implemetation.
> Firstly, as explained in 1.3.3 EPC Eviction & Reload, we need to follow specific
> steps to evict EPC pages, and in order to do that, basically Xen needs to trap
> ENCLS from guest and keep track of EPC page status and enclave info from all
> guest. This is because:
>     - To evict regular EPC page, Xen needs to know SECS location
>     - Xen needs to know EPC page type: evicting regular EPC and evicting SECS,
>       VA page have different steps.
>     - Xen needs to know EPC page status: whether the page is blocked or not.
> 
> Those info can only be got by trapping ENCLS from guest, and parsing its
> parameters (to identify SECS page, etc). Parsing ENCLS parameters means we need
> to know which ENCLS leaf is being trapped, and we need to translate guest's
> virtual address to get physical address in order to locate EPC page. And once
> ENCLS is trapped, we have to emulate ENCLS in Xen, which means we need to
> reconstruct ENCLS parameters by remapping all guest's virtual address to Xen's
> virtual address (gva->gpa->pa->xen_va), as ENCLS always use *effective address*
> which is able to be traslated by processor when running ENCLS.
> 
>     --------------------------------------------------------------
>                 |   ENCLS   |
>     --------------------------------------------------------------
>                 |          /|\
>     ENCLS VMEXIT|           | VMENTRY
>                 |           |
>                \|/          |
> 
> 		1) parse ENCLS parameters
> 		2) reconstruct(remap) guest's ENCLS parameters
> 		3) run ENCLS on behalf of guest (and skip ENCLS)
> 		4) on success, update EPC/enclave info, or inject error
> 
> And Xen needs to maintain each EPC page's status (type, blocked or not, in
> enclave or not, etc). Xen also needs to maintain all Enclave's info from all
> guests, in order to find the correct SECS for regular EPC page, and enclave's
> linear address as well.
> 
> So in general, "Static Partitioning" has simplest implementation, but obviously
> not the best way to use EPC efficiently; "Ballooning" has all pros of Static
> Partitioning but requies guest balloon driver; "Oversubscription" is best in
> terms of flexibility but requires complicated hypervisor implemetation.
> 
> We will start with "Static Partitioning". If "Ballooning" is required in the
> future, we will support it. "Oversubscription" should not be needed in
> forseeable future.
> 
> 2.2.3 Populate EPC for Guest
> 
> Toolstack notifies Xen about domain's EPC base and size by XEN_DOMCTL_set_cpuid,
> so currently Xen populates all EPC pages for guest in XEN_DOMCTL_set_cpuid,
> particularly, in handling XEN_DOMCTL_set_cpuid for CPUID.0x12.0x2. Once Xen
> checks the values passed from toolstack is valid, Xen will allocate all EPC
> pages and setup EPT mappings for guest.
> 
> 2.2.4 Launch Control Support
> 
> To support running multiple domains with each running its own LE signed by
> different owners, physical machine's BIOS must leave IA32_SGXLEPUBKEYHASHn
> *unlocked* before handing to Xen. Xen will trap domain's write to
> IA32_SGXLEPUBKEYHASHn and keep the value in vcpu internally, and update the
> value to physical MSRs when vcpu is scheduled in. This can guarantee that
> when EINIT runs in guest, guest's virtual IA32_SGXLEPUBKEYHASHn have been
> written to physical MSRs.
> 
> SGX_LAUNCH_CONTROL_ENABLE bit in guest's IA32_FEATURE_CONTROL is controlled
> by new added 'lewr' XL parameter (see 2.1.1 New 'sgx' XL configure file
> parameter).
> 
> If physical IA32_SGXLEPUBKEYHASHn are *locked* in machine's BIOS, then only MSR
> read is allowed from guest, and Xen will inject error for guest's MSR writes.
> 
> In addition, if physical IA32_SGXLEPUBKEYHASHn are *locked*, then creating guest
> with 'lehash' parameter or 'lewr' will fail, as in such case Xen is not able to
> update guest's virtual IA32_SGXLEPUBKEYHASHn to physical MSRs.
> 
> If physical IA32_SGXLEPUBKEYHASHn are not available
> (CPUID.0x7.0x0:ECX.SGX_LAUHCN_CONTROL is not present), then creating VM with
> 'lehash' and 'lewr' will also fail. In addition, any MSR read/write for
> IA32_SGXLEPUBKEYHASHn from guest is invalid and Xen will inject error in such
> case.
> 
> 2.2.5 CPUID Emulation
> 
> Most of native SGX CPUID info can be exposed to guest, expect below two parts:
>     - Sub-leaf 0x2 needs to report domain's virtual EPC base and size, instead
>       of physical EPC info.
>     - Sub-leaf 0x1 needs to be consistent with guest's XCR0. For the reason of
>       this part please refer to 1.5.2 Interaction with XSAVE.
> 
> 2.2.6 EPT Violation & ENCLS Trapping Handling
> 
> Only needed when Xen supports EPC Oversubscription, as explained above.
> 
> 2.2.7 Guest Suspend & Resume
> 
> On hardware, EPC is destroyed when power goes to S3-S5. So Xen will destroy
> guest's EPC when guest's power goes into S3-S5. Currently Xen is notified by
> Qemu in terms of S State change via HVM_PARAM_ACPI_S_STATE, where Xen will
> destroy EPC if S State is S3-S5.
> 
> Specifically, Xen will run EREMOVE for guest's each EPC page, as guest may
> not handle EPC suspend & resume correctly, in which case physically guest's EPC
> pages may still be valid, so Xen needs to run EREMOVE to make sure all EPC
> pages are becoming invalid. Otherwise further operation in guest on EPC may
> fault as it assumes all EPC pages are invalid after guest is resumed.
> 
> For SECS page, EREMOVE may fault with SGX_CHILD_PRESENT, in which case Xen will
> keep this SECS page into a list, and call EREMOVE for them again after all EPC
> pages have been called with EREMOVE. This time the EREMOVE on SECS will succeed
> as all children (regular EPC pages) have already been removed.
> 
> 2.2.8 Destroying Domain
> 
> Normally Xen just frees all EPC pages for domain when it is destroyed. But Xen
> will also do EREMOVE on all guest's EPC pages (described in above 2.2.7) before
> free them, as guest may shutdown unexpected (ex, user kills guest), and in this
> case, guest's EPC may still be valid.
> 
> 2.3 Additional Point: Live Migration, Snapshot Support (?)
> 
> Actually from hardware's point of view, SGX is not migratable. There are two
> reasons:
> 
>     - SGX key architecture cannot be virtualized.
> 
>     For example, some keys are bound to CPU. For example, Sealing key, EREPORT
>     key, etc. If VM is migrated to another machine, the same enclave will derive
>     the different keys. Taking Sealing key as an example, Sealing key is
>     typically used by enclave (enclave can get sealing key by EGETKEY) to *seal*
>     its secrets to outside (ex, persistent storage) for further use. If Sealing
>     key changes after VM migration, then the enclave can never get the sealed
>     secrets back by using sealing key, as it has changed, and old sealing key
>     cannot be got back.
> 
>     - There's no ENCLS to evict EPC page to normal memory, but at the meaning
>     time, still keep content in EPC. Currently once EPC page is evicted, the EPC
>     page becomes invalid. So technically, we are unable to implement live
>     migration (or check pointing, or snapshot) for enclave.
> 
> But, with some workaround, and some facts of existing SGX driver, technically
> we are able to support Live migration (or even check pointing, snapshot). This
> is because:
> 
>     - Changing key (which is bound to CPU) is not a problem in reality
> 
>     Take Sealing key as an example. Losing sealed data is not a problem, because
>     sealing key is only supposed to encrypt secrets that can be provisioned
>     again. The typical work model is, enclave gets secrets provisioned from
>     remote (service provider), and use sealing key to store it for further use.
>     When enclave tries to *unseal* use sealing key, if the sealing key is
>     changed, enclave will find the data is some kind of corrupted (integrity
>     check failure), so it will ask secrets to be provisioned again from remote.
>     Another reason is, in data center, VM's typically share lots of data, and as
>     sealing key is bound to CPU, it means the data encrypted by one enclave on
>     one machine cannot be shared by another enclave on another mahcine. So from
>     SGX app writer's point of view, developer should treat Sealing key as a
>     changeable key, and should handle lose of sealing data anyway. Sealing key
>     should only be used to seal secrets that can be easily provisioned again.
> 
>     For other keys such as EREPORT key and provisioning key, which are used for
>     local attestation and remote attestation, due to the second reason below,
>     losing them is not a problem either.
> 
>     - Sudden lose of EPC is not a problem.
> 
>     On hardware, EPC will be lost if system goes to S3-S5, or reset, or
>     shutdown, and SGX driver need to handle lose of EPC due to power transition.
>     This is done by cooperation between SGX driver and userspace SGX SDK/apps.
>     However during live migration, there may not be power transition in guest,
>     so there may not be EPC lose during live migration. And technically we
>     cannot *really* live migrate enclave (explained above), so looks it's not
>     feasible. But the fact is that both Linux SGX driver and Windows SGX driver
>     have already supported *sudden* lose of EPC (not EPC lose during power
>     transition), which means both driver are able to recover in case EPC is lost
>     at any runtime. With this, technically we are able to support live migration
>     by simply ignoring EPC. After VM is migrated, the destination VM will only
>     suffer *sudden* lose of EPC, which both Windows SGX driver and Linux SGX
>     driver are already able to handle.
> 
>     But we must point out such *sudden* lose of EPC is not hardware behavior,
>     and other SGX driver for other OSes (such as FreeBSD) may not implement
>     this, so for those guests, destination VM will behavior in unexpected
>     manner. But I am not sure we need to care about other OSes.
> 
> For the same reason, we are able to support check pointing for SGX guest (only
> Linux and Windows);
> 
> For snapshot, we can support snapshot SGX guest by either:
> 
>     - Suspend guest before snapshot (s3-s5). This works for all guests but
>       requires user to manually susppend guest.
>     - Issue an hypercall to destroy guest's EPC in save_vm. This only works for
>       Linux and Windows but doesn't require user intervention.
> 
> What's your comments?
> 
> 3. Reference
> 
>     - Intel SGX Homepage
>     https://software.intel.com/en-us/sgx
> 
>     - Linux SGX SDK
>     https://01.org/intel-software-guard-extensions
> 
>     - Linux SGX driver for upstreaming
>     https://github.com/01org/linux-sgx
> 
>     - Intel SGX Specification (SDM Vol 3D)
>     https://software.intel.com/sites/default/files/managed/7c/f1/332831-sdm-vol-3d.pdf
> 
>     - Paper: Intel SGX Explained
>     https://eprint.iacr.org/2016/086.pdf
> 
>     - ISCA 2015 tutorial slides for Intel® SGX - Intel® Software
>     https://software.intel.com/sites/default/files/332680-002.pdf
> 
> Boqun Feng (5):
>   xen: mm: introduce non-scrubbable pages
>   xen: mm: manage EPC pages in Xen heaps
>   xen: x86/mm: add SGX EPC management
>   xen: x86: add functions to populate and destroy EPC for domain
>   xen: tools: add SGX to applying MSR policy
> 
> Kai Huang (12):
>   xen: x86: expose SGX to HVM domain in CPU featureset
>   xen: x86: add early stage SGX feature detection
>   xen: vmx: detect ENCLS VMEXIT
>   xen: x86/mm: introduce ioremap_wb()
>   xen: p2m: new 'p2m_epc' type for EPC mapping
>   xen: x86: add SGX cpuid handling support.
>   xen: vmx: handle SGX related MSRs
>   xen: vmx: handle ENCLS VMEXIT
>   xen: vmx: handle VMEXIT from SGX enclave
>   xen: x86: reset EPC when guest got suspended.
>   xen: tools: add new 'sgx' parameter support
>   xen: tools: add SGX to applying CPUID policy
> 
>  docs/misc/xen-command-line.markdown         |   8 +
>  tools/libxc/Makefile                        |   1 +
>  tools/libxc/include/xc_dom.h                |   4 +
>  tools/libxc/include/xenctrl.h               |  16 +
>  tools/libxc/xc_cpuid_x86.c                  |  68 ++-
>  tools/libxc/xc_msr_x86.h                    |  10 +
>  tools/libxc/xc_sgx.c                        |  82 +++
>  tools/libxl/libxl.h                         |   3 +-
>  tools/libxl/libxl_cpuid.c                   |  15 +-
>  tools/libxl/libxl_create.c                  |  10 +
>  tools/libxl/libxl_dom.c                     |  65 ++-
>  tools/libxl/libxl_internal.h                |   2 +
>  tools/libxl/libxl_nocpuid.c                 |   4 +-
>  tools/libxl/libxl_types.idl                 |  11 +
>  tools/libxl/libxl_x86.c                     |  12 +
>  tools/ocaml/libs/xc/xenctrl_stubs.c         |  11 +-
>  tools/python/xen/lowlevel/xc/xc.c           |  11 +-
>  tools/xl/xl_parse.c                         |  86 +++
>  tools/xl/xl_parse.h                         |   1 +
>  xen/arch/x86/Makefile                       |   1 +
>  xen/arch/x86/cpu/common.c                   |  15 +
>  xen/arch/x86/cpuid.c                        |  62 ++-
>  xen/arch/x86/domctl.c                       |  87 ++-
>  xen/arch/x86/hvm/hvm.c                      |   3 +
>  xen/arch/x86/hvm/vmx/vmcs.c                 |  16 +-
>  xen/arch/x86/hvm/vmx/vmx.c                  |  68 +++
>  xen/arch/x86/hvm/vmx/vvmx.c                 |  11 +
>  xen/arch/x86/mm.c                           |   9 +-
>  xen/arch/x86/mm/p2m-ept.c                   |   3 +
>  xen/arch/x86/mm/p2m.c                       |  41 ++
>  xen/arch/x86/msr.c                          |   6 +-
>  xen/arch/x86/sgx.c                          | 815 ++++++++++++++++++++++++++++
>  xen/common/page_alloc.c                     |  39 +-
>  xen/include/asm-arm/mm.h                    |   9 +
>  xen/include/asm-x86/cpufeature.h            |   4 +
>  xen/include/asm-x86/cpuid.h                 |  29 +-
>  xen/include/asm-x86/hvm/hvm.h               |   3 +
>  xen/include/asm-x86/hvm/vmx/vmcs.h          |   8 +
>  xen/include/asm-x86/hvm/vmx/vmx.h           |   3 +
>  xen/include/asm-x86/mm.h                    |  19 +-
>  xen/include/asm-x86/msr-index.h             |   6 +
>  xen/include/asm-x86/msr.h                   |   5 +
>  xen/include/asm-x86/p2m.h                   |  12 +-
>  xen/include/asm-x86/sgx.h                   |  86 +++
>  xen/include/public/arch-x86/cpufeatureset.h |   3 +-
>  xen/include/xen/mm.h                        |   2 +
>  xen/tools/gen-cpuid.py                      |   3 +
>  47 files changed, 1757 insertions(+), 31 deletions(-)
>  create mode 100644 tools/libxc/xc_sgx.c
>  create mode 100644 xen/arch/x86/sgx.c
>  create mode 100644 xen/include/asm-x86/sgx.h
> 
> -- 
> 2.15.0
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel