[RFC PATCH v2 00/17] RFC: SGX Virtualization design and draft patches

* [RFC PATCH v2 00/17] RFC: SGX Virtualization design and draft patches
@ 2017-12-04  0:15 Boqun Feng
  2017-12-04  0:15 ` [PATCH v2 01/17] xen: x86: expose SGX to HVM domain in CPU featureset Boqun Feng
                   ` (17 more replies)
  0 siblings, 18 replies; 23+ messages in thread
From: Boqun Feng @ 2017-12-04  0:15 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Stefano Stabellini, Wei Liu, Jun Nakajima,
	George Dunlap, Andrew Cooper, Ian Jackson,
	Marek Marczykowski-Górecki, Tim Deegan, kai.huang,
	Julien Grall, Jan Beulich, David Scott, Boqun Feng

Hi all,

This is the v2 of RFC SGX Virtualization design and draft patches, you
can find v1 at:

    https://lists.gt.net/xen/devel/483404

In the new version, I fix a few things according to the feedbacks for
previous version(mostly are cleanups and code movement).

Besides, Kai and I redesign the SGX MSRs setting up part and introduce
new XL parameter 'lehash' and 'lewr'.

Another big change is that I modify the EPC management to fit EPC pages
in 'struct page_info', and in patch #6 and #7, unscrubbable pages,
'PGC_epc', 'MEMF_epc' and 'XENZONE_EPC' are introduced, so that EPC
management is fully integrated into existing memory management of xen.
This might be the controversial bit, so patch 6~8 are simply to show the
idea and drive deep discussion.

Detailed changes since v1: (modifications with tag "[New]" is totally
new in this series, reviews and comments are highly welcome for those
parts)

*   Make SGX related mostly common for x86 by: 1) moving sgx.[ch] to
    arch/x86/ and include/asm-x86/ and 2) renaming EPC related functions
    with domain_* prefix.

*   Rename ioremap_cache() with ioremap_wb() and make it x86-specific as
    suggested by Jan Beulich.

*   Remove percpu sgx_cpudata, during bootup secondary CPUs now check
    whether they read different value than boot CPU, if so SGX is
    disabled.

*   Remove domain_has_sgx_{,launch_control}, and make sure we can
    rely on domain's arch.cpuid->feat.sgx{_lc} for setting checks.

*   Cleanup the code for CPUID handling as suggested by Andrew Cooper.

*   Adjust to msr_policy framework for SGX MSRs handling, and remove
    unnecessary fields like 'readable' and 'writable'

*   Use 'page_info' to maintain EPC pages, and [NEW] add an draft
    implementation for employing xenheap for EPC page management. Please
    see patch 6~8

*   [New] Modify the XL parameter for SGX, please see section 2.1.1 in
    the updated design doc. 

*   [New] Use _set_vcpu_msrs hypercall in the toolstack to set the SGX
    related. Please see patch #17.

*   ACPI related tool changes are temporarily dropped in this patchset,
    as I need more time to resolve the comments and do related tests.

And the update design doc is as follow, as the previous version in the
design there are some particualr points that we don't know which
implementation is better. For those a question mark (?) is added at the
right of the menu. And for SGX live migration, thanks to Wei Liu for
providing comments that it's nice to support if we can in previous
version review, but we'd like hear more from you guys so we still put a
question mark fot this item. Your comments on those "question mark (?)"
parts (and other comments as well, of course) are highly appreciated.

===================================================================
1. SGX Introduction
    1.1 Overview
        1.1.1 Enclave
        1.1.2 EPC (Enclave Paage Cache)
        1.1.3 ENCLS and ENCLU
    1.2 Discovering SGX Capability
        1.2.1 Enumerate SGX via CPUID
        1.2.2 Intel SGX Opt-in Configuration
    1.3 Enclave Life Cycle
        1.3.1 Constructing & Destroying Enclave
        1.3.2 Enclave Entry and Exit
            1.3.2.1 Synchonous Entry and Exit
            1.3.2.2 Asynchounous Enclave Exit
        1.3.3 EPC Eviction and Reload
    1.4 SGX Launch Control
    1.5 SGX Interaction with IA32 and IA64 Architecture
2. SGX Virtualization Design
    2.1 High Level Toolstack Changes
        2.1.1 New 'sgx' XL configure file parameter
        2.1.2 New XL commands (?)
        2.1.3 Notify domain's virtual EPC base and size to Xen
    2.2 High Level Hypervisor Changes
        2.2.1 EPC Management
        2.2.2 EPC Virtualization
        2.2.3 Populate EPC for Guest
        2.2.4 Launch Control Support
        2.2.5 CPUID Emulation
        2.2.6 EPT Violation & ENCLS Trapping Handling
        2.2.7 Guest Suspend & Resume
        2.2.8 Destroying Domain
    2.3 Additional Point: Live Migration, Snapshot Support (?)
3. Reference

1. SGX Introduction

1.1 Overview

1.1.1 Enclave

Intel Software Guard Extensions (SGX) is a set of instructions and mechanisms
for memory accesses in order to provide security accesses for sensitive
applications and data. SGX allows an application to use it's pariticular address
space as an *enclave*, which is a protected area provides confidentiality and
integrity even in the presence of privileged malware. Accesses to the enclave
memory area from any software not resident in the enclave are prevented,
including those from privileged software. Below diagram illustrates the presence
of Enclave in application.

        |-----------------------|
        |                       |
        |   |---------------|   |
        |   |   OS kernel   |   |       |-----------------------|
        |   |---------------|   |       |                       |
        |   |               |   |       |   |---------------|   |
        |   |---------------|   |       |   | Entry table   |   |
        |   |   Enclave     |---|-----> |   |---------------|   |
        |   |---------------|   |       |   | Enclave stack |   |
        |   |   App code    |   |       |   |---------------|   |
        |   |---------------|   |       |   | Enclave heap  |   |
        |   |   Enclave     |   |       |   |---------------|   |
        |   |---------------|   |       |   | Enclave code  |   |
        |   |   App code    |   |       |   |---------------|   |
        |   |---------------|   |       |                       |
        |           |           |       |-----------------------|
        |-----------------------|

SGX supports SGX1 and SGX2 extensions. SGX1 provides basic enclave support,
and SGX2 allows additional flexibility in runtime management of enclave
resources and thread execution within an enclave.

1.1.2 EPC (Enclave Page Cache)

Just like normal application memory management, enclave memory management can be
devided into two parts: address space allocation and memory commitment. Address
space allocation is allocating particular range of linear address space for
enclave. Memory commitment is assigning actual resource for the enclave.

Enclave Page Cache (EPC) is the physical resource used to commit to enclave.
EPC is divided to 4K pages. An EPC page is 4K in size and always aligned to 4K
boundary. Hardware performs additional access control checks to restrict access
to the EPC page. The Enclave Page Cache Map (EPCM) is a secure structure which
holds one entry for each EPC page, and is used by hardware to track the status
of each EPC page (invisibe to software). Typically EPC and EPCM are reserved
by BIOS as Processor Reserved Memory but the actual amount, size, and layout
of EPC are model-specific, and dependent on BIOS settings. EPC is enumerated
via new SGX CPUID, and is reported as reserved memory.

EPC pages can either be invalid or valid. There are 4 valid EPC types in SGX1:
regular EPC page, SGX Enclave Control Structure (SECS) page, Thread Control
Structure (TCS) page, and Version Array (VA) page. SGX2 adds Trimmed EPC page.
Each enclave is associated with one SECS page. Each thread in enclave is
associated with one TCS page. VA page is used in EPC page eviction and reload.
Trimmed EPC page is introduced in SGX2 when particular 4K page in enclave is
going to be freed (trimmed) at runtime after enclave is initialized.

1.1.3 ENCLS and ENCLU

Two new instructions ENCLS and ENCLU are introduced to manage enclave and EPC.
ENCLS can only run in ring 0, while ENCLU can only run in ring 3. Both ENCLS and
ENCLU have multiple leaf functions, with EAX indicating the specific leaf
function.

SGX1 supports below ENCLS and ENCLU leaves:

    ENCLS:
    - ECREATE, EADD, EEXTEND, EINIT, EREMOVE (Enclave build and destroy)
    - EPA, EBLOCK, ETRACK, EWB, ELDU/ELDB (EPC eviction & reload)

    ENCLU:
    - EENTER, EEXIT, ERESUME (Enclave entry, exit, re-enter)
    - EGETKEY, EREPORT (SGX key derivation, attestation)

Additionally, SGX2 supports below ENCLS and ENCLU leaves for runtime add/remove
EPC page to enclave after enclave is initialized, along with permission change.

    ENCLS:
    - EAUG, EMODT, EMODPR

    ENCLU:
    - EACCEPT, EACCEPTCOPY, EMODPE

VMM is able to interfere with ENCLS running in guest (see 1.2.x SGX interaction
with VMX) but is unable to interfere with ENCLU.

1.2 Discovering SGX Capability

1.2.1 Enumerate SGX via CPUID

If CPUID.0x7.0:EBX.SGX (bit 2) is 1, then processor supports SGX and SGX
capability and resource can be enumerated via new SGX CPUID (0x12).
CPUID.0x12.0x0 reports SGX capability, such as the presence of SGX1, SGX2,
enclave's maximum size for both 32-bit and 64-bit application. CPUID.0x12.0x1
reports the availability of bits that can be set for SECS.ATTRIBUTES.
CPUID.0x12.0x2 reports the EPC resource's base and size. Platform may support
multiple EPC sections, and CPUID.0x12.0x3 and further sub-leaves can be used
to detect the existence of multiple EPC sections (until CPUID reports invalid
EPC).

Refer to 37.7.2 Intel SGX Resource Enumeration Leaves for full description of
SGX CPUID 0x12.

1.2.2 Intel SGX Opt-in Configuration

On processors that support Intel SGX, IA32_FEATURE_CONTROL also provides the
SGX_ENABLE bit (bit 18) to turn on/off SGX. Before system software can enable
and use SGX, BIOS is required to set IA32_FEATURE_CONTROL.SGX_ENABLE = 1 to
opt-in SGX.

Setting SGX_ENABLE follows the rules of IA32_FEATURE_CONTROL.LOCK (bit 0).
Software is considered to have opted into Intel SGX if and only if
IA32_FEATURE_CONTROL.SGX_ENABLE and IA32_FEATURE_CONTROL.LOCK are set to 1.

The setting of IA32_FEATURE_CONTROL.SGX_ENABLE (bit 18) is not reflected by
SGX CPUID. Enclave instructions will behavior differently according to value
of CPUID.0x7.0x0:EBX.SGX and whether BIOS has opted-in SGX.

Refer to 37.7.1 Intel SGX Opt-in Configuration for more information.

1.3 Enclave Life Cycle

1.3.1 Constructing & Destroying Enclave

Enclave is created via ENCLS[ECREATE] leaf by previleged software. Basically
ECREATE converts an invalid EPC page into SECS page, according to a source SECS
structure resides in normal memory. The source SECS contains enclave's info
such as base (linear) address, size, enclave attributes, enclave's measurement,
etc.

After ECREATE, for each 4K linear address space page, priviledged software uses
EADD and EEXTEND to add one EPC page to it. Enclave code/data (resides in normal
memory) is loaded to enclave during EADD for enclave's each 4K page. After all
EPC pages are added to enclave, priviledged software calls EINIT to initialize
the enclave, and then enclave is ready to run.

During enclave is constructed, enclave measurement, which is a SHA256 hash
value, is also built according to enclave's size, code/data itself and its
location in enclave, etc. The measurement can be used to uniquely identify the
enclave. SIGSTRUCT in EINIT leaf also contains the measurement specified by
untrusted software, via MRENCLAVE. EINIT will check the two measurements and
will only succeed when the two matches.

Enclave is destroyed by running EREMOVE for all Enclave's EPC page, and then
for enclave's SECS. EREMOVE will report SGX_CHILD_PRESENT error if it is called
for SECS when there's still regular EPC pages that haven't been removed from
enclave.

Please refer to SDM chapter 39.1 Constructing an Enclave for more infomation.

1.3.2 Enclave Entry and Exit

1.3.2.1 Synchonous Entry and Exit

After enclave is constructed, non-priviledged software use ENCLU[EENTER] to
enter enclave to run. While process runs in enclave, non-priviledged software
can use ENCLU[EEXIT] to exit from enclave and return to normal mode.

1.3.2.2 Asynchounous Enclave Exit

Asynchronous and synchronous events, such as exceptions, interrupts, traps,
SMIs, and VM exits may occur while executing inside an enclave. These events
are referred to as Enclave Exiting Events (EEE). Upon an EEE, the processor
state is securely saved inside the enclave and then replaced by a synthetic
state to prevent leakage of secrets. The process of securely saving state and
establishing the synthetic state is called an Asynchronous Enclave Exit (AEX).

After AEX, non-priviledged software uses ENCLU[ERESUME] to re-enter enclave.
The SGX userspace software maintains a small piece of code (resides in normal
memory) which basically calls ERESUME to re-enter enclave. The address of this
piece of code is called Asynchronous Exit Pointer (AEP). AEP is specified as
parameter in EENTER and will be kept internally in enclave. Upon AEX, AEP will
be pushed to stack and upon returning from EEE handling, such as IRET, AEP will
be loaded to RIP and ERESUME will be called subsequently to re-enter enclave.

During AEX the processor will do context saving and restore automatically
therefore no change to interrupt handling of OS kernel and VMM is required. It
is SGX userspace software's responsibility to setup AEP correctly.

Please refer to SDM chapter 39.2 Enclave Entry and Exit for more infomation.

1.3.3 EPC Eviction and Reload

SGX also allows priviledged software to evict any EPC pages that are used by
enclave. The idea is the same as normal memory swapping. Below is the detail
info of how to evict EPC pages.

Below is the sequence to evict regular EPC page:

	1) Select one or multiple regular EPC pages from one enclave
	2) Remove EPT/PT mapping for selected EPC pages
	3) Send IPIs to remote CPUs to flush TLB of selected EPC pages
	4) EBLOCK on selected EPC pages
	5) ETRACK on enclave's SECS page
	6) allocate one available slot (8-byte) in VA page
	7) EWB on selected EPC pages

With EWB taking:

	- VA slot, to restore eviction version info.
	- one normal 4K page in memory, to store encrypted content of EPC page.
	- one struct PCMD in memory, to store meta data.

    (VA slot is a 8-byte slot in VA page, which is a particualr EPC page.)

And below is the sequence to evict an SECS page or VA page:

	1) locate SECS (or VA) page
	2) remove EPT/PT mapping for SECS (or VA) page
	3) Send IPIs to remote CPUs
	6) allocate one available slot (8-byte) in VA page
	4) EWB on SECS (or) page

And for evicting SECS page, all regular EPC pages that belongs to that SECS
must be evicted out prior, otherwise EWB returns SGX_CHILD_PRESENT error.

And to reload an EPC page:

	1) ELDU/ELDB on EPC page
	2) setup EPT/PT mapping

With ELDU/ELDB taking:

	- location of SECS page
	- linear address of enclave's 4K page (that we are going to reload to)
	- VA slot (used in EWB)
	- 4K page in memory (used in EWB)
	- struct PCMD in memory (used in EWB)

Please refer to SDM chapter 39.5 EPC and Management of EPC pages for more
information.

1.4 SGX Launch Control

SGX requires running "Launch Enclave" (LE) before running any other enclaves.
This is because LE is the only enclave that does not requires EINITTOKEN in
EINIT. Running any other enclave requires a valid EINITTOKEN, which contains
MAC of the (first 192 bytes) EINITTOKEN calculated by EINITTOKEN key. EINIT
will verify the MAC via internally deriving the EINITTOKEN key, and only the
EINITTOKEN that has matched MAC will be accepted by EINIT. The EINITTOKEN key
derivation depends on some info from LE. The typical process is LE generates
EINITTOKEN for other enclave according to LE itself and the target enclave,
and calcualtes the MAC by using ENCLU[EGETKEY] to get the EINITTOKEN key. Only
LE is able to get the EINITTOKEN key.

Running LE requies the SHA256 hash of LE signer's RSA public key (SHA256 of
sigstruct->modulus) to equal to IA32_SGXLEPUBKEYHASH[0-3] MSRs (the 4 MSRs
together makes up 256-bit SHA256 hash value).

If CPUID.0x7.0x0:EBX.SGX and CPUID.0x7.0x0:ECX.SGX_LAUNCH_CONTROL[bit 30] is
set, then IA32_FEATURE_CONTROL is available, and IA32_FEATURE_CONTROL MSR has
SGX_LAUNCH_CONTROL_ENABLE bit (bit 17) available. 1-setting of
SGX_LAUNCH_CONTROL_ENABLE bit enables runtime change of IA32_SGXLEPUBKEYHASHn
after IA32_FEATURE_CONTROL is locked. Otherwise, IA32_SGXLEPUBKEYHASHn are
read-only after IA32_FEATURE_CONTROL is locked. After reset,
IA32_SGXLEPUBKEYHASHn will be set to hash of Intel's default key. On system
that has only CPUID.0x7.0x0:EBX.SGX set, IA32_SGXLEPUBKEYHASHn are not
available. On such system EINIT will always treat IA32_SGXLEPUBKEYHASHn as
Intel's default value thus only Intel's LE is able to run.

On system with IA32_SGXLEPUBKEYHASHn available, it is BIOS's implementation to
decide whether to provide configurations to user to set IA32_SGXLEPUBKEYHASHn
in *locked* (IA32_SGXLEPUBKEYHASHn are read-only after IA32_FEATURE_CONTROL is
locked) or *unlocked* mode (IA32_SGXLEPUBKEYHASHn are writable to kernel at
runtime). Also BIOS may or may not provide configurations to allow user to set
custom value of IA32_SGXLEPUBKEYHASHn.

1.5 SGX Interaction with IA32 and IA64 Architecture

SDM Chapter 42 describes SGX interaction with various features in IA32 and IA64
architecture. Below outlines the major ones. Refer to Chapter 42 for full
description of SGX interaction with various IA32 and IA64 features.

1.5.1 VMX Changes for Supporting SGX Virtualization

A new 64-bit ENCLS-exiting bitmap control field is added to VMCS (encoding
0202EH) to control VMEXIT on ENCLS leaf functions. And a new "Enable ENCLS
exiting" control bit (bit 15) is defined in secondary processor based vm
execution control. 1-Setting of "Enable ENCLS exiting" enables ENCLS-exiting
bitmap control. ENCLS-exiting bitmap controls which ENCLS leaves will trigger
VMEXIT.

Additionally two new bits are added to indicate whether VMEXIT (any) is from
enclave. Below two bits will be set if VMEXIT is from enclave:
    - Bit 27 in the Exit reason filed of Basic VM-exit information.
    - Bit 4 in the Interruptibility State of Guest Non-Register State of VMCS.

Refer to 42.5 Interactions with VMX, 27.2.1 Basic VM-Exit Information, and
27.3.4 Saving Non-Register.

1.5.2 Interaction with XSAVE

SGX defines a sub-field called X-Feature Request Mask (XFRM) in the attributes
field of SECS. On enclave entry, SGX HW verifies XFRM in SECS.ATTRIBUTES are
already enabled in XCR0.

Upon AEX, SGX saves the processor extended state and miscellaneous state to
enclave's state-save area (SSA), and clear the secrets from processor extended
state that is used by enclave (from leaking secrets).

Refer to 42.7 Interaction with Processor Extended State and Miscellaneous State

1.5.3 Interaction with S state

When processor goes into S3-S5 state, EPC is destroyed, thus all enclaves are
destroyed as well consequently.

Refer to 42.14 Interaction with S States.

2. SGX Virtualization Design

2.1 High Level Toolstack Changes:

2.1.1 New 'sgx' XL configure file parameter

EPC is limited resource. In order to use EPC efficiently among all domains,
when creating guest, administrator should be able to specify domain's virtual
EPC size. And admin alao should be able to get all domain's virtual EPC size.

For SGX Launch Control virtualization, we should allow admin to create VM with
either VM's virtual IA32_SGXLEPUBKEYHASHn locked or unlocked, and we should
also allow admin to create VM with custom IA32_SGXLEPUBKEYHASHn value.

For above purposes, below new 'sgx' XL configure file parameter is added:

	sgx = 'epc=<size>,lehash=<sha256-hash>,lewr=<0|1>'

In which 'epc' specifies VM's EPC size in MB and it's mandatory.

When physical machine is in *locked* mode, both 'lehash' and 'lewr'
cannot be specificed, as physical machine are unable to change
IA32_SGXLEPUBKEYHASHn at runtime. Adding either 'lehash' and 'lewr' will
cause failure to create VM in that case. And VM's initial
IA32_SGXLEPUBKEYHASHn value will be set to value of physical MSRs.

When physical machine is in *unlocked* mode, then VM's initial
IA32_SGXLEPUBKEYHASHn value will be set to 'lehash' if specified, or
Intel's default value. VM's SGX_LAUNCH_CONTROL_ENABLE bit in
IA32_FEATURE_CONTROL will be set or cleared, depending on whether 'lewr'
is specificied (or set to true or false expilicity).

Please also refer to 2.2.4 Launch Control Support.

2.1.2 New XL commands (?)

Administrator should be able to get physical EPC size, and all domain's virtual
EPC size. For this purpose, we can introduce 2 additional commands:

    # xl sgxinfo

Which will print out physical EPC size, and other SGX info (such as SGX1, SGX2,
etc) if necessary.

    # xl sgxlist <did>

Which will print out particular domain's virtual EPC size, or list all virtual
EPC sizes for all supported domains.

Alternatively, we can also extend existing XL commands by adding new option

    # xl info -sgx

Which will print out physical EPC size along with other physinfo. And

    # xl list <did> -sgx

Which will print out domain's virtual EPC size.

Comments?

In this RFC the two new commands are not implemented yet.

2.1.3 Notify domain's virtual EPC base and size to Xen

Xen needs to know guest's EPC base and size in order to populate EPC pages for
it. Toolstack notifies EPC base and size to Xen via XEN_DOMCTL_set_cpuid.

2.2 High Level Xen Hypervisor Changes:

2.2.1 EPC Management

Xen hypervisor needs to detect SGX, discover EPC, and manage EPC before
supporting SGX to guest. EPC is detected via SGX CPUID 0x12.0x2. It's possible
that there are multiple EPC sections (enumerated via sub-leaves 0x3 and so on,
until invaid EPC is reported), but this is typically on MP-socket server on
which each package would have its own EPC.

EPC is reported as reserved memory (so it is not reported as normal memory).
EPC must be managed in 4K pages. CPU hardware uses EPCM to track status of each
EPC pages. Xen needs to manage EPC and provide functions to, ie, alloc and free
EPC pages for guest.

Although typically on physical machine (at least existing machines), EPC is
~100M in size at maximum, but we cannot assume EPC size, thus in terms of EPC
management, it's better to integrate EPC management to Xen's memmory management
framework to take advantage of existing Xen's memory management algorithms.

Specifically, one 'struct page_info' will be created for each EPC page, just
like normal memory, and a new flag will be defined to identify whether 'struct
page_info' is EPC or normal memory. Existing memory allocation API
alloc_domheap_pages will be resued to allocate EPC page, by adding a new memflag
'MEMF_epc' to indicate EPC allocation, rather than memory allocation. The new
'MEMF_epc' can also be used for EPC ballooning (if required in the future), as
with the new flag, existing XENMEM_increase{decrease}_reservation,
XENMEM_populate_physmap can be resued for EPC as well.

2.2.2 EPC Virtualization

This part is how to populate EPC for guests. We have 3 choices:
    - Static Partitioning
    - Oversubscription
    - Ballooning

Static Partitioning means all EPC pages will be allocated and mapped to guest
when it is created, and there's no runtime change of page table mappings for EPC
pages. Oversubscription means Xen hypervisor supports EPC page swapping between
domains, meaning Xen is able to evict EPC page from another domain and assign it
to the domain that needs the EPC. With oversubscription, EPC can be assigned to
domain on demand, when EPT violation happens. Ballooning is similar to memory
ballooning. It is basically "Static Partitioning" + "Balloon driver" in guest.

Static Partitioning is the easiest way in terms of implementation, and there
will be no hypervisor overhead (except EPT overhead of course), because in
"Static partitioning", there is no EPT violation for EPC, and Xen doesn't need
to turn on ENCLS VMEXIT for guest as ENCLS runs perfectly in non-root mode.

Ballooning is "Static Partitioning" + "Balloon driver" in guest. Like "Static
Paratitioning", ballooning doesn't need to turn on ENCLS VMEXIT, and doesn't
have EPT violation for EPC either. To support ballooning, we need ballooning
driver in guest to issue hypercall to give up or reclaim EPC pages. In terms of
hypercall, we have two choices: 1) Add new hypercall for EPC ballooning; 2)
Using existing XENMEM_{increase/decrease}_reservation with new memory flag, ie,
XENMEMF_epc. I'll discuss more regarding to adding dedicated hypercall or not
later.

Oversubscription looks nice but it requires more complicated implemetation.
Firstly, as explained in 1.3.3 EPC Eviction & Reload, we need to follow specific
steps to evict EPC pages, and in order to do that, basically Xen needs to trap
ENCLS from guest and keep track of EPC page status and enclave info from all
guest. This is because:
    - To evict regular EPC page, Xen needs to know SECS location
    - Xen needs to know EPC page type: evicting regular EPC and evicting SECS,
      VA page have different steps.
    - Xen needs to know EPC page status: whether the page is blocked or not.

Those info can only be got by trapping ENCLS from guest, and parsing its
parameters (to identify SECS page, etc). Parsing ENCLS parameters means we need
to know which ENCLS leaf is being trapped, and we need to translate guest's
virtual address to get physical address in order to locate EPC page. And once
ENCLS is trapped, we have to emulate ENCLS in Xen, which means we need to
reconstruct ENCLS parameters by remapping all guest's virtual address to Xen's
virtual address (gva->gpa->pa->xen_va), as ENCLS always use *effective address*
which is able to be traslated by processor when running ENCLS.

    --------------------------------------------------------------
                |   ENCLS   |
    --------------------------------------------------------------
                |          /|\
    ENCLS VMEXIT|           | VMENTRY
                |           |
               \|/          |

		1) parse ENCLS parameters
		2) reconstruct(remap) guest's ENCLS parameters
		3) run ENCLS on behalf of guest (and skip ENCLS)
		4) on success, update EPC/enclave info, or inject error

And Xen needs to maintain each EPC page's status (type, blocked or not, in
enclave or not, etc). Xen also needs to maintain all Enclave's info from all
guests, in order to find the correct SECS for regular EPC page, and enclave's
linear address as well.

So in general, "Static Partitioning" has simplest implementation, but obviously
not the best way to use EPC efficiently; "Ballooning" has all pros of Static
Partitioning but requies guest balloon driver; "Oversubscription" is best in
terms of flexibility but requires complicated hypervisor implemetation.

We will start with "Static Partitioning". If "Ballooning" is required in the
future, we will support it. "Oversubscription" should not be needed in
forseeable future.

2.2.3 Populate EPC for Guest

Toolstack notifies Xen about domain's EPC base and size by XEN_DOMCTL_set_cpuid,
so currently Xen populates all EPC pages for guest in XEN_DOMCTL_set_cpuid,
particularly, in handling XEN_DOMCTL_set_cpuid for CPUID.0x12.0x2. Once Xen
checks the values passed from toolstack is valid, Xen will allocate all EPC
pages and setup EPT mappings for guest.

2.2.4 Launch Control Support

To support running multiple domains with each running its own LE signed by
different owners, physical machine's BIOS must leave IA32_SGXLEPUBKEYHASHn
*unlocked* before handing to Xen. Xen will trap domain's write to
IA32_SGXLEPUBKEYHASHn and keep the value in vcpu internally, and update the
value to physical MSRs when vcpu is scheduled in. This can guarantee that
when EINIT runs in guest, guest's virtual IA32_SGXLEPUBKEYHASHn have been
written to physical MSRs.

SGX_LAUNCH_CONTROL_ENABLE bit in guest's IA32_FEATURE_CONTROL is controlled
by new added 'lewr' XL parameter (see 2.1.1 New 'sgx' XL configure file
parameter).

If physical IA32_SGXLEPUBKEYHASHn are *locked* in machine's BIOS, then only MSR
read is allowed from guest, and Xen will inject error for guest's MSR writes.

In addition, if physical IA32_SGXLEPUBKEYHASHn are *locked*, then creating guest
with 'lehash' parameter or 'lewr' will fail, as in such case Xen is not able to
update guest's virtual IA32_SGXLEPUBKEYHASHn to physical MSRs.

If physical IA32_SGXLEPUBKEYHASHn are not available
(CPUID.0x7.0x0:ECX.SGX_LAUHCN_CONTROL is not present), then creating VM with
'lehash' and 'lewr' will also fail. In addition, any MSR read/write for
IA32_SGXLEPUBKEYHASHn from guest is invalid and Xen will inject error in such
case.

2.2.5 CPUID Emulation

Most of native SGX CPUID info can be exposed to guest, expect below two parts:
    - Sub-leaf 0x2 needs to report domain's virtual EPC base and size, instead
      of physical EPC info.
    - Sub-leaf 0x1 needs to be consistent with guest's XCR0. For the reason of
      this part please refer to 1.5.2 Interaction with XSAVE.

2.2.6 EPT Violation & ENCLS Trapping Handling

Only needed when Xen supports EPC Oversubscription, as explained above.

2.2.7 Guest Suspend & Resume

On hardware, EPC is destroyed when power goes to S3-S5. So Xen will destroy
guest's EPC when guest's power goes into S3-S5. Currently Xen is notified by
Qemu in terms of S State change via HVM_PARAM_ACPI_S_STATE, where Xen will
destroy EPC if S State is S3-S5.

Specifically, Xen will run EREMOVE for guest's each EPC page, as guest may
not handle EPC suspend & resume correctly, in which case physically guest's EPC
pages may still be valid, so Xen needs to run EREMOVE to make sure all EPC
pages are becoming invalid. Otherwise further operation in guest on EPC may
fault as it assumes all EPC pages are invalid after guest is resumed.

For SECS page, EREMOVE may fault with SGX_CHILD_PRESENT, in which case Xen will
keep this SECS page into a list, and call EREMOVE for them again after all EPC
pages have been called with EREMOVE. This time the EREMOVE on SECS will succeed
as all children (regular EPC pages) have already been removed.

2.2.8 Destroying Domain

Normally Xen just frees all EPC pages for domain when it is destroyed. But Xen
will also do EREMOVE on all guest's EPC pages (described in above 2.2.7) before
free them, as guest may shutdown unexpected (ex, user kills guest), and in this
case, guest's EPC may still be valid.

2.3 Additional Point: Live Migration, Snapshot Support (?)

Actually from hardware's point of view, SGX is not migratable. There are two
reasons:

    - SGX key architecture cannot be virtualized.

    For example, some keys are bound to CPU. For example, Sealing key, EREPORT
    key, etc. If VM is migrated to another machine, the same enclave will derive
    the different keys. Taking Sealing key as an example, Sealing key is
    typically used by enclave (enclave can get sealing key by EGETKEY) to *seal*
    its secrets to outside (ex, persistent storage) for further use. If Sealing
    key changes after VM migration, then the enclave can never get the sealed
    secrets back by using sealing key, as it has changed, and old sealing key
    cannot be got back.

    - There's no ENCLS to evict EPC page to normal memory, but at the meaning
    time, still keep content in EPC. Currently once EPC page is evicted, the EPC
    page becomes invalid. So technically, we are unable to implement live
    migration (or check pointing, or snapshot) for enclave.

But, with some workaround, and some facts of existing SGX driver, technically
we are able to support Live migration (or even check pointing, snapshot). This
is because:

    - Changing key (which is bound to CPU) is not a problem in reality

    Take Sealing key as an example. Losing sealed data is not a problem, because
    sealing key is only supposed to encrypt secrets that can be provisioned
    again. The typical work model is, enclave gets secrets provisioned from
    remote (service provider), and use sealing key to store it for further use.
    When enclave tries to *unseal* use sealing key, if the sealing key is
    changed, enclave will find the data is some kind of corrupted (integrity
    check failure), so it will ask secrets to be provisioned again from remote.
    Another reason is, in data center, VM's typically share lots of data, and as
    sealing key is bound to CPU, it means the data encrypted by one enclave on
    one machine cannot be shared by another enclave on another mahcine. So from
    SGX app writer's point of view, developer should treat Sealing key as a
    changeable key, and should handle lose of sealing data anyway. Sealing key
    should only be used to seal secrets that can be easily provisioned again.

    For other keys such as EREPORT key and provisioning key, which are used for
    local attestation and remote attestation, due to the second reason below,
    losing them is not a problem either.

    - Sudden lose of EPC is not a problem.

    On hardware, EPC will be lost if system goes to S3-S5, or reset, or
    shutdown, and SGX driver need to handle lose of EPC due to power transition.
    This is done by cooperation between SGX driver and userspace SGX SDK/apps.
    However during live migration, there may not be power transition in guest,
    so there may not be EPC lose during live migration. And technically we
    cannot *really* live migrate enclave (explained above), so looks it's not
    feasible. But the fact is that both Linux SGX driver and Windows SGX driver
    have already supported *sudden* lose of EPC (not EPC lose during power
    transition), which means both driver are able to recover in case EPC is lost
    at any runtime. With this, technically we are able to support live migration
    by simply ignoring EPC. After VM is migrated, the destination VM will only
    suffer *sudden* lose of EPC, which both Windows SGX driver and Linux SGX
    driver are already able to handle.

    But we must point out such *sudden* lose of EPC is not hardware behavior,
    and other SGX driver for other OSes (such as FreeBSD) may not implement
    this, so for those guests, destination VM will behavior in unexpected
    manner. But I am not sure we need to care about other OSes.

For the same reason, we are able to support check pointing for SGX guest (only
Linux and Windows);

For snapshot, we can support snapshot SGX guest by either:

    - Suspend guest before snapshot (s3-s5). This works for all guests but
      requires user to manually susppend guest.
    - Issue an hypercall to destroy guest's EPC in save_vm. This only works for
      Linux and Windows but doesn't require user intervention.

What's your comments?

3. Reference

    - Intel SGX Homepage
    https://software.intel.com/en-us/sgx

    - Linux SGX SDK
    https://01.org/intel-software-guard-extensions

    - Linux SGX driver for upstreaming
    https://github.com/01org/linux-sgx

    - Intel SGX Specification (SDM Vol 3D)
    https://software.intel.com/sites/default/files/managed/7c/f1/332831-sdm-vol-3d.pdf

    - Paper: Intel SGX Explained
    https://eprint.iacr.org/2016/086.pdf

    - ISCA 2015 tutorial slides for Intel® SGX - Intel® Software
    https://software.intel.com/sites/default/files/332680-002.pdf

Boqun Feng (5):
  xen: mm: introduce non-scrubbable pages
  xen: mm: manage EPC pages in Xen heaps
  xen: x86/mm: add SGX EPC management
  xen: x86: add functions to populate and destroy EPC for domain
  xen: tools: add SGX to applying MSR policy

Kai Huang (12):
  xen: x86: expose SGX to HVM domain in CPU featureset
  xen: x86: add early stage SGX feature detection
  xen: vmx: detect ENCLS VMEXIT
  xen: x86/mm: introduce ioremap_wb()
  xen: p2m: new 'p2m_epc' type for EPC mapping
  xen: x86: add SGX cpuid handling support.
  xen: vmx: handle SGX related MSRs
  xen: vmx: handle ENCLS VMEXIT
  xen: vmx: handle VMEXIT from SGX enclave
  xen: x86: reset EPC when guest got suspended.
  xen: tools: add new 'sgx' parameter support
  xen: tools: add SGX to applying CPUID policy

 docs/misc/xen-command-line.markdown         |   8 +
 tools/libxc/Makefile                        |   1 +
 tools/libxc/include/xc_dom.h                |   4 +
 tools/libxc/include/xenctrl.h               |  16 +
 tools/libxc/xc_cpuid_x86.c                  |  68 ++-
 tools/libxc/xc_msr_x86.h                    |  10 +
 tools/libxc/xc_sgx.c                        |  82 +++
 tools/libxl/libxl.h                         |   3 +-
 tools/libxl/libxl_cpuid.c                   |  15 +-
 tools/libxl/libxl_create.c                  |  10 +
 tools/libxl/libxl_dom.c                     |  65 ++-
 tools/libxl/libxl_internal.h                |   2 +
 tools/libxl/libxl_nocpuid.c                 |   4 +-
 tools/libxl/libxl_types.idl                 |  11 +
 tools/libxl/libxl_x86.c                     |  12 +
 tools/ocaml/libs/xc/xenctrl_stubs.c         |  11 +-
 tools/python/xen/lowlevel/xc/xc.c           |  11 +-
 tools/xl/xl_parse.c                         |  86 +++
 tools/xl/xl_parse.h                         |   1 +
 xen/arch/x86/Makefile                       |   1 +
 xen/arch/x86/cpu/common.c                   |  15 +
 xen/arch/x86/cpuid.c                        |  62 ++-
 xen/arch/x86/domctl.c                       |  87 ++-
 xen/arch/x86/hvm/hvm.c                      |   3 +
 xen/arch/x86/hvm/vmx/vmcs.c                 |  16 +-
 xen/arch/x86/hvm/vmx/vmx.c                  |  68 +++
 xen/arch/x86/hvm/vmx/vvmx.c                 |  11 +
 xen/arch/x86/mm.c                           |   9 +-
 xen/arch/x86/mm/p2m-ept.c                   |   3 +
 xen/arch/x86/mm/p2m.c                       |  41 ++
 xen/arch/x86/msr.c                          |   6 +-
 xen/arch/x86/sgx.c                          | 815 ++++++++++++++++++++++++++++
 xen/common/page_alloc.c                     |  39 +-
 xen/include/asm-arm/mm.h                    |   9 +
 xen/include/asm-x86/cpufeature.h            |   4 +
 xen/include/asm-x86/cpuid.h                 |  29 +-
 xen/include/asm-x86/hvm/hvm.h               |   3 +
 xen/include/asm-x86/hvm/vmx/vmcs.h          |   8 +
 xen/include/asm-x86/hvm/vmx/vmx.h           |   3 +
 xen/include/asm-x86/mm.h                    |  19 +-
 xen/include/asm-x86/msr-index.h             |   6 +
 xen/include/asm-x86/msr.h                   |   5 +
 xen/include/asm-x86/p2m.h                   |  12 +-
 xen/include/asm-x86/sgx.h                   |  86 +++
 xen/include/public/arch-x86/cpufeatureset.h |   3 +-
 xen/include/xen/mm.h                        |   2 +
 xen/tools/gen-cpuid.py                      |   3 +
 47 files changed, 1757 insertions(+), 31 deletions(-)
 create mode 100644 tools/libxc/xc_sgx.c
 create mode 100644 xen/arch/x86/sgx.c
 create mode 100644 xen/include/asm-x86/sgx.h

-- 
2.15.0

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread