From mboxrd@z Thu Jan  1 00:00:00 1970
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: Re: [PATCH v3] docs: add PVH specification
Date: Sat, 20 Sep 2014 15:15:10 -0400
Message-ID: <20140920191510.GA2882@laptop.dumpdata.com>
References: <1411060764-4016-1-git-send-email-roger.pau@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <konrad.wilk@oracle.com>) id 1XVQ7w-00073Z-3y
	for xen-devel@lists.xenproject.org; Sat, 20 Sep 2014 19:15:24 +0000
Content-Disposition: inline
In-Reply-To: <1411060764-4016-1-git-send-email-roger.pau@citrix.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Roger Pau Monne <roger.pau@citrix.com>
Cc: xen-devel@lists.xenproject.org, David Vrabel <david.vrabel@citrix.com>, Jan Beulich <JBeulich@suse.com>
List-Id: xen-devel@lists.xenproject.org

On Thu, Sep 18, 2014 at 07:19:24PM +0200, Roger Pau Monne wrote:
> Introduce a document that describes the interfaces used on PVH. This
> document has been designed from a guest OS point of view (i.e.: what a gu=
est
> needs to do in order to support PVH).
> =

> Signed-off-by: Roger Pau Monn=E9 <roger.pau@citrix.com>
> Acked-by: David Vrabel <david.vrabel@citrix.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Mukesh Rathor <mukesh.rathor@oracle.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
> The document is still far from complete IMHO, but it might be best to just
> commit what we currently have rather than wait for a full document.
> =

> I will try to fill the gaps as I go implementing new features on FreeBSD.
> =

> I've retained David's Ack from v2 in this version.
> ---
>  docs/misc/pvh.markdown | 367 +++++++++++++++++++++++++++++++++++++++++++=
++++++
>  1 file changed, 367 insertions(+)
>  create mode 100644 docs/misc/pvh.markdown
> =

> diff --git a/docs/misc/pvh.markdown b/docs/misc/pvh.markdown
> new file mode 100644
> index 0000000..120ede7
> --- /dev/null
> +++ b/docs/misc/pvh.markdown
> @@ -0,0 +1,367 @@
> +# PVH Specification #
> +
> +## Rationale ##
> +
> +PVH is a new kind of guest that has been introduced on Xen 4.4 as a DomU=
, and
> +on Xen 4.5 as a Dom0. The aim of PVH is to make use of the hardware
> +virtualization extensions present in modern x86 CPUs in order to
> +improve performance.
> +
> +PVH is considered a mix between PV and HVM, and can be seen as a PV guest
> +that runs inside of an HVM container, or as a PVHVM guest without any em=
ulated
> +devices. The design goal of PVH is to provide the best performance possi=
ble and
> +to reduce the amount of modifications needed for a guest OS to run in th=
is mode
> +(compared to pure PV).
> +
> +This document tries to describe the interfaces used by PVH guests, focus=
ing
> +on how an OS should make use of them in order to support PVH.
> +
> +## Early boot ##
> +
> +PVH guests use the PV boot mechanism, that means that the kernel is load=
ed and
> +directly launched by Xen (by jumping into the entry point). In order to =
do this
> +Xen ELF Notes need to be added to the guest kernel, so that they contain=
 the
> +information needed by Xen. Here is an example of the ELF Notes added to =
the
> +FreeBSD amd64 kernel in order to boot as PVH:
> +
> +    ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS,       .asciz, "FreeBSD")
> +    ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION,  .asciz, __XSTRING(__FreeBSD=
_version))
> +    ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION,    .asciz, "xen-3.0")
> +    ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      .quad,  KERNBASE)
> +    ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET,   .quad,  KERNBASE)
> +    ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          .quad,  xen_start)
> +    ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .quad,  hypercall_page)
> +    ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW,   .quad,  HYPERVISOR_VIRT_STA=
RT)
> +    ELFNOTE(Xen, XEN_ELFNOTE_FEATURES,       .asciz, "writable_descripto=
r_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector=
")
> +    ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE,       .asciz, "yes")
> +    ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,   .long,  PG_V, PG_V)
> +    ELFNOTE(Xen, XEN_ELFNOTE_LOADER,         .asciz, "generic")
> +    ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long,  0)
> +    ELFNOTE(Xen, XEN_ELFNOTE_BSD_SYMTAB,     .asciz, "yes")
> +
> +On the linux side, the above can be found in `arch/x86/xen/xen-head.S`.

s/linux/Linux/

> +
> +It is important to highlight the following notes:
> +
> +  * `XEN_ELFNOTE_ENTRY`: contains the virtual memory address of the kern=
el entry
> +    point.
> +  * `XEN_ELFNOTE_HYPERCALL_PAGE`: contains the virtual memory address of=
 the
> +    hypercal page inside of the guest kernel (this memory region will be=
 filled
> +    by Xen prior to booting).
> +  * `XEN_ELFNOTE_FEATURES`: contains the list of features supported by t=
he kernel.
> +    In the example above the kernel is only able to boot as a PVH guest,=
 but
> +    those options can be mixed with the ones used by pure PV guests in o=
rder to
> +    have a kernel that supports both PV and PVH (like Linux). The list of
> +    options available can be found in the `features.h` public header.
> +


Note that 'hvm_callback_vector' is in XEN_ELFNOTE_FEATURES. Older hyperviso=
r will
balk at this being part of it, so it can also be put in             =

XEN_ELFNOTE_SUPPORTED_FEATURES which older hypervisors will ignore.  =


> +Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY`=
 with
> +paging enabled (either long mode or protected mode with paging turned on
> +depending on the kernel bitness) and some basic page tables setup. An im=
portant
> +distinction for a 64bit PVH is that it is launched at privilege level 0 =
as
> +opposed to a 64bit PV guest which is launched at privilege level 3.
> +
> +Also, the `rsi` (`esi` on 32bits) register is going to contain the virtu=
al
> +memory address were Xen has placed the `start_info` structure. The `rsp`=
 (`esp`
> +on 32bits) will point to the top of an initial single page stack, that c=
an be
> +used by the guest kernel. The `start_info` structure contains all the in=
fo the
> +guest needs in order to initialize. More information about the contents =
can be
> +found on the `xen.h` public header.

s/on/in/
> +
> +### Initial amd64 control registers values ###
> +
> +Initial values for the control registers are set up by Xen before bootin=
g the
> +guest kernel. The guest kernel can expect to find the following features
> +enabled by Xen.
> +
> +`CR0` has the following bits set by Xen:
> +
> +  * PE (bit 0): protected mode enable.
> +  * ET (bit 4): 387 or newer processor.
> +  * PG (bit 31): paging enabled.

Also TS (at least that is what the Linux code says:

/* Some of these are setup in 'secondary_startup_64'. The others:       =

* X86_CR0_TS, X86_CR0_PE, X86_CR0_ET are set by Xen for HVM guests     =

* (which PVH shared codepaths), while X86_CR0_PG is for PVH. */        =


Perhaps it is incorrect?

> +
> +`CR4` has the following bits set by Xen:
> +
> +  * PAE (bit 5): PAE enabled.
> +
> +And finally in `EFER` the following features are enabled:
> +
> +  * LME (bit 8): Long mode enable.
> +  * LMA (bit 10): Long mode active.
> +
> +At least the following flags in `EFER` are guaranteed to be disabled:
> +
> +  * SCE (bit 0): System call extensions disabled.
> +  * NXE (bit 11): No-Execute disabled.
> +
> +There's no guarantee about the state of the other bits in the `EFER` reg=
ister.
> +
> +All the segments selectors are set with a flat base at zero.
> +
> +The `cs` segment selector attributes are set to 0x0a09b, which describes=
 an
> +executable and readable code segment only accessible by the most privile=
ged
> +level. The segment is also set as a 64-bit code segment (`L` flag set, `=
D` flag
> +unset).
> +
> +The remaining segment selectors (`ds`, `ss`, `es`, `fs` and `gs`) are al=
l set
> +to the same values. The attributes are set to 0x0c093, which implies a r=
ead and
> +write data segment only accessible by the most privileged level.

I think the SS, ES, FS, GS are set to the null selector in 64-bit mode.

> +
> +The `FS.base` and `GS.base` MSRs are zeroed out.

.. and 'KERNEL_GS.base'

> +
> +The `IDT` and `GDT` are also zeroed, so the guest must be specially care=
ful to
> +not trigger a fault until after they have been properly set. The way of =
setting
> +the IDT and the GDT is using the native instructions as would be done on=
 bare
> +metal.
> +
> +The `RFLAGS` register is guaranteed to be clear when jumping into the ke=
rnel
> +entry point, with the exception of the reserved bit 1 set.
> +
> +## Memory ##
> +
> +Since PVH guests rely on virtualization extensions provided by the CPU, =
they
> +have access to a hardware virtualized MMU, which means page-table related
> +operations should use the same instructions used on native.
> +
> +There are however some differences with native. The usage of native MTRR
> +operations is forbidden, and `XENPF_*_memtype` hypercalls should be used
> +instead. This can be avoided by simply not using MTRR and setting all the
> +memory attributes using PAT, which doesn't require the usage of any hype=
rcalls.
> +
> +Since PVH doesn't use a BIOS in order to boot, the physical memory map h=
as
> +to be retrieved using the `XENMEM_memory_map` hypercall, which will retu=
rn
> +an e820 map. This memory map might contain holes that describe MMIO regi=
ons,
> +that will be already setup by Xen.
> +
> +*TODO*: we need to figure out what to do with MMIO regions, right now Xen
> +sets all the holes in the native e820 to MMIO regions for Dom0 up to 4GB=
. We
> +need to decide what to do with MMIO regions above 4GB on Dom0, and what =
to do
> +for PVH DomUs with pci-passthrough.
> +
> +In the case of a guest started with memory !=3D maxmem, the e820 memory =
map
> +returned by Xen will contain the memory up to maxmem. The guest has to b=
e very
> +careful to only use the lower memory pages up to the value contained in
> +`start_info->nr_pages` because any memory page above that value will not=
 be
> +populated.
> +
> +## Physical devices ##
> +
> +When running as Dom0 the guest OS has the ability to interact with the p=
hysical
> +devices present in the system. A note should be made that PVH guests req=
uire
> +a working IOMMU in order to interact with physical devices.
> +
> +The first step in order to manipulate the devices is to make Xen aware of
> +them. Due to the fact that all the hardware description on x86 comes from
> +ACPI, Dom0 is responsible of parsing the ACPI tables and notify Xen abou=
t the
> +devices it finds. This is done with the `PHYSDEVOP_pci_device_add` hyper=
call.
> +
> +*TODO*: explain the way to register the different kinds of PCI devices, =
like
> +devices with virtual functions.
> +
> +## Interrupts ##
> +
> +All interrupts on PVH guests are routed over event channels, see
> +[Event Channel Internals][event_channels] for more detailed information =
about
> +event channels. In order to inject interrupts into the guest an IDT vect=
or is
> +used. This is the same mechanism used on PVHVM guests, and allows having
> +per-cpu interrupts that can be used to deliver timers or IPIs.
> +
> +In order to register the callback IDT vector the `HVMOP_set_param` hyper=
call
> +is used with the following values:
> +
> +    domid =3D DOMID_SELF
> +    index =3D HVM_PARAM_CALLBACK_IRQ
> +    value =3D (0x2 << 56) | vector_value

And naturally the OS has to program the IDT for the 'vector_value' using
the baremetal mechanism.

> +
> +In order to know which event channel has fired, we need to look into the
> +information provided in the `shared_info` structure. The `evtchn_pending`
> +array is used as a bitmap in order to find out which event channel has
> +fired. Event channels can also be masked by setting it's port value in t=
he
> +`shared_info->evtchn_mask` bitmap.
> +
> +### Interrupts from physical devices ###
> +
> +When running as Dom0 (or when using pci-passthrough) interrupts from phy=
sical
> +devices are routed over event channels. There are 3 different kind of
> +physical interrupts that can be routed over event channels by Xen: IO AP=
IC,
> +MSI and MSI-X interrupts.
> +
> +Since physical interrupts usually need EOI (End Of Interrupt), Xen allow=
s the
> +registration of a memory region that will contain whether a physical int=
errupt
> +needs EOI from the guest or not. This is done with the
> +`PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a parameter containing=
 the
> +physical address of the memory page that will act as a bitmap. Then in o=
rder to
> +find out if an IRQ needs EOI or not, the OS can perform a simple bit tes=
t on the
> +memory page using the PIRQ value.
> +
> +### IO APIC interrupt routing ###
> +
> +IO APIC interrupts can be routed over event channels using `PHYSDEVOP`
> +hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq`
> +hypercall, as an example IRQ#9 is used here:
> +
> +    domid =3D DOMID_SELF
> +    type =3D MAP_PIRQ_TYPE_GSI
> +    index =3D 9
> +    pirq =3D 9
> +
> +The IRQ#9 is now registered as PIRQ#9. The triggering and polarity can a=
lso
> +be configured using the `PHYSDEVOP_setup_gsi` hypercall:
> +
> +    gsi =3D 9 # This is the IRQ value.
> +    triggering =3D 0
> +    polarity =3D 0
> +
> +In this example the IRQ would be configured to use edge triggering and h=
igh
> +polarity.
> +
> +Finally the PIRQ can be bound to an event channel using the
> +`EVTCHNOP_bind_pirq`, that will return the event channel port the PIRQ h=
as been
> +assigned. After this the event channel will be ready for delivery.
> +
> +*NOTE*: when running as Dom0, the guest has to parse the interrupt overr=
ides
> +found on the ACPI tables and notify Xen about them.
> +
> +### MSI ###
> +
> +In order to configure MSI interrupts for a device, Xen must be made awar=
e of
> +it's presence first by using the `PHYSDEVOP_pci_device_add` as described=
 above.
> +Then the `PHYSDEVOP_map_pirq` hypercall is used:
> +
> +    domid =3D DOMID_SELF
> +    type =3D MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI
> +    index =3D -1
> +    pirq =3D -1
> +    bus =3D pci_device_bus
> +    devfn =3D pci_device_function
> +    entry_nr =3D number of MSI interrupts
> +
> +The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI interr=
upt
> +source is being configured. On devices that support MSI interrupt groups
> +`MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure them by also placing =
the
> +number of MSI interrupts in the `entry_nr` field.
> +
> +The values in the `bus` and `devfn` field should be the same as the ones=
 used
> +when registering the device with `PHYSDEVOP_pci_device_add`.
> +
> +### MSI-X ###
> +
> +*TODO*: how to register/use them.
> +
> +## Event timers and timecounters ##
> +
> +Since some hardware is not available on PVH (like the local APIC), Xen p=
rovides
> +the OS with suitable replacements in order to get the same functionality=
. One
> +of them is the timer interface. Using a set of hypercalls, a guest OS ca=
n set
> +event timers that will deliver and event channel interrupt to the guest.
> +
> +In order to use the timer provided by Xen the guest OS first needs to re=
gister
> +a VIRQ event channel to be used by the timer to deliver the interrupts. =
The
> +event channel is registered using the `EVTCHNOP_bind_virq` hypercall, th=
at
> +only takes two parameters:
> +
> +    virq =3D VIRQ_TIMER
> +    vcpu =3D vcpu_id
> +
> +The port that's going to be used by Xen in order to deliver the interrup=
t is
> +returned in the `port` field. Once the interrupt is set, the timer can be
> +programmed using the `VCPUOP_set_singleshot_timer` hypercall.
> +
> +    flags =3D VCPU_SSHOTTMR_future
> +    timeout_abs_ns =3D absolute value when the timer should fire
> +
> +It is important to notice that the `VCPUOP_set_singleshot_timer` hyperca=
ll must
> +be executed from the same vCPU where the timer should fire, or else Xen =
will
> +refuse to set it. This is a single-shot timer, so it must be set by the =
OS
> +every time it fires if a periodic timer is desired.
> +
> +Xen also shares a memory region with the guest OS that contains time rel=
ated
> +values that are updated periodically. This values can be used to impleme=
nt a
> +timecounter or to obtain the current time. This information is placed in=
side of
> +`shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the guest=
 has
> +been launched) can be calculated using the following expression and the =
values
> +stored in the `vcpu_time_info` struct:
> +
> +    system_time + ((((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system=
_mul) >> 32)
> +
> +The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be
> +calculated using the above value, plus the timeout the system wants to s=
et.
> +
> +If the OS also wants to obtain the current wallclock time, the value cal=
culated
> +above has to be added to the values found in `shared_info->wc_sec` and
> +`shared_info->wc_nsec`.
> +
> +## SMP discover and bring up ##
> +
> +The process of bringing up secondary CPUs is obviously different from na=
tive,
> +since PVH doesn't have a local APIC. The first thing to do is to figure =
out
> +how many vCPUs the guest has. This is done using the `VCPUOP_is_up` hype=
rcall,
> +using for example this simple loop:
> +
> +    for (i =3D 0; i < MAXCPU; i++) {
> +        ret =3D HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL);
> +        if (ret >=3D 0)
> +            /* vCPU#i is present */
> +    }
> +
> +Note than when running as Dom0, the ACPI tables might report a different=
 number
> +of available CPUs. This is because the value on the ACPI tables is the
> +number of physical CPUs the host has, and it might bear no resemblance w=
ith the
> +number of vCPUs Dom0 actually has so it should be ignored.
> +
> +In order to bring up the secondary vCPUs they must be configured first. =
This is
> +achieved using the `VCPUOP_initialise` hypercall. A valid context has to=
 be
> +passed to the vCPU in order to boot. The relevant fields for PVH guests =
are
> +the following:
> +
> +  * `flags`: contains `VGCF_*` flags (see `arch-x86/xen.h` public header=
).
> +  * `user_regs`: struct that contains the register values that will be s=
et on
> +    the vCPU before booting. All GPRs are available to be set, however, =
the
> +    most relevant ones are `rip` and `rsp` in order to set the start add=
ress
> +    and the stack. Please note, all selectors must be null.
> +  * `ctrlreg[3]`: contains the address of the page tables that will be u=
sed by
> +    the vCPU. Other control registers should be set to zero, or else the
> +    hypercall will fail with -EINVAL.
> +
> +After the vCPU is initialized with the proper values, it can be started =
by
> +using the `VCPUOP_up` hypercall. The values of the other control registe=
rs of
> +the vCPU will be the same as the ones described in the `control register=
s`
> +section.
> +
> +Examples about how to bring up secondary CPUs can be found on the FreeBSD
> +code base in `sys/x86/xen/pv.c` and on Linux `arch/x86/xen/smp.c`.
> +
> +## Control operations (reboot/shutdown) ##
> +
> +Reboot and shutdown operations on PVH guests are performed using hyperca=
lls.
> +In order to issue a reboot, a guest must use the `SHUTDOWN_reboot` hyper=
call.
> +In order to perform a power off from a guest DomU, the `SHUTDOWN_powerof=
f`
> +hypercall should be used.
> +
> +The way to perform a full system power off from Dom0 is different than w=
hat's
> +done in a DomU guest. In order to perform a power off from Dom0 the nati=
ve
> +ACPI path should be followed, but the guest should not write the `SLP_EN`
> +bit to the Pm1Control register. Instead the `XENPF_enter_acpi_sleep` hyp=
ercall
> +should be used, filling the following data in the `xen_platform_op` stru=
ct:
> +
> +    cmd =3D XENPF_enter_acpi_sleep
> +    interface_version =3D XENPF_INTERFACE_VERSION
> +    u.enter_acpi_sleep.pm1a_cnt_val =3D Pm1aControlValue
> +    u.enter_acpi_sleep.pm1b_cnt_val =3D Pm1bControlValue
> +
> +This will allow Xen to do it's clean up and to power off the system. If =
the
> +host is using hardware reduced ACPI, the following field should also be =
set:
> +
> +    u.enter_acpi_sleep.flags =3D XENPF_ACPI_SLEEP_EXTENDED (0x1)
> +
> +## CPUID ##
> +
> +*TDOD*: describe which cpuid flags a guest should ignore and also which =
flags
> +describe features can be used. It would also be good to describe the set=
 of
> +cpuid flags that will always be present when running as PVH.

Perhaps start with: =

The cpuid instruction that should be used is the normal 'cpuid', not
the emulated 'cpuid' that PV guests usually require.

> +
> +## Final notes ##
> +
> +All the other hardware functionality not described in this document shou=
ld be
> +assumed to be performed in the same way as native.
> +
> +[event_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals

And with those changes:

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

> -- =

> 1.8.5.2 (Apple Git-48)
> =