* RFC: very initial PVH design document
@ 2014-08-22 14:55 Roger Pau Monné
2014-08-22 15:13 ` Jan Beulich
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Roger Pau Monné @ 2014-08-22 14:55 UTC (permalink / raw)
To: xen-devel; +Cc: David Vrabel, Jan Beulich
Hello,
I've started writing a document in order to describe the interface
exposed by Xen to PVH guests, and how it should be used (by guest
OSes). The document is far from complete (see the amount of TODOs
scattered around), but given the lack of documentation regarding PVH I
think it's a good starting point. The aim of this is that it should be
committed to the Xen repository once it's ready. Given that this is
still a *very* early version I'm not even posting it as a patch.
Please comment, and try to fill the holes if possible ;).
Roger.
---
# PVH Specification #
## Rationale ##
PVH is a new kind of guest that has been introduced on Xen 4.4 as a DomU, and
on Xen 4.5 as a Dom0. The aim of PVH is to make use of the hardware
virtualization extensions present in modern x86 CPUs in order to
improve performance.
PVH is considered a mix between PV and HVM, and can be seen as a PV guest
that runs inside of an HVM container, or as a PVHVM guest without any emulated
devices. The design goal of PVH is to provide the best performance possible and
to reduce the amount of modifications needed for a guest OS to run in this mode
(compared to pure PV).
This document tries to describe the interfaces used by PVH guests, focusing
on how an OS should make use of them in order to support PVH.
## Early boot ##
PVH guests use the PV boot mechanism, that means that the kernel is loaded and
directly launched by Xen (by jumping into the entry point). In order to do this
Xen ELF Notes need to be added to the guest kernel, so that they contain the
information needed by Xen. Here is an example of the ELF Notes added to the
FreeBSD amd64 kernel in order to boot as PVH:
ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD")
ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz, __XSTRING(__FreeBSD_version))
ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0")
ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE)
ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE)
ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, .quad, xen_start)
ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page)
ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_START)
ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz, "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes")
ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V)
ELFNOTE(Xen, XEN_ELFNOTE_LOADER, .asciz, "generic")
ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0)
ELFNOTE(Xen, XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes")
It is important to highlight the following notes:
* XEN_ELFNOTE_ENTRY: contains the memory address of the kernel entry point.
* XEN_ELFNOTE_HYPERCALL_PAGE: contains the memory address of the hypercall
page inside of the guest kernel (this memory region will be filled by Xen
prior to booting).
* XEN_ELFNOTE_FEATURES: contains the list of features supported by the kernel.
In this case the kernel is only able to boot as a PVH guest, but those
options can be mixed with the ones used by pure PV guests in order to
have a kernel that supports both PV and PVH (like Linux). The list of
options available can be found in the `features.h` public header.
Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY` with
paging enabled (either long or protected mode depending on the kernel bitness)
and some basic page tables setup.
Also, the `rsi` (`esi` on 32bits) register is going to contain the virtual
memory address were Xen has placed the start_info structure. The `rsp` (`esp`
on 32bits) will contain a stack, that can be used by the guest kernel. The
start_info structure contains all the info the guest needs in order to
initialize. More information about the contents can be found on the
`xen.h` public header.
### Initial amd64 control registers values ###
Initial values for the control registers are set up by Xen before booting the
guest kernel. The guest kernel can expect to find the following features
enabled by Xen.
On `CR0` the following bits are set by Xen:
* PE (bit 0): protected mode enable.
* ET (bit 4): 80387 external math coprocessor.
* PG (bit 31): paging enabled.
On `CR4` the following bits are set by Xen:
* PAE (bit 5): PAE enabled.
And finally on `EFER` the following features are enabled:
* LME (bit 8): Long mode enable.
* LMA (bit 10): Long mode active.
*TODO*: do we expect this flags to change? Are there other flags that might be
enabled depending on the hardware we are running on?
## Memory ##
Since PVH guests rely on virtualization extensions provided by the CPU, they
have access to a hardware virtualized MMU, which means page-table related
operations should use the same instructions used on native.
There are however some differences with native. The usage of native MTRR
operations is forbidden, and `XENPF_*_memtype` hypercalls should be used
instead. This can be avoided by simply not using MTRR and setting all the
memory attributes using PAT, which doesn't require the usage of any hypercalls.
Since PVH doesn't use a BIOS in order to boot, the physical memory map has
to be retrieved using the `XENMEM_memory_map` hypercall, which will return
an e820 map. This memory map might contain holes that describe MMIO regions,
that will be already setup by Xen.
*TODO*: we need to figure out what to do with MMIO regions, right now Xen
sets all the holes in the native e820 to MMIO regions for Dom0 up to 4GB. We
need to decide what to do with MMIO regions above 4GB on Dom0, and what to do
for PVH DomUs with pci-passthrough.
In the case of a guest started with memory != maxmem, the e820 memory map
returned by Xen will contain the memory up to maxmem. The guest has to be very
careful to only use the lower memory pages up to the value contained in
`start_info->nr_pages` because any memory page above that value will not be
populated.
## Physical devices ##
When running as Dom0 the guest OS has the ability to interact with the physical
devices present in the system. A note should be made that PVH guests require
a working IOMMU in order to interact with physical devices.
The first step in order to manipulate the devices is to make Xen aware of
them. Due to the fact that all the hardware description on x86 comes from
ACPI, Dom0 is responsible of parsing the ACPI tables and notify Xen about the
devices it finds. This is done with the `PHYSDEVOP_pci_device_add` hypercall.
*TODO*: explain the way to register the different kinds of PCI devices, like
devices with virtual functions.
## Interrupts ##
All interrupts on PVH guests are routed over event channels, see
[Event Channel Internals][event_channels] for more detailed information about
event channels. In order to inject interrupts into the guest an IDT vector is
used. This is the same mechanism used on PVHVM guests, and allows having
per-cpu interrupts that can be used to deliver timers or IPIs.
In order to register the callback IDT vector the `HVMOP_set_param` hypercall
is used with the following values:
domid = DOMID_SELF
index = HVM_PARAM_CALLBACK_IRQ
value = (0x2 << 56) | vector_value
In order to know which event channel has fired, we need to look into the
information provided in the `shared_info` structure. The `evtchn_pending`
array is used as a bitmap in order to find out which event channel has
fired. Event channels can also be masked by setting it's port value in the
`shared_info->evtchn_mask` bitmap.
*TODO*: provide a reference about how to interact with FIFO event channels?
### Interrupts from physical devices ###
When running as Dom0 (or when using pci-passthrough) interrupts from physical
devices are routed over event channels. There are 3 different kind of
physical interrupts that can be routed over event channels by Xen: IO APIC,
MSI and MSI-X interrupts.
Since physical interrupts usually need EOI (End Of Interrupt), Xen allows the
registration of a memory region that will contain whether a physical interrupt
needs EOI from the guest or not. This is done with the
`PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a parameter containing the
physical address of the memory page that will act as a bitmap. Then in order to
find out if an IRQ needs EOI or not, the OS can perform a simple bit test on the
memory page using the PIRQ value.
### IO APIC interrupt routing ###
IO APIC interrupts can be routed over event channels using `PHYSDEVOP`
hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq`
hypercall, as an example IRQ#9 is used here:
domid = DOMID_SELF
type = MAP_PIRQ_TYPE_GSI
index = 9
pirq = 9
After this hypercall, `PHYSDEVOP_alloc_irq_vector` is used to allocate a vector:
irq = 9
vector = 0
*TODO*: I'm not sure why we need those two hypercalls, and it's usage is not
documented anywhere. Need to clarify what the parameters mean and what effect
they have.
The IRQ#9 is now registered as PIRQ#9. The triggering and polarity can also
be configured using the `PHYSDEVOP_setup_gsi` hypercall:
gsi = 9 # This is the IRQ value.
triggering = 0
polarity = 0
In this example the IRQ would be configured to use edge triggering and high
polarity.
Finally the PIRQ can be bound to an event channel using the
`EVTCHNOP_bind_pirq`, that will return the event channel port the PIRQ has been
assigned. After this the event channel will be ready for delivery.
*NOTE*: when running as Dom0, the guest has to parse the interrupt overwrites
found on the ACPI tables and notify Xen about them.
### MSI ###
In order to configure MSI interrupts for a device, Xen must be made aware of
it's presence first by using the `PHYSDEVOP_pci_device_add` as described above.
Then the `PHYSDEVOP_map_pirq` hypercall is used:
domid = DOMID_SELF
type = MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI
index = -1
pirq = -1
bus = pci_device_bus
devfn = pci_device_function
entry_nr = number of MSI interrupts
The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI interrupt
source is being configured. On devices that support MSI interrupt groups
`MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure them by also placing the
number of MSI interrupts in the `entry_nr` field.
The values in the `bus` and `devfn` field should be the same as the ones used
when registering the device with `PHYSDEVOP_pci_device_add`.
### MSI-X ###
*TODO*: how to register/use them.
## Event timers and timecounters ##
Since some hardware is not available on PVH (like the local APIC), Xen provides
the OS with suitable replacements in order to get the same functionality. One
of them is the timer interface. Using a set of hypercalls, a guest OS can set
event timers that will deliver and event channel interrupt to the guest.
In order to use the timer provided by Xen the guest OS first needs to register
a VIRQ event channel to be used by the timer to deliver the interrupts. The
event channel is registered using the `EVTCHNOP_bind_virq` hypercall, that
only takes two parameters:
virq = VIRQ_TIMER
vcpu = vcpu_id
The port that's going to be used by Xen in order to deliver the interrupt is
returned in the `port` field. Once the interrupt is set, the timer can be
programmed using the `VCPUOP_set_singleshot_timer` hypercall.
flags = VCPU_SSHOTTMR_future
timeout_abs_ns = absolute value when the timer should fire
It is important to notice that the `VCPUOP_set_singleshot_timer` hypercall must
be executed from the same vCPU where the timer should fire, or else Xen will
refuse to set it. This is a single-shot timer, so it must be set by the OS
every time it fires if a periodic timer is desired.
Xen also shares a memory region with the guest OS that contains time related
values that are updated periodically. This values can be used to implement a
timecounter or to obtain the current time. This information is placed inside of
`shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the guest has
been launched) can be calculated using the following expression and the values
stored in the `vcpu_time_info` struct:
system_time + ((((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul) >> 32)
The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be
calculated using the above value, plus the timeout the system wants to set.
If the OS also wants to obtain the current wallclock time, the value calculated
above has to be added to the values found in `shared_info->wc_sec` and
`shared_info->wc_nsec`.
## SMP discover and bring up ##
The process of bringing up secondary CPUs is obviously different from native,
since PVH doesn't have a local APIC. The first thing to do is to figure out
how many vCPUs the guest has. This is done using the `VCPUOP_is_up` hypercall,
using for example this simple loop:
for (i = 0; i < MAXCPU; i++) {
ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL);
if (ret >= 0)
/* vCPU#i is present */
}
Note than when running as Dom0, the ACPI tables might report a different number
of available CPUs. This is because the value on the ACPI tables is the
number of physical CPUs the host has, and it might bear no resemblance with the
number of vCPUs Dom0 actually has so it should be ignored.
In order to bring up the secondary vCPUs they must be configured first. This is
achieved using the `VCPUOP_initialise` hypercall. A valid context has to be
passed to the vCPU in order to boot. The relevant fields for PVH guests are
the following:
* `flags`: contains VGCF_* flags (see `arch-x86/xen.h` public header).
* `user_regs`: struct that contains the register values that will be set on
the vCPU before booting. The most relevant ones are `rip` and `rsp` in order
to set the start address and the stack.
* `ctrlreg[3]`: contains the address of the page tables that will be used by
the vCPU.
After the vCPU is initialized with the proper values, it can be started by
using the `VCPUOP_up` hypercall. The values of the other control registers of
the vCPU will be the same as the ones described in the `control registers`
section.
## Control operations (reboot/shutdown) ##
Reboot and shutdown operations on PVH guests are performed using hypercalls.
In order to issue a reboot, a guest must use the `SHUTDOWN_reboot` hypercall.
In order to perform a power off from a guest DomU, the `SHUTDOWN_poweroff`
hypercall should be used.
The way to perform a full system power off from Dom0 is different than what's
done in a DomU guest. In order to perform a power off from Dom0 the native
ACPI path should be followed, but the guest should not write the SLP_EN
bit to the Pm1Control register. Instead the `XENPF_enter_acpi_sleep` hypercall
should be used, filling the following data in the `xen_platform_op` struct:
cmd = XENPF_enter_acpi_sleep
interface_version = XENPF_INTERFACE_VERSION
u.enter_acpi_sleep.pm1a_cnt_val = Pm1aControlValue
u.enter_acpi_sleep.pm1b_cnt_val = Pm1bControlValue
This will allow Xen to do it's clean up and to power off the system. If the
host is using hardware reduced ACPI, the following field should also be set:
u.enter_acpi_sleep.flags = XENPF_ACPI_SLEEP_EXTENDED (0x1)
## CPUID ##
*TDOD*: describe which cpuid flags a guest should ignore and also which flags
describe features can be used. It would also be good to describe the set of
cpuid flags that will always be present when running as PVH.
## Final notes ##
All the other hardware functionality not described in this document should be
assumed to be performed in the same way as native.
[evnet_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RFC: very initial PVH design document
2014-08-22 14:55 RFC: very initial PVH design document Roger Pau Monné
@ 2014-08-22 15:13 ` Jan Beulich
2014-08-22 15:49 ` Roger Pau Monné
2014-08-27 0:33 ` Mukesh Rathor
2014-09-12 20:38 ` Konrad Rzeszutek Wilk
2 siblings, 1 reply; 10+ messages in thread
From: Jan Beulich @ 2014-08-22 15:13 UTC (permalink / raw)
To: Roger Pau Monné; +Cc: David Vrabel, xen-devel
>>> On 22.08.14 at 16:55, <roger.pau@citrix.com> wrote:
> It is important to highlight the following notes:
>
> * XEN_ELFNOTE_ENTRY: contains the memory address of the kernel entry point.
... the virtual memory address ...
> * XEN_ELFNOTE_HYPERCALL_PAGE: contains the memory address of the hypercall
> page inside of the guest kernel (this memory region will be filled by Xen
> prior to booting).
Same here.
> * XEN_ELFNOTE_FEATURES: contains the list of features supported by the kernel.
> In this case the kernel is only able to boot as a PVH guest, but those
"In this case" is not clear what it relates to. You should probably say
something like "In the example above".
> Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY` with
> paging enabled (either long or protected mode depending on the kernel bitness)
If I understand right how 32-bit PVH is intended to function, "protected
mode" is insufficient to state here, it ought to be "paged protect mode"
or "protected mode with paging turned on".
> ## Physical devices ##
>
> When running as Dom0 the guest OS has the ability to interact with the
> physical
> devices present in the system. A note should be made that PVH guests require
> a working IOMMU in order to interact with physical devices.
>
> The first step in order to manipulate the devices is to make Xen aware of
> them. Due to the fact that all the hardware description on x86 comes from
> ACPI, Dom0 is responsible of parsing the ACPI tables and notify Xen about
> the
> devices it finds. This is done with the `PHYSDEVOP_pci_device_add`
> hypercall.
>
> *TODO*: explain the way to register the different kinds of PCI devices, like
> devices with virtual functions.
I think both the second paragraph and the TODO don't belong here,
as there's no difference to PV, and this shouldn't be subject of this
document.
> ## Interrupts ##
>
> All interrupts on PVH guests are routed over event channels, see
> [Event Channel Internals][event_channels] for more detailed information
> about
> event channels. In order to inject interrupts into the guest an IDT vector
> is
> used. This is the same mechanism used on PVHVM guests, and allows having
> per-cpu interrupts that can be used to deliver timers or IPIs.
>
> In order to register the callback IDT vector the `HVMOP_set_param` hypercall
> is used with the following values:
>
> domid = DOMID_SELF
> index = HVM_PARAM_CALLBACK_IRQ
> value = (0x2 << 56) | vector_value
>
> In order to know which event channel has fired, we need to look into the
> information provided in the `shared_info` structure. The `evtchn_pending`
> array is used as a bitmap in order to find out which event channel has
> fired. Event channels can also be masked by setting it's port value in the
> `shared_info->evtchn_mask` bitmap.
>
> *TODO*: provide a reference about how to interact with FIFO event channels?
Or better don't be event-channel-ABI specific in the paragraph right
before the TODO, as this again isn't PVH-specific? (There are more
such items further down which I'll not further comment on; actually
it looks like very little of the rest of the document is really on PVH.)
Jan
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RFC: very initial PVH design document
2014-08-22 15:13 ` Jan Beulich
@ 2014-08-22 15:49 ` Roger Pau Monné
0 siblings, 0 replies; 10+ messages in thread
From: Roger Pau Monné @ 2014-08-22 15:49 UTC (permalink / raw)
To: Jan Beulich; +Cc: David Vrabel, xen-devel
On 22/08/14 17:13, Jan Beulich wrote:
>>>> On 22.08.14 at 16:55, <roger.pau@citrix.com> wrote:
>> It is important to highlight the following notes:
>>
>> * XEN_ELFNOTE_ENTRY: contains the memory address of the kernel entry point.
>
> ... the virtual memory address ...
>
>> * XEN_ELFNOTE_HYPERCALL_PAGE: contains the memory address of the hypercall
>> page inside of the guest kernel (this memory region will be filled by Xen
>> prior to booting).
>
> Same here.
>
>> * XEN_ELFNOTE_FEATURES: contains the list of features supported by the kernel.
>> In this case the kernel is only able to boot as a PVH guest, but those
>
> "In this case" is not clear what it relates to. You should probably say
> something like "In the example above".
>
>> Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY` with
>> paging enabled (either long or protected mode depending on the kernel bitness)
>
> If I understand right how 32-bit PVH is intended to function, "protected
> mode" is insufficient to state here, it ought to be "paged protect mode"
> or "protected mode with paging turned on".
Thanks, will fix the comments above.
>> ## Physical devices ##
>>
>> When running as Dom0 the guest OS has the ability to interact with the
>> physical
>> devices present in the system. A note should be made that PVH guests require
>> a working IOMMU in order to interact with physical devices.
>>
>> The first step in order to manipulate the devices is to make Xen aware of
>> them. Due to the fact that all the hardware description on x86 comes from
>> ACPI, Dom0 is responsible of parsing the ACPI tables and notify Xen about
>> the
>> devices it finds. This is done with the `PHYSDEVOP_pci_device_add`
>> hypercall.
>>
>> *TODO*: explain the way to register the different kinds of PCI devices, like
>> devices with virtual functions.
>
> I think both the second paragraph and the TODO don't belong here,
> as there's no difference to PV, and this shouldn't be subject of this
> document.
IMHO, I think of this document as a reference that could be used by
people when trying to port OSes to PVH, and I wanted it to be complete.
Also, I don't see any of this hypercalls being documented, neither in
the header files or in a document in the repository which makes it's
usage completely obscure.
>
>> ## Interrupts ##
>>
>> All interrupts on PVH guests are routed over event channels, see
>> [Event Channel Internals][event_channels] for more detailed information
>> about
>> event channels. In order to inject interrupts into the guest an IDT vector
>> is
>> used. This is the same mechanism used on PVHVM guests, and allows having
>> per-cpu interrupts that can be used to deliver timers or IPIs.
>>
>> In order to register the callback IDT vector the `HVMOP_set_param` hypercall
>> is used with the following values:
>>
>> domid = DOMID_SELF
>> index = HVM_PARAM_CALLBACK_IRQ
>> value = (0x2 << 56) | vector_value
>>
>> In order to know which event channel has fired, we need to look into the
>> information provided in the `shared_info` structure. The `evtchn_pending`
>> array is used as a bitmap in order to find out which event channel has
>> fired. Event channels can also be masked by setting it's port value in the
>> `shared_info->evtchn_mask` bitmap.
>>
>> *TODO*: provide a reference about how to interact with FIFO event channels?
>
> Or better don't be event-channel-ABI specific in the paragraph right
> before the TODO, as this again isn't PVH-specific?
Ack, will probably remove it in next version unless someone has a
different opinion.
> (There are more
> such items further down which I'll not further comment on; actually
> it looks like very little of the rest of the document is really on PVH.)
See the reply above regarding the documentation of shared interfaces
used by both PV and PVH.
Roger.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RFC: very initial PVH design document
2014-08-22 14:55 RFC: very initial PVH design document Roger Pau Monné
2014-08-22 15:13 ` Jan Beulich
@ 2014-08-27 0:33 ` Mukesh Rathor
2014-08-27 20:45 ` Konrad Rzeszutek Wilk
2014-09-16 9:36 ` Roger Pau Monné
2014-09-12 20:38 ` Konrad Rzeszutek Wilk
2 siblings, 2 replies; 10+ messages in thread
From: Mukesh Rathor @ 2014-08-27 0:33 UTC (permalink / raw)
To: Roger Pau Monné; +Cc: David Vrabel, Jan Beulich, xen-devel
On Fri, 22 Aug 2014 16:55:08 +0200
Roger Pau Monné <roger.pau@citrix.com> wrote:
> Hello,
>
> I've started writing a document in order to describe the interface
> exposed by Xen to PVH guests, and how it should be used (by guest
> OSes). The document is far from complete (see the amount of TODOs
> scattered around), but given the lack of documentation regarding PVH
> I think it's a good starting point. The aim of this is that it should
> be committed to the Xen repository once it's ready. Given that this
> is still a *very* early version I'm not even posting it as a patch.
>
> Please comment, and try to fill the holes if possible ;).
>
> Roger.
>
> ---
> # PVH Specification #
>
> ## Rationale ##
>
> PVH is a new kind of guest that has been introduced on Xen 4.4 as a
> DomU, and on Xen 4.5 as a Dom0. The aim of PVH is to make use of the
> hardware virtualization extensions present in modern x86 CPUs in
> order to improve performance.
>
> PVH is considered a mix between PV and HVM, and can be seen as a PV
> guest that runs inside of an HVM container, or as a PVHVM guest
> without any emulated devices. The design goal of PVH is to provide
> the best performance possible and to reduce the amount of
> modifications needed for a guest OS to run in this mode (compared to
> pure PV).
>
> This document tries to describe the interfaces used by PVH guests,
> focusing on how an OS should make use of them in order to support PVH.
>
> ## Early boot ##
>
> PVH guests use the PV boot mechanism, that means that the kernel is
> loaded and directly launched by Xen (by jumping into the entry
> point). In order to do this Xen ELF Notes need to be added to the
> guest kernel, so that they contain the information needed by Xen.
> Here is an example of the ELF Notes added to the FreeBSD amd64 kernel
> in order to boot as PVH:
>
> ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD")
> ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz,
> __XSTRING(__FreeBSD_version)) ELFNOTE(Xen,
> XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0") ELFNOTE(Xen,
> XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE) ELFNOTE(Xen,
> XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE) ELFNOTE(Xen,
> XEN_ELFNOTE_ENTRY, .quad, xen_start) ELFNOTE(Xen,
> XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page) ELFNOTE(Xen,
> XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_START)
> ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz,
> "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
> ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes") ELFNOTE(Xen,
> XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V) ELFNOTE(Xen,
> XEN_ELFNOTE_LOADER, .asciz, "generic") ELFNOTE(Xen,
> XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0) ELFNOTE(Xen,
> XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes")
It will be helpful to add:
On the linux side, the above can be found in arch/x86/xen/xen-head.S.
> It is important to highlight the following notes:
>
> * XEN_ELFNOTE_ENTRY: contains the memory address of the kernel
> entry point.
> * XEN_ELFNOTE_HYPERCALL_PAGE: contains the memory address of the
> hypercall page inside of the guest kernel (this memory region will be
> filled by Xen prior to booting).
> * XEN_ELFNOTE_FEATURES: contains the list of features supported by
> the kernel. In this case the kernel is only able to boot as a PVH
> guest, but those options can be mixed with the ones used by pure PV
> guests in order to have a kernel that supports both PV and PVH (like
> Linux). The list of options available can be found in the
> `features.h` public header.
Hmm... for linux I'd word that as follows:
A PVH guest is started by specifying pvh=1 in the config file. However,
for the guest to be launched as a PVH guest, it must minimally advertise
certain features which are: auto_translated_physmap, hvm_callback_vector,
writable_descriptor_tables, and supervisor_mode_kernel. This is done
via XEN_ELFNOTE_FEATURES and XEN_ELFNOTE_SUPPORTED_FEATURES. See
linux:arch/x86/xen/xen-head.S for more info. A list of all xen features
can be found in xen:include/public/features.h. However, at present
the absence of these features does not make it automatically boot in PV
mode, but that may change in future. The ultimate goal is, if a guest
supports these features, then boot it automatically in PVH mode, otherwise
boot it in PV mode.
[You can leave out the last part if you want, or just take whatever from
above].
> Xen will jump into the kernel entry point defined in
> `XEN_ELFNOTE_ENTRY` with paging enabled (either long or protected
> mode depending on the kernel bitness) and some basic page tables
> setup.
If I may rephrase:
Guest is launched at the entry point specified in XEN_ELFNOTE_ENTRY
with paging, PAE, and long mode enabled. At present only 64bit mode
is supported, however, in future compat mode support will be added.
An important distinction for a 64bit PVH is that it is launched at
privilege level 0 as opposed to a 64bit PV guest which is launched at
privilege level 3.
> Also, the `rsi` (`esi` on 32bits) register is going to contain the
> virtual memory address were Xen has placed the start_info structure.
> The `rsp` (`esp` on 32bits) will contain a stack, that can be used by
> the guest kernel. The start_info structure contains all the info the
> guest needs in order to initialize. More information about the
> contents can be found on the `xen.h` public header.
Since the above is all true for PV guest, you could begin it with:
Just like a PV guest, the rsi ....
>
> ### Initial amd64 control registers values ###
>
> Initial values for the control registers are set up by Xen before
> booting the guest kernel. The guest kernel can expect to find the
> following features enabled by Xen.
>
> On `CR0` the following bits are set by Xen:
>
> * PE (bit 0): protected mode enable.
> * ET (bit 4): 80387 external math coprocessor.
> * PG (bit 31): paging enabled.
>
> On `CR4` the following bits are set by Xen:
>
> * PAE (bit 5): PAE enabled.
>
> And finally on `EFER` the following features are enabled:
>
> * LME (bit 8): Long mode enable.
> * LMA (bit 10): Long mode active.
>
> *TODO*: do we expect this flags to change? Are there other flags that
> might be enabled depending on the hardware we are running on?
Can't think of anything...
> ## Memory ##
>
> Since PVH guests rely on virtualization extensions provided by the
> CPU, they have access to a hardware virtualized MMU, which means
> page-table related operations should use the same instructions used
> on native.
Do you wanna expand a bit since this is another big distinction from
a PV guest?
which means that page tables are native and guest managed.
This also implies that mmu_update hypercall is not available to a PVH
guest, unlike a PV guest. The guest is configured at start so it can
access all pages upto start_info->nr_pages.
> There are however some differences with native. The usage of native
> MTRR operations is forbidden, and `XENPF_*_memtype` hypercalls should
> be used instead. This can be avoided by simply not using MTRR and
> setting all the memory attributes using PAT, which doesn't require
> the usage of any hypercalls.
>
> Since PVH doesn't use a BIOS in order to boot, the physical memory
> map has to be retrieved using the `XENMEM_memory_map` hypercall,
> which will return an e820 map. This memory map might contain holes
> that describe MMIO regions, that will be already setup by Xen.
>
> *TODO*: we need to figure out what to do with MMIO regions, right now
> Xen sets all the holes in the native e820 to MMIO regions for Dom0 up
> to 4GB. We need to decide what to do with MMIO regions above 4GB on
> Dom0, and what to do for PVH DomUs with pci-passthrough.
We map all non-ram regions for dom0 1:1 till the highest non-ram e820
entry. If there is anything that is beyond the last e820 entry,
it will remain unmapped.
Correct, passthru needs to be figured.
> In the case of a guest started with memory != maxmem, the e820 memory
> map returned by Xen will contain the memory up to maxmem. The guest
> has to be very careful to only use the lower memory pages up to the
> value contained in `start_info->nr_pages` because any memory page
> above that value will not be populated.
>
> ## Physical devices ##
>
> When running as Dom0 the guest OS has the ability to interact with
> the physical devices present in the system. A note should be made
> that PVH guests require a working IOMMU in order to interact with
> physical devices.
>
> The first step in order to manipulate the devices is to make Xen
> aware of them. Due to the fact that all the hardware description on
> x86 comes from ACPI, Dom0 is responsible of parsing the ACPI tables
> and notify Xen about the devices it finds. This is done with the
> `PHYSDEVOP_pci_device_add` hypercall.
>
> *TODO*: explain the way to register the different kinds of PCI
> devices, like devices with virtual functions.
>
> ## Interrupts ##
>
> All interrupts on PVH guests are routed over event channels, see
> [Event Channel Internals][event_channels] for more detailed
> information about event channels. In order to inject interrupts into
> the guest an IDT vector is used. This is the same mechanism used on
> PVHVM guests, and allows having per-cpu interrupts that can be used
> to deliver timers or IPIs.
>
> In order to register the callback IDT vector the `HVMOP_set_param`
> hypercall is used with the following values:
>
> domid = DOMID_SELF
> index = HVM_PARAM_CALLBACK_IRQ
> value = (0x2 << 56) | vector_value
>
> In order to know which event channel has fired, we need to look into
> the information provided in the `shared_info` structure. The
> `evtchn_pending` array is used as a bitmap in order to find out which
> event channel has fired. Event channels can also be masked by setting
> it's port value in the `shared_info->evtchn_mask` bitmap.
>
> *TODO*: provide a reference about how to interact with FIFO event
> channels?
>
> ### Interrupts from physical devices ###
>
> When running as Dom0 (or when using pci-passthrough) interrupts from
> physical devices are routed over event channels. There are 3
> different kind of physical interrupts that can be routed over event
> channels by Xen: IO APIC, MSI and MSI-X interrupts.
>
> Since physical interrupts usually need EOI (End Of Interrupt), Xen
> allows the registration of a memory region that will contain whether
> a physical interrupt needs EOI from the guest or not. This is done
> with the `PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a
> parameter containing the physical address of the memory page that
> will act as a bitmap. Then in order to find out if an IRQ needs EOI
> or not, the OS can perform a simple bit test on the memory page using
> the PIRQ value.
>
> ### IO APIC interrupt routing ###
>
> IO APIC interrupts can be routed over event channels using `PHYSDEVOP`
> hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq`
> hypercall, as an example IRQ#9 is used here:
>
> domid = DOMID_SELF
> type = MAP_PIRQ_TYPE_GSI
> index = 9
> pirq = 9
>
> After this hypercall, `PHYSDEVOP_alloc_irq_vector` is used to
> allocate a vector:
>
> irq = 9
> vector = 0
>
> *TODO*: I'm not sure why we need those two hypercalls, and it's usage
> is not documented anywhere. Need to clarify what the parameters mean
> and what effect they have.
>
> The IRQ#9 is now registered as PIRQ#9. The triggering and polarity
> can also be configured using the `PHYSDEVOP_setup_gsi` hypercall:
>
> gsi = 9 # This is the IRQ value.
> triggering = 0
> polarity = 0
>
> In this example the IRQ would be configured to use edge triggering
> and high polarity.
>
> Finally the PIRQ can be bound to an event channel using the
> `EVTCHNOP_bind_pirq`, that will return the event channel port the
> PIRQ has been assigned. After this the event channel will be ready
> for delivery.
>
> *NOTE*: when running as Dom0, the guest has to parse the interrupt
> overwrites found on the ACPI tables and notify Xen about them.
>
> ### MSI ###
>
> In order to configure MSI interrupts for a device, Xen must be made
> aware of it's presence first by using the `PHYSDEVOP_pci_device_add`
> as described above. Then the `PHYSDEVOP_map_pirq` hypercall is used:
>
> domid = DOMID_SELF
> type = MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI
> index = -1
> pirq = -1
> bus = pci_device_bus
> devfn = pci_device_function
> entry_nr = number of MSI interrupts
>
> The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI
> interrupt source is being configured. On devices that support MSI
> interrupt groups `MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure
> them by also placing the number of MSI interrupts in the `entry_nr`
> field.
>
> The values in the `bus` and `devfn` field should be the same as the
> ones used when registering the device with `PHYSDEVOP_pci_device_add`.
>
> ### MSI-X ###
>
> *TODO*: how to register/use them.
>
> ## Event timers and timecounters ##
>
> Since some hardware is not available on PVH (like the local APIC),
> Xen provides the OS with suitable replacements in order to get the
> same functionality. One of them is the timer interface. Using a set
> of hypercalls, a guest OS can set event timers that will deliver and
> event channel interrupt to the guest.
>
> In order to use the timer provided by Xen the guest OS first needs to
> register a VIRQ event channel to be used by the timer to deliver the
> interrupts. The event channel is registered using the
> `EVTCHNOP_bind_virq` hypercall, that only takes two parameters:
>
> virq = VIRQ_TIMER
> vcpu = vcpu_id
>
> The port that's going to be used by Xen in order to deliver the
> interrupt is returned in the `port` field. Once the interrupt is set,
> the timer can be programmed using the `VCPUOP_set_singleshot_timer`
> hypercall.
>
> flags = VCPU_SSHOTTMR_future
> timeout_abs_ns = absolute value when the timer should fire
>
> It is important to notice that the `VCPUOP_set_singleshot_timer`
> hypercall must be executed from the same vCPU where the timer should
> fire, or else Xen will refuse to set it. This is a single-shot timer,
> so it must be set by the OS every time it fires if a periodic timer
> is desired.
>
> Xen also shares a memory region with the guest OS that contains time
> related values that are updated periodically. This values can be used
> to implement a timecounter or to obtain the current time. This
> information is placed inside of
> `shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the
> guest has been launched) can be calculated using the following
> expression and the values stored in the `vcpu_time_info` struct:
>
> system_time + ((((tsc - tsc_timestamp) << tsc_shift) *
> tsc_to_system_mul) >> 32)
>
> The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be
> calculated using the above value, plus the timeout the system wants
> to set.
>
> If the OS also wants to obtain the current wallclock time, the value
> calculated above has to be added to the values found in
> `shared_info->wc_sec` and `shared_info->wc_nsec`.
All the above is great info, not PVH specific tho. May wanna mention
it fwiw.
> ## SMP discover and bring up ##
>
> The process of bringing up secondary CPUs is obviously different from
> native, since PVH doesn't have a local APIC. The first thing to do is
> to figure out how many vCPUs the guest has. This is done using the
> `VCPUOP_is_up` hypercall, using for example this simple loop:
>
> for (i = 0; i < MAXCPU; i++) {
> ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL);
> if (ret >= 0)
> /* vCPU#i is present */
> }
>
> Note than when running as Dom0, the ACPI tables might report a
> different number of available CPUs. This is because the value on the
> ACPI tables is the number of physical CPUs the host has, and it might
> bear no resemblance with the number of vCPUs Dom0 actually has so it
> should be ignored.
>
> In order to bring up the secondary vCPUs they must be configured
> first. This is achieved using the `VCPUOP_initialise` hypercall. A
> valid context has to be passed to the vCPU in order to boot. The
> relevant fields for PVH guests are the following:
>
> * `flags`: contains VGCF_* flags (see `arch-x86/xen.h` public
> header).
> * `user_regs`: struct that contains the register values that will
> be set on the vCPU before booting. The most relevant ones are `rip`
> and `rsp` in order to set the start address and the stack.
> * `ctrlreg[3]`: contains the address of the page tables that will
> be used by the vCPU.
>
> After the vCPU is initialized with the proper values, it can be
> started by using the `VCPUOP_up` hypercall. The values of the other
> control registers of the vCPU will be the same as the ones described
> in the `control registers` section.
If you want, you could put linux reference here:
For an example, please see cpu_initialize_context() in arch/x86/xen/smp.c
in linux.
> ## Control operations (reboot/shutdown) ##
>
> Reboot and shutdown operations on PVH guests are performed using
> hypercalls. In order to issue a reboot, a guest must use the
> `SHUTDOWN_reboot` hypercall. In order to perform a power off from a
> guest DomU, the `SHUTDOWN_poweroff` hypercall should be used.
>
> The way to perform a full system power off from Dom0 is different
> than what's done in a DomU guest. In order to perform a power off
> from Dom0 the native ACPI path should be followed, but the guest
> should not write the SLP_EN bit to the Pm1Control register. Instead
> the `XENPF_enter_acpi_sleep` hypercall should be used, filling the
> following data in the `xen_platform_op` struct:
>
> cmd = XENPF_enter_acpi_sleep
> interface_version = XENPF_INTERFACE_VERSION
> u.enter_acpi_sleep.pm1a_cnt_val = Pm1aControlValue
> u.enter_acpi_sleep.pm1b_cnt_val = Pm1bControlValue
>
> This will allow Xen to do it's clean up and to power off the system.
> If the host is using hardware reduced ACPI, the following field
> should also be set:
>
> u.enter_acpi_sleep.flags = XENPF_ACPI_SLEEP_EXTENDED (0x1)
>
> ## CPUID ##
>
> *TDOD*: describe which cpuid flags a guest should ignore and also
> which flags describe features can be used. It would also be good to
> describe the set of cpuid flags that will always be present when
> running as PVH.
>
> ## Final notes ##
>
> All the other hardware functionality not described in this document
> should be assumed to be performed in the same way as native.
>
> [evnet_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals
Great work Roger! Thanks a lot for writing it.
Mukesh
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RFC: very initial PVH design document
2014-08-27 0:33 ` Mukesh Rathor
@ 2014-08-27 20:45 ` Konrad Rzeszutek Wilk
2014-08-27 22:38 ` Mukesh Rathor
2014-09-16 9:36 ` Roger Pau Monné
1 sibling, 1 reply; 10+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-27 20:45 UTC (permalink / raw)
To: Mukesh Rathor; +Cc: xen-devel, David Vrabel, Jan Beulich, Roger Pau Monné
On Tue, Aug 26, 2014 at 05:33:21PM -0700, Mukesh Rathor wrote:
> On Fri, 22 Aug 2014 16:55:08 +0200
> Roger Pau Monné <roger.pau@citrix.com> wrote:
>
> > Hello,
> >
> > I've started writing a document in order to describe the interface
> > exposed by Xen to PVH guests, and how it should be used (by guest
> > OSes). The document is far from complete (see the amount of TODOs
> > scattered around), but given the lack of documentation regarding PVH
> > I think it's a good starting point. The aim of this is that it should
> > be committed to the Xen repository once it's ready. Given that this
> > is still a *very* early version I'm not even posting it as a patch.
> >
> > Please comment, and try to fill the holes if possible ;).
> >
> > Roger.
> >
> > ---
> > # PVH Specification #
> >
> > ## Rationale ##
> >
> > PVH is a new kind of guest that has been introduced on Xen 4.4 as a
> > DomU, and on Xen 4.5 as a Dom0. The aim of PVH is to make use of the
> > hardware virtualization extensions present in modern x86 CPUs in
> > order to improve performance.
> >
> > PVH is considered a mix between PV and HVM, and can be seen as a PV
> > guest that runs inside of an HVM container, or as a PVHVM guest
> > without any emulated devices. The design goal of PVH is to provide
> > the best performance possible and to reduce the amount of
> > modifications needed for a guest OS to run in this mode (compared to
> > pure PV).
> >
> > This document tries to describe the interfaces used by PVH guests,
> > focusing on how an OS should make use of them in order to support PVH.
> >
> > ## Early boot ##
> >
> > PVH guests use the PV boot mechanism, that means that the kernel is
> > loaded and directly launched by Xen (by jumping into the entry
> > point). In order to do this Xen ELF Notes need to be added to the
> > guest kernel, so that they contain the information needed by Xen.
> > Here is an example of the ELF Notes added to the FreeBSD amd64 kernel
> > in order to boot as PVH:
> >
> > ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD")
> > ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz,
> > __XSTRING(__FreeBSD_version)) ELFNOTE(Xen,
> > XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0") ELFNOTE(Xen,
> > XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE) ELFNOTE(Xen,
> > XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE) ELFNOTE(Xen,
> > XEN_ELFNOTE_ENTRY, .quad, xen_start) ELFNOTE(Xen,
> > XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page) ELFNOTE(Xen,
> > XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_START)
> > ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz,
> > "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
> > ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes") ELFNOTE(Xen,
> > XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V) ELFNOTE(Xen,
> > XEN_ELFNOTE_LOADER, .asciz, "generic") ELFNOTE(Xen,
> > XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0) ELFNOTE(Xen,
> > XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes")
>
> It will be helpful to add:
>
> On the linux side, the above can be found in arch/x86/xen/xen-head.S.
>
>
> > It is important to highlight the following notes:
> >
> > * XEN_ELFNOTE_ENTRY: contains the memory address of the kernel
> > entry point.
> > * XEN_ELFNOTE_HYPERCALL_PAGE: contains the memory address of the
> > hypercall page inside of the guest kernel (this memory region will be
> > filled by Xen prior to booting).
> > * XEN_ELFNOTE_FEATURES: contains the list of features supported by
> > the kernel. In this case the kernel is only able to boot as a PVH
> > guest, but those options can be mixed with the ones used by pure PV
> > guests in order to have a kernel that supports both PV and PVH (like
> > Linux). The list of options available can be found in the
> > `features.h` public header.
>
> Hmm... for linux I'd word that as follows:
>
> A PVH guest is started by specifying pvh=1 in the config file. However,
> for the guest to be launched as a PVH guest, it must minimally advertise
> certain features which are: auto_translated_physmap, hvm_callback_vector,
> writable_descriptor_tables, and supervisor_mode_kernel. This is done
> via XEN_ELFNOTE_FEATURES and XEN_ELFNOTE_SUPPORTED_FEATURES. See
> linux:arch/x86/xen/xen-head.S for more info. A list of all xen features
> can be found in xen:include/public/features.h. However, at present
> the absence of these features does not make it automatically boot in PV
> mode, but that may change in future. The ultimate goal is, if a guest
> supports these features, then boot it automatically in PVH mode, otherwise
> boot it in PV mode.
>
> [You can leave out the last part if you want, or just take whatever from
> above].
>
> > Xen will jump into the kernel entry point defined in
> > `XEN_ELFNOTE_ENTRY` with paging enabled (either long or protected
> > mode depending on the kernel bitness) and some basic page tables
> > setup.
>
> If I may rephrase:
>
> Guest is launched at the entry point specified in XEN_ELFNOTE_ENTRY
> with paging, PAE, and long mode enabled. At present only 64bit mode
> is supported, however, in future compat mode support will be added.
> An important distinction for a 64bit PVH is that it is launched at
> privilege level 0 as opposed to a 64bit PV guest which is launched at
> privilege level 3.
>
> > Also, the `rsi` (`esi` on 32bits) register is going to contain the
> > virtual memory address were Xen has placed the start_info structure.
> > The `rsp` (`esp` on 32bits) will contain a stack, that can be used by
> > the guest kernel. The start_info structure contains all the info the
> > guest needs in order to initialize. More information about the
> > contents can be found on the `xen.h` public header.
>
> Since the above is all true for PV guest, you could begin it with:
>
> Just like a PV guest, the rsi ....
>
> >
> > ### Initial amd64 control registers values ###
> >
> > Initial values for the control registers are set up by Xen before
> > booting the guest kernel. The guest kernel can expect to find the
> > following features enabled by Xen.
> >
> > On `CR0` the following bits are set by Xen:
> >
> > * PE (bit 0): protected mode enable.
> > * ET (bit 4): 80387 external math coprocessor.
> > * PG (bit 31): paging enabled.
> >
> > On `CR4` the following bits are set by Xen:
> >
> > * PAE (bit 5): PAE enabled.
> >
> > And finally on `EFER` the following features are enabled:
> >
> > * LME (bit 8): Long mode enable.
> > * LMA (bit 10): Long mode active.
> >
> > *TODO*: do we expect this flags to change? Are there other flags that
> > might be enabled depending on the hardware we are running on?
>
> Can't think of anything...
What about the initial segments (ES, DS, FS, GS)? We boot with Xen
provided ones and need to swap over from them - so that means
the DS and CS are initially set to Xen ones. And we should probably
mention that when the OS switches from Xen ones it MUST jump an
CS with CS.L = 1 set otherwise bad things happen.
We should probably mention that MSR_FS_BASE, MSR_KERNEL_GS_BASE
and MSR_FS_BASE are zeroed out. Not sure about any other MSR?
Should we have a blurb about IDT and GDT and that the PV hypercalls
for that will be ignored.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RFC: very initial PVH design document
2014-08-27 20:45 ` Konrad Rzeszutek Wilk
@ 2014-08-27 22:38 ` Mukesh Rathor
2014-08-29 15:09 ` Konrad Rzeszutek Wilk
0 siblings, 1 reply; 10+ messages in thread
From: Mukesh Rathor @ 2014-08-27 22:38 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: xen-devel, David Vrabel, Jan Beulich, Roger Pau Monné
On Wed, 27 Aug 2014 16:45:37 -0400
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> On Tue, Aug 26, 2014 at 05:33:21PM -0700, Mukesh Rathor wrote:
> > On Fri, 22 Aug 2014 16:55:08 +0200
> > Roger Pau Monné <roger.pau@citrix.com> wrote:
> >
> > > Hello,
> > >
> > > I've started writing a document in order to describe the
> > > interface exposed by Xen to PVH guests, and how it should be used
> > > (by guest OSes). The document is far from complete (see the
> > > amount of TODOs scattered around), but given the lack of
> > > documentation regarding PVH I think it's a good starting point.
> > > The aim of this is that it should be committed to the Xen
> > > repository once it's ready. Given that this is still a *very*
> > > early version I'm not even posting it as a patch.
> > >
> > > Please comment, and try to fill the holes if possible ;).
> > >
> > > Roger.
> > >
> > > ---
> > > # PVH Specification #
> > >
> > > ## Rationale ##
> > >
> > > PVH is a new kind of guest that has been introduced on Xen 4.4 as
> > > a DomU, and on Xen 4.5 as a Dom0. The aim of PVH is to make use
> > > of the hardware virtualization extensions present in modern x86
> > > CPUs in order to improve performance.
> > >
> > > PVH is considered a mix between PV and HVM, and can be seen as a
> > > PV guest that runs inside of an HVM container, or as a PVHVM guest
> > > without any emulated devices. The design goal of PVH is to provide
> > > the best performance possible and to reduce the amount of
> > > modifications needed for a guest OS to run in this mode (compared
> > > to pure PV).
> > >
> > > This document tries to describe the interfaces used by PVH guests,
> > > focusing on how an OS should make use of them in order to support
> > > PVH.
> > >
> > > ## Early boot ##
> > >
> > > PVH guests use the PV boot mechanism, that means that the kernel
> > > is loaded and directly launched by Xen (by jumping into the entry
> > > point). In order to do this Xen ELF Notes need to be added to the
> > > guest kernel, so that they contain the information needed by Xen.
> > > Here is an example of the ELF Notes added to the FreeBSD amd64
> > > kernel in order to boot as PVH:
> > >
> > > ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD")
> > > ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz,
> > > __XSTRING(__FreeBSD_version)) ELFNOTE(Xen,
> > > XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0") ELFNOTE(Xen,
> > > XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE) ELFNOTE(Xen,
> > > XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE) ELFNOTE(Xen,
> > > XEN_ELFNOTE_ENTRY, .quad, xen_start) ELFNOTE(Xen,
> > > XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page) ELFNOTE(Xen,
> > > XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_START)
> > > ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz,
> > > "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
> > > ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes")
> > > ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V)
> > > ELFNOTE(Xen, XEN_ELFNOTE_LOADER, .asciz, "generic")
> > > ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0) ELFNOTE(Xen,
> > > XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes")
> >
> > It will be helpful to add:
> >
> > On the linux side, the above can be found in
> > arch/x86/xen/xen-head.S.
> >
> >
> > > It is important to highlight the following notes:
> > >
> > > * XEN_ELFNOTE_ENTRY: contains the memory address of the kernel
> > > entry point.
> > > * XEN_ELFNOTE_HYPERCALL_PAGE: contains the memory address of the
> > > hypercall page inside of the guest kernel (this memory region
> > > will be filled by Xen prior to booting).
> > > * XEN_ELFNOTE_FEATURES: contains the list of features supported
> > > by the kernel. In this case the kernel is only able to boot as a
> > > PVH guest, but those options can be mixed with the ones used by
> > > pure PV guests in order to have a kernel that supports both PV
> > > and PVH (like Linux). The list of options available can be found
> > > in the `features.h` public header.
> >
> > Hmm... for linux I'd word that as follows:
> >
> > A PVH guest is started by specifying pvh=1 in the config file.
> > However, for the guest to be launched as a PVH guest, it must
> > minimally advertise certain features which are:
> > auto_translated_physmap, hvm_callback_vector,
> > writable_descriptor_tables, and supervisor_mode_kernel. This is
> > done via XEN_ELFNOTE_FEATURES and XEN_ELFNOTE_SUPPORTED_FEATURES.
> > See linux:arch/x86/xen/xen-head.S for more info. A list of all xen
> > features can be found in xen:include/public/features.h. However, at
> > present the absence of these features does not make it
> > automatically boot in PV mode, but that may change in future. The
> > ultimate goal is, if a guest supports these features, then boot it
> > automatically in PVH mode, otherwise boot it in PV mode.
> >
> > [You can leave out the last part if you want, or just take whatever
> > from above].
> >
> > > Xen will jump into the kernel entry point defined in
> > > `XEN_ELFNOTE_ENTRY` with paging enabled (either long or protected
> > > mode depending on the kernel bitness) and some basic page tables
> > > setup.
> >
> > If I may rephrase:
> >
> > Guest is launched at the entry point specified in XEN_ELFNOTE_ENTRY
> > with paging, PAE, and long mode enabled. At present only 64bit mode
> > is supported, however, in future compat mode support will be added.
> > An important distinction for a 64bit PVH is that it is launched at
> > privilege level 0 as opposed to a 64bit PV guest which is launched
> > at privilege level 3.
> >
> > > Also, the `rsi` (`esi` on 32bits) register is going to contain the
> > > virtual memory address were Xen has placed the start_info
> > > structure. The `rsp` (`esp` on 32bits) will contain a stack, that
> > > can be used by the guest kernel. The start_info structure
> > > contains all the info the guest needs in order to initialize.
> > > More information about the contents can be found on the `xen.h`
> > > public header.
> >
> > Since the above is all true for PV guest, you could begin it with:
> >
> > Just like a PV guest, the rsi ....
> >
> > >
> > > ### Initial amd64 control registers values ###
> > >
> > > Initial values for the control registers are set up by Xen before
> > > booting the guest kernel. The guest kernel can expect to find the
> > > following features enabled by Xen.
> > >
> > > On `CR0` the following bits are set by Xen:
> > >
> > > * PE (bit 0): protected mode enable.
> > > * ET (bit 4): 80387 external math coprocessor.
> > > * PG (bit 31): paging enabled.
> > >
> > > On `CR4` the following bits are set by Xen:
> > >
> > > * PAE (bit 5): PAE enabled.
> > >
> > > And finally on `EFER` the following features are enabled:
> > >
> > > * LME (bit 8): Long mode enable.
> > > * LMA (bit 10): Long mode active.
> > >
> > > *TODO*: do we expect this flags to change? Are there other flags
> > > that might be enabled depending on the hardware we are running on?
> >
> > Can't think of anything...
>
> What about the initial segments (ES, DS, FS, GS)? We boot with Xen
> provided ones and need to swap over from them - so that means
> the DS and CS are initially set to Xen ones. And we should probably
> mention that when the OS switches from Xen ones it MUST jump an
> CS with CS.L = 1 set otherwise bad things happen.
CS.L is already covered above:
with paging, PAE, and long mode enabled. At present only 64bit mode
is supported, however, in future compat mode support will be added.
that is the CS.L bit. CS.L==1 ==> 64bit mode, CS.L==0 ==> compat mode.
> We should probably mention that MSR_FS_BASE, MSR_KERNEL_GS_BASE
> and MSR_FS_BASE are zeroed out. Not sure about any other MSR?
Could.
> Should we have a blurb about IDT and GDT and that the PV hypercalls
> for that will be ignored.
and that they are native and guest managed.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RFC: very initial PVH design document
2014-08-27 22:38 ` Mukesh Rathor
@ 2014-08-29 15:09 ` Konrad Rzeszutek Wilk
0 siblings, 0 replies; 10+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-29 15:09 UTC (permalink / raw)
To: Mukesh Rathor; +Cc: xen-devel, David Vrabel, Jan Beulich, Roger Pau Monné
On Wed, Aug 27, 2014 at 03:38:42PM -0700, Mukesh Rathor wrote:
> On Wed, 27 Aug 2014 16:45:37 -0400
> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>
> > On Tue, Aug 26, 2014 at 05:33:21PM -0700, Mukesh Rathor wrote:
> > > On Fri, 22 Aug 2014 16:55:08 +0200
> > > Roger Pau Monné <roger.pau@citrix.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > I've started writing a document in order to describe the
> > > > interface exposed by Xen to PVH guests, and how it should be used
> > > > (by guest OSes). The document is far from complete (see the
> > > > amount of TODOs scattered around), but given the lack of
> > > > documentation regarding PVH I think it's a good starting point.
> > > > The aim of this is that it should be committed to the Xen
> > > > repository once it's ready. Given that this is still a *very*
> > > > early version I'm not even posting it as a patch.
> > > >
> > > > Please comment, and try to fill the holes if possible ;).
> > > >
> > > > Roger.
> > > >
> > > > ---
> > > > # PVH Specification #
> > > >
> > > > ## Rationale ##
> > > >
> > > > PVH is a new kind of guest that has been introduced on Xen 4.4 as
> > > > a DomU, and on Xen 4.5 as a Dom0. The aim of PVH is to make use
> > > > of the hardware virtualization extensions present in modern x86
> > > > CPUs in order to improve performance.
> > > >
> > > > PVH is considered a mix between PV and HVM, and can be seen as a
> > > > PV guest that runs inside of an HVM container, or as a PVHVM guest
> > > > without any emulated devices. The design goal of PVH is to provide
> > > > the best performance possible and to reduce the amount of
> > > > modifications needed for a guest OS to run in this mode (compared
> > > > to pure PV).
> > > >
> > > > This document tries to describe the interfaces used by PVH guests,
> > > > focusing on how an OS should make use of them in order to support
> > > > PVH.
> > > >
> > > > ## Early boot ##
> > > >
> > > > PVH guests use the PV boot mechanism, that means that the kernel
> > > > is loaded and directly launched by Xen (by jumping into the entry
> > > > point). In order to do this Xen ELF Notes need to be added to the
> > > > guest kernel, so that they contain the information needed by Xen.
> > > > Here is an example of the ELF Notes added to the FreeBSD amd64
> > > > kernel in order to boot as PVH:
> > > >
> > > > ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD")
> > > > ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz,
> > > > __XSTRING(__FreeBSD_version)) ELFNOTE(Xen,
> > > > XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0") ELFNOTE(Xen,
> > > > XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE) ELFNOTE(Xen,
> > > > XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE) ELFNOTE(Xen,
> > > > XEN_ELFNOTE_ENTRY, .quad, xen_start) ELFNOTE(Xen,
> > > > XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page) ELFNOTE(Xen,
> > > > XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_START)
> > > > ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz,
> > > > "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
> > > > ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes")
> > > > ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V)
> > > > ELFNOTE(Xen, XEN_ELFNOTE_LOADER, .asciz, "generic")
> > > > ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0) ELFNOTE(Xen,
> > > > XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes")
> > >
> > > It will be helpful to add:
> > >
> > > On the linux side, the above can be found in
> > > arch/x86/xen/xen-head.S.
> > >
> > >
> > > > It is important to highlight the following notes:
> > > >
> > > > * XEN_ELFNOTE_ENTRY: contains the memory address of the kernel
> > > > entry point.
> > > > * XEN_ELFNOTE_HYPERCALL_PAGE: contains the memory address of the
> > > > hypercall page inside of the guest kernel (this memory region
> > > > will be filled by Xen prior to booting).
> > > > * XEN_ELFNOTE_FEATURES: contains the list of features supported
> > > > by the kernel. In this case the kernel is only able to boot as a
> > > > PVH guest, but those options can be mixed with the ones used by
> > > > pure PV guests in order to have a kernel that supports both PV
> > > > and PVH (like Linux). The list of options available can be found
> > > > in the `features.h` public header.
> > >
> > > Hmm... for linux I'd word that as follows:
> > >
> > > A PVH guest is started by specifying pvh=1 in the config file.
> > > However, for the guest to be launched as a PVH guest, it must
> > > minimally advertise certain features which are:
> > > auto_translated_physmap, hvm_callback_vector,
> > > writable_descriptor_tables, and supervisor_mode_kernel. This is
> > > done via XEN_ELFNOTE_FEATURES and XEN_ELFNOTE_SUPPORTED_FEATURES.
> > > See linux:arch/x86/xen/xen-head.S for more info. A list of all xen
> > > features can be found in xen:include/public/features.h. However, at
> > > present the absence of these features does not make it
> > > automatically boot in PV mode, but that may change in future. The
> > > ultimate goal is, if a guest supports these features, then boot it
> > > automatically in PVH mode, otherwise boot it in PV mode.
> > >
> > > [You can leave out the last part if you want, or just take whatever
> > > from above].
> > >
> > > > Xen will jump into the kernel entry point defined in
> > > > `XEN_ELFNOTE_ENTRY` with paging enabled (either long or protected
> > > > mode depending on the kernel bitness) and some basic page tables
> > > > setup.
> > >
> > > If I may rephrase:
> > >
> > > Guest is launched at the entry point specified in XEN_ELFNOTE_ENTRY
> > > with paging, PAE, and long mode enabled. At present only 64bit mode
> > > is supported, however, in future compat mode support will be added.
> > > An important distinction for a 64bit PVH is that it is launched at
> > > privilege level 0 as opposed to a 64bit PV guest which is launched
> > > at privilege level 3.
> > >
> > > > Also, the `rsi` (`esi` on 32bits) register is going to contain the
> > > > virtual memory address were Xen has placed the start_info
> > > > structure. The `rsp` (`esp` on 32bits) will contain a stack, that
> > > > can be used by the guest kernel. The start_info structure
> > > > contains all the info the guest needs in order to initialize.
> > > > More information about the contents can be found on the `xen.h`
> > > > public header.
> > >
> > > Since the above is all true for PV guest, you could begin it with:
> > >
> > > Just like a PV guest, the rsi ....
> > >
> > > >
> > > > ### Initial amd64 control registers values ###
> > > >
> > > > Initial values for the control registers are set up by Xen before
> > > > booting the guest kernel. The guest kernel can expect to find the
> > > > following features enabled by Xen.
> > > >
> > > > On `CR0` the following bits are set by Xen:
> > > >
> > > > * PE (bit 0): protected mode enable.
> > > > * ET (bit 4): 80387 external math coprocessor.
> > > > * PG (bit 31): paging enabled.
> > > >
> > > > On `CR4` the following bits are set by Xen:
> > > >
> > > > * PAE (bit 5): PAE enabled.
> > > >
> > > > And finally on `EFER` the following features are enabled:
> > > >
> > > > * LME (bit 8): Long mode enable.
> > > > * LMA (bit 10): Long mode active.
> > > >
> > > > *TODO*: do we expect this flags to change? Are there other flags
> > > > that might be enabled depending on the hardware we are running on?
> > >
> > > Can't think of anything...
> >
> > What about the initial segments (ES, DS, FS, GS)? We boot with Xen
> > provided ones and need to swap over from them - so that means
> > the DS and CS are initially set to Xen ones. And we should probably
> > mention that when the OS switches from Xen ones it MUST jump an
> > CS with CS.L = 1 set otherwise bad things happen.
>
> CS.L is already covered above:
> with paging, PAE, and long mode enabled. At present only 64bit mode
> is supported, however, in future compat mode support will be added.
>
> that is the CS.L bit. CS.L==1 ==> 64bit mode, CS.L==0 ==> compat mode.
I mean that we should include what the segment actually looks like.
As in what the initial segments it boots with are.
>
>
> > We should probably mention that MSR_FS_BASE, MSR_KERNEL_GS_BASE
> > and MSR_FS_BASE are zeroed out. Not sure about any other MSR?
>
> Could.
Perhaps say that any other MSRS are treated the same as they are
under an HVM guests.
>
> > Should we have a blurb about IDT and GDT and that the PV hypercalls
> > for that will be ignored.
>
> and that they are native and guest managed.
Right. Which means that during early bootup one has to be extra
careful to not get a #GP as there are no page-fault handlers setup.
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RFC: very initial PVH design document
2014-08-22 14:55 RFC: very initial PVH design document Roger Pau Monné
2014-08-22 15:13 ` Jan Beulich
2014-08-27 0:33 ` Mukesh Rathor
@ 2014-09-12 20:38 ` Konrad Rzeszutek Wilk
2014-09-12 21:25 ` Mukesh Rathor
2 siblings, 1 reply; 10+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-12 20:38 UTC (permalink / raw)
To: Roger Pau Monné; +Cc: David Vrabel, Jan Beulich, xen-devel
> ## SMP discover and bring up ##
>
> The process of bringing up secondary CPUs is obviously different from native,
> since PVH doesn't have a local APIC. The first thing to do is to figure out
> how many vCPUs the guest has. This is done using the `VCPUOP_is_up` hypercall,
> using for example this simple loop:
>
> for (i = 0; i < MAXCPU; i++) {
> ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL);
> if (ret >= 0)
> /* vCPU#i is present */
> }
>
> Note than when running as Dom0, the ACPI tables might report a different number
> of available CPUs. This is because the value on the ACPI tables is the
> number of physical CPUs the host has, and it might bear no resemblance with the
> number of vCPUs Dom0 actually has so it should be ignored.
>
> In order to bring up the secondary vCPUs they must be configured first. This is
> achieved using the `VCPUOP_initialise` hypercall. A valid context has to be
> passed to the vCPU in order to boot. The relevant fields for PVH guests are
> the following:
>
> * `flags`: contains VGCF_* flags (see `arch-x86/xen.h` public header).
> * `user_regs`: struct that contains the register values that will be set on
> the vCPU before booting. The most relevant ones are `rip` and `rsp` in order
> to set the start address and the stack.
The OS can use 'rdi' and 'rsi' for their own purpose.
[Any other ones that are free to be used?]
> * `ctrlreg[3]`: contains the address of the page tables that will be used by
> the vCPU.
Other registers - if not set to zero will cause the hypercall to error with
-EINVAL.
>
> After the vCPU is initialized with the proper values, it can be started by
> using the `VCPUOP_up` hypercall. The values of the other control registers of
> the vCPU will be the same as the ones described in the `control registers`
> section.
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RFC: very initial PVH design document
2014-09-12 20:38 ` Konrad Rzeszutek Wilk
@ 2014-09-12 21:25 ` Mukesh Rathor
0 siblings, 0 replies; 10+ messages in thread
From: Mukesh Rathor @ 2014-09-12 21:25 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: xen-devel, David Vrabel, Jan Beulich, Roger Pau Monné
On Fri, 12 Sep 2014 16:38:20 -0400
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> > ## SMP discover and bring up ##
> >
> > The process of bringing up secondary CPUs is obviously different
> > from native, since PVH doesn't have a local APIC. The first thing
> > to do is to figure out how many vCPUs the guest has. This is done
> > using the `VCPUOP_is_up` hypercall, using for example this simple
> > loop:
> >
> > for (i = 0; i < MAXCPU; i++) {
> > ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL);
> > if (ret >= 0)
> > /* vCPU#i is present */
> > }
> >
> > Note than when running as Dom0, the ACPI tables might report a
> > different number of available CPUs. This is because the value on
> > the ACPI tables is the number of physical CPUs the host has, and it
> > might bear no resemblance with the number of vCPUs Dom0 actually
> > has so it should be ignored.
> >
> > In order to bring up the secondary vCPUs they must be configured
> > first. This is achieved using the `VCPUOP_initialise` hypercall. A
> > valid context has to be passed to the vCPU in order to boot. The
> > relevant fields for PVH guests are the following:
> >
> > * `flags`: contains VGCF_* flags (see `arch-x86/xen.h` public
> > header).
> > * `user_regs`: struct that contains the register values that will
> > be set on the vCPU before booting. The most relevant ones are `rip`
> > and `rsp` in order to set the start address and the stack.
>
> The OS can use 'rdi' and 'rsi' for their own purpose.
>
> [Any other ones that are free to be used?]
>
They are all. So, I would phrase it as:
`user_regs`: struct that contains the register values that will
be set on the vCPU before booting. All GPRs are available to
be set, however, the most relevant ones are `rip` and `rsp` in
order to set the start address and the stack. Please note, all
selectors must be null.
In retrospect, maybe I should have tried harder to create a union
here, or even a new subcall for pvh, VCPUOP_initialise_pvh with
it's own struct. Anyways...
thanks,
Mukesh
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: RFC: very initial PVH design document
2014-08-27 0:33 ` Mukesh Rathor
2014-08-27 20:45 ` Konrad Rzeszutek Wilk
@ 2014-09-16 9:36 ` Roger Pau Monné
1 sibling, 0 replies; 10+ messages in thread
From: Roger Pau Monné @ 2014-09-16 9:36 UTC (permalink / raw)
To: Mukesh Rathor; +Cc: David Vrabel, Jan Beulich, xen-devel
El 27/08/14 a les 2.33, Mukesh Rathor ha escrit:
> On Fri, 22 Aug 2014 16:55:08 +0200
> Roger Pau Monné <roger.pau@citrix.com> wrote:
>
>> Hello,
>>
>> I've started writing a document in order to describe the interface
>> exposed by Xen to PVH guests, and how it should be used (by guest
>> OSes). The document is far from complete (see the amount of TODOs
>> scattered around), but given the lack of documentation regarding PVH
>> I think it's a good starting point. The aim of this is that it should
>> be committed to the Xen repository once it's ready. Given that this
>> is still a *very* early version I'm not even posting it as a patch.
>>
>> Please comment, and try to fill the holes if possible ;).
>>
>> Roger.
>>
>> ---
>> # PVH Specification #
>>
>> ## Rationale ##
>>
>> PVH is a new kind of guest that has been introduced on Xen 4.4 as a
>> DomU, and on Xen 4.5 as a Dom0. The aim of PVH is to make use of the
>> hardware virtualization extensions present in modern x86 CPUs in
>> order to improve performance.
>>
>> PVH is considered a mix between PV and HVM, and can be seen as a PV
>> guest that runs inside of an HVM container, or as a PVHVM guest
>> without any emulated devices. The design goal of PVH is to provide
>> the best performance possible and to reduce the amount of
>> modifications needed for a guest OS to run in this mode (compared to
>> pure PV).
>>
>> This document tries to describe the interfaces used by PVH guests,
>> focusing on how an OS should make use of them in order to support PVH.
>>
>> ## Early boot ##
>>
>> PVH guests use the PV boot mechanism, that means that the kernel is
>> loaded and directly launched by Xen (by jumping into the entry
>> point). In order to do this Xen ELF Notes need to be added to the
>> guest kernel, so that they contain the information needed by Xen.
>> Here is an example of the ELF Notes added to the FreeBSD amd64 kernel
>> in order to boot as PVH:
>>
>> ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD")
>> ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz,
>> __XSTRING(__FreeBSD_version)) ELFNOTE(Xen,
>> XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0") ELFNOTE(Xen,
>> XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE) ELFNOTE(Xen,
>> XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE) ELFNOTE(Xen,
>> XEN_ELFNOTE_ENTRY, .quad, xen_start) ELFNOTE(Xen,
>> XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page) ELFNOTE(Xen,
>> XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_START)
>> ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz,
>> "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
>> ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes") ELFNOTE(Xen,
>> XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V) ELFNOTE(Xen,
>> XEN_ELFNOTE_LOADER, .asciz, "generic") ELFNOTE(Xen,
>> XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0) ELFNOTE(Xen,
>> XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes")
>
> It will be helpful to add:
>
> On the linux side, the above can be found in arch/x86/xen/xen-head.S.
Done, although I would prefer to limit the number of code examples
picked from Linux (or to at least try provide alternate examples under a
more liberal license).
>> It is important to highlight the following notes:
>>
>> * XEN_ELFNOTE_ENTRY: contains the memory address of the kernel
>> entry point.
>> * XEN_ELFNOTE_HYPERCALL_PAGE: contains the memory address of the
>> hypercall page inside of the guest kernel (this memory region will be
>> filled by Xen prior to booting).
>> * XEN_ELFNOTE_FEATURES: contains the list of features supported by
>> the kernel. In this case the kernel is only able to boot as a PVH
>> guest, but those options can be mixed with the ones used by pure PV
>> guests in order to have a kernel that supports both PV and PVH (like
>> Linux). The list of options available can be found in the
>> `features.h` public header.
>
> Hmm... for linux I'd word that as follows:
>
> A PVH guest is started by specifying pvh=1 in the config file. However,
> for the guest to be launched as a PVH guest, it must minimally advertise
> certain features which are: auto_translated_physmap, hvm_callback_vector,
> writable_descriptor_tables, and supervisor_mode_kernel. This is done
> via XEN_ELFNOTE_FEATURES and XEN_ELFNOTE_SUPPORTED_FEATURES. See
> linux:arch/x86/xen/xen-head.S for more info. A list of all xen features
> can be found in xen:include/public/features.h. However, at present
> the absence of these features does not make it automatically boot in PV
> mode, but that may change in future. The ultimate goal is, if a guest
> supports these features, then boot it automatically in PVH mode, otherwise
> boot it in PV mode.
I don't think we should add tool-side stuff here (like setting pvh=1 on
the config file). I wanted this document to be a specification about the
interfaces used by a PVH guest, from the OS point of view. Xen supports
a wide variety of toolstacks, and I bet some of them will require a
different method in order to boot as PVH.
> [You can leave out the last part if you want, or just take whatever from
> above].
>
>> Xen will jump into the kernel entry point defined in
>> `XEN_ELFNOTE_ENTRY` with paging enabled (either long or protected
>> mode depending on the kernel bitness) and some basic page tables
>> setup.
>
> If I may rephrase:
>
> Guest is launched at the entry point specified in XEN_ELFNOTE_ENTRY
> with paging, PAE, and long mode enabled. At present only 64bit mode
> is supported, however, in future compat mode support will be added.
> An important distinction for a 64bit PVH is that it is launched at
> privilege level 0 as opposed to a 64bit PV guest which is launched at
> privilege level 3.
I've integrated a part of this paragraph, but I think some of this
contents would go into the i386 section once we have support for 32bit
PVH guests.
>> Also, the `rsi` (`esi` on 32bits) register is going to contain the
>> virtual memory address were Xen has placed the start_info structure.
>> The `rsp` (`esp` on 32bits) will contain a stack, that can be used by
>> the guest kernel. The start_info structure contains all the info the
>> guest needs in order to initialize. More information about the
>> contents can be found on the `xen.h` public header.
>
> Since the above is all true for PV guest, you could begin it with:
>
> Just like a PV guest, the rsi ....
>
>>
>> ### Initial amd64 control registers values ###
>>
>> Initial values for the control registers are set up by Xen before
>> booting the guest kernel. The guest kernel can expect to find the
>> following features enabled by Xen.
>>
>> On `CR0` the following bits are set by Xen:
>>
>> * PE (bit 0): protected mode enable.
>> * ET (bit 4): 80387 external math coprocessor.
>> * PG (bit 31): paging enabled.
>>
>> On `CR4` the following bits are set by Xen:
>>
>> * PAE (bit 5): PAE enabled.
>>
>> And finally on `EFER` the following features are enabled:
>>
>> * LME (bit 8): Long mode enable.
>> * LMA (bit 10): Long mode active.
>>
>> *TODO*: do we expect this flags to change? Are there other flags that
>> might be enabled depending on the hardware we are running on?
>
> Can't think of anything...
>
>
>> ## Memory ##
>>
>> Since PVH guests rely on virtualization extensions provided by the
>> CPU, they have access to a hardware virtualized MMU, which means
>> page-table related operations should use the same instructions used
>> on native.
>
> Do you wanna expand a bit since this is another big distinction from
> a PV guest?
>
> which means that page tables are native and guest managed.
> This also implies that mmu_update hypercall is not available to a PVH
> guest, unlike a PV guest. The guest is configured at start so it can
> access all pages upto start_info->nr_pages.
This is already explained in the last paragraph of this section, and
since MMU hypercalls are not available to PVH guests I don't think we
should even mention them.
I like to see this document as something that can be used to add PVH
support from scratch, not something written to be used to migrate from
PV to PVH (although I think it also serves this purpose).
>
>> There are however some differences with native. The usage of native
>> MTRR operations is forbidden, and `XENPF_*_memtype` hypercalls should
>> be used instead. This can be avoided by simply not using MTRR and
>> setting all the memory attributes using PAT, which doesn't require
>> the usage of any hypercalls.
>>
>> Since PVH doesn't use a BIOS in order to boot, the physical memory
>> map has to be retrieved using the `XENMEM_memory_map` hypercall,
>> which will return an e820 map. This memory map might contain holes
>> that describe MMIO regions, that will be already setup by Xen.
>>
>> *TODO*: we need to figure out what to do with MMIO regions, right now
>> Xen sets all the holes in the native e820 to MMIO regions for Dom0 up
>> to 4GB. We need to decide what to do with MMIO regions above 4GB on
>> Dom0, and what to do for PVH DomUs with pci-passthrough.
>
> We map all non-ram regions for dom0 1:1 till the highest non-ram e820
> entry. If there is anything that is beyond the last e820 entry,
> it will remain unmapped.
>
> Correct, passthru needs to be figured.
>
>> In the case of a guest started with memory != maxmem, the e820 memory
>> map returned by Xen will contain the memory up to maxmem. The guest
>> has to be very careful to only use the lower memory pages up to the
>> value contained in `start_info->nr_pages` because any memory page
>> above that value will not be populated.
>>
>> ## Physical devices ##
>>
>> When running as Dom0 the guest OS has the ability to interact with
>> the physical devices present in the system. A note should be made
>> that PVH guests require a working IOMMU in order to interact with
>> physical devices.
>>
>> The first step in order to manipulate the devices is to make Xen
>> aware of them. Due to the fact that all the hardware description on
>> x86 comes from ACPI, Dom0 is responsible of parsing the ACPI tables
>> and notify Xen about the devices it finds. This is done with the
>> `PHYSDEVOP_pci_device_add` hypercall.
>>
>> *TODO*: explain the way to register the different kinds of PCI
>> devices, like devices with virtual functions.
>>
>> ## Interrupts ##
>>
>> All interrupts on PVH guests are routed over event channels, see
>> [Event Channel Internals][event_channels] for more detailed
>> information about event channels. In order to inject interrupts into
>> the guest an IDT vector is used. This is the same mechanism used on
>> PVHVM guests, and allows having per-cpu interrupts that can be used
>> to deliver timers or IPIs.
>>
>> In order to register the callback IDT vector the `HVMOP_set_param`
>> hypercall is used with the following values:
>>
>> domid = DOMID_SELF
>> index = HVM_PARAM_CALLBACK_IRQ
>> value = (0x2 << 56) | vector_value
>>
>> In order to know which event channel has fired, we need to look into
>> the information provided in the `shared_info` structure. The
>> `evtchn_pending` array is used as a bitmap in order to find out which
>> event channel has fired. Event channels can also be masked by setting
>> it's port value in the `shared_info->evtchn_mask` bitmap.
>>
>> *TODO*: provide a reference about how to interact with FIFO event
>> channels?
>>
>> ### Interrupts from physical devices ###
>>
>> When running as Dom0 (or when using pci-passthrough) interrupts from
>> physical devices are routed over event channels. There are 3
>> different kind of physical interrupts that can be routed over event
>> channels by Xen: IO APIC, MSI and MSI-X interrupts.
>>
>> Since physical interrupts usually need EOI (End Of Interrupt), Xen
>> allows the registration of a memory region that will contain whether
>> a physical interrupt needs EOI from the guest or not. This is done
>> with the `PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a
>> parameter containing the physical address of the memory page that
>> will act as a bitmap. Then in order to find out if an IRQ needs EOI
>> or not, the OS can perform a simple bit test on the memory page using
>> the PIRQ value.
>>
>> ### IO APIC interrupt routing ###
>>
>> IO APIC interrupts can be routed over event channels using `PHYSDEVOP`
>> hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq`
>> hypercall, as an example IRQ#9 is used here:
>>
>> domid = DOMID_SELF
>> type = MAP_PIRQ_TYPE_GSI
>> index = 9
>> pirq = 9
>>
>> After this hypercall, `PHYSDEVOP_alloc_irq_vector` is used to
>> allocate a vector:
>>
>> irq = 9
>> vector = 0
>>
>> *TODO*: I'm not sure why we need those two hypercalls, and it's usage
>> is not documented anywhere. Need to clarify what the parameters mean
>> and what effect they have.
>>
>> The IRQ#9 is now registered as PIRQ#9. The triggering and polarity
>> can also be configured using the `PHYSDEVOP_setup_gsi` hypercall:
>>
>> gsi = 9 # This is the IRQ value.
>> triggering = 0
>> polarity = 0
>>
>> In this example the IRQ would be configured to use edge triggering
>> and high polarity.
>>
>> Finally the PIRQ can be bound to an event channel using the
>> `EVTCHNOP_bind_pirq`, that will return the event channel port the
>> PIRQ has been assigned. After this the event channel will be ready
>> for delivery.
>>
>> *NOTE*: when running as Dom0, the guest has to parse the interrupt
>> overwrites found on the ACPI tables and notify Xen about them.
>>
>> ### MSI ###
>>
>> In order to configure MSI interrupts for a device, Xen must be made
>> aware of it's presence first by using the `PHYSDEVOP_pci_device_add`
>> as described above. Then the `PHYSDEVOP_map_pirq` hypercall is used:
>>
>> domid = DOMID_SELF
>> type = MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI
>> index = -1
>> pirq = -1
>> bus = pci_device_bus
>> devfn = pci_device_function
>> entry_nr = number of MSI interrupts
>>
>> The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI
>> interrupt source is being configured. On devices that support MSI
>> interrupt groups `MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure
>> them by also placing the number of MSI interrupts in the `entry_nr`
>> field.
>>
>> The values in the `bus` and `devfn` field should be the same as the
>> ones used when registering the device with `PHYSDEVOP_pci_device_add`.
>>
>> ### MSI-X ###
>>
>> *TODO*: how to register/use them.
>>
>> ## Event timers and timecounters ##
>>
>> Since some hardware is not available on PVH (like the local APIC),
>> Xen provides the OS with suitable replacements in order to get the
>> same functionality. One of them is the timer interface. Using a set
>> of hypercalls, a guest OS can set event timers that will deliver and
>> event channel interrupt to the guest.
>>
>> In order to use the timer provided by Xen the guest OS first needs to
>> register a VIRQ event channel to be used by the timer to deliver the
>> interrupts. The event channel is registered using the
>> `EVTCHNOP_bind_virq` hypercall, that only takes two parameters:
>>
>> virq = VIRQ_TIMER
>> vcpu = vcpu_id
>>
>> The port that's going to be used by Xen in order to deliver the
>> interrupt is returned in the `port` field. Once the interrupt is set,
>> the timer can be programmed using the `VCPUOP_set_singleshot_timer`
>> hypercall.
>>
>> flags = VCPU_SSHOTTMR_future
>> timeout_abs_ns = absolute value when the timer should fire
>>
>> It is important to notice that the `VCPUOP_set_singleshot_timer`
>> hypercall must be executed from the same vCPU where the timer should
>> fire, or else Xen will refuse to set it. This is a single-shot timer,
>> so it must be set by the OS every time it fires if a periodic timer
>> is desired.
>>
>> Xen also shares a memory region with the guest OS that contains time
>> related values that are updated periodically. This values can be used
>> to implement a timecounter or to obtain the current time. This
>> information is placed inside of
>> `shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the
>> guest has been launched) can be calculated using the following
>> expression and the values stored in the `vcpu_time_info` struct:
>>
>> system_time + ((((tsc - tsc_timestamp) << tsc_shift) *
>> tsc_to_system_mul) >> 32)
>>
>> The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be
>> calculated using the above value, plus the timeout the system wants
>> to set.
>>
>> If the OS also wants to obtain the current wallclock time, the value
>> calculated above has to be added to the values found in
>> `shared_info->wc_sec` and `shared_info->wc_nsec`.
>
> All the above is great info, not PVH specific tho. May wanna mention
> it fwiw.
>
>> ## SMP discover and bring up ##
>>
>> The process of bringing up secondary CPUs is obviously different from
>> native, since PVH doesn't have a local APIC. The first thing to do is
>> to figure out how many vCPUs the guest has. This is done using the
>> `VCPUOP_is_up` hypercall, using for example this simple loop:
>>
>> for (i = 0; i < MAXCPU; i++) {
>> ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL);
>> if (ret >= 0)
>> /* vCPU#i is present */
>> }
>>
>> Note than when running as Dom0, the ACPI tables might report a
>> different number of available CPUs. This is because the value on the
>> ACPI tables is the number of physical CPUs the host has, and it might
>> bear no resemblance with the number of vCPUs Dom0 actually has so it
>> should be ignored.
>>
>> In order to bring up the secondary vCPUs they must be configured
>> first. This is achieved using the `VCPUOP_initialise` hypercall. A
>> valid context has to be passed to the vCPU in order to boot. The
>> relevant fields for PVH guests are the following:
>>
>> * `flags`: contains VGCF_* flags (see `arch-x86/xen.h` public
>> header).
>> * `user_regs`: struct that contains the register values that will
>> be set on the vCPU before booting. The most relevant ones are `rip`
>> and `rsp` in order to set the start address and the stack.
>> * `ctrlreg[3]`: contains the address of the page tables that will
>> be used by the vCPU.
>>
>> After the vCPU is initialized with the proper values, it can be
>> started by using the `VCPUOP_up` hypercall. The values of the other
>> control registers of the vCPU will be the same as the ones described
>> in the `control registers` section.
>
> If you want, you could put linux reference here:
>
> For an example, please see cpu_initialize_context() in arch/x86/xen/smp.c
> in linux.
Done, thanks for the comments.
>> ## Control operations (reboot/shutdown) ##
>>
>> Reboot and shutdown operations on PVH guests are performed using
>> hypercalls. In order to issue a reboot, a guest must use the
>> `SHUTDOWN_reboot` hypercall. In order to perform a power off from a
>> guest DomU, the `SHUTDOWN_poweroff` hypercall should be used.
>>
>> The way to perform a full system power off from Dom0 is different
>> than what's done in a DomU guest. In order to perform a power off
>> from Dom0 the native ACPI path should be followed, but the guest
>> should not write the SLP_EN bit to the Pm1Control register. Instead
>> the `XENPF_enter_acpi_sleep` hypercall should be used, filling the
>> following data in the `xen_platform_op` struct:
>>
>> cmd = XENPF_enter_acpi_sleep
>> interface_version = XENPF_INTERFACE_VERSION
>> u.enter_acpi_sleep.pm1a_cnt_val = Pm1aControlValue
>> u.enter_acpi_sleep.pm1b_cnt_val = Pm1bControlValue
>>
>> This will allow Xen to do it's clean up and to power off the system.
>> If the host is using hardware reduced ACPI, the following field
>> should also be set:
>>
>> u.enter_acpi_sleep.flags = XENPF_ACPI_SLEEP_EXTENDED (0x1)
>>
>> ## CPUID ##
>>
>> *TDOD*: describe which cpuid flags a guest should ignore and also
>> which flags describe features can be used. It would also be good to
>> describe the set of cpuid flags that will always be present when
>> running as PVH.
>>
>> ## Final notes ##
>>
>> All the other hardware functionality not described in this document
>> should be assumed to be performed in the same way as native.
>>
>> [evnet_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals
>
>
> Great work Roger! Thanks a lot for writing it.
>
> Mukesh
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-09-16 9:36 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-22 14:55 RFC: very initial PVH design document Roger Pau Monné
2014-08-22 15:13 ` Jan Beulich
2014-08-22 15:49 ` Roger Pau Monné
2014-08-27 0:33 ` Mukesh Rathor
2014-08-27 20:45 ` Konrad Rzeszutek Wilk
2014-08-27 22:38 ` Mukesh Rathor
2014-08-29 15:09 ` Konrad Rzeszutek Wilk
2014-09-16 9:36 ` Roger Pau Monné
2014-09-12 20:38 ` Konrad Rzeszutek Wilk
2014-09-12 21:25 ` Mukesh Rathor
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).