[PATCH] docs: add PVH specification

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] docs: add PVH specification
@ 2014-09-16 15:53 Roger Pau Monne
  2014-09-16 16:08 ` Ian Campbell
  2014-09-17 11:59 ` Jan Beulich
  0 siblings, 2 replies; 5+ messages in thread
From: Roger Pau Monne @ 2014-09-16 15:53 UTC (permalink / raw)
  To: xen-devel; +Cc: David Vrabel, Jan Beulich, Roger Pau Monne

Introduce a document that describes the interfaces used on PVH. This
document has been designed from a guest OS point of view (i.e.: what a guest
needs to do in order to support PVH).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Mukesh Rathor <mukesh.rathor@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
The document is still far from complete IMHO, but it might be best to just
commit what we currently have rather than wait for a full document.

I will try to fill the gaps as I go implementing new features on FreeBSD.
---
 docs/misc/pvh.markdown | 357 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 357 insertions(+)
 create mode 100644 docs/misc/pvh.markdown

diff --git a/docs/misc/pvh.markdown b/docs/misc/pvh.markdown
new file mode 100644
index 0000000..61f1b4e
--- /dev/null
+++ b/docs/misc/pvh.markdown
@@ -0,0 +1,357 @@
+# PVH Specification #
+
+## Rationale ##
+
+PVH is a new kind of guest that has been introduced on Xen 4.4 as a DomU, and
+on Xen 4.5 as a Dom0. The aim of PVH is to make use of the hardware
+virtualization extensions present in modern x86 CPUs in order to
+improve performance.
+
+PVH is considered a mix between PV and HVM, and can be seen as a PV guest
+that runs inside of an HVM container, or as a PVHVM guest without any emulated
+devices. The design goal of PVH is to provide the best performance possible and
+to reduce the amount of modifications needed for a guest OS to run in this mode
+(compared to pure PV).
+
+This document tries to describe the interfaces used by PVH guests, focusing
+on how an OS should make use of them in order to support PVH.
+
+## Early boot ##
+
+PVH guests use the PV boot mechanism, that means that the kernel is loaded and
+directly launched by Xen (by jumping into the entry point). In order to do this
+Xen ELF Notes need to be added to the guest kernel, so that they contain the
+information needed by Xen. Here is an example of the ELF Notes added to the
+FreeBSD amd64 kernel in order to boot as PVH:
+
+    ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS,       .asciz, "FreeBSD")
+    ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION,  .asciz, __XSTRING(__FreeBSD_version))
+    ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION,    .asciz, "xen-3.0")
+    ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      .quad,  KERNBASE)
+    ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET,   .quad,  KERNBASE)
+    ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          .quad,  xen_start)
+    ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .quad,  hypercall_page)
+    ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW,   .quad,  HYPERVISOR_VIRT_START)
+    ELFNOTE(Xen, XEN_ELFNOTE_FEATURES,       .asciz, "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
+    ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE,       .asciz, "yes")
+    ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,   .long,  PG_V, PG_V)
+    ELFNOTE(Xen, XEN_ELFNOTE_LOADER,         .asciz, "generic")
+    ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long,  0)
+    ELFNOTE(Xen, XEN_ELFNOTE_BSD_SYMTAB,     .asciz, "yes")
+
+On the linux side, the above can be found in `arch/x86/xen/xen-head.S`.
+
+It is important to highlight the following notes:
+
+  * `XEN_ELFNOTE_ENTRY`: contains the virtual memory address of the kernel entry
+    point.
+  * `XEN_ELFNOTE_HYPERCALL_PAGE`: contains the virtual memory address of the
+    hypercal page inside of the guest kernel (this memory region will be filled
+    by Xen prior to booting).
+  * `XEN_ELFNOTE_FEATURES`: contains the list of features supported by the kernel.
+    In the example above the kernel is only able to boot as a PVH guest, but
+    those options can be mixed with the ones used by pure PV guests in order to
+    have a kernel that supports both PV and PVH (like Linux). The list of
+    options available can be found in the `features.h` public header.
+
+Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY` with
+paging enabled (either long mode or protected mode with paging turned on
+depending on the kernel bitness) and some basic page tables setup. An important
+distinction for a 64bit PVH is that it is launched at privilege level 0 as
+opposed to a 64bit PV guest which is launched at privilege level 3.
+
+Also, the `rsi` (`esi` on 32bits) register is going to contain the virtual
+memory address were Xen has placed the `start_info` structure. The `rsp` (`esp`
+on 32bits) will contain a stack, that can be used by the guest kernel. The
+`start_info` structure contains all the info the guest needs in order to
+initialize. More information about the contents can be found on the
+`xen.h` public header.
+
+### Initial amd64 control registers values ###
+
+Initial values for the control registers are set up by Xen before booting the
+guest kernel. The guest kernel can expect to find the following features
+enabled by Xen.
+
+On `CR0` the following bits are set by Xen:
+
+  * PE (bit 0): protected mode enable.
+  * ET (bit 4): 80387 external math coprocessor.
+  * PG (bit 31): paging enabled.
+
+On `CR4` the following bits are set by Xen:
+
+  * PAE (bit 5): PAE enabled.
+
+And finally on `EFER` the following features are enabled:
+
+  * LME (bit 8): Long mode enable.
+  * LMA (bit 10): Long mode active.
+
+All the segment selectors (`cs`, `ds`, `ss`, `es`, `fs` and `gs`), the
+`FS.base` and `GS.base` MSRs are zeroed out. MSR registers should be treated
+like native.
+
+The `IDT` and `GDT` are also zeroed, so the guest must be specially careful to
+not trigger a fault until after they have been properly set. The way of setting
+the IDT and the GDT is using the native instructions as would be done on bare
+metal.
+
+## Memory ##
+
+Since PVH guests rely on virtualization extensions provided by the CPU, they
+have access to a hardware virtualized MMU, which means page-table related
+operations should use the same instructions used on native.
+
+There are however some differences with native. The usage of native MTRR
+operations is forbidden, and `XENPF_*_memtype` hypercalls should be used
+instead. This can be avoided by simply not using MTRR and setting all the
+memory attributes using PAT, which doesn't require the usage of any hypercalls.
+
+Since PVH doesn't use a BIOS in order to boot, the physical memory map has
+to be retrieved using the `XENMEM_memory_map` hypercall, which will return
+an e820 map. This memory map might contain holes that describe MMIO regions,
+that will be already setup by Xen.
+
+*TODO*: we need to figure out what to do with MMIO regions, right now Xen
+sets all the holes in the native e820 to MMIO regions for Dom0 up to 4GB. We
+need to decide what to do with MMIO regions above 4GB on Dom0, and what to do
+for PVH DomUs with pci-passthrough.
+
+In the case of a guest started with memory != maxmem, the e820 memory map
+returned by Xen will contain the memory up to maxmem. The guest has to be very
+careful to only use the lower memory pages up to the value contained in
+`start_info->nr_pages` because any memory page above that value will not be
+populated.
+
+## Physical devices ##
+
+When running as Dom0 the guest OS has the ability to interact with the physical
+devices present in the system. A note should be made that PVH guests require
+a working IOMMU in order to interact with physical devices.
+
+The first step in order to manipulate the devices is to make Xen aware of
+them. Due to the fact that all the hardware description on x86 comes from
+ACPI, Dom0 is responsible of parsing the ACPI tables and notify Xen about the
+devices it finds. This is done with the `PHYSDEVOP_pci_device_add` hypercall.
+
+*TODO*: explain the way to register the different kinds of PCI devices, like
+devices with virtual functions.
+
+## Interrupts ##
+
+All interrupts on PVH guests are routed over event channels, see
+[Event Channel Internals][event_channels] for more detailed information about
+event channels. In order to inject interrupts into the guest an IDT vector is
+used. This is the same mechanism used on PVHVM guests, and allows having
+per-cpu interrupts that can be used to deliver timers or IPIs.
+
+In order to register the callback IDT vector the `HVMOP_set_param` hypercall
+is used with the following values:
+
+    domid = DOMID_SELF
+    index = HVM_PARAM_CALLBACK_IRQ
+    value = (0x2 << 56) | vector_value
+
+In order to know which event channel has fired, we need to look into the
+information provided in the `shared_info` structure. The `evtchn_pending`
+array is used as a bitmap in order to find out which event channel has
+fired. Event channels can also be masked by setting it's port value in the
+`shared_info->evtchn_mask` bitmap.
+
+### Interrupts from physical devices ###
+
+When running as Dom0 (or when using pci-passthrough) interrupts from physical
+devices are routed over event channels. There are 3 different kind of
+physical interrupts that can be routed over event channels by Xen: IO APIC,
+MSI and MSI-X interrupts.
+
+Since physical interrupts usually need EOI (End Of Interrupt), Xen allows the
+registration of a memory region that will contain whether a physical interrupt
+needs EOI from the guest or not. This is done with the
+`PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a parameter containing the
+physical address of the memory page that will act as a bitmap. Then in order to
+find out if an IRQ needs EOI or not, the OS can perform a simple bit test on the
+memory page using the PIRQ value.
+
+### IO APIC interrupt routing ###
+
+IO APIC interrupts can be routed over event channels using `PHYSDEVOP`
+hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq`
+hypercall, as an example IRQ#9 is used here:
+
+    domid = DOMID_SELF
+    type = MAP_PIRQ_TYPE_GSI
+    index = 9
+    pirq = 9
+
+After this hypercall, `PHYSDEVOP_alloc_irq_vector` is used to allocate a vector:
+
+    irq = 9
+    vector = 0
+
+*TODO*: I'm not sure why we need those two hypercalls, and it's usage is not
+documented anywhere. Need to clarify what the parameters mean and what effect
+they have.
+
+The IRQ#9 is now registered as PIRQ#9. The triggering and polarity can also
+be configured using the `PHYSDEVOP_setup_gsi` hypercall:
+
+    gsi = 9 # This is the IRQ value.
+    triggering = 0
+    polarity = 0
+
+In this example the IRQ would be configured to use edge triggering and high
+polarity.
+
+Finally the PIRQ can be bound to an event channel using the
+`EVTCHNOP_bind_pirq`, that will return the event channel port the PIRQ has been
+assigned. After this the event channel will be ready for delivery.
+
+*NOTE*: when running as Dom0, the guest has to parse the interrupt overwrites
+found on the ACPI tables and notify Xen about them.
+
+### MSI ###
+
+In order to configure MSI interrupts for a device, Xen must be made aware of
+it's presence first by using the `PHYSDEVOP_pci_device_add` as described above.
+Then the `PHYSDEVOP_map_pirq` hypercall is used:
+
+    domid = DOMID_SELF
+    type = MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI
+    index = -1
+    pirq = -1
+    bus = pci_device_bus
+    devfn = pci_device_function
+    entry_nr = number of MSI interrupts
+
+The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI interrupt
+source is being configured. On devices that support MSI interrupt groups
+`MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure them by also placing the
+number of MSI interrupts in the `entry_nr` field.
+
+The values in the `bus` and `devfn` field should be the same as the ones used
+when registering the device with `PHYSDEVOP_pci_device_add`.
+
+### MSI-X ###
+
+*TODO*: how to register/use them.
+
+## Event timers and timecounters ##
+
+Since some hardware is not available on PVH (like the local APIC), Xen provides
+the OS with suitable replacements in order to get the same functionality. One
+of them is the timer interface. Using a set of hypercalls, a guest OS can set
+event timers that will deliver and event channel interrupt to the guest.
+
+In order to use the timer provided by Xen the guest OS first needs to register
+a VIRQ event channel to be used by the timer to deliver the interrupts. The
+event channel is registered using the `EVTCHNOP_bind_virq` hypercall, that
+only takes two parameters:
+
+    virq = VIRQ_TIMER
+    vcpu = vcpu_id
+
+The port that's going to be used by Xen in order to deliver the interrupt is
+returned in the `port` field. Once the interrupt is set, the timer can be
+programmed using the `VCPUOP_set_singleshot_timer` hypercall.
+
+    flags = VCPU_SSHOTTMR_future
+    timeout_abs_ns = absolute value when the timer should fire
+
+It is important to notice that the `VCPUOP_set_singleshot_timer` hypercall must
+be executed from the same vCPU where the timer should fire, or else Xen will
+refuse to set it. This is a single-shot timer, so it must be set by the OS
+every time it fires if a periodic timer is desired.
+
+Xen also shares a memory region with the guest OS that contains time related
+values that are updated periodically. This values can be used to implement a
+timecounter or to obtain the current time. This information is placed inside of
+`shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the guest has
+been launched) can be calculated using the following expression and the values
+stored in the `vcpu_time_info` struct:
+
+    system_time + ((((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul) >> 32)
+
+The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be
+calculated using the above value, plus the timeout the system wants to set.
+
+If the OS also wants to obtain the current wallclock time, the value calculated
+above has to be added to the values found in `shared_info->wc_sec` and
+`shared_info->wc_nsec`.
+
+## SMP discover and bring up ##
+
+The process of bringing up secondary CPUs is obviously different from native,
+since PVH doesn't have a local APIC. The first thing to do is to figure out
+how many vCPUs the guest has. This is done using the `VCPUOP_is_up` hypercall,
+using for example this simple loop:
+
+    for (i = 0; i < MAXCPU; i++) {
+        ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL);
+        if (ret >= 0)
+            /* vCPU#i is present */
+    }
+
+Note than when running as Dom0, the ACPI tables might report a different number
+of available CPUs. This is because the value on the ACPI tables is the
+number of physical CPUs the host has, and it might bear no resemblance with the
+number of vCPUs Dom0 actually has so it should be ignored.
+
+In order to bring up the secondary vCPUs they must be configured first. This is
+achieved using the `VCPUOP_initialise` hypercall. A valid context has to be
+passed to the vCPU in order to boot. The relevant fields for PVH guests are
+the following:
+
+  * `flags`: contains `VGCF_*` flags (see `arch-x86/xen.h` public header).
+  * `user_regs`: struct that contains the register values that will be set on
+    the vCPU before booting. All GPRs are available to be set, however, the
+    most relevant ones are `rip` and `rsp` in order to set the start address
+    and the stack. Please note, all selectors must be null.
+  * `ctrlreg[3]`: contains the address of the page tables that will be used by
+    the vCPU. Other control registers should be set to zero, or else the
+    hypercall will fail with -EINVAL.
+
+After the vCPU is initialized with the proper values, it can be started by
+using the `VCPUOP_up` hypercall. The values of the other control registers of
+the vCPU will be the same as the ones described in the `control registers`
+section.
+
+Examples about how to bring up secondary CPUs can be found on the FreeBSD
+code base in `sys/x86/xen/pv.c` and on Linux `arch/x86/xen/smp.c`.
+
+## Control operations (reboot/shutdown) ##
+
+Reboot and shutdown operations on PVH guests are performed using hypercalls.
+In order to issue a reboot, a guest must use the `SHUTDOWN_reboot` hypercall.
+In order to perform a power off from a guest DomU, the `SHUTDOWN_poweroff`
+hypercall should be used.
+
+The way to perform a full system power off from Dom0 is different than what's
+done in a DomU guest. In order to perform a power off from Dom0 the native
+ACPI path should be followed, but the guest should not write the `SLP_EN`
+bit to the Pm1Control register. Instead the `XENPF_enter_acpi_sleep` hypercall
+should be used, filling the following data in the `xen_platform_op` struct:
+
+    cmd = XENPF_enter_acpi_sleep
+    interface_version = XENPF_INTERFACE_VERSION
+    u.enter_acpi_sleep.pm1a_cnt_val = Pm1aControlValue
+    u.enter_acpi_sleep.pm1b_cnt_val = Pm1bControlValue
+
+This will allow Xen to do it's clean up and to power off the system. If the
+host is using hardware reduced ACPI, the following field should also be set:
+
+    u.enter_acpi_sleep.flags = XENPF_ACPI_SLEEP_EXTENDED (0x1)
+
+## CPUID ##
+
+*TDOD*: describe which cpuid flags a guest should ignore and also which flags
+describe features can be used. It would also be good to describe the set of
+cpuid flags that will always be present when running as PVH.
+
+## Final notes ##
+
+All the other hardware functionality not described in this document should be
+assumed to be performed in the same way as native.
+
+[event_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals
-- 
1.8.5.2 (Apple Git-48)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] docs: add PVH specification
  2014-09-16 15:53 [PATCH] docs: add PVH specification Roger Pau Monne
@ 2014-09-16 16:08 ` Ian Campbell
  2014-09-17 11:59 ` Jan Beulich
  1 sibling, 0 replies; 5+ messages in thread
From: Ian Campbell @ 2014-09-16 16:08 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, David Vrabel, Jan Beulich

On Tue, 2014-09-16 at 17:53 +0200, Roger Pau Monne wrote:
> Introduce a document that describes the interfaces used on PVH. This
> document has been designed from a guest OS point of view (i.e.: what a guest
> needs to do in order to support PVH).
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Cc: Jan Beulich <JBeulich@suse.com>
> Cc: Mukesh Rathor <mukesh.rathor@oracle.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
> The document is still far from complete IMHO, but it might be best to just
> commit what we currently have rather than wait for a full document.

I haven't read this doc but I agree with checking something in now. In
general it is easier to motivate people to update an existing document
rather than to gather the activation energy to start a new one, thanks
for taking that first step!

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] docs: add PVH specification
  2014-09-16 15:53 [PATCH] docs: add PVH specification Roger Pau Monne
  2014-09-16 16:08 ` Ian Campbell
@ 2014-09-17 11:59 ` Jan Beulich
  2014-09-18 11:00   ` Roger Pau Monné
  1 sibling, 1 reply; 5+ messages in thread
From: Jan Beulich @ 2014-09-17 11:59 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, David Vrabel

>>> On 16.09.14 at 17:53, <roger.pau@citrix.com> wrote:
> +Also, the `rsi` (`esi` on 32bits) register is going to contain the virtual
> +memory address were Xen has placed the `start_info` structure. The `rsp` (`esp`
> +on 32bits) will contain a stack, that can be used by the guest kernel. The

... will point to the top of an initial single page stack, ...

> +`start_info` structure contains all the info the guest needs in order to
> +initialize. More information about the contents can be found on the
> +`xen.h` public header.
> +
> +### Initial amd64 control registers values ###
> +
> +Initial values for the control registers are set up by Xen before booting the
> +guest kernel. The guest kernel can expect to find the following features
> +enabled by Xen.
> +
> +On `CR0` the following bits are set by Xen:

"In ..." or "CR0 has the following bits set by Xen:".

> +
> +  * PE (bit 0): protected mode enable.
> +  * ET (bit 4): 80387 external math coprocessor.

This bit nothing to do with an external coprocessor, it simply says
387 or newer as opposed to 287.

> +  * PG (bit 31): paging enabled.
> +
> +On `CR4` the following bits are set by Xen:
> +
> +  * PAE (bit 5): PAE enabled.
> +
> +And finally on `EFER` the following features are enabled:
> +
> +  * LME (bit 8): Long mode enable.
> +  * LMA (bit 10): Long mode active.

Perhaps also worth clarifying which bits are guaranteed to be clear
(right now one might imply all others, but that's not something we
can guarantee with forward compatibility in mind). Further I think
EFLAGS wants mentioning here too, and perhaps the debug registers.

> +
> +All the segment selectors (`cs`, `ds`, `ss`, `es`, `fs` and `gs`), the
> +`FS.base` and `GS.base` MSRs are zeroed out.

For the selector registers, specifying what the hidden portions hold
is a must I think, at the very least for %cs and %ss.

> MSR registers should be treated
> +like native.

Not sure what this is intended to mean.

> +In order to register the callback IDT vector the `HVMOP_set_param` hypercall
> +is used with the following values:
> +
> +    domid = DOMID_SELF
> +    index = HVM_PARAM_CALLBACK_IRQ
> +    value = (0x2 << 56) | vector_value

If we don't have #define-s for these two numbers, we urgently ought
to add ones.

> +### IO APIC interrupt routing ###
> +
> +IO APIC interrupts can be routed over event channels using `PHYSDEVOP`
> +hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq`
> +hypercall, as an example IRQ#9 is used here:
> +
> +    domid = DOMID_SELF
> +    type = MAP_PIRQ_TYPE_GSI
> +    index = 9
> +    pirq = 9
> +
> +After this hypercall, `PHYSDEVOP_alloc_irq_vector` is used to allocate a vector:
> +
> +    irq = 9
> +    vector = 0
> +
> +*TODO*: I'm not sure why we need those two hypercalls, and it's usage is not
> +documented anywhere. Need to clarify what the parameters mean and what effect
> +they have.

PHYSDEVOP_alloc_irq_vector has been a dummy for a very long time
now - nothing should break if this call got omitted.

> +*NOTE*: when running as Dom0, the guest has to parse the interrupt overwrites
> +found on the ACPI tables and notify Xen about them.

... overrides ...

Jan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] docs: add PVH specification
  2014-09-17 11:59 ` Jan Beulich
@ 2014-09-18 11:00   ` Roger Pau Monné
  2014-09-18 12:32     ` Jan Beulich
  0 siblings, 1 reply; 5+ messages in thread
From: Roger Pau Monné @ 2014-09-18 11:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, David Vrabel

El 17/09/14 a les 13.59, Jan Beulich ha escrit:
>>>> On 16.09.14 at 17:53, <roger.pau@citrix.com> wrote:
>> +And finally on `EFER` the following features are enabled:
>> +
>> +  * LME (bit 8): Long mode enable.
>> +  * LMA (bit 10): Long mode active.
> 
> Perhaps also worth clarifying which bits are guaranteed to be clear
> (right now one might imply all others, but that's not something we
> can guarantee with forward compatibility in mind). Further I think
> EFLAGS wants mentioning here too, and perhaps the debug registers.

I've added that the SCE and NXE bits will not be enabled, and the
remaining ones will be in an unknown state, possibly we can also add
other bits that will surely not be enabled?

I've also added that RFLAGS is clear when jumping into the kernel entry
point.

>> +
>> +All the segment selectors (`cs`, `ds`, `ss`, `es`, `fs` and `gs`), the
>> +`FS.base` and `GS.base` MSRs are zeroed out.
> 
> For the selector registers, specifying what the hidden portions hold
> is a must I think, at the very least for %cs and %ss.

Done. I've added the following:

The `cs` segment selector is set by Xen with a base of 0x0 and a limit
of 0xfffff. The attributes are set to 0x9b, which describes an
executable and readable code segment only accessible by the most
privileged level.

The remaining segment selectors (`ds`, `ss`, `es`, `fs` and `gs`) are
all set to the same values. Both the selector and the base is set to 0x0
and the limit to 0xfffff. The attributes are set to 0x93, which implies
a read and write data segment only accessible by the most privileged level.

>> MSR registers should be treated
>> +like native.
> 
> Not sure what this is intended to mean.

This was requested in the last review round, but I think that it is
already clear that no hypercalls should be used to write to MSRs, so
I've removed it.

>> +In order to register the callback IDT vector the `HVMOP_set_param` hypercall
>> +is used with the following values:
>> +
>> +    domid = DOMID_SELF
>> +    index = HVM_PARAM_CALLBACK_IRQ
>> +    value = (0x2 << 56) | vector_value
> 
> If we don't have #define-s for these two numbers, we urgently ought
> to add ones.

We already have defines for those two values (in xen.h and hvm/params.h
respectively).

>> +### IO APIC interrupt routing ###
>> +
>> +IO APIC interrupts can be routed over event channels using `PHYSDEVOP`
>> +hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq`
>> +hypercall, as an example IRQ#9 is used here:
>> +
>> +    domid = DOMID_SELF
>> +    type = MAP_PIRQ_TYPE_GSI
>> +    index = 9
>> +    pirq = 9
>> +
>> +After this hypercall, `PHYSDEVOP_alloc_irq_vector` is used to allocate a vector:
>> +
>> +    irq = 9
>> +    vector = 0
>> +
>> +*TODO*: I'm not sure why we need those two hypercalls, and it's usage is not
>> +documented anywhere. Need to clarify what the parameters mean and what effect
>> +they have.
> 
> PHYSDEVOP_alloc_irq_vector has been a dummy for a very long time
> now - nothing should break if this call got omitted.
> 
>> +*NOTE*: when running as Dom0, the guest has to parse the interrupt overwrites
>> +found on the ACPI tables and notify Xen about them.
> 
> ... overrides ...

Removed the mention to PHYSDEVOP_alloc_irq_vector and fixed the spelling
mistake.

Thanks for the review, Roger.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] docs: add PVH specification
  2014-09-18 11:00   ` Roger Pau Monné
@ 2014-09-18 12:32     ` Jan Beulich
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Beulich @ 2014-09-18 12:32 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, David Vrabel

>>> On 18.09.14 at 13:00, <roger.pau@citrix.com> wrote:
> El 17/09/14 a les 13.59, Jan Beulich ha escrit:
>>>>> On 16.09.14 at 17:53, <roger.pau@citrix.com> wrote:
>>> +All the segment selectors (`cs`, `ds`, `ss`, `es`, `fs` and `gs`), the
>>> +`FS.base` and `GS.base` MSRs are zeroed out.
>> 
>> For the selector registers, specifying what the hidden portions hold
>> is a must I think, at the very least for %cs and %ss.
> 
> Done. I've added the following:
> 
> The `cs` segment selector is set by Xen with a base of 0x0 and a limit
> of 0xfffff. The attributes are set to 0x9b, which describes an
> executable and readable code segment only accessible by the most
> privileged level.

Considering that we're talking of 64-bit guests only at this point,
base and limit of %cs don't matter at all. What does matter and
is not mentioned above is that CS.L is set and CS.DB is clear. (And
if talking about the limit, 0xfffff is only the raw value - together
with CS.G it would end up being 0xffffffff.)

> The remaining segment selectors (`ds`, `ss`, `es`, `fs` and `gs`) are
> all set to the same values. Both the selector and the base is set to 0x0
> and the limit to 0xfffff. The attributes are set to 0x93, which implies
> a read and write data segment only accessible by the most privileged level.

Mostly the same here: Limit doesn't matter, and base matters only
for %fs and %gs. For other than %ss, most of the other attributes
(including privilege level) don't matter either.

That said, I don't mind spelling out what we do when there's no
foreseeable reason for us to ever change those. I.e. all I'd really
like to see changed is the attributes of CS to be fully specified,
and the confusion about the limit removed - if you want to keep the
information on base and limit, just say "flat at base zero" or some
such.

Jan

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-09-18 12:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-16 15:53 [PATCH] docs: add PVH specification Roger Pau Monne
2014-09-16 16:08 ` Ian Campbell
2014-09-17 11:59 ` Jan Beulich
2014-09-18 11:00   ` Roger Pau Monné
2014-09-18 12:32     ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.