From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: [PATCH v3] docs: add PVH specification Date: Sat, 20 Sep 2014 15:15:10 -0400 Message-ID: <20140920191510.GA2882@laptop.dumpdata.com> References: <1411060764-4016-1-git-send-email-roger.pau@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1XVQ7w-00073Z-3y for xen-devel@lists.xenproject.org; Sat, 20 Sep 2014 19:15:24 +0000 Content-Disposition: inline In-Reply-To: <1411060764-4016-1-git-send-email-roger.pau@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Roger Pau Monne Cc: xen-devel@lists.xenproject.org, David Vrabel , Jan Beulich List-Id: xen-devel@lists.xenproject.org On Thu, Sep 18, 2014 at 07:19:24PM +0200, Roger Pau Monne wrote: > Introduce a document that describes the interfaces used on PVH. This > document has been designed from a guest OS point of view (i.e.: what a gu= est > needs to do in order to support PVH). > = > Signed-off-by: Roger Pau Monn=E9 > Acked-by: David Vrabel > Cc: Jan Beulich > Cc: Mukesh Rathor > Cc: Konrad Rzeszutek Wilk > Cc: David Vrabel > --- > The document is still far from complete IMHO, but it might be best to just > commit what we currently have rather than wait for a full document. > = > I will try to fill the gaps as I go implementing new features on FreeBSD. > = > I've retained David's Ack from v2 in this version. > --- > docs/misc/pvh.markdown | 367 +++++++++++++++++++++++++++++++++++++++++++= ++++++ > 1 file changed, 367 insertions(+) > create mode 100644 docs/misc/pvh.markdown > = > diff --git a/docs/misc/pvh.markdown b/docs/misc/pvh.markdown > new file mode 100644 > index 0000000..120ede7 > --- /dev/null > +++ b/docs/misc/pvh.markdown > @@ -0,0 +1,367 @@ > +# PVH Specification # > + > +## Rationale ## > + > +PVH is a new kind of guest that has been introduced on Xen 4.4 as a DomU= , and > +on Xen 4.5 as a Dom0. The aim of PVH is to make use of the hardware > +virtualization extensions present in modern x86 CPUs in order to > +improve performance. > + > +PVH is considered a mix between PV and HVM, and can be seen as a PV guest > +that runs inside of an HVM container, or as a PVHVM guest without any em= ulated > +devices. The design goal of PVH is to provide the best performance possi= ble and > +to reduce the amount of modifications needed for a guest OS to run in th= is mode > +(compared to pure PV). > + > +This document tries to describe the interfaces used by PVH guests, focus= ing > +on how an OS should make use of them in order to support PVH. > + > +## Early boot ## > + > +PVH guests use the PV boot mechanism, that means that the kernel is load= ed and > +directly launched by Xen (by jumping into the entry point). In order to = do this > +Xen ELF Notes need to be added to the guest kernel, so that they contain= the > +information needed by Xen. Here is an example of the ELF Notes added to = the > +FreeBSD amd64 kernel in order to boot as PVH: > + > + ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD") > + ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz, __XSTRING(__FreeBSD= _version)) > + ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0") > + ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE) > + ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE) > + ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, .quad, xen_start) > + ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page) > + ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_STA= RT) > + ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz, "writable_descripto= r_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector= ") > + ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes") > + ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V) > + ELFNOTE(Xen, XEN_ELFNOTE_LOADER, .asciz, "generic") > + ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0) > + ELFNOTE(Xen, XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes") > + > +On the linux side, the above can be found in `arch/x86/xen/xen-head.S`. s/linux/Linux/ > + > +It is important to highlight the following notes: > + > + * `XEN_ELFNOTE_ENTRY`: contains the virtual memory address of the kern= el entry > + point. > + * `XEN_ELFNOTE_HYPERCALL_PAGE`: contains the virtual memory address of= the > + hypercal page inside of the guest kernel (this memory region will be= filled > + by Xen prior to booting). > + * `XEN_ELFNOTE_FEATURES`: contains the list of features supported by t= he kernel. > + In the example above the kernel is only able to boot as a PVH guest,= but > + those options can be mixed with the ones used by pure PV guests in o= rder to > + have a kernel that supports both PV and PVH (like Linux). The list of > + options available can be found in the `features.h` public header. > + Note that 'hvm_callback_vector' is in XEN_ELFNOTE_FEATURES. Older hyperviso= r will balk at this being part of it, so it can also be put in = XEN_ELFNOTE_SUPPORTED_FEATURES which older hypervisors will ignore. = > +Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY`= with > +paging enabled (either long mode or protected mode with paging turned on > +depending on the kernel bitness) and some basic page tables setup. An im= portant > +distinction for a 64bit PVH is that it is launched at privilege level 0 = as > +opposed to a 64bit PV guest which is launched at privilege level 3. > + > +Also, the `rsi` (`esi` on 32bits) register is going to contain the virtu= al > +memory address were Xen has placed the `start_info` structure. The `rsp`= (`esp` > +on 32bits) will point to the top of an initial single page stack, that c= an be > +used by the guest kernel. The `start_info` structure contains all the in= fo the > +guest needs in order to initialize. More information about the contents = can be > +found on the `xen.h` public header. s/on/in/ > + > +### Initial amd64 control registers values ### > + > +Initial values for the control registers are set up by Xen before bootin= g the > +guest kernel. The guest kernel can expect to find the following features > +enabled by Xen. > + > +`CR0` has the following bits set by Xen: > + > + * PE (bit 0): protected mode enable. > + * ET (bit 4): 387 or newer processor. > + * PG (bit 31): paging enabled. Also TS (at least that is what the Linux code says: /* Some of these are setup in 'secondary_startup_64'. The others: = * X86_CR0_TS, X86_CR0_PE, X86_CR0_ET are set by Xen for HVM guests = * (which PVH shared codepaths), while X86_CR0_PG is for PVH. */ = Perhaps it is incorrect? > + > +`CR4` has the following bits set by Xen: > + > + * PAE (bit 5): PAE enabled. > + > +And finally in `EFER` the following features are enabled: > + > + * LME (bit 8): Long mode enable. > + * LMA (bit 10): Long mode active. > + > +At least the following flags in `EFER` are guaranteed to be disabled: > + > + * SCE (bit 0): System call extensions disabled. > + * NXE (bit 11): No-Execute disabled. > + > +There's no guarantee about the state of the other bits in the `EFER` reg= ister. > + > +All the segments selectors are set with a flat base at zero. > + > +The `cs` segment selector attributes are set to 0x0a09b, which describes= an > +executable and readable code segment only accessible by the most privile= ged > +level. The segment is also set as a 64-bit code segment (`L` flag set, `= D` flag > +unset). > + > +The remaining segment selectors (`ds`, `ss`, `es`, `fs` and `gs`) are al= l set > +to the same values. The attributes are set to 0x0c093, which implies a r= ead and > +write data segment only accessible by the most privileged level. I think the SS, ES, FS, GS are set to the null selector in 64-bit mode. > + > +The `FS.base` and `GS.base` MSRs are zeroed out. .. and 'KERNEL_GS.base' > + > +The `IDT` and `GDT` are also zeroed, so the guest must be specially care= ful to > +not trigger a fault until after they have been properly set. The way of = setting > +the IDT and the GDT is using the native instructions as would be done on= bare > +metal. > + > +The `RFLAGS` register is guaranteed to be clear when jumping into the ke= rnel > +entry point, with the exception of the reserved bit 1 set. > + > +## Memory ## > + > +Since PVH guests rely on virtualization extensions provided by the CPU, = they > +have access to a hardware virtualized MMU, which means page-table related > +operations should use the same instructions used on native. > + > +There are however some differences with native. The usage of native MTRR > +operations is forbidden, and `XENPF_*_memtype` hypercalls should be used > +instead. This can be avoided by simply not using MTRR and setting all the > +memory attributes using PAT, which doesn't require the usage of any hype= rcalls. > + > +Since PVH doesn't use a BIOS in order to boot, the physical memory map h= as > +to be retrieved using the `XENMEM_memory_map` hypercall, which will retu= rn > +an e820 map. This memory map might contain holes that describe MMIO regi= ons, > +that will be already setup by Xen. > + > +*TODO*: we need to figure out what to do with MMIO regions, right now Xen > +sets all the holes in the native e820 to MMIO regions for Dom0 up to 4GB= . We > +need to decide what to do with MMIO regions above 4GB on Dom0, and what = to do > +for PVH DomUs with pci-passthrough. > + > +In the case of a guest started with memory !=3D maxmem, the e820 memory = map > +returned by Xen will contain the memory up to maxmem. The guest has to b= e very > +careful to only use the lower memory pages up to the value contained in > +`start_info->nr_pages` because any memory page above that value will not= be > +populated. > + > +## Physical devices ## > + > +When running as Dom0 the guest OS has the ability to interact with the p= hysical > +devices present in the system. A note should be made that PVH guests req= uire > +a working IOMMU in order to interact with physical devices. > + > +The first step in order to manipulate the devices is to make Xen aware of > +them. Due to the fact that all the hardware description on x86 comes from > +ACPI, Dom0 is responsible of parsing the ACPI tables and notify Xen abou= t the > +devices it finds. This is done with the `PHYSDEVOP_pci_device_add` hyper= call. > + > +*TODO*: explain the way to register the different kinds of PCI devices, = like > +devices with virtual functions. > + > +## Interrupts ## > + > +All interrupts on PVH guests are routed over event channels, see > +[Event Channel Internals][event_channels] for more detailed information = about > +event channels. In order to inject interrupts into the guest an IDT vect= or is > +used. This is the same mechanism used on PVHVM guests, and allows having > +per-cpu interrupts that can be used to deliver timers or IPIs. > + > +In order to register the callback IDT vector the `HVMOP_set_param` hyper= call > +is used with the following values: > + > + domid =3D DOMID_SELF > + index =3D HVM_PARAM_CALLBACK_IRQ > + value =3D (0x2 << 56) | vector_value And naturally the OS has to program the IDT for the 'vector_value' using the baremetal mechanism. > + > +In order to know which event channel has fired, we need to look into the > +information provided in the `shared_info` structure. The `evtchn_pending` > +array is used as a bitmap in order to find out which event channel has > +fired. Event channels can also be masked by setting it's port value in t= he > +`shared_info->evtchn_mask` bitmap. > + > +### Interrupts from physical devices ### > + > +When running as Dom0 (or when using pci-passthrough) interrupts from phy= sical > +devices are routed over event channels. There are 3 different kind of > +physical interrupts that can be routed over event channels by Xen: IO AP= IC, > +MSI and MSI-X interrupts. > + > +Since physical interrupts usually need EOI (End Of Interrupt), Xen allow= s the > +registration of a memory region that will contain whether a physical int= errupt > +needs EOI from the guest or not. This is done with the > +`PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a parameter containing= the > +physical address of the memory page that will act as a bitmap. Then in o= rder to > +find out if an IRQ needs EOI or not, the OS can perform a simple bit tes= t on the > +memory page using the PIRQ value. > + > +### IO APIC interrupt routing ### > + > +IO APIC interrupts can be routed over event channels using `PHYSDEVOP` > +hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq` > +hypercall, as an example IRQ#9 is used here: > + > + domid =3D DOMID_SELF > + type =3D MAP_PIRQ_TYPE_GSI > + index =3D 9 > + pirq =3D 9 > + > +The IRQ#9 is now registered as PIRQ#9. The triggering and polarity can a= lso > +be configured using the `PHYSDEVOP_setup_gsi` hypercall: > + > + gsi =3D 9 # This is the IRQ value. > + triggering =3D 0 > + polarity =3D 0 > + > +In this example the IRQ would be configured to use edge triggering and h= igh > +polarity. > + > +Finally the PIRQ can be bound to an event channel using the > +`EVTCHNOP_bind_pirq`, that will return the event channel port the PIRQ h= as been > +assigned. After this the event channel will be ready for delivery. > + > +*NOTE*: when running as Dom0, the guest has to parse the interrupt overr= ides > +found on the ACPI tables and notify Xen about them. > + > +### MSI ### > + > +In order to configure MSI interrupts for a device, Xen must be made awar= e of > +it's presence first by using the `PHYSDEVOP_pci_device_add` as described= above. > +Then the `PHYSDEVOP_map_pirq` hypercall is used: > + > + domid =3D DOMID_SELF > + type =3D MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI > + index =3D -1 > + pirq =3D -1 > + bus =3D pci_device_bus > + devfn =3D pci_device_function > + entry_nr =3D number of MSI interrupts > + > +The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI interr= upt > +source is being configured. On devices that support MSI interrupt groups > +`MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure them by also placing = the > +number of MSI interrupts in the `entry_nr` field. > + > +The values in the `bus` and `devfn` field should be the same as the ones= used > +when registering the device with `PHYSDEVOP_pci_device_add`. > + > +### MSI-X ### > + > +*TODO*: how to register/use them. > + > +## Event timers and timecounters ## > + > +Since some hardware is not available on PVH (like the local APIC), Xen p= rovides > +the OS with suitable replacements in order to get the same functionality= . One > +of them is the timer interface. Using a set of hypercalls, a guest OS ca= n set > +event timers that will deliver and event channel interrupt to the guest. > + > +In order to use the timer provided by Xen the guest OS first needs to re= gister > +a VIRQ event channel to be used by the timer to deliver the interrupts. = The > +event channel is registered using the `EVTCHNOP_bind_virq` hypercall, th= at > +only takes two parameters: > + > + virq =3D VIRQ_TIMER > + vcpu =3D vcpu_id > + > +The port that's going to be used by Xen in order to deliver the interrup= t is > +returned in the `port` field. Once the interrupt is set, the timer can be > +programmed using the `VCPUOP_set_singleshot_timer` hypercall. > + > + flags =3D VCPU_SSHOTTMR_future > + timeout_abs_ns =3D absolute value when the timer should fire > + > +It is important to notice that the `VCPUOP_set_singleshot_timer` hyperca= ll must > +be executed from the same vCPU where the timer should fire, or else Xen = will > +refuse to set it. This is a single-shot timer, so it must be set by the = OS > +every time it fires if a periodic timer is desired. > + > +Xen also shares a memory region with the guest OS that contains time rel= ated > +values that are updated periodically. This values can be used to impleme= nt a > +timecounter or to obtain the current time. This information is placed in= side of > +`shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the guest= has > +been launched) can be calculated using the following expression and the = values > +stored in the `vcpu_time_info` struct: > + > + system_time + ((((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system= _mul) >> 32) > + > +The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be > +calculated using the above value, plus the timeout the system wants to s= et. > + > +If the OS also wants to obtain the current wallclock time, the value cal= culated > +above has to be added to the values found in `shared_info->wc_sec` and > +`shared_info->wc_nsec`. > + > +## SMP discover and bring up ## > + > +The process of bringing up secondary CPUs is obviously different from na= tive, > +since PVH doesn't have a local APIC. The first thing to do is to figure = out > +how many vCPUs the guest has. This is done using the `VCPUOP_is_up` hype= rcall, > +using for example this simple loop: > + > + for (i =3D 0; i < MAXCPU; i++) { > + ret =3D HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL); > + if (ret >=3D 0) > + /* vCPU#i is present */ > + } > + > +Note than when running as Dom0, the ACPI tables might report a different= number > +of available CPUs. This is because the value on the ACPI tables is the > +number of physical CPUs the host has, and it might bear no resemblance w= ith the > +number of vCPUs Dom0 actually has so it should be ignored. > + > +In order to bring up the secondary vCPUs they must be configured first. = This is > +achieved using the `VCPUOP_initialise` hypercall. A valid context has to= be > +passed to the vCPU in order to boot. The relevant fields for PVH guests = are > +the following: > + > + * `flags`: contains `VGCF_*` flags (see `arch-x86/xen.h` public header= ). > + * `user_regs`: struct that contains the register values that will be s= et on > + the vCPU before booting. All GPRs are available to be set, however, = the > + most relevant ones are `rip` and `rsp` in order to set the start add= ress > + and the stack. Please note, all selectors must be null. > + * `ctrlreg[3]`: contains the address of the page tables that will be u= sed by > + the vCPU. Other control registers should be set to zero, or else the > + hypercall will fail with -EINVAL. > + > +After the vCPU is initialized with the proper values, it can be started = by > +using the `VCPUOP_up` hypercall. The values of the other control registe= rs of > +the vCPU will be the same as the ones described in the `control register= s` > +section. > + > +Examples about how to bring up secondary CPUs can be found on the FreeBSD > +code base in `sys/x86/xen/pv.c` and on Linux `arch/x86/xen/smp.c`. > + > +## Control operations (reboot/shutdown) ## > + > +Reboot and shutdown operations on PVH guests are performed using hyperca= lls. > +In order to issue a reboot, a guest must use the `SHUTDOWN_reboot` hyper= call. > +In order to perform a power off from a guest DomU, the `SHUTDOWN_powerof= f` > +hypercall should be used. > + > +The way to perform a full system power off from Dom0 is different than w= hat's > +done in a DomU guest. In order to perform a power off from Dom0 the nati= ve > +ACPI path should be followed, but the guest should not write the `SLP_EN` > +bit to the Pm1Control register. Instead the `XENPF_enter_acpi_sleep` hyp= ercall > +should be used, filling the following data in the `xen_platform_op` stru= ct: > + > + cmd =3D XENPF_enter_acpi_sleep > + interface_version =3D XENPF_INTERFACE_VERSION > + u.enter_acpi_sleep.pm1a_cnt_val =3D Pm1aControlValue > + u.enter_acpi_sleep.pm1b_cnt_val =3D Pm1bControlValue > + > +This will allow Xen to do it's clean up and to power off the system. If = the > +host is using hardware reduced ACPI, the following field should also be = set: > + > + u.enter_acpi_sleep.flags =3D XENPF_ACPI_SLEEP_EXTENDED (0x1) > + > +## CPUID ## > + > +*TDOD*: describe which cpuid flags a guest should ignore and also which = flags > +describe features can be used. It would also be good to describe the set= of > +cpuid flags that will always be present when running as PVH. Perhaps start with: = The cpuid instruction that should be used is the normal 'cpuid', not the emulated 'cpuid' that PV guests usually require. > + > +## Final notes ## > + > +All the other hardware functionality not described in this document shou= ld be > +assumed to be performed in the same way as native. > + > +[event_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals And with those changes: Reviewed-by: Konrad Rzeszutek Wilk > -- = > 1.8.5.2 (Apple Git-48) > =