From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?windows-1252?Q?Roger_Pau_Monn=E9?= Subject: Re: [PATCH v3] docs: add PVH specification Date: Mon, 22 Sep 2014 13:36:41 +0200 Message-ID: <542009C9.7000503@citrix.com> References: <1411060764-4016-1-git-send-email-roger.pau@citrix.com> <20140920191510.GA2882@laptop.dumpdata.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail6.bemta4.messagelabs.com ([85.158.143.247]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1XW1vO-0007Gq-GC for xen-devel@lists.xenproject.org; Mon, 22 Sep 2014 11:36:58 +0000 In-Reply-To: <20140920191510.GA2882@laptop.dumpdata.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Konrad Rzeszutek Wilk Cc: xen-devel@lists.xenproject.org, David Vrabel , Jan Beulich List-Id: xen-devel@lists.xenproject.org El 20/09/14 a les 21.15, Konrad Rzeszutek Wilk ha escrit: > On Thu, Sep 18, 2014 at 07:19:24PM +0200, Roger Pau Monne wrote: >> Introduce a document that describes the interfaces used on PVH. This >> document has been designed from a guest OS point of view (i.e.: what a g= uest >> needs to do in order to support PVH). >> >> Signed-off-by: Roger Pau Monn=E9 >> Acked-by: David Vrabel >> Cc: Jan Beulich >> Cc: Mukesh Rathor >> Cc: Konrad Rzeszutek Wilk >> Cc: David Vrabel >> --- >> The document is still far from complete IMHO, but it might be best to ju= st >> commit what we currently have rather than wait for a full document. >> >> I will try to fill the gaps as I go implementing new features on FreeBSD. >> >> I've retained David's Ack from v2 in this version. >> --- >> docs/misc/pvh.markdown | 367 ++++++++++++++++++++++++++++++++++++++++++= +++++++ >> 1 file changed, 367 insertions(+) >> create mode 100644 docs/misc/pvh.markdown >> >> diff --git a/docs/misc/pvh.markdown b/docs/misc/pvh.markdown >> new file mode 100644 >> index 0000000..120ede7 >> --- /dev/null >> +++ b/docs/misc/pvh.markdown >> @@ -0,0 +1,367 @@ >> +# PVH Specification # >> + >> +## Rationale ## >> + >> +PVH is a new kind of guest that has been introduced on Xen 4.4 as a Dom= U, and >> +on Xen 4.5 as a Dom0. The aim of PVH is to make use of the hardware >> +virtualization extensions present in modern x86 CPUs in order to >> +improve performance. >> + >> +PVH is considered a mix between PV and HVM, and can be seen as a PV gue= st >> +that runs inside of an HVM container, or as a PVHVM guest without any e= mulated >> +devices. The design goal of PVH is to provide the best performance poss= ible and >> +to reduce the amount of modifications needed for a guest OS to run in t= his mode >> +(compared to pure PV). >> + >> +This document tries to describe the interfaces used by PVH guests, focu= sing >> +on how an OS should make use of them in order to support PVH. >> + >> +## Early boot ## >> + >> +PVH guests use the PV boot mechanism, that means that the kernel is loa= ded and >> +directly launched by Xen (by jumping into the entry point). In order to= do this >> +Xen ELF Notes need to be added to the guest kernel, so that they contai= n the >> +information needed by Xen. Here is an example of the ELF Notes added to= the >> +FreeBSD amd64 kernel in order to boot as PVH: >> + >> + ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD") >> + ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz, __XSTRING(__FreeBS= D_version)) >> + ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0") >> + ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE) >> + ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE) >> + ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, .quad, xen_start) >> + ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page) >> + ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_ST= ART) >> + ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz, "writable_descript= or_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vecto= r") >> + ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes") >> + ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V) >> + ELFNOTE(Xen, XEN_ELFNOTE_LOADER, .asciz, "generic") >> + ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0) >> + ELFNOTE(Xen, XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes") >> + >> +On the linux side, the above can be found in `arch/x86/xen/xen-head.S`. > = > s/linux/Linux/ Done. > = >> + >> +It is important to highlight the following notes: >> + >> + * `XEN_ELFNOTE_ENTRY`: contains the virtual memory address of the ker= nel entry >> + point. >> + * `XEN_ELFNOTE_HYPERCALL_PAGE`: contains the virtual memory address o= f the >> + hypercal page inside of the guest kernel (this memory region will b= e filled >> + by Xen prior to booting). >> + * `XEN_ELFNOTE_FEATURES`: contains the list of features supported by = the kernel. >> + In the example above the kernel is only able to boot as a PVH guest= , but >> + those options can be mixed with the ones used by pure PV guests in = order to >> + have a kernel that supports both PV and PVH (like Linux). The list = of >> + options available can be found in the `features.h` public header. >> + > = > = > Note that 'hvm_callback_vector' is in XEN_ELFNOTE_FEATURES. Older hypervi= sor will > balk at this being part of it, so it can also be put in = > XEN_ELFNOTE_SUPPORTED_FEATURES which older hypervisors will ignore. = Added to the XEN_ELFNOTE_FEATURES comment, thanks for the info. >> +Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY= ` with >> +paging enabled (either long mode or protected mode with paging turned on >> +depending on the kernel bitness) and some basic page tables setup. An i= mportant >> +distinction for a 64bit PVH is that it is launched at privilege level 0= as >> +opposed to a 64bit PV guest which is launched at privilege level 3. >> + >> +Also, the `rsi` (`esi` on 32bits) register is going to contain the virt= ual >> +memory address were Xen has placed the `start_info` structure. The `rsp= ` (`esp` >> +on 32bits) will point to the top of an initial single page stack, that = can be >> +used by the guest kernel. The `start_info` structure contains all the i= nfo the >> +guest needs in order to initialize. More information about the contents= can be >> +found on the `xen.h` public header. > = > s/on/in/ >> + >> +### Initial amd64 control registers values ### >> + >> +Initial values for the control registers are set up by Xen before booti= ng the >> +guest kernel. The guest kernel can expect to find the following features >> +enabled by Xen. >> + >> +`CR0` has the following bits set by Xen: >> + >> + * PE (bit 0): protected mode enable. >> + * ET (bit 4): 387 or newer processor. >> + * PG (bit 31): paging enabled. > = > Also TS (at least that is what the Linux code says: > = > /* Some of these are setup in 'secondary_startup_64'. The others: = > * X86_CR0_TS, X86_CR0_PE, X86_CR0_ET are set by Xen for HVM guests = > * (which PVH shared codepaths), while X86_CR0_PG is for PVH. */ = > = > Perhaps it is incorrect? I think this comment is outdated/incorrect. This is the CR0 value I see = on a FreeBSD PVH start-of-day: 0x80000011 (PE, ET and PG bits set) > = >> + >> +`CR4` has the following bits set by Xen: >> + >> + * PAE (bit 5): PAE enabled. >> + >> +And finally in `EFER` the following features are enabled: >> + >> + * LME (bit 8): Long mode enable. >> + * LMA (bit 10): Long mode active. >> + >> +At least the following flags in `EFER` are guaranteed to be disabled: >> + >> + * SCE (bit 0): System call extensions disabled. >> + * NXE (bit 11): No-Execute disabled. >> + >> +There's no guarantee about the state of the other bits in the `EFER` re= gister. >> + >> +All the segments selectors are set with a flat base at zero. >> + >> +The `cs` segment selector attributes are set to 0x0a09b, which describe= s an >> +executable and readable code segment only accessible by the most privil= eged >> +level. The segment is also set as a 64-bit code segment (`L` flag set, = `D` flag >> +unset). >> + >> +The remaining segment selectors (`ds`, `ss`, `es`, `fs` and `gs`) are a= ll set >> +to the same values. The attributes are set to 0x0c093, which implies a = read and >> +write data segment only accessible by the most privileged level. > = > I think the SS, ES, FS, GS are set to the null selector in 64-bit mode. This is what I see when I dump the vcpu state of a PVH guest created = with the -p option (so that the guest is never started): (XEN) CS: sel=3D0x0000, attr=3D0x0a09b, limit=3D0xffffffff, base=3D0x000000= 0000000000 (XEN) DS: sel=3D0x0000, attr=3D0x0c093, limit=3D0xffffffff, base=3D0x000000= 0000000000 (XEN) SS: sel=3D0x0000, attr=3D0x0c093, limit=3D0xffffffff, base=3D0x000000= 0000000000 (XEN) ES: sel=3D0x0000, attr=3D0x0c093, limit=3D0xffffffff, base=3D0x000000= 0000000000 (XEN) FS: sel=3D0x0000, attr=3D0x0c093, limit=3D0xffffffff, base=3D0x000000= 0000000000 (XEN) GS: sel=3D0x0000, attr=3D0x0c093, limit=3D0xffffffff, base=3D0x000000= 0000000000 Am I missing something? I don't see a difference between SS, ES, FS, GS and DS. In construct_vmcs on Xen we seem to set all the segments to the same values with the exception of CS attributes. >> + >> +The `FS.base` and `GS.base` MSRs are zeroed out. > = > .. and 'KERNEL_GS.base' Done. >> + >> +The `IDT` and `GDT` are also zeroed, so the guest must be specially car= eful to >> +not trigger a fault until after they have been properly set. The way of= setting >> +the IDT and the GDT is using the native instructions as would be done o= n bare >> +metal. >> + >> +The `RFLAGS` register is guaranteed to be clear when jumping into the k= ernel >> +entry point, with the exception of the reserved bit 1 set. [...] >> +## Interrupts ## >> + >> +All interrupts on PVH guests are routed over event channels, see >> +[Event Channel Internals][event_channels] for more detailed information= about >> +event channels. In order to inject interrupts into the guest an IDT vec= tor is >> +used. This is the same mechanism used on PVHVM guests, and allows having >> +per-cpu interrupts that can be used to deliver timers or IPIs. >> + >> +In order to register the callback IDT vector the `HVMOP_set_param` hype= rcall >> +is used with the following values: >> + >> + domid =3D DOMID_SELF >> + index =3D HVM_PARAM_CALLBACK_IRQ >> + value =3D (0x2 << 56) | vector_value > = > And naturally the OS has to program the IDT for the 'vector_value' using > the baremetal mechanism. Added. [...] >> +## CPUID ## >> + >> +*TDOD*: describe which cpuid flags a guest should ignore and also which= flags >> +describe features can be used. It would also be good to describe the se= t of >> +cpuid flags that will always be present when running as PVH. > = > Perhaps start with: = > The cpuid instruction that should be used is the normal 'cpuid', not > the emulated 'cpuid' that PV guests usually require. Done. > = >> + >> +## Final notes ## >> + >> +All the other hardware functionality not described in this document sho= uld be >> +assumed to be performed in the same way as native. >> + >> +[event_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals > = > And with those changes: > = > Reviewed-by: Konrad Rzeszutek Wilk > = >> -- = >> 1.8.5.2 (Apple Git-48) >> > =