From: "Roger Pau Monné" <roger.pau@citrix.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xenproject.org,
David Vrabel <david.vrabel@citrix.com>,
Jan Beulich <JBeulich@suse.com>
Subject: Re: [PATCH v3] docs: add PVH specification
Date: Mon, 22 Sep 2014 13:36:41 +0200 [thread overview]
Message-ID: <542009C9.7000503@citrix.com> (raw)
In-Reply-To: <20140920191510.GA2882@laptop.dumpdata.com>
El 20/09/14 a les 21.15, Konrad Rzeszutek Wilk ha escrit:
> On Thu, Sep 18, 2014 at 07:19:24PM +0200, Roger Pau Monne wrote:
>> Introduce a document that describes the interfaces used on PVH. This
>> document has been designed from a guest OS point of view (i.e.: what a guest
>> needs to do in order to support PVH).
>>
>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>> Acked-by: David Vrabel <david.vrabel@citrix.com>
>> Cc: Jan Beulich <JBeulich@suse.com>
>> Cc: Mukesh Rathor <mukesh.rathor@oracle.com>
>> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> Cc: David Vrabel <david.vrabel@citrix.com>
>> ---
>> The document is still far from complete IMHO, but it might be best to just
>> commit what we currently have rather than wait for a full document.
>>
>> I will try to fill the gaps as I go implementing new features on FreeBSD.
>>
>> I've retained David's Ack from v2 in this version.
>> ---
>> docs/misc/pvh.markdown | 367 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 367 insertions(+)
>> create mode 100644 docs/misc/pvh.markdown
>>
>> diff --git a/docs/misc/pvh.markdown b/docs/misc/pvh.markdown
>> new file mode 100644
>> index 0000000..120ede7
>> --- /dev/null
>> +++ b/docs/misc/pvh.markdown
>> @@ -0,0 +1,367 @@
>> +# PVH Specification #
>> +
>> +## Rationale ##
>> +
>> +PVH is a new kind of guest that has been introduced on Xen 4.4 as a DomU, and
>> +on Xen 4.5 as a Dom0. The aim of PVH is to make use of the hardware
>> +virtualization extensions present in modern x86 CPUs in order to
>> +improve performance.
>> +
>> +PVH is considered a mix between PV and HVM, and can be seen as a PV guest
>> +that runs inside of an HVM container, or as a PVHVM guest without any emulated
>> +devices. The design goal of PVH is to provide the best performance possible and
>> +to reduce the amount of modifications needed for a guest OS to run in this mode
>> +(compared to pure PV).
>> +
>> +This document tries to describe the interfaces used by PVH guests, focusing
>> +on how an OS should make use of them in order to support PVH.
>> +
>> +## Early boot ##
>> +
>> +PVH guests use the PV boot mechanism, that means that the kernel is loaded and
>> +directly launched by Xen (by jumping into the entry point). In order to do this
>> +Xen ELF Notes need to be added to the guest kernel, so that they contain the
>> +information needed by Xen. Here is an example of the ELF Notes added to the
>> +FreeBSD amd64 kernel in order to boot as PVH:
>> +
>> + ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS, .asciz, "FreeBSD")
>> + ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION, .asciz, __XSTRING(__FreeBSD_version))
>> + ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION, .asciz, "xen-3.0")
>> + ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE, .quad, KERNBASE)
>> + ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET, .quad, KERNBASE)
>> + ELFNOTE(Xen, XEN_ELFNOTE_ENTRY, .quad, xen_start)
>> + ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .quad, hypercall_page)
>> + ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW, .quad, HYPERVISOR_VIRT_START)
>> + ELFNOTE(Xen, XEN_ELFNOTE_FEATURES, .asciz, "writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
>> + ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE, .asciz, "yes")
>> + ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID, .long, PG_V, PG_V)
>> + ELFNOTE(Xen, XEN_ELFNOTE_LOADER, .asciz, "generic")
>> + ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long, 0)
>> + ELFNOTE(Xen, XEN_ELFNOTE_BSD_SYMTAB, .asciz, "yes")
>> +
>> +On the linux side, the above can be found in `arch/x86/xen/xen-head.S`.
>
> s/linux/Linux/
Done.
>
>> +
>> +It is important to highlight the following notes:
>> +
>> + * `XEN_ELFNOTE_ENTRY`: contains the virtual memory address of the kernel entry
>> + point.
>> + * `XEN_ELFNOTE_HYPERCALL_PAGE`: contains the virtual memory address of the
>> + hypercal page inside of the guest kernel (this memory region will be filled
>> + by Xen prior to booting).
>> + * `XEN_ELFNOTE_FEATURES`: contains the list of features supported by the kernel.
>> + In the example above the kernel is only able to boot as a PVH guest, but
>> + those options can be mixed with the ones used by pure PV guests in order to
>> + have a kernel that supports both PV and PVH (like Linux). The list of
>> + options available can be found in the `features.h` public header.
>> +
>
>
> Note that 'hvm_callback_vector' is in XEN_ELFNOTE_FEATURES. Older hypervisor will
> balk at this being part of it, so it can also be put in
> XEN_ELFNOTE_SUPPORTED_FEATURES which older hypervisors will ignore.
Added to the XEN_ELFNOTE_FEATURES comment, thanks for the info.
>> +Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY` with
>> +paging enabled (either long mode or protected mode with paging turned on
>> +depending on the kernel bitness) and some basic page tables setup. An important
>> +distinction for a 64bit PVH is that it is launched at privilege level 0 as
>> +opposed to a 64bit PV guest which is launched at privilege level 3.
>> +
>> +Also, the `rsi` (`esi` on 32bits) register is going to contain the virtual
>> +memory address were Xen has placed the `start_info` structure. The `rsp` (`esp`
>> +on 32bits) will point to the top of an initial single page stack, that can be
>> +used by the guest kernel. The `start_info` structure contains all the info the
>> +guest needs in order to initialize. More information about the contents can be
>> +found on the `xen.h` public header.
>
> s/on/in/
>> +
>> +### Initial amd64 control registers values ###
>> +
>> +Initial values for the control registers are set up by Xen before booting the
>> +guest kernel. The guest kernel can expect to find the following features
>> +enabled by Xen.
>> +
>> +`CR0` has the following bits set by Xen:
>> +
>> + * PE (bit 0): protected mode enable.
>> + * ET (bit 4): 387 or newer processor.
>> + * PG (bit 31): paging enabled.
>
> Also TS (at least that is what the Linux code says:
>
> /* Some of these are setup in 'secondary_startup_64'. The others:
> * X86_CR0_TS, X86_CR0_PE, X86_CR0_ET are set by Xen for HVM guests
> * (which PVH shared codepaths), while X86_CR0_PG is for PVH. */
>
> Perhaps it is incorrect?
I think this comment is outdated/incorrect. This is the CR0 value I see
on a FreeBSD PVH start-of-day:
0x80000011 (PE, ET and PG bits set)
>
>> +
>> +`CR4` has the following bits set by Xen:
>> +
>> + * PAE (bit 5): PAE enabled.
>> +
>> +And finally in `EFER` the following features are enabled:
>> +
>> + * LME (bit 8): Long mode enable.
>> + * LMA (bit 10): Long mode active.
>> +
>> +At least the following flags in `EFER` are guaranteed to be disabled:
>> +
>> + * SCE (bit 0): System call extensions disabled.
>> + * NXE (bit 11): No-Execute disabled.
>> +
>> +There's no guarantee about the state of the other bits in the `EFER` register.
>> +
>> +All the segments selectors are set with a flat base at zero.
>> +
>> +The `cs` segment selector attributes are set to 0x0a09b, which describes an
>> +executable and readable code segment only accessible by the most privileged
>> +level. The segment is also set as a 64-bit code segment (`L` flag set, `D` flag
>> +unset).
>> +
>> +The remaining segment selectors (`ds`, `ss`, `es`, `fs` and `gs`) are all set
>> +to the same values. The attributes are set to 0x0c093, which implies a read and
>> +write data segment only accessible by the most privileged level.
>
> I think the SS, ES, FS, GS are set to the null selector in 64-bit mode.
This is what I see when I dump the vcpu state of a PVH guest created
with the -p option (so that the guest is never started):
(XEN) CS: sel=0x0000, attr=0x0a09b, limit=0xffffffff, base=0x0000000000000000
(XEN) DS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
(XEN) SS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
(XEN) ES: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
(XEN) FS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
(XEN) GS: sel=0x0000, attr=0x0c093, limit=0xffffffff, base=0x0000000000000000
Am I missing something? I don't see a difference between SS, ES, FS,
GS and DS. In construct_vmcs on Xen we seem to set all the segments
to the same values with the exception of CS attributes.
>> +
>> +The `FS.base` and `GS.base` MSRs are zeroed out.
>
> .. and 'KERNEL_GS.base'
Done.
>> +
>> +The `IDT` and `GDT` are also zeroed, so the guest must be specially careful to
>> +not trigger a fault until after they have been properly set. The way of setting
>> +the IDT and the GDT is using the native instructions as would be done on bare
>> +metal.
>> +
>> +The `RFLAGS` register is guaranteed to be clear when jumping into the kernel
>> +entry point, with the exception of the reserved bit 1 set.
[...]
>> +## Interrupts ##
>> +
>> +All interrupts on PVH guests are routed over event channels, see
>> +[Event Channel Internals][event_channels] for more detailed information about
>> +event channels. In order to inject interrupts into the guest an IDT vector is
>> +used. This is the same mechanism used on PVHVM guests, and allows having
>> +per-cpu interrupts that can be used to deliver timers or IPIs.
>> +
>> +In order to register the callback IDT vector the `HVMOP_set_param` hypercall
>> +is used with the following values:
>> +
>> + domid = DOMID_SELF
>> + index = HVM_PARAM_CALLBACK_IRQ
>> + value = (0x2 << 56) | vector_value
>
> And naturally the OS has to program the IDT for the 'vector_value' using
> the baremetal mechanism.
Added.
[...]
>> +## CPUID ##
>> +
>> +*TDOD*: describe which cpuid flags a guest should ignore and also which flags
>> +describe features can be used. It would also be good to describe the set of
>> +cpuid flags that will always be present when running as PVH.
>
> Perhaps start with:
> The cpuid instruction that should be used is the normal 'cpuid', not
> the emulated 'cpuid' that PV guests usually require.
Done.
>
>> +
>> +## Final notes ##
>> +
>> +All the other hardware functionality not described in this document should be
>> +assumed to be performed in the same way as native.
>> +
>> +[event_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals
>
> And with those changes:
>
> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>
>> --
>> 1.8.5.2 (Apple Git-48)
>>
>
next prev parent reply other threads:[~2014-09-22 11:36 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-18 17:19 [PATCH v3] docs: add PVH specification Roger Pau Monne
2014-09-20 19:15 ` Konrad Rzeszutek Wilk
2014-09-22 11:16 ` Jan Beulich
2014-09-22 13:40 ` Konrad Rzeszutek Wilk
2014-09-22 11:36 ` Roger Pau Monné [this message]
2014-09-22 14:02 ` Konrad Rzeszutek Wilk
2014-09-22 14:08 ` Jan Beulich
2014-09-23 0:38 ` Mukesh Rathor
2014-09-23 13:16 ` Jan Beulich
2014-09-26 0:00 ` Mukesh Rathor
2014-09-26 6:32 ` Jan Beulich
2014-09-29 17:38 ` Roger Pau Monné
2014-09-29 17:45 ` David Vrabel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=542009C9.7000503@citrix.com \
--to=roger.pau@citrix.com \
--cc=JBeulich@suse.com \
--cc=david.vrabel@citrix.com \
--cc=konrad.wilk@oracle.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.