HVMlite ABI specification DRAFT A

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* HVMlite ABI specification DRAFT A
@ 2016-02-04 17:48 Roger Pau Monné
  2016-02-04 18:22 ` Andrew Cooper
                   ` (4 more replies)
  0 siblings, 5 replies; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-04 17:48 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, Andrew Cooper, Stefano Stabellini, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, samuel.thibault,
	Boris Ostrovsky

Hello,

I've Cced a bunch of people who have expressed interest in the HVMlite 
design/implementation, both from a Xen or OS point of view. If you 
would like to be removed, please say so and I will remove you in 
further iterations. The same applies if you want to be added to the Cc.

This is an initial draft on the HVMlite design and implementation. I've 
mixed certain aspects of the design with the implementation, because I 
think we are quite tied by the implementation possibilities in certain 
aspects, so not speaking about it would make the document incomplete. I 
might be wrong on that, so feel free to comment otherwise if you would 
prefer a different approach. At least this should get the conversation 
started into a couple of pending items regarding HVMlite. I don't want 
to spoil the fun, but IMHO they are:

 - Local APIC: should we _always_ provide a local APIC to HVMlite 
   guests?
 - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ 
   event channels?
 - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?

The document is still far from complete, and I've only tried to 
represent the points where there's consensus (like the boot ABI) or 
parts where feedback is needed in order to reach a consensus (like the 
items pointed above). I'm of course not as knowledgeable as some people 
on the Cc, so please correct me if you think there are mistakes or 
simply impossible goals.

Roger.
---

Xen HVMlite ABI
===============

Boot ABI
--------

Since the Xen entry point into the kernel can be different from the
native entry point, a `ELFNOTE` is used in order to tell the domain
builder how to load and jump into the kernel entry point:

    ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,          .long,  xen_start32)

The presence of the `XEN_ELFNOTE_PHYS32_ENTRY` note indicates that the
kernel supports the boot ABI described in this document.

The domain builder must load the kernel into the guest memory space and
jump into the entry point defined at `XEN_ELFNOTE_PHYS32_ENTRY` with the
following machine state:

 * `ebx`: contains the physical memory address where the loader has placed
   the boot start info structure.

 * `cr0`: bit 0 (PE) must be set. All the other writeable bits are cleared.

 * `cr4`: all bits are cleared.

 * `cs`: must be a 32-bit read/execute code segment with a base of ‘0’
   and a limit of ‘0xFFFFFFFF’. The selector value is unspecified.

 * `ds`, `es`: must be a 32-bit read/write data segment with a base of
   ‘0’ and a limit of ‘0xFFFFFFFF’. The selector values are all unspecified.

 * `tr`: must be a 32-bit TSS (active) with a base of '0' and a limit of '0x67'.

 * `eflags`: bit 17 (VM) must be cleared. Bit 9 (IF) must be cleared.
   Bit 8 (TF) must be cleared. Other bits are all unspecified.

All other processor registers and flag bits are unspecified. The OS is in
charge of setting up it's own stack, GDT and IDT.

The format of the boot start info structure is the following (pointed to
be %ebx):

    struct hvm_start_info {
    #define HVM_START_MAGIC_VALUE 0x336ec578
        uint32_t magic;             /* Contains the magic value 0x336ec578       */
                                    /* ("xEn3" with the 0x80 bit of the "E" set).*/
        uint32_t flags;             /* SIF_xxx flags.                            */
        uint32_t cmdline_paddr;     /* Physical address of the command line.     */
        uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
        uint32_t modlist_paddr;     /* Physical address of an array of           */
                                    /* hvm_modlist_entry.                        */
    };

    struct hvm_modlist_entry {
        uint32_t paddr;             /* Physical address of the module.           */
        uint32_t size;              /* Size of the module in bytes.              */
    };

Other relevant information needed in order to boot a guest kernel
(console page address, xenstore event channel...) can be obtained
using HVMPARAMS, just like it's done on HVM guests.

The setup of the hypercall page is also performed in the same way
as HVM guests, using the hypervisor cpuid leaves and msr ranges.

Hardware description
--------------------

Hardware description can come from two different sources, just like on (PV)HVM
guests.

Description of PV devices will always come from xenbus, and in fact
xenbus is the only hardware description that is guaranteed to always be
provided to HVMlite guests.

Description of physical hardware devices will always come from ACPI, in the
absence of any physical hardware device no ACPI tables will be provided. The
presence of ACPI tables can be detected by finding the RSDP, just like on
bare metal.

Non-PV devices exposed to the guest
-----------------------------------

The initial idea was to simply don't provide any emulated devices to a HVMlite
guest as the default option. We have however identified certain situations
where emulated devices could be interesting, both from a performance and
easy implementation point of view. The following list tries to encompass
the different identified scenarios:

 * 1. HVMlite with no emulated devices at all
   ------------------------------------------
   This is the current implementation inside of Xen, everything is disabled
   by default and the guest has access to the PV devices only. This is of
   course the most secure design because it has the smaller surface of attack.

 * 2. HVMlite with PCI-passthrough
   -------------------------------
   The current model of PCI-passthrought in PV guests is complex and requires
   heavy modifications to the guest OS. Going forward we would like to remove
   this limitation, by providing an interface that's the same as found on bare
   metal. In order to do this, at least an emulated local APIC and IO APIC
   should be provided to guests, together with the access to a PCI-Root complex.
   As said in the 'Hardware description' section above, this will also require
   ACPI. So this proposed scenario will require the following elements that are
   not present in the minimal (or default) HVMlite implementation: ACPI, local
   APIC IO APIC and PCI-Root complex.

 * 3. HVMlite hardware domain
   --------------------------
   The aim is that a HVMlite hardware domain is going to work exactly like a
   HVMlite domain with passed-through devices. This means that the domain will
   need access to the same set of emulated devices, and that some ACPI tables
   must be fixed in order to reflect the reality of the container the hardware
   domain is running on. The ACPI section contains more detailed information
   about which/how these tables are going to be fixed.

   Note that in this scenario the hardware domain will *always* have a local
   APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
   channels is going to be removed in favour of the bare metal mechanisms.

There have been some opinions that the current model (1) should be replaced
with (2) without any passed-through devices, so that at least a local APIC is
provided. Should then a RSDT, FADT and MADT be provided? We would then be
able to switch the CPU enumeration to the one used on bare metal (ie: using the
data in the MADT).

ACPI
----

ACPI tables will be provided to the hardware domain or to unprivileged
domains that have passed-through PCI devices. In the case of unprivileged
guests ACPI tables are going to be created by the toolstack and will only
contain the set of devices available to the guest, which will at least be
the following: local APIC, IO APIC, the passed-through device. In order to
provide this information from ACPI the following tables are needed as a
minimum: RSDT, FADT, MADT and DSDT.

In the case of the hardware domain, Xen has traditionally passed-through the
native ACPI tables to the guest. This is something that of course we still
want to do, but in the case of HVMlite Xen will have to make sure that
the data passed in the ACPI tables to the hardware domain contain the accurate
hardware description. This means that at least certain tables will have to
be modified/mangled before being presented to the guest:

 * MADT: the number of local APIC entries need to be fixed to match the number
         of vCPUs available to the guest. The address of the IO APIC(s) also
         need to be fixed in order to match the emulated ones that we are going
         to provide.

 * DSDT: certain devices reported in the DSDT may not be available to the guest,
         but since the DSDT is a run-time generated table we cannot fix it. In
         order to cope with this, a STAO table will be provided that should
         be able to signal which devices are not available to the hardware
         domain. This is in line with the Xen/ACPI implementation for ARM.

 * MPST, PMTT, SBTT and SRAT: won't be initially presented to the guest, until
                              we get our act together on the vNUMA stuff.

NB: there are corner cases that I'm not sure how to solve properly. Currently
the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm aware
of the following:

 * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and
   since this table is only available to the hardware domain it has to report
   the PM info back to Xen so that Xen can perform proper PM.
 * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
   mixed with native ACPICA code in most OSes. This is awkward and requires
   the usage of hooks into ACPICA which we have not yet managed to upstream.
 * 3. Reporting the PCI devices it finds to the hypervisor: this is not very
   intrusive in general, so I'm not that pushed to remove it. It's generally
   easy in any OS to add some kind of hook that's executed every time a PCI
   device is discovered.
 * 4. Report PCI memory-mapped configuration areas to Xen: my opinion regarding
   this one is the same as (3), it's not really intrusive so I'm not very
   pushed to remove it.

I would ideally like to get rid of (2) in the list above, since I'm quite sure
we are never going to be able to merge the needed hooks into ACPICA. AFAICT Xen
should be able to parse the FADT table and find the address of the PM1a and
PM1b control registers and trap on access.

(1) is also quite nasty, but I don't see any possible way to get rid of it.

AP startup
----------

AP startup is performed using hypercalls. The following VCPU operations
are used in order to bring up secondary vCPUs:

 * VCPUOP_initialise is used to set the initial state of the vCPU. The
   argument passed to the hypercall must be of the type vcpu_hvm_context.
   See public/hvm/hvm_vcpu.h for the layout of the structure. Note that
   this hypercall allows starting the vCPU in several modes (16/32/64bits),
   regardless of the mode the BSP is currently running on.

 * VCPUOP_up is used to launch the vCPU once the initial state has been
   set using VCPUOP_initialise.

 * VCPUOP_down is used to bring down a vCPU.

 * VCPUOP_is_up is used to scan the number of available vCPUs.

Additionally, if a local APIC is available CPU bringup can also be performed
using the hardware native AP startup sequence (IPIs). In this case the
hypercall interface will still be provided, as a faster and more convenient
way of starting APs.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 17:48 HVMlite ABI specification DRAFT A Roger Pau Monné
@ 2016-02-04 18:22 ` Andrew Cooper
  2016-02-04 19:33   ` Roger Pau Monné
  2016-02-05  9:12   ` Jan Beulich
  2016-02-04 18:38 ` Boris Ostrovsky
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 41+ messages in thread
From: Andrew Cooper @ 2016-02-04 18:22 UTC (permalink / raw)
  To: Roger Pau Monné, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, samuel.thibault, Boris Ostrovsky

On 04/02/16 17:48, Roger Pau Monné wrote:
> Hello,
>
> I've Cced a bunch of people who have expressed interest in the HVMlite 
> design/implementation, both from a Xen or OS point of view. If you 
> would like to be removed, please say so and I will remove you in 
> further iterations. The same applies if you want to be added to the Cc.
>
> This is an initial draft on the HVMlite design and implementation. I've 
> mixed certain aspects of the design with the implementation, because I 
> think we are quite tied by the implementation possibilities in certain 
> aspects, so not speaking about it would make the document incomplete. I 
> might be wrong on that, so feel free to comment otherwise if you would 
> prefer a different approach. At least this should get the conversation 
> started into a couple of pending items regarding HVMlite. I don't want 
> to spoil the fun, but IMHO they are:
>
>  - Local APIC: should we _always_ provide a local APIC to HVMlite 
>    guests?

I think it would be best to offer an LAPIC by default (to be helpful to
most modern OSes), but leave the option for an administrator to disable
if they specifically don't want one.

>  - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ 
>    event channels?
>  - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?

+1000, for both.

>
> The document is still far from complete, and I've only tried to 
> represent the points where there's consensus (like the boot ABI) or 
> parts where feedback is needed in order to reach a consensus (like the 
> items pointed above). I'm of course not as knowledgeable as some people 
> on the Cc, so please correct me if you think there are mistakes or 
> simply impossible goals.
>
> Roger.
> ---
>
> Xen HVMlite ABI
> ===============

Any chance this can end up living in docs/specs/HVMLite-ABI.$FOO,
alongside the existing formal specs?

Would it also be possible to write a feature document in
docs/features/HVMLite.$FOO ?

>
> Boot ABI
> --------
>
> Since the Xen entry point into the kernel can be different from the
> native entry point, a `ELFNOTE` is used in order to tell the domain
> builder how to load and jump into the kernel entry point:
>
>     ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,          .long,  xen_start32)
>
> The presence of the `XEN_ELFNOTE_PHYS32_ENTRY` note indicates that the
> kernel supports the boot ABI described in this document.
>
> The domain builder must load the kernel into the guest memory space and
> jump into the entry point defined at `XEN_ELFNOTE_PHYS32_ENTRY` with the
> following machine state:

Given multiple possible entries, the domain builder might have multiple
starting options available.

I would reword this to "When starting an HVMLite domain, the domain
builder shall load ...", which allows the domian builder to chose an
alternative entry method, at its discretion.

>
>  * `ebx`: contains the physical memory address where the loader has placed
>    the boot start info structure.
>
>  * `cr0`: bit 0 (PE) must be set. All the other writeable bits are cleared.
>
>  * `cr4`: all bits are cleared.
>
>  * `cs`: must be a 32-bit read/execute code segment with a base of ‘0’
>    and a limit of ‘0xFFFFFFFF’. The selector value is unspecified.
>
>  * `ds`, `es`: must be a 32-bit read/write data segment with a base of
>    ‘0’ and a limit of ‘0xFFFFFFFF’. The selector values are all unspecified.
>
>  * `tr`: must be a 32-bit TSS (active) with a base of '0' and a limit of '0x67'.
>
>  * `eflags`: bit 17 (VM) must be cleared. Bit 9 (IF) must be cleared.
>    Bit 8 (TF) must be cleared. Other bits are all unspecified.

I would also specify that the direction flag shall be clear, to prevent
all kernels needing to `cld` on entry.

>
> All other processor registers and flag bits are unspecified. The OS is in
> charge of setting up it's own stack, GDT and IDT.
>
> The format of the boot start info structure is the following (pointed to
> be %ebx):
>
>     struct hvm_start_info {
>     #define HVM_START_MAGIC_VALUE 0x336ec578
>         uint32_t magic;             /* Contains the magic value 0x336ec578       */
>                                     /* ("xEn3" with the 0x80 bit of the "E" set).*/
>         uint32_t flags;             /* SIF_xxx flags.                            */
>         uint32_t cmdline_paddr;     /* Physical address of the command line.     */
>         uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
>         uint32_t modlist_paddr;     /* Physical address of an array of           */
>                                     /* hvm_modlist_entry.                        */
>     };

For both paddr values, zero indicates "not provided".

>
>     struct hvm_modlist_entry {
>         uint32_t paddr;             /* Physical address of the module.           */
>         uint32_t size;              /* Size of the module in bytes.              */
>     };
>
> Other relevant information needed in order to boot a guest kernel
> (console page address, xenstore event channel...) can be obtained
> using HVMPARAMS, just like it's done on HVM guests.
>
> The setup of the hypercall page is also performed in the same way
> as HVM guests, using the hypervisor cpuid leaves and msr ranges.
>
> Hardware description
> --------------------
>
> Hardware description can come from two different sources, just like on (PV)HVM
> guests.
>
> Description of PV devices will always come from xenbus, and in fact
> xenbus is the only hardware description that is guaranteed to always be
> provided to HVMlite guests.
>
> Description of physical hardware devices will always come from ACPI, in the
> absence of any physical hardware device no ACPI tables will be provided. The
> presence of ACPI tables can be detected by finding the RSDP, just like on
> bare metal.
>
> Non-PV devices exposed to the guest
> -----------------------------------
>
> The initial idea was to simply don't provide any emulated devices to a HVMlite
> guest as the default option. We have however identified certain situations
> where emulated devices could be interesting, both from a performance and
> easy implementation point of view. The following list tries to encompass
> the different identified scenarios:
>
>  * 1. HVMlite with no emulated devices at all
>    ------------------------------------------
>    This is the current implementation inside of Xen, everything is disabled
>    by default and the guest has access to the PV devices only. This is of
>    course the most secure design because it has the smaller surface of attack.
>
>  * 2. HVMlite with PCI-passthrough
>    -------------------------------
>    The current model of PCI-passthrought in PV guests is complex and requires
>    heavy modifications to the guest OS. Going forward we would like to remove
>    this limitation, by providing an interface that's the same as found on bare
>    metal. In order to do this, at least an emulated local APIC and IO APIC
>    should be provided to guests, together with the access to a PCI-Root complex.
>    As said in the 'Hardware description' section above, this will also require
>    ACPI. So this proposed scenario will require the following elements that are
>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>    APIC IO APIC and PCI-Root complex.

The IOAPIC is only required when doing passthrough of non-VF devices. 
If the passthrough usecase is restricted to SRIOV VFs only, the IOAPIC
can be omitted, as the SRIOV spec forbids the use of legacy line
interrupts for VFs.  Again with security in mind, it should be possible
for an admin to specify this configuration if they really wish to reduce
the emulated attack surface in Xen.

Independently of the HVMLite angle, having a minimal host bridge in Xen
solves a lot of our current architectural problems with existing PCI
Passthrough, and in particular allows for device model disaggregation,
which will also be of interest for the plain HVM case.

>
>  * 3. HVMlite hardware domain
>    --------------------------
>    The aim is that a HVMlite hardware domain is going to work exactly like a
>    HVMlite domain with passed-through devices. This means that the domain will
>    need access to the same set of emulated devices, and that some ACPI tables
>    must be fixed in order to reflect the reality of the container the hardware
>    domain is running on. The ACPI section contains more detailed information
>    about which/how these tables are going to be fixed.
>
>    Note that in this scenario the hardware domain will *always* have a local
>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>    channels is going to be removed in favour of the bare metal mechanisms.

We do need to cater for at least the RTC for the hardware domain.  This
can be done by not using the FADT "reduced" flag and actually wiring up
the legacy IO ports, which ought to be sufficient.

>
> There have been some opinions that the current model (1) should be replaced
> with (2) without any passed-through devices, so that at least a local APIC is
> provided. Should then a RSDT, FADT and MADT be provided? We would then be
> able to switch the CPU enumeration to the one used on bare metal (ie: using the
> data in the MADT).
>
> ACPI
> ----
>
> ACPI tables will be provided to the hardware domain or to unprivileged
> domains that have passed-through PCI devices. In the case of unprivileged
> guests ACPI tables are going to be created by the toolstack and will only
> contain the set of devices available to the guest, which will at least be
> the following: local APIC, IO APIC, the passed-through device. In order to
> provide this information from ACPI the following tables are needed as a
> minimum: RSDT, FADT, MADT and DSDT.
>
> In the case of the hardware domain, Xen has traditionally passed-through the
> native ACPI tables to the guest. This is something that of course we still
> want to do, but in the case of HVMlite Xen will have to make sure that
> the data passed in the ACPI tables to the hardware domain contain the accurate
> hardware description. This means that at least certain tables will have to
> be modified/mangled before being presented to the guest:
>
>  * MADT: the number of local APIC entries need to be fixed to match the number
>          of vCPUs available to the guest. The address of the IO APIC(s) also
>          need to be fixed in order to match the emulated ones that we are going
>          to provide.
>
>  * DSDT: certain devices reported in the DSDT may not be available to the guest,
>          but since the DSDT is a run-time generated table we cannot fix it. In
>          order to cope with this, a STAO table will be provided that should
>          be able to signal which devices are not available to the hardware
>          domain. This is in line with the Xen/ACPI implementation for ARM.
>
>  * MPST, PMTT, SBTT and SRAT: won't be initially presented to the guest, until
>                               we get our act together on the vNUMA stuff.

and SLIT.

>
> NB: there are corner cases that I'm not sure how to solve properly. Currently
> the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm aware
> of the following:
>
>  * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and
>    since this table is only available to the hardware domain it has to report
>    the PM info back to Xen so that Xen can perform proper PM.
>  * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
>    mixed with native ACPICA code in most OSes. This is awkward and requires
>    the usage of hooks into ACPICA which we have not yet managed to upstream.
>  * 3. Reporting the PCI devices it finds to the hypervisor: this is not very
>    intrusive in general, so I'm not that pushed to remove it. It's generally
>    easy in any OS to add some kind of hook that's executed every time a PCI
>    device is discovered.
>  * 4. Report PCI memory-mapped configuration areas to Xen: my opinion regarding
>    this one is the same as (3), it's not really intrusive so I'm not very
>    pushed to remove it.
>
> I would ideally like to get rid of (2) in the list above, since I'm quite sure
> we are never going to be able to merge the needed hooks into ACPICA. AFAICT Xen
> should be able to parse the FADT table and find the address of the PM1a and
> PM1b control registers and trap on access.

Doing this would require more of (1), as the exact values written to the
PM1a and PM1b control registers are specified in the DSDT, iirc.

>
> (1) is also quite nasty, but I don't see any possible way to get rid of it.

Sadly not.

>
> AP startup
> ----------
>
> AP startup is performed using hypercalls. The following VCPU operations
> are used in order to bring up secondary vCPUs:
>
>  * VCPUOP_initialise is used to set the initial state of the vCPU. The
>    argument passed to the hypercall must be of the type vcpu_hvm_context.
>    See public/hvm/hvm_vcpu.h for the layout of the structure. Note that
>    this hypercall allows starting the vCPU in several modes (16/32/64bits),
>    regardless of the mode the BSP is currently running on.
>
>  * VCPUOP_up is used to launch the vCPU once the initial state has been
>    set using VCPUOP_initialise.
>
>  * VCPUOP_down is used to bring down a vCPU.
>
>  * VCPUOP_is_up is used to scan the number of available vCPUs.
>
> Additionally, if a local APIC is available CPU bringup can also be performed
> using the hardware native AP startup sequence (IPIs). In this case the
> hypercall interface will still be provided, as a faster and more convenient
> way of starting APs.

+1

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 17:48 HVMlite ABI specification DRAFT A Roger Pau Monné
  2016-02-04 18:22 ` Andrew Cooper
@ 2016-02-04 18:38 ` Boris Ostrovsky
  2016-02-04 18:51   ` Samuel Thibault
  2016-02-04 19:09 ` Samuel Thibault
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Boris Ostrovsky @ 2016-02-04 18:38 UTC (permalink / raw)
  To: Roger Pau Monné, xen-devel
  Cc: Wei Liu, Andrew Cooper, Stefano Stabellini, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, samuel.thibault

On 02/04/2016 12:48 PM, Roger Pau Monné wrote:
>
> The format of the boot start info structure is the following (pointed to
> be %ebx):
>
>      struct hvm_start_info {
>      #define HVM_START_MAGIC_VALUE 0x336ec578
>          uint32_t magic;             /* Contains the magic value 0x336ec578       */
>                                      /* ("xEn3" with the 0x80 bit of the "E" set).*/
>          uint32_t flags;             /* SIF_xxx flags.                            */
>          uint32_t cmdline_paddr;     /* Physical address of the command line.     */
>          uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
>          uint32_t modlist_paddr;     /* Physical address of an array of           */
>                                      /* hvm_modlist_entry.                        */
>      };
>
>      struct hvm_modlist_entry {
>          uint32_t paddr;             /* Physical address of the module.           */
>          uint32_t size;              /* Size of the module in bytes.              */
>      };

If there is more than one module, how is the guest expected to sort out 
which module is what?

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 18:38 ` Boris Ostrovsky
@ 2016-02-04 18:51   ` Samuel Thibault
  2016-02-04 19:21     ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Samuel Thibault @ 2016-02-04 18:51 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, xen-devel,
	Roger Pau Monné

Boris Ostrovsky, on Thu 04 Feb 2016 13:38:02 -0500, wrote:
> On 02/04/2016 12:48 PM, Roger Pau Monné wrote:
> >The format of the boot start info structure is the following (pointed to
> >be %ebx):
> >
> >     struct hvm_start_info {
> >     #define HVM_START_MAGIC_VALUE 0x336ec578
> >         uint32_t magic;             /* Contains the magic value 0x336ec578       */
> >                                     /* ("xEn3" with the 0x80 bit of the "E" set).*/
> >         uint32_t flags;             /* SIF_xxx flags.                            */
> >         uint32_t cmdline_paddr;     /* Physical address of the command line.     */
> >         uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
> >         uint32_t modlist_paddr;     /* Physical address of an array of           */
> >                                     /* hvm_modlist_entry.                        */
> >     };
> >
> >     struct hvm_modlist_entry {
> >         uint32_t paddr;             /* Physical address of the module.           */
> >         uint32_t size;              /* Size of the module in bytes.              */
> >     };
> 
> If there is more than one module, how is the guest expected to sort out
> which module is what?

+1
We need that to pass parameters to gnumach modules.

Samuel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 17:48 HVMlite ABI specification DRAFT A Roger Pau Monné
  2016-02-04 18:22 ` Andrew Cooper
  2016-02-04 18:38 ` Boris Ostrovsky
@ 2016-02-04 19:09 ` Samuel Thibault
  2016-02-04 19:18   ` Boris Ostrovsky
  2016-02-05 10:20 ` Ian Campbell
  2016-02-05 16:01 ` Tim Deegan
  4 siblings, 1 reply; 41+ messages in thread
From: Samuel Thibault @ 2016-02-04 19:09 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, xen-devel,
	Boris Ostrovsky

Roger Pau Monné, on Thu 04 Feb 2016 18:48:14 +0100, wrote:
>     struct hvm_start_info {
>     #define HVM_START_MAGIC_VALUE 0x336ec578
>         uint32_t magic;             /* Contains the magic value 0x336ec578       */
>                                     /* ("xEn3" with the 0x80 bit of the "E" set).*/
>         uint32_t flags;             /* SIF_xxx flags.                            */
>         uint32_t cmdline_paddr;     /* Physical address of the command line.     */
>         uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
>         uint32_t modlist_paddr;     /* Physical address of an array of           */
>                                     /* hvm_modlist_entry.                        */
>     };

Mmm, don't we also need a description of the initial page table, so that
the guest kernel knows which part of the memory it shouldn't use until
having initialized its own page table?  Or is there none in the guest
physical memory at startup of HVMlite mode?

Samuel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 19:09 ` Samuel Thibault
@ 2016-02-04 19:18   ` Boris Ostrovsky
  2016-02-04 22:21     ` Samuel Thibault
  0 siblings, 1 reply; 41+ messages in thread
From: Boris Ostrovsky @ 2016-02-04 19:18 UTC (permalink / raw)
  To: Samuel Thibault, Roger Pau Monné, xen-devel, Andrew Cooper,
	Jan Beulich, David Vrabel, Paul Durrant, Stefano Stabellini,
	Konrad Rzeszutek Wilk, Wei Liu, Tim Deegan

On 02/04/2016 02:09 PM, Samuel Thibault wrote:
> Roger Pau Monné, on Thu 04 Feb 2016 18:48:14 +0100, wrote:
>>      struct hvm_start_info {
>>      #define HVM_START_MAGIC_VALUE 0x336ec578
>>          uint32_t magic;             /* Contains the magic value 0x336ec578       */
>>                                      /* ("xEn3" with the 0x80 bit of the "E" set).*/
>>          uint32_t flags;             /* SIF_xxx flags.                            */
>>          uint32_t cmdline_paddr;     /* Physical address of the command line.     */
>>          uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
>>          uint32_t modlist_paddr;     /* Physical address of an array of           */
>>                                      /* hvm_modlist_entry.                        */
>>      };
> Mmm, don't we also need a description of the initial page table, so that
> the guest kernel knows which part of the memory it shouldn't use until
> having initialized its own page table?  Or is there none in the guest
> physical memory at startup of HVMlite mode?

We start with paging off. CR0 only has PE bit set when guest is loaded.

-boris

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 18:51   ` Samuel Thibault
@ 2016-02-04 19:21     ` Roger Pau Monné
  2016-02-04 20:17       ` Boris Ostrovsky
  2016-02-04 22:23       ` Samuel Thibault
  0 siblings, 2 replies; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-04 19:21 UTC (permalink / raw)
  To: Samuel Thibault, Boris Ostrovsky, xen-devel, Wei Liu,
	Andrew Cooper, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich

El 4/2/16 a les 19:51, Samuel Thibault ha escrit:
> Boris Ostrovsky, on Thu 04 Feb 2016 13:38:02 -0500, wrote:
>> On 02/04/2016 12:48 PM, Roger Pau Monné wrote:
>>> The format of the boot start info structure is the following (pointed to
>>> be %ebx):
>>>
>>>     struct hvm_start_info {
>>>     #define HVM_START_MAGIC_VALUE 0x336ec578
>>>         uint32_t magic;             /* Contains the magic value 0x336ec578       */
>>>                                     /* ("xEn3" with the 0x80 bit of the "E" set).*/
>>>         uint32_t flags;             /* SIF_xxx flags.                            */
>>>         uint32_t cmdline_paddr;     /* Physical address of the command line.     */
>>>         uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
>>>         uint32_t modlist_paddr;     /* Physical address of an array of           */
>>>                                     /* hvm_modlist_entry.                        */
>>>     };
>>>
>>>     struct hvm_modlist_entry {
>>>         uint32_t paddr;             /* Physical address of the module.           */
>>>         uint32_t size;              /* Size of the module in bytes.              */
>>>     };
>>
>> If there is more than one module, how is the guest expected to sort out
>> which module is what?

In general I was expecting this would be done by position, or if that's
not enough an additional module (at either position 0 or n) should be
passed to contain that information.

> +1
> We need that to pass parameters to gnumach modules.

Hm, parameters as in a string that's paired with a module, or something
more complex like a metadata block?

I see that multiboot provides a string associated with each module, we
could do the same IMHO. I'm fine with adding it to the boot ABI, but I
would prefer if someone with access to such an OS does the actual
implementation of this feature.

Just to be clear that we are on the same page, then the _entry struct
becomes:

struct hvm_modlist_entry {
	uint32_t paddr;
	uint32_t size;
	uint32_t cmdline_paddr;
};

cmdline_paddr would work the same way as it does in the hvm_start_info
struct (ie: physical address of a zero-terminated ASCII string).

I think I'm going to re-write this in binary form (getting rid of the
structs), or else people are going to get the implementation wrong due
to paddings.

Roger.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 18:22 ` Andrew Cooper
@ 2016-02-04 19:33   ` Roger Pau Monné
  2016-02-04 20:24     ` Boris Ostrovsky
  2016-02-05 14:44     ` Ian Campbell
  2016-02-05  9:12   ` Jan Beulich
  1 sibling, 2 replies; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-04 19:33 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, samuel.thibault, Boris Ostrovsky

El 4/2/16 a les 19:22, Andrew Cooper ha escrit:
> On 04/02/16 17:48, Roger Pau Monné wrote:
>> Hello,
>>
>> I've Cced a bunch of people who have expressed interest in the HVMlite 
>> design/implementation, both from a Xen or OS point of view. If you 
>> would like to be removed, please say so and I will remove you in 
>> further iterations. The same applies if you want to be added to the Cc.
>>
>> This is an initial draft on the HVMlite design and implementation. I've 
>> mixed certain aspects of the design with the implementation, because I 
>> think we are quite tied by the implementation possibilities in certain 
>> aspects, so not speaking about it would make the document incomplete. I 
>> might be wrong on that, so feel free to comment otherwise if you would 
>> prefer a different approach. At least this should get the conversation 
>> started into a couple of pending items regarding HVMlite. I don't want 
>> to spoil the fun, but IMHO they are:
>>
>>  - Local APIC: should we _always_ provide a local APIC to HVMlite 
>>    guests?
> 
> I think it would be best to offer an LAPIC by default (to be helpful to
> most modern OSes), but leave the option for an administrator to disable
> if they specifically don't want one.

So this also implies that we will also provide ACPI by default (RSDT,
FADT, MADT)? IMHO the local APIC is specially helpful if it comes with a
MADT, so that we can do CPU enumeration from it.

>>  - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ 
>>    event channels?
>>  - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?
> 
> +1000, for both.
> 
>>
>> The document is still far from complete, and I've only tried to 
>> represent the points where there's consensus (like the boot ABI) or 
>> parts where feedback is needed in order to reach a consensus (like the 
>> items pointed above). I'm of course not as knowledgeable as some people 
>> on the Cc, so please correct me if you think there are mistakes or 
>> simply impossible goals.
>>
>> Roger.
>> ---
>>
>> Xen HVMlite ABI
>> ===============
> 
> Any chance this can end up living in docs/specs/HVMLite-ABI.$FOO,
> alongside the existing formal specs?
> 
> Would it also be possible to write a feature document in
> docs/features/HVMLite.$FOO ?

Sure, I haven't even sent this in the form of a patch so that we can
discuss it more freely.

>>
>> Boot ABI
>> --------
>>
>> Since the Xen entry point into the kernel can be different from the
>> native entry point, a `ELFNOTE` is used in order to tell the domain
>> builder how to load and jump into the kernel entry point:
>>
>>     ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,          .long,  xen_start32)
>>
>> The presence of the `XEN_ELFNOTE_PHYS32_ENTRY` note indicates that the
>> kernel supports the boot ABI described in this document.
>>
>> The domain builder must load the kernel into the guest memory space and
>> jump into the entry point defined at `XEN_ELFNOTE_PHYS32_ENTRY` with the
>> following machine state:
> 
> Given multiple possible entries, the domain builder might have multiple
> starting options available.
> 
> I would reword this to "When starting an HVMLite domain, the domain
> builder shall load ...", which allows the domian builder to chose an
> alternative entry method, at its discretion.
> 
>>
>>  * `ebx`: contains the physical memory address where the loader has placed
>>    the boot start info structure.
>>
>>  * `cr0`: bit 0 (PE) must be set. All the other writeable bits are cleared.
>>
>>  * `cr4`: all bits are cleared.
>>
>>  * `cs`: must be a 32-bit read/execute code segment with a base of ‘0’
>>    and a limit of ‘0xFFFFFFFF’. The selector value is unspecified.
>>
>>  * `ds`, `es`: must be a 32-bit read/write data segment with a base of
>>    ‘0’ and a limit of ‘0xFFFFFFFF’. The selector values are all unspecified.
>>
>>  * `tr`: must be a 32-bit TSS (active) with a base of '0' and a limit of '0x67'.
>>
>>  * `eflags`: bit 17 (VM) must be cleared. Bit 9 (IF) must be cleared.
>>    Bit 8 (TF) must be cleared. Other bits are all unspecified.
> 
> I would also specify that the direction flag shall be clear, to prevent
> all kernels needing to `cld` on entry.
>
>>
>> All other processor registers and flag bits are unspecified. The OS is in
>> charge of setting up it's own stack, GDT and IDT.
>>
>> The format of the boot start info structure is the following (pointed to
>> be %ebx):
>>
>>     struct hvm_start_info {
>>     #define HVM_START_MAGIC_VALUE 0x336ec578
>>         uint32_t magic;             /* Contains the magic value 0x336ec578       */
>>                                     /* ("xEn3" with the 0x80 bit of the "E" set).*/
>>         uint32_t flags;             /* SIF_xxx flags.                            */
>>         uint32_t cmdline_paddr;     /* Physical address of the command line.     */
>>         uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
>>         uint32_t modlist_paddr;     /* Physical address of an array of           */
>>                                     /* hvm_modlist_entry.                        */
>>     };
> 
> For both paddr values, zero indicates "not provided".

Ack to all of the above.

>>
>>     struct hvm_modlist_entry {
>>         uint32_t paddr;             /* Physical address of the module.           */
>>         uint32_t size;              /* Size of the module in bytes.              */
>>     };
>>
>> Other relevant information needed in order to boot a guest kernel
>> (console page address, xenstore event channel...) can be obtained
>> using HVMPARAMS, just like it's done on HVM guests.
>>
>> The setup of the hypercall page is also performed in the same way
>> as HVM guests, using the hypervisor cpuid leaves and msr ranges.
>>
>> Hardware description
>> --------------------
>>
>> Hardware description can come from two different sources, just like on (PV)HVM
>> guests.
>>
>> Description of PV devices will always come from xenbus, and in fact
>> xenbus is the only hardware description that is guaranteed to always be
>> provided to HVMlite guests.
>>
>> Description of physical hardware devices will always come from ACPI, in the
>> absence of any physical hardware device no ACPI tables will be provided. The
>> presence of ACPI tables can be detected by finding the RSDP, just like on
>> bare metal.
>>
>> Non-PV devices exposed to the guest
>> -----------------------------------
>>
>> The initial idea was to simply don't provide any emulated devices to a HVMlite
>> guest as the default option. We have however identified certain situations
>> where emulated devices could be interesting, both from a performance and
>> easy implementation point of view. The following list tries to encompass
>> the different identified scenarios:
>>
>>  * 1. HVMlite with no emulated devices at all
>>    ------------------------------------------
>>    This is the current implementation inside of Xen, everything is disabled
>>    by default and the guest has access to the PV devices only. This is of
>>    course the most secure design because it has the smaller surface of attack.
>>
>>  * 2. HVMlite with PCI-passthrough
>>    -------------------------------
>>    The current model of PCI-passthrought in PV guests is complex and requires
>>    heavy modifications to the guest OS. Going forward we would like to remove
>>    this limitation, by providing an interface that's the same as found on bare
>>    metal. In order to do this, at least an emulated local APIC and IO APIC
>>    should be provided to guests, together with the access to a PCI-Root complex.
>>    As said in the 'Hardware description' section above, this will also require
>>    ACPI. So this proposed scenario will require the following elements that are
>>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>>    APIC IO APIC and PCI-Root complex.
> 
> The IOAPIC is only required when doing passthrough of non-VF devices. 
> If the passthrough usecase is restricted to SRIOV VFs only, the IOAPIC
> can be omitted, as the SRIOV spec forbids the use of legacy line
> interrupts for VFs.  Again with security in mind, it should be possible
> for an admin to specify this configuration if they really wish to reduce
> the emulated attack surface in Xen.
> 
> Independently of the HVMLite angle, having a minimal host bridge in Xen
> solves a lot of our current architectural problems with existing PCI
> Passthrough, and in particular allows for device model disaggregation,
> which will also be of interest for the plain HVM case.

So we should provide a lapic/ioapic set of options to xl configuration
files?

>>
>>  * 3. HVMlite hardware domain
>>    --------------------------
>>    The aim is that a HVMlite hardware domain is going to work exactly like a
>>    HVMlite domain with passed-through devices. This means that the domain will
>>    need access to the same set of emulated devices, and that some ACPI tables
>>    must be fixed in order to reflect the reality of the container the hardware
>>    domain is running on. The ACPI section contains more detailed information
>>    about which/how these tables are going to be fixed.
>>
>>    Note that in this scenario the hardware domain will *always* have a local
>>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>>    channels is going to be removed in favour of the bare metal mechanisms.
> 
> We do need to cater for at least the RTC for the hardware domain.  This
> can be done by not using the FADT "reduced" flag and actually wiring up
> the legacy IO ports, which ought to be sufficient.

Yes, the reduced flag should be set for DomU, but not for the hardware
domain.

>>
>> There have been some opinions that the current model (1) should be replaced
>> with (2) without any passed-through devices, so that at least a local APIC is
>> provided. Should then a RSDT, FADT and MADT be provided? We would then be
>> able to switch the CPU enumeration to the one used on bare metal (ie: using the
>> data in the MADT).
>>
>> ACPI
>> ----
>>
>> ACPI tables will be provided to the hardware domain or to unprivileged
>> domains that have passed-through PCI devices. In the case of unprivileged
>> guests ACPI tables are going to be created by the toolstack and will only
>> contain the set of devices available to the guest, which will at least be
>> the following: local APIC, IO APIC, the passed-through device. In order to
>> provide this information from ACPI the following tables are needed as a
>> minimum: RSDT, FADT, MADT and DSDT.
>>
>> In the case of the hardware domain, Xen has traditionally passed-through the
>> native ACPI tables to the guest. This is something that of course we still
>> want to do, but in the case of HVMlite Xen will have to make sure that
>> the data passed in the ACPI tables to the hardware domain contain the accurate
>> hardware description. This means that at least certain tables will have to
>> be modified/mangled before being presented to the guest:
>>
>>  * MADT: the number of local APIC entries need to be fixed to match the number
>>          of vCPUs available to the guest. The address of the IO APIC(s) also
>>          need to be fixed in order to match the emulated ones that we are going
>>          to provide.
>>
>>  * DSDT: certain devices reported in the DSDT may not be available to the guest,
>>          but since the DSDT is a run-time generated table we cannot fix it. In
>>          order to cope with this, a STAO table will be provided that should
>>          be able to signal which devices are not available to the hardware
>>          domain. This is in line with the Xen/ACPI implementation for ARM.
>>
>>  * MPST, PMTT, SBTT and SRAT: won't be initially presented to the guest, until
>>                               we get our act together on the vNUMA stuff.
> 
> and SLIT.
> 
>>
>> NB: there are corner cases that I'm not sure how to solve properly. Currently
>> the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm aware
>> of the following:
>>
>>  * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and
>>    since this table is only available to the hardware domain it has to report
>>    the PM info back to Xen so that Xen can perform proper PM.
>>  * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
>>    mixed with native ACPICA code in most OSes. This is awkward and requires
>>    the usage of hooks into ACPICA which we have not yet managed to upstream.
>>  * 3. Reporting the PCI devices it finds to the hypervisor: this is not very
>>    intrusive in general, so I'm not that pushed to remove it. It's generally
>>    easy in any OS to add some kind of hook that's executed every time a PCI
>>    device is discovered.
>>  * 4. Report PCI memory-mapped configuration areas to Xen: my opinion regarding
>>    this one is the same as (3), it's not really intrusive so I'm not very
>>    pushed to remove it.
>>
>> I would ideally like to get rid of (2) in the list above, since I'm quite sure
>> we are never going to be able to merge the needed hooks into ACPICA. AFAICT Xen
>> should be able to parse the FADT table and find the address of the PM1a and
>> PM1b control registers and trap on access.
> 
> Doing this would require more of (1), as the exact values written to the
> PM1a and PM1b control registers are specified in the DSDT, iirc.

Ouch, I was hoping the values would be constants defined somewhere...

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 19:21     ` Roger Pau Monné
@ 2016-02-04 20:17       ` Boris Ostrovsky
  2016-02-04 20:29         ` Konrad Rzeszutek Wilk
  2016-02-05  8:23         ` Roger Pau Monné
  2016-02-04 22:23       ` Samuel Thibault
  1 sibling, 2 replies; 41+ messages in thread
From: Boris Ostrovsky @ 2016-02-04 20:17 UTC (permalink / raw)
  To: Roger Pau Monné, Samuel Thibault, xen-devel, Wei Liu,
	Andrew Cooper, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich

On 02/04/2016 02:21 PM, Roger Pau Monné wrote:
> El 4/2/16 a les 19:51, Samuel Thibault ha escrit:
>> Boris Ostrovsky, on Thu 04 Feb 2016 13:38:02 -0500, wrote:
>>> On 02/04/2016 12:48 PM, Roger Pau Monné wrote:
>>>> The format of the boot start info structure is the following (pointed to
>>>> be %ebx):
>>>>
>>>>      struct hvm_start_info {
>>>>      #define HVM_START_MAGIC_VALUE 0x336ec578
>>>>          uint32_t magic;             /* Contains the magic value 0x336ec578       */
>>>>                                      /* ("xEn3" with the 0x80 bit of the "E" set).*/
>>>>          uint32_t flags;             /* SIF_xxx flags.                            */
>>>>          uint32_t cmdline_paddr;     /* Physical address of the command line.     */
>>>>          uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
>>>>          uint32_t modlist_paddr;     /* Physical address of an array of           */
>>>>                                      /* hvm_modlist_entry.                        */
>>>>      };
>>>>
>>>>      struct hvm_modlist_entry {
>>>>          uint32_t paddr;             /* Physical address of the module.           */
>>>>          uint32_t size;              /* Size of the module in bytes.              */
>>>>      };
>>> If there is more than one module, how is the guest expected to sort out
>>> which module is what?
> In general I was expecting this would be done by position, or if that's
> not enough an additional module (at either position 0 or n) should be
> passed to contain that information.

Then we should specify it somehow --- e.g. that first module is always 
the ramdisk.

>> +1
>> We need that to pass parameters to gnumach modules.
> Hm, parameters as in a string that's paired with a module, or something
> more complex like a metadata block?
>
> I see that multiboot provides a string associated with each module, we
> could do the same IMHO. I'm fine with adding it to the boot ABI, but I
> would prefer if someone with access to such an OS does the actual
> implementation of this feature.
>
> Just to be clear that we are on the same page, then the _entry struct
> becomes:
>
> struct hvm_modlist_entry {
> 	uint32_t paddr;
> 	uint32_t size;
> 	uint32_t cmdline_paddr;
> };
>
> cmdline_paddr would work the same way as it does in the hvm_start_info
> struct (ie: physical address of a zero-terminated ASCII string).

Doesn't this imply that strings should be part of this spec? Line "initrd"?

-boris


>
> I think I'm going to re-write this in binary form (getting rid of the
> structs), or else people are going to get the implementation wrong due
> to paddings.
>
> Roger.
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 19:33   ` Roger Pau Monné
@ 2016-02-04 20:24     ` Boris Ostrovsky
  2016-02-05 14:44     ` Ian Campbell
  1 sibling, 0 replies; 41+ messages in thread
From: Boris Ostrovsky @ 2016-02-04 20:24 UTC (permalink / raw)
  To: Roger Pau Monné, Andrew Cooper, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, samuel.thibault

On 02/04/2016 02:33 PM, Roger Pau Monné wrote:
>
> So we should provide a lapic/ioapic set of options to xl configuration
> files?

We already have 'apic' option. We can also use 'acpi=false' since then 
that will mean no MADT and thus no APIC/IOAPIC.


-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 20:17       ` Boris Ostrovsky
@ 2016-02-04 20:29         ` Konrad Rzeszutek Wilk
  2016-02-04 20:37           ` Andrew Cooper
  2016-02-05  8:23         ` Roger Pau Monné
  1 sibling, 1 reply; 41+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-02-04 20:29 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, xen-devel,
	Samuel Thibault, Roger Pau Monné

> >>>If there is more than one module, how is the guest expected to sort out
> >>>which module is what?
> >In general I was expecting this would be done by position, or if that's
> >not enough an additional module (at either position 0 or n) should be
> >passed to contain that information.
> 
> Then we should specify it somehow --- e.g. that first module is always the
> ramdisk.

Keep in mind that with Linux you can actually append the initrd in the
vmlinuz file - so you only have "one" file. Hence the first module
could be optional.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 20:29         ` Konrad Rzeszutek Wilk
@ 2016-02-04 20:37           ` Andrew Cooper
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Cooper @ 2016-02-04 20:37 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Boris Ostrovsky
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, xen-devel, Samuel Thibault,
	Roger Pau Monné

On 04/02/16 20:29, Konrad Rzeszutek Wilk wrote:
>>>>> If there is more than one module, how is the guest expected to sort out
>>>>> which module is what?
>>> In general I was expecting this would be done by position, or if that's
>>> not enough an additional module (at either position 0 or n) should be
>>> passed to contain that information.
>> Then we should specify it somehow --- e.g. that first module is always the
>> ramdisk.
> Keep in mind that with Linux you can actually append the initrd in the
> vmlinuz file - so you only have "one" file. Hence the first module
> could be optional.

The PV ABI suffers from false assumptions and expectations like this. 
Lets not repeat the same mistakes for HVMLite.

~Andrew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 19:18   ` Boris Ostrovsky
@ 2016-02-04 22:21     ` Samuel Thibault
  2016-02-04 22:25       ` Andrew Cooper
  0 siblings, 1 reply; 41+ messages in thread
From: Samuel Thibault @ 2016-02-04 22:21 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, xen-devel,
	Roger Pau Monné

Boris Ostrovsky, on Thu 04 Feb 2016 14:18:46 -0500, wrote:
> On 02/04/2016 02:09 PM, Samuel Thibault wrote:
> >Roger Pau Monné, on Thu 04 Feb 2016 18:48:14 +0100, wrote:
> >>     struct hvm_start_info {
> >>     #define HVM_START_MAGIC_VALUE 0x336ec578
> >>         uint32_t magic;             /* Contains the magic value 0x336ec578       */
> >>                                     /* ("xEn3" with the 0x80 bit of the "E" set).*/
> >>         uint32_t flags;             /* SIF_xxx flags.                            */
> >>         uint32_t cmdline_paddr;     /* Physical address of the command line.     */
> >>         uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
> >>         uint32_t modlist_paddr;     /* Physical address of an array of           */
> >>                                     /* hvm_modlist_entry.                        */
> >>     };
> >Mmm, don't we also need a description of the initial page table, so that
> >the guest kernel knows which part of the memory it shouldn't use until
> >having initialized its own page table?  Or is there none in the guest
> >physical memory at startup of HVMlite mode?
> 
> We start with paging off.

So a 32bit hypervisor *has* to use segmentation to protect itself from
domU?

Samuel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 19:21     ` Roger Pau Monné
  2016-02-04 20:17       ` Boris Ostrovsky
@ 2016-02-04 22:23       ` Samuel Thibault
  1 sibling, 0 replies; 41+ messages in thread
From: Samuel Thibault @ 2016-02-04 22:23 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, xen-devel,
	Boris Ostrovsky

Roger Pau Monné, on Thu 04 Feb 2016 20:21:24 +0100, wrote:
> > +1
> > We need that to pass parameters to gnumach modules.
> 
> Hm, parameters as in a string that's paired with a module,

That, yes. Just like the kernel command line. One per module.

> I see that multiboot provides a string associated with each module, we
> could do the same IMHO.

That's it.

> Just to be clear that we are on the same page, then the _entry struct
> becomes:
> 
> struct hvm_modlist_entry {
> 	uint32_t paddr;
> 	uint32_t size;
> 	uint32_t cmdline_paddr;
> };
> 
> cmdline_paddr would work the same way as it does in the hvm_start_info
> struct (ie: physical address of a zero-terminated ASCII string).

That looks alright for me.

Samuel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 22:21     ` Samuel Thibault
@ 2016-02-04 22:25       ` Andrew Cooper
  2016-02-04 22:41         ` Samuel Thibault
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Cooper @ 2016-02-04 22:25 UTC (permalink / raw)
  To: Samuel Thibault, Boris Ostrovsky, Roger Pau Monné, xen-devel,
	Jan Beulich, David Vrabel, Paul Durrant, Stefano Stabellini,
	Konrad Rzeszutek Wilk, Wei Liu, Tim Deegan

On 04/02/2016 22:21, Samuel Thibault wrote:
> Boris Ostrovsky, on Thu 04 Feb 2016 14:18:46 -0500, wrote:
>> On 02/04/2016 02:09 PM, Samuel Thibault wrote:
>>> Roger Pau Monné, on Thu 04 Feb 2016 18:48:14 +0100, wrote:
>>>>     struct hvm_start_info {
>>>>     #define HVM_START_MAGIC_VALUE 0x336ec578
>>>>         uint32_t magic;             /* Contains the magic value 0x336ec578       */
>>>>                                     /* ("xEn3" with the 0x80 bit of the "E" set).*/
>>>>         uint32_t flags;             /* SIF_xxx flags.                            */
>>>>         uint32_t cmdline_paddr;     /* Physical address of the command line.     */
>>>>         uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
>>>>         uint32_t modlist_paddr;     /* Physical address of an array of           */
>>>>                                     /* hvm_modlist_entry.                        */
>>>>     };
>>> Mmm, don't we also need a description of the initial page table, so that
>>> the guest kernel knows which part of the memory it shouldn't use until
>>> having initialized its own page table?  Or is there none in the guest
>>> physical memory at startup of HVMlite mode?
>> We start with paging off.
> So a 32bit hypervisor *has* to use segmentation to protect itself from
> domU?

This is an HVM domain, so uses hardware virtualisation extensions.  It
is not like a PV guest.

The HVMLite binary is free to choose its width and paging mode.  All
this document states is that the entry point shall be 32bit flat unpaged
mode.

~Andrew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 22:25       ` Andrew Cooper
@ 2016-02-04 22:41         ` Samuel Thibault
  0 siblings, 0 replies; 41+ messages in thread
From: Samuel Thibault @ 2016-02-04 22:41 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, xen-devel, Boris Ostrovsky,
	Roger Pau Monné

Andrew Cooper, on Thu 04 Feb 2016 22:25:47 +0000, wrote:
> On 04/02/2016 22:21, Samuel Thibault wrote:
> > Boris Ostrovsky, on Thu 04 Feb 2016 14:18:46 -0500, wrote:
> >> On 02/04/2016 02:09 PM, Samuel Thibault wrote:
> >>> Roger Pau Monné, on Thu 04 Feb 2016 18:48:14 +0100, wrote:
> >>>>     struct hvm_start_info {
> >>>>     #define HVM_START_MAGIC_VALUE 0x336ec578
> >>>>         uint32_t magic;             /* Contains the magic value 0x336ec578       */
> >>>>                                     /* ("xEn3" with the 0x80 bit of the "E" set).*/
> >>>>         uint32_t flags;             /* SIF_xxx flags.                            */
> >>>>         uint32_t cmdline_paddr;     /* Physical address of the command line.     */
> >>>>         uint32_t nr_modules;        /* Number of modules passed to the kernel.   */
> >>>>         uint32_t modlist_paddr;     /* Physical address of an array of           */
> >>>>                                     /* hvm_modlist_entry.                        */
> >>>>     };
> >>> Mmm, don't we also need a description of the initial page table, so that
> >>> the guest kernel knows which part of the memory it shouldn't use until
> >>> having initialized its own page table?  Or is there none in the guest
> >>> physical memory at startup of HVMlite mode?
> >> We start with paging off.
> > So a 32bit hypervisor *has* to use segmentation to protect itself from
> > domU?
> 
> This is an HVM domain, so uses hardware virtualisation extensions.  It
> is not like a PV guest.

Ah, right, sorry, too much used to PV :)

Samuel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 20:17       ` Boris Ostrovsky
  2016-02-04 20:29         ` Konrad Rzeszutek Wilk
@ 2016-02-05  8:23         ` Roger Pau Monné
  1 sibling, 0 replies; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-05  8:23 UTC (permalink / raw)
  To: Boris Ostrovsky, Samuel Thibault, xen-devel, Wei Liu,
	Andrew Cooper, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich

El 4/2/16 a les 21:17, Boris Ostrovsky ha escrit:
> On 02/04/2016 02:21 PM, Roger Pau Monné wrote:
>> El 4/2/16 a les 19:51, Samuel Thibault ha escrit:
>>> Boris Ostrovsky, on Thu 04 Feb 2016 13:38:02 -0500, wrote:
>>>> On 02/04/2016 12:48 PM, Roger Pau Monné wrote:
>>>>> The format of the boot start info structure is the following
>>>>> (pointed to
>>>>> be %ebx):
>>>>>
>>>>>      struct hvm_start_info {
>>>>>      #define HVM_START_MAGIC_VALUE 0x336ec578
>>>>>          uint32_t magic;             /* Contains the magic value
>>>>> 0x336ec578       */
>>>>>                                      /* ("xEn3" with the 0x80 bit
>>>>> of the "E" set).*/
>>>>>          uint32_t flags;             /* SIF_xxx
>>>>> flags.                            */
>>>>>          uint32_t cmdline_paddr;     /* Physical address of the
>>>>> command line.     */
>>>>>          uint32_t nr_modules;        /* Number of modules passed to
>>>>> the kernel.   */
>>>>>          uint32_t modlist_paddr;     /* Physical address of an
>>>>> array of           */
>>>>>                                      /*
>>>>> hvm_modlist_entry.                        */
>>>>>      };
>>>>>
>>>>>      struct hvm_modlist_entry {
>>>>>          uint32_t paddr;             /* Physical address of the
>>>>> module.           */
>>>>>          uint32_t size;              /* Size of the module in
>>>>> bytes.              */
>>>>>      };
>>>> If there is more than one module, how is the guest expected to sort out
>>>> which module is what?
>> In general I was expecting this would be done by position, or if that's
>> not enough an additional module (at either position 0 or n) should be
>> passed to contain that information.
> 
> Then we should specify it somehow --- e.g. that first module is always
> the ramdisk.

No, that's how Linux uses it, but it's not part of the spec at all. From
a Xen PoV, this 'modules' are just memory regions, it doesn't know
anything else about them, neither it needs to.

>>> +1
>>> We need that to pass parameters to gnumach modules.
>> Hm, parameters as in a string that's paired with a module, or something
>> more complex like a metadata block?
>>
>> I see that multiboot provides a string associated with each module, we
>> could do the same IMHO. I'm fine with adding it to the boot ABI, but I
>> would prefer if someone with access to such an OS does the actual
>> implementation of this feature.
>>
>> Just to be clear that we are on the same page, then the _entry struct
>> becomes:
>>
>> struct hvm_modlist_entry {
>>     uint32_t paddr;
>>     uint32_t size;
>>     uint32_t cmdline_paddr;
>> };
>>
>> cmdline_paddr would work the same way as it does in the hvm_start_info
>> struct (ie: physical address of a zero-terminated ASCII string).
> 
> Doesn't this imply that strings should be part of this spec? Line "initrd"?

cmdline_paddr needs to be added to the spec, and I will do it in the
next revision (note that this will also require changes to the current
implementation). I'm not sure about your other part of the question,
making the concrete strings part of the implementation is completely out
of the spec, but I guess you mean something else which I don't get.

Roger.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 18:22 ` Andrew Cooper
  2016-02-04 19:33   ` Roger Pau Monné
@ 2016-02-05  9:12   ` Jan Beulich
  2016-02-05  9:50     ` Roger Pau Monné
  1 sibling, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2016-02-05  9:12 UTC (permalink / raw)
  To: Andrew Cooper, roger.pau
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, xen-devel, samuel.thibault, Boris Ostrovsky

>>> On 04.02.16 at 19:22, <andrew.cooper3@citrix.com> wrote:
> On 04/02/16 17:48, Roger Pau Monné wrote:
>>  - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ 
>>    event channels?
>>  - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?
> 
> +1000, for both.

I'm a little lost here: However nice that would be, how do you
envision this to work? For the first one, as pointed out before,
there are physdevops which the hardware domain needs to
issue to assist Xen (as a result of parsing and executing AML).
And for the second one, something needs to translate virtual
guest PCI topology to host physical one as well as mediate
config space accesses.

>>  * `eflags`: bit 17 (VM) must be cleared. Bit 9 (IF) must be cleared.
>>    Bit 8 (TF) must be cleared. Other bits are all unspecified.
> 
> I would also specify that the direction flag shall be clear, to prevent
> all kernels needing to `cld` on entry.

In which case IOPL and AC state should perhaps also be nailed down?
Possibly even all of the control ones (leaving only the status flags
unspecified)?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05  9:12   ` Jan Beulich
@ 2016-02-05  9:50     ` Roger Pau Monné
  2016-02-05 10:40       ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-05  9:50 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, xen-devel, samuel.thibault, Boris Ostrovsky

El 5/2/16 a les 10:12, Jan Beulich ha escrit:
>>>> On 04.02.16 at 19:22, <andrew.cooper3@citrix.com> wrote:
>> On 04/02/16 17:48, Roger Pau Monné wrote:
>>>  - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ 
>>>    event channels?
>>>  - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?
>>
>> +1000, for both.
> 
> I'm a little lost here: However nice that would be, how do you
> envision this to work? For the first one, as pointed out before,
> there are physdevops which the hardware domain needs to
> issue to assist Xen (as a result of parsing and executing AML).
> And for the second one, something needs to translate virtual
> guest PCI topology to host physical one as well as mediate
> config space accesses.

I've got a little carried over in this first statement, inside of the
"ACPI" section in the document below there's a list of physdevops that
we cannot get rid of, however that's considerably smaller than the
current set. We are at least going to keep PHYSDEVOP_pci_device_add and
PHYSDEVOP_pci_mmcfg_reserved.

Regarding PIRQs, for MSI/MSI-X I think we already have the ability to
trap and emulate IIRC, which should allow us to detect when the hardware
domain is trying to set them and act consequently. Xen should receive
the native interrupts and inject them to the guest, but I assume this is
quite similar to what's already done for PCI-passthrough.

For legacy PCI interrupts, we can parse the MADT inside of Xen in order
to properly setup the lines/overwrites and inject the interrupts that
are not handled by Xen straight into the hardware domain. This will
require us to be able to emulate the same topology as what is found in
native (eg: if there are two IO APICs in the hardware we should also
provide two emulated ones to the hw domain).

As for PCI config space accesses, don't we already do that? We trap on
access to the 0xcf8 io port.

>>>  * `eflags`: bit 17 (VM) must be cleared. Bit 9 (IF) must be cleared.
>>>    Bit 8 (TF) must be cleared. Other bits are all unspecified.
>>
>> I would also specify that the direction flag shall be clear, to prevent
>> all kernels needing to `cld` on entry.
> 
> In which case IOPL and AC state should perhaps also be nailed down?
> Possibly even all of the control ones (leaving only the status flags
> unspecified)?

Status flag? Why don't we just say that all user-settable bits in the
status register will be set to 0 (or cleared)?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 17:48 HVMlite ABI specification DRAFT A Roger Pau Monné
                   ` (2 preceding siblings ...)
  2016-02-04 19:09 ` Samuel Thibault
@ 2016-02-05 10:20 ` Ian Campbell
  2016-02-05 16:01 ` Tim Deegan
  4 siblings, 0 replies; 41+ messages in thread
From: Ian Campbell @ 2016-02-05 10:20 UTC (permalink / raw)
  To: Roger Pau Monné, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, Jan Beulich, samuel.thibault,
	Boris Ostrovsky

On Thu, 2016-02-04 at 18:48 +0100, Roger Pau Monné wrote:
> Hello,
> 
> I've Cced a bunch of people who have expressed interest in the HVMlite 
> design/implementation,

I think "HVMlite" has now reached the point where we should start the
transition from PVH (classic) to PVH (hvmlite) naming rather than
introducing yet another guest type terminology where end users are going to
see it (the 4.7 release, specifications in tree, etc).

So IMHO HVMlite should be referred to as "PVH" throughout, with the
original implementation retconned to be called "PVH (classic)" or
"Prototype-PVH" or something. A short paragraph explaining the background
might be appropriate.

Calling them PVHv1 and PVHv2 would also be tolerable.

This should extend to all the documentation etc as well as IMHO to patch
postings (in that case "PVH (hvmlite)" might be appropriate in places where
there might be confusion until "PVH (classic)" really goes away).

The point is that dmlite was always supposed to be a reimplementation of
the PVH concept using the lessons learned from the "come at it from the PV
end" attempt, it's not (from a user PoV) a new operating mode.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05  9:50     ` Roger Pau Monné
@ 2016-02-05 10:40       ` Jan Beulich
  2016-02-05 11:04         ` Andrew Cooper
  2016-02-05 11:30         ` Roger Pau Monné
  0 siblings, 2 replies; 41+ messages in thread
From: Jan Beulich @ 2016-02-05 10:40 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, samuel.thibault,
	Boris Ostrovsky

>>> On 05.02.16 at 10:50, <roger.pau@citrix.com> wrote:
> For legacy PCI interrupts, we can parse the MADT inside of Xen in order
> to properly setup the lines/overwrites and inject the interrupts that
> are not handled by Xen straight into the hardware domain. This will
> require us to be able to emulate the same topology as what is found in
> native (eg: if there are two IO APICs in the hardware we should also
> provide two emulated ones to the hw domain).

I don't think MADT contains all the needed information, or else we
wouldn't need PHYSDEVOP_setup_gsi.

> As for PCI config space accesses, don't we already do that? We trap on
> access to the 0xcf8 io port.

We intercept that, but iirc we do no translation (and for DomU
these get forwarded to qemu anyway).

>>>>  * `eflags`: bit 17 (VM) must be cleared. Bit 9 (IF) must be cleared.
>>>>    Bit 8 (TF) must be cleared. Other bits are all unspecified.
>>>
>>> I would also specify that the direction flag shall be clear, to prevent
>>> all kernels needing to `cld` on entry.
>> 
>> In which case IOPL and AC state should perhaps also be nailed down?
>> Possibly even all of the control ones (leaving only the status flags
>> unspecified)?
> 
> Status flag? Why don't we just say that all user-settable bits in the
> status register will be set to 0 (or cleared)?

Would be an option too.

Jan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 10:40       ` Jan Beulich
@ 2016-02-05 11:04         ` Andrew Cooper
  2016-02-05 11:07           ` Jan Beulich
  2016-02-05 11:30         ` Roger Pau Monné
  1 sibling, 1 reply; 41+ messages in thread
From: Andrew Cooper @ 2016-02-05 11:04 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, xen-devel, samuel.thibault, Boris Ostrovsky

On 05/02/16 10:40, Jan Beulich wrote:
>>>> On 05.02.16 at 10:50, <roger.pau@citrix.com> wrote:
>> For legacy PCI interrupts, we can parse the MADT inside of Xen in order
>> to properly setup the lines/overwrites and inject the interrupts that
>> are not handled by Xen straight into the hardware domain. This will
>> require us to be able to emulate the same topology as what is found in
>> native (eg: if there are two IO APICs in the hardware we should also
>> provide two emulated ones to the hw domain).
> I don't think MADT contains all the needed information, or else we
> wouldn't need PHYSDEVOP_setup_gsi.
>
>> As for PCI config space accesses, don't we already do that? We trap on
>> access to the 0xcf8 io port.
> We intercept that, but iirc we do no translation (and for DomU
> these get forwarded to qemu anyway).

This is one aspect which will change with the proposed plans to have a
small host bridge/root complex in Xen.

Currently, cf8/cf8 handling is already done partly in Xen because of
multiple ioreq server handling.  However, the current setup completely
fails if the guest attempts to renumber the PCI Buses, and requires each
ioreq server to coordinate with their introduced topology.

A small host bridge and root complex in Xen solves all of these problems
for us, reduces the number of broadcast ioreqs Xen needs to make, and
allows multiple ioreq servers to function completely without any
self-coordination.

>
>>>>>  * `eflags`: bit 17 (VM) must be cleared. Bit 9 (IF) must be cleared.
>>>>>    Bit 8 (TF) must be cleared. Other bits are all unspecified.
>>>> I would also specify that the direction flag shall be clear, to prevent
>>>> all kernels needing to `cld` on entry.
>>> In which case IOPL and AC state should perhaps also be nailed down?
>>> Possibly even all of the control ones (leaving only the status flags
>>> unspecified)?
>> Status flag? Why don't we just say that all user-settable bits in the
>> status register will be set to 0 (or cleared)?
> Would be an option too.

What about the ID bit, which probably ought to be set?

~Andrew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 11:04         ` Andrew Cooper
@ 2016-02-05 11:07           ` Jan Beulich
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Beulich @ 2016-02-05 11:07 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, xen-devel, samuel.thibault, Boris Ostrovsky,
	roger.pau

>>> On 05.02.16 at 12:04, <andrew.cooper3@citrix.com> wrote:
> On 05/02/16 10:40, Jan Beulich wrote:
>>>>> On 05.02.16 at 10:50, <roger.pau@citrix.com> wrote:
>>> Status flag? Why don't we just say that all user-settable bits in the
>>> status register will be set to 0 (or cleared)?
>> Would be an option too.
> 
> What about the ID bit, which probably ought to be set?

Why that? This flag exists solely to indicate presence of CPUID,
and does so by being modifiable (not by being set).

Jan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 10:40       ` Jan Beulich
  2016-02-05 11:04         ` Andrew Cooper
@ 2016-02-05 11:30         ` Roger Pau Monné
  2016-02-05 11:45           ` Jan Beulich
  1 sibling, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-05 11:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, samuel.thibault,
	Boris Ostrovsky

El 5/2/16 a les 11:40, Jan Beulich ha escrit:
>>>> On 05.02.16 at 10:50, <roger.pau@citrix.com> wrote:
>> For legacy PCI interrupts, we can parse the MADT inside of Xen in order
>> to properly setup the lines/overwrites and inject the interrupts that
>> are not handled by Xen straight into the hardware domain. This will
>> require us to be able to emulate the same topology as what is found in
>> native (eg: if there are two IO APICs in the hardware we should also
>> provide two emulated ones to the hw domain).
> 
> I don't think MADT contains all the needed information, or else we
> wouldn't need PHYSDEVOP_setup_gsi.

AFAICT, I think we could do something like:

 - IRQs [0, 15]: edge-trigger, low-polarity.
 - IRQs [16, n]: level-triggered, high-polarity.

Unless there's an overwrite in the MADT. Then there are interrupts that
are handled by Xen, which would not be passed-through to the hardware
domain, the rest would be.

I expect that Xen will already have some code to deal with this, since
it's also used for regular PCI-passthrough.

>> As for PCI config space accesses, don't we already do that? We trap on
>> access to the 0xcf8 io port.
> 
> We intercept that, but iirc we do no translation (and for DomU
> these get forwarded to qemu anyway).
> 
>>>>>  * `eflags`: bit 17 (VM) must be cleared. Bit 9 (IF) must be cleared.
>>>>>    Bit 8 (TF) must be cleared. Other bits are all unspecified.
>>>>
>>>> I would also specify that the direction flag shall be clear, to prevent
>>>> all kernels needing to `cld` on entry.
>>>
>>> In which case IOPL and AC state should perhaps also be nailed down?
>>> Possibly even all of the control ones (leaving only the status flags
>>> unspecified)?
>>
>> Status flag? Why don't we just say that all user-settable bits in the
>> status register will be set to 0 (or cleared)?
> 
> Would be an option too.

AFAICT that's what we already do, so I will add it to the next iteration.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 11:30         ` Roger Pau Monné
@ 2016-02-05 11:45           ` Jan Beulich
  2016-02-05 11:50             ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2016-02-05 11:45 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, samuel.thibault,
	Boris Ostrovsky

>>> On 05.02.16 at 12:30, <roger.pau@citrix.com> wrote:
> El 5/2/16 a les 11:40, Jan Beulich ha escrit:
>>>>> On 05.02.16 at 10:50, <roger.pau@citrix.com> wrote:
>>> For legacy PCI interrupts, we can parse the MADT inside of Xen in order
>>> to properly setup the lines/overwrites and inject the interrupts that
>>> are not handled by Xen straight into the hardware domain. This will
>>> require us to be able to emulate the same topology as what is found in
>>> native (eg: if there are two IO APICs in the hardware we should also
>>> provide two emulated ones to the hw domain).
>> 
>> I don't think MADT contains all the needed information, or else we
>> wouldn't need PHYSDEVOP_setup_gsi.
> 
> AFAICT, I think we could do something like:
> 
>  - IRQs [0, 15]: edge-trigger, low-polarity.
>  - IRQs [16, n]: level-triggered, high-polarity.

That's not a valid assumption - I've seen systems with other settings
on GSI >= 16 ...

> Unless there's an overwrite in the MADT.

... and iirc that was without any MADT override (but instead coming
from the DSDT/SSDT).

> I expect that Xen will already have some code to deal with this, since
> it's also used for regular PCI-passthrough.

This has little to do with pass-through - we first of all need to get
the host working correctly on its own.

Jan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 11:45           ` Jan Beulich
@ 2016-02-05 11:50             ` Roger Pau Monné
  2016-02-05 13:22               ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-05 11:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, samuel.thibault,
	Boris Ostrovsky

El 5/2/16 a les 12:45, Jan Beulich ha escrit:
>>>> On 05.02.16 at 12:30, <roger.pau@citrix.com> wrote:
>> El 5/2/16 a les 11:40, Jan Beulich ha escrit:
>>>>>> On 05.02.16 at 10:50, <roger.pau@citrix.com> wrote:
>>>> For legacy PCI interrupts, we can parse the MADT inside of Xen in order
>>>> to properly setup the lines/overwrites and inject the interrupts that
>>>> are not handled by Xen straight into the hardware domain. This will
>>>> require us to be able to emulate the same topology as what is found in
>>>> native (eg: if there are two IO APICs in the hardware we should also
>>>> provide two emulated ones to the hw domain).
>>>
>>> I don't think MADT contains all the needed information, or else we
>>> wouldn't need PHYSDEVOP_setup_gsi.
>>
>> AFAICT, I think we could do something like:
>>
>>  - IRQs [0, 15]: edge-trigger, low-polarity.
>>  - IRQs [16, n]: level-triggered, high-polarity.
> 
> That's not a valid assumption - I've seen systems with other settings
> on GSI >= 16 ...

Then we just propagate how the emulated IO APIC pins are setup to the
real one, this should match reality, and is no different from using
PHYSDEVOP_setup_gsi. AFAICT it's just a different way of getting the
same information.

Roger.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 11:50             ` Roger Pau Monné
@ 2016-02-05 13:22               ` Jan Beulich
  2016-02-05 14:27                 ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2016-02-05 13:22 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, samuel.thibault,
	Boris Ostrovsky

>>> On 05.02.16 at 12:50, <roger.pau@citrix.com> wrote:
> El 5/2/16 a les 12:45, Jan Beulich ha escrit:
>>>>> On 05.02.16 at 12:30, <roger.pau@citrix.com> wrote:
>>> El 5/2/16 a les 11:40, Jan Beulich ha escrit:
>>>>>>> On 05.02.16 at 10:50, <roger.pau@citrix.com> wrote:
>>>>> For legacy PCI interrupts, we can parse the MADT inside of Xen in order
>>>>> to properly setup the lines/overwrites and inject the interrupts that
>>>>> are not handled by Xen straight into the hardware domain. This will
>>>>> require us to be able to emulate the same topology as what is found in
>>>>> native (eg: if there are two IO APICs in the hardware we should also
>>>>> provide two emulated ones to the hw domain).
>>>>
>>>> I don't think MADT contains all the needed information, or else we
>>>> wouldn't need PHYSDEVOP_setup_gsi.
>>>
>>> AFAICT, I think we could do something like:
>>>
>>>  - IRQs [0, 15]: edge-trigger, low-polarity.
>>>  - IRQs [16, n]: level-triggered, high-polarity.
>> 
>> That's not a valid assumption - I've seen systems with other settings
>> on GSI >= 16 ...
> 
> Then we just propagate how the emulated IO APIC pins are setup to the
> real one, this should match reality, and is no different from using
> PHYSDEVOP_setup_gsi. AFAICT it's just a different way of getting the
> same information.

That won't work either I'm afraid: For one, Dom0 may not even write
RTEs for interrupts it never enables. And even if it did, it would write
them masked, yet we mustn't derive information from masked RTEs -
see commit 669d4b85c4 ("x86/IO-APIC: don't create pIRQ mapping
from masked RTE"). Also consider e.g. the device IRQ which the
serial driver may be using: We specifically suppress modifications to
RTEs for in-use IRQs in current code and would of course need to
do so in the PVHv2 code too. That way there would be no proper
way to establish the two bits (short of grabbing the data from what
Dom0 tries to write despite us otherwise suppressing the write).

Jan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 13:22               ` Jan Beulich
@ 2016-02-05 14:27                 ` Roger Pau Monné
  2016-02-05 14:31                   ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-05 14:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, samuel.thibault,
	Boris Ostrovsky

El 5/2/16 a les 14:22, Jan Beulich ha escrit:
>>>> On 05.02.16 at 12:50, <roger.pau@citrix.com> wrote:
>> El 5/2/16 a les 12:45, Jan Beulich ha escrit:
>>>>>> On 05.02.16 at 12:30, <roger.pau@citrix.com> wrote:
>>>> El 5/2/16 a les 11:40, Jan Beulich ha escrit:
>>>>>>>> On 05.02.16 at 10:50, <roger.pau@citrix.com> wrote:
>>>>>> For legacy PCI interrupts, we can parse the MADT inside of Xen in order
>>>>>> to properly setup the lines/overwrites and inject the interrupts that
>>>>>> are not handled by Xen straight into the hardware domain. This will
>>>>>> require us to be able to emulate the same topology as what is found in
>>>>>> native (eg: if there are two IO APICs in the hardware we should also
>>>>>> provide two emulated ones to the hw domain).
>>>>>
>>>>> I don't think MADT contains all the needed information, or else we
>>>>> wouldn't need PHYSDEVOP_setup_gsi.
>>>>
>>>> AFAICT, I think we could do something like:
>>>>
>>>>  - IRQs [0, 15]: edge-trigger, low-polarity.
>>>>  - IRQs [16, n]: level-triggered, high-polarity.
>>>
>>> That's not a valid assumption - I've seen systems with other settings
>>> on GSI >= 16 ...
>>
>> Then we just propagate how the emulated IO APIC pins are setup to the
>> real one, this should match reality, and is no different from using
>> PHYSDEVOP_setup_gsi. AFAICT it's just a different way of getting the
>> same information.
> 
> That won't work either I'm afraid: For one, Dom0 may not even write
> RTEs for interrupts it never enables. And even if it did, it would write
> them masked, yet we mustn't derive information from masked RTEs -
> see commit 669d4b85c4 ("x86/IO-APIC: don't create pIRQ mapping
> from masked RTE").

In which case, why does Xen need to setup this interrupt/RTE if it's
never used by Dom0?

> Also consider e.g. the device IRQ which the
> serial driver may be using: We specifically suppress modifications to
> RTEs for in-use IRQs in current code and would of course need to
> do so in the PVHv2 code too. That way there would be no proper
> way to establish the two bits (short of grabbing the data from what
> Dom0 tries to write despite us otherwise suppressing the write).

For devices in use by Xen itself, like the uart, doesn't Xen already
take care of setting the right interrupt configuration? Or else how does
the uart work before Dom0 is launched?

The plan was to use the STAO ACPI table in order to notify Dom0 that
certain devices (like the uart) are not accessible, thus preventing Dom0
from setting any interrupts for this devices at all (ie: they should
just be ignored/skipped by Dom0 when doing device enumeration).

And in any case, writes to pins that are in use by Xen should not be
propagated to the physical IO APIC at all, since I would assume Xen has
already set them up properly.

Roger.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 14:27                 ` Roger Pau Monné
@ 2016-02-05 14:31                   ` Jan Beulich
  2016-02-05 15:00                     ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2016-02-05 14:31 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, samuel.thibault,
	Boris Ostrovsky

>>> On 05.02.16 at 15:27, <roger.pau@citrix.com> wrote:
> El 5/2/16 a les 14:22, Jan Beulich ha escrit:
>>>>> On 05.02.16 at 12:50, <roger.pau@citrix.com> wrote:
>>> El 5/2/16 a les 12:45, Jan Beulich ha escrit:
>>>>>>> On 05.02.16 at 12:30, <roger.pau@citrix.com> wrote:
>>>>> El 5/2/16 a les 11:40, Jan Beulich ha escrit:
>>>>>>>>> On 05.02.16 at 10:50, <roger.pau@citrix.com> wrote:
>>>>>>> For legacy PCI interrupts, we can parse the MADT inside of Xen in order
>>>>>>> to properly setup the lines/overwrites and inject the interrupts that
>>>>>>> are not handled by Xen straight into the hardware domain. This will
>>>>>>> require us to be able to emulate the same topology as what is found in
>>>>>>> native (eg: if there are two IO APICs in the hardware we should also
>>>>>>> provide two emulated ones to the hw domain).
>>>>>>
>>>>>> I don't think MADT contains all the needed information, or else we
>>>>>> wouldn't need PHYSDEVOP_setup_gsi.
>>>>>
>>>>> AFAICT, I think we could do something like:
>>>>>
>>>>>  - IRQs [0, 15]: edge-trigger, low-polarity.
>>>>>  - IRQs [16, n]: level-triggered, high-polarity.
>>>>
>>>> That's not a valid assumption - I've seen systems with other settings
>>>> on GSI >= 16 ...
>>>
>>> Then we just propagate how the emulated IO APIC pins are setup to the
>>> real one, this should match reality, and is no different from using
>>> PHYSDEVOP_setup_gsi. AFAICT it's just a different way of getting the
>>> same information.
>> 
>> That won't work either I'm afraid: For one, Dom0 may not even write
>> RTEs for interrupts it never enables. And even if it did, it would write
>> them masked, yet we mustn't derive information from masked RTEs -
>> see commit 669d4b85c4 ("x86/IO-APIC: don't create pIRQ mapping
>> from masked RTE").
> 
> In which case, why does Xen need to setup this interrupt/RTE if it's
> never used by Dom0?

Because (a) Xen itself may use it and (b) it may be used by a guest
(for example, Dom0 may not have a driver for a device, causing its
interrupt to not get enabled, but a guest handed the device would
very likely then also know how to deal with it).

>> Also consider e.g. the device IRQ which the
>> serial driver may be using: We specifically suppress modifications to
>> RTEs for in-use IRQs in current code and would of course need to
>> do so in the PVHv2 code too. That way there would be no proper
>> way to establish the two bits (short of grabbing the data from what
>> Dom0 tries to write despite us otherwise suppressing the write).
> 
> For devices in use by Xen itself, like the uart, doesn't Xen already
> take care of setting the right interrupt configuration? Or else how does
> the uart work before Dom0 is launched?

In polling mode.

> The plan was to use the STAO ACPI table in order to notify Dom0 that
> certain devices (like the uart) are not accessible, thus preventing Dom0
> from setting any interrupts for this devices at all (ie: they should
> just be ignored/skipped by Dom0 when doing device enumeration).
> 
> And in any case, writes to pins that are in use by Xen should not be
> propagated to the physical IO APIC at all, since I would assume Xen has
> already set them up properly.

Once again - it can't without Dom0's help if the interrupt isn't in
the legacy GSI range (below 16).

Jan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 19:33   ` Roger Pau Monné
  2016-02-04 20:24     ` Boris Ostrovsky
@ 2016-02-05 14:44     ` Ian Campbell
  2016-02-05 14:46       ` Roger Pau Monné
  1 sibling, 1 reply; 41+ messages in thread
From: Ian Campbell @ 2016-02-05 14:44 UTC (permalink / raw)
  To: Roger Pau Monné, Andrew Cooper, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, samuel.thibault, Boris Ostrovsky

On Thu, 2016-02-04 at 20:33 +0100, Roger Pau Monné wrote:
> El 4/2/16 a les 19:22, Andrew Cooper ha escrit:
> > On 04/02/16 17:48, Roger Pau Monné wrote:
> > > Hello,
> > > 
> > > I've Cced a bunch of people who have expressed interest in the
> > > HVMlite 
> > > design/implementation, both from a Xen or OS point of view. If you 
> > > would like to be removed, please say so and I will remove you in 
> > > further iterations. The same applies if you want to be added to the
> > > Cc.
> > > 
> > > This is an initial draft on the HVMlite design and implementation.
> > > I've 
> > > mixed certain aspects of the design with the implementation, because
> > > I 
> > > think we are quite tied by the implementation possibilities in
> > > certain 
> > > aspects, so not speaking about it would make the document incomplete.
> > > I 
> > > might be wrong on that, so feel free to comment otherwise if you
> > > would 
> > > prefer a different approach. At least this should get the
> > > conversation 
> > > started into a couple of pending items regarding HVMlite. I don't
> > > want 
> > > to spoil the fun, but IMHO they are:
> > > 
> > >  - Local APIC: should we _always_ provide a local APIC to HVMlite 
> > >    guests?
> > 
> > I think it would be best to offer an LAPIC by default (to be helpful to
> > most modern OSes), but leave the option for an administrator to disable
> > if they specifically don't want one.
> 
> So this also implies that we will also provide ACPI by default (RSDT,
> FADT, MADT)? IMHO the local APIC is specially helpful if it comes with a
> MADT, so that we can do CPU enumeration from it.

Just to be clear, we aren't talking about _requiring_ all (SMP) PVH guests
to be ACPI aware are we? Just about providing some of this stuff in ACPI
format for the benefit of OSes which happen to already be ACPI aware.

Right?


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 14:44     ` Ian Campbell
@ 2016-02-05 14:46       ` Roger Pau Monné
  0 siblings, 0 replies; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-05 14:46 UTC (permalink / raw)
  To: Ian Campbell, Andrew Cooper, xen-devel
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, samuel.thibault, Boris Ostrovsky

El 5/2/16 a les 15:44, Ian Campbell ha escrit:
> On Thu, 2016-02-04 at 20:33 +0100, Roger Pau Monné wrote:
>> El 4/2/16 a les 19:22, Andrew Cooper ha escrit:
>>> On 04/02/16 17:48, Roger Pau Monné wrote:
>>>> Hello,
>>>>
>>>> I've Cced a bunch of people who have expressed interest in the
>>>> HVMlite 
>>>> design/implementation, both from a Xen or OS point of view. If you 
>>>> would like to be removed, please say so and I will remove you in 
>>>> further iterations. The same applies if you want to be added to the
>>>> Cc.
>>>>
>>>> This is an initial draft on the HVMlite design and implementation.
>>>> I've 
>>>> mixed certain aspects of the design with the implementation, because
>>>> I 
>>>> think we are quite tied by the implementation possibilities in
>>>> certain 
>>>> aspects, so not speaking about it would make the document incomplete.
>>>> I 
>>>> might be wrong on that, so feel free to comment otherwise if you
>>>> would 
>>>> prefer a different approach. At least this should get the
>>>> conversation 
>>>> started into a couple of pending items regarding HVMlite. I don't
>>>> want 
>>>> to spoil the fun, but IMHO they are:
>>>>
>>>>  - Local APIC: should we _always_ provide a local APIC to HVMlite 
>>>>    guests?
>>>
>>> I think it would be best to offer an LAPIC by default (to be helpful to
>>> most modern OSes), but leave the option for an administrator to disable
>>> if they specifically don't want one.
>>
>> So this also implies that we will also provide ACPI by default (RSDT,
>> FADT, MADT)? IMHO the local APIC is specially helpful if it comes with a
>> MADT, so that we can do CPU enumeration from it.
> 
> Just to be clear, we aren't talking about _requiring_ all (SMP) PVH guests
> to be ACPI aware are we? Just about providing some of this stuff in ACPI
> format for the benefit of OSes which happen to already be ACPI aware.
> 
> Right?

Yes, that's right. ACPI is just going to be a requirement for
PCI-passthrough, since we don't plan to support pciback/pcifront.

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 14:31                   ` Jan Beulich
@ 2016-02-05 15:00                     ` Roger Pau Monné
  2016-02-05 15:29                       ` Jan Beulich
  0 siblings, 1 reply; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-05 15:00 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, samuel.thibault,
	Boris Ostrovsky

El 5/2/16 a les 15:31, Jan Beulich ha escrit:
>>>> On 05.02.16 at 15:27, <roger.pau@citrix.com> wrote:
>> El 5/2/16 a les 14:22, Jan Beulich ha escrit:
>>>>>> On 05.02.16 at 12:50, <roger.pau@citrix.com> wrote:
>>>> El 5/2/16 a les 12:45, Jan Beulich ha escrit:
>>>>>>>> On 05.02.16 at 12:30, <roger.pau@citrix.com> wrote:
>>>>>> El 5/2/16 a les 11:40, Jan Beulich ha escrit:
>>>>>>>>>> On 05.02.16 at 10:50, <roger.pau@citrix.com> wrote:
>>>>>>>> For legacy PCI interrupts, we can parse the MADT inside of Xen in order
>>>>>>>> to properly setup the lines/overwrites and inject the interrupts that
>>>>>>>> are not handled by Xen straight into the hardware domain. This will
>>>>>>>> require us to be able to emulate the same topology as what is found in
>>>>>>>> native (eg: if there are two IO APICs in the hardware we should also
>>>>>>>> provide two emulated ones to the hw domain).
>>>>>>>
>>>>>>> I don't think MADT contains all the needed information, or else we
>>>>>>> wouldn't need PHYSDEVOP_setup_gsi.
>>>>>>
>>>>>> AFAICT, I think we could do something like:
>>>>>>
>>>>>>  - IRQs [0, 15]: edge-trigger, low-polarity.
>>>>>>  - IRQs [16, n]: level-triggered, high-polarity.
>>>>>
>>>>> That's not a valid assumption - I've seen systems with other settings
>>>>> on GSI >= 16 ...
>>>>
>>>> Then we just propagate how the emulated IO APIC pins are setup to the
>>>> real one, this should match reality, and is no different from using
>>>> PHYSDEVOP_setup_gsi. AFAICT it's just a different way of getting the
>>>> same information.
>>>
>>> That won't work either I'm afraid: For one, Dom0 may not even write
>>> RTEs for interrupts it never enables. And even if it did, it would write
>>> them masked, yet we mustn't derive information from masked RTEs -
>>> see commit 669d4b85c4 ("x86/IO-APIC: don't create pIRQ mapping
>>> from masked RTE").
>>
>> In which case, why does Xen need to setup this interrupt/RTE if it's
>> never used by Dom0?
> 
> Because (a) Xen itself may use it and (b) it may be used by a guest
> (for example, Dom0 may not have a driver for a device, causing its
> interrupt to not get enabled, but a guest handed the device would
> very likely then also know how to deal with it).

I would say that in case (a) Xen will have to use polling forever (I
guess our current approach was to switch to an interrupt driven model
once the IRQ was setup).

For (b) I'm quite sure we could force pciback (or whichever driver in
Dom0 gets the device assigned) to perform the IRQ configuration, even if
the device itself is not going to be used.

>>> Also consider e.g. the device IRQ which the
>>> serial driver may be using: We specifically suppress modifications to
>>> RTEs for in-use IRQs in current code and would of course need to
>>> do so in the PVHv2 code too. That way there would be no proper
>>> way to establish the two bits (short of grabbing the data from what
>>> Dom0 tries to write despite us otherwise suppressing the write).
>>
>> For devices in use by Xen itself, like the uart, doesn't Xen already
>> take care of setting the right interrupt configuration? Or else how does
>> the uart work before Dom0 is launched?
> 
> In polling mode.

I guess this is not very common, since most uarts use a GSI < 16. In
which case, couldn't the ones that use a GSI >= 16 just be used in
polling mode _forever_?

>> The plan was to use the STAO ACPI table in order to notify Dom0 that
>> certain devices (like the uart) are not accessible, thus preventing Dom0
>> from setting any interrupts for this devices at all (ie: they should
>> just be ignored/skipped by Dom0 when doing device enumeration).
>>
>> And in any case, writes to pins that are in use by Xen should not be
>> propagated to the physical IO APIC at all, since I would assume Xen has
>> already set them up properly.
> 
> Once again - it can't without Dom0's help if the interrupt isn't in
> the legacy GSI range (below 16).

Which devices is Xen expected to use with a GSI >= 16? I can only think
of the uart, but maybe there are others which I'm missing?

Roger.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 15:00                     ` Roger Pau Monné
@ 2016-02-05 15:29                       ` Jan Beulich
  2016-02-05 15:35                         ` Roger Pau Monné
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Beulich @ 2016-02-05 15:29 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, samuel.thibault,
	Boris Ostrovsky

>>> On 05.02.16 at 16:00, <roger.pau@citrix.com> wrote:
> El 5/2/16 a les 15:31, Jan Beulich ha escrit:
>>>>> On 05.02.16 at 15:27, <roger.pau@citrix.com> wrote:
>>> El 5/2/16 a les 14:22, Jan Beulich ha escrit:
>>>> Also consider e.g. the device IRQ which the
>>>> serial driver may be using: We specifically suppress modifications to
>>>> RTEs for in-use IRQs in current code and would of course need to
>>>> do so in the PVHv2 code too. That way there would be no proper
>>>> way to establish the two bits (short of grabbing the data from what
>>>> Dom0 tries to write despite us otherwise suppressing the write).
>>>
>>> For devices in use by Xen itself, like the uart, doesn't Xen already
>>> take care of setting the right interrupt configuration? Or else how does
>>> the uart work before Dom0 is launched?
>> 
>> In polling mode.
> 
> I guess this is not very common, since most uarts use a GSI < 16. In
> which case, couldn't the ones that use a GSI >= 16 just be used in
> polling mode _forever_?

It could, but it's inefficient.

>>> The plan was to use the STAO ACPI table in order to notify Dom0 that
>>> certain devices (like the uart) are not accessible, thus preventing Dom0
>>> from setting any interrupts for this devices at all (ie: they should
>>> just be ignored/skipped by Dom0 when doing device enumeration).
>>>
>>> And in any case, writes to pins that are in use by Xen should not be
>>> propagated to the physical IO APIC at all, since I would assume Xen has
>>> already set them up properly.
>> 
>> Once again - it can't without Dom0's help if the interrupt isn't in
>> the legacy GSI range (below 16).
> 
> Which devices is Xen expected to use with a GSI >= 16? I can only think
> of the uart, but maybe there are others which I'm missing?

Right now only the UART, but who knows what's to come?

Jan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 15:29                       ` Jan Beulich
@ 2016-02-05 15:35                         ` Roger Pau Monné
  0 siblings, 0 replies; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-05 15:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Tim Deegan,
	Paul Durrant, David Vrabel, xen-devel, samuel.thibault,
	Boris Ostrovsky

El 5/2/16 a les 16:29, Jan Beulich ha escrit:
>>>> On 05.02.16 at 16:00, <roger.pau@citrix.com> wrote:
>> El 5/2/16 a les 15:31, Jan Beulich ha escrit:
>>>>>> On 05.02.16 at 15:27, <roger.pau@citrix.com> wrote:
>>>> El 5/2/16 a les 14:22, Jan Beulich ha escrit:
>>>>> Also consider e.g. the device IRQ which the
>>>>> serial driver may be using: We specifically suppress modifications to
>>>>> RTEs for in-use IRQs in current code and would of course need to
>>>>> do so in the PVHv2 code too. That way there would be no proper
>>>>> way to establish the two bits (short of grabbing the data from what
>>>>> Dom0 tries to write despite us otherwise suppressing the write).
>>>>
>>>> For devices in use by Xen itself, like the uart, doesn't Xen already
>>>> take care of setting the right interrupt configuration? Or else how does
>>>> the uart work before Dom0 is launched?
>>>
>>> In polling mode.
>>
>> I guess this is not very common, since most uarts use a GSI < 16. In
>> which case, couldn't the ones that use a GSI >= 16 just be used in
>> polling mode _forever_?
> 
> It could, but it's inefficient.
> 
>>>> The plan was to use the STAO ACPI table in order to notify Dom0 that
>>>> certain devices (like the uart) are not accessible, thus preventing Dom0
>>>> from setting any interrupts for this devices at all (ie: they should
>>>> just be ignored/skipped by Dom0 when doing device enumeration).
>>>>
>>>> And in any case, writes to pins that are in use by Xen should not be
>>>> propagated to the physical IO APIC at all, since I would assume Xen has
>>>> already set them up properly.
>>>
>>> Once again - it can't without Dom0's help if the interrupt isn't in
>>> the legacy GSI range (below 16).
>>
>> Which devices is Xen expected to use with a GSI >= 16? I can only think
>> of the uart, but maybe there are others which I'm missing?
> 
> Right now only the UART, but who knows what's to come?

TBH (and maybe I'm being overly confident here) I expect that anything
new will just use MSI.

Roger.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-04 17:48 HVMlite ABI specification DRAFT A Roger Pau Monné
                   ` (3 preceding siblings ...)
  2016-02-05 10:20 ` Ian Campbell
@ 2016-02-05 16:01 ` Tim Deegan
  2016-02-05 16:13   ` Roger Pau Monné
  2016-02-05 17:14   ` Andrew Cooper
  4 siblings, 2 replies; 41+ messages in thread
From: Tim Deegan @ 2016-02-05 16:01 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Paul Durrant,
	David Vrabel, Jan Beulich, samuel.thibault, xen-devel,
	Boris Ostrovsky

At 18:48 +0100 on 04 Feb (1454611694), Roger Pau Monné wrote:
> Hello,
> 
> I've Cced a bunch of people who have expressed interest in the HVMlite
> design/implementation, both from a Xen or OS point of view. If you
> would like to be removed, please say so and I will remove you in
> further iterations. The same applies if you want to be added to the Cc.
> 
> This is an initial draft on the HVMlite design and implementation. I've
> mixed certain aspects of the design with the implementation, because I
> think we are quite tied by the implementation possibilities in certain
> aspects, so not speaking about it would make the document incomplete. I
> might be wrong on that, so feel free to comment otherwise if you would
> prefer a different approach. At least this should get the conversation
> started into a couple of pending items regarding HVMlite. I don't want
> to spoil the fun, but IMHO they are:
> 
>  - Local APIC: should we _always_ provide a local APIC to HVMlite
>    guests?
>  - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ
>    event channels?
>  - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?

FWIW, I think we should err on the side of _not_ emulating hardware or
providing ACPI; if the hypervisor interfaces are insufficient/unpleasant
we should make them better.

I understand that PCI passthrough is difficult because the hardware
design is so awkward to retrofit isolation onto.  But I'm very
uncomfortable with the idea of faking out things like PCI root
complexes inside the hypervisor -- as a way of getting rid of qemu
it's laughable.  I'd be much happier saying that PCI passthrough
requires PV or legacy HVM until a better plan can be found
(e.g. depriv helpers).

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 16:01 ` Tim Deegan
@ 2016-02-05 16:13   ` Roger Pau Monné
  2016-02-05 17:14   ` Andrew Cooper
  1 sibling, 0 replies; 41+ messages in thread
From: Roger Pau Monné @ 2016-02-05 16:13 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Paul Durrant,
	David Vrabel, Jan Beulich, samuel.thibault, xen-devel,
	Boris Ostrovsky

El 5/2/16 a les 17:01, Tim Deegan ha escrit:
> At 18:48 +0100 on 04 Feb (1454611694), Roger Pau Monné wrote:
>> Hello,
>>
>> I've Cced a bunch of people who have expressed interest in the HVMlite
>> design/implementation, both from a Xen or OS point of view. If you
>> would like to be removed, please say so and I will remove you in
>> further iterations. The same applies if you want to be added to the Cc.
>>
>> This is an initial draft on the HVMlite design and implementation. I've
>> mixed certain aspects of the design with the implementation, because I
>> think we are quite tied by the implementation possibilities in certain
>> aspects, so not speaking about it would make the document incomplete. I
>> might be wrong on that, so feel free to comment otherwise if you would
>> prefer a different approach. At least this should get the conversation
>> started into a couple of pending items regarding HVMlite. I don't want
>> to spoil the fun, but IMHO they are:
>>
>>  - Local APIC: should we _always_ provide a local APIC to HVMlite
>>    guests?
>>  - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ
>>    event channels?
>>  - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?
> 
> FWIW, I think we should err on the side of _not_ emulating hardware or
> providing ACPI; if the hypervisor interfaces are insufficient/unpleasant
> we should make them better.

Well that's one of the side of it. It's not really that they are
insufficient/unpleasant (or not properly documented) it's just that they
require substantial work to implement and maintain in guest OSes. So we
have the burden of maintaining them inside of Xen and inside of OSes. If
we switch to emulation as much as possible when it makes sense we will
only have the burden of maintaining the hypervisor side.

We will anyway need to provide a local APIC at least in order to benefit
from posted-interrupts (and probably new hw features as they come up),
and the only way to do it is to provide an interface that's exactly the
same as on native.

> I understand that PCI passthrough is difficult because the hardware
> design is so awkward to retrofit isolation onto.  But I'm very
> uncomfortable with the idea of faking out things like PCI root
> complexes inside the hypervisor -- as a way of getting rid of qemu
> it's laughable.  I'd be much happier saying that PCI passthrough
> requires PV or legacy HVM until a better plan can be found
> (e.g. depriv helpers).

That would be great, and it's indeed my prefer route, but if we cannot
hold PCI-passthough until we have the depriv mode in place it would have
to be done inside the hypervisor. Now that we have Kconfig it could even
be left out if someone doesn't really trust the code.

Roger.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 16:01 ` Tim Deegan
  2016-02-05 16:13   ` Roger Pau Monné
@ 2016-02-05 17:14   ` Andrew Cooper
  2016-02-05 18:05     ` Tim Deegan
  2016-02-08 12:10     ` Stefano Stabellini
  1 sibling, 2 replies; 41+ messages in thread
From: Andrew Cooper @ 2016-02-05 17:14 UTC (permalink / raw)
  To: Tim Deegan, Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Paul Durrant, David Vrabel,
	Jan Beulich, samuel.thibault, xen-devel, Boris Ostrovsky

On 05/02/16 16:01, Tim Deegan wrote:
> At 18:48 +0100 on 04 Feb (1454611694), Roger Pau Monné wrote:
>> Hello,
>>
>> I've Cced a bunch of people who have expressed interest in the HVMlite
>> design/implementation, both from a Xen or OS point of view. If you
>> would like to be removed, please say so and I will remove you in
>> further iterations. The same applies if you want to be added to the Cc.
>>
>> This is an initial draft on the HVMlite design and implementation. I've
>> mixed certain aspects of the design with the implementation, because I
>> think we are quite tied by the implementation possibilities in certain
>> aspects, so not speaking about it would make the document incomplete. I
>> might be wrong on that, so feel free to comment otherwise if you would
>> prefer a different approach. At least this should get the conversation
>> started into a couple of pending items regarding HVMlite. I don't want
>> to spoil the fun, but IMHO they are:
>>
>>  - Local APIC: should we _always_ provide a local APIC to HVMlite
>>    guests?
>>  - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ
>>    event channels?
>>  - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?
> FWIW, I think we should err on the side of _not_ emulating hardware or
> providing ACPI; if the hypervisor interfaces are insufficient/unpleasant
> we should make them better.
>
> I understand that PCI passthrough is difficult because the hardware
> design is so awkward to retrofit isolation onto.  But I'm very
> uncomfortable with the idea of faking out things like PCI root
> complexes inside the hypervisor -- as a way of getting rid of qemu
> it's laughable.

Most certainly not.

90% of the necessary PCI infrastructure is already in the hypervisor,
and actively used for tracking interrupt mask bits.  Some of this was
even introduced in XSAs, and isn't going away.

As far as I am aware, the remaining 10% is a bus 0, and PCI-complient
bus handling (a few extra registers in legacy PCI configuration space),
to be able to steer all other PCI related accesses to the appropriate
ioreq server, and splitting of the two GPE blocks.

Yes, this does involve adding a little extra emulation to Xen, but the
benefits are a substantially cleaner architecture for device models,
which doesn't require them to self-coordinate about their layout, or
have to talk to Qemu directly to negotiate hotplug notifications.

>   I'd be much happier saying that PCI passthrough
> requires PV or legacy HVM until a better plan can be found
> (e.g. depriv helpers).

The current pci-front/back and Qemu-based methods have substantial
architectural deficiencies, and are incredibly fragile to change.  When
was the last XSA to PCI Passthrough which didn't end up requiring
further bugfixes to undo the collateral damage?

It is my hope that we can correct the architecture as part of developing
HVMLite (at which point HVM will immediately benefit), and vastly simply
the device model interfaces.  Of course, if during the course of
development this proves not to be the case, we will have to sit down and
replan.

>From where I am sitting, most of the risks are already present, and the
potential benefits vastly outweigh the downsides.

~Andrew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 17:14   ` Andrew Cooper
@ 2016-02-05 18:05     ` Tim Deegan
  2016-02-05 18:44       ` Andrew Cooper
  2016-02-08 12:10     ` Stefano Stabellini
  1 sibling, 1 reply; 41+ messages in thread
From: Tim Deegan @ 2016-02-05 18:05 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, Paul Durrant, David Vrabel,
	Jan Beulich, samuel.thibault, xen-devel, Boris Ostrovsky,
	Roger Pau Monné

At 17:14 +0000 on 05 Feb (1454692488), Andrew Cooper wrote:
> On 05/02/16 16:01, Tim Deegan wrote:
> > At 18:48 +0100 on 04 Feb (1454611694), Roger Pau Monné wrote:
> >> Hello,
> >>
> >> I've Cced a bunch of people who have expressed interest in the HVMlite
> >> design/implementation, both from a Xen or OS point of view. If you
> >> would like to be removed, please say so and I will remove you in
> >> further iterations. The same applies if you want to be added to the Cc.
> >>
> >> This is an initial draft on the HVMlite design and implementation. I've
> >> mixed certain aspects of the design with the implementation, because I
> >> think we are quite tied by the implementation possibilities in certain
> >> aspects, so not speaking about it would make the document incomplete. I
> >> might be wrong on that, so feel free to comment otherwise if you would
> >> prefer a different approach. At least this should get the conversation
> >> started into a couple of pending items regarding HVMlite. I don't want
> >> to spoil the fun, but IMHO they are:
> >>
> >>  - Local APIC: should we _always_ provide a local APIC to HVMlite
> >>    guests?
> >>  - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ
> >>    event channels?
> >>  - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?
> > FWIW, I think we should err on the side of _not_ emulating hardware or
> > providing ACPI; if the hypervisor interfaces are insufficient/unpleasant
> > we should make them better.
> >
> > I understand that PCI passthrough is difficult because the hardware
> > design is so awkward to retrofit isolation onto.  But I'm very
> > uncomfortable with the idea of faking out things like PCI root
> > complexes inside the hypervisor -- as a way of getting rid of qemu
> > it's laughable.
> 
> Most certainly not.
> 
> 90% of the necessary PCI infrastructure is already in the hypervisor,
> and actively used for tracking interrupt mask bits.  Some of this was
> even introduced in XSAs, and isn't going away.

This is the chance to _make_ it go away.  If we commit to modelling
IO-APICs and PCI bridges now, we'll be stuck with it for a while.

I'm not suggesting that we have to stick with pcifront, and I
appreciate the argument that at some point Xen must control the PCI
devices, but it doesn't follow that emulated hardware is the ABI Xen
should expose for that.

> Yes, this does involve adding a little extra emulation to Xen, but the
> benefits are a substantially cleaner architecture for device models,
> which doesn't require them to self-coordinate about their layout, or
> have to talk to Qemu directly to negotiate hotplug notifications.

Now that's a different thing altogether -- emulated device models
presenting as PCI devices.  And here I still disagree with you -- Xen
shouldn't have to decide device models' layouts.  That's _policy_, and
the hypervisor's job is _enforcement_.

Tim.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 18:05     ` Tim Deegan
@ 2016-02-05 18:44       ` Andrew Cooper
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Cooper @ 2016-02-05 18:44 UTC (permalink / raw)
  To: Tim Deegan, Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Paul Durrant, David Vrabel,
	Jan Beulich, samuel.thibault, xen-devel, Boris Ostrovsky

On 05/02/16 18:05, Tim Deegan wrote:
> At 17:14 +0000 on 05 Feb (1454692488), Andrew Cooper wrote:
>> On 05/02/16 16:01, Tim Deegan wrote:
>>> At 18:48 +0100 on 04 Feb (1454611694), Roger Pau Monné wrote:
>>>> Hello,
>>>>
>>>> I've Cced a bunch of people who have expressed interest in the HVMlite
>>>> design/implementation, both from a Xen or OS point of view. If you
>>>> would like to be removed, please say so and I will remove you in
>>>> further iterations. The same applies if you want to be added to the Cc.
>>>>
>>>> This is an initial draft on the HVMlite design and implementation. I've
>>>> mixed certain aspects of the design with the implementation, because I
>>>> think we are quite tied by the implementation possibilities in certain
>>>> aspects, so not speaking about it would make the document incomplete. I
>>>> might be wrong on that, so feel free to comment otherwise if you would
>>>> prefer a different approach. At least this should get the conversation
>>>> started into a couple of pending items regarding HVMlite. I don't want
>>>> to spoil the fun, but IMHO they are:
>>>>
>>>>  - Local APIC: should we _always_ provide a local APIC to HVMlite
>>>>    guests?
>>>>  - HVMlite hardware domain: can we get rid of the PHYSDEV ops and PIRQ
>>>>    event channels?
>>>>  - HVMlite PCI-passthrough: can we get rid of pciback/pcifront?
>>> FWIW, I think we should err on the side of _not_ emulating hardware or
>>> providing ACPI; if the hypervisor interfaces are insufficient/unpleasant
>>> we should make them better.
>>>
>>> I understand that PCI passthrough is difficult because the hardware
>>> design is so awkward to retrofit isolation onto.  But I'm very
>>> uncomfortable with the idea of faking out things like PCI root
>>> complexes inside the hypervisor -- as a way of getting rid of qemu
>>> it's laughable.
>> Most certainly not.
>>
>> 90% of the necessary PCI infrastructure is already in the hypervisor,
>> and actively used for tracking interrupt mask bits.  Some of this was
>> even introduced in XSAs, and isn't going away.
> This is the chance to _make_ it go away.  If we commit to modelling
> IO-APICs and PCI bridges now, we'll be stuck with it for a while.

HVMLite at the moment has no emulated devices, and we definitely want to
keep that option available.

Both FreeBSD and Linux expect an LAPIC, and this appears to be a common
assumption (Reasonable as well, as the LAPIC is part of the CPU these
days).  I think it is worth offering an LAPIC by default, but retaining
the ability for the admin to configure it off.

Hardware extensions such as APICV/AVIC necessitate an LAPIC emulation
for the guest, and I expect there will be demand for using a
configuration like this, simply for the performance benefit (ARAT being
a common clocksource for guests which can now run without Xen interaction).

>
> I'm not suggesting that we have to stick with pcifront, and I
> appreciate the argument that at some point Xen must control the PCI
> devices, but it doesn't follow that emulated hardware is the ABI Xen
> should expose for that.

I don't think the IOAPIC or PCI bridges should be in the base ABI. 
Apologies if I gave that impression.

I expect the overwhelming majority of the use of HVMLite domains will be
without PCI passthrough.

However, if passthrough is wanted, these devices are going to be need,
one way or another.

>
>> Yes, this does involve adding a little extra emulation to Xen, but the
>> benefits are a substantially cleaner architecture for device models,
>> which doesn't require them to self-coordinate about their layout, or
>> have to talk to Qemu directly to negotiate hotplug notifications.
> Now that's a different thing altogether -- emulated device models
> presenting as PCI devices.  And here I still disagree with you -- Xen
> shouldn't have to decide device models' layouts.  That's _policy_, and
> the hypervisor's job is _enforcement_.

I am not suggesting that policy moves into Xen.

Currently, policy is in Qemu, even when multiple device models are
involved, and there is no enforcement anywhere.  All config accesses
must be broadcast to all ioreq servers because Xen has no idea which
ioreq server is serving which devices.  Secondary device models have to
choose a PCI BDF which it knows Qemu will ignore accesses for.

Instead, Xen should "own" bus 0, and be able to say yes/no to ioreq
servers requesting to set up emulation for a new device.  A traditional
device model would come along saying "I have $A, $B, $C and $D, and they
must be layed out like this".  A secondary device model can come along
and say "I have a hotplug $E. Please choose a free slot for me".

~Andrew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-05 17:14   ` Andrew Cooper
  2016-02-05 18:05     ` Tim Deegan
@ 2016-02-08 12:10     ` Stefano Stabellini
  2016-02-08 13:21       ` David Vrabel
  1 sibling, 1 reply; 41+ messages in thread
From: Stefano Stabellini @ 2016-02-08 12:10 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Stefano Stabellini, Tim Deegan, Paul Durrant,
	David Vrabel, Jan Beulich, samuel.thibault, xen-devel,
	Boris Ostrovsky, Roger Pau Monné

On Fri, 5 Feb 2016, Andrew Cooper wrote:
> The current pci-front/back and Qemu-based methods have substantial
> architectural deficiencies, and are incredibly fragile to change.  When
> was the last XSA to PCI Passthrough which didn't end up requiring
> further bugfixes to undo the collateral damage?

What's the problem with the pcifront/pciback model exactly (aside from
being tied to pirqs, but we already had a plan to fix that so that we
could use them on ARM, where we always have a GIC interrupt controller)?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: HVMlite ABI specification DRAFT A
  2016-02-08 12:10     ` Stefano Stabellini
@ 2016-02-08 13:21       ` David Vrabel
  0 siblings, 0 replies; 41+ messages in thread
From: David Vrabel @ 2016-02-08 13:21 UTC (permalink / raw)
  To: Stefano Stabellini, Andrew Cooper
  Cc: Wei Liu, Tim Deegan, Paul Durrant, David Vrabel, Jan Beulich,
	xen-devel, samuel.thibault, Boris Ostrovsky, Roger Pau Monné

On 08/02/16 12:10, Stefano Stabellini wrote:
> On Fri, 5 Feb 2016, Andrew Cooper wrote:
>> The current pci-front/back and Qemu-based methods have substantial
>> architectural deficiencies, and are incredibly fragile to change.  When
>> was the last XSA to PCI Passthrough which didn't end up requiring
>> further bugfixes to undo the collateral damage?
> 
> What's the problem with the pcifront/pciback model exactly (aside from
> being tied to pirqs, but we already had a plan to fix that so that we
> could use them on ARM, where we always have a GIC interrupt controller)?

The most obvious is that it doesn't work for dom0.

David

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2016-02-08 13:21 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-04 17:48 HVMlite ABI specification DRAFT A Roger Pau Monné
2016-02-04 18:22 ` Andrew Cooper
2016-02-04 19:33   ` Roger Pau Monné
2016-02-04 20:24     ` Boris Ostrovsky
2016-02-05 14:44     ` Ian Campbell
2016-02-05 14:46       ` Roger Pau Monné
2016-02-05  9:12   ` Jan Beulich
2016-02-05  9:50     ` Roger Pau Monné
2016-02-05 10:40       ` Jan Beulich
2016-02-05 11:04         ` Andrew Cooper
2016-02-05 11:07           ` Jan Beulich
2016-02-05 11:30         ` Roger Pau Monné
2016-02-05 11:45           ` Jan Beulich
2016-02-05 11:50             ` Roger Pau Monné
2016-02-05 13:22               ` Jan Beulich
2016-02-05 14:27                 ` Roger Pau Monné
2016-02-05 14:31                   ` Jan Beulich
2016-02-05 15:00                     ` Roger Pau Monné
2016-02-05 15:29                       ` Jan Beulich
2016-02-05 15:35                         ` Roger Pau Monné
2016-02-04 18:38 ` Boris Ostrovsky
2016-02-04 18:51   ` Samuel Thibault
2016-02-04 19:21     ` Roger Pau Monné
2016-02-04 20:17       ` Boris Ostrovsky
2016-02-04 20:29         ` Konrad Rzeszutek Wilk
2016-02-04 20:37           ` Andrew Cooper
2016-02-05  8:23         ` Roger Pau Monné
2016-02-04 22:23       ` Samuel Thibault
2016-02-04 19:09 ` Samuel Thibault
2016-02-04 19:18   ` Boris Ostrovsky
2016-02-04 22:21     ` Samuel Thibault
2016-02-04 22:25       ` Andrew Cooper
2016-02-04 22:41         ` Samuel Thibault
2016-02-05 10:20 ` Ian Campbell
2016-02-05 16:01 ` Tim Deegan
2016-02-05 16:13   ` Roger Pau Monné
2016-02-05 17:14   ` Andrew Cooper
2016-02-05 18:05     ` Tim Deegan
2016-02-05 18:44       ` Andrew Cooper
2016-02-08 12:10     ` Stefano Stabellini
2016-02-08 13:21       ` David Vrabel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).