All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Edgar E. Iglesias" <edgar.iglesias@gmail.com>
To: "Lan, Tianyu" <tianyu.lan@intel.com>
Cc: "yang.zhang.wz@gmail.com" <yang.zhang.wz@gmail.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Jan Beulich <JBeulich@suse.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	"ian.jackson@eu.citrix.com" <ian.jackson@eu.citrix.com>,
	xuquan8@huawei.com, Julien Grall <julien.grall@arm.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	"anthony.perard@citrix.com" <anthony.perard@citrix.com>,
	Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: Xen virtual IOMMU high level design doc
Date: Wed, 23 Nov 2016 19:19:37 +0100	[thread overview]
Message-ID: <20161123181937.GZ9606@toto> (raw)
In-Reply-To: <b17f9cc8-955d-c7e7-f944-a20a77f75d64@intel.com>

On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote:
> Hi All:
>      The following is our Xen vIOMMU high level design for detail
> discussion. Please have a look. Very appreciate for your comments.
> This design doesn't cover changes when root port is moved to hypervisor.
> We may design it later.

Hi,

I have a few questions.

If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
So guests will essentially create intel iommu style page-tables.

If we were to use this on Xen/ARM, we would likely be modelling an ARM
SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
Do I understand this correctly?

Has a platform agnostic PV-IOMMU been considered to support 2-stage
translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap
performance too much?

Best regards,
Edgar




> 
> 
> Content:
> ===============================================================================
> 1. Motivation of vIOMMU
> 	1.1 Enable more than 255 vcpus
> 	1.2 Support VFIO-based user space driver
> 	1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
> 	2.1 2th level translation overview
> 	2.2 Interrupt remapping overview
> 3. Xen hypervisor
> 	3.1 New vIOMMU hypercall interface
> 	3.2 2nd level translation
> 	3.3 Interrupt remapping
> 	3.4 1st level translation
> 	3.5 Implementation consideration
> 4. Qemu
> 	4.1 Qemu vIOMMU framework
> 	4.2 Dummy xen-vIOMMU driver
> 	4.3 Q35 vs. i440x
> 	4.4 Report vIOMMU to hvmloader
> 
> 
> 1 Motivation for Xen vIOMMU
> ===============================================================================
> 1.1 Enable more than 255 vcpu support
> HPC virtualization requires more than 255 vcpus support in a single VM
> to meet parallel computing requirement. More than 255 vcpus support
> requires interrupt remapping capability present on vIOMMU to deliver
> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> vcpus if interrupt remapping is absent.
> 
> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the 2nd level translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the 1st level translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.
> 
> 2. Xen vIOMMU Architecture
> ================================================================================
> 
> * vIOMMU will be inside Xen hypervisor for following factors
> 	1) Avoid round trips between Qemu and Xen hypervisor
> 	2) Ease of integration with the rest of the hypervisor
> 	3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> level translation.
> 
> 2.1 2th level translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
> 
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> 
> The following diagram shows 2th level translation architecture.
> +---------------------------------------------------------+
> |Qemu                                +----------------+   |
> |                                    |     Virtual    |   |
> |                                    |   PCI device   |   |
> |                                    |                |   |
> |                                    +----------------+   |
> |                                            |DMA         |
> |                                            V            |
> |  +--------------------+   Request  +----------------+   |
> |  |                    +<-----------+                |   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |                    +----------->+                |   |
> |  +---------+----------+            +-------+--------+   |
> |            |                               |            |
> |            |Hypercall                      |            |
> +--------------------------------------------+------------+
> |Hypervisor  |                               |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     |   vIOMMU    |                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     | IOMMU driver|                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> +--------------------------------------------+------------+
> |HW          v                               V            |
> |     +------+------+                 +-------------+     |
> |     |   IOMMU     +---------------->+  Memory     |     |
> |     +------+------+                 +-------------+     |
> |            ^                                            |
> |            |                                            |
> |     +------+------+                                     |
> |     | PCI Device  |                                     |
> |     +-------------+                                     |
> +---------------------------------------------------------+
> 
> 2.2 Interrupt remapping overview.
> Interrupts from virtual devices and physical devices will be delivered
> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> procedure.
> 
> +---------------------------------------------------+
> |Qemu                       |VM                     |
> |                           | +----------------+    |
> |                           | |  Device driver |    |
> |                           | +--------+-------+    |
> |                           |          ^            |
> |       +----------------+  | +--------+-------+    |
> |       | Virtual device |  | |  IRQ subsystem |    |
> |       +-------+--------+  | +--------+-------+    |
> |               |           |          ^            |
> |               |           |          |            |
> +---------------------------+-----------------------+
> |hyperviosr     |                      | VIRQ       |
> |               |            +---------+--------+   |
> |               |            |      vLAPIC      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |      vIOMMU      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |   vIOAPIC/vMSI   |   |
> |               |            +----+----+--------+   |
> |               |                 ^    ^            |
> |               +-----------------+    |            |
> |                                      |            |
> +---------------------------------------------------+
> HW                                     |IRQ
>                               +-------------------+
>                               |   PCI Device      |
>                               +-------------------+
> 
> 
> 
> 
> 
> 3 Xen hypervisor
> ==========================================================================
> 
> 3.1 New hypercall XEN_SYSCTL_viommu_op
> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
> 
> struct xen_sysctl_viommu_op {
> 	u32 cmd;
> 	u32 domid;
> 	union {
> 		struct {
> 			u32 capabilities;
> 		} query_capabilities;
> 		struct {
> 			u32 capabilities;
> 			u64 base_address;
> 		} create_iommu;
> 		struct {
> 			u8  bus;
> 			u8  devfn;
> 			u64 iova;
> 			u64 translated_addr;
> 			u64 addr_mask; /* Translation page size */
> 			IOMMUAccessFlags permisson;		
> 		} 2th_level_translation;
> };
> 
> typedef enum {
> 	IOMMU_NONE = 0,
> 	IOMMU_RO   = 1,
> 	IOMMU_WO   = 2,
> 	IOMMU_RW   = 3,
> } IOMMUAccessFlags;
> 
> 
> Definition of VIOMMU subops:
> #define XEN_SYSCTL_viommu_query_capability		0
> #define XEN_SYSCTL_viommu_create			1
> #define XEN_SYSCTL_viommu_destroy			2
> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 	3
> 
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_1nd_level_translation	(1 << 0)
> #define XEN_VIOMMU_CAPABILITY_2nd_level_translation	(1 << 1)
> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping	(1 << 2)
> 
> 
> 2) Design for subops
> - XEN_SYSCTL_viommu_query_capability
>       Get vIOMMU capabilities(1st/2th level translation and interrupt
> remapping).
> 
> - XEN_SYSCTL_viommu_create
>      Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> base address.
> 
> - XEN_SYSCTL_viommu_destroy
>      Destory vIOMMU in Xen hypervisor with dom_id as parameters.
> 
> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>      Translate IOVA to GPA for specified virtual PCI device with dom id,
> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> address mask and access permission.
> 
> 
> 3.2 2nd level translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
> 
> 2) For physical PCI device
> DMA operations go though physical IOMMU directly and IO page table for
> IOVA->HPA should be loaded into physical IOMMU. When guest updates
> Second-level Page-table pointer field, it provides IO page table for
> IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate
> GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level
> Page-table pointer to context entry of physical IOMMU.
> 
> Now all PCI devices in same hvm domain share one IO page table
> (GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level
> translation of vIOMMU, IOMMU driver need to support multiple address
> spaces per device entry. Using existing IO page table(GPA->HPA)
> defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level
> translation function is enabled. These change will not affect current
> P2M logic.
> 
> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table. The following diagram shows the logic.
> 
> 
> 3.4 1st level translation
> When nested translation is enabled, any address generated by first-level
> translation is used as the input address for nesting with second-level
> translation. Physical IOMMU needs to enable both 1st level and 2nd level
> translation in nested translation mode(GVA->GPA->HPA) for passthrough
> device.
> 
> VT-d context entry points to guest 1st level translation table which
> will be nest-translated by 2nd level translation table and so it
> can be directly linked to context entry of physical IOMMU.
> 
> To enable 1st level translation in VM
> 1) Xen IOMMU driver enables nested translation mode
> 2) Update GPA root of guest 1st level translation table to context entry
> of physical IOMMU.
> 
> All handles are in hypervisor and no interaction with Qemu.
> 
> 
> 3.5 Implementation consideration
> Linux Intel IOMMU driver will fail to be loaded without 2th level
> translation support even if interrupt remapping and 1th level
> translation are available. This means it's needed to enable 2th level
> translation first before other functions.
> 
> 
> 4 Qemu
> ==============================================================================
> 4.1 Qemu vIOMMU framework
> Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy
> xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> Especially for 2th level translation of virtual PCI device because
> emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> framework provides callback to deal with 2th level translation when
> DMA operations of virtual PCI devices happen.
> 
> 
> 4.2 Dummy xen-vIOMMU driver
> 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> Share Virtual Memory) via hypercall.
> 
> 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> address and desired capability as parameters. Destroy vIOMMU when VM is
> closed.
> 
> 3) Virtual PCI device's 2th level translation
> Qemu already provides DMA translation hook. It's called when DMA
> translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> return back translated GPA.
> 
> 
> 4.3 Q35 vs i440x
> VT-D is introduced since Q35 chipset. Previous concern was that IOMMU
> driver has assumption that VTD only exists on Q35 and newer chipset and
> we have to enable Q35 first.
> 
> Consulted with Linux/Windows IOMMU driver experts and get that these
> drivers doesn't have such assumption. So we may skip Q35 implementation
> and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
> with virtual PCI device's DMA translation and interrupt remapping. We
> are using KVM to do experiment of adding vIOMMU on the I440x and test
> Linux/Windows guest. Will report back when have some results.
> 
> 
> 4.4 Report vIOMMU to hvmloader
> Hvmloader is in charge of building ACPI tables for Guest OS and OS
> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> for Guest OS.
> 
> There are three ways to do that.
> 1) Extend struct hvm_info_table and add variables in the struct
> hvm_info_table to pass vIOMMU information to hvmloader. But this
> requires to add new xc interface to use struct hvm_info_table in the Qemu.
> 
> 2) Pass vIOMMU information to hvmloader via Xenstore
> 
> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> This solution is already present in the vNVDIMM design(4.3.1
> Building Guest ACPI Tables
> http://www.gossamer-threads.com/lists/xen/devel/439766).
> 
> The third option seems more clear and hvmloader doesn't need to deal
> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

  parent reply	other threads:[~2016-11-23 18:19 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-26  8:29 Discussion about virtual iommu support for Xen guest Lan Tianyu
2016-05-26  8:42 ` Dong, Eddie
2016-05-27  2:26   ` Lan Tianyu
2016-05-27  8:11     ` Tian, Kevin
2016-05-26 11:35 ` Andrew Cooper
2016-05-27  8:19   ` Lan Tianyu
2016-06-02 15:03     ` Lan, Tianyu
2016-06-02 18:58       ` Andrew Cooper
2016-06-03 11:01         ` Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest) Roger Pau Monne
2016-06-03 11:21           ` Tian, Kevin
2016-06-03 11:52             ` Roger Pau Monne
2016-06-03 12:11               ` Tian, Kevin
2016-06-03 16:56                 ` Stefano Stabellini
2016-06-07  5:48                   ` Tian, Kevin
2016-06-03 11:17         ` Discussion about virtual iommu support for Xen guest Tian, Kevin
2016-06-03 13:09           ` Lan, Tianyu
2016-06-03 14:00             ` Andrew Cooper
2016-06-03 13:51           ` Andrew Cooper
2016-06-03 14:31             ` Jan Beulich
2016-06-03 17:14             ` Stefano Stabellini
2016-06-07  5:14               ` Tian, Kevin
2016-06-07  7:26                 ` Jan Beulich
2016-06-07 10:07                 ` Stefano Stabellini
2016-06-08  8:11                   ` Tian, Kevin
2016-06-26 13:42                     ` Lan, Tianyu
2016-06-29  3:04                       ` Tian, Kevin
2016-07-05 13:37                         ` Lan, Tianyu
2016-07-05 13:57                           ` Jan Beulich
2016-07-05 14:19                             ` Lan, Tianyu
2016-08-17 12:05                             ` Xen virtual IOMMU high level design doc Lan, Tianyu
2016-08-17 12:42                               ` Paul Durrant
2016-08-18  2:57                                 ` Lan, Tianyu
2016-08-25 11:11                               ` Jan Beulich
2016-08-31  8:39                                 ` Lan Tianyu
2016-08-31 12:02                                   ` Jan Beulich
2016-09-01  1:26                                     ` Tian, Kevin
2016-09-01  2:35                                     ` Lan Tianyu
2016-09-15 14:22                               ` Lan, Tianyu
2016-10-05 18:36                                 ` Konrad Rzeszutek Wilk
2016-10-11  1:52                                   ` Lan Tianyu
2016-11-23 18:19                               ` Edgar E. Iglesias [this message]
2016-11-23 19:09                                 ` Stefano Stabellini
2016-11-24  2:00                                   ` Tian, Kevin
2016-11-24  4:09                                     ` Edgar E. Iglesias
2016-11-24  6:49                                       ` Lan Tianyu
2016-11-24 13:37                                         ` Edgar E. Iglesias
2016-11-25  2:01                                           ` Xuquan (Quan Xu)
2016-11-25  5:53                                           ` Lan, Tianyu
2016-10-18 14:14                             ` Xen virtual IOMMU high level design doc V2 Lan Tianyu
2016-10-18 19:17                               ` Andrew Cooper
2016-10-20  9:53                                 ` Tian, Kevin
2016-10-20 18:10                                   ` Andrew Cooper
2016-10-20 14:17                                 ` Lan Tianyu
2016-10-20 20:36                                   ` Andrew Cooper
2016-10-22  7:32                                     ` Lan, Tianyu
2016-10-26  9:39                                       ` Jan Beulich
2016-10-26 15:03                                         ` Lan, Tianyu
2016-11-03 15:41                                         ` Lan, Tianyu
2016-10-28 15:36                                     ` Lan Tianyu
2016-10-18 20:26                               ` Konrad Rzeszutek Wilk
2016-10-20 10:11                                 ` Tian, Kevin
2016-10-20 14:56                                 ` Lan, Tianyu
2016-10-26  9:36                               ` Jan Beulich
2016-10-26 14:53                                 ` Lan, Tianyu
2016-11-17 15:36                             ` Xen virtual IOMMU high level design doc V3 Lan Tianyu
2016-11-18 19:43                               ` Julien Grall
2016-11-21  2:21                                 ` Lan, Tianyu
2016-11-21 13:17                                   ` Julien Grall
2016-11-21 18:24                                     ` Stefano Stabellini
2016-11-21  7:05                               ` Tian, Kevin
2016-11-23  1:36                                 ` Lan Tianyu
2016-11-21 13:41                               ` Andrew Cooper
2016-11-22  6:02                                 ` Tian, Kevin
2016-11-22  8:32                                 ` Lan Tianyu
2016-11-22 10:24                               ` Jan Beulich
2016-11-24  2:34                                 ` Lan Tianyu
2016-06-03 19:51             ` Is: 'basic pci bridge and root device support. 'Was:Re: Discussion about virtual iommu support for Xen guest Konrad Rzeszutek Wilk
2016-06-06  9:55               ` Jan Beulich
2016-06-06 17:25                 ` Konrad Rzeszutek Wilk
2016-08-02 15:15     ` Lan, Tianyu
2016-05-27  8:35   ` Tian, Kevin
2016-05-27  8:46     ` Paul Durrant
2016-05-27  9:39       ` Tian, Kevin
2016-05-31  9:43   ` George Dunlap
2016-05-27  2:26 ` Yang Zhang
2016-05-27  8:13   ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161123181937.GZ9606@toto \
    --to=edgar.iglesias@gmail.com \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=anthony.perard@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=julien.grall@arm.com \
    --cc=jun.nakajima@intel.com \
    --cc=kevin.tian@intel.com \
    --cc=roger.pau@citrix.com \
    --cc=sstabellini@kernel.org \
    --cc=tianyu.lan@intel.com \
    --cc=xen-devel@lists.xensource.com \
    --cc=xuquan8@huawei.com \
    --cc=yang.zhang.wz@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.