From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Auger Subject: [RFC v3 21/21] vfio: Document nested stage control Date: Tue, 8 Jan 2019 11:26:33 +0100 Message-ID: <20190108102633.17482-22-eric.auger@redhat.com> References: <20190108102633.17482-1-eric.auger@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: marc.zyngier-5wv7dgnIgG8@public.gmane.org, peter.maydell-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org, kevin.tian-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org, ashok.raj-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org, christoffer.dall-5wv7dgnIgG8@public.gmane.org To: eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, eric.auger-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, kvmarm-FPEHb7Xf0XXUo1n7N8X6UoWGPAHP3yOg@public.gmane.org, joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org, alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, jacob.jun.pan-VuQAYsv1563Yd54FQh9/CA@public.gmane.org, yi.l.liu-VuQAYsv1563Yd54FQh9/CA@public.gmane.org, jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org, will.deacon-5wv7dgnIgG8@public.gmane.org, robin.murphy-5wv7dgnIgG8@public.gmane.org Return-path: In-Reply-To: <20190108102633.17482-1-eric.auger-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org List-Id: kvm.vger.kernel.org New iotcls were introduced to pass information about guest stage1 to the host through VFIO. Let's document the nested stage control. Signed-off-by: Eric Auger --- v2 -> v3: - document the new fault API v1 -> v2: - use the new ioctl names - add doc related to fault handling --- Documentation/vfio.txt | 62 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index f1a4d3c3ba0b..620e38ed0c4a 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -239,6 +239,68 @@ group and can access them as follows:: /* Gratuitous device reset and go... */ ioctl(device, VFIO_DEVICE_RESET); +IOMMU Dual Stage Control +------------------------ + +Some IOMMUs support 2 stages/levels of translation. "Stage" corresponds to +the ARM terminology while "level" corresponds to Intel's VTD terminology. In +the following text we use either without distinction. + +This is useful when the guest is exposed with a virtual IOMMU and some +devices are assigned to the guest through VFIO. Then the guest OS can use +stage 1 (IOVA -> GPA), while the hypervisor uses stage 2 for VM isolation +(GPA -> HPA). + +The guest gets ownership of the stage 1 page tables and also owns stage 1 +configuration structures. The hypervisor owns the root configuration structure +(for security reason), including stage 2 configuration. This works as long +configuration structures and page table format are compatible between the +virtual IOMMU and the physical IOMMU. + +Assuming the HW supports it, this nested mode is selected by choosing the +VFIO_TYPE1_NESTING_IOMMU type through: + +ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU); + +This forces the hypervisor to use the stage 2, leaving stage 1 available for +guest usage. + +Once groups are attached to the container, the guest stage 1 translation +configuration data can be passed to VFIO by using + +ioctl(container, VFIO_IOMMU_BIND_PASID_TABLE, &pasid_table_info); + +This allows to combine guest stage 1 configuration structure along with +hypervisor stage 2 configuration structure. stage 1 configuration structures +are dependent on the IOMMU type. + +As the stage 1 translation is fully delegated to the HW, physical events that +may occur (especially translation faults), need to be propagated up to +the virtualizer and re-injected into the guest. + +By using VFIO_DEVICE_SET_IRQS along with the VFIO_PCI_DMA_FAULT_IRQ_INDEX +index, the virtualizer can register an eventfd signalled whenever a +fault is observed at physical level. The actual faults can be retrieved +from the device fault region whose type/subtype is: +VFIO_REGION_TYPE_NESTED/VFIO_REGION_SUBTYPE_NESTED_FAULT_REGION. + +This region can be mmapped. When a fault is consumed, the user must increment +the consumer index. + +When the guest invalidates stage 1 related caches, invalidations must be +forwarded to the host through +ioctl(container, VFIO_IOMMU_CACHE_INVALIDATE, &inv_data); +Those invalidations can happen at various granularity levels, page, context, ... + +The ARM SMMU specification introduces another challenge: MSIs are translated by +both the virtual SMMU and the physical SMMU. To build a nested mapping for the +IOVA programmed into the assigned device, the guest needs to pass its IOVA/MSI +doorbell GPA binding to the host. Then the hypervisor can build a nested stage 2 +binding eventually translating into the physical MSI doorbell. + +This is achieved by +ioctl(container, VFIO_IOMMU_BIND_MSI, &guest_binding); + VFIO User API ------------------------------------------------------------------------------- -- 2.17.2