xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device
@ 2015-09-16 13:23 Quan Xu
  2015-09-16 10:46 ` Ian Jackson
                   ` (16 more replies)
  0 siblings, 17 replies; 84+ messages in thread
From: Quan Xu @ 2015-09-16 13:23 UTC (permalink / raw)
  To: andrew.cooper3, eddie.dong, ian.campbell, ian.jackson, jbeulich,
	jun.nakajima, keir, kevin.tian, tim, yang.z.zhang, george.dunlap
  Cc: Quan Xu, xen-devel

Introduction
============

   VT-d code currently has a number of cases where completion of certain operations
is being waited for by way of spinning. The majority of instances use that variable
indirectly through IOMMU_WAIT_OP() macro , allowing for loops of up to 1 second
(DMAR_OPERATION_TIMEOUT). While in many of the cases this may be acceptable, the
invalidation case seems particularly problematic.

Currently hypervisor polls the status address of wait descriptor up to 1 second to
get Invalidation flush result. When Invalidation queue includes Device-TLB invalidation,
using 1 second is a mistake here in the validation sync. As the 1 second timeout here is
related to response times by the IOMMU engine, Instead of Device-TLB invalidation with
PCI-e Address Translation Services (ATS) in use. the ATS specification mandates a timeout
of 1 _minute_ for cache flush. The ATS case needs to be taken into consideration when
doing invalidations. Obviously we can't spin for a minute, so invalidation absolutely
needs to be converted to a non-spinning model.

   Also i should fix the new memory security issue.
The page freed from the domain should be on held, until the Device-TLB flush is completed (ATS timeout of 1 _minute_).
The page previously associated  with the freed portion of GPA should not be reallocated for
another purpose until the appropriate invalidations have been performed. Otherwise, the original
page owner can still access freed page though DMA.

Why RFC
=======
    Patch 0001--0005, 0013 are IOMMU related.
    Patch 0006 is about new flag (vCPU / MMU related).
    Patch 0007 is vCPU related.
    Patch 0008--0012 are MMU related.

    1. Xen MMU is very complicated. Could Xen MMU experts help me verify whether I
       have covered all of the case?

    2. For gnttab_transfer, If the Device-TLB flush is still not completed when to
       map the transferring page to a remote domain, schedule and wait on a waitqueue
       until the Device-TLB flush is completed. Is it correct?

       (I have tested waitqueue in decrease_reservation() [do_memory_op()  hypercall])
        I wake up domain(with only one vCPU) with debug-key tool, and the domain(with only one vCPU)
        is still working after waiting 60s in a waitqueue. )


Design Overview
===============

This design implements a non-spinning model for Device-TLB invalidation -- using an interrupt
based mechanism. Track the Device-TLB invalidation status in an invalidation table per-domain. The
invalidation table keeps the count of in-flight Device-TLB invalidation requests, and also
provides a global polling parameter per domain for in-flight Device-TLB invalidation requests.
Update invalidation table's count of in-flight Device-TLB invalidation requests and assign the
address of global polling parameter per domain in the Status Address of each invalidation wait
descriptor, when to submit invalidation requests.

For example:
  .

|invl |  Status Data = 1 (the count of in-flight Device-TLB invalidation requests)
|wait |  Status Address = virt_to_maddr(&_a_global_polling_parameter_per_domain_)
|dsc  |
  .
  .

|invl |
|wait | Status Data = 2 (the count of in-flight Device-TLB invalidation requests)
|dsc  | Status Address = virt_to_maddr(&_a_global_polling_parameter_per_domain_)
  .
  .

|invl |
|wait |  Status Data = 3 (the count of in-flight Device-TLB invalidation requests)
|dsc  |  Status Address =  virt_to_maddr(&_a_global_polling_parameter_per_domain_)
  .
  .

More information about VT-d Invalidation Wait Descriptor, please refer to
  http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
  6.5.2.8 Invalidation Wait Descriptor.
Status Address and Data: Status address and data is used by hardware to perform wait descriptor
                         completion status write when the Status Write(SW) field is Set. Hardware Behavior
                         is undefined if the Status address range of 0xFEEX_XXXX etc.). The Status Address
                         and Data fields are ignored by hardware when the Status Write field is Clear.

The invalidation completion event interrupt is generated only after the invalidation wait descriptor
completes. In invalidation interrupt handler, it will schedule a soft-irq to do the following check:

  if invalidation table's count of in-flight Device-TLB invalidation requests == polling parameter:
    This domain has no in-flight Device-TLB invalidation requests.
  else
    This domain has in-flight Device-TLB invalidation requests.

Track domain Status:
   The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield list if it has in-flight
Device-TLB invalidation requests.

Memory security issue:
    In case with PCI-e Address Translation Services(ATS) in use, ATS spec mandates a timeout of 1 minute
for cache flush.
    The page freed from the domain should be on held, until the Device-TLB flush is completed. The page
previously associated  with the freed portion of GPA should not be reallocated for another purpose until
the appropriate invalidations have been performed. Otherwise, the original page owner can still access
freed page though DMA.

   *Held on The page until the Device-TLB flush is completed.
      - Unlink the page from the original owner.
      - Remove the page from the page_list of domain.
      - Decrease the total pages count of domain.
      - Add the page to qi_hold_page_list.

    *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB is completed.

Invalidation Fault:
A fault event will be generated if an invalidation failed. We can disable the devices.

For Context Invalidation and IOTLB invalidation without Device-TLB invalidation, Queued Invalidation(QI) submits
invalidation requests as before(This is a tradeoff and the cost of interrupt is overhead. It will be modified
in coming series of patch).

More details
============

1. invalidation table. We define qi_table structure per domain.
+struct qi_talbe {
+    u64 qi_table_poll_slot;
+    u32 qi_table_status_data;
+};

@ struct hvm_iommu {
+    /* IOMMU Queued Invalidation(QI) */
+    struct qi_talbe talbe;
}

2. Modification to Device IOTLB invalidation:
    - Enabled interrupt notification when hardware completes the invalidations:
      Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason why also set SW bit is that
      the interrupt for notification is global not per domain.
      So we still need to poll the status address to know which Device-TLB invalidation request is
      completed in QI interrupt handler.
    - A new per-domain flag (*qi_flag) is used to track the status of Device-TLB invalidation request.
      The *qi_flag will be set before sbumitting the Device-TLB invalidation requests. The vCPU is NOT
      allowed to entry guest mode and put into SCHEDOP_yield list, if the *qi_flag is Set.
    - new logic to do synchronize.
        if no Device-TLB invalidation:
            Back to current invalidation logic.
        else
            Set IF, SW, FN bit in wait descriptor and prepare the Status Data.
            Set *qi_flag.
            Put the domain in pending flush list (The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield if the *qi_flag is Set.)
        Return

More information about VT-d Invalidation Wait Descriptor, please refer to
  http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
  6.5.2.8 Invalidation Wait Descriptor.
   SW: Indicate the invalidation wait descriptor completion by performing a coherent DWORD write of the value in the Status Data field
       to the address specified in the Status Address.
   FN: Indicate the descriptors following the invalidation wait descriptor must be processed by hardware only after the invalidation
       Wait descriptor completes.
   IF: Indicate the invalidation wait descriptor completion by generating an invalidation completion event per the programing of the
       Invalidation Completion Event Registers.

3. Modification to domain running lifecycle:
    - When the *qi_flag is set, the domain is not allowed to enter guest mode and put into SCHEDOP_yield list
      if there are in-flight Device-TLB invalidation requests.

4. New interrupt handler for invalidation completion:
    - when hardware completes the Device-TLB invalidation requests, it generates an interrupt to notify hypervisor.
    - In interrupt handler, schedule a tasklet to handle it.
    - tasklet to handle below:
        *Clear IWC field in the Invalidation Completion Status register. If the IWC field in the Invalidation
         Completion Status register was already Set at the time of setting this field, it is not treated as a new
         interrupt condition.
        *Scan the domain list. (the domain is with vt-d passthrough devices. scan 'iommu->domid_bitmap')
                for each domain:
                check the values invalidation table (qi_table_poll_slot and qi_table_status_data) of each domain.
                if equal:
                   Put the on hold pages.
                   Clear the invalidation table.
                   Clear *qi_flag.

        *If IP field of Invalidation Event Control Register is Set, try to *Clear IWC and *Scan the domain list again, instead of
         generating another interrupt.
        *Clear IM field of Invalidation Event Control Register.

((
  Logic of IWC / IP / IM as below:

                          Interrupt condition (An invalidation wait descriptor with Interrupt Flag(IF) field Set completed.)
                                  ||
                                   v
           ----------------------(IWC) ----------------------
     (IWC is Set)                                (IWC is not Set)
          ||                                            ||
          V                                             ||
(Not treated as an new interrupt condition)             ||
                                                         V
                                                   (Set IWC / IP)
                                                        ||
                                                         V
                                  ---------------------(IM)---------------------
                              (IM is Set)                               (IM not Set)
                                  ||                                        ||
                                  ||                                        V
                                  ||                    (cause Interrupt message / then hardware clear IP)
                                   V
   (interrupt is held pending, clearing IM to cause interrupt message)

* If IWC field is being clear, the IP field is cleared.
))

5. invalidation failed.
    - A fault event will be generated if invalidation failed. we can disable the devices if receive an
      invalidation fault event.

6. Memory security issue:

    The page freed from the domain should be on held, until the Device-TLB flush is completed. The page
previously associated  with the freed portion of GPA should not be reallocated for another purpose until
the appropriate invalidations have been performed. Otherwise, the original page owner can still access
freed page though DMA.

   *Held on The page unitl the Device-TLB flush is completed.
      - Unlink the page from the original owner.
      - Remove the page from the page_list of domain.
      - Decrease the total pages count of domain.
      - Add the page to qi_hold_page_list.

  *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB is completed.


----
There are 3 reasons to submit device-TLB invalidation requests:
    *VT-d initialization.
    *Reassign device ownership.
    *Memory modification.

6.1 *VT-d initialization
    When VT-d is initializing, there is no guest domain running. So no memory security issue.
iotlb(iotlb/device-tlb)
|-iommu_flush_iotlb_global()--iommu_flush_all()--intel_iommu_hwdom_init()
                                              |--init_vtd_hw()
6.2 *Reassign device ownership
    Reassign device ownership is invoked by 2 hypercalls: do_physdev_op() and arch_do_domctl().
As the *qi_flag is Set, the domain is not allowed to enter guest mode. If the appropriate invalidations maybe have
not been performed, the *qi_flag is still Set, and these devices are not ready for guest domains to launch
DMA with these devices. So if the *qi_flag is introduced, there is no memory security issue.

iotlb(iotlb/device-tlb)
|-iommu_flush_iotlb_dsi()
                       |--domain_context_mapping_one() ...
                       |--domain_context_unmap_one() ...

|-iommu_flush_iotlb_psi()
                       |--domain_context_mapping_one() ...
                       |--domain_context_unmap_one() ...

6.3 *Memory modification.
While memory is modified, There are a lot of invoker flow for updating EPT, but not all of them will update IOMMU page tables. All
of the following three conditions are met.
  * P2M is hostp2m. ( p2m_is_hostp2m(p2m) )
  * Previous mfn is not equal to new mfn. (prev_mfn != new_mfn)
  * This domain needs IOMMU. (need_iommu(d))

##
|--iommu_pte_flush()--ept_set_entry()

#PoD(populate on demand) is not supported while IOMMU passthrough is enabled. So ignore PoD invoker flow below.
      |--p2m_pod_zero_check_superpage()  ...
      |--p2m_pod_zero_check()  ...
      |--p2m_pod_demand_populate()  ...
      |--p2m_pod_decrease_reservation()  ...
      |--guest_physmap_mark_populate_on_demand() ...

#Xen paging is not supported while IOMMU passthrough is enabled. So ignore Xen paging invoker flow below.
      |--p2m_mem_paging_evict() ...
      |--p2m_mem_paging_resume()...
      |--p2m_mem_paging_prep()...
      |--p2m_mem_paging_populate()...
      |--p2m_mem_paging_nominate()...
      |--p2m_alloc_table()--shadow_enable() --paging_enable()--shadow_domctl() --paging_domctl()--arch_do_domctl() --do_domctl()
                                                                                                                  |--paging_domctl_continuation()

#Xen sharing is not supported while IOMMU passthrough is enabled. So ignore Xen paging invoker flow below.
      |--set_shared_p2m_entry()...


#Domain is paused, the domain can't launch DMA.
      |--relinquish_shared_pages()--domain_relinquish_resources( case RELMEM_shared: ) --domain_kill()--do_domctl()

#The below p2m is not hostp2m. It is L2 to L0. So ignore invoker flow below.
      |--nestedhap_fix_p2m() --nestedhvm_hap_nested_page_fault() --hvm_hap_nested_page_fault() --ept_handle_violation()--vmx_vmexit_handler()

#If prev_mfn == new_mfn, it will not update IOMMU page tables. So ignore invoker flow below.
      |--p2m_mem_access_check()-- hvm_hap_nested_page_fault() --ept_handle_violation()--vmx_vmexit_handler()(L1 --> L0 / but just only check p2m_type_t)
      |--p2m_set_mem_access() ...
      |--guest_physmap_mark_populate_on_demand() ...
      |--p2m_change_type_one() ...
# The previous page is not put and allocated for Xen or other guest domains. So there is no memory security issue. Ignore invoker flow below.
   |--p2m_remove_page()--guest_physmap_remove_page() ...

   |--clear_mmio_p2m_entry()--unmap_mmio_regions()--do_domctl()
                           |--map_mmio_regions()--do_domctl()


# Held on the pages which are removed in guest_remove_page(), and put in QI interrupt handler when it has no in-flight Device-TLB invalidation requests.

|--clear_mmio_p2m_entry()--*guest_remove_page()*--decrease_reservation()
                                               |--xenmem_add_to_physmap_one() --xenmem_add_to_physmap() /xenmem_add_to_physmap_batch()  .. --do_memory_op()
                                               |--p2m_add_foreign() -- xenmem_add_to_physmap_one() ..--do_memory_op()
                                                                   |--guest_physmap_add_entry()--create_grant_p2m_mapping()  ... --do_grant_table_op()

((
   Much more explanation:
   Actually, the previous pages are maybe mapped from Xen heap for guest domains in decrease_reservation() / xenmem_add_to_physmap_one()
   / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 functions will map xen heap page for guest domains:
          * share page for xen Oprofile.
          * vLAPIC mapping.
          * grant table shared page.
          * domain share_info page.
))

# For grant_unmap*. ignore it at this point, as we can held on the page when domain free xenbllooned page.

    |--iommu_map_page()--__gnttab_unmap_common()--__gnttab_unmap_grant_ref() --gnttab_unmap_grant_ref()--do_grant_table_op()
                                               |--__gnttab_unmap_and_replace() -- gnttab_unmap_and_replace() --do_grant_table_op()

# For grant_map*, ignore it as there is no pfn<--->mfn in Device-TLB.

# For grant_transfer:
  |--p2m_remove_page()--guest_physmap_remove_page()
                                                 |--gnttab_transfer() ...  --do_grant_table_op()

    If the Device-TLB flush is still not completed when to map the transferring page to a remote domain,
    schedule and wait on a waitqueue until the Device-TLB flush is completed.

   Plan B:
   ((If the Device-TLB flush is still not completed before adding the transferring page to the target domain,
   allocate a new page for target domain and held on the old transferring page which will be put in QI interrupt
   handler when there are no in-flight Device-TLB invalidation requests.))


Quan Xu (13):
  vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt
  vt-d: Register MSI for async invalidation completion interrupt.
  vt-d: Track the Device-TLB invalidation status in an invalidation table.
  vt-d: Clear invalidation table in invaidation interrupt handler
  vt-d: Clear the IWC field of Invalidation Event Control Register in
  vt-d: Introduce a new per-domain flag - qi_flag.
  vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to
  vt-d: Held on the freed page until the Device-TLB flush is completed.
  vt-d: Put the page in Queued Invalidation(QI) interrupt handler if
  vt-d: Held on the removed page until the Device-TLB flush is completed.
  vt-d: If the Device-TLB flush is still not completed when
  vt-d: For gnttab_transfer, If the Device-TLB flush is still
  vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB

 xen/arch/x86/hvm/vmx/entry.S         |  10 ++
 xen/arch/x86/x86_64/asm-offsets.c    |   1 +
 xen/common/domain.c                  |  15 ++
 xen/common/grant_table.c             |  16 ++
 xen/common/memory.c                  |  16 +-
 xen/drivers/passthrough/vtd/iommu.c  | 290 +++++++++++++++++++++++++++++++++--
 xen/drivers/passthrough/vtd/iommu.h  |  18 +++
 xen/drivers/passthrough/vtd/qinval.c |  51 +++++-
 xen/include/xen/hvm/iommu.h          |  42 +++++
 9 files changed, 443 insertions(+), 16 deletions(-)

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2015-10-15  9:50 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-16 13:23 [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Quan Xu
2015-09-16 10:46 ` Ian Jackson
2015-09-16 11:22   ` Julien Grall
2015-09-16 13:47     ` Ian Jackson
2015-09-17  9:06       ` Julien Grall
2015-09-17 10:16         ` Ian Jackson
2015-09-16 13:33   ` Xu, Quan
2015-09-16 13:23 ` [Patch RFC 01/13] vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt Quan Xu
2015-09-29  8:43   ` Jan Beulich
2015-09-16 13:23 ` [Patch RFC 02/13] vt-d: Register MSI for async invalidation completion interrupt Quan Xu
2015-09-29  8:57   ` Jan Beulich
2015-10-10  8:22     ` Xu, Quan
2015-10-12  7:11       ` Jan Beulich
2015-09-16 13:23 ` [Patch RFC 03/13] vt-d: Track the Device-TLB invalidation status in an invalidation table Quan Xu
2015-09-16  9:33   ` Julien Grall
2015-09-16 13:43     ` Xu, Quan
2015-09-29  9:24   ` Jan Beulich
2015-10-10 12:27     ` Xu, Quan
2015-10-12  7:15       ` Jan Beulich
2015-09-16 13:23 ` [Patch RFC 04/13] vt-d: Clear invalidation table in invaidation interrupt handler Quan Xu
2015-09-29  9:33   ` Jan Beulich
2015-09-16 13:23 ` [Patch RFC 05/13] vt-d: Clear the IWC field of Invalidation Event Control Register in Quan Xu
2015-09-29  9:44   ` Jan Beulich
2015-09-16 13:24 ` [Patch RFC 06/13] vt-d: Introduce a new per-domain flag - qi_flag Quan Xu
2015-09-16  9:34   ` Julien Grall
2015-09-16 13:24 ` [Patch RFC 07/13] vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to Quan Xu
2015-09-16  9:44   ` Julien Grall
2015-09-16 14:03     ` Xu, Quan
2015-09-16 13:24 ` [Patch RFC 08/13] vt-d: Held on the freed page until the Device-TLB flush is completed Quan Xu
2015-09-16  9:45   ` Julien Grall
2015-09-16 13:24 ` [Patch RFC 09/13] vt-d: Put the page in Queued Invalidation(QI) interrupt handler if Quan Xu
2015-09-16 13:24 ` [Patch RFC 10/13] vt-d: Held on the removed page until the Device-TLB flush is completed Quan Xu
2015-09-16  9:52   ` Julien Grall
2015-09-16 13:24 ` [Patch RFC 11/13] vt-d: If the Device-TLB flush is still not completed when Quan Xu
2015-09-16  9:56   ` Julien Grall
2015-09-23 17:38   ` Konrad Rzeszutek Wilk
2015-09-24  1:40     ` Xu, Quan
2015-09-16 13:24 ` [Patch RFC 12/13] vt-d: For gnttab_transfer, If the Device-TLB flush is still Quan Xu
2015-09-16 13:24 ` [Patch RFC 13/13] vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB Quan Xu
2015-09-29  9:46   ` Jan Beulich
2015-09-17  3:26 ` [Patch RFC 00/13] VT-d Asynchronous Device-TLB Flush for ATS Device Xu, Quan
2015-09-21  8:51   ` Jan Beulich
2015-09-21  9:46     ` Xu, Quan
2015-09-21 12:03       ` Jan Beulich
2015-09-21 14:03         ` Xu, Quan
2015-09-21 14:20           ` Jan Beulich
2015-09-21 14:09 ` Xu, Quan
2015-09-23 16:26   ` Tim Deegan
2015-09-28  3:08     ` Xu, Quan
2015-09-28  6:47       ` Jan Beulich
2015-09-29  2:53         ` Xu, Quan
2015-09-29  7:21           ` Jan Beulich
2015-09-30 13:55             ` Xu, Quan
2015-09-30 14:03               ` Jan Beulich
2015-10-13 14:29             ` Xu, Quan
2015-10-13 14:50               ` Jan Beulich
2015-10-14 14:54                 ` Xu, Quan
2015-09-29  9:11       ` Tim Deegan
2015-09-29  9:57         ` Jan Beulich
2015-09-30 15:05         ` Xu, Quan
2015-10-01  9:09           ` Tim Deegan
2015-10-07 17:02             ` Xu, Quan
2015-10-08  8:51               ` Jan Beulich
2015-10-09  7:06                 ` Xu, Quan
2015-10-09  7:18                   ` Jan Beulich
2015-10-09  7:51                     ` Xu, Quan
2015-10-10 18:24               ` Tim Deegan
2015-10-11 11:09                 ` Xu, Quan
2015-10-12 12:25                   ` Jan Beulich
2015-10-13  9:34                   ` Tim Deegan
2015-10-14 14:44                     ` Xu, Quan
2015-10-12  1:42 ` Zhang, Yang Z
2015-10-12 12:34   ` Jan Beulich
2015-10-13  5:27     ` Zhang, Yang Z
2015-10-13  9:15       ` Jan Beulich
2015-10-14  5:12         ` Zhang, Yang Z
2015-10-14  9:30           ` Jan Beulich
2015-10-15  1:03             ` Zhang, Yang Z
2015-10-15  6:46               ` Jan Beulich
2015-10-15  7:28                 ` Zhang, Yang Z
2015-10-15  8:25                   ` Jan Beulich
2015-10-15  8:52                     ` Zhang, Yang Z
2015-10-15  9:24                       ` Jan Beulich
2015-10-15  9:50                         ` Zhang, Yang Z

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).