xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] VT-d Device-TLB flush issue
@ 2015-12-23  8:25 Quan Xu
  2015-12-23  8:25 ` [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error Quan Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 60+ messages in thread
From: Quan Xu @ 2015-12-23  8:25 UTC (permalink / raw)
  To: jbeulich, kevin.tian
  Cc: feng.wu, eddie.dong, george.dunlap, andrew.cooper3, tim,
	xen-devel, jun.nakajima, Quan Xu, keir

This patches are based on Kevin Tian's previous discussion 'Revisit VT-d asynchronous flush issue'.
Fix current timeout concern and also allow limited ATS support in a light way:

1. Check VT-d Device-TLB flush error.
   This patch checks all kinds of error and all the way up the call trees of VT-d Device-TLB flush.

2. Reduce spin timeout to 1ms, which can be boot-time changed with 'iommu_qi_timeout_ms'.
   For example:
           multiboot /boot/xen.gz ats=1 iommu_qi_timeout_ms=100

3. Fix vt-d Device-TLB flush timeout issue.
    Now if IOTLB/Context/IETC flush is timeout, panic hypervisor. The coming patch
    set will fix it.

    If Device-TLB flush is timeout, we'll hide the target ATS
    device and crash the domain owning this ATS device.

    If impacted domain is hardware domain, just throw out a warning.

    The hided Device will be disallowed to be further assigned to
    any domain.

--

 * DMAR_OPERATION_TIMEOUT should be also chopped down to a low number of milliseconds.
   As Kevin Tian mentioned in 'Revisit VT-d asynchronous flush issue', We also confirmed with hardware team
   that 1ms is large enough for IOMMU internal flush. So I can change DMAR_OPERATION_TIMEOUT from 1000 ms to 1 ms.

   IOMMU_WAIT_OP() is only for VT-d registers read/write, and there is also a panic. We need a further discussion
   whether or how to remove this panic in next patch set.

 * if IOTLB/Context/IETC flush is timeout, panic hypervisor. The coming patch set will fix it.


Quan Xu (3):
  VT-d: Check VT-d Device-TLB flush error.
  VT-d: Reduce spin timeout to 1ms, which can be boot-time changed.
  VT-d: Fix vt-d Device-TLB flush timeout issue.

 xen/arch/x86/acpi/power.c                     |   8 +-
 xen/arch/x86/crash.c                          |   3 +-
 xen/arch/x86/domain_build.c                   |   5 +-
 xen/arch/x86/mm.c                             |  15 ++-
 xen/arch/x86/mm/p2m-ept.c                     |  14 ++-
 xen/arch/x86/mm/p2m-pt.c                      |  14 ++-
 xen/arch/x86/mm/p2m.c                         |  19 +++-
 xen/arch/x86/x86_64/mm.c                      |   7 +-
 xen/common/domain.c                           |   3 +-
 xen/common/grant_table.c                      |   5 +-
 xen/common/memory.c                           |  13 ++-
 xen/drivers/passthrough/amd/iommu_init.c      |   4 +-
 xen/drivers/passthrough/amd/pci_amd_iommu.c   |   4 +-
 xen/drivers/passthrough/arm/smmu.c            |  13 ++-
 xen/drivers/passthrough/iommu.c               |  47 +++++---
 xen/drivers/passthrough/pci.c                 |   2 +-
 xen/drivers/passthrough/vtd/extern.h          |   6 +-
 xen/drivers/passthrough/vtd/iommu.c           | 157 ++++++++++++++++++++------
 xen/drivers/passthrough/vtd/qinval.c          |  93 ++++++++++++++-
 xen/drivers/passthrough/vtd/quirks.c          |  26 +++--
 xen/drivers/passthrough/vtd/x86/ats.c         |  13 +++
 xen/drivers/passthrough/vtd/x86/vtd.c         |  13 ++-
 xen/drivers/passthrough/x86/iommu.c           |   6 +-
 xen/include/asm-x86/hvm/svm/amd-iommu-proto.h |   4 +-
 xen/include/asm-x86/iommu.h                   |   2 +-
 xen/include/xen/iommu.h                       |  20 ++--
 26 files changed, 403 insertions(+), 113 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 60+ messages in thread
* Re: [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue.
@ 2016-01-07  1:39 Xu, Quan
  2016-01-07 13:28 ` Jan Beulich
  0 siblings, 1 reply; 60+ messages in thread
From: Xu, Quan @ 2016-01-07  1:39 UTC (permalink / raw)
  To: jbeulich@suse.com
  Cc: Tian, Kevin, Wu, Feng, Xu, Quan, george.dunlap@eu.citrix.com,
	tim@xen.org, Dong, Eddie, xen-devel@lists.xen.org, Nakajima, Jun,
	andrew.cooper3@citrix.com, keir@xen.org

On January 06, 2016 7:26 PM, <quan.xu@intel.com> wrote: 
> > > diff --git a/xen/drivers/passthrough/vtd/qinval.c
> > > b/xen/drivers/passthrough/vtd/qinval.c
> > > index b227e4e..7330c5d 100644
> > > --- a/xen/drivers/passthrough/vtd/qinval.c
> > > +++ b/xen/drivers/passthrough/vtd/qinval.c
> > > @@ -190,9 +190,19 @@ static int queue_invalidate_wait(struct iommu
> > > *iommu,  static int invalidate_sync(struct iommu *iommu)  {
> > >      struct qi_ctrl *qi_ctrl = iommu_qi_ctrl(iommu);
> > > +    int rc;
> > >
> > >      if ( qi_ctrl->qinval_maddr )
> > > -        return queue_invalidate_wait(iommu, 0, 1, 1);
> > > +    {
> > > +        rc = queue_invalidate_wait(iommu, 0, 1, 1);
> > > +        if ( rc )
> > > +        {
> > > +            if ( rc == -ETIMEDOUT )
> > > +                panic("Queue invalidate wait descriptor was
> > timeout.\n");
> > > +            return rc;
> > > +        }
> > > +    }
> > > +
> >
> > why do you still do panic here? w/ PATCH 1/3 you should return rc to
> > caller directly, right?
> >
> 
> This panic is original in queue_invalidate_wait(). Patch 1/3 is just for Device-TLB
> flush error ( Actually it covers iotlb flush error).
> If I fix VT-d iotlb/context/iec flush error to propagated callers, then I can
> remove this panic and return rc to propagated caller directly.
> 
> 
> 
> > > +static void dev_invalidate_iotlb_timeout(struct iommu *iommu, u16 did,
> > > +                                         u16 seg, u8 bus, u8
> devfn)
> > {
> > > +    struct domain *d;
> > > +    struct pci_dev *pdev;
> > > +
> > > +    d = rcu_lock_domain_by_id(iommu->domid_map[did]);
> > > +    ASSERT(d);
> > > +    for_each_pdev(d, pdev)
> > > +    {
> > > +        if ( (pdev->seg == seg) &&
> > > +             (pdev->bus == bus) &&
> > > +             (pdev->devfn == devfn) )
> > > +        {
> > > +            if ( pdev->domain )
> > > +            {
> > > +                list_del(&pdev->domain_list);
> > > +                pdev->domain = NULL;
> > > +            }
> > > +
> > > +            if ( pci_hide_device(bus, devfn) )
> > > +            {
> > > +                printk(XENLOG_ERR
> > > +                       "IOMMU hide device %04x:%02x:%02x error.",
> > > +                       seg, bus, devfn);
> > > +                break;
> > > +            }
> >
> > I'm not sure whether using pci_hide_device is enough or not break
> > other code path here? For example, now you have errors propagated to
> callers.
> 
> More information, after call pci_hide_device(), ..
>          pdev->domain = dom_xen;
>          list_add(&pdev->domain_list, &dom_xen->arch.pdev_list); ..
> 
> Hidden PCI devices are associated with 'dom_xen'.
> Now I found it may not clear rmrr mapping / context mapping. .i.e
> iommu_remove_device() -- [hd->platform_ops is NULL for dom_xen].
> But I think it is acceptable, IMO dom_xen is a part hypervisor, and there
> mappings would not a security issue.
> Or I can remove these mappings before to hide this device.
> 
> So I think it will not break other code break other code.
> 
> > What's the
> > final policy against flush error?
> >
> 
> If Device-TLB flush is timeout, we'll hide the target ATS device and crash the
> domain owning this ATS device.
> If impacted domain is hardware domain, just throw out a warning, instead of
> crash the hardware domain.
> The hided Device will be disallowed to be further assigned to any domain.
> 
> 
> 
> >If they will finally go to cleanup some more state  relying on
> >pdev->domain, then that code path might be broken. If it's fine to do
> >cleanup at this point, then just hidden is not enough. If you look at
> >pci_remove_device, there's quite some state to be further cleaned up...
> >
> 
> 
> As the pdev->domain is 'dom_xen';
> So for this case, it is still working. Correct me if it is not correct.
> BTW, a quick question, which case to call pci_remove_device()?
> 
> 
> > I didn't think about the full logic thoroughly now. But it would
> > always be good to not hide device now until a point where all related
> > states have been cleaned up in error handling path chained up.
> >

Jan, could you help me to double check it? thanks.

Quan

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-01-27 14:35 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-23  8:25 [PATCH v4 0/3] VT-d Device-TLB flush issue Quan Xu
2015-12-23  8:25 ` [PATCH v4 1/3] VT-d: Check VT-d Device-TLB flush error Quan Xu
2015-12-25  2:53   ` Tian, Kevin
2016-01-06  2:12     ` Xu, Quan
2016-01-14 17:01     ` Jan Beulich
2016-01-14 17:15   ` Jan Beulich
2015-12-23  8:25 ` [PATCH v4 2/3] VT-d: Reduce spin timeout to 1ms, which can be boot-time changed Quan Xu
2015-12-25  2:56   ` Tian, Kevin
2015-12-25  2:58     ` Xu, Quan
2016-01-13 16:55   ` Jan Beulich
2016-01-14  1:53     ` Xu, Quan
2016-01-13 16:57   ` Jan Beulich
2016-01-14  1:59     ` Xu, Quan
2016-01-14  9:03       ` Jan Beulich
2016-01-14 10:29         ` Xu, Quan
2016-01-14 12:17           ` Jan Beulich
2015-12-23  8:25 ` [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue Quan Xu
2015-12-25  3:06   ` Tian, Kevin
2016-01-06 11:25     ` Xu, Quan
2016-01-15 13:09   ` Jan Beulich
2016-01-20 10:26     ` Xu, Quan
2016-01-20 11:29       ` Jan Beulich
2016-01-21 16:16         ` Xu, Quan
2016-01-21 16:31           ` Jan Beulich
2016-01-22 15:57             ` Xu, Quan
2016-01-25 14:09               ` Jan Beulich
2016-01-26 13:47                 ` Xu, Quan
2016-01-26 13:59                   ` Jan Beulich
2016-01-26 15:27                     ` Xu, Quan
2016-01-26 15:52                       ` Jan Beulich
2016-01-26 22:47                         ` Tian, Kevin
2016-01-27 11:09                           ` Xu, Quan
2016-01-27 11:24                             ` Jan Beulich
2016-01-27 12:38                               ` Xu, Quan
2016-01-27 13:15                                 ` Jan Beulich
2016-01-27 14:13                                   ` Xu, Quan
2016-01-27 14:29                                     ` Jan Beulich
2016-01-27 14:35                                       ` Xu, Quan
2015-12-23 10:19 ` [PATCH v4 0/3] VT-d Device-TLB flush issue Xu, Quan
2015-12-23 10:40   ` Jan Beulich
2015-12-23 10:59     ` Xu, Quan
2015-12-23 10:39 ` Jan Beulich
2015-12-23 11:09   ` Xu, Quan
2015-12-25  1:50   ` Tian, Kevin
2015-12-25  2:26     ` Xu, Quan
2015-12-25  1:53 ` Tian, Kevin
2015-12-25  2:04   ` Xu, Quan
  -- strict thread matches above, loose matches on Subject: below --
2016-01-07  1:39 [PATCH v4 3/3] VT-d: Fix vt-d Device-TLB flush timeout issue Xu, Quan
2016-01-07 13:28 ` Jan Beulich
2016-01-07 13:46   ` Xu, Quan
2016-01-08  1:06     ` Xu, Quan
2016-01-13  6:12     ` Tian, Kevin
2016-01-13 11:02       ` Jan Beulich
2016-01-14  1:22         ` Tian, Kevin
2016-01-14  9:00           ` Jan Beulich
2016-01-14  9:12             ` Tian, Kevin
2016-01-14 10:46               ` Xu, Quan
2016-01-18 15:32                 ` Tim Deegan
2016-01-19  0:46                   ` Tian, Kevin
2016-01-19  6:54                   ` Xu, Quan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).