All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] On unmap, flush IOMMU TLB and return correct size
@ 2013-09-02  2:24 Shankar, Hari
       [not found] ` <CE494510.41547%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Shankar, Hari @ 2013-09-02  2:24 UTC (permalink / raw)
  To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Spiller, John


[-- Attachment #1.1: Type: text/plain, Size: 2121 bytes --]

Patch Description: The patch is for Intel IOMMU driver. It forces IOMMU TLB flush for the particular domain and return correct unmap size when intel_iommu_unmap() is called

The patch is generated on Linux kernel version 3.6.11

--- linux/drivers/iommu/intel-iommu.c.orig      2013-09-01 10:10:14.723958000 -0700
+++ linux/drivers/iommu/intel-iommu.c   2013-09-01 18:22:58.497578000 -0700
@@ -4060,14 +4060,31 @@ static size_t intel_iommu_unmap(struct i
 {
        struct dmar_domain *dmar_domain = domain->priv;
        int order;
+       struct intel_iommu *iommu;
+       unsigned long start_pfn, last_pfn;
+       unsigned int npages;
+       int iommu_id, num, ndomains;

-       order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT,
-                           (iova + size - 1) >> VTD_PAGE_SHIFT);
+       start_pfn = iova >> VTD_PAGE_SHIFT;
+       last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
+       order = dma_pte_clear_range(dmar_domain, start_pfn, last_pfn);
+       last_pfn |= (1UL << order) - 1;
+       npages = last_pfn - start_pfn + 1;
+       for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus) {
+               iommu = g_iommus[iommu_id];

-       if (dmar_domain->max_addr == iova + size)
+               /*
+                * find bit position of dmar_domain
+                */
+               ndomains = cap_ndoms(iommu->cap);
+               for_each_set_bit(num, iommu->domain_ids, ndomains)
+                       if (iommu->domains[num] == dmar_domain)
+                               iommu_flush_iotlb_psi(iommu, num,
+                                                       start_pfn, npages, 0);
+       }
+       if (dmar_domain->max_addr <= iova + (npages << VTD_PAGE_SHIFT))
                dmar_domain->max_addr = iova;
-
-       return PAGE_SIZE << order;
+       return npages << VTD_PAGE_SHIFT;
 }

 static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,

Signed-off-by: Hari Shankar <hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

[-- Attachment #1.2: Type: text/html, Size: 4116 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found] ` <CE494510.41547%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
@ 2013-09-03  4:25   ` Alex Williamson
       [not found]     ` <1378182340.3246.19.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
  2013-09-22  2:59   ` David Woodhouse
  1 sibling, 1 reply; 17+ messages in thread
From: Alex Williamson @ 2013-09-03  4:25 UTC (permalink / raw)
  To: Shankar, Hari
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, Spiller, John

Hi Hari,

On Mon, 2013-09-02 at 02:24 +0000, Shankar, Hari wrote:
> Patch Description: The patch is for Intel IOMMU driver. It forces IOMMU TLB flush for the particular domain and return correct unmap size when intel_iommu_unmap() is called

"Patch Description:" is unnecessary.  Look at other commits that have
touched this file to get an idea how to format the subject and commit
log:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/drivers/iommu/intel-iommu.c

Sign-off should go here, not below the patch.  Subject should include a
subsystem, such as "iommu/intel:" or "intel-iommu:".  Also a good idea
to provide some justification why this is necessary for stable (unmaps
from userspace through vfio results in coherency issue...)

> The patch is generated on Linux kernel version 3.6.11

This should be noted, but not part of the commit message.  See section
15 of SubmittingPatches and note the additional comments between the
"---" marker and the beginning of the patch.

> --- linux/drivers/iommu/intel-iommu.c.orig      2013-09-01 10:10:14.723958000 -0700
> +++ linux/drivers/iommu/intel-iommu.c   2013-09-01 18:22:58.497578000 -0700
> @@ -4060,14 +4060,31 @@ static size_t intel_iommu_unmap(struct i

Tabs seem to have been clobbered in your patch, scripts/checkpatch.pl
will note this for you.

>  {
>         struct dmar_domain *dmar_domain = domain->priv;
>         int order;
> +       struct intel_iommu *iommu;
> +       unsigned long start_pfn, last_pfn;
> +       unsigned int npages;
> +       int iommu_id, num, ndomains;
> 
> -       order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT,
> -                           (iova + size - 1) >> VTD_PAGE_SHIFT);
> +       start_pfn = iova >> VTD_PAGE_SHIFT;
> +       last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
> +       order = dma_pte_clear_range(dmar_domain, start_pfn, last_pfn);
> +       last_pfn |= (1UL << order) - 1;
> +       npages = last_pfn - start_pfn + 1;

Doesn't order tell us npages more directly?  npages = 1UL << order?

> +       for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus) {
> +               iommu = g_iommus[iommu_id];
> 
> -       if (dmar_domain->max_addr == iova + size)
> +               /*
> +                * find bit position of dmar_domain
> +                */
> +               ndomains = cap_ndoms(iommu->cap);
> +               for_each_set_bit(num, iommu->domain_ids, ndomains)
> +                       if (iommu->domains[num] == dmar_domain)
> +                               iommu_flush_iotlb_psi(iommu, num,
> +                                                       start_pfn, npages, 0);

This second bitmap search seems unnecessary, I think
iommu_flush_iotlb_psi wants dmar_domain->id as the second parameter in
all cases, not num.  So this could be reduced to :

for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus)
	iommu_flush_iotlb_psi(g_iommus[iommu_id], dmar_domain->id, start_pfn, npages, 0);

> +       }
> +       if (dmar_domain->max_addr <= iova + (npages << VTD_PAGE_SHIFT))
>                 dmar_domain->max_addr = iova;
> -
> -       return PAGE_SIZE << order;
> +       return npages << VTD_PAGE_SHIFT;
>  }
> 
>  static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
> 
> Signed-off-by: Hari Shankar <hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Please configure your mailer to drop the html version.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]     ` <1378182340.3246.19.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
@ 2013-09-05  4:05       ` Shankar, Hari
       [not found]         ` <CE4D489F.41EE8%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Shankar, Hari @ 2013-09-05  4:05 UTC (permalink / raw)
  To: Alex Williamson, Shankar, Hari
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, Spiller, John

Hi Alex,
See inline,

On 9/2/13 9:25 PM, "Alex Williamson" <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

>Hi Hari,
>
>On Mon, 2013-09-02 at 02:24 +0000, Shankar, Hari wrote:
>> Patch Description: The patch is for Intel IOMMU driver. It forces IOMMU
>>TLB flush for the particular domain and return correct unmap size when
>>intel_iommu_unmap() is called
>
>"Patch Description:" is unnecessary.  Look at other commits that have
>touched this file to get an idea how to format the subject and commit
>log:
>
>http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/drivers
>/iommu/intel-iommu.c
>
>Sign-off should go here, not below the patch.  Subject should include a
>subsystem, such as "iommu/intel:" or "intel-iommu:".  Also a good idea
>to provide some justification why this is necessary for stable (unmaps
>from userspace through vfio results in coherency issue...)
>
>> The patch is generated on Linux kernel version 3.6.11
>
>This should be noted, but not part of the commit message.  See section
>15 of SubmittingPatches and note the additional comments between the
>"---" marker and the beginning of the patch.
>
>> --- linux/drivers/iommu/intel-iommu.c.orig      2013-09-01
>>10:10:14.723958000 -0700
>> +++ linux/drivers/iommu/intel-iommu.c   2013-09-01 18:22:58.497578000
>>-0700
>> @@ -4060,14 +4060,31 @@ static size_t intel_iommu_unmap(struct i
>
>Tabs seem to have been clobbered in your patch, scripts/checkpatch.pl
>will note this for you.
>
>>  {
>>         struct dmar_domain *dmar_domain = domain->priv;
>>         int order;
>> +       struct intel_iommu *iommu;
>> +       unsigned long start_pfn, last_pfn;
>> +       unsigned int npages;
>> +       int iommu_id, num, ndomains;
>> 
>> -       order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT,
>> -                           (iova + size - 1) >> VTD_PAGE_SHIFT);
>> +       start_pfn = iova >> VTD_PAGE_SHIFT;
>> +       last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
>> +       order = dma_pte_clear_range(dmar_domain, start_pfn, last_pfn);
>> +       last_pfn |= (1UL << order) - 1;
>> +       npages = last_pfn - start_pfn + 1;
>
>Doesn't order tell us npages more directly?  npages = 1UL << order?


[Hari] 1UL << order formula will not work for all cases. Jeff Kimmel
(cc'ed) has a good explanation for that. Jeff's response below,

<Jeff>
If start_pfn is always aligned modulo (1 << order),
then '1UL << order' would get the same npages value, but if
start_pfn is not guaranteed to have such alignment, then '1UL << order'
would
produce an incorrect npages value.  For example, start_pfn could fall in
some range that was mapped with small pages, but the range could end in a
region mapped with large pages.

</Jeff>

>
>> +       for_each_set_bit(iommu_id, dmar_domain->iommu_bmp,
>>g_num_of_iommus) {
>> +               iommu = g_iommus[iommu_id];
>> 
>> -       if (dmar_domain->max_addr == iova + size)
>> +               /*
>> +                * find bit position of dmar_domain
>> +                */
>> +               ndomains = cap_ndoms(iommu->cap);
>> +               for_each_set_bit(num, iommu->domain_ids, ndomains)
>> +                       if (iommu->domains[num] == dmar_domain)
>> +                               iommu_flush_iotlb_psi(iommu, num,
>> +                                                       start_pfn,
>>npages, 0);
>
>This second bitmap search seems unnecessary, I think
>iommu_flush_iotlb_psi wants dmar_domain->id as the second parameter in
>all cases, not num.  So this could be reduced to :
>
>for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus)
>	iommu_flush_iotlb_psi(g_iommus[iommu_id], dmar_domain->id, start_pfn,
>npages, 0);

[Hari] For VFIO, iommu_alloc_vm_domain() is called to get VM domain and
has DOMAIN_FLAG_VIRTUAL_MACHINE flag set. In this case, domain->id is
derived from static variable 'vm_domid' and not by using
find_first_zero_bit in iommu->domain_ids, which in-turn also claims a slot
on iommu->domains[] array, see iommu_attach_domain(). For domain with
DOMAIN_FLAG_VIRTUAL_MACHINE flag, domain id is actually derived later in
domain_context_mapping_one(), logic under 'if (domain->flags &
DOMAIN_FLAG_VIRTUAL_MACHINE'. As you can see domain->id for VM domain is
not the actual domain id and can also conflict with non VM domain as
vm_domid starts from 0. Using domain->id in the loop will give wrong
result. Using second loop and passing 'num' makes sure we always select
the right domain to flush.

Hari.

>
>> +       }
>> +       if (dmar_domain->max_addr <= iova + (npages << VTD_PAGE_SHIFT))
>>                 dmar_domain->max_addr = iova;
>> -
>> -       return PAGE_SIZE << order;
>> +       return npages << VTD_PAGE_SHIFT;
>>  }
>> 
>>  static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain
>>*domain,
>> 
>> Signed-off-by: Hari Shankar <hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
>> cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>
>Please configure your mailer to drop the html version.  Thanks,
>
>Alex
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]         ` <CE4D489F.41EE8%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
@ 2013-09-05  4:58           ` Alex Williamson
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Williamson @ 2013-09-05  4:58 UTC (permalink / raw)
  To: Shankar, Hari
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, Spiller, John

On Thu, 2013-09-05 at 04:05 +0000, Shankar, Hari wrote:
> Hi Alex,
> See inline,
> 
> On 9/2/13 9:25 PM, "Alex Williamson" <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> >Hi Hari,
> >
> >On Mon, 2013-09-02 at 02:24 +0000, Shankar, Hari wrote:
> >> Patch Description: The patch is for Intel IOMMU driver. It forces IOMMU
> >>TLB flush for the particular domain and return correct unmap size when
> >>intel_iommu_unmap() is called
> >
> >"Patch Description:" is unnecessary.  Look at other commits that have
> >touched this file to get an idea how to format the subject and commit
> >log:
> >
> >http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/drivers
> >/iommu/intel-iommu.c
> >
> >Sign-off should go here, not below the patch.  Subject should include a
> >subsystem, such as "iommu/intel:" or "intel-iommu:".  Also a good idea
> >to provide some justification why this is necessary for stable (unmaps
> >from userspace through vfio results in coherency issue...)
> >
> >> The patch is generated on Linux kernel version 3.6.11
> >
> >This should be noted, but not part of the commit message.  See section
> >15 of SubmittingPatches and note the additional comments between the
> >"---" marker and the beginning of the patch.
> >
> >> --- linux/drivers/iommu/intel-iommu.c.orig      2013-09-01
> >>10:10:14.723958000 -0700
> >> +++ linux/drivers/iommu/intel-iommu.c   2013-09-01 18:22:58.497578000
> >>-0700
> >> @@ -4060,14 +4060,31 @@ static size_t intel_iommu_unmap(struct i
> >
> >Tabs seem to have been clobbered in your patch, scripts/checkpatch.pl
> >will note this for you.
> >
> >>  {
> >>         struct dmar_domain *dmar_domain = domain->priv;
> >>         int order;
> >> +       struct intel_iommu *iommu;
> >> +       unsigned long start_pfn, last_pfn;
> >> +       unsigned int npages;
> >> +       int iommu_id, num, ndomains;
> >> 
> >> -       order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT,
> >> -                           (iova + size - 1) >> VTD_PAGE_SHIFT);
> >> +       start_pfn = iova >> VTD_PAGE_SHIFT;
> >> +       last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
> >> +       order = dma_pte_clear_range(dmar_domain, start_pfn, last_pfn);
> >> +       last_pfn |= (1UL << order) - 1;
> >> +       npages = last_pfn - start_pfn + 1;
> >
> >Doesn't order tell us npages more directly?  npages = 1UL << order?
> 
> 
> [Hari] 1UL << order formula will not work for all cases. Jeff Kimmel
> (cc'ed) has a good explanation for that. Jeff's response below,
> 
> <Jeff>
> If start_pfn is always aligned modulo (1 << order),
> then '1UL << order' would get the same npages value, but if
> start_pfn is not guaranteed to have such alignment, then '1UL << order'
> would
> produce an incorrect npages value.  For example, start_pfn could fall in
> some range that was mapped with small pages, but the range could end in a
> region mapped with large pages.
> 
> </Jeff>

I think iommu_map()/iommu_unmap() only passes naturally aligned regions
to the iommu drivers, so I don't think we need to worry about that.
Feel free to verify.

> >> +       for_each_set_bit(iommu_id, dmar_domain->iommu_bmp,
> >>g_num_of_iommus) {
> >> +               iommu = g_iommus[iommu_id];
> >> 
> >> -       if (dmar_domain->max_addr == iova + size)
> >> +               /*
> >> +                * find bit position of dmar_domain
> >> +                */
> >> +               ndomains = cap_ndoms(iommu->cap);
> >> +               for_each_set_bit(num, iommu->domain_ids, ndomains)
> >> +                       if (iommu->domains[num] == dmar_domain)
> >> +                               iommu_flush_iotlb_psi(iommu, num,
> >> +                                                       start_pfn,
> >>npages, 0);
> >
> >This second bitmap search seems unnecessary, I think
> >iommu_flush_iotlb_psi wants dmar_domain->id as the second parameter in
> >all cases, not num.  So this could be reduced to :
> >
> >for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus)
> >	iommu_flush_iotlb_psi(g_iommus[iommu_id], dmar_domain->id, start_pfn,
> >npages, 0);
> 
> [Hari] For VFIO, iommu_alloc_vm_domain() is called to get VM domain and
> has DOMAIN_FLAG_VIRTUAL_MACHINE flag set. In this case, domain->id is
> derived from static variable 'vm_domid' and not by using
> find_first_zero_bit in iommu->domain_ids, which in-turn also claims a slot
> on iommu->domains[] array, see iommu_attach_domain(). For domain with
> DOMAIN_FLAG_VIRTUAL_MACHINE flag, domain id is actually derived later in
> domain_context_mapping_one(), logic under 'if (domain->flags &
> DOMAIN_FLAG_VIRTUAL_MACHINE'. As you can see domain->id for VM domain is
> not the actual domain id and can also conflict with non VM domain as
> vm_domid starts from 0. Using domain->id in the loop will give wrong
> result. Using second loop and passing 'num' makes sure we always select
> the right domain to flush.

Ok, that does sound familiar.  I agree, you do need to search the
bitmaps.  Thanks,

Alex

> >> +       }
> >> +       if (dmar_domain->max_addr <= iova + (npages << VTD_PAGE_SHIFT))
> >>                 dmar_domain->max_addr = iova;
> >> -
> >> -       return PAGE_SIZE << order;
> >> +       return npages << VTD_PAGE_SHIFT;
> >>  }
> >> 
> >>  static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain
> >>*domain,
> >> 
> >> Signed-off-by: Hari Shankar <hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> >> cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >
> >Please configure your mailer to drop the html version.  Thanks,
> >
> >Alex
> >
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found] ` <CE494510.41547%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
  2013-09-03  4:25   ` Alex Williamson
@ 2013-09-22  2:59   ` David Woodhouse
       [not found]     ` <1379818740.2547.51.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2013-09-22  2:59 UTC (permalink / raw)
  To: Shankar, Hari
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Singh, Varinder, Sundaram, Rajesh, Spiller, John, Kimmel, Jeff


[-- Attachment #1.1: Type: text/plain, Size: 12348 bytes --]

On Mon, 2013-09-02 at 02:24 +0000, Shankar, Hari wrote:
> Patch Description: The patch is for Intel IOMMU driver. It forces
> IOMMU TLB flush for the particular domain and return correct unmap
> size when intel_iommu_unmap() is called

I *hate* the bizarre calling convention for iommu_unmap(). Is it
actually clearly documented anywhere? Why on earth is it not just
returning void, and expected to unmap what it was *asked* to unmap?

I keep getting patches which "fix" the return value, with no clear
explanation of what it's actually supposed to do. I ask sometimes, but
the answer is always so weird that I end up forgetting it again.

The main user seems to be kvm_iommu_put_pages() in virt/kvm/iommu.c
which appears to be *designed* to get poor performance, making a single
call (with associated clflush and IOTLB flush) for every individual
*page* instead of just unmapping a whole virtual range at a time.

One thing you neglected to fix in this patch is the fact that the
intel_iommu_unmap() function doesn't call dma_pte_free_pagetable() after
it calls dma_pte_clear_range(). For reasons not entirely clear to me,
adding that call fixes a significant performance issue where the time
taken to do unmaps is exponential w.r.t. the size of the region.

And in fact, if we fix that then you suffer the same bug that exists
everywhere *else* we do unmaps... which is that we actually need to
flush the IOTLB (with the IH bit cleared) *before* we free the page
table pages, because the hardware could have cached them and could still
be doing a page walk after you've returned the pages to the pool.

Here's a completely untested work-in-progress that attempts to fix that.
I'll be able to test it myself on about Tuesday when I'm home from New
Orleans and awake...

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 15e9b57..69980a1 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -413,6 +413,7 @@ struct deferred_flush_tables {
 	int next;
 	struct iova *iova[HIGH_WATER_MARK];
 	struct dmar_domain *domain[HIGH_WATER_MARK];
+	struct page *freelist[HIGH_WATER_MARK];
 };
 
 static struct deferred_flush_tables *deferred_flush;
@@ -945,6 +946,114 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain,
 	}
 }
 
+/* When a page at a given level is being unlinked from its parent, we don't
+   need to *modify* it at all. All we need to do is make a list of all the
+   pages which can be freed just as soon as we've flushed the IOTLB and we
+   know the hardware page-walk will no longer touch them.
+   The 'pte' argument is the *parent* PTE, pointing to the page that is to
+   be freed. */
+static struct page *dma_pte_list_pagetables(struct dmar_domain *domain, int level,
+					    struct dma_pte *pte, struct page *freelist)
+{
+	struct page *pg;
+
+	pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT);
+	pg->freelist = freelist;
+	freelist = pg;
+
+	if (level > 1) {
+		pte = page_address(pg);
+
+		do {
+			if (dma_pte_present(pte) && !dma_pte_superpage(pte))
+				freelist = dma_pte_list_pagetables(domain, level - 1, pte, freelist);
+
+		} while (!first_pte_in_page(++pte));
+	}
+
+	return freelist;
+}
+
+static struct page *dma_pte_clear_level(struct dmar_domain *domain, int level,
+					struct dma_pte *pte, unsigned long pfn,
+					unsigned long start_pfn, unsigned long last_pfn,
+					struct page *freelist)
+{
+	struct dma_pte *first_pte = NULL, *last_pte = NULL;
+
+	pfn = max(start_pfn, pfn);
+	pte = &pte[pfn_level_offset(pfn, level)];
+
+	do {
+		unsigned long level_pfn;
+		struct dma_pte *level_pte;
+
+		if (!dma_pte_present(pte))
+			goto next;
+
+		level_pfn = pfn & level_mask(level - 1);
+		level_pte = phys_to_virt(dma_pte_addr(pte));
+
+		/* If range covers entire pagetable, free it */
+		if (start_pfn <= level_pfn &&
+		    last_pfn >= level_pfn + level_size(level)) {
+
+			/* These suborbinate page tables are going away entirely. Don't
+			   bother to clear them; we're just going to *free* them. */
+			if (level > 1 && !dma_pte_superpage(pte))
+				freelist = dma_pte_list_pagetables(domain, level - 1, pte, freelist);
+
+			dma_clear_pte(pte);
+			if (!first_pte)
+				first_pte = pte;
+			last_pte = pte;
+		} else {
+			/* Recurse down into a level that isn't *entirely* obsolete */
+			freelist = dma_pte_clear_level(domain, level - 1, level_pte,
+						       level_pfn, start_pfn, last_pfn, freelist);
+		}
+next:
+		pfn += level_size(level);
+	} while (!first_pte_in_page(++pte) && pfn <= last_pfn);
+
+	if (first_pte)
+		domain_flush_cache(domain, first_pte,
+				   (void *)++last_pte - (void *)first_pte);
+
+	return freelist;
+}
+
+/* We can't just free the pages because the IOMMU may still be walking
+   the page tables, and may have cached the intermediate levels. The
+   pages can only be freed after the IOTLB flush has been done. */
+struct page *domain_unmap(struct dmar_domain *domain,
+			  unsigned long start_pfn,
+			  unsigned long last_pfn)
+{
+	int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT;
+	struct page *freelist = NULL;
+
+	BUG_ON(addr_width < BITS_PER_LONG && start_pfn >> addr_width);
+	BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width);
+	BUG_ON(start_pfn > last_pfn);
+
+	/* we don't need lock here; nobody else touches the iova range */
+	freelist = dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
+				       domain->pgd, 0, start_pfn, last_pfn, NULL);
+
+	return freelist;
+}
+
+void dma_free_pagelist(struct page *freelist)
+{
+	struct page *pg;
+
+	while ((pg = freelist)) {
+		freelist = pg->freelist;
+		free_pgtable_page(pg);
+	}
+}
+
 /* iommu handling */
 static int iommu_alloc_root_entry(struct intel_iommu *iommu)
 {
@@ -1054,7 +1163,7 @@ static void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did,
 		break;
 	case DMA_TLB_PSI_FLUSH:
 		val = DMA_TLB_PSI_FLUSH|DMA_TLB_IVT|DMA_TLB_DID(did);
-		/* Note: always flush non-leaf currently */
+		/* IH bit is passed in as part of address */
 		val_iva = size_order | addr;
 		break;
 	default:
@@ -1165,13 +1274,15 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
 }
 
 static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
-				  unsigned long pfn, unsigned int pages, int map)
+				  unsigned long pfn, unsigned int pages, int ih, int map)
 {
 	unsigned int mask = ilog2(__roundup_pow_of_two(pages));
 	uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
 
 	BUG_ON(pages == 0);
 
+	if (ih)
+		ih = 1 << 6;
 	/*
 	 * Fallback to domain selective flush if no PSI support or the size is
 	 * too big.
@@ -1182,7 +1293,7 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
 		iommu->flush.flush_iotlb(iommu, did, 0, 0,
 						DMA_TLB_DSI_FLUSH);
 	else
-		iommu->flush.flush_iotlb(iommu, did, addr, mask,
+		iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
 						DMA_TLB_PSI_FLUSH);
 
 	/*
@@ -2850,7 +2961,7 @@ static dma_addr_t __intel_map_single(struct device *hwdev, phys_addr_t paddr,
 
 	/* it's a non-present to present mapping. Only flush if caching mode */
 	if (cap_caching_mode(iommu->cap))
-		iommu_flush_iotlb_psi(iommu, domain->id, mm_to_dma_pfn(iova->pfn_lo), size, 1);
+		iommu_flush_iotlb_psi(iommu, domain->id, mm_to_dma_pfn(iova->pfn_lo), size, 0, 1);
 	else
 		iommu_flush_write_buffer(iommu);
 
@@ -2902,13 +3013,16 @@ static void flush_unmaps(void)
 			/* On real hardware multiple invalidations are expensive */
 			if (cap_caching_mode(iommu->cap))
 				iommu_flush_iotlb_psi(iommu, domain->id,
-				iova->pfn_lo, iova->pfn_hi - iova->pfn_lo + 1, 0);
+					iova->pfn_lo, iova->pfn_hi - iova->pfn_lo + 1,
+					!deferred_flush[i].freelist[j], 0);
 			else {
 				mask = ilog2(mm_to_dma_pfn(iova->pfn_hi - iova->pfn_lo + 1));
 				iommu_flush_dev_iotlb(deferred_flush[i].domain[j],
 						(uint64_t)iova->pfn_lo << PAGE_SHIFT, mask);
 			}
 			__free_iova(&deferred_flush[i].domain[j]->iovad, iova);
+			if (deferred_flush[i].freelist[j])
+				dma_free_pagelist(deferred_flush[i].freelist[j]);
 		}
 		deferred_flush[i].next = 0;
 	}
@@ -2925,7 +3039,7 @@ static void flush_unmaps_timeout(unsigned long data)
 	spin_unlock_irqrestore(&async_umap_flush_lock, flags);
 }
 
-static void add_unmap(struct dmar_domain *dom, struct iova *iova)
+static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *freelist)
 {
 	unsigned long flags;
 	int next, iommu_id;
@@ -2941,6 +3055,7 @@ static void add_unmap(struct dmar_domain *dom, struct iova *iova)
 	next = deferred_flush[iommu_id].next;
 	deferred_flush[iommu_id].domain[next] = dom;
 	deferred_flush[iommu_id].iova[next] = iova;
+	deferred_flush[iommu_id].freelist[next] = freelist;
 	deferred_flush[iommu_id].next++;
 
 	if (!timer_on) {
@@ -2960,6 +3075,7 @@ static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr,
 	unsigned long start_pfn, last_pfn;
 	struct iova *iova;
 	struct intel_iommu *iommu;
+	struct page *freelist;
 
 	if (iommu_no_mapping(dev))
 		return;
@@ -2980,19 +3096,16 @@ static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr,
 	pr_debug("Device %s unmapping: pfn %lx-%lx\n",
 		 pci_name(pdev), start_pfn, last_pfn);
 
-	/*  clear the whole page */
-	dma_pte_clear_range(domain, start_pfn, last_pfn);
-
-	/* free page tables */
-	dma_pte_free_pagetable(domain, start_pfn, last_pfn);
+	freelist = domain_unmap(domain, start_pfn, last_pfn);
 
 	if (intel_iommu_strict) {
 		iommu_flush_iotlb_psi(iommu, domain->id, start_pfn,
-				      last_pfn - start_pfn + 1, 0);
+				      last_pfn - start_pfn + 1, !freelist, 0);
 		/* free iova */
 		__free_iova(&domain->iovad, iova);
+		dma_free_pagelist(freelist);
 	} else {
-		add_unmap(domain, iova);
+		add_unmap(domain, iova, freelist);
 		/*
 		 * queue up the release of the unmap to save the 1/6th of the
 		 * cpu used up by the iotlb flush operation...
@@ -3054,6 +3167,7 @@ static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist,
 	unsigned long start_pfn, last_pfn;
 	struct iova *iova;
 	struct intel_iommu *iommu;
+	struct page *freelist;
 
 	if (iommu_no_mapping(hwdev))
 		return;
@@ -3071,19 +3185,16 @@ static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist,
 	start_pfn = mm_to_dma_pfn(iova->pfn_lo);
 	last_pfn = mm_to_dma_pfn(iova->pfn_hi + 1) - 1;
 
-	/*  clear the whole page */
-	dma_pte_clear_range(domain, start_pfn, last_pfn);
-
-	/* free page tables */
-	dma_pte_free_pagetable(domain, start_pfn, last_pfn);
+	freelist = domain_unmap(domain, start_pfn, last_pfn);
 
 	if (intel_iommu_strict) {
 		iommu_flush_iotlb_psi(iommu, domain->id, start_pfn,
-				      last_pfn - start_pfn + 1, 0);
+				      last_pfn - start_pfn + 1, !freelist, 0);
 		/* free iova */
 		__free_iova(&domain->iovad, iova);
+		dma_free_pagelist(freelist);
 	} else {
-		add_unmap(domain, iova);
+		add_unmap(domain, iova, freelist);
 		/*
 		 * queue up the release of the unmap to save the 1/6th of the
 		 * cpu used up by the iotlb flush operation...
@@ -3166,7 +3277,7 @@ static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist, int ne
 
 	/* it's a non-present to present mapping. Only flush if caching mode */
 	if (cap_caching_mode(iommu->cap))
-		iommu_flush_iotlb_psi(iommu, domain->id, start_vpfn, size, 1);
+		iommu_flush_iotlb_psi(iommu, domain->id, start_vpfn, size, 0, 1);
 	else
 		iommu_flush_write_buffer(iommu);
 
@@ -4116,7 +4227,10 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain,
 	int order;
 
 	order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT,
-			    (iova + size - 1) >> VTD_PAGE_SHIFT);
+				    (iova + size - 1) >> VTD_PAGE_SHIFT);
+
+	dma_pte_free_pagetable(dmar_domain, iova >> VTD_PAGE_SHIFT,
+			       (iova + size - 1) >> VTD_PAGE_SHIFT);
 
 	if (dmar_domain->max_addr == iova + size)
 		dmar_domain->max_addr = iova;



-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]     ` <1379818740.2547.51.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
@ 2013-09-25 15:54       ` Joerg Roedel
       [not found]         ` <20130925155459.GA20372-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>
  2013-10-02 15:04       ` David Woodhouse
  1 sibling, 1 reply; 17+ messages in thread
From: Joerg Roedel @ 2013-09-25 15:54 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, Spiller, John

On Sat, Sep 21, 2013 at 09:59:00PM -0500, David Woodhouse wrote:
> I *hate* the bizarre calling convention for iommu_unmap(). Is it
> actually clearly documented anywhere? Why on earth is it not just
> returning void, and expected to unmap what it was *asked* to unmap?

Yeah, I agree that this should be documented since it is quite
non-standard/non-obvious behaviour of a function. The reason the
interface was implemented this way is that the caller should not need to
know (or keep track of) the page-size which was used to map a given iova.

So the interface is basically, that you give an iova and a size and the
iommu driver unmaps _at_least_ a region of that size. But if you ask for
a 4k region which is mapped by a 2M page then the whole 2M are unmapped.
The return value tells you how much was actually unmapped.

Not the best interface, I know. We should come up with a better way to
handle this.


	Joerg

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]         ` <20130925155459.GA20372-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>
@ 2013-09-25 16:05           ` David Woodhouse
       [not found]             ` <1380125113.28494.14.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
  2013-09-25 16:33           ` Alex Williamson
  1 sibling, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2013-09-25 16:05 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, Spiller, John


[-- Attachment #1.1: Type: text/plain, Size: 1627 bytes --]

On Wed, 2013-09-25 at 17:54 +0200, Joerg Roedel wrote:
> On Sat, Sep 21, 2013 at 09:59:00PM -0500, David Woodhouse wrote:
> > I *hate* the bizarre calling convention for iommu_unmap(). Is it
> > actually clearly documented anywhere? Why on earth is it not just
> > returning void, and expected to unmap what it was *asked* to unmap?
> 
> Yeah, I agree that this should be documented since it is quite
> non-standard/non-obvious behaviour of a function. The reason the
> interface was implemented this way is that the caller should not need to
> know (or keep track of) the page-size which was used to map a given iova.

Why would it ever care? If it *happens* to map something that can use
large pages, yay!. If it subsequently breaks apart those large pages by
unmapping 4KiB in the middle, let the IOMMU driver break that apart.

And why are we not allowed to unmap more than a single page at a time?

Doing unmaps in small pieces (with a full flush between every one) is
*horribly* slow.

> So the interface is basically, that you give an iova and a size and the
> iommu driver unmaps _at_least_ a region of that size. But if you ask for
> a 4k region which is mapped by a 2M page then the whole 2M are unmapped.
> The return value tells you how much was actually unmapped.
> 
> Not the best interface, I know. We should come up with a better way to
> handle this.

Seriously, just the address and the size. Let the IOMMU code deal with
whether it can *actually* use large pages on the particular set of
IOMMUs that are in use for the devices you've added to the domain so
far.

-- 
dwmw2

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]         ` <20130925155459.GA20372-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>
  2013-09-25 16:05           ` David Woodhouse
@ 2013-09-25 16:33           ` Alex Williamson
  1 sibling, 0 replies; 17+ messages in thread
From: Alex Williamson @ 2013-09-25 16:33 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, David Woodhouse, Spiller, John

On Wed, 2013-09-25 at 17:54 +0200, Joerg Roedel wrote:
> On Sat, Sep 21, 2013 at 09:59:00PM -0500, David Woodhouse wrote:
> > I *hate* the bizarre calling convention for iommu_unmap(). Is it
> > actually clearly documented anywhere? Why on earth is it not just
> > returning void, and expected to unmap what it was *asked* to unmap?
> 
> Yeah, I agree that this should be documented since it is quite
> non-standard/non-obvious behaviour of a function. The reason the
> interface was implemented this way is that the caller should not need to
> know (or keep track of) the page-size which was used to map a given iova.
> 
> So the interface is basically, that you give an iova and a size and the
> iommu driver unmaps _at_least_ a region of that size. But if you ask for
> a 4k region which is mapped by a 2M page then the whole 2M are unmapped.
> The return value tells you how much was actually unmapped.
> 
> Not the best interface, I know. We should come up with a better way to
> handle this.

Actually, unmap can return zero too, which happens if you try to unmap
an offset into a super page mapping.  The current interface has issues,
but both legacy kvm assignment and vfio rely on it so we don't need to
track individual mappings.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]             ` <1380125113.28494.14.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
@ 2013-09-25 16:58               ` Joerg Roedel
  2013-09-25 17:36               ` Alex Williamson
  1 sibling, 0 replies; 17+ messages in thread
From: Joerg Roedel @ 2013-09-25 16:58 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, Spiller, John

On Wed, Sep 25, 2013 at 05:05:13PM +0100, David Woodhouse wrote:
> On Wed, 2013-09-25 at 17:54 +0200, Joerg Roedel wrote:
> > Yeah, I agree that this should be documented since it is quite
> > non-standard/non-obvious behaviour of a function. The reason the
> > interface was implemented this way is that the caller should not need to
> > know (or keep track of) the page-size which was used to map a given iova.
> 
> Why would it ever care? If it *happens* to map something that can use
> large pages, yay!. If it subsequently breaks apart those large pages by
> unmapping 4KiB in the middle, let the IOMMU driver break that apart.

All that was implemented back in the old days where KVM was the only
user of that code. And back then it didn't make sense for the IOMMU
driver to split huge-pages because in the next step KVM would unmap all
the small pages anyway. So I took the other route and just told KVM how
much was unmapped so that the unmap code could skip over it.

> And why are we not allowed to unmap more than a single page at a time?

The user of iommu_unmap is actually allowed to do that. But the
page-size splitting it done in the iommu_unmap() function already to
have that code common between different IOMMU drivers. In the end the
IOMMU driver only exports the page-size bitmap and is fine.

> Doing unmaps in small pieces (with a full flush between every one) is
> *horribly* slow.

I proposed an iommu_commit() function to move all the necessary flushes
there after a couple of page-table changes. And this also allows to keep
the generic page-size splitting code generic.

> Seriously, just the address and the size. Let the IOMMU code deal with
> whether it can *actually* use large pages on the particular set of
> IOMMUs that are in use for the devices you've added to the domain so
> far.

Actually we can get rid of the unmapped_size thing by putting the same
constraints the DMA-API has in place. The user of the API has to unmap a
region with the same iova,size parameters used for mapping it. Then we
can guarantee that only the requested range will be unmapped.


	Joerg

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]             ` <1380125113.28494.14.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
  2013-09-25 16:58               ` Joerg Roedel
@ 2013-09-25 17:36               ` Alex Williamson
       [not found]                 ` <1380130588.3030.342.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Alex Williamson @ 2013-09-25 17:36 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, Spiller, John

On Wed, 2013-09-25 at 17:05 +0100, David Woodhouse wrote:
> On Wed, 2013-09-25 at 17:54 +0200, Joerg Roedel wrote:
> > On Sat, Sep 21, 2013 at 09:59:00PM -0500, David Woodhouse wrote:
> > > I *hate* the bizarre calling convention for iommu_unmap(). Is it
> > > actually clearly documented anywhere? Why on earth is it not just
> > > returning void, and expected to unmap what it was *asked* to unmap?
> > 
> > Yeah, I agree that this should be documented since it is quite
> > non-standard/non-obvious behaviour of a function. The reason the
> > interface was implemented this way is that the caller should not need to
> > know (or keep track of) the page-size which was used to map a given iova.
> 
> Why would it ever care? If it *happens* to map something that can use
> large pages, yay!. If it subsequently breaks apart those large pages by
> unmapping 4KiB in the middle, let the IOMMU driver break that apart.

Can this be done atomically?  I thought part of the reason for this
interface was that iommu drivers typically couldn't replace a huge page
with multiple smaller pages in the presence of DMA.

> And why are we not allowed to unmap more than a single page at a time?
> 
> Doing unmaps in small pieces (with a full flush between every one) is
> *horribly* slow.

While we call iommu_unmap with PAGE_SIZE, we certainly expect that
larger chunks will be used if they were mapped that way.

> > So the interface is basically, that you give an iova and a size and the
> > iommu driver unmaps _at_least_ a region of that size. But if you ask for
> > a 4k region which is mapped by a 2M page then the whole 2M are unmapped.
> > The return value tells you how much was actually unmapped.
> > 
> > Not the best interface, I know. We should come up with a better way to
> > handle this.
> 
> Seriously, just the address and the size. Let the IOMMU code deal with
> whether it can *actually* use large pages on the particular set of
> IOMMUs that are in use for the devices you've added to the domain so
> far.

You're contradicting yourself here.  Just let the iommu deal with it,
but now you're raising concerns that Intel may be mixing super page
IOMMU hardware with non-super page IOMMU hardware on the same box, so I
do need to be concerned, yet there's no interface to discover super page
support.  What happens if my IOMMU domain makes use of super pages and
then I add a new device behind a new IOMMU without hardware super page
support?  We have the same problem with snoop control already.  There's
no atomic re-mapping interface, so KVM just pretends it can unmap and
remap without anyone noticing.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]                 ` <1380130588.3030.342.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
@ 2013-09-25 18:52                   ` David Woodhouse
       [not found]                     ` <1380135148.28494.26.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2013-09-25 18:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, Spiller, John


[-- Attachment #1.1: Type: text/plain, Size: 2450 bytes --]

On Wed, 2013-09-25 at 11:36 -0600, Alex Williamson wrote:
> On Wed, 2013-09-25 at 17:05 +0100, David Woodhouse wrote:
> > Why would it ever care? If it *happens* to map something that can use
> > large pages, yay!. If it subsequently breaks apart those large pages by
> > unmapping 4KiB in the middle, let the IOMMU driver break that apart.
> 
> Can this be done atomically?  I thought part of the reason for this
> interface was that iommu drivers typically couldn't replace a huge page
> with multiple smaller pages in the presence of DMA.

For the Intel IOMMU it can. You can atomically change from a large page
entry, to a pointer to a full set of smaller page tables. Do the IOTLB
flush, and at no time is there an interruption in service.

Not sure if this is true for *all* IOMMU hardware; I'd be perfectly
happy to accept a variant of Jörg's proposal that we should only ever
unmap exactly the same range that we mapped. Except we should allow the
unmapping of adjacent regions together; just not a partial unmap of
something that was mapped in one go.


> > Seriously, just the address and the size. Let the IOMMU code deal with
> > whether it can *actually* use large pages on the particular set of
> > IOMMUs that are in use for the devices you've added to the domain so
> > far.
> 
> You're contradicting yourself here.  Just let the iommu deal with it,
> but now you're raising concerns that Intel may be mixing super page
> IOMMU hardware with non-super page IOMMU hardware on the same box, so I
> do need to be concerned, yet there's no interface to discover super page
> support. 

No contradiction. Do Not Concern Yourself. Just tell the IOMMU what you
want mapped, and where. Let *it* worry about whether it can use
superpages or not. Like it already *does*.

>  What happens if my IOMMU domain makes use of super pages and
> then I add a new device behind a new IOMMU without hardware super page
> support? 

Currently, you end up with the domain happily including superpages, and
the less capable IOMMU that you added later won't cope. What we probably
*ought* to do is walk the page tables and convert any pre-existing
superpages to small pages, at the time we add the non-SP-capable IOMMU.

FWIW we currently screw up the handling of cache-coherent vs.
non-coherent page tables too. That one wants a wbinvd somewhere when we
add a non-coherent IOMMU to the domain.

-- 
dwmw2

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]                     ` <1380135148.28494.26.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
@ 2013-09-25 19:44                       ` Alex Williamson
       [not found]                         ` <1380138243.5197.20.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Williamson @ 2013-09-25 19:44 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, Spiller, John

On Wed, 2013-09-25 at 19:52 +0100, David Woodhouse wrote:
> On Wed, 2013-09-25 at 11:36 -0600, Alex Williamson wrote:
> > On Wed, 2013-09-25 at 17:05 +0100, David Woodhouse wrote:
> > > Why would it ever care? If it *happens* to map something that can use
> > > large pages, yay!. If it subsequently breaks apart those large pages by
> > > unmapping 4KiB in the middle, let the IOMMU driver break that apart.
> > 
> > Can this be done atomically?  I thought part of the reason for this
> > interface was that iommu drivers typically couldn't replace a huge page
> > with multiple smaller pages in the presence of DMA.
> 
> For the Intel IOMMU it can. You can atomically change from a large page
> entry, to a pointer to a full set of smaller page tables. Do the IOTLB
> flush, and at no time is there an interruption in service.

Cool

> Not sure if this is true for *all* IOMMU hardware; I'd be perfectly
> happy to accept a variant of Jörg's proposal that we should only ever
> unmap exactly the same range that we mapped. Except we should allow the
> unmapping of adjacent regions together; just not a partial unmap of
> something that was mapped in one go.

Well, except if we've just trusted the IOMMU driver to add a device
behind a non-SP capable IOMMU to our domain and convert the page tables,
that partial unmap is no longer partial and now we get different
behavior than before so we can't depend on that adjacent unmapping.

> > > Seriously, just the address and the size. Let the IOMMU code deal with
> > > whether it can *actually* use large pages on the particular set of
> > > IOMMUs that are in use for the devices you've added to the domain so
> > > far.
> > 
> > You're contradicting yourself here.  Just let the iommu deal with it,
> > but now you're raising concerns that Intel may be mixing super page
> > IOMMU hardware with non-super page IOMMU hardware on the same box, so I
> > do need to be concerned, yet there's no interface to discover super page
> > support. 
> 
> No contradiction. Do Not Concern Yourself. Just tell the IOMMU what you
> want mapped, and where. Let *it* worry about whether it can use
> superpages or not. Like it already *does*.

I'd love to be able to do that, but...

> >  What happens if my IOMMU domain makes use of super pages and
> > then I add a new device behind a new IOMMU without hardware super page
> > support? 
> 
> Currently, you end up with the domain happily including superpages, and
> the less capable IOMMU that you added later won't cope.

This is the trouble with trusting the iommu driver. ;)

>  What we probably
> *ought* to do is walk the page tables and convert any pre-existing
> superpages to small pages, at the time we add the non-SP-capable IOMMU.

And then we need to figure out how to handle that in the proposed
interface changes above since it changes the unmap behavior to the naive
user.  There's also the question of whether the IOMMU driver should
re-evaluate super pages when the less capable IOMMU is removed from the
domain.

> FWIW we currently screw up the handling of cache-coherent vs.
> non-coherent page tables too. That one wants a wbinvd somewhere when we
> add a non-coherent IOMMU to the domain.

You're not selling the "trust the IOMMU driver" story very well here.
Can we assume that the IOMMU_CACHE flag (SNP) is ignored appropriately
by non-coherent IOMMUs?  Is there any downside to ignoring it and always
setting SNP in the IOMMU page tables?  AMD IOMMU ignores it, but it's
also always cache coherent.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]                         ` <1380138243.5197.20.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
@ 2013-09-25 20:11                           ` David Woodhouse
       [not found]                             ` <1380139886.28494.31.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2013-09-25 20:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, Spiller, John


[-- Attachment #1.1: Type: text/plain, Size: 4178 bytes --]

On Wed, 2013-09-25 at 13:44 -0600, Alex Williamson wrote:
> On Wed, 2013-09-25 at 19:52 +0100, David Woodhouse wrote:
> > On Wed, 2013-09-25 at 11:36 -0600, Alex Williamson wrote:
> > > On Wed, 2013-09-25 at 17:05 +0100, David Woodhouse wrote:
> > > > Why would it ever care? If it *happens* to map something that can use
> > > > large pages, yay!. If it subsequently breaks apart those large pages by
> > > > unmapping 4KiB in the middle, let the IOMMU driver break that apart.
> > > 
> > > Can this be done atomically?  I thought part of the reason for this
> > > interface was that iommu drivers typically couldn't replace a huge page
> > > with multiple smaller pages in the presence of DMA.
> > 
> > For the Intel IOMMU it can. You can atomically change from a large page
> > entry, to a pointer to a full set of smaller page tables. Do the IOTLB
> > flush, and at no time is there an interruption in service.
> 
> Cool
> 
> > Not sure if this is true for *all* IOMMU hardware; I'd be perfectly
> > happy to accept a variant of Jörg's proposal that we should only ever
> > unmap exactly the same range that we mapped. Except we should allow the
> > unmapping of adjacent regions together; just not a partial unmap of
> > something that was mapped in one go.
> 
> Well, except if we've just trusted the IOMMU driver to add a device
> behind a non-SP capable IOMMU to our domain and convert the page tables,
> that partial unmap is no longer partial and now we get different
> behavior than before so we can't depend on that adjacent unmapping.

Que?

Jörg's proposal was that if you add a mapping at a given address+size,
you should always remove *exactly* that address+size. Which will always
work exactly the same, regardless of superpages.

My slight change to that was that if you also added an *adjacent*
mapping at address2+size2, you should be able to unmap both at the same
time. Which will *also* always work the same regardless of superpages.

Even if your two mappings were also *physically* contiguous, and *could*
have used superpages, they probably won't anyway because you mapped them
in two parts.

> > >  What happens if my IOMMU domain makes use of super pages and
> > > then I add a new device behind a new IOMMU without hardware super page
> > > support? 
> > 
> > Currently, you end up with the domain happily including superpages, and
> > the less capable IOMMU that you added later won't cope.
> 
> This is the trouble with trusting the iommu driver. ;)

Sorry, I should have made it clearer that this is a *bug*. It's not by
design. The IOMMU driver ought to get this right, and will do.

> >  What we probably
> > *ought* to do is walk the page tables and convert any pre-existing
> > superpages to small pages, at the time we add the non-SP-capable IOMMU.
> 
> And then we need to figure out how to handle that in the proposed
> interface changes above since it changes the unmap behavior to the naive
> user. 

Isn't that what you'd *expect*? Surely you don't *expect* the breakage
you currently get?

>  There's also the question of whether the IOMMU driver should
> re-evaluate super pages when the less capable IOMMU is removed from the
> domain.

I wouldn't bother to go looking for opportunities to use super pages if
we remove the last non-SP-capable IOMMU from the domain.

> > FWIW we currently screw up the handling of cache-coherent vs.
> > non-coherent page tables too. That one wants a wbinvd somewhere when we
> > add a non-coherent IOMMU to the domain.
> 
> You're not selling the "trust the IOMMU driver" story very well here.
> Can we assume that the IOMMU_CACHE flag (SNP) is ignored appropriately
> by non-coherent IOMMUs?  Is there any downside to ignoring it and always
> setting SNP in the IOMMU page tables?  AMD IOMMU ignores it, but it's
> also always cache coherent.  Thanks,

SNP is a separate issue. I'm speaking of cache coherency of the hardware
page table walk — the feature bit that all the horrid clflush calls are
predicated on.

Again, this is just a bug. We *should* be getting this right, but don't
yet.

-- 
dwmw2

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]                             ` <1380139886.28494.31.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
@ 2013-09-25 21:46                               ` Alex Williamson
       [not found]                                 ` <1380145560.5197.94.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Williamson @ 2013-09-25 21:46 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, Spiller, John

On Wed, 2013-09-25 at 21:11 +0100, David Woodhouse wrote:
> On Wed, 2013-09-25 at 13:44 -0600, Alex Williamson wrote:
> > On Wed, 2013-09-25 at 19:52 +0100, David Woodhouse wrote:
> > > On Wed, 2013-09-25 at 11:36 -0600, Alex Williamson wrote:
> > > > On Wed, 2013-09-25 at 17:05 +0100, David Woodhouse wrote:
> > > > > Why would it ever care? If it *happens* to map something that can use
> > > > > large pages, yay!. If it subsequently breaks apart those large pages by
> > > > > unmapping 4KiB in the middle, let the IOMMU driver break that apart.
> > > > 
> > > > Can this be done atomically?  I thought part of the reason for this
> > > > interface was that iommu drivers typically couldn't replace a huge page
> > > > with multiple smaller pages in the presence of DMA.
> > > 
> > > For the Intel IOMMU it can. You can atomically change from a large page
> > > entry, to a pointer to a full set of smaller page tables. Do the IOTLB
> > > flush, and at no time is there an interruption in service.
> > 
> > Cool
> > 
> > > Not sure if this is true for *all* IOMMU hardware; I'd be perfectly
> > > happy to accept a variant of Jörg's proposal that we should only ever
> > > unmap exactly the same range that we mapped. Except we should allow the
> > > unmapping of adjacent regions together; just not a partial unmap of
> > > something that was mapped in one go.
> > 
> > Well, except if we've just trusted the IOMMU driver to add a device
> > behind a non-SP capable IOMMU to our domain and convert the page tables,
> > that partial unmap is no longer partial and now we get different
> > behavior than before so we can't depend on that adjacent unmapping.
> 
> Que?
> 
> Jörg's proposal was that if you add a mapping at a given address+size,
> you should always remove *exactly* that address+size. Which will always
> work exactly the same, regardless of superpages.
> 
> My slight change to that was that if you also added an *adjacent*
> mapping at address2+size2, you should be able to unmap both at the same
> time. Which will *also* always work the same regardless of superpages.
> 
> Even if your two mappings were also *physically* contiguous, and *could*
> have used superpages, they probably won't anyway because you mapped them
> in two parts.

Ok, sounds reasonable.  I somehow read it to still include some aspect
of the "fill in the size" API we have now.

> > > >  What happens if my IOMMU domain makes use of super pages and
> > > > then I add a new device behind a new IOMMU without hardware super page
> > > > support? 
> > > 
> > > Currently, you end up with the domain happily including superpages, and
> > > the less capable IOMMU that you added later won't cope.
> > 
> > This is the trouble with trusting the iommu driver. ;)
> 
> Sorry, I should have made it clearer that this is a *bug*. It's not by
> design. The IOMMU driver ought to get this right, and will do.
> 
> > >  What we probably
> > > *ought* to do is walk the page tables and convert any pre-existing
> > > superpages to small pages, at the time we add the non-SP-capable IOMMU.
> > 
> > And then we need to figure out how to handle that in the proposed
> > interface changes above since it changes the unmap behavior to the naive
> > user. 
> 
> Isn't that what you'd *expect*? Surely you don't *expect* the breakage
> you currently get?

For vfio I currently make the same statement that you're pushing for the
IOMMU API; I only guarantee unmapping at the same granularity as the
original mapping.  So that seems fine to me.

> >  There's also the question of whether the IOMMU driver should
> > re-evaluate super pages when the less capable IOMMU is removed from the
> > domain.
> 
> I wouldn't bother to go looking for opportunities to use super pages if
> we remove the last non-SP-capable IOMMU from the domain.

I predict bugs getting filed if a guest sees a performance hit after
adding a device that is not restored when the device is removed.  If
only we could assume a similar feature set among IOMMUs in a system.

> > > FWIW we currently screw up the handling of cache-coherent vs.
> > > non-coherent page tables too. That one wants a wbinvd somewhere when we
> > > add a non-coherent IOMMU to the domain.
> > 
> > You're not selling the "trust the IOMMU driver" story very well here.
> > Can we assume that the IOMMU_CACHE flag (SNP) is ignored appropriately
> > by non-coherent IOMMUs?  Is there any downside to ignoring it and always
> > setting SNP in the IOMMU page tables?  AMD IOMMU ignores it, but it's
> > also always cache coherent.  Thanks,
> 
> SNP is a separate issue. I'm speaking of cache coherency of the hardware
> page table walk — the feature bit that all the horrid clflush calls are
> predicated on.
> 
> Again, this is just a bug. We *should* be getting this right, but don't
> yet.

And for DMA_PTE_SNP?  intel_iommu_map() won't let us set this bit if the
domain contains a hardware unit that doesn't support ecap.SC, but it
also doesn't update existing mappings.  Barring hardware bugs, it seems
much easier to unconditionally set DMA_PTE_SNP but still advertise
IOMMU_CAP_CACHE_COHERENCY based on the composition of the domain.
Otherwise we have to reject adding devices to a domain that change the
coherency once DMA mappings are in play or never set IOMMU_CACHE and
advertise to KVM that the domain is always non-coherent.  Thanks,

Alex




_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]                                 ` <1380145560.5197.94.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
@ 2013-09-25 22:15                                   ` David Woodhouse
       [not found]                                     ` <1380147348.28494.37.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2013-09-25 22:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, Spiller, John


[-- Attachment #1.1: Type: text/plain, Size: 2342 bytes --]

On Wed, 2013-09-25 at 15:46 -0600, Alex Williamson wrote:
> On Wed, 2013-09-25 at 21:11 +0100, David Woodhouse wrote:
> > I wouldn't bother to go looking for opportunities to use super pages if
> > we remove the last non-SP-capable IOMMU from the domain.
> 
> I predict bugs getting filed if a guest sees a performance hit after
> adding a device that is not restored when the device is removed.  If
> only we could assume a similar feature set among IOMMUs in a system.

Ok, fine. As long as you don't have *too* much mapped, that shouldn't
suck too much.

> > > > FWIW we currently screw up the handling of cache-coherent vs.
> > > > non-coherent page tables too. That one wants a wbinvd somewhere when we
> > > > add a non-coherent IOMMU to the domain.
> > > 
> > > You're not selling the "trust the IOMMU driver" story very well here.
> > > Can we assume that the IOMMU_CACHE flag (SNP) is ignored appropriately
> > > by non-coherent IOMMUs?  Is there any downside to ignoring it and always
> > > setting SNP in the IOMMU page tables?  AMD IOMMU ignores it, but it's
> > > also always cache coherent.  Thanks,
> > 
> > SNP is a separate issue. I'm speaking of cache coherency of the hardware
> > page table walk — the feature bit that all the horrid clflush calls are
> > predicated on.
> > 
> > Again, this is just a bug. We *should* be getting this right, but don't
> > yet.
> 
> And for DMA_PTE_SNP?  intel_iommu_map() won't let us set this bit if the
> domain contains a hardware unit that doesn't support ecap.SC, but it
> also doesn't update existing mappings.  Barring hardware bugs, it seems
> much easier to unconditionally set DMA_PTE_SNP but still advertise
> IOMMU_CAP_CACHE_COHERENCY based on the composition of the domain.
> Otherwise we have to reject adding devices to a domain that change the
> coherency once DMA mappings are in play or never set IOMMU_CACHE and
> advertise to KVM that the domain is always non-coherent.  Thanks,

It's not that the domain is non-coherent, is it? The SNP bit allows you
to *force* a PCIe DMA request to snoop the CPU cache, even if it didn't
*request* that.

If your device is actually *trying* to do cache-coherent (i.e. snooping)
DMA, surely that works regardless of the DMA_PTE_SNP setting?

Or am I missing something?


-- 
dwmw2

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]                                     ` <1380147348.28494.37.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
@ 2013-09-25 22:40                                       ` Alex Williamson
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Williamson @ 2013-09-25 22:40 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Shankar, Hari, Spiller, John

On Wed, 2013-09-25 at 23:15 +0100, David Woodhouse wrote:
> On Wed, 2013-09-25 at 15:46 -0600, Alex Williamson wrote:
> > On Wed, 2013-09-25 at 21:11 +0100, David Woodhouse wrote:
> > > I wouldn't bother to go looking for opportunities to use super pages if
> > > we remove the last non-SP-capable IOMMU from the domain.
> > 
> > I predict bugs getting filed if a guest sees a performance hit after
> > adding a device that is not restored when the device is removed.  If
> > only we could assume a similar feature set among IOMMUs in a system.
> 
> Ok, fine. As long as you don't have *too* much mapped, that shouldn't
> suck too much.

Of course if we knew this was going to happen in advance of attaching
the new device, I wonder if we might opt to manage it with a separate
IOMMU domain.

> > > > > FWIW we currently screw up the handling of cache-coherent vs.
> > > > > non-coherent page tables too. That one wants a wbinvd somewhere when we
> > > > > add a non-coherent IOMMU to the domain.
> > > > 
> > > > You're not selling the "trust the IOMMU driver" story very well here.
> > > > Can we assume that the IOMMU_CACHE flag (SNP) is ignored appropriately
> > > > by non-coherent IOMMUs?  Is there any downside to ignoring it and always
> > > > setting SNP in the IOMMU page tables?  AMD IOMMU ignores it, but it's
> > > > also always cache coherent.  Thanks,
> > > 
> > > SNP is a separate issue. I'm speaking of cache coherency of the hardware
> > > page table walk — the feature bit that all the horrid clflush calls are
> > > predicated on.
> > > 
> > > Again, this is just a bug. We *should* be getting this right, but don't
> > > yet.
> > 
> > And for DMA_PTE_SNP?  intel_iommu_map() won't let us set this bit if the
> > domain contains a hardware unit that doesn't support ecap.SC, but it
> > also doesn't update existing mappings.  Barring hardware bugs, it seems
> > much easier to unconditionally set DMA_PTE_SNP but still advertise
> > IOMMU_CAP_CACHE_COHERENCY based on the composition of the domain.
> > Otherwise we have to reject adding devices to a domain that change the
> > coherency once DMA mappings are in play or never set IOMMU_CACHE and
> > advertise to KVM that the domain is always non-coherent.  Thanks,
> 
> It's not that the domain is non-coherent, is it? The SNP bit allows you
> to *force* a PCIe DMA request to snoop the CPU cache, even if it didn't
> *request* that.
> 
> If your device is actually *trying* to do cache-coherent (i.e. snooping)
> DMA, surely that works regardless of the DMA_PTE_SNP setting?
> 
> Or am I missing something?

It's the opposite, devices are trying to do No-Snoop transactions and
KVM would really rather prefer wbinvd was a no-op since emulated devices
don't need it.  We therefore need to toggle KVM's behavior when we have
an assigned device attached to an IOMMU domain where the hardware isn't
stripping the No-Snoop attribute.  Some graphics cards seem to make use
of this and will get error codes loading their driver if it doesn't
work.  Thanks,

Alex

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size
       [not found]     ` <1379818740.2547.51.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
  2013-09-25 15:54       ` Joerg Roedel
@ 2013-10-02 15:04       ` David Woodhouse
  1 sibling, 0 replies; 17+ messages in thread
From: David Woodhouse @ 2013-10-02 15:04 UTC (permalink / raw)
  To: Shankar, Hari
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
	Singh, Varinder, Sundaram, Rajesh, Spiller, John, Kimmel, Jeff


[-- Attachment #1.1: Type: text/plain, Size: 13836 bytes --]

On Sat, 2013-09-21 at 21:59 -0500, David Woodhouse wrote:
> Here's a completely untested work-in-progress that attempts to fix that.
> I'll be able to test it myself on about Tuesday when I'm home from New
> Orleans and awake...

Or might have been if my laptop's hard drive hadn't died. Here's an
updated version which extends the use of domain_unmap() to other places
it needs to happen. Similarly untested by anything but the compiler.

I'll be able to test it myself on about Monday when I'm home from
Portland and awake...

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 15e9b57..cd6e568 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -413,6 +413,7 @@ struct deferred_flush_tables {
 	int next;
 	struct iova *iova[HIGH_WATER_MARK];
 	struct dmar_domain *domain[HIGH_WATER_MARK];
+	struct page *freelist[HIGH_WATER_MARK];
 };
 
 static struct deferred_flush_tables *deferred_flush;
@@ -945,6 +946,123 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain,
 	}
 }
 
+/* When a page at a given level is being unlinked from its parent, we don't
+   need to *modify* it at all. All we need to do is make a list of all the
+   pages which can be freed just as soon as we've flushed the IOTLB and we
+   know the hardware page-walk will no longer touch them.
+   The 'pte' argument is the *parent* PTE, pointing to the page that is to
+   be freed. */
+static struct page *dma_pte_list_pagetables(struct dmar_domain *domain, int level,
+					    struct dma_pte *pte, struct page *freelist)
+{
+	struct page *pg;
+
+	pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT);
+	pg->freelist = freelist;
+	freelist = pg;
+
+	if (level > 1) {
+		pte = page_address(pg);
+
+		do {
+			if (dma_pte_present(pte) && !dma_pte_superpage(pte))
+				freelist = dma_pte_list_pagetables(domain, level - 1, pte, freelist);
+
+		} while (!first_pte_in_page(++pte));
+	}
+
+	return freelist;
+}
+
+static struct page *dma_pte_clear_level(struct dmar_domain *domain, int level,
+					struct dma_pte *pte, unsigned long pfn,
+					unsigned long start_pfn, unsigned long last_pfn,
+					struct page *freelist)
+{
+	struct dma_pte *first_pte = NULL, *last_pte = NULL;
+
+	pfn = max(start_pfn, pfn);
+	pte = &pte[pfn_level_offset(pfn, level)];
+
+	do {
+		unsigned long level_pfn;
+		struct dma_pte *level_pte;
+
+		if (!dma_pte_present(pte))
+			goto next;
+
+		level_pfn = pfn & level_mask(level - 1);
+		level_pte = phys_to_virt(dma_pte_addr(pte));
+
+		/* If range covers entire pagetable, free it */
+		if (start_pfn <= level_pfn &&
+		    last_pfn >= level_pfn + level_size(level)) {
+
+			/* These suborbinate page tables are going away entirely. Don't
+			   bother to clear them; we're just going to *free* them. */
+			if (level > 1 && !dma_pte_superpage(pte))
+				freelist = dma_pte_list_pagetables(domain, level - 1, pte, freelist);
+
+			dma_clear_pte(pte);
+			if (!first_pte)
+				first_pte = pte;
+			last_pte = pte;
+		} else {
+			/* Recurse down into a level that isn't *entirely* obsolete */
+			freelist = dma_pte_clear_level(domain, level - 1, level_pte,
+						       level_pfn, start_pfn, last_pfn, freelist);
+		}
+next:
+		pfn += level_size(level);
+	} while (!first_pte_in_page(++pte) && pfn <= last_pfn);
+
+	if (first_pte)
+		domain_flush_cache(domain, first_pte,
+				   (void *)++last_pte - (void *)first_pte);
+
+	return freelist;
+}
+
+/* We can't just free the pages because the IOMMU may still be walking
+   the page tables, and may have cached the intermediate levels. The
+   pages can only be freed after the IOTLB flush has been done. */
+struct page *domain_unmap(struct dmar_domain *domain,
+			  unsigned long start_pfn,
+			  unsigned long last_pfn)
+{
+	int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT;
+	struct page *freelist = NULL;
+
+	BUG_ON(addr_width < BITS_PER_LONG && start_pfn >> addr_width);
+	BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width);
+	BUG_ON(start_pfn > last_pfn);
+
+	/* we don't need lock here; nobody else touches the iova range */
+	freelist = dma_pte_clear_level(domain, agaw_to_level(domain->agaw),
+				       domain->pgd, 0, start_pfn, last_pfn, NULL);
+
+	/* free pgd */
+	if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
+		struct page *pgd_page = virt_to_page(domain->pgd);
+		pgd_page->freelist = freelist;
+		freelist = pgd_page;
+
+		domain->pgd = NULL;
+	}
+
+	return freelist;
+}
+
+void dma_free_pagelist(struct page *freelist)
+{
+	struct page *pg;
+
+	while ((pg = freelist)) {
+		freelist = pg->freelist;
+		free_pgtable_page(pg);
+	}
+}
+
 /* iommu handling */
 static int iommu_alloc_root_entry(struct intel_iommu *iommu)
 {
@@ -1054,7 +1172,7 @@ static void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did,
 		break;
 	case DMA_TLB_PSI_FLUSH:
 		val = DMA_TLB_PSI_FLUSH|DMA_TLB_IVT|DMA_TLB_DID(did);
-		/* Note: always flush non-leaf currently */
+		/* IH bit is passed in as part of address */
 		val_iva = size_order | addr;
 		break;
 	default:
@@ -1165,13 +1283,15 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain,
 }
 
 static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
-				  unsigned long pfn, unsigned int pages, int map)
+				  unsigned long pfn, unsigned int pages, int ih, int map)
 {
 	unsigned int mask = ilog2(__roundup_pow_of_two(pages));
 	uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
 
 	BUG_ON(pages == 0);
 
+	if (ih)
+		ih = 1 << 6;
 	/*
 	 * Fallback to domain selective flush if no PSI support or the size is
 	 * too big.
@@ -1182,7 +1302,7 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
 		iommu->flush.flush_iotlb(iommu, did, 0, 0,
 						DMA_TLB_DSI_FLUSH);
 	else
-		iommu->flush.flush_iotlb(iommu, did, addr, mask,
+		iommu->flush.flush_iotlb(iommu, did, addr | ih, mask,
 						DMA_TLB_PSI_FLUSH);
 
 	/*
@@ -1517,6 +1637,7 @@ static void domain_exit(struct dmar_domain *domain)
 {
 	struct dmar_drhd_unit *drhd;
 	struct intel_iommu *iommu;
+	struct page *freelist = NULL;
 
 	/* Domain 0 is reserved, so dont process it */
 	if (!domain)
@@ -1530,16 +1651,14 @@ static void domain_exit(struct dmar_domain *domain)
 	/* destroy iovas */
 	put_iova_domain(&domain->iovad);
 
-	/* clear ptes */
-	dma_pte_clear_range(domain, 0, DOMAIN_MAX_PFN(domain->gaw));
-
-	/* free page tables */
-	dma_pte_free_pagetable(domain, 0, DOMAIN_MAX_PFN(domain->gaw));
+	freelist = domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw));
 
 	for_each_active_iommu(iommu, drhd)
 		if (test_bit(iommu->seq_id, domain->iommu_bmp))
 			iommu_detach_domain(domain, iommu);
 
+	dma_free_pagelist(freelist);
+
 	free_domain_mem(domain);
 }
 
@@ -2850,7 +2969,7 @@ static dma_addr_t __intel_map_single(struct device *hwdev, phys_addr_t paddr,
 
 	/* it's a non-present to present mapping. Only flush if caching mode */
 	if (cap_caching_mode(iommu->cap))
-		iommu_flush_iotlb_psi(iommu, domain->id, mm_to_dma_pfn(iova->pfn_lo), size, 1);
+		iommu_flush_iotlb_psi(iommu, domain->id, mm_to_dma_pfn(iova->pfn_lo), size, 0, 1);
 	else
 		iommu_flush_write_buffer(iommu);
 
@@ -2902,13 +3021,16 @@ static void flush_unmaps(void)
 			/* On real hardware multiple invalidations are expensive */
 			if (cap_caching_mode(iommu->cap))
 				iommu_flush_iotlb_psi(iommu, domain->id,
-				iova->pfn_lo, iova->pfn_hi - iova->pfn_lo + 1, 0);
+					iova->pfn_lo, iova->pfn_hi - iova->pfn_lo + 1,
+					!deferred_flush[i].freelist[j], 0);
 			else {
 				mask = ilog2(mm_to_dma_pfn(iova->pfn_hi - iova->pfn_lo + 1));
 				iommu_flush_dev_iotlb(deferred_flush[i].domain[j],
 						(uint64_t)iova->pfn_lo << PAGE_SHIFT, mask);
 			}
 			__free_iova(&deferred_flush[i].domain[j]->iovad, iova);
+			if (deferred_flush[i].freelist[j])
+				dma_free_pagelist(deferred_flush[i].freelist[j]);
 		}
 		deferred_flush[i].next = 0;
 	}
@@ -2925,7 +3047,7 @@ static void flush_unmaps_timeout(unsigned long data)
 	spin_unlock_irqrestore(&async_umap_flush_lock, flags);
 }
 
-static void add_unmap(struct dmar_domain *dom, struct iova *iova)
+static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *freelist)
 {
 	unsigned long flags;
 	int next, iommu_id;
@@ -2941,6 +3063,7 @@ static void add_unmap(struct dmar_domain *dom, struct iova *iova)
 	next = deferred_flush[iommu_id].next;
 	deferred_flush[iommu_id].domain[next] = dom;
 	deferred_flush[iommu_id].iova[next] = iova;
+	deferred_flush[iommu_id].freelist[next] = freelist;
 	deferred_flush[iommu_id].next++;
 
 	if (!timer_on) {
@@ -2960,6 +3083,7 @@ static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr,
 	unsigned long start_pfn, last_pfn;
 	struct iova *iova;
 	struct intel_iommu *iommu;
+	struct page *freelist;
 
 	if (iommu_no_mapping(dev))
 		return;
@@ -2980,19 +3104,16 @@ static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr,
 	pr_debug("Device %s unmapping: pfn %lx-%lx\n",
 		 pci_name(pdev), start_pfn, last_pfn);
 
-	/*  clear the whole page */
-	dma_pte_clear_range(domain, start_pfn, last_pfn);
-
-	/* free page tables */
-	dma_pte_free_pagetable(domain, start_pfn, last_pfn);
+	freelist = domain_unmap(domain, start_pfn, last_pfn);
 
 	if (intel_iommu_strict) {
 		iommu_flush_iotlb_psi(iommu, domain->id, start_pfn,
-				      last_pfn - start_pfn + 1, 0);
+				      last_pfn - start_pfn + 1, !freelist, 0);
 		/* free iova */
 		__free_iova(&domain->iovad, iova);
+		dma_free_pagelist(freelist);
 	} else {
-		add_unmap(domain, iova);
+		add_unmap(domain, iova, freelist);
 		/*
 		 * queue up the release of the unmap to save the 1/6th of the
 		 * cpu used up by the iotlb flush operation...
@@ -3054,6 +3175,7 @@ static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist,
 	unsigned long start_pfn, last_pfn;
 	struct iova *iova;
 	struct intel_iommu *iommu;
+	struct page *freelist;
 
 	if (iommu_no_mapping(hwdev))
 		return;
@@ -3071,19 +3193,16 @@ static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist,
 	start_pfn = mm_to_dma_pfn(iova->pfn_lo);
 	last_pfn = mm_to_dma_pfn(iova->pfn_hi + 1) - 1;
 
-	/*  clear the whole page */
-	dma_pte_clear_range(domain, start_pfn, last_pfn);
-
-	/* free page tables */
-	dma_pte_free_pagetable(domain, start_pfn, last_pfn);
+	freelist = domain_unmap(domain, start_pfn, last_pfn);
 
 	if (intel_iommu_strict) {
 		iommu_flush_iotlb_psi(iommu, domain->id, start_pfn,
-				      last_pfn - start_pfn + 1, 0);
+				      last_pfn - start_pfn + 1, !freelist, 0);
 		/* free iova */
 		__free_iova(&domain->iovad, iova);
+		dma_free_pagelist(freelist);
 	} else {
-		add_unmap(domain, iova);
+		add_unmap(domain, iova, freelist);
 		/*
 		 * queue up the release of the unmap to save the 1/6th of the
 		 * cpu used up by the iotlb flush operation...
@@ -3166,7 +3285,7 @@ static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist, int ne
 
 	/* it's a non-present to present mapping. Only flush if caching mode */
 	if (cap_caching_mode(iommu->cap))
-		iommu_flush_iotlb_psi(iommu, domain->id, start_vpfn, size, 1);
+		iommu_flush_iotlb_psi(iommu, domain->id, start_vpfn, size, 0, 1);
 	else
 		iommu_flush_write_buffer(iommu);
 
@@ -3952,6 +4071,8 @@ static void iommu_free_vm_domain(struct dmar_domain *domain)
 
 static void vm_domain_exit(struct dmar_domain *domain)
 {
+	struct page *freelist;
+
 	/* Domain 0 is reserved, so dont process it */
 	if (!domain)
 		return;
@@ -3960,13 +4081,12 @@ static void vm_domain_exit(struct dmar_domain *domain)
 	/* destroy iovas */
 	put_iova_domain(&domain->iovad);
 
-	/* clear ptes */
-	dma_pte_clear_range(domain, 0, DOMAIN_MAX_PFN(domain->gaw));
-
-	/* free page tables */
-	dma_pte_free_pagetable(domain, 0, DOMAIN_MAX_PFN(domain->gaw));
+	freelist = domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw));
 
 	iommu_free_vm_domain(domain);
+
+	dma_free_pagelist(freelist);
+
 	free_domain_mem(domain);
 }
 
@@ -4110,18 +4230,43 @@ static int intel_iommu_map(struct iommu_domain *domain,
 }
 
 static size_t intel_iommu_unmap(struct iommu_domain *domain,
-			     unsigned long iova, size_t size)
+				unsigned long iova, size_t size)
 {
 	struct dmar_domain *dmar_domain = domain->priv;
-	int order;
+	struct page *freelist = NULL;
+	struct intel_iommu *iommu;
+	unsigned long start_pfn, last_pfn;
+	unsigned int npages;
+	int iommu_id, num, ndomains;
+
+	start_pfn = iova >> VTD_PAGE_SHIFT;
+	last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
+
+	freelist = domain_unmap(dmar_domain, start_pfn, last_pfn);
+
+	npages = last_pfn - start_pfn + 1;
+
+	for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus) {
+               iommu = g_iommus[iommu_id];
+
+               /*
+                * find bit position of dmar_domain
+                */
+               ndomains = cap_ndoms(iommu->cap);
+               for_each_set_bit(num, iommu->domain_ids, ndomains) {
+                       if (iommu->domains[num] == dmar_domain)
+                               iommu_flush_iotlb_psi(iommu, num, start_pfn,
+						     npages, !freelist, 0);
+	       }
+
+	}
 
-	order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT,
-			    (iova + size - 1) >> VTD_PAGE_SHIFT);
+	dma_free_pagelist(freelist);
 
 	if (dmar_domain->max_addr == iova + size)
 		dmar_domain->max_addr = iova;
 
-	return PAGE_SIZE << order;
+	return size;
 }
 
 static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,


-- 
dwmw2


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-10-02 15:04 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-02  2:24 [PATCH] On unmap, flush IOMMU TLB and return correct size Shankar, Hari
     [not found] ` <CE494510.41547%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
2013-09-03  4:25   ` Alex Williamson
     [not found]     ` <1378182340.3246.19.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
2013-09-05  4:05       ` Shankar, Hari
     [not found]         ` <CE4D489F.41EE8%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
2013-09-05  4:58           ` Alex Williamson
2013-09-22  2:59   ` David Woodhouse
     [not found]     ` <1379818740.2547.51.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
2013-09-25 15:54       ` Joerg Roedel
     [not found]         ` <20130925155459.GA20372-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>
2013-09-25 16:05           ` David Woodhouse
     [not found]             ` <1380125113.28494.14.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
2013-09-25 16:58               ` Joerg Roedel
2013-09-25 17:36               ` Alex Williamson
     [not found]                 ` <1380130588.3030.342.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
2013-09-25 18:52                   ` David Woodhouse
     [not found]                     ` <1380135148.28494.26.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
2013-09-25 19:44                       ` Alex Williamson
     [not found]                         ` <1380138243.5197.20.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
2013-09-25 20:11                           ` David Woodhouse
     [not found]                             ` <1380139886.28494.31.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
2013-09-25 21:46                               ` Alex Williamson
     [not found]                                 ` <1380145560.5197.94.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
2013-09-25 22:15                                   ` David Woodhouse
     [not found]                                     ` <1380147348.28494.37.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
2013-09-25 22:40                                       ` Alex Williamson
2013-09-25 16:33           ` Alex Williamson
2013-10-02 15:04       ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.