* [PATCH] On unmap, flush IOMMU TLB and return correct size
@ 2013-09-02 2:24 Shankar, Hari
[not found] ` <CE494510.41547%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 17+ messages in thread
From: Shankar, Hari @ 2013-09-02 2:24 UTC (permalink / raw)
To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org
Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org,
Spiller, John
[-- Attachment #1.1: Type: text/plain, Size: 2121 bytes --]
Patch Description: The patch is for Intel IOMMU driver. It forces IOMMU TLB flush for the particular domain and return correct unmap size when intel_iommu_unmap() is called
The patch is generated on Linux kernel version 3.6.11
--- linux/drivers/iommu/intel-iommu.c.orig 2013-09-01 10:10:14.723958000 -0700
+++ linux/drivers/iommu/intel-iommu.c 2013-09-01 18:22:58.497578000 -0700
@@ -4060,14 +4060,31 @@ static size_t intel_iommu_unmap(struct i
{
struct dmar_domain *dmar_domain = domain->priv;
int order;
+ struct intel_iommu *iommu;
+ unsigned long start_pfn, last_pfn;
+ unsigned int npages;
+ int iommu_id, num, ndomains;
- order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT,
- (iova + size - 1) >> VTD_PAGE_SHIFT);
+ start_pfn = iova >> VTD_PAGE_SHIFT;
+ last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
+ order = dma_pte_clear_range(dmar_domain, start_pfn, last_pfn);
+ last_pfn |= (1UL << order) - 1;
+ npages = last_pfn - start_pfn + 1;
+ for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus) {
+ iommu = g_iommus[iommu_id];
- if (dmar_domain->max_addr == iova + size)
+ /*
+ * find bit position of dmar_domain
+ */
+ ndomains = cap_ndoms(iommu->cap);
+ for_each_set_bit(num, iommu->domain_ids, ndomains)
+ if (iommu->domains[num] == dmar_domain)
+ iommu_flush_iotlb_psi(iommu, num,
+ start_pfn, npages, 0);
+ }
+ if (dmar_domain->max_addr <= iova + (npages << VTD_PAGE_SHIFT))
dmar_domain->max_addr = iova;
-
- return PAGE_SIZE << order;
+ return npages << VTD_PAGE_SHIFT;
}
static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
Signed-off-by: Hari Shankar <hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
[-- Attachment #1.2: Type: text/html, Size: 4116 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread[parent not found: <CE494510.41547%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <CE494510.41547%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> @ 2013-09-03 4:25 ` Alex Williamson [not found] ` <1378182340.3246.19.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org> 2013-09-22 2:59 ` David Woodhouse 1 sibling, 1 reply; 17+ messages in thread From: Alex Williamson @ 2013-09-03 4:25 UTC (permalink / raw) To: Shankar, Hari Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, Spiller, John Hi Hari, On Mon, 2013-09-02 at 02:24 +0000, Shankar, Hari wrote: > Patch Description: The patch is for Intel IOMMU driver. It forces IOMMU TLB flush for the particular domain and return correct unmap size when intel_iommu_unmap() is called "Patch Description:" is unnecessary. Look at other commits that have touched this file to get an idea how to format the subject and commit log: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/drivers/iommu/intel-iommu.c Sign-off should go here, not below the patch. Subject should include a subsystem, such as "iommu/intel:" or "intel-iommu:". Also a good idea to provide some justification why this is necessary for stable (unmaps from userspace through vfio results in coherency issue...) > The patch is generated on Linux kernel version 3.6.11 This should be noted, but not part of the commit message. See section 15 of SubmittingPatches and note the additional comments between the "---" marker and the beginning of the patch. > --- linux/drivers/iommu/intel-iommu.c.orig 2013-09-01 10:10:14.723958000 -0700 > +++ linux/drivers/iommu/intel-iommu.c 2013-09-01 18:22:58.497578000 -0700 > @@ -4060,14 +4060,31 @@ static size_t intel_iommu_unmap(struct i Tabs seem to have been clobbered in your patch, scripts/checkpatch.pl will note this for you. > { > struct dmar_domain *dmar_domain = domain->priv; > int order; > + struct intel_iommu *iommu; > + unsigned long start_pfn, last_pfn; > + unsigned int npages; > + int iommu_id, num, ndomains; > > - order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT, > - (iova + size - 1) >> VTD_PAGE_SHIFT); > + start_pfn = iova >> VTD_PAGE_SHIFT; > + last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT; > + order = dma_pte_clear_range(dmar_domain, start_pfn, last_pfn); > + last_pfn |= (1UL << order) - 1; > + npages = last_pfn - start_pfn + 1; Doesn't order tell us npages more directly? npages = 1UL << order? > + for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus) { > + iommu = g_iommus[iommu_id]; > > - if (dmar_domain->max_addr == iova + size) > + /* > + * find bit position of dmar_domain > + */ > + ndomains = cap_ndoms(iommu->cap); > + for_each_set_bit(num, iommu->domain_ids, ndomains) > + if (iommu->domains[num] == dmar_domain) > + iommu_flush_iotlb_psi(iommu, num, > + start_pfn, npages, 0); This second bitmap search seems unnecessary, I think iommu_flush_iotlb_psi wants dmar_domain->id as the second parameter in all cases, not num. So this could be reduced to : for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus) iommu_flush_iotlb_psi(g_iommus[iommu_id], dmar_domain->id, start_pfn, npages, 0); > + } > + if (dmar_domain->max_addr <= iova + (npages << VTD_PAGE_SHIFT)) > dmar_domain->max_addr = iova; > - > - return PAGE_SIZE << order; > + return npages << VTD_PAGE_SHIFT; > } > > static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain, > > Signed-off-by: Hari Shankar <hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Please configure your mailer to drop the html version. Thanks, Alex ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <1378182340.3246.19.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1378182340.3246.19.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org> @ 2013-09-05 4:05 ` Shankar, Hari [not found] ` <CE4D489F.41EE8%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: Shankar, Hari @ 2013-09-05 4:05 UTC (permalink / raw) To: Alex Williamson, Shankar, Hari Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, Spiller, John Hi Alex, See inline, On 9/2/13 9:25 PM, "Alex Williamson" <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >Hi Hari, > >On Mon, 2013-09-02 at 02:24 +0000, Shankar, Hari wrote: >> Patch Description: The patch is for Intel IOMMU driver. It forces IOMMU >>TLB flush for the particular domain and return correct unmap size when >>intel_iommu_unmap() is called > >"Patch Description:" is unnecessary. Look at other commits that have >touched this file to get an idea how to format the subject and commit >log: > >http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/drivers >/iommu/intel-iommu.c > >Sign-off should go here, not below the patch. Subject should include a >subsystem, such as "iommu/intel:" or "intel-iommu:". Also a good idea >to provide some justification why this is necessary for stable (unmaps >from userspace through vfio results in coherency issue...) > >> The patch is generated on Linux kernel version 3.6.11 > >This should be noted, but not part of the commit message. See section >15 of SubmittingPatches and note the additional comments between the >"---" marker and the beginning of the patch. > >> --- linux/drivers/iommu/intel-iommu.c.orig 2013-09-01 >>10:10:14.723958000 -0700 >> +++ linux/drivers/iommu/intel-iommu.c 2013-09-01 18:22:58.497578000 >>-0700 >> @@ -4060,14 +4060,31 @@ static size_t intel_iommu_unmap(struct i > >Tabs seem to have been clobbered in your patch, scripts/checkpatch.pl >will note this for you. > >> { >> struct dmar_domain *dmar_domain = domain->priv; >> int order; >> + struct intel_iommu *iommu; >> + unsigned long start_pfn, last_pfn; >> + unsigned int npages; >> + int iommu_id, num, ndomains; >> >> - order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT, >> - (iova + size - 1) >> VTD_PAGE_SHIFT); >> + start_pfn = iova >> VTD_PAGE_SHIFT; >> + last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT; >> + order = dma_pte_clear_range(dmar_domain, start_pfn, last_pfn); >> + last_pfn |= (1UL << order) - 1; >> + npages = last_pfn - start_pfn + 1; > >Doesn't order tell us npages more directly? npages = 1UL << order? [Hari] 1UL << order formula will not work for all cases. Jeff Kimmel (cc'ed) has a good explanation for that. Jeff's response below, <Jeff> If start_pfn is always aligned modulo (1 << order), then '1UL << order' would get the same npages value, but if start_pfn is not guaranteed to have such alignment, then '1UL << order' would produce an incorrect npages value. For example, start_pfn could fall in some range that was mapped with small pages, but the range could end in a region mapped with large pages. </Jeff> > >> + for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, >>g_num_of_iommus) { >> + iommu = g_iommus[iommu_id]; >> >> - if (dmar_domain->max_addr == iova + size) >> + /* >> + * find bit position of dmar_domain >> + */ >> + ndomains = cap_ndoms(iommu->cap); >> + for_each_set_bit(num, iommu->domain_ids, ndomains) >> + if (iommu->domains[num] == dmar_domain) >> + iommu_flush_iotlb_psi(iommu, num, >> + start_pfn, >>npages, 0); > >This second bitmap search seems unnecessary, I think >iommu_flush_iotlb_psi wants dmar_domain->id as the second parameter in >all cases, not num. So this could be reduced to : > >for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus) > iommu_flush_iotlb_psi(g_iommus[iommu_id], dmar_domain->id, start_pfn, >npages, 0); [Hari] For VFIO, iommu_alloc_vm_domain() is called to get VM domain and has DOMAIN_FLAG_VIRTUAL_MACHINE flag set. In this case, domain->id is derived from static variable 'vm_domid' and not by using find_first_zero_bit in iommu->domain_ids, which in-turn also claims a slot on iommu->domains[] array, see iommu_attach_domain(). For domain with DOMAIN_FLAG_VIRTUAL_MACHINE flag, domain id is actually derived later in domain_context_mapping_one(), logic under 'if (domain->flags & DOMAIN_FLAG_VIRTUAL_MACHINE'. As you can see domain->id for VM domain is not the actual domain id and can also conflict with non VM domain as vm_domid starts from 0. Using domain->id in the loop will give wrong result. Using second loop and passing 'num' makes sure we always select the right domain to flush. Hari. > >> + } >> + if (dmar_domain->max_addr <= iova + (npages << VTD_PAGE_SHIFT)) >> dmar_domain->max_addr = iova; >> - >> - return PAGE_SIZE << order; >> + return npages << VTD_PAGE_SHIFT; >> } >> >> static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain >>*domain, >> >> Signed-off-by: Hari Shankar <hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> >> cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > >Please configure your mailer to drop the html version. Thanks, > >Alex > ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <CE4D489F.41EE8%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <CE4D489F.41EE8%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> @ 2013-09-05 4:58 ` Alex Williamson 0 siblings, 0 replies; 17+ messages in thread From: Alex Williamson @ 2013-09-05 4:58 UTC (permalink / raw) To: Shankar, Hari Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, Spiller, John On Thu, 2013-09-05 at 04:05 +0000, Shankar, Hari wrote: > Hi Alex, > See inline, > > On 9/2/13 9:25 PM, "Alex Williamson" <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > >Hi Hari, > > > >On Mon, 2013-09-02 at 02:24 +0000, Shankar, Hari wrote: > >> Patch Description: The patch is for Intel IOMMU driver. It forces IOMMU > >>TLB flush for the particular domain and return correct unmap size when > >>intel_iommu_unmap() is called > > > >"Patch Description:" is unnecessary. Look at other commits that have > >touched this file to get an idea how to format the subject and commit > >log: > > > >http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/drivers > >/iommu/intel-iommu.c > > > >Sign-off should go here, not below the patch. Subject should include a > >subsystem, such as "iommu/intel:" or "intel-iommu:". Also a good idea > >to provide some justification why this is necessary for stable (unmaps > >from userspace through vfio results in coherency issue...) > > > >> The patch is generated on Linux kernel version 3.6.11 > > > >This should be noted, but not part of the commit message. See section > >15 of SubmittingPatches and note the additional comments between the > >"---" marker and the beginning of the patch. > > > >> --- linux/drivers/iommu/intel-iommu.c.orig 2013-09-01 > >>10:10:14.723958000 -0700 > >> +++ linux/drivers/iommu/intel-iommu.c 2013-09-01 18:22:58.497578000 > >>-0700 > >> @@ -4060,14 +4060,31 @@ static size_t intel_iommu_unmap(struct i > > > >Tabs seem to have been clobbered in your patch, scripts/checkpatch.pl > >will note this for you. > > > >> { > >> struct dmar_domain *dmar_domain = domain->priv; > >> int order; > >> + struct intel_iommu *iommu; > >> + unsigned long start_pfn, last_pfn; > >> + unsigned int npages; > >> + int iommu_id, num, ndomains; > >> > >> - order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT, > >> - (iova + size - 1) >> VTD_PAGE_SHIFT); > >> + start_pfn = iova >> VTD_PAGE_SHIFT; > >> + last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT; > >> + order = dma_pte_clear_range(dmar_domain, start_pfn, last_pfn); > >> + last_pfn |= (1UL << order) - 1; > >> + npages = last_pfn - start_pfn + 1; > > > >Doesn't order tell us npages more directly? npages = 1UL << order? > > > [Hari] 1UL << order formula will not work for all cases. Jeff Kimmel > (cc'ed) has a good explanation for that. Jeff's response below, > > <Jeff> > If start_pfn is always aligned modulo (1 << order), > then '1UL << order' would get the same npages value, but if > start_pfn is not guaranteed to have such alignment, then '1UL << order' > would > produce an incorrect npages value. For example, start_pfn could fall in > some range that was mapped with small pages, but the range could end in a > region mapped with large pages. > > </Jeff> I think iommu_map()/iommu_unmap() only passes naturally aligned regions to the iommu drivers, so I don't think we need to worry about that. Feel free to verify. > >> + for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, > >>g_num_of_iommus) { > >> + iommu = g_iommus[iommu_id]; > >> > >> - if (dmar_domain->max_addr == iova + size) > >> + /* > >> + * find bit position of dmar_domain > >> + */ > >> + ndomains = cap_ndoms(iommu->cap); > >> + for_each_set_bit(num, iommu->domain_ids, ndomains) > >> + if (iommu->domains[num] == dmar_domain) > >> + iommu_flush_iotlb_psi(iommu, num, > >> + start_pfn, > >>npages, 0); > > > >This second bitmap search seems unnecessary, I think > >iommu_flush_iotlb_psi wants dmar_domain->id as the second parameter in > >all cases, not num. So this could be reduced to : > > > >for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus) > > iommu_flush_iotlb_psi(g_iommus[iommu_id], dmar_domain->id, start_pfn, > >npages, 0); > > [Hari] For VFIO, iommu_alloc_vm_domain() is called to get VM domain and > has DOMAIN_FLAG_VIRTUAL_MACHINE flag set. In this case, domain->id is > derived from static variable 'vm_domid' and not by using > find_first_zero_bit in iommu->domain_ids, which in-turn also claims a slot > on iommu->domains[] array, see iommu_attach_domain(). For domain with > DOMAIN_FLAG_VIRTUAL_MACHINE flag, domain id is actually derived later in > domain_context_mapping_one(), logic under 'if (domain->flags & > DOMAIN_FLAG_VIRTUAL_MACHINE'. As you can see domain->id for VM domain is > not the actual domain id and can also conflict with non VM domain as > vm_domid starts from 0. Using domain->id in the loop will give wrong > result. Using second loop and passing 'num' makes sure we always select > the right domain to flush. Ok, that does sound familiar. I agree, you do need to search the bitmaps. Thanks, Alex > >> + } > >> + if (dmar_domain->max_addr <= iova + (npages << VTD_PAGE_SHIFT)) > >> dmar_domain->max_addr = iova; > >> - > >> - return PAGE_SIZE << order; > >> + return npages << VTD_PAGE_SHIFT; > >> } > >> > >> static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain > >>*domain, > >> > >> Signed-off-by: Hari Shankar <hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> > >> cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > >Please configure your mailer to drop the html version. Thanks, > > > >Alex > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <CE494510.41547%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> 2013-09-03 4:25 ` Alex Williamson @ 2013-09-22 2:59 ` David Woodhouse [not found] ` <1379818740.2547.51.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org> 1 sibling, 1 reply; 17+ messages in thread From: David Woodhouse @ 2013-09-22 2:59 UTC (permalink / raw) To: Shankar, Hari Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Singh, Varinder, Sundaram, Rajesh, Spiller, John, Kimmel, Jeff [-- Attachment #1.1: Type: text/plain, Size: 12348 bytes --] On Mon, 2013-09-02 at 02:24 +0000, Shankar, Hari wrote: > Patch Description: The patch is for Intel IOMMU driver. It forces > IOMMU TLB flush for the particular domain and return correct unmap > size when intel_iommu_unmap() is called I *hate* the bizarre calling convention for iommu_unmap(). Is it actually clearly documented anywhere? Why on earth is it not just returning void, and expected to unmap what it was *asked* to unmap? I keep getting patches which "fix" the return value, with no clear explanation of what it's actually supposed to do. I ask sometimes, but the answer is always so weird that I end up forgetting it again. The main user seems to be kvm_iommu_put_pages() in virt/kvm/iommu.c which appears to be *designed* to get poor performance, making a single call (with associated clflush and IOTLB flush) for every individual *page* instead of just unmapping a whole virtual range at a time. One thing you neglected to fix in this patch is the fact that the intel_iommu_unmap() function doesn't call dma_pte_free_pagetable() after it calls dma_pte_clear_range(). For reasons not entirely clear to me, adding that call fixes a significant performance issue where the time taken to do unmaps is exponential w.r.t. the size of the region. And in fact, if we fix that then you suffer the same bug that exists everywhere *else* we do unmaps... which is that we actually need to flush the IOTLB (with the IH bit cleared) *before* we free the page table pages, because the hardware could have cached them and could still be doing a page walk after you've returned the pages to the pool. Here's a completely untested work-in-progress that attempts to fix that. I'll be able to test it myself on about Tuesday when I'm home from New Orleans and awake... diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 15e9b57..69980a1 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -413,6 +413,7 @@ struct deferred_flush_tables { int next; struct iova *iova[HIGH_WATER_MARK]; struct dmar_domain *domain[HIGH_WATER_MARK]; + struct page *freelist[HIGH_WATER_MARK]; }; static struct deferred_flush_tables *deferred_flush; @@ -945,6 +946,114 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain, } } +/* When a page at a given level is being unlinked from its parent, we don't + need to *modify* it at all. All we need to do is make a list of all the + pages which can be freed just as soon as we've flushed the IOTLB and we + know the hardware page-walk will no longer touch them. + The 'pte' argument is the *parent* PTE, pointing to the page that is to + be freed. */ +static struct page *dma_pte_list_pagetables(struct dmar_domain *domain, int level, + struct dma_pte *pte, struct page *freelist) +{ + struct page *pg; + + pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT); + pg->freelist = freelist; + freelist = pg; + + if (level > 1) { + pte = page_address(pg); + + do { + if (dma_pte_present(pte) && !dma_pte_superpage(pte)) + freelist = dma_pte_list_pagetables(domain, level - 1, pte, freelist); + + } while (!first_pte_in_page(++pte)); + } + + return freelist; +} + +static struct page *dma_pte_clear_level(struct dmar_domain *domain, int level, + struct dma_pte *pte, unsigned long pfn, + unsigned long start_pfn, unsigned long last_pfn, + struct page *freelist) +{ + struct dma_pte *first_pte = NULL, *last_pte = NULL; + + pfn = max(start_pfn, pfn); + pte = &pte[pfn_level_offset(pfn, level)]; + + do { + unsigned long level_pfn; + struct dma_pte *level_pte; + + if (!dma_pte_present(pte)) + goto next; + + level_pfn = pfn & level_mask(level - 1); + level_pte = phys_to_virt(dma_pte_addr(pte)); + + /* If range covers entire pagetable, free it */ + if (start_pfn <= level_pfn && + last_pfn >= level_pfn + level_size(level)) { + + /* These suborbinate page tables are going away entirely. Don't + bother to clear them; we're just going to *free* them. */ + if (level > 1 && !dma_pte_superpage(pte)) + freelist = dma_pte_list_pagetables(domain, level - 1, pte, freelist); + + dma_clear_pte(pte); + if (!first_pte) + first_pte = pte; + last_pte = pte; + } else { + /* Recurse down into a level that isn't *entirely* obsolete */ + freelist = dma_pte_clear_level(domain, level - 1, level_pte, + level_pfn, start_pfn, last_pfn, freelist); + } +next: + pfn += level_size(level); + } while (!first_pte_in_page(++pte) && pfn <= last_pfn); + + if (first_pte) + domain_flush_cache(domain, first_pte, + (void *)++last_pte - (void *)first_pte); + + return freelist; +} + +/* We can't just free the pages because the IOMMU may still be walking + the page tables, and may have cached the intermediate levels. The + pages can only be freed after the IOTLB flush has been done. */ +struct page *domain_unmap(struct dmar_domain *domain, + unsigned long start_pfn, + unsigned long last_pfn) +{ + int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT; + struct page *freelist = NULL; + + BUG_ON(addr_width < BITS_PER_LONG && start_pfn >> addr_width); + BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width); + BUG_ON(start_pfn > last_pfn); + + /* we don't need lock here; nobody else touches the iova range */ + freelist = dma_pte_clear_level(domain, agaw_to_level(domain->agaw), + domain->pgd, 0, start_pfn, last_pfn, NULL); + + return freelist; +} + +void dma_free_pagelist(struct page *freelist) +{ + struct page *pg; + + while ((pg = freelist)) { + freelist = pg->freelist; + free_pgtable_page(pg); + } +} + /* iommu handling */ static int iommu_alloc_root_entry(struct intel_iommu *iommu) { @@ -1054,7 +1163,7 @@ static void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, break; case DMA_TLB_PSI_FLUSH: val = DMA_TLB_PSI_FLUSH|DMA_TLB_IVT|DMA_TLB_DID(did); - /* Note: always flush non-leaf currently */ + /* IH bit is passed in as part of address */ val_iva = size_order | addr; break; default: @@ -1165,13 +1274,15 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain, } static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, - unsigned long pfn, unsigned int pages, int map) + unsigned long pfn, unsigned int pages, int ih, int map) { unsigned int mask = ilog2(__roundup_pow_of_two(pages)); uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT; BUG_ON(pages == 0); + if (ih) + ih = 1 << 6; /* * Fallback to domain selective flush if no PSI support or the size is * too big. @@ -1182,7 +1293,7 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH); else - iommu->flush.flush_iotlb(iommu, did, addr, mask, + iommu->flush.flush_iotlb(iommu, did, addr | ih, mask, DMA_TLB_PSI_FLUSH); /* @@ -2850,7 +2961,7 @@ static dma_addr_t __intel_map_single(struct device *hwdev, phys_addr_t paddr, /* it's a non-present to present mapping. Only flush if caching mode */ if (cap_caching_mode(iommu->cap)) - iommu_flush_iotlb_psi(iommu, domain->id, mm_to_dma_pfn(iova->pfn_lo), size, 1); + iommu_flush_iotlb_psi(iommu, domain->id, mm_to_dma_pfn(iova->pfn_lo), size, 0, 1); else iommu_flush_write_buffer(iommu); @@ -2902,13 +3013,16 @@ static void flush_unmaps(void) /* On real hardware multiple invalidations are expensive */ if (cap_caching_mode(iommu->cap)) iommu_flush_iotlb_psi(iommu, domain->id, - iova->pfn_lo, iova->pfn_hi - iova->pfn_lo + 1, 0); + iova->pfn_lo, iova->pfn_hi - iova->pfn_lo + 1, + !deferred_flush[i].freelist[j], 0); else { mask = ilog2(mm_to_dma_pfn(iova->pfn_hi - iova->pfn_lo + 1)); iommu_flush_dev_iotlb(deferred_flush[i].domain[j], (uint64_t)iova->pfn_lo << PAGE_SHIFT, mask); } __free_iova(&deferred_flush[i].domain[j]->iovad, iova); + if (deferred_flush[i].freelist[j]) + dma_free_pagelist(deferred_flush[i].freelist[j]); } deferred_flush[i].next = 0; } @@ -2925,7 +3039,7 @@ static void flush_unmaps_timeout(unsigned long data) spin_unlock_irqrestore(&async_umap_flush_lock, flags); } -static void add_unmap(struct dmar_domain *dom, struct iova *iova) +static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *freelist) { unsigned long flags; int next, iommu_id; @@ -2941,6 +3055,7 @@ static void add_unmap(struct dmar_domain *dom, struct iova *iova) next = deferred_flush[iommu_id].next; deferred_flush[iommu_id].domain[next] = dom; deferred_flush[iommu_id].iova[next] = iova; + deferred_flush[iommu_id].freelist[next] = freelist; deferred_flush[iommu_id].next++; if (!timer_on) { @@ -2960,6 +3075,7 @@ static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr, unsigned long start_pfn, last_pfn; struct iova *iova; struct intel_iommu *iommu; + struct page *freelist; if (iommu_no_mapping(dev)) return; @@ -2980,19 +3096,16 @@ static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr, pr_debug("Device %s unmapping: pfn %lx-%lx\n", pci_name(pdev), start_pfn, last_pfn); - /* clear the whole page */ - dma_pte_clear_range(domain, start_pfn, last_pfn); - - /* free page tables */ - dma_pte_free_pagetable(domain, start_pfn, last_pfn); + freelist = domain_unmap(domain, start_pfn, last_pfn); if (intel_iommu_strict) { iommu_flush_iotlb_psi(iommu, domain->id, start_pfn, - last_pfn - start_pfn + 1, 0); + last_pfn - start_pfn + 1, !freelist, 0); /* free iova */ __free_iova(&domain->iovad, iova); + dma_free_pagelist(freelist); } else { - add_unmap(domain, iova); + add_unmap(domain, iova, freelist); /* * queue up the release of the unmap to save the 1/6th of the * cpu used up by the iotlb flush operation... @@ -3054,6 +3167,7 @@ static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist, unsigned long start_pfn, last_pfn; struct iova *iova; struct intel_iommu *iommu; + struct page *freelist; if (iommu_no_mapping(hwdev)) return; @@ -3071,19 +3185,16 @@ static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist, start_pfn = mm_to_dma_pfn(iova->pfn_lo); last_pfn = mm_to_dma_pfn(iova->pfn_hi + 1) - 1; - /* clear the whole page */ - dma_pte_clear_range(domain, start_pfn, last_pfn); - - /* free page tables */ - dma_pte_free_pagetable(domain, start_pfn, last_pfn); + freelist = domain_unmap(domain, start_pfn, last_pfn); if (intel_iommu_strict) { iommu_flush_iotlb_psi(iommu, domain->id, start_pfn, - last_pfn - start_pfn + 1, 0); + last_pfn - start_pfn + 1, !freelist, 0); /* free iova */ __free_iova(&domain->iovad, iova); + dma_free_pagelist(freelist); } else { - add_unmap(domain, iova); + add_unmap(domain, iova, freelist); /* * queue up the release of the unmap to save the 1/6th of the * cpu used up by the iotlb flush operation... @@ -3166,7 +3277,7 @@ static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist, int ne /* it's a non-present to present mapping. Only flush if caching mode */ if (cap_caching_mode(iommu->cap)) - iommu_flush_iotlb_psi(iommu, domain->id, start_vpfn, size, 1); + iommu_flush_iotlb_psi(iommu, domain->id, start_vpfn, size, 0, 1); else iommu_flush_write_buffer(iommu); @@ -4116,7 +4227,10 @@ static size_t intel_iommu_unmap(struct iommu_domain *domain, int order; order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT, - (iova + size - 1) >> VTD_PAGE_SHIFT); + (iova + size - 1) >> VTD_PAGE_SHIFT); + + dma_pte_free_pagetable(dmar_domain, iova >> VTD_PAGE_SHIFT, + (iova + size - 1) >> VTD_PAGE_SHIFT); if (dmar_domain->max_addr == iova + size) dmar_domain->max_addr = iova; -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation [-- Attachment #1.2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5745 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply related [flat|nested] 17+ messages in thread
[parent not found: <1379818740.2547.51.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1379818740.2547.51.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org> @ 2013-09-25 15:54 ` Joerg Roedel [not found] ` <20130925155459.GA20372-zLv9SwRftAIdnm+yROfE0A@public.gmane.org> 2013-10-02 15:04 ` David Woodhouse 1 sibling, 1 reply; 17+ messages in thread From: Joerg Roedel @ 2013-09-25 15:54 UTC (permalink / raw) To: David Woodhouse Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, Spiller, John On Sat, Sep 21, 2013 at 09:59:00PM -0500, David Woodhouse wrote: > I *hate* the bizarre calling convention for iommu_unmap(). Is it > actually clearly documented anywhere? Why on earth is it not just > returning void, and expected to unmap what it was *asked* to unmap? Yeah, I agree that this should be documented since it is quite non-standard/non-obvious behaviour of a function. The reason the interface was implemented this way is that the caller should not need to know (or keep track of) the page-size which was used to map a given iova. So the interface is basically, that you give an iova and a size and the iommu driver unmaps _at_least_ a region of that size. But if you ask for a 4k region which is mapped by a 2M page then the whole 2M are unmapped. The return value tells you how much was actually unmapped. Not the best interface, I know. We should come up with a better way to handle this. Joerg ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <20130925155459.GA20372-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <20130925155459.GA20372-zLv9SwRftAIdnm+yROfE0A@public.gmane.org> @ 2013-09-25 16:05 ` David Woodhouse [not found] ` <1380125113.28494.14.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org> 2013-09-25 16:33 ` Alex Williamson 1 sibling, 1 reply; 17+ messages in thread From: David Woodhouse @ 2013-09-25 16:05 UTC (permalink / raw) To: Joerg Roedel Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, Spiller, John [-- Attachment #1.1: Type: text/plain, Size: 1627 bytes --] On Wed, 2013-09-25 at 17:54 +0200, Joerg Roedel wrote: > On Sat, Sep 21, 2013 at 09:59:00PM -0500, David Woodhouse wrote: > > I *hate* the bizarre calling convention for iommu_unmap(). Is it > > actually clearly documented anywhere? Why on earth is it not just > > returning void, and expected to unmap what it was *asked* to unmap? > > Yeah, I agree that this should be documented since it is quite > non-standard/non-obvious behaviour of a function. The reason the > interface was implemented this way is that the caller should not need to > know (or keep track of) the page-size which was used to map a given iova. Why would it ever care? If it *happens* to map something that can use large pages, yay!. If it subsequently breaks apart those large pages by unmapping 4KiB in the middle, let the IOMMU driver break that apart. And why are we not allowed to unmap more than a single page at a time? Doing unmaps in small pieces (with a full flush between every one) is *horribly* slow. > So the interface is basically, that you give an iova and a size and the > iommu driver unmaps _at_least_ a region of that size. But if you ask for > a 4k region which is mapped by a 2M page then the whole 2M are unmapped. > The return value tells you how much was actually unmapped. > > Not the best interface, I know. We should come up with a better way to > handle this. Seriously, just the address and the size. Let the IOMMU code deal with whether it can *actually* use large pages on the particular set of IOMMUs that are in use for the devices you've added to the domain so far. -- dwmw2 [-- Attachment #1.2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5745 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <1380125113.28494.14.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1380125113.28494.14.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org> @ 2013-09-25 16:58 ` Joerg Roedel 2013-09-25 17:36 ` Alex Williamson 1 sibling, 0 replies; 17+ messages in thread From: Joerg Roedel @ 2013-09-25 16:58 UTC (permalink / raw) To: David Woodhouse Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, Spiller, John On Wed, Sep 25, 2013 at 05:05:13PM +0100, David Woodhouse wrote: > On Wed, 2013-09-25 at 17:54 +0200, Joerg Roedel wrote: > > Yeah, I agree that this should be documented since it is quite > > non-standard/non-obvious behaviour of a function. The reason the > > interface was implemented this way is that the caller should not need to > > know (or keep track of) the page-size which was used to map a given iova. > > Why would it ever care? If it *happens* to map something that can use > large pages, yay!. If it subsequently breaks apart those large pages by > unmapping 4KiB in the middle, let the IOMMU driver break that apart. All that was implemented back in the old days where KVM was the only user of that code. And back then it didn't make sense for the IOMMU driver to split huge-pages because in the next step KVM would unmap all the small pages anyway. So I took the other route and just told KVM how much was unmapped so that the unmap code could skip over it. > And why are we not allowed to unmap more than a single page at a time? The user of iommu_unmap is actually allowed to do that. But the page-size splitting it done in the iommu_unmap() function already to have that code common between different IOMMU drivers. In the end the IOMMU driver only exports the page-size bitmap and is fine. > Doing unmaps in small pieces (with a full flush between every one) is > *horribly* slow. I proposed an iommu_commit() function to move all the necessary flushes there after a couple of page-table changes. And this also allows to keep the generic page-size splitting code generic. > Seriously, just the address and the size. Let the IOMMU code deal with > whether it can *actually* use large pages on the particular set of > IOMMUs that are in use for the devices you've added to the domain so > far. Actually we can get rid of the unmapped_size thing by putting the same constraints the DMA-API has in place. The user of the API has to unmap a region with the same iova,size parameters used for mapping it. Then we can guarantee that only the requested range will be unmapped. Joerg ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1380125113.28494.14.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org> 2013-09-25 16:58 ` Joerg Roedel @ 2013-09-25 17:36 ` Alex Williamson [not found] ` <1380130588.3030.342.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org> 1 sibling, 1 reply; 17+ messages in thread From: Alex Williamson @ 2013-09-25 17:36 UTC (permalink / raw) To: David Woodhouse Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, Spiller, John On Wed, 2013-09-25 at 17:05 +0100, David Woodhouse wrote: > On Wed, 2013-09-25 at 17:54 +0200, Joerg Roedel wrote: > > On Sat, Sep 21, 2013 at 09:59:00PM -0500, David Woodhouse wrote: > > > I *hate* the bizarre calling convention for iommu_unmap(). Is it > > > actually clearly documented anywhere? Why on earth is it not just > > > returning void, and expected to unmap what it was *asked* to unmap? > > > > Yeah, I agree that this should be documented since it is quite > > non-standard/non-obvious behaviour of a function. The reason the > > interface was implemented this way is that the caller should not need to > > know (or keep track of) the page-size which was used to map a given iova. > > Why would it ever care? If it *happens* to map something that can use > large pages, yay!. If it subsequently breaks apart those large pages by > unmapping 4KiB in the middle, let the IOMMU driver break that apart. Can this be done atomically? I thought part of the reason for this interface was that iommu drivers typically couldn't replace a huge page with multiple smaller pages in the presence of DMA. > And why are we not allowed to unmap more than a single page at a time? > > Doing unmaps in small pieces (with a full flush between every one) is > *horribly* slow. While we call iommu_unmap with PAGE_SIZE, we certainly expect that larger chunks will be used if they were mapped that way. > > So the interface is basically, that you give an iova and a size and the > > iommu driver unmaps _at_least_ a region of that size. But if you ask for > > a 4k region which is mapped by a 2M page then the whole 2M are unmapped. > > The return value tells you how much was actually unmapped. > > > > Not the best interface, I know. We should come up with a better way to > > handle this. > > Seriously, just the address and the size. Let the IOMMU code deal with > whether it can *actually* use large pages on the particular set of > IOMMUs that are in use for the devices you've added to the domain so > far. You're contradicting yourself here. Just let the iommu deal with it, but now you're raising concerns that Intel may be mixing super page IOMMU hardware with non-super page IOMMU hardware on the same box, so I do need to be concerned, yet there's no interface to discover super page support. What happens if my IOMMU domain makes use of super pages and then I add a new device behind a new IOMMU without hardware super page support? We have the same problem with snoop control already. There's no atomic re-mapping interface, so KVM just pretends it can unmap and remap without anyone noticing. Thanks, Alex ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <1380130588.3030.342.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1380130588.3030.342.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org> @ 2013-09-25 18:52 ` David Woodhouse [not found] ` <1380135148.28494.26.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: David Woodhouse @ 2013-09-25 18:52 UTC (permalink / raw) To: Alex Williamson Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, Spiller, John [-- Attachment #1.1: Type: text/plain, Size: 2450 bytes --] On Wed, 2013-09-25 at 11:36 -0600, Alex Williamson wrote: > On Wed, 2013-09-25 at 17:05 +0100, David Woodhouse wrote: > > Why would it ever care? If it *happens* to map something that can use > > large pages, yay!. If it subsequently breaks apart those large pages by > > unmapping 4KiB in the middle, let the IOMMU driver break that apart. > > Can this be done atomically? I thought part of the reason for this > interface was that iommu drivers typically couldn't replace a huge page > with multiple smaller pages in the presence of DMA. For the Intel IOMMU it can. You can atomically change from a large page entry, to a pointer to a full set of smaller page tables. Do the IOTLB flush, and at no time is there an interruption in service. Not sure if this is true for *all* IOMMU hardware; I'd be perfectly happy to accept a variant of Jörg's proposal that we should only ever unmap exactly the same range that we mapped. Except we should allow the unmapping of adjacent regions together; just not a partial unmap of something that was mapped in one go. > > Seriously, just the address and the size. Let the IOMMU code deal with > > whether it can *actually* use large pages on the particular set of > > IOMMUs that are in use for the devices you've added to the domain so > > far. > > You're contradicting yourself here. Just let the iommu deal with it, > but now you're raising concerns that Intel may be mixing super page > IOMMU hardware with non-super page IOMMU hardware on the same box, so I > do need to be concerned, yet there's no interface to discover super page > support. No contradiction. Do Not Concern Yourself. Just tell the IOMMU what you want mapped, and where. Let *it* worry about whether it can use superpages or not. Like it already *does*. > What happens if my IOMMU domain makes use of super pages and > then I add a new device behind a new IOMMU without hardware super page > support? Currently, you end up with the domain happily including superpages, and the less capable IOMMU that you added later won't cope. What we probably *ought* to do is walk the page tables and convert any pre-existing superpages to small pages, at the time we add the non-SP-capable IOMMU. FWIW we currently screw up the handling of cache-coherent vs. non-coherent page tables too. That one wants a wbinvd somewhere when we add a non-coherent IOMMU to the domain. -- dwmw2 [-- Attachment #1.2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5745 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <1380135148.28494.26.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1380135148.28494.26.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org> @ 2013-09-25 19:44 ` Alex Williamson [not found] ` <1380138243.5197.20.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: Alex Williamson @ 2013-09-25 19:44 UTC (permalink / raw) To: David Woodhouse Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, Spiller, John On Wed, 2013-09-25 at 19:52 +0100, David Woodhouse wrote: > On Wed, 2013-09-25 at 11:36 -0600, Alex Williamson wrote: > > On Wed, 2013-09-25 at 17:05 +0100, David Woodhouse wrote: > > > Why would it ever care? If it *happens* to map something that can use > > > large pages, yay!. If it subsequently breaks apart those large pages by > > > unmapping 4KiB in the middle, let the IOMMU driver break that apart. > > > > Can this be done atomically? I thought part of the reason for this > > interface was that iommu drivers typically couldn't replace a huge page > > with multiple smaller pages in the presence of DMA. > > For the Intel IOMMU it can. You can atomically change from a large page > entry, to a pointer to a full set of smaller page tables. Do the IOTLB > flush, and at no time is there an interruption in service. Cool > Not sure if this is true for *all* IOMMU hardware; I'd be perfectly > happy to accept a variant of Jörg's proposal that we should only ever > unmap exactly the same range that we mapped. Except we should allow the > unmapping of adjacent regions together; just not a partial unmap of > something that was mapped in one go. Well, except if we've just trusted the IOMMU driver to add a device behind a non-SP capable IOMMU to our domain and convert the page tables, that partial unmap is no longer partial and now we get different behavior than before so we can't depend on that adjacent unmapping. > > > Seriously, just the address and the size. Let the IOMMU code deal with > > > whether it can *actually* use large pages on the particular set of > > > IOMMUs that are in use for the devices you've added to the domain so > > > far. > > > > You're contradicting yourself here. Just let the iommu deal with it, > > but now you're raising concerns that Intel may be mixing super page > > IOMMU hardware with non-super page IOMMU hardware on the same box, so I > > do need to be concerned, yet there's no interface to discover super page > > support. > > No contradiction. Do Not Concern Yourself. Just tell the IOMMU what you > want mapped, and where. Let *it* worry about whether it can use > superpages or not. Like it already *does*. I'd love to be able to do that, but... > > What happens if my IOMMU domain makes use of super pages and > > then I add a new device behind a new IOMMU without hardware super page > > support? > > Currently, you end up with the domain happily including superpages, and > the less capable IOMMU that you added later won't cope. This is the trouble with trusting the iommu driver. ;) > What we probably > *ought* to do is walk the page tables and convert any pre-existing > superpages to small pages, at the time we add the non-SP-capable IOMMU. And then we need to figure out how to handle that in the proposed interface changes above since it changes the unmap behavior to the naive user. There's also the question of whether the IOMMU driver should re-evaluate super pages when the less capable IOMMU is removed from the domain. > FWIW we currently screw up the handling of cache-coherent vs. > non-coherent page tables too. That one wants a wbinvd somewhere when we > add a non-coherent IOMMU to the domain. You're not selling the "trust the IOMMU driver" story very well here. Can we assume that the IOMMU_CACHE flag (SNP) is ignored appropriately by non-coherent IOMMUs? Is there any downside to ignoring it and always setting SNP in the IOMMU page tables? AMD IOMMU ignores it, but it's also always cache coherent. Thanks, Alex _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <1380138243.5197.20.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1380138243.5197.20.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org> @ 2013-09-25 20:11 ` David Woodhouse [not found] ` <1380139886.28494.31.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: David Woodhouse @ 2013-09-25 20:11 UTC (permalink / raw) To: Alex Williamson Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, Spiller, John [-- Attachment #1.1: Type: text/plain, Size: 4178 bytes --] On Wed, 2013-09-25 at 13:44 -0600, Alex Williamson wrote: > On Wed, 2013-09-25 at 19:52 +0100, David Woodhouse wrote: > > On Wed, 2013-09-25 at 11:36 -0600, Alex Williamson wrote: > > > On Wed, 2013-09-25 at 17:05 +0100, David Woodhouse wrote: > > > > Why would it ever care? If it *happens* to map something that can use > > > > large pages, yay!. If it subsequently breaks apart those large pages by > > > > unmapping 4KiB in the middle, let the IOMMU driver break that apart. > > > > > > Can this be done atomically? I thought part of the reason for this > > > interface was that iommu drivers typically couldn't replace a huge page > > > with multiple smaller pages in the presence of DMA. > > > > For the Intel IOMMU it can. You can atomically change from a large page > > entry, to a pointer to a full set of smaller page tables. Do the IOTLB > > flush, and at no time is there an interruption in service. > > Cool > > > Not sure if this is true for *all* IOMMU hardware; I'd be perfectly > > happy to accept a variant of Jörg's proposal that we should only ever > > unmap exactly the same range that we mapped. Except we should allow the > > unmapping of adjacent regions together; just not a partial unmap of > > something that was mapped in one go. > > Well, except if we've just trusted the IOMMU driver to add a device > behind a non-SP capable IOMMU to our domain and convert the page tables, > that partial unmap is no longer partial and now we get different > behavior than before so we can't depend on that adjacent unmapping. Que? Jörg's proposal was that if you add a mapping at a given address+size, you should always remove *exactly* that address+size. Which will always work exactly the same, regardless of superpages. My slight change to that was that if you also added an *adjacent* mapping at address2+size2, you should be able to unmap both at the same time. Which will *also* always work the same regardless of superpages. Even if your two mappings were also *physically* contiguous, and *could* have used superpages, they probably won't anyway because you mapped them in two parts. > > > What happens if my IOMMU domain makes use of super pages and > > > then I add a new device behind a new IOMMU without hardware super page > > > support? > > > > Currently, you end up with the domain happily including superpages, and > > the less capable IOMMU that you added later won't cope. > > This is the trouble with trusting the iommu driver. ;) Sorry, I should have made it clearer that this is a *bug*. It's not by design. The IOMMU driver ought to get this right, and will do. > > What we probably > > *ought* to do is walk the page tables and convert any pre-existing > > superpages to small pages, at the time we add the non-SP-capable IOMMU. > > And then we need to figure out how to handle that in the proposed > interface changes above since it changes the unmap behavior to the naive > user. Isn't that what you'd *expect*? Surely you don't *expect* the breakage you currently get? > There's also the question of whether the IOMMU driver should > re-evaluate super pages when the less capable IOMMU is removed from the > domain. I wouldn't bother to go looking for opportunities to use super pages if we remove the last non-SP-capable IOMMU from the domain. > > FWIW we currently screw up the handling of cache-coherent vs. > > non-coherent page tables too. That one wants a wbinvd somewhere when we > > add a non-coherent IOMMU to the domain. > > You're not selling the "trust the IOMMU driver" story very well here. > Can we assume that the IOMMU_CACHE flag (SNP) is ignored appropriately > by non-coherent IOMMUs? Is there any downside to ignoring it and always > setting SNP in the IOMMU page tables? AMD IOMMU ignores it, but it's > also always cache coherent. Thanks, SNP is a separate issue. I'm speaking of cache coherency of the hardware page table walk — the feature bit that all the horrid clflush calls are predicated on. Again, this is just a bug. We *should* be getting this right, but don't yet. -- dwmw2 [-- Attachment #1.2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5745 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <1380139886.28494.31.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1380139886.28494.31.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org> @ 2013-09-25 21:46 ` Alex Williamson [not found] ` <1380145560.5197.94.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: Alex Williamson @ 2013-09-25 21:46 UTC (permalink / raw) To: David Woodhouse Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, Spiller, John On Wed, 2013-09-25 at 21:11 +0100, David Woodhouse wrote: > On Wed, 2013-09-25 at 13:44 -0600, Alex Williamson wrote: > > On Wed, 2013-09-25 at 19:52 +0100, David Woodhouse wrote: > > > On Wed, 2013-09-25 at 11:36 -0600, Alex Williamson wrote: > > > > On Wed, 2013-09-25 at 17:05 +0100, David Woodhouse wrote: > > > > > Why would it ever care? If it *happens* to map something that can use > > > > > large pages, yay!. If it subsequently breaks apart those large pages by > > > > > unmapping 4KiB in the middle, let the IOMMU driver break that apart. > > > > > > > > Can this be done atomically? I thought part of the reason for this > > > > interface was that iommu drivers typically couldn't replace a huge page > > > > with multiple smaller pages in the presence of DMA. > > > > > > For the Intel IOMMU it can. You can atomically change from a large page > > > entry, to a pointer to a full set of smaller page tables. Do the IOTLB > > > flush, and at no time is there an interruption in service. > > > > Cool > > > > > Not sure if this is true for *all* IOMMU hardware; I'd be perfectly > > > happy to accept a variant of Jörg's proposal that we should only ever > > > unmap exactly the same range that we mapped. Except we should allow the > > > unmapping of adjacent regions together; just not a partial unmap of > > > something that was mapped in one go. > > > > Well, except if we've just trusted the IOMMU driver to add a device > > behind a non-SP capable IOMMU to our domain and convert the page tables, > > that partial unmap is no longer partial and now we get different > > behavior than before so we can't depend on that adjacent unmapping. > > Que? > > Jörg's proposal was that if you add a mapping at a given address+size, > you should always remove *exactly* that address+size. Which will always > work exactly the same, regardless of superpages. > > My slight change to that was that if you also added an *adjacent* > mapping at address2+size2, you should be able to unmap both at the same > time. Which will *also* always work the same regardless of superpages. > > Even if your two mappings were also *physically* contiguous, and *could* > have used superpages, they probably won't anyway because you mapped them > in two parts. Ok, sounds reasonable. I somehow read it to still include some aspect of the "fill in the size" API we have now. > > > > What happens if my IOMMU domain makes use of super pages and > > > > then I add a new device behind a new IOMMU without hardware super page > > > > support? > > > > > > Currently, you end up with the domain happily including superpages, and > > > the less capable IOMMU that you added later won't cope. > > > > This is the trouble with trusting the iommu driver. ;) > > Sorry, I should have made it clearer that this is a *bug*. It's not by > design. The IOMMU driver ought to get this right, and will do. > > > > What we probably > > > *ought* to do is walk the page tables and convert any pre-existing > > > superpages to small pages, at the time we add the non-SP-capable IOMMU. > > > > And then we need to figure out how to handle that in the proposed > > interface changes above since it changes the unmap behavior to the naive > > user. > > Isn't that what you'd *expect*? Surely you don't *expect* the breakage > you currently get? For vfio I currently make the same statement that you're pushing for the IOMMU API; I only guarantee unmapping at the same granularity as the original mapping. So that seems fine to me. > > There's also the question of whether the IOMMU driver should > > re-evaluate super pages when the less capable IOMMU is removed from the > > domain. > > I wouldn't bother to go looking for opportunities to use super pages if > we remove the last non-SP-capable IOMMU from the domain. I predict bugs getting filed if a guest sees a performance hit after adding a device that is not restored when the device is removed. If only we could assume a similar feature set among IOMMUs in a system. > > > FWIW we currently screw up the handling of cache-coherent vs. > > > non-coherent page tables too. That one wants a wbinvd somewhere when we > > > add a non-coherent IOMMU to the domain. > > > > You're not selling the "trust the IOMMU driver" story very well here. > > Can we assume that the IOMMU_CACHE flag (SNP) is ignored appropriately > > by non-coherent IOMMUs? Is there any downside to ignoring it and always > > setting SNP in the IOMMU page tables? AMD IOMMU ignores it, but it's > > also always cache coherent. Thanks, > > SNP is a separate issue. I'm speaking of cache coherency of the hardware > page table walk — the feature bit that all the horrid clflush calls are > predicated on. > > Again, this is just a bug. We *should* be getting this right, but don't > yet. And for DMA_PTE_SNP? intel_iommu_map() won't let us set this bit if the domain contains a hardware unit that doesn't support ecap.SC, but it also doesn't update existing mappings. Barring hardware bugs, it seems much easier to unconditionally set DMA_PTE_SNP but still advertise IOMMU_CAP_CACHE_COHERENCY based on the composition of the domain. Otherwise we have to reject adding devices to a domain that change the coherency once DMA mappings are in play or never set IOMMU_CACHE and advertise to KVM that the domain is always non-coherent. Thanks, Alex _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <1380145560.5197.94.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1380145560.5197.94.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org> @ 2013-09-25 22:15 ` David Woodhouse [not found] ` <1380147348.28494.37.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org> 0 siblings, 1 reply; 17+ messages in thread From: David Woodhouse @ 2013-09-25 22:15 UTC (permalink / raw) To: Alex Williamson Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, Spiller, John [-- Attachment #1.1: Type: text/plain, Size: 2342 bytes --] On Wed, 2013-09-25 at 15:46 -0600, Alex Williamson wrote: > On Wed, 2013-09-25 at 21:11 +0100, David Woodhouse wrote: > > I wouldn't bother to go looking for opportunities to use super pages if > > we remove the last non-SP-capable IOMMU from the domain. > > I predict bugs getting filed if a guest sees a performance hit after > adding a device that is not restored when the device is removed. If > only we could assume a similar feature set among IOMMUs in a system. Ok, fine. As long as you don't have *too* much mapped, that shouldn't suck too much. > > > > FWIW we currently screw up the handling of cache-coherent vs. > > > > non-coherent page tables too. That one wants a wbinvd somewhere when we > > > > add a non-coherent IOMMU to the domain. > > > > > > You're not selling the "trust the IOMMU driver" story very well here. > > > Can we assume that the IOMMU_CACHE flag (SNP) is ignored appropriately > > > by non-coherent IOMMUs? Is there any downside to ignoring it and always > > > setting SNP in the IOMMU page tables? AMD IOMMU ignores it, but it's > > > also always cache coherent. Thanks, > > > > SNP is a separate issue. I'm speaking of cache coherency of the hardware > > page table walk — the feature bit that all the horrid clflush calls are > > predicated on. > > > > Again, this is just a bug. We *should* be getting this right, but don't > > yet. > > And for DMA_PTE_SNP? intel_iommu_map() won't let us set this bit if the > domain contains a hardware unit that doesn't support ecap.SC, but it > also doesn't update existing mappings. Barring hardware bugs, it seems > much easier to unconditionally set DMA_PTE_SNP but still advertise > IOMMU_CAP_CACHE_COHERENCY based on the composition of the domain. > Otherwise we have to reject adding devices to a domain that change the > coherency once DMA mappings are in play or never set IOMMU_CACHE and > advertise to KVM that the domain is always non-coherent. Thanks, It's not that the domain is non-coherent, is it? The SNP bit allows you to *force* a PCIe DMA request to snoop the CPU cache, even if it didn't *request* that. If your device is actually *trying* to do cache-coherent (i.e. snooping) DMA, surely that works regardless of the DMA_PTE_SNP setting? Or am I missing something? -- dwmw2 [-- Attachment #1.2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5745 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <1380147348.28494.37.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>]
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1380147348.28494.37.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org> @ 2013-09-25 22:40 ` Alex Williamson 0 siblings, 0 replies; 17+ messages in thread From: Alex Williamson @ 2013-09-25 22:40 UTC (permalink / raw) To: David Woodhouse Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, Spiller, John On Wed, 2013-09-25 at 23:15 +0100, David Woodhouse wrote: > On Wed, 2013-09-25 at 15:46 -0600, Alex Williamson wrote: > > On Wed, 2013-09-25 at 21:11 +0100, David Woodhouse wrote: > > > I wouldn't bother to go looking for opportunities to use super pages if > > > we remove the last non-SP-capable IOMMU from the domain. > > > > I predict bugs getting filed if a guest sees a performance hit after > > adding a device that is not restored when the device is removed. If > > only we could assume a similar feature set among IOMMUs in a system. > > Ok, fine. As long as you don't have *too* much mapped, that shouldn't > suck too much. Of course if we knew this was going to happen in advance of attaching the new device, I wonder if we might opt to manage it with a separate IOMMU domain. > > > > > FWIW we currently screw up the handling of cache-coherent vs. > > > > > non-coherent page tables too. That one wants a wbinvd somewhere when we > > > > > add a non-coherent IOMMU to the domain. > > > > > > > > You're not selling the "trust the IOMMU driver" story very well here. > > > > Can we assume that the IOMMU_CACHE flag (SNP) is ignored appropriately > > > > by non-coherent IOMMUs? Is there any downside to ignoring it and always > > > > setting SNP in the IOMMU page tables? AMD IOMMU ignores it, but it's > > > > also always cache coherent. Thanks, > > > > > > SNP is a separate issue. I'm speaking of cache coherency of the hardware > > > page table walk — the feature bit that all the horrid clflush calls are > > > predicated on. > > > > > > Again, this is just a bug. We *should* be getting this right, but don't > > > yet. > > > > And for DMA_PTE_SNP? intel_iommu_map() won't let us set this bit if the > > domain contains a hardware unit that doesn't support ecap.SC, but it > > also doesn't update existing mappings. Barring hardware bugs, it seems > > much easier to unconditionally set DMA_PTE_SNP but still advertise > > IOMMU_CAP_CACHE_COHERENCY based on the composition of the domain. > > Otherwise we have to reject adding devices to a domain that change the > > coherency once DMA mappings are in play or never set IOMMU_CACHE and > > advertise to KVM that the domain is always non-coherent. Thanks, > > It's not that the domain is non-coherent, is it? The SNP bit allows you > to *force* a PCIe DMA request to snoop the CPU cache, even if it didn't > *request* that. > > If your device is actually *trying* to do cache-coherent (i.e. snooping) > DMA, surely that works regardless of the DMA_PTE_SNP setting? > > Or am I missing something? It's the opposite, devices are trying to do No-Snoop transactions and KVM would really rather prefer wbinvd was a no-op since emulated devices don't need it. We therefore need to toggle KVM's behavior when we have an assigned device attached to an IOMMU domain where the hardware isn't stripping the No-Snoop attribute. Some graphics cards seem to make use of this and will get error codes loading their driver if it doesn't work. Thanks, Alex _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <20130925155459.GA20372-zLv9SwRftAIdnm+yROfE0A@public.gmane.org> 2013-09-25 16:05 ` David Woodhouse @ 2013-09-25 16:33 ` Alex Williamson 1 sibling, 0 replies; 17+ messages in thread From: Alex Williamson @ 2013-09-25 16:33 UTC (permalink / raw) To: Joerg Roedel Cc: Singh, Varinder, Sundaram, Rajesh, Kimmel, Jeff, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Shankar, Hari, David Woodhouse, Spiller, John On Wed, 2013-09-25 at 17:54 +0200, Joerg Roedel wrote: > On Sat, Sep 21, 2013 at 09:59:00PM -0500, David Woodhouse wrote: > > I *hate* the bizarre calling convention for iommu_unmap(). Is it > > actually clearly documented anywhere? Why on earth is it not just > > returning void, and expected to unmap what it was *asked* to unmap? > > Yeah, I agree that this should be documented since it is quite > non-standard/non-obvious behaviour of a function. The reason the > interface was implemented this way is that the caller should not need to > know (or keep track of) the page-size which was used to map a given iova. > > So the interface is basically, that you give an iova and a size and the > iommu driver unmaps _at_least_ a region of that size. But if you ask for > a 4k region which is mapped by a 2M page then the whole 2M are unmapped. > The return value tells you how much was actually unmapped. > > Not the best interface, I know. We should come up with a better way to > handle this. Actually, unmap can return zero too, which happens if you try to unmap an offset into a super page mapping. The current interface has issues, but both legacy kvm assignment and vfio rely on it so we don't need to track individual mappings. Thanks, Alex ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] On unmap, flush IOMMU TLB and return correct size [not found] ` <1379818740.2547.51.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org> 2013-09-25 15:54 ` Joerg Roedel @ 2013-10-02 15:04 ` David Woodhouse 1 sibling, 0 replies; 17+ messages in thread From: David Woodhouse @ 2013-10-02 15:04 UTC (permalink / raw) To: Shankar, Hari Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Singh, Varinder, Sundaram, Rajesh, Spiller, John, Kimmel, Jeff [-- Attachment #1.1: Type: text/plain, Size: 13836 bytes --] On Sat, 2013-09-21 at 21:59 -0500, David Woodhouse wrote: > Here's a completely untested work-in-progress that attempts to fix that. > I'll be able to test it myself on about Tuesday when I'm home from New > Orleans and awake... Or might have been if my laptop's hard drive hadn't died. Here's an updated version which extends the use of domain_unmap() to other places it needs to happen. Similarly untested by anything but the compiler. I'll be able to test it myself on about Monday when I'm home from Portland and awake... diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 15e9b57..cd6e568 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -413,6 +413,7 @@ struct deferred_flush_tables { int next; struct iova *iova[HIGH_WATER_MARK]; struct dmar_domain *domain[HIGH_WATER_MARK]; + struct page *freelist[HIGH_WATER_MARK]; }; static struct deferred_flush_tables *deferred_flush; @@ -945,6 +946,123 @@ static void dma_pte_free_pagetable(struct dmar_domain *domain, } } +/* When a page at a given level is being unlinked from its parent, we don't + need to *modify* it at all. All we need to do is make a list of all the + pages which can be freed just as soon as we've flushed the IOTLB and we + know the hardware page-walk will no longer touch them. + The 'pte' argument is the *parent* PTE, pointing to the page that is to + be freed. */ +static struct page *dma_pte_list_pagetables(struct dmar_domain *domain, int level, + struct dma_pte *pte, struct page *freelist) +{ + struct page *pg; + + pg = pfn_to_page(dma_pte_addr(pte) >> PAGE_SHIFT); + pg->freelist = freelist; + freelist = pg; + + if (level > 1) { + pte = page_address(pg); + + do { + if (dma_pte_present(pte) && !dma_pte_superpage(pte)) + freelist = dma_pte_list_pagetables(domain, level - 1, pte, freelist); + + } while (!first_pte_in_page(++pte)); + } + + return freelist; +} + +static struct page *dma_pte_clear_level(struct dmar_domain *domain, int level, + struct dma_pte *pte, unsigned long pfn, + unsigned long start_pfn, unsigned long last_pfn, + struct page *freelist) +{ + struct dma_pte *first_pte = NULL, *last_pte = NULL; + + pfn = max(start_pfn, pfn); + pte = &pte[pfn_level_offset(pfn, level)]; + + do { + unsigned long level_pfn; + struct dma_pte *level_pte; + + if (!dma_pte_present(pte)) + goto next; + + level_pfn = pfn & level_mask(level - 1); + level_pte = phys_to_virt(dma_pte_addr(pte)); + + /* If range covers entire pagetable, free it */ + if (start_pfn <= level_pfn && + last_pfn >= level_pfn + level_size(level)) { + + /* These suborbinate page tables are going away entirely. Don't + bother to clear them; we're just going to *free* them. */ + if (level > 1 && !dma_pte_superpage(pte)) + freelist = dma_pte_list_pagetables(domain, level - 1, pte, freelist); + + dma_clear_pte(pte); + if (!first_pte) + first_pte = pte; + last_pte = pte; + } else { + /* Recurse down into a level that isn't *entirely* obsolete */ + freelist = dma_pte_clear_level(domain, level - 1, level_pte, + level_pfn, start_pfn, last_pfn, freelist); + } +next: + pfn += level_size(level); + } while (!first_pte_in_page(++pte) && pfn <= last_pfn); + + if (first_pte) + domain_flush_cache(domain, first_pte, + (void *)++last_pte - (void *)first_pte); + + return freelist; +} + +/* We can't just free the pages because the IOMMU may still be walking + the page tables, and may have cached the intermediate levels. The + pages can only be freed after the IOTLB flush has been done. */ +struct page *domain_unmap(struct dmar_domain *domain, + unsigned long start_pfn, + unsigned long last_pfn) +{ + int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT; + struct page *freelist = NULL; + + BUG_ON(addr_width < BITS_PER_LONG && start_pfn >> addr_width); + BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width); + BUG_ON(start_pfn > last_pfn); + + /* we don't need lock here; nobody else touches the iova range */ + freelist = dma_pte_clear_level(domain, agaw_to_level(domain->agaw), + domain->pgd, 0, start_pfn, last_pfn, NULL); + + /* free pgd */ + if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) { + struct page *pgd_page = virt_to_page(domain->pgd); + pgd_page->freelist = freelist; + freelist = pgd_page; + + domain->pgd = NULL; + } + + return freelist; +} + +void dma_free_pagelist(struct page *freelist) +{ + struct page *pg; + + while ((pg = freelist)) { + freelist = pg->freelist; + free_pgtable_page(pg); + } +} + /* iommu handling */ static int iommu_alloc_root_entry(struct intel_iommu *iommu) { @@ -1054,7 +1172,7 @@ static void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, break; case DMA_TLB_PSI_FLUSH: val = DMA_TLB_PSI_FLUSH|DMA_TLB_IVT|DMA_TLB_DID(did); - /* Note: always flush non-leaf currently */ + /* IH bit is passed in as part of address */ val_iva = size_order | addr; break; default: @@ -1165,13 +1283,15 @@ static void iommu_flush_dev_iotlb(struct dmar_domain *domain, } static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, - unsigned long pfn, unsigned int pages, int map) + unsigned long pfn, unsigned int pages, int ih, int map) { unsigned int mask = ilog2(__roundup_pow_of_two(pages)); uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT; BUG_ON(pages == 0); + if (ih) + ih = 1 << 6; /* * Fallback to domain selective flush if no PSI support or the size is * too big. @@ -1182,7 +1302,7 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did, iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH); else - iommu->flush.flush_iotlb(iommu, did, addr, mask, + iommu->flush.flush_iotlb(iommu, did, addr | ih, mask, DMA_TLB_PSI_FLUSH); /* @@ -1517,6 +1637,7 @@ static void domain_exit(struct dmar_domain *domain) { struct dmar_drhd_unit *drhd; struct intel_iommu *iommu; + struct page *freelist = NULL; /* Domain 0 is reserved, so dont process it */ if (!domain) @@ -1530,16 +1651,14 @@ static void domain_exit(struct dmar_domain *domain) /* destroy iovas */ put_iova_domain(&domain->iovad); - /* clear ptes */ - dma_pte_clear_range(domain, 0, DOMAIN_MAX_PFN(domain->gaw)); - - /* free page tables */ - dma_pte_free_pagetable(domain, 0, DOMAIN_MAX_PFN(domain->gaw)); + freelist = domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw)); for_each_active_iommu(iommu, drhd) if (test_bit(iommu->seq_id, domain->iommu_bmp)) iommu_detach_domain(domain, iommu); + dma_free_pagelist(freelist); + free_domain_mem(domain); } @@ -2850,7 +2969,7 @@ static dma_addr_t __intel_map_single(struct device *hwdev, phys_addr_t paddr, /* it's a non-present to present mapping. Only flush if caching mode */ if (cap_caching_mode(iommu->cap)) - iommu_flush_iotlb_psi(iommu, domain->id, mm_to_dma_pfn(iova->pfn_lo), size, 1); + iommu_flush_iotlb_psi(iommu, domain->id, mm_to_dma_pfn(iova->pfn_lo), size, 0, 1); else iommu_flush_write_buffer(iommu); @@ -2902,13 +3021,16 @@ static void flush_unmaps(void) /* On real hardware multiple invalidations are expensive */ if (cap_caching_mode(iommu->cap)) iommu_flush_iotlb_psi(iommu, domain->id, - iova->pfn_lo, iova->pfn_hi - iova->pfn_lo + 1, 0); + iova->pfn_lo, iova->pfn_hi - iova->pfn_lo + 1, + !deferred_flush[i].freelist[j], 0); else { mask = ilog2(mm_to_dma_pfn(iova->pfn_hi - iova->pfn_lo + 1)); iommu_flush_dev_iotlb(deferred_flush[i].domain[j], (uint64_t)iova->pfn_lo << PAGE_SHIFT, mask); } __free_iova(&deferred_flush[i].domain[j]->iovad, iova); + if (deferred_flush[i].freelist[j]) + dma_free_pagelist(deferred_flush[i].freelist[j]); } deferred_flush[i].next = 0; } @@ -2925,7 +3047,7 @@ static void flush_unmaps_timeout(unsigned long data) spin_unlock_irqrestore(&async_umap_flush_lock, flags); } -static void add_unmap(struct dmar_domain *dom, struct iova *iova) +static void add_unmap(struct dmar_domain *dom, struct iova *iova, struct page *freelist) { unsigned long flags; int next, iommu_id; @@ -2941,6 +3063,7 @@ static void add_unmap(struct dmar_domain *dom, struct iova *iova) next = deferred_flush[iommu_id].next; deferred_flush[iommu_id].domain[next] = dom; deferred_flush[iommu_id].iova[next] = iova; + deferred_flush[iommu_id].freelist[next] = freelist; deferred_flush[iommu_id].next++; if (!timer_on) { @@ -2960,6 +3083,7 @@ static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr, unsigned long start_pfn, last_pfn; struct iova *iova; struct intel_iommu *iommu; + struct page *freelist; if (iommu_no_mapping(dev)) return; @@ -2980,19 +3104,16 @@ static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr, pr_debug("Device %s unmapping: pfn %lx-%lx\n", pci_name(pdev), start_pfn, last_pfn); - /* clear the whole page */ - dma_pte_clear_range(domain, start_pfn, last_pfn); - - /* free page tables */ - dma_pte_free_pagetable(domain, start_pfn, last_pfn); + freelist = domain_unmap(domain, start_pfn, last_pfn); if (intel_iommu_strict) { iommu_flush_iotlb_psi(iommu, domain->id, start_pfn, - last_pfn - start_pfn + 1, 0); + last_pfn - start_pfn + 1, !freelist, 0); /* free iova */ __free_iova(&domain->iovad, iova); + dma_free_pagelist(freelist); } else { - add_unmap(domain, iova); + add_unmap(domain, iova, freelist); /* * queue up the release of the unmap to save the 1/6th of the * cpu used up by the iotlb flush operation... @@ -3054,6 +3175,7 @@ static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist, unsigned long start_pfn, last_pfn; struct iova *iova; struct intel_iommu *iommu; + struct page *freelist; if (iommu_no_mapping(hwdev)) return; @@ -3071,19 +3193,16 @@ static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sglist, start_pfn = mm_to_dma_pfn(iova->pfn_lo); last_pfn = mm_to_dma_pfn(iova->pfn_hi + 1) - 1; - /* clear the whole page */ - dma_pte_clear_range(domain, start_pfn, last_pfn); - - /* free page tables */ - dma_pte_free_pagetable(domain, start_pfn, last_pfn); + freelist = domain_unmap(domain, start_pfn, last_pfn); if (intel_iommu_strict) { iommu_flush_iotlb_psi(iommu, domain->id, start_pfn, - last_pfn - start_pfn + 1, 0); + last_pfn - start_pfn + 1, !freelist, 0); /* free iova */ __free_iova(&domain->iovad, iova); + dma_free_pagelist(freelist); } else { - add_unmap(domain, iova); + add_unmap(domain, iova, freelist); /* * queue up the release of the unmap to save the 1/6th of the * cpu used up by the iotlb flush operation... @@ -3166,7 +3285,7 @@ static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist, int ne /* it's a non-present to present mapping. Only flush if caching mode */ if (cap_caching_mode(iommu->cap)) - iommu_flush_iotlb_psi(iommu, domain->id, start_vpfn, size, 1); + iommu_flush_iotlb_psi(iommu, domain->id, start_vpfn, size, 0, 1); else iommu_flush_write_buffer(iommu); @@ -3952,6 +4071,8 @@ static void iommu_free_vm_domain(struct dmar_domain *domain) static void vm_domain_exit(struct dmar_domain *domain) { + struct page *freelist; + /* Domain 0 is reserved, so dont process it */ if (!domain) return; @@ -3960,13 +4081,12 @@ static void vm_domain_exit(struct dmar_domain *domain) /* destroy iovas */ put_iova_domain(&domain->iovad); - /* clear ptes */ - dma_pte_clear_range(domain, 0, DOMAIN_MAX_PFN(domain->gaw)); - - /* free page tables */ - dma_pte_free_pagetable(domain, 0, DOMAIN_MAX_PFN(domain->gaw)); + freelist = domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw)); iommu_free_vm_domain(domain); + + dma_free_pagelist(freelist); + free_domain_mem(domain); } @@ -4110,18 +4230,43 @@ static int intel_iommu_map(struct iommu_domain *domain, } static size_t intel_iommu_unmap(struct iommu_domain *domain, - unsigned long iova, size_t size) + unsigned long iova, size_t size) { struct dmar_domain *dmar_domain = domain->priv; - int order; + struct page *freelist = NULL; + struct intel_iommu *iommu; + unsigned long start_pfn, last_pfn; + unsigned int npages; + int iommu_id, num, ndomains; + + start_pfn = iova >> VTD_PAGE_SHIFT; + last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT; + + freelist = domain_unmap(dmar_domain, start_pfn, last_pfn); + + npages = last_pfn - start_pfn + 1; + + for_each_set_bit(iommu_id, dmar_domain->iommu_bmp, g_num_of_iommus) { + iommu = g_iommus[iommu_id]; + + /* + * find bit position of dmar_domain + */ + ndomains = cap_ndoms(iommu->cap); + for_each_set_bit(num, iommu->domain_ids, ndomains) { + if (iommu->domains[num] == dmar_domain) + iommu_flush_iotlb_psi(iommu, num, start_pfn, + npages, !freelist, 0); + } + + } - order = dma_pte_clear_range(dmar_domain, iova >> VTD_PAGE_SHIFT, - (iova + size - 1) >> VTD_PAGE_SHIFT); + dma_free_pagelist(freelist); if (dmar_domain->max_addr == iova + size) dmar_domain->max_addr = iova; - return PAGE_SIZE << order; + return size; } static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain, -- dwmw2 [-- Attachment #1.2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5745 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply related [flat|nested] 17+ messages in thread
end of thread, other threads:[~2013-10-02 15:04 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-02 2:24 [PATCH] On unmap, flush IOMMU TLB and return correct size Shankar, Hari
[not found] ` <CE494510.41547%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
2013-09-03 4:25 ` Alex Williamson
[not found] ` <1378182340.3246.19.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
2013-09-05 4:05 ` Shankar, Hari
[not found] ` <CE4D489F.41EE8%hshankar-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
2013-09-05 4:58 ` Alex Williamson
2013-09-22 2:59 ` David Woodhouse
[not found] ` <1379818740.2547.51.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
2013-09-25 15:54 ` Joerg Roedel
[not found] ` <20130925155459.GA20372-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>
2013-09-25 16:05 ` David Woodhouse
[not found] ` <1380125113.28494.14.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
2013-09-25 16:58 ` Joerg Roedel
2013-09-25 17:36 ` Alex Williamson
[not found] ` <1380130588.3030.342.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
2013-09-25 18:52 ` David Woodhouse
[not found] ` <1380135148.28494.26.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
2013-09-25 19:44 ` Alex Williamson
[not found] ` <1380138243.5197.20.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
2013-09-25 20:11 ` David Woodhouse
[not found] ` <1380139886.28494.31.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
2013-09-25 21:46 ` Alex Williamson
[not found] ` <1380145560.5197.94.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>
2013-09-25 22:15 ` David Woodhouse
[not found] ` <1380147348.28494.37.camel-W2I5cNIroUsVm/YvaOjsyQ@public.gmane.org>
2013-09-25 22:40 ` Alex Williamson
2013-09-25 16:33 ` Alex Williamson
2013-10-02 15:04 ` David Woodhouse
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.