* [PATCH v3 2/4] iommu/iova: Optimise the padding calculation
[not found] ` <cover.1503412074.git.robin.murphy-5wv7dgnIgG8@public.gmane.org>
2017-08-22 15:17 ` [PATCH v3 1/4] iommu/iova: Optimise rbtree searching Robin Murphy
@ 2017-08-22 15:17 ` Robin Murphy
2017-08-22 15:17 ` [PATCH v3 4/4] iommu/iova: Make dma_32bit_pfn implicit Robin Murphy
2017-08-25 18:52 ` [PATCH v3 0/4] Optimise 64-bit IOVA allocations Nate Watterson
3 siblings, 0 replies; 7+ messages in thread
From: Robin Murphy @ 2017-08-22 15:17 UTC (permalink / raw)
To: joro-zLv9SwRftAIdnm+yROfE0A
Cc: ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
ray.jui-dY08KVG/lbpWk0Htik3J/w,
thunder.leizhen-hv44wF8Li93QT0dZR+AlfA
From: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
The mask for calculating the padding size doesn't change, so there's no
need to recalculate it every loop iteration. Furthermore, Once we've
done that, it becomes clear that we don't actually need to calculate a
padding size at all - by flipping the arithmetic around, we can just
combine the upper limit, size, and mask directly to check against the
lower limit.
For an arm64 build, this alone knocks 16% off the size of the entire
alloc_iova() function!
Signed-off-by: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Tested-by: Ard Biesheuvel <ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Tested-by: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
[rm: simplified more of the arithmetic, rewrote commit message]
Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---
v3: No change
drivers/iommu/iova.c | 40 ++++++++++++++--------------------------
1 file changed, 14 insertions(+), 26 deletions(-)
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 8f7552dc4e04..d094d1ca8f23 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -129,16 +129,6 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
rb_insert_color(&iova->node, root);
}
-/*
- * Computes the padding size required, to make the start address
- * naturally aligned on the power-of-two order of its size
- */
-static unsigned int
-iova_get_pad_size(unsigned int size, unsigned int limit_pfn)
-{
- return (limit_pfn - size) & (__roundup_pow_of_two(size) - 1);
-}
-
static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
unsigned long size, unsigned long limit_pfn,
struct iova *new, bool size_aligned)
@@ -146,7 +136,10 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
struct rb_node *prev, *curr = NULL;
unsigned long flags;
unsigned long saved_pfn;
- unsigned int pad_size = 0;
+ unsigned long align_mask = ~0UL;
+
+ if (size_aligned)
+ align_mask <<= __fls(size);
/* Walk the tree backwards */
spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
@@ -156,31 +149,26 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
while (curr) {
struct iova *curr_iova = rb_entry(curr, struct iova, node);
- if (limit_pfn <= curr_iova->pfn_lo) {
+ if (limit_pfn <= curr_iova->pfn_lo)
goto move_left;
- } else if (limit_pfn > curr_iova->pfn_hi) {
- if (size_aligned)
- pad_size = iova_get_pad_size(size, limit_pfn);
- if ((curr_iova->pfn_hi + size + pad_size) < limit_pfn)
- break; /* found a free slot */
- }
+
+ if (((limit_pfn - size) & align_mask) > curr_iova->pfn_hi)
+ break; /* found a free slot */
+
limit_pfn = curr_iova->pfn_lo;
move_left:
prev = curr;
curr = rb_prev(curr);
}
- if (!curr) {
- if (size_aligned)
- pad_size = iova_get_pad_size(size, limit_pfn);
- if ((iovad->start_pfn + size + pad_size) > limit_pfn) {
- spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
- return -ENOMEM;
- }
+ if (limit_pfn < size ||
+ (!curr && ((limit_pfn - size) & align_mask) < iovad->start_pfn)) {
+ spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+ return -ENOMEM;
}
/* pfn_lo will point to size aligned address if size_aligned is set */
- new->pfn_lo = limit_pfn - (size + pad_size);
+ new->pfn_lo = (limit_pfn - size) & align_mask;
new->pfn_hi = new->pfn_lo + size - 1;
/* If we have 'prev', it's a valid place to start the insertion. */
--
2.13.4.dirty
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH v3 4/4] iommu/iova: Make dma_32bit_pfn implicit
[not found] ` <cover.1503412074.git.robin.murphy-5wv7dgnIgG8@public.gmane.org>
2017-08-22 15:17 ` [PATCH v3 1/4] iommu/iova: Optimise rbtree searching Robin Murphy
2017-08-22 15:17 ` [PATCH v3 2/4] iommu/iova: Optimise the padding calculation Robin Murphy
@ 2017-08-22 15:17 ` Robin Murphy
2017-08-25 18:52 ` [PATCH v3 0/4] Optimise 64-bit IOVA allocations Nate Watterson
3 siblings, 0 replies; 7+ messages in thread
From: Robin Murphy @ 2017-08-22 15:17 UTC (permalink / raw)
To: joro-zLv9SwRftAIdnm+yROfE0A
Cc: Sudeep Dutt, ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A, David Airlie,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jonathan Hunter,
Ashutosh Dixit, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Thierry Reding, ray.jui-dY08KVG/lbpWk0Htik3J/w,
thunder.leizhen-hv44wF8Li93QT0dZR+AlfA
From: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Now that the cached node optimisation can apply to all allocations, the
couple of users which were playing tricks with dma_32bit_pfn in order to
benefit from it can stop doing so. Conversely, there is also no need for
all the other users to explicitly calculate a 'real' 32-bit PFN, when
init_iova_domain() can happily do that itself from the page granularity.
CC: Thierry Reding <thierry.reding-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Jonathan Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
CC: David Airlie <airlied-cv59FeDIM0c@public.gmane.org>
CC: Sudeep Dutt <sudeep.dutt-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
CC: Ashutosh Dixit <ashutosh.dixit-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Tested-by: Ard Biesheuvel <ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Tested-by: Zhen Lei <thunder.leizhen-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
[rm: use iova_shift(), rewrote commit message]
Signed-off-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
---
v3: No change
drivers/gpu/drm/tegra/drm.c | 3 +--
drivers/gpu/host1x/dev.c | 3 +--
drivers/iommu/amd_iommu.c | 7 ++-----
drivers/iommu/dma-iommu.c | 18 +-----------------
drivers/iommu/intel-iommu.c | 11 +++--------
drivers/iommu/iova.c | 4 ++--
drivers/misc/mic/scif/scif_rma.c | 3 +--
include/linux/iova.h | 5 ++---
8 files changed, 13 insertions(+), 41 deletions(-)
diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index 518f4b69ea53..81e9ae1ee90b 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -150,8 +150,7 @@ static int tegra_drm_load(struct drm_device *drm, unsigned long flags)
order = __ffs(tegra->domain->pgsize_bitmap);
init_iova_domain(&tegra->carveout.domain, 1UL << order,
- carveout_start >> order,
- carveout_end >> order);
+ carveout_start >> order);
tegra->carveout.shift = iova_shift(&tegra->carveout.domain);
tegra->carveout.limit = carveout_end >> tegra->carveout.shift;
diff --git a/drivers/gpu/host1x/dev.c b/drivers/gpu/host1x/dev.c
index 2c58a390123a..57c8eed0ed71 100644
--- a/drivers/gpu/host1x/dev.c
+++ b/drivers/gpu/host1x/dev.c
@@ -193,8 +193,7 @@ static int host1x_probe(struct platform_device *pdev)
order = __ffs(host->domain->pgsize_bitmap);
init_iova_domain(&host->iova, 1UL << order,
- geometry->aperture_start >> order,
- geometry->aperture_end >> order);
+ geometry->aperture_start >> order);
host->iova_end = geometry->aperture_end;
}
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 688e77576e5a..a12e3e12014a 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -63,7 +63,6 @@
/* IO virtual address start page frame number */
#define IOVA_START_PFN (1)
#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN IOVA_PFN(DMA_BIT_MASK(32))
/* Reserved IOVA ranges */
#define MSI_RANGE_START (0xfee00000)
@@ -2010,8 +2009,7 @@ static struct dma_ops_domain *dma_ops_domain_alloc(void)
if (!dma_dom->domain.pt_root)
goto free_dma_dom;
- init_iova_domain(&dma_dom->iovad, PAGE_SIZE,
- IOVA_START_PFN, DMA_32BIT_PFN);
+ init_iova_domain(&dma_dom->iovad, PAGE_SIZE, IOVA_START_PFN);
/* Initialize reserved ranges */
copy_reserved_iova(&reserved_iova_ranges, &dma_dom->iovad);
@@ -2912,8 +2910,7 @@ static int init_reserved_iova_ranges(void)
struct pci_dev *pdev = NULL;
struct iova *val;
- init_iova_domain(&reserved_iova_ranges, PAGE_SIZE,
- IOVA_START_PFN, DMA_32BIT_PFN);
+ init_iova_domain(&reserved_iova_ranges, PAGE_SIZE, IOVA_START_PFN);
lockdep_set_class(&reserved_iova_ranges.iova_rbtree_lock,
&reserved_rbtree_key);
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9d1cebe7f6cb..191be9c80a8a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -292,18 +292,7 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
/* ...then finally give it a kicking to make sure it fits */
base_pfn = max_t(unsigned long, base_pfn,
domain->geometry.aperture_start >> order);
- end_pfn = min_t(unsigned long, end_pfn,
- domain->geometry.aperture_end >> order);
}
- /*
- * PCI devices may have larger DMA masks, but still prefer allocating
- * within a 32-bit mask to avoid DAC addressing. Such limitations don't
- * apply to the typical platform device, so for those we may as well
- * leave the cache limit at the top of their range to save an rb_last()
- * traversal on every allocation.
- */
- if (dev && dev_is_pci(dev))
- end_pfn &= DMA_BIT_MASK(32) >> order;
/* start_pfn is always nonzero for an already-initialised domain */
if (iovad->start_pfn) {
@@ -312,16 +301,11 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,
pr_warn("Incompatible range for DMA domain\n");
return -EFAULT;
}
- /*
- * If we have devices with different DMA masks, move the free
- * area cache limit down for the benefit of the smaller one.
- */
- iovad->dma_32bit_pfn = min(end_pfn + 1, iovad->dma_32bit_pfn);
return 0;
}
- init_iova_domain(iovad, 1UL << order, base_pfn, end_pfn);
+ init_iova_domain(iovad, 1UL << order, base_pfn);
if (!dev)
return 0;
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 687f18f65cea..afa3b4e765e7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -82,8 +82,6 @@
#define IOVA_START_PFN (1)
#define IOVA_PFN(addr) ((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN IOVA_PFN(DMA_BIT_MASK(32))
-#define DMA_64BIT_PFN IOVA_PFN(DMA_BIT_MASK(64))
/* page table handling */
#define LEVEL_STRIDE (9)
@@ -1874,8 +1872,7 @@ static int dmar_init_reserved_ranges(void)
struct iova *iova;
int i;
- init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN,
- DMA_32BIT_PFN);
+ init_iova_domain(&reserved_iova_list, VTD_PAGE_SIZE, IOVA_START_PFN);
lockdep_set_class(&reserved_iova_list.iova_rbtree_lock,
&reserved_rbtree_key);
@@ -1933,8 +1930,7 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu,
int adjust_width, agaw;
unsigned long sagaw;
- init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
- DMA_32BIT_PFN);
+ init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
domain_reserve_special_ranges(domain);
/* calculate AGAW */
@@ -4989,8 +4985,7 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
{
int adjust_width;
- init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN,
- DMA_32BIT_PFN);
+ init_iova_domain(&domain->iovad, VTD_PAGE_SIZE, IOVA_START_PFN);
domain_reserve_special_ranges(domain);
/* calculate AGAW */
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b208f656f6a4..612b8f6ce34d 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -35,7 +35,7 @@ static void free_iova_rcaches(struct iova_domain *iovad);
void
init_iova_domain(struct iova_domain *iovad, unsigned long granule,
- unsigned long start_pfn, unsigned long pfn_32bit)
+ unsigned long start_pfn)
{
/*
* IOVA granularity will normally be equal to the smallest
@@ -50,7 +50,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
iovad->cached32_node = NULL;
iovad->granule = granule;
iovad->start_pfn = start_pfn;
- iovad->dma_32bit_pfn = pfn_32bit + 1;
+ iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
init_iova_rcaches(iovad);
}
EXPORT_SYMBOL_GPL(init_iova_domain);
diff --git a/drivers/misc/mic/scif/scif_rma.c b/drivers/misc/mic/scif/scif_rma.c
index 329727e00e97..c824329f7012 100644
--- a/drivers/misc/mic/scif/scif_rma.c
+++ b/drivers/misc/mic/scif/scif_rma.c
@@ -39,8 +39,7 @@ void scif_rma_ep_init(struct scif_endpt *ep)
struct scif_endpt_rma_info *rma = &ep->rma_info;
mutex_init(&rma->rma_lock);
- init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN,
- SCIF_DMA_64BIT_PFN);
+ init_iova_domain(&rma->iovad, PAGE_SIZE, SCIF_IOVA_START_PFN);
spin_lock_init(&rma->tc_lock);
mutex_init(&rma->mmn_lock);
INIT_LIST_HEAD(&rma->reg_list);
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 0bb8df43b393..58c2a365c45f 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -102,7 +102,7 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo,
unsigned long pfn_hi);
void copy_reserved_iova(struct iova_domain *from, struct iova_domain *to);
void init_iova_domain(struct iova_domain *iovad, unsigned long granule,
- unsigned long start_pfn, unsigned long pfn_32bit);
+ unsigned long start_pfn);
struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
void put_iova_domain(struct iova_domain *iovad);
struct iova *split_and_remove_iova(struct iova_domain *iovad,
@@ -170,8 +170,7 @@ static inline void copy_reserved_iova(struct iova_domain *from,
static inline void init_iova_domain(struct iova_domain *iovad,
unsigned long granule,
- unsigned long start_pfn,
- unsigned long pfn_32bit)
+ unsigned long start_pfn)
{
}
--
2.13.4.dirty
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v3 0/4] Optimise 64-bit IOVA allocations
[not found] ` <cover.1503412074.git.robin.murphy-5wv7dgnIgG8@public.gmane.org>
` (2 preceding siblings ...)
2017-08-22 15:17 ` [PATCH v3 4/4] iommu/iova: Make dma_32bit_pfn implicit Robin Murphy
@ 2017-08-25 18:52 ` Nate Watterson
2017-08-30 12:14 ` Joerg Roedel
3 siblings, 1 reply; 7+ messages in thread
From: Nate Watterson @ 2017-08-25 18:52 UTC (permalink / raw)
To: Robin Murphy, joro-zLv9SwRftAIdnm+yROfE0A
Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
ard.biesheuvel-QSEj5FYQhm4dnm+yROfE0A,
ray.jui-dY08KVG/lbpWk0Htik3J/w,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
thunder.leizhen-hv44wF8Li93QT0dZR+AlfA
Hi Robin,
On 8/22/2017 11:17 AM, Robin Murphy wrote:
> Hi all,
>
> Just a quick repost of v2[1] with a small fix for the bug reported by Nate.
I tested the series and can confirm that the crash I reported on v2
no longer occurs with this version.
> To recap, whilst this mostly only improves worst-case performance, those
> worst-cases have a tendency to be pathologically bad:
>
> Ard reports general desktop performance with Chromium on AMD Seattle going
> from ~1-2FPS to perfectly usable.
>
> Leizhen reports gigabit ethernet throughput going from ~6.5Mbit/s to line
> speed.
>
> I also inadvertantly found that the HiSilicon hns_dsaf driver was taking ~35s
> to probe simply becuase of the number of DMA buffers it maps on startup (perf
> shows around 76% of that was spent under the lock in alloc_iova()). With this
> series applied it takes a mere ~1s, mostly of unrelated mdelay()s, with
> alloc_iova() entirely lost in the noise.
Are any of these cases PCI devices attached to domains that have run
out of 32-bit IOVAs and have to retry the allocation using dma_limit?
iommu_dma_alloc_iova() {
[...]
if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev)) [<- TRUE]
iova = alloc_iova_fast(DMA_BIT_MASK(32)); [<- NULL]
if (!iova)
iova = alloc_iova_fast(dma_limit); [<- OK ]
[...]
}
I am asking because, when using 64k pages, the Mellanox CX4 adapter
exhausts the supply 32-bit IOVAs simply allocating per-cpu IOVA space
during 'ifconfig up' so the code path outlined above is taken for
nearly all subsequent allocations. Although I do see a notable (~2x)
performance improvement with this series, I would still characterize it
as "pathologically bad" at < 10% of iommu passthrough performance.
This was a bit surprising to me as I thought the iova_rcache would
have eliminated the need to walk the rbtree for runtime allocations.
Unfortunately, it looks like the failed attempt to allocate a 32-bit
IOVA actually drops the cached IOVAs that we could have used when
subsequently performing the allocation at dma_limit.
alloc_iova_fast() {
[...]
iova_pfn = iova_rcache_get(...); [<- Fail, no 32-bit IOVAs]
if (iova_pfn)
return iova_pfn;
retry:
new_iova = alloc_iova(...); [<- Fail, no 32-bit IOVAs]
if (!new_iova) {
unsigned int cpu;
if (flushed_rcache)
return 0;
/* Try replenishing IOVAs by flushing rcache. */
flushed_rcache = true;
for_each_online_cpu(cpu)
free_cpu_cached_iovas(cpu, iovad); [<- :( ]
goto retry;
}
}
As an experiment, I added code to skip the rcache flushing/retry for
the 32-bit allocations. In this configuration, 100% of passthrough mode
performance was achieved. I made the same change in the baseline and
measured performance at ~95% of passthrough mode.
I also got similar results by altogether removing the 32-bit allocation
from iommu_dma_alloc_iova() which makes me wonder why we even bother.
What (PCIe) workloads have been shown to actually benefit from it?
Tested-by: Nate Watterson <nwatters-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
-Nate
>
> Robin.
>
> [1] https://www.mail-archive.com/iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org/msg19139.html
>
> Robin Murphy (1):
> iommu/iova: Extend rbtree node caching
>
> Zhen Lei (3):
> iommu/iova: Optimise rbtree searching
> iommu/iova: Optimise the padding calculation
> iommu/iova: Make dma_32bit_pfn implicit
>
> drivers/gpu/drm/tegra/drm.c | 3 +-
> drivers/gpu/host1x/dev.c | 3 +-
> drivers/iommu/amd_iommu.c | 7 +--
> drivers/iommu/dma-iommu.c | 18 +------
> drivers/iommu/intel-iommu.c | 11 ++--
> drivers/iommu/iova.c | 114 +++++++++++++++++----------------------
> drivers/misc/mic/scif/scif_rma.c | 3 +-
> include/linux/iova.h | 8 +--
> 8 files changed, 62 insertions(+), 105 deletions(-)
>
--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 7+ messages in thread