[PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems
@ 2017-09-21  8:59 Ganapatrao Kulkarni
  2017-09-21  8:59 ` [PATCH 1/4] mm: move function alloc_pages_exact_nid out of __meminit Ganapatrao Kulkarni
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Ganapatrao Kulkarni @ 2017-09-21  8:59 UTC (permalink / raw)
  To: linux-arm-kernel

Adding numa aware memory allocations used for iommu dma allocation and
memory allocated for SMMU stream tables, page walk tables and command queues.

With this patch, iperf testing on ThunderX2, with 40G NIC card on
NODE 1 PCI shown same performance(around 30% improvement) as NODE 0.

Ganapatrao Kulkarni (4):
  mm: move function alloc_pages_exact_nid out of __meminit
  numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu
    translation tables
  iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and
    comamnd queues
  iommu/dma, numa: Use NUMA aware memory allocations in
    __iommu_dma_alloc_pages

 drivers/iommu/arm-smmu-v3.c    | 57 +++++++++++++++++++++++++++++++++++++-----
 drivers/iommu/dma-iommu.c      | 17 +++++++------
 drivers/iommu/io-pgtable-arm.c |  4 ++-
 include/linux/gfp.h            |  2 +-
 mm/page_alloc.c                |  3 ++-
 5 files changed, 67 insertions(+), 16 deletions(-)

-- 
2.9.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/4] mm: move function alloc_pages_exact_nid out of __meminit
  2017-09-21  8:59 [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems Ganapatrao Kulkarni
@ 2017-09-21  8:59 ` Ganapatrao Kulkarni
  2017-09-26 13:35   ` Michal Hocko
  2017-09-21  8:59 ` [PATCH 2/4] numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu translation tables Ganapatrao Kulkarni
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Ganapatrao Kulkarni @ 2017-09-21  8:59 UTC (permalink / raw)
  To: linux-arm-kernel

This function can be used on NUMA systems in place of alloc_pages_exact
Adding code to export and to remove __meminit section tagging.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
---
 include/linux/gfp.h | 2 +-
 mm/page_alloc.c     | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f780718..a4bd234 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -528,7 +528,7 @@ extern unsigned long get_zeroed_page(gfp_t gfp_mask);
 
 void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
 void free_pages_exact(void *virt, size_t size);
-void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
+void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
 
 #define __get_free_page(gfp_mask) \
 		__get_free_pages((gfp_mask), 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af8..7975870 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4442,7 +4442,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
  * Like alloc_pages_exact(), but try to allocate on node nid first before falling
  * back.
  */
-void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
+void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
 {
 	unsigned int order = get_order(size);
 	struct page *p = alloc_pages_node(nid, gfp_mask, order);
@@ -4450,6 +4450,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
 		return NULL;
 	return make_alloc_exact((unsigned long)page_address(p), order, size);
 }
+EXPORT_SYMBOL(alloc_pages_exact_nid);
 
 /**
  * free_pages_exact - release memory allocated via alloc_pages_exact()
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/4] mm: move function alloc_pages_exact_nid out of __meminit
  2017-09-21  8:59 ` [PATCH 1/4] mm: move function alloc_pages_exact_nid out of __meminit Ganapatrao Kulkarni
@ 2017-09-26 13:35   ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-09-26 13:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu 21-09-17 14:29:19, Ganapatrao Kulkarni wrote:
> This function can be used on NUMA systems in place of alloc_pages_exact
> Adding code to export and to remove __meminit section tagging.

It is usually better to fold such a change into a patch which adds a new
user. Other than that I do not have any objections.

> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/gfp.h | 2 +-
>  mm/page_alloc.c     | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index f780718..a4bd234 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -528,7 +528,7 @@ extern unsigned long get_zeroed_page(gfp_t gfp_mask);
>  
>  void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
>  void free_pages_exact(void *virt, size_t size);
> -void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
> +void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
>  
>  #define __get_free_page(gfp_mask) \
>  		__get_free_pages((gfp_mask), 0)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c841af8..7975870 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4442,7 +4442,7 @@ EXPORT_SYMBOL(alloc_pages_exact);
>   * Like alloc_pages_exact(), but try to allocate on node nid first before falling
>   * back.
>   */
> -void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
> +void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
>  {
>  	unsigned int order = get_order(size);
>  	struct page *p = alloc_pages_node(nid, gfp_mask, order);
> @@ -4450,6 +4450,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask)
>  		return NULL;
>  	return make_alloc_exact((unsigned long)page_address(p), order, size);
>  }
> +EXPORT_SYMBOL(alloc_pages_exact_nid);
>  
>  /**
>   * free_pages_exact - release memory allocated via alloc_pages_exact()
> -- 
> 2.9.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/4] numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu translation tables
  2017-09-21  8:59 [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems Ganapatrao Kulkarni
  2017-09-21  8:59 ` [PATCH 1/4] mm: move function alloc_pages_exact_nid out of __meminit Ganapatrao Kulkarni
@ 2017-09-21  8:59 ` Ganapatrao Kulkarni
  2017-09-21 11:11   ` Robin Murphy
  2017-09-21  8:59 ` [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues Ganapatrao Kulkarni
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Ganapatrao Kulkarni @ 2017-09-21  8:59 UTC (permalink / raw)
  To: linux-arm-kernel

function __arm_lpae_alloc_pages is used to allcoated memory for smmu
translation tables. updating function to allocate memory/pages
from the proximity domain of SMMU device.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
---
 drivers/iommu/io-pgtable-arm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index e8018a3..f6d01f6 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -215,8 +215,10 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
 {
 	struct device *dev = cfg->iommu_dev;
 	dma_addr_t dma;
-	void *pages = alloc_pages_exact(size, gfp | __GFP_ZERO);
+	void *pages;
 
+	pages = alloc_pages_exact_nid(dev_to_node(dev), size,
+			gfp | __GFP_ZERO);
 	if (!pages)
 		return NULL;
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/4] numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu translation tables
  2017-09-21  8:59 ` [PATCH 2/4] numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu translation tables Ganapatrao Kulkarni
@ 2017-09-21 11:11   ` Robin Murphy
  2017-09-22 15:33     ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 21+ messages in thread
From: Robin Murphy @ 2017-09-21 11:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 21/09/17 09:59, Ganapatrao Kulkarni wrote:
> function __arm_lpae_alloc_pages is used to allcoated memory for smmu
> translation tables. updating function to allocate memory/pages
> from the proximity domain of SMMU device.

AFAICS, data->pgd_size always works out to a power-of-two number of
pages, so I'm not sure why we've ever needed alloc_pages_exact() here. I
think we could simply use alloc_pages_node() and drop patch #1.

Robin.

> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> ---
>  drivers/iommu/io-pgtable-arm.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index e8018a3..f6d01f6 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -215,8 +215,10 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
>  {
>  	struct device *dev = cfg->iommu_dev;
>  	dma_addr_t dma;
> -	void *pages = alloc_pages_exact(size, gfp | __GFP_ZERO);
> +	void *pages;
>  
> +	pages = alloc_pages_exact_nid(dev_to_node(dev), size,
> +			gfp | __GFP_ZERO);
>  	if (!pages)
>  		return NULL;
>  
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/4] numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu translation tables
  2017-09-21 11:11   ` Robin Murphy
@ 2017-09-22 15:33     ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 21+ messages in thread
From: Ganapatrao Kulkarni @ 2017-09-22 15:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 21, 2017 at 4:41 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 21/09/17 09:59, Ganapatrao Kulkarni wrote:
>> function __arm_lpae_alloc_pages is used to allcoated memory for smmu
>> translation tables. updating function to allocate memory/pages
>> from the proximity domain of SMMU device.
>
> AFAICS, data->pgd_size always works out to a power-of-two number of
> pages, so I'm not sure why we've ever needed alloc_pages_exact() here. I
> think we could simply use alloc_pages_node() and drop patch #1.

thanks Robin, i think we can replace with alloc_pages_node.
i will change as suggested in next version.

>
> Robin.
>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
>> ---
>>  drivers/iommu/io-pgtable-arm.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>> index e8018a3..f6d01f6 100644
>> --- a/drivers/iommu/io-pgtable-arm.c
>> +++ b/drivers/iommu/io-pgtable-arm.c
>> @@ -215,8 +215,10 @@ static void *__arm_lpae_alloc_pages(size_t size, gfp_t gfp,
>>  {
>>       struct device *dev = cfg->iommu_dev;
>>       dma_addr_t dma;
>> -     void *pages = alloc_pages_exact(size, gfp | __GFP_ZERO);
>> +     void *pages;
>>
>> +     pages = alloc_pages_exact_nid(dev_to_node(dev), size,
>> +                     gfp | __GFP_ZERO);
>>       if (!pages)
>>               return NULL;
>>
>>
>

thanks
Ganapat

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues
  2017-09-21  8:59 [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems Ganapatrao Kulkarni
  2017-09-21  8:59 ` [PATCH 1/4] mm: move function alloc_pages_exact_nid out of __meminit Ganapatrao Kulkarni
  2017-09-21  8:59 ` [PATCH 2/4] numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu translation tables Ganapatrao Kulkarni
@ 2017-09-21  8:59 ` Ganapatrao Kulkarni
  2017-09-21 11:58   ` Robin Murphy
  2017-09-21  8:59 ` [PATCH 4/4] iommu/dma, numa: Use NUMA aware memory allocations in __iommu_dma_alloc_pages Ganapatrao Kulkarni
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Ganapatrao Kulkarni @ 2017-09-21  8:59 UTC (permalink / raw)
  To: linux-arm-kernel

Introduce smmu_alloc_coherent and smmu_free_coherent functions to
allocate/free dma coherent memory from NUMA node associated with SMMU.
Replace all calls of dmam_alloc_coherent with smmu_alloc_coherent
for SMMU stream tables and command queues.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
---
 drivers/iommu/arm-smmu-v3.c | 57 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 51 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index e67ba6c..bc4ba1f 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1158,6 +1158,50 @@ static void arm_smmu_init_bypass_stes(u64 *strtab, unsigned int nent)
 	}
 }
 
+static void *smmu_alloc_coherent(struct arm_smmu_device *smmu, size_t size,
+		dma_addr_t *dma_handle,	gfp_t gfp)
+{
+	struct device *dev = smmu->dev;
+	void *pages;
+	dma_addr_t dma;
+	int numa_node = dev_to_node(dev);
+
+	pages = alloc_pages_exact_nid(numa_node, size, gfp | __GFP_ZERO);
+	if (!pages)
+		return NULL;
+
+	if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY)) {
+		dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
+		if (dma_mapping_error(dev, dma))
+			goto out_free;
+		/*
+		 * We depend on the SMMU being able to work with any physical
+		 * address directly, so if the DMA layer suggests otherwise by
+		 * translating or truncating them, that bodes very badly...
+		 */
+		if (dma != virt_to_phys(pages))
+			goto out_unmap;
+	}
+
+	*dma_handle = (dma_addr_t)virt_to_phys(pages);
+	return pages;
+
+out_unmap:
+	dev_err(dev, "Cannot accommodate DMA translation for IOMMU page tables\n");
+	dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
+out_free:
+	free_pages_exact(pages, size);
+	return NULL;
+}
+
+static void smmu_free_coherent(struct arm_smmu_device *smmu, size_t size,
+		void *pages, dma_addr_t dma_handle)
+{
+	if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
+		dma_unmap_single(smmu->dev, dma_handle, size, DMA_TO_DEVICE);
+	free_pages_exact(pages, size);
+}
+
 static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 {
 	size_t size;
@@ -1172,7 +1216,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 	strtab = &cfg->strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
 
 	desc->span = STRTAB_SPLIT + 1;
-	desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, &desc->l2ptr_dma,
+	desc->l2ptr = smmu_alloc_coherent(smmu, size, &desc->l2ptr_dma,
 					  GFP_KERNEL | __GFP_ZERO);
 	if (!desc->l2ptr) {
 		dev_err(smmu->dev,
@@ -1487,7 +1531,7 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
 		struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg;
 
 		if (cfg->cdptr) {
-			dmam_free_coherent(smmu_domain->smmu->dev,
+			smmu_free_coherent(smmu,
 					   CTXDESC_CD_DWORDS << 3,
 					   cfg->cdptr,
 					   cfg->cdptr_dma);
@@ -1515,7 +1559,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
 	if (asid < 0)
 		return asid;
 
-	cfg->cdptr = dmam_alloc_coherent(smmu->dev, CTXDESC_CD_DWORDS << 3,
+	cfg->cdptr = smmu_alloc_coherent(smmu, CTXDESC_CD_DWORDS << 3,
 					 &cfg->cdptr_dma,
 					 GFP_KERNEL | __GFP_ZERO);
 	if (!cfg->cdptr) {
@@ -1984,7 +2028,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
 {
 	size_t qsz = ((1 << q->max_n_shift) * dwords) << 3;
 
-	q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL);
+	q->base = smmu_alloc_coherent(smmu, qsz, &q->base_dma, GFP_KERNEL);
 	if (!q->base) {
 		dev_err(smmu->dev, "failed to allocate queue (0x%zx bytes)\n",
 			qsz);
@@ -2069,7 +2113,7 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
 			 size, smmu->sid_bits);
 
 	l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
-	strtab = dmam_alloc_coherent(smmu->dev, l1size, &cfg->strtab_dma,
+	strtab = smmu_alloc_coherent(smmu, l1size, &cfg->strtab_dma,
 				     GFP_KERNEL | __GFP_ZERO);
 	if (!strtab) {
 		dev_err(smmu->dev,
@@ -2097,8 +2141,9 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
 	u32 size;
 	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
 
+
 	size = (1 << smmu->sid_bits) * (STRTAB_STE_DWORDS << 3);
-	strtab = dmam_alloc_coherent(smmu->dev, size, &cfg->strtab_dma,
+	strtab = smmu_alloc_coherent(smmu, size, &cfg->strtab_dma,
 				     GFP_KERNEL | __GFP_ZERO);
 	if (!strtab) {
 		dev_err(smmu->dev,
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues
  2017-09-21  8:59 ` [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues Ganapatrao Kulkarni
@ 2017-09-21 11:58   ` Robin Murphy
  2017-09-21 14:26     ` Christoph Hellwig
                       ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Robin Murphy @ 2017-09-21 11:58 UTC (permalink / raw)
  To: linux-arm-kernel

[+Christoph and Marek]

On 21/09/17 09:59, Ganapatrao Kulkarni wrote:
> Introduce smmu_alloc_coherent and smmu_free_coherent functions to
> allocate/free dma coherent memory from NUMA node associated with SMMU.
> Replace all calls of dmam_alloc_coherent with smmu_alloc_coherent
> for SMMU stream tables and command queues.

This doesn't work - not only do you lose the 'managed' aspect and risk
leaking various tables on probe failure or device removal, but more
importantly, unless you add DMA syncs around all the CPU accesses to the
tables, you lose the critical 'coherent' aspect, and that's a horribly
invasive change that I really don't want to make.

Christoph, Marek; how reasonable do you think it is to expect
dma_alloc_coherent() to be inherently NUMA-aware on NUMA-capable
systems? SWIOTLB looks fairly straightforward to fix up (for the simple
allocation case; I'm not sure it's even worth it for bounce-buffering),
but the likes of CMA might be a little trickier...

Robin.

> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> ---
>  drivers/iommu/arm-smmu-v3.c | 57 ++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 51 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index e67ba6c..bc4ba1f 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -1158,6 +1158,50 @@ static void arm_smmu_init_bypass_stes(u64 *strtab, unsigned int nent)
>  	}
>  }
>  
> +static void *smmu_alloc_coherent(struct arm_smmu_device *smmu, size_t size,
> +		dma_addr_t *dma_handle,	gfp_t gfp)
> +{
> +	struct device *dev = smmu->dev;
> +	void *pages;
> +	dma_addr_t dma;
> +	int numa_node = dev_to_node(dev);
> +
> +	pages = alloc_pages_exact_nid(numa_node, size, gfp | __GFP_ZERO);
> +	if (!pages)
> +		return NULL;
> +
> +	if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY)) {
> +		dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
> +		if (dma_mapping_error(dev, dma))
> +			goto out_free;
> +		/*
> +		 * We depend on the SMMU being able to work with any physical
> +		 * address directly, so if the DMA layer suggests otherwise by
> +		 * translating or truncating them, that bodes very badly...
> +		 */
> +		if (dma != virt_to_phys(pages))
> +			goto out_unmap;
> +	}
> +
> +	*dma_handle = (dma_addr_t)virt_to_phys(pages);
> +	return pages;
> +
> +out_unmap:
> +	dev_err(dev, "Cannot accommodate DMA translation for IOMMU page tables\n");
> +	dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
> +out_free:
> +	free_pages_exact(pages, size);
> +	return NULL;
> +}
> +
> +static void smmu_free_coherent(struct arm_smmu_device *smmu, size_t size,
> +		void *pages, dma_addr_t dma_handle)
> +{
> +	if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
> +		dma_unmap_single(smmu->dev, dma_handle, size, DMA_TO_DEVICE);
> +	free_pages_exact(pages, size);
> +}
> +
>  static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>  {
>  	size_t size;
> @@ -1172,7 +1216,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>  	strtab = &cfg->strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
>  
>  	desc->span = STRTAB_SPLIT + 1;
> -	desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, &desc->l2ptr_dma,
> +	desc->l2ptr = smmu_alloc_coherent(smmu, size, &desc->l2ptr_dma,
>  					  GFP_KERNEL | __GFP_ZERO);
>  	if (!desc->l2ptr) {
>  		dev_err(smmu->dev,
> @@ -1487,7 +1531,7 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
>  		struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg;
>  
>  		if (cfg->cdptr) {
> -			dmam_free_coherent(smmu_domain->smmu->dev,
> +			smmu_free_coherent(smmu,
>  					   CTXDESC_CD_DWORDS << 3,
>  					   cfg->cdptr,
>  					   cfg->cdptr_dma);
> @@ -1515,7 +1559,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>  	if (asid < 0)
>  		return asid;
>  
> -	cfg->cdptr = dmam_alloc_coherent(smmu->dev, CTXDESC_CD_DWORDS << 3,
> +	cfg->cdptr = smmu_alloc_coherent(smmu, CTXDESC_CD_DWORDS << 3,
>  					 &cfg->cdptr_dma,
>  					 GFP_KERNEL | __GFP_ZERO);
>  	if (!cfg->cdptr) {
> @@ -1984,7 +2028,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
>  {
>  	size_t qsz = ((1 << q->max_n_shift) * dwords) << 3;
>  
> -	q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL);
> +	q->base = smmu_alloc_coherent(smmu, qsz, &q->base_dma, GFP_KERNEL);
>  	if (!q->base) {
>  		dev_err(smmu->dev, "failed to allocate queue (0x%zx bytes)\n",
>  			qsz);
> @@ -2069,7 +2113,7 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>  			 size, smmu->sid_bits);
>  
>  	l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
> -	strtab = dmam_alloc_coherent(smmu->dev, l1size, &cfg->strtab_dma,
> +	strtab = smmu_alloc_coherent(smmu, l1size, &cfg->strtab_dma,
>  				     GFP_KERNEL | __GFP_ZERO);
>  	if (!strtab) {
>  		dev_err(smmu->dev,
> @@ -2097,8 +2141,9 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>  	u32 size;
>  	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
>  
> +
>  	size = (1 << smmu->sid_bits) * (STRTAB_STE_DWORDS << 3);
> -	strtab = dmam_alloc_coherent(smmu->dev, size, &cfg->strtab_dma,
> +	strtab = smmu_alloc_coherent(smmu, size, &cfg->strtab_dma,
>  				     GFP_KERNEL | __GFP_ZERO);
>  	if (!strtab) {
>  		dev_err(smmu->dev,
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues
  2017-09-21 11:58   ` Robin Murphy
@ 2017-09-21 14:26     ` Christoph Hellwig
  2017-09-29 12:13     ` Marek Szyprowski
  2017-10-04 13:53     ` Ganapatrao Kulkarni
  2 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2017-09-21 14:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 21, 2017 at 12:58:04PM +0100, Robin Murphy wrote:
> Christoph, Marek; how reasonable do you think it is to expect
> dma_alloc_coherent() to be inherently NUMA-aware on NUMA-capable
> systems? SWIOTLB looks fairly straightforward to fix up (for the simple
> allocation case; I'm not sure it's even worth it for bounce-buffering),
> but the likes of CMA might be a little trickier...

I think allocating data node local to dev is a good default.  I'm not
sure if we'd still need a version that takes an explicit node, though.

On the one hand devices like NVMe or RDMA nics have queues that are
assigned to specific cpus and thus have an inherent affinity to given
nodes.  On the other hand we'd still need to access the PCIe device,
so for it to make sense we'd need to access the dma memory a lot more
from the host than from the device, and I'm not sure if we ever have
devices where that is the case (which would not be optimal to start
with).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues
  2017-09-21 11:58   ` Robin Murphy
  2017-09-21 14:26     ` Christoph Hellwig
@ 2017-09-29 12:13     ` Marek Szyprowski
  2017-10-04 13:53     ` Ganapatrao Kulkarni
  2 siblings, 0 replies; 21+ messages in thread
From: Marek Szyprowski @ 2017-09-29 12:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

On 2017-09-21 13:58, Robin Murphy wrote:
> [+Christoph and Marek]
>
> On 21/09/17 09:59, Ganapatrao Kulkarni wrote:
>> Introduce smmu_alloc_coherent and smmu_free_coherent functions to
>> allocate/free dma coherent memory from NUMA node associated with SMMU.
>> Replace all calls of dmam_alloc_coherent with smmu_alloc_coherent
>> for SMMU stream tables and command queues.
> This doesn't work - not only do you lose the 'managed' aspect and risk
> leaking various tables on probe failure or device removal, but more
> importantly, unless you add DMA syncs around all the CPU accesses to the
> tables, you lose the critical 'coherent' aspect, and that's a horribly
> invasive change that I really don't want to make.
>
> Christoph, Marek; how reasonable do you think it is to expect
> dma_alloc_coherent() to be inherently NUMA-aware on NUMA-capable
> systems? SWIOTLB looks fairly straightforward to fix up (for the simple
> allocation case; I'm not sure it's even worth it for bounce-buffering),
> but the likes of CMA might be a little trickier...

I'm not sure if there is any dma-coherent implementation that is NUMA aware.

Maybe author should provide some benchmarks, which show that those 
structures
should be allocated in NUMA-aware way?

On the other hand it is not that hard to add required dma_sync_* calls 
around
all the code which updated those tables.

 > ...

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues
  2017-09-21 11:58   ` Robin Murphy
  2017-09-21 14:26     ` Christoph Hellwig
  2017-09-29 12:13     ` Marek Szyprowski
@ 2017-10-04 13:53     ` Ganapatrao Kulkarni
  2017-10-18 13:36       ` Robin Murphy
  2 siblings, 1 reply; 21+ messages in thread
From: Ganapatrao Kulkarni @ 2017-10-04 13:53 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,


On Thu, Sep 21, 2017 at 5:28 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> [+Christoph and Marek]
>
> On 21/09/17 09:59, Ganapatrao Kulkarni wrote:
>> Introduce smmu_alloc_coherent and smmu_free_coherent functions to
>> allocate/free dma coherent memory from NUMA node associated with SMMU.
>> Replace all calls of dmam_alloc_coherent with smmu_alloc_coherent
>> for SMMU stream tables and command queues.
>
> This doesn't work - not only do you lose the 'managed' aspect and risk
> leaking various tables on probe failure or device removal, but more
> importantly, unless you add DMA syncs around all the CPU accesses to the
> tables, you lose the critical 'coherent' aspect, and that's a horribly
> invasive change that I really don't want to make.

this implementation is similar to function used to allocate memory for
translation tables.
why do you see it affects to stream tables and not to page tables.
at runtime, both tables are accessed by SMMU only.

As said in cover letter, having stream table from respective NUMA node
is yielding
around 30% performance!
please suggest, if there is any better way to address this issue?

>
> Christoph, Marek; how reasonable do you think it is to expect
> dma_alloc_coherent() to be inherently NUMA-aware on NUMA-capable
> systems? SWIOTLB looks fairly straightforward to fix up (for the simple
> allocation case; I'm not sure it's even worth it for bounce-buffering),
> but the likes of CMA might be a little trickier...
>
> Robin.
>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
>> ---
>>  drivers/iommu/arm-smmu-v3.c | 57 ++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 51 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index e67ba6c..bc4ba1f 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -1158,6 +1158,50 @@ static void arm_smmu_init_bypass_stes(u64 *strtab, unsigned int nent)
>>       }
>>  }
>>
>> +static void *smmu_alloc_coherent(struct arm_smmu_device *smmu, size_t size,
>> +             dma_addr_t *dma_handle, gfp_t gfp)
>> +{
>> +     struct device *dev = smmu->dev;
>> +     void *pages;
>> +     dma_addr_t dma;
>> +     int numa_node = dev_to_node(dev);
>> +
>> +     pages = alloc_pages_exact_nid(numa_node, size, gfp | __GFP_ZERO);
>> +     if (!pages)
>> +             return NULL;
>> +
>> +     if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY)) {
>> +             dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
>> +             if (dma_mapping_error(dev, dma))
>> +                     goto out_free;
>> +             /*
>> +              * We depend on the SMMU being able to work with any physical
>> +              * address directly, so if the DMA layer suggests otherwise by
>> +              * translating or truncating them, that bodes very badly...
>> +              */
>> +             if (dma != virt_to_phys(pages))
>> +                     goto out_unmap;
>> +     }
>> +
>> +     *dma_handle = (dma_addr_t)virt_to_phys(pages);
>> +     return pages;
>> +
>> +out_unmap:
>> +     dev_err(dev, "Cannot accommodate DMA translation for IOMMU page tables\n");
>> +     dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
>> +out_free:
>> +     free_pages_exact(pages, size);
>> +     return NULL;
>> +}
>> +
>> +static void smmu_free_coherent(struct arm_smmu_device *smmu, size_t size,
>> +             void *pages, dma_addr_t dma_handle)
>> +{
>> +     if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
>> +             dma_unmap_single(smmu->dev, dma_handle, size, DMA_TO_DEVICE);
>> +     free_pages_exact(pages, size);
>> +}
>> +
>>  static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>>  {
>>       size_t size;
>> @@ -1172,7 +1216,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>>       strtab = &cfg->strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
>>
>>       desc->span = STRTAB_SPLIT + 1;
>> -     desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, &desc->l2ptr_dma,
>> +     desc->l2ptr = smmu_alloc_coherent(smmu, size, &desc->l2ptr_dma,
>>                                         GFP_KERNEL | __GFP_ZERO);
>>       if (!desc->l2ptr) {
>>               dev_err(smmu->dev,
>> @@ -1487,7 +1531,7 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
>>               struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg;
>>
>>               if (cfg->cdptr) {
>> -                     dmam_free_coherent(smmu_domain->smmu->dev,
>> +                     smmu_free_coherent(smmu,
>>                                          CTXDESC_CD_DWORDS << 3,
>>                                          cfg->cdptr,
>>                                          cfg->cdptr_dma);
>> @@ -1515,7 +1559,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>>       if (asid < 0)
>>               return asid;
>>
>> -     cfg->cdptr = dmam_alloc_coherent(smmu->dev, CTXDESC_CD_DWORDS << 3,
>> +     cfg->cdptr = smmu_alloc_coherent(smmu, CTXDESC_CD_DWORDS << 3,
>>                                        &cfg->cdptr_dma,
>>                                        GFP_KERNEL | __GFP_ZERO);
>>       if (!cfg->cdptr) {
>> @@ -1984,7 +2028,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
>>  {
>>       size_t qsz = ((1 << q->max_n_shift) * dwords) << 3;
>>
>> -     q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL);
>> +     q->base = smmu_alloc_coherent(smmu, qsz, &q->base_dma, GFP_KERNEL);
>>       if (!q->base) {
>>               dev_err(smmu->dev, "failed to allocate queue (0x%zx bytes)\n",
>>                       qsz);
>> @@ -2069,7 +2113,7 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>>                        size, smmu->sid_bits);
>>
>>       l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
>> -     strtab = dmam_alloc_coherent(smmu->dev, l1size, &cfg->strtab_dma,
>> +     strtab = smmu_alloc_coherent(smmu, l1size, &cfg->strtab_dma,
>>                                    GFP_KERNEL | __GFP_ZERO);
>>       if (!strtab) {
>>               dev_err(smmu->dev,
>> @@ -2097,8 +2141,9 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>>       u32 size;
>>       struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
>>
>> +
>>       size = (1 << smmu->sid_bits) * (STRTAB_STE_DWORDS << 3);
>> -     strtab = dmam_alloc_coherent(smmu->dev, size, &cfg->strtab_dma,
>> +     strtab = smmu_alloc_coherent(smmu, size, &cfg->strtab_dma,
>>                                    GFP_KERNEL | __GFP_ZERO);
>>       if (!strtab) {
>>               dev_err(smmu->dev,
>>
>

thanks
Ganapat

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues
  2017-10-04 13:53     ` Ganapatrao Kulkarni
@ 2017-10-18 13:36       ` Robin Murphy
  2017-11-06  9:04         ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 21+ messages in thread
From: Robin Murphy @ 2017-10-18 13:36 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/10/17 14:53, Ganapatrao Kulkarni wrote:
> Hi Robin,
> 
> 
> On Thu, Sep 21, 2017 at 5:28 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>> [+Christoph and Marek]
>>
>> On 21/09/17 09:59, Ganapatrao Kulkarni wrote:
>>> Introduce smmu_alloc_coherent and smmu_free_coherent functions to
>>> allocate/free dma coherent memory from NUMA node associated with SMMU.
>>> Replace all calls of dmam_alloc_coherent with smmu_alloc_coherent
>>> for SMMU stream tables and command queues.
>>
>> This doesn't work - not only do you lose the 'managed' aspect and risk
>> leaking various tables on probe failure or device removal, but more
>> importantly, unless you add DMA syncs around all the CPU accesses to the
>> tables, you lose the critical 'coherent' aspect, and that's a horribly
>> invasive change that I really don't want to make.
> 
> this implementation is similar to function used to allocate memory for
> translation tables.

The concept is similar, yes, and would work if implemented *correctly*
with the aforementioned comprehensive and hugely invasive changes. The
implementation as presented in this patch, however, is incomplete and
badly broken.

By way of comparison, the io-pgtable implementations contain all the
necessary dma_sync_* calls, never relied on devres, and only have one
DMA direction to worry about (hint: the queues don't all work
identically). There are also a couple of practical reasons for using
streaming mappings with the DMA == phys restriction there - tracking
both the CPU and DMA addresses for each table would significantly
increase the memory overhead, and using the cacheable linear map address
in all cases sidesteps any potential problems with the atomic PTE
updates. Neither of those concerns apply to the SMMUv3 data structures,
which are textbook coherent DMA allocations (being tied to the lifetime
of the device, rather than transient).

> why do you see it affects to stream tables and not to page tables.
> at runtime, both tables are accessed by SMMU only.
> 
> As said in cover letter, having stream table from respective NUMA node
> is yielding
> around 30% performance!
> please suggest, if there is any better way to address this issue?

I fully agree that NUMA-aware allocations are a worthwhile thing that we
want. I just don't like the idea of going around individual drivers
replacing coherent API usage with bodged-up streaming mappings - I
really think it's worth making the effort to to tackle it once, in the
proper place, in a way that benefits all users together.

Robin.

>>
>> Christoph, Marek; how reasonable do you think it is to expect
>> dma_alloc_coherent() to be inherently NUMA-aware on NUMA-capable
>> systems? SWIOTLB looks fairly straightforward to fix up (for the simple
>> allocation case; I'm not sure it's even worth it for bounce-buffering),
>> but the likes of CMA might be a little trickier...
>>
>> Robin.
>>
>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
>>> ---
>>>  drivers/iommu/arm-smmu-v3.c | 57 ++++++++++++++++++++++++++++++++++++++++-----
>>>  1 file changed, 51 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>> index e67ba6c..bc4ba1f 100644
>>> --- a/drivers/iommu/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>> @@ -1158,6 +1158,50 @@ static void arm_smmu_init_bypass_stes(u64 *strtab, unsigned int nent)
>>>       }
>>>  }
>>>
>>> +static void *smmu_alloc_coherent(struct arm_smmu_device *smmu, size_t size,
>>> +             dma_addr_t *dma_handle, gfp_t gfp)
>>> +{
>>> +     struct device *dev = smmu->dev;
>>> +     void *pages;
>>> +     dma_addr_t dma;
>>> +     int numa_node = dev_to_node(dev);
>>> +
>>> +     pages = alloc_pages_exact_nid(numa_node, size, gfp | __GFP_ZERO);
>>> +     if (!pages)
>>> +             return NULL;
>>> +
>>> +     if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY)) {
>>> +             dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
>>> +             if (dma_mapping_error(dev, dma))
>>> +                     goto out_free;
>>> +             /*
>>> +              * We depend on the SMMU being able to work with any physical
>>> +              * address directly, so if the DMA layer suggests otherwise by
>>> +              * translating or truncating them, that bodes very badly...
>>> +              */
>>> +             if (dma != virt_to_phys(pages))
>>> +                     goto out_unmap;
>>> +     }
>>> +
>>> +     *dma_handle = (dma_addr_t)virt_to_phys(pages);
>>> +     return pages;
>>> +
>>> +out_unmap:
>>> +     dev_err(dev, "Cannot accommodate DMA translation for IOMMU page tables\n");
>>> +     dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
>>> +out_free:
>>> +     free_pages_exact(pages, size);
>>> +     return NULL;
>>> +}
>>> +
>>> +static void smmu_free_coherent(struct arm_smmu_device *smmu, size_t size,
>>> +             void *pages, dma_addr_t dma_handle)
>>> +{
>>> +     if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
>>> +             dma_unmap_single(smmu->dev, dma_handle, size, DMA_TO_DEVICE);
>>> +     free_pages_exact(pages, size);
>>> +}
>>> +
>>>  static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>>>  {
>>>       size_t size;
>>> @@ -1172,7 +1216,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>>>       strtab = &cfg->strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
>>>
>>>       desc->span = STRTAB_SPLIT + 1;
>>> -     desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, &desc->l2ptr_dma,
>>> +     desc->l2ptr = smmu_alloc_coherent(smmu, size, &desc->l2ptr_dma,
>>>                                         GFP_KERNEL | __GFP_ZERO);
>>>       if (!desc->l2ptr) {
>>>               dev_err(smmu->dev,
>>> @@ -1487,7 +1531,7 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
>>>               struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg;
>>>
>>>               if (cfg->cdptr) {
>>> -                     dmam_free_coherent(smmu_domain->smmu->dev,
>>> +                     smmu_free_coherent(smmu,
>>>                                          CTXDESC_CD_DWORDS << 3,
>>>                                          cfg->cdptr,
>>>                                          cfg->cdptr_dma);
>>> @@ -1515,7 +1559,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>>>       if (asid < 0)
>>>               return asid;
>>>
>>> -     cfg->cdptr = dmam_alloc_coherent(smmu->dev, CTXDESC_CD_DWORDS << 3,
>>> +     cfg->cdptr = smmu_alloc_coherent(smmu, CTXDESC_CD_DWORDS << 3,
>>>                                        &cfg->cdptr_dma,
>>>                                        GFP_KERNEL | __GFP_ZERO);
>>>       if (!cfg->cdptr) {
>>> @@ -1984,7 +2028,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
>>>  {
>>>       size_t qsz = ((1 << q->max_n_shift) * dwords) << 3;
>>>
>>> -     q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL);
>>> +     q->base = smmu_alloc_coherent(smmu, qsz, &q->base_dma, GFP_KERNEL);
>>>       if (!q->base) {
>>>               dev_err(smmu->dev, "failed to allocate queue (0x%zx bytes)\n",
>>>                       qsz);
>>> @@ -2069,7 +2113,7 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>>>                        size, smmu->sid_bits);
>>>
>>>       l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
>>> -     strtab = dmam_alloc_coherent(smmu->dev, l1size, &cfg->strtab_dma,
>>> +     strtab = smmu_alloc_coherent(smmu, l1size, &cfg->strtab_dma,
>>>                                    GFP_KERNEL | __GFP_ZERO);
>>>       if (!strtab) {
>>>               dev_err(smmu->dev,
>>> @@ -2097,8 +2141,9 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>>>       u32 size;
>>>       struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
>>>
>>> +
>>>       size = (1 << smmu->sid_bits) * (STRTAB_STE_DWORDS << 3);
>>> -     strtab = dmam_alloc_coherent(smmu->dev, size, &cfg->strtab_dma,
>>> +     strtab = smmu_alloc_coherent(smmu, size, &cfg->strtab_dma,
>>>                                    GFP_KERNEL | __GFP_ZERO);
>>>       if (!strtab) {
>>>               dev_err(smmu->dev,
>>>
>>
> 
> thanks
> Ganapat
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues
  2017-10-18 13:36       ` Robin Murphy
@ 2017-11-06  9:04         ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 21+ messages in thread
From: Ganapatrao Kulkarni @ 2017-11-06  9:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Oct 18, 2017 at 7:06 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 04/10/17 14:53, Ganapatrao Kulkarni wrote:
>> Hi Robin,
>>
>>
>> On Thu, Sep 21, 2017 at 5:28 PM, Robin Murphy <robin.murphy@arm.com> wrote:
>>> [+Christoph and Marek]
>>>
>>> On 21/09/17 09:59, Ganapatrao Kulkarni wrote:
>>>> Introduce smmu_alloc_coherent and smmu_free_coherent functions to
>>>> allocate/free dma coherent memory from NUMA node associated with SMMU.
>>>> Replace all calls of dmam_alloc_coherent with smmu_alloc_coherent
>>>> for SMMU stream tables and command queues.
>>>
>>> This doesn't work - not only do you lose the 'managed' aspect and risk
>>> leaking various tables on probe failure or device removal, but more
>>> importantly, unless you add DMA syncs around all the CPU accesses to the
>>> tables, you lose the critical 'coherent' aspect, and that's a horribly
>>> invasive change that I really don't want to make.
>>
>> this implementation is similar to function used to allocate memory for
>> translation tables.
>
> The concept is similar, yes, and would work if implemented *correctly*
> with the aforementioned comprehensive and hugely invasive changes. The
> implementation as presented in this patch, however, is incomplete and
> badly broken.
>
> By way of comparison, the io-pgtable implementations contain all the
> necessary dma_sync_* calls, never relied on devres, and only have one
> DMA direction to worry about (hint: the queues don't all work
> identically). There are also a couple of practical reasons for using
> streaming mappings with the DMA == phys restriction there - tracking
> both the CPU and DMA addresses for each table would significantly
> increase the memory overhead, and using the cacheable linear map address
> in all cases sidesteps any potential problems with the atomic PTE
> updates. Neither of those concerns apply to the SMMUv3 data structures,
> which are textbook coherent DMA allocations (being tied to the lifetime
> of the device, rather than transient).
>
>> why do you see it affects to stream tables and not to page tables.
>> at runtime, both tables are accessed by SMMU only.
>>
>> As said in cover letter, having stream table from respective NUMA node
>> is yielding
>> around 30% performance!
>> please suggest, if there is any better way to address this issue?
>
> I fully agree that NUMA-aware allocations are a worthwhile thing that we
> want. I just don't like the idea of going around individual drivers
> replacing coherent API usage with bodged-up streaming mappings - I
> really think it's worth making the effort to to tackle it once, in the
> proper place, in a way that benefits all users together.
>
> Robin.
>
>>>
>>> Christoph, Marek; how reasonable do you think it is to expect
>>> dma_alloc_coherent() to be inherently NUMA-aware on NUMA-capable
>>> systems? SWIOTLB looks fairly straightforward to fix up (for the simple
>>> allocation case; I'm not sure it's even worth it for bounce-buffering),
>>> but the likes of CMA might be a little trickier...

IIUC, having DMA allocation per node may become issue for 32 bit PCI
devices connected on NODE 1 on IOMMU less platforms.
most of the platforms may have NODE 1 RAM located beyond 4GB and
having DMA allocation beyond 32bit for NODE1(and above) devices may
make 32 bit pci devices not usable.

DMA/IOMMU experts, please advise?

>>>
>>> Robin.
>>>
>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
>>>> ---
>>>>  drivers/iommu/arm-smmu-v3.c | 57 ++++++++++++++++++++++++++++++++++++++++-----
>>>>  1 file changed, 51 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>>> index e67ba6c..bc4ba1f 100644
>>>> --- a/drivers/iommu/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>>> @@ -1158,6 +1158,50 @@ static void arm_smmu_init_bypass_stes(u64 *strtab, unsigned int nent)
>>>>       }
>>>>  }
>>>>
>>>> +static void *smmu_alloc_coherent(struct arm_smmu_device *smmu, size_t size,
>>>> +             dma_addr_t *dma_handle, gfp_t gfp)
>>>> +{
>>>> +     struct device *dev = smmu->dev;
>>>> +     void *pages;
>>>> +     dma_addr_t dma;
>>>> +     int numa_node = dev_to_node(dev);
>>>> +
>>>> +     pages = alloc_pages_exact_nid(numa_node, size, gfp | __GFP_ZERO);
>>>> +     if (!pages)
>>>> +             return NULL;
>>>> +
>>>> +     if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY)) {
>>>> +             dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
>>>> +             if (dma_mapping_error(dev, dma))
>>>> +                     goto out_free;
>>>> +             /*
>>>> +              * We depend on the SMMU being able to work with any physical
>>>> +              * address directly, so if the DMA layer suggests otherwise by
>>>> +              * translating or truncating them, that bodes very badly...
>>>> +              */
>>>> +             if (dma != virt_to_phys(pages))
>>>> +                     goto out_unmap;
>>>> +     }
>>>> +
>>>> +     *dma_handle = (dma_addr_t)virt_to_phys(pages);
>>>> +     return pages;
>>>> +
>>>> +out_unmap:
>>>> +     dev_err(dev, "Cannot accommodate DMA translation for IOMMU page tables\n");
>>>> +     dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
>>>> +out_free:
>>>> +     free_pages_exact(pages, size);
>>>> +     return NULL;
>>>> +}
>>>> +
>>>> +static void smmu_free_coherent(struct arm_smmu_device *smmu, size_t size,
>>>> +             void *pages, dma_addr_t dma_handle)
>>>> +{
>>>> +     if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
>>>> +             dma_unmap_single(smmu->dev, dma_handle, size, DMA_TO_DEVICE);
>>>> +     free_pages_exact(pages, size);
>>>> +}
>>>> +
>>>>  static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>>>>  {
>>>>       size_t size;
>>>> @@ -1172,7 +1216,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>>>>       strtab = &cfg->strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
>>>>
>>>>       desc->span = STRTAB_SPLIT + 1;
>>>> -     desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, &desc->l2ptr_dma,
>>>> +     desc->l2ptr = smmu_alloc_coherent(smmu, size, &desc->l2ptr_dma,
>>>>                                         GFP_KERNEL | __GFP_ZERO);
>>>>       if (!desc->l2ptr) {
>>>>               dev_err(smmu->dev,
>>>> @@ -1487,7 +1531,7 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
>>>>               struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg;
>>>>
>>>>               if (cfg->cdptr) {
>>>> -                     dmam_free_coherent(smmu_domain->smmu->dev,
>>>> +                     smmu_free_coherent(smmu,
>>>>                                          CTXDESC_CD_DWORDS << 3,
>>>>                                          cfg->cdptr,
>>>>                                          cfg->cdptr_dma);
>>>> @@ -1515,7 +1559,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>>>>       if (asid < 0)
>>>>               return asid;
>>>>
>>>> -     cfg->cdptr = dmam_alloc_coherent(smmu->dev, CTXDESC_CD_DWORDS << 3,
>>>> +     cfg->cdptr = smmu_alloc_coherent(smmu, CTXDESC_CD_DWORDS << 3,
>>>>                                        &cfg->cdptr_dma,
>>>>                                        GFP_KERNEL | __GFP_ZERO);
>>>>       if (!cfg->cdptr) {
>>>> @@ -1984,7 +2028,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
>>>>  {
>>>>       size_t qsz = ((1 << q->max_n_shift) * dwords) << 3;
>>>>
>>>> -     q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL);
>>>> +     q->base = smmu_alloc_coherent(smmu, qsz, &q->base_dma, GFP_KERNEL);
>>>>       if (!q->base) {
>>>>               dev_err(smmu->dev, "failed to allocate queue (0x%zx bytes)\n",
>>>>                       qsz);
>>>> @@ -2069,7 +2113,7 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>>>>                        size, smmu->sid_bits);
>>>>
>>>>       l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
>>>> -     strtab = dmam_alloc_coherent(smmu->dev, l1size, &cfg->strtab_dma,
>>>> +     strtab = smmu_alloc_coherent(smmu, l1size, &cfg->strtab_dma,
>>>>                                    GFP_KERNEL | __GFP_ZERO);
>>>>       if (!strtab) {
>>>>               dev_err(smmu->dev,
>>>> @@ -2097,8 +2141,9 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>>>>       u32 size;
>>>>       struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
>>>>
>>>> +
>>>>       size = (1 << smmu->sid_bits) * (STRTAB_STE_DWORDS << 3);
>>>> -     strtab = dmam_alloc_coherent(smmu->dev, size, &cfg->strtab_dma,
>>>> +     strtab = smmu_alloc_coherent(smmu, size, &cfg->strtab_dma,
>>>>                                    GFP_KERNEL | __GFP_ZERO);
>>>>       if (!strtab) {
>>>>               dev_err(smmu->dev,
>>>>
>>>
>>
>> thanks
>> Ganapat
>>
>

thanks
Ganapat

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/4] iommu/dma, numa: Use NUMA aware memory allocations in __iommu_dma_alloc_pages
  2017-09-21  8:59 [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems Ganapatrao Kulkarni
                   ` (2 preceding siblings ...)
  2017-09-21  8:59 ` [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues Ganapatrao Kulkarni
@ 2017-09-21  8:59 ` Ganapatrao Kulkarni
  2017-09-21 11:41   ` Robin Murphy
  2017-10-18 13:28 ` [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems Will Deacon
  2018-08-22 13:44 ` John Garry
  5 siblings, 1 reply; 21+ messages in thread
From: Ganapatrao Kulkarni @ 2017-09-21  8:59 UTC (permalink / raw)
  To: linux-arm-kernel

Change function __iommu_dma_alloc_pages to allocate memory/pages
for dma from respective device numa node.

Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
---
 drivers/iommu/dma-iommu.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9d1cebe..0626b58 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -428,20 +428,21 @@ static void __iommu_dma_free_pages(struct page **pages, int count)
 	kvfree(pages);
 }
 
-static struct page **__iommu_dma_alloc_pages(unsigned int count,
-		unsigned long order_mask, gfp_t gfp)
+static struct page **__iommu_dma_alloc_pages(struct device *dev,
+		unsigned int count, unsigned long order_mask, gfp_t gfp)
 {
 	struct page **pages;
 	unsigned int i = 0, array_size = count * sizeof(*pages);
+	int numa_node = dev_to_node(dev);
 
 	order_mask &= (2U << MAX_ORDER) - 1;
 	if (!order_mask)
 		return NULL;
 
 	if (array_size <= PAGE_SIZE)
-		pages = kzalloc(array_size, GFP_KERNEL);
+		pages = kzalloc_node(array_size, GFP_KERNEL, numa_node);
 	else
-		pages = vzalloc(array_size);
+		pages = vzalloc_node(array_size, numa_node);
 	if (!pages)
 		return NULL;
 
@@ -462,8 +463,9 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count,
 			unsigned int order = __fls(order_mask);
 
 			order_size = 1U << order;
-			page = alloc_pages((order_mask - order_size) ?
-					   gfp | __GFP_NORETRY : gfp, order);
+			page = alloc_pages_node(numa_node,
+					(order_mask - order_size) ?
+				   gfp | __GFP_NORETRY : gfp, order);
 			if (!page)
 				continue;
 			if (!order)
@@ -548,7 +550,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
 		alloc_sizes = min_size;
 
 	count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-	pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+	pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
+			gfp);
 	if (!pages)
 		return NULL;
 
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/4] iommu/dma, numa: Use NUMA aware memory allocations in __iommu_dma_alloc_pages
  2017-09-21  8:59 ` [PATCH 4/4] iommu/dma, numa: Use NUMA aware memory allocations in __iommu_dma_alloc_pages Ganapatrao Kulkarni
@ 2017-09-21 11:41   ` Robin Murphy
  2017-09-22 15:44     ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 21+ messages in thread
From: Robin Murphy @ 2017-09-21 11:41 UTC (permalink / raw)
  To: linux-arm-kernel

On 21/09/17 09:59, Ganapatrao Kulkarni wrote:
> Change function __iommu_dma_alloc_pages to allocate memory/pages
> for dma from respective device numa node.
> 
> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
> ---
>  drivers/iommu/dma-iommu.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 9d1cebe..0626b58 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -428,20 +428,21 @@ static void __iommu_dma_free_pages(struct page **pages, int count)
>  	kvfree(pages);
>  }
>  
> -static struct page **__iommu_dma_alloc_pages(unsigned int count,
> -		unsigned long order_mask, gfp_t gfp)
> +static struct page **__iommu_dma_alloc_pages(struct device *dev,
> +		unsigned int count, unsigned long order_mask, gfp_t gfp)
>  {
>  	struct page **pages;
>  	unsigned int i = 0, array_size = count * sizeof(*pages);
> +	int numa_node = dev_to_node(dev);
>  
>  	order_mask &= (2U << MAX_ORDER) - 1;
>  	if (!order_mask)
>  		return NULL;
>  
>  	if (array_size <= PAGE_SIZE)
> -		pages = kzalloc(array_size, GFP_KERNEL);
> +		pages = kzalloc_node(array_size, GFP_KERNEL, numa_node);
>  	else
> -		pages = vzalloc(array_size);
> +		pages = vzalloc_node(array_size, numa_node);

kvzalloc{,_node}() didn't exist when this code was first written, but it
does now - since you're touching it you may as well get rid of the whole
if-else and array_size local.

Further nit: some of the indentation below is a bit messed up.

Robin.

>  	if (!pages)
>  		return NULL;
>  
> @@ -462,8 +463,9 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count,
>  			unsigned int order = __fls(order_mask);
>  
>  			order_size = 1U << order;
> -			page = alloc_pages((order_mask - order_size) ?
> -					   gfp | __GFP_NORETRY : gfp, order);
> +			page = alloc_pages_node(numa_node,
> +					(order_mask - order_size) ?
> +				   gfp | __GFP_NORETRY : gfp, order);
>  			if (!page)
>  				continue;
>  			if (!order)
> @@ -548,7 +550,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
>  		alloc_sizes = min_size;
>  
>  	count = PAGE_ALIGN(size) >> PAGE_SHIFT;
> -	pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
> +	pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
> +			gfp);
>  	if (!pages)
>  		return NULL;
>  
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/4] iommu/dma, numa: Use NUMA aware memory allocations in __iommu_dma_alloc_pages
  2017-09-21 11:41   ` Robin Murphy
@ 2017-09-22 15:44     ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 21+ messages in thread
From: Ganapatrao Kulkarni @ 2017-09-22 15:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,


On Thu, Sep 21, 2017 at 5:11 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> On 21/09/17 09:59, Ganapatrao Kulkarni wrote:
>> Change function __iommu_dma_alloc_pages to allocate memory/pages
>> for dma from respective device numa node.
>>
>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
>> ---
>>  drivers/iommu/dma-iommu.c | 17 ++++++++++-------
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index 9d1cebe..0626b58 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -428,20 +428,21 @@ static void __iommu_dma_free_pages(struct page **pages, int count)
>>       kvfree(pages);
>>  }
>>
>> -static struct page **__iommu_dma_alloc_pages(unsigned int count,
>> -             unsigned long order_mask, gfp_t gfp)
>> +static struct page **__iommu_dma_alloc_pages(struct device *dev,
>> +             unsigned int count, unsigned long order_mask, gfp_t gfp)
>>  {
>>       struct page **pages;
>>       unsigned int i = 0, array_size = count * sizeof(*pages);
>> +     int numa_node = dev_to_node(dev);
>>
>>       order_mask &= (2U << MAX_ORDER) - 1;
>>       if (!order_mask)
>>               return NULL;
>>
>>       if (array_size <= PAGE_SIZE)
>> -             pages = kzalloc(array_size, GFP_KERNEL);
>> +             pages = kzalloc_node(array_size, GFP_KERNEL, numa_node);
>>       else
>> -             pages = vzalloc(array_size);
>> +             pages = vzalloc_node(array_size, numa_node);
>
> kvzalloc{,_node}() didn't exist when this code was first written, but it
> does now - since you're touching it you may as well get rid of the whole
> if-else and array_size local.

thanks, i will update in next version.
>
> Further nit: some of the indentation below is a bit messed up.

ok, will fix it.
>
> Robin.
>
>>       if (!pages)
>>               return NULL;
>>
>> @@ -462,8 +463,9 @@ static struct page **__iommu_dma_alloc_pages(unsigned int count,
>>                       unsigned int order = __fls(order_mask);
>>
>>                       order_size = 1U << order;
>> -                     page = alloc_pages((order_mask - order_size) ?
>> -                                        gfp | __GFP_NORETRY : gfp, order);
>> +                     page = alloc_pages_node(numa_node,
>> +                                     (order_mask - order_size) ?
>> +                                gfp | __GFP_NORETRY : gfp, order);
>>                       if (!page)
>>                               continue;
>>                       if (!order)
>> @@ -548,7 +550,8 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
>>               alloc_sizes = min_size;
>>
>>       count = PAGE_ALIGN(size) >> PAGE_SHIFT;
>> -     pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
>> +     pages = __iommu_dma_alloc_pages(dev, count, alloc_sizes >> PAGE_SHIFT,
>> +                     gfp);
>>       if (!pages)
>>               return NULL;
>>
>>
>

thanks
Ganapat

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems
  2017-09-21  8:59 [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems Ganapatrao Kulkarni
                   ` (3 preceding siblings ...)
  2017-09-21  8:59 ` [PATCH 4/4] iommu/dma, numa: Use NUMA aware memory allocations in __iommu_dma_alloc_pages Ganapatrao Kulkarni
@ 2017-10-18 13:28 ` Will Deacon
  2018-08-22 13:44 ` John Garry
  5 siblings, 0 replies; 21+ messages in thread
From: Will Deacon @ 2017-10-18 13:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ganapat,

On Thu, Sep 21, 2017 at 02:29:18PM +0530, Ganapatrao Kulkarni wrote:
> Adding numa aware memory allocations used for iommu dma allocation and
> memory allocated for SMMU stream tables, page walk tables and command queues.
> 
> With this patch, iperf testing on ThunderX2, with 40G NIC card on
> NODE 1 PCI shown same performance(around 30% improvement) as NODE 0.

Are you planning to repost this series? The idea looks good, but it needs
some rework before it can be merged.

Thanks,

Will

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems
  2017-09-21  8:59 [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems Ganapatrao Kulkarni
                   ` (4 preceding siblings ...)
  2017-10-18 13:28 ` [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems Will Deacon
@ 2018-08-22 13:44 ` John Garry
  2018-08-22 14:56   ` Robin Murphy
  5 siblings, 1 reply; 21+ messages in thread
From: John Garry @ 2018-08-22 13:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 21/09/2017 09:59, Ganapatrao Kulkarni wrote:
> Adding numa aware memory allocations used for iommu dma allocation and
> memory allocated for SMMU stream tables, page walk tables and command queues.
>
> With this patch, iperf testing on ThunderX2, with 40G NIC card on
> NODE 1 PCI shown same performance(around 30% improvement) as NODE 0.
>
> Ganapatrao Kulkarni (4):
>   mm: move function alloc_pages_exact_nid out of __meminit
>   numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu
>     translation tables
>   iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and
>     comamnd queues
>   iommu/dma, numa: Use NUMA aware memory allocations in
>     __iommu_dma_alloc_pages
>
>  drivers/iommu/arm-smmu-v3.c    | 57 +++++++++++++++++++++++++++++++++++++-----
>  drivers/iommu/dma-iommu.c      | 17 +++++++------
>  drivers/iommu/io-pgtable-arm.c |  4 ++-
>  include/linux/gfp.h            |  2 +-
>  mm/page_alloc.c                |  3 ++-
>  5 files changed, 67 insertions(+), 16 deletions(-)
>

Hi Ganapatrao,

Have you any plans for further work on this patchset? I have not seen 
anything since this v1 was posted+discussed.

Thanks,
John

> --
> 2.9.4
>
> _______________________________________________
> iommu mailing list
> iommu at lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> .
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems
  2018-08-22 13:44 ` John Garry
@ 2018-08-22 14:56   ` Robin Murphy
  2018-08-22 16:07     ` John Garry
  0 siblings, 1 reply; 21+ messages in thread
From: Robin Murphy @ 2018-08-22 14:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi John,

On 22/08/18 14:44, John Garry wrote:
> On 21/09/2017 09:59, Ganapatrao Kulkarni wrote:
>> Adding numa aware memory allocations used for iommu dma allocation and
>> memory allocated for SMMU stream tables, page walk tables and command 
>> queues.
>>
>> With this patch, iperf testing on ThunderX2, with 40G NIC card on
>> NODE 1 PCI shown same performance(around 30% improvement) as NODE 0.
>>
>> Ganapatrao Kulkarni (4):
>> ? mm: move function alloc_pages_exact_nid out of __meminit
>> ? numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu
>> ??? translation tables
>> ? iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and
>> ??? comamnd queues
>> ? iommu/dma, numa: Use NUMA aware memory allocations in
>> ??? __iommu_dma_alloc_pages
>>
>> ?drivers/iommu/arm-smmu-v3.c??? | 57 
>> +++++++++++++++++++++++++++++++++++++-----
>> ?drivers/iommu/dma-iommu.c????? | 17 +++++++------
>> ?drivers/iommu/io-pgtable-arm.c |? 4 ++-
>> ?include/linux/gfp.h??????????? |? 2 +-
>> ?mm/page_alloc.c??????????????? |? 3 ++-
>> ?5 files changed, 67 insertions(+), 16 deletions(-)
>>
> 
> Hi Ganapatrao,
> 
> Have you any plans for further work on this patchset? I have not seen 
> anything since this v1 was posted+discussed.

Looks like I ended up doing the version of the io-pgtable change that I 
suggested here, which was merged recently (4b123757eeaa). Patch #3 
should also be effectively obsolete now since the SWIOTLB/dma-direct 
rework (21f237e4d085). Apparently I also started reworking patch #4 in 
my tree at some point but sidelined it - I think that was at least 
partly due to another thread[1] which made it seem less clear-cut 
whether this is always the right thing to do.

Robin.

[1] 
https://www.mail-archive.com/linux-kernel at vger.kernel.org/msg1693026.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems
  2018-08-22 14:56   ` Robin Murphy
@ 2018-08-22 16:07     ` John Garry
  2018-08-22 17:57       ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 21+ messages in thread
From: John Garry @ 2018-08-22 16:07 UTC (permalink / raw)
  To: linux-arm-kernel

On 22/08/2018 15:56, Robin Murphy wrote:
> Hi John,
>
> On 22/08/18 14:44, John Garry wrote:
>> On 21/09/2017 09:59, Ganapatrao Kulkarni wrote:
>>> Adding numa aware memory allocations used for iommu dma allocation and
>>> memory allocated for SMMU stream tables, page walk tables and command
>>> queues.
>>>
>>> With this patch, iperf testing on ThunderX2, with 40G NIC card on
>>> NODE 1 PCI shown same performance(around 30% improvement) as NODE 0.
>>>
>>> Ganapatrao Kulkarni (4):
>>>   mm: move function alloc_pages_exact_nid out of __meminit
>>>   numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu
>>>     translation tables
>>>   iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and
>>>     comamnd queues
>>>   iommu/dma, numa: Use NUMA aware memory allocations in
>>>     __iommu_dma_alloc_pages
>>>
>>>  drivers/iommu/arm-smmu-v3.c    | 57
>>> +++++++++++++++++++++++++++++++++++++-----
>>>  drivers/iommu/dma-iommu.c      | 17 +++++++------
>>>  drivers/iommu/io-pgtable-arm.c |  4 ++-
>>>  include/linux/gfp.h            |  2 +-
>>>  mm/page_alloc.c                |  3 ++-
>>>  5 files changed, 67 insertions(+), 16 deletions(-)
>>>
>>
>> Hi Ganapatrao,
>>
>> Have you any plans for further work on this patchset? I have not seen
>> anything since this v1 was posted+discussed.
>

Hi Robin,

Thanks for the info. I thought I remembered 4b12 but couldn't put my 
finger on it.

> Looks like I ended up doing the version of the io-pgtable change that I
> suggested here, which was merged recently (4b123757eeaa). Patch #3
> should also be effectively obsolete now since the SWIOTLB/dma-direct
> rework (21f237e4d085). Apparently I also started reworking patch #4 in
> my tree at some point but sidelined it - I think that was at least
> partly due to another thread[1] which made it seem less clear-cut
> whether this is always the right thing to do.

Right, so #4 seems less straightforward and not directly related to 
IOMMU driver anyway.

Cheers,
John

>
> Robin.
>
> [1]
> https://www.mail-archive.com/linux-kernel at vger.kernel.org/msg1693026.html
>
> .
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems
  2018-08-22 16:07     ` John Garry
@ 2018-08-22 17:57       ` Ganapatrao Kulkarni
  0 siblings, 0 replies; 21+ messages in thread
From: Ganapatrao Kulkarni @ 2018-08-22 17:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Aug 22, 2018 at 9:08 AM John Garry <john.garry@huawei.com> wrote:
>
> On 22/08/2018 15:56, Robin Murphy wrote:
> > Hi John,
> >
> > On 22/08/18 14:44, John Garry wrote:
> >> On 21/09/2017 09:59, Ganapatrao Kulkarni wrote:
> >>> Adding numa aware memory allocations used for iommu dma allocation and
> >>> memory allocated for SMMU stream tables, page walk tables and command
> >>> queues.
> >>>
> >>> With this patch, iperf testing on ThunderX2, with 40G NIC card on
> >>> NODE 1 PCI shown same performance(around 30% improvement) as NODE 0.
> >>>
> >>> Ganapatrao Kulkarni (4):
> >>>   mm: move function alloc_pages_exact_nid out of __meminit
> >>>   numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu
> >>>     translation tables
> >>>   iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and
> >>>     comamnd queues
> >>>   iommu/dma, numa: Use NUMA aware memory allocations in
> >>>     __iommu_dma_alloc_pages
> >>>
> >>>  drivers/iommu/arm-smmu-v3.c    | 57
> >>> +++++++++++++++++++++++++++++++++++++-----
> >>>  drivers/iommu/dma-iommu.c      | 17 +++++++------
> >>>  drivers/iommu/io-pgtable-arm.c |  4 ++-
> >>>  include/linux/gfp.h            |  2 +-
> >>>  mm/page_alloc.c                |  3 ++-
> >>>  5 files changed, 67 insertions(+), 16 deletions(-)
> >>>
> >>
> >> Hi Ganapatrao,
> >>
> >> Have you any plans for further work on this patchset? I have not seen
> >> anything since this v1 was posted+discussed.
> >
>
> Hi Robin,
>
> Thanks for the info. I thought I remembered 4b12 but couldn't put my
> finger on it.
>
> > Looks like I ended up doing the version of the io-pgtable change that I
> > suggested here, which was merged recently (4b123757eeaa). Patch #3
> > should also be effectively obsolete now since the SWIOTLB/dma-direct
> > rework (21f237e4d085). Apparently I also started reworking patch #4 in
> > my tree at some point but sidelined it - I think that was at least
> > partly due to another thread[1] which made it seem less clear-cut
> > whether this is always the right thing to do.
>
> Right, so #4 seems less straightforward and not directly related to
> IOMMU driver anyway.
>

thanks Robin for pulling up the patch. I couldn't followup with this
due to other tasks.

> Cheers,
> John
>
> >
> > Robin.
> >
> > [1]
> > https://www.mail-archive.com/linux-kernel at vger.kernel.org/msg1693026.html
> >
> > .
> >
>
>
thanks,
Ganapat

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-08-22 17:57 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-21  8:59 [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems Ganapatrao Kulkarni
2017-09-21  8:59 ` [PATCH 1/4] mm: move function alloc_pages_exact_nid out of __meminit Ganapatrao Kulkarni
2017-09-26 13:35   ` Michal Hocko
2017-09-21  8:59 ` [PATCH 2/4] numa, iommu/io-pgtable-arm: Use NUMA aware memory allocation for smmu translation tables Ganapatrao Kulkarni
2017-09-21 11:11   ` Robin Murphy
2017-09-22 15:33     ` Ganapatrao Kulkarni
2017-09-21  8:59 ` [PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues Ganapatrao Kulkarni
2017-09-21 11:58   ` Robin Murphy
2017-09-21 14:26     ` Christoph Hellwig
2017-09-29 12:13     ` Marek Szyprowski
2017-10-04 13:53     ` Ganapatrao Kulkarni
2017-10-18 13:36       ` Robin Murphy
2017-11-06  9:04         ` Ganapatrao Kulkarni
2017-09-21  8:59 ` [PATCH 4/4] iommu/dma, numa: Use NUMA aware memory allocations in __iommu_dma_alloc_pages Ganapatrao Kulkarni
2017-09-21 11:41   ` Robin Murphy
2017-09-22 15:44     ` Ganapatrao Kulkarni
2017-10-18 13:28 ` [PATCH 0/4] numa, iommu/smmu: IOMMU/SMMU driver optimization for NUMA systems Will Deacon
2018-08-22 13:44 ` John Garry
2018-08-22 14:56   ` Robin Murphy
2018-08-22 16:07     ` John Garry
2018-08-22 17:57       ` Ganapatrao Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).