[PATCH for-rc 0/3] Fixes for 64K page size support

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* [PATCH for-rc 0/3] Fixes for 64K page size support
@ 2023-11-15 19:17 Shiraz Saleem
  2023-11-15 19:17 ` [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz Shiraz Saleem
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Shiraz Saleem @ 2023-11-15 19:17 UTC (permalink / raw)
  To: jgg, leon, linux-rdma; +Cc: Shiraz Saleem

This is a three patch series.

The first core hunk corrects the core iterator to use __sg_advance to skip
preceding 4k HCA pages.

The second patch corrects an iWarp issue where the SQ must be PAGE_SIZE
aligned.

The third patch corrects an issue with the RDMA driver use of
ib_umem_find_best_pgsz(). QP and CQ allocations pass PAGE_SIZE as the
only bitmap bit. This is incorrect and should use the precise 4k value.

Mike Marciniszyn (3):
  RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
  RDMA/irdma: Ensure iWarp QP queue memory is OS paged aligned
  RDMA/irdma: Fix support for 64k pages

 drivers/infiniband/core/umem.c      | 6 ------
 drivers/infiniband/hw/irdma/verbs.c | 7 ++++++-
 include/rdma/ib_umem.h              | 4 +++-
 include/rdma/ib_verbs.h             | 1 +
 4 files changed, 10 insertions(+), 8 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
  2023-11-15 19:17 [PATCH for-rc 0/3] Fixes for 64K page size support Shiraz Saleem
@ 2023-11-15 19:17 ` Shiraz Saleem
  2023-11-16 17:12   ` Jason Gunthorpe
  2023-11-17 12:13   ` Zhu Yanjun
  2023-11-15 19:17 ` [PATCH for-rc 2/3] RDMA/irdma: Ensure iWarp QP queue memory is OS paged aligned Shiraz Saleem
  2023-11-15 19:17 ` [PATCH for-rc 3/3] RDMA/irdma: Fix support for 64k pages Shiraz Saleem
  2 siblings, 2 replies; 11+ messages in thread
From: Shiraz Saleem @ 2023-11-15 19:17 UTC (permalink / raw)
  To: jgg, leon, linux-rdma; +Cc: Mike Marciniszyn, Shiraz Saleem

From: Mike Marciniszyn <mike.marciniszyn@intel.com>

64k pages introduce the situation in this diagram when the HCA
4k page size is being used:

 +-------------------------------------------+ <--- 64k aligned VA
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+
 |                   o                       |
 |                                           |
 |                   o                       |
 |                                           |
 |                   o                       |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+ <--- Live HCA page
 |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO| <--- offset
 |                                           | <--- VA
 |                MR data                    |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+
 |                   o                       |
 |                                           |
 |                   o                       |
 |                                           |
 |                   o                       |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+

The VA addresses are coming from rdma-core in this diagram can
be arbitrary, but for 64k pages, the VA may be offset by some
number of HCA 4k pages and followed by some number of HCA 4k
pages.

The current iterator doesn't account for either the preceding
4k pages or the following 4k pages.

Fix the issue by extending the ib_block_iter to contain
the number of DMA pages like comment [1] says and
by augmenting the macro limit test to downcount that value.

This prevents the extra pages following the user MR data.

Fix the preceding pages by using the __sq_advance field to start
at the first 4k page containing MR data.

This fix allows for the elimination of the small page crutch noted
in the Fixes.

Fixes: 10c75ccb54e4 ("RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()")
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/rdma/ib_umem.h#n91 [1]
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
---
 drivers/infiniband/core/umem.c | 6 ------
 include/rdma/ib_umem.h         | 4 +++-
 include/rdma/ib_verbs.h        | 1 +
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index f9ab671c8eda..07c571c7b699 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -96,12 +96,6 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 		return page_size;
 	}
 
-	/* rdma_for_each_block() has a bug if the page size is smaller than the
-	 * page size used to build the umem. For now prevent smaller page sizes
-	 * from being returned.
-	 */
-	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
-
 	/* The best result is the smallest page size that results in the minimum
 	 * number of required pages. Compute the largest page size that could
 	 * work based on VA address bits that don't change.
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 95896472a82b..e775d1b4910c 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -77,6 +77,8 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
 {
 	__rdma_block_iter_start(biter, umem->sgt_append.sgt.sgl,
 				umem->sgt_append.sgt.nents, pgsz);
+	biter->__sg_advance = ib_umem_offset(umem) & ~(pgsz - 1);
+	biter->__sg_numblocks = ib_umem_num_dma_blocks(umem, pgsz);
 }
 
 /**
@@ -92,7 +94,7 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
  */
 #define rdma_umem_for_each_dma_block(umem, biter, pgsz)                        \
 	for (__rdma_umem_block_iter_start(biter, umem, pgsz);                  \
-	     __rdma_block_iter_next(biter);)
+	     __rdma_block_iter_next(biter) && (biter)->__sg_numblocks--;)
 
 #ifdef CONFIG_INFINIBAND_USER_MEM
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index fb1a2d6b1969..b7b6b58dd348 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2850,6 +2850,7 @@ struct ib_block_iter {
 	/* internal states */
 	struct scatterlist *__sg;	/* sg holding the current aligned block */
 	dma_addr_t __dma_addr;		/* unaligned DMA address of this block */
+	size_t __sg_numblocks;		/* ib_umem_num_dma_blocks() */
 	unsigned int __sg_nents;	/* number of SG entries */
 	unsigned int __sg_advance;	/* number of bytes to advance in sg in next step */
 	unsigned int __pg_bit;		/* alignment of current block */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-rc 2/3] RDMA/irdma: Ensure iWarp QP queue memory is OS paged aligned
  2023-11-15 19:17 [PATCH for-rc 0/3] Fixes for 64K page size support Shiraz Saleem
  2023-11-15 19:17 ` [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz Shiraz Saleem
@ 2023-11-15 19:17 ` Shiraz Saleem
  2023-11-15 19:17 ` [PATCH for-rc 3/3] RDMA/irdma: Fix support for 64k pages Shiraz Saleem
  2 siblings, 0 replies; 11+ messages in thread
From: Shiraz Saleem @ 2023-11-15 19:17 UTC (permalink / raw)
  To: jgg, leon, linux-rdma; +Cc: Mike Marciniszyn, Shiraz Saleem

From: Mike Marciniszyn <mike.marciniszyn@intel.com>

The SQ is shared for between kernel and used by storing
the kernel page pointer and passing that to a kmap_atomic().

This then requires that the alignment is PAGE_SIZE aligned.

Fix by adding an iWarp specific alignment check.

Fixes: e965ef0e7b2c ("RDMA/irdma: Split QP handler into irdma_reg_user_mr_type_qp")
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
---
 drivers/infiniband/hw/irdma/verbs.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index 6415ada63c5f..b072aa5179e0 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -2934,6 +2934,11 @@ static int irdma_reg_user_mr_type_qp(struct irdma_mem_reg_req req,
 	int err;
 	u8 lvl;
 
+	/* iWarp: Catch page not starting on OS page boundary */
+	if (!rdma_protocol_roce(&iwdev->ibdev, 1) &&
+	    ib_umem_offset(iwmr->region))
+		return -EINVAL;
+
 	total = req.sq_pages + req.rq_pages + 1;
 	if (total > iwmr->page_cnt)
 		return -EINVAL;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-rc 3/3] RDMA/irdma: Fix support for 64k pages
  2023-11-15 19:17 [PATCH for-rc 0/3] Fixes for 64K page size support Shiraz Saleem
  2023-11-15 19:17 ` [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz Shiraz Saleem
  2023-11-15 19:17 ` [PATCH for-rc 2/3] RDMA/irdma: Ensure iWarp QP queue memory is OS paged aligned Shiraz Saleem
@ 2023-11-15 19:17 ` Shiraz Saleem
  2 siblings, 0 replies; 11+ messages in thread
From: Shiraz Saleem @ 2023-11-15 19:17 UTC (permalink / raw)
  To: jgg, leon, linux-rdma; +Cc: Mike Marciniszyn, Shiraz Saleem

From: Mike Marciniszyn <mike.marciniszyn@intel.com>

Virtual QP and CQ require a 4K HW page size but the driver
passes PAGE_SIZE to ib_umem_find_best_pgsz() instead.

Fix this by using the appropriate 4k value in the bitmap passed to
ib_umem_find_best_pgsz().

Fixes: 693a5386eff0 ("RDMA/irdma: Split mr alloc and free into new functions")
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
---
 drivers/infiniband/hw/irdma/verbs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index b072aa5179e0..7c31d2d606bb 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -2902,7 +2902,7 @@ static struct irdma_mr *irdma_alloc_iwmr(struct ib_umem *region,
 	iwmr->type = reg_type;
 
 	pgsz_bitmap = (reg_type == IRDMA_MEMREG_TYPE_MEM) ?
-		iwdev->rf->sc_dev.hw_attrs.page_size_cap : PAGE_SIZE;
+		iwdev->rf->sc_dev.hw_attrs.page_size_cap : SZ_4K;
 
 	iwmr->page_size = ib_umem_find_best_pgsz(region, pgsz_bitmap, virt);
 	if (unlikely(!iwmr->page_size)) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
  2023-11-15 19:17 ` [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz Shiraz Saleem
@ 2023-11-16 17:12   ` Jason Gunthorpe
  2023-11-19 22:24     ` Saleem, Shiraz
  2023-11-17 12:13   ` Zhu Yanjun
  1 sibling, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2023-11-16 17:12 UTC (permalink / raw)
  To: Shiraz Saleem; +Cc: leon, linux-rdma, Mike Marciniszyn

On Wed, Nov 15, 2023 at 01:17:50PM -0600, Shiraz Saleem wrote:
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index f9ab671c8eda..07c571c7b699 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -96,12 +96,6 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>  		return page_size;
>  	}
>  
> -	/* rdma_for_each_block() has a bug if the page size is smaller than the
> -	 * page size used to build the umem. For now prevent smaller page sizes
> -	 * from being returned.
> -	 */
> -	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> -
>  	/* The best result is the smallest page size that results in the minimum
>  	 * number of required pages. Compute the largest page size that could
>  	 * work based on VA address bits that don't change.
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index 95896472a82b..e775d1b4910c 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -77,6 +77,8 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
>  {
>  	__rdma_block_iter_start(biter, umem->sgt_append.sgt.sgl,
>  				umem->sgt_append.sgt.nents, pgsz);
> +	biter->__sg_advance = ib_umem_offset(umem) & ~(pgsz - 1);
> +	biter->__sg_numblocks = ib_umem_num_dma_blocks(umem, pgsz);
>  }
>  
>  /**
> @@ -92,7 +94,7 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
>   */
>  #define rdma_umem_for_each_dma_block(umem, biter, pgsz)                        \
>  	for (__rdma_umem_block_iter_start(biter, umem, pgsz);                  \
> -	     __rdma_block_iter_next(biter);)
> +	     __rdma_block_iter_next(biter) && (biter)->__sg_numblocks--;)

This sg_numblocks should be in the __rdma_block_iter_next() ?

It makes sense to me

Leon, we should be sure to check this on mlx5 also

Thanks,
Jason  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
  2023-11-15 19:17 ` [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz Shiraz Saleem
  2023-11-16 17:12   ` Jason Gunthorpe
@ 2023-11-17 12:13   ` Zhu Yanjun
  2023-11-18 14:54     ` Marciniszyn, Mike
  1 sibling, 1 reply; 11+ messages in thread
From: Zhu Yanjun @ 2023-11-17 12:13 UTC (permalink / raw)
  To: Shiraz Saleem, jgg, leon, linux-rdma; +Cc: Mike Marciniszyn

在 2023/11/16 3:17, Shiraz Saleem 写道:
> From: Mike Marciniszyn <mike.marciniszyn@intel.com>
> 
> 64k pages introduce the situation in this diagram when the HCA

Only ARM64 architecture supports 64K page size?
Is it possible that x86_64 also supports 64K page size?

Zhu Yanjun

> 4k page size is being used:
> 
>   +-------------------------------------------+ <--- 64k aligned VA
>   |                                           |
>   |              HCA 4k page                  |
>   |                                           |
>   +-------------------------------------------+
>   |                   o                       |
>   |                                           |
>   |                   o                       |
>   |                                           |
>   |                   o                       |
>   +-------------------------------------------+
>   |                                           |
>   |              HCA 4k page                  |
>   |                                           |
>   +-------------------------------------------+ <--- Live HCA page
>   |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO| <--- offset
>   |                                           | <--- VA
>   |                MR data                    |
>   +-------------------------------------------+
>   |                                           |
>   |              HCA 4k page                  |
>   |                                           |
>   +-------------------------------------------+
>   |                   o                       |
>   |                                           |
>   |                   o                       |
>   |                                           |
>   |                   o                       |
>   +-------------------------------------------+
>   |                                           |
>   |              HCA 4k page                  |
>   |                                           |
>   +-------------------------------------------+
> 
> The VA addresses are coming from rdma-core in this diagram can
> be arbitrary, but for 64k pages, the VA may be offset by some
> number of HCA 4k pages and followed by some number of HCA 4k
> pages.
> 
> The current iterator doesn't account for either the preceding
> 4k pages or the following 4k pages.
> 
> Fix the issue by extending the ib_block_iter to contain
> the number of DMA pages like comment [1] says and
> by augmenting the macro limit test to downcount that value.
> 
> This prevents the extra pages following the user MR data.
> 
> Fix the preceding pages by using the __sq_advance field to start
> at the first 4k page containing MR data.
> 
> This fix allows for the elimination of the small page crutch noted
> in the Fixes.
> 
> Fixes: 10c75ccb54e4 ("RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()")
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/rdma/ib_umem.h#n91 [1]
> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
> ---
>   drivers/infiniband/core/umem.c | 6 ------
>   include/rdma/ib_umem.h         | 4 +++-
>   include/rdma/ib_verbs.h        | 1 +
>   3 files changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index f9ab671c8eda..07c571c7b699 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -96,12 +96,6 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
>   		return page_size;
>   	}
>   
> -	/* rdma_for_each_block() has a bug if the page size is smaller than the
> -	 * page size used to build the umem. For now prevent smaller page sizes
> -	 * from being returned.
> -	 */
> -	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> -
>   	/* The best result is the smallest page size that results in the minimum
>   	 * number of required pages. Compute the largest page size that could
>   	 * work based on VA address bits that don't change.
> diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
> index 95896472a82b..e775d1b4910c 100644
> --- a/include/rdma/ib_umem.h
> +++ b/include/rdma/ib_umem.h
> @@ -77,6 +77,8 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
>   {
>   	__rdma_block_iter_start(biter, umem->sgt_append.sgt.sgl,
>   				umem->sgt_append.sgt.nents, pgsz);
> +	biter->__sg_advance = ib_umem_offset(umem) & ~(pgsz - 1);
> +	biter->__sg_numblocks = ib_umem_num_dma_blocks(umem, pgsz);
>   }
>   
>   /**
> @@ -92,7 +94,7 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
>    */
>   #define rdma_umem_for_each_dma_block(umem, biter, pgsz)                        \
>   	for (__rdma_umem_block_iter_start(biter, umem, pgsz);                  \
> -	     __rdma_block_iter_next(biter);)
> +	     __rdma_block_iter_next(biter) && (biter)->__sg_numblocks--;)
>   
>   #ifdef CONFIG_INFINIBAND_USER_MEM
>   
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index fb1a2d6b1969..b7b6b58dd348 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -2850,6 +2850,7 @@ struct ib_block_iter {
>   	/* internal states */
>   	struct scatterlist *__sg;	/* sg holding the current aligned block */
>   	dma_addr_t __dma_addr;		/* unaligned DMA address of this block */
> +	size_t __sg_numblocks;		/* ib_umem_num_dma_blocks() */
>   	unsigned int __sg_nents;	/* number of SG entries */
>   	unsigned int __sg_advance;	/* number of bytes to advance in sg in next step */
>   	unsigned int __pg_bit;		/* alignment of current block */


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
  2023-11-17 12:13   ` Zhu Yanjun
@ 2023-11-18 14:54     ` Marciniszyn, Mike
  2023-11-18 14:59       ` Marciniszyn, Mike
                         ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Marciniszyn, Mike @ 2023-11-18 14:54 UTC (permalink / raw)
  To: Zhu Yanjun, Saleem, Shiraz, jgg@nvidia.com, leon@kernel.org,
	linux-rdma@vger.kernel.org

> > From: Mike Marciniszyn <mike.marciniszyn@intel.com>
> >
> > 64k pages introduce the situation in this diagram when the HCA
> 
> Only ARM64 architecture supports 64K page size?

Arm supports multiple page_sizes.   The problematic combination is when
the HCA needs a SMALLER page size than the PAGE_SIZE.

The kernel configuration can select from 

> Is it possible that x86_64 also supports 64K page size?
> 

x86_64 supports larger page_sizes for TLB optimization, but the default minimum is always 4K.

Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
  2023-11-18 14:54     ` Marciniszyn, Mike
@ 2023-11-18 14:59       ` Marciniszyn, Mike
  2023-11-19  1:04       ` Zhu Yanjun
  2023-11-19  1:31       ` Zhu Yanjun
  2 siblings, 0 replies; 11+ messages in thread
From: Marciniszyn, Mike @ 2023-11-18 14:59 UTC (permalink / raw)
  To: Marciniszyn, Mike, Zhu Yanjun, Saleem, Shiraz, jgg@nvidia.com,
	leon@kernel.org, linux-rdma@vger.kernel.org

> 
> The kernel configuration can select from
> 

... multiple page sizes.

Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
  2023-11-18 14:54     ` Marciniszyn, Mike
  2023-11-18 14:59       ` Marciniszyn, Mike
@ 2023-11-19  1:04       ` Zhu Yanjun
  2023-11-19  1:31       ` Zhu Yanjun
  2 siblings, 0 replies; 11+ messages in thread
From: Zhu Yanjun @ 2023-11-19  1:04 UTC (permalink / raw)
  To: Marciniszyn, Mike, Saleem, Shiraz, jgg@nvidia.com,
	leon@kernel.org, RDMA mailing list


在 2023/11/18 22:54, Marciniszyn, Mike 写道:
>>> From: Mike Marciniszyn <mike.marciniszyn@intel.com>
>>>
>>> 64k pages introduce the situation in this diagram when the HCA
>> Only ARM64 architecture supports 64K page size?
> Arm supports multiple page_sizes.   The problematic combination is when
> the HCA needs a SMALLER page size than the PAGE_SIZE.
>
> The kernel configuration can select from

Got it. Thanks a lot. On ARM architecture, some kernel configurations 
can be

selected to enable multiple page sizes.

>
>> Is it possible that x86_64 also supports 64K page size?
>>
> x86_64 supports larger page_sizes for TLB optimization, but the default minimum is always 4K.

On x86_64 architecture, how to enable a not-4k page size? For example, 
16K, 64K and so on.

Thanks a lot.

Zhu Yanjun

>
> Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
  2023-11-18 14:54     ` Marciniszyn, Mike
  2023-11-18 14:59       ` Marciniszyn, Mike
  2023-11-19  1:04       ` Zhu Yanjun
@ 2023-11-19  1:31       ` Zhu Yanjun
  2 siblings, 0 replies; 11+ messages in thread
From: Zhu Yanjun @ 2023-11-19  1:31 UTC (permalink / raw)
  To: Marciniszyn, Mike, Saleem, Shiraz, jgg@nvidia.com,
	leon@kernel.org, linux-rdma@vger.kernel.org

在 2023/11/18 22:54, Marciniszyn, Mike 写道:
>>> From: Mike Marciniszyn <mike.marciniszyn@intel.com>
>>>
>>> 64k pages introduce the situation in this diagram when the HCA
>>
>> Only ARM64 architecture supports 64K page size?
> 
> Arm supports multiple page_sizes.   The problematic combination is when
> the HCA needs a SMALLER page size than the PAGE_SIZE.

Thanks a lot. Perhaps RXE also needs this feature "the HCA needs a 
SMALLER page size than the PAGE_SIZE". But I do not have such test 
environment at hand.
If we can setup a test environment on x86_64 architecture, it is very 
convenient for me to make tests and development.

Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Zhu Yanjun

> 
> The kernel configuration can select from
> 
>> Is it possible that x86_64 also supports 64K page size?
>>
> 
> x86_64 supports larger page_sizes for TLB optimization, but the default minimum is always 4K.
> 
> Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
  2023-11-16 17:12   ` Jason Gunthorpe
@ 2023-11-19 22:24     ` Saleem, Shiraz
  0 siblings, 0 replies; 11+ messages in thread
From: Saleem, Shiraz @ 2023-11-19 22:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: leon@kernel.org, linux-rdma@vger.kernel.org, Marciniszyn, Mike

> Subject: Re: [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE
> is greater then HCA pgsz
> 
> On Wed, Nov 15, 2023 at 01:17:50PM -0600, Shiraz Saleem wrote:
> > diff --git a/drivers/infiniband/core/umem.c
> > b/drivers/infiniband/core/umem.c index f9ab671c8eda..07c571c7b699
> > 100644
> > --- a/drivers/infiniband/core/umem.c
> > +++ b/drivers/infiniband/core/umem.c
> > @@ -96,12 +96,6 @@ unsigned long ib_umem_find_best_pgsz(struct
> ib_umem *umem,
> >  		return page_size;
> >  	}
> >
> > -	/* rdma_for_each_block() has a bug if the page size is smaller than the
> > -	 * page size used to build the umem. For now prevent smaller page sizes
> > -	 * from being returned.
> > -	 */
> > -	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
> > -
> >  	/* The best result is the smallest page size that results in the minimum
> >  	 * number of required pages. Compute the largest page size that could
> >  	 * work based on VA address bits that don't change.
> > diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index
> > 95896472a82b..e775d1b4910c 100644
> > --- a/include/rdma/ib_umem.h
> > +++ b/include/rdma/ib_umem.h
> > @@ -77,6 +77,8 @@ static inline void
> > __rdma_umem_block_iter_start(struct ib_block_iter *biter,  {
> >  	__rdma_block_iter_start(biter, umem->sgt_append.sgt.sgl,
> >  				umem->sgt_append.sgt.nents, pgsz);
> > +	biter->__sg_advance = ib_umem_offset(umem) & ~(pgsz - 1);
> > +	biter->__sg_numblocks = ib_umem_num_dma_blocks(umem, pgsz);
> >  }
> >
> >  /**
> > @@ -92,7 +94,7 @@ static inline void __rdma_umem_block_iter_start(struct
> ib_block_iter *biter,
> >   */
> >  #define rdma_umem_for_each_dma_block(umem, biter, pgsz)                        \
> >  	for (__rdma_umem_block_iter_start(biter, umem, pgsz);                  \
> > -	     __rdma_block_iter_next(biter);)
> > +	     __rdma_block_iter_next(biter) && (biter)->__sg_numblocks--;)
> 
> This sg_numblocks should be in the __rdma_block_iter_next() ?
> 
> It makes sense to me
> 
The __rdma_block_iter_next() is common to two iterators: rdma_umem_for_each_dma_block() and rdma_for_each_block.

The patch makes adjustments to protect users of rdma_for_each_block().

We are working on a v2 to add a umem specific next function that will implement the downcount.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-11-19 22:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-15 19:17 [PATCH for-rc 0/3] Fixes for 64K page size support Shiraz Saleem
2023-11-15 19:17 ` [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz Shiraz Saleem
2023-11-16 17:12   ` Jason Gunthorpe
2023-11-19 22:24     ` Saleem, Shiraz
2023-11-17 12:13   ` Zhu Yanjun
2023-11-18 14:54     ` Marciniszyn, Mike
2023-11-18 14:59       ` Marciniszyn, Mike
2023-11-19  1:04       ` Zhu Yanjun
2023-11-19  1:31       ` Zhu Yanjun
2023-11-15 19:17 ` [PATCH for-rc 2/3] RDMA/irdma: Ensure iWarp QP queue memory is OS paged aligned Shiraz Saleem
2023-11-15 19:17 ` [PATCH for-rc 3/3] RDMA/irdma: Fix support for 64k pages Shiraz Saleem

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox