Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
From: Shiraz Saleem <shiraz.saleem@intel.com>
To: jgg@nvidia.com, leon@kernel.org, linux-rdma@vger.kernel.org
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>,
	Shiraz Saleem <shiraz.saleem@intel.com>
Subject: [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
Date: Wed, 15 Nov 2023 13:17:50 -0600	[thread overview]
Message-ID: <20231115191752.266-2-shiraz.saleem@intel.com> (raw)
In-Reply-To: <20231115191752.266-1-shiraz.saleem@intel.com>

From: Mike Marciniszyn <mike.marciniszyn@intel.com>

64k pages introduce the situation in this diagram when the HCA
4k page size is being used:

 +-------------------------------------------+ <--- 64k aligned VA
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+
 |                   o                       |
 |                                           |
 |                   o                       |
 |                                           |
 |                   o                       |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+ <--- Live HCA page
 |OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO| <--- offset
 |                                           | <--- VA
 |                MR data                    |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+
 |                   o                       |
 |                                           |
 |                   o                       |
 |                                           |
 |                   o                       |
 +-------------------------------------------+
 |                                           |
 |              HCA 4k page                  |
 |                                           |
 +-------------------------------------------+

The VA addresses are coming from rdma-core in this diagram can
be arbitrary, but for 64k pages, the VA may be offset by some
number of HCA 4k pages and followed by some number of HCA 4k
pages.

The current iterator doesn't account for either the preceding
4k pages or the following 4k pages.

Fix the issue by extending the ib_block_iter to contain
the number of DMA pages like comment [1] says and
by augmenting the macro limit test to downcount that value.

This prevents the extra pages following the user MR data.

Fix the preceding pages by using the __sq_advance field to start
at the first 4k page containing MR data.

This fix allows for the elimination of the small page crutch noted
in the Fixes.

Fixes: 10c75ccb54e4 ("RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()")
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/rdma/ib_umem.h#n91 [1]
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
---
 drivers/infiniband/core/umem.c | 6 ------
 include/rdma/ib_umem.h         | 4 +++-
 include/rdma/ib_verbs.h        | 1 +
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index f9ab671c8eda..07c571c7b699 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -96,12 +96,6 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 		return page_size;
 	}
 
-	/* rdma_for_each_block() has a bug if the page size is smaller than the
-	 * page size used to build the umem. For now prevent smaller page sizes
-	 * from being returned.
-	 */
-	pgsz_bitmap &= GENMASK(BITS_PER_LONG - 1, PAGE_SHIFT);
-
 	/* The best result is the smallest page size that results in the minimum
 	 * number of required pages. Compute the largest page size that could
 	 * work based on VA address bits that don't change.
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 95896472a82b..e775d1b4910c 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -77,6 +77,8 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
 {
 	__rdma_block_iter_start(biter, umem->sgt_append.sgt.sgl,
 				umem->sgt_append.sgt.nents, pgsz);
+	biter->__sg_advance = ib_umem_offset(umem) & ~(pgsz - 1);
+	biter->__sg_numblocks = ib_umem_num_dma_blocks(umem, pgsz);
 }
 
 /**
@@ -92,7 +94,7 @@ static inline void __rdma_umem_block_iter_start(struct ib_block_iter *biter,
  */
 #define rdma_umem_for_each_dma_block(umem, biter, pgsz)                        \
 	for (__rdma_umem_block_iter_start(biter, umem, pgsz);                  \
-	     __rdma_block_iter_next(biter);)
+	     __rdma_block_iter_next(biter) && (biter)->__sg_numblocks--;)
 
 #ifdef CONFIG_INFINIBAND_USER_MEM
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index fb1a2d6b1969..b7b6b58dd348 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2850,6 +2850,7 @@ struct ib_block_iter {
 	/* internal states */
 	struct scatterlist *__sg;	/* sg holding the current aligned block */
 	dma_addr_t __dma_addr;		/* unaligned DMA address of this block */
+	size_t __sg_numblocks;		/* ib_umem_num_dma_blocks() */
 	unsigned int __sg_nents;	/* number of SG entries */
 	unsigned int __sg_advance;	/* number of bytes to advance in sg in next step */
 	unsigned int __pg_bit;		/* alignment of current block */
-- 
1.8.3.1


  reply	other threads:[~2023-11-15 19:18 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-15 19:17 [PATCH for-rc 0/3] Fixes for 64K page size support Shiraz Saleem
2023-11-15 19:17 ` Shiraz Saleem [this message]
2023-11-16 17:12   ` [PATCH for-rc 1/3] RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz Jason Gunthorpe
2023-11-19 22:24     ` Saleem, Shiraz
2023-11-17 12:13   ` Zhu Yanjun
2023-11-18 14:54     ` Marciniszyn, Mike
2023-11-18 14:59       ` Marciniszyn, Mike
2023-11-19  1:04       ` Zhu Yanjun
2023-11-19  1:31       ` Zhu Yanjun
2023-11-15 19:17 ` [PATCH for-rc 2/3] RDMA/irdma: Ensure iWarp QP queue memory is OS paged aligned Shiraz Saleem
2023-11-15 19:17 ` [PATCH for-rc 3/3] RDMA/irdma: Fix support for 64k pages Shiraz Saleem

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231115191752.266-2-shiraz.saleem@intel.com \
    --to=shiraz.saleem@intel.com \
    --cc=jgg@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mike.marciniszyn@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox