linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE
@ 2025-12-26  9:52 Li Zhijian
  2025-12-27  5:24 ` Zhu Yanjun
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Li Zhijian @ 2025-12-26  9:52 UTC (permalink / raw)
  To: linux-rdma; +Cc: linux-kernel, zyjzyj2000, jgg, leon, Yi Zhang, Li Zhijian

The current implementation incorrectly handles memory regions (MRs) with
page sizes different from the system PAGE_SIZE. The core issue is that
rxe_set_page() is called with mr->page_size step increments, but the
page_list stores individual struct page pointers, each representing
PAGE_SIZE of memory.

Problem scenarios with concrete examples:

1. mr->page_size > PAGE_SIZE (e.g., 64K MR with 4K system pages):

   Suppose: PAGE_SIZE=4K, mr->page_size=64K, mr->ibmr.iova=0x13010
   When rxe_set_page() is called with dma_addr=0x10000 (64K-aligned),
   it stores only one 4K page at page_list[0], but should store 16 pages.

   Accessing iova=0x13034:
   - Current code: index = (0x13034 >> 16) - (0x13010 >> 16) = 0
                  offset = 0x13034 & (64K-1) = 0x3034
   This calculates offset within 64K page, but page_list[0] is only 4K.

   - Expected: The iova=0x13034 should map to:
                  base_align = 0x13010 & ~(64K-1) = 0x10000
                  index = (0x13034 - 0x10000) / 4K = 0x3034 / 0x1000 = 3
                  offset = 0x13034 & (4K-1) = 0x034
                  So: page_list[3] with offset 0x34

2. mr->page_size < PAGE_SIZE (e.g., 2K MR with 4K system pages):

   Suppose: PAGE_SIZE=4K, mr->page_size=2K, mr->ibmr.iova=0x1100
   When rxe_set_page() is called with dma_addr=0x1000, it stores a 4K page
   at page_list[0]. Another call with dma_addr=0x1800 should not store
   a new page since 0x1800 is within the same 4K page as 0x1000.

   Accessing iova=0x1890:
   - Current code: index = (0x1890 >> 11) - (0x1100 >> 11) = 3
                  offset = 0x1890 & (2K-1) = 0x90
                  This assumes page_list[3] exists and is a 2K page.

   - Expected: Both 0x1000 and 0x1800 are within the same 4K page:
                  index = (0x1890 >> 12) - (0x1100 >> 12) = 0
                  offset = 0x1890 & (4K-1) = 0x890
                  So: page_list[0] with offset 0x890

Yi Zhang reported a kernel panic[1] years ago related to this defect.

The fix introduces:

1. Enhanced iova-to-index conversion with proper alignment handling:
   - For mr->page_size > PAGE_SIZE: Uses aligned base address
     (mr->ibmr.iova & mr->page_mask) as reference, correctly calculating
     which PAGE_SIZE sub-page contains the iova
   - For mr->page_size <= PAGE_SIZE: Uses PAGE_SIZE shift arithmetic
     to ensure each page_list entry corresponds to a PAGE_SIZE page

2. Page splitting in rxe_set_page():
   - When mr->page_size > PAGE_SIZE, set_pages_per_mr() splits each
     MR page into multiple PAGE_SIZE pages stored consecutively
   - For mr->page_size <= PAGE_SIZE, maintains original behavior

3. Always use PAGE_SIZE for offset calculation:
   - Since page_list stores PAGE_SIZE pages, offsets must be within
     [0, PAGE_SIZE-1]

4. Compatibility checks:
   - Ensures mr->page_size and PAGE_SIZE have a compatible relationship
     (one is multiple of the other) for correct mapping

The solution ensures correct iova-to-va conversion for all MR page sizes
while maintaining the existing IB verbs semantics for MR registration
and memory access.

This patch enables srp rnbd nvme in 64K system page environment.

Tests on (4K and 64K PAGE_SIZE):
- rdma-core/pytests
  $ ./build/bin/run_tests.py  --dev eth0_rxe
- blktest:
  $ TIMEOUT=30 QUICK_RUN=1 USE_RXE=1 NVMET_TRTYPES=rdma ./check nvme srp rnbd

In 64K environment, srp/012 is failed while it's passed in 4K environment.

[1] https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
ixes: 592627ccbdff ("RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c | 83 +++++++++++++++++++++++++++---
 1 file changed, 76 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index b28b56db725a..8ad7d163b418 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -72,14 +72,40 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
 	mr->ibmr.type = IB_MR_TYPE_DMA;
 }
 
+/*
+ * Convert iova to page_list index. The page_list stores pages of size
+ * PAGE_SIZE, but MRs can have different page sizes. This function
+ * handles the conversion for all cases:
+ *
+ * 1. mr->page_size > PAGE_SIZE:
+ *    The MR's iova may not be aligned to mr->page_size. We use the
+ *    aligned base (iova & page_mask) as reference, then calculate
+ *    which PAGE_SIZE sub-page the iova falls into.
+ *
+ * 2. mr->page_size <= PAGE_SIZE:
+ *    Use simple shift arithmetic since each page_list entry corresponds
+ *    to one or more MR pages.
+ *
+ * Example for mr->page_size=64K, PAGE_SIZE=4K, iova=0x11034:
+ *   base = iova & ~(64K-1) = 0x10000
+ *   index = (0x11034 - 0x10000) / 4K = 0x1034 / 0x1000 = 4
+ */
 static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
 {
-	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
+	if (mr_page_size(mr) > PAGE_SIZE)
+		return (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
+	else
+		return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
 }
 
+/*
+ * Always return offset within a PAGE_SIZE page since page_list
+ * stores individual struct page pointers, each representing
+ * PAGE_SIZE of memory.
+ */
 static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
 {
-	return iova & (mr_page_size(mr) - 1);
+	return iova & (PAGE_SIZE - 1);
 }
 
 static bool is_pmem_page(struct page *pg)
@@ -205,11 +231,40 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
 	return err;
 }
 
+/*
+ * Split a large MR page (mr->page_size) into multiple PAGE_SIZE
+ * sub-pages and store them in page_list.
+ *
+ * Called when mr->page_size > PAGE_SIZE. Each call to rxe_set_page()
+ * represents one mr->page_size region, which we must split into
+ * (mr->page_size / PAGE_SIZE) individual pages.
+ */
+static int set_pages_per_mr(struct ib_mr *ibmr, u64 dma_addr)
+{
+	struct rxe_mr *mr = to_rmr(ibmr);
+	u32 page_size = mr_page_size(mr);
+	u64 addr = dma_addr & ~(u64)(page_size - 1);
+	u32 i, pages_per_mr = page_size / PAGE_SIZE;
+
+	for (i = 0; i < pages_per_mr; i++) {
+		struct page *sub_page =
+			ib_virt_dma_to_page(addr + i * PAGE_SIZE);
+		int err = xa_err(xa_store(&mr->page_list, mr->nbuf, sub_page,
+					  GFP_KERNEL));
+		if (err)
+			return err;
+
+		mr->nbuf++;
+	}
+	return 0;
+}
+
 static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
 	struct page *page = ib_virt_dma_to_page(dma_addr);
 	bool persistent = !!(mr->access & IB_ACCESS_FLUSH_PERSISTENT);
+	u32 page_size = mr_page_size(mr);
 	int err;
 
 	if (persistent && !is_pmem_page(page)) {
@@ -217,9 +272,13 @@ static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
 		return -EINVAL;
 	}
 
-	if (unlikely(mr->nbuf == mr->num_buf))
+	if (unlikely(mr->nbuf >= mr->num_buf))
 		return -ENOMEM;
 
+	if (page_size > PAGE_SIZE)
+		return set_pages_per_mr(ibmr, dma_addr);
+
+	/* page_size <= PAGE_SIZE */
 	err = xa_err(xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL));
 	if (err)
 		return err;
@@ -234,6 +293,18 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
 	struct rxe_mr *mr = to_rmr(ibmr);
 	unsigned int page_size = mr_page_size(mr);
 
+	/*
+	 * Ensure page_size and PAGE_SIZE are compatible for mapping.
+	 * We require one to be a multiple of the other for correct
+	 * iova-to-page conversion.
+	 */
+	if (!IS_ALIGNED(page_size, PAGE_SIZE) &&
+	    !IS_ALIGNED(PAGE_SIZE, page_size)) {
+		rxe_err_mr(mr, "MR page size %u must be compatible with PAGE_SIZE %lu\n",
+			   page_size, PAGE_SIZE);
+		return -EINVAL;
+	}
+
 	mr->nbuf = 0;
 	mr->page_shift = ilog2(page_size);
 	mr->page_mask = ~((u64)page_size - 1);
@@ -255,8 +326,7 @@ static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
 		if (!page)
 			return -EFAULT;
 
-		bytes = min_t(unsigned int, length,
-				mr_page_size(mr) - page_offset);
+		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
 		va = kmap_local_page(page);
 		if (dir == RXE_FROM_MR_OBJ)
 			memcpy(addr, va + page_offset, bytes);
@@ -442,8 +512,7 @@ static int rxe_mr_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int leng
 		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
 		if (!page)
 			return -EFAULT;
-		bytes = min_t(unsigned int, length,
-			      mr_page_size(mr) - page_offset);
+		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
 
 		va = kmap_local_page(page);
 		arch_wb_cache_pmem(va + page_offset, bytes);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE
  2025-12-26  9:52 [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE Li Zhijian
@ 2025-12-27  5:24 ` Zhu Yanjun
  2025-12-30  4:02   ` Zhu Yanjun
  2026-01-05  6:55 ` Zhijian Li (Fujitsu)
  2026-01-06  1:01 ` Jason Gunthorpe
  2 siblings, 1 reply; 9+ messages in thread
From: Zhu Yanjun @ 2025-12-27  5:24 UTC (permalink / raw)
  To: Li Zhijian, linux-rdma; +Cc: linux-kernel, zyjzyj2000, jgg, leon, Yi Zhang

在 2025/12/26 1:52, Li Zhijian 写道:
> The current implementation incorrectly handles memory regions (MRs) with
> page sizes different from the system PAGE_SIZE. The core issue is that
> rxe_set_page() is called with mr->page_size step increments, but the
> page_list stores individual struct page pointers, each representing
> PAGE_SIZE of memory.
> 
> Problem scenarios with concrete examples:
> 
> 1. mr->page_size > PAGE_SIZE (e.g., 64K MR with 4K system pages):
> 
>     Suppose: PAGE_SIZE=4K, mr->page_size=64K, mr->ibmr.iova=0x13010
>     When rxe_set_page() is called with dma_addr=0x10000 (64K-aligned),
>     it stores only one 4K page at page_list[0], but should store 16 pages.
> 
>     Accessing iova=0x13034:
>     - Current code: index = (0x13034 >> 16) - (0x13010 >> 16) = 0
>                    offset = 0x13034 & (64K-1) = 0x3034
>     This calculates offset within 64K page, but page_list[0] is only 4K.
> 
>     - Expected: The iova=0x13034 should map to:
>                    base_align = 0x13010 & ~(64K-1) = 0x10000
>                    index = (0x13034 - 0x10000) / 4K = 0x3034 / 0x1000 = 3
>                    offset = 0x13034 & (4K-1) = 0x034
>                    So: page_list[3] with offset 0x34
> 
> 2. mr->page_size < PAGE_SIZE (e.g., 2K MR with 4K system pages):
> 
>     Suppose: PAGE_SIZE=4K, mr->page_size=2K, mr->ibmr.iova=0x1100
>     When rxe_set_page() is called with dma_addr=0x1000, it stores a 4K page
>     at page_list[0]. Another call with dma_addr=0x1800 should not store
>     a new page since 0x1800 is within the same 4K page as 0x1000.
> 
>     Accessing iova=0x1890:
>     - Current code: index = (0x1890 >> 11) - (0x1100 >> 11) = 3
>                    offset = 0x1890 & (2K-1) = 0x90
>                    This assumes page_list[3] exists and is a 2K page.
> 
>     - Expected: Both 0x1000 and 0x1800 are within the same 4K page:
>                    index = (0x1890 >> 12) - (0x1100 >> 12) = 0
>                    offset = 0x1890 & (4K-1) = 0x890
>                    So: page_list[0] with offset 0x890
> 
> Yi Zhang reported a kernel panic[1] years ago related to this defect.
> 
> The fix introduces:
> 
> 1. Enhanced iova-to-index conversion with proper alignment handling:
>     - For mr->page_size > PAGE_SIZE: Uses aligned base address
>       (mr->ibmr.iova & mr->page_mask) as reference, correctly calculating
>       which PAGE_SIZE sub-page contains the iova
>     - For mr->page_size <= PAGE_SIZE: Uses PAGE_SIZE shift arithmetic
>       to ensure each page_list entry corresponds to a PAGE_SIZE page
> 
> 2. Page splitting in rxe_set_page():
>     - When mr->page_size > PAGE_SIZE, set_pages_per_mr() splits each
>       MR page into multiple PAGE_SIZE pages stored consecutively
>     - For mr->page_size <= PAGE_SIZE, maintains original behavior
> 
> 3. Always use PAGE_SIZE for offset calculation:
>     - Since page_list stores PAGE_SIZE pages, offsets must be within
>       [0, PAGE_SIZE-1]
> 
> 4. Compatibility checks:
>     - Ensures mr->page_size and PAGE_SIZE have a compatible relationship
>       (one is multiple of the other) for correct mapping
> 
> The solution ensures correct iova-to-va conversion for all MR page sizes
> while maintaining the existing IB verbs semantics for MR registration
> and memory access.
> 
> This patch enables srp rnbd nvme in 64K system page environment.
> 
> Tests on (4K and 64K PAGE_SIZE):
> - rdma-core/pytests
>    $ ./build/bin/run_tests.py  --dev eth0_rxe
> - blktest:
>    $ TIMEOUT=30 QUICK_RUN=1 USE_RXE=1 NVMET_TRTYPES=rdma ./check nvme srp rnbd
> 
> In 64K environment, srp/012 is failed while it's passed in 4K environment.

Hi, Yi

The mentioned testcase, the link is:
https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/

Can you help to make tests to verify whether this problem is fixed or not?

Thanks a lot.
Zhu Yanjun

> 
> [1] https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
> ixes: 592627ccbdff ("RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray")
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>   drivers/infiniband/sw/rxe/rxe_mr.c | 83 +++++++++++++++++++++++++++---
>   1 file changed, 76 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index b28b56db725a..8ad7d163b418 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -72,14 +72,40 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
>   	mr->ibmr.type = IB_MR_TYPE_DMA;
>   }
>   
> +/*
> + * Convert iova to page_list index. The page_list stores pages of size
> + * PAGE_SIZE, but MRs can have different page sizes. This function
> + * handles the conversion for all cases:
> + *
> + * 1. mr->page_size > PAGE_SIZE:
> + *    The MR's iova may not be aligned to mr->page_size. We use the
> + *    aligned base (iova & page_mask) as reference, then calculate
> + *    which PAGE_SIZE sub-page the iova falls into.
> + *
> + * 2. mr->page_size <= PAGE_SIZE:
> + *    Use simple shift arithmetic since each page_list entry corresponds
> + *    to one or more MR pages.
> + *
> + * Example for mr->page_size=64K, PAGE_SIZE=4K, iova=0x11034:
> + *   base = iova & ~(64K-1) = 0x10000
> + *   index = (0x11034 - 0x10000) / 4K = 0x1034 / 0x1000 = 4
> + */
>   static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
>   {
> -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> +	if (mr_page_size(mr) > PAGE_SIZE)
> +		return (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
> +	else
> +		return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>   }
>   
> +/*
> + * Always return offset within a PAGE_SIZE page since page_list
> + * stores individual struct page pointers, each representing
> + * PAGE_SIZE of memory.
> + */
>   static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
>   {
> -	return iova & (mr_page_size(mr) - 1);
> +	return iova & (PAGE_SIZE - 1);
>   }
>   
>   static bool is_pmem_page(struct page *pg)
> @@ -205,11 +231,40 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
>   	return err;
>   }
>   
> +/*
> + * Split a large MR page (mr->page_size) into multiple PAGE_SIZE
> + * sub-pages and store them in page_list.
> + *
> + * Called when mr->page_size > PAGE_SIZE. Each call to rxe_set_page()
> + * represents one mr->page_size region, which we must split into
> + * (mr->page_size / PAGE_SIZE) individual pages.
> + */
> +static int set_pages_per_mr(struct ib_mr *ibmr, u64 dma_addr)
> +{
> +	struct rxe_mr *mr = to_rmr(ibmr);
> +	u32 page_size = mr_page_size(mr);
> +	u64 addr = dma_addr & ~(u64)(page_size - 1);
> +	u32 i, pages_per_mr = page_size / PAGE_SIZE;
> +
> +	for (i = 0; i < pages_per_mr; i++) {
> +		struct page *sub_page =
> +			ib_virt_dma_to_page(addr + i * PAGE_SIZE);
> +		int err = xa_err(xa_store(&mr->page_list, mr->nbuf, sub_page,
> +					  GFP_KERNEL));
> +		if (err)
> +			return err;
> +
> +		mr->nbuf++;
> +	}
> +	return 0;
> +}
> +
>   static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
>   {
>   	struct rxe_mr *mr = to_rmr(ibmr);
>   	struct page *page = ib_virt_dma_to_page(dma_addr);
>   	bool persistent = !!(mr->access & IB_ACCESS_FLUSH_PERSISTENT);
> +	u32 page_size = mr_page_size(mr);
>   	int err;
>   
>   	if (persistent && !is_pmem_page(page)) {
> @@ -217,9 +272,13 @@ static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
>   		return -EINVAL;
>   	}
>   
> -	if (unlikely(mr->nbuf == mr->num_buf))
> +	if (unlikely(mr->nbuf >= mr->num_buf))
>   		return -ENOMEM;
>   
> +	if (page_size > PAGE_SIZE)
> +		return set_pages_per_mr(ibmr, dma_addr);
> +
> +	/* page_size <= PAGE_SIZE */
>   	err = xa_err(xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL));
>   	if (err)
>   		return err;
> @@ -234,6 +293,18 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
>   	struct rxe_mr *mr = to_rmr(ibmr);
>   	unsigned int page_size = mr_page_size(mr);
>   
> +	/*
> +	 * Ensure page_size and PAGE_SIZE are compatible for mapping.
> +	 * We require one to be a multiple of the other for correct
> +	 * iova-to-page conversion.
> +	 */
> +	if (!IS_ALIGNED(page_size, PAGE_SIZE) &&
> +	    !IS_ALIGNED(PAGE_SIZE, page_size)) {
> +		rxe_err_mr(mr, "MR page size %u must be compatible with PAGE_SIZE %lu\n",
> +			   page_size, PAGE_SIZE);
> +		return -EINVAL;
> +	}
> +
>   	mr->nbuf = 0;
>   	mr->page_shift = ilog2(page_size);
>   	mr->page_mask = ~((u64)page_size - 1);
> @@ -255,8 +326,7 @@ static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
>   		if (!page)
>   			return -EFAULT;
>   
> -		bytes = min_t(unsigned int, length,
> -				mr_page_size(mr) - page_offset);
> +		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
>   		va = kmap_local_page(page);
>   		if (dir == RXE_FROM_MR_OBJ)
>   			memcpy(addr, va + page_offset, bytes);
> @@ -442,8 +512,7 @@ static int rxe_mr_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int leng
>   		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
>   		if (!page)
>   			return -EFAULT;
> -		bytes = min_t(unsigned int, length,
> -			      mr_page_size(mr) - page_offset);
> +		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
>   
>   		va = kmap_local_page(page);
>   		arch_wb_cache_pmem(va + page_offset, bytes);


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE
  2025-12-27  5:24 ` Zhu Yanjun
@ 2025-12-30  4:02   ` Zhu Yanjun
  2025-12-30  4:47     ` Zhijian Li (Fujitsu)
  0 siblings, 1 reply; 9+ messages in thread
From: Zhu Yanjun @ 2025-12-30  4:02 UTC (permalink / raw)
  To: Li Zhijian, linux-rdma; +Cc: linux-kernel, zyjzyj2000, jgg, leon, Yi Zhang


在 2025/12/26 21:24, Zhu Yanjun 写道:
> 在 2025/12/26 1:52, Li Zhijian 写道:
>> The current implementation incorrectly handles memory regions (MRs) with
>> page sizes different from the system PAGE_SIZE. The core issue is that
>> rxe_set_page() is called with mr->page_size step increments, but the
>> page_list stores individual struct page pointers, each representing
>> PAGE_SIZE of memory.
>>
>> Problem scenarios with concrete examples:
>>
>> 1. mr->page_size > PAGE_SIZE (e.g., 64K MR with 4K system pages):
>>
>>     Suppose: PAGE_SIZE=4K, mr->page_size=64K, mr->ibmr.iova=0x13010
>>     When rxe_set_page() is called with dma_addr=0x10000 (64K-aligned),
>>     it stores only one 4K page at page_list[0], but should store 16 
>> pages.
>>
>>     Accessing iova=0x13034:
>>     - Current code: index = (0x13034 >> 16) - (0x13010 >> 16) = 0
>>                    offset = 0x13034 & (64K-1) = 0x3034
>>     This calculates offset within 64K page, but page_list[0] is only 4K.
>>
>>     - Expected: The iova=0x13034 should map to:
>>                    base_align = 0x13010 & ~(64K-1) = 0x10000
>>                    index = (0x13034 - 0x10000) / 4K = 0x3034 / 0x1000 
>> = 3
>>                    offset = 0x13034 & (4K-1) = 0x034
>>                    So: page_list[3] with offset 0x34
>>
>> 2. mr->page_size < PAGE_SIZE (e.g., 2K MR with 4K system pages):
>>
>>     Suppose: PAGE_SIZE=4K, mr->page_size=2K, mr->ibmr.iova=0x1100
>>     When rxe_set_page() is called with dma_addr=0x1000, it stores a 
>> 4K page
>>     at page_list[0]. Another call with dma_addr=0x1800 should not store
>>     a new page since 0x1800 is within the same 4K page as 0x1000.
>>
>>     Accessing iova=0x1890:
>>     - Current code: index = (0x1890 >> 11) - (0x1100 >> 11) = 3
>>                    offset = 0x1890 & (2K-1) = 0x90
>>                    This assumes page_list[3] exists and is a 2K page.
>>
>>     - Expected: Both 0x1000 and 0x1800 are within the same 4K page:
>>                    index = (0x1890 >> 12) - (0x1100 >> 12) = 0
>>                    offset = 0x1890 & (4K-1) = 0x890
>>                    So: page_list[0] with offset 0x890
>>
>> Yi Zhang reported a kernel panic[1] years ago related to this defect.
>>
>> The fix introduces:
>>
>> 1. Enhanced iova-to-index conversion with proper alignment handling:
>>     - For mr->page_size > PAGE_SIZE: Uses aligned base address
>>       (mr->ibmr.iova & mr->page_mask) as reference, correctly 
>> calculating
>>       which PAGE_SIZE sub-page contains the iova
>>     - For mr->page_size <= PAGE_SIZE: Uses PAGE_SIZE shift arithmetic
>>       to ensure each page_list entry corresponds to a PAGE_SIZE page
>>
>> 2. Page splitting in rxe_set_page():
>>     - When mr->page_size > PAGE_SIZE, set_pages_per_mr() splits each
>>       MR page into multiple PAGE_SIZE pages stored consecutively
>>     - For mr->page_size <= PAGE_SIZE, maintains original behavior
>>
>> 3. Always use PAGE_SIZE for offset calculation:
>>     - Since page_list stores PAGE_SIZE pages, offsets must be within
>>       [0, PAGE_SIZE-1]
>>
>> 4. Compatibility checks:
>>     - Ensures mr->page_size and PAGE_SIZE have a compatible relationship
>>       (one is multiple of the other) for correct mapping
>>
>> The solution ensures correct iova-to-va conversion for all MR page sizes
>> while maintaining the existing IB verbs semantics for MR registration
>> and memory access.
>>
>> This patch enables srp rnbd nvme in 64K system page environment.
>>
>> Tests on (4K and 64K PAGE_SIZE):
>> - rdma-core/pytests
>>    $ ./build/bin/run_tests.py  --dev eth0_rxe
>> - blktest:
>>    $ TIMEOUT=30 QUICK_RUN=1 USE_RXE=1 NVMET_TRTYPES=rdma ./check nvme 
>> srp rnbd
>>
>> In 64K environment, srp/012 is failed while it's passed in 4K 
>> environment.
>
> Hi, Yi
>
> The mentioned testcase, the link is:
> https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/ 
>
>
> Can you help to make tests to verify whether this problem is fixed or 
> not?
>

On x86_64 hosts, after this commit is applied, all the testcases in 
rdma-core can pass. But in Fedora Core 42 with ARM64 architecture,

there are some errors with rdma-core.

I did not make tests with blktest.

Zhu Yanjun


> Thanks a lot.
> Zhu Yanjun
>
>>
>> [1] 
>> https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
>> ixes: 592627ccbdff ("RDMA/rxe: Replace rxe_map and rxe_phys_buf by 
>> xarray")
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> ---
>>   drivers/infiniband/sw/rxe/rxe_mr.c | 83 +++++++++++++++++++++++++++---
>>   1 file changed, 76 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c 
>> b/drivers/infiniband/sw/rxe/rxe_mr.c
>> index b28b56db725a..8ad7d163b418 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>> @@ -72,14 +72,40 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
>>       mr->ibmr.type = IB_MR_TYPE_DMA;
>>   }
>>   +/*
>> + * Convert iova to page_list index. The page_list stores pages of size
>> + * PAGE_SIZE, but MRs can have different page sizes. This function
>> + * handles the conversion for all cases:
>> + *
>> + * 1. mr->page_size > PAGE_SIZE:
>> + *    The MR's iova may not be aligned to mr->page_size. We use the
>> + *    aligned base (iova & page_mask) as reference, then calculate
>> + *    which PAGE_SIZE sub-page the iova falls into.
>> + *
>> + * 2. mr->page_size <= PAGE_SIZE:
>> + *    Use simple shift arithmetic since each page_list entry 
>> corresponds
>> + *    to one or more MR pages.
>> + *
>> + * Example for mr->page_size=64K, PAGE_SIZE=4K, iova=0x11034:
>> + *   base = iova & ~(64K-1) = 0x10000
>> + *   index = (0x11034 - 0x10000) / 4K = 0x1034 / 0x1000 = 4
>> + */
>>   static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
>>   {
>> -    return (iova >> mr->page_shift) - (mr->ibmr.iova >> 
>> mr->page_shift);
>> +    if (mr_page_size(mr) > PAGE_SIZE)
>> +        return (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
>> +    else
>> +        return (iova >> mr->page_shift) - (mr->ibmr.iova >> 
>> mr->page_shift);
>>   }
>>   +/*
>> + * Always return offset within a PAGE_SIZE page since page_list
>> + * stores individual struct page pointers, each representing
>> + * PAGE_SIZE of memory.
>> + */
>>   static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, 
>> u64 iova)
>>   {
>> -    return iova & (mr_page_size(mr) - 1);
>> +    return iova & (PAGE_SIZE - 1);
>>   }
>>     static bool is_pmem_page(struct page *pg)
>> @@ -205,11 +231,40 @@ int rxe_mr_init_fast(int max_pages, struct 
>> rxe_mr *mr)
>>       return err;
>>   }
>>   +/*
>> + * Split a large MR page (mr->page_size) into multiple PAGE_SIZE
>> + * sub-pages and store them in page_list.
>> + *
>> + * Called when mr->page_size > PAGE_SIZE. Each call to rxe_set_page()
>> + * represents one mr->page_size region, which we must split into
>> + * (mr->page_size / PAGE_SIZE) individual pages.
>> + */
>> +static int set_pages_per_mr(struct ib_mr *ibmr, u64 dma_addr)
>> +{
>> +    struct rxe_mr *mr = to_rmr(ibmr);
>> +    u32 page_size = mr_page_size(mr);
>> +    u64 addr = dma_addr & ~(u64)(page_size - 1);
>> +    u32 i, pages_per_mr = page_size / PAGE_SIZE;
>> +
>> +    for (i = 0; i < pages_per_mr; i++) {
>> +        struct page *sub_page =
>> +            ib_virt_dma_to_page(addr + i * PAGE_SIZE);
>> +        int err = xa_err(xa_store(&mr->page_list, mr->nbuf, sub_page,
>> +                      GFP_KERNEL));
>> +        if (err)
>> +            return err;
>> +
>> +        mr->nbuf++;
>> +    }
>> +    return 0;
>> +}
>> +
>>   static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
>>   {
>>       struct rxe_mr *mr = to_rmr(ibmr);
>>       struct page *page = ib_virt_dma_to_page(dma_addr);
>>       bool persistent = !!(mr->access & IB_ACCESS_FLUSH_PERSISTENT);
>> +    u32 page_size = mr_page_size(mr);
>>       int err;
>>         if (persistent && !is_pmem_page(page)) {
>> @@ -217,9 +272,13 @@ static int rxe_set_page(struct ib_mr *ibmr, u64 
>> dma_addr)
>>           return -EINVAL;
>>       }
>>   -    if (unlikely(mr->nbuf == mr->num_buf))
>> +    if (unlikely(mr->nbuf >= mr->num_buf))
>>           return -ENOMEM;
>>   +    if (page_size > PAGE_SIZE)
>> +        return set_pages_per_mr(ibmr, dma_addr);
>> +
>> +    /* page_size <= PAGE_SIZE */
>>       err = xa_err(xa_store(&mr->page_list, mr->nbuf, page, 
>> GFP_KERNEL));
>>       if (err)
>>           return err;
>> @@ -234,6 +293,18 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct 
>> scatterlist *sgl,
>>       struct rxe_mr *mr = to_rmr(ibmr);
>>       unsigned int page_size = mr_page_size(mr);
>>   +    /*
>> +     * Ensure page_size and PAGE_SIZE are compatible for mapping.
>> +     * We require one to be a multiple of the other for correct
>> +     * iova-to-page conversion.
>> +     */
>> +    if (!IS_ALIGNED(page_size, PAGE_SIZE) &&
>> +        !IS_ALIGNED(PAGE_SIZE, page_size)) {
>> +        rxe_err_mr(mr, "MR page size %u must be compatible with 
>> PAGE_SIZE %lu\n",
>> +               page_size, PAGE_SIZE);
>> +        return -EINVAL;
>> +    }
>> +
>>       mr->nbuf = 0;
>>       mr->page_shift = ilog2(page_size);
>>       mr->page_mask = ~((u64)page_size - 1);
>> @@ -255,8 +326,7 @@ static int rxe_mr_copy_xarray(struct rxe_mr *mr, 
>> u64 iova, void *addr,
>>           if (!page)
>>               return -EFAULT;
>>   -        bytes = min_t(unsigned int, length,
>> -                mr_page_size(mr) - page_offset);
>> +        bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
>>           va = kmap_local_page(page);
>>           if (dir == RXE_FROM_MR_OBJ)
>>               memcpy(addr, va + page_offset, bytes);
>> @@ -442,8 +512,7 @@ static int rxe_mr_flush_pmem_iova(struct rxe_mr 
>> *mr, u64 iova, unsigned int leng
>>           page_offset = rxe_mr_iova_to_page_offset(mr, iova);
>>           if (!page)
>>               return -EFAULT;
>> -        bytes = min_t(unsigned int, length,
>> -                  mr_page_size(mr) - page_offset);
>> +        bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
>>             va = kmap_local_page(page);
>>           arch_wb_cache_pmem(va + page_offset, bytes);
>
-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE
  2025-12-30  4:02   ` Zhu Yanjun
@ 2025-12-30  4:47     ` Zhijian Li (Fujitsu)
  2025-12-30  4:57       ` Zhu Yanjun
  0 siblings, 1 reply; 9+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-12-30  4:47 UTC (permalink / raw)
  To: Zhu Yanjun, linux-rdma@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, zyjzyj2000@gmail.com, jgg@ziepe.ca,
	leon@kernel.org, Yi Zhang

Hi Yanjun,


On 30/12/2025 12:02, Zhu Yanjun wrote:
>>>
>>
>> Hi, Yi
>>
>> The mentioned testcase, the link is:
>> https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
>>
>> Can you help to make tests to verify whether this problem is fixed or not?
>>
> 
> On x86_64 hosts, after this commit is applied, all the testcases in rdma-core can pass. But in Fedora Core 42 with ARM64 architecture,
> 
> there are some errors with rdma-core.


Many thanks for your testing.

This commit should only affect IB_MR_TYPE_MEM_REG MRs, which are exposed to the
Upper Layer Protocol (ULP) in the current kernel.

Since rdma-core typically utilizes IB_MR_TYPE_USER(mr.page_size always same with PAGE_SIZE), it should not be impacted.

Thanks
Zhijian

> 
> I did not make tests with blktest.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE
  2025-12-30  4:47     ` Zhijian Li (Fujitsu)
@ 2025-12-30  4:57       ` Zhu Yanjun
  0 siblings, 0 replies; 9+ messages in thread
From: Zhu Yanjun @ 2025-12-30  4:57 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu), linux-rdma@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, zyjzyj2000@gmail.com, jgg@ziepe.ca,
	leon@kernel.org, Yi Zhang


在 2025/12/29 20:47, Zhijian Li (Fujitsu) 写道:
> Hi Yanjun,
>
>
> On 30/12/2025 12:02, Zhu Yanjun wrote:
>>> Hi, Yi
>>>
>>> The mentioned testcase, the link is:
>>> https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
>>>
>>> Can you help to make tests to verify whether this problem is fixed or not?
>>>
>> On x86_64 hosts, after this commit is applied, all the testcases in rdma-core can pass. But in Fedora Core 42 with ARM64 architecture,
>>
>> there are some errors with rdma-core.
>
> Many thanks for your testing.
>
> This commit should only affect IB_MR_TYPE_MEM_REG MRs, which are exposed to the
> Upper Layer Protocol (ULP) in the current kernel.
>
> Since rdma-core typically utilizes IB_MR_TYPE_USER(mr.page_size always same with PAGE_SIZE), it should not be impacted.

Got it. No worry. If this commit can pass Yi's tests, I will make tests 
again.

Let us wait for Yi's test results.

Thanks a lot for your efforts.

Zhu Yanjun

>
> Thanks
> Zhijian
>
>> I did not make tests with blktest.

-- 
Best Regards,
Yanjun.Zhu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE
  2025-12-26  9:52 [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE Li Zhijian
  2025-12-27  5:24 ` Zhu Yanjun
@ 2026-01-05  6:55 ` Zhijian Li (Fujitsu)
  2026-01-05  7:39   ` Zhijian Li (Fujitsu)
  2026-01-06  1:07   ` Jason Gunthorpe
  2026-01-06  1:01 ` Jason Gunthorpe
  2 siblings, 2 replies; 9+ messages in thread
From: Zhijian Li (Fujitsu) @ 2026-01-05  6:55 UTC (permalink / raw)
  To: linux-rdma@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, zyjzyj2000@gmail.com, jgg@ziepe.ca,
	leon@kernel.org, Yi Zhang, Bob Pearson

+Bob


On 26/12/2025 17:52, Li Zhijian wrote:
> The current implementation incorrectly handles memory regions (MRs) with
> page sizes different from the system PAGE_SIZE. The core issue is that
> rxe_set_page() is called with mr->page_size step increments, but the
> page_list stores individual struct page pointers, each representing
> PAGE_SIZE of memory.
> 
> Problem scenarios with concrete examples:
> 
> 1. mr->page_size > PAGE_SIZE (e.g., 64K MR with 4K system pages):
> 
>     Suppose: PAGE_SIZE=4K, mr->page_size=64K, mr->ibmr.iova=0x13010
>     When rxe_set_page() is called with dma_addr=0x10000 (64K-aligned),
>     it stores only one 4K page at page_list[0], but should store 16 pages.
> 
>     Accessing iova=0x13034:
>     - Current code: index = (0x13034 >> 16) - (0x13010 >> 16) = 0
>                    offset = 0x13034 & (64K-1) = 0x3034
>     This calculates offset within 64K page, but page_list[0] is only 4K.
> 
>     - Expected: The iova=0x13034 should map to:
>                    base_align = 0x13010 & ~(64K-1) = 0x10000
>                    index = (0x13034 - 0x10000) / 4K = 0x3034 / 0x1000 = 3
>                    offset = 0x13034 & (4K-1) = 0x034
>                    So: page_list[3] with offset 0x34
> 
> 2. mr->page_size < PAGE_SIZE (e.g., 2K MR with 4K system pages):
> 
>     Suppose: PAGE_SIZE=4K, mr->page_size=2K, mr->ibmr.iova=0x1100
>     When rxe_set_page() is called with dma_addr=0x1000, it stores a 4K page
>     at page_list[0]. Another call with dma_addr=0x1800 should not store
>     a new page since 0x1800 is within the same 4K page as 0x1000.
> 
>     Accessing iova=0x1890:
>     - Current code: index = (0x1890 >> 11) - (0x1100 >> 11) = 3
>                    offset = 0x1890 & (2K-1) = 0x90
>                    This assumes page_list[3] exists and is a 2K page.
> 
>     - Expected: Both 0x1000 and 0x1800 are within the same 4K page:
>                    index = (0x1890 >> 12) - (0x1100 >> 12) = 0
>                    offset = 0x1890 & (4K-1) = 0x890
>                    So: page_list[0] with offset 0x890
> 
> Yi Zhang reported a kernel panic[1] years ago related to this defect.
> 
> The fix introduces:
> 
> 1. Enhanced iova-to-index conversion with proper alignment handling:
>     - For mr->page_size > PAGE_SIZE: Uses aligned base address
>       (mr->ibmr.iova & mr->page_mask) as reference, correctly calculating
>       which PAGE_SIZE sub-page contains the iova
>     - For mr->page_size <= PAGE_SIZE: Uses PAGE_SIZE shift arithmetic
>       to ensure each page_list entry corresponds to a PAGE_SIZE page
> 
> 2. Page splitting in rxe_set_page():
>     - When mr->page_size > PAGE_SIZE, set_pages_per_mr() splits each
>       MR page into multiple PAGE_SIZE pages stored consecutively
>     - For mr->page_size <= PAGE_SIZE, maintains original behavior
> 
> 3. Always use PAGE_SIZE for offset calculation:
>     - Since page_list stores PAGE_SIZE pages, offsets must be within
>       [0, PAGE_SIZE-1]
> 
> 4. Compatibility checks:
>     - Ensures mr->page_size and PAGE_SIZE have a compatible relationship
>       (one is multiple of the other) for correct mapping
> 
> The solution ensures correct iova-to-va conversion for all MR page sizes
> while maintaining the existing IB verbs semantics for MR registration
> and memory access.
> 
> This patch enables srp rnbd nvme in 64K system page environment.
> 
> Tests on (4K and 64K PAGE_SIZE):
> - rdma-core/pytests
>    $ ./build/bin/run_tests.py  --dev eth0_rxe
> - blktest:
>    $ TIMEOUT=30 QUICK_RUN=1 USE_RXE=1 NVMET_TRTYPES=rdma ./check nvme srp rnbd
> 
> In 64K environment, srp/012 is failed while it's passed in 4K environment.


>   static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
>   {
> -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> +	if (mr_page_size(mr) > PAGE_SIZE)
> +		return (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
> +	else
> +		return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>   }
>   
> +/*
> + * Always return offset within a PAGE_SIZE page since page_list
> + * stores individual struct page pointers, each representing
> + * PAGE_SIZE of memory.
> + */
>   static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
>   {
> -	return iova & (mr_page_size(mr) - 1);
> +	return iova & (PAGE_SIZE - 1);
>   }


After digging into the behavior during the srp/012 test again, it turns out this fix is incomplete.
The current xarray page_list approach cannot correctly map memory regions composed of two or more scatter-gather segments.


Consider the following scenarios:
(I'm assuming a 1:1 mapping from dma_addr to va for simplicity in the examples)

=========================
I) page_size < PAGE_SIZE
Assuming dam_addr to va mapping is 1:1
MR page_size = 4K
PAGE_SIZE = 64K
ibmr->iova = 0x181000;

With this patch:
sg_dma_address(sg[0]): 0x181000(va: 0x181000), dma_len: 0x1000, save to page_list[0]=page(0x181000)
sg_dma_address(sg[1]): 0x173000(va: 0x173000, dma_len: 0x1000, save to page_list[1]=page(0x171000)

Now accessing iova: 0x182000, expectation: va: 0x173000
In current patch,
> (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
index: (0x182000 >> 16) - (0x181000 >> 16)=0x18-0x18=0

> iova & (PAGE_SIZE - 1)
offset: 0x182000 & (PAGE_SIZE -1) = 0x2000

va: va(page_list[0]) + 0x2000 = 0x180000 + 0x2000 = 0x182000(incorrect)

(srp/012 hit this case).


II) page_size > PAGE_SIZE
MR page_size = 64K
PAGE_SIZE = 4K
ibmr->iova = 0x181000;

sg_dma_address(sg[0]): 0x181000(va: 0x181000), dma_len: 0x1000, save to page_list[0-15], page_list[0]=page(0x180000) page_list[1]=page(0x181000)...
sg_dma_address(sg[1]): 0x173000(va: 0x173000, dma_len: 0x1000, save to page_list[16-32], page_list[16]=page(0x170000) page_list[17]=page(0x171000)...

Now accessing iova: 0x182000, expectation: va: 0x173000
> (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
index: (0x182000 - 0x180000)/4K = 2

> iova & (PAGE_SIZE - 1);
offset: 0x182000 & (PAGE_SIZE -1) = 0

va: va(page_list[2]) + 0 = 0x182000(incorrect)

=========================

Although the case I) can be solved by introducing another xarray to save extra base_offset for each page_list
while the case II) I have not figured out a *good* solution without adding more parameters to rxe_set_page()
to tell the dma_len...

Before I go further down this path, I'd like to raise a concern: is storing struct page pointers directly in
the page_list still the right approach for this problem?

Perhaps we need to rethink how we store the mapping information entirely.

Any thoughts or suggestions would be greatly appreciated.

Thanks
Zhijian


> 
> [1] https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
> ixes: 592627ccbdff ("RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray")
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>   drivers/infiniband/sw/rxe/rxe_mr.c | 83 +++++++++++++++++++++++++++---
>   1 file changed, 76 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index b28b56db725a..8ad7d163b418 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -72,14 +72,40 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
>   	mr->ibmr.type = IB_MR_TYPE_DMA;
>   }
>   
> +/*
> + * Convert iova to page_list index. The page_list stores pages of size
> + * PAGE_SIZE, but MRs can have different page sizes. This function
> + * handles the conversion for all cases:
> + *
> + * 1. mr->page_size > PAGE_SIZE:
> + *    The MR's iova may not be aligned to mr->page_size. We use the
> + *    aligned base (iova & page_mask) as reference, then calculate
> + *    which PAGE_SIZE sub-page the iova falls into.
> + *
> + * 2. mr->page_size <= PAGE_SIZE:
> + *    Use simple shift arithmetic since each page_list entry corresponds
> + *    to one or more MR pages.
> + *
> + * Example for mr->page_size=64K, PAGE_SIZE=4K, iova=0x11034:
> + *   base = iova & ~(64K-1) = 0x10000
> + *   index = (0x11034 - 0x10000) / 4K = 0x1034 / 0x1000 = 4
> + */
>   static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
>   {
> -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> +	if (mr_page_size(mr) > PAGE_SIZE)
> +		return (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
> +	else
> +		return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>   }
>   
> +/*
> + * Always return offset within a PAGE_SIZE page since page_list
> + * stores individual struct page pointers, each representing
> + * PAGE_SIZE of memory.
> + */
>   static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
>   {
> -	return iova & (mr_page_size(mr) - 1);
> +	return iova & (PAGE_SIZE - 1);
>   }
>   
>   static bool is_pmem_page(struct page *pg)
> @@ -205,11 +231,40 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
>   	return err;
>   }
>   
> +/*
> + * Split a large MR page (mr->page_size) into multiple PAGE_SIZE
> + * sub-pages and store them in page_list.
> + *
> + * Called when mr->page_size > PAGE_SIZE. Each call to rxe_set_page()
> + * represents one mr->page_size region, which we must split into
> + * (mr->page_size / PAGE_SIZE) individual pages.
> + */
> +static int set_pages_per_mr(struct ib_mr *ibmr, u64 dma_addr)
> +{
> +	struct rxe_mr *mr = to_rmr(ibmr);
> +	u32 page_size = mr_page_size(mr);
> +	u64 addr = dma_addr & ~(u64)(page_size - 1);
> +	u32 i, pages_per_mr = page_size / PAGE_SIZE;
> +
> +	for (i = 0; i < pages_per_mr; i++) {
> +		struct page *sub_page =
> +			ib_virt_dma_to_page(addr + i * PAGE_SIZE);
> +		int err = xa_err(xa_store(&mr->page_list, mr->nbuf, sub_page,
> +					  GFP_KERNEL));
> +		if (err)
> +			return err;
> +
> +		mr->nbuf++;
> +	}
> +	return 0;
> +}
> +
>   static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
>   {
>   	struct rxe_mr *mr = to_rmr(ibmr);
>   	struct page *page = ib_virt_dma_to_page(dma_addr);
>   	bool persistent = !!(mr->access & IB_ACCESS_FLUSH_PERSISTENT);
> +	u32 page_size = mr_page_size(mr);
>   	int err;
>   
>   	if (persistent && !is_pmem_page(page)) {
> @@ -217,9 +272,13 @@ static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
>   		return -EINVAL;
>   	}
>   
> -	if (unlikely(mr->nbuf == mr->num_buf))
> +	if (unlikely(mr->nbuf >= mr->num_buf))
>   		return -ENOMEM;
>   
> +	if (page_size > PAGE_SIZE)
> +		return set_pages_per_mr(ibmr, dma_addr);
> +
> +	/* page_size <= PAGE_SIZE */
>   	err = xa_err(xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL));
>   	if (err)
>   		return err;
> @@ -234,6 +293,18 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
>   	struct rxe_mr *mr = to_rmr(ibmr);
>   	unsigned int page_size = mr_page_size(mr);
>   
> +	/*
> +	 * Ensure page_size and PAGE_SIZE are compatible for mapping.
> +	 * We require one to be a multiple of the other for correct
> +	 * iova-to-page conversion.
> +	 */
> +	if (!IS_ALIGNED(page_size, PAGE_SIZE) &&
> +	    !IS_ALIGNED(PAGE_SIZE, page_size)) {
> +		rxe_err_mr(mr, "MR page size %u must be compatible with PAGE_SIZE %lu\n",
> +			   page_size, PAGE_SIZE);
> +		return -EINVAL;
> +	}
> +
>   	mr->nbuf = 0;
>   	mr->page_shift = ilog2(page_size);
>   	mr->page_mask = ~((u64)page_size - 1);
> @@ -255,8 +326,7 @@ static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
>   		if (!page)
>   			return -EFAULT;
>   
> -		bytes = min_t(unsigned int, length,
> -				mr_page_size(mr) - page_offset);
> +		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
>   		va = kmap_local_page(page);
>   		if (dir == RXE_FROM_MR_OBJ)
>   			memcpy(addr, va + page_offset, bytes);
> @@ -442,8 +512,7 @@ static int rxe_mr_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int leng
>   		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
>   		if (!page)
>   			return -EFAULT;
> -		bytes = min_t(unsigned int, length,
> -			      mr_page_size(mr) - page_offset);
> +		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
>   
>   		va = kmap_local_page(page);
>   		arch_wb_cache_pmem(va + page_offset, bytes);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE
  2026-01-05  6:55 ` Zhijian Li (Fujitsu)
@ 2026-01-05  7:39   ` Zhijian Li (Fujitsu)
  2026-01-06  1:07   ` Jason Gunthorpe
  1 sibling, 0 replies; 9+ messages in thread
From: Zhijian Li (Fujitsu) @ 2026-01-05  7:39 UTC (permalink / raw)
  To: linux-rdma@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, zyjzyj2000@gmail.com, jgg@ziepe.ca,
	leon@kernel.org, Yi Zhang, Bob Pearson



On 05/01/2026 14:55, Zhijian Li (Fujitsu) wrote:
> +Bob
> 
> 
> On 26/12/2025 17:52, Li Zhijian wrote:
>> The current implementation incorrectly handles memory regions (MRs) with
>> page sizes different from the system PAGE_SIZE. The core issue is that
>> rxe_set_page() is called with mr->page_size step increments, but the
>> page_list stores individual struct page pointers, each representing
>> PAGE_SIZE of memory.
>>
>> Problem scenarios with concrete examples:
>>
>> 1. mr->page_size > PAGE_SIZE (e.g., 64K MR with 4K system pages):
>>
>>      Suppose: PAGE_SIZE=4K, mr->page_size=64K, mr->ibmr.iova=0x13010
>>      When rxe_set_page() is called with dma_addr=0x10000 (64K-aligned),
>>      it stores only one 4K page at page_list[0], but should store 16 pages.
>>
>>      Accessing iova=0x13034:
>>      - Current code: index = (0x13034 >> 16) - (0x13010 >> 16) = 0
>>                     offset = 0x13034 & (64K-1) = 0x3034
>>      This calculates offset within 64K page, but page_list[0] is only 4K.
>>
>>      - Expected: The iova=0x13034 should map to:
>>                     base_align = 0x13010 & ~(64K-1) = 0x10000
>>                     index = (0x13034 - 0x10000) / 4K = 0x3034 / 0x1000 = 3
>>                     offset = 0x13034 & (4K-1) = 0x034
>>                     So: page_list[3] with offset 0x34
>>
>> 2. mr->page_size < PAGE_SIZE (e.g., 2K MR with 4K system pages):
>>
>>      Suppose: PAGE_SIZE=4K, mr->page_size=2K, mr->ibmr.iova=0x1100
>>      When rxe_set_page() is called with dma_addr=0x1000, it stores a 4K page
>>      at page_list[0]. Another call with dma_addr=0x1800 should not store
>>      a new page since 0x1800 is within the same 4K page as 0x1000.
>>
>>      Accessing iova=0x1890:
>>      - Current code: index = (0x1890 >> 11) - (0x1100 >> 11) = 3
>>                     offset = 0x1890 & (2K-1) = 0x90
>>                     This assumes page_list[3] exists and is a 2K page.
>>
>>      - Expected: Both 0x1000 and 0x1800 are within the same 4K page:
>>                     index = (0x1890 >> 12) - (0x1100 >> 12) = 0
>>                     offset = 0x1890 & (4K-1) = 0x890
>>                     So: page_list[0] with offset 0x890
>>
>> Yi Zhang reported a kernel panic[1] years ago related to this defect.
>>
>> The fix introduces:
>>
>> 1. Enhanced iova-to-index conversion with proper alignment handling:
>>      - For mr->page_size > PAGE_SIZE: Uses aligned base address
>>        (mr->ibmr.iova & mr->page_mask) as reference, correctly calculating
>>        which PAGE_SIZE sub-page contains the iova
>>      - For mr->page_size <= PAGE_SIZE: Uses PAGE_SIZE shift arithmetic
>>        to ensure each page_list entry corresponds to a PAGE_SIZE page
>>
>> 2. Page splitting in rxe_set_page():
>>      - When mr->page_size > PAGE_SIZE, set_pages_per_mr() splits each
>>        MR page into multiple PAGE_SIZE pages stored consecutively
>>      - For mr->page_size <= PAGE_SIZE, maintains original behavior
>>
>> 3. Always use PAGE_SIZE for offset calculation:
>>      - Since page_list stores PAGE_SIZE pages, offsets must be within
>>        [0, PAGE_SIZE-1]
>>
>> 4. Compatibility checks:
>>      - Ensures mr->page_size and PAGE_SIZE have a compatible relationship
>>        (one is multiple of the other) for correct mapping
>>
>> The solution ensures correct iova-to-va conversion for all MR page sizes
>> while maintaining the existing IB verbs semantics for MR registration
>> and memory access.
>>
>> This patch enables srp rnbd nvme in 64K system page environment.
>>
>> Tests on (4K and 64K PAGE_SIZE):
>> - rdma-core/pytests
>>     $ ./build/bin/run_tests.py  --dev eth0_rxe
>> - blktest:
>>     $ TIMEOUT=30 QUICK_RUN=1 USE_RXE=1 NVMET_TRTYPES=rdma ./check nvme srp rnbd
>>
>> In 64K environment, srp/012 is failed while it's passed in 4K environment.
> 
> 
>>    static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
>>    {
>> -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>> +	if (mr_page_size(mr) > PAGE_SIZE)
>> +		return (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
>> +	else
>> +		return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>>    }
>>    
>> +/*
>> + * Always return offset within a PAGE_SIZE page since page_list
>> + * stores individual struct page pointers, each representing
>> + * PAGE_SIZE of memory.
>> + */
>>    static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
>>    {
>> -	return iova & (mr_page_size(mr) - 1);
>> +	return iova & (PAGE_SIZE - 1);
>>    }
> 
> 
> After digging into the behavior during the srp/012 test again, it turns out this fix is incomplete.
> The current xarray page_list approach cannot correctly map memory regions composed of two or more scatter-gather segments.
> 
> 
> Consider the following scenarios:
> (I'm assuming a 1:1 mapping from dma_addr to va for simplicity in the examples)
> 
> =========================
> I) page_size < PAGE_SIZE
> Assuming dam_addr to va mapping is 1:1
> MR page_size = 4K
> PAGE_SIZE = 64K
> ibmr->iova = 0x181000;
> 
> With this patch:
> sg_dma_address(sg[0]): 0x181000(va: 0x181000), dma_len: 0x1000, save to page_list[0]=page(0x181000)
> sg_dma_address(sg[1]): 0x173000(va: 0x173000, dma_len: 0x1000, save to page_list[1]=page(0x171000)
> 
> Now accessing iova: 0x182000, expectation: va: 0x173000
> In current patch,
>> (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> index: (0x182000 >> 16) - (0x181000 >> 16)=0x18-0x18=0
> 
>> iova & (PAGE_SIZE - 1)
> offset: 0x182000 & (PAGE_SIZE -1) = 0x2000
> 
> va: va(page_list[0]) + 0x2000 = 0x180000 + 0x2000 = 0x182000(incorrect)

mr->page_shift should be 12, so

re-edit:

> (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
index: (0x182000 >> 12) - (0x181000 >> 12)=0x182-0x181=1

> iova & (PAGE_SIZE - 1)
offset: 0x182000 & (PAGE_SIZE -1) = 0x2000
  
va: va(page_list[1]) + 0x2000 = 0x170000 + 0x2000 = 0x172000(incorrect)

Thanks
Zhijian


> 
> (srp/012 hit this case).
> 
> 
> II) page_size > PAGE_SIZE
> MR page_size = 64K
> PAGE_SIZE = 4K
> ibmr->iova = 0x181000;
> 
> sg_dma_address(sg[0]): 0x181000(va: 0x181000), dma_len: 0x1000, save to page_list[0-15], page_list[0]=page(0x180000) page_list[1]=page(0x181000)...
> sg_dma_address(sg[1]): 0x173000(va: 0x173000, dma_len: 0x1000, save to page_list[16-32], page_list[16]=page(0x170000) page_list[17]=page(0x171000)...
> 
> Now accessing iova: 0x182000, expectation: va: 0x173000
>> (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
> index: (0x182000 - 0x180000)/4K = 2
> 
>> iova & (PAGE_SIZE - 1);
> offset: 0x182000 & (PAGE_SIZE -1) = 0
> 
> va: va(page_list[2]) + 0 = 0x182000(incorrect)
> 
> =========================
> 
> Although the case I) can be solved by introducing another xarray to save extra base_offset for each page_list
> while the case II) I have not figured out a *good* solution without adding more parameters to rxe_set_page()
> to tell the dma_len...
> 
> Before I go further down this path, I'd like to raise a concern: is storing struct page pointers directly in
> the page_list still the right approach for this problem?
> 
> Perhaps we need to rethink how we store the mapping information entirely.
> 
> Any thoughts or suggestions would be greatly appreciated.
> 
> Thanks
> Zhijian
> 
> 
>>
>> [1] https://lore.kernel.org/all/CAHj4cs9XRqE25jyVw9rj9YugffLn5+f=1znaBEnu1usLOciD+g@mail.gmail.com/T/
>> ixes: 592627ccbdff ("RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray")
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> ---
>>    drivers/infiniband/sw/rxe/rxe_mr.c | 83 +++++++++++++++++++++++++++---
>>    1 file changed, 76 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
>> index b28b56db725a..8ad7d163b418 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>> @@ -72,14 +72,40 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
>>    	mr->ibmr.type = IB_MR_TYPE_DMA;
>>    }
>>    
>> +/*
>> + * Convert iova to page_list index. The page_list stores pages of size
>> + * PAGE_SIZE, but MRs can have different page sizes. This function
>> + * handles the conversion for all cases:
>> + *
>> + * 1. mr->page_size > PAGE_SIZE:
>> + *    The MR's iova may not be aligned to mr->page_size. We use the
>> + *    aligned base (iova & page_mask) as reference, then calculate
>> + *    which PAGE_SIZE sub-page the iova falls into.
>> + *
>> + * 2. mr->page_size <= PAGE_SIZE:
>> + *    Use simple shift arithmetic since each page_list entry corresponds
>> + *    to one or more MR pages.
>> + *
>> + * Example for mr->page_size=64K, PAGE_SIZE=4K, iova=0x11034:
>> + *   base = iova & ~(64K-1) = 0x10000
>> + *   index = (0x11034 - 0x10000) / 4K = 0x1034 / 0x1000 = 4
>> + */
>>    static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
>>    {
>> -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>> +	if (mr_page_size(mr) > PAGE_SIZE)
>> +		return (iova - (mr->ibmr.iova & mr->page_mask)) / PAGE_SIZE;
>> +	else
>> +		return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>>    }
>>    
>> +/*
>> + * Always return offset within a PAGE_SIZE page since page_list
>> + * stores individual struct page pointers, each representing
>> + * PAGE_SIZE of memory.
>> + */
>>    static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
>>    {
>> -	return iova & (mr_page_size(mr) - 1);
>> +	return iova & (PAGE_SIZE - 1);
>>    }
>>    
>>    static bool is_pmem_page(struct page *pg)
>> @@ -205,11 +231,40 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
>>    	return err;
>>    }
>>    
>> +/*
>> + * Split a large MR page (mr->page_size) into multiple PAGE_SIZE
>> + * sub-pages and store them in page_list.
>> + *
>> + * Called when mr->page_size > PAGE_SIZE. Each call to rxe_set_page()
>> + * represents one mr->page_size region, which we must split into
>> + * (mr->page_size / PAGE_SIZE) individual pages.
>> + */
>> +static int set_pages_per_mr(struct ib_mr *ibmr, u64 dma_addr)
>> +{
>> +	struct rxe_mr *mr = to_rmr(ibmr);
>> +	u32 page_size = mr_page_size(mr);
>> +	u64 addr = dma_addr & ~(u64)(page_size - 1);
>> +	u32 i, pages_per_mr = page_size / PAGE_SIZE;
>> +
>> +	for (i = 0; i < pages_per_mr; i++) {
>> +		struct page *sub_page =
>> +			ib_virt_dma_to_page(addr + i * PAGE_SIZE);
>> +		int err = xa_err(xa_store(&mr->page_list, mr->nbuf, sub_page,
>> +					  GFP_KERNEL));
>> +		if (err)
>> +			return err;
>> +
>> +		mr->nbuf++;
>> +	}
>> +	return 0;
>> +}
>> +
>>    static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
>>    {
>>    	struct rxe_mr *mr = to_rmr(ibmr);
>>    	struct page *page = ib_virt_dma_to_page(dma_addr);
>>    	bool persistent = !!(mr->access & IB_ACCESS_FLUSH_PERSISTENT);
>> +	u32 page_size = mr_page_size(mr);
>>    	int err;
>>    
>>    	if (persistent && !is_pmem_page(page)) {
>> @@ -217,9 +272,13 @@ static int rxe_set_page(struct ib_mr *ibmr, u64 dma_addr)
>>    		return -EINVAL;
>>    	}
>>    
>> -	if (unlikely(mr->nbuf == mr->num_buf))
>> +	if (unlikely(mr->nbuf >= mr->num_buf))
>>    		return -ENOMEM;
>>    
>> +	if (page_size > PAGE_SIZE)
>> +		return set_pages_per_mr(ibmr, dma_addr);
>> +
>> +	/* page_size <= PAGE_SIZE */
>>    	err = xa_err(xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL));
>>    	if (err)
>>    		return err;
>> @@ -234,6 +293,18 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
>>    	struct rxe_mr *mr = to_rmr(ibmr);
>>    	unsigned int page_size = mr_page_size(mr);
>>    
>> +	/*
>> +	 * Ensure page_size and PAGE_SIZE are compatible for mapping.
>> +	 * We require one to be a multiple of the other for correct
>> +	 * iova-to-page conversion.
>> +	 */
>> +	if (!IS_ALIGNED(page_size, PAGE_SIZE) &&
>> +	    !IS_ALIGNED(PAGE_SIZE, page_size)) {
>> +		rxe_err_mr(mr, "MR page size %u must be compatible with PAGE_SIZE %lu\n",
>> +			   page_size, PAGE_SIZE);
>> +		return -EINVAL;
>> +	}
>> +
>>    	mr->nbuf = 0;
>>    	mr->page_shift = ilog2(page_size);
>>    	mr->page_mask = ~((u64)page_size - 1);
>> @@ -255,8 +326,7 @@ static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
>>    		if (!page)
>>    			return -EFAULT;
>>    
>> -		bytes = min_t(unsigned int, length,
>> -				mr_page_size(mr) - page_offset);
>> +		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
>>    		va = kmap_local_page(page);
>>    		if (dir == RXE_FROM_MR_OBJ)
>>    			memcpy(addr, va + page_offset, bytes);
>> @@ -442,8 +512,7 @@ static int rxe_mr_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int leng
>>    		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
>>    		if (!page)
>>    			return -EFAULT;
>> -		bytes = min_t(unsigned int, length,
>> -			      mr_page_size(mr) - page_offset);
>> +		bytes = min_t(unsigned int, length, PAGE_SIZE - page_offset);
>>    
>>    		va = kmap_local_page(page);
>>    		arch_wb_cache_pmem(va + page_offset, bytes);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE
  2025-12-26  9:52 [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE Li Zhijian
  2025-12-27  5:24 ` Zhu Yanjun
  2026-01-05  6:55 ` Zhijian Li (Fujitsu)
@ 2026-01-06  1:01 ` Jason Gunthorpe
  2 siblings, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2026-01-06  1:01 UTC (permalink / raw)
  To: Li Zhijian; +Cc: linux-rdma, linux-kernel, zyjzyj2000, leon, Yi Zhang

On Fri, Dec 26, 2025 at 05:52:37PM +0800, Li Zhijian wrote:
> The current implementation incorrectly handles memory regions (MRs) with
> page sizes different from the system PAGE_SIZE. The core issue is that
> rxe_set_page() is called with mr->page_size step increments, but the
> page_list stores individual struct page pointers, each representing
> PAGE_SIZE of memory.
> 
> Problem scenarios with concrete examples:
> 
> 1. mr->page_size > PAGE_SIZE (e.g., 64K MR with 4K system pages):

Why does RXE even have mr->page_size at all?

Real HW has this value to optimize for large page sizes, but I'm not
sure we really need this for RXE..

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE
  2026-01-05  6:55 ` Zhijian Li (Fujitsu)
  2026-01-05  7:39   ` Zhijian Li (Fujitsu)
@ 2026-01-06  1:07   ` Jason Gunthorpe
  1 sibling, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2026-01-06  1:07 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org,
	zyjzyj2000@gmail.com, leon@kernel.org, Yi Zhang, Bob Pearson

On Mon, Jan 05, 2026 at 06:55:22AM +0000, Zhijian Li (Fujitsu) wrote:

> After digging into the behavior during the srp/012 test again, it
> turns out this fix is incomplete.  The current xarray page_list
> approach cannot correctly map memory regions composed of two or more
> scatter-gather segments.

I seem to recall there are DMA API functions that can control what
kinds of scatterlists the block stack will push down.

For real HW we already cannot support less than 4K alignment of interior SGL
segments.

Maybe rxe can tell the block stack it can only support PAGE_SIZE
alignment of interior SGL segments?

If not then this would be the reason rxe needs mr->page_size, to
support 4k.

And obviously if the mr->page size is less than PAGE_SIZE the xarray
datastructure does not work. You'd have to store physical addresses
instead..

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-06  1:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-26  9:52 [PATCH RFC] rxe: Fix iova-to-va conversion for MR page sizes != PAGE_SIZE Li Zhijian
2025-12-27  5:24 ` Zhu Yanjun
2025-12-30  4:02   ` Zhu Yanjun
2025-12-30  4:47     ` Zhijian Li (Fujitsu)
2025-12-30  4:57       ` Zhu Yanjun
2026-01-05  6:55 ` Zhijian Li (Fujitsu)
2026-01-05  7:39   ` Zhijian Li (Fujitsu)
2026-01-06  1:07   ` Jason Gunthorpe
2026-01-06  1:01 ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).