public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next v2 0/2] RDMA/rxe: RDMA FLUSH and ATOMIC WRITE with ODP
@ 2025-03-18  9:49 Daisuke Matsuda
  2025-03-18  9:49 ` [PATCH for-next v2 1/2] RDMA/rxe: Enable ODP in RDMA FLUSH operation Daisuke Matsuda
  2025-03-18  9:49 ` [PATCH for-next v2 2/2] RDMA/rxe: Enable ODP in ATOMIC WRITE operation Daisuke Matsuda
  0 siblings, 2 replies; 9+ messages in thread
From: Daisuke Matsuda @ 2025-03-18  9:49 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000; +Cc: lizhijian, Daisuke Matsuda

RDMA FLUSH[1] and ATOMIC WRITE[2] were added to rxe, but they cannot run
in the ODP mode as of now. This series is for the kernel-side enablement.

There are also minor changes in libibverbs and pyverbs. The rdma-core tests
are also added so that people can test the features.
PR: https://github.com/linux-rdma/rdma-core/pull/1580

You can try the patches with the tree below:
https://github.com/ddmatsu/linux/tree/odp-extension2

Note that the tree is a bit old (6.13-rc1), because there was an issue[3]
in the for-next tree that disabled ibv_query_device_ex(), which is used to
query ODP capabilities. However, there is already a fix[4], and it is to be
resolved in the next release. I will update the tree once it is ready.

[1] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@fujitsu.com/

[2] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@fujitsu.com/

[3] [bug report] RDMA/rxe: Failure of ibv_query_device() and ibv_query_device_ex() tests in rdma-core
https://lore.kernel.org/all/1b9d6286-62fc-4b42-b304-0054c4ebee02@linux.dev/T/

[4] [PATCH rdma-rc 1/1] RDMA/rxe: Fix the failure of ibv_query_device() and ibv_query_device_ex() tests
https://lore.kernel.org/linux-rdma/174102882930.42565.11864314726635251412.b4-ty@kernel.org/T/#t

History:
  v1->v2: Removed some code duplications

Daisuke Matsuda (2):
  RDMA/rxe: Enable ODP in RDMA FLUSH operation
  RDMA/rxe: Enable ODP in ATOMIC WRITE operation

 drivers/infiniband/sw/rxe/rxe.c      |   2 +
 drivers/infiniband/sw/rxe/rxe_loc.h  |  12 +++
 drivers/infiniband/sw/rxe/rxe_mr.c   |  48 +++++------
 drivers/infiniband/sw/rxe/rxe_odp.c  | 115 ++++++++++++++++++++++++++-
 drivers/infiniband/sw/rxe/rxe_resp.c |  15 ++--
 include/rdma/ib_verbs.h              |   2 +
 6 files changed, 161 insertions(+), 33 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH for-next v2 1/2] RDMA/rxe: Enable ODP in RDMA FLUSH operation
  2025-03-18  9:49 [PATCH for-next v2 0/2] RDMA/rxe: RDMA FLUSH and ATOMIC WRITE with ODP Daisuke Matsuda
@ 2025-03-18  9:49 ` Daisuke Matsuda
  2025-03-20  6:59   ` Zhijian Li (Fujitsu)
  2025-03-18  9:49 ` [PATCH for-next v2 2/2] RDMA/rxe: Enable ODP in ATOMIC WRITE operation Daisuke Matsuda
  1 sibling, 1 reply; 9+ messages in thread
From: Daisuke Matsuda @ 2025-03-18  9:49 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000; +Cc: lizhijian, Daisuke Matsuda

For persistent memories, add rxe_odp_flush_pmem_iova() so that ODP specific
steps are executed. Otherwise, no additional consideration is required.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe.c      |  1 +
 drivers/infiniband/sw/rxe/rxe_loc.h  |  7 ++++
 drivers/infiniband/sw/rxe/rxe_mr.c   | 36 ++++++++++------
 drivers/infiniband/sw/rxe/rxe_odp.c  | 62 ++++++++++++++++++++++++++--
 drivers/infiniband/sw/rxe/rxe_resp.c |  4 --
 include/rdma/ib_verbs.h              |  1 +
 6 files changed, 91 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 4e56a371deb5..df66f8f9efa1 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -109,6 +109,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC;
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_FLUSH;
 	}
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index feb386d98d1d..0012bebe96ef 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -194,6 +194,8 @@ int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		    enum rxe_mr_copy_dir dir);
 int rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 			 u64 compare, u64 swap_add, u64 *orig_val);
+int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
+			    unsigned int length);
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 static inline int
 rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
@@ -212,6 +214,11 @@ rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 {
 	return RESPST_ERR_UNSUPPORTED_OPCODE;
 }
+static inline int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
+					  unsigned int length)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 #endif /* RXE_LOC_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 868d2f0b74e9..93e4b5acd3ac 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -424,7 +424,7 @@ int copy_data(
 	return err;
 }
 
-int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
+static int rxe_mr_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
 {
 	unsigned int page_offset;
 	unsigned long index;
@@ -433,16 +433,6 @@ int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
 	int err;
 	u8 *va;
 
-	/* mr must be valid even if length is zero */
-	if (WARN_ON(!mr))
-		return -EINVAL;
-
-	if (length == 0)
-		return 0;
-
-	if (mr->ibmr.type == IB_MR_TYPE_DMA)
-		return -EFAULT;
-
 	err = mr_check_range(mr, iova, length);
 	if (err)
 		return err;
@@ -454,7 +444,7 @@ int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
 		if (!page)
 			return -EFAULT;
 		bytes = min_t(unsigned int, length,
-				mr_page_size(mr) - page_offset);
+			      mr_page_size(mr) - page_offset);
 
 		va = kmap_local_page(page);
 		arch_wb_cache_pmem(va + page_offset, bytes);
@@ -468,6 +458,28 @@ int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
 	return 0;
 }
 
+int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 start, unsigned int length)
+{
+	int err;
+
+	/* mr must be valid even if length is zero */
+	if (WARN_ON(!mr))
+		return -EINVAL;
+
+	if (length == 0)
+		return 0;
+
+	if (mr->ibmr.type == IB_MR_TYPE_DMA)
+		return -EFAULT;
+
+	if (mr->umem->is_odp)
+		err = rxe_odp_flush_pmem_iova(mr, start, length);
+	else
+		err = rxe_mr_flush_pmem_iova(mr, start, length);
+
+	return err;
+}
+
 /* Guarantee atomicity of atomic operations at the machine level. */
 DEFINE_SPINLOCK(atomic_ops_lock);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index 9f6e2bb2a269..9a9aae967486 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -4,6 +4,7 @@
  */
 
 #include <linux/hmm.h>
+#include <linux/libnvdimm.h>
 
 #include <rdma/ib_umem_odp.h>
 
@@ -147,6 +148,16 @@ static inline bool rxe_check_pagefault(struct ib_umem_odp *umem_odp,
 	return need_fault;
 }
 
+static unsigned long rxe_odp_iova_to_index(struct ib_umem_odp *umem_odp, u64 iova)
+{
+	return (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
+}
+
+static unsigned long rxe_odp_iova_to_page_offset(struct ib_umem_odp *umem_odp, u64 iova)
+{
+	return iova & (BIT(umem_odp->page_shift) - 1);
+}
+
 static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length, u32 flags)
 {
 	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
@@ -190,8 +201,8 @@ static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
 	size_t offset;
 	u8 *user_va;
 
-	idx = (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
-	offset = iova & (BIT(umem_odp->page_shift) - 1);
+	idx = rxe_odp_iova_to_index(umem_odp, iova);
+	offset = rxe_odp_iova_to_page_offset(umem_odp, iova);
 
 	while (length > 0) {
 		u8 *src, *dest;
@@ -277,8 +288,8 @@ static int rxe_odp_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 		return RESPST_ERR_RKEY_VIOLATION;
 	}
 
-	idx = (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
-	page_offset = iova & (BIT(umem_odp->page_shift) - 1);
+	idx = rxe_odp_iova_to_index(umem_odp, iova);
+	page_offset = rxe_odp_iova_to_page_offset(umem_odp, iova);
 	page = hmm_pfn_to_page(umem_odp->pfn_list[idx]);
 	if (!page)
 		return RESPST_ERR_RKEY_VIOLATION;
@@ -324,3 +335,46 @@ int rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 
 	return err;
 }
+
+int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
+			    unsigned int length)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	unsigned int page_offset;
+	unsigned long index;
+	struct page *page;
+	unsigned int bytes;
+	int err;
+	u8 *va;
+
+	err = rxe_odp_map_range_and_lock(mr, iova, length,
+					 RXE_PAGEFAULT_DEFAULT);
+	if (err)
+		return err;
+
+	while (length > 0) {
+		index = rxe_odp_iova_to_index(umem_odp, iova);
+		page_offset = rxe_odp_iova_to_page_offset(umem_odp, iova);
+
+		page = hmm_pfn_to_page(umem_odp->pfn_list[index]);
+		if (!page) {
+			mutex_unlock(&umem_odp->umem_mutex);
+			return -EFAULT;
+		}
+
+		bytes = min_t(unsigned int, length,
+			      mr_page_size(mr) - page_offset);
+
+		va = kmap_local_page(page);
+		arch_wb_cache_pmem(va + page_offset, bytes);
+		kunmap_local(va);
+
+		length -= bytes;
+		iova += bytes;
+		page_offset = 0;
+	}
+
+	mutex_unlock(&umem_odp->umem_mutex);
+
+	return 0;
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 54ba9ee1acc5..304e3de740ad 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -649,10 +649,6 @@ static enum resp_states process_flush(struct rxe_qp *qp,
 	struct rxe_mr *mr = qp->resp.mr;
 	struct resp_res *res = qp->resp.res;
 
-	/* ODP is not supported right now. WIP. */
-	if (mr->umem->is_odp)
-		return RESPST_ERR_UNSUPPORTED_OPCODE;
-
 	/* oA19-14, oA19-15 */
 	if (res && res->replay)
 		return RESPST_ACKNOWLEDGE;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9941f4185c79..da07d3e2db1d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -325,6 +325,7 @@ enum ib_odp_transport_cap_bits {
 	IB_ODP_SUPPORT_READ	= 1 << 3,
 	IB_ODP_SUPPORT_ATOMIC	= 1 << 4,
 	IB_ODP_SUPPORT_SRQ_RECV	= 1 << 5,
+	IB_ODP_SUPPORT_FLUSH	= 1 << 6,
 };
 
 struct ib_odp_caps {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH for-next v2 2/2] RDMA/rxe: Enable ODP in ATOMIC WRITE operation
  2025-03-18  9:49 [PATCH for-next v2 0/2] RDMA/rxe: RDMA FLUSH and ATOMIC WRITE with ODP Daisuke Matsuda
  2025-03-18  9:49 ` [PATCH for-next v2 1/2] RDMA/rxe: Enable ODP in RDMA FLUSH operation Daisuke Matsuda
@ 2025-03-18  9:49 ` Daisuke Matsuda
  2025-03-18 10:10   ` Leon Romanovsky
  1 sibling, 1 reply; 9+ messages in thread
From: Daisuke Matsuda @ 2025-03-18  9:49 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000; +Cc: lizhijian, Daisuke Matsuda

Add rxe_odp_do_atomic_write() so that ODP specific steps are applied to
ATOMIC WRITE requests.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe.c      |  1 +
 drivers/infiniband/sw/rxe/rxe_loc.h  |  5 +++
 drivers/infiniband/sw/rxe/rxe_mr.c   | 12 -------
 drivers/infiniband/sw/rxe/rxe_odp.c  | 53 ++++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_resp.c | 11 +++++-
 include/rdma/ib_verbs.h              |  1 +
 6 files changed, 70 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index df66f8f9efa1..21ce2d876b42 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -110,6 +110,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC;
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_FLUSH;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC_WRITE;
 	}
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 0012bebe96ef..8b1517c0894c 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -196,6 +196,7 @@ int rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 			 u64 compare, u64 swap_add, u64 *orig_val);
 int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
 			    unsigned int length);
+int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 static inline int
 rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
@@ -219,6 +220,10 @@ static inline int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
 {
 	return -EOPNOTSUPP;
 }
+static inline int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
+{
+	return RESPST_ERR_UNSUPPORTED_OPCODE;
+}
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
 #endif /* RXE_LOC_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 93e4b5acd3ac..d40fbe10633f 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -547,16 +547,6 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 	struct page *page;
 	u64 *va;
 
-	/* ODP is not supported right now. WIP. */
-	if (mr->umem->is_odp)
-		return RESPST_ERR_UNSUPPORTED_OPCODE;
-
-	/* See IBA oA19-28 */
-	if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
-		rxe_dbg_mr(mr, "mr not in valid state\n");
-		return RESPST_ERR_RKEY_VIOLATION;
-	}
-
 	if (mr->ibmr.type == IB_MR_TYPE_DMA) {
 		page_offset = iova & (PAGE_SIZE - 1);
 		page = ib_virt_dma_to_page(iova);
@@ -584,10 +574,8 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 	}
 
 	va = kmap_local_page(page);
-
 	/* Do atomic write after all prior operations have completed */
 	smp_store_release(&va[page_offset >> 3], value);
-
 	kunmap_local(va);
 
 	return 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index 9a9aae967486..f3443c604a7f 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -378,3 +378,56 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
 
 	return 0;
 }
+
+#if defined CONFIG_64BIT
+/* only implemented or called for 64 bit architectures */
+int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	unsigned int page_offset;
+	unsigned long index;
+	struct page *page;
+	int err;
+	u64 *va;
+
+	/* See IBA oA19-28 */
+	err = mr_check_range(mr, iova, sizeof(value));
+	if (unlikely(err)) {
+		rxe_dbg_mr(mr, "iova out of range\n");
+		return RESPST_ERR_RKEY_VIOLATION;
+	}
+
+	err = rxe_odp_map_range_and_lock(mr, iova, sizeof(value),
+					 RXE_PAGEFAULT_DEFAULT);
+	if (err)
+		return RESPST_ERR_RKEY_VIOLATION;
+
+	page_offset = rxe_odp_iova_to_page_offset(umem_odp, iova);
+	index = rxe_odp_iova_to_index(umem_odp, iova);
+	page = hmm_pfn_to_page(umem_odp->pfn_list[index]);
+	if (!page) {
+		mutex_unlock(&umem_odp->umem_mutex);
+		return RESPST_ERR_RKEY_VIOLATION;
+	}
+	/* See IBA A19.4.2 */
+	if (unlikely(page_offset & 0x7)) {
+		mutex_unlock(&umem_odp->umem_mutex);
+		rxe_dbg_mr(mr, "misaligned address\n");
+		return RESPST_ERR_MISALIGNED_ATOMIC;
+	}
+
+	va = kmap_local_page(page);
+	/* Do atomic write after all prior operations have completed */
+	smp_store_release(&va[page_offset >> 3], value);
+	kunmap_local(va);
+
+	mutex_unlock(&umem_odp->umem_mutex);
+
+	return 0;
+}
+#else
+int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
+{
+	return RESPST_ERR_UNSUPPORTED_OPCODE;
+}
+#endif
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 304e3de740ad..fd7bac5bce18 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -749,7 +749,16 @@ static enum resp_states atomic_write_reply(struct rxe_qp *qp,
 	value = *(u64 *)payload_addr(pkt);
 	iova = qp->resp.va + qp->resp.offset;
 
-	err = rxe_mr_do_atomic_write(mr, iova, value);
+	/* See IBA oA19-28 */
+	if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
+		rxe_dbg_mr(mr, "mr not in valid state\n");
+		return RESPST_ERR_RKEY_VIOLATION;
+	}
+
+	if (mr->umem->is_odp)
+		err = rxe_odp_do_atomic_write(mr, iova, value);
+	else
+		err = rxe_mr_do_atomic_write(mr, iova, value);
 	if (err)
 		return err;
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index da07d3e2db1d..bfa1bff3c720 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -326,6 +326,7 @@ enum ib_odp_transport_cap_bits {
 	IB_ODP_SUPPORT_ATOMIC	= 1 << 4,
 	IB_ODP_SUPPORT_SRQ_RECV	= 1 << 5,
 	IB_ODP_SUPPORT_FLUSH	= 1 << 6,
+	IB_ODP_SUPPORT_ATOMIC_WRITE	= 1 << 7,
 };
 
 struct ib_odp_caps {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH for-next v2 2/2] RDMA/rxe: Enable ODP in ATOMIC WRITE operation
  2025-03-18  9:49 ` [PATCH for-next v2 2/2] RDMA/rxe: Enable ODP in ATOMIC WRITE operation Daisuke Matsuda
@ 2025-03-18 10:10   ` Leon Romanovsky
  2025-03-19  2:58     ` Daisuke Matsuda (Fujitsu)
  0 siblings, 1 reply; 9+ messages in thread
From: Leon Romanovsky @ 2025-03-18 10:10 UTC (permalink / raw)
  To: Daisuke Matsuda; +Cc: linux-rdma, jgg, zyjzyj2000, lizhijian

On Tue, Mar 18, 2025 at 06:49:32PM +0900, Daisuke Matsuda wrote:
> Add rxe_odp_do_atomic_write() so that ODP specific steps are applied to
> ATOMIC WRITE requests.
> 
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
>  drivers/infiniband/sw/rxe/rxe.c      |  1 +
>  drivers/infiniband/sw/rxe/rxe_loc.h  |  5 +++
>  drivers/infiniband/sw/rxe/rxe_mr.c   | 12 -------
>  drivers/infiniband/sw/rxe/rxe_odp.c  | 53 ++++++++++++++++++++++++++++
>  drivers/infiniband/sw/rxe/rxe_resp.c | 11 +++++-
>  include/rdma/ib_verbs.h              |  1 +
>  6 files changed, 70 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index df66f8f9efa1..21ce2d876b42 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -110,6 +110,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
>  		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC;
>  		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
>  		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_FLUSH;
> +		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC_WRITE;
>  	}
>  }

<...>

> +static inline int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> +{
> +	return RESPST_ERR_UNSUPPORTED_OPCODE;
> +}
>  #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */

You are returning "enum resp_states", while function expects to return "int". You should return -EOPNOTSUPP.

>  
>  #endif /* RXE_LOC_H */
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index 93e4b5acd3ac..d40fbe10633f 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -547,16 +547,6 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
>  	struct page *page;
>  	u64 *va;
>  
> -	/* ODP is not supported right now. WIP. */
> -	if (mr->umem->is_odp)
> -		return RESPST_ERR_UNSUPPORTED_OPCODE;
> -
> -	/* See IBA oA19-28 */
> -	if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
> -		rxe_dbg_mr(mr, "mr not in valid state\n");
> -		return RESPST_ERR_RKEY_VIOLATION;
> -	}
> -
>  	if (mr->ibmr.type == IB_MR_TYPE_DMA) {
>  		page_offset = iova & (PAGE_SIZE - 1);
>  		page = ib_virt_dma_to_page(iova);
> @@ -584,10 +574,8 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
>  	}
>  
>  	va = kmap_local_page(page);
> -
>  	/* Do atomic write after all prior operations have completed */
>  	smp_store_release(&va[page_offset >> 3], value);
> -
>  	kunmap_local(va);
>  
>  	return 0;
> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> index 9a9aae967486..f3443c604a7f 100644
> --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> @@ -378,3 +378,56 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
>  
>  	return 0;
>  }
> +
> +#if defined CONFIG_64BIT
> +/* only implemented or called for 64 bit architectures */
> +int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> +{
> +	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> +	unsigned int page_offset;
> +	unsigned long index;
> +	struct page *page;
> +	int err;
> +	u64 *va;
> +
> +	/* See IBA oA19-28 */
> +	err = mr_check_range(mr, iova, sizeof(value));
> +	if (unlikely(err)) {
> +		rxe_dbg_mr(mr, "iova out of range\n");
> +		return RESPST_ERR_RKEY_VIOLATION;

Please don't redefine returned errors.

> +	}

<...>

> +#else
> +int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> +{
> +	return RESPST_ERR_UNSUPPORTED_OPCODE;
> +}
> +#endif

You already have empty declaration in rxe_loc.h, use it.

Thanks

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH for-next v2 2/2] RDMA/rxe: Enable ODP in ATOMIC WRITE operation
  2025-03-18 10:10   ` Leon Romanovsky
@ 2025-03-19  2:58     ` Daisuke Matsuda (Fujitsu)
  2025-03-19  8:58       ` Leon Romanovsky
  0 siblings, 1 reply; 9+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2025-03-19  2:58 UTC (permalink / raw)
  To: 'Leon Romanovsky'
  Cc: linux-rdma@vger.kernel.org, jgg@ziepe.ca, zyjzyj2000@gmail.com,
	Zhijian Li (Fujitsu)

On Tue, Mar 18, 2025 7:10 PM Leon Romanovsky wrote:
> On Tue, Mar 18, 2025 at 06:49:32PM +0900, Daisuke Matsuda wrote:
> 
> <...>
> 
> > +static inline int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> > +{
> > +	return RESPST_ERR_UNSUPPORTED_OPCODE;
> > +}
> >  #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
> 
> You are returning "enum resp_states", while function expects to return "int". You should return -EOPNOTSUPP.

Other than my patches, there are some functions that do the same thing.
I would like to post a patch to make them consistent, but I think we need
reach an agreement on the design of rxe responder before taking up.
Please see my opinion below.

> 
> >
> >  #endif /* RXE_LOC_H */

<...>

> > diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> > index 9a9aae967486..f3443c604a7f 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> > @@ -378,3 +378,56 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
> >
> >  	return 0;
> >  }
> > +
> > +#if defined CONFIG_64BIT
> > +/* only implemented or called for 64 bit architectures */
> > +int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> > +{
> > +	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> > +	unsigned int page_offset;
> > +	unsigned long index;
> > +	struct page *page;
> > +	int err;
> > +	u64 *va;
> > +
> > +	/* See IBA oA19-28 */
> > +	err = mr_check_range(mr, iova, sizeof(value));
> > +	if (unlikely(err)) {
> > +		rxe_dbg_mr(mr, "iova out of range\n");
> > +		return RESPST_ERR_RKEY_VIOLATION;
> 
> Please don't redefine returned errors.

As a general principle, I think your comment is totally correct.
The problem is that rxe_receiver(), the responder of rxe, is originally designed
as a state machine, and the returned values of "enum resp_states" are used
to specify the next state.

One thing to note is that rxe_receiver() run solely in workqueue, so the errors
generated in the bottom half context are never returned to userspace. In that regard,
I think redefining the error codes with different enum values can be justified.

The responder using the state machine is easy to understand and maintain.
TBH, I am not inclined to change the design, but would like to know the
opinions of other people.

> 
> > +	}
> 
> <...>
> 
> > +#else
> > +int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> > +{
> > +	return RESPST_ERR_UNSUPPORTED_OPCODE;
> > +}
> > +#endif
> 
> You already have empty declaration in rxe_loc.h, use it.

That's right. I will change it.

Thanks,
Daisuke

> 
> Thanks

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for-next v2 2/2] RDMA/rxe: Enable ODP in ATOMIC WRITE operation
  2025-03-19  2:58     ` Daisuke Matsuda (Fujitsu)
@ 2025-03-19  8:58       ` Leon Romanovsky
  2025-03-24  8:05         ` Daisuke Matsuda (Fujitsu)
  0 siblings, 1 reply; 9+ messages in thread
From: Leon Romanovsky @ 2025-03-19  8:58 UTC (permalink / raw)
  To: Daisuke Matsuda (Fujitsu)
  Cc: linux-rdma@vger.kernel.org, jgg@ziepe.ca, zyjzyj2000@gmail.com,
	Zhijian Li (Fujitsu)

On Wed, Mar 19, 2025 at 02:58:51AM +0000, Daisuke Matsuda (Fujitsu) wrote:
> On Tue, Mar 18, 2025 7:10 PM Leon Romanovsky wrote:
> > On Tue, Mar 18, 2025 at 06:49:32PM +0900, Daisuke Matsuda wrote:
> > 
> > <...>
> > 
> > > +static inline int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> > > +{
> > > +	return RESPST_ERR_UNSUPPORTED_OPCODE;
> > > +}
> > >  #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
> > 
> > You are returning "enum resp_states", while function expects to return "int". You should return -EOPNOTSUPP.
> 
> Other than my patches, there are some functions that do the same thing.

Yes, but you are adding new code and in the new code you should try to
have correlated function declaration and returned values.

> I would like to post a patch to make them consistent, but I think we need
> reach an agreement on the design of rxe responder before taking up.
> Please see my opinion below.
> 
> > 
> > >
> > >  #endif /* RXE_LOC_H */
> 
> <...>
> 
> > > diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> > > index 9a9aae967486..f3443c604a7f 100644
> > > --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> > > +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> > > @@ -378,3 +378,56 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
> > >
> > >  	return 0;
> > >  }
> > > +
> > > +#if defined CONFIG_64BIT
> > > +/* only implemented or called for 64 bit architectures */
> > > +int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> > > +{
> > > +	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> > > +	unsigned int page_offset;
> > > +	unsigned long index;
> > > +	struct page *page;
> > > +	int err;
> > > +	u64 *va;
> > > +
> > > +	/* See IBA oA19-28 */
> > > +	err = mr_check_range(mr, iova, sizeof(value));
> > > +	if (unlikely(err)) {
> > > +		rxe_dbg_mr(mr, "iova out of range\n");
> > > +		return RESPST_ERR_RKEY_VIOLATION;
> > 
> > Please don't redefine returned errors.
> 
> As a general principle, I think your comment is totally correct.
> The problem is that rxe_receiver(), the responder of rxe, is originally designed
> as a state machine, and the returned values of "enum resp_states" are used
> to specify the next state.
> 
> One thing to note is that rxe_receiver() run solely in workqueue, so the errors
> generated in the bottom half context are never returned to userspace. In that regard,
> I think redefining the error codes with different enum values can be justified.

In places where rxe_odp_do_atomic_write() respond is important, you can
write something like:
err = rxe_odp_do_atomic_write(...)
if (err == -EPERM)
   state = RESPST_ERR_RKEY_VIOLATION
...

or declare rxe_odp_do_atomic_write() to return enum resp_state.

Thanks

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH for-next v2 1/2] RDMA/rxe: Enable ODP in RDMA FLUSH operation
  2025-03-18  9:49 ` [PATCH for-next v2 1/2] RDMA/rxe: Enable ODP in RDMA FLUSH operation Daisuke Matsuda
@ 2025-03-20  6:59   ` Zhijian Li (Fujitsu)
  2025-03-24  5:16     ` Daisuke Matsuda (Fujitsu)
  0 siblings, 1 reply; 9+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-03-20  6:59 UTC (permalink / raw)
  To: Daisuke Matsuda (Fujitsu), linux-rdma@vger.kernel.org,
	leon@kernel.org, jgg@ziepe.ca, zyjzyj2000@gmail.com

Hi Matsuda-san

Thanks for your patches in ODP.

It looks good to me.

Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>


However, I find myself harboring a hint of hesitation.

I'm wondering if we really need remap a page back from the back-end
memory/pmem device for just doing a flush operation.

I am uncertain about the circumstances under which ODP might occur.
Does it possibly include scenarios ?
1) where a page has not yet had a mapping
2) where a page, once mapped, is subsequently swapped out

When a pmem page that
- for 1), it's meaningless to do the flush
- for 2), a pmem page will be swaped-out to a swap-partition without flushing?

Thanks
Zhijian

On 18/03/2025 17:49, Daisuke Matsuda wrote:
> For persistent memories, add rxe_odp_flush_pmem_iova() so that ODP specific
> steps are executed. Otherwise, no additional consideration is required.
> 
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
>   drivers/infiniband/sw/rxe/rxe.c      |  1 +
>   drivers/infiniband/sw/rxe/rxe_loc.h  |  7 ++++
>   drivers/infiniband/sw/rxe/rxe_mr.c   | 36 ++++++++++------
>   drivers/infiniband/sw/rxe/rxe_odp.c  | 62 ++++++++++++++++++++++++++--
>   drivers/infiniband/sw/rxe/rxe_resp.c |  4 --
>   include/rdma/ib_verbs.h              |  1 +
>   6 files changed, 91 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 4e56a371deb5..df66f8f9efa1 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -109,6 +109,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
>   		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
>   		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC;
>   		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
> +		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_FLUSH;
>   	}
>   }
>   
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index feb386d98d1d..0012bebe96ef 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -194,6 +194,8 @@ int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
>   		    enum rxe_mr_copy_dir dir);
>   int rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
>   			 u64 compare, u64 swap_add, u64 *orig_val);
> +int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
> +			    unsigned int length);
>   #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
>   static inline int
>   rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
> @@ -212,6 +214,11 @@ rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
>   {
>   	return RESPST_ERR_UNSUPPORTED_OPCODE;
>   }
> +static inline int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
> +					  unsigned int length)
> +{
> +	return -EOPNOTSUPP;
> +}
>   #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
>   
>   #endif /* RXE_LOC_H */
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index 868d2f0b74e9..93e4b5acd3ac 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -424,7 +424,7 @@ int copy_data(
>   	return err;
>   }
>   
> -int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
> +static int rxe_mr_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
>   {
>   	unsigned int page_offset;
>   	unsigned long index;
> @@ -433,16 +433,6 @@ int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
>   	int err;
>   	u8 *va;
>   
> -	/* mr must be valid even if length is zero */
> -	if (WARN_ON(!mr))
> -		return -EINVAL;
> -
> -	if (length == 0)
> -		return 0;
> -
> -	if (mr->ibmr.type == IB_MR_TYPE_DMA)
> -		return -EFAULT;
> -
>   	err = mr_check_range(mr, iova, length);
>   	if (err)
>   		return err;
> @@ -454,7 +444,7 @@ int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
>   		if (!page)
>   			return -EFAULT;
>   		bytes = min_t(unsigned int, length,
> -				mr_page_size(mr) - page_offset);
> +			      mr_page_size(mr) - page_offset);
>   
>   		va = kmap_local_page(page);
>   		arch_wb_cache_pmem(va + page_offset, bytes);
> @@ -468,6 +458,28 @@ int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
>   	return 0;
>   }
>   
> +int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 start, unsigned int length)
> +{
> +	int err;
> +
> +	/* mr must be valid even if length is zero */
> +	if (WARN_ON(!mr))
> +		return -EINVAL;
> +
> +	if (length == 0)
> +		return 0;
> +
> +	if (mr->ibmr.type == IB_MR_TYPE_DMA)
> +		return -EFAULT;
> +
> +	if (mr->umem->is_odp)
> +		err = rxe_odp_flush_pmem_iova(mr, start, length);
> +	else
> +		err = rxe_mr_flush_pmem_iova(mr, start, length);
> +
> +	return err;
> +}
> +
>   /* Guarantee atomicity of atomic operations at the machine level. */
>   DEFINE_SPINLOCK(atomic_ops_lock);
>   
> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> index 9f6e2bb2a269..9a9aae967486 100644
> --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> @@ -4,6 +4,7 @@
>    */
>   
>   #include <linux/hmm.h>
> +#include <linux/libnvdimm.h>
>   
>   #include <rdma/ib_umem_odp.h>
>   
> @@ -147,6 +148,16 @@ static inline bool rxe_check_pagefault(struct ib_umem_odp *umem_odp,
>   	return need_fault;
>   }
>   
> +static unsigned long rxe_odp_iova_to_index(struct ib_umem_odp *umem_odp, u64 iova)
> +{
> +	return (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
> +}
> +
> +static unsigned long rxe_odp_iova_to_page_offset(struct ib_umem_odp *umem_odp, u64 iova)
> +{
> +	return iova & (BIT(umem_odp->page_shift) - 1);
> +}
> +
>   static int rxe_odp_map_range_and_lock(struct rxe_mr *mr, u64 iova, int length, u32 flags)
>   {
>   	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> @@ -190,8 +201,8 @@ static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
>   	size_t offset;
>   	u8 *user_va;
>   
> -	idx = (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
> -	offset = iova & (BIT(umem_odp->page_shift) - 1);
> +	idx = rxe_odp_iova_to_index(umem_odp, iova);
> +	offset = rxe_odp_iova_to_page_offset(umem_odp, iova);
>   
>   	while (length > 0) {
>   		u8 *src, *dest;
> @@ -277,8 +288,8 @@ static int rxe_odp_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
>   		return RESPST_ERR_RKEY_VIOLATION;
>   	}
>   
> -	idx = (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
> -	page_offset = iova & (BIT(umem_odp->page_shift) - 1);
> +	idx = rxe_odp_iova_to_index(umem_odp, iova);
> +	page_offset = rxe_odp_iova_to_page_offset(umem_odp, iova);
>   	page = hmm_pfn_to_page(umem_odp->pfn_list[idx]);
>   	if (!page)
>   		return RESPST_ERR_RKEY_VIOLATION;
> @@ -324,3 +335,46 @@ int rxe_odp_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
>   
>   	return err;
>   }
> +
> +int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
> +			    unsigned int length)
> +{
> +	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> +	unsigned int page_offset;
> +	unsigned long index;
> +	struct page *page;
> +	unsigned int bytes;
> +	int err;
> +	u8 *va;
> +
> +	err = rxe_odp_map_range_and_lock(mr, iova, length,
> +					 RXE_PAGEFAULT_DEFAULT);
> +	if (err)
> +		return err;
> +
> +	while (length > 0) {
> +		index = rxe_odp_iova_to_index(umem_odp, iova);
> +		page_offset = rxe_odp_iova_to_page_offset(umem_odp, iova);
> +
> +		page = hmm_pfn_to_page(umem_odp->pfn_list[index]);
> +		if (!page) {
> +			mutex_unlock(&umem_odp->umem_mutex);
> +			return -EFAULT;
> +		}
> +
> +		bytes = min_t(unsigned int, length,
> +			      mr_page_size(mr) - page_offset);
> +
> +		va = kmap_local_page(page);
> +		arch_wb_cache_pmem(va + page_offset, bytes);
> +		kunmap_local(va);
> +
> +		length -= bytes;
> +		iova += bytes;
> +		page_offset = 0;
> +	}
> +
> +	mutex_unlock(&umem_odp->umem_mutex);
> +
> +	return 0;
> +}
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index 54ba9ee1acc5..304e3de740ad 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -649,10 +649,6 @@ static enum resp_states process_flush(struct rxe_qp *qp,
>   	struct rxe_mr *mr = qp->resp.mr;
>   	struct resp_res *res = qp->resp.res;
>   
> -	/* ODP is not supported right now. WIP. */
> -	if (mr->umem->is_odp)
> -		return RESPST_ERR_UNSUPPORTED_OPCODE;
> -
>   	/* oA19-14, oA19-15 */
>   	if (res && res->replay)
>   		return RESPST_ACKNOWLEDGE;
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 9941f4185c79..da07d3e2db1d 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -325,6 +325,7 @@ enum ib_odp_transport_cap_bits {
>   	IB_ODP_SUPPORT_READ	= 1 << 3,
>   	IB_ODP_SUPPORT_ATOMIC	= 1 << 4,
>   	IB_ODP_SUPPORT_SRQ_RECV	= 1 << 5,
> +	IB_ODP_SUPPORT_FLUSH	= 1 << 6,
>   };
>   
>   struct ib_odp_caps {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH for-next v2 1/2] RDMA/rxe: Enable ODP in RDMA FLUSH operation
  2025-03-20  6:59   ` Zhijian Li (Fujitsu)
@ 2025-03-24  5:16     ` Daisuke Matsuda (Fujitsu)
  0 siblings, 0 replies; 9+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2025-03-24  5:16 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu), linux-rdma@vger.kernel.org, leon@kernel.org,
	jgg@ziepe.ca, zyjzyj2000@gmail.com

On Thu, Mar 20, 2025 3:59 PM Li, Zhijian wrote:
> Hi Matsuda-san
> 
> Thanks for your patches in ODP.
> 
> It looks good to me.
> 
> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
> 
Hi,
Thanks for the review.

> 
> However, I find myself harboring a hint of hesitation.
> 
> I'm wondering if we really need remap a page back from the back-end
> memory/pmem device for just doing a flush operation.

That is a difficult question, but I think there are two reasons we should
invoke the page fault in this case.
  1) Even if pages are surely mapped, it may be possible that the target MR
    is truncate(2)-ed without notifying kernel/HW of the metadata update.
    I think this could potentially result in illegal memory access, and ODP
    can prevent that by updating driver/HW-side page table
    Cf. https://lore.kernel.org/lkml/Y3UmaJil5slosqjA@unreal/T/
  2) It is likely that the behavior we are discussing is not strictly defined, so it
    would be better to choose the safer way since there is no penalty except
    for performance.

> 
> I am uncertain about the circumstances under which ODP might occur.
> Does it possibly include scenarios ?
> 1) where a page has not yet had a mapping
> 2) where a page, once mapped, is subsequently swapped out
> 
> When a pmem page that
> - for 1), it's meaningless to do the flush
> - for 2), a pmem page will be swaped-out to a swap-partition without flushing?

Assuming the pmem is in fs-dax mode, I think the answer is no.
We do not use page cache, so page swap will not occur.

Regards,
Daisuke

> 
> Thanks
> Zhijian
> 
> On 18/03/2025 17:49, Daisuke Matsuda wrote:
> > For persistent memories, add rxe_odp_flush_pmem_iova() so that ODP specific
> > steps are executed. Otherwise, no additional consideration is required.
> >
> > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> > ---
> >   drivers/infiniband/sw/rxe/rxe.c      |  1 +
> >   drivers/infiniband/sw/rxe/rxe_loc.h  |  7 ++++
> >   drivers/infiniband/sw/rxe/rxe_mr.c   | 36 ++++++++++------
> >   drivers/infiniband/sw/rxe/rxe_odp.c  | 62 ++++++++++++++++++++++++++--
> >   drivers/infiniband/sw/rxe/rxe_resp.c |  4 --
> >   include/rdma/ib_verbs.h              |  1 +
> >   6 files changed, 91 insertions(+), 20 deletions(-)
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH for-next v2 2/2] RDMA/rxe: Enable ODP in ATOMIC WRITE operation
  2025-03-19  8:58       ` Leon Romanovsky
@ 2025-03-24  8:05         ` Daisuke Matsuda (Fujitsu)
  0 siblings, 0 replies; 9+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2025-03-24  8:05 UTC (permalink / raw)
  To: 'Leon Romanovsky'
  Cc: linux-rdma@vger.kernel.org, jgg@ziepe.ca, zyjzyj2000@gmail.com,
	Zhijian Li (Fujitsu)

Hi Leon,

Thank you for taking a look.
I've submitted v3 patches to address your comment.

I will also work on rechecking the inconsistency in whole rxe driver
after the patches are merged and for-next is rebased.

Thanks,
Daisuke

On Wed, Mar 19, 2025 5:58 PM Leon Romanovsky wrote:
> On Wed, Mar 19, 2025 at 02:58:51AM +0000, Daisuke Matsuda (Fujitsu) wrote:
> > On Tue, Mar 18, 2025 7:10 PM Leon Romanovsky wrote:
> > > On Tue, Mar 18, 2025 at 06:49:32PM +0900, Daisuke Matsuda wrote:
> > >
> > > <...>
> > >
> > > > +static inline int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> > > > +{
> > > > +	return RESPST_ERR_UNSUPPORTED_OPCODE;
> > > > +}
> > > >  #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
> > >
> > > You are returning "enum resp_states", while function expects to return "int". You should return -EOPNOTSUPP.
> >
> > Other than my patches, there are some functions that do the same thing.
> 
> Yes, but you are adding new code and in the new code you should try to
> have correlated function declaration and returned values.
> 
> > I would like to post a patch to make them consistent, but I think we need
> > reach an agreement on the design of rxe responder before taking up.
> > Please see my opinion below.
> >
> > >
> > > >
> > > >  #endif /* RXE_LOC_H */
> >
> > <...>
> >
> > > > diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> > > > index 9a9aae967486..f3443c604a7f 100644
> > > > --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> > > > +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> > > > @@ -378,3 +378,56 @@ int rxe_odp_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
> > > >
> > > >  	return 0;
> > > >  }
> > > > +
> > > > +#if defined CONFIG_64BIT
> > > > +/* only implemented or called for 64 bit architectures */
> > > > +int rxe_odp_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> > > > +{
> > > > +	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> > > > +	unsigned int page_offset;
> > > > +	unsigned long index;
> > > > +	struct page *page;
> > > > +	int err;
> > > > +	u64 *va;
> > > > +
> > > > +	/* See IBA oA19-28 */
> > > > +	err = mr_check_range(mr, iova, sizeof(value));
> > > > +	if (unlikely(err)) {
> > > > +		rxe_dbg_mr(mr, "iova out of range\n");
> > > > +		return RESPST_ERR_RKEY_VIOLATION;
> > >
> > > Please don't redefine returned errors.
> >
> > As a general principle, I think your comment is totally correct.
> > The problem is that rxe_receiver(), the responder of rxe, is originally designed
> > as a state machine, and the returned values of "enum resp_states" are used
> > to specify the next state.
> >
> > One thing to note is that rxe_receiver() run solely in workqueue, so the errors
> > generated in the bottom half context are never returned to userspace. In that regard,
> > I think redefining the error codes with different enum values can be justified.
> 
> In places where rxe_odp_do_atomic_write() respond is important, you can
> write something like:
> err = rxe_odp_do_atomic_write(...)
> if (err == -EPERM)
>    state = RESPST_ERR_RKEY_VIOLATION
> ...
> 
> or declare rxe_odp_do_atomic_write() to return enum resp_state.
> 
> Thanks

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-03-24  8:06 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-18  9:49 [PATCH for-next v2 0/2] RDMA/rxe: RDMA FLUSH and ATOMIC WRITE with ODP Daisuke Matsuda
2025-03-18  9:49 ` [PATCH for-next v2 1/2] RDMA/rxe: Enable ODP in RDMA FLUSH operation Daisuke Matsuda
2025-03-20  6:59   ` Zhijian Li (Fujitsu)
2025-03-24  5:16     ` Daisuke Matsuda (Fujitsu)
2025-03-18  9:49 ` [PATCH for-next v2 2/2] RDMA/rxe: Enable ODP in ATOMIC WRITE operation Daisuke Matsuda
2025-03-18 10:10   ` Leon Romanovsky
2025-03-19  2:58     ` Daisuke Matsuda (Fujitsu)
2025-03-19  8:58       ` Leon Romanovsky
2025-03-24  8:05         ` Daisuke Matsuda (Fujitsu)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox