linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray
@ 2023-01-17 17:25 Bob Pearson
  2023-01-17 17:25 ` [PATCH for-next v6 1/6] RDMA/rxe: Cleanup mr_check_range Bob Pearson
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Bob Pearson @ 2023-01-17 17:25 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

This patch series replaces the page map carried in each memory region
with a struct xarray. It is based on a sketch developed by Jason
Gunthorpe. The first five patches are preparation that tries to
cleanly isolate all the mr specific code into rxe_mr.c. The sixth
patch is the actual change.

v6:
  Respond to a comment from Jason and moved the #if defined CONFIG_64BIT
  outside of the rxe_mr_do_atomic_write() function. I realized that
  the #else case isn't needed since it will never be called so if
  someone does try it will fail at compile/link time.
v5:
  Responded to a note from lizhijian@fujitsu.com and restored calls to
  is_pmem_page() which were accidentally dropped in earlier versions.
v4:
  Responded to a comment by Zhu and cleaned up error passing between
  rxe_mr.c and rxe_resp.c.
  Other various cleanups including more careful use of unsigned ints.
  Rebased to current for-next.
v3:
  Fixed an error reported by kernel test robot
v2:
  Rebased to 6.2.0-rc1+
  Minor cleanups
  Fixed error reported by Jason in 4/6 missing if after else.

Bob Pearson (6):
  RDMA/rxe: Cleanup mr_check_range
  RDMA/rxe: Move rxe_map_mr_sg to rxe_mr.c
  RDMA-rxe: Isolate mr code from atomic_reply()
  RDMA-rxe: Isolate mr code from atomic_write_reply()
  RDMA/rxe: Cleanup page variables in rxe_mr.c
  RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray

 drivers/infiniband/sw/rxe/rxe_loc.h   |  13 +-
 drivers/infiniband/sw/rxe/rxe_mr.c    | 622 +++++++++++++++-----------
 drivers/infiniband/sw/rxe/rxe_resp.c  | 118 ++---
 drivers/infiniband/sw/rxe/rxe_verbs.c |  36 --
 drivers/infiniband/sw/rxe/rxe_verbs.h |  37 +-
 5 files changed, 422 insertions(+), 404 deletions(-)


base-commit: 1ec82317a1daac78c04b0c15af89018ccf9fa2b7
-- 
2.37.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH for-next v6 1/6] RDMA/rxe: Cleanup mr_check_range
  2023-01-17 17:25 [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
@ 2023-01-17 17:25 ` Bob Pearson
  2023-01-17 17:25 ` [PATCH for-next v6 2/6] RDMA/rxe: Move rxe_map_mr_sg to rxe_mr.c Bob Pearson
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Bob Pearson @ 2023-01-17 17:25 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Remove blank lines and replace EFAULT by EINVAL when an invalid
mr type is used.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 072eac4b65d2..632ee1e516a1 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -26,8 +26,6 @@ u8 rxe_get_next_key(u32 last_key)
 
 int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 {
-
-
 	switch (mr->ibmr.type) {
 	case IB_MR_TYPE_DMA:
 		return 0;
@@ -41,7 +39,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 
 	default:
 		rxe_dbg_mr(mr, "type (%d) not supported\n", mr->ibmr.type);
-		return -EFAULT;
+		return -EINVAL;
 	}
 }
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH for-next v6 2/6] RDMA/rxe: Move rxe_map_mr_sg to rxe_mr.c
  2023-01-17 17:25 [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
  2023-01-17 17:25 ` [PATCH for-next v6 1/6] RDMA/rxe: Cleanup mr_check_range Bob Pearson
@ 2023-01-17 17:25 ` Bob Pearson
  2023-01-17 17:25 ` [PATCH for-next v6 3/6] RDMA-rxe: Isolate mr code from atomic_reply() Bob Pearson
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Bob Pearson @ 2023-01-17 17:25 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Move rxe_map_mr_sg() to rxe_mr.c where it makes a little more sense.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h   |  2 ++
 drivers/infiniband/sw/rxe/rxe_mr.c    | 36 +++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_verbs.c | 36 ---------------------------
 3 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 948ce4902b10..29b6c2143045 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -69,6 +69,8 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		enum rxe_mr_copy_dir dir);
 int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
 	      void *addr, int length, enum rxe_mr_copy_dir dir);
+int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+		  int sg_nents, unsigned int *sg_offset);
 void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length);
 struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 			 enum rxe_mr_lookup_type type);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 632ee1e516a1..229c7259644c 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -223,6 +223,42 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
 	return err;
 }
 
+static int rxe_set_page(struct ib_mr *ibmr, u64 addr)
+{
+	struct rxe_mr *mr = to_rmr(ibmr);
+	struct rxe_map *map;
+	struct rxe_phys_buf *buf;
+
+	if (unlikely(mr->nbuf == mr->num_buf))
+		return -ENOMEM;
+
+	map = mr->map[mr->nbuf / RXE_BUF_PER_MAP];
+	buf = &map->buf[mr->nbuf % RXE_BUF_PER_MAP];
+
+	buf->addr = addr;
+	buf->size = ibmr->page_size;
+	mr->nbuf++;
+
+	return 0;
+}
+
+int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+		  int sg_nents, unsigned int *sg_offset)
+{
+	struct rxe_mr *mr = to_rmr(ibmr);
+	int n;
+
+	mr->nbuf = 0;
+
+	n = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
+
+	mr->page_shift = ilog2(ibmr->page_size);
+	mr->page_mask = ibmr->page_size - 1;
+	mr->offset = ibmr->iova & mr->page_mask;
+
+	return n;
+}
+
 static void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
 			size_t *offset_out)
 {
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 025b35bf014e..7a902e0a0607 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -948,42 +948,6 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 	return ERR_PTR(err);
 }
 
-static int rxe_set_page(struct ib_mr *ibmr, u64 addr)
-{
-	struct rxe_mr *mr = to_rmr(ibmr);
-	struct rxe_map *map;
-	struct rxe_phys_buf *buf;
-
-	if (unlikely(mr->nbuf == mr->num_buf))
-		return -ENOMEM;
-
-	map = mr->map[mr->nbuf / RXE_BUF_PER_MAP];
-	buf = &map->buf[mr->nbuf % RXE_BUF_PER_MAP];
-
-	buf->addr = addr;
-	buf->size = ibmr->page_size;
-	mr->nbuf++;
-
-	return 0;
-}
-
-static int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
-			 int sg_nents, unsigned int *sg_offset)
-{
-	struct rxe_mr *mr = to_rmr(ibmr);
-	int n;
-
-	mr->nbuf = 0;
-
-	n = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
-
-	mr->page_shift = ilog2(ibmr->page_size);
-	mr->page_mask = ibmr->page_size - 1;
-	mr->offset = ibmr->iova & mr->page_mask;
-
-	return n;
-}
-
 static ssize_t parent_show(struct device *device,
 			   struct device_attribute *attr, char *buf)
 {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH for-next v6 3/6] RDMA-rxe: Isolate mr code from atomic_reply()
  2023-01-17 17:25 [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
  2023-01-17 17:25 ` [PATCH for-next v6 1/6] RDMA/rxe: Cleanup mr_check_range Bob Pearson
  2023-01-17 17:25 ` [PATCH for-next v6 2/6] RDMA/rxe: Move rxe_map_mr_sg to rxe_mr.c Bob Pearson
@ 2023-01-17 17:25 ` Bob Pearson
  2023-01-17 17:25 ` [PATCH for-next v6 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply() Bob Pearson
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Bob Pearson @ 2023-01-17 17:25 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Isolate mr specific code from atomic_reply() in rxe_resp.c into
a subroutine rxe_mr_do_atomic_op() in rxe_mr.c.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h   |  2 +
 drivers/infiniband/sw/rxe/rxe_mr.c    | 83 ++++++++++++++++++---------
 drivers/infiniband/sw/rxe/rxe_resp.c  | 49 +++++-----------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  5 ++
 4 files changed, 76 insertions(+), 63 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 29b6c2143045..bcb1bbcf50df 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -72,6 +72,8 @@ int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
 int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 		  int sg_nents, unsigned int *sg_offset);
 void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length);
+int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
+			u64 compare, u64 swap_add, u64 *orig_val);
 struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 			 enum rxe_mr_lookup_type type);
 int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 229c7259644c..15a8d44daa35 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -32,13 +32,15 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 
 	case IB_MR_TYPE_USER:
 	case IB_MR_TYPE_MEM_REG:
-		if (iova < mr->ibmr.iova || length > mr->ibmr.length ||
-		    iova > mr->ibmr.iova + mr->ibmr.length - length)
-			return -EFAULT;
+		if (iova < mr->ibmr.iova ||
+		    iova + length > mr->ibmr.iova + mr->ibmr.length) {
+			rxe_dbg_mr(mr, "iova/length out of range");
+			return -ERANGE;
+		}
 		return 0;
 
 	default:
-		rxe_dbg_mr(mr, "type (%d) not supported\n", mr->ibmr.type);
+		rxe_dbg_mr(mr, "mr type not supported\n");
 		return -EINVAL;
 	}
 }
@@ -299,37 +301,22 @@ void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length)
 {
 	size_t offset;
 	int m, n;
-	void *addr;
 
-	if (mr->state != RXE_MR_STATE_VALID) {
-		rxe_dbg_mr(mr, "Not in valid state\n");
-		addr = NULL;
-		goto out;
-	}
+	if (mr->state != RXE_MR_STATE_VALID)
+		return NULL;
 
-	if (!mr->map) {
-		addr = (void *)(uintptr_t)iova;
-		goto out;
-	}
+	if (mr->ibmr.type == IB_MR_TYPE_DMA)
+		return (void *)(uintptr_t)iova;
 
-	if (mr_check_range(mr, iova, length)) {
-		rxe_dbg_mr(mr, "Range violation\n");
-		addr = NULL;
-		goto out;
-	}
+	if (mr_check_range(mr, iova, length))
+		return NULL;
 
 	lookup_iova(mr, iova, &m, &n, &offset);
 
-	if (offset + length > mr->map[m]->buf[n].size) {
-		rxe_dbg_mr(mr, "Crosses page boundary\n");
-		addr = NULL;
-		goto out;
-	}
-
-	addr = (void *)(uintptr_t)mr->map[m]->buf[n].addr + offset;
+	if (offset + length > mr->map[m]->buf[n].size)
+		return NULL;
 
-out:
-	return addr;
+	return (void *)(uintptr_t)mr->map[m]->buf[n].addr + offset;
 }
 
 int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, int length)
@@ -538,6 +525,46 @@ int copy_data(
 	return err;
 }
 
+/* Guarantee atomicity of atomic operations at the machine level. */
+static DEFINE_SPINLOCK(atomic_ops_lock);
+
+int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
+			u64 compare, u64 swap_add, u64 *orig_val)
+{
+	u64 *va;
+	u64 value;
+
+	if (mr->state != RXE_MR_STATE_VALID) {
+		rxe_dbg_mr(mr, "mr not in valid state");
+		return -EINVAL;
+	}
+
+	va = iova_to_vaddr(mr, iova, sizeof(u64));
+	if (!va) {
+		rxe_dbg_mr(mr, "iova out of range");
+		return -ERANGE;
+	}
+
+	if ((uintptr_t)va & 0x7) {
+		rxe_dbg_mr(mr, "iova not aligned");
+		return RXE_ERR_NOT_ALIGNED;
+	}
+
+	spin_lock_bh(&atomic_ops_lock);
+	value = *orig_val = *va;
+
+	if (opcode == IB_OPCODE_RC_COMPARE_SWAP) {
+		if (value == compare)
+			*va = swap_add;
+	} else {
+		value += swap_add;
+		*va = value;
+	}
+	spin_unlock_bh(&atomic_ops_lock);
+
+	return 0;
+}
+
 int advance_dma_data(struct rxe_dma_info *dma, unsigned int length)
 {
 	struct rxe_sge		*sge	= &dma->sge[dma->cur_sge];
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index c74972244f08..1e38e5da1f4c 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -725,17 +725,12 @@ static enum resp_states process_flush(struct rxe_qp *qp,
 	return RESPST_ACKNOWLEDGE;
 }
 
-/* Guarantee atomicity of atomic operations at the machine level. */
-static DEFINE_SPINLOCK(atomic_ops_lock);
-
 static enum resp_states atomic_reply(struct rxe_qp *qp,
-					 struct rxe_pkt_info *pkt)
+				     struct rxe_pkt_info *pkt)
 {
-	u64 *vaddr;
-	enum resp_states ret;
 	struct rxe_mr *mr = qp->resp.mr;
 	struct resp_res *res = qp->resp.res;
-	u64 value;
+	int err;
 
 	if (!res) {
 		res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_MASK);
@@ -743,33 +738,19 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
 	}
 
 	if (!res->replay) {
-		if (mr->state != RXE_MR_STATE_VALID) {
-			ret = RESPST_ERR_RKEY_VIOLATION;
-			goto out;
-		}
-
-		vaddr = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset,
-					sizeof(u64));
-
-		/* check vaddr is 8 bytes aligned. */
-		if (!vaddr || (uintptr_t)vaddr & 7) {
-			ret = RESPST_ERR_MISALIGNED_ATOMIC;
-			goto out;
-		}
-
-		spin_lock_bh(&atomic_ops_lock);
-		res->atomic.orig_val = value = *vaddr;
-
-		if (pkt->opcode == IB_OPCODE_RC_COMPARE_SWAP) {
-			if (value == atmeth_comp(pkt))
-				value = atmeth_swap_add(pkt);
-		} else {
-			value += atmeth_swap_add(pkt);
+		u64 iova = qp->resp.va + qp->resp.offset;
+
+		err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
+					  atmeth_comp(pkt),
+					  atmeth_swap_add(pkt),
+					  &res->atomic.orig_val);
+		if (unlikely(err)) {
+			if (err == -RXE_ERR_NOT_ALIGNED)
+				return RESPST_ERR_MISALIGNED_ATOMIC;
+			else
+				return RESPST_ERR_RKEY_VIOLATION;
 		}
 
-		*vaddr = value;
-		spin_unlock_bh(&atomic_ops_lock);
-
 		qp->resp.msn++;
 
 		/* next expected psn, read handles this separately */
@@ -780,9 +761,7 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
 		qp->resp.status = IB_WC_SUCCESS;
 	}
 
-	ret = RESPST_ACKNOWLEDGE;
-out:
-	return ret;
+	return RESPST_ACKNOWLEDGE;
 }
 
 #ifdef CONFIG_64BIT
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 19ddfa890480..691200d99d6b 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -273,6 +273,11 @@ enum rxe_mr_state {
 	RXE_MR_STATE_VALID,
 };
 
+/* some rxe specific error conditions */
+enum rxe_mr_error {
+	RXE_ERR_NOT_ALIGNED = 0x1001,	/* misaligned address */
+};
+
 enum rxe_mr_copy_dir {
 	RXE_TO_MR_OBJ,
 	RXE_FROM_MR_OBJ,
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH for-next v6 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply()
  2023-01-17 17:25 [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
                   ` (2 preceding siblings ...)
  2023-01-17 17:25 ` [PATCH for-next v6 3/6] RDMA-rxe: Isolate mr code from atomic_reply() Bob Pearson
@ 2023-01-17 17:25 ` Bob Pearson
  2023-01-17 17:25 ` [PATCH for-next v6 5/6] RDMA/rxe: Cleanup page variables in rxe_mr.c Bob Pearson
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Bob Pearson @ 2023-01-17 17:25 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Isolate mr specific code from atomic_write_reply() in rxe_resp.c into
a subroutine rxe_mr_do_atomic_write() in rxe_mr.c.
Check length for atomic write operation.
Make iova_to_vaddr() static.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h  |  1 +
 drivers/infiniband/sw/rxe/rxe_mr.c   | 34 ++++++++++++++
 drivers/infiniband/sw/rxe/rxe_resp.c | 69 ++++++++++++----------------
 3 files changed, 64 insertions(+), 40 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index bcb1bbcf50df..b1dda0cf891b 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -74,6 +74,7 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length);
 int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 			u64 compare, u64 swap_add, u64 *orig_val);
+int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
 struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 			 enum rxe_mr_lookup_type type);
 int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 15a8d44daa35..10484f671977 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -565,6 +565,40 @@ int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 	return 0;
 }
 
+/* only implemented for 64 bit architectures */
+int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
+{
+#if defined CONFIG_64BIT
+	u64 *va;
+
+	/* See IBA oA19-28 */
+	if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
+		rxe_dbg_mr(mr, "mr not in valid state");
+		return -EINVAL;
+	}
+
+	va = iova_to_vaddr(mr, iova, sizeof(value));
+	if (unlikely(!va)) {
+		rxe_dbg_mr(mr, "iova out of range");
+		return -ERANGE;
+	}
+
+	/* See IBA A19.4.2 */
+	if (unlikely((uintptr_t)va & 0x7 || iova & 0x7)) {
+		rxe_dbg_mr(mr, "misaligned address");
+		return -RXE_ERR_NOT_ALIGNED;
+	}
+
+	/* Do atomic write after all prior operations have completed */
+	smp_store_release(va, value);
+
+	return 0;
+#else
+	WARN_ON(1);
+	return -EINVAL;
+#endif
+}
+
 int advance_dma_data(struct rxe_dma_info *dma, unsigned int length)
 {
 	struct rxe_sge		*sge	= &dma->sge[dma->cur_sge];
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 1e38e5da1f4c..49298ff88d25 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -764,30 +764,40 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
 	return RESPST_ACKNOWLEDGE;
 }
 
-#ifdef CONFIG_64BIT
-static enum resp_states do_atomic_write(struct rxe_qp *qp,
-					struct rxe_pkt_info *pkt)
+static enum resp_states atomic_write_reply(struct rxe_qp *qp,
+					   struct rxe_pkt_info *pkt)
 {
-	struct rxe_mr *mr = qp->resp.mr;
-	int payload = payload_size(pkt);
-	u64 src, *dst;
-
-	if (mr->state != RXE_MR_STATE_VALID)
-		return RESPST_ERR_RKEY_VIOLATION;
+	struct resp_res *res = qp->resp.res;
+	struct rxe_mr *mr;
+	u64 value;
+	u64 iova;
+	int err;
 
-	memcpy(&src, payload_addr(pkt), payload);
+	if (!res) {
+		res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_WRITE_MASK);
+		qp->resp.res = res;
+	}
 
-	dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, payload);
-	/* check vaddr is 8 bytes aligned. */
-	if (!dst || (uintptr_t)dst & 7)
-		return RESPST_ERR_MISALIGNED_ATOMIC;
+	if (res->replay)
+		return RESPST_ACKNOWLEDGE;
 
-	/* Do atomic write after all prior operations have completed */
-	smp_store_release(dst, src);
+	mr = qp->resp.mr;
+	value = *(u64 *)payload_addr(pkt);
+	iova = qp->resp.va + qp->resp.offset;
 
-	/* decrease resp.resid to zero */
-	qp->resp.resid -= sizeof(payload);
+#if defined CONFIG_64BIT
+	err = rxe_mr_do_atomic_write(mr, iova, value);
+	if (unlikely(err)) {
+		if (err == -RXE_ERR_NOT_ALIGNED)
+			return RESPST_ERR_MISALIGNED_ATOMIC;
+		else
+			return RESPST_ERR_RKEY_VIOLATION;
+	}
+#else
+	return RESPST_ERR_UNSUPPORTED_OPCODE;
+#endif
 
+	qp->resp.resid = 0;
 	qp->resp.msn++;
 
 	/* next expected psn, read handles this separately */
@@ -796,29 +806,8 @@ static enum resp_states do_atomic_write(struct rxe_qp *qp,
 
 	qp->resp.opcode = pkt->opcode;
 	qp->resp.status = IB_WC_SUCCESS;
-	return RESPST_ACKNOWLEDGE;
-}
-#else
-static enum resp_states do_atomic_write(struct rxe_qp *qp,
-					struct rxe_pkt_info *pkt)
-{
-	return RESPST_ERR_UNSUPPORTED_OPCODE;
-}
-#endif /* CONFIG_64BIT */
 
-static enum resp_states atomic_write_reply(struct rxe_qp *qp,
-					   struct rxe_pkt_info *pkt)
-{
-	struct resp_res *res = qp->resp.res;
-
-	if (!res) {
-		res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_WRITE_MASK);
-		qp->resp.res = res;
-	}
-
-	if (res->replay)
-		return RESPST_ACKNOWLEDGE;
-	return do_atomic_write(qp, pkt);
+	return RESPST_ACKNOWLEDGE;
 }
 
 static struct sk_buff *prepare_ack_packet(struct rxe_qp *qp,
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH for-next v6 5/6] RDMA/rxe: Cleanup page variables in rxe_mr.c
  2023-01-17 17:25 [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
                   ` (3 preceding siblings ...)
  2023-01-17 17:25 ` [PATCH for-next v6 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply() Bob Pearson
@ 2023-01-17 17:25 ` Bob Pearson
  2023-01-17 17:25 ` [PATCH for-next v6 6/6] RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray Bob Pearson
  2023-01-23 19:36 ` [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Jason Gunthorpe
  6 siblings, 0 replies; 12+ messages in thread
From: Bob Pearson @ 2023-01-17 17:25 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Cleanup usage of mr->page_shift and mr->page_mask and introduce
an extractor for mr->ibmr.page_size. Normal usage in the kernel
has page_mask masking out offset in page rather than masking out
the page number. The rxe driver had reversed that which was confusing.
Implicitly there can be a per mr page_size which was not uniformly
supported.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c    | 31 ++++++++++++---------------
 drivers/infiniband/sw/rxe/rxe_verbs.h | 11 +++++++---
 2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 10484f671977..fdf76df4cf3e 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -62,6 +62,9 @@ static void rxe_mr_init(int access, struct rxe_mr *mr)
 	mr->lkey = mr->ibmr.lkey = lkey;
 	mr->rkey = mr->ibmr.rkey = rkey;
 
+	mr->ibmr.page_size = PAGE_SIZE;
+	mr->page_mask = PAGE_MASK;
+	mr->page_shift = PAGE_SHIFT;
 	mr->state = RXE_MR_STATE_INVALID;
 }
 
@@ -151,9 +154,6 @@ int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 		goto err_release_umem;
 	}
 
-	mr->page_shift = PAGE_SHIFT;
-	mr->page_mask = PAGE_SIZE - 1;
-
 	num_buf			= 0;
 	map = mr->map;
 	if (length > 0) {
@@ -182,7 +182,7 @@ int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 				goto err_release_umem;
 			}
 			buf->addr = (uintptr_t)vaddr;
-			buf->size = PAGE_SIZE;
+			buf->size = mr_page_size(mr);
 			num_buf++;
 			buf++;
 
@@ -191,10 +191,9 @@ int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 
 	mr->umem = umem;
 	mr->access = access;
-	mr->offset = ib_umem_offset(umem);
+	mr->page_offset = ib_umem_offset(umem);
 	mr->state = RXE_MR_STATE_VALID;
 	mr->ibmr.type = IB_MR_TYPE_USER;
-	mr->ibmr.page_size = PAGE_SIZE;
 
 	return 0;
 
@@ -248,29 +247,27 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 		  int sg_nents, unsigned int *sg_offset)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
-	int n;
-
-	mr->nbuf = 0;
+	unsigned int page_size = mr_page_size(mr);
 
-	n = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
+	mr->page_shift = ilog2(page_size);
+	mr->page_mask = ~((u64)page_size - 1);
+	mr->page_offset = ibmr->iova & (page_size - 1);
 
-	mr->page_shift = ilog2(ibmr->page_size);
-	mr->page_mask = ibmr->page_size - 1;
-	mr->offset = ibmr->iova & mr->page_mask;
+	mr->nbuf = 0;
 
-	return n;
+	return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
 }
 
 static void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
 			size_t *offset_out)
 {
-	size_t offset = iova - mr->ibmr.iova + mr->offset;
+	size_t offset = iova - mr->ibmr.iova + mr->page_offset;
 	int			map_index;
 	int			buf_index;
 	u64			length;
 
 	if (likely(mr->page_shift)) {
-		*offset_out = offset & mr->page_mask;
+		*offset_out = offset & (mr_page_size(mr) - 1);
 		offset >>= mr->page_shift;
 		*n_out = offset & mr->map_mask;
 		*m_out = offset >> mr->map_shift;
@@ -329,7 +326,7 @@ int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, int length)
 	if (mr->ibmr.type == IB_MR_TYPE_DMA)
 		return -EFAULT;
 
-	offset = (iova - mr->ibmr.iova + mr->offset) & mr->page_mask;
+	offset = (iova - mr->ibmr.iova + mr->page_offset) & mr->page_mask;
 	while (length > 0) {
 		u8 *va;
 		int bytes;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 691200d99d6b..5c3d1500ca68 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -315,11 +315,11 @@ struct rxe_mr {
 	u32			lkey;
 	u32			rkey;
 	enum rxe_mr_state	state;
-	u32			offset;
 	int			access;
 
-	int			page_shift;
-	int			page_mask;
+	unsigned int		page_offset;
+	unsigned int		page_shift;
+	u64			page_mask;
 	int			map_shift;
 	int			map_mask;
 
@@ -334,6 +334,11 @@ struct rxe_mr {
 	struct rxe_map		**map;
 };
 
+static inline unsigned int mr_page_size(struct rxe_mr *mr)
+{
+	return mr ? mr->ibmr.page_size : PAGE_SIZE;
+}
+
 enum rxe_mw_state {
 	RXE_MW_STATE_INVALID	= RXE_MR_STATE_INVALID,
 	RXE_MW_STATE_FREE	= RXE_MR_STATE_FREE,
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH for-next v6 6/6] RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray
  2023-01-17 17:25 [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
                   ` (4 preceding siblings ...)
  2023-01-17 17:25 ` [PATCH for-next v6 5/6] RDMA/rxe: Cleanup page variables in rxe_mr.c Bob Pearson
@ 2023-01-17 17:25 ` Bob Pearson
  2023-01-23 19:36   ` Jason Gunthorpe
  2023-01-23 19:36 ` [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Jason Gunthorpe
  6 siblings, 1 reply; 12+ messages in thread
From: Bob Pearson @ 2023-01-17 17:25 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Replace struct rxe-phys_buf and struct rxe_map by struct xarray
in rxe_verbs.h. This allows using rcu locking on reads for
the memory maps stored in each mr.

This is based off of a sketch of a patch from Jason Gunthorpe in the
link below. Some changes were needed to make this work. It applies
cleanly to the current for-next and passes the pyverbs, perftest
and the same blktests test cases which run today.

Link: https://lore.kernel.org/linux-rdma/Y3gvZr6%2FNCii9Avy@nvidia.com/
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h   |   8 +-
 drivers/infiniband/sw/rxe/rxe_mr.c    | 568 +++++++++++++-------------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  21 +-
 3 files changed, 288 insertions(+), 309 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index b1dda0cf891b..c1f78ed0f991 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -64,14 +64,14 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr);
 int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 		     int access, struct rxe_mr *mr);
 int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr);
-int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, int length);
-int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
-		enum rxe_mr_copy_dir dir);
+int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
+			unsigned int length);
+int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
+		unsigned int length, enum rxe_mr_copy_dir dir);
 int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
 	      void *addr, int length, enum rxe_mr_copy_dir dir);
 int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 		  int sg_nents, unsigned int *sg_offset);
-void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length);
 int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 			u64 compare, u64 swap_add, u64 *orig_val);
 int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index fdf76df4cf3e..53eb2f3b6e07 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -62,60 +62,31 @@ static void rxe_mr_init(int access, struct rxe_mr *mr)
 	mr->lkey = mr->ibmr.lkey = lkey;
 	mr->rkey = mr->ibmr.rkey = rkey;
 
+	mr->access = access;
 	mr->ibmr.page_size = PAGE_SIZE;
 	mr->page_mask = PAGE_MASK;
 	mr->page_shift = PAGE_SHIFT;
 	mr->state = RXE_MR_STATE_INVALID;
 }
 
-static int rxe_mr_alloc(struct rxe_mr *mr, int num_buf)
-{
-	int i;
-	int num_map;
-	struct rxe_map **map = mr->map;
-
-	num_map = (num_buf + RXE_BUF_PER_MAP - 1) / RXE_BUF_PER_MAP;
-
-	mr->map = kmalloc_array(num_map, sizeof(*map), GFP_KERNEL);
-	if (!mr->map)
-		goto err1;
-
-	for (i = 0; i < num_map; i++) {
-		mr->map[i] = kmalloc(sizeof(**map), GFP_KERNEL);
-		if (!mr->map[i])
-			goto err2;
-	}
-
-	BUILD_BUG_ON(!is_power_of_2(RXE_BUF_PER_MAP));
-
-	mr->map_shift = ilog2(RXE_BUF_PER_MAP);
-	mr->map_mask = RXE_BUF_PER_MAP - 1;
-
-	mr->num_buf = num_buf;
-	mr->num_map = num_map;
-	mr->max_buf = num_map * RXE_BUF_PER_MAP;
-
-	return 0;
-
-err2:
-	for (i--; i >= 0; i--)
-		kfree(mr->map[i]);
-
-	kfree(mr->map);
-	mr->map = NULL;
-err1:
-	return -ENOMEM;
-}
-
 void rxe_mr_init_dma(int access, struct rxe_mr *mr)
 {
 	rxe_mr_init(access, mr);
 
-	mr->access = access;
 	mr->state = RXE_MR_STATE_VALID;
 	mr->ibmr.type = IB_MR_TYPE_DMA;
 }
 
+static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
+{
+	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
+}
+
+static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
+{
+	return iova & (mr_page_size(mr) - 1);
+}
+
 static bool is_pmem_page(struct page *pg)
 {
 	unsigned long paddr = page_to_phys(pg);
@@ -125,82 +96,97 @@ static bool is_pmem_page(struct page *pg)
 				 IORES_DESC_PERSISTENT_MEMORY);
 }
 
+static int rxe_mr_fill_pages_from_sgt(struct rxe_mr *mr, struct sg_table *sgt)
+{
+	XA_STATE(xas, &mr->page_list, 0);
+	struct sg_page_iter sg_iter;
+	struct page *page;
+	bool persistent = !!(mr->access & IB_ACCESS_FLUSH_PERSISTENT);
+
+	__sg_page_iter_start(&sg_iter, sgt->sgl, sgt->orig_nents, 0);
+	if (!__sg_page_iter_next(&sg_iter))
+		return 0;
+
+	do {
+		xas_lock(&xas);
+		while (true) {
+			page = sg_page_iter_page(&sg_iter);
+
+			if (persistent && !is_pmem_page(page)) {
+				rxe_dbg_mr(mr, "Unable to register persistent access to non-pmem device\n");
+				return -EINVAL;
+			}
+
+			xas_store(&xas, page);
+			if (xas_error(&xas))
+				break;
+			xas_next(&xas);
+			if (!__sg_page_iter_next(&sg_iter))
+				break;
+		}
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
+
+	return xas_error(&xas);
+}
+
 int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 		     int access, struct rxe_mr *mr)
 {
-	struct rxe_map		**map;
-	struct rxe_phys_buf	*buf = NULL;
-	struct ib_umem		*umem;
-	struct sg_page_iter	sg_iter;
-	int			num_buf;
-	void			*vaddr;
+	struct ib_umem *umem;
 	int err;
 
+	rxe_mr_init(access, mr);
+
+	xa_init(&mr->page_list);
+
 	umem = ib_umem_get(&rxe->ib_dev, start, length, access);
 	if (IS_ERR(umem)) {
 		rxe_dbg_mr(mr, "Unable to pin memory region err = %d\n",
 			(int)PTR_ERR(umem));
-		err = PTR_ERR(umem);
-		goto err_out;
+		return PTR_ERR(umem);
 	}
 
-	num_buf = ib_umem_num_pages(umem);
-
-	rxe_mr_init(access, mr);
-
-	err = rxe_mr_alloc(mr, num_buf);
+	err = rxe_mr_fill_pages_from_sgt(mr, &umem->sgt_append.sgt);
 	if (err) {
-		rxe_dbg_mr(mr, "Unable to allocate memory for map\n");
-		goto err_release_umem;
+		ib_umem_release(umem);
+		return err;
 	}
 
-	num_buf			= 0;
-	map = mr->map;
-	if (length > 0) {
-		bool persistent_access = access & IB_ACCESS_FLUSH_PERSISTENT;
-
-		buf = map[0]->buf;
-		for_each_sgtable_page (&umem->sgt_append.sgt, &sg_iter, 0) {
-			struct page *pg = sg_page_iter_page(&sg_iter);
+	mr->umem = umem;
+	mr->ibmr.type = IB_MR_TYPE_USER;
+	mr->state = RXE_MR_STATE_VALID;
 
-			if (persistent_access && !is_pmem_page(pg)) {
-				rxe_dbg_mr(mr, "Unable to register persistent access to non-pmem device\n");
-				err = -EINVAL;
-				goto err_release_umem;
-			}
+	return 0;
+}
 
-			if (num_buf >= RXE_BUF_PER_MAP) {
-				map++;
-				buf = map[0]->buf;
-				num_buf = 0;
-			}
+static int rxe_mr_alloc(struct rxe_mr *mr, int num_buf)
+{
+	XA_STATE(xas, &mr->page_list, 0);
+	int i = 0;
+	int err;
 
-			vaddr = page_address(pg);
-			if (!vaddr) {
-				rxe_dbg_mr(mr, "Unable to get virtual address\n");
-				err = -ENOMEM;
-				goto err_release_umem;
-			}
-			buf->addr = (uintptr_t)vaddr;
-			buf->size = mr_page_size(mr);
-			num_buf++;
-			buf++;
+	xa_init(&mr->page_list);
 
+	do {
+		xas_lock(&xas);
+		while (i != num_buf) {
+			xas_store(&xas, XA_ZERO_ENTRY);
+			if (xas_error(&xas))
+				break;
+			xas_next(&xas);
+			i++;
 		}
-	}
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
 
-	mr->umem = umem;
-	mr->access = access;
-	mr->page_offset = ib_umem_offset(umem);
-	mr->state = RXE_MR_STATE_VALID;
-	mr->ibmr.type = IB_MR_TYPE_USER;
+	err = xas_error(&xas);
+	if (err)
+		return err;
 
-	return 0;
+	mr->num_buf = num_buf;
 
-err_release_umem:
-	ib_umem_release(umem);
-err_out:
-	return err;
+	return 0;
 }
 
 int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
@@ -214,7 +200,6 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
 	if (err)
 		goto err1;
 
-	mr->max_buf = max_pages;
 	mr->state = RXE_MR_STATE_FREE;
 	mr->ibmr.type = IB_MR_TYPE_MEM_REG;
 
@@ -224,206 +209,144 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
 	return err;
 }
 
-static int rxe_set_page(struct ib_mr *ibmr, u64 addr)
+static int rxe_set_page(struct ib_mr *ibmr, u64 iova)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
-	struct rxe_map *map;
-	struct rxe_phys_buf *buf;
+	struct page *page = virt_to_page(iova & mr->page_mask);
+	XA_STATE(xas, &mr->page_list, mr->nbuf);
+	bool persistent = !!(mr->access & IB_ACCESS_FLUSH_PERSISTENT);
+	int err;
+
+	if (persistent && !is_pmem_page(page)) {
+		rxe_dbg_mr(mr, "Unable to register persistent access to non-pmem device\n");
+		return -EINVAL;
+	}
 
 	if (unlikely(mr->nbuf == mr->num_buf))
 		return -ENOMEM;
 
-	map = mr->map[mr->nbuf / RXE_BUF_PER_MAP];
-	buf = &map->buf[mr->nbuf % RXE_BUF_PER_MAP];
+	do {
+		xas_lock(&xas);
+		xas_store(&xas, page);
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
 
-	buf->addr = addr;
-	buf->size = ibmr->page_size;
-	mr->nbuf++;
+	err = xas_error(&xas);
+	if (err)
+		return err;
 
+	mr->nbuf++;
 	return 0;
 }
 
-int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
 		  int sg_nents, unsigned int *sg_offset)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
 	unsigned int page_size = mr_page_size(mr);
 
+	mr->nbuf = 0;
 	mr->page_shift = ilog2(page_size);
 	mr->page_mask = ~((u64)page_size - 1);
-	mr->page_offset = ibmr->iova & (page_size - 1);
+	mr->page_offset = mr->ibmr.iova & (page_size - 1);
 
-	mr->nbuf = 0;
-
-	return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
+	return ib_sg_to_pages(ibmr, sgl, sg_nents, sg_offset, rxe_set_page);
 }
 
-static void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
-			size_t *offset_out)
+/*
+ * TODO: Attempting to modify the mr page map between the time
+ * a packet is received and the map is referenced as here
+ * in xa_load(&mr->page_list) will cause problems. It is OK to
+ * deregister the mr since the mr reference counts will preserve
+ * it until memory accesses are complete. Currently reregister mr
+ * operations are not supported by the rxe driver but could be
+ * in the future. Invalidate followed by fast_reg mr will change
+ * the map and then the rkey so delayed packets arriving in the
+ * middle could use the wrong map entries. This isn't new but was
+ * already the case in the earlier implementation. This won't be
+ * a problem for well behaved programs which wait until all the
+ * outstanding packets for the first FMR before remapping to the
+ * second.
+ */
+static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
+			      unsigned int length, enum rxe_mr_copy_dir dir)
 {
-	size_t offset = iova - mr->ibmr.iova + mr->page_offset;
-	int			map_index;
-	int			buf_index;
-	u64			length;
-
-	if (likely(mr->page_shift)) {
-		*offset_out = offset & (mr_page_size(mr) - 1);
-		offset >>= mr->page_shift;
-		*n_out = offset & mr->map_mask;
-		*m_out = offset >> mr->map_shift;
-	} else {
-		map_index = 0;
-		buf_index = 0;
-
-		length = mr->map[map_index]->buf[buf_index].size;
-
-		while (offset >= length) {
-			offset -= length;
-			buf_index++;
+	unsigned int page_offset = rxe_mr_iova_to_page_offset(mr, iova);
+	unsigned long index = rxe_mr_iova_to_index(mr, iova);
+	unsigned int bytes;
+	struct page *page;
+	void *va;
 
-			if (buf_index == RXE_BUF_PER_MAP) {
-				map_index++;
-				buf_index = 0;
-			}
-			length = mr->map[map_index]->buf[buf_index].size;
-		}
+	while (length) {
+		page = xa_load(&mr->page_list, index);
+		if (!page)
+			return -EFAULT;
 
-		*m_out = map_index;
-		*n_out = buf_index;
-		*offset_out = offset;
+		bytes = min_t(unsigned int, length,
+				mr_page_size(mr) - page_offset);
+		va = kmap_local_page(page);
+		if (dir == RXE_FROM_MR_OBJ)
+			memcpy(addr, va + page_offset, bytes);
+		else
+			memcpy(va + page_offset, addr, bytes);
+		kunmap_local(va);
+
+		page_offset = 0;
+		addr += bytes;
+		length -= bytes;
+		index++;
 	}
-}
 
-void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length)
-{
-	size_t offset;
-	int m, n;
-
-	if (mr->state != RXE_MR_STATE_VALID)
-		return NULL;
-
-	if (mr->ibmr.type == IB_MR_TYPE_DMA)
-		return (void *)(uintptr_t)iova;
-
-	if (mr_check_range(mr, iova, length))
-		return NULL;
-
-	lookup_iova(mr, iova, &m, &n, &offset);
-
-	if (offset + length > mr->map[m]->buf[n].size)
-		return NULL;
-
-	return (void *)(uintptr_t)mr->map[m]->buf[n].addr + offset;
+	return 0;
 }
 
-int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, int length)
+static void rxe_mr_copy_dma(struct rxe_mr *mr, u64 iova, void *addr,
+			    unsigned int length, enum rxe_mr_copy_dir dir)
 {
-	size_t offset;
+	unsigned int page_offset = iova & (PAGE_SIZE - 1);
+	unsigned int bytes;
+	struct page *page;
+	u8 *va;
 
-	if (length == 0)
-		return 0;
-
-	if (mr->ibmr.type == IB_MR_TYPE_DMA)
-		return -EFAULT;
-
-	offset = (iova - mr->ibmr.iova + mr->page_offset) & mr->page_mask;
-	while (length > 0) {
-		u8 *va;
-		int bytes;
-
-		bytes = mr->ibmr.page_size - offset;
-		if (bytes > length)
-			bytes = length;
-
-		va = iova_to_vaddr(mr, iova, length);
-		if (!va)
-			return -EFAULT;
-
-		arch_wb_cache_pmem(va, bytes);
-
-		length -= bytes;
+	while (length) {
+		page = virt_to_page(iova & mr->page_mask);
+		bytes = min_t(unsigned int, length,
+				PAGE_SIZE - page_offset);
+		va = kmap_local_page(page);
+
+		if (dir == RXE_TO_MR_OBJ)
+			memcpy(va + page_offset, addr, bytes);
+		else
+			memcpy(addr, va + page_offset, bytes);
+
+		kunmap_local(va);
+		page_offset = 0;
 		iova += bytes;
-		offset = 0;
+		addr += bytes;
+		length -= bytes;
 	}
-
-	return 0;
 }
 
-/* copy data from a range (vaddr, vaddr+length-1) to or from
- * a mr object starting at iova.
- */
-int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
-		enum rxe_mr_copy_dir dir)
+int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
+		unsigned int length, enum rxe_mr_copy_dir dir)
 {
-	int			err;
-	int			bytes;
-	u8			*va;
-	struct rxe_map		**map;
-	struct rxe_phys_buf	*buf;
-	int			m;
-	int			i;
-	size_t			offset;
+	int err;
 
 	if (length == 0)
 		return 0;
 
 	if (mr->ibmr.type == IB_MR_TYPE_DMA) {
-		u8 *src, *dest;
-
-		src = (dir == RXE_TO_MR_OBJ) ? addr : ((void *)(uintptr_t)iova);
-
-		dest = (dir == RXE_TO_MR_OBJ) ? ((void *)(uintptr_t)iova) : addr;
-
-		memcpy(dest, src, length);
-
+		rxe_mr_copy_dma(mr, iova, addr, length, dir);
 		return 0;
 	}
 
-	WARN_ON_ONCE(!mr->map);
-
 	err = mr_check_range(mr, iova, length);
-	if (err) {
-		err = -EFAULT;
-		goto err1;
-	}
-
-	lookup_iova(mr, iova, &m, &i, &offset);
-
-	map = mr->map + m;
-	buf	= map[0]->buf + i;
-
-	while (length > 0) {
-		u8 *src, *dest;
-
-		va	= (u8 *)(uintptr_t)buf->addr + offset;
-		src = (dir == RXE_TO_MR_OBJ) ? addr : va;
-		dest = (dir == RXE_TO_MR_OBJ) ? va : addr;
-
-		bytes	= buf->size - offset;
-
-		if (bytes > length)
-			bytes = length;
-
-		memcpy(dest, src, bytes);
-
-		length	-= bytes;
-		addr	+= bytes;
-
-		offset	= 0;
-		buf++;
-		i++;
-
-		if (i == RXE_BUF_PER_MAP) {
-			i = 0;
-			map++;
-			buf = map[0]->buf;
-		}
+	if (unlikely(err)) {
+		rxe_dbg_mr(mr, "iova out of range");
+		return err;
 	}
 
-	return 0;
-
-err1:
-	return err;
+	return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
 }
 
 /* copy data in or out of a wqe, i.e. sg list
@@ -495,7 +418,6 @@ int copy_data(
 
 		if (bytes > 0) {
 			iova = sge->addr + offset;
-
 			err = rxe_mr_copy(mr, iova, addr, bytes, dir);
 			if (err)
 				goto err2;
@@ -522,50 +444,113 @@ int copy_data(
 	return err;
 }
 
+int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
+{
+	unsigned int page_offset;
+	unsigned long index;
+	struct page *page;
+	unsigned int bytes;
+	int err;
+	u8 *va;
+
+	if (length == 0)
+		return 0;
+
+	if (mr->ibmr.type == IB_MR_TYPE_DMA)
+		return -EFAULT;
+
+	err = mr_check_range(mr, iova, length);
+	if (err)
+		return err;
+
+	while (length > 0) {
+		index = rxe_mr_iova_to_index(mr, iova);
+		page = xa_load(&mr->page_list, index);
+		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
+		if (!page)
+			return -EFAULT;
+		bytes = min_t(unsigned int, length,
+				mr_page_size(mr) - page_offset);
+
+		va = kmap_local_page(page);
+		if (!va)
+			return -EFAULT;
+
+		arch_wb_cache_pmem(va + page_offset, bytes);
+
+		length -= bytes;
+		iova += bytes;
+		page_offset = 0;
+	}
+
+	return 0;
+}
+
 /* Guarantee atomicity of atomic operations at the machine level. */
 static DEFINE_SPINLOCK(atomic_ops_lock);
 
 int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 			u64 compare, u64 swap_add, u64 *orig_val)
 {
-	u64 *va;
+	unsigned int page_offset;
+	struct page *page;
 	u64 value;
+	u64 *va;
 
-	if (mr->state != RXE_MR_STATE_VALID) {
+	if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
 		rxe_dbg_mr(mr, "mr not in valid state");
 		return -EINVAL;
 	}
 
-	va = iova_to_vaddr(mr, iova, sizeof(u64));
-	if (!va) {
-		rxe_dbg_mr(mr, "iova out of range");
-		return -ERANGE;
+	if (mr->ibmr.type == IB_MR_TYPE_DMA) {
+		page_offset = iova & (PAGE_SIZE - 1);
+		page = virt_to_page(iova & PAGE_MASK);
+	} else {
+		unsigned long index;
+		int err;
+
+		err = mr_check_range(mr, iova, sizeof(value));
+		if (err) {
+			rxe_dbg_mr(mr, "iova out of range");
+			return -ERANGE;
+		}
+		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
+		index = rxe_mr_iova_to_index(mr, iova);
+		page = xa_load(&mr->page_list, index);
+		if (!page)
+			return -EFAULT;
 	}
 
-	if ((uintptr_t)va & 0x7) {
+	if (unlikely(page_offset & 0x7)) {
 		rxe_dbg_mr(mr, "iova not aligned");
-		return RXE_ERR_NOT_ALIGNED;
+		return -RXE_ERR_NOT_ALIGNED;
 	}
 
+	va = kmap_local_page(page);
+
 	spin_lock_bh(&atomic_ops_lock);
-	value = *orig_val = *va;
+	value = *orig_val = va[page_offset >> 3];
 
 	if (opcode == IB_OPCODE_RC_COMPARE_SWAP) {
 		if (value == compare)
-			*va = swap_add;
+			va[page_offset >> 3] = swap_add;
 	} else {
 		value += swap_add;
-		*va = value;
+		va[page_offset >> 3] = value;
 	}
 	spin_unlock_bh(&atomic_ops_lock);
 
+	kunmap_local(va);
+
 	return 0;
 }
 
-/* only implemented for 64 bit architectures */
+#if defined CONFIG_64BIT
+/* only implemented or called for 64 bit architectures */
 int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 {
-#if defined CONFIG_64BIT
+	unsigned int page_offset;
+	struct page *page;
 	u64 *va;
 
 	/* See IBA oA19-28 */
@@ -574,27 +559,49 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 		return -EINVAL;
 	}
 
-	va = iova_to_vaddr(mr, iova, sizeof(value));
-	if (unlikely(!va)) {
-		rxe_dbg_mr(mr, "iova out of range");
-		return -ERANGE;
+	if (mr->ibmr.type == IB_MR_TYPE_DMA) {
+		page_offset = iova & (PAGE_SIZE - 1);
+		page = virt_to_page(iova & PAGE_MASK);
+	} else {
+		unsigned long index;
+		int err;
+
+		/* See IBA oA19-28 */
+		err = mr_check_range(mr, iova, sizeof(value));
+		if (unlikely(err)) {
+			rxe_dbg_mr(mr, "iova out of range");
+			return -ERANGE;
+		}
+		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
+		index = rxe_mr_iova_to_index(mr, iova);
+		page = xa_load(&mr->page_list, index);
+		if (!page)
+			return -EFAULT;
 	}
 
 	/* See IBA A19.4.2 */
-	if (unlikely((uintptr_t)va & 0x7 || iova & 0x7)) {
+	if (unlikely(page_offset & 0x7)) {
 		rxe_dbg_mr(mr, "misaligned address");
 		return -RXE_ERR_NOT_ALIGNED;
 	}
 
+	va = kmap_local_page(page);
+
 	/* Do atomic write after all prior operations have completed */
-	smp_store_release(va, value);
+	/* TODO: This is what was chosen by the implementer but I am
+	 * concerned it isn't what they want. This only guarantees that
+	 * the write will complete before any subsequent reads but the
+	 * comment says all prior operations have completed. That would
+	 * require a full mb() or matching acquire.
+	 * Normal usage has a matching load_acquire and store_release.
+	 */
+	smp_store_release(&va[page_offset >> 3], value);
+
+	kunmap_local(va);
 
 	return 0;
-#else
-	WARN_ON(1);
-	return -EINVAL;
-#endif
 }
+#endif
 
 int advance_dma_data(struct rxe_dma_info *dma, unsigned int length)
 {
@@ -629,12 +636,6 @@ int advance_dma_data(struct rxe_dma_info *dma, unsigned int length)
 	return 0;
 }
 
-/* (1) find the mr corresponding to lkey/rkey
- *     depending on lookup_type
- * (2) verify that the (qp) pd matches the mr pd
- * (3) verify that the mr can support the requested access
- * (4) verify that mr state is valid
- */
 struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 			 enum rxe_mr_lookup_type type)
 {
@@ -755,15 +756,10 @@ int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 void rxe_mr_cleanup(struct rxe_pool_elem *elem)
 {
 	struct rxe_mr *mr = container_of(elem, typeof(*mr), elem);
-	int i;
 
 	rxe_put(mr_pd(mr));
 	ib_umem_release(mr->umem);
 
-	if (mr->map) {
-		for (i = 0; i < mr->num_map; i++)
-			kfree(mr->map[i]);
-
-		kfree(mr->map);
-	}
+	if (mr->ibmr.type != IB_MR_TYPE_DMA)
+		xa_destroy(&mr->page_list);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 5c3d1500ca68..09437077ceee 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -288,17 +288,6 @@ enum rxe_mr_lookup_type {
 	RXE_LOOKUP_REMOTE,
 };
 
-#define RXE_BUF_PER_MAP		(PAGE_SIZE / sizeof(struct rxe_phys_buf))
-
-struct rxe_phys_buf {
-	u64      addr;
-	u64      size;
-};
-
-struct rxe_map {
-	struct rxe_phys_buf	buf[RXE_BUF_PER_MAP];
-};
-
 static inline int rkey_is_mw(u32 rkey)
 {
 	u32 index = rkey >> 8;
@@ -316,22 +305,16 @@ struct rxe_mr {
 	u32			rkey;
 	enum rxe_mr_state	state;
 	int			access;
+	atomic_t		num_mw;
 
 	unsigned int		page_offset;
 	unsigned int		page_shift;
 	u64			page_mask;
-	int			map_shift;
-	int			map_mask;
 
 	u32			num_buf;
 	u32			nbuf;
 
-	u32			max_buf;
-	u32			num_map;
-
-	atomic_t		num_mw;
-
-	struct rxe_map		**map;
+	struct xarray		page_list;
 };
 
 static inline unsigned int mr_page_size(struct rxe_mr *mr)
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next v6 6/6] RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray
  2023-01-17 17:25 ` [PATCH for-next v6 6/6] RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray Bob Pearson
@ 2023-01-23 19:36   ` Jason Gunthorpe
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2023-01-23 19:36 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On Tue, Jan 17, 2023 at 11:25:41AM -0600, Bob Pearson wrote:
> @@ -574,27 +559,49 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
>  		return -EINVAL;
>  	}
>  
> -	va = iova_to_vaddr(mr, iova, sizeof(value));
> -	if (unlikely(!va)) {
> -		rxe_dbg_mr(mr, "iova out of range");
> -		return -ERANGE;
> +	if (mr->ibmr.type == IB_MR_TYPE_DMA) {
> +		page_offset = iova & (PAGE_SIZE - 1);
> +		page = virt_to_page(iova & PAGE_MASK);
> +	} else {
> +		unsigned long index;
> +		int err;
> +
> +		/* See IBA oA19-28 */
> +		err = mr_check_range(mr, iova, sizeof(value));
> +		if (unlikely(err)) {
> +			rxe_dbg_mr(mr, "iova out of range");
> +			return -ERANGE;
> +		}
> +		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
> +		index = rxe_mr_iova_to_index(mr, iova);
> +		page = xa_load(&mr->page_list, index);
> +		if (!page)
> +			return -EFAULT;
>  	}
>  
>  	/* See IBA A19.4.2 */
> -	if (unlikely((uintptr_t)va & 0x7 || iova & 0x7)) {
> +	if (unlikely(page_offset & 0x7)) {
>  		rxe_dbg_mr(mr, "misaligned address");
>  		return -RXE_ERR_NOT_ALIGNED;
>  	}
>  
> +	va = kmap_local_page(page);
> +
>  	/* Do atomic write after all prior operations have completed */
> -	smp_store_release(va, value);
> +	/* TODO: This is what was chosen by the implementer but I am
> +	 * concerned it isn't what they want. This only guarantees that
> +	 * the write will complete before any subsequent reads but the
> +	 * comment says all prior operations have completed. That would
> +	 * require a full mb() or matching acquire.
> +	 * Normal usage has a matching load_acquire and store_release.
> +	 */
> +	smp_store_release(&va[page_offset >> 3], value);

The 'atomicness' is that the NIC side does a 'release' and the CPU
side will do an 'acquire' when it reads the same memory.

The 'acquire' from the CPU side will ensure that any prior writes or
atomcis done by the NIC are visible by the CPU - because that is what
acquire/release means.

Eg if the NIC does a RDMA write to X and then an atomic update (and
release) then the acquire will observe X too if it observed the atomic
update.

acquire/release and rmb/wmb are two different models of the same
concept. acquire/release is more datacentric and tends to speak more
about how data observability is ordered, while the barriers tend to
talk more about how the CPU orders operations.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray
  2023-01-17 17:25 [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
                   ` (5 preceding siblings ...)
  2023-01-17 17:25 ` [PATCH for-next v6 6/6] RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray Bob Pearson
@ 2023-01-23 19:36 ` Jason Gunthorpe
  2023-01-24  3:43   ` Zhu Yanjun
  6 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2023-01-23 19:36 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On Tue, Jan 17, 2023 at 11:25:35AM -0600, Bob Pearson wrote:
> This patch series replaces the page map carried in each memory region
> with a struct xarray. It is based on a sketch developed by Jason
> Gunthorpe. The first five patches are preparation that tries to
> cleanly isolate all the mr specific code into rxe_mr.c. The sixth
> patch is the actual change.

I think this is fine, are all the other people satisfied?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray
  2023-01-23 19:36 ` [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Jason Gunthorpe
@ 2023-01-24  3:43   ` Zhu Yanjun
  2023-01-24  5:39     ` Daisuke Matsuda (Fujitsu)
  0 siblings, 1 reply; 12+ messages in thread
From: Zhu Yanjun @ 2023-01-24  3:43 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Bob Pearson, leonro, yangx.jy, linux-rdma

On Tue, Jan 24, 2023 at 3:36 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Jan 17, 2023 at 11:25:35AM -0600, Bob Pearson wrote:
> > This patch series replaces the page map carried in each memory region
> > with a struct xarray. It is based on a sketch developed by Jason
> > Gunthorpe. The first five patches are preparation that tries to
> > cleanly isolate all the mr specific code into rxe_mr.c. The sixth
> > patch is the actual change.
>
> I think this is fine, are all the other people satisfied?

I noticed that RXE is disabled in RHEL9.x. And in RHEL 8.x, RXE still
is enabled.
It seems that RXE is disabled in RHEL 9.x because of instability.
And recently RXE accepted several patch series.
IMO, we should have more time to make tests and fix bugs before the
new patch series are accepted.

This can make RXE more stable. And more linux distributions will enable it.
Or else, more and more linux distributions will disable RXE. Less and
less users will use RXE.
Finally RXE will never be used in the future.

I think, this is not what the RXE guys expect.

As such, it had better have more time to make tests and let RXE more stable.
We should not accept too many patch series in short time. Too many
patch series will bring risks.

Zhu Yanjun


>
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray
  2023-01-24  3:43   ` Zhu Yanjun
@ 2023-01-24  5:39     ` Daisuke Matsuda (Fujitsu)
  2023-01-24 15:17       ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2023-01-24  5:39 UTC (permalink / raw)
  To: 'Zhu Yanjun', Jason Gunthorpe
  Cc: Bob Pearson, leonro@nvidia.com, yangx.jy@fujitsu.com,
	linux-rdma@vger.kernel.org

Tue, Jan 24, 2023 12:43 PM Zhu Yanjun wrote:
> On Tue, Jan 24, 2023 at 3:36 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 11:25:35AM -0600, Bob Pearson wrote:
> > > This patch series replaces the page map carried in each memory region
> > > with a struct xarray. It is based on a sketch developed by Jason
> > > Gunthorpe. The first five patches are preparation that tries to
> > > cleanly isolate all the mr specific code into rxe_mr.c. The sixth
> > > patch is the actual change.
> >
> > I think this is fine, are all the other people satisfied?
> 
> I noticed that RXE is disabled in RHEL9.x. And in RHEL 8.x, RXE still
> is enabled.
> It seems that RXE is disabled in RHEL 9.x because of instability.
> And recently RXE accepted several patch series.
> IMO, we should have more time to make tests and fix bugs before the
> new patch series are accepted.

I am relatively a newcomer here, but I think what Zhu says is true.

While there are some pending patch series, there comes a new large
patch series that is hard to review, and they get merged without being
tested and inspected enough resulting in new bugs. I suppose that is 
what have been happening here.

> 
> This can make RXE more stable. And more linux distributions will enable it.
> Or else, more and more linux distributions will disable RXE. Less and
> less users will use RXE.
> Finally RXE will never be used in the future.
> 
> I think, this is not what the RXE guys expect.
> 
> As such, it had better have more time to make tests and let RXE more stable.
> We should not accept too many patch series in short time. Too many
> patch series will bring risks.

Blocking new patch series totally will make the rxe less attractive to
the contributors. I propose that each of us should have at most one
pending patch series at once that consists of more than 4 patches or so.
That will make the situation a lot clearer and make it easier for us to
review patches each other.

Daisuke

> 
> Zhu Yanjun
> 
> 
> >
> > Thanks,
> > Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray
  2023-01-24  5:39     ` Daisuke Matsuda (Fujitsu)
@ 2023-01-24 15:17       ` Jason Gunthorpe
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2023-01-24 15:17 UTC (permalink / raw)
  To: Daisuke Matsuda (Fujitsu)
  Cc: 'Zhu Yanjun', Bob Pearson, leonro@nvidia.com,
	yangx.jy@fujitsu.com, linux-rdma@vger.kernel.org

On Tue, Jan 24, 2023 at 05:39:34AM +0000, Daisuke Matsuda (Fujitsu) wrote:
> Tue, Jan 24, 2023 12:43 PM Zhu Yanjun wrote:
> > On Tue, Jan 24, 2023 at 3:36 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 11:25:35AM -0600, Bob Pearson wrote:
> > > > This patch series replaces the page map carried in each memory region
> > > > with a struct xarray. It is based on a sketch developed by Jason
> > > > Gunthorpe. The first five patches are preparation that tries to
> > > > cleanly isolate all the mr specific code into rxe_mr.c. The sixth
> > > > patch is the actual change.
> > >
> > > I think this is fine, are all the other people satisfied?
> > 
> > I noticed that RXE is disabled in RHEL9.x. And in RHEL 8.x, RXE still
> > is enabled.
> > It seems that RXE is disabled in RHEL 9.x because of instability.
> > And recently RXE accepted several patch series.
> > IMO, we should have more time to make tests and fix bugs before the
> > new patch series are accepted.
> 
> I am relatively a newcomer here, but I think what Zhu says is true.
> 
> While there are some pending patch series, there comes a new large
> patch series that is hard to review, and they get merged without being
> tested and inspected enough resulting in new bugs. I suppose that is 
> what have been happening here.

I think the counterpoint is that rxe isn't really a production piece
of software. If the goal if rxe is to experiment with these new
features people keep proposing then we should let it advance.

Stability comes from actual testing, not from time. If people aren't
testing then waiting longer won't magically make it better.

> Blocking new patch series totally will make the rxe less attractive to
> the contributors. 

Exactly, I would rather people work on rxe than chase some ideal of
bug-freeness

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-01-24 15:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-17 17:25 [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
2023-01-17 17:25 ` [PATCH for-next v6 1/6] RDMA/rxe: Cleanup mr_check_range Bob Pearson
2023-01-17 17:25 ` [PATCH for-next v6 2/6] RDMA/rxe: Move rxe_map_mr_sg to rxe_mr.c Bob Pearson
2023-01-17 17:25 ` [PATCH for-next v6 3/6] RDMA-rxe: Isolate mr code from atomic_reply() Bob Pearson
2023-01-17 17:25 ` [PATCH for-next v6 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply() Bob Pearson
2023-01-17 17:25 ` [PATCH for-next v6 5/6] RDMA/rxe: Cleanup page variables in rxe_mr.c Bob Pearson
2023-01-17 17:25 ` [PATCH for-next v6 6/6] RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray Bob Pearson
2023-01-23 19:36   ` Jason Gunthorpe
2023-01-23 19:36 ` [PATCH for-next v6 0/6] RDMA/rxe: Replace mr page map with an xarray Jason Gunthorpe
2023-01-24  3:43   ` Zhu Yanjun
2023-01-24  5:39     ` Daisuke Matsuda (Fujitsu)
2023-01-24 15:17       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).