linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE
@ 2024-10-09  1:58 Daisuke Matsuda
  2024-10-09  1:58 ` [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code Daisuke Matsuda
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Daisuke Matsuda @ 2024-10-09  1:58 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000
  Cc: linux-kernel, rpearsonhpe, lizhijian, Daisuke Matsuda

This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
driver, which has been available only in mlx5 driver[1] so far.

This series has been blocked because of the hang issue of srp 002 test[2],
which was believed to be caused after applying the commit 9b4b7c1f9f54
("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent
on the commit because the ODP feature requires sleeping in kernel space,
and it is impossible with the former tasklet implementation.

According to the original reporter[3], the hang issue is already gone in
v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe
driver is ready to accept this series since there is no longer any reason
to consider reverting back to the old tasklet.

I omitted some contents like the motive behind this series from the cover-
letter. Please see the cover letter of v3 for more details[5].

[Overview]
When applications register a memory region(MR), RDMA drivers normally pin
pages in the MR so that physical addresses are never changed during RDMA
communication. This requires the MR to fit in physical memory and
inevitably leads to memory pressure. On the other hand, On-Demand Paging
(ODP) allows applications to register MRs without pinning pages. They are
paged-in when the driver requires and paged-out when the OS reclaims. As a
result, it is possible to register a large MR that does not fit in physical
memory without taking up so much physical memory.

[How does ODP work?]
"struct ib_umem_odp" is used to manage pages. It is created for each
ODP-enabled MR on its registration. This struct holds a pair of arrays
(dma_list/pfn_list) that serve as a driver page table. DMA addresses and
PFNs are stored in the driver page table. They are updated on page-in and
page-out, both of which use the common interfaces in the ib_uverbs layer.

Page-in can occur when requester, responder or completer access an MR in
order to process RDMA operations. If they find that the pages being
accessed are not present on physical memory or requisite permissions are
not set on the pages, they provoke page fault to make the pages present
with proper permissions and at the same time update the driver page table.
After confirming the presence of the pages, they execute memory access such
as read, write or atomic operations.

Page-out is triggered by page reclaim or filesystem events (e.g. metadata
update of a file that is being used as an MR). When creating an ODP-enabled
MR, the driver registers an MMU notifier callback. When the kernel issues a
page invalidation notification, the callback is provoked to unmap DMA
addresses and update the driver page table. After that, the kernel releases
the pages.

[Supported operations]
All traditional operations are supported on RC connection. The new Atomic
write[6] and RDMA Flush[7] operations are not included in this patchset. I
will post them later after this patchset is merged. On UD connection, Send,
Recv, and SRQ-Recv are supported.

[How to test ODP?]
There are only a few resources available for testing. pyverbs testcases in
rdma-core and perftest[8] are recommendable ones. Other than them, the
ibv_rc_pingpong command can also be used for testing. Note that you may
have to build perftest from upstream because old versions do not handle ODP
capabilities correctly.

The latest ODP tree is available from github:
https://github.com/ddmatsu/linux/tree/odp_v8

[Future work]
My next work is to enable the new Atomic write[6] and RDMA Flush[7]
operations with ODP. After that, I am going to implement the prefetch
feature. It allows applications to trigger page fault using
ibv_advise_mr(3) to optimize performance. Some existing software like
librpma[9] use this feature. Additionally, I think we can also add the
implicit ODP feature in the future.

[1] Understanding On Demand Paging (ODP)
https://enterprise-support.nvidia.com/s/article/understanding-on-demand-paging--odp-x

[2] [bug report] blktests srp/002 hang
https://lore.kernel.org/linux-rdma/dsg6rd66tyiei32zaxs6ddv5ebefr5vtxjwz6d2ewqrcwisogl@ge7jzan7dg5u/T/

[3] blktests failures with v6.10-rc1 kernel
https://lore.kernel.org/linux-block/wnucs5oboi4flje5yvtea7puvn6zzztcnlrfz3lpzlwgblrxgw@7wvqdzioejgl/

[4] [00/15] ethernet: Convert from tasklet to BH workqueue
https://patchwork.kernel.org/project/linux-rdma/cover/20240621050525.3720069-1-allen.lkml@gmail.com/

[5] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
https://lore.kernel.org/lkml/cover.1671772917.git.matsuda-daisuke@fujitsu.com/

[6] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@fujitsu.com/

[7] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@fujitsu.com/

[8] linux-rdma/perftest: Infiniband Verbs Performance Tests
https://github.com/linux-rdma/perftest

[9] librpma: Remote Persistent Memory Access Library
https://github.com/pmem/rpma

v7->v8:
 1) Dropped the first patch because the same change was made by Bob Pearson.
 cf. https://github.com/torvalds/linux/commit/23bc06af547f2ca3b7d345e09fd8d04575406274
 2) Rebased to 6.12.1-rc2

v6->v7:
 1) Rebased to 6.6.0
 2) Disabled using hugepages with ODP
 3) Addressed comments on v6 from Jason and Zhu
   cf. https://lore.kernel.org/lkml/cover.1694153251.git.matsuda-daisuke@fujitsu.com/

v5->v6:
 Fixed the implementation according to Jason's suggestions
   cf. https://lore.kernel.org/all/ZIdFXfDu4IMKE+BQ@nvidia.com/
   cf. https://lore.kernel.org/all/ZIdGU709e1h5h4JJ@nvidia.com/

v4->v5:
 1) Rebased to 6.4.0-rc2+
 2) Changed to schedule all works on responder and completer to workqueue

v3->v4:
 1) Re-designed functions that access MRs to use the MR xarray.
 2) Rebased onto the latest jgg-for-next tree.

v2->v3:
 1) Removed a patch that changes the common ib_uverbs layer.
 2) Re-implemented patches for conversion to workqueue.
 3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
 4) Fixed some functions that returned incorrect errors.
 5) Temporarily disabled ODP for RDMA Flush and Atomic Write.

v1->v2:
 1) Fixed a crash issue reported by Haris Iqbal.
 2) Tried to make lock patters clearer as pointed out by Romanovsky.
 3) Minor clean ups and fixes.

Daisuke Matsuda (6):
  RDMA/rxe: Make MR functions accessible from other rxe source code
  RDMA/rxe: Move resp_states definition to rxe_verbs.h
  RDMA/rxe: Add page invalidation support
  RDMA/rxe: Allow registering MRs for On-Demand Paging
  RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
  RDMA/rxe: Add support for the traditional Atomic operations with ODP

 drivers/infiniband/sw/rxe/Makefile    |   2 +
 drivers/infiniband/sw/rxe/rxe.c       |  18 ++
 drivers/infiniband/sw/rxe/rxe.h       |  37 ----
 drivers/infiniband/sw/rxe/rxe_loc.h   |  39 ++++
 drivers/infiniband/sw/rxe/rxe_mr.c    |  34 +++-
 drivers/infiniband/sw/rxe/rxe_odp.c   | 282 ++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_resp.c  |  18 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c |   5 +-
 drivers/infiniband/sw/rxe/rxe_verbs.h |  37 ++++
 9 files changed, 419 insertions(+), 53 deletions(-)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code
  2024-10-09  1:58 [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Daisuke Matsuda
@ 2024-10-09  1:58 ` Daisuke Matsuda
  2024-10-09 14:13   ` Zhu Yanjun
  2024-12-09 19:19   ` Jason Gunthorpe
  2024-10-09  1:58 ` [PATCH for-next v8 2/6] RDMA/rxe: Move resp_states definition to rxe_verbs.h Daisuke Matsuda
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 27+ messages in thread
From: Daisuke Matsuda @ 2024-10-09  1:58 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000
  Cc: linux-kernel, rpearsonhpe, lizhijian, Daisuke Matsuda

Some functions in rxe_mr.c are going to be used in rxe_odp.c, which is to
be created in the subsequent patch. List the declarations of the functions
in rxe_loc.h.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h |  8 ++++++++
 drivers/infiniband/sw/rxe/rxe_mr.c  | 11 +++--------
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index ded46119151b..866c36533b53 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -58,6 +58,7 @@ int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
 
 /* rxe_mr.c */
 u8 rxe_get_next_key(u32 last_key);
+void rxe_mr_init(int access, struct rxe_mr *mr);
 void rxe_mr_init_dma(int access, struct rxe_mr *mr);
 int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
 		     int access, struct rxe_mr *mr);
@@ -69,6 +70,8 @@ int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
 	      void *addr, int length, enum rxe_mr_copy_dir dir);
 int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 		  int sg_nents, unsigned int *sg_offset);
+int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
+		       unsigned int length, enum rxe_mr_copy_dir dir);
 int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 			u64 compare, u64 swap_add, u64 *orig_val);
 int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
@@ -80,6 +83,11 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 key);
 int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe);
 void rxe_mr_cleanup(struct rxe_pool_elem *elem);
 
+static inline unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
+{
+	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
+}
+
 /* rxe_mw.c */
 int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata);
 int rxe_dealloc_mw(struct ib_mw *ibmw);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index da3dee520876..1f7b8cf93adc 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -45,7 +45,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 	}
 }
 
-static void rxe_mr_init(int access, struct rxe_mr *mr)
+void rxe_mr_init(int access, struct rxe_mr *mr)
 {
 	u32 key = mr->elem.index << 8 | rxe_get_next_key(-1);
 
@@ -72,11 +72,6 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
 	mr->ibmr.type = IB_MR_TYPE_DMA;
 }
 
-static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
-{
-	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
-}
-
 static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
 {
 	return iova & (mr_page_size(mr) - 1);
@@ -242,8 +237,8 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
 	return ib_sg_to_pages(ibmr, sgl, sg_nents, sg_offset, rxe_set_page);
 }
 
-static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
-			      unsigned int length, enum rxe_mr_copy_dir dir)
+int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
+		       unsigned int length, enum rxe_mr_copy_dir dir)
 {
 	unsigned int page_offset = rxe_mr_iova_to_page_offset(mr, iova);
 	unsigned long index = rxe_mr_iova_to_index(mr, iova);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH for-next v8 2/6] RDMA/rxe: Move resp_states definition to rxe_verbs.h
  2024-10-09  1:58 [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Daisuke Matsuda
  2024-10-09  1:58 ` [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code Daisuke Matsuda
@ 2024-10-09  1:58 ` Daisuke Matsuda
  2024-12-09 19:20   ` Jason Gunthorpe
  2024-10-09  1:59 ` [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support Daisuke Matsuda
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Daisuke Matsuda @ 2024-10-09  1:58 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000
  Cc: linux-kernel, rpearsonhpe, lizhijian, Daisuke Matsuda

To use the resp_states values in rxe_loc.h, it is necessary to move the
definition to rxe_verbs.h, where other internal states of this driver are
defined.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe.h       | 37 ---------------------------
 drivers/infiniband/sw/rxe/rxe_verbs.h | 37 +++++++++++++++++++++++++++
 2 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.h b/drivers/infiniband/sw/rxe/rxe.h
index d8fb2c7af30a..193f7caffaf2 100644
--- a/drivers/infiniband/sw/rxe/rxe.h
+++ b/drivers/infiniband/sw/rxe/rxe.h
@@ -100,43 +100,6 @@
 #define rxe_info_mw(mw, fmt, ...) ibdev_info_ratelimited((mw)->ibmw.device, \
 		"mw#%d %s:  " fmt, (mw)->elem.index, __func__, ##__VA_ARGS__)
 
-/* responder states */
-enum resp_states {
-	RESPST_NONE,
-	RESPST_GET_REQ,
-	RESPST_CHK_PSN,
-	RESPST_CHK_OP_SEQ,
-	RESPST_CHK_OP_VALID,
-	RESPST_CHK_RESOURCE,
-	RESPST_CHK_LENGTH,
-	RESPST_CHK_RKEY,
-	RESPST_EXECUTE,
-	RESPST_READ_REPLY,
-	RESPST_ATOMIC_REPLY,
-	RESPST_ATOMIC_WRITE_REPLY,
-	RESPST_PROCESS_FLUSH,
-	RESPST_COMPLETE,
-	RESPST_ACKNOWLEDGE,
-	RESPST_CLEANUP,
-	RESPST_DUPLICATE_REQUEST,
-	RESPST_ERR_MALFORMED_WQE,
-	RESPST_ERR_UNSUPPORTED_OPCODE,
-	RESPST_ERR_MISALIGNED_ATOMIC,
-	RESPST_ERR_PSN_OUT_OF_SEQ,
-	RESPST_ERR_MISSING_OPCODE_FIRST,
-	RESPST_ERR_MISSING_OPCODE_LAST_C,
-	RESPST_ERR_MISSING_OPCODE_LAST_D1E,
-	RESPST_ERR_TOO_MANY_RDMA_ATM_REQ,
-	RESPST_ERR_RNR,
-	RESPST_ERR_RKEY_VIOLATION,
-	RESPST_ERR_INVALIDATE_RKEY,
-	RESPST_ERR_LENGTH,
-	RESPST_ERR_CQ_OVERFLOW,
-	RESPST_ERROR,
-	RESPST_DONE,
-	RESPST_EXIT,
-};
-
 void rxe_set_mtu(struct rxe_dev *rxe, unsigned int dev_mtu);
 
 int rxe_add(struct rxe_dev *rxe, unsigned int mtu, const char *ibdev_name);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 3c1354f82283..e4656c7640f0 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -126,6 +126,43 @@ struct rxe_comp_info {
 	u32			rnr_retry;
 };
 
+/* responder states */
+enum resp_states {
+	RESPST_NONE,
+	RESPST_GET_REQ,
+	RESPST_CHK_PSN,
+	RESPST_CHK_OP_SEQ,
+	RESPST_CHK_OP_VALID,
+	RESPST_CHK_RESOURCE,
+	RESPST_CHK_LENGTH,
+	RESPST_CHK_RKEY,
+	RESPST_EXECUTE,
+	RESPST_READ_REPLY,
+	RESPST_ATOMIC_REPLY,
+	RESPST_ATOMIC_WRITE_REPLY,
+	RESPST_PROCESS_FLUSH,
+	RESPST_COMPLETE,
+	RESPST_ACKNOWLEDGE,
+	RESPST_CLEANUP,
+	RESPST_DUPLICATE_REQUEST,
+	RESPST_ERR_MALFORMED_WQE,
+	RESPST_ERR_UNSUPPORTED_OPCODE,
+	RESPST_ERR_MISALIGNED_ATOMIC,
+	RESPST_ERR_PSN_OUT_OF_SEQ,
+	RESPST_ERR_MISSING_OPCODE_FIRST,
+	RESPST_ERR_MISSING_OPCODE_LAST_C,
+	RESPST_ERR_MISSING_OPCODE_LAST_D1E,
+	RESPST_ERR_TOO_MANY_RDMA_ATM_REQ,
+	RESPST_ERR_RNR,
+	RESPST_ERR_RKEY_VIOLATION,
+	RESPST_ERR_INVALIDATE_RKEY,
+	RESPST_ERR_LENGTH,
+	RESPST_ERR_CQ_OVERFLOW,
+	RESPST_ERROR,
+	RESPST_DONE,
+	RESPST_EXIT,
+};
+
 enum rdatm_res_state {
 	rdatm_res_state_next,
 	rdatm_res_state_new,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support
  2024-10-09  1:58 [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Daisuke Matsuda
  2024-10-09  1:58 ` [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code Daisuke Matsuda
  2024-10-09  1:58 ` [PATCH for-next v8 2/6] RDMA/rxe: Move resp_states definition to rxe_verbs.h Daisuke Matsuda
@ 2024-10-09  1:59 ` Daisuke Matsuda
  2024-10-13  6:15   ` Zhu Yanjun
                     ` (2 more replies)
  2024-10-09  1:59 ` [PATCH for-next v8 4/6] RDMA/rxe: Allow registering MRs for On-Demand Paging Daisuke Matsuda
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 27+ messages in thread
From: Daisuke Matsuda @ 2024-10-09  1:59 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000
  Cc: linux-kernel, rpearsonhpe, lizhijian, Daisuke Matsuda

On page invalidation, an MMU notifier callback is invoked to unmap DMA
addresses and update the driver page table(umem_odp->dma_list). It also
sets the corresponding entries in MR xarray to NULL to prevent any access.
The callback is registered when an ODP-enabled MR is created.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/Makefile  |  2 +
 drivers/infiniband/sw/rxe/rxe_odp.c | 57 +++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c

diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
index 5395a581f4bb..93134f1d1d0c 100644
--- a/drivers/infiniband/sw/rxe/Makefile
+++ b/drivers/infiniband/sw/rxe/Makefile
@@ -23,3 +23,5 @@ rdma_rxe-y := \
 	rxe_task.o \
 	rxe_net.o \
 	rxe_hw_counters.o
+
+rdma_rxe-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += rxe_odp.o
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
new file mode 100644
index 000000000000..ea55b79be0c6
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022-2023 Fujitsu Ltd. All rights reserved.
+ */
+
+#include <linux/hmm.h>
+
+#include <rdma/ib_umem_odp.h>
+
+#include "rxe.h"
+
+static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start,
+				unsigned long end)
+{
+	unsigned long upper = rxe_mr_iova_to_index(mr, end - 1);
+	unsigned long lower = rxe_mr_iova_to_index(mr, start);
+	void *entry;
+
+	XA_STATE(xas, &mr->page_list, lower);
+
+	/* make elements in xarray NULL */
+	xas_lock(&xas);
+	xas_for_each(&xas, entry, upper)
+		xas_store(&xas, NULL);
+	xas_unlock(&xas);
+}
+
+static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
+				    const struct mmu_notifier_range *range,
+				    unsigned long cur_seq)
+{
+	struct ib_umem_odp *umem_odp =
+		container_of(mni, struct ib_umem_odp, notifier);
+	struct rxe_mr *mr = umem_odp->private;
+	unsigned long start, end;
+
+	if (!mmu_notifier_range_blockable(range))
+		return false;
+
+	mutex_lock(&umem_odp->umem_mutex);
+	mmu_interval_set_seq(mni, cur_seq);
+
+	start = max_t(u64, ib_umem_start(umem_odp), range->start);
+	end = min_t(u64, ib_umem_end(umem_odp), range->end);
+
+	rxe_mr_unset_xarray(mr, start, end);
+
+	/* update umem_odp->dma_list */
+	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
+
+	mutex_unlock(&umem_odp->umem_mutex);
+	return true;
+}
+
+const struct mmu_interval_notifier_ops rxe_mn_ops = {
+	.invalidate = rxe_ib_invalidate_range,
+};
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH for-next v8 4/6] RDMA/rxe: Allow registering MRs for On-Demand Paging
  2024-10-09  1:58 [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (2 preceding siblings ...)
  2024-10-09  1:59 ` [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support Daisuke Matsuda
@ 2024-10-09  1:59 ` Daisuke Matsuda
  2024-12-09 19:33   ` Jason Gunthorpe
  2024-10-09  1:59 ` [PATCH for-next v8 5/6] RDMA/rxe: Add support for Send/Recv/Write/Read with ODP Daisuke Matsuda
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Daisuke Matsuda @ 2024-10-09  1:59 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000
  Cc: linux-kernel, rpearsonhpe, lizhijian, Daisuke Matsuda

Allow userspace to register an ODP-enabled MR, in which case the flag
IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
RDMA operation enabled right now. They will be supported later in the
subsequent two patches.

rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR. It syncs
process address space from the CPU page table to the driver page table
(dma_list/pfn_list in umem_odp) when called with RXE_PAGEFAULT_SNAPSHOT
flag. Additionally, It can be used to trigger page fault when pages being
accessed are not present or do not have proper read/write permissions, and
possibly to prefetch pages in the future.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe.c       |   7 ++
 drivers/infiniband/sw/rxe/rxe_loc.h   |  14 +++
 drivers/infiniband/sw/rxe/rxe_mr.c    |   9 +-
 drivers/infiniband/sw/rxe/rxe_odp.c   | 122 ++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_resp.c  |  15 +++-
 drivers/infiniband/sw/rxe/rxe_verbs.c |   5 +-
 6 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 255677bc12b2..3ca73f8d96cc 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -75,6 +75,13 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 			rxe->ndev->dev_addr);
 
 	rxe->max_ucontext			= RXE_MAX_UCONTEXT;
+
+	if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
+		rxe->attr.kernel_cap_flags |= IBK_ON_DEMAND_PAGING;
+
+		/* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
+		rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
+	}
 }
 
 /* initialize port attributes */
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 866c36533b53..51b77e8827aa 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -189,4 +189,18 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
 	return rxe_wr_opcode_info[opcode].mask[qp->ibqp.qp_type];
 }
 
+/* rxe_odp.c */
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
+			 u64 iova, int access_flags, struct rxe_mr *mr);
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+static inline int
+rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
+		     int access_flags, struct rxe_mr *mr)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
 #endif /* RXE_LOC_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 1f7b8cf93adc..5589314a1e67 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -318,7 +318,10 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
 		return err;
 	}
 
-	return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
+	if (mr->umem->is_odp)
+		return -EOPNOTSUPP;
+	else
+		return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
 }
 
 /* copy data in or out of a wqe, i.e. sg list
@@ -527,6 +530,10 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 	struct page *page;
 	u64 *va;
 
+	/* ODP is not supported right now. WIP. */
+	if (mr->umem->is_odp)
+		return RESPST_ERR_UNSUPPORTED_OPCODE;
+
 	/* See IBA oA19-28 */
 	if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
 		rxe_dbg_mr(mr, "mr not in valid state\n");
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index ea55b79be0c6..c5e24901c141 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -9,6 +9,8 @@
 
 #include "rxe.h"
 
+#define RXE_ODP_WRITABLE_BIT    1UL
+
 static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start,
 				unsigned long end)
 {
@@ -25,6 +27,29 @@ static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start,
 	xas_unlock(&xas);
 }
 
+static void rxe_mr_set_xarray(struct rxe_mr *mr, unsigned long start,
+			      unsigned long end, unsigned long *pfn_list)
+{
+	unsigned long upper = rxe_mr_iova_to_index(mr, end - 1);
+	unsigned long lower = rxe_mr_iova_to_index(mr, start);
+	void *page, *entry;
+
+	XA_STATE(xas, &mr->page_list, lower);
+
+	xas_lock(&xas);
+	while (xas.xa_index <= upper) {
+		if (pfn_list[xas.xa_index] & HMM_PFN_WRITE) {
+			page = xa_tag_pointer(hmm_pfn_to_page(pfn_list[xas.xa_index]),
+					      RXE_ODP_WRITABLE_BIT);
+		} else
+			page = hmm_pfn_to_page(pfn_list[xas.xa_index]);
+
+		xas_store(&xas, page);
+		entry = xas_next(&xas);
+	}
+	xas_unlock(&xas);
+}
+
 static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
 				    const struct mmu_notifier_range *range,
 				    unsigned long cur_seq)
@@ -55,3 +80,100 @@ static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
 const struct mmu_interval_notifier_ops rxe_mn_ops = {
 	.invalidate = rxe_ib_invalidate_range,
 };
+
+#define RXE_PAGEFAULT_RDONLY BIT(1)
+#define RXE_PAGEFAULT_SNAPSHOT BIT(2)
+static int rxe_odp_do_pagefault_and_lock(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
+	u64 access_mask;
+	int np;
+
+	access_mask = ODP_READ_ALLOWED_BIT;
+	if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
+		access_mask |= ODP_WRITE_ALLOWED_BIT;
+
+	/*
+	 * ib_umem_odp_map_dma_and_lock() locks umem_mutex on success.
+	 * Callers must release the lock later to let invalidation handler
+	 * do its work again.
+	 */
+	np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt,
+					  access_mask, fault);
+	if (np < 0)
+		return np;
+
+	/*
+	 * umem_mutex is still locked here, so we can use hmm_pfn_to_page()
+	 * safely to fetch pages in the range.
+	 */
+	rxe_mr_set_xarray(mr, user_va, user_va + bcnt, umem_odp->pfn_list);
+
+	return np;
+}
+
+static int rxe_odp_init_pages(struct rxe_mr *mr)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	int ret;
+
+	ret = rxe_odp_do_pagefault_and_lock(mr, mr->umem->address,
+					    mr->umem->length,
+					    RXE_PAGEFAULT_SNAPSHOT);
+
+	if (ret >= 0)
+		mutex_unlock(&umem_odp->umem_mutex);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
+			 u64 iova, int access_flags, struct rxe_mr *mr)
+{
+	struct ib_umem_odp *umem_odp;
+	int err;
+
+	if (!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
+		return -EOPNOTSUPP;
+
+	rxe_mr_init(access_flags, mr);
+
+	xa_init(&mr->page_list);
+
+	if (!start && length == U64_MAX) {
+		if (iova != 0)
+			return -EINVAL;
+		if (!(rxe->attr.odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
+			return -EINVAL;
+
+		/* Never reach here, for implicit ODP is not implemented. */
+	}
+
+	umem_odp = ib_umem_odp_get(&rxe->ib_dev, start, length, access_flags,
+				   &rxe_mn_ops);
+	if (IS_ERR(umem_odp)) {
+		rxe_dbg_mr(mr, "Unable to create umem_odp err = %d\n",
+			   (int)PTR_ERR(umem_odp));
+		return PTR_ERR(umem_odp);
+	}
+
+	umem_odp->private = mr;
+
+	mr->umem = &umem_odp->umem;
+	mr->access = access_flags;
+	mr->ibmr.length = length;
+	mr->ibmr.iova = iova;
+	mr->page_offset = ib_umem_offset(&umem_odp->umem);
+
+	err = rxe_odp_init_pages(mr);
+	if (err) {
+		ib_umem_odp_release(umem_odp);
+		return err;
+	}
+
+	mr->state = RXE_MR_STATE_VALID;
+	mr->ibmr.type = IB_MR_TYPE_USER;
+
+	return err;
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index c11ab280551a..e703a3ab82d4 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -649,6 +649,10 @@ static enum resp_states process_flush(struct rxe_qp *qp,
 	struct rxe_mr *mr = qp->resp.mr;
 	struct resp_res *res = qp->resp.res;
 
+	/* ODP is not supported right now. WIP. */
+	if (mr->umem->is_odp)
+		return RESPST_ERR_UNSUPPORTED_OPCODE;
+
 	/* oA19-14, oA19-15 */
 	if (res && res->replay)
 		return RESPST_ACKNOWLEDGE;
@@ -702,10 +706,13 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
 	if (!res->replay) {
 		u64 iova = qp->resp.va + qp->resp.offset;
 
-		err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
-					  atmeth_comp(pkt),
-					  atmeth_swap_add(pkt),
-					  &res->atomic.orig_val);
+		if (mr->umem->is_odp)
+			err = RESPST_ERR_UNSUPPORTED_OPCODE;
+		else
+			err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
+						  atmeth_comp(pkt),
+						  atmeth_swap_add(pkt),
+						  &res->atomic.orig_val);
 		if (err)
 			return err;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 5c18f7e342f2..13064302d766 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -1278,7 +1278,10 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd, u64 start,
 	mr->ibmr.pd = ibpd;
 	mr->ibmr.device = ibpd->device;
 
-	err = rxe_mr_init_user(rxe, start, length, access, mr);
+	if (access & IB_ACCESS_ON_DEMAND)
+		err = rxe_odp_mr_init_user(rxe, start, length, iova, access, mr);
+	else
+		err = rxe_mr_init_user(rxe, start, length, access, mr);
 	if (err) {
 		rxe_dbg_mr(mr, "reg_user_mr failed, err = %d\n", err);
 		goto err_cleanup;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH for-next v8 5/6] RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
  2024-10-09  1:58 [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (3 preceding siblings ...)
  2024-10-09  1:59 ` [PATCH for-next v8 4/6] RDMA/rxe: Allow registering MRs for On-Demand Paging Daisuke Matsuda
@ 2024-10-09  1:59 ` Daisuke Matsuda
  2024-10-09  1:59 ` [PATCH for-next v8 6/6] RDMA/rxe: Add support for the traditional Atomic operations " Daisuke Matsuda
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Daisuke Matsuda @ 2024-10-09  1:59 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000
  Cc: linux-kernel, rpearsonhpe, lizhijian, Daisuke Matsuda

rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
it to load payloads of requesting packets; responder uses it to process
Send, Write, and Read operaetions; completer uses it to copy data from
response packets of Read and Atomic operations to a user MR.

Allow these operations to be used with ODP by adding a subordinate function
rxe_odp_mr_copy(). It is comprised of the following steps:
 1. Check page presence and R/W permission.
 2. If OK, just execute data copy to/from the pages and exit.
 3. Otherwise, trigger page fault to map the pages.
 4. Update the MR xarray using PFNs in umem_odp->pfn_list.
 5. Execute data copy to/from the pages.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe.c     | 10 ++++
 drivers/infiniband/sw/rxe/rxe_loc.h |  8 ++++
 drivers/infiniband/sw/rxe/rxe_mr.c  |  9 +++-
 drivers/infiniband/sw/rxe/rxe_odp.c | 73 +++++++++++++++++++++++++++++
 4 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 3ca73f8d96cc..ea643ebf9667 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -81,6 +81,16 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 
 		/* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
 		rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
+
+		rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SEND;
+		rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_RECV;
+		rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
+
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SEND;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
 	}
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 51b77e8827aa..2483e90a5443 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -193,6 +193,8 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
 			 u64 iova, int access_flags, struct rxe_mr *mr);
+int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
+		    enum rxe_mr_copy_dir dir);
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 static inline int
 rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
@@ -200,6 +202,12 @@ rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 {
 	return -EOPNOTSUPP;
 }
+static inline int
+rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
+		int length, enum rxe_mr_copy_dir dir)
+{
+	return -EOPNOTSUPP;
+}
 
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 5589314a1e67..eef3976309eb 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -247,7 +247,12 @@ int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
 	void *va;
 
 	while (length) {
-		page = xa_load(&mr->page_list, index);
+		if (mr->umem->is_odp)
+			page = xa_untag_pointer(xa_load(&mr->page_list,
+							index));
+		else
+			page = xa_load(&mr->page_list, index);
+
 		if (!page)
 			return -EFAULT;
 
@@ -319,7 +324,7 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
 	}
 
 	if (mr->umem->is_odp)
-		return -EOPNOTSUPP;
+		return rxe_odp_mr_copy(mr, iova, addr, length, dir);
 	else
 		return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index c5e24901c141..979af279cf36 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -177,3 +177,76 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
 
 	return err;
 }
+
+/* Take xarray spinlock before entry */
+static inline bool rxe_odp_check_pages(struct rxe_mr *mr, u64 iova,
+				       int length, u32 flags)
+{
+	unsigned long upper = rxe_mr_iova_to_index(mr, iova + length - 1);
+	unsigned long lower = rxe_mr_iova_to_index(mr, iova);
+	bool need_fault = false;
+	void *page, *entry;
+	size_t perm = 0;
+
+	if (!(flags & RXE_PAGEFAULT_RDONLY))
+		perm = RXE_ODP_WRITABLE_BIT;
+
+	XA_STATE(xas, &mr->page_list, lower);
+
+	while (xas.xa_index <= upper) {
+		page = xas_load(&xas);
+
+		/* Check page presence and write permission */
+		if (!page || (perm && !(xa_pointer_tag(page) & perm))) {
+			need_fault = true;
+			break;
+		}
+		entry = xas_next(&xas);
+	}
+
+	return need_fault;
+}
+
+int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
+		    enum rxe_mr_copy_dir dir)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	u32 flags = 0;
+	int err;
+
+	if (unlikely(!mr->umem->is_odp))
+		return -EOPNOTSUPP;
+
+	switch (dir) {
+	case RXE_TO_MR_OBJ:
+		break;
+
+	case RXE_FROM_MR_OBJ:
+		flags = RXE_PAGEFAULT_RDONLY;
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	spin_lock(&mr->page_list.xa_lock);
+
+	if (rxe_odp_check_pages(mr, iova, length, flags)) {
+		spin_unlock(&mr->page_list.xa_lock);
+
+		/* umem_mutex is locked on success */
+		err = rxe_odp_do_pagefault_and_lock(mr, iova, length, flags);
+		if (err < 0)
+			return err;
+
+		/* spinlock to prevent page invalidation */
+		spin_lock(&mr->page_list.xa_lock);
+		mutex_unlock(&umem_odp->umem_mutex);
+	}
+
+	err =  rxe_mr_copy_xarray(mr, iova, addr, length, dir);
+
+	spin_unlock(&mr->page_list.xa_lock);
+
+	return err;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH for-next v8 6/6] RDMA/rxe: Add support for the traditional Atomic operations with ODP
  2024-10-09  1:58 [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (4 preceding siblings ...)
  2024-10-09  1:59 ` [PATCH for-next v8 5/6] RDMA/rxe: Add support for Send/Recv/Write/Read with ODP Daisuke Matsuda
@ 2024-10-09  1:59 ` Daisuke Matsuda
  2024-10-17 19:27 ` [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Jason Gunthorpe
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Daisuke Matsuda @ 2024-10-09  1:59 UTC (permalink / raw)
  To: linux-rdma, leon, jgg, zyjzyj2000
  Cc: linux-kernel, rpearsonhpe, lizhijian, Daisuke Matsuda

Enable 'fetch and add' and 'compare and swap' operations to be used with
ODP. This is comprised of the following steps:
 1. Verify that the page is present with write permission.
 2. If OK, execute the operation and exit.
 3. If not, then trigger page fault to map the page.
 4. Update the entry in the MR xarray.
 5. Execute the operation.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe.c      |  1 +
 drivers/infiniband/sw/rxe/rxe_loc.h  |  9 +++++++++
 drivers/infiniband/sw/rxe/rxe_mr.c   |  7 ++++++-
 drivers/infiniband/sw/rxe/rxe_odp.c  | 30 ++++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_resp.c |  5 ++++-
 5 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index ea643ebf9667..08c69c637663 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -90,6 +90,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC;
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
 	}
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 2483e90a5443..5ea6d423d527 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -195,6 +195,9 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
 			 u64 iova, int access_flags, struct rxe_mr *mr);
 int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		    enum rxe_mr_copy_dir dir);
+int rxe_odp_mr_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
+			 u64 compare, u64 swap_add, u64 *orig_val);
+
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 static inline int
 rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
@@ -208,6 +211,12 @@ rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
 {
 	return -EOPNOTSUPP;
 }
+static inline int
+rxe_odp_mr_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
+		     u64 compare, u64 swap_add, u64 *orig_val)
+{
+	return RESPST_ERR_UNSUPPORTED_OPCODE;
+}
 
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index eef3976309eb..273da7dfca97 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -498,7 +498,12 @@ int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 		}
 		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
 		index = rxe_mr_iova_to_index(mr, iova);
-		page = xa_load(&mr->page_list, index);
+
+		if (mr->umem->is_odp)
+			page = xa_untag_pointer(xa_load(&mr->page_list, index));
+		else
+			page = xa_load(&mr->page_list, index);
+
 		if (!page)
 			return RESPST_ERR_RKEY_VIOLATION;
 	}
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index 979af279cf36..a6d9a840a38c 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -250,3 +250,33 @@ int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 
 	return err;
 }
+
+int rxe_odp_mr_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
+			 u64 compare, u64 swap_add, u64 *orig_val)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	int err;
+
+	spin_lock(&mr->page_list.xa_lock);
+
+	/* Atomic operations manipulate a single char. */
+	if (rxe_odp_check_pages(mr, iova, sizeof(char), 0)) {
+		spin_unlock(&mr->page_list.xa_lock);
+
+		/* umem_mutex is locked on success */
+		err = rxe_odp_do_pagefault_and_lock(mr, iova, sizeof(char), 0);
+		if (err < 0)
+			return err;
+
+		/* spinlock to prevent page invalidation */
+		spin_lock(&mr->page_list.xa_lock);
+		mutex_unlock(&umem_odp->umem_mutex);
+	}
+
+	err = rxe_mr_do_atomic_op(mr, iova, opcode, compare,
+				  swap_add, orig_val);
+
+	spin_unlock(&mr->page_list.xa_lock);
+
+	return err;
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index e703a3ab82d4..4c1e7337519a 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -707,7 +707,10 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
 		u64 iova = qp->resp.va + qp->resp.offset;
 
 		if (mr->umem->is_odp)
-			err = RESPST_ERR_UNSUPPORTED_OPCODE;
+			err = rxe_odp_mr_atomic_op(mr, iova, pkt->opcode,
+						   atmeth_comp(pkt),
+						   atmeth_swap_add(pkt),
+						   &res->atomic.orig_val);
 		else
 			err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
 						  atmeth_comp(pkt),
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code
  2024-10-09  1:58 ` [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code Daisuke Matsuda
@ 2024-10-09 14:13   ` Zhu Yanjun
  2024-10-10  7:24     ` Daisuke Matsuda (Fujitsu)
  2024-12-09 19:19   ` Jason Gunthorpe
  1 sibling, 1 reply; 27+ messages in thread
From: Zhu Yanjun @ 2024-10-09 14:13 UTC (permalink / raw)
  To: Daisuke Matsuda, linux-rdma, leon, jgg, zyjzyj2000
  Cc: linux-kernel, rpearsonhpe, lizhijian


在 2024/10/9 9:58, Daisuke Matsuda 写道:
> Some functions in rxe_mr.c are going to be used in rxe_odp.c, which is to
> be created in the subsequent patch. List the declarations of the functions
> in rxe_loc.h.
>
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
>   drivers/infiniband/sw/rxe/rxe_loc.h |  8 ++++++++
>   drivers/infiniband/sw/rxe/rxe_mr.c  | 11 +++--------
>   2 files changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index ded46119151b..866c36533b53 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -58,6 +58,7 @@ int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
>   
>   /* rxe_mr.c */
>   u8 rxe_get_next_key(u32 last_key);
> +void rxe_mr_init(int access, struct rxe_mr *mr);
>   void rxe_mr_init_dma(int access, struct rxe_mr *mr);
>   int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
>   		     int access, struct rxe_mr *mr);
> @@ -69,6 +70,8 @@ int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
>   	      void *addr, int length, enum rxe_mr_copy_dir dir);
>   int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
>   		  int sg_nents, unsigned int *sg_offset);
> +int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> +		       unsigned int length, enum rxe_mr_copy_dir dir);
>   int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
>   			u64 compare, u64 swap_add, u64 *orig_val);
>   int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
> @@ -80,6 +83,11 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 key);
>   int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe);
>   void rxe_mr_cleanup(struct rxe_pool_elem *elem);
>   
> +static inline unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
> +{
> +	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> +}

The type of the function rxe_mr_iova_to_index is "unsigned long". In 
some 32 architecture, unsigned long is 32 bit.

The type of iova is u64. So it had better use u64 instead of "unsigned 
long".

Zhu Yanjun

> +
>   /* rxe_mw.c */
>   int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata);
>   int rxe_dealloc_mw(struct ib_mw *ibmw);
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index da3dee520876..1f7b8cf93adc 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -45,7 +45,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
>   	}
>   }
>   
> -static void rxe_mr_init(int access, struct rxe_mr *mr)
> +void rxe_mr_init(int access, struct rxe_mr *mr)
>   {
>   	u32 key = mr->elem.index << 8 | rxe_get_next_key(-1);
>   
> @@ -72,11 +72,6 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
>   	mr->ibmr.type = IB_MR_TYPE_DMA;
>   }
>   
> -static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
> -{
> -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> -}
> -
>   static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
>   {
>   	return iova & (mr_page_size(mr) - 1);
> @@ -242,8 +237,8 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
>   	return ib_sg_to_pages(ibmr, sgl, sg_nents, sg_offset, rxe_set_page);
>   }
>   
> -static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> -			      unsigned int length, enum rxe_mr_copy_dir dir)
> +int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> +		       unsigned int length, enum rxe_mr_copy_dir dir)
>   {
>   	unsigned int page_offset = rxe_mr_iova_to_page_offset(mr, iova);
>   	unsigned long index = rxe_mr_iova_to_index(mr, iova);


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code
  2024-10-09 14:13   ` Zhu Yanjun
@ 2024-10-10  7:24     ` Daisuke Matsuda (Fujitsu)
  2024-10-10  9:18       ` Zhu Yanjun
  0 siblings, 1 reply; 27+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2024-10-10  7:24 UTC (permalink / raw)
  To: 'Zhu Yanjun', linux-rdma@vger.kernel.org, leon@kernel.org,
	jgg@ziepe.ca, zyjzyj2000@gmail.com
  Cc: linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com,
	Zhijian Li (Fujitsu)

On Wed, Oct 9, 2024 11:13 PM Zhu Yanjun wrote:
> 
> 
> 在 2024/10/9 9:58, Daisuke Matsuda 写道:
> > Some functions in rxe_mr.c are going to be used in rxe_odp.c, which is to
> > be created in the subsequent patch. List the declarations of the functions
> > in rxe_loc.h.
> >
> > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> > ---
> >   drivers/infiniband/sw/rxe/rxe_loc.h |  8 ++++++++
> >   drivers/infiniband/sw/rxe/rxe_mr.c  | 11 +++--------
> >   2 files changed, 11 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> > index ded46119151b..866c36533b53 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> > +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> > @@ -58,6 +58,7 @@ int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
> >
> >   /* rxe_mr.c */
> >   u8 rxe_get_next_key(u32 last_key);
> > +void rxe_mr_init(int access, struct rxe_mr *mr);
> >   void rxe_mr_init_dma(int access, struct rxe_mr *mr);
> >   int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
> >   		     int access, struct rxe_mr *mr);
> > @@ -69,6 +70,8 @@ int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
> >   	      void *addr, int length, enum rxe_mr_copy_dir dir);
> >   int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
> >   		  int sg_nents, unsigned int *sg_offset);
> > +int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> > +		       unsigned int length, enum rxe_mr_copy_dir dir);
> >   int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
> >   			u64 compare, u64 swap_add, u64 *orig_val);
> >   int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
> > @@ -80,6 +83,11 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 key);
> >   int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe);
> >   void rxe_mr_cleanup(struct rxe_pool_elem *elem);
> >
> > +static inline unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
> > +{
> > +	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> > +}
> 
> The type of the function rxe_mr_iova_to_index is "unsigned long". In
> some 32 architecture, unsigned long is 32 bit.
> 
> The type of iova is u64. So it had better use u64 instead of "unsigned
> long".
> 
> Zhu Yanjun

Hi,
thanks for the comment.

I think the current type declaration doesn't matter in 32-bit OS.
The function returns an index of the page specified with 'iova'.
Assuming the page size is typical 4KiB, u32 index can accommodate
16 TiB in total, which is larger than the theoretical limit imposed
on 32-bit systems (i.e. 4GiB or 2^32 Bytes).

Regards,
Daisuke Matsuda

> 
> > +
> >   /* rxe_mw.c */
> >   int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata);
> >   int rxe_dealloc_mw(struct ib_mw *ibmw);
> > diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> > index da3dee520876..1f7b8cf93adc 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> > @@ -45,7 +45,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
> >   	}
> >   }
> >
> > -static void rxe_mr_init(int access, struct rxe_mr *mr)
> > +void rxe_mr_init(int access, struct rxe_mr *mr)
> >   {
> >   	u32 key = mr->elem.index << 8 | rxe_get_next_key(-1);
> >
> > @@ -72,11 +72,6 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
> >   	mr->ibmr.type = IB_MR_TYPE_DMA;
> >   }
> >
> > -static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
> > -{
> > -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> > -}
> > -
> >   static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
> >   {
> >   	return iova & (mr_page_size(mr) - 1);
> > @@ -242,8 +237,8 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
> >   	return ib_sg_to_pages(ibmr, sgl, sg_nents, sg_offset, rxe_set_page);
> >   }
> >
> > -static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> > -			      unsigned int length, enum rxe_mr_copy_dir dir)
> > +int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> > +		       unsigned int length, enum rxe_mr_copy_dir dir)
> >   {
> >   	unsigned int page_offset = rxe_mr_iova_to_page_offset(mr, iova);
> >   	unsigned long index = rxe_mr_iova_to_index(mr, iova);


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code
  2024-10-10  7:24     ` Daisuke Matsuda (Fujitsu)
@ 2024-10-10  9:18       ` Zhu Yanjun
  2024-10-10 10:29         ` Daisuke Matsuda (Fujitsu)
  0 siblings, 1 reply; 27+ messages in thread
From: Zhu Yanjun @ 2024-10-10  9:18 UTC (permalink / raw)
  To: Daisuke Matsuda (Fujitsu), 'Zhu Yanjun',
	linux-rdma@vger.kernel.org, leon@kernel.org, jgg@ziepe.ca,
	zyjzyj2000@gmail.com
  Cc: linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com,
	Zhijian Li (Fujitsu)

在 2024/10/10 15:24, Daisuke Matsuda (Fujitsu) 写道:
> On Wed, Oct 9, 2024 11:13 PM Zhu Yanjun wrote:
>>
>>
>> 在 2024/10/9 9:58, Daisuke Matsuda 写道:
>>> Some functions in rxe_mr.c are going to be used in rxe_odp.c, which is to
>>> be created in the subsequent patch. List the declarations of the functions
>>> in rxe_loc.h.
>>>
>>> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
>>> ---
>>>    drivers/infiniband/sw/rxe/rxe_loc.h |  8 ++++++++
>>>    drivers/infiniband/sw/rxe/rxe_mr.c  | 11 +++--------
>>>    2 files changed, 11 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
>>> index ded46119151b..866c36533b53 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
>>> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
>>> @@ -58,6 +58,7 @@ int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
>>>
>>>    /* rxe_mr.c */
>>>    u8 rxe_get_next_key(u32 last_key);
>>> +void rxe_mr_init(int access, struct rxe_mr *mr);
>>>    void rxe_mr_init_dma(int access, struct rxe_mr *mr);
>>>    int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
>>>    		     int access, struct rxe_mr *mr);
>>> @@ -69,6 +70,8 @@ int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
>>>    	      void *addr, int length, enum rxe_mr_copy_dir dir);
>>>    int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
>>>    		  int sg_nents, unsigned int *sg_offset);
>>> +int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
>>> +		       unsigned int length, enum rxe_mr_copy_dir dir);
>>>    int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
>>>    			u64 compare, u64 swap_add, u64 *orig_val);
>>>    int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
>>> @@ -80,6 +83,11 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 key);
>>>    int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe);
>>>    void rxe_mr_cleanup(struct rxe_pool_elem *elem);
>>>
>>> +static inline unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
>>> +{
>>> +	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>>> +}
>>
>> The type of the function rxe_mr_iova_to_index is "unsigned long". In
>> some 32 architecture, unsigned long is 32 bit.
>>
>> The type of iova is u64. So it had better use u64 instead of "unsigned
>> long".
>>
>> Zhu Yanjun
> 
> Hi,
> thanks for the comment.
> 
> I think the current type declaration doesn't matter in 32-bit OS.
> The function returns an index of the page specified with 'iova'.
> Assuming the page size is typical 4KiB, u32 index can accommodate
> 16 TiB in total, which is larger than the theoretical limit imposed
> on 32-bit systems (i.e. 4GiB or 2^32 Bytes).

But in 32 bit OS, this will likely pop out "type does not match" warning 
because "unsigned long" is 32 bit in 32-bit OS while u64 is always 64 
bit. So it is better to use u64 type. This will not pop out any warnings 
whether in 32bit OS or in 64bit OS.

Zhu Yanjun
> 
> Regards,
> Daisuke Matsuda
> 
>>
>>> +
>>>    /* rxe_mw.c */
>>>    int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata);
>>>    int rxe_dealloc_mw(struct ib_mw *ibmw);
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
>>> index da3dee520876..1f7b8cf93adc 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
>>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>>> @@ -45,7 +45,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
>>>    	}
>>>    }
>>>
>>> -static void rxe_mr_init(int access, struct rxe_mr *mr)
>>> +void rxe_mr_init(int access, struct rxe_mr *mr)
>>>    {
>>>    	u32 key = mr->elem.index << 8 | rxe_get_next_key(-1);
>>>
>>> @@ -72,11 +72,6 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
>>>    	mr->ibmr.type = IB_MR_TYPE_DMA;
>>>    }
>>>
>>> -static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
>>> -{
>>> -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
>>> -}
>>> -
>>>    static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
>>>    {
>>>    	return iova & (mr_page_size(mr) - 1);
>>> @@ -242,8 +237,8 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
>>>    	return ib_sg_to_pages(ibmr, s"gl, sg_nents, sg_offset, rxe_set_page);
>>>    }
>>>
>>> -static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
>>> -			      unsigned int length, enum rxe_mr_copy_dir dir)
>>> +int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
>>> +		       unsigned int length, enum rxe_mr_copy_dir dir)
>>>    {
>>>    	unsigned int page_offset = rxe_mr_iova_to_page_offset(mr, iova);
>>>    	unsigned long index = rxe_mr_iova_to_index(mr, iova);
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code
  2024-10-10  9:18       ` Zhu Yanjun
@ 2024-10-10 10:29         ` Daisuke Matsuda (Fujitsu)
  0 siblings, 0 replies; 27+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2024-10-10 10:29 UTC (permalink / raw)
  To: 'Zhu Yanjun', 'Zhu Yanjun',
	linux-rdma@vger.kernel.org, leon@kernel.org, jgg@ziepe.ca,
	zyjzyj2000@gmail.com
  Cc: linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com,
	Zhijian Li (Fujitsu), Daisuke Matsuda (Fujitsu)

On Thu, October 10, 2024 6:18 PM Zhu Yanjun wrote:
> 在 2024/10/10 15:24, Daisuke Matsuda (Fujitsu) 写道:
> > On Wed, Oct 9, 2024 11:13 PM Zhu Yanjun wrote:
> >>
> >>
> >> 在 2024/10/9 9:58, Daisuke Matsuda 写道:
> >>> Some functions in rxe_mr.c are going to be used in rxe_odp.c, which is to
> >>> be created in the subsequent patch. List the declarations of the functions
> >>> in rxe_loc.h.
> >>>
> >>> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> >>> ---
> >>>    drivers/infiniband/sw/rxe/rxe_loc.h |  8 ++++++++
> >>>    drivers/infiniband/sw/rxe/rxe_mr.c  | 11 +++--------
> >>>    2 files changed, 11 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> >>> index ded46119151b..866c36533b53 100644
> >>> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> >>> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> >>> @@ -58,6 +58,7 @@ int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
> >>>
> >>>    /* rxe_mr.c */
> >>>    u8 rxe_get_next_key(u32 last_key);
> >>> +void rxe_mr_init(int access, struct rxe_mr *mr);
> >>>    void rxe_mr_init_dma(int access, struct rxe_mr *mr);
> >>>    int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
> >>>    		     int access, struct rxe_mr *mr);
> >>> @@ -69,6 +70,8 @@ int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
> >>>    	      void *addr, int length, enum rxe_mr_copy_dir dir);
> >>>    int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
> >>>    		  int sg_nents, unsigned int *sg_offset);
> >>> +int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> >>> +		       unsigned int length, enum rxe_mr_copy_dir dir);
> >>>    int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
> >>>    			u64 compare, u64 swap_add, u64 *orig_val);
> >>>    int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
> >>> @@ -80,6 +83,11 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 key);
> >>>    int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe);
> >>>    void rxe_mr_cleanup(struct rxe_pool_elem *elem);
> >>>
> >>> +static inline unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
> >>> +{
> >>> +	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> >>> +}
> >>
> >> The type of the function rxe_mr_iova_to_index is "unsigned long". In
> >> some 32 architecture, unsigned long is 32 bit.
> >>
> >> The type of iova is u64. So it had better use u64 instead of "unsigned
> >> long".
> >>
> >> Zhu Yanjun
> >
> > Hi,
> > thanks for the comment.
> >
> > I think the current type declaration doesn't matter in 32-bit OS.
> > The function returns an index of the page specified with 'iova'.
> > Assuming the page size is typical 4KiB, u32 index can accommodate
> > 16 TiB in total, which is larger than the theoretical limit imposed
> > on 32-bit systems (i.e. 4GiB or 2^32 Bytes).
> 
> But in 32 bit OS, this will likely pop out "type does not match" warning
> because "unsigned long" is 32 bit in 32-bit OS while u64 is always 64
> bit. So it is better to use u64 type. This will not pop out any warnings
> whether in 32bit OS or in 64bit OS.

That makes sense. The function was created in the commit 592627ccbdff.
Cf. https://github.com/torvalds/linux/commit/592627ccbdff0ec6fff00fc761142a76db750dd4

It seems to me that rxe_mr_iova_to_page_offset() also has the same problem.
=== rxe_mr.c ===
static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
{
	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
}
static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
{
	return iova & (mr_page_size(mr) - 1);
}
=============

This patch in ODP series just moves the function definition, and it is intrinsically
not the cause of the problem. If my ODP v8 patches could go into the for-next
tree without objection, then I can send a new patch to fix them. Otherwise,
we can fix them before my submitting ODP v9 patches.

Thanks,
Daisuke Matsuda

> 
> Zhu Yanjun
> >
> > Regards,
> > Daisuke Matsuda
> >
> >>
> >>> +
> >>>    /* rxe_mw.c */
> >>>    int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata);
> >>>    int rxe_dealloc_mw(struct ib_mw *ibmw);
> >>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> >>> index da3dee520876..1f7b8cf93adc 100644
> >>> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> >>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> >>> @@ -45,7 +45,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
> >>>    	}
> >>>    }
> >>>
> >>> -static void rxe_mr_init(int access, struct rxe_mr *mr)
> >>> +void rxe_mr_init(int access, struct rxe_mr *mr)
> >>>    {
> >>>    	u32 key = mr->elem.index << 8 | rxe_get_next_key(-1);
> >>>
> >>> @@ -72,11 +72,6 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
> >>>    	mr->ibmr.type = IB_MR_TYPE_DMA;
> >>>    }
> >>>
> >>> -static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
> >>> -{
> >>> -	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> >>> -}
> >>> -
> >>>    static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
> >>>    {
> >>>    	return iova & (mr_page_size(mr) - 1);
> >>> @@ -242,8 +237,8 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
> >>>    	return ib_sg_to_pages(ibmr, s"gl, sg_nents, sg_offset, rxe_set_page);
> >>>    }
> >>>
> >>> -static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> >>> -			      unsigned int length, enum rxe_mr_copy_dir dir)
> >>> +int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> >>> +		       unsigned int length, enum rxe_mr_copy_dir dir)
> >>>    {
> >>>    	unsigned int page_offset = rxe_mr_iova_to_page_offset(mr, iova);
> >>>    	unsigned long index = rxe_mr_iova_to_index(mr, iova);
> >


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support
  2024-10-09  1:59 ` [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support Daisuke Matsuda
@ 2024-10-13  6:15   ` Zhu Yanjun
  2024-10-28  7:25     ` Daisuke Matsuda (Fujitsu)
  2024-12-09 19:21   ` Jason Gunthorpe
  2024-12-09 19:31   ` Jason Gunthorpe
  2 siblings, 1 reply; 27+ messages in thread
From: Zhu Yanjun @ 2024-10-13  6:15 UTC (permalink / raw)
  To: Daisuke Matsuda, linux-rdma, leon, jgg, zyjzyj2000
  Cc: linux-kernel, rpearsonhpe, lizhijian

在 2024/10/9 9:59, Daisuke Matsuda 写道:
> On page invalidation, an MMU notifier callback is invoked to unmap DMA
> addresses and update the driver page table(umem_odp->dma_list). It also
> sets the corresponding entries in MR xarray to NULL to prevent any access.
> The callback is registered when an ODP-enabled MR is created.
> 
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
>   drivers/infiniband/sw/rxe/Makefile  |  2 +
>   drivers/infiniband/sw/rxe/rxe_odp.c | 57 +++++++++++++++++++++++++++++
>   2 files changed, 59 insertions(+)
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
> 
> diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
> index 5395a581f4bb..93134f1d1d0c 100644
> --- a/drivers/infiniband/sw/rxe/Makefile
> +++ b/drivers/infiniband/sw/rxe/Makefile
> @@ -23,3 +23,5 @@ rdma_rxe-y := \
>   	rxe_task.o \
>   	rxe_net.o \
>   	rxe_hw_counters.o
> +
> +rdma_rxe-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += rxe_odp.o
> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> new file mode 100644
> index 000000000000..ea55b79be0c6
> --- /dev/null
> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> @@ -0,0 +1,57 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * Copyright (c) 2022-2023 Fujitsu Ltd. All rights reserved.
> + */
> +
> +#include <linux/hmm.h>
> +
> +#include <rdma/ib_umem_odp.h>
> +
> +#include "rxe.h"
> +
> +static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start,
> +				unsigned long end)
> +{
> +	unsigned long upper = rxe_mr_iova_to_index(mr, end - 1);
> +	unsigned long lower = rxe_mr_iova_to_index(mr, start);
> +	void *entry;
> +
> +	XA_STATE(xas, &mr->page_list, lower);
> +
> +	/* make elements in xarray NULL */
> +	xas_lock(&xas);
> +	xas_for_each(&xas, entry, upper)
> +		xas_store(&xas, NULL);
> +	xas_unlock(&xas);
> +}
> +
> +static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
> +				    const struct mmu_notifier_range *range,
> +				    unsigned long cur_seq)
> +{
> +	struct ib_umem_odp *umem_odp =
> +		container_of(mni, struct ib_umem_odp, notifier);
> +	struct rxe_mr *mr = umem_odp->private;
> +	unsigned long start, end;
> +
> +	if (!mmu_notifier_range_blockable(range))
> +		return false;
> +
> +	mutex_lock(&umem_odp->umem_mutex);

guard(mutex)(&umem_odp->umem_mutex);

It seems that the above is more popular.

Zhu Yanjun
> +	mmu_interval_set_seq(mni, cur_seq);
> +
> +	start = max_t(u64, ib_umem_start(umem_odp), range->start);
> +	end = min_t(u64, ib_umem_end(umem_odp), range->end);
> +
> +	rxe_mr_unset_xarray(mr, start, end);
> +
> +	/* update umem_odp->dma_list */
> +	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
> +
> +	mutex_unlock(&umem_odp->umem_mutex);
> +	return true;
> +}
> +
> +const struct mmu_interval_notifier_ops rxe_mn_ops = {
> +	.invalidate = rxe_ib_invalidate_range,
> +};


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE
  2024-10-09  1:58 [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (5 preceding siblings ...)
  2024-10-09  1:59 ` [PATCH for-next v8 6/6] RDMA/rxe: Add support for the traditional Atomic operations " Daisuke Matsuda
@ 2024-10-17 19:27 ` Jason Gunthorpe
  2024-10-29  5:43   ` Daisuke Matsuda (Fujitsu)
  2024-10-18  7:06 ` Zhu Yanjun
  2024-12-09 19:36 ` Jason Gunthorpe
  8 siblings, 1 reply; 27+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 19:27 UTC (permalink / raw)
  To: Daisuke Matsuda
  Cc: linux-rdma, leon, zyjzyj2000, linux-kernel, rpearsonhpe,
	lizhijian

On Wed, Oct 09, 2024 at 10:58:57AM +0900, Daisuke Matsuda wrote:
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> driver, which has been available only in mlx5 driver[1] so far.
> 
> This series has been blocked because of the hang issue of srp 002 test[2],
> which was believed to be caused after applying the commit 9b4b7c1f9f54
> ("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent
> on the commit because the ODP feature requires sleeping in kernel space,
> and it is impossible with the former tasklet implementation.
> 
> According to the original reporter[3], the hang issue is already gone in
> v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe
> driver is ready to accept this series since there is no longer any reason
> to consider reverting back to the old tasklet.

Okay, and it seems we are just ignoring the rxe bugs these days, so
why not? Lets look at it

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE
  2024-10-09  1:58 [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (6 preceding siblings ...)
  2024-10-17 19:27 ` [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Jason Gunthorpe
@ 2024-10-18  7:06 ` Zhu Yanjun
  2024-10-28  7:59   ` Daisuke Matsuda (Fujitsu)
  2024-12-09 19:36 ` Jason Gunthorpe
  8 siblings, 1 reply; 27+ messages in thread
From: Zhu Yanjun @ 2024-10-18  7:06 UTC (permalink / raw)
  To: Daisuke Matsuda, linux-rdma, leon, jgg, zyjzyj2000
  Cc: linux-kernel, rpearsonhpe, lizhijian

在 2024/10/9 3:58, Daisuke Matsuda 写道:
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> driver, which has been available only in mlx5 driver[1] so far.
> 
> This series has been blocked because of the hang issue of srp 002 test[2],
> which was believed to be caused after applying the commit 9b4b7c1f9f54
> ("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent
> on the commit because the ODP feature requires sleeping in kernel space,
> and it is impossible with the former tasklet implementation.
> 
> According to the original reporter[3], the hang issue is already gone in
> v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe
> driver is ready to accept this series since there is no longer any reason
> to consider reverting back to the old tasklet.
> 
> I omitted some contents like the motive behind this series from the cover-
> letter. Please see the cover letter of v3 for more details[5].
> 
> [Overview]
> When applications register a memory region(MR), RDMA drivers normally pin
> pages in the MR so that physical addresses are never changed during RDMA
> communication. This requires the MR to fit in physical memory and
> inevitably leads to memory pressure. On the other hand, On-Demand Paging
> (ODP) allows applications to register MRs without pinning pages. They are
> paged-in when the driver requires and paged-out when the OS reclaims. As a
> result, it is possible to register a large MR that does not fit in physical
> memory without taking up so much physical memory.
> 
> [How does ODP work?]
> "struct ib_umem_odp" is used to manage pages. It is created for each
> ODP-enabled MR on its registration. This struct holds a pair of arrays
> (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
> PFNs are stored in the driver page table. They are updated on page-in and
> page-out, both of which use the common interfaces in the ib_uverbs layer.
> 
> Page-in can occur when requester, responder or completer access an MR in
> order to process RDMA operations. If they find that the pages being
> accessed are not present on physical memory or requisite permissions are
> not set on the pages, they provoke page fault to make the pages present
> with proper permissions and at the same time update the driver page table.
> After confirming the presence of the pages, they execute memory access such
> as read, write or atomic operations.
> 
> Page-out is triggered by page reclaim or filesystem events (e.g. metadata
> update of a file that is being used as an MR). When creating an ODP-enabled
> MR, the driver registers an MMU notifier callback. When the kernel issues a
> page invalidation notification, the callback is provoked to unmap DMA
> addresses and update the driver page table. After that, the kernel releases
> the pages.
> 
> [Supported operations]
> All traditional operations are supported on RC connection. The new Atomic
> write[6] and RDMA Flush[7] operations are not included in this patchset. I
> will post them later after this patchset is merged. On UD connection, Send,
> Recv, and SRQ-Recv are supported.
> 
> [How to test ODP?]
> There are only a few resources available for testing. pyverbs testcases in
> rdma-core and perftest[8] are recommendable ones. Other than them, the
> ibv_rc_pingpong command can also be used for testing. Note that you may
> have to build perftest from upstream because old versions do not handle ODP
> capabilities correctly.

Thanks a lot. I have tested these patches with perftest. Because ODP (On 
Demand Paging) is a feature, can you also add some testcases into rdma 
core? So we can use rdma-core to make tests with this feature of rxe.

That is, add some testcases in run_tests.py, so use run_tests.py to 
verify this (ODP) feature on rxe.

Thanks,
Zhu Yanjun

> 
> The latest ODP tree is available from github:
> https://github.com/ddmatsu/linux/tree/odp_v8
> 
> [Future work]
> My next work is to enable the new Atomic write[6] and RDMA Flush[7]
> operations with ODP. After that, I am going to implement the prefetch
> feature. It allows applications to trigger page fault using
> ibv_advise_mr(3) to optimize performance. Some existing software like
> librpma[9] use this feature. Additionally, I think we can also add the
> implicit ODP feature in the future.
> 
> [1] Understanding On Demand Paging (ODP)
> https://enterprise-support.nvidia.com/s/article/understanding-on-demand-paging--odp-x
> 
> [2] [bug report] blktests srp/002 hang
> https://lore.kernel.org/linux-rdma/dsg6rd66tyiei32zaxs6ddv5ebefr5vtxjwz6d2ewqrcwisogl@ge7jzan7dg5u/T/
> 
> [3] blktests failures with v6.10-rc1 kernel
> https://lore.kernel.org/linux-block/wnucs5oboi4flje5yvtea7puvn6zzztcnlrfz3lpzlwgblrxgw@7wvqdzioejgl/
> 
> [4] [00/15] ethernet: Convert from tasklet to BH workqueue
> https://patchwork.kernel.org/project/linux-rdma/cover/20240621050525.3720069-1-allen.lkml@gmail.com/
> 
> [5] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
> https://lore.kernel.org/lkml/cover.1671772917.git.matsuda-daisuke@fujitsu.com/
> 
> [6] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
> https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@fujitsu.com/
> 
> [7] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
> https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@fujitsu.com/
> 
> [8] linux-rdma/perftest: Infiniband Verbs Performance Tests
> https://github.com/linux-rdma/perftest
> 
> [9] librpma: Remote Persistent Memory Access Library
> https://github.com/pmem/rpma
> 
> v7->v8:
>   1) Dropped the first patch because the same change was made by Bob Pearson.
>   cf. https://github.com/torvalds/linux/commit/23bc06af547f2ca3b7d345e09fd8d04575406274
>   2) Rebased to 6.12.1-rc2
> 
> v6->v7:
>   1) Rebased to 6.6.0
>   2) Disabled using hugepages with ODP
>   3) Addressed comments on v6 from Jason and Zhu
>     cf. https://lore.kernel.org/lkml/cover.1694153251.git.matsuda-daisuke@fujitsu.com/
> 
> v5->v6:
>   Fixed the implementation according to Jason's suggestions
>     cf. https://lore.kernel.org/all/ZIdFXfDu4IMKE+BQ@nvidia.com/
>     cf. https://lore.kernel.org/all/ZIdGU709e1h5h4JJ@nvidia.com/
> 
> v4->v5:
>   1) Rebased to 6.4.0-rc2+
>   2) Changed to schedule all works on responder and completer to workqueue
> 
> v3->v4:
>   1) Re-designed functions that access MRs to use the MR xarray.
>   2) Rebased onto the latest jgg-for-next tree.
> 
> v2->v3:
>   1) Removed a patch that changes the common ib_uverbs layer.
>   2) Re-implemented patches for conversion to workqueue.
>   3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
>   4) Fixed some functions that returned incorrect errors.
>   5) Temporarily disabled ODP for RDMA Flush and Atomic Write.
> 
> v1->v2:
>   1) Fixed a crash issue reported by Haris Iqbal.
>   2) Tried to make lock patters clearer as pointed out by Romanovsky.
>   3) Minor clean ups and fixes.
> 
> Daisuke Matsuda (6):
>    RDMA/rxe: Make MR functions accessible from other rxe source code
>    RDMA/rxe: Move resp_states definition to rxe_verbs.h
>    RDMA/rxe: Add page invalidation support
>    RDMA/rxe: Allow registering MRs for On-Demand Paging
>    RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
>    RDMA/rxe: Add support for the traditional Atomic operations with ODP
> 
>   drivers/infiniband/sw/rxe/Makefile    |   2 +
>   drivers/infiniband/sw/rxe/rxe.c       |  18 ++
>   drivers/infiniband/sw/rxe/rxe.h       |  37 ----
>   drivers/infiniband/sw/rxe/rxe_loc.h   |  39 ++++
>   drivers/infiniband/sw/rxe/rxe_mr.c    |  34 +++-
>   drivers/infiniband/sw/rxe/rxe_odp.c   | 282 ++++++++++++++++++++++++++
>   drivers/infiniband/sw/rxe/rxe_resp.c  |  18 +-
>   drivers/infiniband/sw/rxe/rxe_verbs.c |   5 +-
>   drivers/infiniband/sw/rxe/rxe_verbs.h |  37 ++++
>   9 files changed, 419 insertions(+), 53 deletions(-)
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support
  2024-10-13  6:15   ` Zhu Yanjun
@ 2024-10-28  7:25     ` Daisuke Matsuda (Fujitsu)
  2024-10-28 20:26       ` Zhu Yanjun
  0 siblings, 1 reply; 27+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2024-10-28  7:25 UTC (permalink / raw)
  To: 'Zhu Yanjun', linux-rdma@vger.kernel.org, leon@kernel.org,
	jgg@ziepe.ca, zyjzyj2000@gmail.com
  Cc: linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com,
	Zhijian Li (Fujitsu)

On Sun, Oct 13, 2024 3:16 PM Zhu Yanjun wrote:
> 在 2024/10/9 9:59, Daisuke Matsuda 写道:
> > On page invalidation, an MMU notifier callback is invoked to unmap DMA
> > addresses and update the driver page table(umem_odp->dma_list). It also
> > sets the corresponding entries in MR xarray to NULL to prevent any access.
> > The callback is registered when an ODP-enabled MR is created.
> >
> > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> > ---
> >   drivers/infiniband/sw/rxe/Makefile  |  2 +
> >   drivers/infiniband/sw/rxe/rxe_odp.c | 57 +++++++++++++++++++++++++++++
> >   2 files changed, 59 insertions(+)
> >   create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
> >
> > diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
> > index 5395a581f4bb..93134f1d1d0c 100644
> > --- a/drivers/infiniband/sw/rxe/Makefile
> > +++ b/drivers/infiniband/sw/rxe/Makefile
> > @@ -23,3 +23,5 @@ rdma_rxe-y := \
> >   	rxe_task.o \
> >   	rxe_net.o \
> >   	rxe_hw_counters.o
> > +
> > +rdma_rxe-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += rxe_odp.o
> > diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> > new file mode 100644
> > index 000000000000..ea55b79be0c6
> > --- /dev/null
> > +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> > @@ -0,0 +1,57 @@
> > +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> > +/*
> > + * Copyright (c) 2022-2023 Fujitsu Ltd. All rights reserved.
> > + */
> > +
> > +#include <linux/hmm.h>
> > +
> > +#include <rdma/ib_umem_odp.h>
> > +
> > +#include "rxe.h"
> > +
> > +static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start,
> > +				unsigned long end)
> > +{
> > +	unsigned long upper = rxe_mr_iova_to_index(mr, end - 1);
> > +	unsigned long lower = rxe_mr_iova_to_index(mr, start);
> > +	void *entry;
> > +
> > +	XA_STATE(xas, &mr->page_list, lower);
> > +
> > +	/* make elements in xarray NULL */
> > +	xas_lock(&xas);
> > +	xas_for_each(&xas, entry, upper)
> > +		xas_store(&xas, NULL);
> > +	xas_unlock(&xas);
> > +}
> > +
> > +static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
> > +				    const struct mmu_notifier_range *range,
> > +				    unsigned long cur_seq)
> > +{
> > +	struct ib_umem_odp *umem_odp =
> > +		container_of(mni, struct ib_umem_odp, notifier);
> > +	struct rxe_mr *mr = umem_odp->private;
> > +	unsigned long start, end;
> > +
> > +	if (!mmu_notifier_range_blockable(range))
> > +		return false;
> > +
> > +	mutex_lock(&umem_odp->umem_mutex);
> 
> guard(mutex)(&umem_odp->umem_mutex);
> 
> It seems that the above is more popular.

Thanks for the comment.

I have no objection to your suggestion since the increasing number of
kernel components use "guard(mutex)" syntax these days, but I would rather
suggest making the change to the whole infiniband subsystem at once because
there are multiple mutex lock/unlock pairs to be converted.

Regards,
Daisuke Matsuda

> 
> Zhu Yanjun
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +
> > +	start = max_t(u64, ib_umem_start(umem_odp), range->start);
> > +	end = min_t(u64, ib_umem_end(umem_odp), range->end);
> > +
> > +	rxe_mr_unset_xarray(mr, start, end);
> > +
> > +	/* update umem_odp->dma_list */
> > +	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
> > +
> > +	mutex_unlock(&umem_odp->umem_mutex);
> > +	return true;
> > +}
> > +
> > +const struct mmu_interval_notifier_ops rxe_mn_ops = {
> > +	.invalidate = rxe_ib_invalidate_range,
> > +};


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE
  2024-10-18  7:06 ` Zhu Yanjun
@ 2024-10-28  7:59   ` Daisuke Matsuda (Fujitsu)
  2024-10-28 20:19     ` Zhu Yanjun
  0 siblings, 1 reply; 27+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2024-10-28  7:59 UTC (permalink / raw)
  To: 'Zhu Yanjun', linux-rdma@vger.kernel.org, leon@kernel.org,
	jgg@ziepe.ca, zyjzyj2000@gmail.com
  Cc: linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com,
	Zhijian Li (Fujitsu)

On Fri, Oct 18, 2024 4:07 PM Zhu Yanjun wrote:
> 在 2024/10/9 3:58, Daisuke Matsuda 写道:
> > This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> > driver, which has been available only in mlx5 driver[1] so far.
> >
> > This series has been blocked because of the hang issue of srp 002 test[2],
> > which was believed to be caused after applying the commit 9b4b7c1f9f54
> > ("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent
> > on the commit because the ODP feature requires sleeping in kernel space,
> > and it is impossible with the former tasklet implementation.
> >
> > According to the original reporter[3], the hang issue is already gone in
> > v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe
> > driver is ready to accept this series since there is no longer any reason
> > to consider reverting back to the old tasklet.
> >
> > I omitted some contents like the motive behind this series from the cover-
> > letter. Please see the cover letter of v3 for more details[5].
> >
> > [Overview]
> > When applications register a memory region(MR), RDMA drivers normally pin
> > pages in the MR so that physical addresses are never changed during RDMA
> > communication. This requires the MR to fit in physical memory and
> > inevitably leads to memory pressure. On the other hand, On-Demand Paging
> > (ODP) allows applications to register MRs without pinning pages. They are
> > paged-in when the driver requires and paged-out when the OS reclaims. As a
> > result, it is possible to register a large MR that does not fit in physical
> > memory without taking up so much physical memory.
> >
> > [How does ODP work?]
> > "struct ib_umem_odp" is used to manage pages. It is created for each
> > ODP-enabled MR on its registration. This struct holds a pair of arrays
> > (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
> > PFNs are stored in the driver page table. They are updated on page-in and
> > page-out, both of which use the common interfaces in the ib_uverbs layer.
> >
> > Page-in can occur when requester, responder or completer access an MR in
> > order to process RDMA operations. If they find that the pages being
> > accessed are not present on physical memory or requisite permissions are
> > not set on the pages, they provoke page fault to make the pages present
> > with proper permissions and at the same time update the driver page table.
> > After confirming the presence of the pages, they execute memory access such
> > as read, write or atomic operations.
> >
> > Page-out is triggered by page reclaim or filesystem events (e.g. metadata
> > update of a file that is being used as an MR). When creating an ODP-enabled
> > MR, the driver registers an MMU notifier callback. When the kernel issues a
> > page invalidation notification, the callback is provoked to unmap DMA
> > addresses and update the driver page table. After that, the kernel releases
> > the pages.
> >
> > [Supported operations]
> > All traditional operations are supported on RC connection. The new Atomic
> > write[6] and RDMA Flush[7] operations are not included in this patchset. I
> > will post them later after this patchset is merged. On UD connection, Send,
> > Recv, and SRQ-Recv are supported.
> >
> > [How to test ODP?]
> > There are only a few resources available for testing. pyverbs testcases in
> > rdma-core and perftest[8] are recommendable ones. Other than them, the
> > ibv_rc_pingpong command can also be used for testing. Note that you may
> > have to build perftest from upstream because old versions do not handle ODP
> > capabilities correctly.
> 
> Thanks a lot. I have tested these patches with perftest. Because ODP (On
> Demand Paging) is a feature, can you also add some testcases into rdma
> core? So we can use rdma-core to make tests with this feature of rxe.

I added Read/Write/Atomics tests two years ago.
Cf. https://github.com/linux-rdma/rdma-core/pull/1229

Each of ODP testcases causes page invalidation so that RDMA traffic
access triggers ODP page-in flow.

Currently, 7 testcases below can pass on rxe ODP v8 implementation.
  test_odp_rc_atomic_cmp_and_swp
  test_odp_rc_atomic_fetch_and_add
  test_odp_rc_mixed_mr
  test_odp_rc_rdma_read
  test_odp_rc_rdma_write
  test_odp_rc_traffic
  test_odp_ud_traffic
The rest 11 tests are just skipped because of lack of capabilities.

Please let me know if you have any suggestions for improvement.

Thanks,
Daisuke Matsuda

> 
> That is, add some testcases in run_tests.py, so use run_tests.py to
> verify this (ODP) feature on rxe.
> 
> Thanks,
> Zhu Yanjun
> 
> >
> > The latest ODP tree is available from github:
> > https://github.com/ddmatsu/linux/tree/odp_v8
> >
> > [Future work]
> > My next work is to enable the new Atomic write[6] and RDMA Flush[7]
> > operations with ODP. After that, I am going to implement the prefetch
> > feature. It allows applications to trigger page fault using
> > ibv_advise_mr(3) to optimize performance. Some existing software like
> > librpma[9] use this feature. Additionally, I think we can also add the
> > implicit ODP feature in the future.
> >
> > [1] Understanding On Demand Paging (ODP)
> > https://enterprise-support.nvidia.com/s/article/understanding-on-demand-paging--odp-x
> >
> > [2] [bug report] blktests srp/002 hang
> > https://lore.kernel.org/linux-rdma/dsg6rd66tyiei32zaxs6ddv5ebefr5vtxjwz6d2ewqrcwisogl@ge7jzan7dg5u/T/
> >
> > [3] blktests failures with v6.10-rc1 kernel
> > https://lore.kernel.org/linux-block/wnucs5oboi4flje5yvtea7puvn6zzztcnlrfz3lpzlwgblrxgw@7wvqdzioejgl/
> >
> > [4] [00/15] ethernet: Convert from tasklet to BH workqueue
> > https://patchwork.kernel.org/project/linux-rdma/cover/20240621050525.3720069-1-allen.lkml@gmail.com/
> >
> > [5] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
> > https://lore.kernel.org/lkml/cover.1671772917.git.matsuda-daisuke@fujitsu.com/
> >
> > [6] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
> > https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@fujitsu.com/
> >
> > [7] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
> > https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@fujitsu.com/
> >
> > [8] linux-rdma/perftest: Infiniband Verbs Performance Tests
> > https://github.com/linux-rdma/perftest
> >
> > [9] librpma: Remote Persistent Memory Access Library
> > https://github.com/pmem/rpma
> >
> > v7->v8:
> >   1) Dropped the first patch because the same change was made by Bob Pearson.
> >   cf. https://github.com/torvalds/linux/commit/23bc06af547f2ca3b7d345e09fd8d04575406274
> >   2) Rebased to 6.12.1-rc2
> >
> > v6->v7:
> >   1) Rebased to 6.6.0
> >   2) Disabled using hugepages with ODP
> >   3) Addressed comments on v6 from Jason and Zhu
> >     cf. https://lore.kernel.org/lkml/cover.1694153251.git.matsuda-daisuke@fujitsu.com/
> >
> > v5->v6:
> >   Fixed the implementation according to Jason's suggestions
> >     cf. https://lore.kernel.org/all/ZIdFXfDu4IMKE+BQ@nvidia.com/
> >     cf. https://lore.kernel.org/all/ZIdGU709e1h5h4JJ@nvidia.com/
> >
> > v4->v5:
> >   1) Rebased to 6.4.0-rc2+
> >   2) Changed to schedule all works on responder and completer to workqueue
> >
> > v3->v4:
> >   1) Re-designed functions that access MRs to use the MR xarray.
> >   2) Rebased onto the latest jgg-for-next tree.
> >
> > v2->v3:
> >   1) Removed a patch that changes the common ib_uverbs layer.
> >   2) Re-implemented patches for conversion to workqueue.
> >   3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
> >   4) Fixed some functions that returned incorrect errors.
> >   5) Temporarily disabled ODP for RDMA Flush and Atomic Write.
> >
> > v1->v2:
> >   1) Fixed a crash issue reported by Haris Iqbal.
> >   2) Tried to make lock patters clearer as pointed out by Romanovsky.
> >   3) Minor clean ups and fixes.
> >
> > Daisuke Matsuda (6):
> >    RDMA/rxe: Make MR functions accessible from other rxe source code
> >    RDMA/rxe: Move resp_states definition to rxe_verbs.h
> >    RDMA/rxe: Add page invalidation support
> >    RDMA/rxe: Allow registering MRs for On-Demand Paging
> >    RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
> >    RDMA/rxe: Add support for the traditional Atomic operations with ODP
> >
> >   drivers/infiniband/sw/rxe/Makefile    |   2 +
> >   drivers/infiniband/sw/rxe/rxe.c       |  18 ++
> >   drivers/infiniband/sw/rxe/rxe.h       |  37 ----
> >   drivers/infiniband/sw/rxe/rxe_loc.h   |  39 ++++
> >   drivers/infiniband/sw/rxe/rxe_mr.c    |  34 +++-
> >   drivers/infiniband/sw/rxe/rxe_odp.c   | 282 ++++++++++++++++++++++++++
> >   drivers/infiniband/sw/rxe/rxe_resp.c  |  18 +-
> >   drivers/infiniband/sw/rxe/rxe_verbs.c |   5 +-
> >   drivers/infiniband/sw/rxe/rxe_verbs.h |  37 ++++
> >   9 files changed, 419 insertions(+), 53 deletions(-)
> >   create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
> >


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE
  2024-10-28  7:59   ` Daisuke Matsuda (Fujitsu)
@ 2024-10-28 20:19     ` Zhu Yanjun
  0 siblings, 0 replies; 27+ messages in thread
From: Zhu Yanjun @ 2024-10-28 20:19 UTC (permalink / raw)
  To: Daisuke Matsuda (Fujitsu), linux-rdma@vger.kernel.org,
	leon@kernel.org, jgg@ziepe.ca, zyjzyj2000@gmail.com
  Cc: linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com,
	Zhijian Li (Fujitsu)

在 2024/10/28 8:59, Daisuke Matsuda (Fujitsu) 写道:
> On Fri, Oct 18, 2024 4:07 PM Zhu Yanjun wrote:
>> 在 2024/10/9 3:58, Daisuke Matsuda 写道:
>>> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
>>> driver, which has been available only in mlx5 driver[1] so far.
>>>
>>> This series has been blocked because of the hang issue of srp 002 test[2],
>>> which was believed to be caused after applying the commit 9b4b7c1f9f54
>>> ("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent
>>> on the commit because the ODP feature requires sleeping in kernel space,
>>> and it is impossible with the former tasklet implementation.
>>>
>>> According to the original reporter[3], the hang issue is already gone in
>>> v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe
>>> driver is ready to accept this series since there is no longer any reason
>>> to consider reverting back to the old tasklet.
>>>
>>> I omitted some contents like the motive behind this series from the cover-
>>> letter. Please see the cover letter of v3 for more details[5].
>>>
>>> [Overview]
>>> When applications register a memory region(MR), RDMA drivers normally pin
>>> pages in the MR so that physical addresses are never changed during RDMA
>>> communication. This requires the MR to fit in physical memory and
>>> inevitably leads to memory pressure. On the other hand, On-Demand Paging
>>> (ODP) allows applications to register MRs without pinning pages. They are
>>> paged-in when the driver requires and paged-out when the OS reclaims. As a
>>> result, it is possible to register a large MR that does not fit in physical
>>> memory without taking up so much physical memory.
>>>
>>> [How does ODP work?]
>>> "struct ib_umem_odp" is used to manage pages. It is created for each
>>> ODP-enabled MR on its registration. This struct holds a pair of arrays
>>> (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
>>> PFNs are stored in the driver page table. They are updated on page-in and
>>> page-out, both of which use the common interfaces in the ib_uverbs layer.
>>>
>>> Page-in can occur when requester, responder or completer access an MR in
>>> order to process RDMA operations. If they find that the pages being
>>> accessed are not present on physical memory or requisite permissions are
>>> not set on the pages, they provoke page fault to make the pages present
>>> with proper permissions and at the same time update the driver page table.
>>> After confirming the presence of the pages, they execute memory access such
>>> as read, write or atomic operations.
>>>
>>> Page-out is triggered by page reclaim or filesystem events (e.g. metadata
>>> update of a file that is being used as an MR). When creating an ODP-enabled
>>> MR, the driver registers an MMU notifier callback. When the kernel issues a
>>> page invalidation notification, the callback is provoked to unmap DMA
>>> addresses and update the driver page table. After that, the kernel releases
>>> the pages.
>>>
>>> [Supported operations]
>>> All traditional operations are supported on RC connection. The new Atomic
>>> write[6] and RDMA Flush[7] operations are not included in this patchset. I
>>> will post them later after this patchset is merged. On UD connection, Send,
>>> Recv, and SRQ-Recv are supported.
>>>
>>> [How to test ODP?]
>>> There are only a few resources available for testing. pyverbs testcases in
>>> rdma-core and perftest[8] are recommendable ones. Other than them, the
>>> ibv_rc_pingpong command can also be used for testing. Note that you may
>>> have to build perftest from upstream because old versions do not handle ODP
>>> capabilities correctly.
>>
>> Thanks a lot. I have tested these patches with perftest. Because ODP (On
>> Demand Paging) is a feature, can you also add some testcases into rdma
>> core? So we can use rdma-core to make tests with this feature of rxe.
> 
> I added Read/Write/Atomics tests two years ago.
> Cf. https://github.com/linux-rdma/rdma-core/pull/1229
> 
> Each of ODP testcases causes page invalidation so that RDMA traffic
> access triggers ODP page-in flow.
> 
> Currently, 7 testcases below can pass on rxe ODP v8 implementation.
>    test_odp_rc_atomic_cmp_and_swp
>    test_odp_rc_atomic_fetch_and_add
>    test_odp_rc_mixed_mr
>    test_odp_rc_rdma_read
>    test_odp_rc_rdma_write
>    test_odp_rc_traffic
>    test_odp_ud_traffic
> The rest 11 tests are just skipped because of lack of capabilities.

Thanks. Run rdma-core, the above tests can also work successfully in my 
test environment.
I am fine with this.

Zhu Yanjun

> 
> Please let me know if you have any suggestions for improvement.
> 
> Thanks,
> Daisuke Matsuda
> 
>>
>> That is, add some testcases in run_tests.py, so use run_tests.py to
>> verify this (ODP) feature on rxe.
>>
>> Thanks,
>> Zhu Yanjun
>>
>>>
>>> The latest ODP tree is available from github:
>>> https://github.com/ddmatsu/linux/tree/odp_v8
>>>
>>> [Future work]
>>> My next work is to enable the new Atomic write[6] and RDMA Flush[7]
>>> operations with ODP. After that, I am going to implement the prefetch
>>> feature. It allows applications to trigger page fault using
>>> ibv_advise_mr(3) to optimize performance. Some existing software like
>>> librpma[9] use this feature. Additionally, I think we can also add the
>>> implicit ODP feature in the future.
>>>
>>> [1] Understanding On Demand Paging (ODP)
>>> https://enterprise-support.nvidia.com/s/article/understanding-on-demand-paging--odp-x
>>>
>>> [2] [bug report] blktests srp/002 hang
>>> https://lore.kernel.org/linux-rdma/dsg6rd66tyiei32zaxs6ddv5ebefr5vtxjwz6d2ewqrcwisogl@ge7jzan7dg5u/T/
>>>
>>> [3] blktests failures with v6.10-rc1 kernel
>>> https://lore.kernel.org/linux-block/wnucs5oboi4flje5yvtea7puvn6zzztcnlrfz3lpzlwgblrxgw@7wvqdzioejgl/
>>>
>>> [4] [00/15] ethernet: Convert from tasklet to BH workqueue
>>> https://patchwork.kernel.org/project/linux-rdma/cover/20240621050525.3720069-1-allen.lkml@gmail.com/
>>>
>>> [5] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
>>> https://lore.kernel.org/lkml/cover.1671772917.git.matsuda-daisuke@fujitsu.com/
>>>
>>> [6] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
>>> https://lore.kernel.org/linux-rdma/1669905432-14-1-git-send-email-yangx.jy@fujitsu.com/
>>>
>>> [7] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
>>> https://lore.kernel.org/lkml/20221206130201.30986-1-lizhijian@fujitsu.com/
>>>
>>> [8] linux-rdma/perftest: Infiniband Verbs Performance Tests
>>> https://github.com/linux-rdma/perftest
>>>
>>> [9] librpma: Remote Persistent Memory Access Library
>>> https://github.com/pmem/rpma
>>>
>>> v7->v8:
>>>    1) Dropped the first patch because the same change was made by Bob Pearson.
>>>    cf. https://github.com/torvalds/linux/commit/23bc06af547f2ca3b7d345e09fd8d04575406274
>>>    2) Rebased to 6.12.1-rc2
>>>
>>> v6->v7:
>>>    1) Rebased to 6.6.0
>>>    2) Disabled using hugepages with ODP
>>>    3) Addressed comments on v6 from Jason and Zhu
>>>      cf. https://lore.kernel.org/lkml/cover.1694153251.git.matsuda-daisuke@fujitsu.com/
>>>
>>> v5->v6:
>>>    Fixed the implementation according to Jason's suggestions
>>>      cf. https://lore.kernel.org/all/ZIdFXfDu4IMKE+BQ@nvidia.com/
>>>      cf. https://lore.kernel.org/all/ZIdGU709e1h5h4JJ@nvidia.com/
>>>
>>> v4->v5:
>>>    1) Rebased to 6.4.0-rc2+
>>>    2) Changed to schedule all works on responder and completer to workqueue
>>>
>>> v3->v4:
>>>    1) Re-designed functions that access MRs to use the MR xarray.
>>>    2) Rebased onto the latest jgg-for-next tree.
>>>
>>> v2->v3:
>>>    1) Removed a patch that changes the common ib_uverbs layer.
>>>    2) Re-implemented patches for conversion to workqueue.
>>>    3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
>>>    4) Fixed some functions that returned incorrect errors.
>>>    5) Temporarily disabled ODP for RDMA Flush and Atomic Write.
>>>
>>> v1->v2:
>>>    1) Fixed a crash issue reported by Haris Iqbal.
>>>    2) Tried to make lock patters clearer as pointed out by Romanovsky.
>>>    3) Minor clean ups and fixes.
>>>
>>> Daisuke Matsuda (6):
>>>     RDMA/rxe: Make MR functions accessible from other rxe source code
>>>     RDMA/rxe: Move resp_states definition to rxe_verbs.h
>>>     RDMA/rxe: Add page invalidation support
>>>     RDMA/rxe: Allow registering MRs for On-Demand Paging
>>>     RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
>>>     RDMA/rxe: Add support for the traditional Atomic operations with ODP
>>>
>>>    drivers/infiniband/sw/rxe/Makefile    |   2 +
>>>    drivers/infiniband/sw/rxe/rxe.c       |  18 ++
>>>    drivers/infiniband/sw/rxe/rxe.h       |  37 ----
>>>    drivers/infiniband/sw/rxe/rxe_loc.h   |  39 ++++
>>>    drivers/infiniband/sw/rxe/rxe_mr.c    |  34 +++-
>>>    drivers/infiniband/sw/rxe/rxe_odp.c   | 282 ++++++++++++++++++++++++++
>>>    drivers/infiniband/sw/rxe/rxe_resp.c  |  18 +-
>>>    drivers/infiniband/sw/rxe/rxe_verbs.c |   5 +-
>>>    drivers/infiniband/sw/rxe/rxe_verbs.h |  37 ++++
>>>    9 files changed, 419 insertions(+), 53 deletions(-)
>>>    create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
>>>
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support
  2024-10-28  7:25     ` Daisuke Matsuda (Fujitsu)
@ 2024-10-28 20:26       ` Zhu Yanjun
  0 siblings, 0 replies; 27+ messages in thread
From: Zhu Yanjun @ 2024-10-28 20:26 UTC (permalink / raw)
  To: Daisuke Matsuda (Fujitsu), linux-rdma@vger.kernel.org,
	leon@kernel.org, jgg@ziepe.ca, zyjzyj2000@gmail.com
  Cc: linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com,
	Zhijian Li (Fujitsu)

在 2024/10/28 8:25, Daisuke Matsuda (Fujitsu) 写道:
> On Sun, Oct 13, 2024 3:16 PM Zhu Yanjun wrote:
>> 在 2024/10/9 9:59, Daisuke Matsuda 写道:
>>> On page invalidation, an MMU notifier callback is invoked to unmap DMA
>>> addresses and update the driver page table(umem_odp->dma_list). It also
>>> sets the corresponding entries in MR xarray to NULL to prevent any access.
>>> The callback is registered when an ODP-enabled MR is created.
>>>
>>> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
>>> ---
>>>    drivers/infiniband/sw/rxe/Makefile  |  2 +
>>>    drivers/infiniband/sw/rxe/rxe_odp.c | 57 +++++++++++++++++++++++++++++
>>>    2 files changed, 59 insertions(+)
>>>    create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
>>>
>>> diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
>>> index 5395a581f4bb..93134f1d1d0c 100644
>>> --- a/drivers/infiniband/sw/rxe/Makefile
>>> +++ b/drivers/infiniband/sw/rxe/Makefile
>>> @@ -23,3 +23,5 @@ rdma_rxe-y := \
>>>    	rxe_task.o \
>>>    	rxe_net.o \
>>>    	rxe_hw_counters.o
>>> +
>>> +rdma_rxe-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += rxe_odp.o
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
>>> new file mode 100644
>>> index 000000000000..ea55b79be0c6
>>> --- /dev/null
>>> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
>>> @@ -0,0 +1,57 @@
>>> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
>>> +/*
>>> + * Copyright (c) 2022-2023 Fujitsu Ltd. All rights reserved.
>>> + */
>>> +
>>> +#include <linux/hmm.h>
>>> +
>>> +#include <rdma/ib_umem_odp.h>
>>> +
>>> +#include "rxe.h"
>>> +
>>> +static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start,
>>> +				unsigned long end)
>>> +{
>>> +	unsigned long upper = rxe_mr_iova_to_index(mr, end - 1);
>>> +	unsigned long lower = rxe_mr_iova_to_index(mr, start);
>>> +	void *entry;
>>> +
>>> +	XA_STATE(xas, &mr->page_list, lower);
>>> +
>>> +	/* make elements in xarray NULL */
>>> +	xas_lock(&xas);
>>> +	xas_for_each(&xas, entry, upper)
>>> +		xas_store(&xas, NULL);
>>> +	xas_unlock(&xas);
>>> +}
>>> +
>>> +static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
>>> +				    const struct mmu_notifier_range *range,
>>> +				    unsigned long cur_seq)
>>> +{
>>> +	struct ib_umem_odp *umem_odp =
>>> +		container_of(mni, struct ib_umem_odp, notifier);
>>> +	struct rxe_mr *mr = umem_odp->private;
>>> +	unsigned long start, end;
>>> +
>>> +	if (!mmu_notifier_range_blockable(range))
>>> +		return false;
>>> +
>>> +	mutex_lock(&umem_odp->umem_mutex);
>>
>> guard(mutex)(&umem_odp->umem_mutex);
>>
>> It seems that the above is more popular.
> 
> Thanks for the comment.
> 
> I have no objection to your suggestion since the increasing number of
> kernel components use "guard(mutex)" syntax these days, but I would rather
> suggest making the change to the whole infiniband subsystem at once because
> there are multiple mutex lock/unlock pairs to be converted.

If you want to make the such changes to the whole infiniband subsystem, 
I am fine with it.

The "guard(mutex)" is used in the following patch.

https://patchwork.kernel.org/project/linux-rdma/patch/20241009210048.4122518-1-bvanassche@acm.org/

Zhu Yanjun

> 
> Regards,
> Daisuke Matsuda
> 
>>
>> Zhu Yanjun
>>> +	mmu_interval_set_seq(mni, cur_seq);
>>> +
>>> +	start = max_t(u64, ib_umem_start(umem_odp), range->start);
>>> +	end = min_t(u64, ib_umem_end(umem_odp), range->end);
>>> +
>>> +	rxe_mr_unset_xarray(mr, start, end);
>>> +
>>> +	/* update umem_odp->dma_list */
>>> +	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
>>> +
>>> +	mutex_unlock(&umem_odp->umem_mutex);
>>> +	return true;
>>> +}
>>> +
>>> +const struct mmu_interval_notifier_ops rxe_mn_ops = {
>>> +	.invalidate = rxe_ib_invalidate_range,
>>> +};
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE
  2024-10-17 19:27 ` [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Jason Gunthorpe
@ 2024-10-29  5:43   ` Daisuke Matsuda (Fujitsu)
  0 siblings, 0 replies; 27+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2024-10-29  5:43 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: linux-rdma@vger.kernel.org, leon@kernel.org, zyjzyj2000@gmail.com,
	linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com,
	Zhijian Li (Fujitsu)

On Fri, Oct 18, 2024 4:28 AM Jason Gunthorpe wrote:
> On Wed, Oct 09, 2024 at 10:58:57AM +0900, Daisuke Matsuda wrote:
> > This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> > driver, which has been available only in mlx5 driver[1] so far.
> >
> > This series has been blocked because of the hang issue of srp 002 test[2],
> > which was believed to be caused after applying the commit 9b4b7c1f9f54
> > ("RDMA/rxe: Add workqueue support for rxe tasks"). My patches are dependent
> > on the commit because the ODP feature requires sleeping in kernel space,
> > and it is impossible with the former tasklet implementation.
> >
> > According to the original reporter[3], the hang issue is already gone in
> > v6.10. Additionally, tasklet is marked deprecated[4]. I think the rxe
> > driver is ready to accept this series since there is no longer any reason
> > to consider reverting back to the old tasklet.
> 
> Okay, and it seems we are just ignoring the rxe bugs these days, so
> why not? Lets look at it

Hi Jason,

What we have seen so far suggests that the hang is derived from a potential timing issue in srp drivers.
I believe it cannot be a reason to delay this feature indefinitely.

However, I understand your stance as a maintainer is not wrong. It is natural you want to improve
overall quality of the infiniband subsystem, including the ULP drivers. I am committed to
maintaining and improving the rxe and underlying drivers, but I am sorry that I cannot take
the enough time to delve into the other components right now.

I must leave it to you whether to continue to block my patchset or not.
You are the maintainer and have the final word on it.

Thanks,
Daisuke Matsuda

> 
> Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code
  2024-10-09  1:58 ` [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code Daisuke Matsuda
  2024-10-09 14:13   ` Zhu Yanjun
@ 2024-12-09 19:19   ` Jason Gunthorpe
  1 sibling, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2024-12-09 19:19 UTC (permalink / raw)
  To: Daisuke Matsuda
  Cc: linux-rdma, leon, zyjzyj2000, linux-kernel, rpearsonhpe,
	lizhijian

On Wed, Oct 09, 2024 at 10:58:58AM +0900, Daisuke Matsuda wrote:
> Some functions in rxe_mr.c are going to be used in rxe_odp.c, which is to
> be created in the subsequent patch. List the declarations of the functions
> in rxe_loc.h.
> 
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_loc.h |  8 ++++++++
>  drivers/infiniband/sw/rxe/rxe_mr.c  | 11 +++--------
>  2 files changed, 11 insertions(+), 8 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 2/6] RDMA/rxe: Move resp_states definition to rxe_verbs.h
  2024-10-09  1:58 ` [PATCH for-next v8 2/6] RDMA/rxe: Move resp_states definition to rxe_verbs.h Daisuke Matsuda
@ 2024-12-09 19:20   ` Jason Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2024-12-09 19:20 UTC (permalink / raw)
  To: Daisuke Matsuda
  Cc: linux-rdma, leon, zyjzyj2000, linux-kernel, rpearsonhpe,
	lizhijian

On Wed, Oct 09, 2024 at 10:58:59AM +0900, Daisuke Matsuda wrote:
> To use the resp_states values in rxe_loc.h, it is necessary to move the
> definition to rxe_verbs.h, where other internal states of this driver are
> defined.
> 
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
>  drivers/infiniband/sw/rxe/rxe.h       | 37 ---------------------------
>  drivers/infiniband/sw/rxe/rxe_verbs.h | 37 +++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+), 37 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support
  2024-10-09  1:59 ` [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support Daisuke Matsuda
  2024-10-13  6:15   ` Zhu Yanjun
@ 2024-12-09 19:21   ` Jason Gunthorpe
  2024-12-10 12:00     ` Daisuke Matsuda (Fujitsu)
  2024-12-09 19:31   ` Jason Gunthorpe
  2 siblings, 1 reply; 27+ messages in thread
From: Jason Gunthorpe @ 2024-12-09 19:21 UTC (permalink / raw)
  To: Daisuke Matsuda
  Cc: linux-rdma, leon, zyjzyj2000, linux-kernel, rpearsonhpe,
	lizhijian

On Wed, Oct 09, 2024 at 10:59:00AM +0900, Daisuke Matsuda wrote:

> +const struct mmu_interval_notifier_ops rxe_mn_ops = {
> +	.invalidate = rxe_ib_invalidate_range,
> +};

I think you'll get a W=1 warning here because there is no prototype
for this in a header?

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support
  2024-10-09  1:59 ` [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support Daisuke Matsuda
  2024-10-13  6:15   ` Zhu Yanjun
  2024-12-09 19:21   ` Jason Gunthorpe
@ 2024-12-09 19:31   ` Jason Gunthorpe
  2024-12-10 12:12     ` Daisuke Matsuda (Fujitsu)
  2 siblings, 1 reply; 27+ messages in thread
From: Jason Gunthorpe @ 2024-12-09 19:31 UTC (permalink / raw)
  To: Daisuke Matsuda
  Cc: linux-rdma, leon, zyjzyj2000, linux-kernel, rpearsonhpe,
	lizhijian

On Wed, Oct 09, 2024 at 10:59:00AM +0900, Daisuke Matsuda wrote:

> +static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
> +				    const struct mmu_notifier_range *range,
> +				    unsigned long cur_seq)
> +{
> +	struct ib_umem_odp *umem_odp =
> +		container_of(mni, struct ib_umem_odp, notifier);
> +	struct rxe_mr *mr = umem_odp->private;
> +	unsigned long start, end;
> +
> +	if (!mmu_notifier_range_blockable(range))
> +		return false;
> +
> +	mutex_lock(&umem_odp->umem_mutex);
> +	mmu_interval_set_seq(mni, cur_seq);
> +
> +	start = max_t(u64, ib_umem_start(umem_odp), range->start);
> +	end = min_t(u64, ib_umem_end(umem_odp), range->end);
> +
> +	rxe_mr_unset_xarray(mr, start, end);
> +
> +	/* update umem_odp->dma_list */
> +	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);

This seems like a strange thing to do, rxe has the xarray so why does
it use the odp->dma_list?

I think what you want is to have rxe disable the odp->dma_list and use
its xarray instead

Or use the odp lists as-is and don't include the xarray?

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 4/6] RDMA/rxe: Allow registering MRs for On-Demand Paging
  2024-10-09  1:59 ` [PATCH for-next v8 4/6] RDMA/rxe: Allow registering MRs for On-Demand Paging Daisuke Matsuda
@ 2024-12-09 19:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2024-12-09 19:33 UTC (permalink / raw)
  To: Daisuke Matsuda
  Cc: linux-rdma, leon, zyjzyj2000, linux-kernel, rpearsonhpe,
	lizhijian

On Wed, Oct 09, 2024 at 10:59:01AM +0900, Daisuke Matsuda wrote:

> +static void rxe_mr_set_xarray(struct rxe_mr *mr, unsigned long start,
> +			      unsigned long end, unsigned long *pfn_list)
> +{
> +	unsigned long upper = rxe_mr_iova_to_index(mr, end - 1);
> +	unsigned long lower = rxe_mr_iova_to_index(mr, start);
> +	void *page, *entry;
> +
> +	XA_STATE(xas, &mr->page_list, lower);
> +
> +	xas_lock(&xas);
> +	while (xas.xa_index <= upper) {
> +		if (pfn_list[xas.xa_index] & HMM_PFN_WRITE) {
> +			page = xa_tag_pointer(hmm_pfn_to_page(pfn_list[xas.xa_index]),
> +					      RXE_ODP_WRITABLE_BIT);
> +		} else
> +			page = hmm_pfn_to_page(pfn_list[xas.xa_index]);

Like here:

> +	rxe_mr_set_xarray(mr, user_va, user_va + bcnt, umem_odp->pfn_list);

So this is just copying the pfn_list to the xarray? Why not just
directly use pfn_list instead?

Though, you'd have to lock it with the mutex, is that the issue?

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE
  2024-10-09  1:58 [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (7 preceding siblings ...)
  2024-10-18  7:06 ` Zhu Yanjun
@ 2024-12-09 19:36 ` Jason Gunthorpe
  8 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2024-12-09 19:36 UTC (permalink / raw)
  To: Daisuke Matsuda
  Cc: linux-rdma, leon, zyjzyj2000, linux-kernel, rpearsonhpe,
	lizhijian

On Wed, Oct 09, 2024 at 10:58:57AM +0900, Daisuke Matsuda wrote:
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> driver, which has been available only in mlx5 driver[1] so far.

Other than my questions about the xarray vs pfn_list I didn't see
anything worrying about this. If you answer them and respin it I will
apply it in January unless someone who knows rxe points out something
wrong with it - as I really don't know much about rxe.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support
  2024-12-09 19:21   ` Jason Gunthorpe
@ 2024-12-10 12:00     ` Daisuke Matsuda (Fujitsu)
  0 siblings, 0 replies; 27+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2024-12-10 12:00 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: linux-rdma@vger.kernel.org, leon@kernel.org, zyjzyj2000@gmail.com,
	linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com,
	Zhijian Li (Fujitsu)

On Tue, Dec 10, 2024 4:22 AM Jason Gunthorpe wrote:
> On Wed, Oct 09, 2024 at 10:59:00AM +0900, Daisuke Matsuda wrote:
> 
> > +const struct mmu_interval_notifier_ops rxe_mn_ops = {
> > +	.invalidate = rxe_ib_invalidate_range,
> > +};
> 
> I think you'll get a W=1 warning here because there is no prototype
> for this in a header?

Thank you for the review.
I will add a declaration for rxe_mn_ops in rxe_loc.h

Daisuke

> 
> Jason


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support
  2024-12-09 19:31   ` Jason Gunthorpe
@ 2024-12-10 12:12     ` Daisuke Matsuda (Fujitsu)
  0 siblings, 0 replies; 27+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2024-12-10 12:12 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: linux-rdma@vger.kernel.org, leon@kernel.org, zyjzyj2000@gmail.com,
	linux-kernel@vger.kernel.org, rpearsonhpe@gmail.com,
	Zhijian Li (Fujitsu)

On Tue, Dec 10, 2024 4:31 AM Jason Gunthorpe wrote:
> On Wed, Oct 09, 2024 at 10:59:00AM +0900, Daisuke Matsuda wrote:
> 
> > +static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
> > +				    const struct mmu_notifier_range *range,
> > +				    unsigned long cur_seq)
> > +{
> > +	struct ib_umem_odp *umem_odp =
> > +		container_of(mni, struct ib_umem_odp, notifier);
> > +	struct rxe_mr *mr = umem_odp->private;
> > +	unsigned long start, end;
> > +
> > +	if (!mmu_notifier_range_blockable(range))
> > +		return false;
> > +
> > +	mutex_lock(&umem_odp->umem_mutex);
> > +	mmu_interval_set_seq(mni, cur_seq);
> > +
> > +	start = max_t(u64, ib_umem_start(umem_odp), range->start);
> > +	end = min_t(u64, ib_umem_end(umem_odp), range->end);
> > +
> > +	rxe_mr_unset_xarray(mr, start, end);
> > +
> > +	/* update umem_odp->dma_list */
> > +	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
> 
> This seems like a strange thing to do, rxe has the xarray so why does
> it use the odp->dma_list?

I tried to reuse existing rxe codes for RDMA operations, and that required
to update the xarray also for ODP cases. I think using pfn_list only is technically
feasible.
> 
> I think what you want is to have rxe disable the odp->dma_list and use
> its xarray instead
> 
> Or use the odp lists as-is and don't include the xarray?

As you pointed out in reply to the next patch, the current implementation introduces
redundant copying overhead. We cannot avoid that with xarray, so I would rather
use the odp lists only.

Regards,
Daisuke

> 
> Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-12-10 12:13 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-09  1:58 [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Daisuke Matsuda
2024-10-09  1:58 ` [PATCH for-next v8 1/6] RDMA/rxe: Make MR functions accessible from other rxe source code Daisuke Matsuda
2024-10-09 14:13   ` Zhu Yanjun
2024-10-10  7:24     ` Daisuke Matsuda (Fujitsu)
2024-10-10  9:18       ` Zhu Yanjun
2024-10-10 10:29         ` Daisuke Matsuda (Fujitsu)
2024-12-09 19:19   ` Jason Gunthorpe
2024-10-09  1:58 ` [PATCH for-next v8 2/6] RDMA/rxe: Move resp_states definition to rxe_verbs.h Daisuke Matsuda
2024-12-09 19:20   ` Jason Gunthorpe
2024-10-09  1:59 ` [PATCH for-next v8 3/6] RDMA/rxe: Add page invalidation support Daisuke Matsuda
2024-10-13  6:15   ` Zhu Yanjun
2024-10-28  7:25     ` Daisuke Matsuda (Fujitsu)
2024-10-28 20:26       ` Zhu Yanjun
2024-12-09 19:21   ` Jason Gunthorpe
2024-12-10 12:00     ` Daisuke Matsuda (Fujitsu)
2024-12-09 19:31   ` Jason Gunthorpe
2024-12-10 12:12     ` Daisuke Matsuda (Fujitsu)
2024-10-09  1:59 ` [PATCH for-next v8 4/6] RDMA/rxe: Allow registering MRs for On-Demand Paging Daisuke Matsuda
2024-12-09 19:33   ` Jason Gunthorpe
2024-10-09  1:59 ` [PATCH for-next v8 5/6] RDMA/rxe: Add support for Send/Recv/Write/Read with ODP Daisuke Matsuda
2024-10-09  1:59 ` [PATCH for-next v8 6/6] RDMA/rxe: Add support for the traditional Atomic operations " Daisuke Matsuda
2024-10-17 19:27 ` [PATCH for-next v8 0/6] On-Demand Paging on SoftRoCE Jason Gunthorpe
2024-10-29  5:43   ` Daisuke Matsuda (Fujitsu)
2024-10-18  7:06 ` Zhu Yanjun
2024-10-28  7:59   ` Daisuke Matsuda (Fujitsu)
2024-10-28 20:19     ` Zhu Yanjun
2024-12-09 19:36 ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).